All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 00/12] IOMMUFD Generic interface
@ 2022-03-18 17:27 ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-03-18 17:27 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

iommufd is the user API to control the IOMMU subsystem as it relates to
managing IO page tables that point at user space memory.

It takes over from drivers/vfio/vfio_iommu_type1.c (aka the VFIO
container) which is the VFIO specific interface for a similar idea.

We see a broad need for extended features, some being highly IOMMU device
specific:
 - Binding iommu_domain's to PASID/SSID
 - Userspace page tables, for ARM, x86 and S390
 - Kernel bypass'd invalidation of user page tables
 - Re-use of the KVM page table in the IOMMU
 - Dirty page tracking in the IOMMU
 - Runtime Increase/Decrease of IOPTE size
 - PRI support with faults resolved in userspace

As well as a need to access these features beyond just VFIO, VDPA for
instance, but other classes of accelerator HW are touching on these areas
now too.

The v1 series proposed re-using the VFIO type 1 data structure, however it
was suggested that if we are doing this big update then we should also
come with a data structure that solves the limitations that VFIO type1
has. Notably this addresses:

 - Multiple IOAS/'containers' and multiple domains inside a single FD

 - Single-pin operation no matter how many domains and containers use
   a page

 - A fine grained locking scheme supporting user managed concurrency for
   multi-threaded map/unmap

 - A pre-registration mechanism to optimize vIOMMU use cases by
   pre-pinning pages

 - Extended ioctl API that can manage these new objects and exposes
   domains directly to user space

 - domains are sharable between subsystems, eg VFIO and VDPA

The bulk of this code is a new data structure design to track how the
IOVAs are mapped to PFNs.

iommufd intends to be general and consumable by any driver that wants to
DMA to userspace. From a driver perspective it can largely be dropped in
in-place of iommu_attach_device() and provides a uniform full feature set
to all consumers.

As this is a larger project this series is the first step. This series
provides the iommfd "generic interface" which is designed to be suitable
for applications like DPDK and VMM flows that are not optimized to
specific HW scenarios. It is close to being a drop in replacement for the
existing VFIO type 1.

This is part two of three for an initial sequence:
 - Move IOMMU Group security into the iommu layer
   https://lore.kernel.org/linux-iommu/20220218005521.172832-1-baolu.lu@linux.intel.com/
 * Generic IOMMUFD implementation
 - VFIO ability to consume IOMMUFD
   An early exploration of this is available here:
    https://github.com/luxis1999/iommufd/commits/iommufd-v5.17-rc6

Various parts of the above extended features are in WIP stages currently
to define how their IOCTL interface should work.

At this point, using the draft VFIO series, unmodified qemu has been
tested to operate using iommufd on x86 and ARM systems.

Several people have contributed directly to this work: Eric Auger, Kevin
Tian, Lu Baolu, Nicolin Chen, Yi L Liu. Many more have participated in the
discussions that lead here, and provided ideas. Thanks to all!

This is on github: https://github.com/jgunthorpe/linux/commits/iommufd

# S390 in-kernel page table walker
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Matthew Rosato <mjrosato@linux.ibm.com>
# AMD Dirty page tracking
Cc: Joao Martins <joao.m.martins@oracle.com>
# ARM SMMU Dirty page tracking
Cc: Keqian Zhu <zhukeqian1@huawei.com>
Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
# ARM SMMU nesting
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
# Map/unmap performance
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
# VDPA
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
# Power
Cc: David Gibson <david@gibson.dropbear.id.au>
# vfio
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: kvm@vger.kernel.org
# iommu
Cc: iommu@lists.linux-foundation.org
# Collaborators
Cc: "Chaitanya Kulkarni" <chaitanyak@nvidia.com>
Cc: Nicolin Chen <nicolinc@nvidia.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Jason Gunthorpe (11):
  interval-tree: Add a utility to iterate over spans in an interval tree
  iommufd: File descriptor, context, kconfig and makefiles
  kernel/user: Allow user::locked_vm to be usable for iommufd
  iommufd: PFN handling for iopt_pages
  iommufd: Algorithms for PFN storage
  iommufd: Data structure to provide IOVA to PFN mapping
  iommufd: IOCTLs for the io_pagetable
  iommufd: Add a HW pagetable object
  iommufd: Add kAPI toward external drivers
  iommufd: vfio container FD ioctl compatibility
  iommufd: Add a selftest

Kevin Tian (1):
  iommufd: Overview documentation

 Documentation/userspace-api/index.rst         |    1 +
 .../userspace-api/ioctl/ioctl-number.rst      |    1 +
 Documentation/userspace-api/iommufd.rst       |  224 +++
 MAINTAINERS                                   |   10 +
 drivers/iommu/Kconfig                         |    1 +
 drivers/iommu/Makefile                        |    2 +-
 drivers/iommu/iommufd/Kconfig                 |   22 +
 drivers/iommu/iommufd/Makefile                |   13 +
 drivers/iommu/iommufd/device.c                |  274 ++++
 drivers/iommu/iommufd/hw_pagetable.c          |  142 ++
 drivers/iommu/iommufd/io_pagetable.c          |  890 +++++++++++
 drivers/iommu/iommufd/io_pagetable.h          |  170 +++
 drivers/iommu/iommufd/ioas.c                  |  252 ++++
 drivers/iommu/iommufd/iommufd_private.h       |  231 +++
 drivers/iommu/iommufd/iommufd_test.h          |   65 +
 drivers/iommu/iommufd/main.c                  |  346 +++++
 drivers/iommu/iommufd/pages.c                 | 1321 +++++++++++++++++
 drivers/iommu/iommufd/selftest.c              |  495 ++++++
 drivers/iommu/iommufd/vfio_compat.c           |  401 +++++
 include/linux/interval_tree.h                 |   41 +
 include/linux/iommufd.h                       |   50 +
 include/linux/sched/user.h                    |    2 +-
 include/uapi/linux/iommufd.h                  |  223 +++
 kernel/user.c                                 |    1 +
 lib/interval_tree.c                           |   98 ++
 tools/testing/selftests/Makefile              |    1 +
 tools/testing/selftests/iommu/.gitignore      |    2 +
 tools/testing/selftests/iommu/Makefile        |   11 +
 tools/testing/selftests/iommu/config          |    2 +
 tools/testing/selftests/iommu/iommufd.c       | 1225 +++++++++++++++
 30 files changed, 6515 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/userspace-api/iommufd.rst
 create mode 100644 drivers/iommu/iommufd/Kconfig
 create mode 100644 drivers/iommu/iommufd/Makefile
 create mode 100644 drivers/iommu/iommufd/device.c
 create mode 100644 drivers/iommu/iommufd/hw_pagetable.c
 create mode 100644 drivers/iommu/iommufd/io_pagetable.c
 create mode 100644 drivers/iommu/iommufd/io_pagetable.h
 create mode 100644 drivers/iommu/iommufd/ioas.c
 create mode 100644 drivers/iommu/iommufd/iommufd_private.h
 create mode 100644 drivers/iommu/iommufd/iommufd_test.h
 create mode 100644 drivers/iommu/iommufd/main.c
 create mode 100644 drivers/iommu/iommufd/pages.c
 create mode 100644 drivers/iommu/iommufd/selftest.c
 create mode 100644 drivers/iommu/iommufd/vfio_compat.c
 create mode 100644 include/linux/iommufd.h
 create mode 100644 include/uapi/linux/iommufd.h
 create mode 100644 tools/testing/selftests/iommu/.gitignore
 create mode 100644 tools/testing/selftests/iommu/Makefile
 create mode 100644 tools/testing/selftests/iommu/config
 create mode 100644 tools/testing/selftests/iommu/iommufd.c


base-commit: d1c716ed82a6bf4c35ba7be3741b9362e84cd722
-- 
2.35.1


^ permalink raw reply	[flat|nested] 244+ messages in thread

* [PATCH RFC 00/12] IOMMUFD Generic interface
@ 2022-03-18 17:27 ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-03-18 17:27 UTC (permalink / raw)
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins,
	David Gibson

iommufd is the user API to control the IOMMU subsystem as it relates to
managing IO page tables that point at user space memory.

It takes over from drivers/vfio/vfio_iommu_type1.c (aka the VFIO
container) which is the VFIO specific interface for a similar idea.

We see a broad need for extended features, some being highly IOMMU device
specific:
 - Binding iommu_domain's to PASID/SSID
 - Userspace page tables, for ARM, x86 and S390
 - Kernel bypass'd invalidation of user page tables
 - Re-use of the KVM page table in the IOMMU
 - Dirty page tracking in the IOMMU
 - Runtime Increase/Decrease of IOPTE size
 - PRI support with faults resolved in userspace

As well as a need to access these features beyond just VFIO, VDPA for
instance, but other classes of accelerator HW are touching on these areas
now too.

The v1 series proposed re-using the VFIO type 1 data structure, however it
was suggested that if we are doing this big update then we should also
come with a data structure that solves the limitations that VFIO type1
has. Notably this addresses:

 - Multiple IOAS/'containers' and multiple domains inside a single FD

 - Single-pin operation no matter how many domains and containers use
   a page

 - A fine grained locking scheme supporting user managed concurrency for
   multi-threaded map/unmap

 - A pre-registration mechanism to optimize vIOMMU use cases by
   pre-pinning pages

 - Extended ioctl API that can manage these new objects and exposes
   domains directly to user space

 - domains are sharable between subsystems, eg VFIO and VDPA

The bulk of this code is a new data structure design to track how the
IOVAs are mapped to PFNs.

iommufd intends to be general and consumable by any driver that wants to
DMA to userspace. From a driver perspective it can largely be dropped in
in-place of iommu_attach_device() and provides a uniform full feature set
to all consumers.

As this is a larger project this series is the first step. This series
provides the iommfd "generic interface" which is designed to be suitable
for applications like DPDK and VMM flows that are not optimized to
specific HW scenarios. It is close to being a drop in replacement for the
existing VFIO type 1.

This is part two of three for an initial sequence:
 - Move IOMMU Group security into the iommu layer
   https://lore.kernel.org/linux-iommu/20220218005521.172832-1-baolu.lu@linux.intel.com/
 * Generic IOMMUFD implementation
 - VFIO ability to consume IOMMUFD
   An early exploration of this is available here:
    https://github.com/luxis1999/iommufd/commits/iommufd-v5.17-rc6

Various parts of the above extended features are in WIP stages currently
to define how their IOCTL interface should work.

At this point, using the draft VFIO series, unmodified qemu has been
tested to operate using iommufd on x86 and ARM systems.

Several people have contributed directly to this work: Eric Auger, Kevin
Tian, Lu Baolu, Nicolin Chen, Yi L Liu. Many more have participated in the
discussions that lead here, and provided ideas. Thanks to all!

This is on github: https://github.com/jgunthorpe/linux/commits/iommufd

# S390 in-kernel page table walker
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Matthew Rosato <mjrosato@linux.ibm.com>
# AMD Dirty page tracking
Cc: Joao Martins <joao.m.martins@oracle.com>
# ARM SMMU Dirty page tracking
Cc: Keqian Zhu <zhukeqian1@huawei.com>
Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
# ARM SMMU nesting
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
# Map/unmap performance
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
# VDPA
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
# Power
Cc: David Gibson <david@gibson.dropbear.id.au>
# vfio
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: kvm@vger.kernel.org
# iommu
Cc: iommu@lists.linux-foundation.org
# Collaborators
Cc: "Chaitanya Kulkarni" <chaitanyak@nvidia.com>
Cc: Nicolin Chen <nicolinc@nvidia.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Jason Gunthorpe (11):
  interval-tree: Add a utility to iterate over spans in an interval tree
  iommufd: File descriptor, context, kconfig and makefiles
  kernel/user: Allow user::locked_vm to be usable for iommufd
  iommufd: PFN handling for iopt_pages
  iommufd: Algorithms for PFN storage
  iommufd: Data structure to provide IOVA to PFN mapping
  iommufd: IOCTLs for the io_pagetable
  iommufd: Add a HW pagetable object
  iommufd: Add kAPI toward external drivers
  iommufd: vfio container FD ioctl compatibility
  iommufd: Add a selftest

Kevin Tian (1):
  iommufd: Overview documentation

 Documentation/userspace-api/index.rst         |    1 +
 .../userspace-api/ioctl/ioctl-number.rst      |    1 +
 Documentation/userspace-api/iommufd.rst       |  224 +++
 MAINTAINERS                                   |   10 +
 drivers/iommu/Kconfig                         |    1 +
 drivers/iommu/Makefile                        |    2 +-
 drivers/iommu/iommufd/Kconfig                 |   22 +
 drivers/iommu/iommufd/Makefile                |   13 +
 drivers/iommu/iommufd/device.c                |  274 ++++
 drivers/iommu/iommufd/hw_pagetable.c          |  142 ++
 drivers/iommu/iommufd/io_pagetable.c          |  890 +++++++++++
 drivers/iommu/iommufd/io_pagetable.h          |  170 +++
 drivers/iommu/iommufd/ioas.c                  |  252 ++++
 drivers/iommu/iommufd/iommufd_private.h       |  231 +++
 drivers/iommu/iommufd/iommufd_test.h          |   65 +
 drivers/iommu/iommufd/main.c                  |  346 +++++
 drivers/iommu/iommufd/pages.c                 | 1321 +++++++++++++++++
 drivers/iommu/iommufd/selftest.c              |  495 ++++++
 drivers/iommu/iommufd/vfio_compat.c           |  401 +++++
 include/linux/interval_tree.h                 |   41 +
 include/linux/iommufd.h                       |   50 +
 include/linux/sched/user.h                    |    2 +-
 include/uapi/linux/iommufd.h                  |  223 +++
 kernel/user.c                                 |    1 +
 lib/interval_tree.c                           |   98 ++
 tools/testing/selftests/Makefile              |    1 +
 tools/testing/selftests/iommu/.gitignore      |    2 +
 tools/testing/selftests/iommu/Makefile        |   11 +
 tools/testing/selftests/iommu/config          |    2 +
 tools/testing/selftests/iommu/iommufd.c       | 1225 +++++++++++++++
 30 files changed, 6515 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/userspace-api/iommufd.rst
 create mode 100644 drivers/iommu/iommufd/Kconfig
 create mode 100644 drivers/iommu/iommufd/Makefile
 create mode 100644 drivers/iommu/iommufd/device.c
 create mode 100644 drivers/iommu/iommufd/hw_pagetable.c
 create mode 100644 drivers/iommu/iommufd/io_pagetable.c
 create mode 100644 drivers/iommu/iommufd/io_pagetable.h
 create mode 100644 drivers/iommu/iommufd/ioas.c
 create mode 100644 drivers/iommu/iommufd/iommufd_private.h
 create mode 100644 drivers/iommu/iommufd/iommufd_test.h
 create mode 100644 drivers/iommu/iommufd/main.c
 create mode 100644 drivers/iommu/iommufd/pages.c
 create mode 100644 drivers/iommu/iommufd/selftest.c
 create mode 100644 drivers/iommu/iommufd/vfio_compat.c
 create mode 100644 include/linux/iommufd.h
 create mode 100644 include/uapi/linux/iommufd.h
 create mode 100644 tools/testing/selftests/iommu/.gitignore
 create mode 100644 tools/testing/selftests/iommu/Makefile
 create mode 100644 tools/testing/selftests/iommu/config
 create mode 100644 tools/testing/selftests/iommu/iommufd.c


base-commit: d1c716ed82a6bf4c35ba7be3741b9362e84cd722
-- 
2.35.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [PATCH RFC 01/12] interval-tree: Add a utility to iterate over spans in an interval tree
  2022-03-18 17:27 ` Jason Gunthorpe via iommu
@ 2022-03-18 17:27   ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-03-18 17:27 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

The span iterator travels over the indexes of the interval_tree, not the
nodes, and classifies spans of indexes as either 'used' or 'hole'.

'used' spans are fully covered by nodes in the tree and 'hole' spans have
no node intersecting the span.

This is done greedily such that spans are maximally sized and every
iteration step switches between used/hole.

As an example a trivial allocator can be written as:

	for (interval_tree_span_iter_first(&span, itree, 0, ULONG_MAX);
	     !interval_tree_span_iter_done(&span);
	     interval_tree_span_iter_next(&span))
		if (span.is_hole &&
		    span.last_hole - span.start_hole >= allocation_size - 1)
			return span.start_hole;

With all the tricky boundary conditions handled by the library code.

The following iommufd patches have several algorithms for two of its
overlapping node interval trees that are significantly simplified with
this kind of iteration primitive. As it seems generally useful, put it
into lib/.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 include/linux/interval_tree.h | 41 +++++++++++++++
 lib/interval_tree.c           | 98 +++++++++++++++++++++++++++++++++++
 2 files changed, 139 insertions(+)

diff --git a/include/linux/interval_tree.h b/include/linux/interval_tree.h
index 288c26f50732d7..3817af0fa54028 100644
--- a/include/linux/interval_tree.h
+++ b/include/linux/interval_tree.h
@@ -27,4 +27,45 @@ extern struct interval_tree_node *
 interval_tree_iter_next(struct interval_tree_node *node,
 			unsigned long start, unsigned long last);
 
+/*
+ * This iterator travels over spans in an interval tree. It does not return
+ * nodes but classifies each span as either a hole, where no nodes intersect, or
+ * a used, which is fully covered by nodes. Each iteration step toggles between
+ * hole and used until the entire range is covered. The returned spans always
+ * fully cover the requested range.
+ *
+ * The iterator is greedy, it always returns the largest hole or used possible,
+ * consolidating all consecutive nodes.
+ *
+ * Only is_hole, start_hole/used and last_hole/used are part of the external
+ * interface.
+ */
+struct interval_tree_span_iter {
+	struct interval_tree_node *nodes[2];
+	unsigned long first_index;
+	unsigned long last_index;
+	union {
+		unsigned long start_hole;
+		unsigned long start_used;
+	};
+	union {
+		unsigned long last_hole;
+		unsigned long last_used;
+	};
+	/* 0 == used, 1 == is_hole, -1 == done iteration */
+	int is_hole;
+};
+
+void interval_tree_span_iter_first(struct interval_tree_span_iter *state,
+				   struct rb_root_cached *itree,
+				   unsigned long first_index,
+				   unsigned long last_index);
+void interval_tree_span_iter_next(struct interval_tree_span_iter *state);
+
+static inline bool
+interval_tree_span_iter_done(struct interval_tree_span_iter *state)
+{
+	return state->is_hole == -1;
+}
+
 #endif	/* _LINUX_INTERVAL_TREE_H */
diff --git a/lib/interval_tree.c b/lib/interval_tree.c
index 593ce56ece5050..5dff0da020923f 100644
--- a/lib/interval_tree.c
+++ b/lib/interval_tree.c
@@ -15,3 +15,101 @@ EXPORT_SYMBOL_GPL(interval_tree_insert);
 EXPORT_SYMBOL_GPL(interval_tree_remove);
 EXPORT_SYMBOL_GPL(interval_tree_iter_first);
 EXPORT_SYMBOL_GPL(interval_tree_iter_next);
+
+static void
+interval_tree_span_iter_next_gap(struct interval_tree_span_iter *state)
+{
+	struct interval_tree_node *cur = state->nodes[1];
+
+	/*
+	 * Roll nodes[1] into nodes[0] by advancing nodes[1] to the end of a
+	 * contiguous span of nodes. This makes nodes[0]->last the end of that
+	 * contiguous span of valid indexes that started at the original
+	 * nodes[1]->start. nodes[1] is now the next node and a hole is between
+	 * nodes[0] and [1].
+	 */
+	state->nodes[0] = cur;
+	do {
+		if (cur->last > state->nodes[0]->last)
+			state->nodes[0] = cur;
+		cur = interval_tree_iter_next(cur, state->first_index,
+					      state->last_index);
+	} while (cur && (state->nodes[0]->last >= cur->start ||
+			 state->nodes[0]->last + 1 == cur->start));
+	state->nodes[1] = cur;
+}
+
+void interval_tree_span_iter_first(struct interval_tree_span_iter *iter,
+				   struct rb_root_cached *itree,
+				   unsigned long first_index,
+				   unsigned long last_index)
+{
+	iter->first_index = first_index;
+	iter->last_index = last_index;
+	iter->nodes[0] = NULL;
+	iter->nodes[1] =
+		interval_tree_iter_first(itree, first_index, last_index);
+	if (!iter->nodes[1]) {
+		/* No nodes intersect the span, whole span is hole */
+		iter->start_hole = first_index;
+		iter->last_hole = last_index;
+		iter->is_hole = 1;
+		return;
+	}
+	if (iter->nodes[1]->start > first_index) {
+		/* Leading hole on first iteration */
+		iter->start_hole = first_index;
+		iter->last_hole = iter->nodes[1]->start - 1;
+		iter->is_hole = 1;
+		interval_tree_span_iter_next_gap(iter);
+		return;
+	}
+
+	/* Starting inside a used */
+	iter->start_used = first_index;
+	iter->is_hole = 0;
+	interval_tree_span_iter_next_gap(iter);
+	iter->last_used = iter->nodes[0]->last;
+	if (iter->last_used >= last_index) {
+		iter->last_used = last_index;
+		iter->nodes[0] = NULL;
+		iter->nodes[1] = NULL;
+	}
+}
+EXPORT_SYMBOL_GPL(interval_tree_span_iter_first);
+
+void interval_tree_span_iter_next(struct interval_tree_span_iter *iter)
+{
+	if (!iter->nodes[0] && !iter->nodes[1]) {
+		iter->is_hole = -1;
+		return;
+	}
+
+	if (iter->is_hole) {
+		iter->start_used = iter->last_hole + 1;
+		iter->last_used = iter->nodes[0]->last;
+		if (iter->last_used >= iter->last_index) {
+			iter->last_used = iter->last_index;
+			iter->nodes[0] = NULL;
+			iter->nodes[1] = NULL;
+		}
+		iter->is_hole = 0;
+		return;
+	}
+
+	if (!iter->nodes[1]) {
+		/* Trailing hole */
+		iter->start_hole = iter->nodes[0]->last + 1;
+		iter->last_hole = iter->last_index;
+		iter->nodes[0] = NULL;
+		iter->is_hole = 1;
+		return;
+	}
+
+	/* must have both nodes[0] and [1], interior hole */
+	iter->start_hole = iter->nodes[0]->last + 1;
+	iter->last_hole = iter->nodes[1]->start - 1;
+	iter->is_hole = 1;
+	interval_tree_span_iter_next_gap(iter);
+}
+EXPORT_SYMBOL_GPL(interval_tree_span_iter_next);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 244+ messages in thread

* [PATCH RFC 01/12] interval-tree: Add a utility to iterate over spans in an interval tree
@ 2022-03-18 17:27   ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-03-18 17:27 UTC (permalink / raw)
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins,
	David Gibson

The span iterator travels over the indexes of the interval_tree, not the
nodes, and classifies spans of indexes as either 'used' or 'hole'.

'used' spans are fully covered by nodes in the tree and 'hole' spans have
no node intersecting the span.

This is done greedily such that spans are maximally sized and every
iteration step switches between used/hole.

As an example a trivial allocator can be written as:

	for (interval_tree_span_iter_first(&span, itree, 0, ULONG_MAX);
	     !interval_tree_span_iter_done(&span);
	     interval_tree_span_iter_next(&span))
		if (span.is_hole &&
		    span.last_hole - span.start_hole >= allocation_size - 1)
			return span.start_hole;

With all the tricky boundary conditions handled by the library code.

The following iommufd patches have several algorithms for two of its
overlapping node interval trees that are significantly simplified with
this kind of iteration primitive. As it seems generally useful, put it
into lib/.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 include/linux/interval_tree.h | 41 +++++++++++++++
 lib/interval_tree.c           | 98 +++++++++++++++++++++++++++++++++++
 2 files changed, 139 insertions(+)

diff --git a/include/linux/interval_tree.h b/include/linux/interval_tree.h
index 288c26f50732d7..3817af0fa54028 100644
--- a/include/linux/interval_tree.h
+++ b/include/linux/interval_tree.h
@@ -27,4 +27,45 @@ extern struct interval_tree_node *
 interval_tree_iter_next(struct interval_tree_node *node,
 			unsigned long start, unsigned long last);
 
+/*
+ * This iterator travels over spans in an interval tree. It does not return
+ * nodes but classifies each span as either a hole, where no nodes intersect, or
+ * a used, which is fully covered by nodes. Each iteration step toggles between
+ * hole and used until the entire range is covered. The returned spans always
+ * fully cover the requested range.
+ *
+ * The iterator is greedy, it always returns the largest hole or used possible,
+ * consolidating all consecutive nodes.
+ *
+ * Only is_hole, start_hole/used and last_hole/used are part of the external
+ * interface.
+ */
+struct interval_tree_span_iter {
+	struct interval_tree_node *nodes[2];
+	unsigned long first_index;
+	unsigned long last_index;
+	union {
+		unsigned long start_hole;
+		unsigned long start_used;
+	};
+	union {
+		unsigned long last_hole;
+		unsigned long last_used;
+	};
+	/* 0 == used, 1 == is_hole, -1 == done iteration */
+	int is_hole;
+};
+
+void interval_tree_span_iter_first(struct interval_tree_span_iter *state,
+				   struct rb_root_cached *itree,
+				   unsigned long first_index,
+				   unsigned long last_index);
+void interval_tree_span_iter_next(struct interval_tree_span_iter *state);
+
+static inline bool
+interval_tree_span_iter_done(struct interval_tree_span_iter *state)
+{
+	return state->is_hole == -1;
+}
+
 #endif	/* _LINUX_INTERVAL_TREE_H */
diff --git a/lib/interval_tree.c b/lib/interval_tree.c
index 593ce56ece5050..5dff0da020923f 100644
--- a/lib/interval_tree.c
+++ b/lib/interval_tree.c
@@ -15,3 +15,101 @@ EXPORT_SYMBOL_GPL(interval_tree_insert);
 EXPORT_SYMBOL_GPL(interval_tree_remove);
 EXPORT_SYMBOL_GPL(interval_tree_iter_first);
 EXPORT_SYMBOL_GPL(interval_tree_iter_next);
+
+static void
+interval_tree_span_iter_next_gap(struct interval_tree_span_iter *state)
+{
+	struct interval_tree_node *cur = state->nodes[1];
+
+	/*
+	 * Roll nodes[1] into nodes[0] by advancing nodes[1] to the end of a
+	 * contiguous span of nodes. This makes nodes[0]->last the end of that
+	 * contiguous span of valid indexes that started at the original
+	 * nodes[1]->start. nodes[1] is now the next node and a hole is between
+	 * nodes[0] and [1].
+	 */
+	state->nodes[0] = cur;
+	do {
+		if (cur->last > state->nodes[0]->last)
+			state->nodes[0] = cur;
+		cur = interval_tree_iter_next(cur, state->first_index,
+					      state->last_index);
+	} while (cur && (state->nodes[0]->last >= cur->start ||
+			 state->nodes[0]->last + 1 == cur->start));
+	state->nodes[1] = cur;
+}
+
+void interval_tree_span_iter_first(struct interval_tree_span_iter *iter,
+				   struct rb_root_cached *itree,
+				   unsigned long first_index,
+				   unsigned long last_index)
+{
+	iter->first_index = first_index;
+	iter->last_index = last_index;
+	iter->nodes[0] = NULL;
+	iter->nodes[1] =
+		interval_tree_iter_first(itree, first_index, last_index);
+	if (!iter->nodes[1]) {
+		/* No nodes intersect the span, whole span is hole */
+		iter->start_hole = first_index;
+		iter->last_hole = last_index;
+		iter->is_hole = 1;
+		return;
+	}
+	if (iter->nodes[1]->start > first_index) {
+		/* Leading hole on first iteration */
+		iter->start_hole = first_index;
+		iter->last_hole = iter->nodes[1]->start - 1;
+		iter->is_hole = 1;
+		interval_tree_span_iter_next_gap(iter);
+		return;
+	}
+
+	/* Starting inside a used */
+	iter->start_used = first_index;
+	iter->is_hole = 0;
+	interval_tree_span_iter_next_gap(iter);
+	iter->last_used = iter->nodes[0]->last;
+	if (iter->last_used >= last_index) {
+		iter->last_used = last_index;
+		iter->nodes[0] = NULL;
+		iter->nodes[1] = NULL;
+	}
+}
+EXPORT_SYMBOL_GPL(interval_tree_span_iter_first);
+
+void interval_tree_span_iter_next(struct interval_tree_span_iter *iter)
+{
+	if (!iter->nodes[0] && !iter->nodes[1]) {
+		iter->is_hole = -1;
+		return;
+	}
+
+	if (iter->is_hole) {
+		iter->start_used = iter->last_hole + 1;
+		iter->last_used = iter->nodes[0]->last;
+		if (iter->last_used >= iter->last_index) {
+			iter->last_used = iter->last_index;
+			iter->nodes[0] = NULL;
+			iter->nodes[1] = NULL;
+		}
+		iter->is_hole = 0;
+		return;
+	}
+
+	if (!iter->nodes[1]) {
+		/* Trailing hole */
+		iter->start_hole = iter->nodes[0]->last + 1;
+		iter->last_hole = iter->last_index;
+		iter->nodes[0] = NULL;
+		iter->is_hole = 1;
+		return;
+	}
+
+	/* must have both nodes[0] and [1], interior hole */
+	iter->start_hole = iter->nodes[0]->last + 1;
+	iter->last_hole = iter->nodes[1]->start - 1;
+	iter->is_hole = 1;
+	interval_tree_span_iter_next_gap(iter);
+}
+EXPORT_SYMBOL_GPL(interval_tree_span_iter_next);
-- 
2.35.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 244+ messages in thread

* [PATCH RFC 02/12] iommufd: Overview documentation
  2022-03-18 17:27 ` Jason Gunthorpe via iommu
@ 2022-03-18 17:27   ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-03-18 17:27 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

From: Kevin Tian <kevin.tian@intel.com>

Add iommufd to the documentation tree.

Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 Documentation/userspace-api/index.rst   |   1 +
 Documentation/userspace-api/iommufd.rst | 224 ++++++++++++++++++++++++
 2 files changed, 225 insertions(+)
 create mode 100644 Documentation/userspace-api/iommufd.rst

diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
index a61eac0c73f825..3815f013e4aebd 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -25,6 +25,7 @@ place where this information is gathered.
    ebpf/index
    ioctl/index
    iommu
+   iommufd
    media/index
    sysfs-platform_profile
    vduse
diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst
new file mode 100644
index 00000000000000..38035b3822fd23
--- /dev/null
+++ b/Documentation/userspace-api/iommufd.rst
@@ -0,0 +1,224 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+=======
+IOMMUFD
+=======
+
+:Author: Jason Gunthorpe
+:Author: Kevin Tian
+
+Overview
+========
+
+IOMMUFD is the user API to control the IOMMU subsystem as it relates to managing
+IO page tables that point at user space memory. It intends to be general and
+consumable by any driver that wants to DMA to userspace. Those drivers are
+expected to deprecate any proprietary IOMMU logic, if existing (e.g.
+vfio_iommu_type1.c).
+
+At minimum iommufd provides a universal support of managing I/O address spaces
+and I/O page tables for all IOMMUs, with room in the design to add non-generic
+features to cater to specific hardware functionality.
+
+In this context the capital letter (IOMMUFD) refers to the subsystem while the
+small letter (iommufd) refers to the file descriptors created via /dev/iommu to
+run the user API over.
+
+Key Concepts
+============
+
+User Visible Objects
+--------------------
+
+Following IOMMUFD objects are exposed to userspace:
+
+- IOMMUFD_OBJ_IOAS, representing an I/O address space (IOAS) allowing map/unmap
+  of user space memory into ranges of I/O Virtual Address (IOVA).
+
+  The IOAS is a functional replacement for the VFIO container, and like the VFIO
+  container copies its IOVA map to a list of iommu_domains held within it.
+
+- IOMMUFD_OBJ_DEVICE, representing a device that is bound to iommufd by an
+  external driver.
+
+- IOMMUFD_OBJ_HW_PAGETABLE, wrapping an actual hardware I/O page table (i.e. a
+  single struct iommu_domain) managed by the iommu driver.
+
+  The IOAS has a list of HW_PAGETABLES that share the same IOVA mapping and the
+  IOAS will synchronize its mapping with each member HW_PAGETABLE.
+
+All user-visible objects are destroyed via the IOMMU_DESTROY uAPI.
+
+Linkage between user-visible objects and external kernel datastructures are
+reflected by dotted line arrows below, with numbers referring to certain
+operations creating the objects and links::
+
+  _________________________________________________________
+ |                         iommufd                         |
+ |       [1]                                               |
+ |  _________________                                      |
+ | |                 |                                     |
+ | |                 |                                     |
+ | |                 |                                     |
+ | |                 |                                     |
+ | |                 |                                     |
+ | |                 |                                     |
+ | |                 |        [3]                 [2]      |
+ | |                 |    ____________         __________  |
+ | |      IOAS       |<--|            |<------|          | |
+ | |                 |   |HW_PAGETABLE|       |  DEVICE  | |
+ | |                 |   |____________|       |__________| |
+ | |                 |         |                   |       |
+ | |                 |         |                   |       |
+ | |                 |         |                   |       |
+ | |                 |         |                   |       |
+ | |                 |         |                   |       |
+ | |_________________|         |                   |       |
+ |         |                   |                   |       |
+ |_________|___________________|___________________|_______|
+           |                   |                   |
+           |              _____v______      _______v_____
+           | PFN storage |            |    |             |
+           |------------>|iommu_domain|    |struct device|
+                         |____________|    |_____________|
+
+1. IOMMUFD_OBJ_IOAS is created via the IOMMU_IOAS_ALLOC uAPI. One iommufd can
+   hold multiple IOAS objects. IOAS is the most generic object and does not
+   expose interfaces that are specific to single IOMMU drivers. All operations
+   on the IOAS must operate equally on each of the iommu_domains that are inside
+   it.
+
+2. IOMMUFD_OBJ_DEVICE is created when an external driver calls the IOMMUFD kAPI
+   to bind a device to an iommufd. The external driver is expected to implement
+   proper uAPI for userspace to initiate the binding operation. Successful
+   completion of this operation establishes the desired DMA ownership over the
+   device. The external driver must set driver_managed_dma flag and must not
+   touch the device until this operation succeeds.
+
+3. IOMMUFD_OBJ_HW_PAGETABLE is created when an external driver calls the IOMMUFD
+   kAPI to attach a bound device to an IOAS. Similarly the external driver uAPI
+   allows userspace to initiate the attaching operation. If a compatible
+   pagetable already exists then it is reused for the attachment. Otherwise a
+   new pagetable object (and a new iommu_domain) is created. Successful
+   completion of this operation sets up the linkages among an IOAS, a device and
+   an iommu_domain. Once this completes the device could do DMA.
+
+   Every iommu_domain inside the IOAS is also represented to userspace as a
+   HW_PAGETABLE object.
+
+   NOTE: Future additions to IOMMUFD will provide an API to create and
+   manipulate the HW_PAGETABLE directly.
+
+One device can only bind to one iommufd (due to DMA ownership claim) and attach
+to at most one IOAS object (no support of PASID yet).
+
+Currently only PCI device is allowed.
+
+Kernel Datastructure
+--------------------
+
+User visible objects are backed by following datastructures:
+
+- iommufd_ioas for IOMMUFD_OBJ_IOAS.
+- iommufd_device for IOMMUFD_OBJ_DEVICE.
+- iommufd_hw_pagetable for IOMMUFD_OBJ_HW_PAGETABLE.
+
+Several terminologies when looking at these datastructures:
+
+- Automatic domain, referring to an iommu domain created automatically when
+  attaching a device to an IOAS object. This is compatible to the semantics of
+  VFIO type1.
+
+- Manual domain, referring to an iommu domain designated by the user as the
+  target pagetable to be attached to by a device. Though currently no user API
+  for userspace to directly create such domain, the datastructure and algorithms
+  are ready for that usage.
+
+- In-kernel user, referring to something like a VFIO mdev that is accessing the
+  IOAS and using a 'struct page \*' for CPU based access. Such users require an
+  isolation granularity smaller than what an iommu domain can afford. They must
+  manually enforce the IOAS constraints on DMA buffers before those buffers can
+  be accessed by mdev. Though no kernel API for an external driver to bind a
+  mdev, the datastructure and algorithms are ready for such usage.
+
+iommufd_ioas serves as the metadata datastructure to manage how IOVA ranges are
+mapped to memory pages, composed of:
+
+- struct io_pagetable holding the IOVA map
+- struct iopt_areas representing populated portions of IOVA
+- struct iopt_pages representing the storage of PFNs
+- struct iommu_domain representing the IO page table in the IOMMU
+- struct iopt_pages_user representing in-kernel users of PFNs
+- struct xarray pinned_pfns holding a list of pages pinned by
+   in-kernel Users
+
+The iopt_pages is the center of the storage and motion of PFNs. Each iopt_pages
+represents a logical linear array of full PFNs. PFNs are stored in a tiered
+scheme:
+
+ 1) iopt_pages::pinned_pfns xarray
+ 2) An iommu_domain
+ 3) The origin of the PFNs, i.e. the userspace pointer
+
+PFN have to be copied between all combinations of tiers, depending on the
+configuration (i.e. attached domains and in-kernel users).
+
+An io_pagetable is composed of iopt_areas pointing at iopt_pages, along with a
+list of iommu_domains that mirror the IOVA to PFN map.
+
+Multiple io_pagetable's, through their iopt_area's, can share a single
+iopt_pages which avoids multi-pinning and double accounting of page consumption.
+
+iommufd_ioas is sharable between subsystems, e.g. VFIO and VDPA, as long as
+devices managed by different subsystems are bound to a same iommufd.
+
+IOMMUFD User API
+================
+
+.. kernel-doc:: include/uapi/linux/iommufd.h
+
+IOMMUFD Kernel API
+==================
+
+The IOMMUFD kAPI is device-centric with group-related tricks managed behind the
+scene. This allows the external driver calling such kAPI to implement a simple
+device-centric uAPI for connecting its device to an iommufd, instead of
+explicitly imposing the group semantics in its uAPI (as VFIO does).
+
+.. kernel-doc:: drivers/iommu/iommufd/device.c
+   :export:
+
+VFIO and IOMMUFD
+----------------
+
+Connecting VFIO device to iommufd can be done in two approaches.
+
+First is a VFIO compatible way by directly implementing the /dev/vfio/vfio
+container IOCTLs by mapping them into io_pagetable operations. Doing so allows
+the use of iommufd in legacy VFIO applications by symlinking /dev/vfio/vfio to
+/dev/iommufd or extending VFIO to SET_CONTAINER using an iommufd instead of a
+container fd.
+
+The second approach directly extends VFIO to support a new set of device-centric
+user API based on aforementioned IOMMUFD kernel API. It requires userspace
+change but better matches the IOMMUFD API semantics and easier to support new
+iommufd features when comparing it to the first approach.
+
+Currently both approaches are still work-in-progress.
+
+There are still a few gaps to be resolved to catch up with VFIO type1, as
+documented in iommufd_vfio_check_extension().
+
+Future TODOs
+============
+
+Currently IOMMUFD supports only kernel-managed I/O page table, similar to VFIO
+type1. New features on the radar include:
+
+ - Binding iommu_domain's to PASID/SSID
+ - Userspace page tables, for ARM, x86 and S390
+ - Kernel bypass'd invalidation of user page tables
+ - Re-use of the KVM page table in the IOMMU
+ - Dirty page tracking in the IOMMU
+ - Runtime Increase/Decrease of IOPTE size
+ - PRI support with faults resolved in userspace
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 244+ messages in thread

* [PATCH RFC 02/12] iommufd: Overview documentation
@ 2022-03-18 17:27   ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-03-18 17:27 UTC (permalink / raw)
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins,
	David Gibson

From: Kevin Tian <kevin.tian@intel.com>

Add iommufd to the documentation tree.

Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 Documentation/userspace-api/index.rst   |   1 +
 Documentation/userspace-api/iommufd.rst | 224 ++++++++++++++++++++++++
 2 files changed, 225 insertions(+)
 create mode 100644 Documentation/userspace-api/iommufd.rst

diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
index a61eac0c73f825..3815f013e4aebd 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -25,6 +25,7 @@ place where this information is gathered.
    ebpf/index
    ioctl/index
    iommu
+   iommufd
    media/index
    sysfs-platform_profile
    vduse
diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst
new file mode 100644
index 00000000000000..38035b3822fd23
--- /dev/null
+++ b/Documentation/userspace-api/iommufd.rst
@@ -0,0 +1,224 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+=======
+IOMMUFD
+=======
+
+:Author: Jason Gunthorpe
+:Author: Kevin Tian
+
+Overview
+========
+
+IOMMUFD is the user API to control the IOMMU subsystem as it relates to managing
+IO page tables that point at user space memory. It intends to be general and
+consumable by any driver that wants to DMA to userspace. Those drivers are
+expected to deprecate any proprietary IOMMU logic, if existing (e.g.
+vfio_iommu_type1.c).
+
+At minimum iommufd provides a universal support of managing I/O address spaces
+and I/O page tables for all IOMMUs, with room in the design to add non-generic
+features to cater to specific hardware functionality.
+
+In this context the capital letter (IOMMUFD) refers to the subsystem while the
+small letter (iommufd) refers to the file descriptors created via /dev/iommu to
+run the user API over.
+
+Key Concepts
+============
+
+User Visible Objects
+--------------------
+
+Following IOMMUFD objects are exposed to userspace:
+
+- IOMMUFD_OBJ_IOAS, representing an I/O address space (IOAS) allowing map/unmap
+  of user space memory into ranges of I/O Virtual Address (IOVA).
+
+  The IOAS is a functional replacement for the VFIO container, and like the VFIO
+  container copies its IOVA map to a list of iommu_domains held within it.
+
+- IOMMUFD_OBJ_DEVICE, representing a device that is bound to iommufd by an
+  external driver.
+
+- IOMMUFD_OBJ_HW_PAGETABLE, wrapping an actual hardware I/O page table (i.e. a
+  single struct iommu_domain) managed by the iommu driver.
+
+  The IOAS has a list of HW_PAGETABLES that share the same IOVA mapping and the
+  IOAS will synchronize its mapping with each member HW_PAGETABLE.
+
+All user-visible objects are destroyed via the IOMMU_DESTROY uAPI.
+
+Linkage between user-visible objects and external kernel datastructures are
+reflected by dotted line arrows below, with numbers referring to certain
+operations creating the objects and links::
+
+  _________________________________________________________
+ |                         iommufd                         |
+ |       [1]                                               |
+ |  _________________                                      |
+ | |                 |                                     |
+ | |                 |                                     |
+ | |                 |                                     |
+ | |                 |                                     |
+ | |                 |                                     |
+ | |                 |                                     |
+ | |                 |        [3]                 [2]      |
+ | |                 |    ____________         __________  |
+ | |      IOAS       |<--|            |<------|          | |
+ | |                 |   |HW_PAGETABLE|       |  DEVICE  | |
+ | |                 |   |____________|       |__________| |
+ | |                 |         |                   |       |
+ | |                 |         |                   |       |
+ | |                 |         |                   |       |
+ | |                 |         |                   |       |
+ | |                 |         |                   |       |
+ | |_________________|         |                   |       |
+ |         |                   |                   |       |
+ |_________|___________________|___________________|_______|
+           |                   |                   |
+           |              _____v______      _______v_____
+           | PFN storage |            |    |             |
+           |------------>|iommu_domain|    |struct device|
+                         |____________|    |_____________|
+
+1. IOMMUFD_OBJ_IOAS is created via the IOMMU_IOAS_ALLOC uAPI. One iommufd can
+   hold multiple IOAS objects. IOAS is the most generic object and does not
+   expose interfaces that are specific to single IOMMU drivers. All operations
+   on the IOAS must operate equally on each of the iommu_domains that are inside
+   it.
+
+2. IOMMUFD_OBJ_DEVICE is created when an external driver calls the IOMMUFD kAPI
+   to bind a device to an iommufd. The external driver is expected to implement
+   proper uAPI for userspace to initiate the binding operation. Successful
+   completion of this operation establishes the desired DMA ownership over the
+   device. The external driver must set driver_managed_dma flag and must not
+   touch the device until this operation succeeds.
+
+3. IOMMUFD_OBJ_HW_PAGETABLE is created when an external driver calls the IOMMUFD
+   kAPI to attach a bound device to an IOAS. Similarly the external driver uAPI
+   allows userspace to initiate the attaching operation. If a compatible
+   pagetable already exists then it is reused for the attachment. Otherwise a
+   new pagetable object (and a new iommu_domain) is created. Successful
+   completion of this operation sets up the linkages among an IOAS, a device and
+   an iommu_domain. Once this completes the device could do DMA.
+
+   Every iommu_domain inside the IOAS is also represented to userspace as a
+   HW_PAGETABLE object.
+
+   NOTE: Future additions to IOMMUFD will provide an API to create and
+   manipulate the HW_PAGETABLE directly.
+
+One device can only bind to one iommufd (due to DMA ownership claim) and attach
+to at most one IOAS object (no support of PASID yet).
+
+Currently only PCI device is allowed.
+
+Kernel Datastructure
+--------------------
+
+User visible objects are backed by following datastructures:
+
+- iommufd_ioas for IOMMUFD_OBJ_IOAS.
+- iommufd_device for IOMMUFD_OBJ_DEVICE.
+- iommufd_hw_pagetable for IOMMUFD_OBJ_HW_PAGETABLE.
+
+Several terminologies when looking at these datastructures:
+
+- Automatic domain, referring to an iommu domain created automatically when
+  attaching a device to an IOAS object. This is compatible to the semantics of
+  VFIO type1.
+
+- Manual domain, referring to an iommu domain designated by the user as the
+  target pagetable to be attached to by a device. Though currently no user API
+  for userspace to directly create such domain, the datastructure and algorithms
+  are ready for that usage.
+
+- In-kernel user, referring to something like a VFIO mdev that is accessing the
+  IOAS and using a 'struct page \*' for CPU based access. Such users require an
+  isolation granularity smaller than what an iommu domain can afford. They must
+  manually enforce the IOAS constraints on DMA buffers before those buffers can
+  be accessed by mdev. Though no kernel API for an external driver to bind a
+  mdev, the datastructure and algorithms are ready for such usage.
+
+iommufd_ioas serves as the metadata datastructure to manage how IOVA ranges are
+mapped to memory pages, composed of:
+
+- struct io_pagetable holding the IOVA map
+- struct iopt_areas representing populated portions of IOVA
+- struct iopt_pages representing the storage of PFNs
+- struct iommu_domain representing the IO page table in the IOMMU
+- struct iopt_pages_user representing in-kernel users of PFNs
+- struct xarray pinned_pfns holding a list of pages pinned by
+   in-kernel Users
+
+The iopt_pages is the center of the storage and motion of PFNs. Each iopt_pages
+represents a logical linear array of full PFNs. PFNs are stored in a tiered
+scheme:
+
+ 1) iopt_pages::pinned_pfns xarray
+ 2) An iommu_domain
+ 3) The origin of the PFNs, i.e. the userspace pointer
+
+PFN have to be copied between all combinations of tiers, depending on the
+configuration (i.e. attached domains and in-kernel users).
+
+An io_pagetable is composed of iopt_areas pointing at iopt_pages, along with a
+list of iommu_domains that mirror the IOVA to PFN map.
+
+Multiple io_pagetable's, through their iopt_area's, can share a single
+iopt_pages which avoids multi-pinning and double accounting of page consumption.
+
+iommufd_ioas is sharable between subsystems, e.g. VFIO and VDPA, as long as
+devices managed by different subsystems are bound to a same iommufd.
+
+IOMMUFD User API
+================
+
+.. kernel-doc:: include/uapi/linux/iommufd.h
+
+IOMMUFD Kernel API
+==================
+
+The IOMMUFD kAPI is device-centric with group-related tricks managed behind the
+scene. This allows the external driver calling such kAPI to implement a simple
+device-centric uAPI for connecting its device to an iommufd, instead of
+explicitly imposing the group semantics in its uAPI (as VFIO does).
+
+.. kernel-doc:: drivers/iommu/iommufd/device.c
+   :export:
+
+VFIO and IOMMUFD
+----------------
+
+Connecting VFIO device to iommufd can be done in two approaches.
+
+First is a VFIO compatible way by directly implementing the /dev/vfio/vfio
+container IOCTLs by mapping them into io_pagetable operations. Doing so allows
+the use of iommufd in legacy VFIO applications by symlinking /dev/vfio/vfio to
+/dev/iommufd or extending VFIO to SET_CONTAINER using an iommufd instead of a
+container fd.
+
+The second approach directly extends VFIO to support a new set of device-centric
+user API based on aforementioned IOMMUFD kernel API. It requires userspace
+change but better matches the IOMMUFD API semantics and easier to support new
+iommufd features when comparing it to the first approach.
+
+Currently both approaches are still work-in-progress.
+
+There are still a few gaps to be resolved to catch up with VFIO type1, as
+documented in iommufd_vfio_check_extension().
+
+Future TODOs
+============
+
+Currently IOMMUFD supports only kernel-managed I/O page table, similar to VFIO
+type1. New features on the radar include:
+
+ - Binding iommu_domain's to PASID/SSID
+ - Userspace page tables, for ARM, x86 and S390
+ - Kernel bypass'd invalidation of user page tables
+ - Re-use of the KVM page table in the IOMMU
+ - Dirty page tracking in the IOMMU
+ - Runtime Increase/Decrease of IOPTE size
+ - PRI support with faults resolved in userspace
-- 
2.35.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 244+ messages in thread

* [PATCH RFC 03/12] iommufd: File descriptor, context, kconfig and makefiles
  2022-03-18 17:27 ` Jason Gunthorpe via iommu
@ 2022-03-18 17:27   ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-03-18 17:27 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

This is the basic infrastructure of a new miscdevice to hold the iommufd
IOCTL API.

It provides:
 - A miscdevice to create file descriptors to run the IOCTL interface over

 - A table based ioctl dispatch and centralized extendable pre-validation
   step

 - An xarray mapping user ID's to kernel objects. The design has multiple
   inter-related objects held within in a single IOMMUFD fd

 - A simple usage count to build a graph of object relations and protect
   against hostile userspace racing ioctls

The only IOCTL provided in this patch is the generic 'destroy any object
by handle' operation.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 .../userspace-api/ioctl/ioctl-number.rst      |   1 +
 MAINTAINERS                                   |  10 +
 drivers/iommu/Kconfig                         |   1 +
 drivers/iommu/Makefile                        |   2 +-
 drivers/iommu/iommufd/Kconfig                 |  13 +
 drivers/iommu/iommufd/Makefile                |   5 +
 drivers/iommu/iommufd/iommufd_private.h       |  95 ++++++
 drivers/iommu/iommufd/main.c                  | 305 ++++++++++++++++++
 include/uapi/linux/iommufd.h                  |  55 ++++
 9 files changed, 486 insertions(+), 1 deletion(-)
 create mode 100644 drivers/iommu/iommufd/Kconfig
 create mode 100644 drivers/iommu/iommufd/Makefile
 create mode 100644 drivers/iommu/iommufd/iommufd_private.h
 create mode 100644 drivers/iommu/iommufd/main.c
 create mode 100644 include/uapi/linux/iommufd.h

diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
index e6fce2cbd99ed4..4a041dfc61fe95 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -105,6 +105,7 @@ Code  Seq#    Include File                                           Comments
 '8'   all                                                            SNP8023 advanced NIC card
                                                                      <mailto:mcr@solidum.com>
 ';'   64-7F  linux/vfio.h
+';'   80-FF  linux/iommufd.h
 '='   00-3f  uapi/linux/ptp_clock.h                                  <mailto:richardcochran@gmail.com>
 '@'   00-0F  linux/radeonfb.h                                        conflict!
 '@'   00-0F  drivers/video/aty/aty128fb.c                            conflict!
diff --git a/MAINTAINERS b/MAINTAINERS
index 1ba1e4af2cbc80..23a9c631051ee8 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10038,6 +10038,16 @@ L:	linux-mips@vger.kernel.org
 S:	Maintained
 F:	drivers/net/ethernet/sgi/ioc3-eth.c
 
+IOMMU FD
+M:	Jason Gunthorpe <jgg@nvidia.com>
+M:	Kevin Tian <kevin.tian@intel.com>
+L:	iommu@lists.linux-foundation.org
+S:	Maintained
+F:	Documentation/userspace-api/iommufd.rst
+F:	drivers/iommu/iommufd/
+F:	include/uapi/linux/iommufd.h
+F:	include/linux/iommufd.h
+
 IOMAP FILESYSTEM LIBRARY
 M:	Christoph Hellwig <hch@infradead.org>
 M:	Darrick J. Wong <djwong@kernel.org>
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 3eb68fa1b8cc02..754d2a9ff64623 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -177,6 +177,7 @@ config MSM_IOMMU
 
 source "drivers/iommu/amd/Kconfig"
 source "drivers/iommu/intel/Kconfig"
+source "drivers/iommu/iommufd/Kconfig"
 
 config IRQ_REMAP
 	bool "Support for Interrupt Remapping"
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index bc7f730edbb0be..6b38d12692b213 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -1,5 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0
-obj-y += amd/ intel/ arm/
+obj-y += amd/ intel/ arm/ iommufd/
 obj-$(CONFIG_IOMMU_API) += iommu.o
 obj-$(CONFIG_IOMMU_API) += iommu-traces.o
 obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
diff --git a/drivers/iommu/iommufd/Kconfig b/drivers/iommu/iommufd/Kconfig
new file mode 100644
index 00000000000000..fddd453bb0e764
--- /dev/null
+++ b/drivers/iommu/iommufd/Kconfig
@@ -0,0 +1,13 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config IOMMUFD
+	tristate "IOMMU Userspace API"
+	select INTERVAL_TREE
+	select IOMMU_API
+	default n
+	help
+	  Provides /dev/iommu the user API to control the IOMMU subsystem as
+	  it relates to managing IO page tables that point at user space memory.
+
+	  This would commonly be used in combination with VFIO.
+
+	  If you don't know what to do here, say N.
diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
new file mode 100644
index 00000000000000..a07a8cffe937c6
--- /dev/null
+++ b/drivers/iommu/iommufd/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0-only
+iommufd-y := \
+	main.o
+
+obj-$(CONFIG_IOMMUFD) += iommufd.o
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
new file mode 100644
index 00000000000000..2d0bba3965be1a
--- /dev/null
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -0,0 +1,95 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ */
+#ifndef __IOMMUFD_PRIVATE_H
+#define __IOMMUFD_PRIVATE_H
+
+#include <linux/rwsem.h>
+#include <linux/xarray.h>
+#include <linux/refcount.h>
+#include <linux/uaccess.h>
+
+struct iommufd_ctx {
+	struct file *filp;
+	struct xarray objects;
+};
+
+struct iommufd_ctx *iommufd_fget(int fd);
+
+struct iommufd_ucmd {
+	struct iommufd_ctx *ictx;
+	void __user *ubuffer;
+	u32 user_size;
+	void *cmd;
+};
+
+/* Copy the response in ucmd->cmd back to userspace. */
+static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd,
+				       size_t cmd_len)
+{
+	if (copy_to_user(ucmd->ubuffer, ucmd->cmd,
+			 min_t(size_t, ucmd->user_size, cmd_len)))
+		return -EFAULT;
+	return 0;
+}
+
+/*
+ * The objects for an acyclic graph through the users refcount. This enum must
+ * be sorted by type depth first so that destruction completes lower objects and
+ * releases the users refcount before reaching higher objects in the graph.
+ */
+enum iommufd_object_type {
+	IOMMUFD_OBJ_NONE,
+	IOMMUFD_OBJ_ANY = IOMMUFD_OBJ_NONE,
+	IOMMUFD_OBJ_MAX,
+};
+
+/* Base struct for all objects with a userspace ID handle. */
+struct iommufd_object {
+	struct rw_semaphore destroy_rwsem;
+	refcount_t users;
+	enum iommufd_object_type type;
+	unsigned int id;
+};
+
+static inline bool iommufd_lock_obj(struct iommufd_object *obj)
+{
+	if (!down_read_trylock(&obj->destroy_rwsem))
+		return false;
+	if (!refcount_inc_not_zero(&obj->users)) {
+		up_read(&obj->destroy_rwsem);
+		return false;
+	}
+	return true;
+}
+
+struct iommufd_object *iommufd_get_object(struct iommufd_ctx *ictx, u32 id,
+					  enum iommufd_object_type type);
+static inline void iommufd_put_object(struct iommufd_object *obj)
+{
+	refcount_dec(&obj->users);
+	up_read(&obj->destroy_rwsem);
+}
+static inline void iommufd_put_object_keep_user(struct iommufd_object *obj)
+{
+	up_read(&obj->destroy_rwsem);
+}
+void iommufd_object_abort(struct iommufd_ctx *ictx, struct iommufd_object *obj);
+void iommufd_object_finalize(struct iommufd_ctx *ictx,
+			     struct iommufd_object *obj);
+bool iommufd_object_destroy_user(struct iommufd_ctx *ictx,
+				 struct iommufd_object *obj);
+struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
+					     size_t size,
+					     enum iommufd_object_type type);
+
+#define iommufd_object_alloc(ictx, ptr, type)                                  \
+	container_of(_iommufd_object_alloc(                                    \
+			     ictx,                                             \
+			     sizeof(*(ptr)) + BUILD_BUG_ON_ZERO(               \
+						      offsetof(typeof(*(ptr)), \
+							       obj) != 0),     \
+			     type),                                            \
+		     typeof(*(ptr)), obj)
+
+#endif
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
new file mode 100644
index 00000000000000..ae8db2f663004f
--- /dev/null
+++ b/drivers/iommu/iommufd/main.c
@@ -0,0 +1,305 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (C) 2021 Intel Corporation
+ * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ *
+ * iommfd provides control over the IOMMU HW objects created by IOMMU kernel
+ * drivers. IOMMU HW objects revolve around IO page tables that map incoming DMA
+ * addresses (IOVA) to CPU addresses.
+ *
+ * The API is divided into a general portion that is intended to work with any
+ * kernel IOMMU driver, and a device specific portion that  is intended to be
+ * used with a userspace HW driver paired with the specific kernel driver. This
+ * mechanism allows all the unique functionalities in individual IOMMUs to be
+ * exposed to userspace control.
+ */
+#define pr_fmt(fmt) "iommufd: " fmt
+
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/miscdevice.h>
+#include <linux/mutex.h>
+#include <linux/bug.h>
+#include <uapi/linux/iommufd.h>
+
+#include "iommufd_private.h"
+
+struct iommufd_object_ops {
+	void (*destroy)(struct iommufd_object *obj);
+};
+static struct iommufd_object_ops iommufd_object_ops[];
+
+struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
+					     size_t size,
+					     enum iommufd_object_type type)
+{
+	struct iommufd_object *obj;
+	int rc;
+
+	obj = kzalloc(size, GFP_KERNEL);
+	if (!obj)
+		return ERR_PTR(-ENOMEM);
+	obj->type = type;
+	init_rwsem(&obj->destroy_rwsem);
+	refcount_set(&obj->users, 1);
+
+	/*
+	 * Reserve an ID in the xarray but do not publish the pointer yet since
+	 * the caller hasn't initialized it yet. Once the pointer is published
+	 * in the xarray and visible to other threads we can't reliably destroy
+	 * it anymore, so the caller must complete all errorable operations
+	 * before calling iommufd_object_finalize().
+	 */
+	rc = xa_alloc(&ictx->objects, &obj->id, XA_ZERO_ENTRY,
+		      xa_limit_32b, GFP_KERNEL);
+	if (rc)
+		goto out_free;
+	return obj;
+out_free:
+	kfree(obj);
+	return ERR_PTR(rc);
+}
+
+/*
+ * Allow concurrent access to the object. This should only be done once the
+ * system call that created the object is guaranteed to succeed.
+ */
+void iommufd_object_finalize(struct iommufd_ctx *ictx,
+			     struct iommufd_object *obj)
+{
+	void *old;
+
+	old = xa_store(&ictx->objects, obj->id, obj, GFP_KERNEL);
+	/* obj->id was returned from xa_alloc() so the xa_store() cannot fail */
+	WARN_ON(old);
+}
+
+/* Undo _iommufd_object_alloc() if iommufd_object_finalize() was not called */
+void iommufd_object_abort(struct iommufd_ctx *ictx, struct iommufd_object *obj)
+{
+	void *old;
+
+	old = xa_erase(&ictx->objects, obj->id);
+	WARN_ON(old);
+	kfree(obj);
+}
+
+struct iommufd_object *iommufd_get_object(struct iommufd_ctx *ictx, u32 id,
+					  enum iommufd_object_type type)
+{
+	struct iommufd_object *obj;
+
+	xa_lock(&ictx->objects);
+	obj = xa_load(&ictx->objects, id);
+	if (!obj || (type != IOMMUFD_OBJ_ANY && obj->type != type) ||
+	    !iommufd_lock_obj(obj))
+		obj = ERR_PTR(-ENOENT);
+	xa_unlock(&ictx->objects);
+	return obj;
+}
+
+/*
+ * The caller holds a users refcount and wants to destroy the object. Returns
+ * true if the object was destroyed. In all cases the caller no longer has a
+ * reference on obj.
+ */
+bool iommufd_object_destroy_user(struct iommufd_ctx *ictx,
+				 struct iommufd_object *obj)
+{
+	/*
+	 * The purpose of the destroy_rwsem is to ensure deterministic
+	 * destruction of objects used by external drivers and destroyed by this
+	 * function. Any temporary increment of the refcount must hold the read
+	 * side of this, such as during ioctl execution.
+	 */
+	down_write(&obj->destroy_rwsem);
+	xa_lock(&ictx->objects);
+	refcount_dec(&obj->users);
+	if (!refcount_dec_if_one(&obj->users)) {
+		xa_unlock(&ictx->objects);
+		up_write(&obj->destroy_rwsem);
+		return false;
+	}
+	__xa_erase(&ictx->objects, obj->id);
+	xa_unlock(&ictx->objects);
+
+	iommufd_object_ops[obj->type].destroy(obj);
+	up_write(&obj->destroy_rwsem);
+	kfree(obj);
+	return true;
+}
+
+static int iommufd_destroy(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_destroy *cmd = ucmd->cmd;
+	struct iommufd_object *obj;
+
+	obj = iommufd_get_object(ucmd->ictx, cmd->id, IOMMUFD_OBJ_ANY);
+	if (IS_ERR(obj))
+		return PTR_ERR(obj);
+	iommufd_put_object_keep_user(obj);
+	if (!iommufd_object_destroy_user(ucmd->ictx, obj))
+		return -EBUSY;
+	return 0;
+}
+
+static int iommufd_fops_open(struct inode *inode, struct file *filp)
+{
+	struct iommufd_ctx *ictx;
+
+	ictx = kzalloc(sizeof(*ictx), GFP_KERNEL);
+	if (!ictx)
+		return -ENOMEM;
+
+	xa_init_flags(&ictx->objects, XA_FLAGS_ALLOC1);
+	ictx->filp = filp;
+	filp->private_data = ictx;
+	return 0;
+}
+
+static int iommufd_fops_release(struct inode *inode, struct file *filp)
+{
+	struct iommufd_ctx *ictx = filp->private_data;
+	struct iommufd_object *obj;
+	unsigned long index = 0;
+	int cur = 0;
+
+	/* Destroy the graph from depth first */
+	while (cur < IOMMUFD_OBJ_MAX) {
+		xa_for_each(&ictx->objects, index, obj) {
+			if (obj->type != cur)
+				continue;
+			xa_erase(&ictx->objects, index);
+			if (WARN_ON(!refcount_dec_and_test(&obj->users)))
+				continue;
+			iommufd_object_ops[obj->type].destroy(obj);
+			kfree(obj);
+		}
+		cur++;
+	}
+	WARN_ON(!xa_empty(&ictx->objects));
+	kfree(ictx);
+	return 0;
+}
+
+union ucmd_buffer {
+	struct iommu_destroy destroy;
+};
+
+struct iommufd_ioctl_op {
+	unsigned int size;
+	unsigned int min_size;
+	unsigned int ioctl_num;
+	int (*execute)(struct iommufd_ucmd *ucmd);
+};
+
+#define IOCTL_OP(_ioctl, _fn, _struct, _last)                                  \
+	[_IOC_NR(_ioctl) - IOMMUFD_CMD_BASE] = {                               \
+		.size = sizeof(_struct) +                                      \
+			BUILD_BUG_ON_ZERO(sizeof(union ucmd_buffer) <          \
+					  sizeof(_struct)),                    \
+		.min_size = offsetofend(_struct, _last),                       \
+		.ioctl_num = _ioctl,                                           \
+		.execute = _fn,                                                \
+	}
+static struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
+	IOCTL_OP(IOMMU_DESTROY, iommufd_destroy, struct iommu_destroy, id),
+};
+
+static long iommufd_fops_ioctl(struct file *filp, unsigned int cmd,
+			       unsigned long arg)
+{
+	struct iommufd_ucmd ucmd = {};
+	struct iommufd_ioctl_op *op;
+	union ucmd_buffer buf;
+	unsigned int nr;
+	int ret;
+
+	ucmd.ictx = filp->private_data;
+	ucmd.ubuffer = (void __user *)arg;
+	ret = get_user(ucmd.user_size, (u32 __user *)ucmd.ubuffer);
+	if (ret)
+		return ret;
+
+	nr = _IOC_NR(cmd);
+	if (nr < IOMMUFD_CMD_BASE ||
+	    (nr - IOMMUFD_CMD_BASE) >= ARRAY_SIZE(iommufd_ioctl_ops))
+		return -ENOIOCTLCMD;
+	op = &iommufd_ioctl_ops[nr - IOMMUFD_CMD_BASE];
+	if (op->ioctl_num != cmd)
+		return -ENOIOCTLCMD;
+	if (ucmd.user_size < op->min_size)
+		return -EOPNOTSUPP;
+
+	ucmd.cmd = &buf;
+	ret = copy_struct_from_user(ucmd.cmd, op->size, ucmd.ubuffer,
+				    ucmd.user_size);
+	if (ret)
+		return ret;
+	ret = op->execute(&ucmd);
+	return ret;
+}
+
+static const struct file_operations iommufd_fops = {
+	.owner = THIS_MODULE,
+	.open = iommufd_fops_open,
+	.release = iommufd_fops_release,
+	.unlocked_ioctl = iommufd_fops_ioctl,
+};
+
+/**
+ * iommufd_fget - Acquires a reference to the iommufd file.
+ * @fd: file descriptor
+ *
+ * Returns a pointer to the iommufd_ctx, otherwise NULL;
+ */
+struct iommufd_ctx *iommufd_fget(int fd)
+{
+	struct file *filp;
+
+	filp = fget(fd);
+	if (!filp)
+		return NULL;
+
+	if (filp->f_op != &iommufd_fops) {
+		fput(filp);
+		return NULL;
+	}
+	return filp->private_data;
+}
+
+static struct iommufd_object_ops iommufd_object_ops[] = {
+};
+
+static struct miscdevice iommu_misc_dev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "iommu",
+	.fops = &iommufd_fops,
+	.nodename = "iommu",
+	.mode = 0660,
+};
+
+static int __init iommufd_init(void)
+{
+	int ret;
+
+	ret = misc_register(&iommu_misc_dev);
+	if (ret) {
+		pr_err("Failed to register misc device\n");
+		return ret;
+	}
+
+	return 0;
+}
+
+static void __exit iommufd_exit(void)
+{
+	misc_deregister(&iommu_misc_dev);
+}
+
+module_init(iommufd_init);
+module_exit(iommufd_exit);
+
+MODULE_DESCRIPTION("I/O Address Space Management for passthrough devices");
+MODULE_LICENSE("GPL v2");
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
new file mode 100644
index 00000000000000..2f7f76ec6db4cb
--- /dev/null
+++ b/include/uapi/linux/iommufd.h
@@ -0,0 +1,55 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
+ */
+#ifndef _UAPI_IOMMUFD_H
+#define _UAPI_IOMMUFD_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+#define IOMMUFD_TYPE (';')
+
+/**
+ * DOC: General ioctl format
+ *
+ * The ioctl mechanims follows a general format to allow for extensibility. Each
+ * ioctl is passed in a structure pointer as the argument providing the size of
+ * the structure in the first u32. The kernel checks that any structure space
+ * beyond what it understands is 0. This allows userspace to use the backward
+ * compatible portion while consistently using the newer, larger, structures.
+ *
+ * ioctls use a standard meaning for common errnos:
+ *
+ *  - ENOTTY: The IOCTL number itself is not supported at all
+ *  - E2BIG: The IOCTL number is supported, but the provided structure has
+ *    non-zero in a part the kernel does not understand.
+ *  - EOPNOTSUPP: The IOCTL number is supported, and the structure is
+ *    understood, however a known field has a value the kernel does not
+ *    understand or support.
+ *  - EINVAL: Everything about the IOCTL was understood, but a field is not
+ *    correct.
+ *  - ENOENT: An ID or IOVA provided does not exist.
+ *  - ENOMEM: Out of memory.
+ *  - EOVERFLOW: Mathematics oveflowed.
+ *
+ * As well as additional errnos. within specific ioctls.
+ */
+enum {
+	IOMMUFD_CMD_BASE = 0x80,
+	IOMMUFD_CMD_DESTROY = IOMMUFD_CMD_BASE,
+};
+
+/**
+ * struct iommu_destroy - ioctl(IOMMU_DESTROY)
+ * @size: sizeof(struct iommu_destroy)
+ * @id: iommufd object ID to destroy. Can by any destroyable object type.
+ *
+ * Destroy any object held within iommufd.
+ */
+struct iommu_destroy {
+	__u32 size;
+	__u32 id;
+};
+#define IOMMU_DESTROY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_DESTROY)
+
+#endif
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 244+ messages in thread

* [PATCH RFC 03/12] iommufd: File descriptor, context, kconfig and makefiles
@ 2022-03-18 17:27   ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-03-18 17:27 UTC (permalink / raw)
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins,
	David Gibson

This is the basic infrastructure of a new miscdevice to hold the iommufd
IOCTL API.

It provides:
 - A miscdevice to create file descriptors to run the IOCTL interface over

 - A table based ioctl dispatch and centralized extendable pre-validation
   step

 - An xarray mapping user ID's to kernel objects. The design has multiple
   inter-related objects held within in a single IOMMUFD fd

 - A simple usage count to build a graph of object relations and protect
   against hostile userspace racing ioctls

The only IOCTL provided in this patch is the generic 'destroy any object
by handle' operation.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 .../userspace-api/ioctl/ioctl-number.rst      |   1 +
 MAINTAINERS                                   |  10 +
 drivers/iommu/Kconfig                         |   1 +
 drivers/iommu/Makefile                        |   2 +-
 drivers/iommu/iommufd/Kconfig                 |  13 +
 drivers/iommu/iommufd/Makefile                |   5 +
 drivers/iommu/iommufd/iommufd_private.h       |  95 ++++++
 drivers/iommu/iommufd/main.c                  | 305 ++++++++++++++++++
 include/uapi/linux/iommufd.h                  |  55 ++++
 9 files changed, 486 insertions(+), 1 deletion(-)
 create mode 100644 drivers/iommu/iommufd/Kconfig
 create mode 100644 drivers/iommu/iommufd/Makefile
 create mode 100644 drivers/iommu/iommufd/iommufd_private.h
 create mode 100644 drivers/iommu/iommufd/main.c
 create mode 100644 include/uapi/linux/iommufd.h

diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
index e6fce2cbd99ed4..4a041dfc61fe95 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -105,6 +105,7 @@ Code  Seq#    Include File                                           Comments
 '8'   all                                                            SNP8023 advanced NIC card
                                                                      <mailto:mcr@solidum.com>
 ';'   64-7F  linux/vfio.h
+';'   80-FF  linux/iommufd.h
 '='   00-3f  uapi/linux/ptp_clock.h                                  <mailto:richardcochran@gmail.com>
 '@'   00-0F  linux/radeonfb.h                                        conflict!
 '@'   00-0F  drivers/video/aty/aty128fb.c                            conflict!
diff --git a/MAINTAINERS b/MAINTAINERS
index 1ba1e4af2cbc80..23a9c631051ee8 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10038,6 +10038,16 @@ L:	linux-mips@vger.kernel.org
 S:	Maintained
 F:	drivers/net/ethernet/sgi/ioc3-eth.c
 
+IOMMU FD
+M:	Jason Gunthorpe <jgg@nvidia.com>
+M:	Kevin Tian <kevin.tian@intel.com>
+L:	iommu@lists.linux-foundation.org
+S:	Maintained
+F:	Documentation/userspace-api/iommufd.rst
+F:	drivers/iommu/iommufd/
+F:	include/uapi/linux/iommufd.h
+F:	include/linux/iommufd.h
+
 IOMAP FILESYSTEM LIBRARY
 M:	Christoph Hellwig <hch@infradead.org>
 M:	Darrick J. Wong <djwong@kernel.org>
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 3eb68fa1b8cc02..754d2a9ff64623 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -177,6 +177,7 @@ config MSM_IOMMU
 
 source "drivers/iommu/amd/Kconfig"
 source "drivers/iommu/intel/Kconfig"
+source "drivers/iommu/iommufd/Kconfig"
 
 config IRQ_REMAP
 	bool "Support for Interrupt Remapping"
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index bc7f730edbb0be..6b38d12692b213 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -1,5 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0
-obj-y += amd/ intel/ arm/
+obj-y += amd/ intel/ arm/ iommufd/
 obj-$(CONFIG_IOMMU_API) += iommu.o
 obj-$(CONFIG_IOMMU_API) += iommu-traces.o
 obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
diff --git a/drivers/iommu/iommufd/Kconfig b/drivers/iommu/iommufd/Kconfig
new file mode 100644
index 00000000000000..fddd453bb0e764
--- /dev/null
+++ b/drivers/iommu/iommufd/Kconfig
@@ -0,0 +1,13 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config IOMMUFD
+	tristate "IOMMU Userspace API"
+	select INTERVAL_TREE
+	select IOMMU_API
+	default n
+	help
+	  Provides /dev/iommu the user API to control the IOMMU subsystem as
+	  it relates to managing IO page tables that point at user space memory.
+
+	  This would commonly be used in combination with VFIO.
+
+	  If you don't know what to do here, say N.
diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
new file mode 100644
index 00000000000000..a07a8cffe937c6
--- /dev/null
+++ b/drivers/iommu/iommufd/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0-only
+iommufd-y := \
+	main.o
+
+obj-$(CONFIG_IOMMUFD) += iommufd.o
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
new file mode 100644
index 00000000000000..2d0bba3965be1a
--- /dev/null
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -0,0 +1,95 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ */
+#ifndef __IOMMUFD_PRIVATE_H
+#define __IOMMUFD_PRIVATE_H
+
+#include <linux/rwsem.h>
+#include <linux/xarray.h>
+#include <linux/refcount.h>
+#include <linux/uaccess.h>
+
+struct iommufd_ctx {
+	struct file *filp;
+	struct xarray objects;
+};
+
+struct iommufd_ctx *iommufd_fget(int fd);
+
+struct iommufd_ucmd {
+	struct iommufd_ctx *ictx;
+	void __user *ubuffer;
+	u32 user_size;
+	void *cmd;
+};
+
+/* Copy the response in ucmd->cmd back to userspace. */
+static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd,
+				       size_t cmd_len)
+{
+	if (copy_to_user(ucmd->ubuffer, ucmd->cmd,
+			 min_t(size_t, ucmd->user_size, cmd_len)))
+		return -EFAULT;
+	return 0;
+}
+
+/*
+ * The objects for an acyclic graph through the users refcount. This enum must
+ * be sorted by type depth first so that destruction completes lower objects and
+ * releases the users refcount before reaching higher objects in the graph.
+ */
+enum iommufd_object_type {
+	IOMMUFD_OBJ_NONE,
+	IOMMUFD_OBJ_ANY = IOMMUFD_OBJ_NONE,
+	IOMMUFD_OBJ_MAX,
+};
+
+/* Base struct for all objects with a userspace ID handle. */
+struct iommufd_object {
+	struct rw_semaphore destroy_rwsem;
+	refcount_t users;
+	enum iommufd_object_type type;
+	unsigned int id;
+};
+
+static inline bool iommufd_lock_obj(struct iommufd_object *obj)
+{
+	if (!down_read_trylock(&obj->destroy_rwsem))
+		return false;
+	if (!refcount_inc_not_zero(&obj->users)) {
+		up_read(&obj->destroy_rwsem);
+		return false;
+	}
+	return true;
+}
+
+struct iommufd_object *iommufd_get_object(struct iommufd_ctx *ictx, u32 id,
+					  enum iommufd_object_type type);
+static inline void iommufd_put_object(struct iommufd_object *obj)
+{
+	refcount_dec(&obj->users);
+	up_read(&obj->destroy_rwsem);
+}
+static inline void iommufd_put_object_keep_user(struct iommufd_object *obj)
+{
+	up_read(&obj->destroy_rwsem);
+}
+void iommufd_object_abort(struct iommufd_ctx *ictx, struct iommufd_object *obj);
+void iommufd_object_finalize(struct iommufd_ctx *ictx,
+			     struct iommufd_object *obj);
+bool iommufd_object_destroy_user(struct iommufd_ctx *ictx,
+				 struct iommufd_object *obj);
+struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
+					     size_t size,
+					     enum iommufd_object_type type);
+
+#define iommufd_object_alloc(ictx, ptr, type)                                  \
+	container_of(_iommufd_object_alloc(                                    \
+			     ictx,                                             \
+			     sizeof(*(ptr)) + BUILD_BUG_ON_ZERO(               \
+						      offsetof(typeof(*(ptr)), \
+							       obj) != 0),     \
+			     type),                                            \
+		     typeof(*(ptr)), obj)
+
+#endif
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
new file mode 100644
index 00000000000000..ae8db2f663004f
--- /dev/null
+++ b/drivers/iommu/iommufd/main.c
@@ -0,0 +1,305 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (C) 2021 Intel Corporation
+ * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ *
+ * iommfd provides control over the IOMMU HW objects created by IOMMU kernel
+ * drivers. IOMMU HW objects revolve around IO page tables that map incoming DMA
+ * addresses (IOVA) to CPU addresses.
+ *
+ * The API is divided into a general portion that is intended to work with any
+ * kernel IOMMU driver, and a device specific portion that  is intended to be
+ * used with a userspace HW driver paired with the specific kernel driver. This
+ * mechanism allows all the unique functionalities in individual IOMMUs to be
+ * exposed to userspace control.
+ */
+#define pr_fmt(fmt) "iommufd: " fmt
+
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/miscdevice.h>
+#include <linux/mutex.h>
+#include <linux/bug.h>
+#include <uapi/linux/iommufd.h>
+
+#include "iommufd_private.h"
+
+struct iommufd_object_ops {
+	void (*destroy)(struct iommufd_object *obj);
+};
+static struct iommufd_object_ops iommufd_object_ops[];
+
+struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
+					     size_t size,
+					     enum iommufd_object_type type)
+{
+	struct iommufd_object *obj;
+	int rc;
+
+	obj = kzalloc(size, GFP_KERNEL);
+	if (!obj)
+		return ERR_PTR(-ENOMEM);
+	obj->type = type;
+	init_rwsem(&obj->destroy_rwsem);
+	refcount_set(&obj->users, 1);
+
+	/*
+	 * Reserve an ID in the xarray but do not publish the pointer yet since
+	 * the caller hasn't initialized it yet. Once the pointer is published
+	 * in the xarray and visible to other threads we can't reliably destroy
+	 * it anymore, so the caller must complete all errorable operations
+	 * before calling iommufd_object_finalize().
+	 */
+	rc = xa_alloc(&ictx->objects, &obj->id, XA_ZERO_ENTRY,
+		      xa_limit_32b, GFP_KERNEL);
+	if (rc)
+		goto out_free;
+	return obj;
+out_free:
+	kfree(obj);
+	return ERR_PTR(rc);
+}
+
+/*
+ * Allow concurrent access to the object. This should only be done once the
+ * system call that created the object is guaranteed to succeed.
+ */
+void iommufd_object_finalize(struct iommufd_ctx *ictx,
+			     struct iommufd_object *obj)
+{
+	void *old;
+
+	old = xa_store(&ictx->objects, obj->id, obj, GFP_KERNEL);
+	/* obj->id was returned from xa_alloc() so the xa_store() cannot fail */
+	WARN_ON(old);
+}
+
+/* Undo _iommufd_object_alloc() if iommufd_object_finalize() was not called */
+void iommufd_object_abort(struct iommufd_ctx *ictx, struct iommufd_object *obj)
+{
+	void *old;
+
+	old = xa_erase(&ictx->objects, obj->id);
+	WARN_ON(old);
+	kfree(obj);
+}
+
+struct iommufd_object *iommufd_get_object(struct iommufd_ctx *ictx, u32 id,
+					  enum iommufd_object_type type)
+{
+	struct iommufd_object *obj;
+
+	xa_lock(&ictx->objects);
+	obj = xa_load(&ictx->objects, id);
+	if (!obj || (type != IOMMUFD_OBJ_ANY && obj->type != type) ||
+	    !iommufd_lock_obj(obj))
+		obj = ERR_PTR(-ENOENT);
+	xa_unlock(&ictx->objects);
+	return obj;
+}
+
+/*
+ * The caller holds a users refcount and wants to destroy the object. Returns
+ * true if the object was destroyed. In all cases the caller no longer has a
+ * reference on obj.
+ */
+bool iommufd_object_destroy_user(struct iommufd_ctx *ictx,
+				 struct iommufd_object *obj)
+{
+	/*
+	 * The purpose of the destroy_rwsem is to ensure deterministic
+	 * destruction of objects used by external drivers and destroyed by this
+	 * function. Any temporary increment of the refcount must hold the read
+	 * side of this, such as during ioctl execution.
+	 */
+	down_write(&obj->destroy_rwsem);
+	xa_lock(&ictx->objects);
+	refcount_dec(&obj->users);
+	if (!refcount_dec_if_one(&obj->users)) {
+		xa_unlock(&ictx->objects);
+		up_write(&obj->destroy_rwsem);
+		return false;
+	}
+	__xa_erase(&ictx->objects, obj->id);
+	xa_unlock(&ictx->objects);
+
+	iommufd_object_ops[obj->type].destroy(obj);
+	up_write(&obj->destroy_rwsem);
+	kfree(obj);
+	return true;
+}
+
+static int iommufd_destroy(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_destroy *cmd = ucmd->cmd;
+	struct iommufd_object *obj;
+
+	obj = iommufd_get_object(ucmd->ictx, cmd->id, IOMMUFD_OBJ_ANY);
+	if (IS_ERR(obj))
+		return PTR_ERR(obj);
+	iommufd_put_object_keep_user(obj);
+	if (!iommufd_object_destroy_user(ucmd->ictx, obj))
+		return -EBUSY;
+	return 0;
+}
+
+static int iommufd_fops_open(struct inode *inode, struct file *filp)
+{
+	struct iommufd_ctx *ictx;
+
+	ictx = kzalloc(sizeof(*ictx), GFP_KERNEL);
+	if (!ictx)
+		return -ENOMEM;
+
+	xa_init_flags(&ictx->objects, XA_FLAGS_ALLOC1);
+	ictx->filp = filp;
+	filp->private_data = ictx;
+	return 0;
+}
+
+static int iommufd_fops_release(struct inode *inode, struct file *filp)
+{
+	struct iommufd_ctx *ictx = filp->private_data;
+	struct iommufd_object *obj;
+	unsigned long index = 0;
+	int cur = 0;
+
+	/* Destroy the graph from depth first */
+	while (cur < IOMMUFD_OBJ_MAX) {
+		xa_for_each(&ictx->objects, index, obj) {
+			if (obj->type != cur)
+				continue;
+			xa_erase(&ictx->objects, index);
+			if (WARN_ON(!refcount_dec_and_test(&obj->users)))
+				continue;
+			iommufd_object_ops[obj->type].destroy(obj);
+			kfree(obj);
+		}
+		cur++;
+	}
+	WARN_ON(!xa_empty(&ictx->objects));
+	kfree(ictx);
+	return 0;
+}
+
+union ucmd_buffer {
+	struct iommu_destroy destroy;
+};
+
+struct iommufd_ioctl_op {
+	unsigned int size;
+	unsigned int min_size;
+	unsigned int ioctl_num;
+	int (*execute)(struct iommufd_ucmd *ucmd);
+};
+
+#define IOCTL_OP(_ioctl, _fn, _struct, _last)                                  \
+	[_IOC_NR(_ioctl) - IOMMUFD_CMD_BASE] = {                               \
+		.size = sizeof(_struct) +                                      \
+			BUILD_BUG_ON_ZERO(sizeof(union ucmd_buffer) <          \
+					  sizeof(_struct)),                    \
+		.min_size = offsetofend(_struct, _last),                       \
+		.ioctl_num = _ioctl,                                           \
+		.execute = _fn,                                                \
+	}
+static struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
+	IOCTL_OP(IOMMU_DESTROY, iommufd_destroy, struct iommu_destroy, id),
+};
+
+static long iommufd_fops_ioctl(struct file *filp, unsigned int cmd,
+			       unsigned long arg)
+{
+	struct iommufd_ucmd ucmd = {};
+	struct iommufd_ioctl_op *op;
+	union ucmd_buffer buf;
+	unsigned int nr;
+	int ret;
+
+	ucmd.ictx = filp->private_data;
+	ucmd.ubuffer = (void __user *)arg;
+	ret = get_user(ucmd.user_size, (u32 __user *)ucmd.ubuffer);
+	if (ret)
+		return ret;
+
+	nr = _IOC_NR(cmd);
+	if (nr < IOMMUFD_CMD_BASE ||
+	    (nr - IOMMUFD_CMD_BASE) >= ARRAY_SIZE(iommufd_ioctl_ops))
+		return -ENOIOCTLCMD;
+	op = &iommufd_ioctl_ops[nr - IOMMUFD_CMD_BASE];
+	if (op->ioctl_num != cmd)
+		return -ENOIOCTLCMD;
+	if (ucmd.user_size < op->min_size)
+		return -EOPNOTSUPP;
+
+	ucmd.cmd = &buf;
+	ret = copy_struct_from_user(ucmd.cmd, op->size, ucmd.ubuffer,
+				    ucmd.user_size);
+	if (ret)
+		return ret;
+	ret = op->execute(&ucmd);
+	return ret;
+}
+
+static const struct file_operations iommufd_fops = {
+	.owner = THIS_MODULE,
+	.open = iommufd_fops_open,
+	.release = iommufd_fops_release,
+	.unlocked_ioctl = iommufd_fops_ioctl,
+};
+
+/**
+ * iommufd_fget - Acquires a reference to the iommufd file.
+ * @fd: file descriptor
+ *
+ * Returns a pointer to the iommufd_ctx, otherwise NULL;
+ */
+struct iommufd_ctx *iommufd_fget(int fd)
+{
+	struct file *filp;
+
+	filp = fget(fd);
+	if (!filp)
+		return NULL;
+
+	if (filp->f_op != &iommufd_fops) {
+		fput(filp);
+		return NULL;
+	}
+	return filp->private_data;
+}
+
+static struct iommufd_object_ops iommufd_object_ops[] = {
+};
+
+static struct miscdevice iommu_misc_dev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "iommu",
+	.fops = &iommufd_fops,
+	.nodename = "iommu",
+	.mode = 0660,
+};
+
+static int __init iommufd_init(void)
+{
+	int ret;
+
+	ret = misc_register(&iommu_misc_dev);
+	if (ret) {
+		pr_err("Failed to register misc device\n");
+		return ret;
+	}
+
+	return 0;
+}
+
+static void __exit iommufd_exit(void)
+{
+	misc_deregister(&iommu_misc_dev);
+}
+
+module_init(iommufd_init);
+module_exit(iommufd_exit);
+
+MODULE_DESCRIPTION("I/O Address Space Management for passthrough devices");
+MODULE_LICENSE("GPL v2");
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
new file mode 100644
index 00000000000000..2f7f76ec6db4cb
--- /dev/null
+++ b/include/uapi/linux/iommufd.h
@@ -0,0 +1,55 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
+ */
+#ifndef _UAPI_IOMMUFD_H
+#define _UAPI_IOMMUFD_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+#define IOMMUFD_TYPE (';')
+
+/**
+ * DOC: General ioctl format
+ *
+ * The ioctl mechanims follows a general format to allow for extensibility. Each
+ * ioctl is passed in a structure pointer as the argument providing the size of
+ * the structure in the first u32. The kernel checks that any structure space
+ * beyond what it understands is 0. This allows userspace to use the backward
+ * compatible portion while consistently using the newer, larger, structures.
+ *
+ * ioctls use a standard meaning for common errnos:
+ *
+ *  - ENOTTY: The IOCTL number itself is not supported at all
+ *  - E2BIG: The IOCTL number is supported, but the provided structure has
+ *    non-zero in a part the kernel does not understand.
+ *  - EOPNOTSUPP: The IOCTL number is supported, and the structure is
+ *    understood, however a known field has a value the kernel does not
+ *    understand or support.
+ *  - EINVAL: Everything about the IOCTL was understood, but a field is not
+ *    correct.
+ *  - ENOENT: An ID or IOVA provided does not exist.
+ *  - ENOMEM: Out of memory.
+ *  - EOVERFLOW: Mathematics oveflowed.
+ *
+ * As well as additional errnos. within specific ioctls.
+ */
+enum {
+	IOMMUFD_CMD_BASE = 0x80,
+	IOMMUFD_CMD_DESTROY = IOMMUFD_CMD_BASE,
+};
+
+/**
+ * struct iommu_destroy - ioctl(IOMMU_DESTROY)
+ * @size: sizeof(struct iommu_destroy)
+ * @id: iommufd object ID to destroy. Can by any destroyable object type.
+ *
+ * Destroy any object held within iommufd.
+ */
+struct iommu_destroy {
+	__u32 size;
+	__u32 id;
+};
+#define IOMMU_DESTROY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_DESTROY)
+
+#endif
-- 
2.35.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 244+ messages in thread

* [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
  2022-03-18 17:27 ` Jason Gunthorpe via iommu
@ 2022-03-18 17:27   ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-03-18 17:27 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

Following the pattern of io_uring, perf, skb, and bpf iommfd will use
user->locked_vm for accounting pinned pages. Ensure the value is included
in the struct and export free_uid() as iommufd is modular.

user->locked_vm is the correct accounting to use for ulimit because it is
per-user, and the ulimit is not supposed to be per-process. Other
places (vfio, vdpa and infiniband) have used mm->pinned_vm and/or
mm->locked_vm for accounting pinned pages, but this is only per-process
and inconsistent with the majority of the kernel.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 include/linux/sched/user.h | 2 +-
 kernel/user.c              | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index 00ed419dd46413..c47dae71dad3c8 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -24,7 +24,7 @@ struct user_struct {
 	kuid_t uid;
 
 #if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL) || \
-    defined(CONFIG_NET) || defined(CONFIG_IO_URING)
+    defined(CONFIG_NET) || defined(CONFIG_IO_URING) || IS_ENABLED(CONFIG_IOMMUFD)
 	atomic_long_t locked_vm;
 #endif
 #ifdef CONFIG_WATCH_QUEUE
diff --git a/kernel/user.c b/kernel/user.c
index e2cf8c22b539a7..d667debeafd609 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -185,6 +185,7 @@ void free_uid(struct user_struct *up)
 	if (refcount_dec_and_lock_irqsave(&up->__count, &uidhash_lock, &flags))
 		free_user(up, flags);
 }
+EXPORT_SYMBOL_GPL(free_uid);
 
 struct user_struct *alloc_uid(kuid_t uid)
 {
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 244+ messages in thread

* [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
@ 2022-03-18 17:27   ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-03-18 17:27 UTC (permalink / raw)
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins,
	David Gibson

Following the pattern of io_uring, perf, skb, and bpf iommfd will use
user->locked_vm for accounting pinned pages. Ensure the value is included
in the struct and export free_uid() as iommufd is modular.

user->locked_vm is the correct accounting to use for ulimit because it is
per-user, and the ulimit is not supposed to be per-process. Other
places (vfio, vdpa and infiniband) have used mm->pinned_vm and/or
mm->locked_vm for accounting pinned pages, but this is only per-process
and inconsistent with the majority of the kernel.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 include/linux/sched/user.h | 2 +-
 kernel/user.c              | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index 00ed419dd46413..c47dae71dad3c8 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -24,7 +24,7 @@ struct user_struct {
 	kuid_t uid;
 
 #if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL) || \
-    defined(CONFIG_NET) || defined(CONFIG_IO_URING)
+    defined(CONFIG_NET) || defined(CONFIG_IO_URING) || IS_ENABLED(CONFIG_IOMMUFD)
 	atomic_long_t locked_vm;
 #endif
 #ifdef CONFIG_WATCH_QUEUE
diff --git a/kernel/user.c b/kernel/user.c
index e2cf8c22b539a7..d667debeafd609 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -185,6 +185,7 @@ void free_uid(struct user_struct *up)
 	if (refcount_dec_and_lock_irqsave(&up->__count, &uidhash_lock, &flags))
 		free_user(up, flags);
 }
+EXPORT_SYMBOL_GPL(free_uid);
 
 struct user_struct *alloc_uid(kuid_t uid)
 {
-- 
2.35.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 244+ messages in thread

* [PATCH RFC 05/12] iommufd: PFN handling for iopt_pages
  2022-03-18 17:27 ` Jason Gunthorpe via iommu
@ 2022-03-18 17:27   ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-03-18 17:27 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

The top of the data structure provides an IO Address Space (IOAS) that is
similar to a VFIO container. The IOAS allows map/unmap of memory into
ranges of IOVA called iopt_areas. Domains and in-kernel users (like VFIO
mdevs) can be attached to the IOAS to access the PFNs that those IOVA
areas cover.

The IO Address Space (IOAS) datastructure is composed of:
 - struct io_pagetable holding the IOVA map
 - struct iopt_areas representing populated portions of IOVA
 - struct iopt_pages representing the storage of PFNs
 - struct iommu_domain representing the IO page table in the system IOMMU
 - struct iopt_pages_user representing in-kernel users of PFNs (ie VFIO
   mdevs)
 - struct xarray pinned_pfns holding a list of pages pinned by in-kernel
   users

This patch introduces the lowest part of the datastructure - the movement
of PFNs in a tiered storage scheme:
 1) iopt_pages::pinned_pfns xarray
 2) An iommu_domain
 3) The origin of the PFNs, i.e. the userspace pointer

PFN have to be copied between all combinations of tiers, depending on the
configuration.

The interface is an iterator called a 'pfn_reader' which determines which
tier each PFN is stored and loads it into a list of PFNs held in a struct
pfn_batch.

Each step of the iterator will fill up the pfn_batch, then the caller can
use the pfn_batch to send the PFNs to the required destination. Repeating
this loop will read all the PFNs in an IOVA range.

The pfn_reader and pfn_batch also keep track of the pinned page accounting.

While PFNs are always stored and accessed as full PAGE_SIZE units the
iommu_domain tier can store with a sub-page offset/length to support
IOMMUs with a smaller IOPTE size than PAGE_SIZE.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommufd/Makefile          |   3 +-
 drivers/iommu/iommufd/io_pagetable.h    | 101 ++++
 drivers/iommu/iommufd/iommufd_private.h |  20 +
 drivers/iommu/iommufd/pages.c           | 723 ++++++++++++++++++++++++
 4 files changed, 846 insertions(+), 1 deletion(-)
 create mode 100644 drivers/iommu/iommufd/io_pagetable.h
 create mode 100644 drivers/iommu/iommufd/pages.c

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index a07a8cffe937c6..05a0e91e30afad 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 iommufd-y := \
-	main.o
+	main.o \
+	pages.o
 
 obj-$(CONFIG_IOMMUFD) += iommufd.o
diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h
new file mode 100644
index 00000000000000..94ca8712722d31
--- /dev/null
+++ b/drivers/iommu/iommufd/io_pagetable.h
@@ -0,0 +1,101 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
+ *
+ */
+#ifndef __IO_PAGETABLE_H
+#define __IO_PAGETABLE_H
+
+#include <linux/interval_tree.h>
+#include <linux/mutex.h>
+#include <linux/kref.h>
+#include <linux/xarray.h>
+
+#include "iommufd_private.h"
+
+struct iommu_domain;
+
+/*
+ * Each io_pagetable is composed of intervals of areas which cover regions of
+ * the iova that are backed by something. iova not covered by areas is not
+ * populated in the page table. Each area is fully populated with pages.
+ *
+ * iovas are in byte units, but must be iopt->iova_alignment aligned.
+ *
+ * pages can be NULL, this means some other thread is still working on setting
+ * up the area. When observed under the write side of the domain_rwsem a NULL
+ * pages must mean no domains are filled.
+ *
+ * storage_domain points at an arbitrary iommu_domain that is holding the PFNs
+ * for this area. It is locked by the pages->mutex. This simplifies the locking
+ * as the pages code can rely on the storage_domain without having to get the
+ * iopt->domains_rwsem.
+ *
+ * The io_pagetable::iova_rwsem protects node
+ * The iopt_pages::mutex protects pages_node
+ * iopt and immu_prot are immutable
+ */
+struct iopt_area {
+	struct interval_tree_node node;
+	struct interval_tree_node pages_node;
+	/* How many bytes into the first page the area starts */
+	unsigned int page_offset;
+	struct io_pagetable *iopt;
+	struct iopt_pages *pages;
+	struct iommu_domain *storage_domain;
+	/* IOMMU_READ, IOMMU_WRITE, etc */
+	int iommu_prot;
+	atomic_t num_users;
+};
+
+static inline unsigned long iopt_area_index(struct iopt_area *area)
+{
+	return area->pages_node.start;
+}
+
+static inline unsigned long iopt_area_last_index(struct iopt_area *area)
+{
+	return area->pages_node.last;
+}
+
+static inline unsigned long iopt_area_iova(struct iopt_area *area)
+{
+	return area->node.start;
+}
+
+static inline unsigned long iopt_area_last_iova(struct iopt_area *area)
+{
+	return area->node.last;
+}
+
+/*
+ * This holds a pinned page list for multiple areas of IO address space. The
+ * pages always originate from a linear chunk of userspace VA. Multiple
+ * io_pagetable's, through their iopt_area's, can share a single iopt_pages
+ * which avoids multi-pinning and double accounting of page consumption.
+ *
+ * indexes in this structure are measured in PAGE_SIZE units, are 0 based from
+ * the start of the uptr and extend to npages. pages are pinned dynamically
+ * according to the intervals in the users_itree and domains_itree, npages
+ * records the current number of pages pinned.
+ */
+struct iopt_pages {
+	struct kref kref;
+	struct mutex mutex;
+	size_t npages;
+	size_t npinned;
+	size_t last_npinned;
+	struct task_struct *source_task;
+	struct mm_struct *source_mm;
+	struct user_struct *source_user;
+	void __user *uptr;
+	bool writable:1;
+	bool has_cap_ipc_lock:1;
+
+	struct xarray pinned_pfns;
+	/* Of iopt_pages_user::node */
+	struct rb_root_cached users_itree;
+	/* Of iopt_area::pages_node */
+	struct rb_root_cached domains_itree;
+};
+
+#endif
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 2d0bba3965be1a..2f1301d39bba7c 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -9,6 +9,26 @@
 #include <linux/refcount.h>
 #include <linux/uaccess.h>
 
+/*
+ * The IOVA to PFN map. The mapper automatically copies the PFNs into multiple
+ * domains and permits sharing of PFNs between io_pagetable instances. This
+ * supports both a design where IOAS's are 1:1 with a domain (eg because the
+ * domain is HW customized), or where the IOAS is 1:N with multiple generic
+ * domains.  The io_pagetable holds an interval tree of iopt_areas which point
+ * to shared iopt_pages which hold the pfns mapped to the page table.
+ *
+ * The locking order is domains_rwsem -> iova_rwsem -> pages::mutex
+ */
+struct io_pagetable {
+	struct rw_semaphore domains_rwsem;
+	struct xarray domains;
+	unsigned int next_domain_id;
+
+	struct rw_semaphore iova_rwsem;
+	struct rb_root_cached area_itree;
+	struct rb_root_cached reserved_iova_itree;
+};
+
 struct iommufd_ctx {
 	struct file *filp;
 	struct xarray objects;
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
new file mode 100644
index 00000000000000..a75e1c73527920
--- /dev/null
+++ b/drivers/iommu/iommufd/pages.c
@@ -0,0 +1,723 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
+ *
+ * The iopt_pages is the center of the storage and motion of PFNs. Each
+ * iopt_pages represents a logical linear array of full PFNs. The array is 0
+ * based and has npages in it. Accessors use 'index' to refer to the entry in
+ * this logical array, regardless of its storage location.
+ *
+ * PFNs are stored in a tiered scheme:
+ *  1) iopt_pages::pinned_pfns xarray
+ *  2) An iommu_domain
+ *  3) The origin of the PFNs, i.e. the userspace pointer
+ *
+ * PFN have to be copied between all combinations of tiers, depending on the
+ * configuration.
+ *
+ * When a PFN is taken out of the userspace pointer it is pinned exactly once.
+ * The storage locations of the PFN's index are tracked in the two interval
+ * trees. If no interval includes the index then it is not pinned.
+ *
+ * If users_itree includes the PFN's index then an in-kernel user has requested
+ * the page. The PFN is stored in the xarray so other requestors can continue to
+ * find it.
+ *
+ * If the domains_itree includes the PFN's index then an iommu_domain is storing
+ * the PFN and it can be read back using iommu_iova_to_phys(). To avoid
+ * duplicating storage the xarray is not used if only iommu_domains are using
+ * the PFN's index.
+ *
+ * As a general principle this is designed so that destroy never fails. This
+ * means removing an iommu_domain or releasing a in-kernel user will not fail
+ * due to insufficient memory. In practice this means some cases have to hold
+ * PFNs in the xarray even though they are also being stored in an iommu_domain.
+ *
+ * While the iopt_pages can use an iommu_domain as storage, it does not have an
+ * IOVA itself. Instead the iopt_area represents a range of IOVA and uses the
+ * iopt_pages as the PFN provider. Multiple iopt_areas can share the iopt_pages
+ * and reference their own slice of the PFN array, with sub page granularity.
+ *
+ * In this file the term 'last' indicates an inclusive and closed interval, eg
+ * [0,0] refers to a single PFN. 'end' means an open range, eg [0,0) refers to
+ * no PFNs.
+ */
+#include <linux/overflow.h>
+#include <linux/slab.h>
+#include <linux/iommu.h>
+#include <linux/sched/mm.h>
+
+#include "io_pagetable.h"
+
+#define TEMP_MEMORY_LIMIT 65536
+#define BATCH_BACKUP_SIZE 32
+
+/*
+ * More memory makes pin_user_pages() and the batching more efficient, but as
+ * this is only a performance optimization don't try too hard to get it. A 64k
+ * allocation can hold about 26M of 4k pages and 13G of 2M pages in an
+ * pfn_batch. Various destroy paths cannot fail and provide a small amount of
+ * stack memory as a backup contingency. If backup_len is given this cannot
+ * fail.
+ */
+static void *temp_kmalloc(size_t *size, void *backup, size_t backup_len)
+{
+	void *res;
+
+	if (*size < backup_len)
+		return backup;
+	*size = min_t(size_t, *size, TEMP_MEMORY_LIMIT);
+	res = kmalloc(*size, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
+	if (res)
+		return res;
+	*size = PAGE_SIZE;
+	if (backup_len) {
+		res = kmalloc(*size, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
+		if (res)
+			return res;
+		*size = backup_len;
+		return backup;
+	}
+	return kmalloc(*size, GFP_KERNEL);
+}
+
+static void iopt_pages_add_npinned(struct iopt_pages *pages, size_t npages)
+{
+	int rc;
+
+	rc = check_add_overflow(pages->npinned, npages, &pages->npinned);
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
+		WARN_ON(rc || pages->npinned > pages->npages);
+}
+
+static void iopt_pages_sub_npinned(struct iopt_pages *pages, size_t npages)
+{
+	int rc;
+
+	rc = check_sub_overflow(pages->npinned, npages, &pages->npinned);
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
+		WARN_ON(rc || pages->npinned > pages->npages);
+}
+
+/*
+ * index is the number of PAGE_SIZE units from the start of the area's
+ * iopt_pages. If the iova is sub page-size then the area has an iova that
+ * covers a portion of the first and last pages in the range.
+ */
+static unsigned long iopt_area_index_to_iova(struct iopt_area *area,
+					     unsigned long index)
+{
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
+		WARN_ON(index < iopt_area_index(area) ||
+			index > iopt_area_last_index(area));
+	index -= iopt_area_index(area);
+	if (index == 0)
+		return iopt_area_iova(area);
+	return iopt_area_iova(area) - area->page_offset + index * PAGE_SIZE;
+}
+
+static unsigned long iopt_area_index_to_iova_last(struct iopt_area *area,
+						  unsigned long index)
+{
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
+		WARN_ON(index < iopt_area_index(area) ||
+			index > iopt_area_last_index(area));
+	if (index == iopt_area_last_index(area))
+		return iopt_area_last_iova(area);
+	return iopt_area_iova(area) - area->page_offset +
+	       (index - iopt_area_index(area) + 1) * PAGE_SIZE - 1;
+}
+
+static void iommu_unmap_nofail(struct iommu_domain *domain, unsigned long iova,
+			       size_t size)
+{
+	size_t ret;
+
+	ret = iommu_unmap(domain, iova, size);
+	/*
+	 * It is a logic error in this code or a driver bug if the IOMMU unmaps
+	 * something other than exactly as requested.
+	 */
+	WARN_ON(ret != size);
+}
+
+static struct iopt_area *iopt_pages_find_domain_area(struct iopt_pages *pages,
+						     unsigned long index)
+{
+	struct interval_tree_node *node;
+
+	node = interval_tree_iter_first(&pages->domains_itree, index, index);
+	if (!node)
+		return NULL;
+	return container_of(node, struct iopt_area, pages_node);
+}
+
+/*
+ * A simple datastructure to hold a vector of PFNs, optimized for contiguous
+ * PFNs. This is used as a temporary holding memory for shuttling pfns from one
+ * place to another. Generally everything is made more efficient if operations
+ * work on the largest possible grouping of pfns. eg fewer lock/unlock cycles,
+ * better cache locality, etc
+ */
+struct pfn_batch {
+	unsigned long *pfns;
+	u16 *npfns;
+	unsigned int array_size;
+	unsigned int end;
+	unsigned int total_pfns;
+};
+
+static void batch_clear(struct pfn_batch *batch)
+{
+	batch->total_pfns = 0;
+	batch->end = 0;
+	batch->pfns[0] = 0;
+	batch->npfns[0] = 0;
+}
+
+static int __batch_init(struct pfn_batch *batch, size_t max_pages, void *backup,
+			size_t backup_len)
+{
+	const size_t elmsz = sizeof(*batch->pfns) + sizeof(*batch->npfns);
+	size_t size = max_pages * elmsz;
+
+	batch->pfns = temp_kmalloc(&size, backup, backup_len);
+	if (!batch->pfns)
+		return -ENOMEM;
+	batch->array_size = size / elmsz;
+	batch->npfns = (u16 *)(batch->pfns + batch->array_size);
+	batch_clear(batch);
+	return 0;
+}
+
+static int batch_init(struct pfn_batch *batch, size_t max_pages)
+{
+	return __batch_init(batch, max_pages, NULL, 0);
+}
+
+static void batch_init_backup(struct pfn_batch *batch, size_t max_pages,
+			      void *backup, size_t backup_len)
+{
+	__batch_init(batch, max_pages, backup, backup_len);
+}
+
+static void batch_destroy(struct pfn_batch *batch, void *backup)
+{
+	if (batch->pfns != backup)
+		kfree(batch->pfns);
+}
+
+/* true if the pfn could be added, false otherwise */
+static bool batch_add_pfn(struct pfn_batch *batch, unsigned long pfn)
+{
+	/* FIXME: U16 is too small */
+	if (batch->end &&
+	    pfn == batch->pfns[batch->end - 1] + batch->npfns[batch->end - 1] &&
+	    batch->npfns[batch->end - 1] != U16_MAX) {
+		batch->npfns[batch->end - 1]++;
+		batch->total_pfns++;
+		return true;
+	}
+	if (batch->end == batch->array_size)
+		return false;
+	batch->total_pfns++;
+	batch->pfns[batch->end] = pfn;
+	batch->npfns[batch->end] = 1;
+	batch->end++;
+	return true;
+}
+
+/*
+ * Fill the batch with pfns from the domain. When the batch is full, or it
+ * reaches last_index, the function will return. The caller should use
+ * batch->total_pfns to determine the starting point for the next iteration.
+ */
+static void batch_from_domain(struct pfn_batch *batch,
+			      struct iommu_domain *domain,
+			      struct iopt_area *area, unsigned long index,
+			      unsigned long last_index)
+{
+	unsigned int page_offset = 0;
+	unsigned long iova;
+	phys_addr_t phys;
+
+	batch_clear(batch);
+	iova = iopt_area_index_to_iova(area, index);
+	if (index == iopt_area_index(area))
+		page_offset = area->page_offset;
+	while (index <= last_index) {
+		/*
+		 * This is pretty slow, it would be nice to get the page size
+		 * back from the driver, or have the driver directly fill the
+		 * batch.
+		 */
+		phys = iommu_iova_to_phys(domain, iova) - page_offset;
+		if (!batch_add_pfn(batch, PHYS_PFN(phys)))
+			return;
+		iova += PAGE_SIZE - page_offset;
+		page_offset = 0;
+		index++;
+	}
+}
+
+static int batch_to_domain(struct pfn_batch *batch, struct iommu_domain *domain,
+			   struct iopt_area *area, unsigned long start_index)
+{
+	unsigned long last_iova = iopt_area_last_iova(area);
+	unsigned int page_offset = 0;
+	unsigned long start_iova;
+	unsigned long next_iova;
+	unsigned int cur = 0;
+	unsigned long iova;
+	int rc;
+
+	/* The first index might be a partial page */
+	if (start_index == iopt_area_index(area))
+		page_offset = area->page_offset;
+	next_iova = iova = start_iova =
+		iopt_area_index_to_iova(area, start_index);
+	while (cur < batch->end) {
+		next_iova = min(last_iova + 1,
+				next_iova + batch->npfns[cur] * PAGE_SIZE -
+					page_offset);
+		rc = iommu_map(domain, iova,
+			       PFN_PHYS(batch->pfns[cur]) + page_offset,
+			       next_iova - iova, area->iommu_prot);
+		if (rc)
+			goto out_unmap;
+		iova = next_iova;
+		page_offset = 0;
+		cur++;
+	}
+	return 0;
+out_unmap:
+	if (start_iova != iova)
+		iommu_unmap_nofail(domain, start_iova, iova - start_iova);
+	return rc;
+}
+
+static void batch_from_xarray(struct pfn_batch *batch, struct xarray *xa,
+			      unsigned long start_index,
+			      unsigned long last_index)
+{
+	XA_STATE(xas, xa, start_index);
+	void *entry;
+
+	rcu_read_lock();
+	while (true) {
+		entry = xas_next(&xas);
+		if (xas_retry(&xas, entry))
+			continue;
+		WARN_ON(!xa_is_value(entry));
+		if (!batch_add_pfn(batch, xa_to_value(entry)) ||
+		    start_index == last_index)
+			break;
+		start_index++;
+	}
+	rcu_read_unlock();
+}
+
+static void clear_xarray(struct xarray *xa, unsigned long index,
+			 unsigned long last)
+{
+	XA_STATE(xas, xa, index);
+	void *entry;
+
+	xas_lock(&xas);
+	xas_for_each (&xas, entry, last)
+		xas_store(&xas, NULL);
+	xas_unlock(&xas);
+}
+
+static int batch_to_xarray(struct pfn_batch *batch, struct xarray *xa,
+			   unsigned long start_index)
+{
+	XA_STATE(xas, xa, start_index);
+	unsigned int npage = 0;
+	unsigned int cur = 0;
+
+	do {
+		xas_lock(&xas);
+		while (cur < batch->end) {
+			void *old;
+
+			old = xas_store(&xas,
+					xa_mk_value(batch->pfns[cur] + npage));
+			if (xas_error(&xas))
+				break;
+			WARN_ON(old);
+			npage++;
+			if (npage == batch->npfns[cur]) {
+				npage = 0;
+				cur++;
+			}
+			xas_next(&xas);
+		}
+		xas_unlock(&xas);
+	} while (xas_nomem(&xas, GFP_KERNEL));
+
+	if (xas_error(&xas)) {
+		if (xas.xa_index != start_index)
+			clear_xarray(xa, start_index, xas.xa_index - 1);
+		return xas_error(&xas);
+	}
+	return 0;
+}
+
+static void batch_to_pages(struct pfn_batch *batch, struct page **pages)
+{
+	unsigned int npage = 0;
+	unsigned int cur = 0;
+
+	while (cur < batch->end) {
+		*pages++ = pfn_to_page(batch->pfns[cur] + npage);
+		npage++;
+		if (npage == batch->npfns[cur]) {
+			npage = 0;
+			cur++;
+		}
+	}
+}
+
+static void batch_from_pages(struct pfn_batch *batch, struct page **pages,
+			     size_t npages)
+{
+	struct page **end = pages + npages;
+
+	for (; pages != end; pages++)
+		if (!batch_add_pfn(batch, page_to_pfn(*pages)))
+			break;
+}
+
+static void batch_unpin(struct pfn_batch *batch, struct iopt_pages *pages,
+			unsigned int offset, size_t npages)
+{
+	unsigned int cur = 0;
+
+	while (offset) {
+		if (batch->npfns[cur] > offset)
+			break;
+		offset -= batch->npfns[cur];
+		cur++;
+	}
+
+	while (npages) {
+		size_t to_unpin =
+			min_t(size_t, npages, batch->npfns[cur] - offset);
+
+		unpin_user_page_range_dirty_lock(
+			pfn_to_page(batch->pfns[cur] + offset), to_unpin,
+			pages->writable);
+		iopt_pages_sub_npinned(pages, to_unpin);
+		cur++;
+		offset = 0;
+		npages -= to_unpin;
+	}
+}
+
+/*
+ * PFNs are stored in three places, in order of preference:
+ * - The iopt_pages xarray. This is only populated if there is a
+ *   iopt_pages_user
+ * - The iommu_domain under an area
+ * - The original PFN source, ie pages->source_mm
+ *
+ * This iterator reads the pfns optimizing to load according to the
+ * above order.
+ */
+struct pfn_reader {
+	struct iopt_pages *pages;
+	struct interval_tree_span_iter span;
+	struct pfn_batch batch;
+	unsigned long batch_start_index;
+	unsigned long batch_end_index;
+	unsigned long last_index;
+
+	struct page **upages;
+	size_t upages_len;
+	unsigned long upages_start;
+	unsigned long upages_end;
+
+	unsigned int gup_flags;
+};
+
+static void update_unpinned(struct iopt_pages *pages)
+{
+	unsigned long npages = pages->last_npinned - pages->npinned;
+
+	lockdep_assert_held(&pages->mutex);
+
+	if (pages->has_cap_ipc_lock) {
+		pages->last_npinned = pages->npinned;
+		return;
+	}
+
+	if (WARN_ON(pages->npinned > pages->last_npinned) ||
+	    WARN_ON(atomic_long_read(&pages->source_user->locked_vm) < npages))
+		return;
+	atomic_long_sub(npages, &pages->source_user->locked_vm);
+	atomic64_sub(npages, &pages->source_mm->pinned_vm);
+	pages->last_npinned = pages->npinned;
+}
+
+/*
+ * Changes in the number of pages pinned is done after the pages have been read
+ * and processed. If the user lacked the limit then the error unwind will unpin
+ * everything that was just pinned.
+ */
+static int update_pinned(struct iopt_pages *pages)
+{
+	unsigned long lock_limit;
+	unsigned long cur_pages;
+	unsigned long new_pages;
+	unsigned long npages;
+
+	lockdep_assert_held(&pages->mutex);
+
+	if (pages->has_cap_ipc_lock) {
+		pages->last_npinned = pages->npinned;
+		return 0;
+	}
+
+	if (pages->npinned == pages->last_npinned)
+		return 0;
+
+	if (pages->npinned < pages->last_npinned) {
+		update_unpinned(pages);
+		return 0;
+	}
+
+	lock_limit =
+		task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	npages = pages->npinned - pages->last_npinned;
+	do {
+		cur_pages = atomic_long_read(&pages->source_user->locked_vm);
+		new_pages = cur_pages + npages;
+		if (new_pages > lock_limit)
+			return -ENOMEM;
+	} while (atomic_long_cmpxchg(&pages->source_user->locked_vm, cur_pages,
+				     new_pages) != cur_pages);
+	atomic64_add(npages, &pages->source_mm->pinned_vm);
+	pages->last_npinned = pages->npinned;
+	return 0;
+}
+
+static int pfn_reader_pin_pages(struct pfn_reader *pfns)
+{
+	struct iopt_pages *pages = pfns->pages;
+	unsigned long npages;
+	long rc;
+
+	if (!pfns->upages) {
+		/* All undone in iopt_pfn_reader_destroy */
+		pfns->upages_len =
+			(pfns->last_index - pfns->batch_end_index + 1) *
+			sizeof(*pfns->upages);
+		pfns->upages = temp_kmalloc(&pfns->upages_len, NULL, 0);
+		if (!pfns->upages)
+			return -ENOMEM;
+
+		if (!mmget_not_zero(pages->source_mm)) {
+			kfree(pfns->upages);
+			pfns->upages = NULL;
+			return -EINVAL;
+		}
+		mmap_read_lock(pages->source_mm);
+	}
+
+	npages = min_t(unsigned long,
+		       pfns->span.last_hole - pfns->batch_end_index + 1,
+		       pfns->upages_len / sizeof(*pfns->upages));
+
+	/* FIXME use pin_user_pages_fast() if current == source_mm */
+	rc = pin_user_pages_remote(
+		pages->source_mm,
+		(uintptr_t)(pages->uptr + pfns->batch_end_index * PAGE_SIZE),
+		npages, pfns->gup_flags, pfns->upages, NULL, NULL);
+	if (rc < 0)
+		return rc;
+	if (WARN_ON(!rc))
+		return -EFAULT;
+	iopt_pages_add_npinned(pages, rc);
+	pfns->upages_start = pfns->batch_end_index;
+	pfns->upages_end = pfns->batch_end_index + rc;
+	return 0;
+}
+
+/*
+ * The batch can contain a mixture of pages that are still in use and pages that
+ * need to be unpinned. Unpin only pages that are not held anywhere else.
+ */
+static void iopt_pages_unpin(struct iopt_pages *pages, struct pfn_batch *batch,
+			     unsigned long index, unsigned long last)
+{
+	struct interval_tree_span_iter user_span;
+	struct interval_tree_span_iter area_span;
+
+	lockdep_assert_held(&pages->mutex);
+
+	for (interval_tree_span_iter_first(&user_span, &pages->users_itree, 0,
+					   last);
+	     !interval_tree_span_iter_done(&user_span);
+	     interval_tree_span_iter_next(&user_span)) {
+		if (!user_span.is_hole)
+			continue;
+
+		for (interval_tree_span_iter_first(
+			     &area_span, &pages->domains_itree,
+			     user_span.start_hole, user_span.last_hole);
+		     !interval_tree_span_iter_done(&area_span);
+		     interval_tree_span_iter_next(&area_span)) {
+			if (!area_span.is_hole)
+				continue;
+
+			batch_unpin(batch, pages, area_span.start_hole - index,
+				    area_span.last_hole - area_span.start_hole +
+					    1);
+		}
+	}
+}
+
+/* Process a single span in the users_itree */
+static int pfn_reader_fill_span(struct pfn_reader *pfns)
+{
+	struct interval_tree_span_iter *span = &pfns->span;
+	struct iopt_area *area;
+	int rc;
+
+	if (!span->is_hole) {
+		batch_from_xarray(&pfns->batch, &pfns->pages->pinned_pfns,
+				  pfns->batch_end_index, span->last_used);
+		return 0;
+	}
+
+	/* FIXME: This should consider the entire hole remaining */
+	area = iopt_pages_find_domain_area(pfns->pages, pfns->batch_end_index);
+	if (area) {
+		unsigned int last_index;
+
+		last_index = min(iopt_area_last_index(area), span->last_hole);
+		/* The storage_domain cannot change without the pages mutex */
+		batch_from_domain(&pfns->batch, area->storage_domain, area,
+				  pfns->batch_end_index, last_index);
+		return 0;
+	}
+
+	if (pfns->batch_end_index >= pfns->upages_end) {
+		rc = pfn_reader_pin_pages(pfns);
+		if (rc)
+			return rc;
+	}
+
+	batch_from_pages(&pfns->batch,
+			 pfns->upages +
+				 (pfns->batch_end_index - pfns->upages_start),
+			 pfns->upages_end - pfns->batch_end_index);
+	return 0;
+}
+
+static bool pfn_reader_done(struct pfn_reader *pfns)
+{
+	return pfns->batch_start_index == pfns->last_index + 1;
+}
+
+static int pfn_reader_next(struct pfn_reader *pfns)
+{
+	int rc;
+
+	batch_clear(&pfns->batch);
+	pfns->batch_start_index = pfns->batch_end_index;
+	while (pfns->batch_end_index != pfns->last_index + 1) {
+		rc = pfn_reader_fill_span(pfns);
+		if (rc)
+			return rc;
+		pfns->batch_end_index =
+			pfns->batch_start_index + pfns->batch.total_pfns;
+		if (pfns->batch_end_index != pfns->span.last_used + 1)
+			return 0;
+		interval_tree_span_iter_next(&pfns->span);
+	}
+	return 0;
+}
+
+/*
+ * Adjust the pfn_reader to start at an externally determined hole span in the
+ * users_itree.
+ */
+static int pfn_reader_seek_hole(struct pfn_reader *pfns,
+				struct interval_tree_span_iter *span)
+{
+	pfns->batch_start_index = span->start_hole;
+	pfns->batch_end_index = span->start_hole;
+	pfns->last_index = span->last_hole;
+	pfns->span = *span;
+	return pfn_reader_next(pfns);
+}
+
+static int pfn_reader_init(struct pfn_reader *pfns, struct iopt_pages *pages,
+			   unsigned long index, unsigned long last)
+{
+	int rc;
+
+	lockdep_assert_held(&pages->mutex);
+
+	rc = batch_init(&pfns->batch, last - index + 1);
+	if (rc)
+		return rc;
+	pfns->pages = pages;
+	pfns->batch_start_index = index;
+	pfns->batch_end_index = index;
+	pfns->last_index = last;
+	pfns->upages = NULL;
+	pfns->upages_start = 0;
+	pfns->upages_end = 0;
+	interval_tree_span_iter_first(&pfns->span, &pages->users_itree, index,
+				      last);
+
+	if (pages->writable) {
+		pfns->gup_flags = FOLL_LONGTERM | FOLL_WRITE;
+	} else {
+		/* Still need to break COWs on read */
+		pfns->gup_flags = FOLL_LONGTERM | FOLL_FORCE | FOLL_WRITE;
+	}
+	return 0;
+}
+
+static void pfn_reader_destroy(struct pfn_reader *pfns)
+{
+	if (pfns->upages) {
+		size_t npages = pfns->upages_end - pfns->batch_end_index;
+
+		mmap_read_unlock(pfns->pages->source_mm);
+		mmput(pfns->pages->source_mm);
+
+		/* Any pages not transferred to the batch are just unpinned */
+		unpin_user_pages(pfns->upages + (pfns->batch_end_index -
+						 pfns->upages_start),
+				 npages);
+		kfree(pfns->upages);
+		pfns->upages = NULL;
+	}
+
+	if (pfns->batch_start_index != pfns->batch_end_index)
+		iopt_pages_unpin(pfns->pages, &pfns->batch,
+				 pfns->batch_start_index,
+				 pfns->batch_end_index - 1);
+	batch_destroy(&pfns->batch, NULL);
+	WARN_ON(pfns->pages->last_npinned != pfns->pages->npinned);
+}
+
+static int pfn_reader_first(struct pfn_reader *pfns, struct iopt_pages *pages,
+			    unsigned long index, unsigned long last)
+{
+	int rc;
+
+	rc = pfn_reader_init(pfns, pages, index, last);
+	if (rc)
+		return rc;
+	rc = pfn_reader_next(pfns);
+	if (rc) {
+		pfn_reader_destroy(pfns);
+		return rc;
+	}
+	return 0;
+}
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 244+ messages in thread

* [PATCH RFC 05/12] iommufd: PFN handling for iopt_pages
@ 2022-03-18 17:27   ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-03-18 17:27 UTC (permalink / raw)
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins,
	David Gibson

The top of the data structure provides an IO Address Space (IOAS) that is
similar to a VFIO container. The IOAS allows map/unmap of memory into
ranges of IOVA called iopt_areas. Domains and in-kernel users (like VFIO
mdevs) can be attached to the IOAS to access the PFNs that those IOVA
areas cover.

The IO Address Space (IOAS) datastructure is composed of:
 - struct io_pagetable holding the IOVA map
 - struct iopt_areas representing populated portions of IOVA
 - struct iopt_pages representing the storage of PFNs
 - struct iommu_domain representing the IO page table in the system IOMMU
 - struct iopt_pages_user representing in-kernel users of PFNs (ie VFIO
   mdevs)
 - struct xarray pinned_pfns holding a list of pages pinned by in-kernel
   users

This patch introduces the lowest part of the datastructure - the movement
of PFNs in a tiered storage scheme:
 1) iopt_pages::pinned_pfns xarray
 2) An iommu_domain
 3) The origin of the PFNs, i.e. the userspace pointer

PFN have to be copied between all combinations of tiers, depending on the
configuration.

The interface is an iterator called a 'pfn_reader' which determines which
tier each PFN is stored and loads it into a list of PFNs held in a struct
pfn_batch.

Each step of the iterator will fill up the pfn_batch, then the caller can
use the pfn_batch to send the PFNs to the required destination. Repeating
this loop will read all the PFNs in an IOVA range.

The pfn_reader and pfn_batch also keep track of the pinned page accounting.

While PFNs are always stored and accessed as full PAGE_SIZE units the
iommu_domain tier can store with a sub-page offset/length to support
IOMMUs with a smaller IOPTE size than PAGE_SIZE.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommufd/Makefile          |   3 +-
 drivers/iommu/iommufd/io_pagetable.h    | 101 ++++
 drivers/iommu/iommufd/iommufd_private.h |  20 +
 drivers/iommu/iommufd/pages.c           | 723 ++++++++++++++++++++++++
 4 files changed, 846 insertions(+), 1 deletion(-)
 create mode 100644 drivers/iommu/iommufd/io_pagetable.h
 create mode 100644 drivers/iommu/iommufd/pages.c

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index a07a8cffe937c6..05a0e91e30afad 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 iommufd-y := \
-	main.o
+	main.o \
+	pages.o
 
 obj-$(CONFIG_IOMMUFD) += iommufd.o
diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h
new file mode 100644
index 00000000000000..94ca8712722d31
--- /dev/null
+++ b/drivers/iommu/iommufd/io_pagetable.h
@@ -0,0 +1,101 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
+ *
+ */
+#ifndef __IO_PAGETABLE_H
+#define __IO_PAGETABLE_H
+
+#include <linux/interval_tree.h>
+#include <linux/mutex.h>
+#include <linux/kref.h>
+#include <linux/xarray.h>
+
+#include "iommufd_private.h"
+
+struct iommu_domain;
+
+/*
+ * Each io_pagetable is composed of intervals of areas which cover regions of
+ * the iova that are backed by something. iova not covered by areas is not
+ * populated in the page table. Each area is fully populated with pages.
+ *
+ * iovas are in byte units, but must be iopt->iova_alignment aligned.
+ *
+ * pages can be NULL, this means some other thread is still working on setting
+ * up the area. When observed under the write side of the domain_rwsem a NULL
+ * pages must mean no domains are filled.
+ *
+ * storage_domain points at an arbitrary iommu_domain that is holding the PFNs
+ * for this area. It is locked by the pages->mutex. This simplifies the locking
+ * as the pages code can rely on the storage_domain without having to get the
+ * iopt->domains_rwsem.
+ *
+ * The io_pagetable::iova_rwsem protects node
+ * The iopt_pages::mutex protects pages_node
+ * iopt and immu_prot are immutable
+ */
+struct iopt_area {
+	struct interval_tree_node node;
+	struct interval_tree_node pages_node;
+	/* How many bytes into the first page the area starts */
+	unsigned int page_offset;
+	struct io_pagetable *iopt;
+	struct iopt_pages *pages;
+	struct iommu_domain *storage_domain;
+	/* IOMMU_READ, IOMMU_WRITE, etc */
+	int iommu_prot;
+	atomic_t num_users;
+};
+
+static inline unsigned long iopt_area_index(struct iopt_area *area)
+{
+	return area->pages_node.start;
+}
+
+static inline unsigned long iopt_area_last_index(struct iopt_area *area)
+{
+	return area->pages_node.last;
+}
+
+static inline unsigned long iopt_area_iova(struct iopt_area *area)
+{
+	return area->node.start;
+}
+
+static inline unsigned long iopt_area_last_iova(struct iopt_area *area)
+{
+	return area->node.last;
+}
+
+/*
+ * This holds a pinned page list for multiple areas of IO address space. The
+ * pages always originate from a linear chunk of userspace VA. Multiple
+ * io_pagetable's, through their iopt_area's, can share a single iopt_pages
+ * which avoids multi-pinning and double accounting of page consumption.
+ *
+ * indexes in this structure are measured in PAGE_SIZE units, are 0 based from
+ * the start of the uptr and extend to npages. pages are pinned dynamically
+ * according to the intervals in the users_itree and domains_itree, npages
+ * records the current number of pages pinned.
+ */
+struct iopt_pages {
+	struct kref kref;
+	struct mutex mutex;
+	size_t npages;
+	size_t npinned;
+	size_t last_npinned;
+	struct task_struct *source_task;
+	struct mm_struct *source_mm;
+	struct user_struct *source_user;
+	void __user *uptr;
+	bool writable:1;
+	bool has_cap_ipc_lock:1;
+
+	struct xarray pinned_pfns;
+	/* Of iopt_pages_user::node */
+	struct rb_root_cached users_itree;
+	/* Of iopt_area::pages_node */
+	struct rb_root_cached domains_itree;
+};
+
+#endif
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 2d0bba3965be1a..2f1301d39bba7c 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -9,6 +9,26 @@
 #include <linux/refcount.h>
 #include <linux/uaccess.h>
 
+/*
+ * The IOVA to PFN map. The mapper automatically copies the PFNs into multiple
+ * domains and permits sharing of PFNs between io_pagetable instances. This
+ * supports both a design where IOAS's are 1:1 with a domain (eg because the
+ * domain is HW customized), or where the IOAS is 1:N with multiple generic
+ * domains.  The io_pagetable holds an interval tree of iopt_areas which point
+ * to shared iopt_pages which hold the pfns mapped to the page table.
+ *
+ * The locking order is domains_rwsem -> iova_rwsem -> pages::mutex
+ */
+struct io_pagetable {
+	struct rw_semaphore domains_rwsem;
+	struct xarray domains;
+	unsigned int next_domain_id;
+
+	struct rw_semaphore iova_rwsem;
+	struct rb_root_cached area_itree;
+	struct rb_root_cached reserved_iova_itree;
+};
+
 struct iommufd_ctx {
 	struct file *filp;
 	struct xarray objects;
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
new file mode 100644
index 00000000000000..a75e1c73527920
--- /dev/null
+++ b/drivers/iommu/iommufd/pages.c
@@ -0,0 +1,723 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
+ *
+ * The iopt_pages is the center of the storage and motion of PFNs. Each
+ * iopt_pages represents a logical linear array of full PFNs. The array is 0
+ * based and has npages in it. Accessors use 'index' to refer to the entry in
+ * this logical array, regardless of its storage location.
+ *
+ * PFNs are stored in a tiered scheme:
+ *  1) iopt_pages::pinned_pfns xarray
+ *  2) An iommu_domain
+ *  3) The origin of the PFNs, i.e. the userspace pointer
+ *
+ * PFN have to be copied between all combinations of tiers, depending on the
+ * configuration.
+ *
+ * When a PFN is taken out of the userspace pointer it is pinned exactly once.
+ * The storage locations of the PFN's index are tracked in the two interval
+ * trees. If no interval includes the index then it is not pinned.
+ *
+ * If users_itree includes the PFN's index then an in-kernel user has requested
+ * the page. The PFN is stored in the xarray so other requestors can continue to
+ * find it.
+ *
+ * If the domains_itree includes the PFN's index then an iommu_domain is storing
+ * the PFN and it can be read back using iommu_iova_to_phys(). To avoid
+ * duplicating storage the xarray is not used if only iommu_domains are using
+ * the PFN's index.
+ *
+ * As a general principle this is designed so that destroy never fails. This
+ * means removing an iommu_domain or releasing a in-kernel user will not fail
+ * due to insufficient memory. In practice this means some cases have to hold
+ * PFNs in the xarray even though they are also being stored in an iommu_domain.
+ *
+ * While the iopt_pages can use an iommu_domain as storage, it does not have an
+ * IOVA itself. Instead the iopt_area represents a range of IOVA and uses the
+ * iopt_pages as the PFN provider. Multiple iopt_areas can share the iopt_pages
+ * and reference their own slice of the PFN array, with sub page granularity.
+ *
+ * In this file the term 'last' indicates an inclusive and closed interval, eg
+ * [0,0] refers to a single PFN. 'end' means an open range, eg [0,0) refers to
+ * no PFNs.
+ */
+#include <linux/overflow.h>
+#include <linux/slab.h>
+#include <linux/iommu.h>
+#include <linux/sched/mm.h>
+
+#include "io_pagetable.h"
+
+#define TEMP_MEMORY_LIMIT 65536
+#define BATCH_BACKUP_SIZE 32
+
+/*
+ * More memory makes pin_user_pages() and the batching more efficient, but as
+ * this is only a performance optimization don't try too hard to get it. A 64k
+ * allocation can hold about 26M of 4k pages and 13G of 2M pages in an
+ * pfn_batch. Various destroy paths cannot fail and provide a small amount of
+ * stack memory as a backup contingency. If backup_len is given this cannot
+ * fail.
+ */
+static void *temp_kmalloc(size_t *size, void *backup, size_t backup_len)
+{
+	void *res;
+
+	if (*size < backup_len)
+		return backup;
+	*size = min_t(size_t, *size, TEMP_MEMORY_LIMIT);
+	res = kmalloc(*size, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
+	if (res)
+		return res;
+	*size = PAGE_SIZE;
+	if (backup_len) {
+		res = kmalloc(*size, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
+		if (res)
+			return res;
+		*size = backup_len;
+		return backup;
+	}
+	return kmalloc(*size, GFP_KERNEL);
+}
+
+static void iopt_pages_add_npinned(struct iopt_pages *pages, size_t npages)
+{
+	int rc;
+
+	rc = check_add_overflow(pages->npinned, npages, &pages->npinned);
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
+		WARN_ON(rc || pages->npinned > pages->npages);
+}
+
+static void iopt_pages_sub_npinned(struct iopt_pages *pages, size_t npages)
+{
+	int rc;
+
+	rc = check_sub_overflow(pages->npinned, npages, &pages->npinned);
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
+		WARN_ON(rc || pages->npinned > pages->npages);
+}
+
+/*
+ * index is the number of PAGE_SIZE units from the start of the area's
+ * iopt_pages. If the iova is sub page-size then the area has an iova that
+ * covers a portion of the first and last pages in the range.
+ */
+static unsigned long iopt_area_index_to_iova(struct iopt_area *area,
+					     unsigned long index)
+{
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
+		WARN_ON(index < iopt_area_index(area) ||
+			index > iopt_area_last_index(area));
+	index -= iopt_area_index(area);
+	if (index == 0)
+		return iopt_area_iova(area);
+	return iopt_area_iova(area) - area->page_offset + index * PAGE_SIZE;
+}
+
+static unsigned long iopt_area_index_to_iova_last(struct iopt_area *area,
+						  unsigned long index)
+{
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
+		WARN_ON(index < iopt_area_index(area) ||
+			index > iopt_area_last_index(area));
+	if (index == iopt_area_last_index(area))
+		return iopt_area_last_iova(area);
+	return iopt_area_iova(area) - area->page_offset +
+	       (index - iopt_area_index(area) + 1) * PAGE_SIZE - 1;
+}
+
+static void iommu_unmap_nofail(struct iommu_domain *domain, unsigned long iova,
+			       size_t size)
+{
+	size_t ret;
+
+	ret = iommu_unmap(domain, iova, size);
+	/*
+	 * It is a logic error in this code or a driver bug if the IOMMU unmaps
+	 * something other than exactly as requested.
+	 */
+	WARN_ON(ret != size);
+}
+
+static struct iopt_area *iopt_pages_find_domain_area(struct iopt_pages *pages,
+						     unsigned long index)
+{
+	struct interval_tree_node *node;
+
+	node = interval_tree_iter_first(&pages->domains_itree, index, index);
+	if (!node)
+		return NULL;
+	return container_of(node, struct iopt_area, pages_node);
+}
+
+/*
+ * A simple datastructure to hold a vector of PFNs, optimized for contiguous
+ * PFNs. This is used as a temporary holding memory for shuttling pfns from one
+ * place to another. Generally everything is made more efficient if operations
+ * work on the largest possible grouping of pfns. eg fewer lock/unlock cycles,
+ * better cache locality, etc
+ */
+struct pfn_batch {
+	unsigned long *pfns;
+	u16 *npfns;
+	unsigned int array_size;
+	unsigned int end;
+	unsigned int total_pfns;
+};
+
+static void batch_clear(struct pfn_batch *batch)
+{
+	batch->total_pfns = 0;
+	batch->end = 0;
+	batch->pfns[0] = 0;
+	batch->npfns[0] = 0;
+}
+
+static int __batch_init(struct pfn_batch *batch, size_t max_pages, void *backup,
+			size_t backup_len)
+{
+	const size_t elmsz = sizeof(*batch->pfns) + sizeof(*batch->npfns);
+	size_t size = max_pages * elmsz;
+
+	batch->pfns = temp_kmalloc(&size, backup, backup_len);
+	if (!batch->pfns)
+		return -ENOMEM;
+	batch->array_size = size / elmsz;
+	batch->npfns = (u16 *)(batch->pfns + batch->array_size);
+	batch_clear(batch);
+	return 0;
+}
+
+static int batch_init(struct pfn_batch *batch, size_t max_pages)
+{
+	return __batch_init(batch, max_pages, NULL, 0);
+}
+
+static void batch_init_backup(struct pfn_batch *batch, size_t max_pages,
+			      void *backup, size_t backup_len)
+{
+	__batch_init(batch, max_pages, backup, backup_len);
+}
+
+static void batch_destroy(struct pfn_batch *batch, void *backup)
+{
+	if (batch->pfns != backup)
+		kfree(batch->pfns);
+}
+
+/* true if the pfn could be added, false otherwise */
+static bool batch_add_pfn(struct pfn_batch *batch, unsigned long pfn)
+{
+	/* FIXME: U16 is too small */
+	if (batch->end &&
+	    pfn == batch->pfns[batch->end - 1] + batch->npfns[batch->end - 1] &&
+	    batch->npfns[batch->end - 1] != U16_MAX) {
+		batch->npfns[batch->end - 1]++;
+		batch->total_pfns++;
+		return true;
+	}
+	if (batch->end == batch->array_size)
+		return false;
+	batch->total_pfns++;
+	batch->pfns[batch->end] = pfn;
+	batch->npfns[batch->end] = 1;
+	batch->end++;
+	return true;
+}
+
+/*
+ * Fill the batch with pfns from the domain. When the batch is full, or it
+ * reaches last_index, the function will return. The caller should use
+ * batch->total_pfns to determine the starting point for the next iteration.
+ */
+static void batch_from_domain(struct pfn_batch *batch,
+			      struct iommu_domain *domain,
+			      struct iopt_area *area, unsigned long index,
+			      unsigned long last_index)
+{
+	unsigned int page_offset = 0;
+	unsigned long iova;
+	phys_addr_t phys;
+
+	batch_clear(batch);
+	iova = iopt_area_index_to_iova(area, index);
+	if (index == iopt_area_index(area))
+		page_offset = area->page_offset;
+	while (index <= last_index) {
+		/*
+		 * This is pretty slow, it would be nice to get the page size
+		 * back from the driver, or have the driver directly fill the
+		 * batch.
+		 */
+		phys = iommu_iova_to_phys(domain, iova) - page_offset;
+		if (!batch_add_pfn(batch, PHYS_PFN(phys)))
+			return;
+		iova += PAGE_SIZE - page_offset;
+		page_offset = 0;
+		index++;
+	}
+}
+
+static int batch_to_domain(struct pfn_batch *batch, struct iommu_domain *domain,
+			   struct iopt_area *area, unsigned long start_index)
+{
+	unsigned long last_iova = iopt_area_last_iova(area);
+	unsigned int page_offset = 0;
+	unsigned long start_iova;
+	unsigned long next_iova;
+	unsigned int cur = 0;
+	unsigned long iova;
+	int rc;
+
+	/* The first index might be a partial page */
+	if (start_index == iopt_area_index(area))
+		page_offset = area->page_offset;
+	next_iova = iova = start_iova =
+		iopt_area_index_to_iova(area, start_index);
+	while (cur < batch->end) {
+		next_iova = min(last_iova + 1,
+				next_iova + batch->npfns[cur] * PAGE_SIZE -
+					page_offset);
+		rc = iommu_map(domain, iova,
+			       PFN_PHYS(batch->pfns[cur]) + page_offset,
+			       next_iova - iova, area->iommu_prot);
+		if (rc)
+			goto out_unmap;
+		iova = next_iova;
+		page_offset = 0;
+		cur++;
+	}
+	return 0;
+out_unmap:
+	if (start_iova != iova)
+		iommu_unmap_nofail(domain, start_iova, iova - start_iova);
+	return rc;
+}
+
+static void batch_from_xarray(struct pfn_batch *batch, struct xarray *xa,
+			      unsigned long start_index,
+			      unsigned long last_index)
+{
+	XA_STATE(xas, xa, start_index);
+	void *entry;
+
+	rcu_read_lock();
+	while (true) {
+		entry = xas_next(&xas);
+		if (xas_retry(&xas, entry))
+			continue;
+		WARN_ON(!xa_is_value(entry));
+		if (!batch_add_pfn(batch, xa_to_value(entry)) ||
+		    start_index == last_index)
+			break;
+		start_index++;
+	}
+	rcu_read_unlock();
+}
+
+static void clear_xarray(struct xarray *xa, unsigned long index,
+			 unsigned long last)
+{
+	XA_STATE(xas, xa, index);
+	void *entry;
+
+	xas_lock(&xas);
+	xas_for_each (&xas, entry, last)
+		xas_store(&xas, NULL);
+	xas_unlock(&xas);
+}
+
+static int batch_to_xarray(struct pfn_batch *batch, struct xarray *xa,
+			   unsigned long start_index)
+{
+	XA_STATE(xas, xa, start_index);
+	unsigned int npage = 0;
+	unsigned int cur = 0;
+
+	do {
+		xas_lock(&xas);
+		while (cur < batch->end) {
+			void *old;
+
+			old = xas_store(&xas,
+					xa_mk_value(batch->pfns[cur] + npage));
+			if (xas_error(&xas))
+				break;
+			WARN_ON(old);
+			npage++;
+			if (npage == batch->npfns[cur]) {
+				npage = 0;
+				cur++;
+			}
+			xas_next(&xas);
+		}
+		xas_unlock(&xas);
+	} while (xas_nomem(&xas, GFP_KERNEL));
+
+	if (xas_error(&xas)) {
+		if (xas.xa_index != start_index)
+			clear_xarray(xa, start_index, xas.xa_index - 1);
+		return xas_error(&xas);
+	}
+	return 0;
+}
+
+static void batch_to_pages(struct pfn_batch *batch, struct page **pages)
+{
+	unsigned int npage = 0;
+	unsigned int cur = 0;
+
+	while (cur < batch->end) {
+		*pages++ = pfn_to_page(batch->pfns[cur] + npage);
+		npage++;
+		if (npage == batch->npfns[cur]) {
+			npage = 0;
+			cur++;
+		}
+	}
+}
+
+static void batch_from_pages(struct pfn_batch *batch, struct page **pages,
+			     size_t npages)
+{
+	struct page **end = pages + npages;
+
+	for (; pages != end; pages++)
+		if (!batch_add_pfn(batch, page_to_pfn(*pages)))
+			break;
+}
+
+static void batch_unpin(struct pfn_batch *batch, struct iopt_pages *pages,
+			unsigned int offset, size_t npages)
+{
+	unsigned int cur = 0;
+
+	while (offset) {
+		if (batch->npfns[cur] > offset)
+			break;
+		offset -= batch->npfns[cur];
+		cur++;
+	}
+
+	while (npages) {
+		size_t to_unpin =
+			min_t(size_t, npages, batch->npfns[cur] - offset);
+
+		unpin_user_page_range_dirty_lock(
+			pfn_to_page(batch->pfns[cur] + offset), to_unpin,
+			pages->writable);
+		iopt_pages_sub_npinned(pages, to_unpin);
+		cur++;
+		offset = 0;
+		npages -= to_unpin;
+	}
+}
+
+/*
+ * PFNs are stored in three places, in order of preference:
+ * - The iopt_pages xarray. This is only populated if there is a
+ *   iopt_pages_user
+ * - The iommu_domain under an area
+ * - The original PFN source, ie pages->source_mm
+ *
+ * This iterator reads the pfns optimizing to load according to the
+ * above order.
+ */
+struct pfn_reader {
+	struct iopt_pages *pages;
+	struct interval_tree_span_iter span;
+	struct pfn_batch batch;
+	unsigned long batch_start_index;
+	unsigned long batch_end_index;
+	unsigned long last_index;
+
+	struct page **upages;
+	size_t upages_len;
+	unsigned long upages_start;
+	unsigned long upages_end;
+
+	unsigned int gup_flags;
+};
+
+static void update_unpinned(struct iopt_pages *pages)
+{
+	unsigned long npages = pages->last_npinned - pages->npinned;
+
+	lockdep_assert_held(&pages->mutex);
+
+	if (pages->has_cap_ipc_lock) {
+		pages->last_npinned = pages->npinned;
+		return;
+	}
+
+	if (WARN_ON(pages->npinned > pages->last_npinned) ||
+	    WARN_ON(atomic_long_read(&pages->source_user->locked_vm) < npages))
+		return;
+	atomic_long_sub(npages, &pages->source_user->locked_vm);
+	atomic64_sub(npages, &pages->source_mm->pinned_vm);
+	pages->last_npinned = pages->npinned;
+}
+
+/*
+ * Changes in the number of pages pinned is done after the pages have been read
+ * and processed. If the user lacked the limit then the error unwind will unpin
+ * everything that was just pinned.
+ */
+static int update_pinned(struct iopt_pages *pages)
+{
+	unsigned long lock_limit;
+	unsigned long cur_pages;
+	unsigned long new_pages;
+	unsigned long npages;
+
+	lockdep_assert_held(&pages->mutex);
+
+	if (pages->has_cap_ipc_lock) {
+		pages->last_npinned = pages->npinned;
+		return 0;
+	}
+
+	if (pages->npinned == pages->last_npinned)
+		return 0;
+
+	if (pages->npinned < pages->last_npinned) {
+		update_unpinned(pages);
+		return 0;
+	}
+
+	lock_limit =
+		task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	npages = pages->npinned - pages->last_npinned;
+	do {
+		cur_pages = atomic_long_read(&pages->source_user->locked_vm);
+		new_pages = cur_pages + npages;
+		if (new_pages > lock_limit)
+			return -ENOMEM;
+	} while (atomic_long_cmpxchg(&pages->source_user->locked_vm, cur_pages,
+				     new_pages) != cur_pages);
+	atomic64_add(npages, &pages->source_mm->pinned_vm);
+	pages->last_npinned = pages->npinned;
+	return 0;
+}
+
+static int pfn_reader_pin_pages(struct pfn_reader *pfns)
+{
+	struct iopt_pages *pages = pfns->pages;
+	unsigned long npages;
+	long rc;
+
+	if (!pfns->upages) {
+		/* All undone in iopt_pfn_reader_destroy */
+		pfns->upages_len =
+			(pfns->last_index - pfns->batch_end_index + 1) *
+			sizeof(*pfns->upages);
+		pfns->upages = temp_kmalloc(&pfns->upages_len, NULL, 0);
+		if (!pfns->upages)
+			return -ENOMEM;
+
+		if (!mmget_not_zero(pages->source_mm)) {
+			kfree(pfns->upages);
+			pfns->upages = NULL;
+			return -EINVAL;
+		}
+		mmap_read_lock(pages->source_mm);
+	}
+
+	npages = min_t(unsigned long,
+		       pfns->span.last_hole - pfns->batch_end_index + 1,
+		       pfns->upages_len / sizeof(*pfns->upages));
+
+	/* FIXME use pin_user_pages_fast() if current == source_mm */
+	rc = pin_user_pages_remote(
+		pages->source_mm,
+		(uintptr_t)(pages->uptr + pfns->batch_end_index * PAGE_SIZE),
+		npages, pfns->gup_flags, pfns->upages, NULL, NULL);
+	if (rc < 0)
+		return rc;
+	if (WARN_ON(!rc))
+		return -EFAULT;
+	iopt_pages_add_npinned(pages, rc);
+	pfns->upages_start = pfns->batch_end_index;
+	pfns->upages_end = pfns->batch_end_index + rc;
+	return 0;
+}
+
+/*
+ * The batch can contain a mixture of pages that are still in use and pages that
+ * need to be unpinned. Unpin only pages that are not held anywhere else.
+ */
+static void iopt_pages_unpin(struct iopt_pages *pages, struct pfn_batch *batch,
+			     unsigned long index, unsigned long last)
+{
+	struct interval_tree_span_iter user_span;
+	struct interval_tree_span_iter area_span;
+
+	lockdep_assert_held(&pages->mutex);
+
+	for (interval_tree_span_iter_first(&user_span, &pages->users_itree, 0,
+					   last);
+	     !interval_tree_span_iter_done(&user_span);
+	     interval_tree_span_iter_next(&user_span)) {
+		if (!user_span.is_hole)
+			continue;
+
+		for (interval_tree_span_iter_first(
+			     &area_span, &pages->domains_itree,
+			     user_span.start_hole, user_span.last_hole);
+		     !interval_tree_span_iter_done(&area_span);
+		     interval_tree_span_iter_next(&area_span)) {
+			if (!area_span.is_hole)
+				continue;
+
+			batch_unpin(batch, pages, area_span.start_hole - index,
+				    area_span.last_hole - area_span.start_hole +
+					    1);
+		}
+	}
+}
+
+/* Process a single span in the users_itree */
+static int pfn_reader_fill_span(struct pfn_reader *pfns)
+{
+	struct interval_tree_span_iter *span = &pfns->span;
+	struct iopt_area *area;
+	int rc;
+
+	if (!span->is_hole) {
+		batch_from_xarray(&pfns->batch, &pfns->pages->pinned_pfns,
+				  pfns->batch_end_index, span->last_used);
+		return 0;
+	}
+
+	/* FIXME: This should consider the entire hole remaining */
+	area = iopt_pages_find_domain_area(pfns->pages, pfns->batch_end_index);
+	if (area) {
+		unsigned int last_index;
+
+		last_index = min(iopt_area_last_index(area), span->last_hole);
+		/* The storage_domain cannot change without the pages mutex */
+		batch_from_domain(&pfns->batch, area->storage_domain, area,
+				  pfns->batch_end_index, last_index);
+		return 0;
+	}
+
+	if (pfns->batch_end_index >= pfns->upages_end) {
+		rc = pfn_reader_pin_pages(pfns);
+		if (rc)
+			return rc;
+	}
+
+	batch_from_pages(&pfns->batch,
+			 pfns->upages +
+				 (pfns->batch_end_index - pfns->upages_start),
+			 pfns->upages_end - pfns->batch_end_index);
+	return 0;
+}
+
+static bool pfn_reader_done(struct pfn_reader *pfns)
+{
+	return pfns->batch_start_index == pfns->last_index + 1;
+}
+
+static int pfn_reader_next(struct pfn_reader *pfns)
+{
+	int rc;
+
+	batch_clear(&pfns->batch);
+	pfns->batch_start_index = pfns->batch_end_index;
+	while (pfns->batch_end_index != pfns->last_index + 1) {
+		rc = pfn_reader_fill_span(pfns);
+		if (rc)
+			return rc;
+		pfns->batch_end_index =
+			pfns->batch_start_index + pfns->batch.total_pfns;
+		if (pfns->batch_end_index != pfns->span.last_used + 1)
+			return 0;
+		interval_tree_span_iter_next(&pfns->span);
+	}
+	return 0;
+}
+
+/*
+ * Adjust the pfn_reader to start at an externally determined hole span in the
+ * users_itree.
+ */
+static int pfn_reader_seek_hole(struct pfn_reader *pfns,
+				struct interval_tree_span_iter *span)
+{
+	pfns->batch_start_index = span->start_hole;
+	pfns->batch_end_index = span->start_hole;
+	pfns->last_index = span->last_hole;
+	pfns->span = *span;
+	return pfn_reader_next(pfns);
+}
+
+static int pfn_reader_init(struct pfn_reader *pfns, struct iopt_pages *pages,
+			   unsigned long index, unsigned long last)
+{
+	int rc;
+
+	lockdep_assert_held(&pages->mutex);
+
+	rc = batch_init(&pfns->batch, last - index + 1);
+	if (rc)
+		return rc;
+	pfns->pages = pages;
+	pfns->batch_start_index = index;
+	pfns->batch_end_index = index;
+	pfns->last_index = last;
+	pfns->upages = NULL;
+	pfns->upages_start = 0;
+	pfns->upages_end = 0;
+	interval_tree_span_iter_first(&pfns->span, &pages->users_itree, index,
+				      last);
+
+	if (pages->writable) {
+		pfns->gup_flags = FOLL_LONGTERM | FOLL_WRITE;
+	} else {
+		/* Still need to break COWs on read */
+		pfns->gup_flags = FOLL_LONGTERM | FOLL_FORCE | FOLL_WRITE;
+	}
+	return 0;
+}
+
+static void pfn_reader_destroy(struct pfn_reader *pfns)
+{
+	if (pfns->upages) {
+		size_t npages = pfns->upages_end - pfns->batch_end_index;
+
+		mmap_read_unlock(pfns->pages->source_mm);
+		mmput(pfns->pages->source_mm);
+
+		/* Any pages not transferred to the batch are just unpinned */
+		unpin_user_pages(pfns->upages + (pfns->batch_end_index -
+						 pfns->upages_start),
+				 npages);
+		kfree(pfns->upages);
+		pfns->upages = NULL;
+	}
+
+	if (pfns->batch_start_index != pfns->batch_end_index)
+		iopt_pages_unpin(pfns->pages, &pfns->batch,
+				 pfns->batch_start_index,
+				 pfns->batch_end_index - 1);
+	batch_destroy(&pfns->batch, NULL);
+	WARN_ON(pfns->pages->last_npinned != pfns->pages->npinned);
+}
+
+static int pfn_reader_first(struct pfn_reader *pfns, struct iopt_pages *pages,
+			    unsigned long index, unsigned long last)
+{
+	int rc;
+
+	rc = pfn_reader_init(pfns, pages, index, last);
+	if (rc)
+		return rc;
+	rc = pfn_reader_next(pfns);
+	if (rc) {
+		pfn_reader_destroy(pfns);
+		return rc;
+	}
+	return 0;
+}
-- 
2.35.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 244+ messages in thread

* [PATCH RFC 06/12] iommufd: Algorithms for PFN storage
  2022-03-18 17:27 ` Jason Gunthorpe via iommu
@ 2022-03-18 17:27   ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-03-18 17:27 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

The iopt_pages which represents a logical linear list of PFNs held in
different storage tiers. Each area points to a slice of exactly one
iopt_pages, and each iopt_pages can have multiple areas and users.

The three storage tiers are managed to meet these objectives:

 - If no iommu_domain or user exists then minimal memory should be
   consumed by iomufd
 - If a page has been pinned then an iopt_pages will not pin it again
 - If an in-kernel user exists then the xarray must provide the backing
   storage to avoid allocations on domain removals
 - Otherwise any iommu_domain will be used for storage

In a common configuration with only an iommu_domain the iopt_pages does
not allocate significant memory itself.

The external interface for pages has several logical operations:

  iopt_area_fill_domain() will load the PFNs from storage into a single
  domain. This is used when attaching a new domain to an existing IOAS.

  iopt_area_fill_domains() will load the PFNs from storage into multiple
  domains. This is used when creating a new IOVA map in an existing IOAS

  iopt_pages_add_user() creates an iopt_pages_user that tracks an in-kernel
  user of PFNs. This is some external driver that might be accessing the
  IOVA using the CPU, or programming PFNs with the DMA API. ie a VFIO
  mdev.

  iopt_pages_fill_xarray() will load PFNs into the xarray and return a
  'struct page *' array. It is used by iopt_pages_user's to extract PFNs
  for in-kernel use. iopt_pages_fill_from_xarray() is a fast path when it
  is known the xarray is already filled.

As an iopt_pages can be referred to in slices by many areas and users it
uses interval trees to keep track of which storage tiers currently hold
the PFNs. On a page-by-page basis any request for a PFN will be satisfied
from one of the storage tiers and the PFN copied to target domain/array.

Unfill actions are similar, on a page by page basis domains are unmapped,
xarray entries freed or struct pages fully put back.

Significant complexity is required to fully optimize all of these data
motions. The implementation calculates the largest consecutive range of
same-storage indexes and operates in blocks. The accumulation of PFNs
always generates the largest contiguous PFN range possible to optimize and
this gathering can cross storage tier boundaries. For cases like 'fill
domains' care is taken to avoid duplicated work and PFNs are read once and
pushed into all domains.

The map/unmap interaction with the iommu_domain always works in contiguous
PFN blocks. The implementation does not require or benefit from any
split/merge optimization in the iommu_domain driver.

This design suggests several possible improvements in the IOMMU API that
would greatly help performance, particularly a way for the driver to map
and read the pfns lists instead of working with one driver call perpage to
read, and one driver call per contiguous range to store.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommufd/io_pagetable.h |  71 +++-
 drivers/iommu/iommufd/pages.c        | 594 +++++++++++++++++++++++++++
 2 files changed, 664 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h
index 94ca8712722d31..c8b6a60ff24c94 100644
--- a/drivers/iommu/iommufd/io_pagetable.h
+++ b/drivers/iommu/iommufd/io_pagetable.h
@@ -47,6 +47,14 @@ struct iopt_area {
 	atomic_t num_users;
 };
 
+int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages);
+void iopt_area_unfill_domains(struct iopt_area *area, struct iopt_pages *pages);
+
+int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain);
+void iopt_area_unfill_domain(struct iopt_area *area, struct iopt_pages *pages,
+			     struct iommu_domain *domain);
+void iopt_unmap_domain(struct io_pagetable *iopt, struct iommu_domain *domain);
+
 static inline unsigned long iopt_area_index(struct iopt_area *area)
 {
 	return area->pages_node.start;
@@ -67,6 +75,37 @@ static inline unsigned long iopt_area_last_iova(struct iopt_area *area)
 	return area->node.last;
 }
 
+static inline size_t iopt_area_length(struct iopt_area *area)
+{
+	return (area->node.last - area->node.start) + 1;
+}
+
+static inline struct iopt_area *iopt_area_iter_first(struct io_pagetable *iopt,
+						     unsigned long start,
+						     unsigned long last)
+{
+	struct interval_tree_node *node;
+
+	lockdep_assert_held(&iopt->iova_rwsem);
+
+	node = interval_tree_iter_first(&iopt->area_itree, start, last);
+	if (!node)
+		return NULL;
+	return container_of(node, struct iopt_area, node);
+}
+
+static inline struct iopt_area *iopt_area_iter_next(struct iopt_area *area,
+						    unsigned long start,
+						    unsigned long last)
+{
+	struct interval_tree_node *node;
+
+	node = interval_tree_iter_next(&area->node, start, last);
+	if (!node)
+		return NULL;
+	return container_of(node, struct iopt_area, node);
+}
+
 /*
  * This holds a pinned page list for multiple areas of IO address space. The
  * pages always originate from a linear chunk of userspace VA. Multiple
@@ -75,7 +114,7 @@ static inline unsigned long iopt_area_last_iova(struct iopt_area *area)
  *
  * indexes in this structure are measured in PAGE_SIZE units, are 0 based from
  * the start of the uptr and extend to npages. pages are pinned dynamically
- * according to the intervals in the users_itree and domains_itree, npages
+ * according to the intervals in the users_itree and domains_itree, npinned
  * records the current number of pages pinned.
  */
 struct iopt_pages {
@@ -98,4 +137,34 @@ struct iopt_pages {
 	struct rb_root_cached domains_itree;
 };
 
+struct iopt_pages *iopt_alloc_pages(void __user *uptr, unsigned long length,
+				    bool writable);
+void iopt_release_pages(struct kref *kref);
+static inline void iopt_put_pages(struct iopt_pages *pages)
+{
+	kref_put(&pages->kref, iopt_release_pages);
+}
+
+void iopt_pages_fill_from_xarray(struct iopt_pages *pages, unsigned long start,
+				 unsigned long last, struct page **out_pages);
+int iopt_pages_fill_xarray(struct iopt_pages *pages, unsigned long start,
+			   unsigned long last, struct page **out_pages);
+void iopt_pages_unfill_xarray(struct iopt_pages *pages, unsigned long start,
+			      unsigned long last);
+
+int iopt_pages_add_user(struct iopt_pages *pages, unsigned long start,
+			unsigned long last, struct page **out_pages,
+			bool write);
+void iopt_pages_remove_user(struct iopt_pages *pages, unsigned long start,
+			    unsigned long last);
+
+/*
+ * Each interval represents an active iopt_access_pages(), it acts as an
+ * interval lock that keeps the PFNs pinned and stored in the xarray.
+ */
+struct iopt_pages_user {
+	struct interval_tree_node node;
+	refcount_t refcount;
+};
+
 #endif
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index a75e1c73527920..8e6a8cc8b20ad1 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -140,6 +140,18 @@ static void iommu_unmap_nofail(struct iommu_domain *domain, unsigned long iova,
 	WARN_ON(ret != size);
 }
 
+static void iopt_area_unmap_domain_range(struct iopt_area *area,
+					 struct iommu_domain *domain,
+					 unsigned long start_index,
+					 unsigned long last_index)
+{
+	unsigned long start_iova = iopt_area_index_to_iova(area, start_index);
+
+	iommu_unmap_nofail(domain, start_iova,
+			   iopt_area_index_to_iova_last(area, last_index) -
+				   start_iova + 1);
+}
+
 static struct iopt_area *iopt_pages_find_domain_area(struct iopt_pages *pages,
 						     unsigned long index)
 {
@@ -721,3 +733,585 @@ static int pfn_reader_first(struct pfn_reader *pfns, struct iopt_pages *pages,
 	}
 	return 0;
 }
+
+struct iopt_pages *iopt_alloc_pages(void __user *uptr, unsigned long length,
+				    bool writable)
+{
+	struct iopt_pages *pages;
+
+	/*
+	 * The iommu API uses size_t as the length, and protect the DIV_ROUND_UP
+	 * below from overflow
+	 */
+	if (length > SIZE_MAX - PAGE_SIZE || length == 0)
+		return ERR_PTR(-EINVAL);
+
+	pages = kzalloc(sizeof(*pages), GFP_KERNEL);
+	if (!pages)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&pages->kref);
+	xa_init(&pages->pinned_pfns);
+	mutex_init(&pages->mutex);
+	pages->source_mm = current->mm;
+	mmgrab(pages->source_mm);
+	pages->uptr = (void __user *)ALIGN_DOWN((uintptr_t)uptr, PAGE_SIZE);
+	pages->npages = DIV_ROUND_UP(length + (uptr - pages->uptr), PAGE_SIZE);
+	pages->users_itree = RB_ROOT_CACHED;
+	pages->domains_itree = RB_ROOT_CACHED;
+	pages->writable = writable;
+	pages->has_cap_ipc_lock = capable(CAP_IPC_LOCK);
+	pages->source_task = current->group_leader;
+	get_task_struct(current->group_leader);
+	pages->source_user = get_uid(current_user());
+	return pages;
+}
+
+void iopt_release_pages(struct kref *kref)
+{
+	struct iopt_pages *pages = container_of(kref, struct iopt_pages, kref);
+
+	WARN_ON(!RB_EMPTY_ROOT(&pages->users_itree.rb_root));
+	WARN_ON(!RB_EMPTY_ROOT(&pages->domains_itree.rb_root));
+	WARN_ON(pages->npinned);
+	WARN_ON(!xa_empty(&pages->pinned_pfns));
+	mmdrop(pages->source_mm);
+	mutex_destroy(&pages->mutex);
+	put_task_struct(pages->source_task);
+	free_uid(pages->source_user);
+	kfree(pages);
+}
+
+/* Quickly guess if the interval tree might fully cover the given interval */
+static bool interval_tree_fully_covers(struct rb_root_cached *root,
+				       unsigned long index, unsigned long last)
+{
+	struct interval_tree_node *node;
+
+	node = interval_tree_iter_first(root, index, last);
+	if (!node)
+		return false;
+	return node->start <= index && node->last >= last;
+}
+
+static bool interval_tree_fully_covers_area(struct rb_root_cached *root,
+					    struct iopt_area *area)
+{
+	return interval_tree_fully_covers(root, iopt_area_index(area),
+					  iopt_area_last_index(area));
+}
+
+static void __iopt_area_unfill_domain(struct iopt_area *area,
+				      struct iopt_pages *pages,
+				      struct iommu_domain *domain,
+				      unsigned long last_index)
+{
+	unsigned long unmapped_index = iopt_area_index(area);
+	unsigned long cur_index = unmapped_index;
+	u64 backup[BATCH_BACKUP_SIZE];
+	struct pfn_batch batch;
+
+	lockdep_assert_held(&pages->mutex);
+
+	/* Fast path if there is obviously something else using every pfn */
+	if (interval_tree_fully_covers_area(&pages->domains_itree, area) ||
+	    interval_tree_fully_covers_area(&pages->users_itree, area)) {
+		iopt_area_unmap_domain_range(area, domain,
+					     iopt_area_index(area), last_index);
+		return;
+	}
+
+	/*
+	 * unmaps must always 'cut' at a place where the pfns are not contiguous
+	 * to pair with the maps that always install contiguous pages. This
+	 * algorithm is efficient in the expected case of few pinners.
+	 */
+	batch_init_backup(&batch, last_index + 1, backup, sizeof(backup));
+	while (cur_index != last_index + 1) {
+		unsigned long batch_index = cur_index;
+
+		batch_from_domain(&batch, domain, area, cur_index, last_index);
+		cur_index += batch.total_pfns;
+		iopt_area_unmap_domain_range(area, domain, unmapped_index,
+					     cur_index - 1);
+		unmapped_index = cur_index;
+		iopt_pages_unpin(pages, &batch, batch_index, cur_index - 1);
+		batch_clear(&batch);
+	}
+	batch_destroy(&batch, backup);
+	update_unpinned(pages);
+}
+
+static void iopt_area_unfill_partial_domain(struct iopt_area *area,
+					    struct iopt_pages *pages,
+					    struct iommu_domain *domain,
+					    unsigned long end_index)
+{
+	if (end_index != iopt_area_index(area))
+		__iopt_area_unfill_domain(area, pages, domain, end_index - 1);
+}
+
+/**
+ * iopt_unmap_domain() - Unmap without unpinning PFNs in a domain
+ * @iopt: The iopt the domain is part of
+ * @domain: The domain to unmap
+ *
+ * The caller must know that unpinning is not required, usually because there
+ * are other domains in the iopt.
+ */
+void iopt_unmap_domain(struct io_pagetable *iopt, struct iommu_domain *domain)
+{
+	struct interval_tree_span_iter span;
+
+	for (interval_tree_span_iter_first(&span, &iopt->area_itree, 0,
+					   ULONG_MAX);
+	     !interval_tree_span_iter_done(&span);
+	     interval_tree_span_iter_next(&span))
+		if (!span.is_hole)
+			iommu_unmap_nofail(domain, span.start_used,
+					   span.last_used - span.start_used +
+						   1);
+}
+
+/**
+ * iopt_area_unfill_domain() - Unmap and unpin PFNs in a domain
+ * @area: IOVA area to use
+ * @pages: page supplier for the area (area->pages is NULL)
+ * @domain: Domain to unmap from
+ *
+ * The domain should be removed from the domains_itree before calling. The
+ * domain will always be unmapped, but the PFNs may not be unpinned if there are
+ * still users.
+ */
+void iopt_area_unfill_domain(struct iopt_area *area, struct iopt_pages *pages,
+			     struct iommu_domain *domain)
+{
+	__iopt_area_unfill_domain(area, pages, domain,
+				  iopt_area_last_index(area));
+}
+
+/**
+ * iopt_area_fill_domain() - Map PFNs from the area into a domain
+ * @area: IOVA area to use
+ * @domain: Domain to load PFNs into
+ *
+ * Read the pfns from the area's underlying iopt_pages and map them into the
+ * given domain. Called when attaching a new domain to an io_pagetable.
+ */
+int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain)
+{
+	struct pfn_reader pfns;
+	int rc;
+
+	lockdep_assert_held(&area->pages->mutex);
+
+	rc = pfn_reader_first(&pfns, area->pages, iopt_area_index(area),
+			      iopt_area_last_index(area));
+	if (rc)
+		return rc;
+
+	while (!pfn_reader_done(&pfns)) {
+		rc = batch_to_domain(&pfns.batch, domain, area,
+				     pfns.batch_start_index);
+		if (rc)
+			goto out_unmap;
+
+		rc = pfn_reader_next(&pfns);
+		if (rc)
+			goto out_unmap;
+	}
+
+	rc = update_pinned(area->pages);
+	if (rc)
+		goto out_unmap;
+	goto out_destroy;
+
+out_unmap:
+	iopt_area_unfill_partial_domain(area, area->pages, domain,
+					pfns.batch_start_index);
+out_destroy:
+	pfn_reader_destroy(&pfns);
+	return rc;
+}
+
+/**
+ * iopt_area_fill_domains() - Install PFNs into the area's domains
+ * @area: The area to act on
+ * @pages: The pages associated with the area (area->pages is NULL)
+ *
+ * Called during area creation. The area is freshly created and not inserted in
+ * the domains_itree yet. PFNs are read and loaded into every domain held in the
+ * area's io_pagetable and the area is installed in the domains_itree.
+ *
+ * On failure all domains are left unchanged.
+ */
+int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages)
+{
+	struct pfn_reader pfns;
+	struct iommu_domain *domain;
+	unsigned long unmap_index;
+	unsigned long index;
+	int rc;
+
+	lockdep_assert_held(&area->iopt->domains_rwsem);
+
+	if (xa_empty(&area->iopt->domains))
+		return 0;
+
+	mutex_lock(&pages->mutex);
+	rc = pfn_reader_first(&pfns, pages, iopt_area_index(area),
+			      iopt_area_last_index(area));
+	if (rc)
+		goto out_unlock;
+
+	while (!pfn_reader_done(&pfns)) {
+		xa_for_each (&area->iopt->domains, index, domain) {
+			rc = batch_to_domain(&pfns.batch, domain, area,
+					     pfns.batch_start_index);
+			if (rc)
+				goto out_unmap;
+		}
+
+		rc = pfn_reader_next(&pfns);
+		if (rc)
+			goto out_unmap;
+	}
+	rc = update_pinned(pages);
+	if (rc)
+		goto out_unmap;
+
+	area->storage_domain = xa_load(&area->iopt->domains, 0);
+	interval_tree_insert(&area->pages_node, &pages->domains_itree);
+	goto out_destroy;
+
+out_unmap:
+	xa_for_each (&area->iopt->domains, unmap_index, domain) {
+		unsigned long end_index = pfns.batch_start_index;
+
+		if (unmap_index <= index)
+			end_index = pfns.batch_end_index;
+
+		/*
+		 * The area is not yet part of the domains_itree so we have to
+		 * manage the unpinning specially. The last domain does the
+		 * unpin, every other domain is just unmapped.
+		 */
+		if (unmap_index != area->iopt->next_domain_id - 1) {
+			if (end_index != iopt_area_index(area))
+				iopt_area_unmap_domain_range(
+					area, domain, iopt_area_index(area),
+					end_index - 1);
+		} else {
+			iopt_area_unfill_partial_domain(area, pages, domain,
+							end_index);
+		}
+	}
+out_destroy:
+	pfn_reader_destroy(&pfns);
+out_unlock:
+	mutex_unlock(&pages->mutex);
+	return rc;
+}
+
+/**
+ * iopt_area_unfill_domains() - unmap PFNs from the area's domains
+ * @area: The area to act on
+ * @pages: The pages associated with the area (area->pages is NULL)
+ *
+ * Called during area destruction. This unmaps the iova's covered by all the
+ * area's domains and releases the PFNs.
+ */
+void iopt_area_unfill_domains(struct iopt_area *area, struct iopt_pages *pages)
+{
+	struct io_pagetable *iopt = area->iopt;
+	struct iommu_domain *domain;
+	unsigned long index;
+
+	lockdep_assert_held(&iopt->domains_rwsem);
+
+	mutex_lock(&pages->mutex);
+	if (!area->storage_domain)
+		goto out_unlock;
+
+	xa_for_each (&iopt->domains, index, domain)
+		if (domain != area->storage_domain)
+			iopt_area_unmap_domain_range(
+				area, domain, iopt_area_index(area),
+				iopt_area_last_index(area));
+
+	interval_tree_remove(&area->pages_node, &pages->domains_itree);
+	iopt_area_unfill_domain(area, pages, area->storage_domain);
+	area->storage_domain = NULL;
+out_unlock:
+	mutex_unlock(&pages->mutex);
+}
+
+/*
+ * Erase entries in the pinned_pfns xarray that are not covered by any users.
+ * This does not unpin the pages, the caller is responsible to deal with the pin
+ * reference. The main purpose of this action is to save memory in the xarray.
+ */
+static void iopt_pages_clean_xarray(struct iopt_pages *pages,
+				    unsigned long index, unsigned long last)
+{
+	struct interval_tree_span_iter span;
+
+	lockdep_assert_held(&pages->mutex);
+
+	for (interval_tree_span_iter_first(&span, &pages->users_itree, index,
+					   last);
+	     !interval_tree_span_iter_done(&span);
+	     interval_tree_span_iter_next(&span))
+		if (span.is_hole)
+			clear_xarray(&pages->pinned_pfns, span.start_hole,
+				     span.last_hole);
+}
+
+/**
+ * iopt_pages_unfill_xarray() - Update the xarry after removing a user
+ * @pages: The pages to act on
+ * @start: Starting PFN index
+ * @last: Last PFN index
+ *
+ * Called when an iopt_pages_user is removed, removes pages from the itree.
+ * The user should already be removed from the users_itree.
+ */
+void iopt_pages_unfill_xarray(struct iopt_pages *pages, unsigned long start,
+			      unsigned long last)
+{
+	struct interval_tree_span_iter span;
+	struct pfn_batch batch;
+	u64 backup[BATCH_BACKUP_SIZE];
+
+	lockdep_assert_held(&pages->mutex);
+
+	if (interval_tree_fully_covers(&pages->domains_itree, start, last))
+		return iopt_pages_clean_xarray(pages, start, last);
+
+	batch_init_backup(&batch, last + 1, backup, sizeof(backup));
+	for (interval_tree_span_iter_first(&span, &pages->users_itree, start,
+					   last);
+	     !interval_tree_span_iter_done(&span);
+	     interval_tree_span_iter_next(&span)) {
+		unsigned long cur_index;
+
+		if (!span.is_hole)
+			continue;
+		cur_index = span.start_hole;
+		while (cur_index != span.last_hole + 1) {
+			batch_from_xarray(&batch, &pages->pinned_pfns,
+					  cur_index, span.last_hole);
+			iopt_pages_unpin(pages, &batch, cur_index,
+					 cur_index + batch.total_pfns - 1);
+			cur_index += batch.total_pfns;
+			batch_clear(&batch);
+		}
+		clear_xarray(&pages->pinned_pfns, span.start_hole,
+			     span.last_hole);
+	}
+	batch_destroy(&batch, backup);
+	update_unpinned(pages);
+}
+
+/**
+ * iopt_pages_fill_from_xarray() - Fast path for reading PFNs
+ * @pages: The pages to act on
+ * @start: The first page index in the range
+ * @last: The last page index in the range
+ * @out_pages: The output array to return the pages
+ *
+ * This can be called if the caller is holding a refcount on an iopt_pages_user
+ * that is known to have already been filled. It quickly reads the pages
+ * directly from the xarray.
+ *
+ * This is part of the SW iommu interface to read pages for in-kernel use.
+ */
+void iopt_pages_fill_from_xarray(struct iopt_pages *pages, unsigned long start,
+				 unsigned long last, struct page **out_pages)
+{
+	XA_STATE(xas, &pages->pinned_pfns, start);
+	void *entry;
+
+	rcu_read_lock();
+	while (true) {
+		entry = xas_next(&xas);
+		if (xas_retry(&xas, entry))
+			continue;
+		WARN_ON(!xa_is_value(entry));
+		*(out_pages++) = pfn_to_page(xa_to_value(entry));
+		if (start == last)
+			break;
+		start++;
+	}
+	rcu_read_unlock();
+}
+
+/**
+ * iopt_pages_fill_xarray() - Read PFNs
+ * @pages: The pages to act on
+ * @start: The first page index in the range
+ * @last: The last page index in the range
+ * @out_pages: The output array to return the pages, may be NULL
+ *
+ * This populates the xarray and returns the pages in out_pages. As the slow
+ * path this is able to copy pages from other storage tiers into the xarray.
+ *
+ * On failure the xarray is left unchanged.
+ *
+ * This is part of the SW iommu interface to read pages for in-kernel use.
+ */
+int iopt_pages_fill_xarray(struct iopt_pages *pages, unsigned long start,
+			   unsigned long last, struct page **out_pages)
+{
+	struct interval_tree_span_iter span;
+	unsigned long xa_end = start;
+	struct pfn_reader pfns;
+	int rc;
+
+	rc = pfn_reader_init(&pfns, pages, start, last);
+	if (rc)
+		return rc;
+
+	for (interval_tree_span_iter_first(&span, &pages->users_itree, start,
+					   last);
+	     !interval_tree_span_iter_done(&span);
+	     interval_tree_span_iter_next(&span)) {
+		if (!span.is_hole) {
+			if (out_pages)
+				iopt_pages_fill_from_xarray(
+					pages + (span.start_used - start),
+					span.start_used, span.last_used,
+					out_pages);
+			continue;
+		}
+
+		rc = pfn_reader_seek_hole(&pfns, &span);
+		if (rc)
+			goto out_clean_xa;
+
+		while (!pfn_reader_done(&pfns)) {
+			rc = batch_to_xarray(&pfns.batch, &pages->pinned_pfns,
+					     pfns.batch_start_index);
+			if (rc)
+				goto out_clean_xa;
+			batch_to_pages(&pfns.batch, out_pages);
+			xa_end += pfns.batch.total_pfns;
+			out_pages += pfns.batch.total_pfns;
+			rc = pfn_reader_next(&pfns);
+			if (rc)
+				goto out_clean_xa;
+		}
+	}
+
+	rc = update_pinned(pages);
+	if (rc)
+		goto out_clean_xa;
+	pfn_reader_destroy(&pfns);
+	return 0;
+
+out_clean_xa:
+	if (start != xa_end)
+		iopt_pages_unfill_xarray(pages, start, xa_end - 1);
+	pfn_reader_destroy(&pfns);
+	return rc;
+}
+
+static struct iopt_pages_user *
+iopt_pages_get_exact_user(struct iopt_pages *pages, unsigned long index,
+			  unsigned long last)
+{
+	struct interval_tree_node *node;
+
+	lockdep_assert_held(&pages->mutex);
+
+	/* There can be overlapping ranges in this interval tree */
+	for (node = interval_tree_iter_first(&pages->users_itree, index, last);
+	     node; node = interval_tree_iter_next(node, index, last))
+		if (node->start == index && node->last == last)
+			return container_of(node, struct iopt_pages_user, node);
+	return NULL;
+}
+
+/**
+ * iopt_pages_add_user() - Record an in-knerel user for PFNs
+ * @pages: The source of PFNs
+ * @start: First page index
+ * @last: Inclusive last page index
+ * @out_pages: Output list of struct page's representing the PFNs
+ * @write: True if the user will write to the pages
+ *
+ * Record that an in-kernel user will be accessing the pages, ensure they are
+ * pinned, and return the PFNs as a simple list of 'struct page *'.
+ *
+ * This should be undone through a matching call to iopt_pages_remove_user()
+ */
+int iopt_pages_add_user(struct iopt_pages *pages, unsigned long start,
+			unsigned long last, struct page **out_pages, bool write)
+{
+	struct iopt_pages_user *user;
+	int rc;
+
+	if (pages->writable != write)
+		return -EPERM;
+
+	mutex_lock(&pages->mutex);
+	user = iopt_pages_get_exact_user(pages, start, last);
+	if (user) {
+		refcount_inc(&user->refcount);
+		mutex_unlock(&pages->mutex);
+		iopt_pages_fill_from_xarray(pages, start, last, out_pages);
+		return 0;
+	}
+
+	user = kzalloc(sizeof(*user), GFP_KERNEL);
+	if (!user) {
+		rc = -ENOMEM;
+		goto out_unlock;
+	}
+
+	rc = iopt_pages_fill_xarray(pages, start, last, out_pages);
+	if (rc)
+		goto out_free;
+
+	user->node.start = start;
+	user->node.last = last;
+	refcount_set(&user->refcount, 1);
+	interval_tree_insert(&user->node, &pages->users_itree);
+	mutex_unlock(&pages->mutex);
+	return 0;
+
+out_free:
+	kfree(user);
+out_unlock:
+	mutex_unlock(&pages->mutex);
+	return rc;
+}
+
+/**
+ * iopt_pages_remove_user() - Release an in-kernel user for PFNs
+ * @pages: The source of PFNs
+ * @start: First page index
+ * @last: Inclusive last page index
+ *
+ * Undo iopt_pages_add_user() and unpin the pages if necessary. The caller must
+ * stop using the PFNs before calling this.
+ */
+void iopt_pages_remove_user(struct iopt_pages *pages, unsigned long start,
+			    unsigned long last)
+{
+	struct iopt_pages_user *user;
+
+	mutex_lock(&pages->mutex);
+	user = iopt_pages_get_exact_user(pages, start, last);
+	if (WARN_ON(!user))
+		goto out_unlock;
+
+	if (!refcount_dec_and_test(&user->refcount))
+		goto out_unlock;
+
+	interval_tree_remove(&user->node, &pages->users_itree);
+	iopt_pages_unfill_xarray(pages, start, last);
+	kfree(user);
+out_unlock:
+	mutex_unlock(&pages->mutex);
+}
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 244+ messages in thread

* [PATCH RFC 06/12] iommufd: Algorithms for PFN storage
@ 2022-03-18 17:27   ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-03-18 17:27 UTC (permalink / raw)
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins,
	David Gibson

The iopt_pages which represents a logical linear list of PFNs held in
different storage tiers. Each area points to a slice of exactly one
iopt_pages, and each iopt_pages can have multiple areas and users.

The three storage tiers are managed to meet these objectives:

 - If no iommu_domain or user exists then minimal memory should be
   consumed by iomufd
 - If a page has been pinned then an iopt_pages will not pin it again
 - If an in-kernel user exists then the xarray must provide the backing
   storage to avoid allocations on domain removals
 - Otherwise any iommu_domain will be used for storage

In a common configuration with only an iommu_domain the iopt_pages does
not allocate significant memory itself.

The external interface for pages has several logical operations:

  iopt_area_fill_domain() will load the PFNs from storage into a single
  domain. This is used when attaching a new domain to an existing IOAS.

  iopt_area_fill_domains() will load the PFNs from storage into multiple
  domains. This is used when creating a new IOVA map in an existing IOAS

  iopt_pages_add_user() creates an iopt_pages_user that tracks an in-kernel
  user of PFNs. This is some external driver that might be accessing the
  IOVA using the CPU, or programming PFNs with the DMA API. ie a VFIO
  mdev.

  iopt_pages_fill_xarray() will load PFNs into the xarray and return a
  'struct page *' array. It is used by iopt_pages_user's to extract PFNs
  for in-kernel use. iopt_pages_fill_from_xarray() is a fast path when it
  is known the xarray is already filled.

As an iopt_pages can be referred to in slices by many areas and users it
uses interval trees to keep track of which storage tiers currently hold
the PFNs. On a page-by-page basis any request for a PFN will be satisfied
from one of the storage tiers and the PFN copied to target domain/array.

Unfill actions are similar, on a page by page basis domains are unmapped,
xarray entries freed or struct pages fully put back.

Significant complexity is required to fully optimize all of these data
motions. The implementation calculates the largest consecutive range of
same-storage indexes and operates in blocks. The accumulation of PFNs
always generates the largest contiguous PFN range possible to optimize and
this gathering can cross storage tier boundaries. For cases like 'fill
domains' care is taken to avoid duplicated work and PFNs are read once and
pushed into all domains.

The map/unmap interaction with the iommu_domain always works in contiguous
PFN blocks. The implementation does not require or benefit from any
split/merge optimization in the iommu_domain driver.

This design suggests several possible improvements in the IOMMU API that
would greatly help performance, particularly a way for the driver to map
and read the pfns lists instead of working with one driver call perpage to
read, and one driver call per contiguous range to store.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommufd/io_pagetable.h |  71 +++-
 drivers/iommu/iommufd/pages.c        | 594 +++++++++++++++++++++++++++
 2 files changed, 664 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h
index 94ca8712722d31..c8b6a60ff24c94 100644
--- a/drivers/iommu/iommufd/io_pagetable.h
+++ b/drivers/iommu/iommufd/io_pagetable.h
@@ -47,6 +47,14 @@ struct iopt_area {
 	atomic_t num_users;
 };
 
+int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages);
+void iopt_area_unfill_domains(struct iopt_area *area, struct iopt_pages *pages);
+
+int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain);
+void iopt_area_unfill_domain(struct iopt_area *area, struct iopt_pages *pages,
+			     struct iommu_domain *domain);
+void iopt_unmap_domain(struct io_pagetable *iopt, struct iommu_domain *domain);
+
 static inline unsigned long iopt_area_index(struct iopt_area *area)
 {
 	return area->pages_node.start;
@@ -67,6 +75,37 @@ static inline unsigned long iopt_area_last_iova(struct iopt_area *area)
 	return area->node.last;
 }
 
+static inline size_t iopt_area_length(struct iopt_area *area)
+{
+	return (area->node.last - area->node.start) + 1;
+}
+
+static inline struct iopt_area *iopt_area_iter_first(struct io_pagetable *iopt,
+						     unsigned long start,
+						     unsigned long last)
+{
+	struct interval_tree_node *node;
+
+	lockdep_assert_held(&iopt->iova_rwsem);
+
+	node = interval_tree_iter_first(&iopt->area_itree, start, last);
+	if (!node)
+		return NULL;
+	return container_of(node, struct iopt_area, node);
+}
+
+static inline struct iopt_area *iopt_area_iter_next(struct iopt_area *area,
+						    unsigned long start,
+						    unsigned long last)
+{
+	struct interval_tree_node *node;
+
+	node = interval_tree_iter_next(&area->node, start, last);
+	if (!node)
+		return NULL;
+	return container_of(node, struct iopt_area, node);
+}
+
 /*
  * This holds a pinned page list for multiple areas of IO address space. The
  * pages always originate from a linear chunk of userspace VA. Multiple
@@ -75,7 +114,7 @@ static inline unsigned long iopt_area_last_iova(struct iopt_area *area)
  *
  * indexes in this structure are measured in PAGE_SIZE units, are 0 based from
  * the start of the uptr and extend to npages. pages are pinned dynamically
- * according to the intervals in the users_itree and domains_itree, npages
+ * according to the intervals in the users_itree and domains_itree, npinned
  * records the current number of pages pinned.
  */
 struct iopt_pages {
@@ -98,4 +137,34 @@ struct iopt_pages {
 	struct rb_root_cached domains_itree;
 };
 
+struct iopt_pages *iopt_alloc_pages(void __user *uptr, unsigned long length,
+				    bool writable);
+void iopt_release_pages(struct kref *kref);
+static inline void iopt_put_pages(struct iopt_pages *pages)
+{
+	kref_put(&pages->kref, iopt_release_pages);
+}
+
+void iopt_pages_fill_from_xarray(struct iopt_pages *pages, unsigned long start,
+				 unsigned long last, struct page **out_pages);
+int iopt_pages_fill_xarray(struct iopt_pages *pages, unsigned long start,
+			   unsigned long last, struct page **out_pages);
+void iopt_pages_unfill_xarray(struct iopt_pages *pages, unsigned long start,
+			      unsigned long last);
+
+int iopt_pages_add_user(struct iopt_pages *pages, unsigned long start,
+			unsigned long last, struct page **out_pages,
+			bool write);
+void iopt_pages_remove_user(struct iopt_pages *pages, unsigned long start,
+			    unsigned long last);
+
+/*
+ * Each interval represents an active iopt_access_pages(), it acts as an
+ * interval lock that keeps the PFNs pinned and stored in the xarray.
+ */
+struct iopt_pages_user {
+	struct interval_tree_node node;
+	refcount_t refcount;
+};
+
 #endif
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index a75e1c73527920..8e6a8cc8b20ad1 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -140,6 +140,18 @@ static void iommu_unmap_nofail(struct iommu_domain *domain, unsigned long iova,
 	WARN_ON(ret != size);
 }
 
+static void iopt_area_unmap_domain_range(struct iopt_area *area,
+					 struct iommu_domain *domain,
+					 unsigned long start_index,
+					 unsigned long last_index)
+{
+	unsigned long start_iova = iopt_area_index_to_iova(area, start_index);
+
+	iommu_unmap_nofail(domain, start_iova,
+			   iopt_area_index_to_iova_last(area, last_index) -
+				   start_iova + 1);
+}
+
 static struct iopt_area *iopt_pages_find_domain_area(struct iopt_pages *pages,
 						     unsigned long index)
 {
@@ -721,3 +733,585 @@ static int pfn_reader_first(struct pfn_reader *pfns, struct iopt_pages *pages,
 	}
 	return 0;
 }
+
+struct iopt_pages *iopt_alloc_pages(void __user *uptr, unsigned long length,
+				    bool writable)
+{
+	struct iopt_pages *pages;
+
+	/*
+	 * The iommu API uses size_t as the length, and protect the DIV_ROUND_UP
+	 * below from overflow
+	 */
+	if (length > SIZE_MAX - PAGE_SIZE || length == 0)
+		return ERR_PTR(-EINVAL);
+
+	pages = kzalloc(sizeof(*pages), GFP_KERNEL);
+	if (!pages)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&pages->kref);
+	xa_init(&pages->pinned_pfns);
+	mutex_init(&pages->mutex);
+	pages->source_mm = current->mm;
+	mmgrab(pages->source_mm);
+	pages->uptr = (void __user *)ALIGN_DOWN((uintptr_t)uptr, PAGE_SIZE);
+	pages->npages = DIV_ROUND_UP(length + (uptr - pages->uptr), PAGE_SIZE);
+	pages->users_itree = RB_ROOT_CACHED;
+	pages->domains_itree = RB_ROOT_CACHED;
+	pages->writable = writable;
+	pages->has_cap_ipc_lock = capable(CAP_IPC_LOCK);
+	pages->source_task = current->group_leader;
+	get_task_struct(current->group_leader);
+	pages->source_user = get_uid(current_user());
+	return pages;
+}
+
+void iopt_release_pages(struct kref *kref)
+{
+	struct iopt_pages *pages = container_of(kref, struct iopt_pages, kref);
+
+	WARN_ON(!RB_EMPTY_ROOT(&pages->users_itree.rb_root));
+	WARN_ON(!RB_EMPTY_ROOT(&pages->domains_itree.rb_root));
+	WARN_ON(pages->npinned);
+	WARN_ON(!xa_empty(&pages->pinned_pfns));
+	mmdrop(pages->source_mm);
+	mutex_destroy(&pages->mutex);
+	put_task_struct(pages->source_task);
+	free_uid(pages->source_user);
+	kfree(pages);
+}
+
+/* Quickly guess if the interval tree might fully cover the given interval */
+static bool interval_tree_fully_covers(struct rb_root_cached *root,
+				       unsigned long index, unsigned long last)
+{
+	struct interval_tree_node *node;
+
+	node = interval_tree_iter_first(root, index, last);
+	if (!node)
+		return false;
+	return node->start <= index && node->last >= last;
+}
+
+static bool interval_tree_fully_covers_area(struct rb_root_cached *root,
+					    struct iopt_area *area)
+{
+	return interval_tree_fully_covers(root, iopt_area_index(area),
+					  iopt_area_last_index(area));
+}
+
+static void __iopt_area_unfill_domain(struct iopt_area *area,
+				      struct iopt_pages *pages,
+				      struct iommu_domain *domain,
+				      unsigned long last_index)
+{
+	unsigned long unmapped_index = iopt_area_index(area);
+	unsigned long cur_index = unmapped_index;
+	u64 backup[BATCH_BACKUP_SIZE];
+	struct pfn_batch batch;
+
+	lockdep_assert_held(&pages->mutex);
+
+	/* Fast path if there is obviously something else using every pfn */
+	if (interval_tree_fully_covers_area(&pages->domains_itree, area) ||
+	    interval_tree_fully_covers_area(&pages->users_itree, area)) {
+		iopt_area_unmap_domain_range(area, domain,
+					     iopt_area_index(area), last_index);
+		return;
+	}
+
+	/*
+	 * unmaps must always 'cut' at a place where the pfns are not contiguous
+	 * to pair with the maps that always install contiguous pages. This
+	 * algorithm is efficient in the expected case of few pinners.
+	 */
+	batch_init_backup(&batch, last_index + 1, backup, sizeof(backup));
+	while (cur_index != last_index + 1) {
+		unsigned long batch_index = cur_index;
+
+		batch_from_domain(&batch, domain, area, cur_index, last_index);
+		cur_index += batch.total_pfns;
+		iopt_area_unmap_domain_range(area, domain, unmapped_index,
+					     cur_index - 1);
+		unmapped_index = cur_index;
+		iopt_pages_unpin(pages, &batch, batch_index, cur_index - 1);
+		batch_clear(&batch);
+	}
+	batch_destroy(&batch, backup);
+	update_unpinned(pages);
+}
+
+static void iopt_area_unfill_partial_domain(struct iopt_area *area,
+					    struct iopt_pages *pages,
+					    struct iommu_domain *domain,
+					    unsigned long end_index)
+{
+	if (end_index != iopt_area_index(area))
+		__iopt_area_unfill_domain(area, pages, domain, end_index - 1);
+}
+
+/**
+ * iopt_unmap_domain() - Unmap without unpinning PFNs in a domain
+ * @iopt: The iopt the domain is part of
+ * @domain: The domain to unmap
+ *
+ * The caller must know that unpinning is not required, usually because there
+ * are other domains in the iopt.
+ */
+void iopt_unmap_domain(struct io_pagetable *iopt, struct iommu_domain *domain)
+{
+	struct interval_tree_span_iter span;
+
+	for (interval_tree_span_iter_first(&span, &iopt->area_itree, 0,
+					   ULONG_MAX);
+	     !interval_tree_span_iter_done(&span);
+	     interval_tree_span_iter_next(&span))
+		if (!span.is_hole)
+			iommu_unmap_nofail(domain, span.start_used,
+					   span.last_used - span.start_used +
+						   1);
+}
+
+/**
+ * iopt_area_unfill_domain() - Unmap and unpin PFNs in a domain
+ * @area: IOVA area to use
+ * @pages: page supplier for the area (area->pages is NULL)
+ * @domain: Domain to unmap from
+ *
+ * The domain should be removed from the domains_itree before calling. The
+ * domain will always be unmapped, but the PFNs may not be unpinned if there are
+ * still users.
+ */
+void iopt_area_unfill_domain(struct iopt_area *area, struct iopt_pages *pages,
+			     struct iommu_domain *domain)
+{
+	__iopt_area_unfill_domain(area, pages, domain,
+				  iopt_area_last_index(area));
+}
+
+/**
+ * iopt_area_fill_domain() - Map PFNs from the area into a domain
+ * @area: IOVA area to use
+ * @domain: Domain to load PFNs into
+ *
+ * Read the pfns from the area's underlying iopt_pages and map them into the
+ * given domain. Called when attaching a new domain to an io_pagetable.
+ */
+int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain)
+{
+	struct pfn_reader pfns;
+	int rc;
+
+	lockdep_assert_held(&area->pages->mutex);
+
+	rc = pfn_reader_first(&pfns, area->pages, iopt_area_index(area),
+			      iopt_area_last_index(area));
+	if (rc)
+		return rc;
+
+	while (!pfn_reader_done(&pfns)) {
+		rc = batch_to_domain(&pfns.batch, domain, area,
+				     pfns.batch_start_index);
+		if (rc)
+			goto out_unmap;
+
+		rc = pfn_reader_next(&pfns);
+		if (rc)
+			goto out_unmap;
+	}
+
+	rc = update_pinned(area->pages);
+	if (rc)
+		goto out_unmap;
+	goto out_destroy;
+
+out_unmap:
+	iopt_area_unfill_partial_domain(area, area->pages, domain,
+					pfns.batch_start_index);
+out_destroy:
+	pfn_reader_destroy(&pfns);
+	return rc;
+}
+
+/**
+ * iopt_area_fill_domains() - Install PFNs into the area's domains
+ * @area: The area to act on
+ * @pages: The pages associated with the area (area->pages is NULL)
+ *
+ * Called during area creation. The area is freshly created and not inserted in
+ * the domains_itree yet. PFNs are read and loaded into every domain held in the
+ * area's io_pagetable and the area is installed in the domains_itree.
+ *
+ * On failure all domains are left unchanged.
+ */
+int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages)
+{
+	struct pfn_reader pfns;
+	struct iommu_domain *domain;
+	unsigned long unmap_index;
+	unsigned long index;
+	int rc;
+
+	lockdep_assert_held(&area->iopt->domains_rwsem);
+
+	if (xa_empty(&area->iopt->domains))
+		return 0;
+
+	mutex_lock(&pages->mutex);
+	rc = pfn_reader_first(&pfns, pages, iopt_area_index(area),
+			      iopt_area_last_index(area));
+	if (rc)
+		goto out_unlock;
+
+	while (!pfn_reader_done(&pfns)) {
+		xa_for_each (&area->iopt->domains, index, domain) {
+			rc = batch_to_domain(&pfns.batch, domain, area,
+					     pfns.batch_start_index);
+			if (rc)
+				goto out_unmap;
+		}
+
+		rc = pfn_reader_next(&pfns);
+		if (rc)
+			goto out_unmap;
+	}
+	rc = update_pinned(pages);
+	if (rc)
+		goto out_unmap;
+
+	area->storage_domain = xa_load(&area->iopt->domains, 0);
+	interval_tree_insert(&area->pages_node, &pages->domains_itree);
+	goto out_destroy;
+
+out_unmap:
+	xa_for_each (&area->iopt->domains, unmap_index, domain) {
+		unsigned long end_index = pfns.batch_start_index;
+
+		if (unmap_index <= index)
+			end_index = pfns.batch_end_index;
+
+		/*
+		 * The area is not yet part of the domains_itree so we have to
+		 * manage the unpinning specially. The last domain does the
+		 * unpin, every other domain is just unmapped.
+		 */
+		if (unmap_index != area->iopt->next_domain_id - 1) {
+			if (end_index != iopt_area_index(area))
+				iopt_area_unmap_domain_range(
+					area, domain, iopt_area_index(area),
+					end_index - 1);
+		} else {
+			iopt_area_unfill_partial_domain(area, pages, domain,
+							end_index);
+		}
+	}
+out_destroy:
+	pfn_reader_destroy(&pfns);
+out_unlock:
+	mutex_unlock(&pages->mutex);
+	return rc;
+}
+
+/**
+ * iopt_area_unfill_domains() - unmap PFNs from the area's domains
+ * @area: The area to act on
+ * @pages: The pages associated with the area (area->pages is NULL)
+ *
+ * Called during area destruction. This unmaps the iova's covered by all the
+ * area's domains and releases the PFNs.
+ */
+void iopt_area_unfill_domains(struct iopt_area *area, struct iopt_pages *pages)
+{
+	struct io_pagetable *iopt = area->iopt;
+	struct iommu_domain *domain;
+	unsigned long index;
+
+	lockdep_assert_held(&iopt->domains_rwsem);
+
+	mutex_lock(&pages->mutex);
+	if (!area->storage_domain)
+		goto out_unlock;
+
+	xa_for_each (&iopt->domains, index, domain)
+		if (domain != area->storage_domain)
+			iopt_area_unmap_domain_range(
+				area, domain, iopt_area_index(area),
+				iopt_area_last_index(area));
+
+	interval_tree_remove(&area->pages_node, &pages->domains_itree);
+	iopt_area_unfill_domain(area, pages, area->storage_domain);
+	area->storage_domain = NULL;
+out_unlock:
+	mutex_unlock(&pages->mutex);
+}
+
+/*
+ * Erase entries in the pinned_pfns xarray that are not covered by any users.
+ * This does not unpin the pages, the caller is responsible to deal with the pin
+ * reference. The main purpose of this action is to save memory in the xarray.
+ */
+static void iopt_pages_clean_xarray(struct iopt_pages *pages,
+				    unsigned long index, unsigned long last)
+{
+	struct interval_tree_span_iter span;
+
+	lockdep_assert_held(&pages->mutex);
+
+	for (interval_tree_span_iter_first(&span, &pages->users_itree, index,
+					   last);
+	     !interval_tree_span_iter_done(&span);
+	     interval_tree_span_iter_next(&span))
+		if (span.is_hole)
+			clear_xarray(&pages->pinned_pfns, span.start_hole,
+				     span.last_hole);
+}
+
+/**
+ * iopt_pages_unfill_xarray() - Update the xarry after removing a user
+ * @pages: The pages to act on
+ * @start: Starting PFN index
+ * @last: Last PFN index
+ *
+ * Called when an iopt_pages_user is removed, removes pages from the itree.
+ * The user should already be removed from the users_itree.
+ */
+void iopt_pages_unfill_xarray(struct iopt_pages *pages, unsigned long start,
+			      unsigned long last)
+{
+	struct interval_tree_span_iter span;
+	struct pfn_batch batch;
+	u64 backup[BATCH_BACKUP_SIZE];
+
+	lockdep_assert_held(&pages->mutex);
+
+	if (interval_tree_fully_covers(&pages->domains_itree, start, last))
+		return iopt_pages_clean_xarray(pages, start, last);
+
+	batch_init_backup(&batch, last + 1, backup, sizeof(backup));
+	for (interval_tree_span_iter_first(&span, &pages->users_itree, start,
+					   last);
+	     !interval_tree_span_iter_done(&span);
+	     interval_tree_span_iter_next(&span)) {
+		unsigned long cur_index;
+
+		if (!span.is_hole)
+			continue;
+		cur_index = span.start_hole;
+		while (cur_index != span.last_hole + 1) {
+			batch_from_xarray(&batch, &pages->pinned_pfns,
+					  cur_index, span.last_hole);
+			iopt_pages_unpin(pages, &batch, cur_index,
+					 cur_index + batch.total_pfns - 1);
+			cur_index += batch.total_pfns;
+			batch_clear(&batch);
+		}
+		clear_xarray(&pages->pinned_pfns, span.start_hole,
+			     span.last_hole);
+	}
+	batch_destroy(&batch, backup);
+	update_unpinned(pages);
+}
+
+/**
+ * iopt_pages_fill_from_xarray() - Fast path for reading PFNs
+ * @pages: The pages to act on
+ * @start: The first page index in the range
+ * @last: The last page index in the range
+ * @out_pages: The output array to return the pages
+ *
+ * This can be called if the caller is holding a refcount on an iopt_pages_user
+ * that is known to have already been filled. It quickly reads the pages
+ * directly from the xarray.
+ *
+ * This is part of the SW iommu interface to read pages for in-kernel use.
+ */
+void iopt_pages_fill_from_xarray(struct iopt_pages *pages, unsigned long start,
+				 unsigned long last, struct page **out_pages)
+{
+	XA_STATE(xas, &pages->pinned_pfns, start);
+	void *entry;
+
+	rcu_read_lock();
+	while (true) {
+		entry = xas_next(&xas);
+		if (xas_retry(&xas, entry))
+			continue;
+		WARN_ON(!xa_is_value(entry));
+		*(out_pages++) = pfn_to_page(xa_to_value(entry));
+		if (start == last)
+			break;
+		start++;
+	}
+	rcu_read_unlock();
+}
+
+/**
+ * iopt_pages_fill_xarray() - Read PFNs
+ * @pages: The pages to act on
+ * @start: The first page index in the range
+ * @last: The last page index in the range
+ * @out_pages: The output array to return the pages, may be NULL
+ *
+ * This populates the xarray and returns the pages in out_pages. As the slow
+ * path this is able to copy pages from other storage tiers into the xarray.
+ *
+ * On failure the xarray is left unchanged.
+ *
+ * This is part of the SW iommu interface to read pages for in-kernel use.
+ */
+int iopt_pages_fill_xarray(struct iopt_pages *pages, unsigned long start,
+			   unsigned long last, struct page **out_pages)
+{
+	struct interval_tree_span_iter span;
+	unsigned long xa_end = start;
+	struct pfn_reader pfns;
+	int rc;
+
+	rc = pfn_reader_init(&pfns, pages, start, last);
+	if (rc)
+		return rc;
+
+	for (interval_tree_span_iter_first(&span, &pages->users_itree, start,
+					   last);
+	     !interval_tree_span_iter_done(&span);
+	     interval_tree_span_iter_next(&span)) {
+		if (!span.is_hole) {
+			if (out_pages)
+				iopt_pages_fill_from_xarray(
+					pages + (span.start_used - start),
+					span.start_used, span.last_used,
+					out_pages);
+			continue;
+		}
+
+		rc = pfn_reader_seek_hole(&pfns, &span);
+		if (rc)
+			goto out_clean_xa;
+
+		while (!pfn_reader_done(&pfns)) {
+			rc = batch_to_xarray(&pfns.batch, &pages->pinned_pfns,
+					     pfns.batch_start_index);
+			if (rc)
+				goto out_clean_xa;
+			batch_to_pages(&pfns.batch, out_pages);
+			xa_end += pfns.batch.total_pfns;
+			out_pages += pfns.batch.total_pfns;
+			rc = pfn_reader_next(&pfns);
+			if (rc)
+				goto out_clean_xa;
+		}
+	}
+
+	rc = update_pinned(pages);
+	if (rc)
+		goto out_clean_xa;
+	pfn_reader_destroy(&pfns);
+	return 0;
+
+out_clean_xa:
+	if (start != xa_end)
+		iopt_pages_unfill_xarray(pages, start, xa_end - 1);
+	pfn_reader_destroy(&pfns);
+	return rc;
+}
+
+static struct iopt_pages_user *
+iopt_pages_get_exact_user(struct iopt_pages *pages, unsigned long index,
+			  unsigned long last)
+{
+	struct interval_tree_node *node;
+
+	lockdep_assert_held(&pages->mutex);
+
+	/* There can be overlapping ranges in this interval tree */
+	for (node = interval_tree_iter_first(&pages->users_itree, index, last);
+	     node; node = interval_tree_iter_next(node, index, last))
+		if (node->start == index && node->last == last)
+			return container_of(node, struct iopt_pages_user, node);
+	return NULL;
+}
+
+/**
+ * iopt_pages_add_user() - Record an in-knerel user for PFNs
+ * @pages: The source of PFNs
+ * @start: First page index
+ * @last: Inclusive last page index
+ * @out_pages: Output list of struct page's representing the PFNs
+ * @write: True if the user will write to the pages
+ *
+ * Record that an in-kernel user will be accessing the pages, ensure they are
+ * pinned, and return the PFNs as a simple list of 'struct page *'.
+ *
+ * This should be undone through a matching call to iopt_pages_remove_user()
+ */
+int iopt_pages_add_user(struct iopt_pages *pages, unsigned long start,
+			unsigned long last, struct page **out_pages, bool write)
+{
+	struct iopt_pages_user *user;
+	int rc;
+
+	if (pages->writable != write)
+		return -EPERM;
+
+	mutex_lock(&pages->mutex);
+	user = iopt_pages_get_exact_user(pages, start, last);
+	if (user) {
+		refcount_inc(&user->refcount);
+		mutex_unlock(&pages->mutex);
+		iopt_pages_fill_from_xarray(pages, start, last, out_pages);
+		return 0;
+	}
+
+	user = kzalloc(sizeof(*user), GFP_KERNEL);
+	if (!user) {
+		rc = -ENOMEM;
+		goto out_unlock;
+	}
+
+	rc = iopt_pages_fill_xarray(pages, start, last, out_pages);
+	if (rc)
+		goto out_free;
+
+	user->node.start = start;
+	user->node.last = last;
+	refcount_set(&user->refcount, 1);
+	interval_tree_insert(&user->node, &pages->users_itree);
+	mutex_unlock(&pages->mutex);
+	return 0;
+
+out_free:
+	kfree(user);
+out_unlock:
+	mutex_unlock(&pages->mutex);
+	return rc;
+}
+
+/**
+ * iopt_pages_remove_user() - Release an in-kernel user for PFNs
+ * @pages: The source of PFNs
+ * @start: First page index
+ * @last: Inclusive last page index
+ *
+ * Undo iopt_pages_add_user() and unpin the pages if necessary. The caller must
+ * stop using the PFNs before calling this.
+ */
+void iopt_pages_remove_user(struct iopt_pages *pages, unsigned long start,
+			    unsigned long last)
+{
+	struct iopt_pages_user *user;
+
+	mutex_lock(&pages->mutex);
+	user = iopt_pages_get_exact_user(pages, start, last);
+	if (WARN_ON(!user))
+		goto out_unlock;
+
+	if (!refcount_dec_and_test(&user->refcount))
+		goto out_unlock;
+
+	interval_tree_remove(&user->node, &pages->users_itree);
+	iopt_pages_unfill_xarray(pages, start, last);
+	kfree(user);
+out_unlock:
+	mutex_unlock(&pages->mutex);
+}
-- 
2.35.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 244+ messages in thread

* [PATCH RFC 07/12] iommufd: Data structure to provide IOVA to PFN mapping
  2022-03-18 17:27 ` Jason Gunthorpe via iommu
@ 2022-03-18 17:27   ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-03-18 17:27 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

This is the remainder of the IOAS data structure. Provide an object called
an io_pagetable that is composed of iopt_areas pointing at iopt_pages,
along with a list of iommu_domains that mirror the IOVA to PFN map.

At the top this is a simple interval tree of iopt_areas indicating the map
of IOVA to iopt_pages. An xarray keeps track of a list of domains. Based
on the attached domains there is a minimum alignment for areas (which may
be smaller than PAGE_SIZE) and an interval tree of reserved IOVA that
can't be mapped.

The concept of a 'user' refers to something like a VFIO mdev that is
accessing the IOVA and using a 'struct page *' for CPU based access.

Externally an API is provided that matches the requirements of the IOCTL
interface for map/unmap and domain attachment.

The API provides a 'copy' primitive to establish a new IOVA map in a
different IOAS from an existing mapping.

This is designed to support a pre-registration flow where userspace would
setup an dummy IOAS with no domains, map in memory and then establish a
user to pin all PFNs into the xarray.

Copy can then be used to create new IOVA mappings in a different IOAS,
with iommu_domains attached. Upon copy the PFNs will be read out of the
xarray and mapped into the iommu_domains, avoiding any pin_user_pages()
overheads.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommufd/Makefile          |   1 +
 drivers/iommu/iommufd/io_pagetable.c    | 890 ++++++++++++++++++++++++
 drivers/iommu/iommufd/iommufd_private.h |  35 +
 3 files changed, 926 insertions(+)
 create mode 100644 drivers/iommu/iommufd/io_pagetable.c

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index 05a0e91e30afad..b66a8c47ff55ec 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 iommufd-y := \
+	io_pagetable.o \
 	main.o \
 	pages.o
 
diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
new file mode 100644
index 00000000000000..f9f3b06946bfb9
--- /dev/null
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -0,0 +1,890 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
+ *
+ * The io_pagetable is the top of datastructure that maps IOVA's to PFNs. The
+ * PFNs can be placed into an iommu_domain, or returned to the caller as a page
+ * list for access by an in-kernel user.
+ *
+ * The datastructure uses the iopt_pages to optimize the storage of the PFNs
+ * between the domains and xarray.
+ */
+#include <linux/lockdep.h>
+#include <linux/iommu.h>
+#include <linux/sched/mm.h>
+#include <linux/err.h>
+#include <linux/slab.h>
+#include <linux/errno.h>
+
+#include "io_pagetable.h"
+
+static unsigned long iopt_area_iova_to_index(struct iopt_area *area,
+					     unsigned long iova)
+{
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
+		WARN_ON(iova < iopt_area_iova(area) ||
+			iova > iopt_area_last_iova(area));
+	return (iova - (iopt_area_iova(area) & PAGE_MASK)) / PAGE_SIZE;
+}
+
+static struct iopt_area *iopt_find_exact_area(struct io_pagetable *iopt,
+					      unsigned long iova,
+					      unsigned long last_iova)
+{
+	struct iopt_area *area;
+
+	area = iopt_area_iter_first(iopt, iova, last_iova);
+	if (!area || !area->pages || iopt_area_iova(area) != iova ||
+	    iopt_area_last_iova(area) != last_iova)
+		return NULL;
+	return area;
+}
+
+static bool __alloc_iova_check_hole(struct interval_tree_span_iter *span,
+				    unsigned long length,
+				    unsigned long iova_alignment,
+				    unsigned long page_offset)
+{
+	if (!span->is_hole || span->last_hole - span->start_hole < length - 1)
+		return false;
+
+	span->start_hole =
+		ALIGN(span->start_hole, iova_alignment) | page_offset;
+	if (span->start_hole > span->last_hole ||
+	    span->last_hole - span->start_hole < length - 1)
+		return false;
+	return true;
+}
+
+/*
+ * Automatically find a block of IOVA that is not being used and not reserved.
+ * Does not return a 0 IOVA even if it is valid.
+ */
+static int iopt_alloc_iova(struct io_pagetable *iopt, unsigned long *iova,
+			   unsigned long uptr, unsigned long length)
+{
+	struct interval_tree_span_iter reserved_span;
+	unsigned long page_offset = uptr % PAGE_SIZE;
+	struct interval_tree_span_iter area_span;
+	unsigned long iova_alignment;
+
+	lockdep_assert_held(&iopt->iova_rwsem);
+
+	/* Protect roundup_pow-of_two() from overflow */
+	if (length == 0 || length >= ULONG_MAX / 2)
+		return -EOVERFLOW;
+
+	/*
+	 * Keep alignment present in the uptr when building the IOVA, this
+	 * increases the chance we can map a THP.
+	 */
+	if (!uptr)
+		iova_alignment = roundup_pow_of_two(length);
+	else
+		iova_alignment =
+			min_t(unsigned long, roundup_pow_of_two(length),
+			      1UL << __ffs64(uptr));
+
+	if (iova_alignment < iopt->iova_alignment)
+		return -EINVAL;
+	for (interval_tree_span_iter_first(&area_span, &iopt->area_itree,
+					   PAGE_SIZE, ULONG_MAX - PAGE_SIZE);
+	     !interval_tree_span_iter_done(&area_span);
+	     interval_tree_span_iter_next(&area_span)) {
+		if (!__alloc_iova_check_hole(&area_span, length, iova_alignment,
+					     page_offset))
+			continue;
+
+		for (interval_tree_span_iter_first(
+			     &reserved_span, &iopt->reserved_iova_itree,
+			     area_span.start_hole, area_span.last_hole);
+		     !interval_tree_span_iter_done(&reserved_span);
+		     interval_tree_span_iter_next(&reserved_span)) {
+			if (!__alloc_iova_check_hole(&reserved_span, length,
+						     iova_alignment,
+						     page_offset))
+				continue;
+
+			*iova = reserved_span.start_hole;
+			return 0;
+		}
+	}
+	return -ENOSPC;
+}
+
+/*
+ * The area takes a slice of the pages from start_bytes to start_byte + length
+ */
+static struct iopt_area *
+iopt_alloc_area(struct io_pagetable *iopt, struct iopt_pages *pages,
+		unsigned long iova, unsigned long start_byte,
+		unsigned long length, int iommu_prot, unsigned int flags)
+{
+	struct iopt_area *area;
+	int rc;
+
+	area = kzalloc(sizeof(*area), GFP_KERNEL);
+	if (!area)
+		return ERR_PTR(-ENOMEM);
+
+	area->iopt = iopt;
+	area->iommu_prot = iommu_prot;
+	area->page_offset = start_byte % PAGE_SIZE;
+	area->pages_node.start = start_byte / PAGE_SIZE;
+	if (check_add_overflow(start_byte, length - 1, &area->pages_node.last))
+		return ERR_PTR(-EOVERFLOW);
+	area->pages_node.last = area->pages_node.last / PAGE_SIZE;
+	if (WARN_ON(area->pages_node.last >= pages->npages))
+		return ERR_PTR(-EOVERFLOW);
+
+	down_write(&iopt->iova_rwsem);
+	if (flags & IOPT_ALLOC_IOVA) {
+		rc = iopt_alloc_iova(iopt, &iova,
+				     (uintptr_t)pages->uptr + start_byte,
+				     length);
+		if (rc)
+			goto out_unlock;
+	}
+
+	if (check_add_overflow(iova, length - 1, &area->node.last)) {
+		rc = -EOVERFLOW;
+		goto out_unlock;
+	}
+
+	if (!(flags & IOPT_ALLOC_IOVA)) {
+		if ((iova & (iopt->iova_alignment - 1)) ||
+		    (length & (iopt->iova_alignment - 1)) || !length) {
+			rc = -EINVAL;
+			goto out_unlock;
+		}
+
+		/* No reserved IOVA intersects the range */
+		if (interval_tree_iter_first(&iopt->reserved_iova_itree, iova,
+					     area->node.last)) {
+			rc = -ENOENT;
+			goto out_unlock;
+		}
+
+		/* Check that there is not already a mapping in the range */
+		if (iopt_area_iter_first(iopt, iova, area->node.last)) {
+			rc = -EADDRINUSE;
+			goto out_unlock;
+		}
+	}
+
+	/*
+	 * The area is inserted with a NULL pages indicating it is not fully
+	 * initialized yet.
+	 */
+	area->node.start = iova;
+	interval_tree_insert(&area->node, &area->iopt->area_itree);
+	up_write(&iopt->iova_rwsem);
+	return area;
+
+out_unlock:
+	up_write(&iopt->iova_rwsem);
+	kfree(area);
+	return ERR_PTR(rc);
+}
+
+static void iopt_abort_area(struct iopt_area *area)
+{
+	down_write(&area->iopt->iova_rwsem);
+	interval_tree_remove(&area->node, &area->iopt->area_itree);
+	up_write(&area->iopt->iova_rwsem);
+	kfree(area);
+}
+
+static int iopt_finalize_area(struct iopt_area *area, struct iopt_pages *pages)
+{
+	int rc;
+
+	down_read(&area->iopt->domains_rwsem);
+	rc = iopt_area_fill_domains(area, pages);
+	if (!rc) {
+		/*
+		 * area->pages must be set inside the domains_rwsem to ensure
+		 * any newly added domains will get filled. Moves the reference
+		 * in from the caller
+		 */
+		down_write(&area->iopt->iova_rwsem);
+		area->pages = pages;
+		up_write(&area->iopt->iova_rwsem);
+	}
+	up_read(&area->iopt->domains_rwsem);
+	return rc;
+}
+
+int iopt_map_pages(struct io_pagetable *iopt, struct iopt_pages *pages,
+		   unsigned long *dst_iova, unsigned long start_bytes,
+		   unsigned long length, int iommu_prot, unsigned int flags)
+{
+	struct iopt_area *area;
+	int rc;
+
+	if ((iommu_prot & IOMMU_WRITE) && !pages->writable)
+		return -EPERM;
+
+	area = iopt_alloc_area(iopt, pages, *dst_iova, start_bytes, length,
+			       iommu_prot, flags);
+	if (IS_ERR(area))
+		return PTR_ERR(area);
+	*dst_iova = iopt_area_iova(area);
+
+	rc = iopt_finalize_area(area, pages);
+	if (rc) {
+		iopt_abort_area(area);
+		return rc;
+	}
+	return 0;
+}
+
+/**
+ * iopt_map_user_pages() - Map a user VA to an iova in the io page table
+ * @iopt: io_pagetable to act on
+ * @iova: If IOPT_ALLOC_IOVA is set this is unused on input and contains
+ *        the chosen iova on output. Otherwise is the iova to map to on input
+ * @uptr: User VA to map
+ * @length: Number of bytes to map
+ * @iommu_prot: Combination of IOMMU_READ/WRITE/etc bits for the mapping
+ * @flags: IOPT_ALLOC_IOVA or zero
+ *
+ * iova, uptr, and length must be aligned to iova_alignment. For domain backed
+ * page tables this will pin the pages and load them into the domain at iova.
+ * For non-domain page tables this will only setup a lazy reference and the
+ * caller must use iopt_access_pages() to touch them.
+ *
+ * iopt_unmap_iova() must be called to undo this before the io_pagetable can be
+ * destroyed.
+ */
+int iopt_map_user_pages(struct io_pagetable *iopt, unsigned long *iova,
+			void __user *uptr, unsigned long length, int iommu_prot,
+			unsigned int flags)
+{
+	struct iopt_pages *pages;
+	int rc;
+
+	pages = iopt_alloc_pages(uptr, length, iommu_prot & IOMMU_WRITE);
+	if (IS_ERR(pages))
+		return PTR_ERR(pages);
+
+	rc = iopt_map_pages(iopt, pages, iova, uptr - pages->uptr, length,
+			    iommu_prot, flags);
+	if (rc) {
+		iopt_put_pages(pages);
+		return rc;
+	}
+	return 0;
+}
+
+struct iopt_pages *iopt_get_pages(struct io_pagetable *iopt, unsigned long iova,
+				  unsigned long *start_byte,
+				  unsigned long length)
+{
+	unsigned long iova_end;
+	struct iopt_pages *pages;
+	struct iopt_area *area;
+
+	if (check_add_overflow(iova, length - 1, &iova_end))
+		return ERR_PTR(-EOVERFLOW);
+
+	down_read(&iopt->iova_rwsem);
+	area = iopt_find_exact_area(iopt, iova, iova_end);
+	if (!area) {
+		up_read(&iopt->iova_rwsem);
+		return ERR_PTR(-ENOENT);
+	}
+	pages = area->pages;
+	*start_byte = area->page_offset + iopt_area_index(area) * PAGE_SIZE;
+	kref_get(&pages->kref);
+	up_read(&iopt->iova_rwsem);
+
+	return pages;
+}
+
+static int __iopt_unmap_iova(struct io_pagetable *iopt, struct iopt_area *area,
+			     struct iopt_pages *pages)
+{
+	/* Drivers have to unpin on notification. */
+	if (WARN_ON(atomic_read(&area->num_users)))
+		return -EBUSY;
+
+	iopt_area_unfill_domains(area, pages);
+	WARN_ON(atomic_read(&area->num_users));
+	iopt_abort_area(area);
+	iopt_put_pages(pages);
+	return 0;
+}
+
+/**
+ * iopt_unmap_iova() - Remove a range of iova
+ * @iopt: io_pagetable to act on
+ * @iova: Starting iova to unmap
+ * @length: Number of bytes to unmap
+ *
+ * The requested range must exactly match an existing range.
+ * Splitting/truncating IOVA mappings is not allowed.
+ */
+int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
+		    unsigned long length)
+{
+	struct iopt_pages *pages;
+	struct iopt_area *area;
+	unsigned long iova_end;
+	int rc;
+
+	if (!length)
+		return -EINVAL;
+
+	if (check_add_overflow(iova, length - 1, &iova_end))
+		return -EOVERFLOW;
+
+	down_read(&iopt->domains_rwsem);
+	down_write(&iopt->iova_rwsem);
+	area = iopt_find_exact_area(iopt, iova, iova_end);
+	if (!area) {
+		up_write(&iopt->iova_rwsem);
+		up_read(&iopt->domains_rwsem);
+		return -ENOENT;
+	}
+	pages = area->pages;
+	area->pages = NULL;
+	up_write(&iopt->iova_rwsem);
+
+	rc = __iopt_unmap_iova(iopt, area, pages);
+	up_read(&iopt->domains_rwsem);
+	return rc;
+}
+
+int iopt_unmap_all(struct io_pagetable *iopt)
+{
+	struct iopt_area *area;
+	int rc;
+
+	down_read(&iopt->domains_rwsem);
+	down_write(&iopt->iova_rwsem);
+	while ((area = iopt_area_iter_first(iopt, 0, ULONG_MAX))) {
+		struct iopt_pages *pages;
+
+		/* Userspace should not race unmap all and map */
+		if (!area->pages) {
+			rc = -EBUSY;
+			goto out_unlock_iova;
+		}
+		pages = area->pages;
+		area->pages = NULL;
+		up_write(&iopt->iova_rwsem);
+
+		rc = __iopt_unmap_iova(iopt, area, pages);
+		if (rc)
+			goto out_unlock_domains;
+
+		down_write(&iopt->iova_rwsem);
+	}
+	rc = 0;
+
+out_unlock_iova:
+	up_write(&iopt->iova_rwsem);
+out_unlock_domains:
+	up_read(&iopt->domains_rwsem);
+	return rc;
+}
+
+/**
+ * iopt_access_pages() - Return a list of pages under the iova
+ * @iopt: io_pagetable to act on
+ * @iova: Starting IOVA
+ * @length: Number of bytes to access
+ * @out_pages: Output page list
+ * @write: True if access is for writing
+ *
+ * Reads @npages starting at iova and returns the struct page * pointers. These
+ * can be kmap'd by the caller for CPU access.
+ *
+ * The caller must perform iopt_unaccess_pages() when done to balance this.
+ *
+ * iova can be unaligned from PAGE_SIZE. The first returned byte starts at
+ * page_to_phys(out_pages[0]) + (iova % PAGE_SIZE). The caller promises not to
+ * touch memory outside the requested iova slice.
+ *
+ * FIXME: callers that need a DMA mapping via a sgl should create another
+ * interface to build the SGL efficiently
+ */
+int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
+		      unsigned long length, struct page **out_pages, bool write)
+{
+	unsigned long cur_iova = iova;
+	unsigned long last_iova;
+	struct iopt_area *area;
+	int rc;
+
+	if (!length)
+		return -EINVAL;
+	if (check_add_overflow(iova, length - 1, &last_iova))
+		return -EOVERFLOW;
+
+	down_read(&iopt->iova_rwsem);
+	for (area = iopt_area_iter_first(iopt, iova, last_iova); area;
+	     area = iopt_area_iter_next(area, iova, last_iova)) {
+		unsigned long last = min(last_iova, iopt_area_last_iova(area));
+		unsigned long last_index;
+		unsigned long index;
+
+		/* Need contiguous areas in the access */
+		if (iopt_area_iova(area) < cur_iova || !area->pages) {
+			rc = -EINVAL;
+			goto out_remove;
+		}
+
+		index = iopt_area_iova_to_index(area, cur_iova);
+		last_index = iopt_area_iova_to_index(area, last);
+		rc = iopt_pages_add_user(area->pages, index, last_index,
+					 out_pages, write);
+		if (rc)
+			goto out_remove;
+		if (last == last_iova)
+			break;
+		/*
+		 * Can't cross areas that are not aligned to the system page
+		 * size with this API.
+		 */
+		if (cur_iova % PAGE_SIZE) {
+			rc = -EINVAL;
+			goto out_remove;
+		}
+		cur_iova = last + 1;
+		out_pages += last_index - index;
+		atomic_inc(&area->num_users);
+	}
+
+	up_read(&iopt->iova_rwsem);
+	return 0;
+
+out_remove:
+	if (cur_iova != iova)
+		iopt_unaccess_pages(iopt, iova, cur_iova - iova);
+	up_read(&iopt->iova_rwsem);
+	return rc;
+}
+
+/**
+ * iopt_unaccess_pages() - Undo iopt_access_pages
+ * @iopt: io_pagetable to act on
+ * @iova: Starting IOVA
+ * @length:- Number of bytes to access
+ *
+ * Return the struct page's. The caller must stop accessing them before calling
+ * this. The iova/length must exactly match the one provided to access_pages.
+ */
+void iopt_unaccess_pages(struct io_pagetable *iopt, unsigned long iova,
+			 size_t length)
+{
+	unsigned long cur_iova = iova;
+	unsigned long last_iova;
+	struct iopt_area *area;
+
+	if (WARN_ON(!length) ||
+	    WARN_ON(check_add_overflow(iova, length - 1, &last_iova)))
+		return;
+
+	down_read(&iopt->iova_rwsem);
+	for (area = iopt_area_iter_first(iopt, iova, last_iova); area;
+	     area = iopt_area_iter_next(area, iova, last_iova)) {
+		unsigned long last = min(last_iova, iopt_area_last_iova(area));
+		int num_users;
+
+		iopt_pages_remove_user(area->pages,
+				       iopt_area_iova_to_index(area, cur_iova),
+				       iopt_area_iova_to_index(area, last));
+		if (last == last_iova)
+			break;
+		cur_iova = last + 1;
+		num_users = atomic_dec_return(&area->num_users);
+		WARN_ON(num_users < 0);
+	}
+	up_read(&iopt->iova_rwsem);
+}
+
+struct iopt_reserved_iova {
+	struct interval_tree_node node;
+	void *owner;
+};
+
+int iopt_reserve_iova(struct io_pagetable *iopt, unsigned long start,
+		      unsigned long last, void *owner)
+{
+	struct iopt_reserved_iova *reserved;
+
+	lockdep_assert_held_write(&iopt->iova_rwsem);
+
+	if (iopt_area_iter_first(iopt, start, last))
+		return -EADDRINUSE;
+
+	reserved = kzalloc(sizeof(*reserved), GFP_KERNEL);
+	if (!reserved)
+		return -ENOMEM;
+	reserved->node.start = start;
+	reserved->node.last = last;
+	reserved->owner = owner;
+	interval_tree_insert(&reserved->node, &iopt->reserved_iova_itree);
+	return 0;
+}
+
+void iopt_remove_reserved_iova(struct io_pagetable *iopt, void *owner)
+{
+
+	struct interval_tree_node *node;
+
+	for (node = interval_tree_iter_first(&iopt->reserved_iova_itree, 0,
+					     ULONG_MAX);
+	     node;) {
+		struct iopt_reserved_iova *reserved =
+			container_of(node, struct iopt_reserved_iova, node);
+
+		node = interval_tree_iter_next(node, 0, ULONG_MAX);
+
+		if (reserved->owner == owner) {
+			interval_tree_remove(&reserved->node,
+					     &iopt->reserved_iova_itree);
+			kfree(reserved);
+		}
+	}
+}
+
+int iopt_init_table(struct io_pagetable *iopt)
+{
+	init_rwsem(&iopt->iova_rwsem);
+	init_rwsem(&iopt->domains_rwsem);
+	iopt->area_itree = RB_ROOT_CACHED;
+	iopt->reserved_iova_itree = RB_ROOT_CACHED;
+	xa_init(&iopt->domains);
+
+	/*
+	 * iopt's start as SW tables that can use the entire size_t IOVA space
+	 * due to the use of size_t in the APIs. They have no alignment
+	 * restriction.
+	 */
+	iopt->iova_alignment = 1;
+
+	return 0;
+}
+
+void iopt_destroy_table(struct io_pagetable *iopt)
+{
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
+		iopt_remove_reserved_iova(iopt, NULL);
+	WARN_ON(!RB_EMPTY_ROOT(&iopt->reserved_iova_itree.rb_root));
+	WARN_ON(!xa_empty(&iopt->domains));
+	WARN_ON(!RB_EMPTY_ROOT(&iopt->area_itree.rb_root));
+}
+
+/**
+ * iopt_unfill_domain() - Unfill a domain with PFNs
+ * @iopt: io_pagetable to act on
+ * @domain: domain to unfill
+ *
+ * This is used when removing a domain from the iopt. Every area in the iopt
+ * will be unmapped from the domain. The domain must already be removed from the
+ * domains xarray.
+ */
+static void iopt_unfill_domain(struct io_pagetable *iopt,
+			       struct iommu_domain *domain)
+{
+	struct iopt_area *area;
+
+	lockdep_assert_held(&iopt->iova_rwsem);
+	lockdep_assert_held_write(&iopt->domains_rwsem);
+
+	/*
+	 * Some other domain is holding all the pfns still, rapidly unmap this
+	 * domain.
+	 */
+	if (iopt->next_domain_id != 0) {
+		/* Pick an arbitrary remaining domain to act as storage */
+		struct iommu_domain *storage_domain =
+			xa_load(&iopt->domains, 0);
+
+		for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
+		     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
+			struct iopt_pages *pages = area->pages;
+
+			if (WARN_ON(!pages))
+				continue;
+
+			mutex_lock(&pages->mutex);
+			if (area->storage_domain != domain) {
+				mutex_unlock(&pages->mutex);
+				continue;
+			}
+			area->storage_domain = storage_domain;
+			mutex_unlock(&pages->mutex);
+		}
+
+
+		iopt_unmap_domain(iopt, domain);
+		return;
+	}
+
+	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
+	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
+		struct iopt_pages *pages = area->pages;
+
+		if (WARN_ON(!pages))
+			continue;
+
+		mutex_lock(&pages->mutex);
+		interval_tree_remove(&area->pages_node,
+				     &area->pages->domains_itree);
+		WARN_ON(area->storage_domain != domain);
+		area->storage_domain = NULL;
+		iopt_area_unfill_domain(area, pages, domain);
+		mutex_unlock(&pages->mutex);
+	}
+}
+
+/**
+ * iopt_fill_domain() - Fill a domain with PFNs
+ * @iopt: io_pagetable to act on
+ * @domain: domain to fill
+ *
+ * Fill the domain with PFNs from every area in the iopt. On failure the domain
+ * is left unchanged.
+ */
+static int iopt_fill_domain(struct io_pagetable *iopt,
+			    struct iommu_domain *domain)
+{
+	struct iopt_area *end_area;
+	struct iopt_area *area;
+	int rc;
+
+	lockdep_assert_held(&iopt->iova_rwsem);
+	lockdep_assert_held_write(&iopt->domains_rwsem);
+
+	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
+	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
+		struct iopt_pages *pages = area->pages;
+
+		if (WARN_ON(!pages))
+			continue;
+
+		mutex_lock(&pages->mutex);
+		rc = iopt_area_fill_domain(area, domain);
+		if (rc) {
+			mutex_unlock(&pages->mutex);
+			goto out_unfill;
+		}
+		if (!area->storage_domain) {
+			WARN_ON(iopt->next_domain_id != 0);
+			area->storage_domain = domain;
+			interval_tree_insert(&area->pages_node,
+					     &pages->domains_itree);
+		}
+		mutex_unlock(&pages->mutex);
+	}
+	return 0;
+
+out_unfill:
+	end_area = area;
+	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
+	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
+		struct iopt_pages *pages = area->pages;
+
+		if (area == end_area)
+			break;
+		if (WARN_ON(!pages))
+			continue;
+		mutex_lock(&pages->mutex);
+		if (iopt->next_domain_id == 0) {
+			interval_tree_remove(&area->pages_node,
+					     &pages->domains_itree);
+			area->storage_domain = NULL;
+		}
+		iopt_area_unfill_domain(area, pages, domain);
+		mutex_unlock(&pages->mutex);
+	}
+	return rc;
+}
+
+/* All existing area's conform to an increased page size */
+static int iopt_check_iova_alignment(struct io_pagetable *iopt,
+				     unsigned long new_iova_alignment)
+{
+	struct iopt_area *area;
+
+	lockdep_assert_held(&iopt->iova_rwsem);
+
+	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
+	     area = iopt_area_iter_next(area, 0, ULONG_MAX))
+		if ((iopt_area_iova(area) % new_iova_alignment) ||
+		    (iopt_area_length(area) % new_iova_alignment))
+			return -EADDRINUSE;
+	return 0;
+}
+
+int iopt_table_add_domain(struct io_pagetable *iopt,
+			  struct iommu_domain *domain)
+{
+	const struct iommu_domain_geometry *geometry = &domain->geometry;
+	struct iommu_domain *iter_domain;
+	unsigned int new_iova_alignment;
+	unsigned long index;
+	int rc;
+
+	down_write(&iopt->domains_rwsem);
+	down_write(&iopt->iova_rwsem);
+
+	xa_for_each (&iopt->domains, index, iter_domain) {
+		if (WARN_ON(iter_domain == domain)) {
+			rc = -EEXIST;
+			goto out_unlock;
+		}
+	}
+
+	/*
+	 * The io page size drives the iova_alignment. Internally the iopt_pages
+	 * works in PAGE_SIZE units and we adjust when mapping sub-PAGE_SIZE
+	 * objects into the iommu_domain.
+	 *
+	 * A iommu_domain must always be able to accept PAGE_SIZE to be
+	 * compatible as we can't guarantee higher contiguity.
+	 */
+	new_iova_alignment =
+		max_t(unsigned long, 1UL << __ffs(domain->pgsize_bitmap),
+		      iopt->iova_alignment);
+	if (new_iova_alignment > PAGE_SIZE) {
+		rc = -EINVAL;
+		goto out_unlock;
+	}
+	if (new_iova_alignment != iopt->iova_alignment) {
+		rc = iopt_check_iova_alignment(iopt, new_iova_alignment);
+		if (rc)
+			goto out_unlock;
+	}
+
+	/* No area exists that is outside the allowed domain aperture */
+	if (geometry->aperture_start != 0) {
+		rc = iopt_reserve_iova(iopt, 0, geometry->aperture_start - 1,
+				       domain);
+		if (rc)
+			goto out_reserved;
+	}
+	if (geometry->aperture_end != ULONG_MAX) {
+		rc = iopt_reserve_iova(iopt, geometry->aperture_end + 1,
+				       ULONG_MAX, domain);
+		if (rc)
+			goto out_reserved;
+	}
+
+	rc = xa_reserve(&iopt->domains, iopt->next_domain_id, GFP_KERNEL);
+	if (rc)
+		goto out_reserved;
+
+	rc = iopt_fill_domain(iopt, domain);
+	if (rc)
+		goto out_release;
+
+	iopt->iova_alignment = new_iova_alignment;
+	xa_store(&iopt->domains, iopt->next_domain_id, domain, GFP_KERNEL);
+	iopt->next_domain_id++;
+	up_write(&iopt->iova_rwsem);
+	up_write(&iopt->domains_rwsem);
+	return 0;
+out_release:
+	xa_release(&iopt->domains, iopt->next_domain_id);
+out_reserved:
+	iopt_remove_reserved_iova(iopt, domain);
+out_unlock:
+	up_write(&iopt->iova_rwsem);
+	up_write(&iopt->domains_rwsem);
+	return rc;
+}
+
+void iopt_table_remove_domain(struct io_pagetable *iopt,
+			      struct iommu_domain *domain)
+{
+	struct iommu_domain *iter_domain = NULL;
+	unsigned long new_iova_alignment;
+	unsigned long index;
+
+	down_write(&iopt->domains_rwsem);
+	down_write(&iopt->iova_rwsem);
+
+	xa_for_each (&iopt->domains, index, iter_domain)
+		if (iter_domain == domain)
+			break;
+	if (WARN_ON(iter_domain != domain) || index >= iopt->next_domain_id)
+		goto out_unlock;
+
+	/*
+	 * Compress the xarray to keep it linear by swapping the entry to erase
+	 * with the tail entry and shrinking the tail.
+	 */
+	iopt->next_domain_id--;
+	iter_domain = xa_erase(&iopt->domains, iopt->next_domain_id);
+	if (index != iopt->next_domain_id)
+		xa_store(&iopt->domains, index, iter_domain, GFP_KERNEL);
+
+	iopt_unfill_domain(iopt, domain);
+	iopt_remove_reserved_iova(iopt, domain);
+
+	/* Recalculate the iova alignment without the domain */
+	new_iova_alignment = 1;
+	xa_for_each (&iopt->domains, index, iter_domain)
+		new_iova_alignment = max_t(unsigned long,
+					   1UL << __ffs(domain->pgsize_bitmap),
+					   new_iova_alignment);
+	if (!WARN_ON(new_iova_alignment > iopt->iova_alignment))
+		iopt->iova_alignment = new_iova_alignment;
+
+out_unlock:
+	up_write(&iopt->iova_rwsem);
+	up_write(&iopt->domains_rwsem);
+}
+
+/* Narrow the valid_iova_itree to include reserved ranges from a group. */
+int iopt_table_enforce_group_resv_regions(struct io_pagetable *iopt,
+					  struct iommu_group *group,
+					  phys_addr_t *sw_msi_start)
+{
+	struct iommu_resv_region *resv;
+	struct iommu_resv_region *tmp;
+	LIST_HEAD(group_resv_regions);
+	int rc;
+
+	down_write(&iopt->iova_rwsem);
+	rc = iommu_get_group_resv_regions(group, &group_resv_regions);
+	if (rc)
+		goto out_unlock;
+
+	list_for_each_entry (resv, &group_resv_regions, list) {
+		if (resv->type == IOMMU_RESV_DIRECT_RELAXABLE)
+			continue;
+
+		/*
+		 * The presence of any 'real' MSI regions should take precedence
+		 * over the software-managed one if the IOMMU driver happens to
+		 * advertise both types.
+		 */
+		if (sw_msi_start && resv->type == IOMMU_RESV_MSI) {
+			*sw_msi_start = 0;
+			sw_msi_start = NULL;
+		}
+		if (sw_msi_start && resv->type == IOMMU_RESV_SW_MSI)
+			*sw_msi_start = resv->start;
+
+		rc = iopt_reserve_iova(iopt, resv->start,
+				       resv->length - 1 + resv->start, group);
+		if (rc)
+			goto out_reserved;
+	}
+	rc = 0;
+	goto out_free_resv;
+
+out_reserved:
+	iopt_remove_reserved_iova(iopt, group);
+out_free_resv:
+	list_for_each_entry_safe (resv, tmp, &group_resv_regions, list)
+		kfree(resv);
+out_unlock:
+	up_write(&iopt->iova_rwsem);
+	return rc;
+}
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 2f1301d39bba7c..bcf08e61bc87e9 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -9,6 +9,9 @@
 #include <linux/refcount.h>
 #include <linux/uaccess.h>
 
+struct iommu_domain;
+struct iommu_group;
+
 /*
  * The IOVA to PFN map. The mapper automatically copies the PFNs into multiple
  * domains and permits sharing of PFNs between io_pagetable instances. This
@@ -27,8 +30,40 @@ struct io_pagetable {
 	struct rw_semaphore iova_rwsem;
 	struct rb_root_cached area_itree;
 	struct rb_root_cached reserved_iova_itree;
+	unsigned long iova_alignment;
 };
 
+int iopt_init_table(struct io_pagetable *iopt);
+void iopt_destroy_table(struct io_pagetable *iopt);
+struct iopt_pages *iopt_get_pages(struct io_pagetable *iopt, unsigned long iova,
+				  unsigned long *start_byte,
+				  unsigned long length);
+enum { IOPT_ALLOC_IOVA = 1 << 0 };
+int iopt_map_user_pages(struct io_pagetable *iopt, unsigned long *iova,
+			void __user *uptr, unsigned long length, int iommu_prot,
+			unsigned int flags);
+int iopt_map_pages(struct io_pagetable *iopt, struct iopt_pages *pages,
+		   unsigned long *dst_iova, unsigned long start_byte,
+		   unsigned long length, int iommu_prot, unsigned int flags);
+int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
+		    unsigned long length);
+int iopt_unmap_all(struct io_pagetable *iopt);
+
+int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
+		      unsigned long npages, struct page **out_pages, bool write);
+void iopt_unaccess_pages(struct io_pagetable *iopt, unsigned long iova,
+			 size_t npages);
+int iopt_table_add_domain(struct io_pagetable *iopt,
+			  struct iommu_domain *domain);
+void iopt_table_remove_domain(struct io_pagetable *iopt,
+			      struct iommu_domain *domain);
+int iopt_table_enforce_group_resv_regions(struct io_pagetable *iopt,
+					  struct iommu_group *group,
+					  phys_addr_t *sw_msi_start);
+int iopt_reserve_iova(struct io_pagetable *iopt, unsigned long start,
+		      unsigned long last, void *owner);
+void iopt_remove_reserved_iova(struct io_pagetable *iopt, void *owner);
+
 struct iommufd_ctx {
 	struct file *filp;
 	struct xarray objects;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 244+ messages in thread

* [PATCH RFC 07/12] iommufd: Data structure to provide IOVA to PFN mapping
@ 2022-03-18 17:27   ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-03-18 17:27 UTC (permalink / raw)
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins,
	David Gibson

This is the remainder of the IOAS data structure. Provide an object called
an io_pagetable that is composed of iopt_areas pointing at iopt_pages,
along with a list of iommu_domains that mirror the IOVA to PFN map.

At the top this is a simple interval tree of iopt_areas indicating the map
of IOVA to iopt_pages. An xarray keeps track of a list of domains. Based
on the attached domains there is a minimum alignment for areas (which may
be smaller than PAGE_SIZE) and an interval tree of reserved IOVA that
can't be mapped.

The concept of a 'user' refers to something like a VFIO mdev that is
accessing the IOVA and using a 'struct page *' for CPU based access.

Externally an API is provided that matches the requirements of the IOCTL
interface for map/unmap and domain attachment.

The API provides a 'copy' primitive to establish a new IOVA map in a
different IOAS from an existing mapping.

This is designed to support a pre-registration flow where userspace would
setup an dummy IOAS with no domains, map in memory and then establish a
user to pin all PFNs into the xarray.

Copy can then be used to create new IOVA mappings in a different IOAS,
with iommu_domains attached. Upon copy the PFNs will be read out of the
xarray and mapped into the iommu_domains, avoiding any pin_user_pages()
overheads.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommufd/Makefile          |   1 +
 drivers/iommu/iommufd/io_pagetable.c    | 890 ++++++++++++++++++++++++
 drivers/iommu/iommufd/iommufd_private.h |  35 +
 3 files changed, 926 insertions(+)
 create mode 100644 drivers/iommu/iommufd/io_pagetable.c

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index 05a0e91e30afad..b66a8c47ff55ec 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 iommufd-y := \
+	io_pagetable.o \
 	main.o \
 	pages.o
 
diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
new file mode 100644
index 00000000000000..f9f3b06946bfb9
--- /dev/null
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -0,0 +1,890 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
+ *
+ * The io_pagetable is the top of datastructure that maps IOVA's to PFNs. The
+ * PFNs can be placed into an iommu_domain, or returned to the caller as a page
+ * list for access by an in-kernel user.
+ *
+ * The datastructure uses the iopt_pages to optimize the storage of the PFNs
+ * between the domains and xarray.
+ */
+#include <linux/lockdep.h>
+#include <linux/iommu.h>
+#include <linux/sched/mm.h>
+#include <linux/err.h>
+#include <linux/slab.h>
+#include <linux/errno.h>
+
+#include "io_pagetable.h"
+
+static unsigned long iopt_area_iova_to_index(struct iopt_area *area,
+					     unsigned long iova)
+{
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
+		WARN_ON(iova < iopt_area_iova(area) ||
+			iova > iopt_area_last_iova(area));
+	return (iova - (iopt_area_iova(area) & PAGE_MASK)) / PAGE_SIZE;
+}
+
+static struct iopt_area *iopt_find_exact_area(struct io_pagetable *iopt,
+					      unsigned long iova,
+					      unsigned long last_iova)
+{
+	struct iopt_area *area;
+
+	area = iopt_area_iter_first(iopt, iova, last_iova);
+	if (!area || !area->pages || iopt_area_iova(area) != iova ||
+	    iopt_area_last_iova(area) != last_iova)
+		return NULL;
+	return area;
+}
+
+static bool __alloc_iova_check_hole(struct interval_tree_span_iter *span,
+				    unsigned long length,
+				    unsigned long iova_alignment,
+				    unsigned long page_offset)
+{
+	if (!span->is_hole || span->last_hole - span->start_hole < length - 1)
+		return false;
+
+	span->start_hole =
+		ALIGN(span->start_hole, iova_alignment) | page_offset;
+	if (span->start_hole > span->last_hole ||
+	    span->last_hole - span->start_hole < length - 1)
+		return false;
+	return true;
+}
+
+/*
+ * Automatically find a block of IOVA that is not being used and not reserved.
+ * Does not return a 0 IOVA even if it is valid.
+ */
+static int iopt_alloc_iova(struct io_pagetable *iopt, unsigned long *iova,
+			   unsigned long uptr, unsigned long length)
+{
+	struct interval_tree_span_iter reserved_span;
+	unsigned long page_offset = uptr % PAGE_SIZE;
+	struct interval_tree_span_iter area_span;
+	unsigned long iova_alignment;
+
+	lockdep_assert_held(&iopt->iova_rwsem);
+
+	/* Protect roundup_pow-of_two() from overflow */
+	if (length == 0 || length >= ULONG_MAX / 2)
+		return -EOVERFLOW;
+
+	/*
+	 * Keep alignment present in the uptr when building the IOVA, this
+	 * increases the chance we can map a THP.
+	 */
+	if (!uptr)
+		iova_alignment = roundup_pow_of_two(length);
+	else
+		iova_alignment =
+			min_t(unsigned long, roundup_pow_of_two(length),
+			      1UL << __ffs64(uptr));
+
+	if (iova_alignment < iopt->iova_alignment)
+		return -EINVAL;
+	for (interval_tree_span_iter_first(&area_span, &iopt->area_itree,
+					   PAGE_SIZE, ULONG_MAX - PAGE_SIZE);
+	     !interval_tree_span_iter_done(&area_span);
+	     interval_tree_span_iter_next(&area_span)) {
+		if (!__alloc_iova_check_hole(&area_span, length, iova_alignment,
+					     page_offset))
+			continue;
+
+		for (interval_tree_span_iter_first(
+			     &reserved_span, &iopt->reserved_iova_itree,
+			     area_span.start_hole, area_span.last_hole);
+		     !interval_tree_span_iter_done(&reserved_span);
+		     interval_tree_span_iter_next(&reserved_span)) {
+			if (!__alloc_iova_check_hole(&reserved_span, length,
+						     iova_alignment,
+						     page_offset))
+				continue;
+
+			*iova = reserved_span.start_hole;
+			return 0;
+		}
+	}
+	return -ENOSPC;
+}
+
+/*
+ * The area takes a slice of the pages from start_bytes to start_byte + length
+ */
+static struct iopt_area *
+iopt_alloc_area(struct io_pagetable *iopt, struct iopt_pages *pages,
+		unsigned long iova, unsigned long start_byte,
+		unsigned long length, int iommu_prot, unsigned int flags)
+{
+	struct iopt_area *area;
+	int rc;
+
+	area = kzalloc(sizeof(*area), GFP_KERNEL);
+	if (!area)
+		return ERR_PTR(-ENOMEM);
+
+	area->iopt = iopt;
+	area->iommu_prot = iommu_prot;
+	area->page_offset = start_byte % PAGE_SIZE;
+	area->pages_node.start = start_byte / PAGE_SIZE;
+	if (check_add_overflow(start_byte, length - 1, &area->pages_node.last))
+		return ERR_PTR(-EOVERFLOW);
+	area->pages_node.last = area->pages_node.last / PAGE_SIZE;
+	if (WARN_ON(area->pages_node.last >= pages->npages))
+		return ERR_PTR(-EOVERFLOW);
+
+	down_write(&iopt->iova_rwsem);
+	if (flags & IOPT_ALLOC_IOVA) {
+		rc = iopt_alloc_iova(iopt, &iova,
+				     (uintptr_t)pages->uptr + start_byte,
+				     length);
+		if (rc)
+			goto out_unlock;
+	}
+
+	if (check_add_overflow(iova, length - 1, &area->node.last)) {
+		rc = -EOVERFLOW;
+		goto out_unlock;
+	}
+
+	if (!(flags & IOPT_ALLOC_IOVA)) {
+		if ((iova & (iopt->iova_alignment - 1)) ||
+		    (length & (iopt->iova_alignment - 1)) || !length) {
+			rc = -EINVAL;
+			goto out_unlock;
+		}
+
+		/* No reserved IOVA intersects the range */
+		if (interval_tree_iter_first(&iopt->reserved_iova_itree, iova,
+					     area->node.last)) {
+			rc = -ENOENT;
+			goto out_unlock;
+		}
+
+		/* Check that there is not already a mapping in the range */
+		if (iopt_area_iter_first(iopt, iova, area->node.last)) {
+			rc = -EADDRINUSE;
+			goto out_unlock;
+		}
+	}
+
+	/*
+	 * The area is inserted with a NULL pages indicating it is not fully
+	 * initialized yet.
+	 */
+	area->node.start = iova;
+	interval_tree_insert(&area->node, &area->iopt->area_itree);
+	up_write(&iopt->iova_rwsem);
+	return area;
+
+out_unlock:
+	up_write(&iopt->iova_rwsem);
+	kfree(area);
+	return ERR_PTR(rc);
+}
+
+static void iopt_abort_area(struct iopt_area *area)
+{
+	down_write(&area->iopt->iova_rwsem);
+	interval_tree_remove(&area->node, &area->iopt->area_itree);
+	up_write(&area->iopt->iova_rwsem);
+	kfree(area);
+}
+
+static int iopt_finalize_area(struct iopt_area *area, struct iopt_pages *pages)
+{
+	int rc;
+
+	down_read(&area->iopt->domains_rwsem);
+	rc = iopt_area_fill_domains(area, pages);
+	if (!rc) {
+		/*
+		 * area->pages must be set inside the domains_rwsem to ensure
+		 * any newly added domains will get filled. Moves the reference
+		 * in from the caller
+		 */
+		down_write(&area->iopt->iova_rwsem);
+		area->pages = pages;
+		up_write(&area->iopt->iova_rwsem);
+	}
+	up_read(&area->iopt->domains_rwsem);
+	return rc;
+}
+
+int iopt_map_pages(struct io_pagetable *iopt, struct iopt_pages *pages,
+		   unsigned long *dst_iova, unsigned long start_bytes,
+		   unsigned long length, int iommu_prot, unsigned int flags)
+{
+	struct iopt_area *area;
+	int rc;
+
+	if ((iommu_prot & IOMMU_WRITE) && !pages->writable)
+		return -EPERM;
+
+	area = iopt_alloc_area(iopt, pages, *dst_iova, start_bytes, length,
+			       iommu_prot, flags);
+	if (IS_ERR(area))
+		return PTR_ERR(area);
+	*dst_iova = iopt_area_iova(area);
+
+	rc = iopt_finalize_area(area, pages);
+	if (rc) {
+		iopt_abort_area(area);
+		return rc;
+	}
+	return 0;
+}
+
+/**
+ * iopt_map_user_pages() - Map a user VA to an iova in the io page table
+ * @iopt: io_pagetable to act on
+ * @iova: If IOPT_ALLOC_IOVA is set this is unused on input and contains
+ *        the chosen iova on output. Otherwise is the iova to map to on input
+ * @uptr: User VA to map
+ * @length: Number of bytes to map
+ * @iommu_prot: Combination of IOMMU_READ/WRITE/etc bits for the mapping
+ * @flags: IOPT_ALLOC_IOVA or zero
+ *
+ * iova, uptr, and length must be aligned to iova_alignment. For domain backed
+ * page tables this will pin the pages and load them into the domain at iova.
+ * For non-domain page tables this will only setup a lazy reference and the
+ * caller must use iopt_access_pages() to touch them.
+ *
+ * iopt_unmap_iova() must be called to undo this before the io_pagetable can be
+ * destroyed.
+ */
+int iopt_map_user_pages(struct io_pagetable *iopt, unsigned long *iova,
+			void __user *uptr, unsigned long length, int iommu_prot,
+			unsigned int flags)
+{
+	struct iopt_pages *pages;
+	int rc;
+
+	pages = iopt_alloc_pages(uptr, length, iommu_prot & IOMMU_WRITE);
+	if (IS_ERR(pages))
+		return PTR_ERR(pages);
+
+	rc = iopt_map_pages(iopt, pages, iova, uptr - pages->uptr, length,
+			    iommu_prot, flags);
+	if (rc) {
+		iopt_put_pages(pages);
+		return rc;
+	}
+	return 0;
+}
+
+struct iopt_pages *iopt_get_pages(struct io_pagetable *iopt, unsigned long iova,
+				  unsigned long *start_byte,
+				  unsigned long length)
+{
+	unsigned long iova_end;
+	struct iopt_pages *pages;
+	struct iopt_area *area;
+
+	if (check_add_overflow(iova, length - 1, &iova_end))
+		return ERR_PTR(-EOVERFLOW);
+
+	down_read(&iopt->iova_rwsem);
+	area = iopt_find_exact_area(iopt, iova, iova_end);
+	if (!area) {
+		up_read(&iopt->iova_rwsem);
+		return ERR_PTR(-ENOENT);
+	}
+	pages = area->pages;
+	*start_byte = area->page_offset + iopt_area_index(area) * PAGE_SIZE;
+	kref_get(&pages->kref);
+	up_read(&iopt->iova_rwsem);
+
+	return pages;
+}
+
+static int __iopt_unmap_iova(struct io_pagetable *iopt, struct iopt_area *area,
+			     struct iopt_pages *pages)
+{
+	/* Drivers have to unpin on notification. */
+	if (WARN_ON(atomic_read(&area->num_users)))
+		return -EBUSY;
+
+	iopt_area_unfill_domains(area, pages);
+	WARN_ON(atomic_read(&area->num_users));
+	iopt_abort_area(area);
+	iopt_put_pages(pages);
+	return 0;
+}
+
+/**
+ * iopt_unmap_iova() - Remove a range of iova
+ * @iopt: io_pagetable to act on
+ * @iova: Starting iova to unmap
+ * @length: Number of bytes to unmap
+ *
+ * The requested range must exactly match an existing range.
+ * Splitting/truncating IOVA mappings is not allowed.
+ */
+int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
+		    unsigned long length)
+{
+	struct iopt_pages *pages;
+	struct iopt_area *area;
+	unsigned long iova_end;
+	int rc;
+
+	if (!length)
+		return -EINVAL;
+
+	if (check_add_overflow(iova, length - 1, &iova_end))
+		return -EOVERFLOW;
+
+	down_read(&iopt->domains_rwsem);
+	down_write(&iopt->iova_rwsem);
+	area = iopt_find_exact_area(iopt, iova, iova_end);
+	if (!area) {
+		up_write(&iopt->iova_rwsem);
+		up_read(&iopt->domains_rwsem);
+		return -ENOENT;
+	}
+	pages = area->pages;
+	area->pages = NULL;
+	up_write(&iopt->iova_rwsem);
+
+	rc = __iopt_unmap_iova(iopt, area, pages);
+	up_read(&iopt->domains_rwsem);
+	return rc;
+}
+
+int iopt_unmap_all(struct io_pagetable *iopt)
+{
+	struct iopt_area *area;
+	int rc;
+
+	down_read(&iopt->domains_rwsem);
+	down_write(&iopt->iova_rwsem);
+	while ((area = iopt_area_iter_first(iopt, 0, ULONG_MAX))) {
+		struct iopt_pages *pages;
+
+		/* Userspace should not race unmap all and map */
+		if (!area->pages) {
+			rc = -EBUSY;
+			goto out_unlock_iova;
+		}
+		pages = area->pages;
+		area->pages = NULL;
+		up_write(&iopt->iova_rwsem);
+
+		rc = __iopt_unmap_iova(iopt, area, pages);
+		if (rc)
+			goto out_unlock_domains;
+
+		down_write(&iopt->iova_rwsem);
+	}
+	rc = 0;
+
+out_unlock_iova:
+	up_write(&iopt->iova_rwsem);
+out_unlock_domains:
+	up_read(&iopt->domains_rwsem);
+	return rc;
+}
+
+/**
+ * iopt_access_pages() - Return a list of pages under the iova
+ * @iopt: io_pagetable to act on
+ * @iova: Starting IOVA
+ * @length: Number of bytes to access
+ * @out_pages: Output page list
+ * @write: True if access is for writing
+ *
+ * Reads @npages starting at iova and returns the struct page * pointers. These
+ * can be kmap'd by the caller for CPU access.
+ *
+ * The caller must perform iopt_unaccess_pages() when done to balance this.
+ *
+ * iova can be unaligned from PAGE_SIZE. The first returned byte starts at
+ * page_to_phys(out_pages[0]) + (iova % PAGE_SIZE). The caller promises not to
+ * touch memory outside the requested iova slice.
+ *
+ * FIXME: callers that need a DMA mapping via a sgl should create another
+ * interface to build the SGL efficiently
+ */
+int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
+		      unsigned long length, struct page **out_pages, bool write)
+{
+	unsigned long cur_iova = iova;
+	unsigned long last_iova;
+	struct iopt_area *area;
+	int rc;
+
+	if (!length)
+		return -EINVAL;
+	if (check_add_overflow(iova, length - 1, &last_iova))
+		return -EOVERFLOW;
+
+	down_read(&iopt->iova_rwsem);
+	for (area = iopt_area_iter_first(iopt, iova, last_iova); area;
+	     area = iopt_area_iter_next(area, iova, last_iova)) {
+		unsigned long last = min(last_iova, iopt_area_last_iova(area));
+		unsigned long last_index;
+		unsigned long index;
+
+		/* Need contiguous areas in the access */
+		if (iopt_area_iova(area) < cur_iova || !area->pages) {
+			rc = -EINVAL;
+			goto out_remove;
+		}
+
+		index = iopt_area_iova_to_index(area, cur_iova);
+		last_index = iopt_area_iova_to_index(area, last);
+		rc = iopt_pages_add_user(area->pages, index, last_index,
+					 out_pages, write);
+		if (rc)
+			goto out_remove;
+		if (last == last_iova)
+			break;
+		/*
+		 * Can't cross areas that are not aligned to the system page
+		 * size with this API.
+		 */
+		if (cur_iova % PAGE_SIZE) {
+			rc = -EINVAL;
+			goto out_remove;
+		}
+		cur_iova = last + 1;
+		out_pages += last_index - index;
+		atomic_inc(&area->num_users);
+	}
+
+	up_read(&iopt->iova_rwsem);
+	return 0;
+
+out_remove:
+	if (cur_iova != iova)
+		iopt_unaccess_pages(iopt, iova, cur_iova - iova);
+	up_read(&iopt->iova_rwsem);
+	return rc;
+}
+
+/**
+ * iopt_unaccess_pages() - Undo iopt_access_pages
+ * @iopt: io_pagetable to act on
+ * @iova: Starting IOVA
+ * @length:- Number of bytes to access
+ *
+ * Return the struct page's. The caller must stop accessing them before calling
+ * this. The iova/length must exactly match the one provided to access_pages.
+ */
+void iopt_unaccess_pages(struct io_pagetable *iopt, unsigned long iova,
+			 size_t length)
+{
+	unsigned long cur_iova = iova;
+	unsigned long last_iova;
+	struct iopt_area *area;
+
+	if (WARN_ON(!length) ||
+	    WARN_ON(check_add_overflow(iova, length - 1, &last_iova)))
+		return;
+
+	down_read(&iopt->iova_rwsem);
+	for (area = iopt_area_iter_first(iopt, iova, last_iova); area;
+	     area = iopt_area_iter_next(area, iova, last_iova)) {
+		unsigned long last = min(last_iova, iopt_area_last_iova(area));
+		int num_users;
+
+		iopt_pages_remove_user(area->pages,
+				       iopt_area_iova_to_index(area, cur_iova),
+				       iopt_area_iova_to_index(area, last));
+		if (last == last_iova)
+			break;
+		cur_iova = last + 1;
+		num_users = atomic_dec_return(&area->num_users);
+		WARN_ON(num_users < 0);
+	}
+	up_read(&iopt->iova_rwsem);
+}
+
+struct iopt_reserved_iova {
+	struct interval_tree_node node;
+	void *owner;
+};
+
+int iopt_reserve_iova(struct io_pagetable *iopt, unsigned long start,
+		      unsigned long last, void *owner)
+{
+	struct iopt_reserved_iova *reserved;
+
+	lockdep_assert_held_write(&iopt->iova_rwsem);
+
+	if (iopt_area_iter_first(iopt, start, last))
+		return -EADDRINUSE;
+
+	reserved = kzalloc(sizeof(*reserved), GFP_KERNEL);
+	if (!reserved)
+		return -ENOMEM;
+	reserved->node.start = start;
+	reserved->node.last = last;
+	reserved->owner = owner;
+	interval_tree_insert(&reserved->node, &iopt->reserved_iova_itree);
+	return 0;
+}
+
+void iopt_remove_reserved_iova(struct io_pagetable *iopt, void *owner)
+{
+
+	struct interval_tree_node *node;
+
+	for (node = interval_tree_iter_first(&iopt->reserved_iova_itree, 0,
+					     ULONG_MAX);
+	     node;) {
+		struct iopt_reserved_iova *reserved =
+			container_of(node, struct iopt_reserved_iova, node);
+
+		node = interval_tree_iter_next(node, 0, ULONG_MAX);
+
+		if (reserved->owner == owner) {
+			interval_tree_remove(&reserved->node,
+					     &iopt->reserved_iova_itree);
+			kfree(reserved);
+		}
+	}
+}
+
+int iopt_init_table(struct io_pagetable *iopt)
+{
+	init_rwsem(&iopt->iova_rwsem);
+	init_rwsem(&iopt->domains_rwsem);
+	iopt->area_itree = RB_ROOT_CACHED;
+	iopt->reserved_iova_itree = RB_ROOT_CACHED;
+	xa_init(&iopt->domains);
+
+	/*
+	 * iopt's start as SW tables that can use the entire size_t IOVA space
+	 * due to the use of size_t in the APIs. They have no alignment
+	 * restriction.
+	 */
+	iopt->iova_alignment = 1;
+
+	return 0;
+}
+
+void iopt_destroy_table(struct io_pagetable *iopt)
+{
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
+		iopt_remove_reserved_iova(iopt, NULL);
+	WARN_ON(!RB_EMPTY_ROOT(&iopt->reserved_iova_itree.rb_root));
+	WARN_ON(!xa_empty(&iopt->domains));
+	WARN_ON(!RB_EMPTY_ROOT(&iopt->area_itree.rb_root));
+}
+
+/**
+ * iopt_unfill_domain() - Unfill a domain with PFNs
+ * @iopt: io_pagetable to act on
+ * @domain: domain to unfill
+ *
+ * This is used when removing a domain from the iopt. Every area in the iopt
+ * will be unmapped from the domain. The domain must already be removed from the
+ * domains xarray.
+ */
+static void iopt_unfill_domain(struct io_pagetable *iopt,
+			       struct iommu_domain *domain)
+{
+	struct iopt_area *area;
+
+	lockdep_assert_held(&iopt->iova_rwsem);
+	lockdep_assert_held_write(&iopt->domains_rwsem);
+
+	/*
+	 * Some other domain is holding all the pfns still, rapidly unmap this
+	 * domain.
+	 */
+	if (iopt->next_domain_id != 0) {
+		/* Pick an arbitrary remaining domain to act as storage */
+		struct iommu_domain *storage_domain =
+			xa_load(&iopt->domains, 0);
+
+		for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
+		     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
+			struct iopt_pages *pages = area->pages;
+
+			if (WARN_ON(!pages))
+				continue;
+
+			mutex_lock(&pages->mutex);
+			if (area->storage_domain != domain) {
+				mutex_unlock(&pages->mutex);
+				continue;
+			}
+			area->storage_domain = storage_domain;
+			mutex_unlock(&pages->mutex);
+		}
+
+
+		iopt_unmap_domain(iopt, domain);
+		return;
+	}
+
+	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
+	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
+		struct iopt_pages *pages = area->pages;
+
+		if (WARN_ON(!pages))
+			continue;
+
+		mutex_lock(&pages->mutex);
+		interval_tree_remove(&area->pages_node,
+				     &area->pages->domains_itree);
+		WARN_ON(area->storage_domain != domain);
+		area->storage_domain = NULL;
+		iopt_area_unfill_domain(area, pages, domain);
+		mutex_unlock(&pages->mutex);
+	}
+}
+
+/**
+ * iopt_fill_domain() - Fill a domain with PFNs
+ * @iopt: io_pagetable to act on
+ * @domain: domain to fill
+ *
+ * Fill the domain with PFNs from every area in the iopt. On failure the domain
+ * is left unchanged.
+ */
+static int iopt_fill_domain(struct io_pagetable *iopt,
+			    struct iommu_domain *domain)
+{
+	struct iopt_area *end_area;
+	struct iopt_area *area;
+	int rc;
+
+	lockdep_assert_held(&iopt->iova_rwsem);
+	lockdep_assert_held_write(&iopt->domains_rwsem);
+
+	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
+	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
+		struct iopt_pages *pages = area->pages;
+
+		if (WARN_ON(!pages))
+			continue;
+
+		mutex_lock(&pages->mutex);
+		rc = iopt_area_fill_domain(area, domain);
+		if (rc) {
+			mutex_unlock(&pages->mutex);
+			goto out_unfill;
+		}
+		if (!area->storage_domain) {
+			WARN_ON(iopt->next_domain_id != 0);
+			area->storage_domain = domain;
+			interval_tree_insert(&area->pages_node,
+					     &pages->domains_itree);
+		}
+		mutex_unlock(&pages->mutex);
+	}
+	return 0;
+
+out_unfill:
+	end_area = area;
+	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
+	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
+		struct iopt_pages *pages = area->pages;
+
+		if (area == end_area)
+			break;
+		if (WARN_ON(!pages))
+			continue;
+		mutex_lock(&pages->mutex);
+		if (iopt->next_domain_id == 0) {
+			interval_tree_remove(&area->pages_node,
+					     &pages->domains_itree);
+			area->storage_domain = NULL;
+		}
+		iopt_area_unfill_domain(area, pages, domain);
+		mutex_unlock(&pages->mutex);
+	}
+	return rc;
+}
+
+/* All existing area's conform to an increased page size */
+static int iopt_check_iova_alignment(struct io_pagetable *iopt,
+				     unsigned long new_iova_alignment)
+{
+	struct iopt_area *area;
+
+	lockdep_assert_held(&iopt->iova_rwsem);
+
+	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
+	     area = iopt_area_iter_next(area, 0, ULONG_MAX))
+		if ((iopt_area_iova(area) % new_iova_alignment) ||
+		    (iopt_area_length(area) % new_iova_alignment))
+			return -EADDRINUSE;
+	return 0;
+}
+
+int iopt_table_add_domain(struct io_pagetable *iopt,
+			  struct iommu_domain *domain)
+{
+	const struct iommu_domain_geometry *geometry = &domain->geometry;
+	struct iommu_domain *iter_domain;
+	unsigned int new_iova_alignment;
+	unsigned long index;
+	int rc;
+
+	down_write(&iopt->domains_rwsem);
+	down_write(&iopt->iova_rwsem);
+
+	xa_for_each (&iopt->domains, index, iter_domain) {
+		if (WARN_ON(iter_domain == domain)) {
+			rc = -EEXIST;
+			goto out_unlock;
+		}
+	}
+
+	/*
+	 * The io page size drives the iova_alignment. Internally the iopt_pages
+	 * works in PAGE_SIZE units and we adjust when mapping sub-PAGE_SIZE
+	 * objects into the iommu_domain.
+	 *
+	 * A iommu_domain must always be able to accept PAGE_SIZE to be
+	 * compatible as we can't guarantee higher contiguity.
+	 */
+	new_iova_alignment =
+		max_t(unsigned long, 1UL << __ffs(domain->pgsize_bitmap),
+		      iopt->iova_alignment);
+	if (new_iova_alignment > PAGE_SIZE) {
+		rc = -EINVAL;
+		goto out_unlock;
+	}
+	if (new_iova_alignment != iopt->iova_alignment) {
+		rc = iopt_check_iova_alignment(iopt, new_iova_alignment);
+		if (rc)
+			goto out_unlock;
+	}
+
+	/* No area exists that is outside the allowed domain aperture */
+	if (geometry->aperture_start != 0) {
+		rc = iopt_reserve_iova(iopt, 0, geometry->aperture_start - 1,
+				       domain);
+		if (rc)
+			goto out_reserved;
+	}
+	if (geometry->aperture_end != ULONG_MAX) {
+		rc = iopt_reserve_iova(iopt, geometry->aperture_end + 1,
+				       ULONG_MAX, domain);
+		if (rc)
+			goto out_reserved;
+	}
+
+	rc = xa_reserve(&iopt->domains, iopt->next_domain_id, GFP_KERNEL);
+	if (rc)
+		goto out_reserved;
+
+	rc = iopt_fill_domain(iopt, domain);
+	if (rc)
+		goto out_release;
+
+	iopt->iova_alignment = new_iova_alignment;
+	xa_store(&iopt->domains, iopt->next_domain_id, domain, GFP_KERNEL);
+	iopt->next_domain_id++;
+	up_write(&iopt->iova_rwsem);
+	up_write(&iopt->domains_rwsem);
+	return 0;
+out_release:
+	xa_release(&iopt->domains, iopt->next_domain_id);
+out_reserved:
+	iopt_remove_reserved_iova(iopt, domain);
+out_unlock:
+	up_write(&iopt->iova_rwsem);
+	up_write(&iopt->domains_rwsem);
+	return rc;
+}
+
+void iopt_table_remove_domain(struct io_pagetable *iopt,
+			      struct iommu_domain *domain)
+{
+	struct iommu_domain *iter_domain = NULL;
+	unsigned long new_iova_alignment;
+	unsigned long index;
+
+	down_write(&iopt->domains_rwsem);
+	down_write(&iopt->iova_rwsem);
+
+	xa_for_each (&iopt->domains, index, iter_domain)
+		if (iter_domain == domain)
+			break;
+	if (WARN_ON(iter_domain != domain) || index >= iopt->next_domain_id)
+		goto out_unlock;
+
+	/*
+	 * Compress the xarray to keep it linear by swapping the entry to erase
+	 * with the tail entry and shrinking the tail.
+	 */
+	iopt->next_domain_id--;
+	iter_domain = xa_erase(&iopt->domains, iopt->next_domain_id);
+	if (index != iopt->next_domain_id)
+		xa_store(&iopt->domains, index, iter_domain, GFP_KERNEL);
+
+	iopt_unfill_domain(iopt, domain);
+	iopt_remove_reserved_iova(iopt, domain);
+
+	/* Recalculate the iova alignment without the domain */
+	new_iova_alignment = 1;
+	xa_for_each (&iopt->domains, index, iter_domain)
+		new_iova_alignment = max_t(unsigned long,
+					   1UL << __ffs(domain->pgsize_bitmap),
+					   new_iova_alignment);
+	if (!WARN_ON(new_iova_alignment > iopt->iova_alignment))
+		iopt->iova_alignment = new_iova_alignment;
+
+out_unlock:
+	up_write(&iopt->iova_rwsem);
+	up_write(&iopt->domains_rwsem);
+}
+
+/* Narrow the valid_iova_itree to include reserved ranges from a group. */
+int iopt_table_enforce_group_resv_regions(struct io_pagetable *iopt,
+					  struct iommu_group *group,
+					  phys_addr_t *sw_msi_start)
+{
+	struct iommu_resv_region *resv;
+	struct iommu_resv_region *tmp;
+	LIST_HEAD(group_resv_regions);
+	int rc;
+
+	down_write(&iopt->iova_rwsem);
+	rc = iommu_get_group_resv_regions(group, &group_resv_regions);
+	if (rc)
+		goto out_unlock;
+
+	list_for_each_entry (resv, &group_resv_regions, list) {
+		if (resv->type == IOMMU_RESV_DIRECT_RELAXABLE)
+			continue;
+
+		/*
+		 * The presence of any 'real' MSI regions should take precedence
+		 * over the software-managed one if the IOMMU driver happens to
+		 * advertise both types.
+		 */
+		if (sw_msi_start && resv->type == IOMMU_RESV_MSI) {
+			*sw_msi_start = 0;
+			sw_msi_start = NULL;
+		}
+		if (sw_msi_start && resv->type == IOMMU_RESV_SW_MSI)
+			*sw_msi_start = resv->start;
+
+		rc = iopt_reserve_iova(iopt, resv->start,
+				       resv->length - 1 + resv->start, group);
+		if (rc)
+			goto out_reserved;
+	}
+	rc = 0;
+	goto out_free_resv;
+
+out_reserved:
+	iopt_remove_reserved_iova(iopt, group);
+out_free_resv:
+	list_for_each_entry_safe (resv, tmp, &group_resv_regions, list)
+		kfree(resv);
+out_unlock:
+	up_write(&iopt->iova_rwsem);
+	return rc;
+}
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 2f1301d39bba7c..bcf08e61bc87e9 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -9,6 +9,9 @@
 #include <linux/refcount.h>
 #include <linux/uaccess.h>
 
+struct iommu_domain;
+struct iommu_group;
+
 /*
  * The IOVA to PFN map. The mapper automatically copies the PFNs into multiple
  * domains and permits sharing of PFNs between io_pagetable instances. This
@@ -27,8 +30,40 @@ struct io_pagetable {
 	struct rw_semaphore iova_rwsem;
 	struct rb_root_cached area_itree;
 	struct rb_root_cached reserved_iova_itree;
+	unsigned long iova_alignment;
 };
 
+int iopt_init_table(struct io_pagetable *iopt);
+void iopt_destroy_table(struct io_pagetable *iopt);
+struct iopt_pages *iopt_get_pages(struct io_pagetable *iopt, unsigned long iova,
+				  unsigned long *start_byte,
+				  unsigned long length);
+enum { IOPT_ALLOC_IOVA = 1 << 0 };
+int iopt_map_user_pages(struct io_pagetable *iopt, unsigned long *iova,
+			void __user *uptr, unsigned long length, int iommu_prot,
+			unsigned int flags);
+int iopt_map_pages(struct io_pagetable *iopt, struct iopt_pages *pages,
+		   unsigned long *dst_iova, unsigned long start_byte,
+		   unsigned long length, int iommu_prot, unsigned int flags);
+int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
+		    unsigned long length);
+int iopt_unmap_all(struct io_pagetable *iopt);
+
+int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
+		      unsigned long npages, struct page **out_pages, bool write);
+void iopt_unaccess_pages(struct io_pagetable *iopt, unsigned long iova,
+			 size_t npages);
+int iopt_table_add_domain(struct io_pagetable *iopt,
+			  struct iommu_domain *domain);
+void iopt_table_remove_domain(struct io_pagetable *iopt,
+			      struct iommu_domain *domain);
+int iopt_table_enforce_group_resv_regions(struct io_pagetable *iopt,
+					  struct iommu_group *group,
+					  phys_addr_t *sw_msi_start);
+int iopt_reserve_iova(struct io_pagetable *iopt, unsigned long start,
+		      unsigned long last, void *owner);
+void iopt_remove_reserved_iova(struct io_pagetable *iopt, void *owner);
+
 struct iommufd_ctx {
 	struct file *filp;
 	struct xarray objects;
-- 
2.35.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


^ permalink raw reply related	[flat|nested] 244+ messages in thread

* [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
  2022-03-18 17:27 ` Jason Gunthorpe via iommu
@ 2022-03-18 17:27   ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-03-18 17:27 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

Connect the IOAS to its IOCTL interface. This exposes most of the
functionality in the io_pagetable to userspace.

This is intended to be the core of the generic interface that IOMMUFD will
provide. Every IOMMU driver should be able to implement an iommu_domain
that is compatible with this generic mechanism.

It is also designed to be easy to use for simple non virtual machine
monitor users, like DPDK:
 - Universal simple support for all IOMMUs (no PPC special path)
 - An IOVA allocator that considerds the aperture and the reserved ranges
 - io_pagetable allows any number of iommu_domains to be connected to the
   IOAS

Along with room in the design to add non-generic features to cater to
specific HW functionality.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommufd/Makefile          |   1 +
 drivers/iommu/iommufd/ioas.c            | 248 ++++++++++++++++++++++++
 drivers/iommu/iommufd/iommufd_private.h |  27 +++
 drivers/iommu/iommufd/main.c            |  17 ++
 include/uapi/linux/iommufd.h            | 132 +++++++++++++
 5 files changed, 425 insertions(+)
 create mode 100644 drivers/iommu/iommufd/ioas.c

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index b66a8c47ff55ec..2b4f36f1b72f9d 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 iommufd-y := \
 	io_pagetable.o \
+	ioas.o \
 	main.o \
 	pages.o
 
diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
new file mode 100644
index 00000000000000..c530b2ba74b06b
--- /dev/null
+++ b/drivers/iommu/iommufd/ioas.c
@@ -0,0 +1,248 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ */
+#include <linux/interval_tree.h>
+#include <linux/iommufd.h>
+#include <linux/iommu.h>
+#include <uapi/linux/iommufd.h>
+
+#include "io_pagetable.h"
+
+void iommufd_ioas_destroy(struct iommufd_object *obj)
+{
+	struct iommufd_ioas *ioas = container_of(obj, struct iommufd_ioas, obj);
+	int rc;
+
+	rc = iopt_unmap_all(&ioas->iopt);
+	WARN_ON(rc);
+	iopt_destroy_table(&ioas->iopt);
+}
+
+struct iommufd_ioas *iommufd_ioas_alloc(struct iommufd_ctx *ictx)
+{
+	struct iommufd_ioas *ioas;
+	int rc;
+
+	ioas = iommufd_object_alloc(ictx, ioas, IOMMUFD_OBJ_IOAS);
+	if (IS_ERR(ioas))
+		return ioas;
+
+	rc = iopt_init_table(&ioas->iopt);
+	if (rc)
+		goto out_abort;
+	return ioas;
+
+out_abort:
+	iommufd_object_abort(ictx, &ioas->obj);
+	return ERR_PTR(rc);
+}
+
+int iommufd_ioas_alloc_ioctl(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_ioas_alloc *cmd = ucmd->cmd;
+	struct iommufd_ioas *ioas;
+	int rc;
+
+	if (cmd->flags)
+		return -EOPNOTSUPP;
+
+	ioas = iommufd_ioas_alloc(ucmd->ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	cmd->out_ioas_id = ioas->obj.id;
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+	if (rc)
+		goto out_table;
+	iommufd_object_finalize(ucmd->ictx, &ioas->obj);
+	return 0;
+
+out_table:
+	iommufd_ioas_destroy(&ioas->obj);
+	return rc;
+}
+
+int iommufd_ioas_iova_ranges(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_ioas_iova_ranges __user *uptr = ucmd->ubuffer;
+	struct iommu_ioas_iova_ranges *cmd = ucmd->cmd;
+	struct iommufd_ioas *ioas;
+	struct interval_tree_span_iter span;
+	u32 max_iovas;
+	int rc;
+
+	if (cmd->__reserved)
+		return -EOPNOTSUPP;
+
+	max_iovas = cmd->size - sizeof(*cmd);
+	if (max_iovas % sizeof(cmd->out_valid_iovas[0]))
+		return -EINVAL;
+	max_iovas /= sizeof(cmd->out_valid_iovas[0]);
+
+	ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	down_read(&ioas->iopt.iova_rwsem);
+	cmd->out_num_iovas = 0;
+	for (interval_tree_span_iter_first(
+		     &span, &ioas->iopt.reserved_iova_itree, 0, ULONG_MAX);
+	     !interval_tree_span_iter_done(&span);
+	     interval_tree_span_iter_next(&span)) {
+		if (!span.is_hole)
+			continue;
+		if (cmd->out_num_iovas < max_iovas) {
+			rc = put_user((u64)span.start_hole,
+				      &uptr->out_valid_iovas[cmd->out_num_iovas]
+					       .start);
+			if (rc)
+				goto out_put;
+			rc = put_user(
+				(u64)span.last_hole,
+				&uptr->out_valid_iovas[cmd->out_num_iovas].last);
+			if (rc)
+				goto out_put;
+		}
+		cmd->out_num_iovas++;
+	}
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+	if (rc)
+		goto out_put;
+	if (cmd->out_num_iovas > max_iovas)
+		rc = -EMSGSIZE;
+out_put:
+	up_read(&ioas->iopt.iova_rwsem);
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+static int conv_iommu_prot(u32 map_flags)
+{
+	int iommu_prot;
+
+	/*
+	 * We provide no manual cache coherency ioctls to userspace and most
+	 * architectures make the CPU ops for cache flushing privileged.
+	 * Therefore we require the underlying IOMMU to support CPU coherent
+	 * operation.
+	 */
+	iommu_prot = IOMMU_CACHE;
+	if (map_flags & IOMMU_IOAS_MAP_WRITEABLE)
+		iommu_prot |= IOMMU_WRITE;
+	if (map_flags & IOMMU_IOAS_MAP_READABLE)
+		iommu_prot |= IOMMU_READ;
+	return iommu_prot;
+}
+
+int iommufd_ioas_map(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_ioas_map *cmd = ucmd->cmd;
+	struct iommufd_ioas *ioas;
+	unsigned int flags = 0;
+	unsigned long iova;
+	int rc;
+
+	if ((cmd->flags &
+	     ~(IOMMU_IOAS_MAP_FIXED_IOVA | IOMMU_IOAS_MAP_WRITEABLE |
+	       IOMMU_IOAS_MAP_READABLE)) ||
+	    cmd->__reserved)
+		return -EOPNOTSUPP;
+	if (cmd->iova >= ULONG_MAX || cmd->length >= ULONG_MAX)
+		return -EOVERFLOW;
+
+	ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	if (!(cmd->flags & IOMMU_IOAS_MAP_FIXED_IOVA))
+		flags = IOPT_ALLOC_IOVA;
+	iova = cmd->iova;
+	rc = iopt_map_user_pages(&ioas->iopt, &iova,
+				 u64_to_user_ptr(cmd->user_va), cmd->length,
+				 conv_iommu_prot(cmd->flags), flags);
+	if (rc)
+		goto out_put;
+
+	cmd->iova = iova;
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+out_put:
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+int iommufd_ioas_copy(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_ioas_copy *cmd = ucmd->cmd;
+	struct iommufd_ioas *src_ioas;
+	struct iommufd_ioas *dst_ioas;
+	struct iopt_pages *pages;
+	unsigned int flags = 0;
+	unsigned long iova;
+	unsigned long start_byte;
+	int rc;
+
+	if ((cmd->flags &
+	     ~(IOMMU_IOAS_MAP_FIXED_IOVA | IOMMU_IOAS_MAP_WRITEABLE |
+	       IOMMU_IOAS_MAP_READABLE)))
+		return -EOPNOTSUPP;
+	if (cmd->length >= ULONG_MAX)
+		return -EOVERFLOW;
+
+	src_ioas = iommufd_get_ioas(ucmd, cmd->src_ioas_id);
+	if (IS_ERR(src_ioas))
+		return PTR_ERR(src_ioas);
+	/* FIXME: copy is not limited to an exact match anymore */
+	pages = iopt_get_pages(&src_ioas->iopt, cmd->src_iova, &start_byte,
+			       cmd->length);
+	iommufd_put_object(&src_ioas->obj);
+	if (IS_ERR(pages))
+		return PTR_ERR(pages);
+
+	dst_ioas = iommufd_get_ioas(ucmd, cmd->dst_ioas_id);
+	if (IS_ERR(dst_ioas)) {
+		iopt_put_pages(pages);
+		return PTR_ERR(dst_ioas);
+	}
+
+	if (!(cmd->flags & IOMMU_IOAS_MAP_FIXED_IOVA))
+		flags = IOPT_ALLOC_IOVA;
+	iova = cmd->dst_iova;
+	rc = iopt_map_pages(&dst_ioas->iopt, pages, &iova, start_byte,
+			    cmd->length, conv_iommu_prot(cmd->flags), flags);
+	if (rc) {
+		iopt_put_pages(pages);
+		goto out_put_dst;
+	}
+
+	cmd->dst_iova = iova;
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+out_put_dst:
+	iommufd_put_object(&dst_ioas->obj);
+	return rc;
+}
+
+int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_ioas_unmap *cmd = ucmd->cmd;
+	struct iommufd_ioas *ioas;
+	int rc;
+
+	ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	if (cmd->iova == 0 && cmd->length == U64_MAX) {
+		rc = iopt_unmap_all(&ioas->iopt);
+	} else {
+		if (cmd->iova >= ULONG_MAX || cmd->length >= ULONG_MAX) {
+			rc = -EOVERFLOW;
+			goto out_put;
+		}
+		rc = iopt_unmap_iova(&ioas->iopt, cmd->iova, cmd->length);
+	}
+
+out_put:
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index bcf08e61bc87e9..d24c9dac5a82a9 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -96,6 +96,7 @@ static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd,
 enum iommufd_object_type {
 	IOMMUFD_OBJ_NONE,
 	IOMMUFD_OBJ_ANY = IOMMUFD_OBJ_NONE,
+	IOMMUFD_OBJ_IOAS,
 	IOMMUFD_OBJ_MAX,
 };
 
@@ -147,4 +148,30 @@ struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
 			     type),                                            \
 		     typeof(*(ptr)), obj)
 
+/*
+ * The IO Address Space (IOAS) pagetable is a virtual page table backed by the
+ * io_pagetable object. It is a user controlled mapping of IOVA -> PFNs. The
+ * mapping is copied into all of the associated domains and made available to
+ * in-kernel users.
+ */
+struct iommufd_ioas {
+	struct iommufd_object obj;
+	struct io_pagetable iopt;
+};
+
+static inline struct iommufd_ioas *iommufd_get_ioas(struct iommufd_ucmd *ucmd,
+						    u32 id)
+{
+	return container_of(iommufd_get_object(ucmd->ictx, id,
+					       IOMMUFD_OBJ_IOAS),
+			    struct iommufd_ioas, obj);
+}
+
+struct iommufd_ioas *iommufd_ioas_alloc(struct iommufd_ctx *ictx);
+int iommufd_ioas_alloc_ioctl(struct iommufd_ucmd *ucmd);
+void iommufd_ioas_destroy(struct iommufd_object *obj);
+int iommufd_ioas_iova_ranges(struct iommufd_ucmd *ucmd);
+int iommufd_ioas_map(struct iommufd_ucmd *ucmd);
+int iommufd_ioas_copy(struct iommufd_ucmd *ucmd);
+int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd);
 #endif
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index ae8db2f663004f..e506f493b54cfe 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -184,6 +184,10 @@ static int iommufd_fops_release(struct inode *inode, struct file *filp)
 }
 
 union ucmd_buffer {
+	struct iommu_ioas_alloc alloc;
+	struct iommu_ioas_iova_ranges iova_ranges;
+	struct iommu_ioas_map map;
+	struct iommu_ioas_unmap unmap;
 	struct iommu_destroy destroy;
 };
 
@@ -205,6 +209,16 @@ struct iommufd_ioctl_op {
 	}
 static struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
 	IOCTL_OP(IOMMU_DESTROY, iommufd_destroy, struct iommu_destroy, id),
+	IOCTL_OP(IOMMU_IOAS_ALLOC, iommufd_ioas_alloc_ioctl,
+		 struct iommu_ioas_alloc, out_ioas_id),
+	IOCTL_OP(IOMMU_IOAS_COPY, iommufd_ioas_copy, struct iommu_ioas_copy,
+		 src_iova),
+	IOCTL_OP(IOMMU_IOAS_IOVA_RANGES, iommufd_ioas_iova_ranges,
+		 struct iommu_ioas_iova_ranges, __reserved),
+	IOCTL_OP(IOMMU_IOAS_MAP, iommufd_ioas_map, struct iommu_ioas_map,
+		 __reserved),
+	IOCTL_OP(IOMMU_IOAS_UNMAP, iommufd_ioas_unmap, struct iommu_ioas_unmap,
+		 length),
 };
 
 static long iommufd_fops_ioctl(struct file *filp, unsigned int cmd,
@@ -270,6 +284,9 @@ struct iommufd_ctx *iommufd_fget(int fd)
 }
 
 static struct iommufd_object_ops iommufd_object_ops[] = {
+	[IOMMUFD_OBJ_IOAS] = {
+		.destroy = iommufd_ioas_destroy,
+	},
 };
 
 static struct miscdevice iommu_misc_dev = {
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 2f7f76ec6db4cb..ba7b17ec3002e3 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -37,6 +37,11 @@
 enum {
 	IOMMUFD_CMD_BASE = 0x80,
 	IOMMUFD_CMD_DESTROY = IOMMUFD_CMD_BASE,
+	IOMMUFD_CMD_IOAS_ALLOC,
+	IOMMUFD_CMD_IOAS_IOVA_RANGES,
+	IOMMUFD_CMD_IOAS_MAP,
+	IOMMUFD_CMD_IOAS_COPY,
+	IOMMUFD_CMD_IOAS_UNMAP,
 };
 
 /**
@@ -52,4 +57,131 @@ struct iommu_destroy {
 };
 #define IOMMU_DESTROY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_DESTROY)
 
+/**
+ * struct iommu_ioas_alloc - ioctl(IOMMU_IOAS_ALLOC)
+ * @size: sizeof(struct iommu_ioas_alloc)
+ * @flags: Must be 0
+ * @out_ioas_id: Output IOAS ID for the allocated object
+ *
+ * Allocate an IO Address Space (IOAS) which holds an IO Virtual Address (IOVA)
+ * to memory mapping.
+ */
+struct iommu_ioas_alloc {
+	__u32 size;
+	__u32 flags;
+	__u32 out_ioas_id;
+};
+#define IOMMU_IOAS_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_ALLOC)
+
+/**
+ * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
+ * @size: sizeof(struct iommu_ioas_iova_ranges)
+ * @ioas_id: IOAS ID to read ranges from
+ * @out_num_iovas: Output total number of ranges in the IOAS
+ * @__reserved: Must be 0
+ * @out_valid_iovas: Array of valid IOVA ranges. The array length is the smaller
+ *                   of out_num_iovas or the length implied by size.
+ * @out_valid_iovas.start: First IOVA in the allowed range
+ * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
+ *
+ * Query an IOAS for ranges of allowed IOVAs. Operation outside these ranges is
+ * not allowed. out_num_iovas will be set to the total number of iovas
+ * and the out_valid_iovas[] will be filled in as space permits.
+ * size should include the allocated flex array.
+ */
+struct iommu_ioas_iova_ranges {
+	__u32 size;
+	__u32 ioas_id;
+	__u32 out_num_iovas;
+	__u32 __reserved;
+	struct iommu_valid_iovas {
+		__aligned_u64 start;
+		__aligned_u64 last;
+	} out_valid_iovas[];
+};
+#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_IOVA_RANGES)
+
+/**
+ * enum iommufd_ioas_map_flags - Flags for map and copy
+ * @IOMMU_IOAS_MAP_FIXED_IOVA: If clear the kernel will compute an appropriate
+ *                             IOVA to place the mapping at
+ * @IOMMU_IOAS_MAP_WRITEABLE: DMA is allowed to write to this mapping
+ * @IOMMU_IOAS_MAP_READABLE: DMA is allowed to read from this mapping
+ */
+enum iommufd_ioas_map_flags {
+	IOMMU_IOAS_MAP_FIXED_IOVA = 1 << 0,
+	IOMMU_IOAS_MAP_WRITEABLE = 1 << 1,
+	IOMMU_IOAS_MAP_READABLE = 1 << 2,
+};
+
+/**
+ * struct iommu_ioas_map - ioctl(IOMMU_IOAS_MAP)
+ * @size: sizeof(struct iommu_ioas_map)
+ * @flags: Combination of enum iommufd_ioas_map_flags
+ * @ioas_id: IOAS ID to change the mapping of
+ * @__reserved: Must be 0
+ * @user_va: Userspace pointer to start mapping from
+ * @length: Number of bytes to map
+ * @iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is set
+ *        then this must be provided as input.
+ *
+ * Set an IOVA mapping from a user pointer. If FIXED_IOVA is specified then the
+ * mapping will be established at iova, otherwise a suitable location will be
+ * automatically selected and returned in iova.
+ */
+struct iommu_ioas_map {
+	__u32 size;
+	__u32 flags;
+	__u32 ioas_id;
+	__u32 __reserved;
+	__aligned_u64 user_va;
+	__aligned_u64 length;
+	__aligned_u64 iova;
+};
+#define IOMMU_IOAS_MAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_MAP)
+
+/**
+ * struct iommu_ioas_copy - ioctl(IOMMU_IOAS_COPY)
+ * @size: sizeof(struct iommu_ioas_copy)
+ * @flags: Combination of enum iommufd_ioas_map_flags
+ * @dst_ioas_id: IOAS ID to change the mapping of
+ * @src_ioas_id: IOAS ID to copy from
+ * @length: Number of bytes to copy and map
+ * @dst_iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is
+ *            set then this must be provided as input.
+ * @src_iova: IOVA to start the copy
+ *
+ * Copy an already existing mapping from src_ioas_id and establish it in
+ * dst_ioas_id. The src iova/length must exactly match a range used with
+ * IOMMU_IOAS_MAP.
+ */
+struct iommu_ioas_copy {
+	__u32 size;
+	__u32 flags;
+	__u32 dst_ioas_id;
+	__u32 src_ioas_id;
+	__aligned_u64 length;
+	__aligned_u64 dst_iova;
+	__aligned_u64 src_iova;
+};
+#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY)
+
+/**
+ * struct iommu_ioas_unmap - ioctl(IOMMU_IOAS_UNMAP)
+ * @size: sizeof(struct iommu_ioas_copy)
+ * @ioas_id: IOAS ID to change the mapping of
+ * @iova: IOVA to start the unmapping at
+ * @length: Number of bytes to unmap
+ *
+ * Unmap an IOVA range. The iova/length must exactly match a range
+ * used with IOMMU_IOAS_PAGETABLE_MAP, or be the values 0 & U64_MAX.
+ * In the latter case all IOVAs will be unmaped.
+ */
+struct iommu_ioas_unmap {
+	__u32 size;
+	__u32 ioas_id;
+	__aligned_u64 iova;
+	__aligned_u64 length;
+};
+#define IOMMU_IOAS_UNMAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_UNMAP)
 #endif
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 244+ messages in thread

* [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
@ 2022-03-18 17:27   ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-03-18 17:27 UTC (permalink / raw)
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins,
	David Gibson

Connect the IOAS to its IOCTL interface. This exposes most of the
functionality in the io_pagetable to userspace.

This is intended to be the core of the generic interface that IOMMUFD will
provide. Every IOMMU driver should be able to implement an iommu_domain
that is compatible with this generic mechanism.

It is also designed to be easy to use for simple non virtual machine
monitor users, like DPDK:
 - Universal simple support for all IOMMUs (no PPC special path)
 - An IOVA allocator that considerds the aperture and the reserved ranges
 - io_pagetable allows any number of iommu_domains to be connected to the
   IOAS

Along with room in the design to add non-generic features to cater to
specific HW functionality.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommufd/Makefile          |   1 +
 drivers/iommu/iommufd/ioas.c            | 248 ++++++++++++++++++++++++
 drivers/iommu/iommufd/iommufd_private.h |  27 +++
 drivers/iommu/iommufd/main.c            |  17 ++
 include/uapi/linux/iommufd.h            | 132 +++++++++++++
 5 files changed, 425 insertions(+)
 create mode 100644 drivers/iommu/iommufd/ioas.c

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index b66a8c47ff55ec..2b4f36f1b72f9d 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 iommufd-y := \
 	io_pagetable.o \
+	ioas.o \
 	main.o \
 	pages.o
 
diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
new file mode 100644
index 00000000000000..c530b2ba74b06b
--- /dev/null
+++ b/drivers/iommu/iommufd/ioas.c
@@ -0,0 +1,248 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ */
+#include <linux/interval_tree.h>
+#include <linux/iommufd.h>
+#include <linux/iommu.h>
+#include <uapi/linux/iommufd.h>
+
+#include "io_pagetable.h"
+
+void iommufd_ioas_destroy(struct iommufd_object *obj)
+{
+	struct iommufd_ioas *ioas = container_of(obj, struct iommufd_ioas, obj);
+	int rc;
+
+	rc = iopt_unmap_all(&ioas->iopt);
+	WARN_ON(rc);
+	iopt_destroy_table(&ioas->iopt);
+}
+
+struct iommufd_ioas *iommufd_ioas_alloc(struct iommufd_ctx *ictx)
+{
+	struct iommufd_ioas *ioas;
+	int rc;
+
+	ioas = iommufd_object_alloc(ictx, ioas, IOMMUFD_OBJ_IOAS);
+	if (IS_ERR(ioas))
+		return ioas;
+
+	rc = iopt_init_table(&ioas->iopt);
+	if (rc)
+		goto out_abort;
+	return ioas;
+
+out_abort:
+	iommufd_object_abort(ictx, &ioas->obj);
+	return ERR_PTR(rc);
+}
+
+int iommufd_ioas_alloc_ioctl(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_ioas_alloc *cmd = ucmd->cmd;
+	struct iommufd_ioas *ioas;
+	int rc;
+
+	if (cmd->flags)
+		return -EOPNOTSUPP;
+
+	ioas = iommufd_ioas_alloc(ucmd->ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	cmd->out_ioas_id = ioas->obj.id;
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+	if (rc)
+		goto out_table;
+	iommufd_object_finalize(ucmd->ictx, &ioas->obj);
+	return 0;
+
+out_table:
+	iommufd_ioas_destroy(&ioas->obj);
+	return rc;
+}
+
+int iommufd_ioas_iova_ranges(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_ioas_iova_ranges __user *uptr = ucmd->ubuffer;
+	struct iommu_ioas_iova_ranges *cmd = ucmd->cmd;
+	struct iommufd_ioas *ioas;
+	struct interval_tree_span_iter span;
+	u32 max_iovas;
+	int rc;
+
+	if (cmd->__reserved)
+		return -EOPNOTSUPP;
+
+	max_iovas = cmd->size - sizeof(*cmd);
+	if (max_iovas % sizeof(cmd->out_valid_iovas[0]))
+		return -EINVAL;
+	max_iovas /= sizeof(cmd->out_valid_iovas[0]);
+
+	ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	down_read(&ioas->iopt.iova_rwsem);
+	cmd->out_num_iovas = 0;
+	for (interval_tree_span_iter_first(
+		     &span, &ioas->iopt.reserved_iova_itree, 0, ULONG_MAX);
+	     !interval_tree_span_iter_done(&span);
+	     interval_tree_span_iter_next(&span)) {
+		if (!span.is_hole)
+			continue;
+		if (cmd->out_num_iovas < max_iovas) {
+			rc = put_user((u64)span.start_hole,
+				      &uptr->out_valid_iovas[cmd->out_num_iovas]
+					       .start);
+			if (rc)
+				goto out_put;
+			rc = put_user(
+				(u64)span.last_hole,
+				&uptr->out_valid_iovas[cmd->out_num_iovas].last);
+			if (rc)
+				goto out_put;
+		}
+		cmd->out_num_iovas++;
+	}
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+	if (rc)
+		goto out_put;
+	if (cmd->out_num_iovas > max_iovas)
+		rc = -EMSGSIZE;
+out_put:
+	up_read(&ioas->iopt.iova_rwsem);
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+static int conv_iommu_prot(u32 map_flags)
+{
+	int iommu_prot;
+
+	/*
+	 * We provide no manual cache coherency ioctls to userspace and most
+	 * architectures make the CPU ops for cache flushing privileged.
+	 * Therefore we require the underlying IOMMU to support CPU coherent
+	 * operation.
+	 */
+	iommu_prot = IOMMU_CACHE;
+	if (map_flags & IOMMU_IOAS_MAP_WRITEABLE)
+		iommu_prot |= IOMMU_WRITE;
+	if (map_flags & IOMMU_IOAS_MAP_READABLE)
+		iommu_prot |= IOMMU_READ;
+	return iommu_prot;
+}
+
+int iommufd_ioas_map(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_ioas_map *cmd = ucmd->cmd;
+	struct iommufd_ioas *ioas;
+	unsigned int flags = 0;
+	unsigned long iova;
+	int rc;
+
+	if ((cmd->flags &
+	     ~(IOMMU_IOAS_MAP_FIXED_IOVA | IOMMU_IOAS_MAP_WRITEABLE |
+	       IOMMU_IOAS_MAP_READABLE)) ||
+	    cmd->__reserved)
+		return -EOPNOTSUPP;
+	if (cmd->iova >= ULONG_MAX || cmd->length >= ULONG_MAX)
+		return -EOVERFLOW;
+
+	ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	if (!(cmd->flags & IOMMU_IOAS_MAP_FIXED_IOVA))
+		flags = IOPT_ALLOC_IOVA;
+	iova = cmd->iova;
+	rc = iopt_map_user_pages(&ioas->iopt, &iova,
+				 u64_to_user_ptr(cmd->user_va), cmd->length,
+				 conv_iommu_prot(cmd->flags), flags);
+	if (rc)
+		goto out_put;
+
+	cmd->iova = iova;
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+out_put:
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+int iommufd_ioas_copy(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_ioas_copy *cmd = ucmd->cmd;
+	struct iommufd_ioas *src_ioas;
+	struct iommufd_ioas *dst_ioas;
+	struct iopt_pages *pages;
+	unsigned int flags = 0;
+	unsigned long iova;
+	unsigned long start_byte;
+	int rc;
+
+	if ((cmd->flags &
+	     ~(IOMMU_IOAS_MAP_FIXED_IOVA | IOMMU_IOAS_MAP_WRITEABLE |
+	       IOMMU_IOAS_MAP_READABLE)))
+		return -EOPNOTSUPP;
+	if (cmd->length >= ULONG_MAX)
+		return -EOVERFLOW;
+
+	src_ioas = iommufd_get_ioas(ucmd, cmd->src_ioas_id);
+	if (IS_ERR(src_ioas))
+		return PTR_ERR(src_ioas);
+	/* FIXME: copy is not limited to an exact match anymore */
+	pages = iopt_get_pages(&src_ioas->iopt, cmd->src_iova, &start_byte,
+			       cmd->length);
+	iommufd_put_object(&src_ioas->obj);
+	if (IS_ERR(pages))
+		return PTR_ERR(pages);
+
+	dst_ioas = iommufd_get_ioas(ucmd, cmd->dst_ioas_id);
+	if (IS_ERR(dst_ioas)) {
+		iopt_put_pages(pages);
+		return PTR_ERR(dst_ioas);
+	}
+
+	if (!(cmd->flags & IOMMU_IOAS_MAP_FIXED_IOVA))
+		flags = IOPT_ALLOC_IOVA;
+	iova = cmd->dst_iova;
+	rc = iopt_map_pages(&dst_ioas->iopt, pages, &iova, start_byte,
+			    cmd->length, conv_iommu_prot(cmd->flags), flags);
+	if (rc) {
+		iopt_put_pages(pages);
+		goto out_put_dst;
+	}
+
+	cmd->dst_iova = iova;
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+out_put_dst:
+	iommufd_put_object(&dst_ioas->obj);
+	return rc;
+}
+
+int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_ioas_unmap *cmd = ucmd->cmd;
+	struct iommufd_ioas *ioas;
+	int rc;
+
+	ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	if (cmd->iova == 0 && cmd->length == U64_MAX) {
+		rc = iopt_unmap_all(&ioas->iopt);
+	} else {
+		if (cmd->iova >= ULONG_MAX || cmd->length >= ULONG_MAX) {
+			rc = -EOVERFLOW;
+			goto out_put;
+		}
+		rc = iopt_unmap_iova(&ioas->iopt, cmd->iova, cmd->length);
+	}
+
+out_put:
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index bcf08e61bc87e9..d24c9dac5a82a9 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -96,6 +96,7 @@ static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd,
 enum iommufd_object_type {
 	IOMMUFD_OBJ_NONE,
 	IOMMUFD_OBJ_ANY = IOMMUFD_OBJ_NONE,
+	IOMMUFD_OBJ_IOAS,
 	IOMMUFD_OBJ_MAX,
 };
 
@@ -147,4 +148,30 @@ struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
 			     type),                                            \
 		     typeof(*(ptr)), obj)
 
+/*
+ * The IO Address Space (IOAS) pagetable is a virtual page table backed by the
+ * io_pagetable object. It is a user controlled mapping of IOVA -> PFNs. The
+ * mapping is copied into all of the associated domains and made available to
+ * in-kernel users.
+ */
+struct iommufd_ioas {
+	struct iommufd_object obj;
+	struct io_pagetable iopt;
+};
+
+static inline struct iommufd_ioas *iommufd_get_ioas(struct iommufd_ucmd *ucmd,
+						    u32 id)
+{
+	return container_of(iommufd_get_object(ucmd->ictx, id,
+					       IOMMUFD_OBJ_IOAS),
+			    struct iommufd_ioas, obj);
+}
+
+struct iommufd_ioas *iommufd_ioas_alloc(struct iommufd_ctx *ictx);
+int iommufd_ioas_alloc_ioctl(struct iommufd_ucmd *ucmd);
+void iommufd_ioas_destroy(struct iommufd_object *obj);
+int iommufd_ioas_iova_ranges(struct iommufd_ucmd *ucmd);
+int iommufd_ioas_map(struct iommufd_ucmd *ucmd);
+int iommufd_ioas_copy(struct iommufd_ucmd *ucmd);
+int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd);
 #endif
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index ae8db2f663004f..e506f493b54cfe 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -184,6 +184,10 @@ static int iommufd_fops_release(struct inode *inode, struct file *filp)
 }
 
 union ucmd_buffer {
+	struct iommu_ioas_alloc alloc;
+	struct iommu_ioas_iova_ranges iova_ranges;
+	struct iommu_ioas_map map;
+	struct iommu_ioas_unmap unmap;
 	struct iommu_destroy destroy;
 };
 
@@ -205,6 +209,16 @@ struct iommufd_ioctl_op {
 	}
 static struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
 	IOCTL_OP(IOMMU_DESTROY, iommufd_destroy, struct iommu_destroy, id),
+	IOCTL_OP(IOMMU_IOAS_ALLOC, iommufd_ioas_alloc_ioctl,
+		 struct iommu_ioas_alloc, out_ioas_id),
+	IOCTL_OP(IOMMU_IOAS_COPY, iommufd_ioas_copy, struct iommu_ioas_copy,
+		 src_iova),
+	IOCTL_OP(IOMMU_IOAS_IOVA_RANGES, iommufd_ioas_iova_ranges,
+		 struct iommu_ioas_iova_ranges, __reserved),
+	IOCTL_OP(IOMMU_IOAS_MAP, iommufd_ioas_map, struct iommu_ioas_map,
+		 __reserved),
+	IOCTL_OP(IOMMU_IOAS_UNMAP, iommufd_ioas_unmap, struct iommu_ioas_unmap,
+		 length),
 };
 
 static long iommufd_fops_ioctl(struct file *filp, unsigned int cmd,
@@ -270,6 +284,9 @@ struct iommufd_ctx *iommufd_fget(int fd)
 }
 
 static struct iommufd_object_ops iommufd_object_ops[] = {
+	[IOMMUFD_OBJ_IOAS] = {
+		.destroy = iommufd_ioas_destroy,
+	},
 };
 
 static struct miscdevice iommu_misc_dev = {
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 2f7f76ec6db4cb..ba7b17ec3002e3 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -37,6 +37,11 @@
 enum {
 	IOMMUFD_CMD_BASE = 0x80,
 	IOMMUFD_CMD_DESTROY = IOMMUFD_CMD_BASE,
+	IOMMUFD_CMD_IOAS_ALLOC,
+	IOMMUFD_CMD_IOAS_IOVA_RANGES,
+	IOMMUFD_CMD_IOAS_MAP,
+	IOMMUFD_CMD_IOAS_COPY,
+	IOMMUFD_CMD_IOAS_UNMAP,
 };
 
 /**
@@ -52,4 +57,131 @@ struct iommu_destroy {
 };
 #define IOMMU_DESTROY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_DESTROY)
 
+/**
+ * struct iommu_ioas_alloc - ioctl(IOMMU_IOAS_ALLOC)
+ * @size: sizeof(struct iommu_ioas_alloc)
+ * @flags: Must be 0
+ * @out_ioas_id: Output IOAS ID for the allocated object
+ *
+ * Allocate an IO Address Space (IOAS) which holds an IO Virtual Address (IOVA)
+ * to memory mapping.
+ */
+struct iommu_ioas_alloc {
+	__u32 size;
+	__u32 flags;
+	__u32 out_ioas_id;
+};
+#define IOMMU_IOAS_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_ALLOC)
+
+/**
+ * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
+ * @size: sizeof(struct iommu_ioas_iova_ranges)
+ * @ioas_id: IOAS ID to read ranges from
+ * @out_num_iovas: Output total number of ranges in the IOAS
+ * @__reserved: Must be 0
+ * @out_valid_iovas: Array of valid IOVA ranges. The array length is the smaller
+ *                   of out_num_iovas or the length implied by size.
+ * @out_valid_iovas.start: First IOVA in the allowed range
+ * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
+ *
+ * Query an IOAS for ranges of allowed IOVAs. Operation outside these ranges is
+ * not allowed. out_num_iovas will be set to the total number of iovas
+ * and the out_valid_iovas[] will be filled in as space permits.
+ * size should include the allocated flex array.
+ */
+struct iommu_ioas_iova_ranges {
+	__u32 size;
+	__u32 ioas_id;
+	__u32 out_num_iovas;
+	__u32 __reserved;
+	struct iommu_valid_iovas {
+		__aligned_u64 start;
+		__aligned_u64 last;
+	} out_valid_iovas[];
+};
+#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_IOVA_RANGES)
+
+/**
+ * enum iommufd_ioas_map_flags - Flags for map and copy
+ * @IOMMU_IOAS_MAP_FIXED_IOVA: If clear the kernel will compute an appropriate
+ *                             IOVA to place the mapping at
+ * @IOMMU_IOAS_MAP_WRITEABLE: DMA is allowed to write to this mapping
+ * @IOMMU_IOAS_MAP_READABLE: DMA is allowed to read from this mapping
+ */
+enum iommufd_ioas_map_flags {
+	IOMMU_IOAS_MAP_FIXED_IOVA = 1 << 0,
+	IOMMU_IOAS_MAP_WRITEABLE = 1 << 1,
+	IOMMU_IOAS_MAP_READABLE = 1 << 2,
+};
+
+/**
+ * struct iommu_ioas_map - ioctl(IOMMU_IOAS_MAP)
+ * @size: sizeof(struct iommu_ioas_map)
+ * @flags: Combination of enum iommufd_ioas_map_flags
+ * @ioas_id: IOAS ID to change the mapping of
+ * @__reserved: Must be 0
+ * @user_va: Userspace pointer to start mapping from
+ * @length: Number of bytes to map
+ * @iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is set
+ *        then this must be provided as input.
+ *
+ * Set an IOVA mapping from a user pointer. If FIXED_IOVA is specified then the
+ * mapping will be established at iova, otherwise a suitable location will be
+ * automatically selected and returned in iova.
+ */
+struct iommu_ioas_map {
+	__u32 size;
+	__u32 flags;
+	__u32 ioas_id;
+	__u32 __reserved;
+	__aligned_u64 user_va;
+	__aligned_u64 length;
+	__aligned_u64 iova;
+};
+#define IOMMU_IOAS_MAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_MAP)
+
+/**
+ * struct iommu_ioas_copy - ioctl(IOMMU_IOAS_COPY)
+ * @size: sizeof(struct iommu_ioas_copy)
+ * @flags: Combination of enum iommufd_ioas_map_flags
+ * @dst_ioas_id: IOAS ID to change the mapping of
+ * @src_ioas_id: IOAS ID to copy from
+ * @length: Number of bytes to copy and map
+ * @dst_iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is
+ *            set then this must be provided as input.
+ * @src_iova: IOVA to start the copy
+ *
+ * Copy an already existing mapping from src_ioas_id and establish it in
+ * dst_ioas_id. The src iova/length must exactly match a range used with
+ * IOMMU_IOAS_MAP.
+ */
+struct iommu_ioas_copy {
+	__u32 size;
+	__u32 flags;
+	__u32 dst_ioas_id;
+	__u32 src_ioas_id;
+	__aligned_u64 length;
+	__aligned_u64 dst_iova;
+	__aligned_u64 src_iova;
+};
+#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY)
+
+/**
+ * struct iommu_ioas_unmap - ioctl(IOMMU_IOAS_UNMAP)
+ * @size: sizeof(struct iommu_ioas_copy)
+ * @ioas_id: IOAS ID to change the mapping of
+ * @iova: IOVA to start the unmapping at
+ * @length: Number of bytes to unmap
+ *
+ * Unmap an IOVA range. The iova/length must exactly match a range
+ * used with IOMMU_IOAS_PAGETABLE_MAP, or be the values 0 & U64_MAX.
+ * In the latter case all IOVAs will be unmaped.
+ */
+struct iommu_ioas_unmap {
+	__u32 size;
+	__u32 ioas_id;
+	__aligned_u64 iova;
+	__aligned_u64 length;
+};
+#define IOMMU_IOAS_UNMAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_UNMAP)
 #endif
-- 
2.35.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 244+ messages in thread

* [PATCH RFC 09/12] iommufd: Add a HW pagetable object
  2022-03-18 17:27 ` Jason Gunthorpe via iommu
@ 2022-03-18 17:27   ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-03-18 17:27 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

The hw_pagetable object exposes the internal struct iommu_domain's to
userspace. An iommu_domain is required when any DMA device attaches to an
IOAS to control the io page table through the iommu driver.

For compatibility with VFIO the hw_pagetable is automatically created when
a DMA device is attached to the IOAS. If a compatible iommu_domain already
exists then the hw_pagetable associated with it is used for the
attachment.

In the initial series there is no iommufd uAPI for the hw_pagetable
object. The next patch provides driver facing APIs for IO page table
attachment that allows drivers to accept either an IOAS or a hw_pagetable
ID and for the driver to return the hw_pagetable ID that was auto-selected
from an IOAS. The expectation is the driver will provide uAPI through its
own FD for attaching its device to iommufd. This allows userspace to learn
the mapping of devices to iommu_domains and to override the automatic
attachment.

The future HW specific interface will allow userspace to create
hw_pagetable objects using iommu_domains with IOMMU driver specific
parameters. This infrastructure will allow linking those domains to IOAS's
and devices.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommufd/Makefile          |   1 +
 drivers/iommu/iommufd/hw_pagetable.c    | 142 ++++++++++++++++++++++++
 drivers/iommu/iommufd/ioas.c            |   4 +
 drivers/iommu/iommufd/iommufd_private.h |  35 ++++++
 drivers/iommu/iommufd/main.c            |   3 +
 5 files changed, 185 insertions(+)
 create mode 100644 drivers/iommu/iommufd/hw_pagetable.c

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index 2b4f36f1b72f9d..e13e971aa28c60 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 iommufd-y := \
+	hw_pagetable.o \
 	io_pagetable.o \
 	ioas.o \
 	main.o \
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
new file mode 100644
index 00000000000000..bafd7d07918bfd
--- /dev/null
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -0,0 +1,142 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ */
+#include <linux/iommu.h>
+
+#include "iommufd_private.h"
+
+void iommufd_hw_pagetable_destroy(struct iommufd_object *obj)
+{
+	struct iommufd_hw_pagetable *hwpt =
+		container_of(obj, struct iommufd_hw_pagetable, obj);
+	struct iommufd_ioas *ioas = hwpt->ioas;
+
+	WARN_ON(!list_empty(&hwpt->devices));
+	mutex_lock(&ioas->mutex);
+	list_del(&hwpt->auto_domains_item);
+	mutex_unlock(&ioas->mutex);
+
+	iommu_domain_free(hwpt->domain);
+	refcount_dec(&hwpt->ioas->obj.users);
+	mutex_destroy(&hwpt->devices_lock);
+}
+
+/*
+ * When automatically managing the domains we search for a compatible domain in
+ * the iopt and if one is found use it, otherwise create a new domain.
+ * Automatic domain selection will never pick a manually created domain.
+ */
+static struct iommufd_hw_pagetable *
+iommufd_hw_pagetable_auto_get(struct iommufd_ctx *ictx,
+			      struct iommufd_ioas *ioas, struct device *dev)
+{
+	struct iommufd_hw_pagetable *hwpt;
+	int rc;
+
+	/*
+	 * There is no differentiation when domains are allocated, so any domain
+	 * from the right ops is interchangeable with any other.
+	 */
+	mutex_lock(&ioas->mutex);
+	list_for_each_entry (hwpt, &ioas->auto_domains, auto_domains_item) {
+		/*
+		 * FIXME: We really need an op from the driver to test if a
+		 * device is compatible with a domain. This thing from VFIO
+		 * works sometimes.
+		 */
+		if (hwpt->domain->ops == dev_iommu_ops(dev)->default_domain_ops) {
+			if (refcount_inc_not_zero(&hwpt->obj.users)) {
+				mutex_unlock(&ioas->mutex);
+				return hwpt;
+			}
+		}
+	}
+
+	hwpt = iommufd_object_alloc(ictx, hwpt, IOMMUFD_OBJ_HW_PAGETABLE);
+	if (IS_ERR(hwpt)) {
+		rc = PTR_ERR(hwpt);
+		goto out_unlock;
+	}
+
+	hwpt->domain = iommu_domain_alloc(dev->bus);
+	if (!hwpt->domain) {
+		rc = -ENOMEM;
+		goto out_abort;
+	}
+
+	INIT_LIST_HEAD(&hwpt->devices);
+	mutex_init(&hwpt->devices_lock);
+	hwpt->ioas = ioas;
+	/* The calling driver is a user until iommufd_hw_pagetable_put() */
+	refcount_inc(&ioas->obj.users);
+
+	list_add_tail(&hwpt->auto_domains_item, &ioas->auto_domains);
+	/*
+	 * iommufd_object_finalize() consumes the refcount, get one for the
+	 * caller. This pairs with the first put in
+	 * iommufd_object_destroy_user()
+	 */
+	refcount_inc(&hwpt->obj.users);
+	iommufd_object_finalize(ictx, &hwpt->obj);
+
+	mutex_unlock(&ioas->mutex);
+	return hwpt;
+
+out_abort:
+	iommufd_object_abort(ictx, &hwpt->obj);
+out_unlock:
+	mutex_unlock(&ioas->mutex);
+	return ERR_PTR(rc);
+}
+
+/**
+ * iommufd_hw_pagetable_from_id() - Get an iommu_domain for a device
+ * @ictx: iommufd context
+ * @pt_id: ID of the IOAS or hw_pagetable object
+ * @dev: Device to get an iommu_domain for
+ *
+ * Turn a general page table ID into an iommu_domain contained in a
+ * iommufd_hw_pagetable object. If a hw_pagetable ID is specified then that
+ * iommu_domain is used, otherwise a suitable iommu_domain in the IOAS is found
+ * for the device, creating one automatically if necessary.
+ */
+struct iommufd_hw_pagetable *
+iommufd_hw_pagetable_from_id(struct iommufd_ctx *ictx, u32 pt_id,
+			     struct device *dev)
+{
+	struct iommufd_object *obj;
+
+	obj = iommufd_get_object(ictx, pt_id, IOMMUFD_OBJ_ANY);
+	if (IS_ERR(obj))
+		return ERR_CAST(obj);
+
+	switch (obj->type) {
+	case IOMMUFD_OBJ_HW_PAGETABLE:
+		iommufd_put_object_keep_user(obj);
+		return container_of(obj, struct iommufd_hw_pagetable, obj);
+	case IOMMUFD_OBJ_IOAS: {
+		struct iommufd_ioas *ioas =
+			container_of(obj, struct iommufd_ioas, obj);
+		struct iommufd_hw_pagetable *hwpt;
+
+		hwpt = iommufd_hw_pagetable_auto_get(ictx, ioas, dev);
+		iommufd_put_object(obj);
+		return hwpt;
+	}
+	default:
+		iommufd_put_object(obj);
+		return ERR_PTR(-EINVAL);
+	}
+}
+
+void iommufd_hw_pagetable_put(struct iommufd_ctx *ictx,
+			      struct iommufd_hw_pagetable *hwpt)
+{
+	if (list_empty(&hwpt->auto_domains_item)) {
+		/* Manually created hw_pagetables just keep going */
+		refcount_dec(&hwpt->obj.users);
+		return;
+	}
+	iommufd_object_destroy_user(ictx, &hwpt->obj);
+}
diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
index c530b2ba74b06b..48149988c84bbc 100644
--- a/drivers/iommu/iommufd/ioas.c
+++ b/drivers/iommu/iommufd/ioas.c
@@ -17,6 +17,7 @@ void iommufd_ioas_destroy(struct iommufd_object *obj)
 	rc = iopt_unmap_all(&ioas->iopt);
 	WARN_ON(rc);
 	iopt_destroy_table(&ioas->iopt);
+	mutex_destroy(&ioas->mutex);
 }
 
 struct iommufd_ioas *iommufd_ioas_alloc(struct iommufd_ctx *ictx)
@@ -31,6 +32,9 @@ struct iommufd_ioas *iommufd_ioas_alloc(struct iommufd_ctx *ictx)
 	rc = iopt_init_table(&ioas->iopt);
 	if (rc)
 		goto out_abort;
+
+	INIT_LIST_HEAD(&ioas->auto_domains);
+	mutex_init(&ioas->mutex);
 	return ioas;
 
 out_abort:
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index d24c9dac5a82a9..c5c9650cc86818 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -96,6 +96,7 @@ static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd,
 enum iommufd_object_type {
 	IOMMUFD_OBJ_NONE,
 	IOMMUFD_OBJ_ANY = IOMMUFD_OBJ_NONE,
+	IOMMUFD_OBJ_HW_PAGETABLE,
 	IOMMUFD_OBJ_IOAS,
 	IOMMUFD_OBJ_MAX,
 };
@@ -153,10 +154,20 @@ struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
  * io_pagetable object. It is a user controlled mapping of IOVA -> PFNs. The
  * mapping is copied into all of the associated domains and made available to
  * in-kernel users.
+ *
+ * Every iommu_domain that is created is wrapped in a iommufd_hw_pagetable
+ * object. When we go to attach a device to an IOAS we need to get an
+ * iommu_domain and wrapping iommufd_hw_pagetable for it.
+ *
+ * An iommu_domain & iommfd_hw_pagetable will be automatically selected
+ * for a device based on the auto_domains list. If no suitable iommu_domain
+ * is found a new iommu_domain will be created.
  */
 struct iommufd_ioas {
 	struct iommufd_object obj;
 	struct io_pagetable iopt;
+	struct mutex mutex;
+	struct list_head auto_domains;
 };
 
 static inline struct iommufd_ioas *iommufd_get_ioas(struct iommufd_ucmd *ucmd,
@@ -174,4 +185,28 @@ int iommufd_ioas_iova_ranges(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_map(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_copy(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd);
+
+/*
+ * A HW pagetable is called an iommu_domain inside the kernel. This user object
+ * allows directly creating and inspecting the domains. Domains that have kernel
+ * owned page tables will be associated with an iommufd_ioas that provides the
+ * IOVA to PFN map.
+ */
+struct iommufd_hw_pagetable {
+	struct iommufd_object obj;
+	struct iommufd_ioas *ioas;
+	struct iommu_domain *domain;
+	/* Head at iommufd_ioas::auto_domains */
+	struct list_head auto_domains_item;
+	struct mutex devices_lock;
+	struct list_head devices;
+};
+
+struct iommufd_hw_pagetable *
+iommufd_hw_pagetable_from_id(struct iommufd_ctx *ictx, u32 pt_id,
+			     struct device *dev);
+void iommufd_hw_pagetable_put(struct iommufd_ctx *ictx,
+			      struct iommufd_hw_pagetable *hwpt);
+void iommufd_hw_pagetable_destroy(struct iommufd_object *obj);
+
 #endif
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index e506f493b54cfe..954cde173c86fc 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -287,6 +287,9 @@ static struct iommufd_object_ops iommufd_object_ops[] = {
 	[IOMMUFD_OBJ_IOAS] = {
 		.destroy = iommufd_ioas_destroy,
 	},
+	[IOMMUFD_OBJ_HW_PAGETABLE] = {
+		.destroy = iommufd_hw_pagetable_destroy,
+	},
 };
 
 static struct miscdevice iommu_misc_dev = {
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 244+ messages in thread

* [PATCH RFC 09/12] iommufd: Add a HW pagetable object
@ 2022-03-18 17:27   ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-03-18 17:27 UTC (permalink / raw)
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins,
	David Gibson

The hw_pagetable object exposes the internal struct iommu_domain's to
userspace. An iommu_domain is required when any DMA device attaches to an
IOAS to control the io page table through the iommu driver.

For compatibility with VFIO the hw_pagetable is automatically created when
a DMA device is attached to the IOAS. If a compatible iommu_domain already
exists then the hw_pagetable associated with it is used for the
attachment.

In the initial series there is no iommufd uAPI for the hw_pagetable
object. The next patch provides driver facing APIs for IO page table
attachment that allows drivers to accept either an IOAS or a hw_pagetable
ID and for the driver to return the hw_pagetable ID that was auto-selected
from an IOAS. The expectation is the driver will provide uAPI through its
own FD for attaching its device to iommufd. This allows userspace to learn
the mapping of devices to iommu_domains and to override the automatic
attachment.

The future HW specific interface will allow userspace to create
hw_pagetable objects using iommu_domains with IOMMU driver specific
parameters. This infrastructure will allow linking those domains to IOAS's
and devices.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommufd/Makefile          |   1 +
 drivers/iommu/iommufd/hw_pagetable.c    | 142 ++++++++++++++++++++++++
 drivers/iommu/iommufd/ioas.c            |   4 +
 drivers/iommu/iommufd/iommufd_private.h |  35 ++++++
 drivers/iommu/iommufd/main.c            |   3 +
 5 files changed, 185 insertions(+)
 create mode 100644 drivers/iommu/iommufd/hw_pagetable.c

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index 2b4f36f1b72f9d..e13e971aa28c60 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 iommufd-y := \
+	hw_pagetable.o \
 	io_pagetable.o \
 	ioas.o \
 	main.o \
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
new file mode 100644
index 00000000000000..bafd7d07918bfd
--- /dev/null
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -0,0 +1,142 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ */
+#include <linux/iommu.h>
+
+#include "iommufd_private.h"
+
+void iommufd_hw_pagetable_destroy(struct iommufd_object *obj)
+{
+	struct iommufd_hw_pagetable *hwpt =
+		container_of(obj, struct iommufd_hw_pagetable, obj);
+	struct iommufd_ioas *ioas = hwpt->ioas;
+
+	WARN_ON(!list_empty(&hwpt->devices));
+	mutex_lock(&ioas->mutex);
+	list_del(&hwpt->auto_domains_item);
+	mutex_unlock(&ioas->mutex);
+
+	iommu_domain_free(hwpt->domain);
+	refcount_dec(&hwpt->ioas->obj.users);
+	mutex_destroy(&hwpt->devices_lock);
+}
+
+/*
+ * When automatically managing the domains we search for a compatible domain in
+ * the iopt and if one is found use it, otherwise create a new domain.
+ * Automatic domain selection will never pick a manually created domain.
+ */
+static struct iommufd_hw_pagetable *
+iommufd_hw_pagetable_auto_get(struct iommufd_ctx *ictx,
+			      struct iommufd_ioas *ioas, struct device *dev)
+{
+	struct iommufd_hw_pagetable *hwpt;
+	int rc;
+
+	/*
+	 * There is no differentiation when domains are allocated, so any domain
+	 * from the right ops is interchangeable with any other.
+	 */
+	mutex_lock(&ioas->mutex);
+	list_for_each_entry (hwpt, &ioas->auto_domains, auto_domains_item) {
+		/*
+		 * FIXME: We really need an op from the driver to test if a
+		 * device is compatible with a domain. This thing from VFIO
+		 * works sometimes.
+		 */
+		if (hwpt->domain->ops == dev_iommu_ops(dev)->default_domain_ops) {
+			if (refcount_inc_not_zero(&hwpt->obj.users)) {
+				mutex_unlock(&ioas->mutex);
+				return hwpt;
+			}
+		}
+	}
+
+	hwpt = iommufd_object_alloc(ictx, hwpt, IOMMUFD_OBJ_HW_PAGETABLE);
+	if (IS_ERR(hwpt)) {
+		rc = PTR_ERR(hwpt);
+		goto out_unlock;
+	}
+
+	hwpt->domain = iommu_domain_alloc(dev->bus);
+	if (!hwpt->domain) {
+		rc = -ENOMEM;
+		goto out_abort;
+	}
+
+	INIT_LIST_HEAD(&hwpt->devices);
+	mutex_init(&hwpt->devices_lock);
+	hwpt->ioas = ioas;
+	/* The calling driver is a user until iommufd_hw_pagetable_put() */
+	refcount_inc(&ioas->obj.users);
+
+	list_add_tail(&hwpt->auto_domains_item, &ioas->auto_domains);
+	/*
+	 * iommufd_object_finalize() consumes the refcount, get one for the
+	 * caller. This pairs with the first put in
+	 * iommufd_object_destroy_user()
+	 */
+	refcount_inc(&hwpt->obj.users);
+	iommufd_object_finalize(ictx, &hwpt->obj);
+
+	mutex_unlock(&ioas->mutex);
+	return hwpt;
+
+out_abort:
+	iommufd_object_abort(ictx, &hwpt->obj);
+out_unlock:
+	mutex_unlock(&ioas->mutex);
+	return ERR_PTR(rc);
+}
+
+/**
+ * iommufd_hw_pagetable_from_id() - Get an iommu_domain for a device
+ * @ictx: iommufd context
+ * @pt_id: ID of the IOAS or hw_pagetable object
+ * @dev: Device to get an iommu_domain for
+ *
+ * Turn a general page table ID into an iommu_domain contained in a
+ * iommufd_hw_pagetable object. If a hw_pagetable ID is specified then that
+ * iommu_domain is used, otherwise a suitable iommu_domain in the IOAS is found
+ * for the device, creating one automatically if necessary.
+ */
+struct iommufd_hw_pagetable *
+iommufd_hw_pagetable_from_id(struct iommufd_ctx *ictx, u32 pt_id,
+			     struct device *dev)
+{
+	struct iommufd_object *obj;
+
+	obj = iommufd_get_object(ictx, pt_id, IOMMUFD_OBJ_ANY);
+	if (IS_ERR(obj))
+		return ERR_CAST(obj);
+
+	switch (obj->type) {
+	case IOMMUFD_OBJ_HW_PAGETABLE:
+		iommufd_put_object_keep_user(obj);
+		return container_of(obj, struct iommufd_hw_pagetable, obj);
+	case IOMMUFD_OBJ_IOAS: {
+		struct iommufd_ioas *ioas =
+			container_of(obj, struct iommufd_ioas, obj);
+		struct iommufd_hw_pagetable *hwpt;
+
+		hwpt = iommufd_hw_pagetable_auto_get(ictx, ioas, dev);
+		iommufd_put_object(obj);
+		return hwpt;
+	}
+	default:
+		iommufd_put_object(obj);
+		return ERR_PTR(-EINVAL);
+	}
+}
+
+void iommufd_hw_pagetable_put(struct iommufd_ctx *ictx,
+			      struct iommufd_hw_pagetable *hwpt)
+{
+	if (list_empty(&hwpt->auto_domains_item)) {
+		/* Manually created hw_pagetables just keep going */
+		refcount_dec(&hwpt->obj.users);
+		return;
+	}
+	iommufd_object_destroy_user(ictx, &hwpt->obj);
+}
diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
index c530b2ba74b06b..48149988c84bbc 100644
--- a/drivers/iommu/iommufd/ioas.c
+++ b/drivers/iommu/iommufd/ioas.c
@@ -17,6 +17,7 @@ void iommufd_ioas_destroy(struct iommufd_object *obj)
 	rc = iopt_unmap_all(&ioas->iopt);
 	WARN_ON(rc);
 	iopt_destroy_table(&ioas->iopt);
+	mutex_destroy(&ioas->mutex);
 }
 
 struct iommufd_ioas *iommufd_ioas_alloc(struct iommufd_ctx *ictx)
@@ -31,6 +32,9 @@ struct iommufd_ioas *iommufd_ioas_alloc(struct iommufd_ctx *ictx)
 	rc = iopt_init_table(&ioas->iopt);
 	if (rc)
 		goto out_abort;
+
+	INIT_LIST_HEAD(&ioas->auto_domains);
+	mutex_init(&ioas->mutex);
 	return ioas;
 
 out_abort:
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index d24c9dac5a82a9..c5c9650cc86818 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -96,6 +96,7 @@ static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd,
 enum iommufd_object_type {
 	IOMMUFD_OBJ_NONE,
 	IOMMUFD_OBJ_ANY = IOMMUFD_OBJ_NONE,
+	IOMMUFD_OBJ_HW_PAGETABLE,
 	IOMMUFD_OBJ_IOAS,
 	IOMMUFD_OBJ_MAX,
 };
@@ -153,10 +154,20 @@ struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
  * io_pagetable object. It is a user controlled mapping of IOVA -> PFNs. The
  * mapping is copied into all of the associated domains and made available to
  * in-kernel users.
+ *
+ * Every iommu_domain that is created is wrapped in a iommufd_hw_pagetable
+ * object. When we go to attach a device to an IOAS we need to get an
+ * iommu_domain and wrapping iommufd_hw_pagetable for it.
+ *
+ * An iommu_domain & iommfd_hw_pagetable will be automatically selected
+ * for a device based on the auto_domains list. If no suitable iommu_domain
+ * is found a new iommu_domain will be created.
  */
 struct iommufd_ioas {
 	struct iommufd_object obj;
 	struct io_pagetable iopt;
+	struct mutex mutex;
+	struct list_head auto_domains;
 };
 
 static inline struct iommufd_ioas *iommufd_get_ioas(struct iommufd_ucmd *ucmd,
@@ -174,4 +185,28 @@ int iommufd_ioas_iova_ranges(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_map(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_copy(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd);
+
+/*
+ * A HW pagetable is called an iommu_domain inside the kernel. This user object
+ * allows directly creating and inspecting the domains. Domains that have kernel
+ * owned page tables will be associated with an iommufd_ioas that provides the
+ * IOVA to PFN map.
+ */
+struct iommufd_hw_pagetable {
+	struct iommufd_object obj;
+	struct iommufd_ioas *ioas;
+	struct iommu_domain *domain;
+	/* Head at iommufd_ioas::auto_domains */
+	struct list_head auto_domains_item;
+	struct mutex devices_lock;
+	struct list_head devices;
+};
+
+struct iommufd_hw_pagetable *
+iommufd_hw_pagetable_from_id(struct iommufd_ctx *ictx, u32 pt_id,
+			     struct device *dev);
+void iommufd_hw_pagetable_put(struct iommufd_ctx *ictx,
+			      struct iommufd_hw_pagetable *hwpt);
+void iommufd_hw_pagetable_destroy(struct iommufd_object *obj);
+
 #endif
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index e506f493b54cfe..954cde173c86fc 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -287,6 +287,9 @@ static struct iommufd_object_ops iommufd_object_ops[] = {
 	[IOMMUFD_OBJ_IOAS] = {
 		.destroy = iommufd_ioas_destroy,
 	},
+	[IOMMUFD_OBJ_HW_PAGETABLE] = {
+		.destroy = iommufd_hw_pagetable_destroy,
+	},
 };
 
 static struct miscdevice iommu_misc_dev = {
-- 
2.35.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 244+ messages in thread

* [PATCH RFC 10/12] iommufd: Add kAPI toward external drivers
  2022-03-18 17:27 ` Jason Gunthorpe via iommu
@ 2022-03-18 17:27   ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-03-18 17:27 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

Add the four functions external drivers need to connect physical DMA to
the IOMMUFD:

iommufd_bind_pci_device() / iommufd_unbind_device()
  Register the device with iommufd and establish security isolation.

iommufd_device_attach() / iommufd_device_detach()
  Connect a bound device to a page table

binding a device creates a device object ID in the uAPI, however the
generic API provides no IOCTLs to manipulate them.

An API to support the VFIO mdevs is a WIP at this point, but likely
involves requesting a struct iommufd_device without providing any struct
device, and then using the pin/unpin/rw operations on that iommufd_device.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommufd/Makefile          |   1 +
 drivers/iommu/iommufd/device.c          | 274 ++++++++++++++++++++++++
 drivers/iommu/iommufd/iommufd_private.h |   4 +
 drivers/iommu/iommufd/main.c            |   3 +
 include/linux/iommufd.h                 |  50 +++++
 5 files changed, 332 insertions(+)
 create mode 100644 drivers/iommu/iommufd/device.c
 create mode 100644 include/linux/iommufd.h

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index e13e971aa28c60..ca28a135b9675f 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 iommufd-y := \
+	device.o \
 	hw_pagetable.o \
 	io_pagetable.o \
 	ioas.o \
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
new file mode 100644
index 00000000000000..c20bc9eab07e13
--- /dev/null
+++ b/drivers/iommu/iommufd/device.c
@@ -0,0 +1,274 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ */
+#include <linux/iommufd.h>
+#include <linux/slab.h>
+#include <linux/iommu.h>
+#include <linux/file.h>
+#include <linux/pci.h>
+#include <linux/irqdomain.h>
+#include <linux/dma-iommu.h>
+
+#include "iommufd_private.h"
+
+/*
+ * A iommufd_device object represents the binding relationship between a
+ * consuming driver and the iommufd. These objects are created/destroyed by
+ * external drivers, not by userspace.
+ */
+struct iommufd_device {
+	struct iommufd_object obj;
+	struct iommufd_ctx *ictx;
+	struct iommufd_hw_pagetable *hwpt;
+	/* Head at iommufd_hw_pagetable::devices */
+	struct list_head devices_item;
+	/* always the physical device */
+	struct device *dev;
+	struct iommu_group *group;
+};
+
+void iommufd_device_destroy(struct iommufd_object *obj)
+{
+	struct iommufd_device *idev =
+		container_of(obj, struct iommufd_device, obj);
+
+	iommu_group_release_dma_owner(idev->group);
+	iommu_group_put(idev->group);
+	fput(idev->ictx->filp);
+}
+
+/**
+ * iommufd_bind_pci_device - Bind a physical device to an iommu fd
+ * @fd: iommufd file descriptor.
+ * @pdev: Pointer to a physical PCI device struct
+ * @id: Output ID number to return to userspace for this device
+ *
+ * A successful bind establishes an ownership over the device and returns
+ * struct iommufd_device pointer, otherwise returns error pointer.
+ *
+ * A driver using this API must set driver_managed_dma and must not touch
+ * the device until this routine succeeds and establishes ownership.
+ *
+ * Binding a PCI device places the entire RID under iommufd control.
+ *
+ * The caller must undo this with iommufd_unbind_device()
+ */
+struct iommufd_device *iommufd_bind_pci_device(int fd, struct pci_dev *pdev,
+					       u32 *id)
+{
+	struct iommufd_device *idev;
+	struct iommufd_ctx *ictx;
+	struct iommu_group *group;
+	int rc;
+
+	ictx = iommufd_fget(fd);
+	if (!ictx)
+		return ERR_PTR(-EINVAL);
+
+	group = iommu_group_get(&pdev->dev);
+	if (!group) {
+		rc = -ENODEV;
+		goto out_file_put;
+	}
+
+	/*
+	 * FIXME: Use a device-centric iommu api and this won't work with
+	 * multi-device groups
+	 */
+	rc = iommu_group_claim_dma_owner(group, ictx->filp);
+	if (rc)
+		goto out_group_put;
+
+	idev = iommufd_object_alloc(ictx, idev, IOMMUFD_OBJ_DEVICE);
+	if (IS_ERR(idev)) {
+		rc = PTR_ERR(idev);
+		goto out_release_owner;
+	}
+	idev->ictx = ictx;
+	idev->dev = &pdev->dev;
+	/* The calling driver is a user until iommufd_unbind_device() */
+	refcount_inc(&idev->obj.users);
+	/* group refcount moves into iommufd_device */
+	idev->group = group;
+
+	/*
+	 * If the caller fails after this success it must call
+	 * iommufd_unbind_device() which is safe since we hold this refcount.
+	 * This also means the device is a leaf in the graph and no other object
+	 * can take a reference on it.
+	 */
+	iommufd_object_finalize(ictx, &idev->obj);
+	*id = idev->obj.id;
+	return idev;
+
+out_release_owner:
+	iommu_group_release_dma_owner(group);
+out_group_put:
+	iommu_group_put(group);
+out_file_put:
+	fput(ictx->filp);
+	return ERR_PTR(rc);
+}
+EXPORT_SYMBOL_GPL(iommufd_bind_pci_device);
+
+void iommufd_unbind_device(struct iommufd_device *idev)
+{
+	bool was_destroyed;
+
+	was_destroyed = iommufd_object_destroy_user(idev->ictx, &idev->obj);
+	WARN_ON(!was_destroyed);
+}
+EXPORT_SYMBOL_GPL(iommufd_unbind_device);
+
+static bool iommufd_hw_pagetable_has_group(struct iommufd_hw_pagetable *hwpt,
+					   struct iommu_group *group)
+{
+	struct iommufd_device *cur_dev;
+
+	list_for_each_entry (cur_dev, &hwpt->devices, devices_item)
+		if (cur_dev->group == group)
+			return true;
+	return false;
+}
+
+static int iommufd_device_setup_msi(struct iommufd_device *idev,
+				    struct iommufd_hw_pagetable *hwpt,
+				    phys_addr_t sw_msi_start,
+				    unsigned int flags)
+{
+	int rc;
+
+	/*
+	 * IOMMU_CAP_INTR_REMAP means that the platform is isolating MSI,
+	 * nothing further to do.
+	 */
+	if (iommu_capable(idev->dev->bus, IOMMU_CAP_INTR_REMAP))
+		return 0;
+
+	/*
+	 * On ARM systems that set the global IRQ_DOMAIN_FLAG_MSI_REMAP every
+	 * allocated iommu_domain will block interrupts by default and this
+	 * special flow is needed to turn them back on.
+	 */
+	if (irq_domain_check_msi_remap()) {
+		if (WARN_ON(!sw_msi_start))
+			return -EPERM;
+		/*
+		 * iommu_get_msi_cookie() can only be called once per domain,
+		 * it returns -EBUSY on later calls.
+		 */
+		if (hwpt->msi_cookie)
+			return 0;
+		rc = iommu_get_msi_cookie(hwpt->domain, sw_msi_start);
+		if (rc && rc != -ENODEV)
+			return rc;
+		hwpt->msi_cookie = true;
+		return 0;
+	}
+
+	/*
+	 * Otherwise the platform has a MSI window that is not isolated. For
+	 * historical compat with VFIO allow a module parameter to ignore the
+	 * insecurity.
+	 */
+	if (!(flags & IOMMUFD_ATTACH_FLAGS_ALLOW_UNSAFE_INTERRUPT))
+		return -EPERM;
+	return 0;
+}
+
+/**
+ * iommufd_device_attach - Connect a device to an iommu_domain
+ * @idev: device to attach
+ * @pt_id: Input a IOMMUFD_OBJ_IOAS, or IOMMUFD_OBJ_HW_PAGETABLE
+ *         Output the IOMMUFD_OBJ_HW_PAGETABLE ID
+ * @flags: Optional flags
+ *
+ * This connects the device to an iommu_domain, either automatically or manually
+ * selected. Once this completes the device could do DMA.
+ *
+ * The caller should return the resulting pt_id back to userspace.
+ * This function is undone by calling iommufd_device_detach().
+ */
+int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id,
+			  unsigned int flags)
+{
+	struct iommufd_hw_pagetable *hwpt;
+	int rc;
+
+	refcount_inc(&idev->obj.users);
+
+	hwpt = iommufd_hw_pagetable_from_id(idev->ictx, *pt_id, idev->dev);
+	if (IS_ERR(hwpt)) {
+		rc = PTR_ERR(hwpt);
+		goto out_users;
+	}
+
+	mutex_lock(&hwpt->devices_lock);
+	/* FIXME: Use a device-centric iommu api. For now check if the
+	 * hw_pagetable already has a device of the same group joined to tell if
+	 * we are the first and need to attach the group. */
+	if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
+		phys_addr_t sw_msi_start = 0;
+
+		rc = iommu_attach_group(hwpt->domain, idev->group);
+		if (rc)
+			goto out_unlock;
+
+		/*
+		 * hwpt is now the exclusive owner of the group so this is the
+		 * first time enforce is called for this group.
+		 */
+		rc = iopt_table_enforce_group_resv_regions(
+			&hwpt->ioas->iopt, idev->group, &sw_msi_start);
+		if (rc)
+			goto out_detach;
+		rc = iommufd_device_setup_msi(idev, hwpt, sw_msi_start, flags);
+		if (rc)
+			goto out_iova;
+	}
+
+	idev->hwpt = hwpt;
+	if (list_empty(&hwpt->devices)) {
+		rc = iopt_table_add_domain(&hwpt->ioas->iopt, hwpt->domain);
+		if (rc)
+			goto out_iova;
+	}
+	list_add(&idev->devices_item, &hwpt->devices);
+	mutex_unlock(&hwpt->devices_lock);
+
+	*pt_id = idev->hwpt->obj.id;
+	return 0;
+
+out_iova:
+	iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->group);
+out_detach:
+	iommu_detach_group(hwpt->domain, idev->group);
+out_unlock:
+	mutex_unlock(&hwpt->devices_lock);
+	iommufd_hw_pagetable_put(idev->ictx, hwpt);
+out_users:
+	refcount_dec(&idev->obj.users);
+	return rc;
+}
+EXPORT_SYMBOL_GPL(iommufd_device_attach);
+
+void iommufd_device_detach(struct iommufd_device *idev)
+{
+	struct iommufd_hw_pagetable *hwpt = idev->hwpt;
+
+	mutex_lock(&hwpt->devices_lock);
+	list_del(&idev->devices_item);
+	if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
+		iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->group);
+		iommu_detach_group(hwpt->domain, idev->group);
+	}
+	if (list_empty(&hwpt->devices))
+		iopt_table_remove_domain(&hwpt->ioas->iopt, hwpt->domain);
+	mutex_unlock(&hwpt->devices_lock);
+
+	iommufd_hw_pagetable_put(idev->ictx, hwpt);
+	idev->hwpt = NULL;
+
+	refcount_dec(&idev->obj.users);
+}
+EXPORT_SYMBOL_GPL(iommufd_device_detach);
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index c5c9650cc86818..e5c717231f851e 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -96,6 +96,7 @@ static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd,
 enum iommufd_object_type {
 	IOMMUFD_OBJ_NONE,
 	IOMMUFD_OBJ_ANY = IOMMUFD_OBJ_NONE,
+	IOMMUFD_OBJ_DEVICE,
 	IOMMUFD_OBJ_HW_PAGETABLE,
 	IOMMUFD_OBJ_IOAS,
 	IOMMUFD_OBJ_MAX,
@@ -196,6 +197,7 @@ struct iommufd_hw_pagetable {
 	struct iommufd_object obj;
 	struct iommufd_ioas *ioas;
 	struct iommu_domain *domain;
+	bool msi_cookie;
 	/* Head at iommufd_ioas::auto_domains */
 	struct list_head auto_domains_item;
 	struct mutex devices_lock;
@@ -209,4 +211,6 @@ void iommufd_hw_pagetable_put(struct iommufd_ctx *ictx,
 			      struct iommufd_hw_pagetable *hwpt);
 void iommufd_hw_pagetable_destroy(struct iommufd_object *obj);
 
+void iommufd_device_destroy(struct iommufd_object *obj);
+
 #endif
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 954cde173c86fc..6a895489fb5b82 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -284,6 +284,9 @@ struct iommufd_ctx *iommufd_fget(int fd)
 }
 
 static struct iommufd_object_ops iommufd_object_ops[] = {
+	[IOMMUFD_OBJ_DEVICE] = {
+		.destroy = iommufd_device_destroy,
+	},
 	[IOMMUFD_OBJ_IOAS] = {
 		.destroy = iommufd_ioas_destroy,
 	},
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
new file mode 100644
index 00000000000000..6caac05475e39f
--- /dev/null
+++ b/include/linux/iommufd.h
@@ -0,0 +1,50 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2021 Intel Corporation
+ * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ */
+#ifndef __LINUX_IOMMUFD_H
+#define __LINUX_IOMMUFD_H
+
+#include <linux/types.h>
+#include <linux/errno.h>
+#include <linux/err.h>
+#include <linux/device.h>
+
+struct pci_dev;
+struct iommufd_device;
+
+#if IS_ENABLED(CONFIG_IOMMUFD)
+struct iommufd_device *iommufd_bind_pci_device(int fd, struct pci_dev *pdev,
+					       u32 *id);
+void iommufd_unbind_device(struct iommufd_device *idev);
+
+enum {
+	IOMMUFD_ATTACH_FLAGS_ALLOW_UNSAFE_INTERRUPT = 1 << 0,
+};
+int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id,
+			  unsigned int flags);
+void iommufd_device_detach(struct iommufd_device *idev);
+
+#else /* !CONFIG_IOMMUFD */
+static inline struct iommufd_device *
+iommufd_bind_pci_device(int fd, struct pci_dev *pdev, u32 *id)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+
+static inline void iommufd_unbind_device(struct iommufd_device *idev)
+{
+}
+
+static inline int iommufd_device_attach(struct iommufd_device *idev,
+					u32 ioas_id)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline void iommufd_device_detach(struct iommufd_device *idev)
+{
+}
+#endif /* CONFIG_IOMMUFD */
+#endif
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 244+ messages in thread

* [PATCH RFC 10/12] iommufd: Add kAPI toward external drivers
@ 2022-03-18 17:27   ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-03-18 17:27 UTC (permalink / raw)
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins,
	David Gibson

Add the four functions external drivers need to connect physical DMA to
the IOMMUFD:

iommufd_bind_pci_device() / iommufd_unbind_device()
  Register the device with iommufd and establish security isolation.

iommufd_device_attach() / iommufd_device_detach()
  Connect a bound device to a page table

binding a device creates a device object ID in the uAPI, however the
generic API provides no IOCTLs to manipulate them.

An API to support the VFIO mdevs is a WIP at this point, but likely
involves requesting a struct iommufd_device without providing any struct
device, and then using the pin/unpin/rw operations on that iommufd_device.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommufd/Makefile          |   1 +
 drivers/iommu/iommufd/device.c          | 274 ++++++++++++++++++++++++
 drivers/iommu/iommufd/iommufd_private.h |   4 +
 drivers/iommu/iommufd/main.c            |   3 +
 include/linux/iommufd.h                 |  50 +++++
 5 files changed, 332 insertions(+)
 create mode 100644 drivers/iommu/iommufd/device.c
 create mode 100644 include/linux/iommufd.h

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index e13e971aa28c60..ca28a135b9675f 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 iommufd-y := \
+	device.o \
 	hw_pagetable.o \
 	io_pagetable.o \
 	ioas.o \
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
new file mode 100644
index 00000000000000..c20bc9eab07e13
--- /dev/null
+++ b/drivers/iommu/iommufd/device.c
@@ -0,0 +1,274 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ */
+#include <linux/iommufd.h>
+#include <linux/slab.h>
+#include <linux/iommu.h>
+#include <linux/file.h>
+#include <linux/pci.h>
+#include <linux/irqdomain.h>
+#include <linux/dma-iommu.h>
+
+#include "iommufd_private.h"
+
+/*
+ * A iommufd_device object represents the binding relationship between a
+ * consuming driver and the iommufd. These objects are created/destroyed by
+ * external drivers, not by userspace.
+ */
+struct iommufd_device {
+	struct iommufd_object obj;
+	struct iommufd_ctx *ictx;
+	struct iommufd_hw_pagetable *hwpt;
+	/* Head at iommufd_hw_pagetable::devices */
+	struct list_head devices_item;
+	/* always the physical device */
+	struct device *dev;
+	struct iommu_group *group;
+};
+
+void iommufd_device_destroy(struct iommufd_object *obj)
+{
+	struct iommufd_device *idev =
+		container_of(obj, struct iommufd_device, obj);
+
+	iommu_group_release_dma_owner(idev->group);
+	iommu_group_put(idev->group);
+	fput(idev->ictx->filp);
+}
+
+/**
+ * iommufd_bind_pci_device - Bind a physical device to an iommu fd
+ * @fd: iommufd file descriptor.
+ * @pdev: Pointer to a physical PCI device struct
+ * @id: Output ID number to return to userspace for this device
+ *
+ * A successful bind establishes an ownership over the device and returns
+ * struct iommufd_device pointer, otherwise returns error pointer.
+ *
+ * A driver using this API must set driver_managed_dma and must not touch
+ * the device until this routine succeeds and establishes ownership.
+ *
+ * Binding a PCI device places the entire RID under iommufd control.
+ *
+ * The caller must undo this with iommufd_unbind_device()
+ */
+struct iommufd_device *iommufd_bind_pci_device(int fd, struct pci_dev *pdev,
+					       u32 *id)
+{
+	struct iommufd_device *idev;
+	struct iommufd_ctx *ictx;
+	struct iommu_group *group;
+	int rc;
+
+	ictx = iommufd_fget(fd);
+	if (!ictx)
+		return ERR_PTR(-EINVAL);
+
+	group = iommu_group_get(&pdev->dev);
+	if (!group) {
+		rc = -ENODEV;
+		goto out_file_put;
+	}
+
+	/*
+	 * FIXME: Use a device-centric iommu api and this won't work with
+	 * multi-device groups
+	 */
+	rc = iommu_group_claim_dma_owner(group, ictx->filp);
+	if (rc)
+		goto out_group_put;
+
+	idev = iommufd_object_alloc(ictx, idev, IOMMUFD_OBJ_DEVICE);
+	if (IS_ERR(idev)) {
+		rc = PTR_ERR(idev);
+		goto out_release_owner;
+	}
+	idev->ictx = ictx;
+	idev->dev = &pdev->dev;
+	/* The calling driver is a user until iommufd_unbind_device() */
+	refcount_inc(&idev->obj.users);
+	/* group refcount moves into iommufd_device */
+	idev->group = group;
+
+	/*
+	 * If the caller fails after this success it must call
+	 * iommufd_unbind_device() which is safe since we hold this refcount.
+	 * This also means the device is a leaf in the graph and no other object
+	 * can take a reference on it.
+	 */
+	iommufd_object_finalize(ictx, &idev->obj);
+	*id = idev->obj.id;
+	return idev;
+
+out_release_owner:
+	iommu_group_release_dma_owner(group);
+out_group_put:
+	iommu_group_put(group);
+out_file_put:
+	fput(ictx->filp);
+	return ERR_PTR(rc);
+}
+EXPORT_SYMBOL_GPL(iommufd_bind_pci_device);
+
+void iommufd_unbind_device(struct iommufd_device *idev)
+{
+	bool was_destroyed;
+
+	was_destroyed = iommufd_object_destroy_user(idev->ictx, &idev->obj);
+	WARN_ON(!was_destroyed);
+}
+EXPORT_SYMBOL_GPL(iommufd_unbind_device);
+
+static bool iommufd_hw_pagetable_has_group(struct iommufd_hw_pagetable *hwpt,
+					   struct iommu_group *group)
+{
+	struct iommufd_device *cur_dev;
+
+	list_for_each_entry (cur_dev, &hwpt->devices, devices_item)
+		if (cur_dev->group == group)
+			return true;
+	return false;
+}
+
+static int iommufd_device_setup_msi(struct iommufd_device *idev,
+				    struct iommufd_hw_pagetable *hwpt,
+				    phys_addr_t sw_msi_start,
+				    unsigned int flags)
+{
+	int rc;
+
+	/*
+	 * IOMMU_CAP_INTR_REMAP means that the platform is isolating MSI,
+	 * nothing further to do.
+	 */
+	if (iommu_capable(idev->dev->bus, IOMMU_CAP_INTR_REMAP))
+		return 0;
+
+	/*
+	 * On ARM systems that set the global IRQ_DOMAIN_FLAG_MSI_REMAP every
+	 * allocated iommu_domain will block interrupts by default and this
+	 * special flow is needed to turn them back on.
+	 */
+	if (irq_domain_check_msi_remap()) {
+		if (WARN_ON(!sw_msi_start))
+			return -EPERM;
+		/*
+		 * iommu_get_msi_cookie() can only be called once per domain,
+		 * it returns -EBUSY on later calls.
+		 */
+		if (hwpt->msi_cookie)
+			return 0;
+		rc = iommu_get_msi_cookie(hwpt->domain, sw_msi_start);
+		if (rc && rc != -ENODEV)
+			return rc;
+		hwpt->msi_cookie = true;
+		return 0;
+	}
+
+	/*
+	 * Otherwise the platform has a MSI window that is not isolated. For
+	 * historical compat with VFIO allow a module parameter to ignore the
+	 * insecurity.
+	 */
+	if (!(flags & IOMMUFD_ATTACH_FLAGS_ALLOW_UNSAFE_INTERRUPT))
+		return -EPERM;
+	return 0;
+}
+
+/**
+ * iommufd_device_attach - Connect a device to an iommu_domain
+ * @idev: device to attach
+ * @pt_id: Input a IOMMUFD_OBJ_IOAS, or IOMMUFD_OBJ_HW_PAGETABLE
+ *         Output the IOMMUFD_OBJ_HW_PAGETABLE ID
+ * @flags: Optional flags
+ *
+ * This connects the device to an iommu_domain, either automatically or manually
+ * selected. Once this completes the device could do DMA.
+ *
+ * The caller should return the resulting pt_id back to userspace.
+ * This function is undone by calling iommufd_device_detach().
+ */
+int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id,
+			  unsigned int flags)
+{
+	struct iommufd_hw_pagetable *hwpt;
+	int rc;
+
+	refcount_inc(&idev->obj.users);
+
+	hwpt = iommufd_hw_pagetable_from_id(idev->ictx, *pt_id, idev->dev);
+	if (IS_ERR(hwpt)) {
+		rc = PTR_ERR(hwpt);
+		goto out_users;
+	}
+
+	mutex_lock(&hwpt->devices_lock);
+	/* FIXME: Use a device-centric iommu api. For now check if the
+	 * hw_pagetable already has a device of the same group joined to tell if
+	 * we are the first and need to attach the group. */
+	if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
+		phys_addr_t sw_msi_start = 0;
+
+		rc = iommu_attach_group(hwpt->domain, idev->group);
+		if (rc)
+			goto out_unlock;
+
+		/*
+		 * hwpt is now the exclusive owner of the group so this is the
+		 * first time enforce is called for this group.
+		 */
+		rc = iopt_table_enforce_group_resv_regions(
+			&hwpt->ioas->iopt, idev->group, &sw_msi_start);
+		if (rc)
+			goto out_detach;
+		rc = iommufd_device_setup_msi(idev, hwpt, sw_msi_start, flags);
+		if (rc)
+			goto out_iova;
+	}
+
+	idev->hwpt = hwpt;
+	if (list_empty(&hwpt->devices)) {
+		rc = iopt_table_add_domain(&hwpt->ioas->iopt, hwpt->domain);
+		if (rc)
+			goto out_iova;
+	}
+	list_add(&idev->devices_item, &hwpt->devices);
+	mutex_unlock(&hwpt->devices_lock);
+
+	*pt_id = idev->hwpt->obj.id;
+	return 0;
+
+out_iova:
+	iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->group);
+out_detach:
+	iommu_detach_group(hwpt->domain, idev->group);
+out_unlock:
+	mutex_unlock(&hwpt->devices_lock);
+	iommufd_hw_pagetable_put(idev->ictx, hwpt);
+out_users:
+	refcount_dec(&idev->obj.users);
+	return rc;
+}
+EXPORT_SYMBOL_GPL(iommufd_device_attach);
+
+void iommufd_device_detach(struct iommufd_device *idev)
+{
+	struct iommufd_hw_pagetable *hwpt = idev->hwpt;
+
+	mutex_lock(&hwpt->devices_lock);
+	list_del(&idev->devices_item);
+	if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
+		iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->group);
+		iommu_detach_group(hwpt->domain, idev->group);
+	}
+	if (list_empty(&hwpt->devices))
+		iopt_table_remove_domain(&hwpt->ioas->iopt, hwpt->domain);
+	mutex_unlock(&hwpt->devices_lock);
+
+	iommufd_hw_pagetable_put(idev->ictx, hwpt);
+	idev->hwpt = NULL;
+
+	refcount_dec(&idev->obj.users);
+}
+EXPORT_SYMBOL_GPL(iommufd_device_detach);
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index c5c9650cc86818..e5c717231f851e 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -96,6 +96,7 @@ static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd,
 enum iommufd_object_type {
 	IOMMUFD_OBJ_NONE,
 	IOMMUFD_OBJ_ANY = IOMMUFD_OBJ_NONE,
+	IOMMUFD_OBJ_DEVICE,
 	IOMMUFD_OBJ_HW_PAGETABLE,
 	IOMMUFD_OBJ_IOAS,
 	IOMMUFD_OBJ_MAX,
@@ -196,6 +197,7 @@ struct iommufd_hw_pagetable {
 	struct iommufd_object obj;
 	struct iommufd_ioas *ioas;
 	struct iommu_domain *domain;
+	bool msi_cookie;
 	/* Head at iommufd_ioas::auto_domains */
 	struct list_head auto_domains_item;
 	struct mutex devices_lock;
@@ -209,4 +211,6 @@ void iommufd_hw_pagetable_put(struct iommufd_ctx *ictx,
 			      struct iommufd_hw_pagetable *hwpt);
 void iommufd_hw_pagetable_destroy(struct iommufd_object *obj);
 
+void iommufd_device_destroy(struct iommufd_object *obj);
+
 #endif
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 954cde173c86fc..6a895489fb5b82 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -284,6 +284,9 @@ struct iommufd_ctx *iommufd_fget(int fd)
 }
 
 static struct iommufd_object_ops iommufd_object_ops[] = {
+	[IOMMUFD_OBJ_DEVICE] = {
+		.destroy = iommufd_device_destroy,
+	},
 	[IOMMUFD_OBJ_IOAS] = {
 		.destroy = iommufd_ioas_destroy,
 	},
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
new file mode 100644
index 00000000000000..6caac05475e39f
--- /dev/null
+++ b/include/linux/iommufd.h
@@ -0,0 +1,50 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2021 Intel Corporation
+ * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ */
+#ifndef __LINUX_IOMMUFD_H
+#define __LINUX_IOMMUFD_H
+
+#include <linux/types.h>
+#include <linux/errno.h>
+#include <linux/err.h>
+#include <linux/device.h>
+
+struct pci_dev;
+struct iommufd_device;
+
+#if IS_ENABLED(CONFIG_IOMMUFD)
+struct iommufd_device *iommufd_bind_pci_device(int fd, struct pci_dev *pdev,
+					       u32 *id);
+void iommufd_unbind_device(struct iommufd_device *idev);
+
+enum {
+	IOMMUFD_ATTACH_FLAGS_ALLOW_UNSAFE_INTERRUPT = 1 << 0,
+};
+int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id,
+			  unsigned int flags);
+void iommufd_device_detach(struct iommufd_device *idev);
+
+#else /* !CONFIG_IOMMUFD */
+static inline struct iommufd_device *
+iommufd_bind_pci_device(int fd, struct pci_dev *pdev, u32 *id)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+
+static inline void iommufd_unbind_device(struct iommufd_device *idev)
+{
+}
+
+static inline int iommufd_device_attach(struct iommufd_device *idev,
+					u32 ioas_id)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline void iommufd_device_detach(struct iommufd_device *idev)
+{
+}
+#endif /* CONFIG_IOMMUFD */
+#endif
-- 
2.35.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 244+ messages in thread

* [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
  2022-03-18 17:27 ` Jason Gunthorpe via iommu
@ 2022-03-18 17:27   ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-03-18 17:27 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

iommufd can directly implement the /dev/vfio/vfio container IOCTLs by
mapping them into io_pagetable operations. Doing so allows the use of
iommufd by symliking /dev/vfio/vfio to /dev/iommufd. Allowing VFIO to
SET_CONTAINER using a iommufd instead of a container fd is a followup
series.

Internally the compatibility API uses a normal IOAS object that, like
vfio, is automatically allocated when the first device is
attached.

Userspace can also query or set this IOAS object directly using the
IOMMU_VFIO_IOAS ioctl. This allows mixing and matching new iommufd only
features while still using the VFIO style map/unmap ioctls.

While this is enough to operate qemu, it is still a bit of a WIP with a
few gaps to be resolved:

 - Only the TYPE1v2 mode is supported where unmap cannot punch holes or
   split areas. The old mode can be implemented with a new operation to
   split an iopt_area into two without disturbing the iopt_pages or the
   domains, then unmapping a whole area as normal.

 - Resource limits rely on memory cgroups to bound what userspace can do
   instead of the module parameter dma_entry_limit.

 - VFIO P2P is not implemented. Avoiding the follow_pfn() mis-design will
   require some additional work to properly expose PFN lifecycle between
   VFIO and iommfd

 - Various components of the mdev API are not completed yet

 - Indefinite suspend of SW access (VFIO_DMA_MAP_FLAG_VADDR) is not
   implemented.

 - The 'dirty tracking' is not implemented

 - A full audit for pedantic compatibility details (eg errnos, etc) has
   not yet been done

 - powerpc SPAPR is left out, as it is not connected to the iommu_domain
   framework. My hope is that SPAPR will be moved into the iommu_domain
   framework as a special HW specific type and would expect power to
   support the generic interface through a normal iommu_domain.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommufd/Makefile          |   3 +-
 drivers/iommu/iommufd/iommufd_private.h |   6 +
 drivers/iommu/iommufd/main.c            |  16 +-
 drivers/iommu/iommufd/vfio_compat.c     | 401 ++++++++++++++++++++++++
 include/uapi/linux/iommufd.h            |  36 +++
 5 files changed, 456 insertions(+), 6 deletions(-)
 create mode 100644 drivers/iommu/iommufd/vfio_compat.c

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index ca28a135b9675f..2fdff04000b326 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -5,6 +5,7 @@ iommufd-y := \
 	io_pagetable.o \
 	ioas.o \
 	main.o \
-	pages.o
+	pages.o \
+	vfio_compat.o
 
 obj-$(CONFIG_IOMMUFD) += iommufd.o
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index e5c717231f851e..31628591591c17 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -67,6 +67,8 @@ void iopt_remove_reserved_iova(struct io_pagetable *iopt, void *owner);
 struct iommufd_ctx {
 	struct file *filp;
 	struct xarray objects;
+
+	struct iommufd_ioas *vfio_ioas;
 };
 
 struct iommufd_ctx *iommufd_fget(int fd);
@@ -78,6 +80,9 @@ struct iommufd_ucmd {
 	void *cmd;
 };
 
+int iommufd_vfio_ioctl(struct iommufd_ctx *ictx, unsigned int cmd,
+		       unsigned long arg);
+
 /* Copy the response in ucmd->cmd back to userspace. */
 static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd,
 				       size_t cmd_len)
@@ -186,6 +191,7 @@ int iommufd_ioas_iova_ranges(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_map(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_copy(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd);
+int iommufd_vfio_ioas(struct iommufd_ucmd *ucmd);
 
 /*
  * A HW pagetable is called an iommu_domain inside the kernel. This user object
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 6a895489fb5b82..f746fcff8145cb 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -122,6 +122,8 @@ bool iommufd_object_destroy_user(struct iommufd_ctx *ictx,
 		return false;
 	}
 	__xa_erase(&ictx->objects, obj->id);
+	if (ictx->vfio_ioas && &ictx->vfio_ioas->obj == obj)
+		ictx->vfio_ioas = NULL;
 	xa_unlock(&ictx->objects);
 
 	iommufd_object_ops[obj->type].destroy(obj);
@@ -219,27 +221,31 @@ static struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
 		 __reserved),
 	IOCTL_OP(IOMMU_IOAS_UNMAP, iommufd_ioas_unmap, struct iommu_ioas_unmap,
 		 length),
+	IOCTL_OP(IOMMU_VFIO_IOAS, iommufd_vfio_ioas, struct iommu_vfio_ioas,
+		 __reserved),
 };
 
 static long iommufd_fops_ioctl(struct file *filp, unsigned int cmd,
 			       unsigned long arg)
 {
+	struct iommufd_ctx *ictx = filp->private_data;
 	struct iommufd_ucmd ucmd = {};
 	struct iommufd_ioctl_op *op;
 	union ucmd_buffer buf;
 	unsigned int nr;
 	int ret;
 
-	ucmd.ictx = filp->private_data;
+	nr = _IOC_NR(cmd);
+	if (nr < IOMMUFD_CMD_BASE ||
+	    (nr - IOMMUFD_CMD_BASE) >= ARRAY_SIZE(iommufd_ioctl_ops))
+		return iommufd_vfio_ioctl(ictx, cmd, arg);
+
+	ucmd.ictx = ictx;
 	ucmd.ubuffer = (void __user *)arg;
 	ret = get_user(ucmd.user_size, (u32 __user *)ucmd.ubuffer);
 	if (ret)
 		return ret;
 
-	nr = _IOC_NR(cmd);
-	if (nr < IOMMUFD_CMD_BASE ||
-	    (nr - IOMMUFD_CMD_BASE) >= ARRAY_SIZE(iommufd_ioctl_ops))
-		return -ENOIOCTLCMD;
 	op = &iommufd_ioctl_ops[nr - IOMMUFD_CMD_BASE];
 	if (op->ioctl_num != cmd)
 		return -ENOIOCTLCMD;
diff --git a/drivers/iommu/iommufd/vfio_compat.c b/drivers/iommu/iommufd/vfio_compat.c
new file mode 100644
index 00000000000000..5c996bc9b44d48
--- /dev/null
+++ b/drivers/iommu/iommufd/vfio_compat.c
@@ -0,0 +1,401 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ */
+#include <linux/file.h>
+#include <linux/interval_tree.h>
+#include <linux/iommu.h>
+#include <linux/iommufd.h>
+#include <linux/slab.h>
+#include <linux/vfio.h>
+#include <uapi/linux/vfio.h>
+#include <uapi/linux/iommufd.h>
+
+#include "iommufd_private.h"
+
+static struct iommufd_ioas *get_compat_ioas(struct iommufd_ctx *ictx)
+{
+	struct iommufd_ioas *ioas = ERR_PTR(-ENODEV);
+
+	xa_lock(&ictx->objects);
+	if (!ictx->vfio_ioas || !iommufd_lock_obj(&ictx->vfio_ioas->obj))
+		goto out_unlock;
+	ioas = ictx->vfio_ioas;
+out_unlock:
+	xa_unlock(&ictx->objects);
+	return ioas;
+}
+
+/*
+ * Only attaching a group should cause a default creation of the internal ioas,
+ * this returns the existing ioas if it has already been assigned somehow
+ * FIXME: maybe_unused
+ */
+static __maybe_unused struct iommufd_ioas *
+create_compat_ioas(struct iommufd_ctx *ictx)
+{
+	struct iommufd_ioas *ioas = NULL;
+	struct iommufd_ioas *out_ioas;
+
+	ioas = iommufd_ioas_alloc(ictx);
+	if (IS_ERR(ioas))
+		return ioas;
+
+	xa_lock(&ictx->objects);
+	if (ictx->vfio_ioas && iommufd_lock_obj(&ictx->vfio_ioas->obj))
+		out_ioas = ictx->vfio_ioas;
+	else
+		out_ioas = ioas;
+	xa_unlock(&ictx->objects);
+
+	if (out_ioas != ioas) {
+		iommufd_object_abort(ictx, &ioas->obj);
+		return out_ioas;
+	}
+	if (!iommufd_lock_obj(&ioas->obj))
+		WARN_ON(true);
+	iommufd_object_finalize(ictx, &ioas->obj);
+	return ioas;
+}
+
+int iommufd_vfio_ioas(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_vfio_ioas *cmd = ucmd->cmd;
+	struct iommufd_ioas *ioas;
+
+	if (cmd->__reserved)
+		return -EOPNOTSUPP;
+	switch (cmd->op) {
+	case IOMMU_VFIO_IOAS_GET:
+		ioas = get_compat_ioas(ucmd->ictx);
+		if (IS_ERR(ioas))
+			return PTR_ERR(ioas);
+		cmd->ioas_id = ioas->obj.id;
+		iommufd_put_object(&ioas->obj);
+		return iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+
+	case IOMMU_VFIO_IOAS_SET:
+		ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
+		if (IS_ERR(ioas))
+			return PTR_ERR(ioas);
+		xa_lock(&ucmd->ictx->objects);
+		ucmd->ictx->vfio_ioas = ioas;
+		xa_unlock(&ucmd->ictx->objects);
+		iommufd_put_object(&ioas->obj);
+		return 0;
+
+	case IOMMU_VFIO_IOAS_CLEAR:
+		xa_lock(&ucmd->ictx->objects);
+		ucmd->ictx->vfio_ioas = NULL;
+		xa_unlock(&ucmd->ictx->objects);
+		return 0;
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
+static int iommufd_vfio_map_dma(struct iommufd_ctx *ictx, unsigned int cmd,
+				void __user *arg)
+{
+	u32 supported_flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
+	size_t minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
+	struct vfio_iommu_type1_dma_map map;
+	int iommu_prot = IOMMU_CACHE;
+	struct iommufd_ioas *ioas;
+	unsigned long iova;
+	int rc;
+
+	if (copy_from_user(&map, arg, minsz))
+		return -EFAULT;
+
+	if (map.argsz < minsz || map.flags & ~supported_flags)
+		return -EINVAL;
+
+	if (map.flags & VFIO_DMA_MAP_FLAG_READ)
+		iommu_prot |= IOMMU_READ;
+	if (map.flags & VFIO_DMA_MAP_FLAG_WRITE)
+		iommu_prot |= IOMMU_WRITE;
+
+	ioas = get_compat_ioas(ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	iova = map.iova;
+	rc = iopt_map_user_pages(&ioas->iopt, &iova,
+				 u64_to_user_ptr(map.vaddr), map.size,
+				 iommu_prot, 0);
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+static int iommufd_vfio_unmap_dma(struct iommufd_ctx *ictx, unsigned int cmd,
+				  void __user *arg)
+{
+	size_t minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size);
+	u32 supported_flags = VFIO_DMA_UNMAP_FLAG_ALL;
+	struct vfio_iommu_type1_dma_unmap unmap;
+	struct iommufd_ioas *ioas;
+	int rc;
+
+	if (copy_from_user(&unmap, arg, minsz))
+		return -EFAULT;
+
+	if (unmap.argsz < minsz || unmap.flags & ~supported_flags)
+		return -EINVAL;
+
+	ioas = get_compat_ioas(ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	if (unmap.flags & VFIO_DMA_UNMAP_FLAG_ALL)
+		rc = iopt_unmap_all(&ioas->iopt);
+	else
+		rc = iopt_unmap_iova(&ioas->iopt, unmap.iova, unmap.size);
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+static int iommufd_vfio_check_extension(unsigned long type)
+{
+	switch (type) {
+	case VFIO_TYPE1v2_IOMMU:
+	case VFIO_UNMAP_ALL:
+		return 1;
+	/*
+	 * FIXME: The type1 iommu allows splitting of maps, which can fail. This is doable but
+	 * is a bunch of extra code that is only for supporting this case.
+	 */
+	case VFIO_TYPE1_IOMMU:
+	/*
+	 * FIXME: No idea what VFIO_TYPE1_NESTING_IOMMU does as far as the uAPI
+	 * is concerned. Seems like it was never completed, it only does
+	 * something on ARM, but I can't figure out what or how to use it. Can't
+	 * find any user implementation either.
+	 */
+	case VFIO_TYPE1_NESTING_IOMMU:
+	/*
+	 * FIXME: Easy to support, but needs rework in the Intel iommu driver
+	 * to expose the no snoop squashing to iommufd
+	 */
+	case VFIO_DMA_CC_IOMMU:
+	/*
+	 * FIXME: VFIO_DMA_MAP_FLAG_VADDR
+	 * https://lore.kernel.org/kvm/1611939252-7240-1-git-send-email-steven.sistare@oracle.com/
+	 * Wow, what a wild feature. This should have been implemented by
+	 * allowing a iopt_pages to be associated with a memfd. It can then
+	 * source mapping requests directly from a memfd without going through a
+	 * mm_struct and thus doesn't care that the original qemu exec'd itself.
+	 * The idea that userspace can flip a flag and cause kernel users to
+	 * block indefinately is unacceptable.
+	 *
+	 * For VFIO compat we should implement this in a slightly different way,
+	 * Creating a access_user that spans the whole area will immediately
+	 * stop new faults as they will be handled from the xarray. We can then
+	 * reparent the iopt_pages to the new mm_struct and undo the
+	 * access_user. No blockage of kernel users required, does require
+	 * filling the xarray with pages though.
+	 */
+	case VFIO_UPDATE_VADDR:
+	default:
+		return 0;
+	}
+
+ /* FIXME: VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP I think everything with dirty
+  * tracking should be in its own ioctl, not muddled in unmap. If we want to
+  * atomically unmap and get the dirty bitmap it should be a flag in the dirty
+  * tracking ioctl, not here in unmap. Overall dirty tracking needs a careful
+  * review along side HW drivers implementing it.
+  */
+}
+
+static int iommufd_vfio_set_iommu(struct iommufd_ctx *ictx, unsigned long type)
+{
+	struct iommufd_ioas *ioas = NULL;
+
+	if (type != VFIO_TYPE1v2_IOMMU)
+		return -EINVAL;
+
+	/* VFIO fails the set_iommu if there is no group */
+	ioas = get_compat_ioas(ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+	iommufd_put_object(&ioas->obj);
+	return 0;
+}
+
+static u64 iommufd_get_pagesizes(struct iommufd_ioas *ioas)
+{
+	/* FIXME: See vfio_update_pgsize_bitmap(), for compat this should return
+	 * the high bits too, and we need to decide if we should report that
+	 * iommufd supports less than PAGE_SIZE alignment or stick to strict
+	 * compatibility. qemu only cares about the first set bit.
+	 */
+	return ioas->iopt.iova_alignment;
+}
+
+static int iommufd_fill_cap_iova(struct iommufd_ioas *ioas,
+				 struct vfio_info_cap_header __user *cur,
+				 size_t avail)
+{
+	struct vfio_iommu_type1_info_cap_iova_range __user *ucap_iovas =
+		container_of(cur, struct vfio_iommu_type1_info_cap_iova_range,
+			     header);
+	struct vfio_iommu_type1_info_cap_iova_range cap_iovas = {
+		.header = {
+			.id = VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE,
+			.version = 1,
+		},
+	};
+	struct interval_tree_span_iter span;
+
+	for (interval_tree_span_iter_first(
+		     &span, &ioas->iopt.reserved_iova_itree, 0, ULONG_MAX);
+	     !interval_tree_span_iter_done(&span);
+	     interval_tree_span_iter_next(&span)) {
+		struct vfio_iova_range range;
+
+		if (!span.is_hole)
+			continue;
+		range.start = span.start_hole;
+		range.end = span.last_hole;
+		if (avail >= struct_size(&cap_iovas, iova_ranges,
+					 cap_iovas.nr_iovas + 1) &&
+		    copy_to_user(&ucap_iovas->iova_ranges[cap_iovas.nr_iovas],
+				 &range, sizeof(range)))
+			return -EFAULT;
+		cap_iovas.nr_iovas++;
+	}
+	if (avail >= struct_size(&cap_iovas, iova_ranges, cap_iovas.nr_iovas) &&
+	    copy_to_user(ucap_iovas, &cap_iovas, sizeof(cap_iovas)))
+		return -EFAULT;
+	return struct_size(&cap_iovas, iova_ranges, cap_iovas.nr_iovas);
+}
+
+static int iommufd_fill_cap_dma_avail(struct iommufd_ioas *ioas,
+				      struct vfio_info_cap_header __user *cur,
+				      size_t avail)
+{
+	struct vfio_iommu_type1_info_dma_avail cap_dma = {
+		.header = {
+			.id = VFIO_IOMMU_TYPE1_INFO_DMA_AVAIL,
+			.version = 1,
+		},
+		/* iommufd has no limit, return the same value as VFIO. */
+		.avail = U16_MAX,
+	};
+
+	if (avail >= sizeof(cap_dma) &&
+	    copy_to_user(cur, &cap_dma, sizeof(cap_dma)))
+		return -EFAULT;
+	return sizeof(cap_dma);
+}
+
+static int iommufd_vfio_iommu_get_info(struct iommufd_ctx *ictx,
+				       void __user *arg)
+{
+	typedef int (*fill_cap_fn)(struct iommufd_ioas *ioas,
+				   struct vfio_info_cap_header __user *cur,
+				   size_t avail);
+	static const fill_cap_fn fill_fns[] = {
+		iommufd_fill_cap_iova,
+		iommufd_fill_cap_dma_avail,
+	};
+	size_t minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes);
+	struct vfio_info_cap_header __user *last_cap = NULL;
+	struct vfio_iommu_type1_info info;
+	struct iommufd_ioas *ioas;
+	size_t total_cap_size;
+	int rc;
+	int i;
+
+	if (copy_from_user(&info, arg, minsz))
+		return -EFAULT;
+
+	if (info.argsz < minsz)
+		return -EINVAL;
+	minsz = min_t(size_t, info.argsz, sizeof(info));
+
+	ioas = get_compat_ioas(ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	down_read(&ioas->iopt.iova_rwsem);
+	info.flags = VFIO_IOMMU_INFO_PGSIZES;
+	info.iova_pgsizes = iommufd_get_pagesizes(ioas);
+	info.cap_offset = 0;
+
+	total_cap_size = sizeof(info);
+	for (i = 0; i != ARRAY_SIZE(fill_fns); i++) {
+		int cap_size;
+
+		if (info.argsz > total_cap_size)
+			cap_size = fill_fns[i](ioas, arg + total_cap_size,
+					       info.argsz - total_cap_size);
+		else
+			cap_size = fill_fns[i](ioas, NULL, 0);
+		if (cap_size < 0) {
+			rc = cap_size;
+			goto out_put;
+		}
+		if (last_cap && info.argsz >= total_cap_size &&
+		    put_user(total_cap_size, &last_cap->next)) {
+			rc = -EFAULT;
+			goto out_put;
+		}
+		last_cap = arg + total_cap_size;
+		total_cap_size += cap_size;
+	}
+
+	/*
+	 * If the user did not provide enough space then only some caps are
+	 * returned and the argsz will be updated to the correct amount to get
+	 * all caps.
+	 */
+	if (info.argsz >= total_cap_size)
+		info.cap_offset = sizeof(info);
+	info.argsz = total_cap_size;
+	info.flags |= VFIO_IOMMU_INFO_CAPS;
+	if (copy_to_user(arg, &info, minsz))
+		rc = -EFAULT;
+	rc = 0;
+
+out_put:
+	up_read(&ioas->iopt.iova_rwsem);
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+/* FIXME TODO:
+PowerPC SPAPR only:
+#define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
+#define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
+#define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
+#define VFIO_IOMMU_SPAPR_REGISTER_MEMORY	_IO(VFIO_TYPE, VFIO_BASE + 17)
+#define VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY	_IO(VFIO_TYPE, VFIO_BASE + 18)
+#define VFIO_IOMMU_SPAPR_TCE_CREATE	_IO(VFIO_TYPE, VFIO_BASE + 19)
+#define VFIO_IOMMU_SPAPR_TCE_REMOVE	_IO(VFIO_TYPE, VFIO_BASE + 20)
+*/
+
+int iommufd_vfio_ioctl(struct iommufd_ctx *ictx, unsigned int cmd,
+		       unsigned long arg)
+{
+	void __user *uarg = (void __user *)arg;
+
+	switch (cmd) {
+	case VFIO_GET_API_VERSION:
+		return VFIO_API_VERSION;
+	case VFIO_SET_IOMMU:
+		return iommufd_vfio_set_iommu(ictx, arg);
+	case VFIO_CHECK_EXTENSION:
+		return iommufd_vfio_check_extension(arg);
+	case VFIO_IOMMU_GET_INFO:
+		return iommufd_vfio_iommu_get_info(ictx, uarg);
+	case VFIO_IOMMU_MAP_DMA:
+		return iommufd_vfio_map_dma(ictx, cmd, uarg);
+	case VFIO_IOMMU_UNMAP_DMA:
+		return iommufd_vfio_unmap_dma(ictx, cmd, uarg);
+	case VFIO_IOMMU_DIRTY_PAGES:
+	default:
+		return -ENOIOCTLCMD;
+	}
+	return -ENOIOCTLCMD;
+}
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index ba7b17ec3002e3..2c0f5ced417335 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -42,6 +42,7 @@ enum {
 	IOMMUFD_CMD_IOAS_MAP,
 	IOMMUFD_CMD_IOAS_COPY,
 	IOMMUFD_CMD_IOAS_UNMAP,
+	IOMMUFD_CMD_VFIO_IOAS,
 };
 
 /**
@@ -184,4 +185,39 @@ struct iommu_ioas_unmap {
 	__aligned_u64 length;
 };
 #define IOMMU_IOAS_UNMAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_UNMAP)
+
+/**
+ * enum iommufd_vfio_ioas_op
+ * @IOMMU_VFIO_IOAS_GET: Get the current compatibility IOAS
+ * @IOMMU_VFIO_IOAS_SET: Change the current compatibility IOAS
+ * @IOMMU_VFIO_IOAS_CLEAR: Disable VFIO compatibility
+ */
+enum iommufd_vfio_ioas_op {
+	IOMMU_VFIO_IOAS_GET = 0,
+	IOMMU_VFIO_IOAS_SET = 1,
+	IOMMU_VFIO_IOAS_CLEAR = 2,
+};
+
+/**
+ * struct iommu_vfio_ioas - ioctl(IOMMU_VFIO_IOAS)
+ * @size: sizeof(struct iommu_ioas_copy)
+ * @ioas_id: For IOMMU_VFIO_IOAS_SET the input IOAS ID to set
+ *           For IOMMU_VFIO_IOAS_GET will output the IOAS ID
+ * @op: One of enum iommufd_vfio_ioas_op
+ * @__reserved: Must be 0
+ *
+ * The VFIO compatibility support uses a single ioas because VFIO APIs do not
+ * support the ID field. Set or Get the IOAS that VFIO compatibility will use.
+ * When VFIO_GROUP_SET_CONTAINER is used on an iommufd it will get the
+ * compatibility ioas, either by taking what is already set, or auto creating
+ * one. From then on VFIO will continue to use that ioas and is not effected by
+ * this ioctl. SET or CLEAR does not destroy any auto-created IOAS.
+ */
+struct iommu_vfio_ioas {
+	__u32 size;
+	__u32 ioas_id;
+	__u16 op;
+	__u16 __reserved;
+};
+#define IOMMU_VFIO_IOAS _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VFIO_IOAS)
 #endif
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 244+ messages in thread

* [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
@ 2022-03-18 17:27   ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-03-18 17:27 UTC (permalink / raw)
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins,
	David Gibson

iommufd can directly implement the /dev/vfio/vfio container IOCTLs by
mapping them into io_pagetable operations. Doing so allows the use of
iommufd by symliking /dev/vfio/vfio to /dev/iommufd. Allowing VFIO to
SET_CONTAINER using a iommufd instead of a container fd is a followup
series.

Internally the compatibility API uses a normal IOAS object that, like
vfio, is automatically allocated when the first device is
attached.

Userspace can also query or set this IOAS object directly using the
IOMMU_VFIO_IOAS ioctl. This allows mixing and matching new iommufd only
features while still using the VFIO style map/unmap ioctls.

While this is enough to operate qemu, it is still a bit of a WIP with a
few gaps to be resolved:

 - Only the TYPE1v2 mode is supported where unmap cannot punch holes or
   split areas. The old mode can be implemented with a new operation to
   split an iopt_area into two without disturbing the iopt_pages or the
   domains, then unmapping a whole area as normal.

 - Resource limits rely on memory cgroups to bound what userspace can do
   instead of the module parameter dma_entry_limit.

 - VFIO P2P is not implemented. Avoiding the follow_pfn() mis-design will
   require some additional work to properly expose PFN lifecycle between
   VFIO and iommfd

 - Various components of the mdev API are not completed yet

 - Indefinite suspend of SW access (VFIO_DMA_MAP_FLAG_VADDR) is not
   implemented.

 - The 'dirty tracking' is not implemented

 - A full audit for pedantic compatibility details (eg errnos, etc) has
   not yet been done

 - powerpc SPAPR is left out, as it is not connected to the iommu_domain
   framework. My hope is that SPAPR will be moved into the iommu_domain
   framework as a special HW specific type and would expect power to
   support the generic interface through a normal iommu_domain.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommufd/Makefile          |   3 +-
 drivers/iommu/iommufd/iommufd_private.h |   6 +
 drivers/iommu/iommufd/main.c            |  16 +-
 drivers/iommu/iommufd/vfio_compat.c     | 401 ++++++++++++++++++++++++
 include/uapi/linux/iommufd.h            |  36 +++
 5 files changed, 456 insertions(+), 6 deletions(-)
 create mode 100644 drivers/iommu/iommufd/vfio_compat.c

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index ca28a135b9675f..2fdff04000b326 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -5,6 +5,7 @@ iommufd-y := \
 	io_pagetable.o \
 	ioas.o \
 	main.o \
-	pages.o
+	pages.o \
+	vfio_compat.o
 
 obj-$(CONFIG_IOMMUFD) += iommufd.o
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index e5c717231f851e..31628591591c17 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -67,6 +67,8 @@ void iopt_remove_reserved_iova(struct io_pagetable *iopt, void *owner);
 struct iommufd_ctx {
 	struct file *filp;
 	struct xarray objects;
+
+	struct iommufd_ioas *vfio_ioas;
 };
 
 struct iommufd_ctx *iommufd_fget(int fd);
@@ -78,6 +80,9 @@ struct iommufd_ucmd {
 	void *cmd;
 };
 
+int iommufd_vfio_ioctl(struct iommufd_ctx *ictx, unsigned int cmd,
+		       unsigned long arg);
+
 /* Copy the response in ucmd->cmd back to userspace. */
 static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd,
 				       size_t cmd_len)
@@ -186,6 +191,7 @@ int iommufd_ioas_iova_ranges(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_map(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_copy(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd);
+int iommufd_vfio_ioas(struct iommufd_ucmd *ucmd);
 
 /*
  * A HW pagetable is called an iommu_domain inside the kernel. This user object
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 6a895489fb5b82..f746fcff8145cb 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -122,6 +122,8 @@ bool iommufd_object_destroy_user(struct iommufd_ctx *ictx,
 		return false;
 	}
 	__xa_erase(&ictx->objects, obj->id);
+	if (ictx->vfio_ioas && &ictx->vfio_ioas->obj == obj)
+		ictx->vfio_ioas = NULL;
 	xa_unlock(&ictx->objects);
 
 	iommufd_object_ops[obj->type].destroy(obj);
@@ -219,27 +221,31 @@ static struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
 		 __reserved),
 	IOCTL_OP(IOMMU_IOAS_UNMAP, iommufd_ioas_unmap, struct iommu_ioas_unmap,
 		 length),
+	IOCTL_OP(IOMMU_VFIO_IOAS, iommufd_vfio_ioas, struct iommu_vfio_ioas,
+		 __reserved),
 };
 
 static long iommufd_fops_ioctl(struct file *filp, unsigned int cmd,
 			       unsigned long arg)
 {
+	struct iommufd_ctx *ictx = filp->private_data;
 	struct iommufd_ucmd ucmd = {};
 	struct iommufd_ioctl_op *op;
 	union ucmd_buffer buf;
 	unsigned int nr;
 	int ret;
 
-	ucmd.ictx = filp->private_data;
+	nr = _IOC_NR(cmd);
+	if (nr < IOMMUFD_CMD_BASE ||
+	    (nr - IOMMUFD_CMD_BASE) >= ARRAY_SIZE(iommufd_ioctl_ops))
+		return iommufd_vfio_ioctl(ictx, cmd, arg);
+
+	ucmd.ictx = ictx;
 	ucmd.ubuffer = (void __user *)arg;
 	ret = get_user(ucmd.user_size, (u32 __user *)ucmd.ubuffer);
 	if (ret)
 		return ret;
 
-	nr = _IOC_NR(cmd);
-	if (nr < IOMMUFD_CMD_BASE ||
-	    (nr - IOMMUFD_CMD_BASE) >= ARRAY_SIZE(iommufd_ioctl_ops))
-		return -ENOIOCTLCMD;
 	op = &iommufd_ioctl_ops[nr - IOMMUFD_CMD_BASE];
 	if (op->ioctl_num != cmd)
 		return -ENOIOCTLCMD;
diff --git a/drivers/iommu/iommufd/vfio_compat.c b/drivers/iommu/iommufd/vfio_compat.c
new file mode 100644
index 00000000000000..5c996bc9b44d48
--- /dev/null
+++ b/drivers/iommu/iommufd/vfio_compat.c
@@ -0,0 +1,401 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ */
+#include <linux/file.h>
+#include <linux/interval_tree.h>
+#include <linux/iommu.h>
+#include <linux/iommufd.h>
+#include <linux/slab.h>
+#include <linux/vfio.h>
+#include <uapi/linux/vfio.h>
+#include <uapi/linux/iommufd.h>
+
+#include "iommufd_private.h"
+
+static struct iommufd_ioas *get_compat_ioas(struct iommufd_ctx *ictx)
+{
+	struct iommufd_ioas *ioas = ERR_PTR(-ENODEV);
+
+	xa_lock(&ictx->objects);
+	if (!ictx->vfio_ioas || !iommufd_lock_obj(&ictx->vfio_ioas->obj))
+		goto out_unlock;
+	ioas = ictx->vfio_ioas;
+out_unlock:
+	xa_unlock(&ictx->objects);
+	return ioas;
+}
+
+/*
+ * Only attaching a group should cause a default creation of the internal ioas,
+ * this returns the existing ioas if it has already been assigned somehow
+ * FIXME: maybe_unused
+ */
+static __maybe_unused struct iommufd_ioas *
+create_compat_ioas(struct iommufd_ctx *ictx)
+{
+	struct iommufd_ioas *ioas = NULL;
+	struct iommufd_ioas *out_ioas;
+
+	ioas = iommufd_ioas_alloc(ictx);
+	if (IS_ERR(ioas))
+		return ioas;
+
+	xa_lock(&ictx->objects);
+	if (ictx->vfio_ioas && iommufd_lock_obj(&ictx->vfio_ioas->obj))
+		out_ioas = ictx->vfio_ioas;
+	else
+		out_ioas = ioas;
+	xa_unlock(&ictx->objects);
+
+	if (out_ioas != ioas) {
+		iommufd_object_abort(ictx, &ioas->obj);
+		return out_ioas;
+	}
+	if (!iommufd_lock_obj(&ioas->obj))
+		WARN_ON(true);
+	iommufd_object_finalize(ictx, &ioas->obj);
+	return ioas;
+}
+
+int iommufd_vfio_ioas(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_vfio_ioas *cmd = ucmd->cmd;
+	struct iommufd_ioas *ioas;
+
+	if (cmd->__reserved)
+		return -EOPNOTSUPP;
+	switch (cmd->op) {
+	case IOMMU_VFIO_IOAS_GET:
+		ioas = get_compat_ioas(ucmd->ictx);
+		if (IS_ERR(ioas))
+			return PTR_ERR(ioas);
+		cmd->ioas_id = ioas->obj.id;
+		iommufd_put_object(&ioas->obj);
+		return iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+
+	case IOMMU_VFIO_IOAS_SET:
+		ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
+		if (IS_ERR(ioas))
+			return PTR_ERR(ioas);
+		xa_lock(&ucmd->ictx->objects);
+		ucmd->ictx->vfio_ioas = ioas;
+		xa_unlock(&ucmd->ictx->objects);
+		iommufd_put_object(&ioas->obj);
+		return 0;
+
+	case IOMMU_VFIO_IOAS_CLEAR:
+		xa_lock(&ucmd->ictx->objects);
+		ucmd->ictx->vfio_ioas = NULL;
+		xa_unlock(&ucmd->ictx->objects);
+		return 0;
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
+static int iommufd_vfio_map_dma(struct iommufd_ctx *ictx, unsigned int cmd,
+				void __user *arg)
+{
+	u32 supported_flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
+	size_t minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
+	struct vfio_iommu_type1_dma_map map;
+	int iommu_prot = IOMMU_CACHE;
+	struct iommufd_ioas *ioas;
+	unsigned long iova;
+	int rc;
+
+	if (copy_from_user(&map, arg, minsz))
+		return -EFAULT;
+
+	if (map.argsz < minsz || map.flags & ~supported_flags)
+		return -EINVAL;
+
+	if (map.flags & VFIO_DMA_MAP_FLAG_READ)
+		iommu_prot |= IOMMU_READ;
+	if (map.flags & VFIO_DMA_MAP_FLAG_WRITE)
+		iommu_prot |= IOMMU_WRITE;
+
+	ioas = get_compat_ioas(ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	iova = map.iova;
+	rc = iopt_map_user_pages(&ioas->iopt, &iova,
+				 u64_to_user_ptr(map.vaddr), map.size,
+				 iommu_prot, 0);
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+static int iommufd_vfio_unmap_dma(struct iommufd_ctx *ictx, unsigned int cmd,
+				  void __user *arg)
+{
+	size_t minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size);
+	u32 supported_flags = VFIO_DMA_UNMAP_FLAG_ALL;
+	struct vfio_iommu_type1_dma_unmap unmap;
+	struct iommufd_ioas *ioas;
+	int rc;
+
+	if (copy_from_user(&unmap, arg, minsz))
+		return -EFAULT;
+
+	if (unmap.argsz < minsz || unmap.flags & ~supported_flags)
+		return -EINVAL;
+
+	ioas = get_compat_ioas(ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	if (unmap.flags & VFIO_DMA_UNMAP_FLAG_ALL)
+		rc = iopt_unmap_all(&ioas->iopt);
+	else
+		rc = iopt_unmap_iova(&ioas->iopt, unmap.iova, unmap.size);
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+static int iommufd_vfio_check_extension(unsigned long type)
+{
+	switch (type) {
+	case VFIO_TYPE1v2_IOMMU:
+	case VFIO_UNMAP_ALL:
+		return 1;
+	/*
+	 * FIXME: The type1 iommu allows splitting of maps, which can fail. This is doable but
+	 * is a bunch of extra code that is only for supporting this case.
+	 */
+	case VFIO_TYPE1_IOMMU:
+	/*
+	 * FIXME: No idea what VFIO_TYPE1_NESTING_IOMMU does as far as the uAPI
+	 * is concerned. Seems like it was never completed, it only does
+	 * something on ARM, but I can't figure out what or how to use it. Can't
+	 * find any user implementation either.
+	 */
+	case VFIO_TYPE1_NESTING_IOMMU:
+	/*
+	 * FIXME: Easy to support, but needs rework in the Intel iommu driver
+	 * to expose the no snoop squashing to iommufd
+	 */
+	case VFIO_DMA_CC_IOMMU:
+	/*
+	 * FIXME: VFIO_DMA_MAP_FLAG_VADDR
+	 * https://lore.kernel.org/kvm/1611939252-7240-1-git-send-email-steven.sistare@oracle.com/
+	 * Wow, what a wild feature. This should have been implemented by
+	 * allowing a iopt_pages to be associated with a memfd. It can then
+	 * source mapping requests directly from a memfd without going through a
+	 * mm_struct and thus doesn't care that the original qemu exec'd itself.
+	 * The idea that userspace can flip a flag and cause kernel users to
+	 * block indefinately is unacceptable.
+	 *
+	 * For VFIO compat we should implement this in a slightly different way,
+	 * Creating a access_user that spans the whole area will immediately
+	 * stop new faults as they will be handled from the xarray. We can then
+	 * reparent the iopt_pages to the new mm_struct and undo the
+	 * access_user. No blockage of kernel users required, does require
+	 * filling the xarray with pages though.
+	 */
+	case VFIO_UPDATE_VADDR:
+	default:
+		return 0;
+	}
+
+ /* FIXME: VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP I think everything with dirty
+  * tracking should be in its own ioctl, not muddled in unmap. If we want to
+  * atomically unmap and get the dirty bitmap it should be a flag in the dirty
+  * tracking ioctl, not here in unmap. Overall dirty tracking needs a careful
+  * review along side HW drivers implementing it.
+  */
+}
+
+static int iommufd_vfio_set_iommu(struct iommufd_ctx *ictx, unsigned long type)
+{
+	struct iommufd_ioas *ioas = NULL;
+
+	if (type != VFIO_TYPE1v2_IOMMU)
+		return -EINVAL;
+
+	/* VFIO fails the set_iommu if there is no group */
+	ioas = get_compat_ioas(ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+	iommufd_put_object(&ioas->obj);
+	return 0;
+}
+
+static u64 iommufd_get_pagesizes(struct iommufd_ioas *ioas)
+{
+	/* FIXME: See vfio_update_pgsize_bitmap(), for compat this should return
+	 * the high bits too, and we need to decide if we should report that
+	 * iommufd supports less than PAGE_SIZE alignment or stick to strict
+	 * compatibility. qemu only cares about the first set bit.
+	 */
+	return ioas->iopt.iova_alignment;
+}
+
+static int iommufd_fill_cap_iova(struct iommufd_ioas *ioas,
+				 struct vfio_info_cap_header __user *cur,
+				 size_t avail)
+{
+	struct vfio_iommu_type1_info_cap_iova_range __user *ucap_iovas =
+		container_of(cur, struct vfio_iommu_type1_info_cap_iova_range,
+			     header);
+	struct vfio_iommu_type1_info_cap_iova_range cap_iovas = {
+		.header = {
+			.id = VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE,
+			.version = 1,
+		},
+	};
+	struct interval_tree_span_iter span;
+
+	for (interval_tree_span_iter_first(
+		     &span, &ioas->iopt.reserved_iova_itree, 0, ULONG_MAX);
+	     !interval_tree_span_iter_done(&span);
+	     interval_tree_span_iter_next(&span)) {
+		struct vfio_iova_range range;
+
+		if (!span.is_hole)
+			continue;
+		range.start = span.start_hole;
+		range.end = span.last_hole;
+		if (avail >= struct_size(&cap_iovas, iova_ranges,
+					 cap_iovas.nr_iovas + 1) &&
+		    copy_to_user(&ucap_iovas->iova_ranges[cap_iovas.nr_iovas],
+				 &range, sizeof(range)))
+			return -EFAULT;
+		cap_iovas.nr_iovas++;
+	}
+	if (avail >= struct_size(&cap_iovas, iova_ranges, cap_iovas.nr_iovas) &&
+	    copy_to_user(ucap_iovas, &cap_iovas, sizeof(cap_iovas)))
+		return -EFAULT;
+	return struct_size(&cap_iovas, iova_ranges, cap_iovas.nr_iovas);
+}
+
+static int iommufd_fill_cap_dma_avail(struct iommufd_ioas *ioas,
+				      struct vfio_info_cap_header __user *cur,
+				      size_t avail)
+{
+	struct vfio_iommu_type1_info_dma_avail cap_dma = {
+		.header = {
+			.id = VFIO_IOMMU_TYPE1_INFO_DMA_AVAIL,
+			.version = 1,
+		},
+		/* iommufd has no limit, return the same value as VFIO. */
+		.avail = U16_MAX,
+	};
+
+	if (avail >= sizeof(cap_dma) &&
+	    copy_to_user(cur, &cap_dma, sizeof(cap_dma)))
+		return -EFAULT;
+	return sizeof(cap_dma);
+}
+
+static int iommufd_vfio_iommu_get_info(struct iommufd_ctx *ictx,
+				       void __user *arg)
+{
+	typedef int (*fill_cap_fn)(struct iommufd_ioas *ioas,
+				   struct vfio_info_cap_header __user *cur,
+				   size_t avail);
+	static const fill_cap_fn fill_fns[] = {
+		iommufd_fill_cap_iova,
+		iommufd_fill_cap_dma_avail,
+	};
+	size_t minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes);
+	struct vfio_info_cap_header __user *last_cap = NULL;
+	struct vfio_iommu_type1_info info;
+	struct iommufd_ioas *ioas;
+	size_t total_cap_size;
+	int rc;
+	int i;
+
+	if (copy_from_user(&info, arg, minsz))
+		return -EFAULT;
+
+	if (info.argsz < minsz)
+		return -EINVAL;
+	minsz = min_t(size_t, info.argsz, sizeof(info));
+
+	ioas = get_compat_ioas(ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	down_read(&ioas->iopt.iova_rwsem);
+	info.flags = VFIO_IOMMU_INFO_PGSIZES;
+	info.iova_pgsizes = iommufd_get_pagesizes(ioas);
+	info.cap_offset = 0;
+
+	total_cap_size = sizeof(info);
+	for (i = 0; i != ARRAY_SIZE(fill_fns); i++) {
+		int cap_size;
+
+		if (info.argsz > total_cap_size)
+			cap_size = fill_fns[i](ioas, arg + total_cap_size,
+					       info.argsz - total_cap_size);
+		else
+			cap_size = fill_fns[i](ioas, NULL, 0);
+		if (cap_size < 0) {
+			rc = cap_size;
+			goto out_put;
+		}
+		if (last_cap && info.argsz >= total_cap_size &&
+		    put_user(total_cap_size, &last_cap->next)) {
+			rc = -EFAULT;
+			goto out_put;
+		}
+		last_cap = arg + total_cap_size;
+		total_cap_size += cap_size;
+	}
+
+	/*
+	 * If the user did not provide enough space then only some caps are
+	 * returned and the argsz will be updated to the correct amount to get
+	 * all caps.
+	 */
+	if (info.argsz >= total_cap_size)
+		info.cap_offset = sizeof(info);
+	info.argsz = total_cap_size;
+	info.flags |= VFIO_IOMMU_INFO_CAPS;
+	if (copy_to_user(arg, &info, minsz))
+		rc = -EFAULT;
+	rc = 0;
+
+out_put:
+	up_read(&ioas->iopt.iova_rwsem);
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+/* FIXME TODO:
+PowerPC SPAPR only:
+#define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
+#define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
+#define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
+#define VFIO_IOMMU_SPAPR_REGISTER_MEMORY	_IO(VFIO_TYPE, VFIO_BASE + 17)
+#define VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY	_IO(VFIO_TYPE, VFIO_BASE + 18)
+#define VFIO_IOMMU_SPAPR_TCE_CREATE	_IO(VFIO_TYPE, VFIO_BASE + 19)
+#define VFIO_IOMMU_SPAPR_TCE_REMOVE	_IO(VFIO_TYPE, VFIO_BASE + 20)
+*/
+
+int iommufd_vfio_ioctl(struct iommufd_ctx *ictx, unsigned int cmd,
+		       unsigned long arg)
+{
+	void __user *uarg = (void __user *)arg;
+
+	switch (cmd) {
+	case VFIO_GET_API_VERSION:
+		return VFIO_API_VERSION;
+	case VFIO_SET_IOMMU:
+		return iommufd_vfio_set_iommu(ictx, arg);
+	case VFIO_CHECK_EXTENSION:
+		return iommufd_vfio_check_extension(arg);
+	case VFIO_IOMMU_GET_INFO:
+		return iommufd_vfio_iommu_get_info(ictx, uarg);
+	case VFIO_IOMMU_MAP_DMA:
+		return iommufd_vfio_map_dma(ictx, cmd, uarg);
+	case VFIO_IOMMU_UNMAP_DMA:
+		return iommufd_vfio_unmap_dma(ictx, cmd, uarg);
+	case VFIO_IOMMU_DIRTY_PAGES:
+	default:
+		return -ENOIOCTLCMD;
+	}
+	return -ENOIOCTLCMD;
+}
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index ba7b17ec3002e3..2c0f5ced417335 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -42,6 +42,7 @@ enum {
 	IOMMUFD_CMD_IOAS_MAP,
 	IOMMUFD_CMD_IOAS_COPY,
 	IOMMUFD_CMD_IOAS_UNMAP,
+	IOMMUFD_CMD_VFIO_IOAS,
 };
 
 /**
@@ -184,4 +185,39 @@ struct iommu_ioas_unmap {
 	__aligned_u64 length;
 };
 #define IOMMU_IOAS_UNMAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_UNMAP)
+
+/**
+ * enum iommufd_vfio_ioas_op
+ * @IOMMU_VFIO_IOAS_GET: Get the current compatibility IOAS
+ * @IOMMU_VFIO_IOAS_SET: Change the current compatibility IOAS
+ * @IOMMU_VFIO_IOAS_CLEAR: Disable VFIO compatibility
+ */
+enum iommufd_vfio_ioas_op {
+	IOMMU_VFIO_IOAS_GET = 0,
+	IOMMU_VFIO_IOAS_SET = 1,
+	IOMMU_VFIO_IOAS_CLEAR = 2,
+};
+
+/**
+ * struct iommu_vfio_ioas - ioctl(IOMMU_VFIO_IOAS)
+ * @size: sizeof(struct iommu_ioas_copy)
+ * @ioas_id: For IOMMU_VFIO_IOAS_SET the input IOAS ID to set
+ *           For IOMMU_VFIO_IOAS_GET will output the IOAS ID
+ * @op: One of enum iommufd_vfio_ioas_op
+ * @__reserved: Must be 0
+ *
+ * The VFIO compatibility support uses a single ioas because VFIO APIs do not
+ * support the ID field. Set or Get the IOAS that VFIO compatibility will use.
+ * When VFIO_GROUP_SET_CONTAINER is used on an iommufd it will get the
+ * compatibility ioas, either by taking what is already set, or auto creating
+ * one. From then on VFIO will continue to use that ioas and is not effected by
+ * this ioctl. SET or CLEAR does not destroy any auto-created IOAS.
+ */
+struct iommu_vfio_ioas {
+	__u32 size;
+	__u32 ioas_id;
+	__u16 op;
+	__u16 __reserved;
+};
+#define IOMMU_VFIO_IOAS _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VFIO_IOAS)
 #endif
-- 
2.35.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 244+ messages in thread

* [PATCH RFC 12/12] iommufd: Add a selftest
  2022-03-18 17:27 ` Jason Gunthorpe via iommu
@ 2022-03-18 17:27   ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-03-18 17:27 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

Cover the essential functionality of the iommufd with a directed
test. This aims to achieve reasonable functional coverage using the
in-kernel self test framework.

It provides a mock for the iommu_domain that allows it to run without any
HW and the mocking provides a way to directly validate that the PFNs
loaded into the iommu_domain are correct.

The mock also simulates the rare case of PAGE_SIZE > iommu page size as
the mock will typically operate at a 2K iommu page size. This allows
exercising all of the calculations to support this mismatch.

This allows achieving high coverage of the corner cases in the iopt_pages.

However, it is an unusually invasive config option to enable all of
this. The config option should never be enabled in a production kernel.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommufd/Kconfig            |    9 +
 drivers/iommu/iommufd/Makefile           |    2 +
 drivers/iommu/iommufd/iommufd_private.h  |    9 +
 drivers/iommu/iommufd/iommufd_test.h     |   65 ++
 drivers/iommu/iommufd/main.c             |   12 +
 drivers/iommu/iommufd/pages.c            |    4 +
 drivers/iommu/iommufd/selftest.c         |  495 +++++++++
 tools/testing/selftests/Makefile         |    1 +
 tools/testing/selftests/iommu/.gitignore |    2 +
 tools/testing/selftests/iommu/Makefile   |   11 +
 tools/testing/selftests/iommu/config     |    2 +
 tools/testing/selftests/iommu/iommufd.c  | 1225 ++++++++++++++++++++++
 12 files changed, 1837 insertions(+)
 create mode 100644 drivers/iommu/iommufd/iommufd_test.h
 create mode 100644 drivers/iommu/iommufd/selftest.c
 create mode 100644 tools/testing/selftests/iommu/.gitignore
 create mode 100644 tools/testing/selftests/iommu/Makefile
 create mode 100644 tools/testing/selftests/iommu/config
 create mode 100644 tools/testing/selftests/iommu/iommufd.c

diff --git a/drivers/iommu/iommufd/Kconfig b/drivers/iommu/iommufd/Kconfig
index fddd453bb0e764..9b41fde7c839c5 100644
--- a/drivers/iommu/iommufd/Kconfig
+++ b/drivers/iommu/iommufd/Kconfig
@@ -11,3 +11,12 @@ config IOMMUFD
 	  This would commonly be used in combination with VFIO.
 
 	  If you don't know what to do here, say N.
+
+config IOMMUFD_TEST
+	bool "IOMMU Userspace API Test support"
+	depends on IOMMUFD
+	depends on RUNTIME_TESTING_MENU
+	default n
+	help
+	  This is dangerous, do not enable unless running
+	  tools/testing/selftests/iommu
diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index 2fdff04000b326..8aeba81800c512 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -8,4 +8,6 @@ iommufd-y := \
 	pages.o \
 	vfio_compat.o
 
+iommufd-$(CONFIG_IOMMUFD_TEST) += selftest.o
+
 obj-$(CONFIG_IOMMUFD) += iommufd.o
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 31628591591c17..6f11470c8ea677 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -102,6 +102,9 @@ enum iommufd_object_type {
 	IOMMUFD_OBJ_NONE,
 	IOMMUFD_OBJ_ANY = IOMMUFD_OBJ_NONE,
 	IOMMUFD_OBJ_DEVICE,
+#ifdef CONFIG_IOMMUFD_TEST
+	IOMMUFD_OBJ_SELFTEST,
+#endif
 	IOMMUFD_OBJ_HW_PAGETABLE,
 	IOMMUFD_OBJ_IOAS,
 	IOMMUFD_OBJ_MAX,
@@ -219,4 +222,10 @@ void iommufd_hw_pagetable_destroy(struct iommufd_object *obj);
 
 void iommufd_device_destroy(struct iommufd_object *obj);
 
+#ifdef CONFIG_IOMMUFD_TEST
+int iommufd_test(struct iommufd_ucmd *ucmd);
+void iommufd_selftest_destroy(struct iommufd_object *obj);
+extern size_t iommufd_test_memory_limit;
+#endif
+
 #endif
diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h
new file mode 100644
index 00000000000000..d22ef484af1a90
--- /dev/null
+++ b/drivers/iommu/iommufd/iommufd_test.h
@@ -0,0 +1,65 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
+ */
+#ifndef _UAPI_IOMMUFD_TEST_H
+#define _UAPI_IOMMUFD_TEST_H
+
+#include <linux/types.h>
+#include <linux/iommufd.h>
+
+enum {
+	IOMMU_TEST_OP_ADD_RESERVED,
+	IOMMU_TEST_OP_MOCK_DOMAIN,
+	IOMMU_TEST_OP_MD_CHECK_MAP,
+	IOMMU_TEST_OP_MD_CHECK_REFS,
+	IOMMU_TEST_OP_ACCESS_PAGES,
+	IOMMU_TEST_OP_SET_TEMP_MEMORY_LIMIT,
+};
+
+enum {
+	MOCK_APERTURE_START = 1UL << 24,
+	MOCK_APERTURE_LAST = (1UL << 31) - 1,
+};
+
+enum {
+	MOCK_FLAGS_ACCESS_WRITE = 1 << 0,
+};
+
+struct iommu_test_cmd {
+	__u32 size;
+	__u32 op;
+	__u32 id;
+	union {
+		struct {
+			__u32 device_id;
+		} mock_domain;
+		struct {
+			__aligned_u64 start;
+			__aligned_u64 length;
+		} add_reserved;
+		struct {
+			__aligned_u64 iova;
+			__aligned_u64 length;
+			__aligned_u64 uptr;
+		} check_map;
+		struct {
+			__aligned_u64 length;
+			__aligned_u64 uptr;
+			__u32 refs;
+		} check_refs;
+		struct {
+			__u32 flags;
+			__u32 out_access_id;
+			__aligned_u64 iova;
+			__aligned_u64 length;
+			__aligned_u64 uptr;
+		} access_pages;
+		struct {
+			__u32 limit;
+		} memory_limit;
+	};
+	__u32 last;
+};
+#define IOMMU_TEST_CMD _IO(IOMMUFD_TYPE, IOMMUFD_CMD_BASE + 32)
+
+#endif
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index f746fcff8145cb..8c820bb90caa72 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -24,6 +24,7 @@
 #include <uapi/linux/iommufd.h>
 
 #include "iommufd_private.h"
+#include "iommufd_test.h"
 
 struct iommufd_object_ops {
 	void (*destroy)(struct iommufd_object *obj);
@@ -191,6 +192,9 @@ union ucmd_buffer {
 	struct iommu_ioas_map map;
 	struct iommu_ioas_unmap unmap;
 	struct iommu_destroy destroy;
+#ifdef CONFIG_IOMMUFD_TEST
+	struct iommu_test_cmd test;
+#endif
 };
 
 struct iommufd_ioctl_op {
@@ -223,6 +227,9 @@ static struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
 		 length),
 	IOCTL_OP(IOMMU_VFIO_IOAS, iommufd_vfio_ioas, struct iommu_vfio_ioas,
 		 __reserved),
+#ifdef CONFIG_IOMMUFD_TEST
+	IOCTL_OP(IOMMU_TEST_CMD, iommufd_test, struct iommu_test_cmd, last),
+#endif
 };
 
 static long iommufd_fops_ioctl(struct file *filp, unsigned int cmd,
@@ -299,6 +306,11 @@ static struct iommufd_object_ops iommufd_object_ops[] = {
 	[IOMMUFD_OBJ_HW_PAGETABLE] = {
 		.destroy = iommufd_hw_pagetable_destroy,
 	},
+#ifdef CONFIG_IOMMUFD_TEST
+	[IOMMUFD_OBJ_SELFTEST] = {
+		.destroy = iommufd_selftest_destroy,
+	},
+#endif
 };
 
 static struct miscdevice iommu_misc_dev = {
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index 8e6a8cc8b20ad1..3fd39e0201f542 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -48,7 +48,11 @@
 
 #include "io_pagetable.h"
 
+#ifndef CONFIG_IOMMUFD_TEST
 #define TEMP_MEMORY_LIMIT 65536
+#else
+#define TEMP_MEMORY_LIMIT iommufd_test_memory_limit
+#endif
 #define BATCH_BACKUP_SIZE 32
 
 /*
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
new file mode 100644
index 00000000000000..a665719b493ec5
--- /dev/null
+++ b/drivers/iommu/iommufd/selftest.c
@@ -0,0 +1,495 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
+ *
+ * Kernel side components to support tools/testing/selftests/iommu
+ */
+#include <linux/slab.h>
+#include <linux/iommu.h>
+#include <linux/xarray.h>
+
+#include "iommufd_private.h"
+#include "iommufd_test.h"
+
+size_t iommufd_test_memory_limit = 65536;
+
+enum {
+	MOCK_IO_PAGE_SIZE = PAGE_SIZE / 2,
+
+	/*
+	 * Like a real page table alignment requires the low bits of the address
+	 * to be zero. xarray also requires the high bit to be zero, so we store
+	 * the pfns shifted. The upper bits are used for metadata.
+	 */
+	MOCK_PFN_MASK = ULONG_MAX / MOCK_IO_PAGE_SIZE,
+
+	_MOCK_PFN_START = MOCK_PFN_MASK + 1,
+	MOCK_PFN_START_IOVA = _MOCK_PFN_START,
+	MOCK_PFN_LAST_IOVA = _MOCK_PFN_START,
+};
+
+struct mock_iommu_domain {
+	struct iommu_domain domain;
+	struct xarray pfns;
+};
+
+enum selftest_obj_type {
+	TYPE_ACCESS,
+	TYPE_IDEV,
+};
+
+struct selftest_obj {
+	struct iommufd_object obj;
+	enum selftest_obj_type type;
+
+	union {
+		struct {
+			struct iommufd_ioas *ioas;
+			unsigned long iova;
+			size_t length;
+		} access;
+		struct {
+			struct iommufd_hw_pagetable *hwpt;
+			struct iommufd_ctx *ictx;
+		} idev;
+	};
+};
+
+static struct iommu_domain *mock_domain_alloc(unsigned int iommu_domain_type)
+{
+	struct mock_iommu_domain *mock;
+
+	if (WARN_ON(iommu_domain_type != IOMMU_DOMAIN_UNMANAGED))
+		return NULL;
+
+	mock = kzalloc(sizeof(*mock), GFP_KERNEL);
+	if (!mock)
+		return NULL;
+	mock->domain.geometry.aperture_start = MOCK_APERTURE_START;
+	mock->domain.geometry.aperture_end = MOCK_APERTURE_LAST;
+	mock->domain.pgsize_bitmap = MOCK_IO_PAGE_SIZE;
+	xa_init(&mock->pfns);
+	return &mock->domain;
+}
+
+static void mock_domain_free(struct iommu_domain *domain)
+{
+	struct mock_iommu_domain *mock =
+		container_of(domain, struct mock_iommu_domain, domain);
+
+	WARN_ON(!xa_empty(&mock->pfns));
+	kfree(mock);
+}
+
+static int mock_domain_map_pages(struct iommu_domain *domain,
+				 unsigned long iova, phys_addr_t paddr,
+				 size_t pgsize, size_t pgcount, int prot,
+				 gfp_t gfp, size_t *mapped)
+{
+	struct mock_iommu_domain *mock =
+		container_of(domain, struct mock_iommu_domain, domain);
+	unsigned long flags = MOCK_PFN_START_IOVA;
+
+	WARN_ON(iova % MOCK_IO_PAGE_SIZE);
+	WARN_ON(pgsize % MOCK_IO_PAGE_SIZE);
+	for (; pgcount; pgcount--) {
+		size_t cur;
+
+		for (cur = 0; cur != pgsize; cur += MOCK_IO_PAGE_SIZE) {
+			void *old;
+
+			if (pgcount == 1 && cur + MOCK_IO_PAGE_SIZE == pgsize)
+				flags = MOCK_PFN_LAST_IOVA;
+			old = xa_store(&mock->pfns, iova / MOCK_IO_PAGE_SIZE,
+				       xa_mk_value((paddr / MOCK_IO_PAGE_SIZE) | flags),
+				       GFP_KERNEL);
+			if (xa_is_err(old))
+				return xa_err(old);
+			WARN_ON(old);
+			iova += MOCK_IO_PAGE_SIZE;
+			paddr += MOCK_IO_PAGE_SIZE;
+			*mapped += MOCK_IO_PAGE_SIZE;
+			flags = 0;
+		}
+	}
+	return 0;
+}
+
+static size_t mock_domain_unmap_pages(struct iommu_domain *domain,
+				      unsigned long iova, size_t pgsize,
+				      size_t pgcount,
+				      struct iommu_iotlb_gather *iotlb_gather)
+{
+	struct mock_iommu_domain *mock =
+		container_of(domain, struct mock_iommu_domain, domain);
+	bool first = true;
+	size_t ret = 0;
+	void *ent;
+
+	WARN_ON(iova % MOCK_IO_PAGE_SIZE);
+	WARN_ON(pgsize % MOCK_IO_PAGE_SIZE);
+
+	for (; pgcount; pgcount--) {
+		size_t cur;
+
+		for (cur = 0; cur != pgsize; cur += MOCK_IO_PAGE_SIZE) {
+			ent = xa_erase(&mock->pfns, iova / MOCK_IO_PAGE_SIZE);
+			WARN_ON(!ent);
+			/*
+			 * iommufd generates unmaps that must be a strict
+			 * superset of the map's performend So every starting
+			 * IOVA should have been an iova passed to map, and the
+			 *
+			 * First IOVA must be present and have been a first IOVA
+			 * passed to map_pages
+			 */
+			if (first) {
+				WARN_ON(!(xa_to_value(ent) &
+					  MOCK_PFN_START_IOVA));
+				first = false;
+			}
+			if (pgcount == 1 && cur + MOCK_IO_PAGE_SIZE == pgsize)
+				WARN_ON(!(xa_to_value(ent) &
+					  MOCK_PFN_LAST_IOVA));
+
+			iova += MOCK_IO_PAGE_SIZE;
+			ret += MOCK_IO_PAGE_SIZE;
+		}
+	}
+	return ret;
+}
+
+static phys_addr_t mock_domain_iova_to_phys(struct iommu_domain *domain,
+					    dma_addr_t iova)
+{
+	struct mock_iommu_domain *mock =
+		container_of(domain, struct mock_iommu_domain, domain);
+	void *ent;
+
+	WARN_ON(iova % MOCK_IO_PAGE_SIZE);
+	ent = xa_load(&mock->pfns, iova / MOCK_IO_PAGE_SIZE);
+	WARN_ON(!ent);
+	return (xa_to_value(ent) & MOCK_PFN_MASK) * MOCK_IO_PAGE_SIZE;
+}
+
+static const struct iommu_ops mock_ops = {
+	.owner = THIS_MODULE,
+	.pgsize_bitmap = MOCK_IO_PAGE_SIZE,
+	.domain_alloc = mock_domain_alloc,
+	.default_domain_ops =
+		&(struct iommu_domain_ops){
+			.free = mock_domain_free,
+			.map_pages = mock_domain_map_pages,
+			.unmap_pages = mock_domain_unmap_pages,
+			.iova_to_phys = mock_domain_iova_to_phys,
+		},
+};
+
+static inline struct iommufd_hw_pagetable *
+get_md_pagetable(struct iommufd_ucmd *ucmd, u32 mockpt_id,
+		 struct mock_iommu_domain **mock)
+{
+	struct iommufd_hw_pagetable *hwpt;
+	struct iommufd_object *obj;
+
+	obj = iommufd_get_object(ucmd->ictx, mockpt_id,
+				 IOMMUFD_OBJ_HW_PAGETABLE);
+	if (IS_ERR(obj))
+		return ERR_CAST(obj);
+	hwpt = container_of(obj, struct iommufd_hw_pagetable, obj);
+	if (hwpt->domain->ops != mock_ops.default_domain_ops) {
+		return ERR_PTR(-EINVAL);
+		iommufd_put_object(&hwpt->obj);
+	}
+	*mock = container_of(hwpt->domain, struct mock_iommu_domain, domain);
+	return hwpt;
+}
+
+/* Create an hw_pagetable with the mock domain so we can test the domain ops */
+static int iommufd_test_mock_domain(struct iommufd_ucmd *ucmd,
+				    struct iommu_test_cmd *cmd)
+{
+	struct bus_type mock_bus = { .iommu_ops = &mock_ops };
+	struct device mock_dev = { .bus = &mock_bus };
+	struct iommufd_hw_pagetable *hwpt;
+	struct selftest_obj *sobj;
+	int rc;
+
+	sobj = iommufd_object_alloc(ucmd->ictx, sobj, IOMMUFD_OBJ_SELFTEST);
+	if (IS_ERR(sobj))
+		return PTR_ERR(sobj);
+	sobj->idev.ictx = ucmd->ictx;
+	sobj->type = TYPE_IDEV;
+
+	hwpt = iommufd_hw_pagetable_from_id(ucmd->ictx, cmd->id, &mock_dev);
+	if (IS_ERR(hwpt)) {
+		rc = PTR_ERR(hwpt);
+		goto out_sobj;
+	}
+	if (WARN_ON(refcount_read(&hwpt->obj.users) != 2)) {
+		rc = -EINVAL;
+		goto out_hwpt;
+	}
+	sobj->idev.hwpt = hwpt;
+
+	/* Creating a real iommufd_device is too hard, fake one */
+	rc = iopt_table_add_domain(&hwpt->ioas->iopt, hwpt->domain);
+	if (rc)
+		goto out_hwpt;
+
+	/* Convert auto domain to user created */
+	list_del_init(&hwpt->auto_domains_item);
+	cmd->id = hwpt->obj.id;
+	cmd->mock_domain.device_id = sobj->obj.id;
+	iommufd_object_finalize(ucmd->ictx, &sobj->obj);
+	return iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+
+out_hwpt:
+	iommufd_hw_pagetable_put(ucmd->ictx, hwpt);
+out_sobj:
+	iommufd_object_abort(ucmd->ictx, &sobj->obj);
+	return rc;
+}
+
+/* Add an additional reserved IOVA to the IOAS */
+static int iommufd_test_add_reserved(struct iommufd_ucmd *ucmd,
+				     unsigned int mockpt_id,
+				     unsigned long start, size_t length)
+{
+	struct iommufd_ioas *ioas;
+	int rc;
+
+	ioas = iommufd_get_ioas(ucmd, mockpt_id);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+	down_write(&ioas->iopt.iova_rwsem);
+	rc = iopt_reserve_iova(&ioas->iopt, start, start + length - 1, NULL);
+	up_write(&ioas->iopt.iova_rwsem);
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+/* Check that every pfn under each iova matches the pfn under a user VA */
+static int iommufd_test_md_check_pa(struct iommufd_ucmd *ucmd,
+				    unsigned int mockpt_id, unsigned long iova,
+				    size_t length, void __user *uptr)
+{
+	struct iommufd_hw_pagetable *hwpt;
+	struct mock_iommu_domain *mock;
+	int rc;
+
+	if (iova % MOCK_IO_PAGE_SIZE || length % MOCK_IO_PAGE_SIZE ||
+	    (uintptr_t)uptr % MOCK_IO_PAGE_SIZE)
+		return -EINVAL;
+
+	hwpt = get_md_pagetable(ucmd, mockpt_id, &mock);
+	if (IS_ERR(hwpt))
+		return PTR_ERR(hwpt);
+
+	for (; length; length -= MOCK_IO_PAGE_SIZE) {
+		struct page *pages[1];
+		unsigned long pfn;
+		long npages;
+		void *ent;
+
+		npages = get_user_pages_fast((uintptr_t)uptr & PAGE_MASK, 1, 0,
+					     pages);
+		if (npages < 0) {
+			rc = npages;
+			goto out_put;
+		}
+		if (WARN_ON(npages != 1)) {
+			rc = -EFAULT;
+			goto out_put;
+		}
+		pfn = page_to_pfn(pages[0]);
+		put_page(pages[0]);
+
+		ent = xa_load(&mock->pfns, iova / MOCK_IO_PAGE_SIZE);
+		if (!ent ||
+		    (xa_to_value(ent) & MOCK_PFN_MASK) * MOCK_IO_PAGE_SIZE !=
+			    pfn * PAGE_SIZE + ((uintptr_t)uptr % PAGE_SIZE)) {
+			rc = -EINVAL;
+			goto out_put;
+		}
+		iova += MOCK_IO_PAGE_SIZE;
+		uptr += MOCK_IO_PAGE_SIZE;
+	}
+	rc = 0;
+
+out_put:
+	iommufd_put_object(&hwpt->obj);
+	return rc;
+}
+
+/* Check that the page ref count matches, to look for missing pin/unpins */
+static int iommufd_test_md_check_refs(struct iommufd_ucmd *ucmd,
+				      void __user *uptr, size_t length,
+				      unsigned int refs)
+{
+	if (length % PAGE_SIZE || (uintptr_t)uptr % PAGE_SIZE)
+		return -EINVAL;
+
+	for (; length; length -= PAGE_SIZE) {
+		struct page *pages[1];
+		long npages;
+
+		npages = get_user_pages_fast((uintptr_t)uptr, 1, 0, pages);
+		if (npages < 0)
+			return npages;
+		if (WARN_ON(npages != 1))
+			return -EFAULT;
+		if (!PageCompound(pages[0])) {
+			unsigned int count;
+
+			count = page_ref_count(pages[0]);
+			if (count / GUP_PIN_COUNTING_BIAS != refs) {
+				put_page(pages[0]);
+				return -EIO;
+			}
+		}
+		put_page(pages[0]);
+		uptr += PAGE_SIZE;
+	}
+	return 0;
+}
+
+/* Check that the pages in a page array match the pages in the user VA */
+static int iommufd_test_check_pages(void __user *uptr, struct page **pages,
+				    size_t npages)
+{
+	for (; npages; npages--) {
+		struct page *tmp_pages[1];
+		long rc;
+
+		rc = get_user_pages_fast((uintptr_t)uptr, 1, 0, tmp_pages);
+		if (rc < 0)
+			return rc;
+		if (WARN_ON(rc != 1))
+			return -EFAULT;
+		put_page(tmp_pages[0]);
+		if (tmp_pages[0] != *pages)
+			return -EBADE;
+		pages++;
+		uptr += PAGE_SIZE;
+	}
+	return 0;
+}
+
+/* Test iopt_access_pages() by checking it returns the correct pages */
+static int iommufd_test_access_pages(struct iommufd_ucmd *ucmd,
+				     unsigned int ioas_id, unsigned long iova,
+				     size_t length, void __user *uptr,
+				     u32 flags)
+{
+	struct iommu_test_cmd *cmd = ucmd->cmd;
+	struct selftest_obj *sobj;
+	struct page **pages;
+	size_t npages;
+	int rc;
+
+	if (flags & ~MOCK_FLAGS_ACCESS_WRITE)
+		return -EOPNOTSUPP;
+
+	sobj = iommufd_object_alloc(ucmd->ictx, sobj, IOMMUFD_OBJ_SELFTEST);
+	if (IS_ERR(sobj))
+		return PTR_ERR(sobj);
+	sobj->type = TYPE_ACCESS;
+
+	npages = (ALIGN(iova + length, PAGE_SIZE) -
+		  ALIGN_DOWN(iova, PAGE_SIZE)) /
+		 PAGE_SIZE;
+	pages = kvcalloc(npages, sizeof(*pages), GFP_KERNEL);
+	if (!pages) {
+		rc = -ENOMEM;
+		goto out_abort;
+	}
+
+	sobj->access.ioas = iommufd_get_ioas(ucmd, ioas_id);
+	if (IS_ERR(sobj->access.ioas)) {
+		rc = PTR_ERR(sobj->access.ioas);
+		goto out_free;
+	}
+
+	sobj->access.iova = iova;
+	sobj->access.length = length;
+	rc = iopt_access_pages(&sobj->access.ioas->iopt, iova, length, pages,
+			       flags & MOCK_FLAGS_ACCESS_WRITE);
+	if (rc)
+		goto out_put;
+
+	rc = iommufd_test_check_pages(
+		uptr - (iova - ALIGN_DOWN(iova, PAGE_SIZE)), pages, npages);
+	if (rc)
+		goto out_unaccess;
+
+	cmd->access_pages.out_access_id = sobj->obj.id;
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+	if (rc)
+		goto out_unaccess;
+
+	iommufd_object_finalize(ucmd->ictx, &sobj->obj);
+	iommufd_put_object_keep_user(&sobj->access.ioas->obj);
+	kvfree(pages);
+	return 0;
+out_unaccess:
+	iopt_unaccess_pages(&sobj->access.ioas->iopt, iova, length);
+out_put:
+	iommufd_put_object(&sobj->access.ioas->obj);
+out_free:
+	kvfree(pages);
+out_abort:
+	iommufd_object_abort(ucmd->ictx, &sobj->obj);
+	return rc;
+}
+
+void iommufd_selftest_destroy(struct iommufd_object *obj)
+{
+	struct selftest_obj *sobj = container_of(obj, struct selftest_obj, obj);
+
+	switch (sobj->type) {
+	case TYPE_IDEV:
+		iopt_table_remove_domain(&sobj->idev.hwpt->ioas->iopt,
+					 sobj->idev.hwpt->domain);
+		iommufd_hw_pagetable_put(sobj->idev.ictx, sobj->idev.hwpt);
+		break;
+	case TYPE_ACCESS:
+		iopt_unaccess_pages(&sobj->access.ioas->iopt,
+				    sobj->access.iova, sobj->access.length);
+		refcount_dec(&sobj->access.ioas->obj.users);
+		break;
+	}
+}
+
+int iommufd_test(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_test_cmd *cmd = ucmd->cmd;
+
+	switch (cmd->op) {
+	case IOMMU_TEST_OP_ADD_RESERVED:
+		return iommufd_test_add_reserved(ucmd, cmd->id,
+						 cmd->add_reserved.start,
+						 cmd->add_reserved.length);
+	case IOMMU_TEST_OP_MOCK_DOMAIN:
+		return iommufd_test_mock_domain(ucmd, cmd);
+	case IOMMU_TEST_OP_MD_CHECK_MAP:
+		return iommufd_test_md_check_pa(
+			ucmd, cmd->id, cmd->check_map.iova,
+			cmd->check_map.length,
+			u64_to_user_ptr(cmd->check_map.uptr));
+	case IOMMU_TEST_OP_MD_CHECK_REFS:
+		return iommufd_test_md_check_refs(
+			ucmd, u64_to_user_ptr(cmd->check_refs.uptr),
+			cmd->check_refs.length, cmd->check_refs.refs);
+	case IOMMU_TEST_OP_ACCESS_PAGES:
+		return iommufd_test_access_pages(
+			ucmd, cmd->id, cmd->access_pages.iova,
+			cmd->access_pages.length,
+			u64_to_user_ptr(cmd->access_pages.uptr),
+			cmd->access_pages.flags);
+	case IOMMU_TEST_OP_SET_TEMP_MEMORY_LIMIT:
+		iommufd_test_memory_limit = cmd->memory_limit.limit;
+		return 0;
+	default:
+		return -EOPNOTSUPP;
+	}
+}
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index d08fe4cfe81152..5533a3b2e8af51 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -21,6 +21,7 @@ TARGETS += ftrace
 TARGETS += futex
 TARGETS += gpio
 TARGETS += intel_pstate
+TARGETS += iommu
 TARGETS += ipc
 TARGETS += ir
 TARGETS += kcmp
diff --git a/tools/testing/selftests/iommu/.gitignore b/tools/testing/selftests/iommu/.gitignore
new file mode 100644
index 00000000000000..c6bd07e7ff59b3
--- /dev/null
+++ b/tools/testing/selftests/iommu/.gitignore
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+/iommufd
diff --git a/tools/testing/selftests/iommu/Makefile b/tools/testing/selftests/iommu/Makefile
new file mode 100644
index 00000000000000..7bc38b3beaeb20
--- /dev/null
+++ b/tools/testing/selftests/iommu/Makefile
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0-only
+CFLAGS += -Wall -O2 -Wno-unused-function
+CFLAGS += -I../../../../include/uapi/
+CFLAGS += -I../../../../include/
+
+CFLAGS += -D_GNU_SOURCE
+
+TEST_GEN_PROGS :=
+TEST_GEN_PROGS += iommufd
+
+include ../lib.mk
diff --git a/tools/testing/selftests/iommu/config b/tools/testing/selftests/iommu/config
new file mode 100644
index 00000000000000..6c4f901d6fed3c
--- /dev/null
+++ b/tools/testing/selftests/iommu/config
@@ -0,0 +1,2 @@
+CONFIG_IOMMUFD
+CONFIG_IOMMUFD_TEST
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
new file mode 100644
index 00000000000000..5c47d706ed9449
--- /dev/null
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -0,0 +1,1225 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES */
+#include <stdlib.h>
+#include <unistd.h>
+#include <sys/mman.h>
+#include <sys/fcntl.h>
+#include <sys/ioctl.h>
+#include <assert.h>
+#include <stddef.h>
+
+#include "../kselftest_harness.h"
+
+#define __EXPORTED_HEADERS__
+#include <linux/iommufd.h>
+#include <linux/vfio.h>
+#include "../../../../drivers/iommu/iommufd/iommufd_test.h"
+
+static void *buffer;
+
+static unsigned long PAGE_SIZE;
+static unsigned long HUGEPAGE_SIZE;
+static unsigned long BUFFER_SIZE;
+
+#define MOCK_PAGE_SIZE (PAGE_SIZE / 2)
+
+static unsigned long get_huge_page_size(void)
+{
+	char buf[80];
+	int ret;
+	int fd;
+
+	fd = open("/sys/kernel/mm/transparent_hugepage/hpage_pmd_size",
+		  O_RDONLY);
+	if (fd < 0)
+		return 2 * 1024 * 1024;
+
+	ret = read(fd, buf, sizeof(buf));
+	close(fd);
+	if (ret <= 0 || ret == sizeof(buf))
+		return 2 * 1024 * 1024;
+	buf[ret] = 0;
+	return strtoul(buf, NULL, 10);
+}
+
+static __attribute__((constructor)) void setup_sizes(void)
+{
+	int rc;
+
+	PAGE_SIZE = sysconf(_SC_PAGE_SIZE);
+	HUGEPAGE_SIZE = get_huge_page_size();
+
+	BUFFER_SIZE = PAGE_SIZE * 16;
+	rc = posix_memalign(&buffer, HUGEPAGE_SIZE, BUFFER_SIZE);
+	assert(rc || buffer || (uintptr_t)buffer % HUGEPAGE_SIZE == 0);
+}
+
+/*
+ * Have the kernel check the refcount on pages. I don't know why a freshly
+ * mmap'd anon non-compound page starts out with a ref of 3
+ */
+#define check_refs(_ptr, _length, _refs)                                       \
+	({                                                                     \
+		struct iommu_test_cmd test_cmd = {                             \
+			.size = sizeof(test_cmd),                              \
+			.op = IOMMU_TEST_OP_MD_CHECK_REFS,                     \
+			.check_refs = { .length = _length,                     \
+					.uptr = (uintptr_t)(_ptr),             \
+					.refs = _refs },                       \
+		};                                                             \
+		ASSERT_EQ(0,                                                   \
+			  ioctl(self->fd,                                      \
+				_IOMMU_TEST_CMD(IOMMU_TEST_OP_MD_CHECK_REFS),  \
+				&test_cmd));                                   \
+	})
+
+/* Hack to make assertions more readable */
+#define _IOMMU_TEST_CMD(x) IOMMU_TEST_CMD
+
+#define EXPECT_ERRNO(expected_errno, cmd)                                      \
+	({                                                                     \
+		ASSERT_EQ(-1, cmd);                                            \
+		EXPECT_EQ(expected_errno, errno);                              \
+	})
+
+FIXTURE(iommufd) {
+	int fd;
+};
+
+FIXTURE_SETUP(iommufd) {
+	self->fd = open("/dev/iommu", O_RDWR);
+	ASSERT_NE(-1, self->fd);
+}
+
+FIXTURE_TEARDOWN(iommufd) {
+	ASSERT_EQ(0, close(self->fd));
+}
+
+TEST_F(iommufd, simple_close)
+{
+}
+
+TEST_F(iommufd, cmd_fail)
+{
+	struct iommu_destroy cmd = { .size = sizeof(cmd), .id = 0 };
+
+	/* object id is invalid */
+	EXPECT_ERRNO(ENOENT, ioctl(self->fd, IOMMU_DESTROY, &cmd));
+	/* Bad pointer */
+	EXPECT_ERRNO(EFAULT, ioctl(self->fd, IOMMU_DESTROY, NULL));
+	/* Unknown ioctl */
+	EXPECT_ERRNO(ENOTTY,
+		     ioctl(self->fd, _IO(IOMMUFD_TYPE, IOMMUFD_CMD_BASE - 1),
+			   &cmd));
+}
+
+TEST_F(iommufd, cmd_ex_fail)
+{
+	struct {
+		struct iommu_destroy cmd;
+		__u64 future;
+	} cmd = { .cmd = { .size = sizeof(cmd), .id = 0 } };
+
+	/* object id is invalid and command is longer */
+	EXPECT_ERRNO(ENOENT, ioctl(self->fd, IOMMU_DESTROY, &cmd));
+	/* future area is non-zero */
+	cmd.future = 1;
+	EXPECT_ERRNO(E2BIG, ioctl(self->fd, IOMMU_DESTROY, &cmd));
+	/* Original command "works" */
+	cmd.cmd.size = sizeof(cmd.cmd);
+	EXPECT_ERRNO(ENOENT, ioctl(self->fd, IOMMU_DESTROY, &cmd));
+	/* Short command fails */
+	cmd.cmd.size = sizeof(cmd.cmd) - 1;
+	EXPECT_ERRNO(EOPNOTSUPP, ioctl(self->fd, IOMMU_DESTROY, &cmd));
+}
+
+FIXTURE(iommufd_ioas) {
+	int fd;
+	uint32_t ioas_id;
+	uint32_t domain_id;
+	uint64_t base_iova;
+};
+
+FIXTURE_VARIANT(iommufd_ioas) {
+	unsigned int mock_domains;
+	unsigned int memory_limit;
+};
+
+FIXTURE_SETUP(iommufd_ioas) {
+	struct iommu_test_cmd memlimit_cmd = {
+		.size = sizeof(memlimit_cmd),
+		.op = IOMMU_TEST_OP_SET_TEMP_MEMORY_LIMIT,
+		.memory_limit = {.limit = variant->memory_limit},
+	};
+	struct iommu_ioas_alloc alloc_cmd = {
+		.size = sizeof(alloc_cmd),
+	};
+	unsigned int i;
+
+	if (!variant->memory_limit)
+		memlimit_cmd.memory_limit.limit = 65536;
+
+	self->fd = open("/dev/iommu", O_RDWR);
+	ASSERT_NE(-1, self->fd);
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_ALLOC, &alloc_cmd));
+	ASSERT_NE(0, alloc_cmd.out_ioas_id);
+	self->ioas_id = alloc_cmd.out_ioas_id;
+
+	ASSERT_EQ(0, ioctl(self->fd,
+			   _IOMMU_TEST_CMD(IOMMU_TEST_OP_SET_TEMP_MEMORY_LIMIT),
+			   &memlimit_cmd));
+
+	for (i = 0; i != variant->mock_domains; i++) {
+		struct iommu_test_cmd test_cmd = {
+			.size = sizeof(test_cmd),
+			.op = IOMMU_TEST_OP_MOCK_DOMAIN,
+			.id = self->ioas_id,
+		};
+
+		ASSERT_EQ(0, ioctl(self->fd,
+				   _IOMMU_TEST_CMD(IOMMU_TEST_OP_MOCK_DOMAIN),
+				   &test_cmd));
+		EXPECT_NE(0, test_cmd.id);
+		self->domain_id = test_cmd.id;
+		self->base_iova = MOCK_APERTURE_START;
+	}
+}
+
+FIXTURE_TEARDOWN(iommufd_ioas) {
+	struct iommu_test_cmd memlimit_cmd = {
+		.size = sizeof(memlimit_cmd),
+		.op = IOMMU_TEST_OP_SET_TEMP_MEMORY_LIMIT,
+		.memory_limit = {.limit = 65536},
+	};
+
+	ASSERT_EQ(0, ioctl(self->fd,
+			   _IOMMU_TEST_CMD(IOMMU_TEST_OP_SET_TEMP_MEMORY_LIMIT),
+			   &memlimit_cmd));
+	ASSERT_EQ(0, close(self->fd));
+
+	self->fd = open("/dev/iommu", O_RDWR);
+	ASSERT_NE(-1, self->fd);
+	check_refs(buffer, BUFFER_SIZE, 0);
+	ASSERT_EQ(0, close(self->fd));
+}
+
+FIXTURE_VARIANT_ADD(iommufd_ioas, no_domain) {
+};
+
+FIXTURE_VARIANT_ADD(iommufd_ioas, mock_domain) {
+	.mock_domains = 1,
+};
+
+FIXTURE_VARIANT_ADD(iommufd_ioas, two_mock_domain) {
+	.mock_domains = 2,
+};
+
+FIXTURE_VARIANT_ADD(iommufd_ioas, mock_domain_limit) {
+	.mock_domains = 1,
+	.memory_limit = 16,
+};
+
+TEST_F(iommufd_ioas, ioas_auto_destroy)
+{
+}
+
+TEST_F(iommufd_ioas, ioas_destroy)
+{
+	struct iommu_destroy destroy_cmd = {
+		.size = sizeof(destroy_cmd),
+		.id = self->ioas_id,
+	};
+
+	if (self->domain_id) {
+		/* IOAS cannot be freed while a domain is on it */
+		EXPECT_ERRNO(EBUSY,
+			     ioctl(self->fd, IOMMU_DESTROY, &destroy_cmd));
+	} else {
+		/* Can allocate and manually free an IOAS table */
+		ASSERT_EQ(0, ioctl(self->fd, IOMMU_DESTROY, &destroy_cmd));
+	}
+}
+
+TEST_F(iommufd_ioas, ioas_area_destroy)
+{
+	struct iommu_destroy destroy_cmd = {
+		.size = sizeof(destroy_cmd),
+		.id = self->ioas_id,
+	};
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.ioas_id = self->ioas_id,
+		.user_va = (uintptr_t)buffer,
+		.length = PAGE_SIZE,
+		.iova = self->base_iova,
+	};
+
+	/* Adding an area does not change ability to destroy */
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	if (self->domain_id)
+		EXPECT_ERRNO(EBUSY,
+			     ioctl(self->fd, IOMMU_DESTROY, &destroy_cmd));
+	else
+		ASSERT_EQ(0, ioctl(self->fd, IOMMU_DESTROY, &destroy_cmd));
+}
+
+TEST_F(iommufd_ioas, ioas_area_auto_destroy)
+{
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.ioas_id = self->ioas_id,
+		.user_va = (uintptr_t)buffer,
+		.length = PAGE_SIZE,
+	};
+	int i;
+
+	/* Can allocate and automatically free an IOAS table with many areas */
+	for (i = 0; i != 10; i++) {
+		map_cmd.iova = self->base_iova + i * PAGE_SIZE;
+		ASSERT_EQ(0,
+			  ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	}
+}
+
+TEST_F(iommufd_ioas, area)
+{
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.ioas_id = self->ioas_id,
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.length = PAGE_SIZE,
+		.user_va = (uintptr_t)buffer,
+	};
+	struct iommu_ioas_unmap unmap_cmd = {
+		.size = sizeof(unmap_cmd),
+		.ioas_id = self->ioas_id,
+		.length = PAGE_SIZE,
+	};
+	int i;
+
+	/* Unmap fails if nothing is mapped */
+	for (i = 0; i != 10; i++) {
+		unmap_cmd.iova = i * PAGE_SIZE;
+		EXPECT_ERRNO(ENOENT, ioctl(self->fd, IOMMU_IOAS_UNMAP,
+					   &unmap_cmd));
+	}
+
+	/* Unmap works */
+	for (i = 0; i != 10; i++) {
+		map_cmd.iova = self->base_iova + i * PAGE_SIZE;
+		ASSERT_EQ(0,
+			  ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	}
+	for (i = 0; i != 10; i++) {
+		unmap_cmd.iova = self->base_iova + i * PAGE_SIZE;
+		ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_UNMAP,
+				   &unmap_cmd));
+	}
+
+	/* Split fails */
+	map_cmd.length = PAGE_SIZE * 2;
+	map_cmd.iova = self->base_iova + 16 * PAGE_SIZE;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	unmap_cmd.iova = self->base_iova + 16 * PAGE_SIZE;
+	EXPECT_ERRNO(ENOENT,
+		     ioctl(self->fd, IOMMU_IOAS_UNMAP, &unmap_cmd));
+	unmap_cmd.iova = self->base_iova + 17 * PAGE_SIZE;
+	EXPECT_ERRNO(ENOENT,
+		     ioctl(self->fd, IOMMU_IOAS_UNMAP, &unmap_cmd));
+
+	/* Over map fails */
+	map_cmd.length = PAGE_SIZE * 2;
+	map_cmd.iova = self->base_iova + 16 * PAGE_SIZE;
+	EXPECT_ERRNO(EADDRINUSE,
+		     ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	map_cmd.length = PAGE_SIZE;
+	map_cmd.iova = self->base_iova + 16 * PAGE_SIZE;
+	EXPECT_ERRNO(EADDRINUSE,
+		     ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	map_cmd.length = PAGE_SIZE;
+	map_cmd.iova = self->base_iova + 17 * PAGE_SIZE;
+	EXPECT_ERRNO(EADDRINUSE,
+		     ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	map_cmd.length = PAGE_SIZE * 2;
+	map_cmd.iova = self->base_iova + 15 * PAGE_SIZE;
+	EXPECT_ERRNO(EADDRINUSE,
+		     ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	map_cmd.length = PAGE_SIZE * 3;
+	map_cmd.iova = self->base_iova + 15 * PAGE_SIZE;
+	EXPECT_ERRNO(EADDRINUSE,
+		     ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+
+	/* unmap all works */
+	unmap_cmd.iova = 0;
+	unmap_cmd.length = UINT64_MAX;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_UNMAP, &unmap_cmd));
+}
+
+TEST_F(iommufd_ioas, area_auto_iova)
+{
+	struct iommu_test_cmd test_cmd = {
+		.size = sizeof(test_cmd),
+		.op = IOMMU_TEST_OP_ADD_RESERVED,
+		.id = self->ioas_id,
+		.add_reserved = { .start = PAGE_SIZE * 4,
+				  .length = PAGE_SIZE * 100 },
+	};
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.ioas_id = self->ioas_id,
+		.length = PAGE_SIZE,
+		.user_va = (uintptr_t)buffer,
+	};
+	struct iommu_ioas_unmap unmap_cmd = {
+		.size = sizeof(unmap_cmd),
+		.ioas_id = self->ioas_id,
+		.length = PAGE_SIZE,
+	};
+	uint64_t iovas[10];
+	int i;
+
+	/* Simple 4k pages */
+	for (i = 0; i != 10; i++) {
+		ASSERT_EQ(0,
+			  ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+		iovas[i] = map_cmd.iova;
+	}
+	for (i = 0; i != 10; i++) {
+		unmap_cmd.iova = iovas[i];
+		ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_UNMAP,
+				   &unmap_cmd));
+	}
+
+	/* Kernel automatically aligns IOVAs properly */
+	if (self->domain_id)
+		map_cmd.user_va = (uintptr_t)buffer;
+	else
+		map_cmd.user_va = 1UL << 31;
+	for (i = 0; i != 10; i++) {
+		map_cmd.length = PAGE_SIZE * (i + 1);
+		ASSERT_EQ(0,
+			  ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+		iovas[i] = map_cmd.iova;
+		EXPECT_EQ(0, map_cmd.iova % (1UL << (ffs(map_cmd.length)-1)));
+	}
+	for (i = 0; i != 10; i++) {
+		unmap_cmd.length = PAGE_SIZE * (i + 1);
+		unmap_cmd.iova = iovas[i];
+		ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_UNMAP,
+				   &unmap_cmd));
+	}
+
+	/* Avoids a reserved region */
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ADD_RESERVED),
+			&test_cmd));
+	for (i = 0; i != 10; i++) {
+		map_cmd.length = PAGE_SIZE * (i + 1);
+		ASSERT_EQ(0,
+			  ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+		iovas[i] = map_cmd.iova;
+		EXPECT_EQ(0, map_cmd.iova % (1UL << (ffs(map_cmd.length)-1)));
+		EXPECT_EQ(false,
+			  map_cmd.iova > test_cmd.add_reserved.start &&
+				  map_cmd.iova <
+					  test_cmd.add_reserved.start +
+						  test_cmd.add_reserved.length);
+	}
+	for (i = 0; i != 10; i++) {
+		unmap_cmd.length = PAGE_SIZE * (i + 1);
+		unmap_cmd.iova = iovas[i];
+		ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_UNMAP,
+				   &unmap_cmd));
+	}
+}
+
+TEST_F(iommufd_ioas, copy_area)
+{
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.ioas_id = self->ioas_id,
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.length = PAGE_SIZE,
+		.user_va = (uintptr_t)buffer,
+	};
+	struct iommu_ioas_copy copy_cmd = {
+		.size = sizeof(copy_cmd),
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.dst_ioas_id = self->ioas_id,
+		.src_ioas_id = self->ioas_id,
+		.length = PAGE_SIZE,
+	};
+	struct iommu_ioas_alloc alloc_cmd = {
+		.size = sizeof(alloc_cmd),
+	};
+
+	map_cmd.iova = self->base_iova;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+
+	/* Copy inside a single IOAS */
+	copy_cmd.src_iova = self->base_iova;
+	copy_cmd.dst_iova = self->base_iova + PAGE_SIZE;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_COPY, &copy_cmd));
+
+	/* Copy between IOAS's */
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_ALLOC, &alloc_cmd));
+	ASSERT_NE(0, alloc_cmd.out_ioas_id);
+	copy_cmd.src_iova = self->base_iova;
+	copy_cmd.dst_iova = 0;
+	copy_cmd.dst_ioas_id = alloc_cmd.out_ioas_id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_COPY, &copy_cmd));
+}
+
+TEST_F(iommufd_ioas, iova_ranges)
+{
+	struct iommu_test_cmd test_cmd = {
+		.size = sizeof(test_cmd),
+		.op = IOMMU_TEST_OP_ADD_RESERVED,
+		.id = self->ioas_id,
+		.add_reserved = { .start = PAGE_SIZE, .length = PAGE_SIZE },
+	};
+	struct iommu_ioas_iova_ranges *cmd = (void *)buffer;
+
+	*cmd = (struct iommu_ioas_iova_ranges){
+		.size = BUFFER_SIZE,
+		.ioas_id = self->ioas_id,
+	};
+
+	/* Range can be read */
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_IOVA_RANGES, cmd));
+	EXPECT_EQ(1, cmd->out_num_iovas);
+	if (!self->domain_id) {
+		EXPECT_EQ(0, cmd->out_valid_iovas[0].start);
+		EXPECT_EQ(SIZE_MAX, cmd->out_valid_iovas[0].last);
+	} else {
+		EXPECT_EQ(MOCK_APERTURE_START, cmd->out_valid_iovas[0].start);
+		EXPECT_EQ(MOCK_APERTURE_LAST, cmd->out_valid_iovas[0].last);
+	}
+	memset(cmd->out_valid_iovas, 0,
+	       sizeof(cmd->out_valid_iovas[0]) * cmd->out_num_iovas);
+
+	/* Buffer too small */
+	cmd->size = sizeof(*cmd);
+	EXPECT_ERRNO(EMSGSIZE,
+		     ioctl(self->fd, IOMMU_IOAS_IOVA_RANGES, cmd));
+	EXPECT_EQ(1, cmd->out_num_iovas);
+	EXPECT_EQ(0, cmd->out_valid_iovas[0].start);
+	EXPECT_EQ(0, cmd->out_valid_iovas[0].last);
+
+	/* 2 ranges */
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ADD_RESERVED),
+			&test_cmd));
+	cmd->size = BUFFER_SIZE;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_IOVA_RANGES, cmd));
+	if (!self->domain_id) {
+		EXPECT_EQ(2, cmd->out_num_iovas);
+		EXPECT_EQ(0, cmd->out_valid_iovas[0].start);
+		EXPECT_EQ(PAGE_SIZE - 1, cmd->out_valid_iovas[0].last);
+		EXPECT_EQ(PAGE_SIZE * 2, cmd->out_valid_iovas[1].start);
+		EXPECT_EQ(SIZE_MAX, cmd->out_valid_iovas[1].last);
+	} else {
+		EXPECT_EQ(1, cmd->out_num_iovas);
+		EXPECT_EQ(MOCK_APERTURE_START, cmd->out_valid_iovas[0].start);
+		EXPECT_EQ(MOCK_APERTURE_LAST, cmd->out_valid_iovas[0].last);
+	}
+	memset(cmd->out_valid_iovas, 0,
+	       sizeof(cmd->out_valid_iovas[0]) * cmd->out_num_iovas);
+
+	/* Buffer too small */
+	cmd->size = sizeof(*cmd) + sizeof(cmd->out_valid_iovas[0]);
+	if (!self->domain_id) {
+		EXPECT_ERRNO(EMSGSIZE,
+			     ioctl(self->fd, IOMMU_IOAS_IOVA_RANGES, cmd));
+		EXPECT_EQ(2, cmd->out_num_iovas);
+		EXPECT_EQ(0, cmd->out_valid_iovas[0].start);
+		EXPECT_EQ(PAGE_SIZE - 1, cmd->out_valid_iovas[0].last);
+	} else {
+		ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_IOVA_RANGES, cmd));
+		EXPECT_EQ(1, cmd->out_num_iovas);
+		EXPECT_EQ(MOCK_APERTURE_START, cmd->out_valid_iovas[0].start);
+		EXPECT_EQ(MOCK_APERTURE_LAST, cmd->out_valid_iovas[0].last);
+	}
+	EXPECT_EQ(0, cmd->out_valid_iovas[1].start);
+	EXPECT_EQ(0, cmd->out_valid_iovas[1].last);
+}
+
+TEST_F(iommufd_ioas, access)
+{
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.ioas_id = self->ioas_id,
+		.user_va = (uintptr_t)buffer,
+		.length = BUFFER_SIZE,
+		.iova = MOCK_APERTURE_START,
+	};
+	struct iommu_test_cmd access_cmd = {
+		.size = sizeof(access_cmd),
+		.op = IOMMU_TEST_OP_ACCESS_PAGES,
+		.id = self->ioas_id,
+		.access_pages = { .iova = MOCK_APERTURE_START,
+				  .length = BUFFER_SIZE,
+				  .uptr = (uintptr_t)buffer },
+	};
+	struct iommu_test_cmd mock_cmd = {
+		.size = sizeof(mock_cmd),
+		.op = IOMMU_TEST_OP_MOCK_DOMAIN,
+		.id = self->ioas_id,
+	};
+	struct iommu_test_cmd check_map_cmd = {
+		.size = sizeof(check_map_cmd),
+		.op = IOMMU_TEST_OP_MD_CHECK_MAP,
+		.check_map = { .iova = MOCK_APERTURE_START,
+			       .length = BUFFER_SIZE,
+			       .uptr = (uintptr_t)buffer },
+	};
+	struct iommu_destroy destroy_cmd = { .size = sizeof(destroy_cmd) };
+	uint32_t id;
+
+	/* Single map/unmap */
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ACCESS_PAGES),
+			&access_cmd));
+	destroy_cmd.id = access_cmd.access_pages.out_access_id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_DESTROY, &destroy_cmd));
+
+	/* Double user */
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ACCESS_PAGES),
+			&access_cmd));
+	id = access_cmd.access_pages.out_access_id;
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ACCESS_PAGES),
+			&access_cmd));
+	destroy_cmd.id = access_cmd.access_pages.out_access_id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_DESTROY, &destroy_cmd));
+	destroy_cmd.id = id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_DESTROY, &destroy_cmd));
+
+	/* Add/remove a domain with a user */
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ACCESS_PAGES),
+			&access_cmd));
+	ASSERT_EQ(0, ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_MOCK_DOMAIN),
+			   &mock_cmd));
+	check_map_cmd.id = mock_cmd.id;
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_MD_CHECK_MAP),
+			&check_map_cmd));
+	destroy_cmd.id = mock_cmd.mock_domain.device_id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_DESTROY, &destroy_cmd));
+	destroy_cmd.id = mock_cmd.id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_DESTROY, &destroy_cmd));
+	destroy_cmd.id = access_cmd.access_pages.out_access_id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_DESTROY, &destroy_cmd));
+}
+
+FIXTURE(iommufd_mock_domain) {
+	int fd;
+	uint32_t ioas_id;
+	uint32_t domain_id;
+	uint32_t domain_ids[2];
+	int mmap_flags;
+	size_t mmap_buf_size;
+};
+
+FIXTURE_VARIANT(iommufd_mock_domain) {
+	unsigned int mock_domains;
+	bool hugepages;
+};
+
+FIXTURE_SETUP(iommufd_mock_domain)
+{
+	struct iommu_ioas_alloc alloc_cmd = {
+		.size = sizeof(alloc_cmd),
+	};
+	struct iommu_test_cmd test_cmd = {
+		.size = sizeof(test_cmd),
+		.op = IOMMU_TEST_OP_MOCK_DOMAIN,
+	};
+	unsigned int i;
+
+	self->fd = open("/dev/iommu", O_RDWR);
+	ASSERT_NE(-1, self->fd);
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_ALLOC, &alloc_cmd));
+	ASSERT_NE(0, alloc_cmd.out_ioas_id);
+	self->ioas_id = alloc_cmd.out_ioas_id;
+
+	ASSERT_GE(ARRAY_SIZE(self->domain_ids), variant->mock_domains);
+
+	for (i = 0; i != variant->mock_domains; i++) {
+		test_cmd.id = self->ioas_id;
+		ASSERT_EQ(0, ioctl(self->fd,
+				   _IOMMU_TEST_CMD(IOMMU_TEST_OP_MOCK_DOMAIN),
+				   &test_cmd));
+		EXPECT_NE(0, test_cmd.id);
+		self->domain_ids[i] = test_cmd.id;
+	}
+	self->domain_id = self->domain_ids[0];
+
+	self->mmap_flags = MAP_SHARED | MAP_ANONYMOUS;
+	self->mmap_buf_size = PAGE_SIZE * 8;
+	if (variant->hugepages) {
+		/*
+		 * MAP_POPULATE will cause the kernel to fail mmap if THPs are
+		 * not available.
+		 */
+		self->mmap_flags |= MAP_HUGETLB | MAP_POPULATE;
+		self->mmap_buf_size = HUGEPAGE_SIZE * 2;
+	}
+}
+
+FIXTURE_TEARDOWN(iommufd_mock_domain) {
+	ASSERT_EQ(0, close(self->fd));
+
+	self->fd = open("/dev/iommu", O_RDWR);
+	ASSERT_NE(-1, self->fd);
+	check_refs(buffer, BUFFER_SIZE, 0);
+	ASSERT_EQ(0, close(self->fd));
+}
+
+FIXTURE_VARIANT_ADD(iommufd_mock_domain, one_domain) {
+	.mock_domains = 1,
+	.hugepages = false,
+};
+
+FIXTURE_VARIANT_ADD(iommufd_mock_domain, two_domains) {
+	.mock_domains = 2,
+	.hugepages = false,
+};
+
+FIXTURE_VARIANT_ADD(iommufd_mock_domain, one_domain_hugepage) {
+	.mock_domains = 1,
+	.hugepages = true,
+};
+
+FIXTURE_VARIANT_ADD(iommufd_mock_domain, two_domains_hugepage) {
+	.mock_domains = 2,
+	.hugepages = true,
+};
+
+/* Have the kernel check that the user pages made it to the iommu_domain */
+#define check_mock_iova(_ptr, _iova, _length)                                  \
+	({                                                                     \
+		struct iommu_test_cmd check_map_cmd = {                        \
+			.size = sizeof(check_map_cmd),                         \
+			.op = IOMMU_TEST_OP_MD_CHECK_MAP,                      \
+			.id = self->domain_id,                                 \
+			.check_map = { .iova = _iova,                          \
+				       .length = _length,                      \
+				       .uptr = (uintptr_t)(_ptr) },            \
+		};                                                             \
+		ASSERT_EQ(0,                                                   \
+			  ioctl(self->fd,                                      \
+				_IOMMU_TEST_CMD(IOMMU_TEST_OP_MD_CHECK_MAP),   \
+				&check_map_cmd));                              \
+		if (self->domain_ids[1]) {                                     \
+			check_map_cmd.id = self->domain_ids[1];                \
+			ASSERT_EQ(0,                                           \
+				  ioctl(self->fd,                              \
+					_IOMMU_TEST_CMD(                       \
+						IOMMU_TEST_OP_MD_CHECK_MAP),   \
+					&check_map_cmd));                      \
+		}                                                              \
+	})
+
+TEST_F(iommufd_mock_domain, basic)
+{
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.ioas_id = self->ioas_id,
+	};
+	struct iommu_ioas_unmap unmap_cmd = {
+		.size = sizeof(unmap_cmd),
+		.ioas_id = self->ioas_id,
+	};
+	size_t buf_size = self->mmap_buf_size;
+	uint8_t *buf;
+
+	/* Simple one page map */
+	map_cmd.user_va = (uintptr_t)buffer;
+	map_cmd.length = PAGE_SIZE;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	check_mock_iova(buffer, map_cmd.iova, map_cmd.length);
+
+	buf = mmap(0, buf_size, PROT_READ | PROT_WRITE, self->mmap_flags, -1,
+		   0);
+	ASSERT_NE(MAP_FAILED, buf);
+
+	/* EFAULT half way through mapping */
+	ASSERT_EQ(0, munmap(buf + buf_size / 2, buf_size / 2));
+	map_cmd.user_va = (uintptr_t)buf;
+	map_cmd.length = buf_size;
+	EXPECT_ERRNO(EFAULT,
+		     ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+
+	/* EFAULT on first page */
+	ASSERT_EQ(0, munmap(buf, buf_size / 2));
+	EXPECT_ERRNO(EFAULT,
+		     ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+}
+
+TEST_F(iommufd_mock_domain, all_aligns)
+{
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.ioas_id = self->ioas_id,
+	};
+	struct iommu_ioas_unmap unmap_cmd = {
+		.size = sizeof(unmap_cmd),
+		.ioas_id = self->ioas_id,
+	};
+	size_t test_step =
+		variant->hugepages ? (self->mmap_buf_size / 16) : MOCK_PAGE_SIZE;
+	size_t buf_size = self->mmap_buf_size;
+	unsigned int start;
+	unsigned int end;
+	uint8_t *buf;
+
+	buf = mmap(0, buf_size, PROT_READ | PROT_WRITE, self->mmap_flags, -1, 0);
+	ASSERT_NE(MAP_FAILED, buf);
+	check_refs(buf, buf_size, 0);
+
+	/*
+	 * Map every combination of page size and alignment within a big region,
+	 * less for hugepage case as it takes so long to finish.
+	 */
+	for (start = 0; start < buf_size; start += test_step) {
+		map_cmd.user_va = (uintptr_t)buf + start;
+		if (variant->hugepages)
+			end = buf_size;
+		else
+			end = start + MOCK_PAGE_SIZE;
+		for (; end < buf_size; end += MOCK_PAGE_SIZE) {
+			map_cmd.length = end - start;
+			ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_MAP,
+					   &map_cmd));
+			check_mock_iova(buf + start, map_cmd.iova,
+					map_cmd.length);
+			check_refs(buf + start / PAGE_SIZE * PAGE_SIZE,
+				   end / PAGE_SIZE * PAGE_SIZE -
+					   start / PAGE_SIZE * PAGE_SIZE,
+				   1);
+
+			unmap_cmd.iova = map_cmd.iova;
+			unmap_cmd.length = end - start;
+			ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_UNMAP,
+					   &unmap_cmd));
+		}
+	}
+	check_refs(buf, buf_size, 0);
+	ASSERT_EQ(0, munmap(buf, buf_size));
+}
+
+TEST_F(iommufd_mock_domain, all_aligns_copy)
+{
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.ioas_id = self->ioas_id,
+	};
+	struct iommu_ioas_unmap unmap_cmd = {
+		.size = sizeof(unmap_cmd),
+		.ioas_id = self->ioas_id,
+	};
+	struct iommu_test_cmd add_mock_pt = {
+		.size = sizeof(add_mock_pt),
+		.op = IOMMU_TEST_OP_MOCK_DOMAIN,
+	};
+	struct iommu_destroy destroy_cmd = {
+		.size = sizeof(destroy_cmd),
+	};
+	size_t test_step =
+		variant->hugepages ? self->mmap_buf_size / 16 : MOCK_PAGE_SIZE;
+	size_t buf_size = self->mmap_buf_size;
+	unsigned int start;
+	unsigned int end;
+	uint8_t *buf;
+
+	buf = mmap(0, buf_size, PROT_READ | PROT_WRITE, self->mmap_flags, -1, 0);
+	ASSERT_NE(MAP_FAILED, buf);
+	check_refs(buf, buf_size, 0);
+
+	/*
+	 * Map every combination of page size and alignment within a big region,
+	 * less for hugepage case as it takes so long to finish.
+	 */
+	for (start = 0; start < buf_size; start += test_step) {
+		map_cmd.user_va = (uintptr_t)buf + start;
+		if (variant->hugepages)
+			end = buf_size;
+		else
+			end = start + MOCK_PAGE_SIZE;
+		for (; end < buf_size; end += MOCK_PAGE_SIZE) {
+			unsigned int old_id;
+
+			map_cmd.length = end - start;
+			ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_MAP,
+					   &map_cmd));
+
+			/* Add and destroy a domain while the area exists */
+			add_mock_pt.id = self->ioas_id;
+			ASSERT_EQ(0, ioctl(self->fd,
+					   _IOMMU_TEST_CMD(
+						   IOMMU_TEST_OP_MOCK_DOMAIN),
+					   &add_mock_pt));
+			old_id = self->domain_ids[1];
+			self->domain_ids[1] = add_mock_pt.id;
+
+			check_mock_iova(buf + start, map_cmd.iova,
+					map_cmd.length);
+			check_refs(buf + start / PAGE_SIZE * PAGE_SIZE,
+				   end / PAGE_SIZE * PAGE_SIZE -
+					   start / PAGE_SIZE * PAGE_SIZE,
+				   1);
+
+			destroy_cmd.id = add_mock_pt.mock_domain.device_id;
+			ASSERT_EQ(0,
+				  ioctl(self->fd, IOMMU_DESTROY, &destroy_cmd));
+			destroy_cmd.id = add_mock_pt.id;
+			ASSERT_EQ(0,
+				  ioctl(self->fd, IOMMU_DESTROY, &destroy_cmd));
+			self->domain_ids[1] = old_id;
+
+			unmap_cmd.iova = map_cmd.iova;
+			unmap_cmd.length = end - start;
+			ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_UNMAP,
+					   &unmap_cmd));
+		}
+	}
+	check_refs(buf, buf_size, 0);
+	ASSERT_EQ(0, munmap(buf, buf_size));
+}
+
+TEST_F(iommufd_mock_domain, user_copy)
+{
+	struct iommu_ioas_alloc alloc_cmd = {
+		.size = sizeof(alloc_cmd),
+	};
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.ioas_id = self->ioas_id,
+		.user_va = (uintptr_t)buffer,
+		.length = BUFFER_SIZE,
+		.iova = MOCK_APERTURE_START,
+	};
+	struct iommu_test_cmd access_cmd = {
+		.size = sizeof(access_cmd),
+		.op = IOMMU_TEST_OP_ACCESS_PAGES,
+		.id = self->ioas_id,
+		.access_pages = { .iova = MOCK_APERTURE_START,
+				  .length = BUFFER_SIZE,
+				  .uptr = (uintptr_t)buffer },
+	};
+	struct iommu_ioas_copy copy_cmd = {
+		.size = sizeof(copy_cmd),
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.dst_ioas_id = self->ioas_id,
+		.src_iova = MOCK_APERTURE_START,
+		.dst_iova = MOCK_APERTURE_START,
+		.length = BUFFER_SIZE,
+	};
+	struct iommu_destroy destroy_cmd = { .size = sizeof(destroy_cmd) };
+
+	/* Pin the pages in an IOAS with no domains then copy to an IOAS with domains */
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_ALLOC, &alloc_cmd));
+	map_cmd.ioas_id = alloc_cmd.out_ioas_id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	access_cmd.id = alloc_cmd.out_ioas_id;
+
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ACCESS_PAGES),
+			&access_cmd));
+	copy_cmd.src_ioas_id = alloc_cmd.out_ioas_id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_COPY, &copy_cmd));
+	check_mock_iova(buffer, map_cmd.iova, BUFFER_SIZE);
+
+	destroy_cmd.id = access_cmd.access_pages.out_access_id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_DESTROY, &destroy_cmd));
+	destroy_cmd.id = alloc_cmd.out_ioas_id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_DESTROY, &destroy_cmd));
+}
+
+FIXTURE(vfio_compat_nodev) {
+	int fd;
+};
+
+FIXTURE_SETUP(vfio_compat_nodev) {
+	self->fd = open("/dev/iommu", O_RDWR);
+	ASSERT_NE(-1, self->fd);
+}
+
+FIXTURE_TEARDOWN(vfio_compat_nodev) {
+	ASSERT_EQ(0, close(self->fd));
+}
+
+TEST_F(vfio_compat_nodev, simple_ioctls)
+{
+	ASSERT_EQ(VFIO_API_VERSION, ioctl(self->fd, VFIO_GET_API_VERSION));
+	ASSERT_EQ(1, ioctl(self->fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1v2_IOMMU));
+}
+
+TEST_F(vfio_compat_nodev, unmap_cmd)
+{
+	struct vfio_iommu_type1_dma_unmap unmap_cmd = {
+		.iova = MOCK_APERTURE_START,
+		.size = PAGE_SIZE,
+	};
+
+	unmap_cmd.argsz = 1;
+	EXPECT_ERRNO(EINVAL, ioctl(self->fd, VFIO_IOMMU_UNMAP_DMA, &unmap_cmd));
+
+	unmap_cmd.argsz = sizeof(unmap_cmd);
+	unmap_cmd.flags = 1 << 31;
+	EXPECT_ERRNO(EINVAL, ioctl(self->fd, VFIO_IOMMU_UNMAP_DMA, &unmap_cmd));
+
+	unmap_cmd.flags = 0;
+	EXPECT_ERRNO(ENODEV, ioctl(self->fd, VFIO_IOMMU_UNMAP_DMA, &unmap_cmd));
+}
+
+TEST_F(vfio_compat_nodev, map_cmd)
+{
+	struct vfio_iommu_type1_dma_map map_cmd = {
+		.iova = MOCK_APERTURE_START,
+		.size = PAGE_SIZE,
+		.vaddr = (__u64)buffer,
+	};
+
+	map_cmd.argsz = 1;
+	EXPECT_ERRNO(EINVAL, ioctl(self->fd, VFIO_IOMMU_MAP_DMA, &map_cmd));
+
+	map_cmd.argsz = sizeof(map_cmd);
+	map_cmd.flags = 1 << 31;
+	EXPECT_ERRNO(EINVAL, ioctl(self->fd, VFIO_IOMMU_MAP_DMA, &map_cmd));
+
+	/* Requires a domain to be attached */
+	map_cmd.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
+	EXPECT_ERRNO(ENODEV, ioctl(self->fd, VFIO_IOMMU_MAP_DMA, &map_cmd));
+}
+
+TEST_F(vfio_compat_nodev, info_cmd)
+{
+	struct vfio_iommu_type1_info info_cmd = {};
+
+	/* Invalid argsz */
+	info_cmd.argsz = 1;
+	EXPECT_ERRNO(EINVAL, ioctl(self->fd, VFIO_IOMMU_GET_INFO, &info_cmd));
+
+	info_cmd.argsz = sizeof(info_cmd);
+	EXPECT_ERRNO(ENODEV, ioctl(self->fd, VFIO_IOMMU_GET_INFO, &info_cmd));
+}
+
+TEST_F(vfio_compat_nodev, set_iommu_cmd)
+{
+	/* Requires a domain to be attached */
+	EXPECT_ERRNO(ENODEV, ioctl(self->fd, VFIO_SET_IOMMU, VFIO_TYPE1v2_IOMMU));
+}
+
+TEST_F(vfio_compat_nodev, vfio_ioas)
+{
+	struct iommu_ioas_alloc alloc_cmd = {
+		.size = sizeof(alloc_cmd),
+	};
+	struct iommu_vfio_ioas vfio_ioas_cmd = {
+		.size = sizeof(vfio_ioas_cmd),
+		.op = IOMMU_VFIO_IOAS_GET,
+	};
+
+	/* ENODEV if there is no compat ioas */
+	EXPECT_ERRNO(ENODEV, ioctl(self->fd, IOMMU_VFIO_IOAS, &vfio_ioas_cmd));
+
+	/* Invalid id for set */
+	vfio_ioas_cmd.op = IOMMU_VFIO_IOAS_SET;
+	EXPECT_ERRNO(ENOENT, ioctl(self->fd, IOMMU_VFIO_IOAS, &vfio_ioas_cmd));
+
+	/* Valid id for set*/
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_ALLOC, &alloc_cmd));
+	vfio_ioas_cmd.ioas_id = alloc_cmd.out_ioas_id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_VFIO_IOAS, &vfio_ioas_cmd));
+
+	/* Same id comes back from get */
+	vfio_ioas_cmd.op = IOMMU_VFIO_IOAS_GET;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_VFIO_IOAS, &vfio_ioas_cmd));
+	ASSERT_EQ(alloc_cmd.out_ioas_id, vfio_ioas_cmd.ioas_id);
+
+	/* Clear works */
+	vfio_ioas_cmd.op = IOMMU_VFIO_IOAS_CLEAR;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_VFIO_IOAS, &vfio_ioas_cmd));
+	vfio_ioas_cmd.op = IOMMU_VFIO_IOAS_GET;
+	EXPECT_ERRNO(ENODEV, ioctl(self->fd, IOMMU_VFIO_IOAS, &vfio_ioas_cmd));
+}
+
+FIXTURE(vfio_compat_mock_domain) {
+	int fd;
+	uint32_t ioas_id;
+};
+
+FIXTURE_SETUP(vfio_compat_mock_domain) {
+	struct iommu_ioas_alloc alloc_cmd = {
+		.size = sizeof(alloc_cmd),
+	};
+	struct iommu_test_cmd test_cmd = {
+		.size = sizeof(test_cmd),
+		.op = IOMMU_TEST_OP_MOCK_DOMAIN,
+	};
+	struct iommu_vfio_ioas vfio_ioas_cmd = {
+		.size = sizeof(vfio_ioas_cmd),
+		.op = IOMMU_VFIO_IOAS_SET,
+	};
+
+	self->fd = open("/dev/iommu", O_RDWR);
+	ASSERT_NE(-1, self->fd);
+
+	/* Create what VFIO would consider a group */
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_ALLOC, &alloc_cmd));
+	ASSERT_NE(0, alloc_cmd.out_ioas_id);
+	self->ioas_id = alloc_cmd.out_ioas_id;
+	test_cmd.id = self->ioas_id;
+	ASSERT_EQ(0, ioctl(self->fd,
+			   _IOMMU_TEST_CMD(IOMMU_TEST_OP_MOCK_DOMAIN),
+			   &test_cmd));
+	EXPECT_NE(0, test_cmd.id);
+
+	/* Attach it to the vfio compat */
+	vfio_ioas_cmd.ioas_id = self->ioas_id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_VFIO_IOAS, &vfio_ioas_cmd));
+	ASSERT_EQ(0, ioctl(self->fd, VFIO_SET_IOMMU, VFIO_TYPE1v2_IOMMU));
+}
+
+FIXTURE_TEARDOWN(vfio_compat_mock_domain) {
+	ASSERT_EQ(0, close(self->fd));
+
+	self->fd = open("/dev/iommu", O_RDWR);
+	ASSERT_NE(-1, self->fd);
+	check_refs(buffer, BUFFER_SIZE, 0);
+	ASSERT_EQ(0, close(self->fd));
+}
+
+TEST_F(vfio_compat_mock_domain, simple_close)
+{
+}
+
+/*
+ * Execute an ioctl command stored in buffer and check that the result does not
+ * overflow memory.
+ */
+static bool is_filled(const void *buf, uint8_t c, size_t len)
+{
+	const uint8_t *cbuf = buf;
+
+	for (; len; cbuf++, len--)
+		if (*cbuf != c)
+			return false;
+	return true;
+}
+
+#define ioctl_check_buf(fd, cmd)                                               \
+	({                                                                     \
+		size_t _cmd_len = *(__u32 *)buffer;                            \
+									       \
+		memset(buffer + _cmd_len, 0xAA, BUFFER_SIZE - _cmd_len);       \
+		ASSERT_EQ(0, ioctl(fd, cmd, buffer));                          \
+		ASSERT_EQ(true, is_filled(buffer + _cmd_len, 0xAA,             \
+					  BUFFER_SIZE - _cmd_len));            \
+	})
+
+static void check_vfio_info_cap_chain(struct __test_metadata *_metadata,
+				      struct vfio_iommu_type1_info *info_cmd)
+{
+	const struct vfio_info_cap_header *cap;
+
+	ASSERT_GE(info_cmd->argsz, info_cmd->cap_offset + sizeof(*cap));
+	cap = buffer + info_cmd->cap_offset;
+	while (true) {
+		size_t cap_size;
+
+		if (cap->next)
+			cap_size = (buffer + cap->next) - (void *)cap;
+		else
+			cap_size = (buffer + info_cmd->argsz) - (void *)cap;
+
+		switch (cap->id) {
+		case VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE: {
+			struct vfio_iommu_type1_info_cap_iova_range *data =
+				(void *)cap;
+
+			ASSERT_EQ(1, data->header.version);
+			ASSERT_EQ(1, data->nr_iovas);
+			EXPECT_EQ(MOCK_APERTURE_START,
+				  data->iova_ranges[0].start);
+			EXPECT_EQ(MOCK_APERTURE_LAST,
+				  data->iova_ranges[0].end);
+			break;
+		}
+		case VFIO_IOMMU_TYPE1_INFO_DMA_AVAIL: {
+			struct vfio_iommu_type1_info_dma_avail *data =
+				(void *)cap;
+
+			ASSERT_EQ(1, data->header.version);
+			ASSERT_EQ(sizeof(*data), cap_size);
+			break;
+		}
+		default:
+			ASSERT_EQ(false, true);
+			break;
+		}
+		if (!cap->next)
+			break;
+
+		ASSERT_GE(info_cmd->argsz, cap->next + sizeof(*cap));
+		ASSERT_GE(buffer + cap->next, (void *)cap);
+		cap = buffer + cap->next;
+	}
+}
+
+TEST_F(vfio_compat_mock_domain, get_info)
+{
+	struct vfio_iommu_type1_info *info_cmd = buffer;
+	unsigned int i;
+	size_t caplen;
+
+	/* Pre-cap ABI */
+	*info_cmd = (struct vfio_iommu_type1_info){
+		.argsz = offsetof(struct vfio_iommu_type1_info, cap_offset),
+	};
+	ioctl_check_buf(self->fd, VFIO_IOMMU_GET_INFO);
+	ASSERT_NE(0, info_cmd->iova_pgsizes);
+	ASSERT_EQ(VFIO_IOMMU_INFO_PGSIZES | VFIO_IOMMU_INFO_CAPS,
+		  info_cmd->flags);
+
+	/* Read the cap chain size */
+	*info_cmd = (struct vfio_iommu_type1_info){
+		.argsz = sizeof(*info_cmd),
+	};
+	ioctl_check_buf(self->fd, VFIO_IOMMU_GET_INFO);
+	ASSERT_NE(0, info_cmd->iova_pgsizes);
+	ASSERT_EQ(VFIO_IOMMU_INFO_PGSIZES | VFIO_IOMMU_INFO_CAPS,
+		  info_cmd->flags);
+	ASSERT_EQ(0, info_cmd->cap_offset);
+	ASSERT_LT(sizeof(*info_cmd), info_cmd->argsz);
+
+	/* Read the caps, kernel should never create a corrupted caps */
+	caplen = info_cmd->argsz;
+	for (i = sizeof(*info_cmd); i < caplen; i++) {
+		*info_cmd = (struct vfio_iommu_type1_info){
+			.argsz = i,
+		};
+		ioctl_check_buf(self->fd, VFIO_IOMMU_GET_INFO);
+		ASSERT_EQ(VFIO_IOMMU_INFO_PGSIZES | VFIO_IOMMU_INFO_CAPS,
+			  info_cmd->flags);
+		if (!info_cmd->cap_offset)
+			continue;
+		check_vfio_info_cap_chain(_metadata, info_cmd);
+	}
+}
+
+/* FIXME use fault injection to test memory failure paths */
+/* FIXME test VFIO_IOMMU_MAP_DMA */
+/* FIXME test VFIO_IOMMU_UNMAP_DMA */
+/* FIXME test 2k iova alignment */
+
+TEST_HARNESS_MAIN
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 244+ messages in thread

* [PATCH RFC 12/12] iommufd: Add a selftest
@ 2022-03-18 17:27   ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-03-18 17:27 UTC (permalink / raw)
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins,
	David Gibson

Cover the essential functionality of the iommufd with a directed
test. This aims to achieve reasonable functional coverage using the
in-kernel self test framework.

It provides a mock for the iommu_domain that allows it to run without any
HW and the mocking provides a way to directly validate that the PFNs
loaded into the iommu_domain are correct.

The mock also simulates the rare case of PAGE_SIZE > iommu page size as
the mock will typically operate at a 2K iommu page size. This allows
exercising all of the calculations to support this mismatch.

This allows achieving high coverage of the corner cases in the iopt_pages.

However, it is an unusually invasive config option to enable all of
this. The config option should never be enabled in a production kernel.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommufd/Kconfig            |    9 +
 drivers/iommu/iommufd/Makefile           |    2 +
 drivers/iommu/iommufd/iommufd_private.h  |    9 +
 drivers/iommu/iommufd/iommufd_test.h     |   65 ++
 drivers/iommu/iommufd/main.c             |   12 +
 drivers/iommu/iommufd/pages.c            |    4 +
 drivers/iommu/iommufd/selftest.c         |  495 +++++++++
 tools/testing/selftests/Makefile         |    1 +
 tools/testing/selftests/iommu/.gitignore |    2 +
 tools/testing/selftests/iommu/Makefile   |   11 +
 tools/testing/selftests/iommu/config     |    2 +
 tools/testing/selftests/iommu/iommufd.c  | 1225 ++++++++++++++++++++++
 12 files changed, 1837 insertions(+)
 create mode 100644 drivers/iommu/iommufd/iommufd_test.h
 create mode 100644 drivers/iommu/iommufd/selftest.c
 create mode 100644 tools/testing/selftests/iommu/.gitignore
 create mode 100644 tools/testing/selftests/iommu/Makefile
 create mode 100644 tools/testing/selftests/iommu/config
 create mode 100644 tools/testing/selftests/iommu/iommufd.c

diff --git a/drivers/iommu/iommufd/Kconfig b/drivers/iommu/iommufd/Kconfig
index fddd453bb0e764..9b41fde7c839c5 100644
--- a/drivers/iommu/iommufd/Kconfig
+++ b/drivers/iommu/iommufd/Kconfig
@@ -11,3 +11,12 @@ config IOMMUFD
 	  This would commonly be used in combination with VFIO.
 
 	  If you don't know what to do here, say N.
+
+config IOMMUFD_TEST
+	bool "IOMMU Userspace API Test support"
+	depends on IOMMUFD
+	depends on RUNTIME_TESTING_MENU
+	default n
+	help
+	  This is dangerous, do not enable unless running
+	  tools/testing/selftests/iommu
diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index 2fdff04000b326..8aeba81800c512 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -8,4 +8,6 @@ iommufd-y := \
 	pages.o \
 	vfio_compat.o
 
+iommufd-$(CONFIG_IOMMUFD_TEST) += selftest.o
+
 obj-$(CONFIG_IOMMUFD) += iommufd.o
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 31628591591c17..6f11470c8ea677 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -102,6 +102,9 @@ enum iommufd_object_type {
 	IOMMUFD_OBJ_NONE,
 	IOMMUFD_OBJ_ANY = IOMMUFD_OBJ_NONE,
 	IOMMUFD_OBJ_DEVICE,
+#ifdef CONFIG_IOMMUFD_TEST
+	IOMMUFD_OBJ_SELFTEST,
+#endif
 	IOMMUFD_OBJ_HW_PAGETABLE,
 	IOMMUFD_OBJ_IOAS,
 	IOMMUFD_OBJ_MAX,
@@ -219,4 +222,10 @@ void iommufd_hw_pagetable_destroy(struct iommufd_object *obj);
 
 void iommufd_device_destroy(struct iommufd_object *obj);
 
+#ifdef CONFIG_IOMMUFD_TEST
+int iommufd_test(struct iommufd_ucmd *ucmd);
+void iommufd_selftest_destroy(struct iommufd_object *obj);
+extern size_t iommufd_test_memory_limit;
+#endif
+
 #endif
diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h
new file mode 100644
index 00000000000000..d22ef484af1a90
--- /dev/null
+++ b/drivers/iommu/iommufd/iommufd_test.h
@@ -0,0 +1,65 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
+ */
+#ifndef _UAPI_IOMMUFD_TEST_H
+#define _UAPI_IOMMUFD_TEST_H
+
+#include <linux/types.h>
+#include <linux/iommufd.h>
+
+enum {
+	IOMMU_TEST_OP_ADD_RESERVED,
+	IOMMU_TEST_OP_MOCK_DOMAIN,
+	IOMMU_TEST_OP_MD_CHECK_MAP,
+	IOMMU_TEST_OP_MD_CHECK_REFS,
+	IOMMU_TEST_OP_ACCESS_PAGES,
+	IOMMU_TEST_OP_SET_TEMP_MEMORY_LIMIT,
+};
+
+enum {
+	MOCK_APERTURE_START = 1UL << 24,
+	MOCK_APERTURE_LAST = (1UL << 31) - 1,
+};
+
+enum {
+	MOCK_FLAGS_ACCESS_WRITE = 1 << 0,
+};
+
+struct iommu_test_cmd {
+	__u32 size;
+	__u32 op;
+	__u32 id;
+	union {
+		struct {
+			__u32 device_id;
+		} mock_domain;
+		struct {
+			__aligned_u64 start;
+			__aligned_u64 length;
+		} add_reserved;
+		struct {
+			__aligned_u64 iova;
+			__aligned_u64 length;
+			__aligned_u64 uptr;
+		} check_map;
+		struct {
+			__aligned_u64 length;
+			__aligned_u64 uptr;
+			__u32 refs;
+		} check_refs;
+		struct {
+			__u32 flags;
+			__u32 out_access_id;
+			__aligned_u64 iova;
+			__aligned_u64 length;
+			__aligned_u64 uptr;
+		} access_pages;
+		struct {
+			__u32 limit;
+		} memory_limit;
+	};
+	__u32 last;
+};
+#define IOMMU_TEST_CMD _IO(IOMMUFD_TYPE, IOMMUFD_CMD_BASE + 32)
+
+#endif
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index f746fcff8145cb..8c820bb90caa72 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -24,6 +24,7 @@
 #include <uapi/linux/iommufd.h>
 
 #include "iommufd_private.h"
+#include "iommufd_test.h"
 
 struct iommufd_object_ops {
 	void (*destroy)(struct iommufd_object *obj);
@@ -191,6 +192,9 @@ union ucmd_buffer {
 	struct iommu_ioas_map map;
 	struct iommu_ioas_unmap unmap;
 	struct iommu_destroy destroy;
+#ifdef CONFIG_IOMMUFD_TEST
+	struct iommu_test_cmd test;
+#endif
 };
 
 struct iommufd_ioctl_op {
@@ -223,6 +227,9 @@ static struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
 		 length),
 	IOCTL_OP(IOMMU_VFIO_IOAS, iommufd_vfio_ioas, struct iommu_vfio_ioas,
 		 __reserved),
+#ifdef CONFIG_IOMMUFD_TEST
+	IOCTL_OP(IOMMU_TEST_CMD, iommufd_test, struct iommu_test_cmd, last),
+#endif
 };
 
 static long iommufd_fops_ioctl(struct file *filp, unsigned int cmd,
@@ -299,6 +306,11 @@ static struct iommufd_object_ops iommufd_object_ops[] = {
 	[IOMMUFD_OBJ_HW_PAGETABLE] = {
 		.destroy = iommufd_hw_pagetable_destroy,
 	},
+#ifdef CONFIG_IOMMUFD_TEST
+	[IOMMUFD_OBJ_SELFTEST] = {
+		.destroy = iommufd_selftest_destroy,
+	},
+#endif
 };
 
 static struct miscdevice iommu_misc_dev = {
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index 8e6a8cc8b20ad1..3fd39e0201f542 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -48,7 +48,11 @@
 
 #include "io_pagetable.h"
 
+#ifndef CONFIG_IOMMUFD_TEST
 #define TEMP_MEMORY_LIMIT 65536
+#else
+#define TEMP_MEMORY_LIMIT iommufd_test_memory_limit
+#endif
 #define BATCH_BACKUP_SIZE 32
 
 /*
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
new file mode 100644
index 00000000000000..a665719b493ec5
--- /dev/null
+++ b/drivers/iommu/iommufd/selftest.c
@@ -0,0 +1,495 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
+ *
+ * Kernel side components to support tools/testing/selftests/iommu
+ */
+#include <linux/slab.h>
+#include <linux/iommu.h>
+#include <linux/xarray.h>
+
+#include "iommufd_private.h"
+#include "iommufd_test.h"
+
+size_t iommufd_test_memory_limit = 65536;
+
+enum {
+	MOCK_IO_PAGE_SIZE = PAGE_SIZE / 2,
+
+	/*
+	 * Like a real page table alignment requires the low bits of the address
+	 * to be zero. xarray also requires the high bit to be zero, so we store
+	 * the pfns shifted. The upper bits are used for metadata.
+	 */
+	MOCK_PFN_MASK = ULONG_MAX / MOCK_IO_PAGE_SIZE,
+
+	_MOCK_PFN_START = MOCK_PFN_MASK + 1,
+	MOCK_PFN_START_IOVA = _MOCK_PFN_START,
+	MOCK_PFN_LAST_IOVA = _MOCK_PFN_START,
+};
+
+struct mock_iommu_domain {
+	struct iommu_domain domain;
+	struct xarray pfns;
+};
+
+enum selftest_obj_type {
+	TYPE_ACCESS,
+	TYPE_IDEV,
+};
+
+struct selftest_obj {
+	struct iommufd_object obj;
+	enum selftest_obj_type type;
+
+	union {
+		struct {
+			struct iommufd_ioas *ioas;
+			unsigned long iova;
+			size_t length;
+		} access;
+		struct {
+			struct iommufd_hw_pagetable *hwpt;
+			struct iommufd_ctx *ictx;
+		} idev;
+	};
+};
+
+static struct iommu_domain *mock_domain_alloc(unsigned int iommu_domain_type)
+{
+	struct mock_iommu_domain *mock;
+
+	if (WARN_ON(iommu_domain_type != IOMMU_DOMAIN_UNMANAGED))
+		return NULL;
+
+	mock = kzalloc(sizeof(*mock), GFP_KERNEL);
+	if (!mock)
+		return NULL;
+	mock->domain.geometry.aperture_start = MOCK_APERTURE_START;
+	mock->domain.geometry.aperture_end = MOCK_APERTURE_LAST;
+	mock->domain.pgsize_bitmap = MOCK_IO_PAGE_SIZE;
+	xa_init(&mock->pfns);
+	return &mock->domain;
+}
+
+static void mock_domain_free(struct iommu_domain *domain)
+{
+	struct mock_iommu_domain *mock =
+		container_of(domain, struct mock_iommu_domain, domain);
+
+	WARN_ON(!xa_empty(&mock->pfns));
+	kfree(mock);
+}
+
+static int mock_domain_map_pages(struct iommu_domain *domain,
+				 unsigned long iova, phys_addr_t paddr,
+				 size_t pgsize, size_t pgcount, int prot,
+				 gfp_t gfp, size_t *mapped)
+{
+	struct mock_iommu_domain *mock =
+		container_of(domain, struct mock_iommu_domain, domain);
+	unsigned long flags = MOCK_PFN_START_IOVA;
+
+	WARN_ON(iova % MOCK_IO_PAGE_SIZE);
+	WARN_ON(pgsize % MOCK_IO_PAGE_SIZE);
+	for (; pgcount; pgcount--) {
+		size_t cur;
+
+		for (cur = 0; cur != pgsize; cur += MOCK_IO_PAGE_SIZE) {
+			void *old;
+
+			if (pgcount == 1 && cur + MOCK_IO_PAGE_SIZE == pgsize)
+				flags = MOCK_PFN_LAST_IOVA;
+			old = xa_store(&mock->pfns, iova / MOCK_IO_PAGE_SIZE,
+				       xa_mk_value((paddr / MOCK_IO_PAGE_SIZE) | flags),
+				       GFP_KERNEL);
+			if (xa_is_err(old))
+				return xa_err(old);
+			WARN_ON(old);
+			iova += MOCK_IO_PAGE_SIZE;
+			paddr += MOCK_IO_PAGE_SIZE;
+			*mapped += MOCK_IO_PAGE_SIZE;
+			flags = 0;
+		}
+	}
+	return 0;
+}
+
+static size_t mock_domain_unmap_pages(struct iommu_domain *domain,
+				      unsigned long iova, size_t pgsize,
+				      size_t pgcount,
+				      struct iommu_iotlb_gather *iotlb_gather)
+{
+	struct mock_iommu_domain *mock =
+		container_of(domain, struct mock_iommu_domain, domain);
+	bool first = true;
+	size_t ret = 0;
+	void *ent;
+
+	WARN_ON(iova % MOCK_IO_PAGE_SIZE);
+	WARN_ON(pgsize % MOCK_IO_PAGE_SIZE);
+
+	for (; pgcount; pgcount--) {
+		size_t cur;
+
+		for (cur = 0; cur != pgsize; cur += MOCK_IO_PAGE_SIZE) {
+			ent = xa_erase(&mock->pfns, iova / MOCK_IO_PAGE_SIZE);
+			WARN_ON(!ent);
+			/*
+			 * iommufd generates unmaps that must be a strict
+			 * superset of the map's performend So every starting
+			 * IOVA should have been an iova passed to map, and the
+			 *
+			 * First IOVA must be present and have been a first IOVA
+			 * passed to map_pages
+			 */
+			if (first) {
+				WARN_ON(!(xa_to_value(ent) &
+					  MOCK_PFN_START_IOVA));
+				first = false;
+			}
+			if (pgcount == 1 && cur + MOCK_IO_PAGE_SIZE == pgsize)
+				WARN_ON(!(xa_to_value(ent) &
+					  MOCK_PFN_LAST_IOVA));
+
+			iova += MOCK_IO_PAGE_SIZE;
+			ret += MOCK_IO_PAGE_SIZE;
+		}
+	}
+	return ret;
+}
+
+static phys_addr_t mock_domain_iova_to_phys(struct iommu_domain *domain,
+					    dma_addr_t iova)
+{
+	struct mock_iommu_domain *mock =
+		container_of(domain, struct mock_iommu_domain, domain);
+	void *ent;
+
+	WARN_ON(iova % MOCK_IO_PAGE_SIZE);
+	ent = xa_load(&mock->pfns, iova / MOCK_IO_PAGE_SIZE);
+	WARN_ON(!ent);
+	return (xa_to_value(ent) & MOCK_PFN_MASK) * MOCK_IO_PAGE_SIZE;
+}
+
+static const struct iommu_ops mock_ops = {
+	.owner = THIS_MODULE,
+	.pgsize_bitmap = MOCK_IO_PAGE_SIZE,
+	.domain_alloc = mock_domain_alloc,
+	.default_domain_ops =
+		&(struct iommu_domain_ops){
+			.free = mock_domain_free,
+			.map_pages = mock_domain_map_pages,
+			.unmap_pages = mock_domain_unmap_pages,
+			.iova_to_phys = mock_domain_iova_to_phys,
+		},
+};
+
+static inline struct iommufd_hw_pagetable *
+get_md_pagetable(struct iommufd_ucmd *ucmd, u32 mockpt_id,
+		 struct mock_iommu_domain **mock)
+{
+	struct iommufd_hw_pagetable *hwpt;
+	struct iommufd_object *obj;
+
+	obj = iommufd_get_object(ucmd->ictx, mockpt_id,
+				 IOMMUFD_OBJ_HW_PAGETABLE);
+	if (IS_ERR(obj))
+		return ERR_CAST(obj);
+	hwpt = container_of(obj, struct iommufd_hw_pagetable, obj);
+	if (hwpt->domain->ops != mock_ops.default_domain_ops) {
+		return ERR_PTR(-EINVAL);
+		iommufd_put_object(&hwpt->obj);
+	}
+	*mock = container_of(hwpt->domain, struct mock_iommu_domain, domain);
+	return hwpt;
+}
+
+/* Create an hw_pagetable with the mock domain so we can test the domain ops */
+static int iommufd_test_mock_domain(struct iommufd_ucmd *ucmd,
+				    struct iommu_test_cmd *cmd)
+{
+	struct bus_type mock_bus = { .iommu_ops = &mock_ops };
+	struct device mock_dev = { .bus = &mock_bus };
+	struct iommufd_hw_pagetable *hwpt;
+	struct selftest_obj *sobj;
+	int rc;
+
+	sobj = iommufd_object_alloc(ucmd->ictx, sobj, IOMMUFD_OBJ_SELFTEST);
+	if (IS_ERR(sobj))
+		return PTR_ERR(sobj);
+	sobj->idev.ictx = ucmd->ictx;
+	sobj->type = TYPE_IDEV;
+
+	hwpt = iommufd_hw_pagetable_from_id(ucmd->ictx, cmd->id, &mock_dev);
+	if (IS_ERR(hwpt)) {
+		rc = PTR_ERR(hwpt);
+		goto out_sobj;
+	}
+	if (WARN_ON(refcount_read(&hwpt->obj.users) != 2)) {
+		rc = -EINVAL;
+		goto out_hwpt;
+	}
+	sobj->idev.hwpt = hwpt;
+
+	/* Creating a real iommufd_device is too hard, fake one */
+	rc = iopt_table_add_domain(&hwpt->ioas->iopt, hwpt->domain);
+	if (rc)
+		goto out_hwpt;
+
+	/* Convert auto domain to user created */
+	list_del_init(&hwpt->auto_domains_item);
+	cmd->id = hwpt->obj.id;
+	cmd->mock_domain.device_id = sobj->obj.id;
+	iommufd_object_finalize(ucmd->ictx, &sobj->obj);
+	return iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+
+out_hwpt:
+	iommufd_hw_pagetable_put(ucmd->ictx, hwpt);
+out_sobj:
+	iommufd_object_abort(ucmd->ictx, &sobj->obj);
+	return rc;
+}
+
+/* Add an additional reserved IOVA to the IOAS */
+static int iommufd_test_add_reserved(struct iommufd_ucmd *ucmd,
+				     unsigned int mockpt_id,
+				     unsigned long start, size_t length)
+{
+	struct iommufd_ioas *ioas;
+	int rc;
+
+	ioas = iommufd_get_ioas(ucmd, mockpt_id);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+	down_write(&ioas->iopt.iova_rwsem);
+	rc = iopt_reserve_iova(&ioas->iopt, start, start + length - 1, NULL);
+	up_write(&ioas->iopt.iova_rwsem);
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+/* Check that every pfn under each iova matches the pfn under a user VA */
+static int iommufd_test_md_check_pa(struct iommufd_ucmd *ucmd,
+				    unsigned int mockpt_id, unsigned long iova,
+				    size_t length, void __user *uptr)
+{
+	struct iommufd_hw_pagetable *hwpt;
+	struct mock_iommu_domain *mock;
+	int rc;
+
+	if (iova % MOCK_IO_PAGE_SIZE || length % MOCK_IO_PAGE_SIZE ||
+	    (uintptr_t)uptr % MOCK_IO_PAGE_SIZE)
+		return -EINVAL;
+
+	hwpt = get_md_pagetable(ucmd, mockpt_id, &mock);
+	if (IS_ERR(hwpt))
+		return PTR_ERR(hwpt);
+
+	for (; length; length -= MOCK_IO_PAGE_SIZE) {
+		struct page *pages[1];
+		unsigned long pfn;
+		long npages;
+		void *ent;
+
+		npages = get_user_pages_fast((uintptr_t)uptr & PAGE_MASK, 1, 0,
+					     pages);
+		if (npages < 0) {
+			rc = npages;
+			goto out_put;
+		}
+		if (WARN_ON(npages != 1)) {
+			rc = -EFAULT;
+			goto out_put;
+		}
+		pfn = page_to_pfn(pages[0]);
+		put_page(pages[0]);
+
+		ent = xa_load(&mock->pfns, iova / MOCK_IO_PAGE_SIZE);
+		if (!ent ||
+		    (xa_to_value(ent) & MOCK_PFN_MASK) * MOCK_IO_PAGE_SIZE !=
+			    pfn * PAGE_SIZE + ((uintptr_t)uptr % PAGE_SIZE)) {
+			rc = -EINVAL;
+			goto out_put;
+		}
+		iova += MOCK_IO_PAGE_SIZE;
+		uptr += MOCK_IO_PAGE_SIZE;
+	}
+	rc = 0;
+
+out_put:
+	iommufd_put_object(&hwpt->obj);
+	return rc;
+}
+
+/* Check that the page ref count matches, to look for missing pin/unpins */
+static int iommufd_test_md_check_refs(struct iommufd_ucmd *ucmd,
+				      void __user *uptr, size_t length,
+				      unsigned int refs)
+{
+	if (length % PAGE_SIZE || (uintptr_t)uptr % PAGE_SIZE)
+		return -EINVAL;
+
+	for (; length; length -= PAGE_SIZE) {
+		struct page *pages[1];
+		long npages;
+
+		npages = get_user_pages_fast((uintptr_t)uptr, 1, 0, pages);
+		if (npages < 0)
+			return npages;
+		if (WARN_ON(npages != 1))
+			return -EFAULT;
+		if (!PageCompound(pages[0])) {
+			unsigned int count;
+
+			count = page_ref_count(pages[0]);
+			if (count / GUP_PIN_COUNTING_BIAS != refs) {
+				put_page(pages[0]);
+				return -EIO;
+			}
+		}
+		put_page(pages[0]);
+		uptr += PAGE_SIZE;
+	}
+	return 0;
+}
+
+/* Check that the pages in a page array match the pages in the user VA */
+static int iommufd_test_check_pages(void __user *uptr, struct page **pages,
+				    size_t npages)
+{
+	for (; npages; npages--) {
+		struct page *tmp_pages[1];
+		long rc;
+
+		rc = get_user_pages_fast((uintptr_t)uptr, 1, 0, tmp_pages);
+		if (rc < 0)
+			return rc;
+		if (WARN_ON(rc != 1))
+			return -EFAULT;
+		put_page(tmp_pages[0]);
+		if (tmp_pages[0] != *pages)
+			return -EBADE;
+		pages++;
+		uptr += PAGE_SIZE;
+	}
+	return 0;
+}
+
+/* Test iopt_access_pages() by checking it returns the correct pages */
+static int iommufd_test_access_pages(struct iommufd_ucmd *ucmd,
+				     unsigned int ioas_id, unsigned long iova,
+				     size_t length, void __user *uptr,
+				     u32 flags)
+{
+	struct iommu_test_cmd *cmd = ucmd->cmd;
+	struct selftest_obj *sobj;
+	struct page **pages;
+	size_t npages;
+	int rc;
+
+	if (flags & ~MOCK_FLAGS_ACCESS_WRITE)
+		return -EOPNOTSUPP;
+
+	sobj = iommufd_object_alloc(ucmd->ictx, sobj, IOMMUFD_OBJ_SELFTEST);
+	if (IS_ERR(sobj))
+		return PTR_ERR(sobj);
+	sobj->type = TYPE_ACCESS;
+
+	npages = (ALIGN(iova + length, PAGE_SIZE) -
+		  ALIGN_DOWN(iova, PAGE_SIZE)) /
+		 PAGE_SIZE;
+	pages = kvcalloc(npages, sizeof(*pages), GFP_KERNEL);
+	if (!pages) {
+		rc = -ENOMEM;
+		goto out_abort;
+	}
+
+	sobj->access.ioas = iommufd_get_ioas(ucmd, ioas_id);
+	if (IS_ERR(sobj->access.ioas)) {
+		rc = PTR_ERR(sobj->access.ioas);
+		goto out_free;
+	}
+
+	sobj->access.iova = iova;
+	sobj->access.length = length;
+	rc = iopt_access_pages(&sobj->access.ioas->iopt, iova, length, pages,
+			       flags & MOCK_FLAGS_ACCESS_WRITE);
+	if (rc)
+		goto out_put;
+
+	rc = iommufd_test_check_pages(
+		uptr - (iova - ALIGN_DOWN(iova, PAGE_SIZE)), pages, npages);
+	if (rc)
+		goto out_unaccess;
+
+	cmd->access_pages.out_access_id = sobj->obj.id;
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+	if (rc)
+		goto out_unaccess;
+
+	iommufd_object_finalize(ucmd->ictx, &sobj->obj);
+	iommufd_put_object_keep_user(&sobj->access.ioas->obj);
+	kvfree(pages);
+	return 0;
+out_unaccess:
+	iopt_unaccess_pages(&sobj->access.ioas->iopt, iova, length);
+out_put:
+	iommufd_put_object(&sobj->access.ioas->obj);
+out_free:
+	kvfree(pages);
+out_abort:
+	iommufd_object_abort(ucmd->ictx, &sobj->obj);
+	return rc;
+}
+
+void iommufd_selftest_destroy(struct iommufd_object *obj)
+{
+	struct selftest_obj *sobj = container_of(obj, struct selftest_obj, obj);
+
+	switch (sobj->type) {
+	case TYPE_IDEV:
+		iopt_table_remove_domain(&sobj->idev.hwpt->ioas->iopt,
+					 sobj->idev.hwpt->domain);
+		iommufd_hw_pagetable_put(sobj->idev.ictx, sobj->idev.hwpt);
+		break;
+	case TYPE_ACCESS:
+		iopt_unaccess_pages(&sobj->access.ioas->iopt,
+				    sobj->access.iova, sobj->access.length);
+		refcount_dec(&sobj->access.ioas->obj.users);
+		break;
+	}
+}
+
+int iommufd_test(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_test_cmd *cmd = ucmd->cmd;
+
+	switch (cmd->op) {
+	case IOMMU_TEST_OP_ADD_RESERVED:
+		return iommufd_test_add_reserved(ucmd, cmd->id,
+						 cmd->add_reserved.start,
+						 cmd->add_reserved.length);
+	case IOMMU_TEST_OP_MOCK_DOMAIN:
+		return iommufd_test_mock_domain(ucmd, cmd);
+	case IOMMU_TEST_OP_MD_CHECK_MAP:
+		return iommufd_test_md_check_pa(
+			ucmd, cmd->id, cmd->check_map.iova,
+			cmd->check_map.length,
+			u64_to_user_ptr(cmd->check_map.uptr));
+	case IOMMU_TEST_OP_MD_CHECK_REFS:
+		return iommufd_test_md_check_refs(
+			ucmd, u64_to_user_ptr(cmd->check_refs.uptr),
+			cmd->check_refs.length, cmd->check_refs.refs);
+	case IOMMU_TEST_OP_ACCESS_PAGES:
+		return iommufd_test_access_pages(
+			ucmd, cmd->id, cmd->access_pages.iova,
+			cmd->access_pages.length,
+			u64_to_user_ptr(cmd->access_pages.uptr),
+			cmd->access_pages.flags);
+	case IOMMU_TEST_OP_SET_TEMP_MEMORY_LIMIT:
+		iommufd_test_memory_limit = cmd->memory_limit.limit;
+		return 0;
+	default:
+		return -EOPNOTSUPP;
+	}
+}
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index d08fe4cfe81152..5533a3b2e8af51 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -21,6 +21,7 @@ TARGETS += ftrace
 TARGETS += futex
 TARGETS += gpio
 TARGETS += intel_pstate
+TARGETS += iommu
 TARGETS += ipc
 TARGETS += ir
 TARGETS += kcmp
diff --git a/tools/testing/selftests/iommu/.gitignore b/tools/testing/selftests/iommu/.gitignore
new file mode 100644
index 00000000000000..c6bd07e7ff59b3
--- /dev/null
+++ b/tools/testing/selftests/iommu/.gitignore
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+/iommufd
diff --git a/tools/testing/selftests/iommu/Makefile b/tools/testing/selftests/iommu/Makefile
new file mode 100644
index 00000000000000..7bc38b3beaeb20
--- /dev/null
+++ b/tools/testing/selftests/iommu/Makefile
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0-only
+CFLAGS += -Wall -O2 -Wno-unused-function
+CFLAGS += -I../../../../include/uapi/
+CFLAGS += -I../../../../include/
+
+CFLAGS += -D_GNU_SOURCE
+
+TEST_GEN_PROGS :=
+TEST_GEN_PROGS += iommufd
+
+include ../lib.mk
diff --git a/tools/testing/selftests/iommu/config b/tools/testing/selftests/iommu/config
new file mode 100644
index 00000000000000..6c4f901d6fed3c
--- /dev/null
+++ b/tools/testing/selftests/iommu/config
@@ -0,0 +1,2 @@
+CONFIG_IOMMUFD
+CONFIG_IOMMUFD_TEST
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
new file mode 100644
index 00000000000000..5c47d706ed9449
--- /dev/null
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -0,0 +1,1225 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES */
+#include <stdlib.h>
+#include <unistd.h>
+#include <sys/mman.h>
+#include <sys/fcntl.h>
+#include <sys/ioctl.h>
+#include <assert.h>
+#include <stddef.h>
+
+#include "../kselftest_harness.h"
+
+#define __EXPORTED_HEADERS__
+#include <linux/iommufd.h>
+#include <linux/vfio.h>
+#include "../../../../drivers/iommu/iommufd/iommufd_test.h"
+
+static void *buffer;
+
+static unsigned long PAGE_SIZE;
+static unsigned long HUGEPAGE_SIZE;
+static unsigned long BUFFER_SIZE;
+
+#define MOCK_PAGE_SIZE (PAGE_SIZE / 2)
+
+static unsigned long get_huge_page_size(void)
+{
+	char buf[80];
+	int ret;
+	int fd;
+
+	fd = open("/sys/kernel/mm/transparent_hugepage/hpage_pmd_size",
+		  O_RDONLY);
+	if (fd < 0)
+		return 2 * 1024 * 1024;
+
+	ret = read(fd, buf, sizeof(buf));
+	close(fd);
+	if (ret <= 0 || ret == sizeof(buf))
+		return 2 * 1024 * 1024;
+	buf[ret] = 0;
+	return strtoul(buf, NULL, 10);
+}
+
+static __attribute__((constructor)) void setup_sizes(void)
+{
+	int rc;
+
+	PAGE_SIZE = sysconf(_SC_PAGE_SIZE);
+	HUGEPAGE_SIZE = get_huge_page_size();
+
+	BUFFER_SIZE = PAGE_SIZE * 16;
+	rc = posix_memalign(&buffer, HUGEPAGE_SIZE, BUFFER_SIZE);
+	assert(rc || buffer || (uintptr_t)buffer % HUGEPAGE_SIZE == 0);
+}
+
+/*
+ * Have the kernel check the refcount on pages. I don't know why a freshly
+ * mmap'd anon non-compound page starts out with a ref of 3
+ */
+#define check_refs(_ptr, _length, _refs)                                       \
+	({                                                                     \
+		struct iommu_test_cmd test_cmd = {                             \
+			.size = sizeof(test_cmd),                              \
+			.op = IOMMU_TEST_OP_MD_CHECK_REFS,                     \
+			.check_refs = { .length = _length,                     \
+					.uptr = (uintptr_t)(_ptr),             \
+					.refs = _refs },                       \
+		};                                                             \
+		ASSERT_EQ(0,                                                   \
+			  ioctl(self->fd,                                      \
+				_IOMMU_TEST_CMD(IOMMU_TEST_OP_MD_CHECK_REFS),  \
+				&test_cmd));                                   \
+	})
+
+/* Hack to make assertions more readable */
+#define _IOMMU_TEST_CMD(x) IOMMU_TEST_CMD
+
+#define EXPECT_ERRNO(expected_errno, cmd)                                      \
+	({                                                                     \
+		ASSERT_EQ(-1, cmd);                                            \
+		EXPECT_EQ(expected_errno, errno);                              \
+	})
+
+FIXTURE(iommufd) {
+	int fd;
+};
+
+FIXTURE_SETUP(iommufd) {
+	self->fd = open("/dev/iommu", O_RDWR);
+	ASSERT_NE(-1, self->fd);
+}
+
+FIXTURE_TEARDOWN(iommufd) {
+	ASSERT_EQ(0, close(self->fd));
+}
+
+TEST_F(iommufd, simple_close)
+{
+}
+
+TEST_F(iommufd, cmd_fail)
+{
+	struct iommu_destroy cmd = { .size = sizeof(cmd), .id = 0 };
+
+	/* object id is invalid */
+	EXPECT_ERRNO(ENOENT, ioctl(self->fd, IOMMU_DESTROY, &cmd));
+	/* Bad pointer */
+	EXPECT_ERRNO(EFAULT, ioctl(self->fd, IOMMU_DESTROY, NULL));
+	/* Unknown ioctl */
+	EXPECT_ERRNO(ENOTTY,
+		     ioctl(self->fd, _IO(IOMMUFD_TYPE, IOMMUFD_CMD_BASE - 1),
+			   &cmd));
+}
+
+TEST_F(iommufd, cmd_ex_fail)
+{
+	struct {
+		struct iommu_destroy cmd;
+		__u64 future;
+	} cmd = { .cmd = { .size = sizeof(cmd), .id = 0 } };
+
+	/* object id is invalid and command is longer */
+	EXPECT_ERRNO(ENOENT, ioctl(self->fd, IOMMU_DESTROY, &cmd));
+	/* future area is non-zero */
+	cmd.future = 1;
+	EXPECT_ERRNO(E2BIG, ioctl(self->fd, IOMMU_DESTROY, &cmd));
+	/* Original command "works" */
+	cmd.cmd.size = sizeof(cmd.cmd);
+	EXPECT_ERRNO(ENOENT, ioctl(self->fd, IOMMU_DESTROY, &cmd));
+	/* Short command fails */
+	cmd.cmd.size = sizeof(cmd.cmd) - 1;
+	EXPECT_ERRNO(EOPNOTSUPP, ioctl(self->fd, IOMMU_DESTROY, &cmd));
+}
+
+FIXTURE(iommufd_ioas) {
+	int fd;
+	uint32_t ioas_id;
+	uint32_t domain_id;
+	uint64_t base_iova;
+};
+
+FIXTURE_VARIANT(iommufd_ioas) {
+	unsigned int mock_domains;
+	unsigned int memory_limit;
+};
+
+FIXTURE_SETUP(iommufd_ioas) {
+	struct iommu_test_cmd memlimit_cmd = {
+		.size = sizeof(memlimit_cmd),
+		.op = IOMMU_TEST_OP_SET_TEMP_MEMORY_LIMIT,
+		.memory_limit = {.limit = variant->memory_limit},
+	};
+	struct iommu_ioas_alloc alloc_cmd = {
+		.size = sizeof(alloc_cmd),
+	};
+	unsigned int i;
+
+	if (!variant->memory_limit)
+		memlimit_cmd.memory_limit.limit = 65536;
+
+	self->fd = open("/dev/iommu", O_RDWR);
+	ASSERT_NE(-1, self->fd);
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_ALLOC, &alloc_cmd));
+	ASSERT_NE(0, alloc_cmd.out_ioas_id);
+	self->ioas_id = alloc_cmd.out_ioas_id;
+
+	ASSERT_EQ(0, ioctl(self->fd,
+			   _IOMMU_TEST_CMD(IOMMU_TEST_OP_SET_TEMP_MEMORY_LIMIT),
+			   &memlimit_cmd));
+
+	for (i = 0; i != variant->mock_domains; i++) {
+		struct iommu_test_cmd test_cmd = {
+			.size = sizeof(test_cmd),
+			.op = IOMMU_TEST_OP_MOCK_DOMAIN,
+			.id = self->ioas_id,
+		};
+
+		ASSERT_EQ(0, ioctl(self->fd,
+				   _IOMMU_TEST_CMD(IOMMU_TEST_OP_MOCK_DOMAIN),
+				   &test_cmd));
+		EXPECT_NE(0, test_cmd.id);
+		self->domain_id = test_cmd.id;
+		self->base_iova = MOCK_APERTURE_START;
+	}
+}
+
+FIXTURE_TEARDOWN(iommufd_ioas) {
+	struct iommu_test_cmd memlimit_cmd = {
+		.size = sizeof(memlimit_cmd),
+		.op = IOMMU_TEST_OP_SET_TEMP_MEMORY_LIMIT,
+		.memory_limit = {.limit = 65536},
+	};
+
+	ASSERT_EQ(0, ioctl(self->fd,
+			   _IOMMU_TEST_CMD(IOMMU_TEST_OP_SET_TEMP_MEMORY_LIMIT),
+			   &memlimit_cmd));
+	ASSERT_EQ(0, close(self->fd));
+
+	self->fd = open("/dev/iommu", O_RDWR);
+	ASSERT_NE(-1, self->fd);
+	check_refs(buffer, BUFFER_SIZE, 0);
+	ASSERT_EQ(0, close(self->fd));
+}
+
+FIXTURE_VARIANT_ADD(iommufd_ioas, no_domain) {
+};
+
+FIXTURE_VARIANT_ADD(iommufd_ioas, mock_domain) {
+	.mock_domains = 1,
+};
+
+FIXTURE_VARIANT_ADD(iommufd_ioas, two_mock_domain) {
+	.mock_domains = 2,
+};
+
+FIXTURE_VARIANT_ADD(iommufd_ioas, mock_domain_limit) {
+	.mock_domains = 1,
+	.memory_limit = 16,
+};
+
+TEST_F(iommufd_ioas, ioas_auto_destroy)
+{
+}
+
+TEST_F(iommufd_ioas, ioas_destroy)
+{
+	struct iommu_destroy destroy_cmd = {
+		.size = sizeof(destroy_cmd),
+		.id = self->ioas_id,
+	};
+
+	if (self->domain_id) {
+		/* IOAS cannot be freed while a domain is on it */
+		EXPECT_ERRNO(EBUSY,
+			     ioctl(self->fd, IOMMU_DESTROY, &destroy_cmd));
+	} else {
+		/* Can allocate and manually free an IOAS table */
+		ASSERT_EQ(0, ioctl(self->fd, IOMMU_DESTROY, &destroy_cmd));
+	}
+}
+
+TEST_F(iommufd_ioas, ioas_area_destroy)
+{
+	struct iommu_destroy destroy_cmd = {
+		.size = sizeof(destroy_cmd),
+		.id = self->ioas_id,
+	};
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.ioas_id = self->ioas_id,
+		.user_va = (uintptr_t)buffer,
+		.length = PAGE_SIZE,
+		.iova = self->base_iova,
+	};
+
+	/* Adding an area does not change ability to destroy */
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	if (self->domain_id)
+		EXPECT_ERRNO(EBUSY,
+			     ioctl(self->fd, IOMMU_DESTROY, &destroy_cmd));
+	else
+		ASSERT_EQ(0, ioctl(self->fd, IOMMU_DESTROY, &destroy_cmd));
+}
+
+TEST_F(iommufd_ioas, ioas_area_auto_destroy)
+{
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.ioas_id = self->ioas_id,
+		.user_va = (uintptr_t)buffer,
+		.length = PAGE_SIZE,
+	};
+	int i;
+
+	/* Can allocate and automatically free an IOAS table with many areas */
+	for (i = 0; i != 10; i++) {
+		map_cmd.iova = self->base_iova + i * PAGE_SIZE;
+		ASSERT_EQ(0,
+			  ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	}
+}
+
+TEST_F(iommufd_ioas, area)
+{
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.ioas_id = self->ioas_id,
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.length = PAGE_SIZE,
+		.user_va = (uintptr_t)buffer,
+	};
+	struct iommu_ioas_unmap unmap_cmd = {
+		.size = sizeof(unmap_cmd),
+		.ioas_id = self->ioas_id,
+		.length = PAGE_SIZE,
+	};
+	int i;
+
+	/* Unmap fails if nothing is mapped */
+	for (i = 0; i != 10; i++) {
+		unmap_cmd.iova = i * PAGE_SIZE;
+		EXPECT_ERRNO(ENOENT, ioctl(self->fd, IOMMU_IOAS_UNMAP,
+					   &unmap_cmd));
+	}
+
+	/* Unmap works */
+	for (i = 0; i != 10; i++) {
+		map_cmd.iova = self->base_iova + i * PAGE_SIZE;
+		ASSERT_EQ(0,
+			  ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	}
+	for (i = 0; i != 10; i++) {
+		unmap_cmd.iova = self->base_iova + i * PAGE_SIZE;
+		ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_UNMAP,
+				   &unmap_cmd));
+	}
+
+	/* Split fails */
+	map_cmd.length = PAGE_SIZE * 2;
+	map_cmd.iova = self->base_iova + 16 * PAGE_SIZE;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	unmap_cmd.iova = self->base_iova + 16 * PAGE_SIZE;
+	EXPECT_ERRNO(ENOENT,
+		     ioctl(self->fd, IOMMU_IOAS_UNMAP, &unmap_cmd));
+	unmap_cmd.iova = self->base_iova + 17 * PAGE_SIZE;
+	EXPECT_ERRNO(ENOENT,
+		     ioctl(self->fd, IOMMU_IOAS_UNMAP, &unmap_cmd));
+
+	/* Over map fails */
+	map_cmd.length = PAGE_SIZE * 2;
+	map_cmd.iova = self->base_iova + 16 * PAGE_SIZE;
+	EXPECT_ERRNO(EADDRINUSE,
+		     ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	map_cmd.length = PAGE_SIZE;
+	map_cmd.iova = self->base_iova + 16 * PAGE_SIZE;
+	EXPECT_ERRNO(EADDRINUSE,
+		     ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	map_cmd.length = PAGE_SIZE;
+	map_cmd.iova = self->base_iova + 17 * PAGE_SIZE;
+	EXPECT_ERRNO(EADDRINUSE,
+		     ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	map_cmd.length = PAGE_SIZE * 2;
+	map_cmd.iova = self->base_iova + 15 * PAGE_SIZE;
+	EXPECT_ERRNO(EADDRINUSE,
+		     ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	map_cmd.length = PAGE_SIZE * 3;
+	map_cmd.iova = self->base_iova + 15 * PAGE_SIZE;
+	EXPECT_ERRNO(EADDRINUSE,
+		     ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+
+	/* unmap all works */
+	unmap_cmd.iova = 0;
+	unmap_cmd.length = UINT64_MAX;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_UNMAP, &unmap_cmd));
+}
+
+TEST_F(iommufd_ioas, area_auto_iova)
+{
+	struct iommu_test_cmd test_cmd = {
+		.size = sizeof(test_cmd),
+		.op = IOMMU_TEST_OP_ADD_RESERVED,
+		.id = self->ioas_id,
+		.add_reserved = { .start = PAGE_SIZE * 4,
+				  .length = PAGE_SIZE * 100 },
+	};
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.ioas_id = self->ioas_id,
+		.length = PAGE_SIZE,
+		.user_va = (uintptr_t)buffer,
+	};
+	struct iommu_ioas_unmap unmap_cmd = {
+		.size = sizeof(unmap_cmd),
+		.ioas_id = self->ioas_id,
+		.length = PAGE_SIZE,
+	};
+	uint64_t iovas[10];
+	int i;
+
+	/* Simple 4k pages */
+	for (i = 0; i != 10; i++) {
+		ASSERT_EQ(0,
+			  ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+		iovas[i] = map_cmd.iova;
+	}
+	for (i = 0; i != 10; i++) {
+		unmap_cmd.iova = iovas[i];
+		ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_UNMAP,
+				   &unmap_cmd));
+	}
+
+	/* Kernel automatically aligns IOVAs properly */
+	if (self->domain_id)
+		map_cmd.user_va = (uintptr_t)buffer;
+	else
+		map_cmd.user_va = 1UL << 31;
+	for (i = 0; i != 10; i++) {
+		map_cmd.length = PAGE_SIZE * (i + 1);
+		ASSERT_EQ(0,
+			  ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+		iovas[i] = map_cmd.iova;
+		EXPECT_EQ(0, map_cmd.iova % (1UL << (ffs(map_cmd.length)-1)));
+	}
+	for (i = 0; i != 10; i++) {
+		unmap_cmd.length = PAGE_SIZE * (i + 1);
+		unmap_cmd.iova = iovas[i];
+		ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_UNMAP,
+				   &unmap_cmd));
+	}
+
+	/* Avoids a reserved region */
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ADD_RESERVED),
+			&test_cmd));
+	for (i = 0; i != 10; i++) {
+		map_cmd.length = PAGE_SIZE * (i + 1);
+		ASSERT_EQ(0,
+			  ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+		iovas[i] = map_cmd.iova;
+		EXPECT_EQ(0, map_cmd.iova % (1UL << (ffs(map_cmd.length)-1)));
+		EXPECT_EQ(false,
+			  map_cmd.iova > test_cmd.add_reserved.start &&
+				  map_cmd.iova <
+					  test_cmd.add_reserved.start +
+						  test_cmd.add_reserved.length);
+	}
+	for (i = 0; i != 10; i++) {
+		unmap_cmd.length = PAGE_SIZE * (i + 1);
+		unmap_cmd.iova = iovas[i];
+		ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_UNMAP,
+				   &unmap_cmd));
+	}
+}
+
+TEST_F(iommufd_ioas, copy_area)
+{
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.ioas_id = self->ioas_id,
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.length = PAGE_SIZE,
+		.user_va = (uintptr_t)buffer,
+	};
+	struct iommu_ioas_copy copy_cmd = {
+		.size = sizeof(copy_cmd),
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.dst_ioas_id = self->ioas_id,
+		.src_ioas_id = self->ioas_id,
+		.length = PAGE_SIZE,
+	};
+	struct iommu_ioas_alloc alloc_cmd = {
+		.size = sizeof(alloc_cmd),
+	};
+
+	map_cmd.iova = self->base_iova;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+
+	/* Copy inside a single IOAS */
+	copy_cmd.src_iova = self->base_iova;
+	copy_cmd.dst_iova = self->base_iova + PAGE_SIZE;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_COPY, &copy_cmd));
+
+	/* Copy between IOAS's */
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_ALLOC, &alloc_cmd));
+	ASSERT_NE(0, alloc_cmd.out_ioas_id);
+	copy_cmd.src_iova = self->base_iova;
+	copy_cmd.dst_iova = 0;
+	copy_cmd.dst_ioas_id = alloc_cmd.out_ioas_id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_COPY, &copy_cmd));
+}
+
+TEST_F(iommufd_ioas, iova_ranges)
+{
+	struct iommu_test_cmd test_cmd = {
+		.size = sizeof(test_cmd),
+		.op = IOMMU_TEST_OP_ADD_RESERVED,
+		.id = self->ioas_id,
+		.add_reserved = { .start = PAGE_SIZE, .length = PAGE_SIZE },
+	};
+	struct iommu_ioas_iova_ranges *cmd = (void *)buffer;
+
+	*cmd = (struct iommu_ioas_iova_ranges){
+		.size = BUFFER_SIZE,
+		.ioas_id = self->ioas_id,
+	};
+
+	/* Range can be read */
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_IOVA_RANGES, cmd));
+	EXPECT_EQ(1, cmd->out_num_iovas);
+	if (!self->domain_id) {
+		EXPECT_EQ(0, cmd->out_valid_iovas[0].start);
+		EXPECT_EQ(SIZE_MAX, cmd->out_valid_iovas[0].last);
+	} else {
+		EXPECT_EQ(MOCK_APERTURE_START, cmd->out_valid_iovas[0].start);
+		EXPECT_EQ(MOCK_APERTURE_LAST, cmd->out_valid_iovas[0].last);
+	}
+	memset(cmd->out_valid_iovas, 0,
+	       sizeof(cmd->out_valid_iovas[0]) * cmd->out_num_iovas);
+
+	/* Buffer too small */
+	cmd->size = sizeof(*cmd);
+	EXPECT_ERRNO(EMSGSIZE,
+		     ioctl(self->fd, IOMMU_IOAS_IOVA_RANGES, cmd));
+	EXPECT_EQ(1, cmd->out_num_iovas);
+	EXPECT_EQ(0, cmd->out_valid_iovas[0].start);
+	EXPECT_EQ(0, cmd->out_valid_iovas[0].last);
+
+	/* 2 ranges */
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ADD_RESERVED),
+			&test_cmd));
+	cmd->size = BUFFER_SIZE;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_IOVA_RANGES, cmd));
+	if (!self->domain_id) {
+		EXPECT_EQ(2, cmd->out_num_iovas);
+		EXPECT_EQ(0, cmd->out_valid_iovas[0].start);
+		EXPECT_EQ(PAGE_SIZE - 1, cmd->out_valid_iovas[0].last);
+		EXPECT_EQ(PAGE_SIZE * 2, cmd->out_valid_iovas[1].start);
+		EXPECT_EQ(SIZE_MAX, cmd->out_valid_iovas[1].last);
+	} else {
+		EXPECT_EQ(1, cmd->out_num_iovas);
+		EXPECT_EQ(MOCK_APERTURE_START, cmd->out_valid_iovas[0].start);
+		EXPECT_EQ(MOCK_APERTURE_LAST, cmd->out_valid_iovas[0].last);
+	}
+	memset(cmd->out_valid_iovas, 0,
+	       sizeof(cmd->out_valid_iovas[0]) * cmd->out_num_iovas);
+
+	/* Buffer too small */
+	cmd->size = sizeof(*cmd) + sizeof(cmd->out_valid_iovas[0]);
+	if (!self->domain_id) {
+		EXPECT_ERRNO(EMSGSIZE,
+			     ioctl(self->fd, IOMMU_IOAS_IOVA_RANGES, cmd));
+		EXPECT_EQ(2, cmd->out_num_iovas);
+		EXPECT_EQ(0, cmd->out_valid_iovas[0].start);
+		EXPECT_EQ(PAGE_SIZE - 1, cmd->out_valid_iovas[0].last);
+	} else {
+		ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_IOVA_RANGES, cmd));
+		EXPECT_EQ(1, cmd->out_num_iovas);
+		EXPECT_EQ(MOCK_APERTURE_START, cmd->out_valid_iovas[0].start);
+		EXPECT_EQ(MOCK_APERTURE_LAST, cmd->out_valid_iovas[0].last);
+	}
+	EXPECT_EQ(0, cmd->out_valid_iovas[1].start);
+	EXPECT_EQ(0, cmd->out_valid_iovas[1].last);
+}
+
+TEST_F(iommufd_ioas, access)
+{
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.ioas_id = self->ioas_id,
+		.user_va = (uintptr_t)buffer,
+		.length = BUFFER_SIZE,
+		.iova = MOCK_APERTURE_START,
+	};
+	struct iommu_test_cmd access_cmd = {
+		.size = sizeof(access_cmd),
+		.op = IOMMU_TEST_OP_ACCESS_PAGES,
+		.id = self->ioas_id,
+		.access_pages = { .iova = MOCK_APERTURE_START,
+				  .length = BUFFER_SIZE,
+				  .uptr = (uintptr_t)buffer },
+	};
+	struct iommu_test_cmd mock_cmd = {
+		.size = sizeof(mock_cmd),
+		.op = IOMMU_TEST_OP_MOCK_DOMAIN,
+		.id = self->ioas_id,
+	};
+	struct iommu_test_cmd check_map_cmd = {
+		.size = sizeof(check_map_cmd),
+		.op = IOMMU_TEST_OP_MD_CHECK_MAP,
+		.check_map = { .iova = MOCK_APERTURE_START,
+			       .length = BUFFER_SIZE,
+			       .uptr = (uintptr_t)buffer },
+	};
+	struct iommu_destroy destroy_cmd = { .size = sizeof(destroy_cmd) };
+	uint32_t id;
+
+	/* Single map/unmap */
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ACCESS_PAGES),
+			&access_cmd));
+	destroy_cmd.id = access_cmd.access_pages.out_access_id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_DESTROY, &destroy_cmd));
+
+	/* Double user */
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ACCESS_PAGES),
+			&access_cmd));
+	id = access_cmd.access_pages.out_access_id;
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ACCESS_PAGES),
+			&access_cmd));
+	destroy_cmd.id = access_cmd.access_pages.out_access_id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_DESTROY, &destroy_cmd));
+	destroy_cmd.id = id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_DESTROY, &destroy_cmd));
+
+	/* Add/remove a domain with a user */
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ACCESS_PAGES),
+			&access_cmd));
+	ASSERT_EQ(0, ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_MOCK_DOMAIN),
+			   &mock_cmd));
+	check_map_cmd.id = mock_cmd.id;
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_MD_CHECK_MAP),
+			&check_map_cmd));
+	destroy_cmd.id = mock_cmd.mock_domain.device_id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_DESTROY, &destroy_cmd));
+	destroy_cmd.id = mock_cmd.id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_DESTROY, &destroy_cmd));
+	destroy_cmd.id = access_cmd.access_pages.out_access_id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_DESTROY, &destroy_cmd));
+}
+
+FIXTURE(iommufd_mock_domain) {
+	int fd;
+	uint32_t ioas_id;
+	uint32_t domain_id;
+	uint32_t domain_ids[2];
+	int mmap_flags;
+	size_t mmap_buf_size;
+};
+
+FIXTURE_VARIANT(iommufd_mock_domain) {
+	unsigned int mock_domains;
+	bool hugepages;
+};
+
+FIXTURE_SETUP(iommufd_mock_domain)
+{
+	struct iommu_ioas_alloc alloc_cmd = {
+		.size = sizeof(alloc_cmd),
+	};
+	struct iommu_test_cmd test_cmd = {
+		.size = sizeof(test_cmd),
+		.op = IOMMU_TEST_OP_MOCK_DOMAIN,
+	};
+	unsigned int i;
+
+	self->fd = open("/dev/iommu", O_RDWR);
+	ASSERT_NE(-1, self->fd);
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_ALLOC, &alloc_cmd));
+	ASSERT_NE(0, alloc_cmd.out_ioas_id);
+	self->ioas_id = alloc_cmd.out_ioas_id;
+
+	ASSERT_GE(ARRAY_SIZE(self->domain_ids), variant->mock_domains);
+
+	for (i = 0; i != variant->mock_domains; i++) {
+		test_cmd.id = self->ioas_id;
+		ASSERT_EQ(0, ioctl(self->fd,
+				   _IOMMU_TEST_CMD(IOMMU_TEST_OP_MOCK_DOMAIN),
+				   &test_cmd));
+		EXPECT_NE(0, test_cmd.id);
+		self->domain_ids[i] = test_cmd.id;
+	}
+	self->domain_id = self->domain_ids[0];
+
+	self->mmap_flags = MAP_SHARED | MAP_ANONYMOUS;
+	self->mmap_buf_size = PAGE_SIZE * 8;
+	if (variant->hugepages) {
+		/*
+		 * MAP_POPULATE will cause the kernel to fail mmap if THPs are
+		 * not available.
+		 */
+		self->mmap_flags |= MAP_HUGETLB | MAP_POPULATE;
+		self->mmap_buf_size = HUGEPAGE_SIZE * 2;
+	}
+}
+
+FIXTURE_TEARDOWN(iommufd_mock_domain) {
+	ASSERT_EQ(0, close(self->fd));
+
+	self->fd = open("/dev/iommu", O_RDWR);
+	ASSERT_NE(-1, self->fd);
+	check_refs(buffer, BUFFER_SIZE, 0);
+	ASSERT_EQ(0, close(self->fd));
+}
+
+FIXTURE_VARIANT_ADD(iommufd_mock_domain, one_domain) {
+	.mock_domains = 1,
+	.hugepages = false,
+};
+
+FIXTURE_VARIANT_ADD(iommufd_mock_domain, two_domains) {
+	.mock_domains = 2,
+	.hugepages = false,
+};
+
+FIXTURE_VARIANT_ADD(iommufd_mock_domain, one_domain_hugepage) {
+	.mock_domains = 1,
+	.hugepages = true,
+};
+
+FIXTURE_VARIANT_ADD(iommufd_mock_domain, two_domains_hugepage) {
+	.mock_domains = 2,
+	.hugepages = true,
+};
+
+/* Have the kernel check that the user pages made it to the iommu_domain */
+#define check_mock_iova(_ptr, _iova, _length)                                  \
+	({                                                                     \
+		struct iommu_test_cmd check_map_cmd = {                        \
+			.size = sizeof(check_map_cmd),                         \
+			.op = IOMMU_TEST_OP_MD_CHECK_MAP,                      \
+			.id = self->domain_id,                                 \
+			.check_map = { .iova = _iova,                          \
+				       .length = _length,                      \
+				       .uptr = (uintptr_t)(_ptr) },            \
+		};                                                             \
+		ASSERT_EQ(0,                                                   \
+			  ioctl(self->fd,                                      \
+				_IOMMU_TEST_CMD(IOMMU_TEST_OP_MD_CHECK_MAP),   \
+				&check_map_cmd));                              \
+		if (self->domain_ids[1]) {                                     \
+			check_map_cmd.id = self->domain_ids[1];                \
+			ASSERT_EQ(0,                                           \
+				  ioctl(self->fd,                              \
+					_IOMMU_TEST_CMD(                       \
+						IOMMU_TEST_OP_MD_CHECK_MAP),   \
+					&check_map_cmd));                      \
+		}                                                              \
+	})
+
+TEST_F(iommufd_mock_domain, basic)
+{
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.ioas_id = self->ioas_id,
+	};
+	struct iommu_ioas_unmap unmap_cmd = {
+		.size = sizeof(unmap_cmd),
+		.ioas_id = self->ioas_id,
+	};
+	size_t buf_size = self->mmap_buf_size;
+	uint8_t *buf;
+
+	/* Simple one page map */
+	map_cmd.user_va = (uintptr_t)buffer;
+	map_cmd.length = PAGE_SIZE;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	check_mock_iova(buffer, map_cmd.iova, map_cmd.length);
+
+	buf = mmap(0, buf_size, PROT_READ | PROT_WRITE, self->mmap_flags, -1,
+		   0);
+	ASSERT_NE(MAP_FAILED, buf);
+
+	/* EFAULT half way through mapping */
+	ASSERT_EQ(0, munmap(buf + buf_size / 2, buf_size / 2));
+	map_cmd.user_va = (uintptr_t)buf;
+	map_cmd.length = buf_size;
+	EXPECT_ERRNO(EFAULT,
+		     ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+
+	/* EFAULT on first page */
+	ASSERT_EQ(0, munmap(buf, buf_size / 2));
+	EXPECT_ERRNO(EFAULT,
+		     ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+}
+
+TEST_F(iommufd_mock_domain, all_aligns)
+{
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.ioas_id = self->ioas_id,
+	};
+	struct iommu_ioas_unmap unmap_cmd = {
+		.size = sizeof(unmap_cmd),
+		.ioas_id = self->ioas_id,
+	};
+	size_t test_step =
+		variant->hugepages ? (self->mmap_buf_size / 16) : MOCK_PAGE_SIZE;
+	size_t buf_size = self->mmap_buf_size;
+	unsigned int start;
+	unsigned int end;
+	uint8_t *buf;
+
+	buf = mmap(0, buf_size, PROT_READ | PROT_WRITE, self->mmap_flags, -1, 0);
+	ASSERT_NE(MAP_FAILED, buf);
+	check_refs(buf, buf_size, 0);
+
+	/*
+	 * Map every combination of page size and alignment within a big region,
+	 * less for hugepage case as it takes so long to finish.
+	 */
+	for (start = 0; start < buf_size; start += test_step) {
+		map_cmd.user_va = (uintptr_t)buf + start;
+		if (variant->hugepages)
+			end = buf_size;
+		else
+			end = start + MOCK_PAGE_SIZE;
+		for (; end < buf_size; end += MOCK_PAGE_SIZE) {
+			map_cmd.length = end - start;
+			ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_MAP,
+					   &map_cmd));
+			check_mock_iova(buf + start, map_cmd.iova,
+					map_cmd.length);
+			check_refs(buf + start / PAGE_SIZE * PAGE_SIZE,
+				   end / PAGE_SIZE * PAGE_SIZE -
+					   start / PAGE_SIZE * PAGE_SIZE,
+				   1);
+
+			unmap_cmd.iova = map_cmd.iova;
+			unmap_cmd.length = end - start;
+			ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_UNMAP,
+					   &unmap_cmd));
+		}
+	}
+	check_refs(buf, buf_size, 0);
+	ASSERT_EQ(0, munmap(buf, buf_size));
+}
+
+TEST_F(iommufd_mock_domain, all_aligns_copy)
+{
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.ioas_id = self->ioas_id,
+	};
+	struct iommu_ioas_unmap unmap_cmd = {
+		.size = sizeof(unmap_cmd),
+		.ioas_id = self->ioas_id,
+	};
+	struct iommu_test_cmd add_mock_pt = {
+		.size = sizeof(add_mock_pt),
+		.op = IOMMU_TEST_OP_MOCK_DOMAIN,
+	};
+	struct iommu_destroy destroy_cmd = {
+		.size = sizeof(destroy_cmd),
+	};
+	size_t test_step =
+		variant->hugepages ? self->mmap_buf_size / 16 : MOCK_PAGE_SIZE;
+	size_t buf_size = self->mmap_buf_size;
+	unsigned int start;
+	unsigned int end;
+	uint8_t *buf;
+
+	buf = mmap(0, buf_size, PROT_READ | PROT_WRITE, self->mmap_flags, -1, 0);
+	ASSERT_NE(MAP_FAILED, buf);
+	check_refs(buf, buf_size, 0);
+
+	/*
+	 * Map every combination of page size and alignment within a big region,
+	 * less for hugepage case as it takes so long to finish.
+	 */
+	for (start = 0; start < buf_size; start += test_step) {
+		map_cmd.user_va = (uintptr_t)buf + start;
+		if (variant->hugepages)
+			end = buf_size;
+		else
+			end = start + MOCK_PAGE_SIZE;
+		for (; end < buf_size; end += MOCK_PAGE_SIZE) {
+			unsigned int old_id;
+
+			map_cmd.length = end - start;
+			ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_MAP,
+					   &map_cmd));
+
+			/* Add and destroy a domain while the area exists */
+			add_mock_pt.id = self->ioas_id;
+			ASSERT_EQ(0, ioctl(self->fd,
+					   _IOMMU_TEST_CMD(
+						   IOMMU_TEST_OP_MOCK_DOMAIN),
+					   &add_mock_pt));
+			old_id = self->domain_ids[1];
+			self->domain_ids[1] = add_mock_pt.id;
+
+			check_mock_iova(buf + start, map_cmd.iova,
+					map_cmd.length);
+			check_refs(buf + start / PAGE_SIZE * PAGE_SIZE,
+				   end / PAGE_SIZE * PAGE_SIZE -
+					   start / PAGE_SIZE * PAGE_SIZE,
+				   1);
+
+			destroy_cmd.id = add_mock_pt.mock_domain.device_id;
+			ASSERT_EQ(0,
+				  ioctl(self->fd, IOMMU_DESTROY, &destroy_cmd));
+			destroy_cmd.id = add_mock_pt.id;
+			ASSERT_EQ(0,
+				  ioctl(self->fd, IOMMU_DESTROY, &destroy_cmd));
+			self->domain_ids[1] = old_id;
+
+			unmap_cmd.iova = map_cmd.iova;
+			unmap_cmd.length = end - start;
+			ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_UNMAP,
+					   &unmap_cmd));
+		}
+	}
+	check_refs(buf, buf_size, 0);
+	ASSERT_EQ(0, munmap(buf, buf_size));
+}
+
+TEST_F(iommufd_mock_domain, user_copy)
+{
+	struct iommu_ioas_alloc alloc_cmd = {
+		.size = sizeof(alloc_cmd),
+	};
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.ioas_id = self->ioas_id,
+		.user_va = (uintptr_t)buffer,
+		.length = BUFFER_SIZE,
+		.iova = MOCK_APERTURE_START,
+	};
+	struct iommu_test_cmd access_cmd = {
+		.size = sizeof(access_cmd),
+		.op = IOMMU_TEST_OP_ACCESS_PAGES,
+		.id = self->ioas_id,
+		.access_pages = { .iova = MOCK_APERTURE_START,
+				  .length = BUFFER_SIZE,
+				  .uptr = (uintptr_t)buffer },
+	};
+	struct iommu_ioas_copy copy_cmd = {
+		.size = sizeof(copy_cmd),
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.dst_ioas_id = self->ioas_id,
+		.src_iova = MOCK_APERTURE_START,
+		.dst_iova = MOCK_APERTURE_START,
+		.length = BUFFER_SIZE,
+	};
+	struct iommu_destroy destroy_cmd = { .size = sizeof(destroy_cmd) };
+
+	/* Pin the pages in an IOAS with no domains then copy to an IOAS with domains */
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_ALLOC, &alloc_cmd));
+	map_cmd.ioas_id = alloc_cmd.out_ioas_id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	access_cmd.id = alloc_cmd.out_ioas_id;
+
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ACCESS_PAGES),
+			&access_cmd));
+	copy_cmd.src_ioas_id = alloc_cmd.out_ioas_id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_COPY, &copy_cmd));
+	check_mock_iova(buffer, map_cmd.iova, BUFFER_SIZE);
+
+	destroy_cmd.id = access_cmd.access_pages.out_access_id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_DESTROY, &destroy_cmd));
+	destroy_cmd.id = alloc_cmd.out_ioas_id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_DESTROY, &destroy_cmd));
+}
+
+FIXTURE(vfio_compat_nodev) {
+	int fd;
+};
+
+FIXTURE_SETUP(vfio_compat_nodev) {
+	self->fd = open("/dev/iommu", O_RDWR);
+	ASSERT_NE(-1, self->fd);
+}
+
+FIXTURE_TEARDOWN(vfio_compat_nodev) {
+	ASSERT_EQ(0, close(self->fd));
+}
+
+TEST_F(vfio_compat_nodev, simple_ioctls)
+{
+	ASSERT_EQ(VFIO_API_VERSION, ioctl(self->fd, VFIO_GET_API_VERSION));
+	ASSERT_EQ(1, ioctl(self->fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1v2_IOMMU));
+}
+
+TEST_F(vfio_compat_nodev, unmap_cmd)
+{
+	struct vfio_iommu_type1_dma_unmap unmap_cmd = {
+		.iova = MOCK_APERTURE_START,
+		.size = PAGE_SIZE,
+	};
+
+	unmap_cmd.argsz = 1;
+	EXPECT_ERRNO(EINVAL, ioctl(self->fd, VFIO_IOMMU_UNMAP_DMA, &unmap_cmd));
+
+	unmap_cmd.argsz = sizeof(unmap_cmd);
+	unmap_cmd.flags = 1 << 31;
+	EXPECT_ERRNO(EINVAL, ioctl(self->fd, VFIO_IOMMU_UNMAP_DMA, &unmap_cmd));
+
+	unmap_cmd.flags = 0;
+	EXPECT_ERRNO(ENODEV, ioctl(self->fd, VFIO_IOMMU_UNMAP_DMA, &unmap_cmd));
+}
+
+TEST_F(vfio_compat_nodev, map_cmd)
+{
+	struct vfio_iommu_type1_dma_map map_cmd = {
+		.iova = MOCK_APERTURE_START,
+		.size = PAGE_SIZE,
+		.vaddr = (__u64)buffer,
+	};
+
+	map_cmd.argsz = 1;
+	EXPECT_ERRNO(EINVAL, ioctl(self->fd, VFIO_IOMMU_MAP_DMA, &map_cmd));
+
+	map_cmd.argsz = sizeof(map_cmd);
+	map_cmd.flags = 1 << 31;
+	EXPECT_ERRNO(EINVAL, ioctl(self->fd, VFIO_IOMMU_MAP_DMA, &map_cmd));
+
+	/* Requires a domain to be attached */
+	map_cmd.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
+	EXPECT_ERRNO(ENODEV, ioctl(self->fd, VFIO_IOMMU_MAP_DMA, &map_cmd));
+}
+
+TEST_F(vfio_compat_nodev, info_cmd)
+{
+	struct vfio_iommu_type1_info info_cmd = {};
+
+	/* Invalid argsz */
+	info_cmd.argsz = 1;
+	EXPECT_ERRNO(EINVAL, ioctl(self->fd, VFIO_IOMMU_GET_INFO, &info_cmd));
+
+	info_cmd.argsz = sizeof(info_cmd);
+	EXPECT_ERRNO(ENODEV, ioctl(self->fd, VFIO_IOMMU_GET_INFO, &info_cmd));
+}
+
+TEST_F(vfio_compat_nodev, set_iommu_cmd)
+{
+	/* Requires a domain to be attached */
+	EXPECT_ERRNO(ENODEV, ioctl(self->fd, VFIO_SET_IOMMU, VFIO_TYPE1v2_IOMMU));
+}
+
+TEST_F(vfio_compat_nodev, vfio_ioas)
+{
+	struct iommu_ioas_alloc alloc_cmd = {
+		.size = sizeof(alloc_cmd),
+	};
+	struct iommu_vfio_ioas vfio_ioas_cmd = {
+		.size = sizeof(vfio_ioas_cmd),
+		.op = IOMMU_VFIO_IOAS_GET,
+	};
+
+	/* ENODEV if there is no compat ioas */
+	EXPECT_ERRNO(ENODEV, ioctl(self->fd, IOMMU_VFIO_IOAS, &vfio_ioas_cmd));
+
+	/* Invalid id for set */
+	vfio_ioas_cmd.op = IOMMU_VFIO_IOAS_SET;
+	EXPECT_ERRNO(ENOENT, ioctl(self->fd, IOMMU_VFIO_IOAS, &vfio_ioas_cmd));
+
+	/* Valid id for set*/
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_ALLOC, &alloc_cmd));
+	vfio_ioas_cmd.ioas_id = alloc_cmd.out_ioas_id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_VFIO_IOAS, &vfio_ioas_cmd));
+
+	/* Same id comes back from get */
+	vfio_ioas_cmd.op = IOMMU_VFIO_IOAS_GET;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_VFIO_IOAS, &vfio_ioas_cmd));
+	ASSERT_EQ(alloc_cmd.out_ioas_id, vfio_ioas_cmd.ioas_id);
+
+	/* Clear works */
+	vfio_ioas_cmd.op = IOMMU_VFIO_IOAS_CLEAR;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_VFIO_IOAS, &vfio_ioas_cmd));
+	vfio_ioas_cmd.op = IOMMU_VFIO_IOAS_GET;
+	EXPECT_ERRNO(ENODEV, ioctl(self->fd, IOMMU_VFIO_IOAS, &vfio_ioas_cmd));
+}
+
+FIXTURE(vfio_compat_mock_domain) {
+	int fd;
+	uint32_t ioas_id;
+};
+
+FIXTURE_SETUP(vfio_compat_mock_domain) {
+	struct iommu_ioas_alloc alloc_cmd = {
+		.size = sizeof(alloc_cmd),
+	};
+	struct iommu_test_cmd test_cmd = {
+		.size = sizeof(test_cmd),
+		.op = IOMMU_TEST_OP_MOCK_DOMAIN,
+	};
+	struct iommu_vfio_ioas vfio_ioas_cmd = {
+		.size = sizeof(vfio_ioas_cmd),
+		.op = IOMMU_VFIO_IOAS_SET,
+	};
+
+	self->fd = open("/dev/iommu", O_RDWR);
+	ASSERT_NE(-1, self->fd);
+
+	/* Create what VFIO would consider a group */
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_ALLOC, &alloc_cmd));
+	ASSERT_NE(0, alloc_cmd.out_ioas_id);
+	self->ioas_id = alloc_cmd.out_ioas_id;
+	test_cmd.id = self->ioas_id;
+	ASSERT_EQ(0, ioctl(self->fd,
+			   _IOMMU_TEST_CMD(IOMMU_TEST_OP_MOCK_DOMAIN),
+			   &test_cmd));
+	EXPECT_NE(0, test_cmd.id);
+
+	/* Attach it to the vfio compat */
+	vfio_ioas_cmd.ioas_id = self->ioas_id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_VFIO_IOAS, &vfio_ioas_cmd));
+	ASSERT_EQ(0, ioctl(self->fd, VFIO_SET_IOMMU, VFIO_TYPE1v2_IOMMU));
+}
+
+FIXTURE_TEARDOWN(vfio_compat_mock_domain) {
+	ASSERT_EQ(0, close(self->fd));
+
+	self->fd = open("/dev/iommu", O_RDWR);
+	ASSERT_NE(-1, self->fd);
+	check_refs(buffer, BUFFER_SIZE, 0);
+	ASSERT_EQ(0, close(self->fd));
+}
+
+TEST_F(vfio_compat_mock_domain, simple_close)
+{
+}
+
+/*
+ * Execute an ioctl command stored in buffer and check that the result does not
+ * overflow memory.
+ */
+static bool is_filled(const void *buf, uint8_t c, size_t len)
+{
+	const uint8_t *cbuf = buf;
+
+	for (; len; cbuf++, len--)
+		if (*cbuf != c)
+			return false;
+	return true;
+}
+
+#define ioctl_check_buf(fd, cmd)                                               \
+	({                                                                     \
+		size_t _cmd_len = *(__u32 *)buffer;                            \
+									       \
+		memset(buffer + _cmd_len, 0xAA, BUFFER_SIZE - _cmd_len);       \
+		ASSERT_EQ(0, ioctl(fd, cmd, buffer));                          \
+		ASSERT_EQ(true, is_filled(buffer + _cmd_len, 0xAA,             \
+					  BUFFER_SIZE - _cmd_len));            \
+	})
+
+static void check_vfio_info_cap_chain(struct __test_metadata *_metadata,
+				      struct vfio_iommu_type1_info *info_cmd)
+{
+	const struct vfio_info_cap_header *cap;
+
+	ASSERT_GE(info_cmd->argsz, info_cmd->cap_offset + sizeof(*cap));
+	cap = buffer + info_cmd->cap_offset;
+	while (true) {
+		size_t cap_size;
+
+		if (cap->next)
+			cap_size = (buffer + cap->next) - (void *)cap;
+		else
+			cap_size = (buffer + info_cmd->argsz) - (void *)cap;
+
+		switch (cap->id) {
+		case VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE: {
+			struct vfio_iommu_type1_info_cap_iova_range *data =
+				(void *)cap;
+
+			ASSERT_EQ(1, data->header.version);
+			ASSERT_EQ(1, data->nr_iovas);
+			EXPECT_EQ(MOCK_APERTURE_START,
+				  data->iova_ranges[0].start);
+			EXPECT_EQ(MOCK_APERTURE_LAST,
+				  data->iova_ranges[0].end);
+			break;
+		}
+		case VFIO_IOMMU_TYPE1_INFO_DMA_AVAIL: {
+			struct vfio_iommu_type1_info_dma_avail *data =
+				(void *)cap;
+
+			ASSERT_EQ(1, data->header.version);
+			ASSERT_EQ(sizeof(*data), cap_size);
+			break;
+		}
+		default:
+			ASSERT_EQ(false, true);
+			break;
+		}
+		if (!cap->next)
+			break;
+
+		ASSERT_GE(info_cmd->argsz, cap->next + sizeof(*cap));
+		ASSERT_GE(buffer + cap->next, (void *)cap);
+		cap = buffer + cap->next;
+	}
+}
+
+TEST_F(vfio_compat_mock_domain, get_info)
+{
+	struct vfio_iommu_type1_info *info_cmd = buffer;
+	unsigned int i;
+	size_t caplen;
+
+	/* Pre-cap ABI */
+	*info_cmd = (struct vfio_iommu_type1_info){
+		.argsz = offsetof(struct vfio_iommu_type1_info, cap_offset),
+	};
+	ioctl_check_buf(self->fd, VFIO_IOMMU_GET_INFO);
+	ASSERT_NE(0, info_cmd->iova_pgsizes);
+	ASSERT_EQ(VFIO_IOMMU_INFO_PGSIZES | VFIO_IOMMU_INFO_CAPS,
+		  info_cmd->flags);
+
+	/* Read the cap chain size */
+	*info_cmd = (struct vfio_iommu_type1_info){
+		.argsz = sizeof(*info_cmd),
+	};
+	ioctl_check_buf(self->fd, VFIO_IOMMU_GET_INFO);
+	ASSERT_NE(0, info_cmd->iova_pgsizes);
+	ASSERT_EQ(VFIO_IOMMU_INFO_PGSIZES | VFIO_IOMMU_INFO_CAPS,
+		  info_cmd->flags);
+	ASSERT_EQ(0, info_cmd->cap_offset);
+	ASSERT_LT(sizeof(*info_cmd), info_cmd->argsz);
+
+	/* Read the caps, kernel should never create a corrupted caps */
+	caplen = info_cmd->argsz;
+	for (i = sizeof(*info_cmd); i < caplen; i++) {
+		*info_cmd = (struct vfio_iommu_type1_info){
+			.argsz = i,
+		};
+		ioctl_check_buf(self->fd, VFIO_IOMMU_GET_INFO);
+		ASSERT_EQ(VFIO_IOMMU_INFO_PGSIZES | VFIO_IOMMU_INFO_CAPS,
+			  info_cmd->flags);
+		if (!info_cmd->cap_offset)
+			continue;
+		check_vfio_info_cap_chain(_metadata, info_cmd);
+	}
+}
+
+/* FIXME use fault injection to test memory failure paths */
+/* FIXME test VFIO_IOMMU_MAP_DMA */
+/* FIXME test VFIO_IOMMU_UNMAP_DMA */
+/* FIXME test 2k iova alignment */
+
+TEST_HARNESS_MAIN
-- 
2.35.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 03/12] iommufd: File descriptor, context, kconfig and makefiles
  2022-03-18 17:27   ` Jason Gunthorpe via iommu
@ 2022-03-22 14:18     ` Niklas Schnelle
  -1 siblings, 0 replies; 244+ messages in thread
From: Niklas Schnelle @ 2022-03-22 14:18 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, iommu,
	Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins,
	David Gibson

On Fri, 2022-03-18 at 14:27 -0300, Jason Gunthorpe wrote:
> This is the basic infrastructure of a new miscdevice to hold the iommufd
> IOCTL API.
> 
> It provides:
>  - A miscdevice to create file descriptors to run the IOCTL interface over
> 
>  - A table based ioctl dispatch and centralized extendable pre-validation
>    step
> 
>  - An xarray mapping user ID's to kernel objects. The design has multiple
>    inter-related objects held within in a single IOMMUFD fd
> 
>  - A simple usage count to build a graph of object relations and protect
>    against hostile userspace racing ioctls

For me at this point it seems hard to grok what this "graph of object
relations" is about. Maybe an example would help? I'm assuming this is
about e.g. the DEVICE -depends-on-> HW_PAGETABLE -depends-on-> IOAS  so
the arrows in the picture of PATCH 02? Or is it the other way around
and the IOAS -depends-on-> HW_PAGETABLE because it's about which
references which? From the rest of the patch I seem to understand that
this mostly establishes the order of destruction. So is HW_PAGETABLE
destroyed before the IOAS because a HW_PAGETABLE must never reference
an invalid/destoryed IOAS or is it the other way arund because the IOAS
has a reference to the HW_PAGETABLES in it? I'd guess the former but
I'm a bit confused still.

> 
> The only IOCTL provided in this patch is the generic 'destroy any object
> by handle' operation.
> 
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  .../userspace-api/ioctl/ioctl-number.rst      |   1 +
>  MAINTAINERS                                   |  10 +
>  drivers/iommu/Kconfig                         |   1 +
>  drivers/iommu/Makefile                        |   2 +-
>  drivers/iommu/iommufd/Kconfig                 |  13 +
>  drivers/iommu/iommufd/Makefile                |   5 +
>  drivers/iommu/iommufd/iommufd_private.h       |  95 ++++++
>  drivers/iommu/iommufd/main.c                  | 305 ++++++++++++++++++
>  include/uapi/linux/iommufd.h                  |  55 ++++
>  9 files changed, 486 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/iommu/iommufd/Kconfig
>  create mode 100644 drivers/iommu/iommufd/Makefile
>  create mode 100644 drivers/iommu/iommufd/iommufd_private.h
>  create mode 100644 drivers/iommu/iommufd/main.c
>  create mode 100644 include/uapi/linux/iommufd.h
> 
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index e6fce2cbd99ed4..4a041dfc61fe95 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -105,6 +105,7 @@ Code  Seq#    Include File                                           Comments
>  '8'   all                                                            SNP8023 advanced NIC card
>                                                                       <mailto:mcr@solidum.com>
>  ';'   64-7F  linux/vfio.h
> +';'   80-FF  linux/iommufd.h
>  '='   00-3f  uapi/linux/ptp_clock.h                                  <mailto:richardcochran@gmail.com>
>  '@'   00-0F  linux/radeonfb.h                                        conflict!
>  '@'   00-0F  drivers/video/aty/aty128fb.c                            conflict!
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 1ba1e4af2cbc80..23a9c631051ee8 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -10038,6 +10038,16 @@ L:	linux-mips@vger.kernel.org
>  S:	Maintained
>  F:	drivers/net/ethernet/sgi/ioc3-eth.c
>  
> +IOMMU FD
> +M:	Jason Gunthorpe <jgg@nvidia.com>
> +M:	Kevin Tian <kevin.tian@intel.com>
> +L:	iommu@lists.linux-foundation.org
> +S:	Maintained
> +F:	Documentation/userspace-api/iommufd.rst
> +F:	drivers/iommu/iommufd/
> +F:	include/uapi/linux/iommufd.h
> +F:	include/linux/iommufd.h
> +
>  IOMAP FILESYSTEM LIBRARY
>  M:	Christoph Hellwig <hch@infradead.org>
>  M:	Darrick J. Wong <djwong@kernel.org>
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index 3eb68fa1b8cc02..754d2a9ff64623 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -177,6 +177,7 @@ config MSM_IOMMU
>  
>  source "drivers/iommu/amd/Kconfig"
>  source "drivers/iommu/intel/Kconfig"
> +source "drivers/iommu/iommufd/Kconfig"
>  
>  config IRQ_REMAP
>  	bool "Support for Interrupt Remapping"
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index bc7f730edbb0be..6b38d12692b213 100644
> --- a/drivers/iommu/Makefile
> +++ b/drivers/iommu/Makefile
> @@ -1,5 +1,5 @@
>  # SPDX-License-Identifier: GPL-2.0
> -obj-y += amd/ intel/ arm/
> +obj-y += amd/ intel/ arm/ iommufd/
>  obj-$(CONFIG_IOMMU_API) += iommu.o
>  obj-$(CONFIG_IOMMU_API) += iommu-traces.o
>  obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
> diff --git a/drivers/iommu/iommufd/Kconfig b/drivers/iommu/iommufd/Kconfig
> new file mode 100644
> index 00000000000000..fddd453bb0e764
> --- /dev/null
> +++ b/drivers/iommu/iommufd/Kconfig
> @@ -0,0 +1,13 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +config IOMMUFD
> +	tristate "IOMMU Userspace API"
> +	select INTERVAL_TREE
> +	select IOMMU_API
> +	default n
> +	help
> +	  Provides /dev/iommu the user API to control the IOMMU subsystem as
> +	  it relates to managing IO page tables that point at user space memory.
> +
> +	  This would commonly be used in combination with VFIO.
> +
> +	  If you don't know what to do here, say N.
> diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
> new file mode 100644
> index 00000000000000..a07a8cffe937c6
> --- /dev/null
> +++ b/drivers/iommu/iommufd/Makefile
> @@ -0,0 +1,5 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +iommufd-y := \
> +	main.o
> +
> +obj-$(CONFIG_IOMMUFD) += iommufd.o
> diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
> new file mode 100644
> index 00000000000000..2d0bba3965be1a
> --- /dev/null
> +++ b/drivers/iommu/iommufd/iommufd_private.h
> @@ -0,0 +1,95 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
> + */
> +#ifndef __IOMMUFD_PRIVATE_H
> +#define __IOMMUFD_PRIVATE_H
> +
> +#include <linux/rwsem.h>
> +#include <linux/xarray.h>
> +#include <linux/refcount.h>
> +#include <linux/uaccess.h>
> +
> +struct iommufd_ctx {
> +	struct file *filp;
> +	struct xarray objects;
> +};
> +
> +struct iommufd_ctx *iommufd_fget(int fd);
> +
> +struct iommufd_ucmd {
> +	struct iommufd_ctx *ictx;
> +	void __user *ubuffer;
> +	u32 user_size;
> +	void *cmd;
> +};
> +
> +/* Copy the response in ucmd->cmd back to userspace. */
> +static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd,
> +				       size_t cmd_len)
> +{
> +	if (copy_to_user(ucmd->ubuffer, ucmd->cmd,
> +			 min_t(size_t, ucmd->user_size, cmd_len)))
> +		return -EFAULT;
> +	return 0;
> +}
> +
> +/*
> + * The objects for an acyclic graph through the users refcount. This enum must
> + * be sorted by type depth first so that destruction completes lower objects and
> + * releases the users refcount before reaching higher objects in the graph.
> + */

A bit like my first comment I think this would benefit from an example
what lower/higher mean in this case. I believe a lower object must only
reference/depend on higher objects, correct?

> +enum iommufd_object_type {
> +	IOMMUFD_OBJ_NONE,
> +	IOMMUFD_OBJ_ANY = IOMMUFD_OBJ_NONE,
> +	IOMMUFD_OBJ_MAX,
> +};
> +
> +/* Base struct for all objects with a userspace ID handle. */
> +struct iommufd_object {
> +	struct rw_semaphore destroy_rwsem;
> +	refcount_t users;
> +	enum iommufd_object_type type;
> +	unsigned int id;
> +};
> +
> +static inline bool iommufd_lock_obj(struct iommufd_object *obj)
> +{
> +	if (!down_read_trylock(&obj->destroy_rwsem))
> +		return false;
> +	if (!refcount_inc_not_zero(&obj->users)) {
> +		up_read(&obj->destroy_rwsem);
> +		return false;
> +	}
> +	return true;
> +}
> +
> +struct iommufd_object *iommufd_get_object(struct iommufd_ctx *ictx, u32 id,
> +					  enum iommufd_object_type type);
> +static inline void iommufd_put_object(struct iommufd_object *obj)
> +{
> +	refcount_dec(&obj->users);
> +	up_read(&obj->destroy_rwsem);
> +}
> +static inline void iommufd_put_object_keep_user(struct iommufd_object *obj)
> +{
> +	up_read(&obj->destroy_rwsem);
> +}
> +void iommufd_object_abort(struct iommufd_ctx *ictx, struct iommufd_object *obj);
> +void iommufd_object_finalize(struct iommufd_ctx *ictx,
> +			     struct iommufd_object *obj);
> +bool iommufd_object_destroy_user(struct iommufd_ctx *ictx,
> +				 struct iommufd_object *obj);
> +struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
> +					     size_t size,
> +					     enum iommufd_object_type type);
> +
> +#define iommufd_object_alloc(ictx, ptr, type)                                  \
> +	container_of(_iommufd_object_alloc(                                    \
> +			     ictx,                                             \
> +			     sizeof(*(ptr)) + BUILD_BUG_ON_ZERO(               \
> +						      offsetof(typeof(*(ptr)), \
> +							       obj) != 0),     \
> +			     type),                                            \
> +		     typeof(*(ptr)), obj)
> +
> +#endif
> diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
> new file mode 100644
> index 00000000000000..ae8db2f663004f
> --- /dev/null
> +++ b/drivers/iommu/iommufd/main.c
> @@ -0,0 +1,305 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (C) 2021 Intel Corporation
> + * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
> + *
> + * iommfd provides control over the IOMMU HW objects created by IOMMU kernel
      ^^^^^^ iommufd
> + * drivers. IOMMU HW objects revolve around IO page tables that map incoming DMA
> + * addresses (IOVA) to CPU addresses.
> + *
> + * The API is divided into a general portion that is intended to work with any
> + * kernel IOMMU driver, and a device specific portion that  is intended to be
> + * used with a userspace HW driver paired with the specific kernel driver. This
> + * mechanism allows all the unique functionalities in individual IOMMUs to be
> + * exposed to userspace control.
> + */
> +#define pr_fmt(fmt) "iommufd: " fmt
> +
> +#include <linux/file.h>
> +#include <linux/fs.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/miscdevice.h>
> +#include <linux/mutex.h>
> +#include <linux/bug.h>
> +#include <uapi/linux/iommufd.h>
> +
> +#include "iommufd_private.h"
> +
> +struct iommufd_object_ops {
> +	void (*destroy)(struct iommufd_object *obj);
> +};
> +static struct iommufd_object_ops iommufd_object_ops[];
> +
> +struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
> +					     size_t size,
> +					     enum iommufd_object_type type)
> +{
> +	struct iommufd_object *obj;
> +	int rc;
> +
> +	obj = kzalloc(size, GFP_KERNEL);
> +	if (!obj)
> +		return ERR_PTR(-ENOMEM);
> +	obj->type = type;
> +	init_rwsem(&obj->destroy_rwsem);
> +	refcount_set(&obj->users, 1);
> +
> +	/*
> +	 * Reserve an ID in the xarray but do not publish the pointer yet since
> +	 * the caller hasn't initialized it yet. Once the pointer is published
> +	 * in the xarray and visible to other threads we can't reliably destroy
> +	 * it anymore, so the caller must complete all errorable operations
> +	 * before calling iommufd_object_finalize().
> +	 */
> +	rc = xa_alloc(&ictx->objects, &obj->id, XA_ZERO_ENTRY,
> +		      xa_limit_32b, GFP_KERNEL);
> +	if (rc)
> +		goto out_free;
> +	return obj;
> +out_free:
> +	kfree(obj);
> +	return ERR_PTR(rc);
> +}
> +
> +/*
> + * Allow concurrent access to the object. This should only be done once the
> + * system call that created the object is guaranteed to succeed.
> + */
> +void iommufd_object_finalize(struct iommufd_ctx *ictx,
> +			     struct iommufd_object *obj)

Not sure about the name here. Finalize kind of sounds like
destruction/going away. That might just be me though. Maybe having a
"..abort.." function, "commit" or "complete" come to mind.

> +{
> +	void *old;
> +
> +	old = xa_store(&ictx->objects, obj->id, obj, GFP_KERNEL);
> +	/* obj->id was returned from xa_alloc() so the xa_store() cannot fail */
> +	WARN_ON(old);
> +}
> +
> +/* Undo _iommufd_object_alloc() if iommufd_object_finalize() was not called */
> +void iommufd_object_abort(struct iommufd_ctx *ictx, struct iommufd_object *obj)
> +{
> +	void *old;
> +
> +	old = xa_erase(&ictx->objects, obj->id);
> +	WARN_ON(old);
> +	kfree(obj);
> +}
> +
> +struct iommufd_object *iommufd_get_object(struct iommufd_ctx *ictx, u32 id,
> +					  enum iommufd_object_type type)
> +{
> +	struct iommufd_object *obj;
> +
> +	xa_lock(&ictx->objects);
> +	obj = xa_load(&ictx->objects, id);
> +	if (!obj || (type != IOMMUFD_OBJ_ANY && obj->type != type) ||
> +	    !iommufd_lock_obj(obj))

Looking at the code it seems iommufd_lock_obj() locks against
destroy_rw_sem and increases the reference count but there is also an
iommufd_put_object_keep_user() which only unlocks the destroy_rw_sem. I
think I personally would benefit from a comment/commit message
explaining the lifecycle.

There is the below comment in iommufd_object_destroy_user() but I think
it would be better placed near the destroy_rwsem decleration and to
also explain the interaction between the destroy_rwsem and the
reference count.

> +		obj = ERR_PTR(-ENOENT);
> +	xa_unlock(&ictx->objects);
> +	return obj;
> +}
> +
> +/*
> + * The caller holds a users refcount and wants to destroy the object. Returns
> + * true if the object was destroyed. In all cases the caller no longer has a
> + * reference on obj.
> + */
> +bool iommufd_object_destroy_user(struct iommufd_ctx *ictx,
> +				 struct iommufd_object *obj)
> +{
> +	/*
> +	 * The purpose of the destroy_rwsem is to ensure deterministic
> +	 * destruction of objects used by external drivers and destroyed by this
> +	 * function. Any temporary increment of the refcount must hold the read
> +	 * side of this, such as during ioctl execution.
> +	 */
> +	down_write(&obj->destroy_rwsem);
> +	xa_lock(&ictx->objects);
> +	refcount_dec(&obj->users);
> +	if (!refcount_dec_if_one(&obj->users)) {
> +		xa_unlock(&ictx->objects);
> +		up_write(&obj->destroy_rwsem);
> +		return false;
> +	}
> +	__xa_erase(&ictx->objects, obj->id);
> +	xa_unlock(&ictx->objects);
> +
> +	iommufd_object_ops[obj->type].destroy(obj);
> +	up_write(&obj->destroy_rwsem);
> +	kfree(obj);
> +	return true;
> +}
> +
> +static int iommufd_destroy(struct iommufd_ucmd *ucmd)
> +{
> +	struct iommu_destroy *cmd = ucmd->cmd;
> +	struct iommufd_object *obj;
> +
> +	obj = iommufd_get_object(ucmd->ictx, cmd->id, IOMMUFD_OBJ_ANY);
> +	if (IS_ERR(obj))
> +		return PTR_ERR(obj);
> +	iommufd_put_object_keep_user(obj);
> +	if (!iommufd_object_destroy_user(ucmd->ictx, obj))
> +		return -EBUSY;
> +	return 0;
> +}
> +
> +static int iommufd_fops_open(struct inode *inode, struct file *filp)
> +{
> +	struct iommufd_ctx *ictx;
> +
> +	ictx = kzalloc(sizeof(*ictx), GFP_KERNEL);
> +	if (!ictx)
> +		return -ENOMEM;
> +
> +	xa_init_flags(&ictx->objects, XA_FLAGS_ALLOC1);
> +	ictx->filp = filp;
> +	filp->private_data = ictx;
> +	return 0;
> +}
> +
> +static int iommufd_fops_release(struct inode *inode, struct file *filp)
> +{
> +	struct iommufd_ctx *ictx = filp->private_data;
> +	struct iommufd_object *obj;
> +	unsigned long index = 0;
> +	int cur = 0;
> +
> +	/* Destroy the graph from depth first */
> +	while (cur < IOMMUFD_OBJ_MAX) {
> +		xa_for_each(&ictx->objects, index, obj) {
> +			if (obj->type != cur)
> +				continue;
> +			xa_erase(&ictx->objects, index);
> +			if (WARN_ON(!refcount_dec_and_test(&obj->users)))

I think if this warning ever triggers it would be good to have a
comment here what it means. As I understand it this would mean that the
graph isn't acyclic and the types aren't in the correct order i.e. we
attempted to destroy a higher order object before a lower one?

> +				continue;
> +			iommufd_object_ops[obj->type].destroy(obj);
> +			kfree(obj);
> +		}
> +		cur++;
> +	}
> +	WARN_ON(!xa_empty(&ictx->objects));
> +	kfree(ictx);
> +	return 0;
> +}
> +
> 
---8<---


_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 03/12] iommufd: File descriptor, context, kconfig and makefiles
@ 2022-03-22 14:18     ` Niklas Schnelle
  0 siblings, 0 replies; 244+ messages in thread
From: Niklas Schnelle @ 2022-03-22 14:18 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Fri, 2022-03-18 at 14:27 -0300, Jason Gunthorpe wrote:
> This is the basic infrastructure of a new miscdevice to hold the iommufd
> IOCTL API.
> 
> It provides:
>  - A miscdevice to create file descriptors to run the IOCTL interface over
> 
>  - A table based ioctl dispatch and centralized extendable pre-validation
>    step
> 
>  - An xarray mapping user ID's to kernel objects. The design has multiple
>    inter-related objects held within in a single IOMMUFD fd
> 
>  - A simple usage count to build a graph of object relations and protect
>    against hostile userspace racing ioctls

For me at this point it seems hard to grok what this "graph of object
relations" is about. Maybe an example would help? I'm assuming this is
about e.g. the DEVICE -depends-on-> HW_PAGETABLE -depends-on-> IOAS  so
the arrows in the picture of PATCH 02? Or is it the other way around
and the IOAS -depends-on-> HW_PAGETABLE because it's about which
references which? From the rest of the patch I seem to understand that
this mostly establishes the order of destruction. So is HW_PAGETABLE
destroyed before the IOAS because a HW_PAGETABLE must never reference
an invalid/destoryed IOAS or is it the other way arund because the IOAS
has a reference to the HW_PAGETABLES in it? I'd guess the former but
I'm a bit confused still.

> 
> The only IOCTL provided in this patch is the generic 'destroy any object
> by handle' operation.
> 
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  .../userspace-api/ioctl/ioctl-number.rst      |   1 +
>  MAINTAINERS                                   |  10 +
>  drivers/iommu/Kconfig                         |   1 +
>  drivers/iommu/Makefile                        |   2 +-
>  drivers/iommu/iommufd/Kconfig                 |  13 +
>  drivers/iommu/iommufd/Makefile                |   5 +
>  drivers/iommu/iommufd/iommufd_private.h       |  95 ++++++
>  drivers/iommu/iommufd/main.c                  | 305 ++++++++++++++++++
>  include/uapi/linux/iommufd.h                  |  55 ++++
>  9 files changed, 486 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/iommu/iommufd/Kconfig
>  create mode 100644 drivers/iommu/iommufd/Makefile
>  create mode 100644 drivers/iommu/iommufd/iommufd_private.h
>  create mode 100644 drivers/iommu/iommufd/main.c
>  create mode 100644 include/uapi/linux/iommufd.h
> 
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index e6fce2cbd99ed4..4a041dfc61fe95 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -105,6 +105,7 @@ Code  Seq#    Include File                                           Comments
>  '8'   all                                                            SNP8023 advanced NIC card
>                                                                       <mailto:mcr@solidum.com>
>  ';'   64-7F  linux/vfio.h
> +';'   80-FF  linux/iommufd.h
>  '='   00-3f  uapi/linux/ptp_clock.h                                  <mailto:richardcochran@gmail.com>
>  '@'   00-0F  linux/radeonfb.h                                        conflict!
>  '@'   00-0F  drivers/video/aty/aty128fb.c                            conflict!
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 1ba1e4af2cbc80..23a9c631051ee8 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -10038,6 +10038,16 @@ L:	linux-mips@vger.kernel.org
>  S:	Maintained
>  F:	drivers/net/ethernet/sgi/ioc3-eth.c
>  
> +IOMMU FD
> +M:	Jason Gunthorpe <jgg@nvidia.com>
> +M:	Kevin Tian <kevin.tian@intel.com>
> +L:	iommu@lists.linux-foundation.org
> +S:	Maintained
> +F:	Documentation/userspace-api/iommufd.rst
> +F:	drivers/iommu/iommufd/
> +F:	include/uapi/linux/iommufd.h
> +F:	include/linux/iommufd.h
> +
>  IOMAP FILESYSTEM LIBRARY
>  M:	Christoph Hellwig <hch@infradead.org>
>  M:	Darrick J. Wong <djwong@kernel.org>
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index 3eb68fa1b8cc02..754d2a9ff64623 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -177,6 +177,7 @@ config MSM_IOMMU
>  
>  source "drivers/iommu/amd/Kconfig"
>  source "drivers/iommu/intel/Kconfig"
> +source "drivers/iommu/iommufd/Kconfig"
>  
>  config IRQ_REMAP
>  	bool "Support for Interrupt Remapping"
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index bc7f730edbb0be..6b38d12692b213 100644
> --- a/drivers/iommu/Makefile
> +++ b/drivers/iommu/Makefile
> @@ -1,5 +1,5 @@
>  # SPDX-License-Identifier: GPL-2.0
> -obj-y += amd/ intel/ arm/
> +obj-y += amd/ intel/ arm/ iommufd/
>  obj-$(CONFIG_IOMMU_API) += iommu.o
>  obj-$(CONFIG_IOMMU_API) += iommu-traces.o
>  obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
> diff --git a/drivers/iommu/iommufd/Kconfig b/drivers/iommu/iommufd/Kconfig
> new file mode 100644
> index 00000000000000..fddd453bb0e764
> --- /dev/null
> +++ b/drivers/iommu/iommufd/Kconfig
> @@ -0,0 +1,13 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +config IOMMUFD
> +	tristate "IOMMU Userspace API"
> +	select INTERVAL_TREE
> +	select IOMMU_API
> +	default n
> +	help
> +	  Provides /dev/iommu the user API to control the IOMMU subsystem as
> +	  it relates to managing IO page tables that point at user space memory.
> +
> +	  This would commonly be used in combination with VFIO.
> +
> +	  If you don't know what to do here, say N.
> diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
> new file mode 100644
> index 00000000000000..a07a8cffe937c6
> --- /dev/null
> +++ b/drivers/iommu/iommufd/Makefile
> @@ -0,0 +1,5 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +iommufd-y := \
> +	main.o
> +
> +obj-$(CONFIG_IOMMUFD) += iommufd.o
> diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
> new file mode 100644
> index 00000000000000..2d0bba3965be1a
> --- /dev/null
> +++ b/drivers/iommu/iommufd/iommufd_private.h
> @@ -0,0 +1,95 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
> + */
> +#ifndef __IOMMUFD_PRIVATE_H
> +#define __IOMMUFD_PRIVATE_H
> +
> +#include <linux/rwsem.h>
> +#include <linux/xarray.h>
> +#include <linux/refcount.h>
> +#include <linux/uaccess.h>
> +
> +struct iommufd_ctx {
> +	struct file *filp;
> +	struct xarray objects;
> +};
> +
> +struct iommufd_ctx *iommufd_fget(int fd);
> +
> +struct iommufd_ucmd {
> +	struct iommufd_ctx *ictx;
> +	void __user *ubuffer;
> +	u32 user_size;
> +	void *cmd;
> +};
> +
> +/* Copy the response in ucmd->cmd back to userspace. */
> +static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd,
> +				       size_t cmd_len)
> +{
> +	if (copy_to_user(ucmd->ubuffer, ucmd->cmd,
> +			 min_t(size_t, ucmd->user_size, cmd_len)))
> +		return -EFAULT;
> +	return 0;
> +}
> +
> +/*
> + * The objects for an acyclic graph through the users refcount. This enum must
> + * be sorted by type depth first so that destruction completes lower objects and
> + * releases the users refcount before reaching higher objects in the graph.
> + */

A bit like my first comment I think this would benefit from an example
what lower/higher mean in this case. I believe a lower object must only
reference/depend on higher objects, correct?

> +enum iommufd_object_type {
> +	IOMMUFD_OBJ_NONE,
> +	IOMMUFD_OBJ_ANY = IOMMUFD_OBJ_NONE,
> +	IOMMUFD_OBJ_MAX,
> +};
> +
> +/* Base struct for all objects with a userspace ID handle. */
> +struct iommufd_object {
> +	struct rw_semaphore destroy_rwsem;
> +	refcount_t users;
> +	enum iommufd_object_type type;
> +	unsigned int id;
> +};
> +
> +static inline bool iommufd_lock_obj(struct iommufd_object *obj)
> +{
> +	if (!down_read_trylock(&obj->destroy_rwsem))
> +		return false;
> +	if (!refcount_inc_not_zero(&obj->users)) {
> +		up_read(&obj->destroy_rwsem);
> +		return false;
> +	}
> +	return true;
> +}
> +
> +struct iommufd_object *iommufd_get_object(struct iommufd_ctx *ictx, u32 id,
> +					  enum iommufd_object_type type);
> +static inline void iommufd_put_object(struct iommufd_object *obj)
> +{
> +	refcount_dec(&obj->users);
> +	up_read(&obj->destroy_rwsem);
> +}
> +static inline void iommufd_put_object_keep_user(struct iommufd_object *obj)
> +{
> +	up_read(&obj->destroy_rwsem);
> +}
> +void iommufd_object_abort(struct iommufd_ctx *ictx, struct iommufd_object *obj);
> +void iommufd_object_finalize(struct iommufd_ctx *ictx,
> +			     struct iommufd_object *obj);
> +bool iommufd_object_destroy_user(struct iommufd_ctx *ictx,
> +				 struct iommufd_object *obj);
> +struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
> +					     size_t size,
> +					     enum iommufd_object_type type);
> +
> +#define iommufd_object_alloc(ictx, ptr, type)                                  \
> +	container_of(_iommufd_object_alloc(                                    \
> +			     ictx,                                             \
> +			     sizeof(*(ptr)) + BUILD_BUG_ON_ZERO(               \
> +						      offsetof(typeof(*(ptr)), \
> +							       obj) != 0),     \
> +			     type),                                            \
> +		     typeof(*(ptr)), obj)
> +
> +#endif
> diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
> new file mode 100644
> index 00000000000000..ae8db2f663004f
> --- /dev/null
> +++ b/drivers/iommu/iommufd/main.c
> @@ -0,0 +1,305 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (C) 2021 Intel Corporation
> + * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
> + *
> + * iommfd provides control over the IOMMU HW objects created by IOMMU kernel
      ^^^^^^ iommufd
> + * drivers. IOMMU HW objects revolve around IO page tables that map incoming DMA
> + * addresses (IOVA) to CPU addresses.
> + *
> + * The API is divided into a general portion that is intended to work with any
> + * kernel IOMMU driver, and a device specific portion that  is intended to be
> + * used with a userspace HW driver paired with the specific kernel driver. This
> + * mechanism allows all the unique functionalities in individual IOMMUs to be
> + * exposed to userspace control.
> + */
> +#define pr_fmt(fmt) "iommufd: " fmt
> +
> +#include <linux/file.h>
> +#include <linux/fs.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/miscdevice.h>
> +#include <linux/mutex.h>
> +#include <linux/bug.h>
> +#include <uapi/linux/iommufd.h>
> +
> +#include "iommufd_private.h"
> +
> +struct iommufd_object_ops {
> +	void (*destroy)(struct iommufd_object *obj);
> +};
> +static struct iommufd_object_ops iommufd_object_ops[];
> +
> +struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
> +					     size_t size,
> +					     enum iommufd_object_type type)
> +{
> +	struct iommufd_object *obj;
> +	int rc;
> +
> +	obj = kzalloc(size, GFP_KERNEL);
> +	if (!obj)
> +		return ERR_PTR(-ENOMEM);
> +	obj->type = type;
> +	init_rwsem(&obj->destroy_rwsem);
> +	refcount_set(&obj->users, 1);
> +
> +	/*
> +	 * Reserve an ID in the xarray but do not publish the pointer yet since
> +	 * the caller hasn't initialized it yet. Once the pointer is published
> +	 * in the xarray and visible to other threads we can't reliably destroy
> +	 * it anymore, so the caller must complete all errorable operations
> +	 * before calling iommufd_object_finalize().
> +	 */
> +	rc = xa_alloc(&ictx->objects, &obj->id, XA_ZERO_ENTRY,
> +		      xa_limit_32b, GFP_KERNEL);
> +	if (rc)
> +		goto out_free;
> +	return obj;
> +out_free:
> +	kfree(obj);
> +	return ERR_PTR(rc);
> +}
> +
> +/*
> + * Allow concurrent access to the object. This should only be done once the
> + * system call that created the object is guaranteed to succeed.
> + */
> +void iommufd_object_finalize(struct iommufd_ctx *ictx,
> +			     struct iommufd_object *obj)

Not sure about the name here. Finalize kind of sounds like
destruction/going away. That might just be me though. Maybe having a
"..abort.." function, "commit" or "complete" come to mind.

> +{
> +	void *old;
> +
> +	old = xa_store(&ictx->objects, obj->id, obj, GFP_KERNEL);
> +	/* obj->id was returned from xa_alloc() so the xa_store() cannot fail */
> +	WARN_ON(old);
> +}
> +
> +/* Undo _iommufd_object_alloc() if iommufd_object_finalize() was not called */
> +void iommufd_object_abort(struct iommufd_ctx *ictx, struct iommufd_object *obj)
> +{
> +	void *old;
> +
> +	old = xa_erase(&ictx->objects, obj->id);
> +	WARN_ON(old);
> +	kfree(obj);
> +}
> +
> +struct iommufd_object *iommufd_get_object(struct iommufd_ctx *ictx, u32 id,
> +					  enum iommufd_object_type type)
> +{
> +	struct iommufd_object *obj;
> +
> +	xa_lock(&ictx->objects);
> +	obj = xa_load(&ictx->objects, id);
> +	if (!obj || (type != IOMMUFD_OBJ_ANY && obj->type != type) ||
> +	    !iommufd_lock_obj(obj))

Looking at the code it seems iommufd_lock_obj() locks against
destroy_rw_sem and increases the reference count but there is also an
iommufd_put_object_keep_user() which only unlocks the destroy_rw_sem. I
think I personally would benefit from a comment/commit message
explaining the lifecycle.

There is the below comment in iommufd_object_destroy_user() but I think
it would be better placed near the destroy_rwsem decleration and to
also explain the interaction between the destroy_rwsem and the
reference count.

> +		obj = ERR_PTR(-ENOENT);
> +	xa_unlock(&ictx->objects);
> +	return obj;
> +}
> +
> +/*
> + * The caller holds a users refcount and wants to destroy the object. Returns
> + * true if the object was destroyed. In all cases the caller no longer has a
> + * reference on obj.
> + */
> +bool iommufd_object_destroy_user(struct iommufd_ctx *ictx,
> +				 struct iommufd_object *obj)
> +{
> +	/*
> +	 * The purpose of the destroy_rwsem is to ensure deterministic
> +	 * destruction of objects used by external drivers and destroyed by this
> +	 * function. Any temporary increment of the refcount must hold the read
> +	 * side of this, such as during ioctl execution.
> +	 */
> +	down_write(&obj->destroy_rwsem);
> +	xa_lock(&ictx->objects);
> +	refcount_dec(&obj->users);
> +	if (!refcount_dec_if_one(&obj->users)) {
> +		xa_unlock(&ictx->objects);
> +		up_write(&obj->destroy_rwsem);
> +		return false;
> +	}
> +	__xa_erase(&ictx->objects, obj->id);
> +	xa_unlock(&ictx->objects);
> +
> +	iommufd_object_ops[obj->type].destroy(obj);
> +	up_write(&obj->destroy_rwsem);
> +	kfree(obj);
> +	return true;
> +}
> +
> +static int iommufd_destroy(struct iommufd_ucmd *ucmd)
> +{
> +	struct iommu_destroy *cmd = ucmd->cmd;
> +	struct iommufd_object *obj;
> +
> +	obj = iommufd_get_object(ucmd->ictx, cmd->id, IOMMUFD_OBJ_ANY);
> +	if (IS_ERR(obj))
> +		return PTR_ERR(obj);
> +	iommufd_put_object_keep_user(obj);
> +	if (!iommufd_object_destroy_user(ucmd->ictx, obj))
> +		return -EBUSY;
> +	return 0;
> +}
> +
> +static int iommufd_fops_open(struct inode *inode, struct file *filp)
> +{
> +	struct iommufd_ctx *ictx;
> +
> +	ictx = kzalloc(sizeof(*ictx), GFP_KERNEL);
> +	if (!ictx)
> +		return -ENOMEM;
> +
> +	xa_init_flags(&ictx->objects, XA_FLAGS_ALLOC1);
> +	ictx->filp = filp;
> +	filp->private_data = ictx;
> +	return 0;
> +}
> +
> +static int iommufd_fops_release(struct inode *inode, struct file *filp)
> +{
> +	struct iommufd_ctx *ictx = filp->private_data;
> +	struct iommufd_object *obj;
> +	unsigned long index = 0;
> +	int cur = 0;
> +
> +	/* Destroy the graph from depth first */
> +	while (cur < IOMMUFD_OBJ_MAX) {
> +		xa_for_each(&ictx->objects, index, obj) {
> +			if (obj->type != cur)
> +				continue;
> +			xa_erase(&ictx->objects, index);
> +			if (WARN_ON(!refcount_dec_and_test(&obj->users)))

I think if this warning ever triggers it would be good to have a
comment here what it means. As I understand it this would mean that the
graph isn't acyclic and the types aren't in the correct order i.e. we
attempted to destroy a higher order object before a lower one?

> +				continue;
> +			iommufd_object_ops[obj->type].destroy(obj);
> +			kfree(obj);
> +		}
> +		cur++;
> +	}
> +	WARN_ON(!xa_empty(&ictx->objects));
> +	kfree(ictx);
> +	return 0;
> +}
> +
> 
---8<---



^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
  2022-03-18 17:27   ` Jason Gunthorpe via iommu
@ 2022-03-22 14:28     ` Niklas Schnelle
  -1 siblings, 0 replies; 244+ messages in thread
From: Niklas Schnelle @ 2022-03-22 14:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Fri, 2022-03-18 at 14:27 -0300, Jason Gunthorpe wrote:
> Following the pattern of io_uring, perf, skb, and bpf iommfd will use
                                                iommufd ----^
> user->locked_vm for accounting pinned pages. Ensure the value is included
> in the struct and export free_uid() as iommufd is modular.
> 
> user->locked_vm is the correct accounting to use for ulimit because it is
> per-user, and the ulimit is not supposed to be per-process. Other
> places (vfio, vdpa and infiniband) have used mm->pinned_vm and/or
> mm->locked_vm for accounting pinned pages, but this is only per-process
> and inconsistent with the majority of the kernel.

Since this will replace parts of vfio this difference seems
significant. Can you explain this a bit more?

I'm also a bit confused how io_uring handles this. When I stumbled over
the problem fixed by 6b7898eb180d ("io_uring: fix imbalanced sqo_mm
accounting") and from that commit description I seem to rember that
io_uring also accounts in mm->locked_vm too? In fact I stumbled over
that because the wrong accounting in io_uring exhausted the applied to
vfio (I was using a QEMU utilizing io_uring itself).

> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  include/linux/sched/user.h | 2 +-
>  kernel/user.c              | 1 +
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
> index 00ed419dd46413..c47dae71dad3c8 100644
> --- a/include/linux/sched/user.h
> +++ b/include/linux/sched/user.h
> @@ -24,7 +24,7 @@ struct user_struct {
>  	kuid_t uid;
>  
>  #if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL) || \
> -    defined(CONFIG_NET) || defined(CONFIG_IO_URING)
> +    defined(CONFIG_NET) || defined(CONFIG_IO_URING) || IS_ENABLED(CONFIG_IOMMUFD)
>  	atomic_long_t locked_vm;
>  #endif
>  #ifdef CONFIG_WATCH_QUEUE
> diff --git a/kernel/user.c b/kernel/user.c
> index e2cf8c22b539a7..d667debeafd609 100644
> --- a/kernel/user.c
> +++ b/kernel/user.c
> @@ -185,6 +185,7 @@ void free_uid(struct user_struct *up)
>  	if (refcount_dec_and_lock_irqsave(&up->__count, &uidhash_lock, &flags))
>  		free_user(up, flags);
>  }
> +EXPORT_SYMBOL_GPL(free_uid);
>  
>  struct user_struct *alloc_uid(kuid_t uid)
>  {



^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
@ 2022-03-22 14:28     ` Niklas Schnelle
  0 siblings, 0 replies; 244+ messages in thread
From: Niklas Schnelle @ 2022-03-22 14:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, iommu,
	Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins,
	David Gibson

On Fri, 2022-03-18 at 14:27 -0300, Jason Gunthorpe wrote:
> Following the pattern of io_uring, perf, skb, and bpf iommfd will use
                                                iommufd ----^
> user->locked_vm for accounting pinned pages. Ensure the value is included
> in the struct and export free_uid() as iommufd is modular.
> 
> user->locked_vm is the correct accounting to use for ulimit because it is
> per-user, and the ulimit is not supposed to be per-process. Other
> places (vfio, vdpa and infiniband) have used mm->pinned_vm and/or
> mm->locked_vm for accounting pinned pages, but this is only per-process
> and inconsistent with the majority of the kernel.

Since this will replace parts of vfio this difference seems
significant. Can you explain this a bit more?

I'm also a bit confused how io_uring handles this. When I stumbled over
the problem fixed by 6b7898eb180d ("io_uring: fix imbalanced sqo_mm
accounting") and from that commit description I seem to rember that
io_uring also accounts in mm->locked_vm too? In fact I stumbled over
that because the wrong accounting in io_uring exhausted the applied to
vfio (I was using a QEMU utilizing io_uring itself).

> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  include/linux/sched/user.h | 2 +-
>  kernel/user.c              | 1 +
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
> index 00ed419dd46413..c47dae71dad3c8 100644
> --- a/include/linux/sched/user.h
> +++ b/include/linux/sched/user.h
> @@ -24,7 +24,7 @@ struct user_struct {
>  	kuid_t uid;
>  
>  #if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL) || \
> -    defined(CONFIG_NET) || defined(CONFIG_IO_URING)
> +    defined(CONFIG_NET) || defined(CONFIG_IO_URING) || IS_ENABLED(CONFIG_IOMMUFD)
>  	atomic_long_t locked_vm;
>  #endif
>  #ifdef CONFIG_WATCH_QUEUE
> diff --git a/kernel/user.c b/kernel/user.c
> index e2cf8c22b539a7..d667debeafd609 100644
> --- a/kernel/user.c
> +++ b/kernel/user.c
> @@ -185,6 +185,7 @@ void free_uid(struct user_struct *up)
>  	if (refcount_dec_and_lock_irqsave(&up->__count, &uidhash_lock, &flags))
>  		free_user(up, flags);
>  }
> +EXPORT_SYMBOL_GPL(free_uid);
>  
>  struct user_struct *alloc_uid(kuid_t uid)
>  {


_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 03/12] iommufd: File descriptor, context, kconfig and makefiles
  2022-03-22 14:18     ` Niklas Schnelle
@ 2022-03-22 14:50       ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-03-22 14:50 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Tue, Mar 22, 2022 at 03:18:51PM +0100, Niklas Schnelle wrote:
> On Fri, 2022-03-18 at 14:27 -0300, Jason Gunthorpe wrote:
> > This is the basic infrastructure of a new miscdevice to hold the iommufd
> > IOCTL API.
> > 
> > It provides:
> >  - A miscdevice to create file descriptors to run the IOCTL interface over
> > 
> >  - A table based ioctl dispatch and centralized extendable pre-validation
> >    step
> > 
> >  - An xarray mapping user ID's to kernel objects. The design has multiple
> >    inter-related objects held within in a single IOMMUFD fd
> > 
> >  - A simple usage count to build a graph of object relations and protect
> >    against hostile userspace racing ioctls
> 
> For me at this point it seems hard to grok what this "graph of object
> relations" is about. Maybe an example would help? I'm assuming this is
> about e.g. the DEVICE -depends-on-> HW_PAGETABLE -depends-on-> IOAS  so
> the arrows in the picture of PATCH 02? 

Yes, it is basically standard reference relations, think
'kref'. Object A referenced B because A has a pointer to B in its
struct.

Therefore B cannot be destroyed before A without creating a dangling
reference.

> Or is it the other way around
> and the IOAS -depends-on-> HW_PAGETABLE because it's about which
> references which? From the rest of the patch I seem to understand that
> this mostly establishes the order of destruction. So is HW_PAGETABLE
> destroyed before the IOAS because a HW_PAGETABLE must never reference
> an invalid/destoryed IOAS or is it the other way arund because the IOAS
> has a reference to the HW_PAGETABLES in it? I'd guess the former but
> I'm a bit confused still.

Yes HW_PAGETABLE is first because it is responsible to remove the
iommu_domain from the IOAS and the IOAS cannot be destroyed with
iommu_domains in it.

> > +/*
> > + * The objects for an acyclic graph through the users refcount. This enum must
> > + * be sorted by type depth first so that destruction completes lower objects and
> > + * releases the users refcount before reaching higher objects in the graph.
> > + */
> 
> A bit like my first comment I think this would benefit from an example
> what lower/higher mean in this case. I believe a lower object must only
> reference/depend on higher objects, correct?

Maybe it is too confusing - I was debating using a try and fail
approach instead which achieves the same thing with a little different
complexity. It seems we may need to do this anyhow for nesting..

Would look like this:

	/* Destroy the graph from depth first */
	while (!xa_empty(&ictx->objects)) {
		unsigned int destroyed = 0;
		unsigned long index = 0;

		xa_for_each (&ictx->objects, index, obj) {
			/*
			 * Since we are in release elevated users must come from
			 * other objects holding the users. We will eventually
			 * destroy the object that holds this one and the next
			 * pass will progress it.
			 */
			if (!refcount_dec_if_one(&obj->users))
				continue;
			destroyed++;
			xa_erase(&ictx->objects, index);
			iommufd_object_ops[obj->type].destroy(obj);
			kfree(obj);
		}
		/* Bug related to users refcount */
		if (WARN_ON(!destroyed))
			break;
	}
	kfree(ictx);

> > +struct iommufd_object *iommufd_get_object(struct iommufd_ctx *ictx, u32 id,
> > +					  enum iommufd_object_type type)
> > +{
> > +	struct iommufd_object *obj;
> > +
> > +	xa_lock(&ictx->objects);
> > +	obj = xa_load(&ictx->objects, id);
> > +	if (!obj || (type != IOMMUFD_OBJ_ANY && obj->type != type) ||
> > +	    !iommufd_lock_obj(obj))
> 
> Looking at the code it seems iommufd_lock_obj() locks against
> destroy_rw_sem and increases the reference count but there is also an
> iommufd_put_object_keep_user() which only unlocks the destroy_rw_sem. I
> think I personally would benefit from a comment/commit message
> explaining the lifecycle.
> 
> There is the below comment in iommufd_object_destroy_user() but I think
> it would be better placed near the destroy_rwsem decleration and to
> also explain the interaction between the destroy_rwsem and the
> reference count.

I do prefer it near the destroy because that is the only place that
actually requires the property it gives. The code outside this layer
shouldn't know about this at all beyond folowing some rules about
iommufd_put_object_keep_user(). Lets add a comment there instead:

/**
 * iommufd_put_object_keep_user() - Release part of the refcount on obj
 * @obj - Object to release
 *
 * Objects have two protections to ensure that userspace has a consistent
 * experience with destruction. Normally objects are locked so that destroy will
 * block while there are concurrent users, and wait for the object to be
 * unlocked.
 *
 * However, destroy can also be blocked by holding users reference counts on the
 * objects, in that case destroy will immediately return EBUSY and will not wait
 * for reference counts to go to zero.
 *
 * This function releases the destroy lock and destroy will return EBUSY.
 *
 * It should be used in places where the users will be held beyond a single
 * system call.
 */

Thanks,
Jason

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 03/12] iommufd: File descriptor, context, kconfig and makefiles
@ 2022-03-22 14:50       ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-03-22 14:50 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, iommu,
	Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins,
	David Gibson

On Tue, Mar 22, 2022 at 03:18:51PM +0100, Niklas Schnelle wrote:
> On Fri, 2022-03-18 at 14:27 -0300, Jason Gunthorpe wrote:
> > This is the basic infrastructure of a new miscdevice to hold the iommufd
> > IOCTL API.
> > 
> > It provides:
> >  - A miscdevice to create file descriptors to run the IOCTL interface over
> > 
> >  - A table based ioctl dispatch and centralized extendable pre-validation
> >    step
> > 
> >  - An xarray mapping user ID's to kernel objects. The design has multiple
> >    inter-related objects held within in a single IOMMUFD fd
> > 
> >  - A simple usage count to build a graph of object relations and protect
> >    against hostile userspace racing ioctls
> 
> For me at this point it seems hard to grok what this "graph of object
> relations" is about. Maybe an example would help? I'm assuming this is
> about e.g. the DEVICE -depends-on-> HW_PAGETABLE -depends-on-> IOAS  so
> the arrows in the picture of PATCH 02? 

Yes, it is basically standard reference relations, think
'kref'. Object A referenced B because A has a pointer to B in its
struct.

Therefore B cannot be destroyed before A without creating a dangling
reference.

> Or is it the other way around
> and the IOAS -depends-on-> HW_PAGETABLE because it's about which
> references which? From the rest of the patch I seem to understand that
> this mostly establishes the order of destruction. So is HW_PAGETABLE
> destroyed before the IOAS because a HW_PAGETABLE must never reference
> an invalid/destoryed IOAS or is it the other way arund because the IOAS
> has a reference to the HW_PAGETABLES in it? I'd guess the former but
> I'm a bit confused still.

Yes HW_PAGETABLE is first because it is responsible to remove the
iommu_domain from the IOAS and the IOAS cannot be destroyed with
iommu_domains in it.

> > +/*
> > + * The objects for an acyclic graph through the users refcount. This enum must
> > + * be sorted by type depth first so that destruction completes lower objects and
> > + * releases the users refcount before reaching higher objects in the graph.
> > + */
> 
> A bit like my first comment I think this would benefit from an example
> what lower/higher mean in this case. I believe a lower object must only
> reference/depend on higher objects, correct?

Maybe it is too confusing - I was debating using a try and fail
approach instead which achieves the same thing with a little different
complexity. It seems we may need to do this anyhow for nesting..

Would look like this:

	/* Destroy the graph from depth first */
	while (!xa_empty(&ictx->objects)) {
		unsigned int destroyed = 0;
		unsigned long index = 0;

		xa_for_each (&ictx->objects, index, obj) {
			/*
			 * Since we are in release elevated users must come from
			 * other objects holding the users. We will eventually
			 * destroy the object that holds this one and the next
			 * pass will progress it.
			 */
			if (!refcount_dec_if_one(&obj->users))
				continue;
			destroyed++;
			xa_erase(&ictx->objects, index);
			iommufd_object_ops[obj->type].destroy(obj);
			kfree(obj);
		}
		/* Bug related to users refcount */
		if (WARN_ON(!destroyed))
			break;
	}
	kfree(ictx);

> > +struct iommufd_object *iommufd_get_object(struct iommufd_ctx *ictx, u32 id,
> > +					  enum iommufd_object_type type)
> > +{
> > +	struct iommufd_object *obj;
> > +
> > +	xa_lock(&ictx->objects);
> > +	obj = xa_load(&ictx->objects, id);
> > +	if (!obj || (type != IOMMUFD_OBJ_ANY && obj->type != type) ||
> > +	    !iommufd_lock_obj(obj))
> 
> Looking at the code it seems iommufd_lock_obj() locks against
> destroy_rw_sem and increases the reference count but there is also an
> iommufd_put_object_keep_user() which only unlocks the destroy_rw_sem. I
> think I personally would benefit from a comment/commit message
> explaining the lifecycle.
> 
> There is the below comment in iommufd_object_destroy_user() but I think
> it would be better placed near the destroy_rwsem decleration and to
> also explain the interaction between the destroy_rwsem and the
> reference count.

I do prefer it near the destroy because that is the only place that
actually requires the property it gives. The code outside this layer
shouldn't know about this at all beyond folowing some rules about
iommufd_put_object_keep_user(). Lets add a comment there instead:

/**
 * iommufd_put_object_keep_user() - Release part of the refcount on obj
 * @obj - Object to release
 *
 * Objects have two protections to ensure that userspace has a consistent
 * experience with destruction. Normally objects are locked so that destroy will
 * block while there are concurrent users, and wait for the object to be
 * unlocked.
 *
 * However, destroy can also be blocked by holding users reference counts on the
 * objects, in that case destroy will immediately return EBUSY and will not wait
 * for reference counts to go to zero.
 *
 * This function releases the destroy lock and destroy will return EBUSY.
 *
 * It should be used in places where the users will be held beyond a single
 * system call.
 */

Thanks,
Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
  2022-03-22 14:28     ` Niklas Schnelle
@ 2022-03-22 14:57       ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-03-22 14:57 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Tue, Mar 22, 2022 at 03:28:22PM +0100, Niklas Schnelle wrote:
> On Fri, 2022-03-18 at 14:27 -0300, Jason Gunthorpe wrote:
> > Following the pattern of io_uring, perf, skb, and bpf iommfd will use
>                                                 iommufd ----^
> > user->locked_vm for accounting pinned pages. Ensure the value is included
> > in the struct and export free_uid() as iommufd is modular.
> > 
> > user->locked_vm is the correct accounting to use for ulimit because it is
> > per-user, and the ulimit is not supposed to be per-process. Other
> > places (vfio, vdpa and infiniband) have used mm->pinned_vm and/or
> > mm->locked_vm for accounting pinned pages, but this is only per-process
> > and inconsistent with the majority of the kernel.
> 
> Since this will replace parts of vfio this difference seems
> significant. Can you explain this a bit more?

I'm not sure what to say more, this is the correct way to account for
this. It is natural to see it is right because the ulimit is supposted
to be global to the user, not effectively reset every time the user
creates a new process.

So checking the ulimit against a per-process variable in the mm_struct
doesn't make too much sense.

> I'm also a bit confused how io_uring handles this. When I stumbled over
> the problem fixed by 6b7898eb180d ("io_uring: fix imbalanced sqo_mm
> accounting") and from that commit description I seem to rember that
> io_uring also accounts in mm->locked_vm too? 

locked/pinned_pages in the mm is kind of a debugging counter, it
indicates how many pins the user obtained through this mm. AFAICT its
only correct use is to report through proc. Things are supposed to
update it, but there is no reason to limit it as the user limit
supersedes it.

The commit you pointed at is fixing that io_uring corrupted the value.

Since VFIO checks locked/pinned_pages against the ulimit would blow up
when the value was wrong.

> In fact I stumbled over that because the wrong accounting in
> io_uring exhausted the applied to vfio (I was using a QEMU utilizing
> io_uring itself).

I'm pretty interested in this as well, do you have anything you can
share?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
@ 2022-03-22 14:57       ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-03-22 14:57 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, iommu,
	Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins,
	David Gibson

On Tue, Mar 22, 2022 at 03:28:22PM +0100, Niklas Schnelle wrote:
> On Fri, 2022-03-18 at 14:27 -0300, Jason Gunthorpe wrote:
> > Following the pattern of io_uring, perf, skb, and bpf iommfd will use
>                                                 iommufd ----^
> > user->locked_vm for accounting pinned pages. Ensure the value is included
> > in the struct and export free_uid() as iommufd is modular.
> > 
> > user->locked_vm is the correct accounting to use for ulimit because it is
> > per-user, and the ulimit is not supposed to be per-process. Other
> > places (vfio, vdpa and infiniband) have used mm->pinned_vm and/or
> > mm->locked_vm for accounting pinned pages, but this is only per-process
> > and inconsistent with the majority of the kernel.
> 
> Since this will replace parts of vfio this difference seems
> significant. Can you explain this a bit more?

I'm not sure what to say more, this is the correct way to account for
this. It is natural to see it is right because the ulimit is supposted
to be global to the user, not effectively reset every time the user
creates a new process.

So checking the ulimit against a per-process variable in the mm_struct
doesn't make too much sense.

> I'm also a bit confused how io_uring handles this. When I stumbled over
> the problem fixed by 6b7898eb180d ("io_uring: fix imbalanced sqo_mm
> accounting") and from that commit description I seem to rember that
> io_uring also accounts in mm->locked_vm too? 

locked/pinned_pages in the mm is kind of a debugging counter, it
indicates how many pins the user obtained through this mm. AFAICT its
only correct use is to report through proc. Things are supposed to
update it, but there is no reason to limit it as the user limit
supersedes it.

The commit you pointed at is fixing that io_uring corrupted the value.

Since VFIO checks locked/pinned_pages against the ulimit would blow up
when the value was wrong.

> In fact I stumbled over that because the wrong accounting in
> io_uring exhausted the applied to vfio (I was using a QEMU utilizing
> io_uring itself).

I'm pretty interested in this as well, do you have anything you can
share?

Thanks,
Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
  2022-03-22 14:57       ` Jason Gunthorpe via iommu
@ 2022-03-22 15:29         ` Alex Williamson
  -1 siblings, 0 replies; 244+ messages in thread
From: Alex Williamson @ 2022-03-22 15:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Niklas Schnelle, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Tue, 22 Mar 2022 11:57:41 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Mar 22, 2022 at 03:28:22PM +0100, Niklas Schnelle wrote:
> > On Fri, 2022-03-18 at 14:27 -0300, Jason Gunthorpe wrote:  
> > > 
> > > user->locked_vm is the correct accounting to use for ulimit because it is
> > > per-user, and the ulimit is not supposed to be per-process. Other
> > > places (vfio, vdpa and infiniband) have used mm->pinned_vm and/or
> > > mm->locked_vm for accounting pinned pages, but this is only per-process
> > > and inconsistent with the majority of the kernel.  
> > 
> > Since this will replace parts of vfio this difference seems
> > significant. Can you explain this a bit more?  
> 
> I'm not sure what to say more, this is the correct way to account for
> this. It is natural to see it is right because the ulimit is supposted
> to be global to the user, not effectively reset every time the user
> creates a new process.
> 
> So checking the ulimit against a per-process variable in the mm_struct
> doesn't make too much sense.

I'm still picking my way through the series, but the later compat
interface doesn't mention this difference as an outstanding issue.
Doesn't this difference need to be accounted in how libvirt manages VM
resource limits?  AIUI libvirt uses some form of prlimit(2) to set
process locked memory limits.  A compat interface might be an
interesting feature, but does it only provide ioctl compatibility and
not resource ulimit compatibility?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
@ 2022-03-22 15:29         ` Alex Williamson
  0 siblings, 0 replies; 244+ messages in thread
From: Alex Williamson @ 2022-03-22 15:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm, Niklas Schnelle,
	Jason Wang, Cornelia Huck, Kevin Tian, Daniel Jordan, iommu,
	Michael S. Tsirkin, Joao Martins, David Gibson

On Tue, 22 Mar 2022 11:57:41 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Mar 22, 2022 at 03:28:22PM +0100, Niklas Schnelle wrote:
> > On Fri, 2022-03-18 at 14:27 -0300, Jason Gunthorpe wrote:  
> > > 
> > > user->locked_vm is the correct accounting to use for ulimit because it is
> > > per-user, and the ulimit is not supposed to be per-process. Other
> > > places (vfio, vdpa and infiniband) have used mm->pinned_vm and/or
> > > mm->locked_vm for accounting pinned pages, but this is only per-process
> > > and inconsistent with the majority of the kernel.  
> > 
> > Since this will replace parts of vfio this difference seems
> > significant. Can you explain this a bit more?  
> 
> I'm not sure what to say more, this is the correct way to account for
> this. It is natural to see it is right because the ulimit is supposted
> to be global to the user, not effectively reset every time the user
> creates a new process.
> 
> So checking the ulimit against a per-process variable in the mm_struct
> doesn't make too much sense.

I'm still picking my way through the series, but the later compat
interface doesn't mention this difference as an outstanding issue.
Doesn't this difference need to be accounted in how libvirt manages VM
resource limits?  AIUI libvirt uses some form of prlimit(2) to set
process locked memory limits.  A compat interface might be an
interesting feature, but does it only provide ioctl compatibility and
not resource ulimit compatibility?  Thanks,

Alex

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
  2022-03-22 15:29         ` Alex Williamson
@ 2022-03-22 16:15           ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-03-22 16:15 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Niklas Schnelle, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:

> I'm still picking my way through the series, but the later compat
> interface doesn't mention this difference as an outstanding issue.
> Doesn't this difference need to be accounted in how libvirt manages VM
> resource limits?  

AFACIT, no, but it should be checked.

> AIUI libvirt uses some form of prlimit(2) to set process locked
> memory limits.

Yes, and ulimit does work fully. prlimit adjusts the value:

int do_prlimit(struct task_struct *tsk, unsigned int resource,
		struct rlimit *new_rlim, struct rlimit *old_rlim)
{
	rlim = tsk->signal->rlim + resource;
[..]
		if (new_rlim)
			*rlim = *new_rlim;

Which vfio reads back here:

drivers/vfio/vfio_iommu_type1.c:        unsigned long pfn, limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
drivers/vfio/vfio_iommu_type1.c:        unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;

And iommufd does the same read back:

	lock_limit =
		task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> PAGE_SHIFT;
	npages = pages->npinned - pages->last_npinned;
	do {
		cur_pages = atomic_long_read(&pages->source_user->locked_vm);
		new_pages = cur_pages + npages;
		if (new_pages > lock_limit)
			return -ENOMEM;
	} while (atomic_long_cmpxchg(&pages->source_user->locked_vm, cur_pages,
				     new_pages) != cur_pages);

So it does work essentially the same.

The difference is more subtle, iouring/etc puts the charge in the user
so it is additive with things like iouring and additively spans all
the users processes.

However vfio is accounting only per-process and only for itself - no
other subsystem uses locked as the charge variable for DMA pins.

The user visible difference will be that a limit X that worked with
VFIO may start to fail after a kernel upgrade as the charge accounting
is now cross user and additive with things like iommufd.

This whole area is a bit peculiar (eg mlock itself works differently),
IMHO, but with most of the places doing pins voting to use
user->locked_vm as the charge it seems the right path in today's
kernel.

Ceratinly having qemu concurrently using three different subsystems
(vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for
RLIMIT_MEMLOCK differently cannot be sane or correct.

I plan to fix RDMA like this as well so at least we can have
consistency within qemu.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
@ 2022-03-22 16:15           ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-03-22 16:15 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm, Niklas Schnelle,
	Jason Wang, Cornelia Huck, Kevin Tian, Daniel Jordan, iommu,
	Michael S. Tsirkin, Joao Martins, David Gibson

On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:

> I'm still picking my way through the series, but the later compat
> interface doesn't mention this difference as an outstanding issue.
> Doesn't this difference need to be accounted in how libvirt manages VM
> resource limits?  

AFACIT, no, but it should be checked.

> AIUI libvirt uses some form of prlimit(2) to set process locked
> memory limits.

Yes, and ulimit does work fully. prlimit adjusts the value:

int do_prlimit(struct task_struct *tsk, unsigned int resource,
		struct rlimit *new_rlim, struct rlimit *old_rlim)
{
	rlim = tsk->signal->rlim + resource;
[..]
		if (new_rlim)
			*rlim = *new_rlim;

Which vfio reads back here:

drivers/vfio/vfio_iommu_type1.c:        unsigned long pfn, limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
drivers/vfio/vfio_iommu_type1.c:        unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;

And iommufd does the same read back:

	lock_limit =
		task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> PAGE_SHIFT;
	npages = pages->npinned - pages->last_npinned;
	do {
		cur_pages = atomic_long_read(&pages->source_user->locked_vm);
		new_pages = cur_pages + npages;
		if (new_pages > lock_limit)
			return -ENOMEM;
	} while (atomic_long_cmpxchg(&pages->source_user->locked_vm, cur_pages,
				     new_pages) != cur_pages);

So it does work essentially the same.

The difference is more subtle, iouring/etc puts the charge in the user
so it is additive with things like iouring and additively spans all
the users processes.

However vfio is accounting only per-process and only for itself - no
other subsystem uses locked as the charge variable for DMA pins.

The user visible difference will be that a limit X that worked with
VFIO may start to fail after a kernel upgrade as the charge accounting
is now cross user and additive with things like iommufd.

This whole area is a bit peculiar (eg mlock itself works differently),
IMHO, but with most of the places doing pins voting to use
user->locked_vm as the charge it seems the right path in today's
kernel.

Ceratinly having qemu concurrently using three different subsystems
(vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for
RLIMIT_MEMLOCK differently cannot be sane or correct.

I plan to fix RDMA like this as well so at least we can have
consistency within qemu.

Thanks,
Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
  2022-03-22 14:57       ` Jason Gunthorpe via iommu
@ 2022-03-22 16:31         ` Niklas Schnelle
  -1 siblings, 0 replies; 244+ messages in thread
From: Niklas Schnelle @ 2022-03-22 16:31 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Tue, 2022-03-22 at 11:57 -0300, Jason Gunthorpe wrote:
> On Tue, Mar 22, 2022 at 03:28:22PM +0100, Niklas Schnelle wrote:
> > On Fri, 2022-03-18 at 14:27 -0300, Jason Gunthorpe wrote:
> > > Following the pattern of io_uring, perf, skb, and bpf iommfd will use
> >                                                 iommufd ----^
> > > user->locked_vm for accounting pinned pages. Ensure the value is included
> > > in the struct and export free_uid() as iommufd is modular.
> > > 
> > > user->locked_vm is the correct accounting to use for ulimit because it is
> > > per-user, and the ulimit is not supposed to be per-process. Other
> > > places (vfio, vdpa and infiniband) have used mm->pinned_vm and/or
> > > mm->locked_vm for accounting pinned pages, but this is only per-process
> > > and inconsistent with the majority of the kernel.
> > 
> > Since this will replace parts of vfio this difference seems
> > significant. Can you explain this a bit more?
> 
> I'm not sure what to say more, this is the correct way to account for
> this. It is natural to see it is right because the ulimit is supposted
> to be global to the user, not effectively reset every time the user
> creates a new process.
> 
> So checking the ulimit against a per-process variable in the mm_struct
> doesn't make too much sense.

Yes I agree that logically this makes more sense. I was kind of aiming
in the same direction as Alex i.e. it's a conscious decision to do it
right and we need to know where this may lead to differences and how to
handle them.

> 
> > I'm also a bit confused how io_uring handles this. When I stumbled over
> > the problem fixed by 6b7898eb180d ("io_uring: fix imbalanced sqo_mm
> > accounting") and from that commit description I seem to rember that
> > io_uring also accounts in mm->locked_vm too? 
> 
> locked/pinned_pages in the mm is kind of a debugging counter, it
> indicates how many pins the user obtained through this mm. AFAICT its
> only correct use is to report through proc. Things are supposed to
> update it, but there is no reason to limit it as the user limit
> supersedes it.
> 
> The commit you pointed at is fixing that io_uring corrupted the value.
> 
> Since VFIO checks locked/pinned_pages against the ulimit would blow up
> when the value was wrong.
> 
> > In fact I stumbled over that because the wrong accounting in
> > io_uring exhausted the applied to vfio (I was using a QEMU utilizing
> > io_uring itself).
> 
> I'm pretty interested in this as well, do you have anything you can
> share?

This was the issue reported in the following BZ.

https://bugzilla.kernel.org/show_bug.cgi?id=209025

I stumbled over the same problem on my x86 box and also on s390. I
don't remember exactly what limit this ran into but I suspect it had
something to do with the libvirt resource limits Alex mentioned.
Meaning io_uring had an accounting bug and then vfio / QEMU couldn't
pin memory. I think that libvirt limit is set to allow pinning all of
guest memory plus a bit so the io_uring misaccounting easily tipped it
over.

> 
> Thanks,
> Jason



^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
@ 2022-03-22 16:31         ` Niklas Schnelle
  0 siblings, 0 replies; 244+ messages in thread
From: Niklas Schnelle @ 2022-03-22 16:31 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, iommu,
	Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins,
	David Gibson

On Tue, 2022-03-22 at 11:57 -0300, Jason Gunthorpe wrote:
> On Tue, Mar 22, 2022 at 03:28:22PM +0100, Niklas Schnelle wrote:
> > On Fri, 2022-03-18 at 14:27 -0300, Jason Gunthorpe wrote:
> > > Following the pattern of io_uring, perf, skb, and bpf iommfd will use
> >                                                 iommufd ----^
> > > user->locked_vm for accounting pinned pages. Ensure the value is included
> > > in the struct and export free_uid() as iommufd is modular.
> > > 
> > > user->locked_vm is the correct accounting to use for ulimit because it is
> > > per-user, and the ulimit is not supposed to be per-process. Other
> > > places (vfio, vdpa and infiniband) have used mm->pinned_vm and/or
> > > mm->locked_vm for accounting pinned pages, but this is only per-process
> > > and inconsistent with the majority of the kernel.
> > 
> > Since this will replace parts of vfio this difference seems
> > significant. Can you explain this a bit more?
> 
> I'm not sure what to say more, this is the correct way to account for
> this. It is natural to see it is right because the ulimit is supposted
> to be global to the user, not effectively reset every time the user
> creates a new process.
> 
> So checking the ulimit against a per-process variable in the mm_struct
> doesn't make too much sense.

Yes I agree that logically this makes more sense. I was kind of aiming
in the same direction as Alex i.e. it's a conscious decision to do it
right and we need to know where this may lead to differences and how to
handle them.

> 
> > I'm also a bit confused how io_uring handles this. When I stumbled over
> > the problem fixed by 6b7898eb180d ("io_uring: fix imbalanced sqo_mm
> > accounting") and from that commit description I seem to rember that
> > io_uring also accounts in mm->locked_vm too? 
> 
> locked/pinned_pages in the mm is kind of a debugging counter, it
> indicates how many pins the user obtained through this mm. AFAICT its
> only correct use is to report through proc. Things are supposed to
> update it, but there is no reason to limit it as the user limit
> supersedes it.
> 
> The commit you pointed at is fixing that io_uring corrupted the value.
> 
> Since VFIO checks locked/pinned_pages against the ulimit would blow up
> when the value was wrong.
> 
> > In fact I stumbled over that because the wrong accounting in
> > io_uring exhausted the applied to vfio (I was using a QEMU utilizing
> > io_uring itself).
> 
> I'm pretty interested in this as well, do you have anything you can
> share?

This was the issue reported in the following BZ.

https://bugzilla.kernel.org/show_bug.cgi?id=209025

I stumbled over the same problem on my x86 box and also on s390. I
don't remember exactly what limit this ran into but I suspect it had
something to do with the libvirt resource limits Alex mentioned.
Meaning io_uring had an accounting bug and then vfio / QEMU couldn't
pin memory. I think that libvirt limit is set to allow pinning all of
guest memory plus a bit so the io_uring misaccounting easily tipped it
over.

> 
> Thanks,
> Jason


_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
  2022-03-22 16:31         ` Niklas Schnelle
@ 2022-03-22 16:41           ` Jason Gunthorpe
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-03-22 16:41 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, iommu,
	Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins,
	David Gibson

On Tue, Mar 22, 2022 at 05:31:26PM +0100, Niklas Schnelle wrote:

> > > In fact I stumbled over that because the wrong accounting in
> > > io_uring exhausted the applied to vfio (I was using a QEMU utilizing
> > > io_uring itself).
> > 
> > I'm pretty interested in this as well, do you have anything you can
> > share?
> 
> This was the issue reported in the following BZ.

Sorry, I was talking about the iouring usage in qemu :)

Thanks,
Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
@ 2022-03-22 16:41           ` Jason Gunthorpe
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-03-22 16:41 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Tue, Mar 22, 2022 at 05:31:26PM +0100, Niklas Schnelle wrote:

> > > In fact I stumbled over that because the wrong accounting in
> > > io_uring exhausted the applied to vfio (I was using a QEMU utilizing
> > > io_uring itself).
> > 
> > I'm pretty interested in this as well, do you have anything you can
> > share?
> 
> This was the issue reported in the following BZ.

Sorry, I was talking about the iouring usage in qemu :)

Thanks,
Jason

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 07/12] iommufd: Data structure to provide IOVA to PFN mapping
  2022-03-18 17:27   ` Jason Gunthorpe via iommu
@ 2022-03-22 22:15     ` Alex Williamson
  -1 siblings, 0 replies; 244+ messages in thread
From: Alex Williamson @ 2022-03-22 22:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Fri, 18 Mar 2022 14:27:32 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:
> +/*
> + * The area takes a slice of the pages from start_bytes to start_byte + length
> + */
> +static struct iopt_area *
> +iopt_alloc_area(struct io_pagetable *iopt, struct iopt_pages *pages,
> +		unsigned long iova, unsigned long start_byte,
> +		unsigned long length, int iommu_prot, unsigned int flags)
> +{
> +	struct iopt_area *area;
> +	int rc;
> +
> +	area = kzalloc(sizeof(*area), GFP_KERNEL);
> +	if (!area)
> +		return ERR_PTR(-ENOMEM);
> +
> +	area->iopt = iopt;
> +	area->iommu_prot = iommu_prot;
> +	area->page_offset = start_byte % PAGE_SIZE;
> +	area->pages_node.start = start_byte / PAGE_SIZE;
> +	if (check_add_overflow(start_byte, length - 1, &area->pages_node.last))
> +		return ERR_PTR(-EOVERFLOW);
> +	area->pages_node.last = area->pages_node.last / PAGE_SIZE;
> +	if (WARN_ON(area->pages_node.last >= pages->npages))
> +		return ERR_PTR(-EOVERFLOW);

@area leaked in the above two error cases.

> +
> +	down_write(&iopt->iova_rwsem);
> +	if (flags & IOPT_ALLOC_IOVA) {
> +		rc = iopt_alloc_iova(iopt, &iova,
> +				     (uintptr_t)pages->uptr + start_byte,
> +				     length);
> +		if (rc)
> +			goto out_unlock;
> +	}
> +
> +	if (check_add_overflow(iova, length - 1, &area->node.last)) {
> +		rc = -EOVERFLOW;
> +		goto out_unlock;
> +	}
> +
> +	if (!(flags & IOPT_ALLOC_IOVA)) {
> +		if ((iova & (iopt->iova_alignment - 1)) ||
> +		    (length & (iopt->iova_alignment - 1)) || !length) {
> +			rc = -EINVAL;
> +			goto out_unlock;
> +		}
> +
> +		/* No reserved IOVA intersects the range */
> +		if (interval_tree_iter_first(&iopt->reserved_iova_itree, iova,
> +					     area->node.last)) {
> +			rc = -ENOENT;
> +			goto out_unlock;
> +		}
> +
> +		/* Check that there is not already a mapping in the range */
> +		if (iopt_area_iter_first(iopt, iova, area->node.last)) {
> +			rc = -EADDRINUSE;
> +			goto out_unlock;
> +		}
> +	}
> +
> +	/*
> +	 * The area is inserted with a NULL pages indicating it is not fully
> +	 * initialized yet.
> +	 */
> +	area->node.start = iova;
> +	interval_tree_insert(&area->node, &area->iopt->area_itree);
> +	up_write(&iopt->iova_rwsem);
> +	return area;
> +
> +out_unlock:
> +	up_write(&iopt->iova_rwsem);
> +	kfree(area);
> +	return ERR_PTR(rc);
> +}
...
> +/**
> + * iopt_access_pages() - Return a list of pages under the iova
> + * @iopt: io_pagetable to act on
> + * @iova: Starting IOVA
> + * @length: Number of bytes to access
> + * @out_pages: Output page list
> + * @write: True if access is for writing
> + *
> + * Reads @npages starting at iova and returns the struct page * pointers. These
> + * can be kmap'd by the caller for CPU access.
> + *
> + * The caller must perform iopt_unaccess_pages() when done to balance this.
> + *
> + * iova can be unaligned from PAGE_SIZE. The first returned byte starts at
> + * page_to_phys(out_pages[0]) + (iova % PAGE_SIZE). The caller promises not to
> + * touch memory outside the requested iova slice.
> + *
> + * FIXME: callers that need a DMA mapping via a sgl should create another
> + * interface to build the SGL efficiently
> + */
> +int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
> +		      unsigned long length, struct page **out_pages, bool write)
> +{
> +	unsigned long cur_iova = iova;
> +	unsigned long last_iova;
> +	struct iopt_area *area;
> +	int rc;
> +
> +	if (!length)
> +		return -EINVAL;
> +	if (check_add_overflow(iova, length - 1, &last_iova))
> +		return -EOVERFLOW;
> +
> +	down_read(&iopt->iova_rwsem);
> +	for (area = iopt_area_iter_first(iopt, iova, last_iova); area;
> +	     area = iopt_area_iter_next(area, iova, last_iova)) {
> +		unsigned long last = min(last_iova, iopt_area_last_iova(area));
> +		unsigned long last_index;
> +		unsigned long index;
> +
> +		/* Need contiguous areas in the access */
> +		if (iopt_area_iova(area) < cur_iova || !area->pages) {
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Should this be (cur_iova != iova && iopt_area_iova(area) < cur_iova)?

I can't see how we'd require in-kernel page users to know the iopt_area
alignment from userspace, so I think this needs to skip the first
iteration.  Thanks,

Alex

> +			rc = -EINVAL;
> +			goto out_remove;
> +		}
> +
> +		index = iopt_area_iova_to_index(area, cur_iova);
> +		last_index = iopt_area_iova_to_index(area, last);
> +		rc = iopt_pages_add_user(area->pages, index, last_index,
> +					 out_pages, write);
> +		if (rc)
> +			goto out_remove;
> +		if (last == last_iova)
> +			break;
> +		/*
> +		 * Can't cross areas that are not aligned to the system page
> +		 * size with this API.
> +		 */
> +		if (cur_iova % PAGE_SIZE) {
> +			rc = -EINVAL;
> +			goto out_remove;
> +		}
> +		cur_iova = last + 1;
> +		out_pages += last_index - index;
> +		atomic_inc(&area->num_users);
> +	}
> +
> +	up_read(&iopt->iova_rwsem);
> +	return 0;
> +
> +out_remove:
> +	if (cur_iova != iova)
> +		iopt_unaccess_pages(iopt, iova, cur_iova - iova);
> +	up_read(&iopt->iova_rwsem);
> +	return rc;
> +}


^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 07/12] iommufd: Data structure to provide IOVA to PFN mapping
@ 2022-03-22 22:15     ` Alex Williamson
  0 siblings, 0 replies; 244+ messages in thread
From: Alex Williamson @ 2022-03-22 22:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	Kevin Tian, Daniel Jordan, iommu, Joao Martins, David Gibson

On Fri, 18 Mar 2022 14:27:32 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:
> +/*
> + * The area takes a slice of the pages from start_bytes to start_byte + length
> + */
> +static struct iopt_area *
> +iopt_alloc_area(struct io_pagetable *iopt, struct iopt_pages *pages,
> +		unsigned long iova, unsigned long start_byte,
> +		unsigned long length, int iommu_prot, unsigned int flags)
> +{
> +	struct iopt_area *area;
> +	int rc;
> +
> +	area = kzalloc(sizeof(*area), GFP_KERNEL);
> +	if (!area)
> +		return ERR_PTR(-ENOMEM);
> +
> +	area->iopt = iopt;
> +	area->iommu_prot = iommu_prot;
> +	area->page_offset = start_byte % PAGE_SIZE;
> +	area->pages_node.start = start_byte / PAGE_SIZE;
> +	if (check_add_overflow(start_byte, length - 1, &area->pages_node.last))
> +		return ERR_PTR(-EOVERFLOW);
> +	area->pages_node.last = area->pages_node.last / PAGE_SIZE;
> +	if (WARN_ON(area->pages_node.last >= pages->npages))
> +		return ERR_PTR(-EOVERFLOW);

@area leaked in the above two error cases.

> +
> +	down_write(&iopt->iova_rwsem);
> +	if (flags & IOPT_ALLOC_IOVA) {
> +		rc = iopt_alloc_iova(iopt, &iova,
> +				     (uintptr_t)pages->uptr + start_byte,
> +				     length);
> +		if (rc)
> +			goto out_unlock;
> +	}
> +
> +	if (check_add_overflow(iova, length - 1, &area->node.last)) {
> +		rc = -EOVERFLOW;
> +		goto out_unlock;
> +	}
> +
> +	if (!(flags & IOPT_ALLOC_IOVA)) {
> +		if ((iova & (iopt->iova_alignment - 1)) ||
> +		    (length & (iopt->iova_alignment - 1)) || !length) {
> +			rc = -EINVAL;
> +			goto out_unlock;
> +		}
> +
> +		/* No reserved IOVA intersects the range */
> +		if (interval_tree_iter_first(&iopt->reserved_iova_itree, iova,
> +					     area->node.last)) {
> +			rc = -ENOENT;
> +			goto out_unlock;
> +		}
> +
> +		/* Check that there is not already a mapping in the range */
> +		if (iopt_area_iter_first(iopt, iova, area->node.last)) {
> +			rc = -EADDRINUSE;
> +			goto out_unlock;
> +		}
> +	}
> +
> +	/*
> +	 * The area is inserted with a NULL pages indicating it is not fully
> +	 * initialized yet.
> +	 */
> +	area->node.start = iova;
> +	interval_tree_insert(&area->node, &area->iopt->area_itree);
> +	up_write(&iopt->iova_rwsem);
> +	return area;
> +
> +out_unlock:
> +	up_write(&iopt->iova_rwsem);
> +	kfree(area);
> +	return ERR_PTR(rc);
> +}
...
> +/**
> + * iopt_access_pages() - Return a list of pages under the iova
> + * @iopt: io_pagetable to act on
> + * @iova: Starting IOVA
> + * @length: Number of bytes to access
> + * @out_pages: Output page list
> + * @write: True if access is for writing
> + *
> + * Reads @npages starting at iova and returns the struct page * pointers. These
> + * can be kmap'd by the caller for CPU access.
> + *
> + * The caller must perform iopt_unaccess_pages() when done to balance this.
> + *
> + * iova can be unaligned from PAGE_SIZE. The first returned byte starts at
> + * page_to_phys(out_pages[0]) + (iova % PAGE_SIZE). The caller promises not to
> + * touch memory outside the requested iova slice.
> + *
> + * FIXME: callers that need a DMA mapping via a sgl should create another
> + * interface to build the SGL efficiently
> + */
> +int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
> +		      unsigned long length, struct page **out_pages, bool write)
> +{
> +	unsigned long cur_iova = iova;
> +	unsigned long last_iova;
> +	struct iopt_area *area;
> +	int rc;
> +
> +	if (!length)
> +		return -EINVAL;
> +	if (check_add_overflow(iova, length - 1, &last_iova))
> +		return -EOVERFLOW;
> +
> +	down_read(&iopt->iova_rwsem);
> +	for (area = iopt_area_iter_first(iopt, iova, last_iova); area;
> +	     area = iopt_area_iter_next(area, iova, last_iova)) {
> +		unsigned long last = min(last_iova, iopt_area_last_iova(area));
> +		unsigned long last_index;
> +		unsigned long index;
> +
> +		/* Need contiguous areas in the access */
> +		if (iopt_area_iova(area) < cur_iova || !area->pages) {
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Should this be (cur_iova != iova && iopt_area_iova(area) < cur_iova)?

I can't see how we'd require in-kernel page users to know the iopt_area
alignment from userspace, so I think this needs to skip the first
iteration.  Thanks,

Alex

> +			rc = -EINVAL;
> +			goto out_remove;
> +		}
> +
> +		index = iopt_area_iova_to_index(area, cur_iova);
> +		last_index = iopt_area_iova_to_index(area, last);
> +		rc = iopt_pages_add_user(area->pages, index, last_index,
> +					 out_pages, write);
> +		if (rc)
> +			goto out_remove;
> +		if (last == last_iova)
> +			break;
> +		/*
> +		 * Can't cross areas that are not aligned to the system page
> +		 * size with this API.
> +		 */
> +		if (cur_iova % PAGE_SIZE) {
> +			rc = -EINVAL;
> +			goto out_remove;
> +		}
> +		cur_iova = last + 1;
> +		out_pages += last_index - index;
> +		atomic_inc(&area->num_users);
> +	}
> +
> +	up_read(&iopt->iova_rwsem);
> +	return 0;
> +
> +out_remove:
> +	if (cur_iova != iova)
> +		iopt_unaccess_pages(iopt, iova, cur_iova - iova);
> +	up_read(&iopt->iova_rwsem);
> +	return rc;
> +}

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 05/12] iommufd: PFN handling for iopt_pages
  2022-03-18 17:27   ` Jason Gunthorpe via iommu
@ 2022-03-23 15:37     ` Niklas Schnelle
  -1 siblings, 0 replies; 244+ messages in thread
From: Niklas Schnelle @ 2022-03-23 15:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, iommu,
	Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins,
	David Gibson

On Fri, 2022-03-18 at 14:27 -0300, Jason Gunthorpe wrote:
> The top of the data structure provides an IO Address Space (IOAS) that is
> similar to a VFIO container. The IOAS allows map/unmap of memory into
> ranges of IOVA called iopt_areas. Domains and in-kernel users (like VFIO
> mdevs) can be attached to the IOAS to access the PFNs that those IOVA
> areas cover.
> 
> The IO Address Space (IOAS) datastructure is composed of:
>  - struct io_pagetable holding the IOVA map
>  - struct iopt_areas representing populated portions of IOVA
>  - struct iopt_pages representing the storage of PFNs
>  - struct iommu_domain representing the IO page table in the system IOMMU
>  - struct iopt_pages_user representing in-kernel users of PFNs (ie VFIO
>    mdevs)
>  - struct xarray pinned_pfns holding a list of pages pinned by in-kernel
>    users
> 
> This patch introduces the lowest part of the datastructure - the movement
> of PFNs in a tiered storage scheme:
>  1) iopt_pages::pinned_pfns xarray
>  2) An iommu_domain
>  3) The origin of the PFNs, i.e. the userspace pointer
> 
> PFN have to be copied between all combinations of tiers, depending on the
> configuration.
> 
> The interface is an iterator called a 'pfn_reader' which determines which
> tier each PFN is stored and loads it into a list of PFNs held in a struct
> pfn_batch.
> 
> Each step of the iterator will fill up the pfn_batch, then the caller can
> use the pfn_batch to send the PFNs to the required destination. Repeating
> this loop will read all the PFNs in an IOVA range.
> 
> The pfn_reader and pfn_batch also keep track of the pinned page accounting.
> 
> While PFNs are always stored and accessed as full PAGE_SIZE units the
> iommu_domain tier can store with a sub-page offset/length to support
> IOMMUs with a smaller IOPTE size than PAGE_SIZE.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/iommu/iommufd/Makefile          |   3 +-
>  drivers/iommu/iommufd/io_pagetable.h    | 101 ++++
>  drivers/iommu/iommufd/iommufd_private.h |  20 +
>  drivers/iommu/iommufd/pages.c           | 723 ++++++++++++++++++++++++
>  4 files changed, 846 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/iommu/iommufd/io_pagetable.h
>  create mode 100644 drivers/iommu/iommufd/pages.c
> 
> 
---8<---
> +
> +/*
> + * This holds a pinned page list for multiple areas of IO address space. The
> + * pages always originate from a linear chunk of userspace VA. Multiple
> + * io_pagetable's, through their iopt_area's, can share a single iopt_pages
> + * which avoids multi-pinning and double accounting of page consumption.
> + *
> + * indexes in this structure are measured in PAGE_SIZE units, are 0 based from
> + * the start of the uptr and extend to npages. pages are pinned dynamically
> + * according to the intervals in the users_itree and domains_itree, npages
> + * records the current number of pages pinned.

This sounds wrong or at least badly named. If npages records the
current number of pages pinned then what does npinned record?

> + */
> +struct iopt_pages {
> +	struct kref kref;
> +	struct mutex mutex;
> +	size_t npages;
> +	size_t npinned;
> +	size_t last_npinned;
> +	struct task_struct *source_task;
> +	struct mm_struct *source_mm;
> +	struct user_struct *source_user;
> +	void __user *uptr;
> +	bool writable:1;
> +	bool has_cap_ipc_lock:1;
> +
> +	struct xarray pinned_pfns;
> +	/* Of iopt_pages_user::node */
> +	struct rb_root_cached users_itree;
> +	/* Of iopt_area::pages_node */
> +	struct rb_root_cached domains_itree;
> +};
> +
---8<---

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 05/12] iommufd: PFN handling for iopt_pages
@ 2022-03-23 15:37     ` Niklas Schnelle
  0 siblings, 0 replies; 244+ messages in thread
From: Niklas Schnelle @ 2022-03-23 15:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Fri, 2022-03-18 at 14:27 -0300, Jason Gunthorpe wrote:
> The top of the data structure provides an IO Address Space (IOAS) that is
> similar to a VFIO container. The IOAS allows map/unmap of memory into
> ranges of IOVA called iopt_areas. Domains and in-kernel users (like VFIO
> mdevs) can be attached to the IOAS to access the PFNs that those IOVA
> areas cover.
> 
> The IO Address Space (IOAS) datastructure is composed of:
>  - struct io_pagetable holding the IOVA map
>  - struct iopt_areas representing populated portions of IOVA
>  - struct iopt_pages representing the storage of PFNs
>  - struct iommu_domain representing the IO page table in the system IOMMU
>  - struct iopt_pages_user representing in-kernel users of PFNs (ie VFIO
>    mdevs)
>  - struct xarray pinned_pfns holding a list of pages pinned by in-kernel
>    users
> 
> This patch introduces the lowest part of the datastructure - the movement
> of PFNs in a tiered storage scheme:
>  1) iopt_pages::pinned_pfns xarray
>  2) An iommu_domain
>  3) The origin of the PFNs, i.e. the userspace pointer
> 
> PFN have to be copied between all combinations of tiers, depending on the
> configuration.
> 
> The interface is an iterator called a 'pfn_reader' which determines which
> tier each PFN is stored and loads it into a list of PFNs held in a struct
> pfn_batch.
> 
> Each step of the iterator will fill up the pfn_batch, then the caller can
> use the pfn_batch to send the PFNs to the required destination. Repeating
> this loop will read all the PFNs in an IOVA range.
> 
> The pfn_reader and pfn_batch also keep track of the pinned page accounting.
> 
> While PFNs are always stored and accessed as full PAGE_SIZE units the
> iommu_domain tier can store with a sub-page offset/length to support
> IOMMUs with a smaller IOPTE size than PAGE_SIZE.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/iommu/iommufd/Makefile          |   3 +-
>  drivers/iommu/iommufd/io_pagetable.h    | 101 ++++
>  drivers/iommu/iommufd/iommufd_private.h |  20 +
>  drivers/iommu/iommufd/pages.c           | 723 ++++++++++++++++++++++++
>  4 files changed, 846 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/iommu/iommufd/io_pagetable.h
>  create mode 100644 drivers/iommu/iommufd/pages.c
> 
> 
---8<---
> +
> +/*
> + * This holds a pinned page list for multiple areas of IO address space. The
> + * pages always originate from a linear chunk of userspace VA. Multiple
> + * io_pagetable's, through their iopt_area's, can share a single iopt_pages
> + * which avoids multi-pinning and double accounting of page consumption.
> + *
> + * indexes in this structure are measured in PAGE_SIZE units, are 0 based from
> + * the start of the uptr and extend to npages. pages are pinned dynamically
> + * according to the intervals in the users_itree and domains_itree, npages
> + * records the current number of pages pinned.

This sounds wrong or at least badly named. If npages records the
current number of pages pinned then what does npinned record?

> + */
> +struct iopt_pages {
> +	struct kref kref;
> +	struct mutex mutex;
> +	size_t npages;
> +	size_t npinned;
> +	size_t last_npinned;
> +	struct task_struct *source_task;
> +	struct mm_struct *source_mm;
> +	struct user_struct *source_user;
> +	void __user *uptr;
> +	bool writable:1;
> +	bool has_cap_ipc_lock:1;
> +
> +	struct xarray pinned_pfns;
> +	/* Of iopt_pages_user::node */
> +	struct rb_root_cached users_itree;
> +	/* Of iopt_area::pages_node */
> +	struct rb_root_cached domains_itree;
> +};
> +
---8<---


^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 05/12] iommufd: PFN handling for iopt_pages
  2022-03-23 15:37     ` Niklas Schnelle
@ 2022-03-23 16:09       ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-03-23 16:09 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Wed, Mar 23, 2022 at 04:37:50PM +0100, Niklas Schnelle wrote:
> > +/*
> > + * This holds a pinned page list for multiple areas of IO address space. The
> > + * pages always originate from a linear chunk of userspace VA. Multiple
> > + * io_pagetable's, through their iopt_area's, can share a single iopt_pages
> > + * which avoids multi-pinning and double accounting of page consumption.
> > + *
> > + * indexes in this structure are measured in PAGE_SIZE units, are 0 based from
> > + * the start of the uptr and extend to npages. pages are pinned dynamically
> > + * according to the intervals in the users_itree and domains_itree, npages
> > + * records the current number of pages pinned.
> 
> This sounds wrong or at least badly named. If npages records the
> current number of pages pinned then what does npinned record?

Oh, it is a typo, the 2nd npages should be npinned, thanks

Jason

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 05/12] iommufd: PFN handling for iopt_pages
@ 2022-03-23 16:09       ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-03-23 16:09 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, iommu,
	Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins,
	David Gibson

On Wed, Mar 23, 2022 at 04:37:50PM +0100, Niklas Schnelle wrote:
> > +/*
> > + * This holds a pinned page list for multiple areas of IO address space. The
> > + * pages always originate from a linear chunk of userspace VA. Multiple
> > + * io_pagetable's, through their iopt_area's, can share a single iopt_pages
> > + * which avoids multi-pinning and double accounting of page consumption.
> > + *
> > + * indexes in this structure are measured in PAGE_SIZE units, are 0 based from
> > + * the start of the uptr and extend to npages. pages are pinned dynamically
> > + * according to the intervals in the users_itree and domains_itree, npages
> > + * records the current number of pages pinned.
> 
> This sounds wrong or at least badly named. If npages records the
> current number of pages pinned then what does npinned record?

Oh, it is a typo, the 2nd npages should be npinned, thanks

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 10/12] iommufd: Add kAPI toward external drivers
  2022-03-18 17:27   ` Jason Gunthorpe via iommu
@ 2022-03-23 18:10     ` Alex Williamson
  -1 siblings, 0 replies; 244+ messages in thread
From: Alex Williamson @ 2022-03-23 18:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Fri, 18 Mar 2022 14:27:35 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:
> +/**
> + * iommufd_bind_pci_device - Bind a physical device to an iommu fd
> + * @fd: iommufd file descriptor.
> + * @pdev: Pointer to a physical PCI device struct
> + * @id: Output ID number to return to userspace for this device
> + *
> + * A successful bind establishes an ownership over the device and returns
> + * struct iommufd_device pointer, otherwise returns error pointer.
> + *
> + * A driver using this API must set driver_managed_dma and must not touch
> + * the device until this routine succeeds and establishes ownership.
> + *
> + * Binding a PCI device places the entire RID under iommufd control.
> + *
> + * The caller must undo this with iommufd_unbind_device()
> + */
> +struct iommufd_device *iommufd_bind_pci_device(int fd, struct pci_dev *pdev,
> +					       u32 *id)
> +{
> +	struct iommufd_device *idev;
> +	struct iommufd_ctx *ictx;
> +	struct iommu_group *group;
> +	int rc;
> +
> +	ictx = iommufd_fget(fd);
> +	if (!ictx)
> +		return ERR_PTR(-EINVAL);
> +
> +	group = iommu_group_get(&pdev->dev);
> +	if (!group) {
> +		rc = -ENODEV;
> +		goto out_file_put;
> +	}
> +
> +	/*
> +	 * FIXME: Use a device-centric iommu api and this won't work with
> +	 * multi-device groups
> +	 */
> +	rc = iommu_group_claim_dma_owner(group, ictx->filp);
> +	if (rc)
> +		goto out_group_put;
> +
> +	idev = iommufd_object_alloc(ictx, idev, IOMMUFD_OBJ_DEVICE);
> +	if (IS_ERR(idev)) {
> +		rc = PTR_ERR(idev);
> +		goto out_release_owner;
> +	}
> +	idev->ictx = ictx;
> +	idev->dev = &pdev->dev;
> +	/* The calling driver is a user until iommufd_unbind_device() */
> +	refcount_inc(&idev->obj.users);
> +	/* group refcount moves into iommufd_device */
> +	idev->group = group;
> +
> +	/*
> +	 * If the caller fails after this success it must call
> +	 * iommufd_unbind_device() which is safe since we hold this refcount.
> +	 * This also means the device is a leaf in the graph and no other object
> +	 * can take a reference on it.
> +	 */
> +	iommufd_object_finalize(ictx, &idev->obj);
> +	*id = idev->obj.id;
> +	return idev;
> +
> +out_release_owner:
> +	iommu_group_release_dma_owner(group);
> +out_group_put:
> +	iommu_group_put(group);
> +out_file_put:
> +	fput(ictx->filp);
> +	return ERR_PTR(rc);
> +}
> +EXPORT_SYMBOL_GPL(iommufd_bind_pci_device);

I'm stumped why this needs to be PCI specific.  Anything beyond the RID
comment?  Please enlighten.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 10/12] iommufd: Add kAPI toward external drivers
@ 2022-03-23 18:10     ` Alex Williamson
  0 siblings, 0 replies; 244+ messages in thread
From: Alex Williamson @ 2022-03-23 18:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	Kevin Tian, Daniel Jordan, iommu, Joao Martins, David Gibson

On Fri, 18 Mar 2022 14:27:35 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:
> +/**
> + * iommufd_bind_pci_device - Bind a physical device to an iommu fd
> + * @fd: iommufd file descriptor.
> + * @pdev: Pointer to a physical PCI device struct
> + * @id: Output ID number to return to userspace for this device
> + *
> + * A successful bind establishes an ownership over the device and returns
> + * struct iommufd_device pointer, otherwise returns error pointer.
> + *
> + * A driver using this API must set driver_managed_dma and must not touch
> + * the device until this routine succeeds and establishes ownership.
> + *
> + * Binding a PCI device places the entire RID under iommufd control.
> + *
> + * The caller must undo this with iommufd_unbind_device()
> + */
> +struct iommufd_device *iommufd_bind_pci_device(int fd, struct pci_dev *pdev,
> +					       u32 *id)
> +{
> +	struct iommufd_device *idev;
> +	struct iommufd_ctx *ictx;
> +	struct iommu_group *group;
> +	int rc;
> +
> +	ictx = iommufd_fget(fd);
> +	if (!ictx)
> +		return ERR_PTR(-EINVAL);
> +
> +	group = iommu_group_get(&pdev->dev);
> +	if (!group) {
> +		rc = -ENODEV;
> +		goto out_file_put;
> +	}
> +
> +	/*
> +	 * FIXME: Use a device-centric iommu api and this won't work with
> +	 * multi-device groups
> +	 */
> +	rc = iommu_group_claim_dma_owner(group, ictx->filp);
> +	if (rc)
> +		goto out_group_put;
> +
> +	idev = iommufd_object_alloc(ictx, idev, IOMMUFD_OBJ_DEVICE);
> +	if (IS_ERR(idev)) {
> +		rc = PTR_ERR(idev);
> +		goto out_release_owner;
> +	}
> +	idev->ictx = ictx;
> +	idev->dev = &pdev->dev;
> +	/* The calling driver is a user until iommufd_unbind_device() */
> +	refcount_inc(&idev->obj.users);
> +	/* group refcount moves into iommufd_device */
> +	idev->group = group;
> +
> +	/*
> +	 * If the caller fails after this success it must call
> +	 * iommufd_unbind_device() which is safe since we hold this refcount.
> +	 * This also means the device is a leaf in the graph and no other object
> +	 * can take a reference on it.
> +	 */
> +	iommufd_object_finalize(ictx, &idev->obj);
> +	*id = idev->obj.id;
> +	return idev;
> +
> +out_release_owner:
> +	iommu_group_release_dma_owner(group);
> +out_group_put:
> +	iommu_group_put(group);
> +out_file_put:
> +	fput(ictx->filp);
> +	return ERR_PTR(rc);
> +}
> +EXPORT_SYMBOL_GPL(iommufd_bind_pci_device);

I'm stumped why this needs to be PCI specific.  Anything beyond the RID
comment?  Please enlighten.  Thanks,

Alex

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 10/12] iommufd: Add kAPI toward external drivers
  2022-03-23 18:10     ` Alex Williamson
@ 2022-03-23 18:15       ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-03-23 18:15 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Wed, Mar 23, 2022 at 12:10:01PM -0600, Alex Williamson wrote:
> > +EXPORT_SYMBOL_GPL(iommufd_bind_pci_device);
> 
> I'm stumped why this needs to be PCI specific.  Anything beyond the RID
> comment?  Please enlighten.  Thanks,

The way it turned out in the end it is not for a good reason any
more. I'll change it

Jason

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 10/12] iommufd: Add kAPI toward external drivers
@ 2022-03-23 18:15       ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-03-23 18:15 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	Kevin Tian, Daniel Jordan, iommu, Joao Martins, David Gibson

On Wed, Mar 23, 2022 at 12:10:01PM -0600, Alex Williamson wrote:
> > +EXPORT_SYMBOL_GPL(iommufd_bind_pci_device);
> 
> I'm stumped why this needs to be PCI specific.  Anything beyond the RID
> comment?  Please enlighten.  Thanks,

The way it turned out in the end it is not for a good reason any
more. I'll change it

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 07/12] iommufd: Data structure to provide IOVA to PFN mapping
  2022-03-22 22:15     ` Alex Williamson
@ 2022-03-23 18:15       ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-03-23 18:15 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Tue, Mar 22, 2022 at 04:15:44PM -0600, Alex Williamson wrote:

> > +static struct iopt_area *
> > +iopt_alloc_area(struct io_pagetable *iopt, struct iopt_pages *pages,
> > +		unsigned long iova, unsigned long start_byte,
> > +		unsigned long length, int iommu_prot, unsigned int flags)
> > +{
> > +	struct iopt_area *area;
> > +	int rc;
> > +
> > +	area = kzalloc(sizeof(*area), GFP_KERNEL);
> > +	if (!area)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	area->iopt = iopt;
> > +	area->iommu_prot = iommu_prot;
> > +	area->page_offset = start_byte % PAGE_SIZE;
> > +	area->pages_node.start = start_byte / PAGE_SIZE;
> > +	if (check_add_overflow(start_byte, length - 1, &area->pages_node.last))
> > +		return ERR_PTR(-EOVERFLOW);
> > +	area->pages_node.last = area->pages_node.last / PAGE_SIZE;
> > +	if (WARN_ON(area->pages_node.last >= pages->npages))
> > +		return ERR_PTR(-EOVERFLOW);
> 
> @area leaked in the above two error cases.

Yes, thanks

> > +int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
> > +		      unsigned long length, struct page **out_pages, bool write)
> > +{
> > +	unsigned long cur_iova = iova;
> > +	unsigned long last_iova;
> > +	struct iopt_area *area;
> > +	int rc;
> > +
> > +	if (!length)
> > +		return -EINVAL;
> > +	if (check_add_overflow(iova, length - 1, &last_iova))
> > +		return -EOVERFLOW;
> > +
> > +	down_read(&iopt->iova_rwsem);
> > +	for (area = iopt_area_iter_first(iopt, iova, last_iova); area;
> > +	     area = iopt_area_iter_next(area, iova, last_iova)) {
> > +		unsigned long last = min(last_iova, iopt_area_last_iova(area));
> > +		unsigned long last_index;
> > +		unsigned long index;
> > +
> > +		/* Need contiguous areas in the access */
> > +		if (iopt_area_iova(area) < cur_iova || !area->pages) {
>                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> Should this be (cur_iova != iova && iopt_area_iova(area) < cur_iova)?

Oh good eye

That is a typo < should be >:

		if (iopt_area_iova(area) > cur_iova || !area->pages) {

There are three boundary conditions here to worry about
 - interval trees return any nodes that intersect the queried range
   so the first found node can start after iova

 - There can be a gap in the intervals

 - The final area can end short of last_iova

The last one is botched too and needs this:
        for ... { ...
	}
+	if (cur_iova != last_iova)
+		goto out_remove;

The test suite isn't covering the boundary cases here yet, I added a
FIXME for this for now.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 07/12] iommufd: Data structure to provide IOVA to PFN mapping
@ 2022-03-23 18:15       ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-03-23 18:15 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	Kevin Tian, Daniel Jordan, iommu, Joao Martins, David Gibson

On Tue, Mar 22, 2022 at 04:15:44PM -0600, Alex Williamson wrote:

> > +static struct iopt_area *
> > +iopt_alloc_area(struct io_pagetable *iopt, struct iopt_pages *pages,
> > +		unsigned long iova, unsigned long start_byte,
> > +		unsigned long length, int iommu_prot, unsigned int flags)
> > +{
> > +	struct iopt_area *area;
> > +	int rc;
> > +
> > +	area = kzalloc(sizeof(*area), GFP_KERNEL);
> > +	if (!area)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	area->iopt = iopt;
> > +	area->iommu_prot = iommu_prot;
> > +	area->page_offset = start_byte % PAGE_SIZE;
> > +	area->pages_node.start = start_byte / PAGE_SIZE;
> > +	if (check_add_overflow(start_byte, length - 1, &area->pages_node.last))
> > +		return ERR_PTR(-EOVERFLOW);
> > +	area->pages_node.last = area->pages_node.last / PAGE_SIZE;
> > +	if (WARN_ON(area->pages_node.last >= pages->npages))
> > +		return ERR_PTR(-EOVERFLOW);
> 
> @area leaked in the above two error cases.

Yes, thanks

> > +int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
> > +		      unsigned long length, struct page **out_pages, bool write)
> > +{
> > +	unsigned long cur_iova = iova;
> > +	unsigned long last_iova;
> > +	struct iopt_area *area;
> > +	int rc;
> > +
> > +	if (!length)
> > +		return -EINVAL;
> > +	if (check_add_overflow(iova, length - 1, &last_iova))
> > +		return -EOVERFLOW;
> > +
> > +	down_read(&iopt->iova_rwsem);
> > +	for (area = iopt_area_iter_first(iopt, iova, last_iova); area;
> > +	     area = iopt_area_iter_next(area, iova, last_iova)) {
> > +		unsigned long last = min(last_iova, iopt_area_last_iova(area));
> > +		unsigned long last_index;
> > +		unsigned long index;
> > +
> > +		/* Need contiguous areas in the access */
> > +		if (iopt_area_iova(area) < cur_iova || !area->pages) {
>                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> Should this be (cur_iova != iova && iopt_area_iova(area) < cur_iova)?

Oh good eye

That is a typo < should be >:

		if (iopt_area_iova(area) > cur_iova || !area->pages) {

There are three boundary conditions here to worry about
 - interval trees return any nodes that intersect the queried range
   so the first found node can start after iova

 - There can be a gap in the intervals

 - The final area can end short of last_iova

The last one is botched too and needs this:
        for ... { ...
	}
+	if (cur_iova != last_iova)
+		goto out_remove;

The test suite isn't covering the boundary cases here yet, I added a
FIXME for this for now.

Thanks,
Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
  2022-03-18 17:27   ` Jason Gunthorpe via iommu
@ 2022-03-23 19:10     ` Alex Williamson
  -1 siblings, 0 replies; 244+ messages in thread
From: Alex Williamson @ 2022-03-23 19:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Fri, 18 Mar 2022 14:27:33 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> +static int conv_iommu_prot(u32 map_flags)
> +{
> +	int iommu_prot;
> +
> +	/*
> +	 * We provide no manual cache coherency ioctls to userspace and most
> +	 * architectures make the CPU ops for cache flushing privileged.
> +	 * Therefore we require the underlying IOMMU to support CPU coherent
> +	 * operation.
> +	 */
> +	iommu_prot = IOMMU_CACHE;

Where is this requirement enforced?  AIUI we'd need to test
IOMMU_CAP_CACHE_COHERENCY somewhere since functions like
intel_iommu_map() simply drop the flag when not supported by HW.

This also seems like an issue relative to vfio compatibility that I
don't see mentioned in that patch.  Thanks,

Alex

> +	if (map_flags & IOMMU_IOAS_MAP_WRITEABLE)
> +		iommu_prot |= IOMMU_WRITE;
> +	if (map_flags & IOMMU_IOAS_MAP_READABLE)
> +		iommu_prot |= IOMMU_READ;
> +	return iommu_prot;
> +}



^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
@ 2022-03-23 19:10     ` Alex Williamson
  0 siblings, 0 replies; 244+ messages in thread
From: Alex Williamson @ 2022-03-23 19:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	Kevin Tian, Daniel Jordan, iommu, Joao Martins, David Gibson

On Fri, 18 Mar 2022 14:27:33 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> +static int conv_iommu_prot(u32 map_flags)
> +{
> +	int iommu_prot;
> +
> +	/*
> +	 * We provide no manual cache coherency ioctls to userspace and most
> +	 * architectures make the CPU ops for cache flushing privileged.
> +	 * Therefore we require the underlying IOMMU to support CPU coherent
> +	 * operation.
> +	 */
> +	iommu_prot = IOMMU_CACHE;

Where is this requirement enforced?  AIUI we'd need to test
IOMMU_CAP_CACHE_COHERENCY somewhere since functions like
intel_iommu_map() simply drop the flag when not supported by HW.

This also seems like an issue relative to vfio compatibility that I
don't see mentioned in that patch.  Thanks,

Alex

> +	if (map_flags & IOMMU_IOAS_MAP_WRITEABLE)
> +		iommu_prot |= IOMMU_WRITE;
> +	if (map_flags & IOMMU_IOAS_MAP_READABLE)
> +		iommu_prot |= IOMMU_READ;
> +	return iommu_prot;
> +}


_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
  2022-03-23 19:10     ` Alex Williamson
@ 2022-03-23 19:34       ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-03-23 19:34 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Wed, Mar 23, 2022 at 01:10:38PM -0600, Alex Williamson wrote:
> On Fri, 18 Mar 2022 14:27:33 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > +static int conv_iommu_prot(u32 map_flags)
> > +{
> > +	int iommu_prot;
> > +
> > +	/*
> > +	 * We provide no manual cache coherency ioctls to userspace and most
> > +	 * architectures make the CPU ops for cache flushing privileged.
> > +	 * Therefore we require the underlying IOMMU to support CPU coherent
> > +	 * operation.
> > +	 */
> > +	iommu_prot = IOMMU_CACHE;
> 
> Where is this requirement enforced?  AIUI we'd need to test
> IOMMU_CAP_CACHE_COHERENCY somewhere since functions like
> intel_iommu_map() simply drop the flag when not supported by HW.

You are right, the correct thing to do is to fail device
binding/attach entirely if IOMMU_CAP_CACHE_COHERENCY is not there,
however we can't do that because Intel abuses the meaning of
IOMMU_CAP_CACHE_COHERENCY to mean their special no-snoop behavior is
supported.

I want Intel to split out their special no-snoop from IOMMU_CACHE and
IOMMU_CAP_CACHE_COHERENCY so these things have a consisent meaning in
all iommu drivers. Once this is done vfio and iommufd should both
always set IOMMU_CACHE and refuse to work without
IOMMU_CAP_CACHE_COHERENCY. (unless someone knows of an !IOMMU_CACHE
arch that does in fact work today with vfio, somehow, but I don't..)

I added a fixme about this.

> This also seems like an issue relative to vfio compatibility that I
> don't see mentioned in that patch.  Thanks,

Yes, it was missed in the notes for vfio compat that Intel no-snoop is
not working currently, I fixed it.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
@ 2022-03-23 19:34       ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-03-23 19:34 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	Kevin Tian, Daniel Jordan, iommu, Joao Martins, David Gibson

On Wed, Mar 23, 2022 at 01:10:38PM -0600, Alex Williamson wrote:
> On Fri, 18 Mar 2022 14:27:33 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > +static int conv_iommu_prot(u32 map_flags)
> > +{
> > +	int iommu_prot;
> > +
> > +	/*
> > +	 * We provide no manual cache coherency ioctls to userspace and most
> > +	 * architectures make the CPU ops for cache flushing privileged.
> > +	 * Therefore we require the underlying IOMMU to support CPU coherent
> > +	 * operation.
> > +	 */
> > +	iommu_prot = IOMMU_CACHE;
> 
> Where is this requirement enforced?  AIUI we'd need to test
> IOMMU_CAP_CACHE_COHERENCY somewhere since functions like
> intel_iommu_map() simply drop the flag when not supported by HW.

You are right, the correct thing to do is to fail device
binding/attach entirely if IOMMU_CAP_CACHE_COHERENCY is not there,
however we can't do that because Intel abuses the meaning of
IOMMU_CAP_CACHE_COHERENCY to mean their special no-snoop behavior is
supported.

I want Intel to split out their special no-snoop from IOMMU_CACHE and
IOMMU_CAP_CACHE_COHERENCY so these things have a consisent meaning in
all iommu drivers. Once this is done vfio and iommufd should both
always set IOMMU_CACHE and refuse to work without
IOMMU_CAP_CACHE_COHERENCY. (unless someone knows of an !IOMMU_CACHE
arch that does in fact work today with vfio, somehow, but I don't..)

I added a fixme about this.

> This also seems like an issue relative to vfio compatibility that I
> don't see mentioned in that patch.  Thanks,

Yes, it was missed in the notes for vfio compat that Intel no-snoop is
not working currently, I fixed it.

Thanks,
Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
  2022-03-23 19:34       ` Jason Gunthorpe via iommu
@ 2022-03-23 20:04         ` Alex Williamson
  -1 siblings, 0 replies; 244+ messages in thread
From: Alex Williamson @ 2022-03-23 20:04 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Wed, 23 Mar 2022 16:34:39 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Mar 23, 2022 at 01:10:38PM -0600, Alex Williamson wrote:
> > On Fri, 18 Mar 2022 14:27:33 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > +static int conv_iommu_prot(u32 map_flags)
> > > +{
> > > +	int iommu_prot;
> > > +
> > > +	/*
> > > +	 * We provide no manual cache coherency ioctls to userspace and most
> > > +	 * architectures make the CPU ops for cache flushing privileged.
> > > +	 * Therefore we require the underlying IOMMU to support CPU coherent
> > > +	 * operation.
> > > +	 */
> > > +	iommu_prot = IOMMU_CACHE;  
> > 
> > Where is this requirement enforced?  AIUI we'd need to test
> > IOMMU_CAP_CACHE_COHERENCY somewhere since functions like
> > intel_iommu_map() simply drop the flag when not supported by HW.  
> 
> You are right, the correct thing to do is to fail device
> binding/attach entirely if IOMMU_CAP_CACHE_COHERENCY is not there,
> however we can't do that because Intel abuses the meaning of
> IOMMU_CAP_CACHE_COHERENCY to mean their special no-snoop behavior is
> supported.
> 
> I want Intel to split out their special no-snoop from IOMMU_CACHE and
> IOMMU_CAP_CACHE_COHERENCY so these things have a consisent meaning in
> all iommu drivers. Once this is done vfio and iommufd should both
> always set IOMMU_CACHE and refuse to work without
> IOMMU_CAP_CACHE_COHERENCY. (unless someone knows of an !IOMMU_CACHE
> arch that does in fact work today with vfio, somehow, but I don't..)

IIRC, the DMAR on Intel CPUs dedicated to IGD was where we'd often see
lack of snoop-control support, causing us to have mixed coherent and
non-coherent domains.  I don't recall if you go back far enough in VT-d
history if the primary IOMMU might have lacked this support.  So I
think there are systems we care about with IOMMUs that can't enforce
DMA coherency.

As it is today, if the IOMMU reports IOMMU_CAP_CACHE_COHERENCY and all
mappings make use of IOMMU_CACHE, then all DMA is coherent.  Are you
suggesting IOMMU_CAP_CACHE_COHERENCY should indicate that all mappings
are coherent regardless of mapping protection flags?  What's the point
of IOMMU_CACHE at that point?

> I added a fixme about this.
> 
> > This also seems like an issue relative to vfio compatibility that I
> > don't see mentioned in that patch.  Thanks,  
> 
> Yes, it was missed in the notes for vfio compat that Intel no-snoop is
> not working currently, I fixed it.

Right, I see it in the comments relative to extensions, but missed in
the commit log.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
@ 2022-03-23 20:04         ` Alex Williamson
  0 siblings, 0 replies; 244+ messages in thread
From: Alex Williamson @ 2022-03-23 20:04 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	Kevin Tian, Daniel Jordan, iommu, Joao Martins, David Gibson

On Wed, 23 Mar 2022 16:34:39 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Mar 23, 2022 at 01:10:38PM -0600, Alex Williamson wrote:
> > On Fri, 18 Mar 2022 14:27:33 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > +static int conv_iommu_prot(u32 map_flags)
> > > +{
> > > +	int iommu_prot;
> > > +
> > > +	/*
> > > +	 * We provide no manual cache coherency ioctls to userspace and most
> > > +	 * architectures make the CPU ops for cache flushing privileged.
> > > +	 * Therefore we require the underlying IOMMU to support CPU coherent
> > > +	 * operation.
> > > +	 */
> > > +	iommu_prot = IOMMU_CACHE;  
> > 
> > Where is this requirement enforced?  AIUI we'd need to test
> > IOMMU_CAP_CACHE_COHERENCY somewhere since functions like
> > intel_iommu_map() simply drop the flag when not supported by HW.  
> 
> You are right, the correct thing to do is to fail device
> binding/attach entirely if IOMMU_CAP_CACHE_COHERENCY is not there,
> however we can't do that because Intel abuses the meaning of
> IOMMU_CAP_CACHE_COHERENCY to mean their special no-snoop behavior is
> supported.
> 
> I want Intel to split out their special no-snoop from IOMMU_CACHE and
> IOMMU_CAP_CACHE_COHERENCY so these things have a consisent meaning in
> all iommu drivers. Once this is done vfio and iommufd should both
> always set IOMMU_CACHE and refuse to work without
> IOMMU_CAP_CACHE_COHERENCY. (unless someone knows of an !IOMMU_CACHE
> arch that does in fact work today with vfio, somehow, but I don't..)

IIRC, the DMAR on Intel CPUs dedicated to IGD was where we'd often see
lack of snoop-control support, causing us to have mixed coherent and
non-coherent domains.  I don't recall if you go back far enough in VT-d
history if the primary IOMMU might have lacked this support.  So I
think there are systems we care about with IOMMUs that can't enforce
DMA coherency.

As it is today, if the IOMMU reports IOMMU_CAP_CACHE_COHERENCY and all
mappings make use of IOMMU_CACHE, then all DMA is coherent.  Are you
suggesting IOMMU_CAP_CACHE_COHERENCY should indicate that all mappings
are coherent regardless of mapping protection flags?  What's the point
of IOMMU_CACHE at that point?

> I added a fixme about this.
> 
> > This also seems like an issue relative to vfio compatibility that I
> > don't see mentioned in that patch.  Thanks,  
> 
> Yes, it was missed in the notes for vfio compat that Intel no-snoop is
> not working currently, I fixed it.

Right, I see it in the comments relative to extensions, but missed in
the commit log.  Thanks,

Alex

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
  2022-03-23 20:04         ` Alex Williamson
@ 2022-03-23 20:34           ` Jason Gunthorpe
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-03-23 20:34 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	Kevin Tian, Daniel Jordan, iommu, Joao Martins, David Gibson

On Wed, Mar 23, 2022 at 02:04:46PM -0600, Alex Williamson wrote:
> On Wed, 23 Mar 2022 16:34:39 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Mar 23, 2022 at 01:10:38PM -0600, Alex Williamson wrote:
> > > On Fri, 18 Mar 2022 14:27:33 -0300
> > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >   
> > > > +static int conv_iommu_prot(u32 map_flags)
> > > > +{
> > > > +	int iommu_prot;
> > > > +
> > > > +	/*
> > > > +	 * We provide no manual cache coherency ioctls to userspace and most
> > > > +	 * architectures make the CPU ops for cache flushing privileged.
> > > > +	 * Therefore we require the underlying IOMMU to support CPU coherent
> > > > +	 * operation.
> > > > +	 */
> > > > +	iommu_prot = IOMMU_CACHE;  
> > > 
> > > Where is this requirement enforced?  AIUI we'd need to test
> > > IOMMU_CAP_CACHE_COHERENCY somewhere since functions like
> > > intel_iommu_map() simply drop the flag when not supported by HW.  
> > 
> > You are right, the correct thing to do is to fail device
> > binding/attach entirely if IOMMU_CAP_CACHE_COHERENCY is not there,
> > however we can't do that because Intel abuses the meaning of
> > IOMMU_CAP_CACHE_COHERENCY to mean their special no-snoop behavior is
> > supported.
> > 
> > I want Intel to split out their special no-snoop from IOMMU_CACHE and
> > IOMMU_CAP_CACHE_COHERENCY so these things have a consisent meaning in
> > all iommu drivers. Once this is done vfio and iommufd should both
> > always set IOMMU_CACHE and refuse to work without
> > IOMMU_CAP_CACHE_COHERENCY. (unless someone knows of an !IOMMU_CACHE
> > arch that does in fact work today with vfio, somehow, but I don't..)
> 
> IIRC, the DMAR on Intel CPUs dedicated to IGD was where we'd often see
> lack of snoop-control support, causing us to have mixed coherent and
> non-coherent domains.  I don't recall if you go back far enough in VT-d
> history if the primary IOMMU might have lacked this support.  So I
> think there are systems we care about with IOMMUs that can't enforce
> DMA coherency.
> 
> As it is today, if the IOMMU reports IOMMU_CAP_CACHE_COHERENCY and all
> mappings make use of IOMMU_CACHE, then all DMA is coherent.  Are you
> suggesting IOMMU_CAP_CACHE_COHERENCY should indicate that all mappings
> are coherent regardless of mapping protection flags?  What's the point
> of IOMMU_CACHE at that point?

IOMMU_CAP_CACHE_COHERENCY should return to what it was before Intel's
change.

It only means normal DMAs issued in a normal way are coherent with the
CPU and do not require special cache flushing instructions. ie DMA
issued by a kernel driver using the DMA API.

It does not mean that non-coherence DMA does not exist, or that
platform or device specific ways to trigger non-coherence are blocked.

Stated another way, any platform that wires dev_is_dma_coherent() to
true, like all x86 does, must support IOMMU_CACHE and report
IOMMU_CAP_CACHE_COHERENCY for every iommu_domain the platform
supports. The platform obviously declares it support this in order to
support the in-kernel DMA API.

Thus, a new cap indicating that 'all dma is coherent' or 'no-snoop
blocking' should be created to cover Intel's special need. From what I
know it is only implemented in the Intel driver and apparently only
for some IOMMUs connected to IGD.

> > Yes, it was missed in the notes for vfio compat that Intel no-snoop is
> > not working currently, I fixed it.
> 
> Right, I see it in the comments relative to extensions, but missed in
> the commit log.  Thanks,

Oh good, I remembered it was someplace..

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
@ 2022-03-23 20:34           ` Jason Gunthorpe
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-03-23 20:34 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Wed, Mar 23, 2022 at 02:04:46PM -0600, Alex Williamson wrote:
> On Wed, 23 Mar 2022 16:34:39 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Mar 23, 2022 at 01:10:38PM -0600, Alex Williamson wrote:
> > > On Fri, 18 Mar 2022 14:27:33 -0300
> > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >   
> > > > +static int conv_iommu_prot(u32 map_flags)
> > > > +{
> > > > +	int iommu_prot;
> > > > +
> > > > +	/*
> > > > +	 * We provide no manual cache coherency ioctls to userspace and most
> > > > +	 * architectures make the CPU ops for cache flushing privileged.
> > > > +	 * Therefore we require the underlying IOMMU to support CPU coherent
> > > > +	 * operation.
> > > > +	 */
> > > > +	iommu_prot = IOMMU_CACHE;  
> > > 
> > > Where is this requirement enforced?  AIUI we'd need to test
> > > IOMMU_CAP_CACHE_COHERENCY somewhere since functions like
> > > intel_iommu_map() simply drop the flag when not supported by HW.  
> > 
> > You are right, the correct thing to do is to fail device
> > binding/attach entirely if IOMMU_CAP_CACHE_COHERENCY is not there,
> > however we can't do that because Intel abuses the meaning of
> > IOMMU_CAP_CACHE_COHERENCY to mean their special no-snoop behavior is
> > supported.
> > 
> > I want Intel to split out their special no-snoop from IOMMU_CACHE and
> > IOMMU_CAP_CACHE_COHERENCY so these things have a consisent meaning in
> > all iommu drivers. Once this is done vfio and iommufd should both
> > always set IOMMU_CACHE and refuse to work without
> > IOMMU_CAP_CACHE_COHERENCY. (unless someone knows of an !IOMMU_CACHE
> > arch that does in fact work today with vfio, somehow, but I don't..)
> 
> IIRC, the DMAR on Intel CPUs dedicated to IGD was where we'd often see
> lack of snoop-control support, causing us to have mixed coherent and
> non-coherent domains.  I don't recall if you go back far enough in VT-d
> history if the primary IOMMU might have lacked this support.  So I
> think there are systems we care about with IOMMUs that can't enforce
> DMA coherency.
> 
> As it is today, if the IOMMU reports IOMMU_CAP_CACHE_COHERENCY and all
> mappings make use of IOMMU_CACHE, then all DMA is coherent.  Are you
> suggesting IOMMU_CAP_CACHE_COHERENCY should indicate that all mappings
> are coherent regardless of mapping protection flags?  What's the point
> of IOMMU_CACHE at that point?

IOMMU_CAP_CACHE_COHERENCY should return to what it was before Intel's
change.

It only means normal DMAs issued in a normal way are coherent with the
CPU and do not require special cache flushing instructions. ie DMA
issued by a kernel driver using the DMA API.

It does not mean that non-coherence DMA does not exist, or that
platform or device specific ways to trigger non-coherence are blocked.

Stated another way, any platform that wires dev_is_dma_coherent() to
true, like all x86 does, must support IOMMU_CACHE and report
IOMMU_CAP_CACHE_COHERENCY for every iommu_domain the platform
supports. The platform obviously declares it support this in order to
support the in-kernel DMA API.

Thus, a new cap indicating that 'all dma is coherent' or 'no-snoop
blocking' should be created to cover Intel's special need. From what I
know it is only implemented in the Intel driver and apparently only
for some IOMMUs connected to IGD.

> > Yes, it was missed in the notes for vfio compat that Intel no-snoop is
> > not working currently, I fixed it.
> 
> Right, I see it in the comments relative to extensions, but missed in
> the commit log.  Thanks,

Oh good, I remembered it was someplace..

Jason

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
  2022-03-18 17:27   ` Jason Gunthorpe via iommu
@ 2022-03-23 22:51     ` Alex Williamson
  -1 siblings, 0 replies; 244+ messages in thread
From: Alex Williamson @ 2022-03-23 22:51 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Fri, 18 Mar 2022 14:27:36 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> iommufd can directly implement the /dev/vfio/vfio container IOCTLs by
> mapping them into io_pagetable operations. Doing so allows the use of
> iommufd by symliking /dev/vfio/vfio to /dev/iommufd. Allowing VFIO to
> SET_CONTAINER using a iommufd instead of a container fd is a followup
> series.
> 
> Internally the compatibility API uses a normal IOAS object that, like
> vfio, is automatically allocated when the first device is
> attached.
> 
> Userspace can also query or set this IOAS object directly using the
> IOMMU_VFIO_IOAS ioctl. This allows mixing and matching new iommufd only
> features while still using the VFIO style map/unmap ioctls.
> 
> While this is enough to operate qemu, it is still a bit of a WIP with a
> few gaps to be resolved:
> 
>  - Only the TYPE1v2 mode is supported where unmap cannot punch holes or
>    split areas. The old mode can be implemented with a new operation to
>    split an iopt_area into two without disturbing the iopt_pages or the
>    domains, then unmapping a whole area as normal.
> 
>  - Resource limits rely on memory cgroups to bound what userspace can do
>    instead of the module parameter dma_entry_limit.
> 
>  - VFIO P2P is not implemented. Avoiding the follow_pfn() mis-design will
>    require some additional work to properly expose PFN lifecycle between
>    VFIO and iommfd
> 
>  - Various components of the mdev API are not completed yet
>
>  - Indefinite suspend of SW access (VFIO_DMA_MAP_FLAG_VADDR) is not
>    implemented.
> 
>  - The 'dirty tracking' is not implemented
> 
>  - A full audit for pedantic compatibility details (eg errnos, etc) has
>    not yet been done
> 
>  - powerpc SPAPR is left out, as it is not connected to the iommu_domain
>    framework. My hope is that SPAPR will be moved into the iommu_domain
>    framework as a special HW specific type and would expect power to
>    support the generic interface through a normal iommu_domain.

My overall question here would be whether we can actually achieve a
compatibility interface that has sufficient feature transparency that we
can dump vfio code in favor of this interface, or will there be enough
niche use cases that we need to keep type1 and vfio containers around
through a deprecation process?

The locked memory differences for one seem like something that libvirt
wouldn't want hidden and we have questions regarding support for vaddr
hijacking and different ideas how to implement dirty page tracking, not
to mention the missing features that are currently well used, like p2p
mappings, coherency tracking, mdev, etc.

It seems like quite an endeavor to fill all these gaps, while at the
same time QEMU will be working to move to use iommufd directly in order
to gain all the new features.

Where do we focus attention?  Is symlinking device files our proposal
to userspace and is that something achievable, or do we want to use
this compatibility interface as a means to test the interface and
allow userspace to make use of it for transition, if their use cases
allow it, perhaps eventually performing the symlink after deprecation
and eventual removal of the vfio container and type1 code?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
@ 2022-03-23 22:51     ` Alex Williamson
  0 siblings, 0 replies; 244+ messages in thread
From: Alex Williamson @ 2022-03-23 22:51 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	Kevin Tian, Daniel Jordan, iommu, Joao Martins, David Gibson

On Fri, 18 Mar 2022 14:27:36 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> iommufd can directly implement the /dev/vfio/vfio container IOCTLs by
> mapping them into io_pagetable operations. Doing so allows the use of
> iommufd by symliking /dev/vfio/vfio to /dev/iommufd. Allowing VFIO to
> SET_CONTAINER using a iommufd instead of a container fd is a followup
> series.
> 
> Internally the compatibility API uses a normal IOAS object that, like
> vfio, is automatically allocated when the first device is
> attached.
> 
> Userspace can also query or set this IOAS object directly using the
> IOMMU_VFIO_IOAS ioctl. This allows mixing and matching new iommufd only
> features while still using the VFIO style map/unmap ioctls.
> 
> While this is enough to operate qemu, it is still a bit of a WIP with a
> few gaps to be resolved:
> 
>  - Only the TYPE1v2 mode is supported where unmap cannot punch holes or
>    split areas. The old mode can be implemented with a new operation to
>    split an iopt_area into two without disturbing the iopt_pages or the
>    domains, then unmapping a whole area as normal.
> 
>  - Resource limits rely on memory cgroups to bound what userspace can do
>    instead of the module parameter dma_entry_limit.
> 
>  - VFIO P2P is not implemented. Avoiding the follow_pfn() mis-design will
>    require some additional work to properly expose PFN lifecycle between
>    VFIO and iommfd
> 
>  - Various components of the mdev API are not completed yet
>
>  - Indefinite suspend of SW access (VFIO_DMA_MAP_FLAG_VADDR) is not
>    implemented.
> 
>  - The 'dirty tracking' is not implemented
> 
>  - A full audit for pedantic compatibility details (eg errnos, etc) has
>    not yet been done
> 
>  - powerpc SPAPR is left out, as it is not connected to the iommu_domain
>    framework. My hope is that SPAPR will be moved into the iommu_domain
>    framework as a special HW specific type and would expect power to
>    support the generic interface through a normal iommu_domain.

My overall question here would be whether we can actually achieve a
compatibility interface that has sufficient feature transparency that we
can dump vfio code in favor of this interface, or will there be enough
niche use cases that we need to keep type1 and vfio containers around
through a deprecation process?

The locked memory differences for one seem like something that libvirt
wouldn't want hidden and we have questions regarding support for vaddr
hijacking and different ideas how to implement dirty page tracking, not
to mention the missing features that are currently well used, like p2p
mappings, coherency tracking, mdev, etc.

It seems like quite an endeavor to fill all these gaps, while at the
same time QEMU will be working to move to use iommufd directly in order
to gain all the new features.

Where do we focus attention?  Is symlinking device files our proposal
to userspace and is that something achievable, or do we want to use
this compatibility interface as a means to test the interface and
allow userspace to make use of it for transition, if their use cases
allow it, perhaps eventually performing the symlink after deprecation
and eventual removal of the vfio container and type1 code?  Thanks,

Alex

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
  2022-03-23 20:34           ` Jason Gunthorpe
@ 2022-03-23 22:54             ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-03-23 22:54 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Wed, Mar 23, 2022 at 05:34:18PM -0300, Jason Gunthorpe wrote:

> Stated another way, any platform that wires dev_is_dma_coherent() to
> true, like all x86 does, must support IOMMU_CACHE and report
> IOMMU_CAP_CACHE_COHERENCY for every iommu_domain the platform
> supports. The platform obviously declares it support this in order to
> support the in-kernel DMA API.

That gives me a nice simple idea:

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 3c6b95ad026829..8366884df4a030 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -8,6 +8,7 @@
 #include <linux/pci.h>
 #include <linux/irqdomain.h>
 #include <linux/dma-iommu.h>
+#include <linux/dma-map-ops.h>
 
 #include "iommufd_private.h"
 
@@ -61,6 +62,10 @@ struct iommufd_device *iommufd_bind_pci_device(int fd, struct pci_dev *pdev,
 	struct iommu_group *group;
 	int rc;
 
+	/* iommufd always uses IOMMU_CACHE */
+	if (!dev_is_dma_coherent(&pdev->dev))
+		return ERR_PTR(-EINVAL);
+
 	ictx = iommufd_fget(fd);
 	if (!ictx)
 		return ERR_PTR(-EINVAL);
diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
index 48149988c84bbc..3d6df1ffbf93e6 100644
--- a/drivers/iommu/iommufd/ioas.c
+++ b/drivers/iommu/iommufd/ioas.c
@@ -129,7 +129,8 @@ static int conv_iommu_prot(u32 map_flags)
 	 * We provide no manual cache coherency ioctls to userspace and most
 	 * architectures make the CPU ops for cache flushing privileged.
 	 * Therefore we require the underlying IOMMU to support CPU coherent
-	 * operation.
+	 * operation. Support for IOMMU_CACHE is enforced by the
+	 * dev_is_dma_coherent() test during bind.
 	 */
 	iommu_prot = IOMMU_CACHE;
 	if (map_flags & IOMMU_IOAS_MAP_WRITEABLE)

Looking at it I would say all the places that test
IOMMU_CAP_CACHE_COHERENCY can be replaced with dev_is_dma_coherent()
except for the one call in VFIO that is supporting the Intel no-snoop
behavior.

Then we can rename IOMMU_CAP_CACHE_COHERENCY to something like
IOMMU_CAP_ENFORCE_CACHE_COHERENCY and just create a
IOMMU_ENFORCE_CACHE prot flag for Intel IOMMU to use instead of
abusing IOMMU_CACHE.

Jason

^ permalink raw reply related	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
@ 2022-03-23 22:54             ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-03-23 22:54 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	Kevin Tian, Daniel Jordan, iommu, Joao Martins, David Gibson

On Wed, Mar 23, 2022 at 05:34:18PM -0300, Jason Gunthorpe wrote:

> Stated another way, any platform that wires dev_is_dma_coherent() to
> true, like all x86 does, must support IOMMU_CACHE and report
> IOMMU_CAP_CACHE_COHERENCY for every iommu_domain the platform
> supports. The platform obviously declares it support this in order to
> support the in-kernel DMA API.

That gives me a nice simple idea:

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 3c6b95ad026829..8366884df4a030 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -8,6 +8,7 @@
 #include <linux/pci.h>
 #include <linux/irqdomain.h>
 #include <linux/dma-iommu.h>
+#include <linux/dma-map-ops.h>
 
 #include "iommufd_private.h"
 
@@ -61,6 +62,10 @@ struct iommufd_device *iommufd_bind_pci_device(int fd, struct pci_dev *pdev,
 	struct iommu_group *group;
 	int rc;
 
+	/* iommufd always uses IOMMU_CACHE */
+	if (!dev_is_dma_coherent(&pdev->dev))
+		return ERR_PTR(-EINVAL);
+
 	ictx = iommufd_fget(fd);
 	if (!ictx)
 		return ERR_PTR(-EINVAL);
diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
index 48149988c84bbc..3d6df1ffbf93e6 100644
--- a/drivers/iommu/iommufd/ioas.c
+++ b/drivers/iommu/iommufd/ioas.c
@@ -129,7 +129,8 @@ static int conv_iommu_prot(u32 map_flags)
 	 * We provide no manual cache coherency ioctls to userspace and most
 	 * architectures make the CPU ops for cache flushing privileged.
 	 * Therefore we require the underlying IOMMU to support CPU coherent
-	 * operation.
+	 * operation. Support for IOMMU_CACHE is enforced by the
+	 * dev_is_dma_coherent() test during bind.
 	 */
 	iommu_prot = IOMMU_CACHE;
 	if (map_flags & IOMMU_IOAS_MAP_WRITEABLE)

Looking at it I would say all the places that test
IOMMU_CAP_CACHE_COHERENCY can be replaced with dev_is_dma_coherent()
except for the one call in VFIO that is supporting the Intel no-snoop
behavior.

Then we can rename IOMMU_CAP_CACHE_COHERENCY to something like
IOMMU_CAP_ENFORCE_CACHE_COHERENCY and just create a
IOMMU_ENFORCE_CACHE prot flag for Intel IOMMU to use instead of
abusing IOMMU_CACHE.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
  2022-03-23 22:51     ` Alex Williamson
@ 2022-03-24  0:33       ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-03-24  0:33 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote:

> My overall question here would be whether we can actually achieve a
> compatibility interface that has sufficient feature transparency that we
> can dump vfio code in favor of this interface, or will there be enough
> niche use cases that we need to keep type1 and vfio containers around
> through a deprecation process?

Other than SPAPR, I think we can.

> The locked memory differences for one seem like something that
> libvirt wouldn't want hidden

I'm first interested to have an understanding how this change becomes
a real problem in practice that requires libvirt to do something
different for vfio or iommufd. We can discuss in the other thread

If this is the make or break point then I think we can deal with it
either by going back to what vfio does now or perhaps some other
friendly compat approach..

> and we have questions regarding support for vaddr hijacking

I'm not sure what vaddr hijacking is? Do you mean
VFIO_DMA_MAP_FLAG_VADDR ? There is a comment that outlines my plan to
implement it in a functionally compatible way without the deadlock
problem. I estimate this as a small project.

> and different ideas how to implement dirty page tracking, 

I don't think this is compatibility. No kernel today triggers qemu to
use this feature as no kernel supports live migration. No existing
qemu will trigger this feature with new kernels that support live
migration v2. Therefore we can adjust qemu's dirty tracking at the
same time we enable migration v2 in qemu.

With Joao's work we are close to having a solid RFC to come with
something that can be fully implemented.

Hopefully we can agree to this soon enough that qemu can come with a
full package of migration v2 support including the dirty tracking
solution.

> not to mention the missing features that are currently well used,
> like p2p mappings, coherency tracking, mdev, etc.

I consider these all mandatory things, they won't be left out.

The reason they are not in the RFC is mostly because supporting them
requires work outside just this iommufd area, and I'd like this series
to remain self-contained.

I've already got a draft to add DMABUF support to VFIO PCI which
nicely solves the follow_pfn security problem, we want to do this for
another reason already. I'm waiting for some testing feedback before
posting it. Need some help from Daniel make the DMABUF revoke semantic
him and I have been talking about. In the worst case can copy the
follow_pfn approach.

Intel no-snoop is simple enough, just needs some Intel cleanup parts.

mdev will come along with the final VFIO integration, all the really
hard parts are done already. The VFIO integration is a medium sized
task overall.

So, I'm not ready to give up yet :)

> Where do we focus attention?  Is symlinking device files our proposal
> to userspace and is that something achievable, or do we want to use
> this compatibility interface as a means to test the interface and
> allow userspace to make use of it for transition, if their use cases
> allow it, perhaps eventually performing the symlink after deprecation
> and eventual removal of the vfio container and type1 code?  Thanks,

symlinking device files is definitely just a suggested way to expedite
testing.

Things like qemu that are learning to use iommufd-only features should
learn to directly open iommufd instead of vfio container to activate
those features.

Looking long down the road I don't think we want to have type 1 and
iommufd code forever. So, I would like to make an option to compile
out vfio container support entirely and have that option arrange for
iommufd to provide the container device node itself.

I think we can get there pretty quickly, or at least I haven't got
anything that is scaring me alot (beyond SPAPR of course)

For the dpdk/etcs of the world I think we are already there.

Jason

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
@ 2022-03-24  0:33       ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-03-24  0:33 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	Kevin Tian, Daniel Jordan, iommu, Joao Martins, David Gibson

On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote:

> My overall question here would be whether we can actually achieve a
> compatibility interface that has sufficient feature transparency that we
> can dump vfio code in favor of this interface, or will there be enough
> niche use cases that we need to keep type1 and vfio containers around
> through a deprecation process?

Other than SPAPR, I think we can.

> The locked memory differences for one seem like something that
> libvirt wouldn't want hidden

I'm first interested to have an understanding how this change becomes
a real problem in practice that requires libvirt to do something
different for vfio or iommufd. We can discuss in the other thread

If this is the make or break point then I think we can deal with it
either by going back to what vfio does now or perhaps some other
friendly compat approach..

> and we have questions regarding support for vaddr hijacking

I'm not sure what vaddr hijacking is? Do you mean
VFIO_DMA_MAP_FLAG_VADDR ? There is a comment that outlines my plan to
implement it in a functionally compatible way without the deadlock
problem. I estimate this as a small project.

> and different ideas how to implement dirty page tracking, 

I don't think this is compatibility. No kernel today triggers qemu to
use this feature as no kernel supports live migration. No existing
qemu will trigger this feature with new kernels that support live
migration v2. Therefore we can adjust qemu's dirty tracking at the
same time we enable migration v2 in qemu.

With Joao's work we are close to having a solid RFC to come with
something that can be fully implemented.

Hopefully we can agree to this soon enough that qemu can come with a
full package of migration v2 support including the dirty tracking
solution.

> not to mention the missing features that are currently well used,
> like p2p mappings, coherency tracking, mdev, etc.

I consider these all mandatory things, they won't be left out.

The reason they are not in the RFC is mostly because supporting them
requires work outside just this iommufd area, and I'd like this series
to remain self-contained.

I've already got a draft to add DMABUF support to VFIO PCI which
nicely solves the follow_pfn security problem, we want to do this for
another reason already. I'm waiting for some testing feedback before
posting it. Need some help from Daniel make the DMABUF revoke semantic
him and I have been talking about. In the worst case can copy the
follow_pfn approach.

Intel no-snoop is simple enough, just needs some Intel cleanup parts.

mdev will come along with the final VFIO integration, all the really
hard parts are done already. The VFIO integration is a medium sized
task overall.

So, I'm not ready to give up yet :)

> Where do we focus attention?  Is symlinking device files our proposal
> to userspace and is that something achievable, or do we want to use
> this compatibility interface as a means to test the interface and
> allow userspace to make use of it for transition, if their use cases
> allow it, perhaps eventually performing the symlink after deprecation
> and eventual removal of the vfio container and type1 code?  Thanks,

symlinking device files is definitely just a suggested way to expedite
testing.

Things like qemu that are learning to use iommufd-only features should
learn to directly open iommufd instead of vfio container to activate
those features.

Looking long down the road I don't think we want to have type 1 and
iommufd code forever. So, I would like to make an option to compile
out vfio container support entirely and have that option arrange for
iommufd to provide the container device node itself.

I think we can get there pretty quickly, or at least I haven't got
anything that is scaring me alot (beyond SPAPR of course)

For the dpdk/etcs of the world I think we are already there.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* RE: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
  2022-03-22 16:15           ` Jason Gunthorpe via iommu
@ 2022-03-24  2:11             ` Tian, Kevin
  -1 siblings, 0 replies; 244+ messages in thread
From: Tian, Kevin @ 2022-03-24  2:11 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm, Niklas Schnelle,
	Jason Wang, Cornelia Huck, Daniel Jordan, iommu,
	Michael S. Tsirkin, Martins, Joao, David Gibson

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, March 23, 2022 12:15 AM
> 
> On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> 
> > I'm still picking my way through the series, but the later compat
> > interface doesn't mention this difference as an outstanding issue.
> > Doesn't this difference need to be accounted in how libvirt manages VM
> > resource limits?
> 
> AFACIT, no, but it should be checked.
> 
> > AIUI libvirt uses some form of prlimit(2) to set process locked
> > memory limits.
> 
> Yes, and ulimit does work fully. prlimit adjusts the value:
> 
> int do_prlimit(struct task_struct *tsk, unsigned int resource,
> 		struct rlimit *new_rlim, struct rlimit *old_rlim)
> {
> 	rlim = tsk->signal->rlim + resource;
> [..]
> 		if (new_rlim)
> 			*rlim = *new_rlim;
> 
> Which vfio reads back here:
> 
> drivers/vfio/vfio_iommu_type1.c:        unsigned long pfn, limit =
> rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> drivers/vfio/vfio_iommu_type1.c:        unsigned long limit =
> rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> 
> And iommufd does the same read back:
> 
> 	lock_limit =
> 		task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >>
> PAGE_SHIFT;
> 	npages = pages->npinned - pages->last_npinned;
> 	do {
> 		cur_pages = atomic_long_read(&pages->source_user-
> >locked_vm);
> 		new_pages = cur_pages + npages;
> 		if (new_pages > lock_limit)
> 			return -ENOMEM;
> 	} while (atomic_long_cmpxchg(&pages->source_user->locked_vm,
> cur_pages,
> 				     new_pages) != cur_pages);
> 
> So it does work essentially the same.
> 
> The difference is more subtle, iouring/etc puts the charge in the user
> so it is additive with things like iouring and additively spans all
> the users processes.
> 
> However vfio is accounting only per-process and only for itself - no
> other subsystem uses locked as the charge variable for DMA pins.
> 
> The user visible difference will be that a limit X that worked with
> VFIO may start to fail after a kernel upgrade as the charge accounting
> is now cross user and additive with things like iommufd.
> 
> This whole area is a bit peculiar (eg mlock itself works differently),
> IMHO, but with most of the places doing pins voting to use
> user->locked_vm as the charge it seems the right path in today's
> kernel.
> 
> Ceratinly having qemu concurrently using three different subsystems
> (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for
> RLIMIT_MEMLOCK differently cannot be sane or correct.
> 
> I plan to fix RDMA like this as well so at least we can have
> consistency within qemu.
> 

I have an impression that iommufd and vfio type1 must use
the same accounting scheme given the management stack
has no insight into qemu on which one is actually used thus
cannot adapt to the subtle difference in between. in this
regard either we start fixing vfio type1 to use user->locked_vm
now or have iommufd follow vfio type1 for upward compatibility
and then change them together at a later point.

I prefer to the former as IMHO I don't know when will be a later
point w/o certain kernel changes to actually break the userspace
policy built on a wrong accounting scheme... 

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* RE: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
@ 2022-03-24  2:11             ` Tian, Kevin
  0 siblings, 0 replies; 244+ messages in thread
From: Tian, Kevin @ 2022-03-24  2:11 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Niklas Schnelle, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Shameerali Kolothum Thodi, Liu,
	Yi L, Keqian Zhu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, March 23, 2022 12:15 AM
> 
> On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> 
> > I'm still picking my way through the series, but the later compat
> > interface doesn't mention this difference as an outstanding issue.
> > Doesn't this difference need to be accounted in how libvirt manages VM
> > resource limits?
> 
> AFACIT, no, but it should be checked.
> 
> > AIUI libvirt uses some form of prlimit(2) to set process locked
> > memory limits.
> 
> Yes, and ulimit does work fully. prlimit adjusts the value:
> 
> int do_prlimit(struct task_struct *tsk, unsigned int resource,
> 		struct rlimit *new_rlim, struct rlimit *old_rlim)
> {
> 	rlim = tsk->signal->rlim + resource;
> [..]
> 		if (new_rlim)
> 			*rlim = *new_rlim;
> 
> Which vfio reads back here:
> 
> drivers/vfio/vfio_iommu_type1.c:        unsigned long pfn, limit =
> rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> drivers/vfio/vfio_iommu_type1.c:        unsigned long limit =
> rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> 
> And iommufd does the same read back:
> 
> 	lock_limit =
> 		task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >>
> PAGE_SHIFT;
> 	npages = pages->npinned - pages->last_npinned;
> 	do {
> 		cur_pages = atomic_long_read(&pages->source_user-
> >locked_vm);
> 		new_pages = cur_pages + npages;
> 		if (new_pages > lock_limit)
> 			return -ENOMEM;
> 	} while (atomic_long_cmpxchg(&pages->source_user->locked_vm,
> cur_pages,
> 				     new_pages) != cur_pages);
> 
> So it does work essentially the same.
> 
> The difference is more subtle, iouring/etc puts the charge in the user
> so it is additive with things like iouring and additively spans all
> the users processes.
> 
> However vfio is accounting only per-process and only for itself - no
> other subsystem uses locked as the charge variable for DMA pins.
> 
> The user visible difference will be that a limit X that worked with
> VFIO may start to fail after a kernel upgrade as the charge accounting
> is now cross user and additive with things like iommufd.
> 
> This whole area is a bit peculiar (eg mlock itself works differently),
> IMHO, but with most of the places doing pins voting to use
> user->locked_vm as the charge it seems the right path in today's
> kernel.
> 
> Ceratinly having qemu concurrently using three different subsystems
> (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for
> RLIMIT_MEMLOCK differently cannot be sane or correct.
> 
> I plan to fix RDMA like this as well so at least we can have
> consistency within qemu.
> 

I have an impression that iommufd and vfio type1 must use
the same accounting scheme given the management stack
has no insight into qemu on which one is actually used thus
cannot adapt to the subtle difference in between. in this
regard either we start fixing vfio type1 to use user->locked_vm
now or have iommufd follow vfio type1 for upward compatibility
and then change them together at a later point.

I prefer to the former as IMHO I don't know when will be a later
point w/o certain kernel changes to actually break the userspace
policy built on a wrong accounting scheme... 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
  2022-03-24  2:11             ` Tian, Kevin
@ 2022-03-24  2:27               ` Jason Wang
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Wang @ 2022-03-24  2:27 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm, Niklas Schnelle,
	Cornelia Huck, iommu, Daniel Jordan, Alex Williamson,
	Michael S. Tsirkin, Jason Gunthorpe, Martins, Joao, David Gibson

On Thu, Mar 24, 2022 at 10:12 AM Tian, Kevin <kevin.tian@intel.com> wrote:
>
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, March 23, 2022 12:15 AM
> >
> > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> >
> > > I'm still picking my way through the series, but the later compat
> > > interface doesn't mention this difference as an outstanding issue.
> > > Doesn't this difference need to be accounted in how libvirt manages VM
> > > resource limits?
> >
> > AFACIT, no, but it should be checked.
> >
> > > AIUI libvirt uses some form of prlimit(2) to set process locked
> > > memory limits.
> >
> > Yes, and ulimit does work fully. prlimit adjusts the value:
> >
> > int do_prlimit(struct task_struct *tsk, unsigned int resource,
> >               struct rlimit *new_rlim, struct rlimit *old_rlim)
> > {
> >       rlim = tsk->signal->rlim + resource;
> > [..]
> >               if (new_rlim)
> >                       *rlim = *new_rlim;
> >
> > Which vfio reads back here:
> >
> > drivers/vfio/vfio_iommu_type1.c:        unsigned long pfn, limit =
> > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > drivers/vfio/vfio_iommu_type1.c:        unsigned long limit =
> > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> >
> > And iommufd does the same read back:
> >
> >       lock_limit =
> >               task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >>
> > PAGE_SHIFT;
> >       npages = pages->npinned - pages->last_npinned;
> >       do {
> >               cur_pages = atomic_long_read(&pages->source_user-
> > >locked_vm);
> >               new_pages = cur_pages + npages;
> >               if (new_pages > lock_limit)
> >                       return -ENOMEM;
> >       } while (atomic_long_cmpxchg(&pages->source_user->locked_vm,
> > cur_pages,
> >                                    new_pages) != cur_pages);
> >
> > So it does work essentially the same.
> >
> > The difference is more subtle, iouring/etc puts the charge in the user
> > so it is additive with things like iouring and additively spans all
> > the users processes.
> >
> > However vfio is accounting only per-process and only for itself - no
> > other subsystem uses locked as the charge variable for DMA pins.
> >
> > The user visible difference will be that a limit X that worked with
> > VFIO may start to fail after a kernel upgrade as the charge accounting
> > is now cross user and additive with things like iommufd.
> >
> > This whole area is a bit peculiar (eg mlock itself works differently),
> > IMHO, but with most of the places doing pins voting to use
> > user->locked_vm as the charge it seems the right path in today's
> > kernel.
> >
> > Ceratinly having qemu concurrently using three different subsystems
> > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for
> > RLIMIT_MEMLOCK differently cannot be sane or correct.
> >
> > I plan to fix RDMA like this as well so at least we can have
> > consistency within qemu.
> >
>
> I have an impression that iommufd and vfio type1 must use
> the same accounting scheme given the management stack
> has no insight into qemu on which one is actually used thus
> cannot adapt to the subtle difference in between. in this
> regard either we start fixing vfio type1 to use user->locked_vm
> now or have iommufd follow vfio type1 for upward compatibility
> and then change them together at a later point.
>
> I prefer to the former as IMHO I don't know when will be a later
> point w/o certain kernel changes to actually break the userspace
> policy built on a wrong accounting scheme...

I wonder if the kernel is the right place to do this. We have new uAPI
so management layer can know the difference of the accounting in
advance by

-device vfio-pci,iommufd=on

?

>
> Thanks
> Kevin
>

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
@ 2022-03-24  2:27               ` Jason Wang
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Wang @ 2022-03-24  2:27 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, Alex Williamson, Niklas Schnelle, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Auger, iommu, Jean-Philippe Brucker, Martins, Joao, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

On Thu, Mar 24, 2022 at 10:12 AM Tian, Kevin <kevin.tian@intel.com> wrote:
>
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, March 23, 2022 12:15 AM
> >
> > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> >
> > > I'm still picking my way through the series, but the later compat
> > > interface doesn't mention this difference as an outstanding issue.
> > > Doesn't this difference need to be accounted in how libvirt manages VM
> > > resource limits?
> >
> > AFACIT, no, but it should be checked.
> >
> > > AIUI libvirt uses some form of prlimit(2) to set process locked
> > > memory limits.
> >
> > Yes, and ulimit does work fully. prlimit adjusts the value:
> >
> > int do_prlimit(struct task_struct *tsk, unsigned int resource,
> >               struct rlimit *new_rlim, struct rlimit *old_rlim)
> > {
> >       rlim = tsk->signal->rlim + resource;
> > [..]
> >               if (new_rlim)
> >                       *rlim = *new_rlim;
> >
> > Which vfio reads back here:
> >
> > drivers/vfio/vfio_iommu_type1.c:        unsigned long pfn, limit =
> > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > drivers/vfio/vfio_iommu_type1.c:        unsigned long limit =
> > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> >
> > And iommufd does the same read back:
> >
> >       lock_limit =
> >               task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >>
> > PAGE_SHIFT;
> >       npages = pages->npinned - pages->last_npinned;
> >       do {
> >               cur_pages = atomic_long_read(&pages->source_user-
> > >locked_vm);
> >               new_pages = cur_pages + npages;
> >               if (new_pages > lock_limit)
> >                       return -ENOMEM;
> >       } while (atomic_long_cmpxchg(&pages->source_user->locked_vm,
> > cur_pages,
> >                                    new_pages) != cur_pages);
> >
> > So it does work essentially the same.
> >
> > The difference is more subtle, iouring/etc puts the charge in the user
> > so it is additive with things like iouring and additively spans all
> > the users processes.
> >
> > However vfio is accounting only per-process and only for itself - no
> > other subsystem uses locked as the charge variable for DMA pins.
> >
> > The user visible difference will be that a limit X that worked with
> > VFIO may start to fail after a kernel upgrade as the charge accounting
> > is now cross user and additive with things like iommufd.
> >
> > This whole area is a bit peculiar (eg mlock itself works differently),
> > IMHO, but with most of the places doing pins voting to use
> > user->locked_vm as the charge it seems the right path in today's
> > kernel.
> >
> > Ceratinly having qemu concurrently using three different subsystems
> > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for
> > RLIMIT_MEMLOCK differently cannot be sane or correct.
> >
> > I plan to fix RDMA like this as well so at least we can have
> > consistency within qemu.
> >
>
> I have an impression that iommufd and vfio type1 must use
> the same accounting scheme given the management stack
> has no insight into qemu on which one is actually used thus
> cannot adapt to the subtle difference in between. in this
> regard either we start fixing vfio type1 to use user->locked_vm
> now or have iommufd follow vfio type1 for upward compatibility
> and then change them together at a later point.
>
> I prefer to the former as IMHO I don't know when will be a later
> point w/o certain kernel changes to actually break the userspace
> policy built on a wrong accounting scheme...

I wonder if the kernel is the right place to do this. We have new uAPI
so management layer can know the difference of the accounting in
advance by

-device vfio-pci,iommufd=on

?

>
> Thanks
> Kevin
>


^ permalink raw reply	[flat|nested] 244+ messages in thread

* RE: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
  2022-03-24  2:27               ` Jason Wang
@ 2022-03-24  2:42                 ` Tian, Kevin
  -1 siblings, 0 replies; 244+ messages in thread
From: Tian, Kevin @ 2022-03-24  2:42 UTC (permalink / raw)
  To: Jason Wang
  Cc: Jason Gunthorpe, Alex Williamson, Niklas Schnelle, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Auger, iommu, Jean-Philippe Brucker, Martins, Joao, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

> From: Jason Wang <jasowang@redhat.com>
> Sent: Thursday, March 24, 2022 10:28 AM
> 
> On Thu, Mar 24, 2022 at 10:12 AM Tian, Kevin <kevin.tian@intel.com> wrote:
> >
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, March 23, 2022 12:15 AM
> > >
> > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> > >
> > > > I'm still picking my way through the series, but the later compat
> > > > interface doesn't mention this difference as an outstanding issue.
> > > > Doesn't this difference need to be accounted in how libvirt manages VM
> > > > resource limits?
> > >
> > > AFACIT, no, but it should be checked.
> > >
> > > > AIUI libvirt uses some form of prlimit(2) to set process locked
> > > > memory limits.
> > >
> > > Yes, and ulimit does work fully. prlimit adjusts the value:
> > >
> > > int do_prlimit(struct task_struct *tsk, unsigned int resource,
> > >               struct rlimit *new_rlim, struct rlimit *old_rlim)
> > > {
> > >       rlim = tsk->signal->rlim + resource;
> > > [..]
> > >               if (new_rlim)
> > >                       *rlim = *new_rlim;
> > >
> > > Which vfio reads back here:
> > >
> > > drivers/vfio/vfio_iommu_type1.c:        unsigned long pfn, limit =
> > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > drivers/vfio/vfio_iommu_type1.c:        unsigned long limit =
> > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > >
> > > And iommufd does the same read back:
> > >
> > >       lock_limit =
> > >               task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >>
> > > PAGE_SHIFT;
> > >       npages = pages->npinned - pages->last_npinned;
> > >       do {
> > >               cur_pages = atomic_long_read(&pages->source_user-
> > > >locked_vm);
> > >               new_pages = cur_pages + npages;
> > >               if (new_pages > lock_limit)
> > >                       return -ENOMEM;
> > >       } while (atomic_long_cmpxchg(&pages->source_user->locked_vm,
> > > cur_pages,
> > >                                    new_pages) != cur_pages);
> > >
> > > So it does work essentially the same.
> > >
> > > The difference is more subtle, iouring/etc puts the charge in the user
> > > so it is additive with things like iouring and additively spans all
> > > the users processes.
> > >
> > > However vfio is accounting only per-process and only for itself - no
> > > other subsystem uses locked as the charge variable for DMA pins.
> > >
> > > The user visible difference will be that a limit X that worked with
> > > VFIO may start to fail after a kernel upgrade as the charge accounting
> > > is now cross user and additive with things like iommufd.
> > >
> > > This whole area is a bit peculiar (eg mlock itself works differently),
> > > IMHO, but with most of the places doing pins voting to use
> > > user->locked_vm as the charge it seems the right path in today's
> > > kernel.
> > >
> > > Ceratinly having qemu concurrently using three different subsystems
> > > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for
> > > RLIMIT_MEMLOCK differently cannot be sane or correct.
> > >
> > > I plan to fix RDMA like this as well so at least we can have
> > > consistency within qemu.
> > >
> >
> > I have an impression that iommufd and vfio type1 must use
> > the same accounting scheme given the management stack
> > has no insight into qemu on which one is actually used thus
> > cannot adapt to the subtle difference in between. in this
> > regard either we start fixing vfio type1 to use user->locked_vm
> > now or have iommufd follow vfio type1 for upward compatibility
> > and then change them together at a later point.
> >
> > I prefer to the former as IMHO I don't know when will be a later
> > point w/o certain kernel changes to actually break the userspace
> > policy built on a wrong accounting scheme...
> 
> I wonder if the kernel is the right place to do this. We have new uAPI

I didn't get this. This thread is about that VFIO uses a wrong accounting
scheme and then the discussion is about the impact of fixing it to the
userspace. I didn't see the question on the right place part.

> so management layer can know the difference of the accounting in
> advance by
> 
> -device vfio-pci,iommufd=on
> 

I suppose iommufd will be used once Qemu supports it, as long as
the compatibility opens that Jason/Alex discussed in another thread
are well addressed. It is not necessarily to be a control knob exposed
to the caller.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 244+ messages in thread

* RE: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
@ 2022-03-24  2:42                 ` Tian, Kevin
  0 siblings, 0 replies; 244+ messages in thread
From: Tian, Kevin @ 2022-03-24  2:42 UTC (permalink / raw)
  To: Jason Wang
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm, Niklas Schnelle,
	Cornelia Huck, iommu, Daniel Jordan, Alex Williamson,
	Michael S. Tsirkin, Jason Gunthorpe, Martins, Joao, David Gibson

> From: Jason Wang <jasowang@redhat.com>
> Sent: Thursday, March 24, 2022 10:28 AM
> 
> On Thu, Mar 24, 2022 at 10:12 AM Tian, Kevin <kevin.tian@intel.com> wrote:
> >
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, March 23, 2022 12:15 AM
> > >
> > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> > >
> > > > I'm still picking my way through the series, but the later compat
> > > > interface doesn't mention this difference as an outstanding issue.
> > > > Doesn't this difference need to be accounted in how libvirt manages VM
> > > > resource limits?
> > >
> > > AFACIT, no, but it should be checked.
> > >
> > > > AIUI libvirt uses some form of prlimit(2) to set process locked
> > > > memory limits.
> > >
> > > Yes, and ulimit does work fully. prlimit adjusts the value:
> > >
> > > int do_prlimit(struct task_struct *tsk, unsigned int resource,
> > >               struct rlimit *new_rlim, struct rlimit *old_rlim)
> > > {
> > >       rlim = tsk->signal->rlim + resource;
> > > [..]
> > >               if (new_rlim)
> > >                       *rlim = *new_rlim;
> > >
> > > Which vfio reads back here:
> > >
> > > drivers/vfio/vfio_iommu_type1.c:        unsigned long pfn, limit =
> > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > drivers/vfio/vfio_iommu_type1.c:        unsigned long limit =
> > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > >
> > > And iommufd does the same read back:
> > >
> > >       lock_limit =
> > >               task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >>
> > > PAGE_SHIFT;
> > >       npages = pages->npinned - pages->last_npinned;
> > >       do {
> > >               cur_pages = atomic_long_read(&pages->source_user-
> > > >locked_vm);
> > >               new_pages = cur_pages + npages;
> > >               if (new_pages > lock_limit)
> > >                       return -ENOMEM;
> > >       } while (atomic_long_cmpxchg(&pages->source_user->locked_vm,
> > > cur_pages,
> > >                                    new_pages) != cur_pages);
> > >
> > > So it does work essentially the same.
> > >
> > > The difference is more subtle, iouring/etc puts the charge in the user
> > > so it is additive with things like iouring and additively spans all
> > > the users processes.
> > >
> > > However vfio is accounting only per-process and only for itself - no
> > > other subsystem uses locked as the charge variable for DMA pins.
> > >
> > > The user visible difference will be that a limit X that worked with
> > > VFIO may start to fail after a kernel upgrade as the charge accounting
> > > is now cross user and additive with things like iommufd.
> > >
> > > This whole area is a bit peculiar (eg mlock itself works differently),
> > > IMHO, but with most of the places doing pins voting to use
> > > user->locked_vm as the charge it seems the right path in today's
> > > kernel.
> > >
> > > Ceratinly having qemu concurrently using three different subsystems
> > > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for
> > > RLIMIT_MEMLOCK differently cannot be sane or correct.
> > >
> > > I plan to fix RDMA like this as well so at least we can have
> > > consistency within qemu.
> > >
> >
> > I have an impression that iommufd and vfio type1 must use
> > the same accounting scheme given the management stack
> > has no insight into qemu on which one is actually used thus
> > cannot adapt to the subtle difference in between. in this
> > regard either we start fixing vfio type1 to use user->locked_vm
> > now or have iommufd follow vfio type1 for upward compatibility
> > and then change them together at a later point.
> >
> > I prefer to the former as IMHO I don't know when will be a later
> > point w/o certain kernel changes to actually break the userspace
> > policy built on a wrong accounting scheme...
> 
> I wonder if the kernel is the right place to do this. We have new uAPI

I didn't get this. This thread is about that VFIO uses a wrong accounting
scheme and then the discussion is about the impact of fixing it to the
userspace. I didn't see the question on the right place part.

> so management layer can know the difference of the accounting in
> advance by
> 
> -device vfio-pci,iommufd=on
> 

I suppose iommufd will be used once Qemu supports it, as long as
the compatibility opens that Jason/Alex discussed in another thread
are well addressed. It is not necessarily to be a control knob exposed
to the caller.

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
  2022-03-24  2:42                 ` Tian, Kevin
@ 2022-03-24  2:57                   ` Jason Wang
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Wang @ 2022-03-24  2:57 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, Alex Williamson, Niklas Schnelle, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Auger, iommu, Jean-Philippe Brucker, Martins, Joao, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

On Thu, Mar 24, 2022 at 10:42 AM Tian, Kevin <kevin.tian@intel.com> wrote:
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Thursday, March 24, 2022 10:28 AM
> >
> > On Thu, Mar 24, 2022 at 10:12 AM Tian, Kevin <kevin.tian@intel.com> wrote:
> > >
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Wednesday, March 23, 2022 12:15 AM
> > > >
> > > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> > > >
> > > > > I'm still picking my way through the series, but the later compat
> > > > > interface doesn't mention this difference as an outstanding issue.
> > > > > Doesn't this difference need to be accounted in how libvirt manages VM
> > > > > resource limits?
> > > >
> > > > AFACIT, no, but it should be checked.
> > > >
> > > > > AIUI libvirt uses some form of prlimit(2) to set process locked
> > > > > memory limits.
> > > >
> > > > Yes, and ulimit does work fully. prlimit adjusts the value:
> > > >
> > > > int do_prlimit(struct task_struct *tsk, unsigned int resource,
> > > >               struct rlimit *new_rlim, struct rlimit *old_rlim)
> > > > {
> > > >       rlim = tsk->signal->rlim + resource;
> > > > [..]
> > > >               if (new_rlim)
> > > >                       *rlim = *new_rlim;
> > > >
> > > > Which vfio reads back here:
> > > >
> > > > drivers/vfio/vfio_iommu_type1.c:        unsigned long pfn, limit =
> > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > > drivers/vfio/vfio_iommu_type1.c:        unsigned long limit =
> > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > >
> > > > And iommufd does the same read back:
> > > >
> > > >       lock_limit =
> > > >               task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >>
> > > > PAGE_SHIFT;
> > > >       npages = pages->npinned - pages->last_npinned;
> > > >       do {
> > > >               cur_pages = atomic_long_read(&pages->source_user-
> > > > >locked_vm);
> > > >               new_pages = cur_pages + npages;
> > > >               if (new_pages > lock_limit)
> > > >                       return -ENOMEM;
> > > >       } while (atomic_long_cmpxchg(&pages->source_user->locked_vm,
> > > > cur_pages,
> > > >                                    new_pages) != cur_pages);
> > > >
> > > > So it does work essentially the same.
> > > >
> > > > The difference is more subtle, iouring/etc puts the charge in the user
> > > > so it is additive with things like iouring and additively spans all
> > > > the users processes.
> > > >
> > > > However vfio is accounting only per-process and only for itself - no
> > > > other subsystem uses locked as the charge variable for DMA pins.
> > > >
> > > > The user visible difference will be that a limit X that worked with
> > > > VFIO may start to fail after a kernel upgrade as the charge accounting
> > > > is now cross user and additive with things like iommufd.
> > > >
> > > > This whole area is a bit peculiar (eg mlock itself works differently),
> > > > IMHO, but with most of the places doing pins voting to use
> > > > user->locked_vm as the charge it seems the right path in today's
> > > > kernel.
> > > >
> > > > Ceratinly having qemu concurrently using three different subsystems
> > > > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for
> > > > RLIMIT_MEMLOCK differently cannot be sane or correct.
> > > >
> > > > I plan to fix RDMA like this as well so at least we can have
> > > > consistency within qemu.
> > > >
> > >
> > > I have an impression that iommufd and vfio type1 must use
> > > the same accounting scheme given the management stack
> > > has no insight into qemu on which one is actually used thus
> > > cannot adapt to the subtle difference in between. in this
> > > regard either we start fixing vfio type1 to use user->locked_vm
> > > now or have iommufd follow vfio type1 for upward compatibility
> > > and then change them together at a later point.
> > >
> > > I prefer to the former as IMHO I don't know when will be a later
> > > point w/o certain kernel changes to actually break the userspace
> > > policy built on a wrong accounting scheme...
> >
> > I wonder if the kernel is the right place to do this. We have new uAPI
>
> I didn't get this. This thread is about that VFIO uses a wrong accounting
> scheme and then the discussion is about the impact of fixing it to the
> userspace.

It's probably too late to fix the kernel considering it may break the userspace.

> I didn't see the question on the right place part.

I meant it would be easier to let userspace know the difference than
trying to fix or workaround in the kernel.

>
> > so management layer can know the difference of the accounting in
> > advance by
> >
> > -device vfio-pci,iommufd=on
> >
>
> I suppose iommufd will be used once Qemu supports it, as long as
> the compatibility opens that Jason/Alex discussed in another thread
> are well addressed. It is not necessarily to be a control knob exposed
> to the caller.

It has a lot of implications if we do this, it means iommufd needs to
inherit all the userspace noticeable behaviour as well as the "bugs"
of VFIO.

We know it's easier to find the difference than saying no difference.

Thanks

>
> Thanks
> Kevin


^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
@ 2022-03-24  2:57                   ` Jason Wang
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Wang @ 2022-03-24  2:57 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm, Niklas Schnelle,
	Cornelia Huck, iommu, Daniel Jordan, Alex Williamson,
	Michael S. Tsirkin, Jason Gunthorpe, Martins, Joao, David Gibson

On Thu, Mar 24, 2022 at 10:42 AM Tian, Kevin <kevin.tian@intel.com> wrote:
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Thursday, March 24, 2022 10:28 AM
> >
> > On Thu, Mar 24, 2022 at 10:12 AM Tian, Kevin <kevin.tian@intel.com> wrote:
> > >
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Wednesday, March 23, 2022 12:15 AM
> > > >
> > > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> > > >
> > > > > I'm still picking my way through the series, but the later compat
> > > > > interface doesn't mention this difference as an outstanding issue.
> > > > > Doesn't this difference need to be accounted in how libvirt manages VM
> > > > > resource limits?
> > > >
> > > > AFACIT, no, but it should be checked.
> > > >
> > > > > AIUI libvirt uses some form of prlimit(2) to set process locked
> > > > > memory limits.
> > > >
> > > > Yes, and ulimit does work fully. prlimit adjusts the value:
> > > >
> > > > int do_prlimit(struct task_struct *tsk, unsigned int resource,
> > > >               struct rlimit *new_rlim, struct rlimit *old_rlim)
> > > > {
> > > >       rlim = tsk->signal->rlim + resource;
> > > > [..]
> > > >               if (new_rlim)
> > > >                       *rlim = *new_rlim;
> > > >
> > > > Which vfio reads back here:
> > > >
> > > > drivers/vfio/vfio_iommu_type1.c:        unsigned long pfn, limit =
> > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > > drivers/vfio/vfio_iommu_type1.c:        unsigned long limit =
> > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > >
> > > > And iommufd does the same read back:
> > > >
> > > >       lock_limit =
> > > >               task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >>
> > > > PAGE_SHIFT;
> > > >       npages = pages->npinned - pages->last_npinned;
> > > >       do {
> > > >               cur_pages = atomic_long_read(&pages->source_user-
> > > > >locked_vm);
> > > >               new_pages = cur_pages + npages;
> > > >               if (new_pages > lock_limit)
> > > >                       return -ENOMEM;
> > > >       } while (atomic_long_cmpxchg(&pages->source_user->locked_vm,
> > > > cur_pages,
> > > >                                    new_pages) != cur_pages);
> > > >
> > > > So it does work essentially the same.
> > > >
> > > > The difference is more subtle, iouring/etc puts the charge in the user
> > > > so it is additive with things like iouring and additively spans all
> > > > the users processes.
> > > >
> > > > However vfio is accounting only per-process and only for itself - no
> > > > other subsystem uses locked as the charge variable for DMA pins.
> > > >
> > > > The user visible difference will be that a limit X that worked with
> > > > VFIO may start to fail after a kernel upgrade as the charge accounting
> > > > is now cross user and additive with things like iommufd.
> > > >
> > > > This whole area is a bit peculiar (eg mlock itself works differently),
> > > > IMHO, but with most of the places doing pins voting to use
> > > > user->locked_vm as the charge it seems the right path in today's
> > > > kernel.
> > > >
> > > > Ceratinly having qemu concurrently using three different subsystems
> > > > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for
> > > > RLIMIT_MEMLOCK differently cannot be sane or correct.
> > > >
> > > > I plan to fix RDMA like this as well so at least we can have
> > > > consistency within qemu.
> > > >
> > >
> > > I have an impression that iommufd and vfio type1 must use
> > > the same accounting scheme given the management stack
> > > has no insight into qemu on which one is actually used thus
> > > cannot adapt to the subtle difference in between. in this
> > > regard either we start fixing vfio type1 to use user->locked_vm
> > > now or have iommufd follow vfio type1 for upward compatibility
> > > and then change them together at a later point.
> > >
> > > I prefer to the former as IMHO I don't know when will be a later
> > > point w/o certain kernel changes to actually break the userspace
> > > policy built on a wrong accounting scheme...
> >
> > I wonder if the kernel is the right place to do this. We have new uAPI
>
> I didn't get this. This thread is about that VFIO uses a wrong accounting
> scheme and then the discussion is about the impact of fixing it to the
> userspace.

It's probably too late to fix the kernel considering it may break the userspace.

> I didn't see the question on the right place part.

I meant it would be easier to let userspace know the difference than
trying to fix or workaround in the kernel.

>
> > so management layer can know the difference of the accounting in
> > advance by
> >
> > -device vfio-pci,iommufd=on
> >
>
> I suppose iommufd will be used once Qemu supports it, as long as
> the compatibility opens that Jason/Alex discussed in another thread
> are well addressed. It is not necessarily to be a control knob exposed
> to the caller.

It has a lot of implications if we do this, it means iommufd needs to
inherit all the userspace noticeable behaviour as well as the "bugs"
of VFIO.

We know it's easier to find the difference than saying no difference.

Thanks

>
> Thanks
> Kevin

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* RE: [PATCH RFC 07/12] iommufd: Data structure to provide IOVA to PFN mapping
  2022-03-23 18:15       ` Jason Gunthorpe via iommu
@ 2022-03-24  3:09         ` Tian, Kevin
  -1 siblings, 0 replies; 244+ messages in thread
From: Tian, Kevin @ 2022-03-24  3:09 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, March 24, 2022 2:16 AM
> 
> On Tue, Mar 22, 2022 at 04:15:44PM -0600, Alex Williamson wrote:
> 
> > > +int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
> > > +		      unsigned long length, struct page **out_pages, bool write)
> > > +{
> > > +	unsigned long cur_iova = iova;
> > > +	unsigned long last_iova;
> > > +	struct iopt_area *area;
> > > +	int rc;
> > > +
> > > +	if (!length)
> > > +		return -EINVAL;
> > > +	if (check_add_overflow(iova, length - 1, &last_iova))
> > > +		return -EOVERFLOW;
> > > +
> > > +	down_read(&iopt->iova_rwsem);
> > > +	for (area = iopt_area_iter_first(iopt, iova, last_iova); area;
> > > +	     area = iopt_area_iter_next(area, iova, last_iova)) {
> > > +		unsigned long last = min(last_iova, iopt_area_last_iova(area));
> > > +		unsigned long last_index;
> > > +		unsigned long index;
> > > +
> > > +		/* Need contiguous areas in the access */
> > > +		if (iopt_area_iova(area) < cur_iova || !area->pages) {
> >                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > Should this be (cur_iova != iova && iopt_area_iova(area) < cur_iova)?
> 
> Oh good eye
> 
> That is a typo < should be >:
> 
> 		if (iopt_area_iova(area) > cur_iova || !area->pages) {
> 
> There are three boundary conditions here to worry about
>  - interval trees return any nodes that intersect the queried range
>    so the first found node can start after iova
> 
>  - There can be a gap in the intervals
> 
>  - The final area can end short of last_iova
> 
> The last one is botched too and needs this:
>         for ... { ...
> 	}
> +	if (cur_iova != last_iova)
> +		goto out_remove;
> 
> The test suite isn't covering the boundary cases here yet, I added a
> FIXME for this for now.
> 

Another nit about below:

+		/*
+		 * Can't cross areas that are not aligned to the system page
+		 * size with this API.
+		 */
+		if (cur_iova % PAGE_SIZE) {
+			rc = -EINVAL;
+			goto out_remove;
+		}

Currently it's done after iopt_pages_add_user() but before cur_iova 
is adjusted, which implies the last add_user() will not be reverted in
case of failed check here.

suppose this should be checked at the start of the loop.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 244+ messages in thread

* RE: [PATCH RFC 07/12] iommufd: Data structure to provide IOVA to PFN mapping
@ 2022-03-24  3:09         ` Tian, Kevin
  0 siblings, 0 replies; 244+ messages in thread
From: Tian, Kevin @ 2022-03-24  3:09 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	Daniel Jordan, iommu, Martins, Joao, David Gibson

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, March 24, 2022 2:16 AM
> 
> On Tue, Mar 22, 2022 at 04:15:44PM -0600, Alex Williamson wrote:
> 
> > > +int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
> > > +		      unsigned long length, struct page **out_pages, bool write)
> > > +{
> > > +	unsigned long cur_iova = iova;
> > > +	unsigned long last_iova;
> > > +	struct iopt_area *area;
> > > +	int rc;
> > > +
> > > +	if (!length)
> > > +		return -EINVAL;
> > > +	if (check_add_overflow(iova, length - 1, &last_iova))
> > > +		return -EOVERFLOW;
> > > +
> > > +	down_read(&iopt->iova_rwsem);
> > > +	for (area = iopt_area_iter_first(iopt, iova, last_iova); area;
> > > +	     area = iopt_area_iter_next(area, iova, last_iova)) {
> > > +		unsigned long last = min(last_iova, iopt_area_last_iova(area));
> > > +		unsigned long last_index;
> > > +		unsigned long index;
> > > +
> > > +		/* Need contiguous areas in the access */
> > > +		if (iopt_area_iova(area) < cur_iova || !area->pages) {
> >                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > Should this be (cur_iova != iova && iopt_area_iova(area) < cur_iova)?
> 
> Oh good eye
> 
> That is a typo < should be >:
> 
> 		if (iopt_area_iova(area) > cur_iova || !area->pages) {
> 
> There are three boundary conditions here to worry about
>  - interval trees return any nodes that intersect the queried range
>    so the first found node can start after iova
> 
>  - There can be a gap in the intervals
> 
>  - The final area can end short of last_iova
> 
> The last one is botched too and needs this:
>         for ... { ...
> 	}
> +	if (cur_iova != last_iova)
> +		goto out_remove;
> 
> The test suite isn't covering the boundary cases here yet, I added a
> FIXME for this for now.
> 

Another nit about below:

+		/*
+		 * Can't cross areas that are not aligned to the system page
+		 * size with this API.
+		 */
+		if (cur_iova % PAGE_SIZE) {
+			rc = -EINVAL;
+			goto out_remove;
+		}

Currently it's done after iopt_pages_add_user() but before cur_iova 
is adjusted, which implies the last add_user() will not be reverted in
case of failed check here.

suppose this should be checked at the start of the loop.

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* RE: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
  2022-03-24  2:57                   ` Jason Wang
@ 2022-03-24  3:15                     ` Tian, Kevin
  -1 siblings, 0 replies; 244+ messages in thread
From: Tian, Kevin @ 2022-03-24  3:15 UTC (permalink / raw)
  To: Jason Wang
  Cc: Jason Gunthorpe, Alex Williamson, Niklas Schnelle, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Auger, iommu, Jean-Philippe Brucker, Martins, Joao, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

> From: Jason Wang <jasowang@redhat.com>
> Sent: Thursday, March 24, 2022 10:57 AM
> 
> On Thu, Mar 24, 2022 at 10:42 AM Tian, Kevin <kevin.tian@intel.com> wrote:
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Thursday, March 24, 2022 10:28 AM
> > >
> > > On Thu, Mar 24, 2022 at 10:12 AM Tian, Kevin <kevin.tian@intel.com>
> wrote:
> > > >
> > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > Sent: Wednesday, March 23, 2022 12:15 AM
> > > > >
> > > > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> > > > >
> > > > > > I'm still picking my way through the series, but the later compat
> > > > > > interface doesn't mention this difference as an outstanding issue.
> > > > > > Doesn't this difference need to be accounted in how libvirt manages
> VM
> > > > > > resource limits?
> > > > >
> > > > > AFACIT, no, but it should be checked.
> > > > >
> > > > > > AIUI libvirt uses some form of prlimit(2) to set process locked
> > > > > > memory limits.
> > > > >
> > > > > Yes, and ulimit does work fully. prlimit adjusts the value:
> > > > >
> > > > > int do_prlimit(struct task_struct *tsk, unsigned int resource,
> > > > >               struct rlimit *new_rlim, struct rlimit *old_rlim)
> > > > > {
> > > > >       rlim = tsk->signal->rlim + resource;
> > > > > [..]
> > > > >               if (new_rlim)
> > > > >                       *rlim = *new_rlim;
> > > > >
> > > > > Which vfio reads back here:
> > > > >
> > > > > drivers/vfio/vfio_iommu_type1.c:        unsigned long pfn, limit =
> > > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > > > drivers/vfio/vfio_iommu_type1.c:        unsigned long limit =
> > > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > > >
> > > > > And iommufd does the same read back:
> > > > >
> > > > >       lock_limit =
> > > > >               task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >>
> > > > > PAGE_SHIFT;
> > > > >       npages = pages->npinned - pages->last_npinned;
> > > > >       do {
> > > > >               cur_pages = atomic_long_read(&pages->source_user-
> > > > > >locked_vm);
> > > > >               new_pages = cur_pages + npages;
> > > > >               if (new_pages > lock_limit)
> > > > >                       return -ENOMEM;
> > > > >       } while (atomic_long_cmpxchg(&pages->source_user->locked_vm,
> > > > > cur_pages,
> > > > >                                    new_pages) != cur_pages);
> > > > >
> > > > > So it does work essentially the same.
> > > > >
> > > > > The difference is more subtle, iouring/etc puts the charge in the user
> > > > > so it is additive with things like iouring and additively spans all
> > > > > the users processes.
> > > > >
> > > > > However vfio is accounting only per-process and only for itself - no
> > > > > other subsystem uses locked as the charge variable for DMA pins.
> > > > >
> > > > > The user visible difference will be that a limit X that worked with
> > > > > VFIO may start to fail after a kernel upgrade as the charge accounting
> > > > > is now cross user and additive with things like iommufd.
> > > > >
> > > > > This whole area is a bit peculiar (eg mlock itself works differently),
> > > > > IMHO, but with most of the places doing pins voting to use
> > > > > user->locked_vm as the charge it seems the right path in today's
> > > > > kernel.
> > > > >
> > > > > Ceratinly having qemu concurrently using three different subsystems
> > > > > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for
> > > > > RLIMIT_MEMLOCK differently cannot be sane or correct.
> > > > >
> > > > > I plan to fix RDMA like this as well so at least we can have
> > > > > consistency within qemu.
> > > > >
> > > >
> > > > I have an impression that iommufd and vfio type1 must use
> > > > the same accounting scheme given the management stack
> > > > has no insight into qemu on which one is actually used thus
> > > > cannot adapt to the subtle difference in between. in this
> > > > regard either we start fixing vfio type1 to use user->locked_vm
> > > > now or have iommufd follow vfio type1 for upward compatibility
> > > > and then change them together at a later point.
> > > >
> > > > I prefer to the former as IMHO I don't know when will be a later
> > > > point w/o certain kernel changes to actually break the userspace
> > > > policy built on a wrong accounting scheme...
> > >
> > > I wonder if the kernel is the right place to do this. We have new uAPI
> >
> > I didn't get this. This thread is about that VFIO uses a wrong accounting
> > scheme and then the discussion is about the impact of fixing it to the
> > userspace.
> 
> It's probably too late to fix the kernel considering it may break the userspace.
> 
> > I didn't see the question on the right place part.
> 
> I meant it would be easier to let userspace know the difference than
> trying to fix or workaround in the kernel.

Jason already plans to fix RDMA which will also leads to user-aware
impact when Qemu uses both VFIO and RDMA. Why is VFIO so special
and left behind to carry the legacy misdesign?

> 
> >
> > > so management layer can know the difference of the accounting in
> > > advance by
> > >
> > > -device vfio-pci,iommufd=on
> > >
> >
> > I suppose iommufd will be used once Qemu supports it, as long as
> > the compatibility opens that Jason/Alex discussed in another thread
> > are well addressed. It is not necessarily to be a control knob exposed
> > to the caller.
> 
> It has a lot of implications if we do this, it means iommufd needs to
> inherit all the userspace noticeable behaviour as well as the "bugs"
> of VFIO.
> 
> We know it's easier to find the difference than saying no difference.
> 

In the end vfio type1 will be replaced by iommufd compat layer. With
that goal in mind iommufd has to inherit type1 behaviors.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 244+ messages in thread

* RE: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
@ 2022-03-24  3:15                     ` Tian, Kevin
  0 siblings, 0 replies; 244+ messages in thread
From: Tian, Kevin @ 2022-03-24  3:15 UTC (permalink / raw)
  To: Jason Wang
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm, Niklas Schnelle,
	Cornelia Huck, iommu, Daniel Jordan, Alex Williamson,
	Michael S. Tsirkin, Jason Gunthorpe, Martins, Joao, David Gibson

> From: Jason Wang <jasowang@redhat.com>
> Sent: Thursday, March 24, 2022 10:57 AM
> 
> On Thu, Mar 24, 2022 at 10:42 AM Tian, Kevin <kevin.tian@intel.com> wrote:
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Thursday, March 24, 2022 10:28 AM
> > >
> > > On Thu, Mar 24, 2022 at 10:12 AM Tian, Kevin <kevin.tian@intel.com>
> wrote:
> > > >
> > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > Sent: Wednesday, March 23, 2022 12:15 AM
> > > > >
> > > > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> > > > >
> > > > > > I'm still picking my way through the series, but the later compat
> > > > > > interface doesn't mention this difference as an outstanding issue.
> > > > > > Doesn't this difference need to be accounted in how libvirt manages
> VM
> > > > > > resource limits?
> > > > >
> > > > > AFACIT, no, but it should be checked.
> > > > >
> > > > > > AIUI libvirt uses some form of prlimit(2) to set process locked
> > > > > > memory limits.
> > > > >
> > > > > Yes, and ulimit does work fully. prlimit adjusts the value:
> > > > >
> > > > > int do_prlimit(struct task_struct *tsk, unsigned int resource,
> > > > >               struct rlimit *new_rlim, struct rlimit *old_rlim)
> > > > > {
> > > > >       rlim = tsk->signal->rlim + resource;
> > > > > [..]
> > > > >               if (new_rlim)
> > > > >                       *rlim = *new_rlim;
> > > > >
> > > > > Which vfio reads back here:
> > > > >
> > > > > drivers/vfio/vfio_iommu_type1.c:        unsigned long pfn, limit =
> > > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > > > drivers/vfio/vfio_iommu_type1.c:        unsigned long limit =
> > > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > > >
> > > > > And iommufd does the same read back:
> > > > >
> > > > >       lock_limit =
> > > > >               task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >>
> > > > > PAGE_SHIFT;
> > > > >       npages = pages->npinned - pages->last_npinned;
> > > > >       do {
> > > > >               cur_pages = atomic_long_read(&pages->source_user-
> > > > > >locked_vm);
> > > > >               new_pages = cur_pages + npages;
> > > > >               if (new_pages > lock_limit)
> > > > >                       return -ENOMEM;
> > > > >       } while (atomic_long_cmpxchg(&pages->source_user->locked_vm,
> > > > > cur_pages,
> > > > >                                    new_pages) != cur_pages);
> > > > >
> > > > > So it does work essentially the same.
> > > > >
> > > > > The difference is more subtle, iouring/etc puts the charge in the user
> > > > > so it is additive with things like iouring and additively spans all
> > > > > the users processes.
> > > > >
> > > > > However vfio is accounting only per-process and only for itself - no
> > > > > other subsystem uses locked as the charge variable for DMA pins.
> > > > >
> > > > > The user visible difference will be that a limit X that worked with
> > > > > VFIO may start to fail after a kernel upgrade as the charge accounting
> > > > > is now cross user and additive with things like iommufd.
> > > > >
> > > > > This whole area is a bit peculiar (eg mlock itself works differently),
> > > > > IMHO, but with most of the places doing pins voting to use
> > > > > user->locked_vm as the charge it seems the right path in today's
> > > > > kernel.
> > > > >
> > > > > Ceratinly having qemu concurrently using three different subsystems
> > > > > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for
> > > > > RLIMIT_MEMLOCK differently cannot be sane or correct.
> > > > >
> > > > > I plan to fix RDMA like this as well so at least we can have
> > > > > consistency within qemu.
> > > > >
> > > >
> > > > I have an impression that iommufd and vfio type1 must use
> > > > the same accounting scheme given the management stack
> > > > has no insight into qemu on which one is actually used thus
> > > > cannot adapt to the subtle difference in between. in this
> > > > regard either we start fixing vfio type1 to use user->locked_vm
> > > > now or have iommufd follow vfio type1 for upward compatibility
> > > > and then change them together at a later point.
> > > >
> > > > I prefer to the former as IMHO I don't know when will be a later
> > > > point w/o certain kernel changes to actually break the userspace
> > > > policy built on a wrong accounting scheme...
> > >
> > > I wonder if the kernel is the right place to do this. We have new uAPI
> >
> > I didn't get this. This thread is about that VFIO uses a wrong accounting
> > scheme and then the discussion is about the impact of fixing it to the
> > userspace.
> 
> It's probably too late to fix the kernel considering it may break the userspace.
> 
> > I didn't see the question on the right place part.
> 
> I meant it would be easier to let userspace know the difference than
> trying to fix or workaround in the kernel.

Jason already plans to fix RDMA which will also leads to user-aware
impact when Qemu uses both VFIO and RDMA. Why is VFIO so special
and left behind to carry the legacy misdesign?

> 
> >
> > > so management layer can know the difference of the accounting in
> > > advance by
> > >
> > > -device vfio-pci,iommufd=on
> > >
> >
> > I suppose iommufd will be used once Qemu supports it, as long as
> > the compatibility opens that Jason/Alex discussed in another thread
> > are well addressed. It is not necessarily to be a control knob exposed
> > to the caller.
> 
> It has a lot of implications if we do this, it means iommufd needs to
> inherit all the userspace noticeable behaviour as well as the "bugs"
> of VFIO.
> 
> We know it's easier to find the difference than saying no difference.
> 

In the end vfio type1 will be replaced by iommufd compat layer. With
that goal in mind iommufd has to inherit type1 behaviors.

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
  2022-03-24  3:15                     ` Tian, Kevin
@ 2022-03-24  3:50                       ` Jason Wang
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Wang @ 2022-03-24  3:50 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, Alex Williamson, Niklas Schnelle, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Auger, iommu, Jean-Philippe Brucker, Martins, Joao, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

On Thu, Mar 24, 2022 at 11:15 AM Tian, Kevin <kevin.tian@intel.com> wrote:
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Thursday, March 24, 2022 10:57 AM
> >
> > On Thu, Mar 24, 2022 at 10:42 AM Tian, Kevin <kevin.tian@intel.com> wrote:
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Thursday, March 24, 2022 10:28 AM
> > > >
> > > > On Thu, Mar 24, 2022 at 10:12 AM Tian, Kevin <kevin.tian@intel.com>
> > wrote:
> > > > >
> > > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > > Sent: Wednesday, March 23, 2022 12:15 AM
> > > > > >
> > > > > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> > > > > >
> > > > > > > I'm still picking my way through the series, but the later compat
> > > > > > > interface doesn't mention this difference as an outstanding issue.
> > > > > > > Doesn't this difference need to be accounted in how libvirt manages
> > VM
> > > > > > > resource limits?
> > > > > >
> > > > > > AFACIT, no, but it should be checked.
> > > > > >
> > > > > > > AIUI libvirt uses some form of prlimit(2) to set process locked
> > > > > > > memory limits.
> > > > > >
> > > > > > Yes, and ulimit does work fully. prlimit adjusts the value:
> > > > > >
> > > > > > int do_prlimit(struct task_struct *tsk, unsigned int resource,
> > > > > >               struct rlimit *new_rlim, struct rlimit *old_rlim)
> > > > > > {
> > > > > >       rlim = tsk->signal->rlim + resource;
> > > > > > [..]
> > > > > >               if (new_rlim)
> > > > > >                       *rlim = *new_rlim;
> > > > > >
> > > > > > Which vfio reads back here:
> > > > > >
> > > > > > drivers/vfio/vfio_iommu_type1.c:        unsigned long pfn, limit =
> > > > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > > > > drivers/vfio/vfio_iommu_type1.c:        unsigned long limit =
> > > > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > > > >
> > > > > > And iommufd does the same read back:
> > > > > >
> > > > > >       lock_limit =
> > > > > >               task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >>
> > > > > > PAGE_SHIFT;
> > > > > >       npages = pages->npinned - pages->last_npinned;
> > > > > >       do {
> > > > > >               cur_pages = atomic_long_read(&pages->source_user-
> > > > > > >locked_vm);
> > > > > >               new_pages = cur_pages + npages;
> > > > > >               if (new_pages > lock_limit)
> > > > > >                       return -ENOMEM;
> > > > > >       } while (atomic_long_cmpxchg(&pages->source_user->locked_vm,
> > > > > > cur_pages,
> > > > > >                                    new_pages) != cur_pages);
> > > > > >
> > > > > > So it does work essentially the same.
> > > > > >
> > > > > > The difference is more subtle, iouring/etc puts the charge in the user
> > > > > > so it is additive with things like iouring and additively spans all
> > > > > > the users processes.
> > > > > >
> > > > > > However vfio is accounting only per-process and only for itself - no
> > > > > > other subsystem uses locked as the charge variable for DMA pins.
> > > > > >
> > > > > > The user visible difference will be that a limit X that worked with
> > > > > > VFIO may start to fail after a kernel upgrade as the charge accounting
> > > > > > is now cross user and additive with things like iommufd.
> > > > > >
> > > > > > This whole area is a bit peculiar (eg mlock itself works differently),
> > > > > > IMHO, but with most of the places doing pins voting to use
> > > > > > user->locked_vm as the charge it seems the right path in today's
> > > > > > kernel.
> > > > > >
> > > > > > Ceratinly having qemu concurrently using three different subsystems
> > > > > > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for
> > > > > > RLIMIT_MEMLOCK differently cannot be sane or correct.
> > > > > >
> > > > > > I plan to fix RDMA like this as well so at least we can have
> > > > > > consistency within qemu.
> > > > > >
> > > > >
> > > > > I have an impression that iommufd and vfio type1 must use
> > > > > the same accounting scheme given the management stack
> > > > > has no insight into qemu on which one is actually used thus
> > > > > cannot adapt to the subtle difference in between. in this
> > > > > regard either we start fixing vfio type1 to use user->locked_vm
> > > > > now or have iommufd follow vfio type1 for upward compatibility
> > > > > and then change them together at a later point.
> > > > >
> > > > > I prefer to the former as IMHO I don't know when will be a later
> > > > > point w/o certain kernel changes to actually break the userspace
> > > > > policy built on a wrong accounting scheme...
> > > >
> > > > I wonder if the kernel is the right place to do this. We have new uAPI
> > >
> > > I didn't get this. This thread is about that VFIO uses a wrong accounting
> > > scheme and then the discussion is about the impact of fixing it to the
> > > userspace.
> >
> > It's probably too late to fix the kernel considering it may break the userspace.
> >
> > > I didn't see the question on the right place part.
> >
> > I meant it would be easier to let userspace know the difference than
> > trying to fix or workaround in the kernel.
>
> Jason already plans to fix RDMA which will also leads to user-aware
> impact when Qemu uses both VFIO and RDMA. Why is VFIO so special
> and left behind to carry the legacy misdesign?

It's simply because we don't want to break existing userspace. [1]

>
> >
> > >
> > > > so management layer can know the difference of the accounting in
> > > > advance by
> > > >
> > > > -device vfio-pci,iommufd=on
> > > >
> > >
> > > I suppose iommufd will be used once Qemu supports it, as long as
> > > the compatibility opens that Jason/Alex discussed in another thread
> > > are well addressed. It is not necessarily to be a control knob exposed
> > > to the caller.
> >
> > It has a lot of implications if we do this, it means iommufd needs to
> > inherit all the userspace noticeable behaviour as well as the "bugs"
> > of VFIO.
> >
> > We know it's easier to find the difference than saying no difference.
> >
>
> In the end vfio type1 will be replaced by iommufd compat layer. With
> that goal in mind iommufd has to inherit type1 behaviors.

So the compatibility should be provided by the compat layer instead of
the core iommufd.

And I wonder what happens if iommufd is used by other subsystems like
vDPA. Does it mean vDPA needs to inherit type1 behaviours? If yes, do
we need a per subsystem new uAPI to expose this capability? If yes,
why can't VFIO have such an API then we don't even need the compat
layer at all?

Thanks

[1] https://lkml.org/lkml/2012/12/23/75

>
> Thanks
> Kevin


^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
@ 2022-03-24  3:50                       ` Jason Wang
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Wang @ 2022-03-24  3:50 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm, Niklas Schnelle,
	Cornelia Huck, iommu, Daniel Jordan, Alex Williamson,
	Michael S. Tsirkin, Jason Gunthorpe, Martins, Joao, David Gibson

On Thu, Mar 24, 2022 at 11:15 AM Tian, Kevin <kevin.tian@intel.com> wrote:
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Thursday, March 24, 2022 10:57 AM
> >
> > On Thu, Mar 24, 2022 at 10:42 AM Tian, Kevin <kevin.tian@intel.com> wrote:
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Thursday, March 24, 2022 10:28 AM
> > > >
> > > > On Thu, Mar 24, 2022 at 10:12 AM Tian, Kevin <kevin.tian@intel.com>
> > wrote:
> > > > >
> > > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > > Sent: Wednesday, March 23, 2022 12:15 AM
> > > > > >
> > > > > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> > > > > >
> > > > > > > I'm still picking my way through the series, but the later compat
> > > > > > > interface doesn't mention this difference as an outstanding issue.
> > > > > > > Doesn't this difference need to be accounted in how libvirt manages
> > VM
> > > > > > > resource limits?
> > > > > >
> > > > > > AFACIT, no, but it should be checked.
> > > > > >
> > > > > > > AIUI libvirt uses some form of prlimit(2) to set process locked
> > > > > > > memory limits.
> > > > > >
> > > > > > Yes, and ulimit does work fully. prlimit adjusts the value:
> > > > > >
> > > > > > int do_prlimit(struct task_struct *tsk, unsigned int resource,
> > > > > >               struct rlimit *new_rlim, struct rlimit *old_rlim)
> > > > > > {
> > > > > >       rlim = tsk->signal->rlim + resource;
> > > > > > [..]
> > > > > >               if (new_rlim)
> > > > > >                       *rlim = *new_rlim;
> > > > > >
> > > > > > Which vfio reads back here:
> > > > > >
> > > > > > drivers/vfio/vfio_iommu_type1.c:        unsigned long pfn, limit =
> > > > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > > > > drivers/vfio/vfio_iommu_type1.c:        unsigned long limit =
> > > > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > > > >
> > > > > > And iommufd does the same read back:
> > > > > >
> > > > > >       lock_limit =
> > > > > >               task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >>
> > > > > > PAGE_SHIFT;
> > > > > >       npages = pages->npinned - pages->last_npinned;
> > > > > >       do {
> > > > > >               cur_pages = atomic_long_read(&pages->source_user-
> > > > > > >locked_vm);
> > > > > >               new_pages = cur_pages + npages;
> > > > > >               if (new_pages > lock_limit)
> > > > > >                       return -ENOMEM;
> > > > > >       } while (atomic_long_cmpxchg(&pages->source_user->locked_vm,
> > > > > > cur_pages,
> > > > > >                                    new_pages) != cur_pages);
> > > > > >
> > > > > > So it does work essentially the same.
> > > > > >
> > > > > > The difference is more subtle, iouring/etc puts the charge in the user
> > > > > > so it is additive with things like iouring and additively spans all
> > > > > > the users processes.
> > > > > >
> > > > > > However vfio is accounting only per-process and only for itself - no
> > > > > > other subsystem uses locked as the charge variable for DMA pins.
> > > > > >
> > > > > > The user visible difference will be that a limit X that worked with
> > > > > > VFIO may start to fail after a kernel upgrade as the charge accounting
> > > > > > is now cross user and additive with things like iommufd.
> > > > > >
> > > > > > This whole area is a bit peculiar (eg mlock itself works differently),
> > > > > > IMHO, but with most of the places doing pins voting to use
> > > > > > user->locked_vm as the charge it seems the right path in today's
> > > > > > kernel.
> > > > > >
> > > > > > Ceratinly having qemu concurrently using three different subsystems
> > > > > > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for
> > > > > > RLIMIT_MEMLOCK differently cannot be sane or correct.
> > > > > >
> > > > > > I plan to fix RDMA like this as well so at least we can have
> > > > > > consistency within qemu.
> > > > > >
> > > > >
> > > > > I have an impression that iommufd and vfio type1 must use
> > > > > the same accounting scheme given the management stack
> > > > > has no insight into qemu on which one is actually used thus
> > > > > cannot adapt to the subtle difference in between. in this
> > > > > regard either we start fixing vfio type1 to use user->locked_vm
> > > > > now or have iommufd follow vfio type1 for upward compatibility
> > > > > and then change them together at a later point.
> > > > >
> > > > > I prefer to the former as IMHO I don't know when will be a later
> > > > > point w/o certain kernel changes to actually break the userspace
> > > > > policy built on a wrong accounting scheme...
> > > >
> > > > I wonder if the kernel is the right place to do this. We have new uAPI
> > >
> > > I didn't get this. This thread is about that VFIO uses a wrong accounting
> > > scheme and then the discussion is about the impact of fixing it to the
> > > userspace.
> >
> > It's probably too late to fix the kernel considering it may break the userspace.
> >
> > > I didn't see the question on the right place part.
> >
> > I meant it would be easier to let userspace know the difference than
> > trying to fix or workaround in the kernel.
>
> Jason already plans to fix RDMA which will also leads to user-aware
> impact when Qemu uses both VFIO and RDMA. Why is VFIO so special
> and left behind to carry the legacy misdesign?

It's simply because we don't want to break existing userspace. [1]

>
> >
> > >
> > > > so management layer can know the difference of the accounting in
> > > > advance by
> > > >
> > > > -device vfio-pci,iommufd=on
> > > >
> > >
> > > I suppose iommufd will be used once Qemu supports it, as long as
> > > the compatibility opens that Jason/Alex discussed in another thread
> > > are well addressed. It is not necessarily to be a control knob exposed
> > > to the caller.
> >
> > It has a lot of implications if we do this, it means iommufd needs to
> > inherit all the userspace noticeable behaviour as well as the "bugs"
> > of VFIO.
> >
> > We know it's easier to find the difference than saying no difference.
> >
>
> In the end vfio type1 will be replaced by iommufd compat layer. With
> that goal in mind iommufd has to inherit type1 behaviors.

So the compatibility should be provided by the compat layer instead of
the core iommufd.

And I wonder what happens if iommufd is used by other subsystems like
vDPA. Does it mean vDPA needs to inherit type1 behaviours? If yes, do
we need a per subsystem new uAPI to expose this capability? If yes,
why can't VFIO have such an API then we don't even need the compat
layer at all?

Thanks

[1] https://lkml.org/lkml/2012/12/23/75

>
> Thanks
> Kevin

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* RE: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
  2022-03-24  3:50                       ` Jason Wang
@ 2022-03-24  4:29                         ` Tian, Kevin
  -1 siblings, 0 replies; 244+ messages in thread
From: Tian, Kevin @ 2022-03-24  4:29 UTC (permalink / raw)
  To: Jason Wang
  Cc: Jason Gunthorpe, Alex Williamson, Niklas Schnelle, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Auger, iommu, Jean-Philippe Brucker, Martins, Joao, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

> From: Jason Wang <jasowang@redhat.com>
> Sent: Thursday, March 24, 2022 11:51 AM
> 
> > >
> >
> > In the end vfio type1 will be replaced by iommufd compat layer. With
> > that goal in mind iommufd has to inherit type1 behaviors.
> 
> So the compatibility should be provided by the compat layer instead of
> the core iommufd.
> 
> And I wonder what happens if iommufd is used by other subsystems like
> vDPA. Does it mean vDPA needs to inherit type1 behaviours? If yes, do
> we need a per subsystem new uAPI to expose this capability? If yes,
> why can't VFIO have such an API then we don't even need the compat
> layer at all?
> 

No, compat layer is just for vfio. other subsystems including vdpa is
expected to use iommu uAPI directly, except having their own
bind_iommufd and attach_ioas uAPIs to build the connection between
device and iommufd/ioas.

And having a compat layer for vfio is just for transition purpose. Yi has
demonstrated how vfio can follow what other subsystems are
expected to do here:

https://github.com/luxis1999/iommufd/commits/iommufd-v5.17-rc6
(specifically commits related to "cover-letter: Adapting vfio-pci to iommufd")

Thanks
Kevin

^ permalink raw reply	[flat|nested] 244+ messages in thread

* RE: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
@ 2022-03-24  4:29                         ` Tian, Kevin
  0 siblings, 0 replies; 244+ messages in thread
From: Tian, Kevin @ 2022-03-24  4:29 UTC (permalink / raw)
  To: Jason Wang
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm, Niklas Schnelle,
	Cornelia Huck, iommu, Daniel Jordan, Alex Williamson,
	Michael S. Tsirkin, Jason Gunthorpe, Martins, Joao, David Gibson

> From: Jason Wang <jasowang@redhat.com>
> Sent: Thursday, March 24, 2022 11:51 AM
> 
> > >
> >
> > In the end vfio type1 will be replaced by iommufd compat layer. With
> > that goal in mind iommufd has to inherit type1 behaviors.
> 
> So the compatibility should be provided by the compat layer instead of
> the core iommufd.
> 
> And I wonder what happens if iommufd is used by other subsystems like
> vDPA. Does it mean vDPA needs to inherit type1 behaviours? If yes, do
> we need a per subsystem new uAPI to expose this capability? If yes,
> why can't VFIO have such an API then we don't even need the compat
> layer at all?
> 

No, compat layer is just for vfio. other subsystems including vdpa is
expected to use iommu uAPI directly, except having their own
bind_iommufd and attach_ioas uAPIs to build the connection between
device and iommufd/ioas.

And having a compat layer for vfio is just for transition purpose. Yi has
demonstrated how vfio can follow what other subsystems are
expected to do here:

https://github.com/luxis1999/iommufd/commits/iommufd-v5.17-rc6
(specifically commits related to "cover-letter: Adapting vfio-pci to iommufd")

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* RE: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
  2022-03-23 20:34           ` Jason Gunthorpe
@ 2022-03-24  6:46             ` Tian, Kevin
  -1 siblings, 0 replies; 244+ messages in thread
From: Tian, Kevin @ 2022-03-24  6:46 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, March 24, 2022 4:34 AM
> 
> On Wed, Mar 23, 2022 at 02:04:46PM -0600, Alex Williamson wrote:
> > On Wed, 23 Mar 2022 16:34:39 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > > On Wed, Mar 23, 2022 at 01:10:38PM -0600, Alex Williamson wrote:
> > > > On Fri, 18 Mar 2022 14:27:33 -0300
> > > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > > >
> > > > > +static int conv_iommu_prot(u32 map_flags)
> > > > > +{
> > > > > +	int iommu_prot;
> > > > > +
> > > > > +	/*
> > > > > +	 * We provide no manual cache coherency ioctls to userspace
> and most
> > > > > +	 * architectures make the CPU ops for cache flushing
> privileged.
> > > > > +	 * Therefore we require the underlying IOMMU to support
> CPU coherent
> > > > > +	 * operation.
> > > > > +	 */
> > > > > +	iommu_prot = IOMMU_CACHE;
> > > >
> > > > Where is this requirement enforced?  AIUI we'd need to test
> > > > IOMMU_CAP_CACHE_COHERENCY somewhere since functions like
> > > > intel_iommu_map() simply drop the flag when not supported by HW.
> > >
> > > You are right, the correct thing to do is to fail device
> > > binding/attach entirely if IOMMU_CAP_CACHE_COHERENCY is not there,
> > > however we can't do that because Intel abuses the meaning of
> > > IOMMU_CAP_CACHE_COHERENCY to mean their special no-snoop
> behavior is
> > > supported.
> > >
> > > I want Intel to split out their special no-snoop from IOMMU_CACHE and
> > > IOMMU_CAP_CACHE_COHERENCY so these things have a consisent
> meaning in
> > > all iommu drivers. Once this is done vfio and iommufd should both
> > > always set IOMMU_CACHE and refuse to work without
> > > IOMMU_CAP_CACHE_COHERENCY. (unless someone knows of
> an !IOMMU_CACHE
> > > arch that does in fact work today with vfio, somehow, but I don't..)
> >
> > IIRC, the DMAR on Intel CPUs dedicated to IGD was where we'd often see
> > lack of snoop-control support, causing us to have mixed coherent and
> > non-coherent domains.  I don't recall if you go back far enough in VT-d
> > history if the primary IOMMU might have lacked this support.  So I
> > think there are systems we care about with IOMMUs that can't enforce
> > DMA coherency.
> >
> > As it is today, if the IOMMU reports IOMMU_CAP_CACHE_COHERENCY and
> all
> > mappings make use of IOMMU_CACHE, then all DMA is coherent.  Are you
> > suggesting IOMMU_CAP_CACHE_COHERENCY should indicate that all
> mappings
> > are coherent regardless of mapping protection flags?  What's the point
> > of IOMMU_CACHE at that point?
> 
> IOMMU_CAP_CACHE_COHERENCY should return to what it was before Intel's
> change.

One nit (as I explained in previous v1 discussion). It is not that Intel
abuses this capability as it was firstly introduced for Intel's force 
snoop requirement. It is just that when later its meaning was changed
to match what you described below the original use of Intel was not
caught and fixed properly. 😊

> 
> It only means normal DMAs issued in a normal way are coherent with the
> CPU and do not require special cache flushing instructions. ie DMA
> issued by a kernel driver using the DMA API.
> 
> It does not mean that non-coherence DMA does not exist, or that
> platform or device specific ways to trigger non-coherence are blocked.
> 
> Stated another way, any platform that wires dev_is_dma_coherent() to
> true, like all x86 does, must support IOMMU_CACHE and report
> IOMMU_CAP_CACHE_COHERENCY for every iommu_domain the platform
> supports. The platform obviously declares it support this in order to
> support the in-kernel DMA API.

This is a good explanation of IOMMU_CACHE. From that intel-iommu
driver should always report this capability and do nothing with
IOMMU_CACHE since it's already guaranteed by the arch. Actually
this is exactly what AMD iommu driver does today.

Then we'll introduce a new cap/prot in particular for enforcing snoop
as you suggested below.

> 
> Thus, a new cap indicating that 'all dma is coherent' or 'no-snoop
> blocking' should be created to cover Intel's special need. From what I
> know it is only implemented in the Intel driver and apparently only
> for some IOMMUs connected to IGD.
> 
> > > Yes, it was missed in the notes for vfio compat that Intel no-snoop is
> > > not working currently, I fixed it.
> >
> > Right, I see it in the comments relative to extensions, but missed in
> > the commit log.  Thanks,
> 
> Oh good, I remembered it was someplace..
> 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 244+ messages in thread

* RE: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
@ 2022-03-24  6:46             ` Tian, Kevin
  0 siblings, 0 replies; 244+ messages in thread
From: Tian, Kevin @ 2022-03-24  6:46 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	Daniel Jordan, iommu, Martins, Joao, David Gibson

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, March 24, 2022 4:34 AM
> 
> On Wed, Mar 23, 2022 at 02:04:46PM -0600, Alex Williamson wrote:
> > On Wed, 23 Mar 2022 16:34:39 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > > On Wed, Mar 23, 2022 at 01:10:38PM -0600, Alex Williamson wrote:
> > > > On Fri, 18 Mar 2022 14:27:33 -0300
> > > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > > >
> > > > > +static int conv_iommu_prot(u32 map_flags)
> > > > > +{
> > > > > +	int iommu_prot;
> > > > > +
> > > > > +	/*
> > > > > +	 * We provide no manual cache coherency ioctls to userspace
> and most
> > > > > +	 * architectures make the CPU ops for cache flushing
> privileged.
> > > > > +	 * Therefore we require the underlying IOMMU to support
> CPU coherent
> > > > > +	 * operation.
> > > > > +	 */
> > > > > +	iommu_prot = IOMMU_CACHE;
> > > >
> > > > Where is this requirement enforced?  AIUI we'd need to test
> > > > IOMMU_CAP_CACHE_COHERENCY somewhere since functions like
> > > > intel_iommu_map() simply drop the flag when not supported by HW.
> > >
> > > You are right, the correct thing to do is to fail device
> > > binding/attach entirely if IOMMU_CAP_CACHE_COHERENCY is not there,
> > > however we can't do that because Intel abuses the meaning of
> > > IOMMU_CAP_CACHE_COHERENCY to mean their special no-snoop
> behavior is
> > > supported.
> > >
> > > I want Intel to split out their special no-snoop from IOMMU_CACHE and
> > > IOMMU_CAP_CACHE_COHERENCY so these things have a consisent
> meaning in
> > > all iommu drivers. Once this is done vfio and iommufd should both
> > > always set IOMMU_CACHE and refuse to work without
> > > IOMMU_CAP_CACHE_COHERENCY. (unless someone knows of
> an !IOMMU_CACHE
> > > arch that does in fact work today with vfio, somehow, but I don't..)
> >
> > IIRC, the DMAR on Intel CPUs dedicated to IGD was where we'd often see
> > lack of snoop-control support, causing us to have mixed coherent and
> > non-coherent domains.  I don't recall if you go back far enough in VT-d
> > history if the primary IOMMU might have lacked this support.  So I
> > think there are systems we care about with IOMMUs that can't enforce
> > DMA coherency.
> >
> > As it is today, if the IOMMU reports IOMMU_CAP_CACHE_COHERENCY and
> all
> > mappings make use of IOMMU_CACHE, then all DMA is coherent.  Are you
> > suggesting IOMMU_CAP_CACHE_COHERENCY should indicate that all
> mappings
> > are coherent regardless of mapping protection flags?  What's the point
> > of IOMMU_CACHE at that point?
> 
> IOMMU_CAP_CACHE_COHERENCY should return to what it was before Intel's
> change.

One nit (as I explained in previous v1 discussion). It is not that Intel
abuses this capability as it was firstly introduced for Intel's force 
snoop requirement. It is just that when later its meaning was changed
to match what you described below the original use of Intel was not
caught and fixed properly. 😊

> 
> It only means normal DMAs issued in a normal way are coherent with the
> CPU and do not require special cache flushing instructions. ie DMA
> issued by a kernel driver using the DMA API.
> 
> It does not mean that non-coherence DMA does not exist, or that
> platform or device specific ways to trigger non-coherence are blocked.
> 
> Stated another way, any platform that wires dev_is_dma_coherent() to
> true, like all x86 does, must support IOMMU_CACHE and report
> IOMMU_CAP_CACHE_COHERENCY for every iommu_domain the platform
> supports. The platform obviously declares it support this in order to
> support the in-kernel DMA API.

This is a good explanation of IOMMU_CACHE. From that intel-iommu
driver should always report this capability and do nothing with
IOMMU_CACHE since it's already guaranteed by the arch. Actually
this is exactly what AMD iommu driver does today.

Then we'll introduce a new cap/prot in particular for enforcing snoop
as you suggested below.

> 
> Thus, a new cap indicating that 'all dma is coherent' or 'no-snoop
> blocking' should be created to cover Intel's special need. From what I
> know it is only implemented in the Intel driver and apparently only
> for some IOMMUs connected to IGD.
> 
> > > Yes, it was missed in the notes for vfio compat that Intel no-snoop is
> > > not working currently, I fixed it.
> >
> > Right, I see it in the comments relative to extensions, but missed in
> > the commit log.  Thanks,
> 
> Oh good, I remembered it was someplace..
> 

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* RE: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
  2022-03-23 22:54             ` Jason Gunthorpe via iommu
@ 2022-03-24  7:25               ` Tian, Kevin
  -1 siblings, 0 replies; 244+ messages in thread
From: Tian, Kevin @ 2022-03-24  7:25 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	Daniel Jordan, iommu, Martins, Joao, David Gibson

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, March 24, 2022 6:55 AM
> 
> On Wed, Mar 23, 2022 at 05:34:18PM -0300, Jason Gunthorpe wrote:
> 
> > Stated another way, any platform that wires dev_is_dma_coherent() to
> > true, like all x86 does, must support IOMMU_CACHE and report
> > IOMMU_CAP_CACHE_COHERENCY for every iommu_domain the platform
> > supports. The platform obviously declares it support this in order to
> > support the in-kernel DMA API.
> 
> That gives me a nice simple idea:
> 
> diff --git a/drivers/iommu/iommufd/device.c
> b/drivers/iommu/iommufd/device.c
> index 3c6b95ad026829..8366884df4a030 100644
> --- a/drivers/iommu/iommufd/device.c
> +++ b/drivers/iommu/iommufd/device.c
> @@ -8,6 +8,7 @@
>  #include <linux/pci.h>
>  #include <linux/irqdomain.h>
>  #include <linux/dma-iommu.h>
> +#include <linux/dma-map-ops.h>
> 
>  #include "iommufd_private.h"
> 
> @@ -61,6 +62,10 @@ struct iommufd_device
> *iommufd_bind_pci_device(int fd, struct pci_dev *pdev,
>  	struct iommu_group *group;
>  	int rc;
> 
> +	/* iommufd always uses IOMMU_CACHE */
> +	if (!dev_is_dma_coherent(&pdev->dev))
> +		return ERR_PTR(-EINVAL);
> +
>  	ictx = iommufd_fget(fd);
>  	if (!ictx)
>  		return ERR_PTR(-EINVAL);
> diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
> index 48149988c84bbc..3d6df1ffbf93e6 100644
> --- a/drivers/iommu/iommufd/ioas.c
> +++ b/drivers/iommu/iommufd/ioas.c
> @@ -129,7 +129,8 @@ static int conv_iommu_prot(u32 map_flags)
>  	 * We provide no manual cache coherency ioctls to userspace and
> most
>  	 * architectures make the CPU ops for cache flushing privileged.
>  	 * Therefore we require the underlying IOMMU to support CPU
> coherent
> -	 * operation.
> +	 * operation. Support for IOMMU_CACHE is enforced by the
> +	 * dev_is_dma_coherent() test during bind.
>  	 */
>  	iommu_prot = IOMMU_CACHE;
>  	if (map_flags & IOMMU_IOAS_MAP_WRITEABLE)
> 
> Looking at it I would say all the places that test
> IOMMU_CAP_CACHE_COHERENCY can be replaced with
> dev_is_dma_coherent()
> except for the one call in VFIO that is supporting the Intel no-snoop
> behavior.
> 
> Then we can rename IOMMU_CAP_CACHE_COHERENCY to something like
> IOMMU_CAP_ENFORCE_CACHE_COHERENCY and just create a
> IOMMU_ENFORCE_CACHE prot flag for Intel IOMMU to use instead of
> abusing IOMMU_CACHE.
> 

Based on that here is a quick tweak of the force-snoop part (not compiled).

Several notes:

- IOMMU_CAP_CACHE_COHERENCY is kept as it's checked in vfio's
  group attach interface. Removing it may require a group_is_dma_coherent();

- vdpa is not changed as force-snoop is only for integrated GPU today which
  is not managed by vdpa. But adding the snoop support is easy if necessary;

- vfio type1 reports force-snoop fact to KVM via VFIO_DMA_CC_IOMMU. For
  iommufd the compat layer may leverage that interface but more thoughts
  are required for non-compat usage how that can be reused or whether a
  new one is required between iommufd and kvm. Per earlier  discussions
  Paolo prefers to reuse.

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 5b196cf..06cca04 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5110,7 +5110,8 @@ static int intel_iommu_map(struct iommu_domain *domain,
 		prot |= DMA_PTE_READ;
 	if (iommu_prot & IOMMU_WRITE)
 		prot |= DMA_PTE_WRITE;
-	if ((iommu_prot & IOMMU_CACHE) && dmar_domain->iommu_snooping)
+	/* nothing to do for IOMMU_CACHE */
+	if ((iommu_prot & IOMMU_SNOOP) && dmar_domain->iommu_snooping)
 		prot |= DMA_PTE_SNP;
 
 	max_addr = iova + size;
@@ -5236,6 +5237,8 @@ static phys_addr_t intel_iommu_iova_to_phys(struct iommu_domain *domain,
 static bool intel_iommu_capable(enum iommu_cap cap)
 {
 	if (cap == IOMMU_CAP_CACHE_COHERENCY)
+		return true;
+	if (cap == IOMMU_CAP_FORCE_SNOOP)
 		return domain_update_iommu_snooping(NULL);
 	if (cap == IOMMU_CAP_INTR_REMAP)
 		return irq_remapping_enabled == 1;
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 9394aa9..abc4cfe 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -2270,6 +2270,9 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	if (iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY))
 		domain->prot |= IOMMU_CACHE;
 
+	if (iommu_capable(bus, IOMMU_CAP_FORCE_SNOOP)
+		domain->prot |= IOMMU_SNOOP;
+
 	/*
 	 * Try to match an existing compatible domain.  We don't want to
 	 * preclude an IOMMU driver supporting multiple bus_types and being
@@ -2611,14 +2614,14 @@ static void vfio_iommu_type1_release(void *iommu_data)
 	kfree(iommu);
 }
 
-static int vfio_domains_have_iommu_cache(struct vfio_iommu *iommu)
+static int vfio_domains_have_iommu_snoop(struct vfio_iommu *iommu)
 {
 	struct vfio_domain *domain;
 	int ret = 1;
 
 	mutex_lock(&iommu->lock);
 	list_for_each_entry(domain, &iommu->domain_list, next) {
-		if (!(domain->prot & IOMMU_CACHE)) {
+		if (!(domain->prot & IOMMU_SNOOP)) {
 			ret = 0;
 			break;
 		}
@@ -2641,7 +2644,7 @@ static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
 	case VFIO_DMA_CC_IOMMU:
 		if (!iommu)
 			return 0;
-		return vfio_domains_have_iommu_cache(iommu);
+		return vfio_domains_have_iommu_snoop(iommu);
 	default:
 		return 0;
 	}
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index de0c57a..45184d7 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -21,6 +21,8 @@
 #define IOMMU_CACHE	(1 << 2) /* DMA cache coherency */
 #define IOMMU_NOEXEC	(1 << 3)
 #define IOMMU_MMIO	(1 << 4) /* e.g. things like MSI doorbells */
+#define IOMMU_SNOOP	(1 << 5) /* force DMA to snoop */
+
 /*
  * Where the bus hardware includes a privilege level as part of its access type
  * markings, and certain devices are capable of issuing transactions marked as
@@ -106,6 +108,8 @@ enum iommu_cap {
 					   transactions */
 	IOMMU_CAP_INTR_REMAP,		/* IOMMU supports interrupt isolation */
 	IOMMU_CAP_NOEXEC,		/* IOMMU_NOEXEC flag */
+	IOMMU_CAP_FORCE_SNOOP,		/* IOMMU forces all transactions to
+					   snoop cache */
 };
 
 /* These are the possible reserved region types */

Thanks
Kevin

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 244+ messages in thread

* RE: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
@ 2022-03-24  7:25               ` Tian, Kevin
  0 siblings, 0 replies; 244+ messages in thread
From: Tian, Kevin @ 2022-03-24  7:25 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, March 24, 2022 6:55 AM
> 
> On Wed, Mar 23, 2022 at 05:34:18PM -0300, Jason Gunthorpe wrote:
> 
> > Stated another way, any platform that wires dev_is_dma_coherent() to
> > true, like all x86 does, must support IOMMU_CACHE and report
> > IOMMU_CAP_CACHE_COHERENCY for every iommu_domain the platform
> > supports. The platform obviously declares it support this in order to
> > support the in-kernel DMA API.
> 
> That gives me a nice simple idea:
> 
> diff --git a/drivers/iommu/iommufd/device.c
> b/drivers/iommu/iommufd/device.c
> index 3c6b95ad026829..8366884df4a030 100644
> --- a/drivers/iommu/iommufd/device.c
> +++ b/drivers/iommu/iommufd/device.c
> @@ -8,6 +8,7 @@
>  #include <linux/pci.h>
>  #include <linux/irqdomain.h>
>  #include <linux/dma-iommu.h>
> +#include <linux/dma-map-ops.h>
> 
>  #include "iommufd_private.h"
> 
> @@ -61,6 +62,10 @@ struct iommufd_device
> *iommufd_bind_pci_device(int fd, struct pci_dev *pdev,
>  	struct iommu_group *group;
>  	int rc;
> 
> +	/* iommufd always uses IOMMU_CACHE */
> +	if (!dev_is_dma_coherent(&pdev->dev))
> +		return ERR_PTR(-EINVAL);
> +
>  	ictx = iommufd_fget(fd);
>  	if (!ictx)
>  		return ERR_PTR(-EINVAL);
> diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
> index 48149988c84bbc..3d6df1ffbf93e6 100644
> --- a/drivers/iommu/iommufd/ioas.c
> +++ b/drivers/iommu/iommufd/ioas.c
> @@ -129,7 +129,8 @@ static int conv_iommu_prot(u32 map_flags)
>  	 * We provide no manual cache coherency ioctls to userspace and
> most
>  	 * architectures make the CPU ops for cache flushing privileged.
>  	 * Therefore we require the underlying IOMMU to support CPU
> coherent
> -	 * operation.
> +	 * operation. Support for IOMMU_CACHE is enforced by the
> +	 * dev_is_dma_coherent() test during bind.
>  	 */
>  	iommu_prot = IOMMU_CACHE;
>  	if (map_flags & IOMMU_IOAS_MAP_WRITEABLE)
> 
> Looking at it I would say all the places that test
> IOMMU_CAP_CACHE_COHERENCY can be replaced with
> dev_is_dma_coherent()
> except for the one call in VFIO that is supporting the Intel no-snoop
> behavior.
> 
> Then we can rename IOMMU_CAP_CACHE_COHERENCY to something like
> IOMMU_CAP_ENFORCE_CACHE_COHERENCY and just create a
> IOMMU_ENFORCE_CACHE prot flag for Intel IOMMU to use instead of
> abusing IOMMU_CACHE.
> 

Based on that here is a quick tweak of the force-snoop part (not compiled).

Several notes:

- IOMMU_CAP_CACHE_COHERENCY is kept as it's checked in vfio's
  group attach interface. Removing it may require a group_is_dma_coherent();

- vdpa is not changed as force-snoop is only for integrated GPU today which
  is not managed by vdpa. But adding the snoop support is easy if necessary;

- vfio type1 reports force-snoop fact to KVM via VFIO_DMA_CC_IOMMU. For
  iommufd the compat layer may leverage that interface but more thoughts
  are required for non-compat usage how that can be reused or whether a
  new one is required between iommufd and kvm. Per earlier  discussions
  Paolo prefers to reuse.

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 5b196cf..06cca04 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5110,7 +5110,8 @@ static int intel_iommu_map(struct iommu_domain *domain,
 		prot |= DMA_PTE_READ;
 	if (iommu_prot & IOMMU_WRITE)
 		prot |= DMA_PTE_WRITE;
-	if ((iommu_prot & IOMMU_CACHE) && dmar_domain->iommu_snooping)
+	/* nothing to do for IOMMU_CACHE */
+	if ((iommu_prot & IOMMU_SNOOP) && dmar_domain->iommu_snooping)
 		prot |= DMA_PTE_SNP;
 
 	max_addr = iova + size;
@@ -5236,6 +5237,8 @@ static phys_addr_t intel_iommu_iova_to_phys(struct iommu_domain *domain,
 static bool intel_iommu_capable(enum iommu_cap cap)
 {
 	if (cap == IOMMU_CAP_CACHE_COHERENCY)
+		return true;
+	if (cap == IOMMU_CAP_FORCE_SNOOP)
 		return domain_update_iommu_snooping(NULL);
 	if (cap == IOMMU_CAP_INTR_REMAP)
 		return irq_remapping_enabled == 1;
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 9394aa9..abc4cfe 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -2270,6 +2270,9 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	if (iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY))
 		domain->prot |= IOMMU_CACHE;
 
+	if (iommu_capable(bus, IOMMU_CAP_FORCE_SNOOP)
+		domain->prot |= IOMMU_SNOOP;
+
 	/*
 	 * Try to match an existing compatible domain.  We don't want to
 	 * preclude an IOMMU driver supporting multiple bus_types and being
@@ -2611,14 +2614,14 @@ static void vfio_iommu_type1_release(void *iommu_data)
 	kfree(iommu);
 }
 
-static int vfio_domains_have_iommu_cache(struct vfio_iommu *iommu)
+static int vfio_domains_have_iommu_snoop(struct vfio_iommu *iommu)
 {
 	struct vfio_domain *domain;
 	int ret = 1;
 
 	mutex_lock(&iommu->lock);
 	list_for_each_entry(domain, &iommu->domain_list, next) {
-		if (!(domain->prot & IOMMU_CACHE)) {
+		if (!(domain->prot & IOMMU_SNOOP)) {
 			ret = 0;
 			break;
 		}
@@ -2641,7 +2644,7 @@ static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
 	case VFIO_DMA_CC_IOMMU:
 		if (!iommu)
 			return 0;
-		return vfio_domains_have_iommu_cache(iommu);
+		return vfio_domains_have_iommu_snoop(iommu);
 	default:
 		return 0;
 	}
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index de0c57a..45184d7 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -21,6 +21,8 @@
 #define IOMMU_CACHE	(1 << 2) /* DMA cache coherency */
 #define IOMMU_NOEXEC	(1 << 3)
 #define IOMMU_MMIO	(1 << 4) /* e.g. things like MSI doorbells */
+#define IOMMU_SNOOP	(1 << 5) /* force DMA to snoop */
+
 /*
  * Where the bus hardware includes a privilege level as part of its access type
  * markings, and certain devices are capable of issuing transactions marked as
@@ -106,6 +108,8 @@ enum iommu_cap {
 					   transactions */
 	IOMMU_CAP_INTR_REMAP,		/* IOMMU supports interrupt isolation */
 	IOMMU_CAP_NOEXEC,		/* IOMMU_NOEXEC flag */
+	IOMMU_CAP_FORCE_SNOOP,		/* IOMMU forces all transactions to
+					   snoop cache */
 };
 
 /* These are the possible reserved region types */

Thanks
Kevin


^ permalink raw reply related	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
  2022-03-24  0:33       ` Jason Gunthorpe via iommu
@ 2022-03-24  8:13         ` Eric Auger
  -1 siblings, 0 replies; 244+ messages in thread
From: Eric Auger @ 2022-03-24  8:13 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	Kevin Tian, Daniel Jordan, iommu, Joao Martins, David Gibson

Hi,

On 3/24/22 1:33 AM, Jason Gunthorpe wrote:
> On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote:
>
>> My overall question here would be whether we can actually achieve a
>> compatibility interface that has sufficient feature transparency that we
>> can dump vfio code in favor of this interface, or will there be enough
>> niche use cases that we need to keep type1 and vfio containers around
>> through a deprecation process?
> Other than SPAPR, I think we can.
>
>> The locked memory differences for one seem like something that
>> libvirt wouldn't want hidden
> I'm first interested to have an understanding how this change becomes
> a real problem in practice that requires libvirt to do something
> different for vfio or iommufd. We can discuss in the other thread
>
> If this is the make or break point then I think we can deal with it
> either by going back to what vfio does now or perhaps some other
> friendly compat approach..
>
>> and we have questions regarding support for vaddr hijacking
> I'm not sure what vaddr hijacking is? Do you mean
> VFIO_DMA_MAP_FLAG_VADDR ? There is a comment that outlines my plan to
> implement it in a functionally compatible way without the deadlock
> problem. I estimate this as a small project.
>
>> and different ideas how to implement dirty page tracking, 
> I don't think this is compatibility. No kernel today triggers qemu to
> use this feature as no kernel supports live migration. No existing
> qemu will trigger this feature with new kernels that support live
> migration v2. Therefore we can adjust qemu's dirty tracking at the
> same time we enable migration v2 in qemu.
>
> With Joao's work we are close to having a solid RFC to come with
> something that can be fully implemented.
>
> Hopefully we can agree to this soon enough that qemu can come with a
> full package of migration v2 support including the dirty tracking
> solution.
>
>> not to mention the missing features that are currently well used,
>> like p2p mappings, coherency tracking, mdev, etc.
> I consider these all mandatory things, they won't be left out.
>
> The reason they are not in the RFC is mostly because supporting them
> requires work outside just this iommufd area, and I'd like this series
> to remain self-contained.
>
> I've already got a draft to add DMABUF support to VFIO PCI which
> nicely solves the follow_pfn security problem, we want to do this for
> another reason already. I'm waiting for some testing feedback before
> posting it. Need some help from Daniel make the DMABUF revoke semantic
> him and I have been talking about. In the worst case can copy the
> follow_pfn approach.
>
> Intel no-snoop is simple enough, just needs some Intel cleanup parts.
>
> mdev will come along with the final VFIO integration, all the really
> hard parts are done already. The VFIO integration is a medium sized
> task overall.
>
> So, I'm not ready to give up yet :)
>
>> Where do we focus attention?  Is symlinking device files our proposal
>> to userspace and is that something achievable, or do we want to use
>> this compatibility interface as a means to test the interface and
>> allow userspace to make use of it for transition, if their use cases
>> allow it, perhaps eventually performing the symlink after deprecation
>> and eventual removal of the vfio container and type1 code?  Thanks,
> symlinking device files is definitely just a suggested way to expedite
> testing.
>
> Things like qemu that are learning to use iommufd-only features should
> learn to directly open iommufd instead of vfio container to activate
> those features.
>
> Looking long down the road I don't think we want to have type 1 and
> iommufd code forever. So, I would like to make an option to compile
> out vfio container support entirely and have that option arrange for
> iommufd to provide the container device node itself.
I am currently working on migrating the QEMU VFIO device onto the new
API because since after our discussions the compat mode cannot be used
anyway to implemented nesting. I hope I will be able to present
something next week.

Thanks

Eric
>
> I think we can get there pretty quickly, or at least I haven't got
> anything that is scaring me alot (beyond SPAPR of course)
>
> For the dpdk/etcs of the world I think we are already there.
>
> Jason
>

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
@ 2022-03-24  8:13         ` Eric Auger
  0 siblings, 0 replies; 244+ messages in thread
From: Eric Auger @ 2022-03-24  8:13 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, iommu, Jason Wang, Jean-Philippe Brucker,
	Joao Martins, Kevin Tian, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

Hi,

On 3/24/22 1:33 AM, Jason Gunthorpe wrote:
> On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote:
>
>> My overall question here would be whether we can actually achieve a
>> compatibility interface that has sufficient feature transparency that we
>> can dump vfio code in favor of this interface, or will there be enough
>> niche use cases that we need to keep type1 and vfio containers around
>> through a deprecation process?
> Other than SPAPR, I think we can.
>
>> The locked memory differences for one seem like something that
>> libvirt wouldn't want hidden
> I'm first interested to have an understanding how this change becomes
> a real problem in practice that requires libvirt to do something
> different for vfio or iommufd. We can discuss in the other thread
>
> If this is the make or break point then I think we can deal with it
> either by going back to what vfio does now or perhaps some other
> friendly compat approach..
>
>> and we have questions regarding support for vaddr hijacking
> I'm not sure what vaddr hijacking is? Do you mean
> VFIO_DMA_MAP_FLAG_VADDR ? There is a comment that outlines my plan to
> implement it in a functionally compatible way without the deadlock
> problem. I estimate this as a small project.
>
>> and different ideas how to implement dirty page tracking, 
> I don't think this is compatibility. No kernel today triggers qemu to
> use this feature as no kernel supports live migration. No existing
> qemu will trigger this feature with new kernels that support live
> migration v2. Therefore we can adjust qemu's dirty tracking at the
> same time we enable migration v2 in qemu.
>
> With Joao's work we are close to having a solid RFC to come with
> something that can be fully implemented.
>
> Hopefully we can agree to this soon enough that qemu can come with a
> full package of migration v2 support including the dirty tracking
> solution.
>
>> not to mention the missing features that are currently well used,
>> like p2p mappings, coherency tracking, mdev, etc.
> I consider these all mandatory things, they won't be left out.
>
> The reason they are not in the RFC is mostly because supporting them
> requires work outside just this iommufd area, and I'd like this series
> to remain self-contained.
>
> I've already got a draft to add DMABUF support to VFIO PCI which
> nicely solves the follow_pfn security problem, we want to do this for
> another reason already. I'm waiting for some testing feedback before
> posting it. Need some help from Daniel make the DMABUF revoke semantic
> him and I have been talking about. In the worst case can copy the
> follow_pfn approach.
>
> Intel no-snoop is simple enough, just needs some Intel cleanup parts.
>
> mdev will come along with the final VFIO integration, all the really
> hard parts are done already. The VFIO integration is a medium sized
> task overall.
>
> So, I'm not ready to give up yet :)
>
>> Where do we focus attention?  Is symlinking device files our proposal
>> to userspace and is that something achievable, or do we want to use
>> this compatibility interface as a means to test the interface and
>> allow userspace to make use of it for transition, if their use cases
>> allow it, perhaps eventually performing the symlink after deprecation
>> and eventual removal of the vfio container and type1 code?  Thanks,
> symlinking device files is definitely just a suggested way to expedite
> testing.
>
> Things like qemu that are learning to use iommufd-only features should
> learn to directly open iommufd instead of vfio container to activate
> those features.
>
> Looking long down the road I don't think we want to have type 1 and
> iommufd code forever. So, I would like to make an option to compile
> out vfio container support entirely and have that option arrange for
> iommufd to provide the container device node itself.
I am currently working on migrating the QEMU VFIO device onto the new
API because since after our discussions the compat mode cannot be used
anyway to implemented nesting. I hope I will be able to present
something next week.

Thanks

Eric
>
> I think we can get there pretty quickly, or at least I haven't got
> anything that is scaring me alot (beyond SPAPR of course)
>
> For the dpdk/etcs of the world I think we are already there.
>
> Jason
>


^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
  2022-03-24  3:50                       ` Jason Wang
@ 2022-03-24 11:46                         ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-03-24 11:46 UTC (permalink / raw)
  To: Jason Wang
  Cc: Tian, Kevin, Alex Williamson, Niklas Schnelle, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Auger, iommu, Jean-Philippe Brucker, Martins, Joao, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

On Thu, Mar 24, 2022 at 11:50:47AM +0800, Jason Wang wrote:

> It's simply because we don't want to break existing userspace. [1]

I'm still waiting to hear what exactly breaks in real systems.

As I explained this is not a significant change, but it could break
something in a few special scenarios.

Also the one place we do have ABI breaks is security, and ulimit is a
security mechanism that isn't working right. So we do clearly need to
understand *exactly* what real thing breaks - if anything.

Jason

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
@ 2022-03-24 11:46                         ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-03-24 11:46 UTC (permalink / raw)
  To: Jason Wang
  Cc: Jean-Philippe Brucker, Tian, Kevin, kvm, Niklas Schnelle,
	Cornelia Huck, Chaitanya Kulkarni, iommu, Daniel Jordan,
	Alex Williamson, Michael S. Tsirkin, Martins, Joao, David Gibson

On Thu, Mar 24, 2022 at 11:50:47AM +0800, Jason Wang wrote:

> It's simply because we don't want to break existing userspace. [1]

I'm still waiting to hear what exactly breaks in real systems.

As I explained this is not a significant change, but it could break
something in a few special scenarios.

Also the one place we do have ABI breaks is security, and ulimit is a
security mechanism that isn't working right. So we do clearly need to
understand *exactly* what real thing breaks - if anything.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 07/12] iommufd: Data structure to provide IOVA to PFN mapping
  2022-03-24  3:09         ` Tian, Kevin
@ 2022-03-24 12:46           ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-03-24 12:46 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

On Thu, Mar 24, 2022 at 03:09:46AM +0000, Tian, Kevin wrote:
> +		/*
> +		 * Can't cross areas that are not aligned to the system page
> +		 * size with this API.
> +		 */
> +		if (cur_iova % PAGE_SIZE) {
> +			rc = -EINVAL;
> +			goto out_remove;
> +		}
> 
> Currently it's done after iopt_pages_add_user() but before cur_iova 
> is adjusted, which implies the last add_user() will not be reverted in
> case of failed check here.

Oh, yes that's right too..

The above is wrong even, it didn't get fixed when page_offset was
done.

So more like this:

diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
index 1c08ae9b848fcf..9505f119df982e 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -23,7 +23,7 @@ static unsigned long iopt_area_iova_to_index(struct iopt_area *area,
 	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
 		WARN_ON(iova < iopt_area_iova(area) ||
 			iova > iopt_area_last_iova(area));
-	return (iova - (iopt_area_iova(area) & PAGE_MASK)) / PAGE_SIZE;
+	return (iova - (iopt_area_iova(area) - area->page_offset)) / PAGE_SIZE;
 }
 
 static struct iopt_area *iopt_find_exact_area(struct io_pagetable *iopt,
@@ -436,31 +436,45 @@ int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
 		unsigned long index;
 
 		/* Need contiguous areas in the access */
-		if (iopt_area_iova(area) < cur_iova || !area->pages) {
+		if (iopt_area_iova(area) > cur_iova || !area->pages) {
 			rc = -EINVAL;
 			goto out_remove;
 		}
 
 		index = iopt_area_iova_to_index(area, cur_iova);
 		last_index = iopt_area_iova_to_index(area, last);
+
+		/*
+		 * The API can only return aligned pages, so the starting point
+		 * must be at a page boundary.
+		 */
+		if ((cur_iova - (iopt_area_iova(area) - area->page_offset)) %
+		    PAGE_SIZE) {
+			rc = -EINVAL;
+			goto out_remove;
+		}
+
+		/*
+		 * and an interior ending point must be at a page boundary
+		 */
+		if (last != last_iova &&
+		    (iopt_area_last_iova(area) - cur_iova + 1) % PAGE_SIZE) {
+			rc = -EINVAL;
+			goto out_remove;
+		}
+
 		rc = iopt_pages_add_user(area->pages, index, last_index,
 					 out_pages, write);
 		if (rc)
 			goto out_remove;
 		if (last == last_iova)
 			break;
-		/*
-		 * Can't cross areas that are not aligned to the system page
-		 * size with this API.
-		 */
-		if (cur_iova % PAGE_SIZE) {
-			rc = -EINVAL;
-			goto out_remove;
-		}
 		cur_iova = last + 1;
 		out_pages += last_index - index;
 		atomic_inc(&area->num_users);
 	}
+	if (cur_iova != last_iova)
+		goto out_remove;
 
 	up_read(&iopt->iova_rwsem);
 	return 0;
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
index 5c47d706ed9449..a46e0f0ae82553 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -1221,5 +1221,6 @@ TEST_F(vfio_compat_mock_domain, get_info)
 /* FIXME test VFIO_IOMMU_MAP_DMA */
 /* FIXME test VFIO_IOMMU_UNMAP_DMA */
 /* FIXME test 2k iova alignment */
+/* FIXME cover boundary cases for iopt_access_pages()  */
 
 TEST_HARNESS_MAIN

^ permalink raw reply related	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 07/12] iommufd: Data structure to provide IOVA to PFN mapping
@ 2022-03-24 12:46           ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-03-24 12:46 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Alex Williamson, Martins, Joao,
	David Gibson

On Thu, Mar 24, 2022 at 03:09:46AM +0000, Tian, Kevin wrote:
> +		/*
> +		 * Can't cross areas that are not aligned to the system page
> +		 * size with this API.
> +		 */
> +		if (cur_iova % PAGE_SIZE) {
> +			rc = -EINVAL;
> +			goto out_remove;
> +		}
> 
> Currently it's done after iopt_pages_add_user() but before cur_iova 
> is adjusted, which implies the last add_user() will not be reverted in
> case of failed check here.

Oh, yes that's right too..

The above is wrong even, it didn't get fixed when page_offset was
done.

So more like this:

diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
index 1c08ae9b848fcf..9505f119df982e 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -23,7 +23,7 @@ static unsigned long iopt_area_iova_to_index(struct iopt_area *area,
 	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
 		WARN_ON(iova < iopt_area_iova(area) ||
 			iova > iopt_area_last_iova(area));
-	return (iova - (iopt_area_iova(area) & PAGE_MASK)) / PAGE_SIZE;
+	return (iova - (iopt_area_iova(area) - area->page_offset)) / PAGE_SIZE;
 }
 
 static struct iopt_area *iopt_find_exact_area(struct io_pagetable *iopt,
@@ -436,31 +436,45 @@ int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
 		unsigned long index;
 
 		/* Need contiguous areas in the access */
-		if (iopt_area_iova(area) < cur_iova || !area->pages) {
+		if (iopt_area_iova(area) > cur_iova || !area->pages) {
 			rc = -EINVAL;
 			goto out_remove;
 		}
 
 		index = iopt_area_iova_to_index(area, cur_iova);
 		last_index = iopt_area_iova_to_index(area, last);
+
+		/*
+		 * The API can only return aligned pages, so the starting point
+		 * must be at a page boundary.
+		 */
+		if ((cur_iova - (iopt_area_iova(area) - area->page_offset)) %
+		    PAGE_SIZE) {
+			rc = -EINVAL;
+			goto out_remove;
+		}
+
+		/*
+		 * and an interior ending point must be at a page boundary
+		 */
+		if (last != last_iova &&
+		    (iopt_area_last_iova(area) - cur_iova + 1) % PAGE_SIZE) {
+			rc = -EINVAL;
+			goto out_remove;
+		}
+
 		rc = iopt_pages_add_user(area->pages, index, last_index,
 					 out_pages, write);
 		if (rc)
 			goto out_remove;
 		if (last == last_iova)
 			break;
-		/*
-		 * Can't cross areas that are not aligned to the system page
-		 * size with this API.
-		 */
-		if (cur_iova % PAGE_SIZE) {
-			rc = -EINVAL;
-			goto out_remove;
-		}
 		cur_iova = last + 1;
 		out_pages += last_index - index;
 		atomic_inc(&area->num_users);
 	}
+	if (cur_iova != last_iova)
+		goto out_remove;
 
 	up_read(&iopt->iova_rwsem);
 	return 0;
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
index 5c47d706ed9449..a46e0f0ae82553 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -1221,5 +1221,6 @@ TEST_F(vfio_compat_mock_domain, get_info)
 /* FIXME test VFIO_IOMMU_MAP_DMA */
 /* FIXME test VFIO_IOMMU_UNMAP_DMA */
 /* FIXME test 2k iova alignment */
+/* FIXME cover boundary cases for iopt_access_pages()  */
 
 TEST_HARNESS_MAIN
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
  2022-03-24  7:25               ` Tian, Kevin
@ 2022-03-24 13:46                 ` Jason Gunthorpe
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-03-24 13:46 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Alex Williamson, Martins, Joao,
	David Gibson

On Thu, Mar 24, 2022 at 07:25:03AM +0000, Tian, Kevin wrote:

> Based on that here is a quick tweak of the force-snoop part (not compiled).

I liked your previous idea better, that IOMMU_CAP_CACHE_COHERENCY
started out OK but got weird. So lets fix it back to the way it was.

How about this:

https://github.com/jgunthorpe/linux/commits/intel_no_snoop

b11c19a4b34c2a iommu: Move the Intel no-snoop control off of IOMMU_CACHE
5263947f9d5f36 vfio: Require that device support DMA cache coherence
eab4b381c64a30 iommu: Restore IOMMU_CAP_CACHE_COHERENCY to its original meaning
2752e12bed48f6 iommu: Replace uses of IOMMU_CAP_CACHE_COHERENCY with dev_is_dma_coherent()

If you like it could you take it from here?

Thanks,
Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
@ 2022-03-24 13:46                 ` Jason Gunthorpe
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-03-24 13:46 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

On Thu, Mar 24, 2022 at 07:25:03AM +0000, Tian, Kevin wrote:

> Based on that here is a quick tweak of the force-snoop part (not compiled).

I liked your previous idea better, that IOMMU_CAP_CACHE_COHERENCY
started out OK but got weird. So lets fix it back to the way it was.

How about this:

https://github.com/jgunthorpe/linux/commits/intel_no_snoop

b11c19a4b34c2a iommu: Move the Intel no-snoop control off of IOMMU_CACHE
5263947f9d5f36 vfio: Require that device support DMA cache coherence
eab4b381c64a30 iommu: Restore IOMMU_CAP_CACHE_COHERENCY to its original meaning
2752e12bed48f6 iommu: Replace uses of IOMMU_CAP_CACHE_COHERENCY with dev_is_dma_coherent()

If you like it could you take it from here?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
  2022-03-22 16:15           ` Jason Gunthorpe via iommu
@ 2022-03-24 20:40             ` Alex Williamson
  -1 siblings, 0 replies; 244+ messages in thread
From: Alex Williamson @ 2022-03-24 20:40 UTC (permalink / raw)
  To: Jason Gunthorpe via iommu
  Cc: Jason Gunthorpe, Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Niklas Schnelle, Jason Wang, Cornelia Huck, Kevin Tian,
	Daniel Jordan, Michael S. Tsirkin, Joao Martins, David Gibson,
	Daniel P. Berrangé

On Tue, 22 Mar 2022 13:15:21 -0300
Jason Gunthorpe via iommu <iommu@lists.linux-foundation.org> wrote:

> On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> 
> > I'm still picking my way through the series, but the later compat
> > interface doesn't mention this difference as an outstanding issue.
> > Doesn't this difference need to be accounted in how libvirt manages VM
> > resource limits?    
> 
> AFACIT, no, but it should be checked.
> 
> > AIUI libvirt uses some form of prlimit(2) to set process locked
> > memory limits.  
> 
> Yes, and ulimit does work fully. prlimit adjusts the value:
> 
> int do_prlimit(struct task_struct *tsk, unsigned int resource,
> 		struct rlimit *new_rlim, struct rlimit *old_rlim)
> {
> 	rlim = tsk->signal->rlim + resource;
> [..]
> 		if (new_rlim)
> 			*rlim = *new_rlim;
> 
> Which vfio reads back here:
> 
> drivers/vfio/vfio_iommu_type1.c:        unsigned long pfn, limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> drivers/vfio/vfio_iommu_type1.c:        unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> 
> And iommufd does the same read back:
> 
> 	lock_limit =
> 		task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> 	npages = pages->npinned - pages->last_npinned;
> 	do {
> 		cur_pages = atomic_long_read(&pages->source_user->locked_vm);
> 		new_pages = cur_pages + npages;
> 		if (new_pages > lock_limit)
> 			return -ENOMEM;
> 	} while (atomic_long_cmpxchg(&pages->source_user->locked_vm, cur_pages,
> 				     new_pages) != cur_pages);
> 
> So it does work essentially the same.

Well, except for the part about vfio updating mm->locked_vm and iommufd
updating user->locked_vm, a per-process counter versus a per-user
counter.  prlimit specifically sets process resource limits, which get
reflected in task_rlimit.

For example, let's say a user has two 4GB VMs and they're hot-adding
vfio devices to each of them, so libvirt needs to dynamically modify
the locked memory limit for each VM.  AIUI, libvirt would look at the
VM size and call prlimit to set that value.  If libvirt does this to
both VMs, then each has a task_rlimit of 4GB.  In vfio we add pinned
pages to mm->locked_vm, so this works well.  In the iommufd loop above,
we're comparing a per-task/process limit to a per-user counter.  So I'm
a bit lost how both VMs can pin their pages here.

Am I missing some assumption about how libvirt users prlimit or
sandboxes users?

> The difference is more subtle, iouring/etc puts the charge in the user
> so it is additive with things like iouring and additively spans all
> the users processes.
> 
> However vfio is accounting only per-process and only for itself - no
> other subsystem uses locked as the charge variable for DMA pins.
> 
> The user visible difference will be that a limit X that worked with
> VFIO may start to fail after a kernel upgrade as the charge accounting
> is now cross user and additive with things like iommufd.

And that's exactly the concern.
 
> This whole area is a bit peculiar (eg mlock itself works differently),
> IMHO, but with most of the places doing pins voting to use
> user->locked_vm as the charge it seems the right path in today's
> kernel.

The philosophy of whether it's ultimately a better choice for the
kernel aside, if userspace breaks because we're accounting in a
per-user pool rather than a per-process pool, then our compatibility
layer ain't so transparent.

> Ceratinly having qemu concurrently using three different subsystems
> (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for
> RLIMIT_MEMLOCK differently cannot be sane or correct.

I think everyone would agree with that, but it also seems there are
real differences between task_rlimits and per-user vs per-process
accounting buckets and I'm confused how that's not a blocker for trying
to implement transparent compatibility.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
@ 2022-03-24 20:40             ` Alex Williamson
  0 siblings, 0 replies; 244+ messages in thread
From: Alex Williamson @ 2022-03-24 20:40 UTC (permalink / raw)
  To: Jason Gunthorpe via iommu
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni,
	Daniel P. Berrangé,
	kvm, Niklas Schnelle, Jason Wang, Cornelia Huck, Kevin Tian,
	Daniel Jordan, Michael S. Tsirkin, Jason Gunthorpe, Joao Martins,
	David Gibson

On Tue, 22 Mar 2022 13:15:21 -0300
Jason Gunthorpe via iommu <iommu@lists.linux-foundation.org> wrote:

> On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> 
> > I'm still picking my way through the series, but the later compat
> > interface doesn't mention this difference as an outstanding issue.
> > Doesn't this difference need to be accounted in how libvirt manages VM
> > resource limits?    
> 
> AFACIT, no, but it should be checked.
> 
> > AIUI libvirt uses some form of prlimit(2) to set process locked
> > memory limits.  
> 
> Yes, and ulimit does work fully. prlimit adjusts the value:
> 
> int do_prlimit(struct task_struct *tsk, unsigned int resource,
> 		struct rlimit *new_rlim, struct rlimit *old_rlim)
> {
> 	rlim = tsk->signal->rlim + resource;
> [..]
> 		if (new_rlim)
> 			*rlim = *new_rlim;
> 
> Which vfio reads back here:
> 
> drivers/vfio/vfio_iommu_type1.c:        unsigned long pfn, limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> drivers/vfio/vfio_iommu_type1.c:        unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> 
> And iommufd does the same read back:
> 
> 	lock_limit =
> 		task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> 	npages = pages->npinned - pages->last_npinned;
> 	do {
> 		cur_pages = atomic_long_read(&pages->source_user->locked_vm);
> 		new_pages = cur_pages + npages;
> 		if (new_pages > lock_limit)
> 			return -ENOMEM;
> 	} while (atomic_long_cmpxchg(&pages->source_user->locked_vm, cur_pages,
> 				     new_pages) != cur_pages);
> 
> So it does work essentially the same.

Well, except for the part about vfio updating mm->locked_vm and iommufd
updating user->locked_vm, a per-process counter versus a per-user
counter.  prlimit specifically sets process resource limits, which get
reflected in task_rlimit.

For example, let's say a user has two 4GB VMs and they're hot-adding
vfio devices to each of them, so libvirt needs to dynamically modify
the locked memory limit for each VM.  AIUI, libvirt would look at the
VM size and call prlimit to set that value.  If libvirt does this to
both VMs, then each has a task_rlimit of 4GB.  In vfio we add pinned
pages to mm->locked_vm, so this works well.  In the iommufd loop above,
we're comparing a per-task/process limit to a per-user counter.  So I'm
a bit lost how both VMs can pin their pages here.

Am I missing some assumption about how libvirt users prlimit or
sandboxes users?

> The difference is more subtle, iouring/etc puts the charge in the user
> so it is additive with things like iouring and additively spans all
> the users processes.
> 
> However vfio is accounting only per-process and only for itself - no
> other subsystem uses locked as the charge variable for DMA pins.
> 
> The user visible difference will be that a limit X that worked with
> VFIO may start to fail after a kernel upgrade as the charge accounting
> is now cross user and additive with things like iommufd.

And that's exactly the concern.
 
> This whole area is a bit peculiar (eg mlock itself works differently),
> IMHO, but with most of the places doing pins voting to use
> user->locked_vm as the charge it seems the right path in today's
> kernel.

The philosophy of whether it's ultimately a better choice for the
kernel aside, if userspace breaks because we're accounting in a
per-user pool rather than a per-process pool, then our compatibility
layer ain't so transparent.

> Ceratinly having qemu concurrently using three different subsystems
> (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for
> RLIMIT_MEMLOCK differently cannot be sane or correct.

I think everyone would agree with that, but it also seems there are
real differences between task_rlimits and per-user vs per-process
accounting buckets and I'm confused how that's not a blocker for trying
to implement transparent compatibility.  Thanks,

Alex

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
  2022-03-24  0:33       ` Jason Gunthorpe via iommu
@ 2022-03-24 22:04         ` Alex Williamson
  -1 siblings, 0 replies; 244+ messages in thread
From: Alex Williamson @ 2022-03-24 22:04 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Wed, 23 Mar 2022 21:33:42 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote:
> 
> > My overall question here would be whether we can actually achieve a
> > compatibility interface that has sufficient feature transparency that we
> > can dump vfio code in favor of this interface, or will there be enough
> > niche use cases that we need to keep type1 and vfio containers around
> > through a deprecation process?  
> 
> Other than SPAPR, I think we can.

Does this mean #ifdef CONFIG_PPC in vfio core to retain infrastructure
for POWER support?

> > The locked memory differences for one seem like something that
> > libvirt wouldn't want hidden  
> 
> I'm first interested to have an understanding how this change becomes
> a real problem in practice that requires libvirt to do something
> different for vfio or iommufd. We can discuss in the other thread
> 
> If this is the make or break point then I think we can deal with it
> either by going back to what vfio does now or perhaps some other
> friendly compat approach..
> 
> > and we have questions regarding support for vaddr hijacking  
> 
> I'm not sure what vaddr hijacking is? Do you mean
> VFIO_DMA_MAP_FLAG_VADDR ? There is a comment that outlines my plan to
> implement it in a functionally compatible way without the deadlock
> problem. I estimate this as a small project.
> 
> > and different ideas how to implement dirty page tracking,   
> 
> I don't think this is compatibility. No kernel today triggers qemu to
> use this feature as no kernel supports live migration. No existing
> qemu will trigger this feature with new kernels that support live
> migration v2. Therefore we can adjust qemu's dirty tracking at the
> same time we enable migration v2 in qemu.

I guess I was assuming that enabling v2 migration in QEMU was dependent
on the existing type1 dirty tracking because it's the only means we
have to tell QEMU that all memory is perpetually dirty when we have a
DMA device.  Is that not correct?  If we don't intend to carry type1
dirty tracking into iommufd compatibility and we need it for this
purpose, then our window for being able to rip it out entirely closes
when QEMU gains v2 migration support.

> With Joao's work we are close to having a solid RFC to come with
> something that can be fully implemented.
> 
> Hopefully we can agree to this soon enough that qemu can come with a
> full package of migration v2 support including the dirty tracking
> solution.
> 
> > not to mention the missing features that are currently well used,
> > like p2p mappings, coherency tracking, mdev, etc.  
> 
> I consider these all mandatory things, they won't be left out.
> 
> The reason they are not in the RFC is mostly because supporting them
> requires work outside just this iommufd area, and I'd like this series
> to remain self-contained.
> 
> I've already got a draft to add DMABUF support to VFIO PCI which
> nicely solves the follow_pfn security problem, we want to do this for
> another reason already. I'm waiting for some testing feedback before
> posting it. Need some help from Daniel make the DMABUF revoke semantic
> him and I have been talking about. In the worst case can copy the
> follow_pfn approach.
> 
> Intel no-snoop is simple enough, just needs some Intel cleanup parts.
> 
> mdev will come along with the final VFIO integration, all the really
> hard parts are done already. The VFIO integration is a medium sized
> task overall.
> 
> So, I'm not ready to give up yet :)

Ok, that's a more promising outlook than I was inferring from the long
list of missing features.

> > Where do we focus attention?  Is symlinking device files our proposal
> > to userspace and is that something achievable, or do we want to use
> > this compatibility interface as a means to test the interface and
> > allow userspace to make use of it for transition, if their use cases
> > allow it, perhaps eventually performing the symlink after deprecation
> > and eventual removal of the vfio container and type1 code?  Thanks,  
> 
> symlinking device files is definitely just a suggested way to expedite
> testing.
> 
> Things like qemu that are learning to use iommufd-only features should
> learn to directly open iommufd instead of vfio container to activate
> those features.

Which is kind of the basis for my question, QEMU is racing for native
support, Eric and Yi are already working on this, so some of these
compatibility interfaces might only have short term usefulness.

> Looking long down the road I don't think we want to have type 1 and
> iommufd code forever.

Agreed.

> So, I would like to make an option to compile
> out vfio container support entirely and have that option arrange for
> iommufd to provide the container device node itself.
> 
> I think we can get there pretty quickly, or at least I haven't got
> anything that is scaring me alot (beyond SPAPR of course)
> 
> For the dpdk/etcs of the world I think we are already there.

That's essentially what I'm trying to reconcile, we're racing both
to round out the compatibility interface to fully support QEMU, while
also updating QEMU to use iommufd directly so it won't need that full
support.  It's a confusing message.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
@ 2022-03-24 22:04         ` Alex Williamson
  0 siblings, 0 replies; 244+ messages in thread
From: Alex Williamson @ 2022-03-24 22:04 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	Kevin Tian, Daniel Jordan, iommu, Joao Martins, David Gibson

On Wed, 23 Mar 2022 21:33:42 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote:
> 
> > My overall question here would be whether we can actually achieve a
> > compatibility interface that has sufficient feature transparency that we
> > can dump vfio code in favor of this interface, or will there be enough
> > niche use cases that we need to keep type1 and vfio containers around
> > through a deprecation process?  
> 
> Other than SPAPR, I think we can.

Does this mean #ifdef CONFIG_PPC in vfio core to retain infrastructure
for POWER support?

> > The locked memory differences for one seem like something that
> > libvirt wouldn't want hidden  
> 
> I'm first interested to have an understanding how this change becomes
> a real problem in practice that requires libvirt to do something
> different for vfio or iommufd. We can discuss in the other thread
> 
> If this is the make or break point then I think we can deal with it
> either by going back to what vfio does now or perhaps some other
> friendly compat approach..
> 
> > and we have questions regarding support for vaddr hijacking  
> 
> I'm not sure what vaddr hijacking is? Do you mean
> VFIO_DMA_MAP_FLAG_VADDR ? There is a comment that outlines my plan to
> implement it in a functionally compatible way without the deadlock
> problem. I estimate this as a small project.
> 
> > and different ideas how to implement dirty page tracking,   
> 
> I don't think this is compatibility. No kernel today triggers qemu to
> use this feature as no kernel supports live migration. No existing
> qemu will trigger this feature with new kernels that support live
> migration v2. Therefore we can adjust qemu's dirty tracking at the
> same time we enable migration v2 in qemu.

I guess I was assuming that enabling v2 migration in QEMU was dependent
on the existing type1 dirty tracking because it's the only means we
have to tell QEMU that all memory is perpetually dirty when we have a
DMA device.  Is that not correct?  If we don't intend to carry type1
dirty tracking into iommufd compatibility and we need it for this
purpose, then our window for being able to rip it out entirely closes
when QEMU gains v2 migration support.

> With Joao's work we are close to having a solid RFC to come with
> something that can be fully implemented.
> 
> Hopefully we can agree to this soon enough that qemu can come with a
> full package of migration v2 support including the dirty tracking
> solution.
> 
> > not to mention the missing features that are currently well used,
> > like p2p mappings, coherency tracking, mdev, etc.  
> 
> I consider these all mandatory things, they won't be left out.
> 
> The reason they are not in the RFC is mostly because supporting them
> requires work outside just this iommufd area, and I'd like this series
> to remain self-contained.
> 
> I've already got a draft to add DMABUF support to VFIO PCI which
> nicely solves the follow_pfn security problem, we want to do this for
> another reason already. I'm waiting for some testing feedback before
> posting it. Need some help from Daniel make the DMABUF revoke semantic
> him and I have been talking about. In the worst case can copy the
> follow_pfn approach.
> 
> Intel no-snoop is simple enough, just needs some Intel cleanup parts.
> 
> mdev will come along with the final VFIO integration, all the really
> hard parts are done already. The VFIO integration is a medium sized
> task overall.
> 
> So, I'm not ready to give up yet :)

Ok, that's a more promising outlook than I was inferring from the long
list of missing features.

> > Where do we focus attention?  Is symlinking device files our proposal
> > to userspace and is that something achievable, or do we want to use
> > this compatibility interface as a means to test the interface and
> > allow userspace to make use of it for transition, if their use cases
> > allow it, perhaps eventually performing the symlink after deprecation
> > and eventual removal of the vfio container and type1 code?  Thanks,  
> 
> symlinking device files is definitely just a suggested way to expedite
> testing.
> 
> Things like qemu that are learning to use iommufd-only features should
> learn to directly open iommufd instead of vfio container to activate
> those features.

Which is kind of the basis for my question, QEMU is racing for native
support, Eric and Yi are already working on this, so some of these
compatibility interfaces might only have short term usefulness.

> Looking long down the road I don't think we want to have type 1 and
> iommufd code forever.

Agreed.

> So, I would like to make an option to compile
> out vfio container support entirely and have that option arrange for
> iommufd to provide the container device node itself.
> 
> I think we can get there pretty quickly, or at least I haven't got
> anything that is scaring me alot (beyond SPAPR of course)
> 
> For the dpdk/etcs of the world I think we are already there.

That's essentially what I'm trying to reconcile, we're racing both
to round out the compatibility interface to fully support QEMU, while
also updating QEMU to use iommufd directly so it won't need that full
support.  It's a confusing message.  Thanks,

Alex

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
  2022-03-24 20:40             ` Alex Williamson
@ 2022-03-24 22:27               ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-03-24 22:27 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe via iommu, Jean-Philippe Brucker,
	Chaitanya Kulkarni, kvm, Niklas Schnelle, Jason Wang,
	Cornelia Huck, Kevin Tian, Daniel Jordan, Michael S. Tsirkin,
	Joao Martins, David Gibson, Daniel P. Berrangé

On Thu, Mar 24, 2022 at 02:40:15PM -0600, Alex Williamson wrote:
> On Tue, 22 Mar 2022 13:15:21 -0300
> Jason Gunthorpe via iommu <iommu@lists.linux-foundation.org> wrote:
> 
> > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> > 
> > > I'm still picking my way through the series, but the later compat
> > > interface doesn't mention this difference as an outstanding issue.
> > > Doesn't this difference need to be accounted in how libvirt manages VM
> > > resource limits?    
> > 
> > AFACIT, no, but it should be checked.
> > 
> > > AIUI libvirt uses some form of prlimit(2) to set process locked
> > > memory limits.  
> > 
> > Yes, and ulimit does work fully. prlimit adjusts the value:
> > 
> > int do_prlimit(struct task_struct *tsk, unsigned int resource,
> > 		struct rlimit *new_rlim, struct rlimit *old_rlim)
> > {
> > 	rlim = tsk->signal->rlim + resource;
> > [..]
> > 		if (new_rlim)
> > 			*rlim = *new_rlim;
> > 
> > Which vfio reads back here:
> > 
> > drivers/vfio/vfio_iommu_type1.c:        unsigned long pfn, limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > drivers/vfio/vfio_iommu_type1.c:        unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > 
> > And iommufd does the same read back:
> > 
> > 	lock_limit =
> > 		task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > 	npages = pages->npinned - pages->last_npinned;
> > 	do {
> > 		cur_pages = atomic_long_read(&pages->source_user->locked_vm);
> > 		new_pages = cur_pages + npages;
> > 		if (new_pages > lock_limit)
> > 			return -ENOMEM;
> > 	} while (atomic_long_cmpxchg(&pages->source_user->locked_vm, cur_pages,
> > 				     new_pages) != cur_pages);
> > 
> > So it does work essentially the same.
> 
> Well, except for the part about vfio updating mm->locked_vm and iommufd
> updating user->locked_vm, a per-process counter versus a per-user
> counter.  prlimit specifically sets process resource limits, which get
> reflected in task_rlimit.

Indeed, but that is not how the majority of other things seem to
operate it.

> For example, let's say a user has two 4GB VMs and they're hot-adding
> vfio devices to each of them, so libvirt needs to dynamically modify
> the locked memory limit for each VM.  AIUI, libvirt would look at the
> VM size and call prlimit to set that value.  If libvirt does this to
> both VMs, then each has a task_rlimit of 4GB.  In vfio we add pinned
> pages to mm->locked_vm, so this works well.  In the iommufd loop above,
> we're comparing a per-task/process limit to a per-user counter.  So I'm
> a bit lost how both VMs can pin their pages here.

I don't know anything about libvirt - it seems strange to use a
securityish feature like ulimit but not security isolate processes
with real users.

But if it really does this then it really does this.

So at the very least VFIO container has to keep working this way.

The next question is if we want iommufd's own device node to work this
way and try to change libvirt somehow. It seems libvirt will have to
deal with this at some point as iouring will trigger the same problem.

> > This whole area is a bit peculiar (eg mlock itself works differently),
> > IMHO, but with most of the places doing pins voting to use
> > user->locked_vm as the charge it seems the right path in today's
> > kernel.
> 
> The philosophy of whether it's ultimately a better choice for the
> kernel aside, if userspace breaks because we're accounting in a
> per-user pool rather than a per-process pool, then our compatibility
> layer ain't so transparent.

Sure, if it doesn't work it doesn't work. Lets be sure and clearly
document what the compatability issue is and then we have to keep it
per-process.

And the same reasoning likely means I can't change RDMA either as qemu
will break just as well when qemu uses rdma mode.

Which is pretty sucky, but it is what it is..

Jason

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
@ 2022-03-24 22:27               ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-03-24 22:27 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni,
	Daniel P. Berrangé,
	kvm, Niklas Schnelle, Jason Wang, Cornelia Huck, Kevin Tian,
	Daniel Jordan, Jason Gunthorpe via iommu, Michael S. Tsirkin,
	Joao Martins, David Gibson

On Thu, Mar 24, 2022 at 02:40:15PM -0600, Alex Williamson wrote:
> On Tue, 22 Mar 2022 13:15:21 -0300
> Jason Gunthorpe via iommu <iommu@lists.linux-foundation.org> wrote:
> 
> > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> > 
> > > I'm still picking my way through the series, but the later compat
> > > interface doesn't mention this difference as an outstanding issue.
> > > Doesn't this difference need to be accounted in how libvirt manages VM
> > > resource limits?    
> > 
> > AFACIT, no, but it should be checked.
> > 
> > > AIUI libvirt uses some form of prlimit(2) to set process locked
> > > memory limits.  
> > 
> > Yes, and ulimit does work fully. prlimit adjusts the value:
> > 
> > int do_prlimit(struct task_struct *tsk, unsigned int resource,
> > 		struct rlimit *new_rlim, struct rlimit *old_rlim)
> > {
> > 	rlim = tsk->signal->rlim + resource;
> > [..]
> > 		if (new_rlim)
> > 			*rlim = *new_rlim;
> > 
> > Which vfio reads back here:
> > 
> > drivers/vfio/vfio_iommu_type1.c:        unsigned long pfn, limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > drivers/vfio/vfio_iommu_type1.c:        unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > 
> > And iommufd does the same read back:
> > 
> > 	lock_limit =
> > 		task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > 	npages = pages->npinned - pages->last_npinned;
> > 	do {
> > 		cur_pages = atomic_long_read(&pages->source_user->locked_vm);
> > 		new_pages = cur_pages + npages;
> > 		if (new_pages > lock_limit)
> > 			return -ENOMEM;
> > 	} while (atomic_long_cmpxchg(&pages->source_user->locked_vm, cur_pages,
> > 				     new_pages) != cur_pages);
> > 
> > So it does work essentially the same.
> 
> Well, except for the part about vfio updating mm->locked_vm and iommufd
> updating user->locked_vm, a per-process counter versus a per-user
> counter.  prlimit specifically sets process resource limits, which get
> reflected in task_rlimit.

Indeed, but that is not how the majority of other things seem to
operate it.

> For example, let's say a user has two 4GB VMs and they're hot-adding
> vfio devices to each of them, so libvirt needs to dynamically modify
> the locked memory limit for each VM.  AIUI, libvirt would look at the
> VM size and call prlimit to set that value.  If libvirt does this to
> both VMs, then each has a task_rlimit of 4GB.  In vfio we add pinned
> pages to mm->locked_vm, so this works well.  In the iommufd loop above,
> we're comparing a per-task/process limit to a per-user counter.  So I'm
> a bit lost how both VMs can pin their pages here.

I don't know anything about libvirt - it seems strange to use a
securityish feature like ulimit but not security isolate processes
with real users.

But if it really does this then it really does this.

So at the very least VFIO container has to keep working this way.

The next question is if we want iommufd's own device node to work this
way and try to change libvirt somehow. It seems libvirt will have to
deal with this at some point as iouring will trigger the same problem.

> > This whole area is a bit peculiar (eg mlock itself works differently),
> > IMHO, but with most of the places doing pins voting to use
> > user->locked_vm as the charge it seems the right path in today's
> > kernel.
> 
> The philosophy of whether it's ultimately a better choice for the
> kernel aside, if userspace breaks because we're accounting in a
> per-user pool rather than a per-process pool, then our compatibility
> layer ain't so transparent.

Sure, if it doesn't work it doesn't work. Lets be sure and clearly
document what the compatability issue is and then we have to keep it
per-process.

And the same reasoning likely means I can't change RDMA either as qemu
will break just as well when qemu uses rdma mode.

Which is pretty sucky, but it is what it is..

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
  2022-03-24 22:27               ` Jason Gunthorpe via iommu
@ 2022-03-24 22:41                 ` Alex Williamson
  -1 siblings, 0 replies; 244+ messages in thread
From: Alex Williamson @ 2022-03-24 22:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni,
	Daniel P. Berrangé,
	kvm, Niklas Schnelle, Jason Wang, Cornelia Huck, Kevin Tian,
	Daniel Jordan, Jason Gunthorpe via iommu, Michael S. Tsirkin,
	Joao Martins, David Gibson

On Thu, 24 Mar 2022 19:27:39 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Thu, Mar 24, 2022 at 02:40:15PM -0600, Alex Williamson wrote:
> > On Tue, 22 Mar 2022 13:15:21 -0300
> > Jason Gunthorpe via iommu <iommu@lists.linux-foundation.org> wrote:
> >   
> > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> > >   
> > > > I'm still picking my way through the series, but the later compat
> > > > interface doesn't mention this difference as an outstanding issue.
> > > > Doesn't this difference need to be accounted in how libvirt manages VM
> > > > resource limits?      
> > > 
> > > AFACIT, no, but it should be checked.
> > >   
> > > > AIUI libvirt uses some form of prlimit(2) to set process locked
> > > > memory limits.    
> > > 
> > > Yes, and ulimit does work fully. prlimit adjusts the value:
> > > 
> > > int do_prlimit(struct task_struct *tsk, unsigned int resource,
> > > 		struct rlimit *new_rlim, struct rlimit *old_rlim)
> > > {
> > > 	rlim = tsk->signal->rlim + resource;
> > > [..]
> > > 		if (new_rlim)
> > > 			*rlim = *new_rlim;
> > > 
> > > Which vfio reads back here:
> > > 
> > > drivers/vfio/vfio_iommu_type1.c:        unsigned long pfn, limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > drivers/vfio/vfio_iommu_type1.c:        unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > 
> > > And iommufd does the same read back:
> > > 
> > > 	lock_limit =
> > > 		task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > 	npages = pages->npinned - pages->last_npinned;
> > > 	do {
> > > 		cur_pages = atomic_long_read(&pages->source_user->locked_vm);
> > > 		new_pages = cur_pages + npages;
> > > 		if (new_pages > lock_limit)
> > > 			return -ENOMEM;
> > > 	} while (atomic_long_cmpxchg(&pages->source_user->locked_vm, cur_pages,
> > > 				     new_pages) != cur_pages);
> > > 
> > > So it does work essentially the same.  
> > 
> > Well, except for the part about vfio updating mm->locked_vm and iommufd
> > updating user->locked_vm, a per-process counter versus a per-user
> > counter.  prlimit specifically sets process resource limits, which get
> > reflected in task_rlimit.  
> 
> Indeed, but that is not how the majority of other things seem to
> operate it.
> 
> > For example, let's say a user has two 4GB VMs and they're hot-adding
> > vfio devices to each of them, so libvirt needs to dynamically modify
> > the locked memory limit for each VM.  AIUI, libvirt would look at the
> > VM size and call prlimit to set that value.  If libvirt does this to
> > both VMs, then each has a task_rlimit of 4GB.  In vfio we add pinned
> > pages to mm->locked_vm, so this works well.  In the iommufd loop above,
> > we're comparing a per-task/process limit to a per-user counter.  So I'm
> > a bit lost how both VMs can pin their pages here.  
> 
> I don't know anything about libvirt - it seems strange to use a
> securityish feature like ulimit but not security isolate processes
> with real users.
> 
> But if it really does this then it really does this.
> 
> So at the very least VFIO container has to keep working this way.
> 
> The next question is if we want iommufd's own device node to work this
> way and try to change libvirt somehow. It seems libvirt will have to
> deal with this at some point as iouring will trigger the same problem.
> 
> > > This whole area is a bit peculiar (eg mlock itself works differently),
> > > IMHO, but with most of the places doing pins voting to use
> > > user->locked_vm as the charge it seems the right path in today's
> > > kernel.  
> > 
> > The philosophy of whether it's ultimately a better choice for the
> > kernel aside, if userspace breaks because we're accounting in a
> > per-user pool rather than a per-process pool, then our compatibility
> > layer ain't so transparent.  
> 
> Sure, if it doesn't work it doesn't work. Lets be sure and clearly
> document what the compatability issue is and then we have to keep it
> per-process.
> 
> And the same reasoning likely means I can't change RDMA either as qemu
> will break just as well when qemu uses rdma mode.
> 
> Which is pretty sucky, but it is what it is..

I added Daniel Berrangé to the cc list for my previous reply, hopefully
he can comment whether libvirt has the sort of user security model you
allude to above that maybe makes this a non-issue for this use case.
Unfortunately it's extremely difficult to prove that there are no such
use cases out there even if libvirt is ok.  Thanks,

Alex

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
@ 2022-03-24 22:41                 ` Alex Williamson
  0 siblings, 0 replies; 244+ messages in thread
From: Alex Williamson @ 2022-03-24 22:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jason Gunthorpe via iommu, Jean-Philippe Brucker,
	Chaitanya Kulkarni, kvm, Niklas Schnelle, Jason Wang,
	Cornelia Huck, Kevin Tian, Daniel Jordan, Michael S. Tsirkin,
	Joao Martins, David Gibson, Daniel P. Berrangé

On Thu, 24 Mar 2022 19:27:39 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Thu, Mar 24, 2022 at 02:40:15PM -0600, Alex Williamson wrote:
> > On Tue, 22 Mar 2022 13:15:21 -0300
> > Jason Gunthorpe via iommu <iommu@lists.linux-foundation.org> wrote:
> >   
> > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> > >   
> > > > I'm still picking my way through the series, but the later compat
> > > > interface doesn't mention this difference as an outstanding issue.
> > > > Doesn't this difference need to be accounted in how libvirt manages VM
> > > > resource limits?      
> > > 
> > > AFACIT, no, but it should be checked.
> > >   
> > > > AIUI libvirt uses some form of prlimit(2) to set process locked
> > > > memory limits.    
> > > 
> > > Yes, and ulimit does work fully. prlimit adjusts the value:
> > > 
> > > int do_prlimit(struct task_struct *tsk, unsigned int resource,
> > > 		struct rlimit *new_rlim, struct rlimit *old_rlim)
> > > {
> > > 	rlim = tsk->signal->rlim + resource;
> > > [..]
> > > 		if (new_rlim)
> > > 			*rlim = *new_rlim;
> > > 
> > > Which vfio reads back here:
> > > 
> > > drivers/vfio/vfio_iommu_type1.c:        unsigned long pfn, limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > drivers/vfio/vfio_iommu_type1.c:        unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > 
> > > And iommufd does the same read back:
> > > 
> > > 	lock_limit =
> > > 		task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > 	npages = pages->npinned - pages->last_npinned;
> > > 	do {
> > > 		cur_pages = atomic_long_read(&pages->source_user->locked_vm);
> > > 		new_pages = cur_pages + npages;
> > > 		if (new_pages > lock_limit)
> > > 			return -ENOMEM;
> > > 	} while (atomic_long_cmpxchg(&pages->source_user->locked_vm, cur_pages,
> > > 				     new_pages) != cur_pages);
> > > 
> > > So it does work essentially the same.  
> > 
> > Well, except for the part about vfio updating mm->locked_vm and iommufd
> > updating user->locked_vm, a per-process counter versus a per-user
> > counter.  prlimit specifically sets process resource limits, which get
> > reflected in task_rlimit.  
> 
> Indeed, but that is not how the majority of other things seem to
> operate it.
> 
> > For example, let's say a user has two 4GB VMs and they're hot-adding
> > vfio devices to each of them, so libvirt needs to dynamically modify
> > the locked memory limit for each VM.  AIUI, libvirt would look at the
> > VM size and call prlimit to set that value.  If libvirt does this to
> > both VMs, then each has a task_rlimit of 4GB.  In vfio we add pinned
> > pages to mm->locked_vm, so this works well.  In the iommufd loop above,
> > we're comparing a per-task/process limit to a per-user counter.  So I'm
> > a bit lost how both VMs can pin their pages here.  
> 
> I don't know anything about libvirt - it seems strange to use a
> securityish feature like ulimit but not security isolate processes
> with real users.
> 
> But if it really does this then it really does this.
> 
> So at the very least VFIO container has to keep working this way.
> 
> The next question is if we want iommufd's own device node to work this
> way and try to change libvirt somehow. It seems libvirt will have to
> deal with this at some point as iouring will trigger the same problem.
> 
> > > This whole area is a bit peculiar (eg mlock itself works differently),
> > > IMHO, but with most of the places doing pins voting to use
> > > user->locked_vm as the charge it seems the right path in today's
> > > kernel.  
> > 
> > The philosophy of whether it's ultimately a better choice for the
> > kernel aside, if userspace breaks because we're accounting in a
> > per-user pool rather than a per-process pool, then our compatibility
> > layer ain't so transparent.  
> 
> Sure, if it doesn't work it doesn't work. Lets be sure and clearly
> document what the compatability issue is and then we have to keep it
> per-process.
> 
> And the same reasoning likely means I can't change RDMA either as qemu
> will break just as well when qemu uses rdma mode.
> 
> Which is pretty sucky, but it is what it is..

I added Daniel Berrangé to the cc list for my previous reply, hopefully
he can comment whether libvirt has the sort of user security model you
allude to above that maybe makes this a non-issue for this use case.
Unfortunately it's extremely difficult to prove that there are no such
use cases out there even if libvirt is ok.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
  2022-03-24 22:04         ` Alex Williamson
@ 2022-03-24 23:11           ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-03-24 23:11 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Thu, Mar 24, 2022 at 04:04:03PM -0600, Alex Williamson wrote:
> On Wed, 23 Mar 2022 21:33:42 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote:
> > 
> > > My overall question here would be whether we can actually achieve a
> > > compatibility interface that has sufficient feature transparency that we
> > > can dump vfio code in favor of this interface, or will there be enough
> > > niche use cases that we need to keep type1 and vfio containers around
> > > through a deprecation process?  
> > 
> > Other than SPAPR, I think we can.
> 
> Does this mean #ifdef CONFIG_PPC in vfio core to retain infrastructure
> for POWER support?

Certainly initialy - I have no ability to do better than that.

I'm hoping someone from IBM will be willing to work on this in the
long run and we can do better.

> > I don't think this is compatibility. No kernel today triggers qemu to
> > use this feature as no kernel supports live migration. No existing
> > qemu will trigger this feature with new kernels that support live
> > migration v2. Therefore we can adjust qemu's dirty tracking at the
> > same time we enable migration v2 in qemu.
> 
> I guess I was assuming that enabling v2 migration in QEMU was dependent
> on the existing type1 dirty tracking because it's the only means we
> have to tell QEMU that all memory is perpetually dirty when we have a
> DMA device.  Is that not correct?

I haven't looked closely at this part in qemu, but IMHO, if qemu sees
that it has VFIO migration support but does not have any DMA dirty
tracking capability it should not do precopy flows.

If there is a bug here we should certainly fix it before progressing
the v2 patches. I'll ask Yishai & Co to take a look.

> > Intel no-snoop is simple enough, just needs some Intel cleanup parts.

Patches for this exist now
 
> > mdev will come along with the final VFIO integration, all the really
> > hard parts are done already. The VFIO integration is a medium sized
> > task overall.
> > 
> > So, I'm not ready to give up yet :)
> 
> Ok, that's a more promising outlook than I was inferring from the long
> list of missing features.

Yeah, it is just long, but they are not scary things, just priorites
and patch planning.

> > I think we can get there pretty quickly, or at least I haven't got
> > anything that is scaring me alot (beyond SPAPR of course)
> > 
> > For the dpdk/etcs of the world I think we are already there.
> 
> That's essentially what I'm trying to reconcile, we're racing both
> to round out the compatibility interface to fully support QEMU, while
> also updating QEMU to use iommufd directly so it won't need that full
> support.  It's a confusing message.  Thanks,

The long term purpose of compatibility is to provide a config option
to allow type 1 to be turned off and continue to support old user
space (eg in containers) that is running old qemu/dpdk/spdk/etc.

This shows that we have a plan/path to allow a distro to support only
one iommu interface in their kernel should they choose without having
to sacrifice uABI compatibility.

As for racing, my intention is to leave the compat interface alone for
awhile - the more urgent things in on my personal list are the RFC
for dirty tracking, mlx5 support for dirty tracking, and VFIO preparation
for iommufd support.

Eric and Yi are focusing on userspace page tables and qemu updates.

Joao is working on implementing iommu driver dirty tracking

Lu and Jacob are working on getting PASID support infrastructure
together.

There is alot going on!

A question to consider is what would you consider the minimum bar for
merging?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
@ 2022-03-24 23:11           ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-03-24 23:11 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	Kevin Tian, Daniel Jordan, iommu, Joao Martins, David Gibson

On Thu, Mar 24, 2022 at 04:04:03PM -0600, Alex Williamson wrote:
> On Wed, 23 Mar 2022 21:33:42 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote:
> > 
> > > My overall question here would be whether we can actually achieve a
> > > compatibility interface that has sufficient feature transparency that we
> > > can dump vfio code in favor of this interface, or will there be enough
> > > niche use cases that we need to keep type1 and vfio containers around
> > > through a deprecation process?  
> > 
> > Other than SPAPR, I think we can.
> 
> Does this mean #ifdef CONFIG_PPC in vfio core to retain infrastructure
> for POWER support?

Certainly initialy - I have no ability to do better than that.

I'm hoping someone from IBM will be willing to work on this in the
long run and we can do better.

> > I don't think this is compatibility. No kernel today triggers qemu to
> > use this feature as no kernel supports live migration. No existing
> > qemu will trigger this feature with new kernels that support live
> > migration v2. Therefore we can adjust qemu's dirty tracking at the
> > same time we enable migration v2 in qemu.
> 
> I guess I was assuming that enabling v2 migration in QEMU was dependent
> on the existing type1 dirty tracking because it's the only means we
> have to tell QEMU that all memory is perpetually dirty when we have a
> DMA device.  Is that not correct?

I haven't looked closely at this part in qemu, but IMHO, if qemu sees
that it has VFIO migration support but does not have any DMA dirty
tracking capability it should not do precopy flows.

If there is a bug here we should certainly fix it before progressing
the v2 patches. I'll ask Yishai & Co to take a look.

> > Intel no-snoop is simple enough, just needs some Intel cleanup parts.

Patches for this exist now
 
> > mdev will come along with the final VFIO integration, all the really
> > hard parts are done already. The VFIO integration is a medium sized
> > task overall.
> > 
> > So, I'm not ready to give up yet :)
> 
> Ok, that's a more promising outlook than I was inferring from the long
> list of missing features.

Yeah, it is just long, but they are not scary things, just priorites
and patch planning.

> > I think we can get there pretty quickly, or at least I haven't got
> > anything that is scaring me alot (beyond SPAPR of course)
> > 
> > For the dpdk/etcs of the world I think we are already there.
> 
> That's essentially what I'm trying to reconcile, we're racing both
> to round out the compatibility interface to fully support QEMU, while
> also updating QEMU to use iommufd directly so it won't need that full
> support.  It's a confusing message.  Thanks,

The long term purpose of compatibility is to provide a config option
to allow type 1 to be turned off and continue to support old user
space (eg in containers) that is running old qemu/dpdk/spdk/etc.

This shows that we have a plan/path to allow a distro to support only
one iommu interface in their kernel should they choose without having
to sacrifice uABI compatibility.

As for racing, my intention is to leave the compat interface alone for
awhile - the more urgent things in on my personal list are the RFC
for dirty tracking, mlx5 support for dirty tracking, and VFIO preparation
for iommufd support.

Eric and Yi are focusing on userspace page tables and qemu updates.

Joao is working on implementing iommu driver dirty tracking

Lu and Jacob are working on getting PASID support infrastructure
together.

There is alot going on!

A question to consider is what would you consider the minimum bar for
merging?

Thanks,
Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* RE: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
  2022-03-24 13:46                 ` Jason Gunthorpe
@ 2022-03-25  2:15                   ` Tian, Kevin
  -1 siblings, 0 replies; 244+ messages in thread
From: Tian, Kevin @ 2022-03-25  2:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, March 24, 2022 9:46 PM
> 
> On Thu, Mar 24, 2022 at 07:25:03AM +0000, Tian, Kevin wrote:
> 
> > Based on that here is a quick tweak of the force-snoop part (not compiled).
> 
> I liked your previous idea better, that IOMMU_CAP_CACHE_COHERENCY
> started out OK but got weird. So lets fix it back to the way it was.
> 
> How about this:
> 
> https://github.com/jgunthorpe/linux/commits/intel_no_snoop
> 
> b11c19a4b34c2a iommu: Move the Intel no-snoop control off of
> IOMMU_CACHE
> 5263947f9d5f36 vfio: Require that device support DMA cache coherence
> eab4b381c64a30 iommu: Restore IOMMU_CAP_CACHE_COHERENCY to its
> original meaning
> 2752e12bed48f6 iommu: Replace uses of IOMMU_CAP_CACHE_COHERENCY
> with dev_is_dma_coherent()
> 
> If you like it could you take it from here?
> 

this looks good to me except that the 2nd patch (eab4b381) should be
the last one otherwise it affects bisect. and in that case the subject 
would be simply about removing the capability instead of restoring...

let me find a box to verify it.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 244+ messages in thread

* RE: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
@ 2022-03-25  2:15                   ` Tian, Kevin
  0 siblings, 0 replies; 244+ messages in thread
From: Tian, Kevin @ 2022-03-25  2:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Alex Williamson, Martins, Joao,
	David Gibson

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, March 24, 2022 9:46 PM
> 
> On Thu, Mar 24, 2022 at 07:25:03AM +0000, Tian, Kevin wrote:
> 
> > Based on that here is a quick tweak of the force-snoop part (not compiled).
> 
> I liked your previous idea better, that IOMMU_CAP_CACHE_COHERENCY
> started out OK but got weird. So lets fix it back to the way it was.
> 
> How about this:
> 
> https://github.com/jgunthorpe/linux/commits/intel_no_snoop
> 
> b11c19a4b34c2a iommu: Move the Intel no-snoop control off of
> IOMMU_CACHE
> 5263947f9d5f36 vfio: Require that device support DMA cache coherence
> eab4b381c64a30 iommu: Restore IOMMU_CAP_CACHE_COHERENCY to its
> original meaning
> 2752e12bed48f6 iommu: Replace uses of IOMMU_CAP_CACHE_COHERENCY
> with dev_is_dma_coherent()
> 
> If you like it could you take it from here?
> 

this looks good to me except that the 2nd patch (eab4b381) should be
the last one otherwise it affects bisect. and in that case the subject 
would be simply about removing the capability instead of restoring...

let me find a box to verify it.

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* RE: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
  2022-03-24 23:11           ` Jason Gunthorpe via iommu
@ 2022-03-25  3:10             ` Tian, Kevin
  -1 siblings, 0 replies; 244+ messages in thread
From: Tian, Kevin @ 2022-03-25  3:10 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, March 25, 2022 7:12 AM
> 
> On Thu, Mar 24, 2022 at 04:04:03PM -0600, Alex Williamson wrote:
> > That's essentially what I'm trying to reconcile, we're racing both
> > to round out the compatibility interface to fully support QEMU, while
> > also updating QEMU to use iommufd directly so it won't need that full
> > support.  It's a confusing message.  Thanks,
> 
> The long term purpose of compatibility is to provide a config option
> to allow type 1 to be turned off and continue to support old user
> space (eg in containers) that is running old qemu/dpdk/spdk/etc.
> 
> This shows that we have a plan/path to allow a distro to support only
> one iommu interface in their kernel should they choose without having
> to sacrifice uABI compatibility.
> 
> As for racing, my intention is to leave the compat interface alone for
> awhile - the more urgent things in on my personal list are the RFC
> for dirty tracking, mlx5 support for dirty tracking, and VFIO preparation
> for iommufd support.
> 
> Eric and Yi are focusing on userspace page tables and qemu updates.
> 
> Joao is working on implementing iommu driver dirty tracking
> 
> Lu and Jacob are working on getting PASID support infrastructure
> together.
> 
> There is alot going on!
> 
> A question to consider is what would you consider the minimum bar for
> merging?
> 

My two cents. 😊

IMHO making the compat work as a task in parallel with other works
listed above is the most efficient approach to move forward. In concept
they are not mutual-dependent by using different set of uAPIs (vfio
compat vs. iommufd native). Otherwise considering the list of TODOs 
the compat work will become a single big task gating all other works.

If agreed this suggests we may want to prioritize Yi's vfio device uAPI [1]
to integrate vfio with iommufd to get this series merged. iirc there are
less opens remaining from v1 discussion compared to the list in the
compat interface. Of course it needs the Qemu change ready to use
iommufd directly, but this is necessary to unblock other tasks anyway.

[1] https://github.com/luxis1999/iommufd/commit/2d9278d4ecad7953b3787c98cdb650764af8a1a1

Thanks
Kevin

^ permalink raw reply	[flat|nested] 244+ messages in thread

* RE: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
@ 2022-03-25  3:10             ` Tian, Kevin
  0 siblings, 0 replies; 244+ messages in thread
From: Tian, Kevin @ 2022-03-25  3:10 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	Daniel Jordan, iommu, Martins, Joao, David Gibson

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, March 25, 2022 7:12 AM
> 
> On Thu, Mar 24, 2022 at 04:04:03PM -0600, Alex Williamson wrote:
> > That's essentially what I'm trying to reconcile, we're racing both
> > to round out the compatibility interface to fully support QEMU, while
> > also updating QEMU to use iommufd directly so it won't need that full
> > support.  It's a confusing message.  Thanks,
> 
> The long term purpose of compatibility is to provide a config option
> to allow type 1 to be turned off and continue to support old user
> space (eg in containers) that is running old qemu/dpdk/spdk/etc.
> 
> This shows that we have a plan/path to allow a distro to support only
> one iommu interface in their kernel should they choose without having
> to sacrifice uABI compatibility.
> 
> As for racing, my intention is to leave the compat interface alone for
> awhile - the more urgent things in on my personal list are the RFC
> for dirty tracking, mlx5 support for dirty tracking, and VFIO preparation
> for iommufd support.
> 
> Eric and Yi are focusing on userspace page tables and qemu updates.
> 
> Joao is working on implementing iommu driver dirty tracking
> 
> Lu and Jacob are working on getting PASID support infrastructure
> together.
> 
> There is alot going on!
> 
> A question to consider is what would you consider the minimum bar for
> merging?
> 

My two cents. 😊

IMHO making the compat work as a task in parallel with other works
listed above is the most efficient approach to move forward. In concept
they are not mutual-dependent by using different set of uAPIs (vfio
compat vs. iommufd native). Otherwise considering the list of TODOs 
the compat work will become a single big task gating all other works.

If agreed this suggests we may want to prioritize Yi's vfio device uAPI [1]
to integrate vfio with iommufd to get this series merged. iirc there are
less opens remaining from v1 discussion compared to the list in the
compat interface. Of course it needs the Qemu change ready to use
iommufd directly, but this is necessary to unblock other tasks anyway.

[1] https://github.com/luxis1999/iommufd/commit/2d9278d4ecad7953b3787c98cdb650764af8a1a1

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
  2022-03-24 23:11           ` Jason Gunthorpe via iommu
@ 2022-03-25 11:24             ` Joao Martins
  -1 siblings, 0 replies; 244+ messages in thread
From: Joao Martins @ 2022-03-25 11:24 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Kevin Tian, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On 3/24/22 23:11, Jason Gunthorpe wrote:
> On Thu, Mar 24, 2022 at 04:04:03PM -0600, Alex Williamson wrote:
>> On Wed, 23 Mar 2022 21:33:42 -0300
>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>> On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote:
>>> I don't think this is compatibility. No kernel today triggers qemu to
>>> use this feature as no kernel supports live migration. No existing
>>> qemu will trigger this feature with new kernels that support live
>>> migration v2. Therefore we can adjust qemu's dirty tracking at the
>>> same time we enable migration v2 in qemu.
>>
>> I guess I was assuming that enabling v2 migration in QEMU was dependent
>> on the existing type1 dirty tracking because it's the only means we
>> have to tell QEMU that all memory is perpetually dirty when we have a
>> DMA device.  Is that not correct?
> 
> I haven't looked closely at this part in qemu, but IMHO, if qemu sees
> that it has VFIO migration support but does not have any DMA dirty
> tracking capability it should not do precopy flows.
> 
> If there is a bug here we should certainly fix it before progressing
> the v2 patches. I'll ask Yishai & Co to take a look.

I think that's already the case.

wrt to VFIO IOMMU type1, kernel always exports a migration capability
and the page sizes it supports. In the VMM if it matches the page size
qemu is using (x86 it is PAGE_SIZE) it determines for Qemu it will /use/ vfio
container ioctls. Which well, I guess it's always if the syscall is
there considering we dirty every page.

In qemu, the start and stop of dirty tracking is actually unbounded (it attempts
to do it without checking if the capability is there), although
syncing the dirties from vfio against Qemu private tracking, it does check
if the dirty page tracking is supported prior to even trying the syncing via the
ioctl. /Most importantly/ prior to all of this, starting/stopping/syncing dirty
tracking, Qemu adds a live migration blocker if either the device doesn't support
migration or VFIO container doesn't support it (so migration won't even start).
So I think VMM knows how to deal with the lack of the dirty container ioctls as
far as my understanding goes.

TBH, I am not overly concerned with dirty page tracking in vfio-compat layer --
I have been doing both in tandem (old and new). We mainly need to decide what do
we wanna maintain in the compat layer. I can drop that IOMMU support code I have
from vfio-compat or we do the 'perpectual dirtying' that current does, or not
support the dirty ioctls in vfio-compat at all. Maybe the latter makes more sense,
as that might mimmic more accurately what hardware supports, and deprive VMMs from
even starting migration. The second looks useful for testing, but doing dirty of all
DMA-mapped memory seems to be too much in a real world migration scenario :(
specially as the guest size increases.

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
@ 2022-03-25 11:24             ` Joao Martins
  0 siblings, 0 replies; 244+ messages in thread
From: Joao Martins @ 2022-03-25 11:24 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	Kevin Tian, Daniel Jordan, iommu, David Gibson

On 3/24/22 23:11, Jason Gunthorpe wrote:
> On Thu, Mar 24, 2022 at 04:04:03PM -0600, Alex Williamson wrote:
>> On Wed, 23 Mar 2022 21:33:42 -0300
>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>> On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote:
>>> I don't think this is compatibility. No kernel today triggers qemu to
>>> use this feature as no kernel supports live migration. No existing
>>> qemu will trigger this feature with new kernels that support live
>>> migration v2. Therefore we can adjust qemu's dirty tracking at the
>>> same time we enable migration v2 in qemu.
>>
>> I guess I was assuming that enabling v2 migration in QEMU was dependent
>> on the existing type1 dirty tracking because it's the only means we
>> have to tell QEMU that all memory is perpetually dirty when we have a
>> DMA device.  Is that not correct?
> 
> I haven't looked closely at this part in qemu, but IMHO, if qemu sees
> that it has VFIO migration support but does not have any DMA dirty
> tracking capability it should not do precopy flows.
> 
> If there is a bug here we should certainly fix it before progressing
> the v2 patches. I'll ask Yishai & Co to take a look.

I think that's already the case.

wrt to VFIO IOMMU type1, kernel always exports a migration capability
and the page sizes it supports. In the VMM if it matches the page size
qemu is using (x86 it is PAGE_SIZE) it determines for Qemu it will /use/ vfio
container ioctls. Which well, I guess it's always if the syscall is
there considering we dirty every page.

In qemu, the start and stop of dirty tracking is actually unbounded (it attempts
to do it without checking if the capability is there), although
syncing the dirties from vfio against Qemu private tracking, it does check
if the dirty page tracking is supported prior to even trying the syncing via the
ioctl. /Most importantly/ prior to all of this, starting/stopping/syncing dirty
tracking, Qemu adds a live migration blocker if either the device doesn't support
migration or VFIO container doesn't support it (so migration won't even start).
So I think VMM knows how to deal with the lack of the dirty container ioctls as
far as my understanding goes.

TBH, I am not overly concerned with dirty page tracking in vfio-compat layer --
I have been doing both in tandem (old and new). We mainly need to decide what do
we wanna maintain in the compat layer. I can drop that IOMMU support code I have
from vfio-compat or we do the 'perpectual dirtying' that current does, or not
support the dirty ioctls in vfio-compat at all. Maybe the latter makes more sense,
as that might mimmic more accurately what hardware supports, and deprive VMMs from
even starting migration. The second looks useful for testing, but doing dirty of all
DMA-mapped memory seems to be too much in a real world migration scenario :(
specially as the guest size increases.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 07/12] iommufd: Data structure to provide IOVA to PFN mapping
  2022-03-18 17:27   ` Jason Gunthorpe via iommu
@ 2022-03-25 13:34     ` zhangfei.gao
  -1 siblings, 0 replies; 244+ messages in thread
From: zhangfei.gao @ 2022-03-25 13:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins,
	David Gibson

Hi, Jason

On 2022/3/19 上午1:27, Jason Gunthorpe via iommu wrote:
> This is the remainder of the IOAS data structure. Provide an object called
> an io_pagetable that is composed of iopt_areas pointing at iopt_pages,
> along with a list of iommu_domains that mirror the IOVA to PFN map.
>
> At the top this is a simple interval tree of iopt_areas indicating the map
> of IOVA to iopt_pages. An xarray keeps track of a list of domains. Based
> on the attached domains there is a minimum alignment for areas (which may
> be smaller than PAGE_SIZE) and an interval tree of reserved IOVA that
> can't be mapped.
>
> The concept of a 'user' refers to something like a VFIO mdev that is
> accessing the IOVA and using a 'struct page *' for CPU based access.
>
> Externally an API is provided that matches the requirements of the IOCTL
> interface for map/unmap and domain attachment.
>
> The API provides a 'copy' primitive to establish a new IOVA map in a
> different IOAS from an existing mapping.
>
> This is designed to support a pre-registration flow where userspace would
> setup an dummy IOAS with no domains, map in memory and then establish a
> user to pin all PFNs into the xarray.
>
> Copy can then be used to create new IOVA mappings in a different IOAS,
> with iommu_domains attached. Upon copy the PFNs will be read out of the
> xarray and mapped into the iommu_domains, avoiding any pin_user_pages()
> overheads.
>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>   drivers/iommu/iommufd/Makefile          |   1 +
>   drivers/iommu/iommufd/io_pagetable.c    | 890 ++++++++++++++++++++++++
>   drivers/iommu/iommufd/iommufd_private.h |  35 +
>   3 files changed, 926 insertions(+)
>   create mode 100644 drivers/iommu/iommufd/io_pagetable.c
>
> diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
> index 05a0e91e30afad..b66a8c47ff55ec 100644
> --- a/drivers/iommu/iommufd/Makefile
> +++ b/drivers/iommu/iommufd/Makefile
> @@ -1,5 +1,6 @@
>   # SPDX-License-Identifier: GPL-2.0-only
>   iommufd-y := \
> +	io_pagetable.o \
>   	main.o \
>   	pages.o
>   
> diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
> new file mode 100644
> index 00000000000000..f9f3b06946bfb9
> --- /dev/null
> +++ b/drivers/iommu/iommufd/io_pagetable.c
> @@ -0,0 +1,890 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
> + *
> + * The io_pagetable is the top of datastructure that maps IOVA's to PFNs. The
> + * PFNs can be placed into an iommu_domain, or returned to the caller as a page
> + * list for access by an in-kernel user.
> + *
> + * The datastructure uses the iopt_pages to optimize the storage of the PFNs
> + * between the domains and xarray.
> + */
> +#include <linux/lockdep.h>
> +#include <linux/iommu.h>
> +#include <linux/sched/mm.h>
> +#include <linux/err.h>
> +#include <linux/slab.h>
> +#include <linux/errno.h>
> +
> +#include "io_pagetable.h"
> +
> +static unsigned long iopt_area_iova_to_index(struct iopt_area *area,
> +					     unsigned long iova)
> +{
> +	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
> +		WARN_ON(iova < iopt_area_iova(area) ||
> +			iova > iopt_area_last_iova(area));
> +	return (iova - (iopt_area_iova(area) & PAGE_MASK)) / PAGE_SIZE;
> +}
> +
> +static struct iopt_area *iopt_find_exact_area(struct io_pagetable *iopt,
> +					      unsigned long iova,
> +					      unsigned long last_iova)
> +{
> +	struct iopt_area *area;
> +
> +	area = iopt_area_iter_first(iopt, iova, last_iova);
> +	if (!area || !area->pages || iopt_area_iova(area) != iova ||
> +	    iopt_area_last_iova(area) != last_iova)
> +		return NULL;
> +	return area;
> +}
> +
> +static bool __alloc_iova_check_hole(struct interval_tree_span_iter *span,
> +				    unsigned long length,
> +				    unsigned long iova_alignment,
> +				    unsigned long page_offset)
> +{
> +	if (!span->is_hole || span->last_hole - span->start_hole < length - 1)
> +		return false;
> +
> +	span->start_hole =
> +		ALIGN(span->start_hole, iova_alignment) | page_offset;
> +	if (span->start_hole > span->last_hole ||
> +	    span->last_hole - span->start_hole < length - 1)
> +		return false;
> +	return true;
> +}
> +
> +/*
> + * Automatically find a block of IOVA that is not being used and not reserved.
> + * Does not return a 0 IOVA even if it is valid.
> + */
> +static int iopt_alloc_iova(struct io_pagetable *iopt, unsigned long *iova,
> +			   unsigned long uptr, unsigned long length)
> +{
> +	struct interval_tree_span_iter reserved_span;
> +	unsigned long page_offset = uptr % PAGE_SIZE;
> +	struct interval_tree_span_iter area_span;
> +	unsigned long iova_alignment;
> +
> +	lockdep_assert_held(&iopt->iova_rwsem);
> +
> +	/* Protect roundup_pow-of_two() from overflow */
> +	if (length == 0 || length >= ULONG_MAX / 2)
> +		return -EOVERFLOW;
> +
> +	/*
> +	 * Keep alignment present in the uptr when building the IOVA, this
> +	 * increases the chance we can map a THP.
> +	 */
> +	if (!uptr)
> +		iova_alignment = roundup_pow_of_two(length);
> +	else
> +		iova_alignment =
> +			min_t(unsigned long, roundup_pow_of_two(length),
> +			      1UL << __ffs64(uptr));
> +
> +	if (iova_alignment < iopt->iova_alignment)
> +		return -EINVAL;
> +	for (interval_tree_span_iter_first(&area_span, &iopt->area_itree,
> +					   PAGE_SIZE, ULONG_MAX - PAGE_SIZE);
> +	     !interval_tree_span_iter_done(&area_span);
> +	     interval_tree_span_iter_next(&area_span)) {
> +		if (!__alloc_iova_check_hole(&area_span, length, iova_alignment,
> +					     page_offset))
> +			continue;
> +
> +		for (interval_tree_span_iter_first(
> +			     &reserved_span, &iopt->reserved_iova_itree,
> +			     area_span.start_hole, area_span.last_hole);
> +		     !interval_tree_span_iter_done(&reserved_span);
> +		     interval_tree_span_iter_next(&reserved_span)) {
> +			if (!__alloc_iova_check_hole(&reserved_span, length,
> +						     iova_alignment,
> +						     page_offset))
> +				continue;
> +
> +			*iova = reserved_span.start_hole;
> +			return 0;
> +		}
> +	}
> +	return -ENOSPC;
> +}
> +
> +/*
> + * The area takes a slice of the pages from start_bytes to start_byte + length
> + */
> +static struct iopt_area *
> +iopt_alloc_area(struct io_pagetable *iopt, struct iopt_pages *pages,
> +		unsigned long iova, unsigned long start_byte,
> +		unsigned long length, int iommu_prot, unsigned int flags)
> +{
> +	struct iopt_area *area;
> +	int rc;
> +
> +	area = kzalloc(sizeof(*area), GFP_KERNEL);
> +	if (!area)
> +		return ERR_PTR(-ENOMEM);
> +
> +	area->iopt = iopt;
> +	area->iommu_prot = iommu_prot;
> +	area->page_offset = start_byte % PAGE_SIZE;
> +	area->pages_node.start = start_byte / PAGE_SIZE;
> +	if (check_add_overflow(start_byte, length - 1, &area->pages_node.last))
> +		return ERR_PTR(-EOVERFLOW);
> +	area->pages_node.last = area->pages_node.last / PAGE_SIZE;
> +	if (WARN_ON(area->pages_node.last >= pages->npages))
> +		return ERR_PTR(-EOVERFLOW);
> +
> +	down_write(&iopt->iova_rwsem);
> +	if (flags & IOPT_ALLOC_IOVA) {
> +		rc = iopt_alloc_iova(iopt, &iova,
> +				     (uintptr_t)pages->uptr + start_byte,
> +				     length);
> +		if (rc)
> +			goto out_unlock;
> +	}
> +
> +	if (check_add_overflow(iova, length - 1, &area->node.last)) {
> +		rc = -EOVERFLOW;
> +		goto out_unlock;
> +	}
> +
> +	if (!(flags & IOPT_ALLOC_IOVA)) {
> +		if ((iova & (iopt->iova_alignment - 1)) ||
> +		    (length & (iopt->iova_alignment - 1)) || !length) {
> +			rc = -EINVAL;
> +			goto out_unlock;
> +		}
> +
> +		/* No reserved IOVA intersects the range */
> +		if (interval_tree_iter_first(&iopt->reserved_iova_itree, iova,
> +					     area->node.last)) {
> +			rc = -ENOENT;
> +			goto out_unlock;
> +		}
> +
> +		/* Check that there is not already a mapping in the range */
> +		if (iopt_area_iter_first(iopt, iova, area->node.last)) {
> +			rc = -EADDRINUSE;
> +			goto out_unlock;
> +		}
> +	}
> +
> +	/*
> +	 * The area is inserted with a NULL pages indicating it is not fully
> +	 * initialized yet.
> +	 */
> +	area->node.start = iova;
> +	interval_tree_insert(&area->node, &area->iopt->area_itree);
> +	up_write(&iopt->iova_rwsem);
> +	return area;
> +
> +out_unlock:
> +	up_write(&iopt->iova_rwsem);
> +	kfree(area);
> +	return ERR_PTR(rc);
> +}
> +
> +static void iopt_abort_area(struct iopt_area *area)
> +{
> +	down_write(&area->iopt->iova_rwsem);
> +	interval_tree_remove(&area->node, &area->iopt->area_itree);
> +	up_write(&area->iopt->iova_rwsem);
> +	kfree(area);
> +}
> +
> +static int iopt_finalize_area(struct iopt_area *area, struct iopt_pages *pages)
> +{
> +	int rc;
> +
> +	down_read(&area->iopt->domains_rwsem);
> +	rc = iopt_area_fill_domains(area, pages);
> +	if (!rc) {
> +		/*
> +		 * area->pages must be set inside the domains_rwsem to ensure
> +		 * any newly added domains will get filled. Moves the reference
> +		 * in from the caller
> +		 */
> +		down_write(&area->iopt->iova_rwsem);
> +		area->pages = pages;
> +		up_write(&area->iopt->iova_rwsem);
> +	}
> +	up_read(&area->iopt->domains_rwsem);
> +	return rc;
> +}
> +
> +int iopt_map_pages(struct io_pagetable *iopt, struct iopt_pages *pages,
> +		   unsigned long *dst_iova, unsigned long start_bytes,
> +		   unsigned long length, int iommu_prot, unsigned int flags)
> +{
> +	struct iopt_area *area;
> +	int rc;
> +
> +	if ((iommu_prot & IOMMU_WRITE) && !pages->writable)
> +		return -EPERM;
> +
> +	area = iopt_alloc_area(iopt, pages, *dst_iova, start_bytes, length,
> +			       iommu_prot, flags);
> +	if (IS_ERR(area))
> +		return PTR_ERR(area);
> +	*dst_iova = iopt_area_iova(area);
> +
> +	rc = iopt_finalize_area(area, pages);
> +	if (rc) {
> +		iopt_abort_area(area);
> +		return rc;
> +	}
> +	return 0;
> +}
> +
> +/**
> + * iopt_map_user_pages() - Map a user VA to an iova in the io page table
> + * @iopt: io_pagetable to act on
> + * @iova: If IOPT_ALLOC_IOVA is set this is unused on input and contains
> + *        the chosen iova on output. Otherwise is the iova to map to on input
> + * @uptr: User VA to map
> + * @length: Number of bytes to map
> + * @iommu_prot: Combination of IOMMU_READ/WRITE/etc bits for the mapping
> + * @flags: IOPT_ALLOC_IOVA or zero
> + *
> + * iova, uptr, and length must be aligned to iova_alignment. For domain backed
> + * page tables this will pin the pages and load them into the domain at iova.
> + * For non-domain page tables this will only setup a lazy reference and the
> + * caller must use iopt_access_pages() to touch them.
> + *
> + * iopt_unmap_iova() must be called to undo this before the io_pagetable can be
> + * destroyed.
> + */
> +int iopt_map_user_pages(struct io_pagetable *iopt, unsigned long *iova,
> +			void __user *uptr, unsigned long length, int iommu_prot,
> +			unsigned int flags)
> +{
> +	struct iopt_pages *pages;
> +	int rc;
> +
> +	pages = iopt_alloc_pages(uptr, length, iommu_prot & IOMMU_WRITE);
> +	if (IS_ERR(pages))
> +		return PTR_ERR(pages);
> +
> +	rc = iopt_map_pages(iopt, pages, iova, uptr - pages->uptr, length,
> +			    iommu_prot, flags);
> +	if (rc) {
> +		iopt_put_pages(pages);
> +		return rc;
> +	}
> +	return 0;
> +}
> +
> +struct iopt_pages *iopt_get_pages(struct io_pagetable *iopt, unsigned long iova,
> +				  unsigned long *start_byte,
> +				  unsigned long length)
> +{
> +	unsigned long iova_end;
> +	struct iopt_pages *pages;
> +	struct iopt_area *area;
> +
> +	if (check_add_overflow(iova, length - 1, &iova_end))
> +		return ERR_PTR(-EOVERFLOW);
> +
> +	down_read(&iopt->iova_rwsem);
> +	area = iopt_find_exact_area(iopt, iova, iova_end);
> +	if (!area) {
> +		up_read(&iopt->iova_rwsem);
> +		return ERR_PTR(-ENOENT);
> +	}
> +	pages = area->pages;
> +	*start_byte = area->page_offset + iopt_area_index(area) * PAGE_SIZE;
> +	kref_get(&pages->kref);
> +	up_read(&iopt->iova_rwsem);
> +
> +	return pages;
> +}
> +
> +static int __iopt_unmap_iova(struct io_pagetable *iopt, struct iopt_area *area,
> +			     struct iopt_pages *pages)
> +{
> +	/* Drivers have to unpin on notification. */
> +	if (WARN_ON(atomic_read(&area->num_users)))
> +		return -EBUSY;
> +
> +	iopt_area_unfill_domains(area, pages);
> +	WARN_ON(atomic_read(&area->num_users));
> +	iopt_abort_area(area);
> +	iopt_put_pages(pages);
> +	return 0;
> +}
> +
> +/**
> + * iopt_unmap_iova() - Remove a range of iova
> + * @iopt: io_pagetable to act on
> + * @iova: Starting iova to unmap
> + * @length: Number of bytes to unmap
> + *
> + * The requested range must exactly match an existing range.
> + * Splitting/truncating IOVA mappings is not allowed.
> + */
> +int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
> +		    unsigned long length)
> +{
> +	struct iopt_pages *pages;
> +	struct iopt_area *area;
> +	unsigned long iova_end;
> +	int rc;
> +
> +	if (!length)
> +		return -EINVAL;
> +
> +	if (check_add_overflow(iova, length - 1, &iova_end))
> +		return -EOVERFLOW;
> +
> +	down_read(&iopt->domains_rwsem);
> +	down_write(&iopt->iova_rwsem);
> +	area = iopt_find_exact_area(iopt, iova, iova_end);
> +	if (!area) {
> +		up_write(&iopt->iova_rwsem);
> +		up_read(&iopt->domains_rwsem);
> +		return -ENOENT;
> +	}
> +	pages = area->pages;
> +	area->pages = NULL;
> +	up_write(&iopt->iova_rwsem);
> +
> +	rc = __iopt_unmap_iova(iopt, area, pages);
> +	up_read(&iopt->domains_rwsem);
> +	return rc;
> +}
> +
> +int iopt_unmap_all(struct io_pagetable *iopt)
> +{
> +	struct iopt_area *area;
> +	int rc;
> +
> +	down_read(&iopt->domains_rwsem);
> +	down_write(&iopt->iova_rwsem);
> +	while ((area = iopt_area_iter_first(iopt, 0, ULONG_MAX))) {
> +		struct iopt_pages *pages;
> +
> +		/* Userspace should not race unmap all and map */
> +		if (!area->pages) {
> +			rc = -EBUSY;
> +			goto out_unlock_iova;
> +		}
> +		pages = area->pages;
> +		area->pages = NULL;
> +		up_write(&iopt->iova_rwsem);
> +
> +		rc = __iopt_unmap_iova(iopt, area, pages);
> +		if (rc)
> +			goto out_unlock_domains;
> +
> +		down_write(&iopt->iova_rwsem);
> +	}
> +	rc = 0;
> +
> +out_unlock_iova:
> +	up_write(&iopt->iova_rwsem);
> +out_unlock_domains:
> +	up_read(&iopt->domains_rwsem);
> +	return rc;
> +}
> +
> +/**
> + * iopt_access_pages() - Return a list of pages under the iova
> + * @iopt: io_pagetable to act on
> + * @iova: Starting IOVA
> + * @length: Number of bytes to access
> + * @out_pages: Output page list
> + * @write: True if access is for writing
> + *
> + * Reads @npages starting at iova and returns the struct page * pointers. These
> + * can be kmap'd by the caller for CPU access.
> + *
> + * The caller must perform iopt_unaccess_pages() when done to balance this.
> + *
> + * iova can be unaligned from PAGE_SIZE. The first returned byte starts at
> + * page_to_phys(out_pages[0]) + (iova % PAGE_SIZE). The caller promises not to
> + * touch memory outside the requested iova slice.
> + *
> + * FIXME: callers that need a DMA mapping via a sgl should create another
> + * interface to build the SGL efficiently
> + */
> +int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
> +		      unsigned long length, struct page **out_pages, bool write)
> +{
> +	unsigned long cur_iova = iova;
> +	unsigned long last_iova;
> +	struct iopt_area *area;
> +	int rc;
> +
> +	if (!length)
> +		return -EINVAL;
> +	if (check_add_overflow(iova, length - 1, &last_iova))
> +		return -EOVERFLOW;
> +
> +	down_read(&iopt->iova_rwsem);
> +	for (area = iopt_area_iter_first(iopt, iova, last_iova); area;
> +	     area = iopt_area_iter_next(area, iova, last_iova)) {
> +		unsigned long last = min(last_iova, iopt_area_last_iova(area));
> +		unsigned long last_index;
> +		unsigned long index;
> +
> +		/* Need contiguous areas in the access */
> +		if (iopt_area_iova(area) < cur_iova || !area->pages) {
> +			rc = -EINVAL;
> +			goto out_remove;
> +		}
> +
> +		index = iopt_area_iova_to_index(area, cur_iova);
> +		last_index = iopt_area_iova_to_index(area, last);
> +		rc = iopt_pages_add_user(area->pages, index, last_index,
> +					 out_pages, write);
> +		if (rc)
> +			goto out_remove;
> +		if (last == last_iova)
> +			break;
> +		/*
> +		 * Can't cross areas that are not aligned to the system page
> +		 * size with this API.
> +		 */
> +		if (cur_iova % PAGE_SIZE) {
> +			rc = -EINVAL;
> +			goto out_remove;
> +		}
> +		cur_iova = last + 1;
> +		out_pages += last_index - index;
> +		atomic_inc(&area->num_users);
> +	}
> +
> +	up_read(&iopt->iova_rwsem);
> +	return 0;
> +
> +out_remove:
> +	if (cur_iova != iova)
> +		iopt_unaccess_pages(iopt, iova, cur_iova - iova);
> +	up_read(&iopt->iova_rwsem);
> +	return rc;
> +}
> +
> +/**
> + * iopt_unaccess_pages() - Undo iopt_access_pages
> + * @iopt: io_pagetable to act on
> + * @iova: Starting IOVA
> + * @length:- Number of bytes to access
> + *
> + * Return the struct page's. The caller must stop accessing them before calling
> + * this. The iova/length must exactly match the one provided to access_pages.
> + */
> +void iopt_unaccess_pages(struct io_pagetable *iopt, unsigned long iova,
> +			 size_t length)
> +{
> +	unsigned long cur_iova = iova;
> +	unsigned long last_iova;
> +	struct iopt_area *area;
> +
> +	if (WARN_ON(!length) ||
> +	    WARN_ON(check_add_overflow(iova, length - 1, &last_iova)))
> +		return;
> +
> +	down_read(&iopt->iova_rwsem);
> +	for (area = iopt_area_iter_first(iopt, iova, last_iova); area;
> +	     area = iopt_area_iter_next(area, iova, last_iova)) {
> +		unsigned long last = min(last_iova, iopt_area_last_iova(area));
> +		int num_users;
> +
> +		iopt_pages_remove_user(area->pages,
> +				       iopt_area_iova_to_index(area, cur_iova),
> +				       iopt_area_iova_to_index(area, last));
> +		if (last == last_iova)
> +			break;
> +		cur_iova = last + 1;
> +		num_users = atomic_dec_return(&area->num_users);
> +		WARN_ON(num_users < 0);
> +	}
> +	up_read(&iopt->iova_rwsem);
> +}
> +
> +struct iopt_reserved_iova {
> +	struct interval_tree_node node;
> +	void *owner;
> +};
> +
> +int iopt_reserve_iova(struct io_pagetable *iopt, unsigned long start,
> +		      unsigned long last, void *owner)
> +{
> +	struct iopt_reserved_iova *reserved;
> +
> +	lockdep_assert_held_write(&iopt->iova_rwsem);
> +
> +	if (iopt_area_iter_first(iopt, start, last))
> +		return -EADDRINUSE;
> +
> +	reserved = kzalloc(sizeof(*reserved), GFP_KERNEL);
> +	if (!reserved)
> +		return -ENOMEM;
> +	reserved->node.start = start;
> +	reserved->node.last = last;
> +	reserved->owner = owner;
> +	interval_tree_insert(&reserved->node, &iopt->reserved_iova_itree);
> +	return 0;
> +}
> +
> +void iopt_remove_reserved_iova(struct io_pagetable *iopt, void *owner)
> +{
> +
> +	struct interval_tree_node *node;
> +
> +	for (node = interval_tree_iter_first(&iopt->reserved_iova_itree, 0,
> +					     ULONG_MAX);
> +	     node;) {
> +		struct iopt_reserved_iova *reserved =
> +			container_of(node, struct iopt_reserved_iova, node);
> +
> +		node = interval_tree_iter_next(node, 0, ULONG_MAX);
> +
> +		if (reserved->owner == owner) {
> +			interval_tree_remove(&reserved->node,
> +					     &iopt->reserved_iova_itree);
> +			kfree(reserved);
> +		}
> +	}
> +}
> +
> +int iopt_init_table(struct io_pagetable *iopt)
> +{
> +	init_rwsem(&iopt->iova_rwsem);
> +	init_rwsem(&iopt->domains_rwsem);
> +	iopt->area_itree = RB_ROOT_CACHED;
> +	iopt->reserved_iova_itree = RB_ROOT_CACHED;
> +	xa_init(&iopt->domains);
> +
> +	/*
> +	 * iopt's start as SW tables that can use the entire size_t IOVA space
> +	 * due to the use of size_t in the APIs. They have no alignment
> +	 * restriction.
> +	 */
> +	iopt->iova_alignment = 1;
> +
> +	return 0;
> +}
> +
> +void iopt_destroy_table(struct io_pagetable *iopt)
> +{
> +	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
> +		iopt_remove_reserved_iova(iopt, NULL);
> +	WARN_ON(!RB_EMPTY_ROOT(&iopt->reserved_iova_itree.rb_root));
> +	WARN_ON(!xa_empty(&iopt->domains));
> +	WARN_ON(!RB_EMPTY_ROOT(&iopt->area_itree.rb_root));
> +}
> +
> +/**
> + * iopt_unfill_domain() - Unfill a domain with PFNs
> + * @iopt: io_pagetable to act on
> + * @domain: domain to unfill
> + *
> + * This is used when removing a domain from the iopt. Every area in the iopt
> + * will be unmapped from the domain. The domain must already be removed from the
> + * domains xarray.
> + */
> +static void iopt_unfill_domain(struct io_pagetable *iopt,
> +			       struct iommu_domain *domain)
> +{
> +	struct iopt_area *area;
> +
> +	lockdep_assert_held(&iopt->iova_rwsem);
> +	lockdep_assert_held_write(&iopt->domains_rwsem);
> +
> +	/*
> +	 * Some other domain is holding all the pfns still, rapidly unmap this
> +	 * domain.
> +	 */
> +	if (iopt->next_domain_id != 0) {
> +		/* Pick an arbitrary remaining domain to act as storage */
> +		struct iommu_domain *storage_domain =
> +			xa_load(&iopt->domains, 0);
> +
> +		for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
> +		     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
> +			struct iopt_pages *pages = area->pages;
> +
> +			if (WARN_ON(!pages))
> +				continue;
> +
> +			mutex_lock(&pages->mutex);
> +			if (area->storage_domain != domain) {
> +				mutex_unlock(&pages->mutex);
> +				continue;
> +			}
> +			area->storage_domain = storage_domain;
> +			mutex_unlock(&pages->mutex);
> +		}
> +
> +
> +		iopt_unmap_domain(iopt, domain);
> +		return;
> +	}
> +
> +	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
> +	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
> +		struct iopt_pages *pages = area->pages;
> +
> +		if (WARN_ON(!pages))
> +			continue;
> +
> +		mutex_lock(&pages->mutex);
> +		interval_tree_remove(&area->pages_node,
> +				     &area->pages->domains_itree);
> +		WARN_ON(area->storage_domain != domain);
> +		area->storage_domain = NULL;
> +		iopt_area_unfill_domain(area, pages, domain);
> +		mutex_unlock(&pages->mutex);
> +	}
> +}
> +
> +/**
> + * iopt_fill_domain() - Fill a domain with PFNs
> + * @iopt: io_pagetable to act on
> + * @domain: domain to fill
> + *
> + * Fill the domain with PFNs from every area in the iopt. On failure the domain
> + * is left unchanged.
> + */
> +static int iopt_fill_domain(struct io_pagetable *iopt,
> +			    struct iommu_domain *domain)
> +{
> +	struct iopt_area *end_area;
> +	struct iopt_area *area;
> +	int rc;
> +
> +	lockdep_assert_held(&iopt->iova_rwsem);
> +	lockdep_assert_held_write(&iopt->domains_rwsem);
> +
> +	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
> +	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
> +		struct iopt_pages *pages = area->pages;
> +
> +		if (WARN_ON(!pages))
> +			continue;
> +
> +		mutex_lock(&pages->mutex);
> +		rc = iopt_area_fill_domain(area, domain);
> +		if (rc) {
> +			mutex_unlock(&pages->mutex);
> +			goto out_unfill;
> +		}
> +		if (!area->storage_domain) {
> +			WARN_ON(iopt->next_domain_id != 0);
> +			area->storage_domain = domain;
> +			interval_tree_insert(&area->pages_node,
> +					     &pages->domains_itree);
> +		}
> +		mutex_unlock(&pages->mutex);
> +	}
> +	return 0;
> +
> +out_unfill:
> +	end_area = area;
> +	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
> +	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
> +		struct iopt_pages *pages = area->pages;
> +
> +		if (area == end_area)
> +			break;
> +		if (WARN_ON(!pages))
> +			continue;
> +		mutex_lock(&pages->mutex);
> +		if (iopt->next_domain_id == 0) {
> +			interval_tree_remove(&area->pages_node,
> +					     &pages->domains_itree);
> +			area->storage_domain = NULL;
> +		}
> +		iopt_area_unfill_domain(area, pages, domain);
> +		mutex_unlock(&pages->mutex);
> +	}
> +	return rc;
> +}
> +
> +/* All existing area's conform to an increased page size */
> +static int iopt_check_iova_alignment(struct io_pagetable *iopt,
> +				     unsigned long new_iova_alignment)
> +{
> +	struct iopt_area *area;
> +
> +	lockdep_assert_held(&iopt->iova_rwsem);
> +
> +	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
> +	     area = iopt_area_iter_next(area, 0, ULONG_MAX))
> +		if ((iopt_area_iova(area) % new_iova_alignment) ||
> +		    (iopt_area_length(area) % new_iova_alignment))
> +			return -EADDRINUSE;
> +	return 0;
> +}
> +
> +int iopt_table_add_domain(struct io_pagetable *iopt,
> +			  struct iommu_domain *domain)
> +{
> +	const struct iommu_domain_geometry *geometry = &domain->geometry;
> +	struct iommu_domain *iter_domain;
> +	unsigned int new_iova_alignment;
> +	unsigned long index;
> +	int rc;
> +
> +	down_write(&iopt->domains_rwsem);
> +	down_write(&iopt->iova_rwsem);
> +
> +	xa_for_each (&iopt->domains, index, iter_domain) {
> +		if (WARN_ON(iter_domain == domain)) {
> +			rc = -EEXIST;
> +			goto out_unlock;
> +		}
> +	}
> +
> +	/*
> +	 * The io page size drives the iova_alignment. Internally the iopt_pages
> +	 * works in PAGE_SIZE units and we adjust when mapping sub-PAGE_SIZE
> +	 * objects into the iommu_domain.
> +	 *
> +	 * A iommu_domain must always be able to accept PAGE_SIZE to be
> +	 * compatible as we can't guarantee higher contiguity.
> +	 */
> +	new_iova_alignment =
> +		max_t(unsigned long, 1UL << __ffs(domain->pgsize_bitmap),
> +		      iopt->iova_alignment);
> +	if (new_iova_alignment > PAGE_SIZE) {
> +		rc = -EINVAL;
> +		goto out_unlock;
> +	}
> +	if (new_iova_alignment != iopt->iova_alignment) {
> +		rc = iopt_check_iova_alignment(iopt, new_iova_alignment);
> +		if (rc)
> +			goto out_unlock;
> +	}
> +
> +	/* No area exists that is outside the allowed domain aperture */
> +	if (geometry->aperture_start != 0) {
> +		rc = iopt_reserve_iova(iopt, 0, geometry->aperture_start - 1,
> +				       domain);
> +		if (rc)
> +			goto out_reserved;
> +	}
> +	if (geometry->aperture_end != ULONG_MAX) {
> +		rc = iopt_reserve_iova(iopt, geometry->aperture_end + 1,
> +				       ULONG_MAX, domain);
> +		if (rc)
> +			goto out_reserved;
> +	}
> +
> +	rc = xa_reserve(&iopt->domains, iopt->next_domain_id, GFP_KERNEL);
> +	if (rc)
> +		goto out_reserved;
> +
> +	rc = iopt_fill_domain(iopt, domain);
> +	if (rc)
> +		goto out_release;
> +
> +	iopt->iova_alignment = new_iova_alignment;
> +	xa_store(&iopt->domains, iopt->next_domain_id, domain, GFP_KERNEL);
> +	iopt->next_domain_id++;
Not understand here.

Do we get the domain = xa_load(&iopt->domains, iopt->next_domain_id-1)?
Then how to get the domain if next_domain_id++.
For example, iopt_table_add_domain 3 times with 3 domains,
how to know which next_domain_id is the correct one.

Thanks

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 07/12] iommufd: Data structure to provide IOVA to PFN mapping
@ 2022-03-25 13:34     ` zhangfei.gao
  0 siblings, 0 replies; 244+ messages in thread
From: zhangfei.gao @ 2022-03-25 13:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, Alex Williamson, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	Kevin Tian, Daniel Jordan, iommu, Joao Martins, David Gibson

Hi, Jason

On 2022/3/19 上午1:27, Jason Gunthorpe via iommu wrote:
> This is the remainder of the IOAS data structure. Provide an object called
> an io_pagetable that is composed of iopt_areas pointing at iopt_pages,
> along with a list of iommu_domains that mirror the IOVA to PFN map.
>
> At the top this is a simple interval tree of iopt_areas indicating the map
> of IOVA to iopt_pages. An xarray keeps track of a list of domains. Based
> on the attached domains there is a minimum alignment for areas (which may
> be smaller than PAGE_SIZE) and an interval tree of reserved IOVA that
> can't be mapped.
>
> The concept of a 'user' refers to something like a VFIO mdev that is
> accessing the IOVA and using a 'struct page *' for CPU based access.
>
> Externally an API is provided that matches the requirements of the IOCTL
> interface for map/unmap and domain attachment.
>
> The API provides a 'copy' primitive to establish a new IOVA map in a
> different IOAS from an existing mapping.
>
> This is designed to support a pre-registration flow where userspace would
> setup an dummy IOAS with no domains, map in memory and then establish a
> user to pin all PFNs into the xarray.
>
> Copy can then be used to create new IOVA mappings in a different IOAS,
> with iommu_domains attached. Upon copy the PFNs will be read out of the
> xarray and mapped into the iommu_domains, avoiding any pin_user_pages()
> overheads.
>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>   drivers/iommu/iommufd/Makefile          |   1 +
>   drivers/iommu/iommufd/io_pagetable.c    | 890 ++++++++++++++++++++++++
>   drivers/iommu/iommufd/iommufd_private.h |  35 +
>   3 files changed, 926 insertions(+)
>   create mode 100644 drivers/iommu/iommufd/io_pagetable.c
>
> diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
> index 05a0e91e30afad..b66a8c47ff55ec 100644
> --- a/drivers/iommu/iommufd/Makefile
> +++ b/drivers/iommu/iommufd/Makefile
> @@ -1,5 +1,6 @@
>   # SPDX-License-Identifier: GPL-2.0-only
>   iommufd-y := \
> +	io_pagetable.o \
>   	main.o \
>   	pages.o
>   
> diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
> new file mode 100644
> index 00000000000000..f9f3b06946bfb9
> --- /dev/null
> +++ b/drivers/iommu/iommufd/io_pagetable.c
> @@ -0,0 +1,890 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
> + *
> + * The io_pagetable is the top of datastructure that maps IOVA's to PFNs. The
> + * PFNs can be placed into an iommu_domain, or returned to the caller as a page
> + * list for access by an in-kernel user.
> + *
> + * The datastructure uses the iopt_pages to optimize the storage of the PFNs
> + * between the domains and xarray.
> + */
> +#include <linux/lockdep.h>
> +#include <linux/iommu.h>
> +#include <linux/sched/mm.h>
> +#include <linux/err.h>
> +#include <linux/slab.h>
> +#include <linux/errno.h>
> +
> +#include "io_pagetable.h"
> +
> +static unsigned long iopt_area_iova_to_index(struct iopt_area *area,
> +					     unsigned long iova)
> +{
> +	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
> +		WARN_ON(iova < iopt_area_iova(area) ||
> +			iova > iopt_area_last_iova(area));
> +	return (iova - (iopt_area_iova(area) & PAGE_MASK)) / PAGE_SIZE;
> +}
> +
> +static struct iopt_area *iopt_find_exact_area(struct io_pagetable *iopt,
> +					      unsigned long iova,
> +					      unsigned long last_iova)
> +{
> +	struct iopt_area *area;
> +
> +	area = iopt_area_iter_first(iopt, iova, last_iova);
> +	if (!area || !area->pages || iopt_area_iova(area) != iova ||
> +	    iopt_area_last_iova(area) != last_iova)
> +		return NULL;
> +	return area;
> +}
> +
> +static bool __alloc_iova_check_hole(struct interval_tree_span_iter *span,
> +				    unsigned long length,
> +				    unsigned long iova_alignment,
> +				    unsigned long page_offset)
> +{
> +	if (!span->is_hole || span->last_hole - span->start_hole < length - 1)
> +		return false;
> +
> +	span->start_hole =
> +		ALIGN(span->start_hole, iova_alignment) | page_offset;
> +	if (span->start_hole > span->last_hole ||
> +	    span->last_hole - span->start_hole < length - 1)
> +		return false;
> +	return true;
> +}
> +
> +/*
> + * Automatically find a block of IOVA that is not being used and not reserved.
> + * Does not return a 0 IOVA even if it is valid.
> + */
> +static int iopt_alloc_iova(struct io_pagetable *iopt, unsigned long *iova,
> +			   unsigned long uptr, unsigned long length)
> +{
> +	struct interval_tree_span_iter reserved_span;
> +	unsigned long page_offset = uptr % PAGE_SIZE;
> +	struct interval_tree_span_iter area_span;
> +	unsigned long iova_alignment;
> +
> +	lockdep_assert_held(&iopt->iova_rwsem);
> +
> +	/* Protect roundup_pow-of_two() from overflow */
> +	if (length == 0 || length >= ULONG_MAX / 2)
> +		return -EOVERFLOW;
> +
> +	/*
> +	 * Keep alignment present in the uptr when building the IOVA, this
> +	 * increases the chance we can map a THP.
> +	 */
> +	if (!uptr)
> +		iova_alignment = roundup_pow_of_two(length);
> +	else
> +		iova_alignment =
> +			min_t(unsigned long, roundup_pow_of_two(length),
> +			      1UL << __ffs64(uptr));
> +
> +	if (iova_alignment < iopt->iova_alignment)
> +		return -EINVAL;
> +	for (interval_tree_span_iter_first(&area_span, &iopt->area_itree,
> +					   PAGE_SIZE, ULONG_MAX - PAGE_SIZE);
> +	     !interval_tree_span_iter_done(&area_span);
> +	     interval_tree_span_iter_next(&area_span)) {
> +		if (!__alloc_iova_check_hole(&area_span, length, iova_alignment,
> +					     page_offset))
> +			continue;
> +
> +		for (interval_tree_span_iter_first(
> +			     &reserved_span, &iopt->reserved_iova_itree,
> +			     area_span.start_hole, area_span.last_hole);
> +		     !interval_tree_span_iter_done(&reserved_span);
> +		     interval_tree_span_iter_next(&reserved_span)) {
> +			if (!__alloc_iova_check_hole(&reserved_span, length,
> +						     iova_alignment,
> +						     page_offset))
> +				continue;
> +
> +			*iova = reserved_span.start_hole;
> +			return 0;
> +		}
> +	}
> +	return -ENOSPC;
> +}
> +
> +/*
> + * The area takes a slice of the pages from start_bytes to start_byte + length
> + */
> +static struct iopt_area *
> +iopt_alloc_area(struct io_pagetable *iopt, struct iopt_pages *pages,
> +		unsigned long iova, unsigned long start_byte,
> +		unsigned long length, int iommu_prot, unsigned int flags)
> +{
> +	struct iopt_area *area;
> +	int rc;
> +
> +	area = kzalloc(sizeof(*area), GFP_KERNEL);
> +	if (!area)
> +		return ERR_PTR(-ENOMEM);
> +
> +	area->iopt = iopt;
> +	area->iommu_prot = iommu_prot;
> +	area->page_offset = start_byte % PAGE_SIZE;
> +	area->pages_node.start = start_byte / PAGE_SIZE;
> +	if (check_add_overflow(start_byte, length - 1, &area->pages_node.last))
> +		return ERR_PTR(-EOVERFLOW);
> +	area->pages_node.last = area->pages_node.last / PAGE_SIZE;
> +	if (WARN_ON(area->pages_node.last >= pages->npages))
> +		return ERR_PTR(-EOVERFLOW);
> +
> +	down_write(&iopt->iova_rwsem);
> +	if (flags & IOPT_ALLOC_IOVA) {
> +		rc = iopt_alloc_iova(iopt, &iova,
> +				     (uintptr_t)pages->uptr + start_byte,
> +				     length);
> +		if (rc)
> +			goto out_unlock;
> +	}
> +
> +	if (check_add_overflow(iova, length - 1, &area->node.last)) {
> +		rc = -EOVERFLOW;
> +		goto out_unlock;
> +	}
> +
> +	if (!(flags & IOPT_ALLOC_IOVA)) {
> +		if ((iova & (iopt->iova_alignment - 1)) ||
> +		    (length & (iopt->iova_alignment - 1)) || !length) {
> +			rc = -EINVAL;
> +			goto out_unlock;
> +		}
> +
> +		/* No reserved IOVA intersects the range */
> +		if (interval_tree_iter_first(&iopt->reserved_iova_itree, iova,
> +					     area->node.last)) {
> +			rc = -ENOENT;
> +			goto out_unlock;
> +		}
> +
> +		/* Check that there is not already a mapping in the range */
> +		if (iopt_area_iter_first(iopt, iova, area->node.last)) {
> +			rc = -EADDRINUSE;
> +			goto out_unlock;
> +		}
> +	}
> +
> +	/*
> +	 * The area is inserted with a NULL pages indicating it is not fully
> +	 * initialized yet.
> +	 */
> +	area->node.start = iova;
> +	interval_tree_insert(&area->node, &area->iopt->area_itree);
> +	up_write(&iopt->iova_rwsem);
> +	return area;
> +
> +out_unlock:
> +	up_write(&iopt->iova_rwsem);
> +	kfree(area);
> +	return ERR_PTR(rc);
> +}
> +
> +static void iopt_abort_area(struct iopt_area *area)
> +{
> +	down_write(&area->iopt->iova_rwsem);
> +	interval_tree_remove(&area->node, &area->iopt->area_itree);
> +	up_write(&area->iopt->iova_rwsem);
> +	kfree(area);
> +}
> +
> +static int iopt_finalize_area(struct iopt_area *area, struct iopt_pages *pages)
> +{
> +	int rc;
> +
> +	down_read(&area->iopt->domains_rwsem);
> +	rc = iopt_area_fill_domains(area, pages);
> +	if (!rc) {
> +		/*
> +		 * area->pages must be set inside the domains_rwsem to ensure
> +		 * any newly added domains will get filled. Moves the reference
> +		 * in from the caller
> +		 */
> +		down_write(&area->iopt->iova_rwsem);
> +		area->pages = pages;
> +		up_write(&area->iopt->iova_rwsem);
> +	}
> +	up_read(&area->iopt->domains_rwsem);
> +	return rc;
> +}
> +
> +int iopt_map_pages(struct io_pagetable *iopt, struct iopt_pages *pages,
> +		   unsigned long *dst_iova, unsigned long start_bytes,
> +		   unsigned long length, int iommu_prot, unsigned int flags)
> +{
> +	struct iopt_area *area;
> +	int rc;
> +
> +	if ((iommu_prot & IOMMU_WRITE) && !pages->writable)
> +		return -EPERM;
> +
> +	area = iopt_alloc_area(iopt, pages, *dst_iova, start_bytes, length,
> +			       iommu_prot, flags);
> +	if (IS_ERR(area))
> +		return PTR_ERR(area);
> +	*dst_iova = iopt_area_iova(area);
> +
> +	rc = iopt_finalize_area(area, pages);
> +	if (rc) {
> +		iopt_abort_area(area);
> +		return rc;
> +	}
> +	return 0;
> +}
> +
> +/**
> + * iopt_map_user_pages() - Map a user VA to an iova in the io page table
> + * @iopt: io_pagetable to act on
> + * @iova: If IOPT_ALLOC_IOVA is set this is unused on input and contains
> + *        the chosen iova on output. Otherwise is the iova to map to on input
> + * @uptr: User VA to map
> + * @length: Number of bytes to map
> + * @iommu_prot: Combination of IOMMU_READ/WRITE/etc bits for the mapping
> + * @flags: IOPT_ALLOC_IOVA or zero
> + *
> + * iova, uptr, and length must be aligned to iova_alignment. For domain backed
> + * page tables this will pin the pages and load them into the domain at iova.
> + * For non-domain page tables this will only setup a lazy reference and the
> + * caller must use iopt_access_pages() to touch them.
> + *
> + * iopt_unmap_iova() must be called to undo this before the io_pagetable can be
> + * destroyed.
> + */
> +int iopt_map_user_pages(struct io_pagetable *iopt, unsigned long *iova,
> +			void __user *uptr, unsigned long length, int iommu_prot,
> +			unsigned int flags)
> +{
> +	struct iopt_pages *pages;
> +	int rc;
> +
> +	pages = iopt_alloc_pages(uptr, length, iommu_prot & IOMMU_WRITE);
> +	if (IS_ERR(pages))
> +		return PTR_ERR(pages);
> +
> +	rc = iopt_map_pages(iopt, pages, iova, uptr - pages->uptr, length,
> +			    iommu_prot, flags);
> +	if (rc) {
> +		iopt_put_pages(pages);
> +		return rc;
> +	}
> +	return 0;
> +}
> +
> +struct iopt_pages *iopt_get_pages(struct io_pagetable *iopt, unsigned long iova,
> +				  unsigned long *start_byte,
> +				  unsigned long length)
> +{
> +	unsigned long iova_end;
> +	struct iopt_pages *pages;
> +	struct iopt_area *area;
> +
> +	if (check_add_overflow(iova, length - 1, &iova_end))
> +		return ERR_PTR(-EOVERFLOW);
> +
> +	down_read(&iopt->iova_rwsem);
> +	area = iopt_find_exact_area(iopt, iova, iova_end);
> +	if (!area) {
> +		up_read(&iopt->iova_rwsem);
> +		return ERR_PTR(-ENOENT);
> +	}
> +	pages = area->pages;
> +	*start_byte = area->page_offset + iopt_area_index(area) * PAGE_SIZE;
> +	kref_get(&pages->kref);
> +	up_read(&iopt->iova_rwsem);
> +
> +	return pages;
> +}
> +
> +static int __iopt_unmap_iova(struct io_pagetable *iopt, struct iopt_area *area,
> +			     struct iopt_pages *pages)
> +{
> +	/* Drivers have to unpin on notification. */
> +	if (WARN_ON(atomic_read(&area->num_users)))
> +		return -EBUSY;
> +
> +	iopt_area_unfill_domains(area, pages);
> +	WARN_ON(atomic_read(&area->num_users));
> +	iopt_abort_area(area);
> +	iopt_put_pages(pages);
> +	return 0;
> +}
> +
> +/**
> + * iopt_unmap_iova() - Remove a range of iova
> + * @iopt: io_pagetable to act on
> + * @iova: Starting iova to unmap
> + * @length: Number of bytes to unmap
> + *
> + * The requested range must exactly match an existing range.
> + * Splitting/truncating IOVA mappings is not allowed.
> + */
> +int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
> +		    unsigned long length)
> +{
> +	struct iopt_pages *pages;
> +	struct iopt_area *area;
> +	unsigned long iova_end;
> +	int rc;
> +
> +	if (!length)
> +		return -EINVAL;
> +
> +	if (check_add_overflow(iova, length - 1, &iova_end))
> +		return -EOVERFLOW;
> +
> +	down_read(&iopt->domains_rwsem);
> +	down_write(&iopt->iova_rwsem);
> +	area = iopt_find_exact_area(iopt, iova, iova_end);
> +	if (!area) {
> +		up_write(&iopt->iova_rwsem);
> +		up_read(&iopt->domains_rwsem);
> +		return -ENOENT;
> +	}
> +	pages = area->pages;
> +	area->pages = NULL;
> +	up_write(&iopt->iova_rwsem);
> +
> +	rc = __iopt_unmap_iova(iopt, area, pages);
> +	up_read(&iopt->domains_rwsem);
> +	return rc;
> +}
> +
> +int iopt_unmap_all(struct io_pagetable *iopt)
> +{
> +	struct iopt_area *area;
> +	int rc;
> +
> +	down_read(&iopt->domains_rwsem);
> +	down_write(&iopt->iova_rwsem);
> +	while ((area = iopt_area_iter_first(iopt, 0, ULONG_MAX))) {
> +		struct iopt_pages *pages;
> +
> +		/* Userspace should not race unmap all and map */
> +		if (!area->pages) {
> +			rc = -EBUSY;
> +			goto out_unlock_iova;
> +		}
> +		pages = area->pages;
> +		area->pages = NULL;
> +		up_write(&iopt->iova_rwsem);
> +
> +		rc = __iopt_unmap_iova(iopt, area, pages);
> +		if (rc)
> +			goto out_unlock_domains;
> +
> +		down_write(&iopt->iova_rwsem);
> +	}
> +	rc = 0;
> +
> +out_unlock_iova:
> +	up_write(&iopt->iova_rwsem);
> +out_unlock_domains:
> +	up_read(&iopt->domains_rwsem);
> +	return rc;
> +}
> +
> +/**
> + * iopt_access_pages() - Return a list of pages under the iova
> + * @iopt: io_pagetable to act on
> + * @iova: Starting IOVA
> + * @length: Number of bytes to access
> + * @out_pages: Output page list
> + * @write: True if access is for writing
> + *
> + * Reads @npages starting at iova and returns the struct page * pointers. These
> + * can be kmap'd by the caller for CPU access.
> + *
> + * The caller must perform iopt_unaccess_pages() when done to balance this.
> + *
> + * iova can be unaligned from PAGE_SIZE. The first returned byte starts at
> + * page_to_phys(out_pages[0]) + (iova % PAGE_SIZE). The caller promises not to
> + * touch memory outside the requested iova slice.
> + *
> + * FIXME: callers that need a DMA mapping via a sgl should create another
> + * interface to build the SGL efficiently
> + */
> +int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
> +		      unsigned long length, struct page **out_pages, bool write)
> +{
> +	unsigned long cur_iova = iova;
> +	unsigned long last_iova;
> +	struct iopt_area *area;
> +	int rc;
> +
> +	if (!length)
> +		return -EINVAL;
> +	if (check_add_overflow(iova, length - 1, &last_iova))
> +		return -EOVERFLOW;
> +
> +	down_read(&iopt->iova_rwsem);
> +	for (area = iopt_area_iter_first(iopt, iova, last_iova); area;
> +	     area = iopt_area_iter_next(area, iova, last_iova)) {
> +		unsigned long last = min(last_iova, iopt_area_last_iova(area));
> +		unsigned long last_index;
> +		unsigned long index;
> +
> +		/* Need contiguous areas in the access */
> +		if (iopt_area_iova(area) < cur_iova || !area->pages) {
> +			rc = -EINVAL;
> +			goto out_remove;
> +		}
> +
> +		index = iopt_area_iova_to_index(area, cur_iova);
> +		last_index = iopt_area_iova_to_index(area, last);
> +		rc = iopt_pages_add_user(area->pages, index, last_index,
> +					 out_pages, write);
> +		if (rc)
> +			goto out_remove;
> +		if (last == last_iova)
> +			break;
> +		/*
> +		 * Can't cross areas that are not aligned to the system page
> +		 * size with this API.
> +		 */
> +		if (cur_iova % PAGE_SIZE) {
> +			rc = -EINVAL;
> +			goto out_remove;
> +		}
> +		cur_iova = last + 1;
> +		out_pages += last_index - index;
> +		atomic_inc(&area->num_users);
> +	}
> +
> +	up_read(&iopt->iova_rwsem);
> +	return 0;
> +
> +out_remove:
> +	if (cur_iova != iova)
> +		iopt_unaccess_pages(iopt, iova, cur_iova - iova);
> +	up_read(&iopt->iova_rwsem);
> +	return rc;
> +}
> +
> +/**
> + * iopt_unaccess_pages() - Undo iopt_access_pages
> + * @iopt: io_pagetable to act on
> + * @iova: Starting IOVA
> + * @length:- Number of bytes to access
> + *
> + * Return the struct page's. The caller must stop accessing them before calling
> + * this. The iova/length must exactly match the one provided to access_pages.
> + */
> +void iopt_unaccess_pages(struct io_pagetable *iopt, unsigned long iova,
> +			 size_t length)
> +{
> +	unsigned long cur_iova = iova;
> +	unsigned long last_iova;
> +	struct iopt_area *area;
> +
> +	if (WARN_ON(!length) ||
> +	    WARN_ON(check_add_overflow(iova, length - 1, &last_iova)))
> +		return;
> +
> +	down_read(&iopt->iova_rwsem);
> +	for (area = iopt_area_iter_first(iopt, iova, last_iova); area;
> +	     area = iopt_area_iter_next(area, iova, last_iova)) {
> +		unsigned long last = min(last_iova, iopt_area_last_iova(area));
> +		int num_users;
> +
> +		iopt_pages_remove_user(area->pages,
> +				       iopt_area_iova_to_index(area, cur_iova),
> +				       iopt_area_iova_to_index(area, last));
> +		if (last == last_iova)
> +			break;
> +		cur_iova = last + 1;
> +		num_users = atomic_dec_return(&area->num_users);
> +		WARN_ON(num_users < 0);
> +	}
> +	up_read(&iopt->iova_rwsem);
> +}
> +
> +struct iopt_reserved_iova {
> +	struct interval_tree_node node;
> +	void *owner;
> +};
> +
> +int iopt_reserve_iova(struct io_pagetable *iopt, unsigned long start,
> +		      unsigned long last, void *owner)
> +{
> +	struct iopt_reserved_iova *reserved;
> +
> +	lockdep_assert_held_write(&iopt->iova_rwsem);
> +
> +	if (iopt_area_iter_first(iopt, start, last))
> +		return -EADDRINUSE;
> +
> +	reserved = kzalloc(sizeof(*reserved), GFP_KERNEL);
> +	if (!reserved)
> +		return -ENOMEM;
> +	reserved->node.start = start;
> +	reserved->node.last = last;
> +	reserved->owner = owner;
> +	interval_tree_insert(&reserved->node, &iopt->reserved_iova_itree);
> +	return 0;
> +}
> +
> +void iopt_remove_reserved_iova(struct io_pagetable *iopt, void *owner)
> +{
> +
> +	struct interval_tree_node *node;
> +
> +	for (node = interval_tree_iter_first(&iopt->reserved_iova_itree, 0,
> +					     ULONG_MAX);
> +	     node;) {
> +		struct iopt_reserved_iova *reserved =
> +			container_of(node, struct iopt_reserved_iova, node);
> +
> +		node = interval_tree_iter_next(node, 0, ULONG_MAX);
> +
> +		if (reserved->owner == owner) {
> +			interval_tree_remove(&reserved->node,
> +					     &iopt->reserved_iova_itree);
> +			kfree(reserved);
> +		}
> +	}
> +}
> +
> +int iopt_init_table(struct io_pagetable *iopt)
> +{
> +	init_rwsem(&iopt->iova_rwsem);
> +	init_rwsem(&iopt->domains_rwsem);
> +	iopt->area_itree = RB_ROOT_CACHED;
> +	iopt->reserved_iova_itree = RB_ROOT_CACHED;
> +	xa_init(&iopt->domains);
> +
> +	/*
> +	 * iopt's start as SW tables that can use the entire size_t IOVA space
> +	 * due to the use of size_t in the APIs. They have no alignment
> +	 * restriction.
> +	 */
> +	iopt->iova_alignment = 1;
> +
> +	return 0;
> +}
> +
> +void iopt_destroy_table(struct io_pagetable *iopt)
> +{
> +	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
> +		iopt_remove_reserved_iova(iopt, NULL);
> +	WARN_ON(!RB_EMPTY_ROOT(&iopt->reserved_iova_itree.rb_root));
> +	WARN_ON(!xa_empty(&iopt->domains));
> +	WARN_ON(!RB_EMPTY_ROOT(&iopt->area_itree.rb_root));
> +}
> +
> +/**
> + * iopt_unfill_domain() - Unfill a domain with PFNs
> + * @iopt: io_pagetable to act on
> + * @domain: domain to unfill
> + *
> + * This is used when removing a domain from the iopt. Every area in the iopt
> + * will be unmapped from the domain. The domain must already be removed from the
> + * domains xarray.
> + */
> +static void iopt_unfill_domain(struct io_pagetable *iopt,
> +			       struct iommu_domain *domain)
> +{
> +	struct iopt_area *area;
> +
> +	lockdep_assert_held(&iopt->iova_rwsem);
> +	lockdep_assert_held_write(&iopt->domains_rwsem);
> +
> +	/*
> +	 * Some other domain is holding all the pfns still, rapidly unmap this
> +	 * domain.
> +	 */
> +	if (iopt->next_domain_id != 0) {
> +		/* Pick an arbitrary remaining domain to act as storage */
> +		struct iommu_domain *storage_domain =
> +			xa_load(&iopt->domains, 0);
> +
> +		for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
> +		     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
> +			struct iopt_pages *pages = area->pages;
> +
> +			if (WARN_ON(!pages))
> +				continue;
> +
> +			mutex_lock(&pages->mutex);
> +			if (area->storage_domain != domain) {
> +				mutex_unlock(&pages->mutex);
> +				continue;
> +			}
> +			area->storage_domain = storage_domain;
> +			mutex_unlock(&pages->mutex);
> +		}
> +
> +
> +		iopt_unmap_domain(iopt, domain);
> +		return;
> +	}
> +
> +	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
> +	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
> +		struct iopt_pages *pages = area->pages;
> +
> +		if (WARN_ON(!pages))
> +			continue;
> +
> +		mutex_lock(&pages->mutex);
> +		interval_tree_remove(&area->pages_node,
> +				     &area->pages->domains_itree);
> +		WARN_ON(area->storage_domain != domain);
> +		area->storage_domain = NULL;
> +		iopt_area_unfill_domain(area, pages, domain);
> +		mutex_unlock(&pages->mutex);
> +	}
> +}
> +
> +/**
> + * iopt_fill_domain() - Fill a domain with PFNs
> + * @iopt: io_pagetable to act on
> + * @domain: domain to fill
> + *
> + * Fill the domain with PFNs from every area in the iopt. On failure the domain
> + * is left unchanged.
> + */
> +static int iopt_fill_domain(struct io_pagetable *iopt,
> +			    struct iommu_domain *domain)
> +{
> +	struct iopt_area *end_area;
> +	struct iopt_area *area;
> +	int rc;
> +
> +	lockdep_assert_held(&iopt->iova_rwsem);
> +	lockdep_assert_held_write(&iopt->domains_rwsem);
> +
> +	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
> +	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
> +		struct iopt_pages *pages = area->pages;
> +
> +		if (WARN_ON(!pages))
> +			continue;
> +
> +		mutex_lock(&pages->mutex);
> +		rc = iopt_area_fill_domain(area, domain);
> +		if (rc) {
> +			mutex_unlock(&pages->mutex);
> +			goto out_unfill;
> +		}
> +		if (!area->storage_domain) {
> +			WARN_ON(iopt->next_domain_id != 0);
> +			area->storage_domain = domain;
> +			interval_tree_insert(&area->pages_node,
> +					     &pages->domains_itree);
> +		}
> +		mutex_unlock(&pages->mutex);
> +	}
> +	return 0;
> +
> +out_unfill:
> +	end_area = area;
> +	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
> +	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
> +		struct iopt_pages *pages = area->pages;
> +
> +		if (area == end_area)
> +			break;
> +		if (WARN_ON(!pages))
> +			continue;
> +		mutex_lock(&pages->mutex);
> +		if (iopt->next_domain_id == 0) {
> +			interval_tree_remove(&area->pages_node,
> +					     &pages->domains_itree);
> +			area->storage_domain = NULL;
> +		}
> +		iopt_area_unfill_domain(area, pages, domain);
> +		mutex_unlock(&pages->mutex);
> +	}
> +	return rc;
> +}
> +
> +/* All existing area's conform to an increased page size */
> +static int iopt_check_iova_alignment(struct io_pagetable *iopt,
> +				     unsigned long new_iova_alignment)
> +{
> +	struct iopt_area *area;
> +
> +	lockdep_assert_held(&iopt->iova_rwsem);
> +
> +	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
> +	     area = iopt_area_iter_next(area, 0, ULONG_MAX))
> +		if ((iopt_area_iova(area) % new_iova_alignment) ||
> +		    (iopt_area_length(area) % new_iova_alignment))
> +			return -EADDRINUSE;
> +	return 0;
> +}
> +
> +int iopt_table_add_domain(struct io_pagetable *iopt,
> +			  struct iommu_domain *domain)
> +{
> +	const struct iommu_domain_geometry *geometry = &domain->geometry;
> +	struct iommu_domain *iter_domain;
> +	unsigned int new_iova_alignment;
> +	unsigned long index;
> +	int rc;
> +
> +	down_write(&iopt->domains_rwsem);
> +	down_write(&iopt->iova_rwsem);
> +
> +	xa_for_each (&iopt->domains, index, iter_domain) {
> +		if (WARN_ON(iter_domain == domain)) {
> +			rc = -EEXIST;
> +			goto out_unlock;
> +		}
> +	}
> +
> +	/*
> +	 * The io page size drives the iova_alignment. Internally the iopt_pages
> +	 * works in PAGE_SIZE units and we adjust when mapping sub-PAGE_SIZE
> +	 * objects into the iommu_domain.
> +	 *
> +	 * A iommu_domain must always be able to accept PAGE_SIZE to be
> +	 * compatible as we can't guarantee higher contiguity.
> +	 */
> +	new_iova_alignment =
> +		max_t(unsigned long, 1UL << __ffs(domain->pgsize_bitmap),
> +		      iopt->iova_alignment);
> +	if (new_iova_alignment > PAGE_SIZE) {
> +		rc = -EINVAL;
> +		goto out_unlock;
> +	}
> +	if (new_iova_alignment != iopt->iova_alignment) {
> +		rc = iopt_check_iova_alignment(iopt, new_iova_alignment);
> +		if (rc)
> +			goto out_unlock;
> +	}
> +
> +	/* No area exists that is outside the allowed domain aperture */
> +	if (geometry->aperture_start != 0) {
> +		rc = iopt_reserve_iova(iopt, 0, geometry->aperture_start - 1,
> +				       domain);
> +		if (rc)
> +			goto out_reserved;
> +	}
> +	if (geometry->aperture_end != ULONG_MAX) {
> +		rc = iopt_reserve_iova(iopt, geometry->aperture_end + 1,
> +				       ULONG_MAX, domain);
> +		if (rc)
> +			goto out_reserved;
> +	}
> +
> +	rc = xa_reserve(&iopt->domains, iopt->next_domain_id, GFP_KERNEL);
> +	if (rc)
> +		goto out_reserved;
> +
> +	rc = iopt_fill_domain(iopt, domain);
> +	if (rc)
> +		goto out_release;
> +
> +	iopt->iova_alignment = new_iova_alignment;
> +	xa_store(&iopt->domains, iopt->next_domain_id, domain, GFP_KERNEL);
> +	iopt->next_domain_id++;
Not understand here.

Do we get the domain = xa_load(&iopt->domains, iopt->next_domain_id-1)?
Then how to get the domain if next_domain_id++.
For example, iopt_table_add_domain 3 times with 3 domains,
how to know which next_domain_id is the correct one.

Thanks
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 07/12] iommufd: Data structure to provide IOVA to PFN mapping
  2022-03-25 13:34     ` zhangfei.gao
@ 2022-03-25 17:19       ` Jason Gunthorpe
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-03-25 17:19 UTC (permalink / raw)
  To: zhangfei.gao
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, Alex Williamson, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	Kevin Tian, Daniel Jordan, iommu, Joao Martins, David Gibson

On Fri, Mar 25, 2022 at 09:34:08PM +0800, zhangfei.gao@foxmail.com wrote:

> > +	iopt->iova_alignment = new_iova_alignment;
> > +	xa_store(&iopt->domains, iopt->next_domain_id, domain, GFP_KERNEL);
> > +	iopt->next_domain_id++;
> Not understand here.
> 
> Do we get the domain = xa_load(&iopt->domains, iopt->next_domain_id-1)?
> Then how to get the domain if next_domain_id++.
> For example, iopt_table_add_domain 3 times with 3 domains,
> how to know which next_domain_id is the correct one.

There is no "correct one" this is just a simple list of domains, the
alorithms either need to pick any single domain or iterate over every
domain.

Basically this bit of code is building a vector with the operations
'push_back', 'front', 'erase' and 'for each'

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 07/12] iommufd: Data structure to provide IOVA to PFN mapping
@ 2022-03-25 17:19       ` Jason Gunthorpe
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-03-25 17:19 UTC (permalink / raw)
  To: zhangfei.gao
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins,
	David Gibson

On Fri, Mar 25, 2022 at 09:34:08PM +0800, zhangfei.gao@foxmail.com wrote:

> > +	iopt->iova_alignment = new_iova_alignment;
> > +	xa_store(&iopt->domains, iopt->next_domain_id, domain, GFP_KERNEL);
> > +	iopt->next_domain_id++;
> Not understand here.
> 
> Do we get the domain = xa_load(&iopt->domains, iopt->next_domain_id-1)?
> Then how to get the domain if next_domain_id++.
> For example, iopt_table_add_domain 3 times with 3 domains,
> how to know which next_domain_id is the correct one.

There is no "correct one" this is just a simple list of domains, the
alorithms either need to pick any single domain or iterate over every
domain.

Basically this bit of code is building a vector with the operations
'push_back', 'front', 'erase' and 'for each'

Jason

^ permalink raw reply	[flat|nested] 244+ messages in thread

* RE: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
  2022-03-24 13:46                 ` Jason Gunthorpe
@ 2022-03-27  2:32                   ` Tian, Kevin
  -1 siblings, 0 replies; 244+ messages in thread
From: Tian, Kevin @ 2022-03-27  2:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu, Xu, Terrence

> From: Tian, Kevin
> Sent: Friday, March 25, 2022 10:16 AM
> 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Thursday, March 24, 2022 9:46 PM
> >
> > On Thu, Mar 24, 2022 at 07:25:03AM +0000, Tian, Kevin wrote:
> >
> > > Based on that here is a quick tweak of the force-snoop part (not
> compiled).
> >
> > I liked your previous idea better, that IOMMU_CAP_CACHE_COHERENCY
> > started out OK but got weird. So lets fix it back to the way it was.
> >
> > How about this:
> >
> > https://github.com/jgunthorpe/linux/commits/intel_no_snoop
> >
> > b11c19a4b34c2a iommu: Move the Intel no-snoop control off of
> > IOMMU_CACHE
> > 5263947f9d5f36 vfio: Require that device support DMA cache coherence
> > eab4b381c64a30 iommu: Restore IOMMU_CAP_CACHE_COHERENCY to its
> > original meaning
> > 2752e12bed48f6 iommu: Replace uses of
> IOMMU_CAP_CACHE_COHERENCY
> > with dev_is_dma_coherent()
> >
> > If you like it could you take it from here?
> >
> 
> this looks good to me except that the 2nd patch (eab4b381) should be
> the last one otherwise it affects bisect. and in that case the subject
> would be simply about removing the capability instead of restoring...
> 
> let me find a box to verify it.
> 

My colleague (Terrence) has the environment and helped verify it.

He will give his tested-by after you send out the formal series.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 244+ messages in thread

* RE: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
@ 2022-03-27  2:32                   ` Tian, Kevin
  0 siblings, 0 replies; 244+ messages in thread
From: Tian, Kevin @ 2022-03-27  2:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Alex Williamson, Xu, Terrence, Martins,
	Joao, David Gibson

> From: Tian, Kevin
> Sent: Friday, March 25, 2022 10:16 AM
> 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Thursday, March 24, 2022 9:46 PM
> >
> > On Thu, Mar 24, 2022 at 07:25:03AM +0000, Tian, Kevin wrote:
> >
> > > Based on that here is a quick tweak of the force-snoop part (not
> compiled).
> >
> > I liked your previous idea better, that IOMMU_CAP_CACHE_COHERENCY
> > started out OK but got weird. So lets fix it back to the way it was.
> >
> > How about this:
> >
> > https://github.com/jgunthorpe/linux/commits/intel_no_snoop
> >
> > b11c19a4b34c2a iommu: Move the Intel no-snoop control off of
> > IOMMU_CACHE
> > 5263947f9d5f36 vfio: Require that device support DMA cache coherence
> > eab4b381c64a30 iommu: Restore IOMMU_CAP_CACHE_COHERENCY to its
> > original meaning
> > 2752e12bed48f6 iommu: Replace uses of
> IOMMU_CAP_CACHE_COHERENCY
> > with dev_is_dma_coherent()
> >
> > If you like it could you take it from here?
> >
> 
> this looks good to me except that the 2nd patch (eab4b381) should be
> the last one otherwise it affects bisect. and in that case the subject
> would be simply about removing the capability instead of restoring...
> 
> let me find a box to verify it.
> 

My colleague (Terrence) has the environment and helped verify it.

He will give his tested-by after you send out the formal series.

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
  2022-03-27  2:32                   ` Tian, Kevin
@ 2022-03-27 14:28                     ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-03-27 14:28 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu, Xu, Terrence

On Sun, Mar 27, 2022 at 02:32:23AM +0000, Tian, Kevin wrote:

> > this looks good to me except that the 2nd patch (eab4b381) should be
> > the last one otherwise it affects bisect. and in that case the subject
> > would be simply about removing the capability instead of
> > restoring...

Oh because VFIO won't sent IOMMU_CACHE in this case? Hmm. OK

> > let me find a box to verify it.
> 
> My colleague (Terrence) has the environment and helped verify it.
> 
> He will give his tested-by after you send out the formal series.

Okay, I can send it after the merge window

Jason

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
@ 2022-03-27 14:28                     ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-03-27 14:28 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Alex Williamson, Xu, Terrence, Martins,
	Joao, David Gibson

On Sun, Mar 27, 2022 at 02:32:23AM +0000, Tian, Kevin wrote:

> > this looks good to me except that the 2nd patch (eab4b381) should be
> > the last one otherwise it affects bisect. and in that case the subject
> > would be simply about removing the capability instead of
> > restoring...

Oh because VFIO won't sent IOMMU_CACHE in this case? Hmm. OK

> > let me find a box to verify it.
> 
> My colleague (Terrence) has the environment and helped verify it.
> 
> He will give his tested-by after you send out the formal series.

Okay, I can send it after the merge window

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
  2022-03-24 11:46                         ` Jason Gunthorpe via iommu
@ 2022-03-28  1:53                           ` Jason Wang
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Wang @ 2022-03-28  1:53 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Alex Williamson, Niklas Schnelle, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Auger, iommu, Jean-Philippe Brucker, Martins, Joao, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu, Sean Mooney

On Thu, Mar 24, 2022 at 7:46 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Mar 24, 2022 at 11:50:47AM +0800, Jason Wang wrote:
>
> > It's simply because we don't want to break existing userspace. [1]
>
> I'm still waiting to hear what exactly breaks in real systems.
>
> As I explained this is not a significant change, but it could break
> something in a few special scenarios.
>
> Also the one place we do have ABI breaks is security, and ulimit is a
> security mechanism that isn't working right. So we do clearly need to
> understand *exactly* what real thing breaks - if anything.
>
> Jason
>

To tell the truth, I don't know. I remember that Openstack may do some
accounting so adding Sean for more comments. But we really can't image
openstack is the only userspace that may use this.

To me, it looks more easier to not answer this question by letting
userspace know about the change,

Thanks


^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
@ 2022-03-28  1:53                           ` Jason Wang
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Wang @ 2022-03-28  1:53 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Tian, Kevin, kvm, Niklas Schnelle,
	Cornelia Huck, Chaitanya Kulkarni, iommu, Daniel Jordan,
	Alex Williamson, Michael S. Tsirkin, Martins, Joao, David Gibson,
	Sean Mooney

On Thu, Mar 24, 2022 at 7:46 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Mar 24, 2022 at 11:50:47AM +0800, Jason Wang wrote:
>
> > It's simply because we don't want to break existing userspace. [1]
>
> I'm still waiting to hear what exactly breaks in real systems.
>
> As I explained this is not a significant change, but it could break
> something in a few special scenarios.
>
> Also the one place we do have ABI breaks is security, and ulimit is a
> security mechanism that isn't working right. So we do clearly need to
> understand *exactly* what real thing breaks - if anything.
>
> Jason
>

To tell the truth, I don't know. I remember that Openstack may do some
accounting so adding Sean for more comments. But we really can't image
openstack is the only userspace that may use this.

To me, it looks more easier to not answer this question by letting
userspace know about the change,

Thanks

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
  2022-03-28  1:53                           ` Jason Wang
@ 2022-03-28 12:22                             ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-03-28 12:22 UTC (permalink / raw)
  To: Jason Wang
  Cc: Tian, Kevin, Alex Williamson, Niklas Schnelle, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Auger, iommu, Jean-Philippe Brucker, Martins, Joao, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu, Sean Mooney

On Mon, Mar 28, 2022 at 09:53:27AM +0800, Jason Wang wrote:
> To me, it looks more easier to not answer this question by letting
> userspace know about the change,

That is not backwards compatbile, so I don't think it helps unless we
say if you open /dev/vfio/vfio you get old behavior and if you open
/dev/iommu you get new...

Nor does it answer if I can fix RDMA or not :\

So we really do need to know what exactly is the situation here.

Jason

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
@ 2022-03-28 12:22                             ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-03-28 12:22 UTC (permalink / raw)
  To: Jason Wang
  Cc: Jean-Philippe Brucker, Tian, Kevin, kvm, Niklas Schnelle,
	Cornelia Huck, Chaitanya Kulkarni, iommu, Daniel Jordan,
	Alex Williamson, Michael S. Tsirkin, Martins, Joao, David Gibson,
	Sean Mooney

On Mon, Mar 28, 2022 at 09:53:27AM +0800, Jason Wang wrote:
> To me, it looks more easier to not answer this question by letting
> userspace know about the change,

That is not backwards compatbile, so I don't think it helps unless we
say if you open /dev/vfio/vfio you get old behavior and if you open
/dev/iommu you get new...

Nor does it answer if I can fix RDMA or not :\

So we really do need to know what exactly is the situation here.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
  2022-03-28  1:53                           ` Jason Wang
@ 2022-03-28 13:14                             ` Sean Mooney
  -1 siblings, 0 replies; 244+ messages in thread
From: Sean Mooney @ 2022-03-28 13:14 UTC (permalink / raw)
  To: Jason Wang, Jason Gunthorpe
  Cc: Tian, Kevin, Alex Williamson, Niklas Schnelle, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Auger, iommu, Jean-Philippe Brucker, Martins, Joao, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

On Mon, 2022-03-28 at 09:53 +0800, Jason Wang wrote:
> On Thu, Mar 24, 2022 at 7:46 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > 
> > On Thu, Mar 24, 2022 at 11:50:47AM +0800, Jason Wang wrote:
> > 
> > > It's simply because we don't want to break existing userspace. [1]
> > 
> > I'm still waiting to hear what exactly breaks in real systems.
> > 
> > As I explained this is not a significant change, but it could break
> > something in a few special scenarios.
> > 
> > Also the one place we do have ABI breaks is security, and ulimit is a
> > security mechanism that isn't working right. So we do clearly need to
> > understand *exactly* what real thing breaks - if anything.
> > 
> > Jason
> > 
> 
> To tell the truth, I don't know. I remember that Openstack may do some
> accounting so adding Sean for more comments. But we really can't image
> openstack is the only userspace that may use this.
sorry there is a lot of context to this discussion i have tried to read back the
thread but i may have missed part of it.

tl;dr openstack does not currently track locked/pinned memory per use or per vm because we have
no idea when libvirt will request it or how much is needed per device. when ulimits are configured
today for nova/openstack its done at teh qemu user level outside of openstack in our installer tooling.
e.g. in tripleo the ulimits woudl be set on the nova_libvirt contaienr to constrain all vms spawned
not per vm/process.

full responce below
-------------------

openstacks history with locked/pinned/unswapable memory is a bit complicated.
we currently only request locked memory explictly in 2 cases directly
https://github.com/openstack/nova/blob/50fdbc752a9ca9c31488140ef2997ed59d861a41/nova/virt/libvirt/driver.py#L5769-L5784=
when the adminstartor configure the vm flaovr to requst amd's SEV feature or configures the flavor for realtime scheduling pirorotiy.
i say explictly as libvirt invented a request for locked/pinned pages implictly for sriov VFs and a number of other cases
which we were not aware of implictly. this only became apprent when we went to add vdpa supprot to openstack and libvirt
did not make that implict request and we had to fall back to requesting realtime instances as a workaround.

nova/openstack does have the ablity to generate the libvirt xml element that configure hard and soft limtis 
https://github.com/openstack/nova/blob/50fdbc752a9ca9c31488140ef2997ed59d861a41/nova/virt/libvirt/config.py#L2559-L2590
however it is only ever used in our test code
https://github.com/openstack/nova/search?q=LibvirtConfigGuestMemoryTune

the descirption of hard limit in the libvirt docs stongly dicurages its use with a small caveat for locked memory
https://libvirt.org/formatdomain.html#memory-tuning

   hard_limit
   
       The optional hard_limit element is the maximum memory the guest can use. The units for this value are kibibytes (i.e. blocks of 1024 bytes). Users
   of QEMU and KVM are strongly advised not to set this limit as domain may get killed by the kernel if the guess is too low, and determining the memory
   needed for a process to run is an undecidable problem; that said, if you already set locked in memory backing because your workload demands it, you'll
   have to take into account the specifics of your deployment and figure out a value for hard_limit that is large enough to support the memory
   requirements of your guest, but small enough to protect your host against a malicious guest locking all memory.
   
we coudl not figure out how to automatically comptue a hard_limit in nova that would work for everyone and we felt exposign this to our
users/operators was  bit of a cop out when they likely cant caluate that properly either. As a result we cant actully account for them today when
schduilign workloads to a host. Im not sure this woudl chagne even if you exposed new user space apis unless we 
had a way to inspect each VF to know how much locked memory that VF woudl need to lock? same for vdpa devices,
mdevs ectra. cloud system dont normaly have quotas on "locked" memory used trasitivly via passthoguh devices so even if we had this info
its not imeditly apperant how we woudl consume it without altering our existing quotas. Openstack is a self service cloud plathform
where enduser can upload there own worload iamge so its basicaly impossibel for the oeprator of the cloud to know how much memroy to set teh ard limit
too without setting it overly large in most cases. from a management applciaton point of view we currently have no insigth into how
memory will be pinned in the kernel or when libvirt will invent addtional request for pinned/locked memeory or how large they are. 

instead of going down that route operators are encuraged to use ulimit to set a global limit on the amount of memory the nova/qemu user can use.
while nova/openstack support multi tenancy we do not expose that multi tenancy to hte underlying hypervior hosts. the agents are typicly
deploy as the nova user which is a member of the libvirt and qemu groups. the vms that are created for our teants are all created as under the qemu
user/group as a result. so the qemu user is gobal ulimit on realtime systems woudl need to be set "to protect your host against a malicious guest
locking all memory" but we do not do this on a per vm or per process basis.

to avoid memory starvation we generally recommend using hugepages when ever you are locking memroy as we at least track those per numa node and
have the memroy trackign in place to know that they are not oversubscibeable. i.e. they cant be swapped so the are effectivly the same as being locked
form a user space point of view. using hugepage memory as a workaround whenever we need to account for memory lockign is not ideal but most of our
user that need sriov or vdpa are telcos so they are alreayd usign hugepages and cpu pinning in most cases so it kind of works.

since we dont currently support per instance hard_limits and dont plan to intoruce them in the future wehter this is track per process(vm) or per
user(qemu) is not going to break openstack today. it may complicate any future use of the memtune element in libvirt but we do not currently have
customer/user asking use to expose this and as a cloud solution this super low level customiation is not really somethign we
want to expose in our api anyway.

regards
sean

> 
> To me, it looks more easier to not answer this question by letting
> userspace know about the change,
> 
> Thanks
> 


^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
@ 2022-03-28 13:14                             ` Sean Mooney
  0 siblings, 0 replies; 244+ messages in thread
From: Sean Mooney @ 2022-03-28 13:14 UTC (permalink / raw)
  To: Jason Wang, Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Tian, Kevin, kvm, Niklas Schnelle,
	Cornelia Huck, Chaitanya Kulkarni, iommu, Daniel Jordan,
	Alex Williamson, Michael S. Tsirkin, Martins, Joao, David Gibson

On Mon, 2022-03-28 at 09:53 +0800, Jason Wang wrote:
> On Thu, Mar 24, 2022 at 7:46 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > 
> > On Thu, Mar 24, 2022 at 11:50:47AM +0800, Jason Wang wrote:
> > 
> > > It's simply because we don't want to break existing userspace. [1]
> > 
> > I'm still waiting to hear what exactly breaks in real systems.
> > 
> > As I explained this is not a significant change, but it could break
> > something in a few special scenarios.
> > 
> > Also the one place we do have ABI breaks is security, and ulimit is a
> > security mechanism that isn't working right. So we do clearly need to
> > understand *exactly* what real thing breaks - if anything.
> > 
> > Jason
> > 
> 
> To tell the truth, I don't know. I remember that Openstack may do some
> accounting so adding Sean for more comments. But we really can't image
> openstack is the only userspace that may use this.
sorry there is a lot of context to this discussion i have tried to read back the
thread but i may have missed part of it.

tl;dr openstack does not currently track locked/pinned memory per use or per vm because we have
no idea when libvirt will request it or how much is needed per device. when ulimits are configured
today for nova/openstack its done at teh qemu user level outside of openstack in our installer tooling.
e.g. in tripleo the ulimits woudl be set on the nova_libvirt contaienr to constrain all vms spawned
not per vm/process.

full responce below
-------------------

openstacks history with locked/pinned/unswapable memory is a bit complicated.
we currently only request locked memory explictly in 2 cases directly
https://github.com/openstack/nova/blob/50fdbc752a9ca9c31488140ef2997ed59d861a41/nova/virt/libvirt/driver.py#L5769-L5784=
when the adminstartor configure the vm flaovr to requst amd's SEV feature or configures the flavor for realtime scheduling pirorotiy.
i say explictly as libvirt invented a request for locked/pinned pages implictly for sriov VFs and a number of other cases
which we were not aware of implictly. this only became apprent when we went to add vdpa supprot to openstack and libvirt
did not make that implict request and we had to fall back to requesting realtime instances as a workaround.

nova/openstack does have the ablity to generate the libvirt xml element that configure hard and soft limtis 
https://github.com/openstack/nova/blob/50fdbc752a9ca9c31488140ef2997ed59d861a41/nova/virt/libvirt/config.py#L2559-L2590
however it is only ever used in our test code
https://github.com/openstack/nova/search?q=LibvirtConfigGuestMemoryTune

the descirption of hard limit in the libvirt docs stongly dicurages its use with a small caveat for locked memory
https://libvirt.org/formatdomain.html#memory-tuning

   hard_limit
   
       The optional hard_limit element is the maximum memory the guest can use. The units for this value are kibibytes (i.e. blocks of 1024 bytes). Users
   of QEMU and KVM are strongly advised not to set this limit as domain may get killed by the kernel if the guess is too low, and determining the memory
   needed for a process to run is an undecidable problem; that said, if you already set locked in memory backing because your workload demands it, you'll
   have to take into account the specifics of your deployment and figure out a value for hard_limit that is large enough to support the memory
   requirements of your guest, but small enough to protect your host against a malicious guest locking all memory.
   
we coudl not figure out how to automatically comptue a hard_limit in nova that would work for everyone and we felt exposign this to our
users/operators was  bit of a cop out when they likely cant caluate that properly either. As a result we cant actully account for them today when
schduilign workloads to a host. Im not sure this woudl chagne even if you exposed new user space apis unless we 
had a way to inspect each VF to know how much locked memory that VF woudl need to lock? same for vdpa devices,
mdevs ectra. cloud system dont normaly have quotas on "locked" memory used trasitivly via passthoguh devices so even if we had this info
its not imeditly apperant how we woudl consume it without altering our existing quotas. Openstack is a self service cloud plathform
where enduser can upload there own worload iamge so its basicaly impossibel for the oeprator of the cloud to know how much memroy to set teh ard limit
too without setting it overly large in most cases. from a management applciaton point of view we currently have no insigth into how
memory will be pinned in the kernel or when libvirt will invent addtional request for pinned/locked memeory or how large they are. 

instead of going down that route operators are encuraged to use ulimit to set a global limit on the amount of memory the nova/qemu user can use.
while nova/openstack support multi tenancy we do not expose that multi tenancy to hte underlying hypervior hosts. the agents are typicly
deploy as the nova user which is a member of the libvirt and qemu groups. the vms that are created for our teants are all created as under the qemu
user/group as a result. so the qemu user is gobal ulimit on realtime systems woudl need to be set "to protect your host against a malicious guest
locking all memory" but we do not do this on a per vm or per process basis.

to avoid memory starvation we generally recommend using hugepages when ever you are locking memroy as we at least track those per numa node and
have the memroy trackign in place to know that they are not oversubscibeable. i.e. they cant be swapped so the are effectivly the same as being locked
form a user space point of view. using hugepage memory as a workaround whenever we need to account for memory lockign is not ideal but most of our
user that need sriov or vdpa are telcos so they are alreayd usign hugepages and cpu pinning in most cases so it kind of works.

since we dont currently support per instance hard_limits and dont plan to intoruce them in the future wehter this is track per process(vm) or per
user(qemu) is not going to break openstack today. it may complicate any future use of the memtune element in libvirt but we do not currently have
customer/user asking use to expose this and as a cloud solution this super low level customiation is not really somethign we
want to expose in our api anyway.

regards
sean

> 
> To me, it looks more easier to not answer this question by letting
> userspace know about the change,
> 
> Thanks
> 

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
  2022-03-28 13:14                             ` Sean Mooney
@ 2022-03-28 14:27                               ` Jason Gunthorpe
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-03-28 14:27 UTC (permalink / raw)
  To: Sean Mooney
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm, Niklas Schnelle,
	Jason Wang, Cornelia Huck, Tian, Kevin, iommu, Daniel Jordan,
	Alex Williamson, Michael S. Tsirkin, Martins, Joao, David Gibson

On Mon, Mar 28, 2022 at 02:14:26PM +0100, Sean Mooney wrote:
> On Mon, 2022-03-28 at 09:53 +0800, Jason Wang wrote:
> > On Thu, Mar 24, 2022 at 7:46 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > > 
> > > On Thu, Mar 24, 2022 at 11:50:47AM +0800, Jason Wang wrote:
> > > 
> > > > It's simply because we don't want to break existing userspace. [1]
> > > 
> > > I'm still waiting to hear what exactly breaks in real systems.
> > > 
> > > As I explained this is not a significant change, but it could break
> > > something in a few special scenarios.
> > > 
> > > Also the one place we do have ABI breaks is security, and ulimit is a
> > > security mechanism that isn't working right. So we do clearly need to
> > > understand *exactly* what real thing breaks - if anything.
> > > 
> > > Jason
> > > 
> > 
> > To tell the truth, I don't know. I remember that Openstack may do some
> > accounting so adding Sean for more comments. But we really can't image
> > openstack is the only userspace that may use this.
> sorry there is a lot of context to this discussion i have tried to read back the
> thread but i may have missed part of it.

Thanks Sean, this is quite interesting, though I'm not sure it
entirely reached the question

> tl;dr openstack does not currently track locked/pinned memory per
> use or per vm because we have no idea when libvirt will request it
> or how much is needed per device. when ulimits are configured today
> for nova/openstack its done at teh qemu user level outside of
> openstack in our installer tooling.  e.g. in tripleo the ulimits
> woudl be set on the nova_libvirt contaienr to constrain all vms
> spawned not per vm/process.

So, today, you expect the ulimit to be machine wide, like if your
machine has 1TB of memory you'd set the ulimit at 0.9 TB and you'd
like the stuff under to limit memory pinning to 0.9TB globally for all
qemus?

To be clear it doesn't work that way today at all, you might as well
just not bother setting ulimit to anything less than unlimited at the
openstack layer.

>    hard_limit
>    
>        The optional hard_limit element is the maximum memory the
>    guest can use. The units for this value are kibibytes
>    (i.e. blocks of 1024 bytes). Users of QEMU and KVM are strongly
>    advised not to set this limit as domain may get killed by the
>    kernel if the guess is too low, and determining the memory needed
>    for a process to run is an undecidable problem; that said, if you
>    already set locked in memory backing because your workload
>    demands it, you'll have to take into account the specifics of
>    your deployment and figure out a value for hard_limit that is
>    large enough to support the memory requirements of your guest,
>    but small enough to protect your host against a malicious guest
>    locking all memory.

And hard_limit is the ulimit that Alex was talking about?

So now we switched from talking about global per-user things to
per-qemu-instance things?

> we coudl not figure out how to automatically comptue a hard_limit in
> nova that would work for everyone and we felt exposign this to our
> users/operators was bit of a cop out when they likely cant caluate
> that properly either.

Not surprising..

> As a result we cant actully account for them
> today when schduilign workloads to a host. Im not sure this woudl
> chagne even if you exposed new user space apis unless we  had a way
> to inspect each VF to know how much locked memory that VF woudl need
> to lock?

We are not talking about a new uAPI we are talking about changing the
meaning of the existing ulimit. You can see it in your message above,
at the openstack level you were talking about global limits and then
in the libvirt level you are talking about per-qemu limits.

In the kernel both of these are being used by the same control and one
of the users is wrong.

The kernel consensus is that the ulimit is per-user and is used by all
kernel entities consistently

Currently vfio is different and uses it per-process and effectively
has its own private bucket.

When you talk about VDPA you start to see the problems here because
VPDA use a different accounting from VFIO. If you run VFIO and VDPA
together then you should need 2x the ulimit, but today you only need
1x because they don't share accounting buckets.

This also means the ulimit doesn't actually work the way it is
supposed to.

The question is how to fix it, if we do fix it, how much cares that
things work differently.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
@ 2022-03-28 14:27                               ` Jason Gunthorpe
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-03-28 14:27 UTC (permalink / raw)
  To: Sean Mooney
  Cc: Jason Wang, Tian, Kevin, Alex Williamson, Niklas Schnelle,
	Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, iommu, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

On Mon, Mar 28, 2022 at 02:14:26PM +0100, Sean Mooney wrote:
> On Mon, 2022-03-28 at 09:53 +0800, Jason Wang wrote:
> > On Thu, Mar 24, 2022 at 7:46 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > > 
> > > On Thu, Mar 24, 2022 at 11:50:47AM +0800, Jason Wang wrote:
> > > 
> > > > It's simply because we don't want to break existing userspace. [1]
> > > 
> > > I'm still waiting to hear what exactly breaks in real systems.
> > > 
> > > As I explained this is not a significant change, but it could break
> > > something in a few special scenarios.
> > > 
> > > Also the one place we do have ABI breaks is security, and ulimit is a
> > > security mechanism that isn't working right. So we do clearly need to
> > > understand *exactly* what real thing breaks - if anything.
> > > 
> > > Jason
> > > 
> > 
> > To tell the truth, I don't know. I remember that Openstack may do some
> > accounting so adding Sean for more comments. But we really can't image
> > openstack is the only userspace that may use this.
> sorry there is a lot of context to this discussion i have tried to read back the
> thread but i may have missed part of it.

Thanks Sean, this is quite interesting, though I'm not sure it
entirely reached the question

> tl;dr openstack does not currently track locked/pinned memory per
> use or per vm because we have no idea when libvirt will request it
> or how much is needed per device. when ulimits are configured today
> for nova/openstack its done at teh qemu user level outside of
> openstack in our installer tooling.  e.g. in tripleo the ulimits
> woudl be set on the nova_libvirt contaienr to constrain all vms
> spawned not per vm/process.

So, today, you expect the ulimit to be machine wide, like if your
machine has 1TB of memory you'd set the ulimit at 0.9 TB and you'd
like the stuff under to limit memory pinning to 0.9TB globally for all
qemus?

To be clear it doesn't work that way today at all, you might as well
just not bother setting ulimit to anything less than unlimited at the
openstack layer.

>    hard_limit
>    
>        The optional hard_limit element is the maximum memory the
>    guest can use. The units for this value are kibibytes
>    (i.e. blocks of 1024 bytes). Users of QEMU and KVM are strongly
>    advised not to set this limit as domain may get killed by the
>    kernel if the guess is too low, and determining the memory needed
>    for a process to run is an undecidable problem; that said, if you
>    already set locked in memory backing because your workload
>    demands it, you'll have to take into account the specifics of
>    your deployment and figure out a value for hard_limit that is
>    large enough to support the memory requirements of your guest,
>    but small enough to protect your host against a malicious guest
>    locking all memory.

And hard_limit is the ulimit that Alex was talking about?

So now we switched from talking about global per-user things to
per-qemu-instance things?

> we coudl not figure out how to automatically comptue a hard_limit in
> nova that would work for everyone and we felt exposign this to our
> users/operators was bit of a cop out when they likely cant caluate
> that properly either.

Not surprising..

> As a result we cant actully account for them
> today when schduilign workloads to a host. Im not sure this woudl
> chagne even if you exposed new user space apis unless we  had a way
> to inspect each VF to know how much locked memory that VF woudl need
> to lock?

We are not talking about a new uAPI we are talking about changing the
meaning of the existing ulimit. You can see it in your message above,
at the openstack level you were talking about global limits and then
in the libvirt level you are talking about per-qemu limits.

In the kernel both of these are being used by the same control and one
of the users is wrong.

The kernel consensus is that the ulimit is per-user and is used by all
kernel entities consistently

Currently vfio is different and uses it per-process and effectively
has its own private bucket.

When you talk about VDPA you start to see the problems here because
VPDA use a different accounting from VFIO. If you run VFIO and VDPA
together then you should need 2x the ulimit, but today you only need
1x because they don't share accounting buckets.

This also means the ulimit doesn't actually work the way it is
supposed to.

The question is how to fix it, if we do fix it, how much cares that
things work differently.

Jason

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
  2022-03-24 13:46                 ` Jason Gunthorpe
@ 2022-03-28 17:17                   ` Alex Williamson
  -1 siblings, 0 replies; 244+ messages in thread
From: Alex Williamson @ 2022-03-28 17:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

On Thu, 24 Mar 2022 10:46:22 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Thu, Mar 24, 2022 at 07:25:03AM +0000, Tian, Kevin wrote:
> 
> > Based on that here is a quick tweak of the force-snoop part (not compiled).  
> 
> I liked your previous idea better, that IOMMU_CAP_CACHE_COHERENCY
> started out OK but got weird. So lets fix it back to the way it was.
> 
> How about this:
> 
> https://github.com/jgunthorpe/linux/commits/intel_no_snoop
> 
> b11c19a4b34c2a iommu: Move the Intel no-snoop control off of IOMMU_CACHE
> 5263947f9d5f36 vfio: Require that device support DMA cache coherence

I have some issues with the argument here:

  This will block device/platform/iommu combinations that do not
  support cache coherent DMA - but these never worked anyhow as VFIO
  did not expose any interface to perform the required cache
  maintenance operations.

VFIO never intended to provide such operations, it only tried to make
the coherence of the device visible to userspace such that it can
perform operations via other means, for example via KVM.  The "never
worked" statement here seems false.

Commit b11c19a4b34c2a also appears to be a behavioral change.  AIUI
vfio_domain.enforce_cache_coherency would only be set on Intel VT-d
where snoop-control is supported, this translates to KVM emulating
coherency instructions everywhere except VT-d w/ snoop-control.

My understanding of AMD-Vi is that no-snoop TLPs are always coherent, so
this would trigger unnecessary wbinvd emulation on those platforms.  I
don't know if other archs need similar, but it seems we're changing
polarity wrt no-snoop TLPs from "everyone is coherent except this case
on Intel" to "everyone is non-coherent except this opposite case on
Intel".  Thanks,

Alex

> eab4b381c64a30 iommu: Restore IOMMU_CAP_CACHE_COHERENCY to its original meaning
> 2752e12bed48f6 iommu: Replace uses of IOMMU_CAP_CACHE_COHERENCY with dev_is_dma_coherent()
> 
> If you like it could you take it from here?
> 
> Thanks,
> Jason
> 


^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
@ 2022-03-28 17:17                   ` Alex Williamson
  0 siblings, 0 replies; 244+ messages in thread
From: Alex Williamson @ 2022-03-28 17:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Tian, Kevin, kvm, Michael S. Tsirkin,
	Jason Wang, Cornelia Huck, Niklas Schnelle, Chaitanya Kulkarni,
	Daniel Jordan, iommu, Martins, Joao, David Gibson

On Thu, 24 Mar 2022 10:46:22 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Thu, Mar 24, 2022 at 07:25:03AM +0000, Tian, Kevin wrote:
> 
> > Based on that here is a quick tweak of the force-snoop part (not compiled).  
> 
> I liked your previous idea better, that IOMMU_CAP_CACHE_COHERENCY
> started out OK but got weird. So lets fix it back to the way it was.
> 
> How about this:
> 
> https://github.com/jgunthorpe/linux/commits/intel_no_snoop
> 
> b11c19a4b34c2a iommu: Move the Intel no-snoop control off of IOMMU_CACHE
> 5263947f9d5f36 vfio: Require that device support DMA cache coherence

I have some issues with the argument here:

  This will block device/platform/iommu combinations that do not
  support cache coherent DMA - but these never worked anyhow as VFIO
  did not expose any interface to perform the required cache
  maintenance operations.

VFIO never intended to provide such operations, it only tried to make
the coherence of the device visible to userspace such that it can
perform operations via other means, for example via KVM.  The "never
worked" statement here seems false.

Commit b11c19a4b34c2a also appears to be a behavioral change.  AIUI
vfio_domain.enforce_cache_coherency would only be set on Intel VT-d
where snoop-control is supported, this translates to KVM emulating
coherency instructions everywhere except VT-d w/ snoop-control.

My understanding of AMD-Vi is that no-snoop TLPs are always coherent, so
this would trigger unnecessary wbinvd emulation on those platforms.  I
don't know if other archs need similar, but it seems we're changing
polarity wrt no-snoop TLPs from "everyone is coherent except this case
on Intel" to "everyone is non-coherent except this opposite case on
Intel".  Thanks,

Alex

> eab4b381c64a30 iommu: Restore IOMMU_CAP_CACHE_COHERENCY to its original meaning
> 2752e12bed48f6 iommu: Replace uses of IOMMU_CAP_CACHE_COHERENCY with dev_is_dma_coherent()
> 
> If you like it could you take it from here?
> 
> Thanks,
> Jason
> 

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
  2022-03-28 17:17                   ` Alex Williamson
@ 2022-03-28 18:57                     ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-03-28 18:57 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

On Mon, Mar 28, 2022 at 11:17:23AM -0600, Alex Williamson wrote:
> On Thu, 24 Mar 2022 10:46:22 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Thu, Mar 24, 2022 at 07:25:03AM +0000, Tian, Kevin wrote:
> > 
> > > Based on that here is a quick tweak of the force-snoop part (not compiled).  
> > 
> > I liked your previous idea better, that IOMMU_CAP_CACHE_COHERENCY
> > started out OK but got weird. So lets fix it back to the way it was.
> > 
> > How about this:
> > 
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjgunthorpe%2Flinux%2Fcommits%2Fintel_no_snoop&amp;data=04%7C01%7Cjgg%40nvidia.com%7C9d34426f1c1646af43a608da10ded6b5%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637840846514240225%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=%2ByHWyE8Yxcwxe8r8LoMQD9tPh5%2FZPaGfNsUkMlpRfWM%3D&amp;reserved=0
> > 
> > b11c19a4b34c2a iommu: Move the Intel no-snoop control off of IOMMU_CACHE
> > 5263947f9d5f36 vfio: Require that device support DMA cache coherence
> 
> I have some issues with the argument here:
> 
>   This will block device/platform/iommu combinations that do not
>   support cache coherent DMA - but these never worked anyhow as VFIO
>   did not expose any interface to perform the required cache
>   maintenance operations.
> 
> VFIO never intended to provide such operations, it only tried to make
> the coherence of the device visible to userspace such that it can
> perform operations via other means, for example via KVM.  The "never
> worked" statement here seems false.

VFIO is generic. I expect if DPDK connects to VFIO then it will work
properly. That is definitely not the case today when
dev_is_dma_coherent() is false. This is what the paragraph is talking
about.

Remember, x86 wires dev_is_dma_coherent() to true, so this above
remark is not related to anything about x86.

We don't have a way in VFIO to negotiate that 'vfio can only be used
with kvm' so I hope no cases like that really do exist :( Do you know
of any?

> Commit b11c19a4b34c2a also appears to be a behavioral change.  AIUI
> vfio_domain.enforce_cache_coherency would only be set on Intel VT-d
> where snoop-control is supported, this translates to KVM emulating
> coherency instructions everywhere except VT-d w/ snoop-control.

It seems so.

> My understanding of AMD-Vi is that no-snoop TLPs are always coherent, so
> this would trigger unnecessary wbinvd emulation on those platforms.  

I look in the AMD manual and it looks like it works the same as intel
with a dedicated IOPTE bit:

  #define IOMMU_PTE_FC (1ULL << 60)

https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf Pg 79:

 FC: Force Coherent. Software uses the FC bit in the PTE to indicate
 the source of the upstream coherent attribute state for an
 untranslated DMA transaction.1 = the IOMMU sets the coherent attribute
 state in the upstream request. 0 = the IOMMU passes on the coherent
 attribute state from the originating request. Device internal
 address/page table translations are considered "untranslated accesses"
 by IOMMU.The FC state is returned in the ATS response to the device
 endpoint via the state of the (N)oSnoop bit.

So, currently AMD and Intel have exactly the same HW feature with a
different kAPI..

I would say it is wrong that AMD creates kernel owned domains for the
DMA-API to use that do not support snoop.

> don't know if other archs need similar, but it seems we're changing
> polarity wrt no-snoop TLPs from "everyone is coherent except this case
> on Intel" to "everyone is non-coherent except this opposite case on
> Intel".

Yes. We should not assume no-snoop blocking is a HW feature without
explicit knowledge that it is.

From a kAPI compat perspective IOMMU_CAP_CACHE_COHERENCY
only has two impacts:
 - Only on x86 arch it controls kvm_arch_register_noncoherent_dma()
 - It triggers IOMMU_CACHE

If we look at the list of places where IOMMU_CAP_CACHE_COHERENCY is set:

 drivers/iommu/intel/iommu.c
   Must have IOMMU_CACHE set/clear to control no-snoop blocking

 drivers/iommu/amd/iommu.c
   Always sets its no-snoop block, inconsistent with Intel

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
 drivers/iommu/arm/arm-smmu/arm-smmu.c
 drivers/iommu/arm/arm-smmu/qcom_iommu.c
   Must have IOMMU_CACHE set, ARM arch has no
   kvm_arch_register_noncoherent_dma()

   From what I could tell in the manuals and the prior discussion
   SMMU doesn't block no-snoop.

   ie ARM lies about IOMMU_CAP_CACHE_COHERENCY because it needs
   IOMM_CACHE set to work.

 drivers/iommu/fsl_pamu_domain.c
 drivers/iommu/s390-iommu.c
   Ignore IOMM_CACHE, arch has no kvm_arch_register_noncoherent_dma()

   No idea if the HW blocks no-snoop or not, but it doesn't matter.

So other than AMD, it is OK to change the sense and makes it clearer
for future driver authors what they are expected to do with this.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
@ 2022-03-28 18:57                     ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-03-28 18:57 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jean-Philippe Brucker, Tian, Kevin, kvm, Michael S. Tsirkin,
	Jason Wang, Cornelia Huck, Niklas Schnelle, Chaitanya Kulkarni,
	Daniel Jordan, iommu, Martins, Joao, David Gibson

On Mon, Mar 28, 2022 at 11:17:23AM -0600, Alex Williamson wrote:
> On Thu, 24 Mar 2022 10:46:22 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Thu, Mar 24, 2022 at 07:25:03AM +0000, Tian, Kevin wrote:
> > 
> > > Based on that here is a quick tweak of the force-snoop part (not compiled).  
> > 
> > I liked your previous idea better, that IOMMU_CAP_CACHE_COHERENCY
> > started out OK but got weird. So lets fix it back to the way it was.
> > 
> > How about this:
> > 
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjgunthorpe%2Flinux%2Fcommits%2Fintel_no_snoop&amp;data=04%7C01%7Cjgg%40nvidia.com%7C9d34426f1c1646af43a608da10ded6b5%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637840846514240225%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=%2ByHWyE8Yxcwxe8r8LoMQD9tPh5%2FZPaGfNsUkMlpRfWM%3D&amp;reserved=0
> > 
> > b11c19a4b34c2a iommu: Move the Intel no-snoop control off of IOMMU_CACHE
> > 5263947f9d5f36 vfio: Require that device support DMA cache coherence
> 
> I have some issues with the argument here:
> 
>   This will block device/platform/iommu combinations that do not
>   support cache coherent DMA - but these never worked anyhow as VFIO
>   did not expose any interface to perform the required cache
>   maintenance operations.
> 
> VFIO never intended to provide such operations, it only tried to make
> the coherence of the device visible to userspace such that it can
> perform operations via other means, for example via KVM.  The "never
> worked" statement here seems false.

VFIO is generic. I expect if DPDK connects to VFIO then it will work
properly. That is definitely not the case today when
dev_is_dma_coherent() is false. This is what the paragraph is talking
about.

Remember, x86 wires dev_is_dma_coherent() to true, so this above
remark is not related to anything about x86.

We don't have a way in VFIO to negotiate that 'vfio can only be used
with kvm' so I hope no cases like that really do exist :( Do you know
of any?

> Commit b11c19a4b34c2a also appears to be a behavioral change.  AIUI
> vfio_domain.enforce_cache_coherency would only be set on Intel VT-d
> where snoop-control is supported, this translates to KVM emulating
> coherency instructions everywhere except VT-d w/ snoop-control.

It seems so.

> My understanding of AMD-Vi is that no-snoop TLPs are always coherent, so
> this would trigger unnecessary wbinvd emulation on those platforms.  

I look in the AMD manual and it looks like it works the same as intel
with a dedicated IOPTE bit:

  #define IOMMU_PTE_FC (1ULL << 60)

https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf Pg 79:

 FC: Force Coherent. Software uses the FC bit in the PTE to indicate
 the source of the upstream coherent attribute state for an
 untranslated DMA transaction.1 = the IOMMU sets the coherent attribute
 state in the upstream request. 0 = the IOMMU passes on the coherent
 attribute state from the originating request. Device internal
 address/page table translations are considered "untranslated accesses"
 by IOMMU.The FC state is returned in the ATS response to the device
 endpoint via the state of the (N)oSnoop bit.

So, currently AMD and Intel have exactly the same HW feature with a
different kAPI..

I would say it is wrong that AMD creates kernel owned domains for the
DMA-API to use that do not support snoop.

> don't know if other archs need similar, but it seems we're changing
> polarity wrt no-snoop TLPs from "everyone is coherent except this case
> on Intel" to "everyone is non-coherent except this opposite case on
> Intel".

Yes. We should not assume no-snoop blocking is a HW feature without
explicit knowledge that it is.

From a kAPI compat perspective IOMMU_CAP_CACHE_COHERENCY
only has two impacts:
 - Only on x86 arch it controls kvm_arch_register_noncoherent_dma()
 - It triggers IOMMU_CACHE

If we look at the list of places where IOMMU_CAP_CACHE_COHERENCY is set:

 drivers/iommu/intel/iommu.c
   Must have IOMMU_CACHE set/clear to control no-snoop blocking

 drivers/iommu/amd/iommu.c
   Always sets its no-snoop block, inconsistent with Intel

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
 drivers/iommu/arm/arm-smmu/arm-smmu.c
 drivers/iommu/arm/arm-smmu/qcom_iommu.c
   Must have IOMMU_CACHE set, ARM arch has no
   kvm_arch_register_noncoherent_dma()

   From what I could tell in the manuals and the prior discussion
   SMMU doesn't block no-snoop.

   ie ARM lies about IOMMU_CAP_CACHE_COHERENCY because it needs
   IOMM_CACHE set to work.

 drivers/iommu/fsl_pamu_domain.c
 drivers/iommu/s390-iommu.c
   Ignore IOMM_CACHE, arch has no kvm_arch_register_noncoherent_dma()

   No idea if the HW blocks no-snoop or not, but it doesn't matter.

So other than AMD, it is OK to change the sense and makes it clearer
for future driver authors what they are expected to do with this.

Thanks,
Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
  2022-03-28 18:57                     ` Jason Gunthorpe via iommu
@ 2022-03-28 19:47                       ` Jason Gunthorpe
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-03-28 19:47 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jean-Philippe Brucker, Tian, Kevin, kvm, Michael S. Tsirkin,
	Jason Wang, Cornelia Huck, Niklas Schnelle, Chaitanya Kulkarni,
	Daniel Jordan, iommu, Martins, Joao, David Gibson

On Mon, Mar 28, 2022 at 03:57:53PM -0300, Jason Gunthorpe wrote:

> So, currently AMD and Intel have exactly the same HW feature with a
> different kAPI..

I fixed it like below and made the ordering changes Kevin pointed
to. Will send next week after the merge window:

527e438a974a06 iommu: Delete IOMMU_CAP_CACHE_COHERENCY
5cbc8603ffdf20 vfio: Move the Intel no-snoop control off of IOMMU_CACHE
ebc961f93d1af3 iommu: Introduce the domain op enforce_cache_coherency()
79c52a2bb1e60b vfio: Require that devices support DMA cache coherence
02168f961b6a75 iommu: Replace uses of IOMMU_CAP_CACHE_COHERENCY with dev_is_dma_coherent()

'79c can be avoided, we'd just drive IOMMU_CACHE off of
dev_is_dma_coherent() - but if we do that I'd like to properly
document the arch/iommu/platform/kvm combination that is using this..

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 3c0ac3c34a7f9a..f144eb9fea8e31 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2269,6 +2269,12 @@ static int amd_iommu_def_domain_type(struct device *dev)
 	return 0;
 }
 
+static bool amd_iommu_enforce_cache_coherency(struct iommu_domain *domain)
+{
+	/* IOMMU_PTE_FC is always set */
+	return true;
+}
+
 const struct iommu_ops amd_iommu_ops = {
 	.capable = amd_iommu_capable,
 	.domain_alloc = amd_iommu_domain_alloc,
@@ -2291,6 +2297,7 @@ const struct iommu_ops amd_iommu_ops = {
 		.flush_iotlb_all = amd_iommu_flush_iotlb_all,
 		.iotlb_sync	= amd_iommu_iotlb_sync,
 		.free		= amd_iommu_domain_free,
+		.enforce_cache_coherency = amd_iommu_enforce_cache_coherency,
 	}
 };

Thanks,
Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
@ 2022-03-28 19:47                       ` Jason Gunthorpe
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-03-28 19:47 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

On Mon, Mar 28, 2022 at 03:57:53PM -0300, Jason Gunthorpe wrote:

> So, currently AMD and Intel have exactly the same HW feature with a
> different kAPI..

I fixed it like below and made the ordering changes Kevin pointed
to. Will send next week after the merge window:

527e438a974a06 iommu: Delete IOMMU_CAP_CACHE_COHERENCY
5cbc8603ffdf20 vfio: Move the Intel no-snoop control off of IOMMU_CACHE
ebc961f93d1af3 iommu: Introduce the domain op enforce_cache_coherency()
79c52a2bb1e60b vfio: Require that devices support DMA cache coherence
02168f961b6a75 iommu: Replace uses of IOMMU_CAP_CACHE_COHERENCY with dev_is_dma_coherent()

'79c can be avoided, we'd just drive IOMMU_CACHE off of
dev_is_dma_coherent() - but if we do that I'd like to properly
document the arch/iommu/platform/kvm combination that is using this..

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 3c0ac3c34a7f9a..f144eb9fea8e31 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2269,6 +2269,12 @@ static int amd_iommu_def_domain_type(struct device *dev)
 	return 0;
 }
 
+static bool amd_iommu_enforce_cache_coherency(struct iommu_domain *domain)
+{
+	/* IOMMU_PTE_FC is always set */
+	return true;
+}
+
 const struct iommu_ops amd_iommu_ops = {
 	.capable = amd_iommu_capable,
 	.domain_alloc = amd_iommu_domain_alloc,
@@ -2291,6 +2297,7 @@ const struct iommu_ops amd_iommu_ops = {
 		.flush_iotlb_all = amd_iommu_flush_iotlb_all,
 		.iotlb_sync	= amd_iommu_iotlb_sync,
 		.free		= amd_iommu_domain_free,
+		.enforce_cache_coherency = amd_iommu_enforce_cache_coherency,
 	}
 };

Thanks,
Jason

^ permalink raw reply related	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
  2022-03-28 19:47                       ` Jason Gunthorpe
@ 2022-03-28 21:26                         ` Alex Williamson
  -1 siblings, 0 replies; 244+ messages in thread
From: Alex Williamson @ 2022-03-28 21:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Tian, Kevin, kvm, Michael S. Tsirkin,
	Jason Wang, Cornelia Huck, Niklas Schnelle, Chaitanya Kulkarni,
	Daniel Jordan, iommu, Martins, Joao, David Gibson

On Mon, 28 Mar 2022 16:47:49 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Mar 28, 2022 at 03:57:53PM -0300, Jason Gunthorpe wrote:
> 
> > So, currently AMD and Intel have exactly the same HW feature with a
> > different kAPI..  
> 
> I fixed it like below and made the ordering changes Kevin pointed
> to. Will send next week after the merge window:
> 
> 527e438a974a06 iommu: Delete IOMMU_CAP_CACHE_COHERENCY
> 5cbc8603ffdf20 vfio: Move the Intel no-snoop control off of IOMMU_CACHE
> ebc961f93d1af3 iommu: Introduce the domain op enforce_cache_coherency()
> 79c52a2bb1e60b vfio: Require that devices support DMA cache coherence
> 02168f961b6a75 iommu: Replace uses of IOMMU_CAP_CACHE_COHERENCY with dev_is_dma_coherent()
> 
> '79c can be avoided, we'd just drive IOMMU_CACHE off of
> dev_is_dma_coherent() - but if we do that I'd like to properly
> document the arch/iommu/platform/kvm combination that is using this..

We can try to enforce dev_is_dma_coherent(), as you note it's not going
to affect any x86 users.  arm64 is the only obviously relevant arch that
defines ARCH_HAS_SYNC_DMA_FOR_{DEVICE,CPU} but the device.dma_coherent
setting comes from ACPI/OF firmware, so someone from ARM land will need
to shout if this is an issue.  I think we'd need to back off and go
with documentation if a broken use case shows up.  Thanks,

Alex

 
> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
> index 3c0ac3c34a7f9a..f144eb9fea8e31 100644
> --- a/drivers/iommu/amd/iommu.c
> +++ b/drivers/iommu/amd/iommu.c
> @@ -2269,6 +2269,12 @@ static int amd_iommu_def_domain_type(struct device *dev)
>  	return 0;
>  }
>  
> +static bool amd_iommu_enforce_cache_coherency(struct iommu_domain *domain)
> +{
> +	/* IOMMU_PTE_FC is always set */
> +	return true;
> +}
> +
>  const struct iommu_ops amd_iommu_ops = {
>  	.capable = amd_iommu_capable,
>  	.domain_alloc = amd_iommu_domain_alloc,
> @@ -2291,6 +2297,7 @@ const struct iommu_ops amd_iommu_ops = {
>  		.flush_iotlb_all = amd_iommu_flush_iotlb_all,
>  		.iotlb_sync	= amd_iommu_iotlb_sync,
>  		.free		= amd_iommu_domain_free,
> +		.enforce_cache_coherency = amd_iommu_enforce_cache_coherency,
>  	}
>  };
> 
> Thanks,
> Jason
> 

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
@ 2022-03-28 21:26                         ` Alex Williamson
  0 siblings, 0 replies; 244+ messages in thread
From: Alex Williamson @ 2022-03-28 21:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

On Mon, 28 Mar 2022 16:47:49 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Mar 28, 2022 at 03:57:53PM -0300, Jason Gunthorpe wrote:
> 
> > So, currently AMD and Intel have exactly the same HW feature with a
> > different kAPI..  
> 
> I fixed it like below and made the ordering changes Kevin pointed
> to. Will send next week after the merge window:
> 
> 527e438a974a06 iommu: Delete IOMMU_CAP_CACHE_COHERENCY
> 5cbc8603ffdf20 vfio: Move the Intel no-snoop control off of IOMMU_CACHE
> ebc961f93d1af3 iommu: Introduce the domain op enforce_cache_coherency()
> 79c52a2bb1e60b vfio: Require that devices support DMA cache coherence
> 02168f961b6a75 iommu: Replace uses of IOMMU_CAP_CACHE_COHERENCY with dev_is_dma_coherent()
> 
> '79c can be avoided, we'd just drive IOMMU_CACHE off of
> dev_is_dma_coherent() - but if we do that I'd like to properly
> document the arch/iommu/platform/kvm combination that is using this..

We can try to enforce dev_is_dma_coherent(), as you note it's not going
to affect any x86 users.  arm64 is the only obviously relevant arch that
defines ARCH_HAS_SYNC_DMA_FOR_{DEVICE,CPU} but the device.dma_coherent
setting comes from ACPI/OF firmware, so someone from ARM land will need
to shout if this is an issue.  I think we'd need to back off and go
with documentation if a broken use case shows up.  Thanks,

Alex

 
> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
> index 3c0ac3c34a7f9a..f144eb9fea8e31 100644
> --- a/drivers/iommu/amd/iommu.c
> +++ b/drivers/iommu/amd/iommu.c
> @@ -2269,6 +2269,12 @@ static int amd_iommu_def_domain_type(struct device *dev)
>  	return 0;
>  }
>  
> +static bool amd_iommu_enforce_cache_coherency(struct iommu_domain *domain)
> +{
> +	/* IOMMU_PTE_FC is always set */
> +	return true;
> +}
> +
>  const struct iommu_ops amd_iommu_ops = {
>  	.capable = amd_iommu_capable,
>  	.domain_alloc = amd_iommu_domain_alloc,
> @@ -2291,6 +2297,7 @@ const struct iommu_ops amd_iommu_ops = {
>  		.flush_iotlb_all = amd_iommu_flush_iotlb_all,
>  		.iotlb_sync	= amd_iommu_iotlb_sync,
>  		.free		= amd_iommu_domain_free,
> +		.enforce_cache_coherency = amd_iommu_enforce_cache_coherency,
>  	}
>  };
> 
> Thanks,
> Jason
> 


^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
  2022-03-28 12:22                             ` Jason Gunthorpe via iommu
@ 2022-03-29  4:59                               ` Jason Wang
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Wang @ 2022-03-29  4:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Alex Williamson, Niklas Schnelle, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Auger, iommu, Jean-Philippe Brucker, Martins, Joao, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu, Sean Mooney

On Mon, Mar 28, 2022 at 8:23 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Mon, Mar 28, 2022 at 09:53:27AM +0800, Jason Wang wrote:
> > To me, it looks more easier to not answer this question by letting
> > userspace know about the change,
>
> That is not backwards compatbile, so I don't think it helps unless we
> say if you open /dev/vfio/vfio you get old behavior and if you open
> /dev/iommu you get new...

Actually, this is one way to go. Trying to behave exactly like typ1
might be not easy.

>
> Nor does it answer if I can fix RDMA or not :\
>

vDPA has a backend feature negotiation, then actually, userspace can
tell vDPA to go with the new accounting approach. Not sure RDMA can do
the same.

Thanks

> So we really do need to know what exactly is the situation here.
>
> Jason
>


^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
@ 2022-03-29  4:59                               ` Jason Wang
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Wang @ 2022-03-29  4:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Tian, Kevin, kvm, Niklas Schnelle,
	Cornelia Huck, Chaitanya Kulkarni, iommu, Daniel Jordan,
	Alex Williamson, Michael S. Tsirkin, Martins, Joao, David Gibson,
	Sean Mooney

On Mon, Mar 28, 2022 at 8:23 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Mon, Mar 28, 2022 at 09:53:27AM +0800, Jason Wang wrote:
> > To me, it looks more easier to not answer this question by letting
> > userspace know about the change,
>
> That is not backwards compatbile, so I don't think it helps unless we
> say if you open /dev/vfio/vfio you get old behavior and if you open
> /dev/iommu you get new...

Actually, this is one way to go. Trying to behave exactly like typ1
might be not easy.

>
> Nor does it answer if I can fix RDMA or not :\
>

vDPA has a backend feature negotiation, then actually, userspace can
tell vDPA to go with the new accounting approach. Not sure RDMA can do
the same.

Thanks

> So we really do need to know what exactly is the situation here.
>
> Jason
>

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
  2022-03-23 22:51     ` Alex Williamson
@ 2022-03-29  9:17       ` Yi Liu
  -1 siblings, 0 replies; 244+ messages in thread
From: Yi Liu @ 2022-03-29  9:17 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Keqian Zhu

On 2022/3/24 06:51, Alex Williamson wrote:
> On Fri, 18 Mar 2022 14:27:36 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
>> iommufd can directly implement the /dev/vfio/vfio container IOCTLs by
>> mapping them into io_pagetable operations. Doing so allows the use of
>> iommufd by symliking /dev/vfio/vfio to /dev/iommufd. Allowing VFIO to
>> SET_CONTAINER using a iommufd instead of a container fd is a followup
>> series.
>>
>> Internally the compatibility API uses a normal IOAS object that, like
>> vfio, is automatically allocated when the first device is
>> attached.
>>
>> Userspace can also query or set this IOAS object directly using the
>> IOMMU_VFIO_IOAS ioctl. This allows mixing and matching new iommufd only
>> features while still using the VFIO style map/unmap ioctls.
>>
>> While this is enough to operate qemu, it is still a bit of a WIP with a
>> few gaps to be resolved:
>>
>>   - Only the TYPE1v2 mode is supported where unmap cannot punch holes or
>>     split areas. The old mode can be implemented with a new operation to
>>     split an iopt_area into two without disturbing the iopt_pages or the
>>     domains, then unmapping a whole area as normal.
>>
>>   - Resource limits rely on memory cgroups to bound what userspace can do
>>     instead of the module parameter dma_entry_limit.
>>
>>   - VFIO P2P is not implemented. Avoiding the follow_pfn() mis-design will
>>     require some additional work to properly expose PFN lifecycle between
>>     VFIO and iommfd
>>
>>   - Various components of the mdev API are not completed yet
>>
>>   - Indefinite suspend of SW access (VFIO_DMA_MAP_FLAG_VADDR) is not
>>     implemented.
>>
>>   - The 'dirty tracking' is not implemented
>>
>>   - A full audit for pedantic compatibility details (eg errnos, etc) has
>>     not yet been done
>>
>>   - powerpc SPAPR is left out, as it is not connected to the iommu_domain
>>     framework. My hope is that SPAPR will be moved into the iommu_domain
>>     framework as a special HW specific type and would expect power to
>>     support the generic interface through a normal iommu_domain.
> 
> My overall question here would be whether we can actually achieve a
> compatibility interface that has sufficient feature transparency that we
> can dump vfio code in favor of this interface, or will there be enough
> niche use cases that we need to keep type1 and vfio containers around
> through a deprecation process?
> 
> The locked memory differences for one seem like something that libvirt
> wouldn't want hidden and we have questions regarding support for vaddr
> hijacking and different ideas how to implement dirty page tracking, not
> to mention the missing features that are currently well used, like p2p
> mappings, coherency tracking, mdev, etc.
>
> It seems like quite an endeavor to fill all these gaps, while at the
> same time QEMU will be working to move to use iommufd directly in order
> to gain all the new features.

Hi Alex,

Jason hasn't included the vfio changes for adapting to iommufd. But it's
in this branch 
(https://github.com/luxis1999/iommufd/commits/iommufd-v5.17-rc6). Eric and 
me are working on adding the iommufd support in QEMU as well. If wanting to 
run the new QEMU on old kernel, QEMU is supposed to support both the legacy 
group/container interface and the latest device/iommufd interface. We've 
got some draft code toward this direction 
(https://github.com/luxis1999/qemu/commits/qemu-for-5.17-rc4-vm). It works 
for both legacy group/container and device/iommufd path. It's just for 
reference so far, Eric and me will have a further sync on it.

> Where do we focus attention?  Is symlinking device files our proposal
> to userspace and is that something achievable, or do we want to use
> this compatibility interface as a means to test the interface and
> allow userspace to make use of it for transition, if their use cases
> allow it, perhaps eventually performing the symlink after deprecation
> and eventual removal of the vfio container and type1 code?  Thanks,

I'm sure it is possible that one day the group/container interface will be
removed in kernel. Perhaps this will happen when SPAPR is supported by 
iommufd. But how about QEMU, should QEMU keep backward compatibility 
forever? or one day QEMU may also remove the group/container path and hence
unable to work on the old kernels?

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
@ 2022-03-29  9:17       ` Yi Liu
  0 siblings, 0 replies; 244+ messages in thread
From: Yi Liu @ 2022-03-29  9:17 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	Kevin Tian, Daniel Jordan, iommu, Joao Martins, David Gibson

On 2022/3/24 06:51, Alex Williamson wrote:
> On Fri, 18 Mar 2022 14:27:36 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
>> iommufd can directly implement the /dev/vfio/vfio container IOCTLs by
>> mapping them into io_pagetable operations. Doing so allows the use of
>> iommufd by symliking /dev/vfio/vfio to /dev/iommufd. Allowing VFIO to
>> SET_CONTAINER using a iommufd instead of a container fd is a followup
>> series.
>>
>> Internally the compatibility API uses a normal IOAS object that, like
>> vfio, is automatically allocated when the first device is
>> attached.
>>
>> Userspace can also query or set this IOAS object directly using the
>> IOMMU_VFIO_IOAS ioctl. This allows mixing and matching new iommufd only
>> features while still using the VFIO style map/unmap ioctls.
>>
>> While this is enough to operate qemu, it is still a bit of a WIP with a
>> few gaps to be resolved:
>>
>>   - Only the TYPE1v2 mode is supported where unmap cannot punch holes or
>>     split areas. The old mode can be implemented with a new operation to
>>     split an iopt_area into two without disturbing the iopt_pages or the
>>     domains, then unmapping a whole area as normal.
>>
>>   - Resource limits rely on memory cgroups to bound what userspace can do
>>     instead of the module parameter dma_entry_limit.
>>
>>   - VFIO P2P is not implemented. Avoiding the follow_pfn() mis-design will
>>     require some additional work to properly expose PFN lifecycle between
>>     VFIO and iommfd
>>
>>   - Various components of the mdev API are not completed yet
>>
>>   - Indefinite suspend of SW access (VFIO_DMA_MAP_FLAG_VADDR) is not
>>     implemented.
>>
>>   - The 'dirty tracking' is not implemented
>>
>>   - A full audit for pedantic compatibility details (eg errnos, etc) has
>>     not yet been done
>>
>>   - powerpc SPAPR is left out, as it is not connected to the iommu_domain
>>     framework. My hope is that SPAPR will be moved into the iommu_domain
>>     framework as a special HW specific type and would expect power to
>>     support the generic interface through a normal iommu_domain.
> 
> My overall question here would be whether we can actually achieve a
> compatibility interface that has sufficient feature transparency that we
> can dump vfio code in favor of this interface, or will there be enough
> niche use cases that we need to keep type1 and vfio containers around
> through a deprecation process?
> 
> The locked memory differences for one seem like something that libvirt
> wouldn't want hidden and we have questions regarding support for vaddr
> hijacking and different ideas how to implement dirty page tracking, not
> to mention the missing features that are currently well used, like p2p
> mappings, coherency tracking, mdev, etc.
>
> It seems like quite an endeavor to fill all these gaps, while at the
> same time QEMU will be working to move to use iommufd directly in order
> to gain all the new features.

Hi Alex,

Jason hasn't included the vfio changes for adapting to iommufd. But it's
in this branch 
(https://github.com/luxis1999/iommufd/commits/iommufd-v5.17-rc6). Eric and 
me are working on adding the iommufd support in QEMU as well. If wanting to 
run the new QEMU on old kernel, QEMU is supposed to support both the legacy 
group/container interface and the latest device/iommufd interface. We've 
got some draft code toward this direction 
(https://github.com/luxis1999/qemu/commits/qemu-for-5.17-rc4-vm). It works 
for both legacy group/container and device/iommufd path. It's just for 
reference so far, Eric and me will have a further sync on it.

> Where do we focus attention?  Is symlinking device files our proposal
> to userspace and is that something achievable, or do we want to use
> this compatibility interface as a means to test the interface and
> allow userspace to make use of it for transition, if their use cases
> allow it, perhaps eventually performing the symlink after deprecation
> and eventual removal of the vfio container and type1 code?  Thanks,

I'm sure it is possible that one day the group/container interface will be
removed in kernel. Perhaps this will happen when SPAPR is supported by 
iommufd. But how about QEMU, should QEMU keep backward compatibility 
forever? or one day QEMU may also remove the group/container path and hence
unable to work on the old kernels?

-- 
Regards,
Yi Liu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
  2022-03-29  4:59                               ` Jason Wang
@ 2022-03-29 11:46                                 ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-03-29 11:46 UTC (permalink / raw)
  To: Jason Wang
  Cc: Tian, Kevin, Alex Williamson, Niklas Schnelle, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Auger, iommu, Jean-Philippe Brucker, Martins, Joao, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu, Sean Mooney

On Tue, Mar 29, 2022 at 12:59:52PM +0800, Jason Wang wrote:

> vDPA has a backend feature negotiation, then actually, userspace can
> tell vDPA to go with the new accounting approach. Not sure RDMA can do
> the same.

A security feature userspace can ask to turn off is not really a
security feature.

Jason

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
@ 2022-03-29 11:46                                 ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-03-29 11:46 UTC (permalink / raw)
  To: Jason Wang
  Cc: Jean-Philippe Brucker, Tian, Kevin, kvm, Niklas Schnelle,
	Cornelia Huck, Chaitanya Kulkarni, iommu, Daniel Jordan,
	Alex Williamson, Michael S. Tsirkin, Martins, Joao, David Gibson,
	Sean Mooney

On Tue, Mar 29, 2022 at 12:59:52PM +0800, Jason Wang wrote:

> vDPA has a backend feature negotiation, then actually, userspace can
> tell vDPA to go with the new accounting approach. Not sure RDMA can do
> the same.

A security feature userspace can ask to turn off is not really a
security feature.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
  2022-03-18 17:27   ` Jason Gunthorpe via iommu
@ 2022-03-30 13:35     ` Yi Liu
  -1 siblings, 0 replies; 244+ messages in thread
From: Yi Liu @ 2022-03-30 13:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Keqian Zhu

Hi Jason,

On 2022/3/19 01:27, Jason Gunthorpe wrote:
> Connect the IOAS to its IOCTL interface. This exposes most of the
> functionality in the io_pagetable to userspace.
> 
> This is intended to be the core of the generic interface that IOMMUFD will
> provide. Every IOMMU driver should be able to implement an iommu_domain
> that is compatible with this generic mechanism.
> 
> It is also designed to be easy to use for simple non virtual machine
> monitor users, like DPDK:
>   - Universal simple support for all IOMMUs (no PPC special path)
>   - An IOVA allocator that considerds the aperture and the reserved ranges
>   - io_pagetable allows any number of iommu_domains to be connected to the
>     IOAS
> 
> Along with room in the design to add non-generic features to cater to
> specific HW functionality.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>   drivers/iommu/iommufd/Makefile          |   1 +
>   drivers/iommu/iommufd/ioas.c            | 248 ++++++++++++++++++++++++
>   drivers/iommu/iommufd/iommufd_private.h |  27 +++
>   drivers/iommu/iommufd/main.c            |  17 ++
>   include/uapi/linux/iommufd.h            | 132 +++++++++++++
>   5 files changed, 425 insertions(+)
>   create mode 100644 drivers/iommu/iommufd/ioas.c
> 
> diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
> index b66a8c47ff55ec..2b4f36f1b72f9d 100644
> --- a/drivers/iommu/iommufd/Makefile
> +++ b/drivers/iommu/iommufd/Makefile
> @@ -1,6 +1,7 @@
>   # SPDX-License-Identifier: GPL-2.0-only
>   iommufd-y := \
>   	io_pagetable.o \
> +	ioas.o \
>   	main.o \
>   	pages.o
>   
> diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
> new file mode 100644
> index 00000000000000..c530b2ba74b06b
> --- /dev/null
> +++ b/drivers/iommu/iommufd/ioas.c
> @@ -0,0 +1,248 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
> + */
> +#include <linux/interval_tree.h>
> +#include <linux/iommufd.h>
> +#include <linux/iommu.h>
> +#include <uapi/linux/iommufd.h>
> +
> +#include "io_pagetable.h"
> +
> +void iommufd_ioas_destroy(struct iommufd_object *obj)
> +{
> +	struct iommufd_ioas *ioas = container_of(obj, struct iommufd_ioas, obj);
> +	int rc;
> +
> +	rc = iopt_unmap_all(&ioas->iopt);
> +	WARN_ON(rc);
> +	iopt_destroy_table(&ioas->iopt);
> +}
> +
> +struct iommufd_ioas *iommufd_ioas_alloc(struct iommufd_ctx *ictx)
> +{
> +	struct iommufd_ioas *ioas;
> +	int rc;
> +
> +	ioas = iommufd_object_alloc(ictx, ioas, IOMMUFD_OBJ_IOAS);
> +	if (IS_ERR(ioas))
> +		return ioas;
> +
> +	rc = iopt_init_table(&ioas->iopt);
> +	if (rc)
> +		goto out_abort;
> +	return ioas;
> +
> +out_abort:
> +	iommufd_object_abort(ictx, &ioas->obj);
> +	return ERR_PTR(rc);
> +}
> +
> +int iommufd_ioas_alloc_ioctl(struct iommufd_ucmd *ucmd)
> +{
> +	struct iommu_ioas_alloc *cmd = ucmd->cmd;
> +	struct iommufd_ioas *ioas;
> +	int rc;
> +
> +	if (cmd->flags)
> +		return -EOPNOTSUPP;
> +
> +	ioas = iommufd_ioas_alloc(ucmd->ictx);
> +	if (IS_ERR(ioas))
> +		return PTR_ERR(ioas);
> +
> +	cmd->out_ioas_id = ioas->obj.id;
> +	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
> +	if (rc)
> +		goto out_table;
> +	iommufd_object_finalize(ucmd->ictx, &ioas->obj);
> +	return 0;
> +
> +out_table:
> +	iommufd_ioas_destroy(&ioas->obj);
> +	return rc;
> +}
> +
> +int iommufd_ioas_iova_ranges(struct iommufd_ucmd *ucmd)
> +{
> +	struct iommu_ioas_iova_ranges __user *uptr = ucmd->ubuffer;
> +	struct iommu_ioas_iova_ranges *cmd = ucmd->cmd;
> +	struct iommufd_ioas *ioas;
> +	struct interval_tree_span_iter span;
> +	u32 max_iovas;
> +	int rc;
> +
> +	if (cmd->__reserved)
> +		return -EOPNOTSUPP;
> +
> +	max_iovas = cmd->size - sizeof(*cmd);
> +	if (max_iovas % sizeof(cmd->out_valid_iovas[0]))
> +		return -EINVAL;
> +	max_iovas /= sizeof(cmd->out_valid_iovas[0]);
> +
> +	ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
> +	if (IS_ERR(ioas))
> +		return PTR_ERR(ioas);
> +
> +	down_read(&ioas->iopt.iova_rwsem);
> +	cmd->out_num_iovas = 0;
> +	for (interval_tree_span_iter_first(
> +		     &span, &ioas->iopt.reserved_iova_itree, 0, ULONG_MAX);
> +	     !interval_tree_span_iter_done(&span);
> +	     interval_tree_span_iter_next(&span)) {
> +		if (!span.is_hole)
> +			continue;
> +		if (cmd->out_num_iovas < max_iovas) {
> +			rc = put_user((u64)span.start_hole,
> +				      &uptr->out_valid_iovas[cmd->out_num_iovas]
> +					       .start);
> +			if (rc)
> +				goto out_put;
> +			rc = put_user(
> +				(u64)span.last_hole,
> +				&uptr->out_valid_iovas[cmd->out_num_iovas].last);
> +			if (rc)
> +				goto out_put;
> +		}
> +		cmd->out_num_iovas++;
> +	}
> +	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
> +	if (rc)
> +		goto out_put;
> +	if (cmd->out_num_iovas > max_iovas)
> +		rc = -EMSGSIZE;
> +out_put:
> +	up_read(&ioas->iopt.iova_rwsem);
> +	iommufd_put_object(&ioas->obj);
> +	return rc;
> +}
> +
> +static int conv_iommu_prot(u32 map_flags)
> +{
> +	int iommu_prot;
> +
> +	/*
> +	 * We provide no manual cache coherency ioctls to userspace and most
> +	 * architectures make the CPU ops for cache flushing privileged.
> +	 * Therefore we require the underlying IOMMU to support CPU coherent
> +	 * operation.
> +	 */
> +	iommu_prot = IOMMU_CACHE;
> +	if (map_flags & IOMMU_IOAS_MAP_WRITEABLE)
> +		iommu_prot |= IOMMU_WRITE;
> +	if (map_flags & IOMMU_IOAS_MAP_READABLE)
> +		iommu_prot |= IOMMU_READ;
> +	return iommu_prot;
> +}
> +
> +int iommufd_ioas_map(struct iommufd_ucmd *ucmd)
> +{
> +	struct iommu_ioas_map *cmd = ucmd->cmd;
> +	struct iommufd_ioas *ioas;
> +	unsigned int flags = 0;
> +	unsigned long iova;
> +	int rc;
> +
> +	if ((cmd->flags &
> +	     ~(IOMMU_IOAS_MAP_FIXED_IOVA | IOMMU_IOAS_MAP_WRITEABLE |
> +	       IOMMU_IOAS_MAP_READABLE)) ||
> +	    cmd->__reserved)
> +		return -EOPNOTSUPP;
> +	if (cmd->iova >= ULONG_MAX || cmd->length >= ULONG_MAX)
> +		return -EOVERFLOW;
> +
> +	ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
> +	if (IS_ERR(ioas))
> +		return PTR_ERR(ioas);
> +
> +	if (!(cmd->flags & IOMMU_IOAS_MAP_FIXED_IOVA))
> +		flags = IOPT_ALLOC_IOVA;
> +	iova = cmd->iova;
> +	rc = iopt_map_user_pages(&ioas->iopt, &iova,
> +				 u64_to_user_ptr(cmd->user_va), cmd->length,
> +				 conv_iommu_prot(cmd->flags), flags);
> +	if (rc)
> +		goto out_put;
> +
> +	cmd->iova = iova;
> +	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
> +out_put:
> +	iommufd_put_object(&ioas->obj);
> +	return rc;
> +}
> +
> +int iommufd_ioas_copy(struct iommufd_ucmd *ucmd)
> +{
> +	struct iommu_ioas_copy *cmd = ucmd->cmd;
> +	struct iommufd_ioas *src_ioas;
> +	struct iommufd_ioas *dst_ioas;
> +	struct iopt_pages *pages;
> +	unsigned int flags = 0;
> +	unsigned long iova;
> +	unsigned long start_byte;
> +	int rc;
> +
> +	if ((cmd->flags &
> +	     ~(IOMMU_IOAS_MAP_FIXED_IOVA | IOMMU_IOAS_MAP_WRITEABLE |
> +	       IOMMU_IOAS_MAP_READABLE)))
> +		return -EOPNOTSUPP;
> +	if (cmd->length >= ULONG_MAX)
> +		return -EOVERFLOW;
> +
> +	src_ioas = iommufd_get_ioas(ucmd, cmd->src_ioas_id);
> +	if (IS_ERR(src_ioas))
> +		return PTR_ERR(src_ioas);
> +	/* FIXME: copy is not limited to an exact match anymore */
> +	pages = iopt_get_pages(&src_ioas->iopt, cmd->src_iova, &start_byte,
> +			       cmd->length);
> +	iommufd_put_object(&src_ioas->obj);
> +	if (IS_ERR(pages))
> +		return PTR_ERR(pages);
> +
> +	dst_ioas = iommufd_get_ioas(ucmd, cmd->dst_ioas_id);
> +	if (IS_ERR(dst_ioas)) {
> +		iopt_put_pages(pages);
> +		return PTR_ERR(dst_ioas);
> +	}
> +
> +	if (!(cmd->flags & IOMMU_IOAS_MAP_FIXED_IOVA))
> +		flags = IOPT_ALLOC_IOVA;
> +	iova = cmd->dst_iova;
> +	rc = iopt_map_pages(&dst_ioas->iopt, pages, &iova, start_byte,
> +			    cmd->length, conv_iommu_prot(cmd->flags), flags);
> +	if (rc) {
> +		iopt_put_pages(pages);
> +		goto out_put_dst;
> +	}
> +
> +	cmd->dst_iova = iova;
> +	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
> +out_put_dst:
> +	iommufd_put_object(&dst_ioas->obj);
> +	return rc;
> +}
> +
> +int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd)
> +{
> +	struct iommu_ioas_unmap *cmd = ucmd->cmd;
> +	struct iommufd_ioas *ioas;
> +	int rc;
> +
> +	ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
> +	if (IS_ERR(ioas))
> +		return PTR_ERR(ioas);
> +
> +	if (cmd->iova == 0 && cmd->length == U64_MAX) {
> +		rc = iopt_unmap_all(&ioas->iopt);
> +	} else {
> +		if (cmd->iova >= ULONG_MAX || cmd->length >= ULONG_MAX) {
> +			rc = -EOVERFLOW;
> +			goto out_put;
> +		}
> +		rc = iopt_unmap_iova(&ioas->iopt, cmd->iova, cmd->length);
> +	}
> +
> +out_put:
> +	iommufd_put_object(&ioas->obj);
> +	return rc;
> +}
> diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
> index bcf08e61bc87e9..d24c9dac5a82a9 100644
> --- a/drivers/iommu/iommufd/iommufd_private.h
> +++ b/drivers/iommu/iommufd/iommufd_private.h
> @@ -96,6 +96,7 @@ static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd,
>   enum iommufd_object_type {
>   	IOMMUFD_OBJ_NONE,
>   	IOMMUFD_OBJ_ANY = IOMMUFD_OBJ_NONE,
> +	IOMMUFD_OBJ_IOAS,
>   	IOMMUFD_OBJ_MAX,
>   };
>   
> @@ -147,4 +148,30 @@ struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
>   			     type),                                            \
>   		     typeof(*(ptr)), obj)
>   
> +/*
> + * The IO Address Space (IOAS) pagetable is a virtual page table backed by the
> + * io_pagetable object. It is a user controlled mapping of IOVA -> PFNs. The
> + * mapping is copied into all of the associated domains and made available to
> + * in-kernel users.
> + */
> +struct iommufd_ioas {
> +	struct iommufd_object obj;
> +	struct io_pagetable iopt;
> +};
> +
> +static inline struct iommufd_ioas *iommufd_get_ioas(struct iommufd_ucmd *ucmd,
> +						    u32 id)
> +{
> +	return container_of(iommufd_get_object(ucmd->ictx, id,
> +					       IOMMUFD_OBJ_IOAS),
> +			    struct iommufd_ioas, obj);
> +}
> +
> +struct iommufd_ioas *iommufd_ioas_alloc(struct iommufd_ctx *ictx);
> +int iommufd_ioas_alloc_ioctl(struct iommufd_ucmd *ucmd);
> +void iommufd_ioas_destroy(struct iommufd_object *obj);
> +int iommufd_ioas_iova_ranges(struct iommufd_ucmd *ucmd);
> +int iommufd_ioas_map(struct iommufd_ucmd *ucmd);
> +int iommufd_ioas_copy(struct iommufd_ucmd *ucmd);
> +int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd);
>   #endif
> diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
> index ae8db2f663004f..e506f493b54cfe 100644
> --- a/drivers/iommu/iommufd/main.c
> +++ b/drivers/iommu/iommufd/main.c
> @@ -184,6 +184,10 @@ static int iommufd_fops_release(struct inode *inode, struct file *filp)
>   }
>   
>   union ucmd_buffer {
> +	struct iommu_ioas_alloc alloc;
> +	struct iommu_ioas_iova_ranges iova_ranges;
> +	struct iommu_ioas_map map;
> +	struct iommu_ioas_unmap unmap;
>   	struct iommu_destroy destroy;
>   };
>   
> @@ -205,6 +209,16 @@ struct iommufd_ioctl_op {
>   	}
>   static struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
>   	IOCTL_OP(IOMMU_DESTROY, iommufd_destroy, struct iommu_destroy, id),
> +	IOCTL_OP(IOMMU_IOAS_ALLOC, iommufd_ioas_alloc_ioctl,
> +		 struct iommu_ioas_alloc, out_ioas_id),
> +	IOCTL_OP(IOMMU_IOAS_COPY, iommufd_ioas_copy, struct iommu_ioas_copy,
> +		 src_iova),
> +	IOCTL_OP(IOMMU_IOAS_IOVA_RANGES, iommufd_ioas_iova_ranges,
> +		 struct iommu_ioas_iova_ranges, __reserved),
> +	IOCTL_OP(IOMMU_IOAS_MAP, iommufd_ioas_map, struct iommu_ioas_map,
> +		 __reserved),
> +	IOCTL_OP(IOMMU_IOAS_UNMAP, iommufd_ioas_unmap, struct iommu_ioas_unmap,
> +		 length),
>   };
>   
>   static long iommufd_fops_ioctl(struct file *filp, unsigned int cmd,
> @@ -270,6 +284,9 @@ struct iommufd_ctx *iommufd_fget(int fd)
>   }
>   
>   static struct iommufd_object_ops iommufd_object_ops[] = {
> +	[IOMMUFD_OBJ_IOAS] = {
> +		.destroy = iommufd_ioas_destroy,
> +	},
>   };
>   
>   static struct miscdevice iommu_misc_dev = {
> diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
> index 2f7f76ec6db4cb..ba7b17ec3002e3 100644
> --- a/include/uapi/linux/iommufd.h
> +++ b/include/uapi/linux/iommufd.h
> @@ -37,6 +37,11 @@
>   enum {
>   	IOMMUFD_CMD_BASE = 0x80,
>   	IOMMUFD_CMD_DESTROY = IOMMUFD_CMD_BASE,
> +	IOMMUFD_CMD_IOAS_ALLOC,
> +	IOMMUFD_CMD_IOAS_IOVA_RANGES,
> +	IOMMUFD_CMD_IOAS_MAP,
> +	IOMMUFD_CMD_IOAS_COPY,
> +	IOMMUFD_CMD_IOAS_UNMAP,
>   };
>   
>   /**
> @@ -52,4 +57,131 @@ struct iommu_destroy {
>   };
>   #define IOMMU_DESTROY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_DESTROY)
>   
> +/**
> + * struct iommu_ioas_alloc - ioctl(IOMMU_IOAS_ALLOC)
> + * @size: sizeof(struct iommu_ioas_alloc)
> + * @flags: Must be 0
> + * @out_ioas_id: Output IOAS ID for the allocated object
> + *
> + * Allocate an IO Address Space (IOAS) which holds an IO Virtual Address (IOVA)
> + * to memory mapping.
> + */
> +struct iommu_ioas_alloc {
> +	__u32 size;
> +	__u32 flags;
> +	__u32 out_ioas_id;
> +};
> +#define IOMMU_IOAS_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_ALLOC)
> +
> +/**
> + * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
> + * @size: sizeof(struct iommu_ioas_iova_ranges)
> + * @ioas_id: IOAS ID to read ranges from
> + * @out_num_iovas: Output total number of ranges in the IOAS
> + * @__reserved: Must be 0
> + * @out_valid_iovas: Array of valid IOVA ranges. The array length is the smaller
> + *                   of out_num_iovas or the length implied by size.
> + * @out_valid_iovas.start: First IOVA in the allowed range
> + * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
> + *
> + * Query an IOAS for ranges of allowed IOVAs. Operation outside these ranges is
> + * not allowed. out_num_iovas will be set to the total number of iovas
> + * and the out_valid_iovas[] will be filled in as space permits.
> + * size should include the allocated flex array.
> + */
> +struct iommu_ioas_iova_ranges {
> +	__u32 size;
> +	__u32 ioas_id;
> +	__u32 out_num_iovas;
> +	__u32 __reserved;
> +	struct iommu_valid_iovas {
> +		__aligned_u64 start;
> +		__aligned_u64 last;
> +	} out_valid_iovas[];
> +};
> +#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_IOVA_RANGES)
> +
> +/**
> + * enum iommufd_ioas_map_flags - Flags for map and copy
> + * @IOMMU_IOAS_MAP_FIXED_IOVA: If clear the kernel will compute an appropriate
> + *                             IOVA to place the mapping at
> + * @IOMMU_IOAS_MAP_WRITEABLE: DMA is allowed to write to this mapping
> + * @IOMMU_IOAS_MAP_READABLE: DMA is allowed to read from this mapping
> + */
> +enum iommufd_ioas_map_flags {
> +	IOMMU_IOAS_MAP_FIXED_IOVA = 1 << 0,
> +	IOMMU_IOAS_MAP_WRITEABLE = 1 << 1,
> +	IOMMU_IOAS_MAP_READABLE = 1 << 2,
> +};
> +
> +/**
> + * struct iommu_ioas_map - ioctl(IOMMU_IOAS_MAP)
> + * @size: sizeof(struct iommu_ioas_map)
> + * @flags: Combination of enum iommufd_ioas_map_flags
> + * @ioas_id: IOAS ID to change the mapping of
> + * @__reserved: Must be 0
> + * @user_va: Userspace pointer to start mapping from
> + * @length: Number of bytes to map
> + * @iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is set
> + *        then this must be provided as input.
> + *
> + * Set an IOVA mapping from a user pointer. If FIXED_IOVA is specified then the
> + * mapping will be established at iova, otherwise a suitable location will be
> + * automatically selected and returned in iova.
> + */
> +struct iommu_ioas_map {
> +	__u32 size;
> +	__u32 flags;
> +	__u32 ioas_id;
> +	__u32 __reserved;
> +	__aligned_u64 user_va;
> +	__aligned_u64 length;
> +	__aligned_u64 iova;
> +};
> +#define IOMMU_IOAS_MAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_MAP)
> +
> +/**
> + * struct iommu_ioas_copy - ioctl(IOMMU_IOAS_COPY)
> + * @size: sizeof(struct iommu_ioas_copy)
> + * @flags: Combination of enum iommufd_ioas_map_flags
> + * @dst_ioas_id: IOAS ID to change the mapping of
> + * @src_ioas_id: IOAS ID to copy from

so the dst and src ioas_id are allocated via the same iommufd.
right? just out of curious, do you think it is possible that
the srs/dst ioas_ids are from different iommufds? In that case
may need to add src/dst iommufd. It's not needed today, just to
see if any blocker in kernel to support such copy. :-)

> + * @length: Number of bytes to copy and map
> + * @dst_iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is
> + *            set then this must be provided as input.
> + * @src_iova: IOVA to start the copy
> + *
> + * Copy an already existing mapping from src_ioas_id and establish it in
> + * dst_ioas_id. The src iova/length must exactly match a range used with
> + * IOMMU_IOAS_MAP.
> + */
> +struct iommu_ioas_copy {
> +	__u32 size;
> +	__u32 flags;
> +	__u32 dst_ioas_id;
> +	__u32 src_ioas_id;
> +	__aligned_u64 length;
> +	__aligned_u64 dst_iova;
> +	__aligned_u64 src_iova;
> +};
> +#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY)
> +
> +/**
> + * struct iommu_ioas_unmap - ioctl(IOMMU_IOAS_UNMAP)
> + * @size: sizeof(struct iommu_ioas_copy)
> + * @ioas_id: IOAS ID to change the mapping of
> + * @iova: IOVA to start the unmapping at
> + * @length: Number of bytes to unmap
> + *
> + * Unmap an IOVA range. The iova/length must exactly match a range
> + * used with IOMMU_IOAS_PAGETABLE_MAP, or be the values 0 & U64_MAX.
> + * In the latter case all IOVAs will be unmaped.
> + */
> +struct iommu_ioas_unmap {
> +	__u32 size;
> +	__u32 ioas_id;
> +	__aligned_u64 iova;
> +	__aligned_u64 length;
> +};
> +#define IOMMU_IOAS_UNMAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_UNMAP)
>   #endif

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
@ 2022-03-30 13:35     ` Yi Liu
  0 siblings, 0 replies; 244+ messages in thread
From: Yi Liu @ 2022-03-30 13:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins,
	David Gibson

Hi Jason,

On 2022/3/19 01:27, Jason Gunthorpe wrote:
> Connect the IOAS to its IOCTL interface. This exposes most of the
> functionality in the io_pagetable to userspace.
> 
> This is intended to be the core of the generic interface that IOMMUFD will
> provide. Every IOMMU driver should be able to implement an iommu_domain
> that is compatible with this generic mechanism.
> 
> It is also designed to be easy to use for simple non virtual machine
> monitor users, like DPDK:
>   - Universal simple support for all IOMMUs (no PPC special path)
>   - An IOVA allocator that considerds the aperture and the reserved ranges
>   - io_pagetable allows any number of iommu_domains to be connected to the
>     IOAS
> 
> Along with room in the design to add non-generic features to cater to
> specific HW functionality.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>   drivers/iommu/iommufd/Makefile          |   1 +
>   drivers/iommu/iommufd/ioas.c            | 248 ++++++++++++++++++++++++
>   drivers/iommu/iommufd/iommufd_private.h |  27 +++
>   drivers/iommu/iommufd/main.c            |  17 ++
>   include/uapi/linux/iommufd.h            | 132 +++++++++++++
>   5 files changed, 425 insertions(+)
>   create mode 100644 drivers/iommu/iommufd/ioas.c
> 
> diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
> index b66a8c47ff55ec..2b4f36f1b72f9d 100644
> --- a/drivers/iommu/iommufd/Makefile
> +++ b/drivers/iommu/iommufd/Makefile
> @@ -1,6 +1,7 @@
>   # SPDX-License-Identifier: GPL-2.0-only
>   iommufd-y := \
>   	io_pagetable.o \
> +	ioas.o \
>   	main.o \
>   	pages.o
>   
> diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
> new file mode 100644
> index 00000000000000..c530b2ba74b06b
> --- /dev/null
> +++ b/drivers/iommu/iommufd/ioas.c
> @@ -0,0 +1,248 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
> + */
> +#include <linux/interval_tree.h>
> +#include <linux/iommufd.h>
> +#include <linux/iommu.h>
> +#include <uapi/linux/iommufd.h>
> +
> +#include "io_pagetable.h"
> +
> +void iommufd_ioas_destroy(struct iommufd_object *obj)
> +{
> +	struct iommufd_ioas *ioas = container_of(obj, struct iommufd_ioas, obj);
> +	int rc;
> +
> +	rc = iopt_unmap_all(&ioas->iopt);
> +	WARN_ON(rc);
> +	iopt_destroy_table(&ioas->iopt);
> +}
> +
> +struct iommufd_ioas *iommufd_ioas_alloc(struct iommufd_ctx *ictx)
> +{
> +	struct iommufd_ioas *ioas;
> +	int rc;
> +
> +	ioas = iommufd_object_alloc(ictx, ioas, IOMMUFD_OBJ_IOAS);
> +	if (IS_ERR(ioas))
> +		return ioas;
> +
> +	rc = iopt_init_table(&ioas->iopt);
> +	if (rc)
> +		goto out_abort;
> +	return ioas;
> +
> +out_abort:
> +	iommufd_object_abort(ictx, &ioas->obj);
> +	return ERR_PTR(rc);
> +}
> +
> +int iommufd_ioas_alloc_ioctl(struct iommufd_ucmd *ucmd)
> +{
> +	struct iommu_ioas_alloc *cmd = ucmd->cmd;
> +	struct iommufd_ioas *ioas;
> +	int rc;
> +
> +	if (cmd->flags)
> +		return -EOPNOTSUPP;
> +
> +	ioas = iommufd_ioas_alloc(ucmd->ictx);
> +	if (IS_ERR(ioas))
> +		return PTR_ERR(ioas);
> +
> +	cmd->out_ioas_id = ioas->obj.id;
> +	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
> +	if (rc)
> +		goto out_table;
> +	iommufd_object_finalize(ucmd->ictx, &ioas->obj);
> +	return 0;
> +
> +out_table:
> +	iommufd_ioas_destroy(&ioas->obj);
> +	return rc;
> +}
> +
> +int iommufd_ioas_iova_ranges(struct iommufd_ucmd *ucmd)
> +{
> +	struct iommu_ioas_iova_ranges __user *uptr = ucmd->ubuffer;
> +	struct iommu_ioas_iova_ranges *cmd = ucmd->cmd;
> +	struct iommufd_ioas *ioas;
> +	struct interval_tree_span_iter span;
> +	u32 max_iovas;
> +	int rc;
> +
> +	if (cmd->__reserved)
> +		return -EOPNOTSUPP;
> +
> +	max_iovas = cmd->size - sizeof(*cmd);
> +	if (max_iovas % sizeof(cmd->out_valid_iovas[0]))
> +		return -EINVAL;
> +	max_iovas /= sizeof(cmd->out_valid_iovas[0]);
> +
> +	ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
> +	if (IS_ERR(ioas))
> +		return PTR_ERR(ioas);
> +
> +	down_read(&ioas->iopt.iova_rwsem);
> +	cmd->out_num_iovas = 0;
> +	for (interval_tree_span_iter_first(
> +		     &span, &ioas->iopt.reserved_iova_itree, 0, ULONG_MAX);
> +	     !interval_tree_span_iter_done(&span);
> +	     interval_tree_span_iter_next(&span)) {
> +		if (!span.is_hole)
> +			continue;
> +		if (cmd->out_num_iovas < max_iovas) {
> +			rc = put_user((u64)span.start_hole,
> +				      &uptr->out_valid_iovas[cmd->out_num_iovas]
> +					       .start);
> +			if (rc)
> +				goto out_put;
> +			rc = put_user(
> +				(u64)span.last_hole,
> +				&uptr->out_valid_iovas[cmd->out_num_iovas].last);
> +			if (rc)
> +				goto out_put;
> +		}
> +		cmd->out_num_iovas++;
> +	}
> +	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
> +	if (rc)
> +		goto out_put;
> +	if (cmd->out_num_iovas > max_iovas)
> +		rc = -EMSGSIZE;
> +out_put:
> +	up_read(&ioas->iopt.iova_rwsem);
> +	iommufd_put_object(&ioas->obj);
> +	return rc;
> +}
> +
> +static int conv_iommu_prot(u32 map_flags)
> +{
> +	int iommu_prot;
> +
> +	/*
> +	 * We provide no manual cache coherency ioctls to userspace and most
> +	 * architectures make the CPU ops for cache flushing privileged.
> +	 * Therefore we require the underlying IOMMU to support CPU coherent
> +	 * operation.
> +	 */
> +	iommu_prot = IOMMU_CACHE;
> +	if (map_flags & IOMMU_IOAS_MAP_WRITEABLE)
> +		iommu_prot |= IOMMU_WRITE;
> +	if (map_flags & IOMMU_IOAS_MAP_READABLE)
> +		iommu_prot |= IOMMU_READ;
> +	return iommu_prot;
> +}
> +
> +int iommufd_ioas_map(struct iommufd_ucmd *ucmd)
> +{
> +	struct iommu_ioas_map *cmd = ucmd->cmd;
> +	struct iommufd_ioas *ioas;
> +	unsigned int flags = 0;
> +	unsigned long iova;
> +	int rc;
> +
> +	if ((cmd->flags &
> +	     ~(IOMMU_IOAS_MAP_FIXED_IOVA | IOMMU_IOAS_MAP_WRITEABLE |
> +	       IOMMU_IOAS_MAP_READABLE)) ||
> +	    cmd->__reserved)
> +		return -EOPNOTSUPP;
> +	if (cmd->iova >= ULONG_MAX || cmd->length >= ULONG_MAX)
> +		return -EOVERFLOW;
> +
> +	ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
> +	if (IS_ERR(ioas))
> +		return PTR_ERR(ioas);
> +
> +	if (!(cmd->flags & IOMMU_IOAS_MAP_FIXED_IOVA))
> +		flags = IOPT_ALLOC_IOVA;
> +	iova = cmd->iova;
> +	rc = iopt_map_user_pages(&ioas->iopt, &iova,
> +				 u64_to_user_ptr(cmd->user_va), cmd->length,
> +				 conv_iommu_prot(cmd->flags), flags);
> +	if (rc)
> +		goto out_put;
> +
> +	cmd->iova = iova;
> +	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
> +out_put:
> +	iommufd_put_object(&ioas->obj);
> +	return rc;
> +}
> +
> +int iommufd_ioas_copy(struct iommufd_ucmd *ucmd)
> +{
> +	struct iommu_ioas_copy *cmd = ucmd->cmd;
> +	struct iommufd_ioas *src_ioas;
> +	struct iommufd_ioas *dst_ioas;
> +	struct iopt_pages *pages;
> +	unsigned int flags = 0;
> +	unsigned long iova;
> +	unsigned long start_byte;
> +	int rc;
> +
> +	if ((cmd->flags &
> +	     ~(IOMMU_IOAS_MAP_FIXED_IOVA | IOMMU_IOAS_MAP_WRITEABLE |
> +	       IOMMU_IOAS_MAP_READABLE)))
> +		return -EOPNOTSUPP;
> +	if (cmd->length >= ULONG_MAX)
> +		return -EOVERFLOW;
> +
> +	src_ioas = iommufd_get_ioas(ucmd, cmd->src_ioas_id);
> +	if (IS_ERR(src_ioas))
> +		return PTR_ERR(src_ioas);
> +	/* FIXME: copy is not limited to an exact match anymore */
> +	pages = iopt_get_pages(&src_ioas->iopt, cmd->src_iova, &start_byte,
> +			       cmd->length);
> +	iommufd_put_object(&src_ioas->obj);
> +	if (IS_ERR(pages))
> +		return PTR_ERR(pages);
> +
> +	dst_ioas = iommufd_get_ioas(ucmd, cmd->dst_ioas_id);
> +	if (IS_ERR(dst_ioas)) {
> +		iopt_put_pages(pages);
> +		return PTR_ERR(dst_ioas);
> +	}
> +
> +	if (!(cmd->flags & IOMMU_IOAS_MAP_FIXED_IOVA))
> +		flags = IOPT_ALLOC_IOVA;
> +	iova = cmd->dst_iova;
> +	rc = iopt_map_pages(&dst_ioas->iopt, pages, &iova, start_byte,
> +			    cmd->length, conv_iommu_prot(cmd->flags), flags);
> +	if (rc) {
> +		iopt_put_pages(pages);
> +		goto out_put_dst;
> +	}
> +
> +	cmd->dst_iova = iova;
> +	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
> +out_put_dst:
> +	iommufd_put_object(&dst_ioas->obj);
> +	return rc;
> +}
> +
> +int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd)
> +{
> +	struct iommu_ioas_unmap *cmd = ucmd->cmd;
> +	struct iommufd_ioas *ioas;
> +	int rc;
> +
> +	ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
> +	if (IS_ERR(ioas))
> +		return PTR_ERR(ioas);
> +
> +	if (cmd->iova == 0 && cmd->length == U64_MAX) {
> +		rc = iopt_unmap_all(&ioas->iopt);
> +	} else {
> +		if (cmd->iova >= ULONG_MAX || cmd->length >= ULONG_MAX) {
> +			rc = -EOVERFLOW;
> +			goto out_put;
> +		}
> +		rc = iopt_unmap_iova(&ioas->iopt, cmd->iova, cmd->length);
> +	}
> +
> +out_put:
> +	iommufd_put_object(&ioas->obj);
> +	return rc;
> +}
> diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
> index bcf08e61bc87e9..d24c9dac5a82a9 100644
> --- a/drivers/iommu/iommufd/iommufd_private.h
> +++ b/drivers/iommu/iommufd/iommufd_private.h
> @@ -96,6 +96,7 @@ static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd,
>   enum iommufd_object_type {
>   	IOMMUFD_OBJ_NONE,
>   	IOMMUFD_OBJ_ANY = IOMMUFD_OBJ_NONE,
> +	IOMMUFD_OBJ_IOAS,
>   	IOMMUFD_OBJ_MAX,
>   };
>   
> @@ -147,4 +148,30 @@ struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
>   			     type),                                            \
>   		     typeof(*(ptr)), obj)
>   
> +/*
> + * The IO Address Space (IOAS) pagetable is a virtual page table backed by the
> + * io_pagetable object. It is a user controlled mapping of IOVA -> PFNs. The
> + * mapping is copied into all of the associated domains and made available to
> + * in-kernel users.
> + */
> +struct iommufd_ioas {
> +	struct iommufd_object obj;
> +	struct io_pagetable iopt;
> +};
> +
> +static inline struct iommufd_ioas *iommufd_get_ioas(struct iommufd_ucmd *ucmd,
> +						    u32 id)
> +{
> +	return container_of(iommufd_get_object(ucmd->ictx, id,
> +					       IOMMUFD_OBJ_IOAS),
> +			    struct iommufd_ioas, obj);
> +}
> +
> +struct iommufd_ioas *iommufd_ioas_alloc(struct iommufd_ctx *ictx);
> +int iommufd_ioas_alloc_ioctl(struct iommufd_ucmd *ucmd);
> +void iommufd_ioas_destroy(struct iommufd_object *obj);
> +int iommufd_ioas_iova_ranges(struct iommufd_ucmd *ucmd);
> +int iommufd_ioas_map(struct iommufd_ucmd *ucmd);
> +int iommufd_ioas_copy(struct iommufd_ucmd *ucmd);
> +int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd);
>   #endif
> diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
> index ae8db2f663004f..e506f493b54cfe 100644
> --- a/drivers/iommu/iommufd/main.c
> +++ b/drivers/iommu/iommufd/main.c
> @@ -184,6 +184,10 @@ static int iommufd_fops_release(struct inode *inode, struct file *filp)
>   }
>   
>   union ucmd_buffer {
> +	struct iommu_ioas_alloc alloc;
> +	struct iommu_ioas_iova_ranges iova_ranges;
> +	struct iommu_ioas_map map;
> +	struct iommu_ioas_unmap unmap;
>   	struct iommu_destroy destroy;
>   };
>   
> @@ -205,6 +209,16 @@ struct iommufd_ioctl_op {
>   	}
>   static struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
>   	IOCTL_OP(IOMMU_DESTROY, iommufd_destroy, struct iommu_destroy, id),
> +	IOCTL_OP(IOMMU_IOAS_ALLOC, iommufd_ioas_alloc_ioctl,
> +		 struct iommu_ioas_alloc, out_ioas_id),
> +	IOCTL_OP(IOMMU_IOAS_COPY, iommufd_ioas_copy, struct iommu_ioas_copy,
> +		 src_iova),
> +	IOCTL_OP(IOMMU_IOAS_IOVA_RANGES, iommufd_ioas_iova_ranges,
> +		 struct iommu_ioas_iova_ranges, __reserved),
> +	IOCTL_OP(IOMMU_IOAS_MAP, iommufd_ioas_map, struct iommu_ioas_map,
> +		 __reserved),
> +	IOCTL_OP(IOMMU_IOAS_UNMAP, iommufd_ioas_unmap, struct iommu_ioas_unmap,
> +		 length),
>   };
>   
>   static long iommufd_fops_ioctl(struct file *filp, unsigned int cmd,
> @@ -270,6 +284,9 @@ struct iommufd_ctx *iommufd_fget(int fd)
>   }
>   
>   static struct iommufd_object_ops iommufd_object_ops[] = {
> +	[IOMMUFD_OBJ_IOAS] = {
> +		.destroy = iommufd_ioas_destroy,
> +	},
>   };
>   
>   static struct miscdevice iommu_misc_dev = {
> diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
> index 2f7f76ec6db4cb..ba7b17ec3002e3 100644
> --- a/include/uapi/linux/iommufd.h
> +++ b/include/uapi/linux/iommufd.h
> @@ -37,6 +37,11 @@
>   enum {
>   	IOMMUFD_CMD_BASE = 0x80,
>   	IOMMUFD_CMD_DESTROY = IOMMUFD_CMD_BASE,
> +	IOMMUFD_CMD_IOAS_ALLOC,
> +	IOMMUFD_CMD_IOAS_IOVA_RANGES,
> +	IOMMUFD_CMD_IOAS_MAP,
> +	IOMMUFD_CMD_IOAS_COPY,
> +	IOMMUFD_CMD_IOAS_UNMAP,
>   };
>   
>   /**
> @@ -52,4 +57,131 @@ struct iommu_destroy {
>   };
>   #define IOMMU_DESTROY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_DESTROY)
>   
> +/**
> + * struct iommu_ioas_alloc - ioctl(IOMMU_IOAS_ALLOC)
> + * @size: sizeof(struct iommu_ioas_alloc)
> + * @flags: Must be 0
> + * @out_ioas_id: Output IOAS ID for the allocated object
> + *
> + * Allocate an IO Address Space (IOAS) which holds an IO Virtual Address (IOVA)
> + * to memory mapping.
> + */
> +struct iommu_ioas_alloc {
> +	__u32 size;
> +	__u32 flags;
> +	__u32 out_ioas_id;
> +};
> +#define IOMMU_IOAS_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_ALLOC)
> +
> +/**
> + * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
> + * @size: sizeof(struct iommu_ioas_iova_ranges)
> + * @ioas_id: IOAS ID to read ranges from
> + * @out_num_iovas: Output total number of ranges in the IOAS
> + * @__reserved: Must be 0
> + * @out_valid_iovas: Array of valid IOVA ranges. The array length is the smaller
> + *                   of out_num_iovas or the length implied by size.
> + * @out_valid_iovas.start: First IOVA in the allowed range
> + * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
> + *
> + * Query an IOAS for ranges of allowed IOVAs. Operation outside these ranges is
> + * not allowed. out_num_iovas will be set to the total number of iovas
> + * and the out_valid_iovas[] will be filled in as space permits.
> + * size should include the allocated flex array.
> + */
> +struct iommu_ioas_iova_ranges {
> +	__u32 size;
> +	__u32 ioas_id;
> +	__u32 out_num_iovas;
> +	__u32 __reserved;
> +	struct iommu_valid_iovas {
> +		__aligned_u64 start;
> +		__aligned_u64 last;
> +	} out_valid_iovas[];
> +};
> +#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_IOVA_RANGES)
> +
> +/**
> + * enum iommufd_ioas_map_flags - Flags for map and copy
> + * @IOMMU_IOAS_MAP_FIXED_IOVA: If clear the kernel will compute an appropriate
> + *                             IOVA to place the mapping at
> + * @IOMMU_IOAS_MAP_WRITEABLE: DMA is allowed to write to this mapping
> + * @IOMMU_IOAS_MAP_READABLE: DMA is allowed to read from this mapping
> + */
> +enum iommufd_ioas_map_flags {
> +	IOMMU_IOAS_MAP_FIXED_IOVA = 1 << 0,
> +	IOMMU_IOAS_MAP_WRITEABLE = 1 << 1,
> +	IOMMU_IOAS_MAP_READABLE = 1 << 2,
> +};
> +
> +/**
> + * struct iommu_ioas_map - ioctl(IOMMU_IOAS_MAP)
> + * @size: sizeof(struct iommu_ioas_map)
> + * @flags: Combination of enum iommufd_ioas_map_flags
> + * @ioas_id: IOAS ID to change the mapping of
> + * @__reserved: Must be 0
> + * @user_va: Userspace pointer to start mapping from
> + * @length: Number of bytes to map
> + * @iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is set
> + *        then this must be provided as input.
> + *
> + * Set an IOVA mapping from a user pointer. If FIXED_IOVA is specified then the
> + * mapping will be established at iova, otherwise a suitable location will be
> + * automatically selected and returned in iova.
> + */
> +struct iommu_ioas_map {
> +	__u32 size;
> +	__u32 flags;
> +	__u32 ioas_id;
> +	__u32 __reserved;
> +	__aligned_u64 user_va;
> +	__aligned_u64 length;
> +	__aligned_u64 iova;
> +};
> +#define IOMMU_IOAS_MAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_MAP)
> +
> +/**
> + * struct iommu_ioas_copy - ioctl(IOMMU_IOAS_COPY)
> + * @size: sizeof(struct iommu_ioas_copy)
> + * @flags: Combination of enum iommufd_ioas_map_flags
> + * @dst_ioas_id: IOAS ID to change the mapping of
> + * @src_ioas_id: IOAS ID to copy from

so the dst and src ioas_id are allocated via the same iommufd.
right? just out of curious, do you think it is possible that
the srs/dst ioas_ids are from different iommufds? In that case
may need to add src/dst iommufd. It's not needed today, just to
see if any blocker in kernel to support such copy. :-)

> + * @length: Number of bytes to copy and map
> + * @dst_iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is
> + *            set then this must be provided as input.
> + * @src_iova: IOVA to start the copy
> + *
> + * Copy an already existing mapping from src_ioas_id and establish it in
> + * dst_ioas_id. The src iova/length must exactly match a range used with
> + * IOMMU_IOAS_MAP.
> + */
> +struct iommu_ioas_copy {
> +	__u32 size;
> +	__u32 flags;
> +	__u32 dst_ioas_id;
> +	__u32 src_ioas_id;
> +	__aligned_u64 length;
> +	__aligned_u64 dst_iova;
> +	__aligned_u64 src_iova;
> +};
> +#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY)
> +
> +/**
> + * struct iommu_ioas_unmap - ioctl(IOMMU_IOAS_UNMAP)
> + * @size: sizeof(struct iommu_ioas_copy)
> + * @ioas_id: IOAS ID to change the mapping of
> + * @iova: IOVA to start the unmapping at
> + * @length: Number of bytes to unmap
> + *
> + * Unmap an IOVA range. The iova/length must exactly match a range
> + * used with IOMMU_IOAS_PAGETABLE_MAP, or be the values 0 & U64_MAX.
> + * In the latter case all IOVAs will be unmaped.
> + */
> +struct iommu_ioas_unmap {
> +	__u32 size;
> +	__u32 ioas_id;
> +	__aligned_u64 iova;
> +	__aligned_u64 length;
> +};
> +#define IOMMU_IOAS_UNMAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_UNMAP)
>   #endif

-- 
Regards,
Yi Liu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
  2022-03-18 17:27   ` Jason Gunthorpe via iommu
@ 2022-03-31  4:36     ` David Gibson
  -1 siblings, 0 replies; 244+ messages in thread
From: David Gibson @ 2022-03-31  4:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

[-- Attachment #1: Type: text/plain, Size: 6300 bytes --]

On Fri, Mar 18, 2022 at 02:27:33PM -0300, Jason Gunthorpe wrote:
> Connect the IOAS to its IOCTL interface. This exposes most of the
> functionality in the io_pagetable to userspace.
> 
> This is intended to be the core of the generic interface that IOMMUFD will
> provide. Every IOMMU driver should be able to implement an iommu_domain
> that is compatible with this generic mechanism.
> 
> It is also designed to be easy to use for simple non virtual machine
> monitor users, like DPDK:
>  - Universal simple support for all IOMMUs (no PPC special path)
>  - An IOVA allocator that considerds the aperture and the reserved ranges
>  - io_pagetable allows any number of iommu_domains to be connected to the
>    IOAS
> 
> Along with room in the design to add non-generic features to cater to
> specific HW functionality.


[snip]
> +/**
> + * struct iommu_ioas_alloc - ioctl(IOMMU_IOAS_ALLOC)
> + * @size: sizeof(struct iommu_ioas_alloc)
> + * @flags: Must be 0
> + * @out_ioas_id: Output IOAS ID for the allocated object
> + *
> + * Allocate an IO Address Space (IOAS) which holds an IO Virtual Address (IOVA)
> + * to memory mapping.
> + */
> +struct iommu_ioas_alloc {
> +	__u32 size;
> +	__u32 flags;
> +	__u32 out_ioas_id;
> +};
> +#define IOMMU_IOAS_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_ALLOC)
> +
> +/**
> + * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
> + * @size: sizeof(struct iommu_ioas_iova_ranges)
> + * @ioas_id: IOAS ID to read ranges from
> + * @out_num_iovas: Output total number of ranges in the IOAS
> + * @__reserved: Must be 0
> + * @out_valid_iovas: Array of valid IOVA ranges. The array length is the smaller
> + *                   of out_num_iovas or the length implied by size.
> + * @out_valid_iovas.start: First IOVA in the allowed range
> + * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
> + *
> + * Query an IOAS for ranges of allowed IOVAs. Operation outside these ranges is
> + * not allowed. out_num_iovas will be set to the total number of iovas
> + * and the out_valid_iovas[] will be filled in as space permits.
> + * size should include the allocated flex array.
> + */
> +struct iommu_ioas_iova_ranges {
> +	__u32 size;
> +	__u32 ioas_id;
> +	__u32 out_num_iovas;
> +	__u32 __reserved;
> +	struct iommu_valid_iovas {
> +		__aligned_u64 start;
> +		__aligned_u64 last;
> +	} out_valid_iovas[];
> +};
> +#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_IOVA_RANGES)

Is the information returned by this valid for the lifeime of the IOAS,
or can it change?  If it can change, what events can change it?

If it *can't* change, then how do we have enough information to
determine this at ALLOC time, since we don't necessarily know which
(if any) hardware IOMMU will be attached to it.

> +/**
> + * enum iommufd_ioas_map_flags - Flags for map and copy
> + * @IOMMU_IOAS_MAP_FIXED_IOVA: If clear the kernel will compute an appropriate
> + *                             IOVA to place the mapping at
> + * @IOMMU_IOAS_MAP_WRITEABLE: DMA is allowed to write to this mapping
> + * @IOMMU_IOAS_MAP_READABLE: DMA is allowed to read from this mapping
> + */
> +enum iommufd_ioas_map_flags {
> +	IOMMU_IOAS_MAP_FIXED_IOVA = 1 << 0,
> +	IOMMU_IOAS_MAP_WRITEABLE = 1 << 1,
> +	IOMMU_IOAS_MAP_READABLE = 1 << 2,
> +};
> +
> +/**
> + * struct iommu_ioas_map - ioctl(IOMMU_IOAS_MAP)
> + * @size: sizeof(struct iommu_ioas_map)
> + * @flags: Combination of enum iommufd_ioas_map_flags
> + * @ioas_id: IOAS ID to change the mapping of
> + * @__reserved: Must be 0
> + * @user_va: Userspace pointer to start mapping from
> + * @length: Number of bytes to map
> + * @iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is set
> + *        then this must be provided as input.
> + *
> + * Set an IOVA mapping from a user pointer. If FIXED_IOVA is specified then the
> + * mapping will be established at iova, otherwise a suitable location will be
> + * automatically selected and returned in iova.
> + */
> +struct iommu_ioas_map {
> +	__u32 size;
> +	__u32 flags;
> +	__u32 ioas_id;
> +	__u32 __reserved;
> +	__aligned_u64 user_va;
> +	__aligned_u64 length;
> +	__aligned_u64 iova;
> +};
> +#define IOMMU_IOAS_MAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_MAP)
> +
> +/**
> + * struct iommu_ioas_copy - ioctl(IOMMU_IOAS_COPY)
> + * @size: sizeof(struct iommu_ioas_copy)
> + * @flags: Combination of enum iommufd_ioas_map_flags
> + * @dst_ioas_id: IOAS ID to change the mapping of
> + * @src_ioas_id: IOAS ID to copy from
> + * @length: Number of bytes to copy and map
> + * @dst_iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is
> + *            set then this must be provided as input.
> + * @src_iova: IOVA to start the copy
> + *
> + * Copy an already existing mapping from src_ioas_id and establish it in
> + * dst_ioas_id. The src iova/length must exactly match a range used with
> + * IOMMU_IOAS_MAP.
> + */
> +struct iommu_ioas_copy {
> +	__u32 size;
> +	__u32 flags;
> +	__u32 dst_ioas_id;
> +	__u32 src_ioas_id;
> +	__aligned_u64 length;
> +	__aligned_u64 dst_iova;
> +	__aligned_u64 src_iova;
> +};
> +#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY)

Since it can only copy a single mapping, what's the benefit of this
over just repeating an IOAS_MAP in the new IOAS?

> +/**
> + * struct iommu_ioas_unmap - ioctl(IOMMU_IOAS_UNMAP)
> + * @size: sizeof(struct iommu_ioas_copy)
> + * @ioas_id: IOAS ID to change the mapping of
> + * @iova: IOVA to start the unmapping at
> + * @length: Number of bytes to unmap
> + *
> + * Unmap an IOVA range. The iova/length must exactly match a range
> + * used with IOMMU_IOAS_PAGETABLE_MAP, or be the values 0 & U64_MAX.
> + * In the latter case all IOVAs will be unmaped.
> + */
> +struct iommu_ioas_unmap {
> +	__u32 size;
> +	__u32 ioas_id;
> +	__aligned_u64 iova;
> +	__aligned_u64 length;
> +};
> +#define IOMMU_IOAS_UNMAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_UNMAP)
>  #endif

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
@ 2022-03-31  4:36     ` David Gibson
  0 siblings, 0 replies; 244+ messages in thread
From: David Gibson @ 2022-03-31  4:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins


[-- Attachment #1.1: Type: text/plain, Size: 6300 bytes --]

On Fri, Mar 18, 2022 at 02:27:33PM -0300, Jason Gunthorpe wrote:
> Connect the IOAS to its IOCTL interface. This exposes most of the
> functionality in the io_pagetable to userspace.
> 
> This is intended to be the core of the generic interface that IOMMUFD will
> provide. Every IOMMU driver should be able to implement an iommu_domain
> that is compatible with this generic mechanism.
> 
> It is also designed to be easy to use for simple non virtual machine
> monitor users, like DPDK:
>  - Universal simple support for all IOMMUs (no PPC special path)
>  - An IOVA allocator that considerds the aperture and the reserved ranges
>  - io_pagetable allows any number of iommu_domains to be connected to the
>    IOAS
> 
> Along with room in the design to add non-generic features to cater to
> specific HW functionality.


[snip]
> +/**
> + * struct iommu_ioas_alloc - ioctl(IOMMU_IOAS_ALLOC)
> + * @size: sizeof(struct iommu_ioas_alloc)
> + * @flags: Must be 0
> + * @out_ioas_id: Output IOAS ID for the allocated object
> + *
> + * Allocate an IO Address Space (IOAS) which holds an IO Virtual Address (IOVA)
> + * to memory mapping.
> + */
> +struct iommu_ioas_alloc {
> +	__u32 size;
> +	__u32 flags;
> +	__u32 out_ioas_id;
> +};
> +#define IOMMU_IOAS_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_ALLOC)
> +
> +/**
> + * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
> + * @size: sizeof(struct iommu_ioas_iova_ranges)
> + * @ioas_id: IOAS ID to read ranges from
> + * @out_num_iovas: Output total number of ranges in the IOAS
> + * @__reserved: Must be 0
> + * @out_valid_iovas: Array of valid IOVA ranges. The array length is the smaller
> + *                   of out_num_iovas or the length implied by size.
> + * @out_valid_iovas.start: First IOVA in the allowed range
> + * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
> + *
> + * Query an IOAS for ranges of allowed IOVAs. Operation outside these ranges is
> + * not allowed. out_num_iovas will be set to the total number of iovas
> + * and the out_valid_iovas[] will be filled in as space permits.
> + * size should include the allocated flex array.
> + */
> +struct iommu_ioas_iova_ranges {
> +	__u32 size;
> +	__u32 ioas_id;
> +	__u32 out_num_iovas;
> +	__u32 __reserved;
> +	struct iommu_valid_iovas {
> +		__aligned_u64 start;
> +		__aligned_u64 last;
> +	} out_valid_iovas[];
> +};
> +#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_IOVA_RANGES)

Is the information returned by this valid for the lifeime of the IOAS,
or can it change?  If it can change, what events can change it?

If it *can't* change, then how do we have enough information to
determine this at ALLOC time, since we don't necessarily know which
(if any) hardware IOMMU will be attached to it.

> +/**
> + * enum iommufd_ioas_map_flags - Flags for map and copy
> + * @IOMMU_IOAS_MAP_FIXED_IOVA: If clear the kernel will compute an appropriate
> + *                             IOVA to place the mapping at
> + * @IOMMU_IOAS_MAP_WRITEABLE: DMA is allowed to write to this mapping
> + * @IOMMU_IOAS_MAP_READABLE: DMA is allowed to read from this mapping
> + */
> +enum iommufd_ioas_map_flags {
> +	IOMMU_IOAS_MAP_FIXED_IOVA = 1 << 0,
> +	IOMMU_IOAS_MAP_WRITEABLE = 1 << 1,
> +	IOMMU_IOAS_MAP_READABLE = 1 << 2,
> +};
> +
> +/**
> + * struct iommu_ioas_map - ioctl(IOMMU_IOAS_MAP)
> + * @size: sizeof(struct iommu_ioas_map)
> + * @flags: Combination of enum iommufd_ioas_map_flags
> + * @ioas_id: IOAS ID to change the mapping of
> + * @__reserved: Must be 0
> + * @user_va: Userspace pointer to start mapping from
> + * @length: Number of bytes to map
> + * @iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is set
> + *        then this must be provided as input.
> + *
> + * Set an IOVA mapping from a user pointer. If FIXED_IOVA is specified then the
> + * mapping will be established at iova, otherwise a suitable location will be
> + * automatically selected and returned in iova.
> + */
> +struct iommu_ioas_map {
> +	__u32 size;
> +	__u32 flags;
> +	__u32 ioas_id;
> +	__u32 __reserved;
> +	__aligned_u64 user_va;
> +	__aligned_u64 length;
> +	__aligned_u64 iova;
> +};
> +#define IOMMU_IOAS_MAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_MAP)
> +
> +/**
> + * struct iommu_ioas_copy - ioctl(IOMMU_IOAS_COPY)
> + * @size: sizeof(struct iommu_ioas_copy)
> + * @flags: Combination of enum iommufd_ioas_map_flags
> + * @dst_ioas_id: IOAS ID to change the mapping of
> + * @src_ioas_id: IOAS ID to copy from
> + * @length: Number of bytes to copy and map
> + * @dst_iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is
> + *            set then this must be provided as input.
> + * @src_iova: IOVA to start the copy
> + *
> + * Copy an already existing mapping from src_ioas_id and establish it in
> + * dst_ioas_id. The src iova/length must exactly match a range used with
> + * IOMMU_IOAS_MAP.
> + */
> +struct iommu_ioas_copy {
> +	__u32 size;
> +	__u32 flags;
> +	__u32 dst_ioas_id;
> +	__u32 src_ioas_id;
> +	__aligned_u64 length;
> +	__aligned_u64 dst_iova;
> +	__aligned_u64 src_iova;
> +};
> +#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY)

Since it can only copy a single mapping, what's the benefit of this
over just repeating an IOAS_MAP in the new IOAS?

> +/**
> + * struct iommu_ioas_unmap - ioctl(IOMMU_IOAS_UNMAP)
> + * @size: sizeof(struct iommu_ioas_copy)
> + * @ioas_id: IOAS ID to change the mapping of
> + * @iova: IOVA to start the unmapping at
> + * @length: Number of bytes to unmap
> + *
> + * Unmap an IOVA range. The iova/length must exactly match a range
> + * used with IOMMU_IOAS_PAGETABLE_MAP, or be the values 0 & U64_MAX.
> + * In the latter case all IOVAs will be unmaped.
> + */
> +struct iommu_ioas_unmap {
> +	__u32 size;
> +	__u32 ioas_id;
> +	__aligned_u64 iova;
> +	__aligned_u64 length;
> +};
> +#define IOMMU_IOAS_UNMAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_UNMAP)
>  #endif

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 156 bytes --]

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* RE: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
  2022-03-31  4:36     ` David Gibson
@ 2022-03-31  5:41       ` Tian, Kevin
  -1 siblings, 0 replies; 244+ messages in thread
From: Tian, Kevin @ 2022-03-31  5:41 UTC (permalink / raw)
  To: David Gibson, Jason Gunthorpe
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

> From: David Gibson <david@gibson.dropbear.id.au>
> Sent: Thursday, March 31, 2022 12:36 PM
> > +
> > +/**
> > + * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
> > + * @size: sizeof(struct iommu_ioas_iova_ranges)
> > + * @ioas_id: IOAS ID to read ranges from
> > + * @out_num_iovas: Output total number of ranges in the IOAS
> > + * @__reserved: Must be 0
> > + * @out_valid_iovas: Array of valid IOVA ranges. The array length is the
> smaller
> > + *                   of out_num_iovas or the length implied by size.
> > + * @out_valid_iovas.start: First IOVA in the allowed range
> > + * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
> > + *
> > + * Query an IOAS for ranges of allowed IOVAs. Operation outside these
> ranges is
> > + * not allowed. out_num_iovas will be set to the total number of iovas
> > + * and the out_valid_iovas[] will be filled in as space permits.
> > + * size should include the allocated flex array.
> > + */
> > +struct iommu_ioas_iova_ranges {
> > +	__u32 size;
> > +	__u32 ioas_id;
> > +	__u32 out_num_iovas;
> > +	__u32 __reserved;
> > +	struct iommu_valid_iovas {
> > +		__aligned_u64 start;
> > +		__aligned_u64 last;
> > +	} out_valid_iovas[];
> > +};
> > +#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE,
> IOMMUFD_CMD_IOAS_IOVA_RANGES)
> 
> Is the information returned by this valid for the lifeime of the IOAS,
> or can it change?  If it can change, what events can change it?
> 

It can change when a new device is attached to an ioas.

You can look at iopt_table_enforce_group_resv_regions() in patch7
which is called by iommufd_device_attach() in patch10. That function
will first check whether new reserved ranges from the attached device
have been used and if no conflict then add them to the list of reserved
ranges of this ioas.

Userspace can call this ioctl to retrieve updated IOVA range info after
attaching a device.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 244+ messages in thread

* RE: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
@ 2022-03-31  5:41       ` Tian, Kevin
  0 siblings, 0 replies; 244+ messages in thread
From: Tian, Kevin @ 2022-03-31  5:41 UTC (permalink / raw)
  To: David Gibson, Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Alex Williamson, Martins, Joao

> From: David Gibson <david@gibson.dropbear.id.au>
> Sent: Thursday, March 31, 2022 12:36 PM
> > +
> > +/**
> > + * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
> > + * @size: sizeof(struct iommu_ioas_iova_ranges)
> > + * @ioas_id: IOAS ID to read ranges from
> > + * @out_num_iovas: Output total number of ranges in the IOAS
> > + * @__reserved: Must be 0
> > + * @out_valid_iovas: Array of valid IOVA ranges. The array length is the
> smaller
> > + *                   of out_num_iovas or the length implied by size.
> > + * @out_valid_iovas.start: First IOVA in the allowed range
> > + * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
> > + *
> > + * Query an IOAS for ranges of allowed IOVAs. Operation outside these
> ranges is
> > + * not allowed. out_num_iovas will be set to the total number of iovas
> > + * and the out_valid_iovas[] will be filled in as space permits.
> > + * size should include the allocated flex array.
> > + */
> > +struct iommu_ioas_iova_ranges {
> > +	__u32 size;
> > +	__u32 ioas_id;
> > +	__u32 out_num_iovas;
> > +	__u32 __reserved;
> > +	struct iommu_valid_iovas {
> > +		__aligned_u64 start;
> > +		__aligned_u64 last;
> > +	} out_valid_iovas[];
> > +};
> > +#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE,
> IOMMUFD_CMD_IOAS_IOVA_RANGES)
> 
> Is the information returned by this valid for the lifeime of the IOAS,
> or can it change?  If it can change, what events can change it?
> 

It can change when a new device is attached to an ioas.

You can look at iopt_table_enforce_group_resv_regions() in patch7
which is called by iommufd_device_attach() in patch10. That function
will first check whether new reserved ranges from the attached device
have been used and if no conflict then add them to the list of reserved
ranges of this ioas.

Userspace can call this ioctl to retrieve updated IOVA range info after
attaching a device.

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
  2022-03-31  4:36     ` David Gibson
@ 2022-03-31 12:58       ` Jason Gunthorpe
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-03-31 12:58 UTC (permalink / raw)
  To: David Gibson
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins

On Thu, Mar 31, 2022 at 03:36:29PM +1100, David Gibson wrote:

> > +/**
> > + * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
> > + * @size: sizeof(struct iommu_ioas_iova_ranges)
> > + * @ioas_id: IOAS ID to read ranges from
> > + * @out_num_iovas: Output total number of ranges in the IOAS
> > + * @__reserved: Must be 0
> > + * @out_valid_iovas: Array of valid IOVA ranges. The array length is the smaller
> > + *                   of out_num_iovas or the length implied by size.
> > + * @out_valid_iovas.start: First IOVA in the allowed range
> > + * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
> > + *
> > + * Query an IOAS for ranges of allowed IOVAs. Operation outside these ranges is
> > + * not allowed. out_num_iovas will be set to the total number of iovas
> > + * and the out_valid_iovas[] will be filled in as space permits.
> > + * size should include the allocated flex array.
> > + */
> > +struct iommu_ioas_iova_ranges {
> > +	__u32 size;
> > +	__u32 ioas_id;
> > +	__u32 out_num_iovas;
> > +	__u32 __reserved;
> > +	struct iommu_valid_iovas {
> > +		__aligned_u64 start;
> > +		__aligned_u64 last;
> > +	} out_valid_iovas[];
> > +};
> > +#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_IOVA_RANGES)
> 
> Is the information returned by this valid for the lifeime of the IOAS,
> or can it change?  If it can change, what events can change it?
>
> If it *can't* change, then how do we have enough information to
> determine this at ALLOC time, since we don't necessarily know which
> (if any) hardware IOMMU will be attached to it.

It is a good point worth documenting. It can change. Particularly
after any device attachment.

I added this:

 * Query an IOAS for ranges of allowed IOVAs. Mapping IOVA outside these ranges
 * is not allowed. out_num_iovas will be set to the total number of iovas and
 * the out_valid_iovas[] will be filled in as space permits. size should include
 * the allocated flex array.
 *
 * The allowed ranges are dependent on the HW path the DMA operation takes, and
 * can change during the lifetime of the IOAS. A fresh empty IOAS will have a
 * full range, and each attached device will narrow the ranges based on that
 * devices HW restrictions.


> > +#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY)
> 
> Since it can only copy a single mapping, what's the benefit of this
> over just repeating an IOAS_MAP in the new IOAS?

It causes the underlying pin accounting to be shared and can avoid
calling GUP entirely.

Thanks,
Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
@ 2022-03-31 12:58       ` Jason Gunthorpe
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-03-31 12:58 UTC (permalink / raw)
  To: David Gibson
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Thu, Mar 31, 2022 at 03:36:29PM +1100, David Gibson wrote:

> > +/**
> > + * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
> > + * @size: sizeof(struct iommu_ioas_iova_ranges)
> > + * @ioas_id: IOAS ID to read ranges from
> > + * @out_num_iovas: Output total number of ranges in the IOAS
> > + * @__reserved: Must be 0
> > + * @out_valid_iovas: Array of valid IOVA ranges. The array length is the smaller
> > + *                   of out_num_iovas or the length implied by size.
> > + * @out_valid_iovas.start: First IOVA in the allowed range
> > + * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
> > + *
> > + * Query an IOAS for ranges of allowed IOVAs. Operation outside these ranges is
> > + * not allowed. out_num_iovas will be set to the total number of iovas
> > + * and the out_valid_iovas[] will be filled in as space permits.
> > + * size should include the allocated flex array.
> > + */
> > +struct iommu_ioas_iova_ranges {
> > +	__u32 size;
> > +	__u32 ioas_id;
> > +	__u32 out_num_iovas;
> > +	__u32 __reserved;
> > +	struct iommu_valid_iovas {
> > +		__aligned_u64 start;
> > +		__aligned_u64 last;
> > +	} out_valid_iovas[];
> > +};
> > +#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_IOVA_RANGES)
> 
> Is the information returned by this valid for the lifeime of the IOAS,
> or can it change?  If it can change, what events can change it?
>
> If it *can't* change, then how do we have enough information to
> determine this at ALLOC time, since we don't necessarily know which
> (if any) hardware IOMMU will be attached to it.

It is a good point worth documenting. It can change. Particularly
after any device attachment.

I added this:

 * Query an IOAS for ranges of allowed IOVAs. Mapping IOVA outside these ranges
 * is not allowed. out_num_iovas will be set to the total number of iovas and
 * the out_valid_iovas[] will be filled in as space permits. size should include
 * the allocated flex array.
 *
 * The allowed ranges are dependent on the HW path the DMA operation takes, and
 * can change during the lifetime of the IOAS. A fresh empty IOAS will have a
 * full range, and each attached device will narrow the ranges based on that
 * devices HW restrictions.


> > +#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY)
> 
> Since it can only copy a single mapping, what's the benefit of this
> over just repeating an IOAS_MAP in the new IOAS?

It causes the underlying pin accounting to be shared and can avoid
calling GUP entirely.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
  2022-03-30 13:35     ` Yi Liu
@ 2022-03-31 12:59       ` Jason Gunthorpe
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-03-31 12:59 UTC (permalink / raw)
  To: Yi Liu
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins,
	David Gibson

On Wed, Mar 30, 2022 at 09:35:52PM +0800, Yi Liu wrote:

> > +/**
> > + * struct iommu_ioas_copy - ioctl(IOMMU_IOAS_COPY)
> > + * @size: sizeof(struct iommu_ioas_copy)
> > + * @flags: Combination of enum iommufd_ioas_map_flags
> > + * @dst_ioas_id: IOAS ID to change the mapping of
> > + * @src_ioas_id: IOAS ID to copy from
> 
> so the dst and src ioas_id are allocated via the same iommufd.
> right? just out of curious, do you think it is possible that
> the srs/dst ioas_ids are from different iommufds? In that case
> may need to add src/dst iommufd. It's not needed today, just to
> see if any blocker in kernel to support such copy. :-)

Yes, all IDs in all ioctls are within the scope of a single iommufd.

There should be no reason for a single application to open multiple
iommufds.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
@ 2022-03-31 12:59       ` Jason Gunthorpe
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-03-31 12:59 UTC (permalink / raw)
  To: Yi Liu
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Keqian Zhu

On Wed, Mar 30, 2022 at 09:35:52PM +0800, Yi Liu wrote:

> > +/**
> > + * struct iommu_ioas_copy - ioctl(IOMMU_IOAS_COPY)
> > + * @size: sizeof(struct iommu_ioas_copy)
> > + * @flags: Combination of enum iommufd_ioas_map_flags
> > + * @dst_ioas_id: IOAS ID to change the mapping of
> > + * @src_ioas_id: IOAS ID to copy from
> 
> so the dst and src ioas_id are allocated via the same iommufd.
> right? just out of curious, do you think it is possible that
> the srs/dst ioas_ids are from different iommufds? In that case
> may need to add src/dst iommufd. It's not needed today, just to
> see if any blocker in kernel to support such copy. :-)

Yes, all IDs in all ioctls are within the scope of a single iommufd.

There should be no reason for a single application to open multiple
iommufds.

Jason

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
  2022-03-31 12:59       ` Jason Gunthorpe
@ 2022-04-01 13:30         ` Yi Liu
  -1 siblings, 0 replies; 244+ messages in thread
From: Yi Liu @ 2022-04-01 13:30 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Keqian Zhu



On 2022/3/31 20:59, Jason Gunthorpe wrote:
> On Wed, Mar 30, 2022 at 09:35:52PM +0800, Yi Liu wrote:
> 
>>> +/**
>>> + * struct iommu_ioas_copy - ioctl(IOMMU_IOAS_COPY)
>>> + * @size: sizeof(struct iommu_ioas_copy)
>>> + * @flags: Combination of enum iommufd_ioas_map_flags
>>> + * @dst_ioas_id: IOAS ID to change the mapping of
>>> + * @src_ioas_id: IOAS ID to copy from
>>
>> so the dst and src ioas_id are allocated via the same iommufd.
>> right? just out of curious, do you think it is possible that
>> the srs/dst ioas_ids are from different iommufds? In that case
>> may need to add src/dst iommufd. It's not needed today, just to
>> see if any blocker in kernel to support such copy. :-)
> 
> Yes, all IDs in all ioctls are within the scope of a single iommufd.
> 
> There should be no reason for a single application to open multiple
> iommufds.

then should this be documented?

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
@ 2022-04-01 13:30         ` Yi Liu
  0 siblings, 0 replies; 244+ messages in thread
From: Yi Liu @ 2022-04-01 13:30 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins,
	David Gibson



On 2022/3/31 20:59, Jason Gunthorpe wrote:
> On Wed, Mar 30, 2022 at 09:35:52PM +0800, Yi Liu wrote:
> 
>>> +/**
>>> + * struct iommu_ioas_copy - ioctl(IOMMU_IOAS_COPY)
>>> + * @size: sizeof(struct iommu_ioas_copy)
>>> + * @flags: Combination of enum iommufd_ioas_map_flags
>>> + * @dst_ioas_id: IOAS ID to change the mapping of
>>> + * @src_ioas_id: IOAS ID to copy from
>>
>> so the dst and src ioas_id are allocated via the same iommufd.
>> right? just out of curious, do you think it is possible that
>> the srs/dst ioas_ids are from different iommufds? In that case
>> may need to add src/dst iommufd. It's not needed today, just to
>> see if any blocker in kernel to support such copy. :-)
> 
> Yes, all IDs in all ioctls are within the scope of a single iommufd.
> 
> There should be no reason for a single application to open multiple
> iommufds.

then should this be documented?

-- 
Regards,
Yi Liu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 00/12] IOMMUFD Generic interface
  2022-03-18 17:27 ` Jason Gunthorpe via iommu
@ 2022-04-12 20:13   ` Eric Auger
  -1 siblings, 0 replies; 244+ messages in thread
From: Eric Auger @ 2022-04-12 20:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins,
	David Gibson

Hi,

On 3/18/22 6:27 PM, Jason Gunthorpe wrote:
> iommufd is the user API to control the IOMMU subsystem as it relates to
> managing IO page tables that point at user space memory.
>
> It takes over from drivers/vfio/vfio_iommu_type1.c (aka the VFIO
> container) which is the VFIO specific interface for a similar idea.
>
> We see a broad need for extended features, some being highly IOMMU device
> specific:
>  - Binding iommu_domain's to PASID/SSID
>  - Userspace page tables, for ARM, x86 and S390
>  - Kernel bypass'd invalidation of user page tables
>  - Re-use of the KVM page table in the IOMMU
>  - Dirty page tracking in the IOMMU
>  - Runtime Increase/Decrease of IOPTE size
>  - PRI support with faults resolved in userspace

This series does not have any concept of group fds anymore and the API
is device oriented.
I have a question wrt pci bus reset capability.

8b27ee60bfd6 ("vfio-pci: PCI hot reset interface")
introduced VFIO_DEVICE_PCI_GET_HOT_RESET_INFO and VFIO_DEVICE_PCI_HOT_RESET

Maybe we can reuse VFIO_DEVICE_GET_PCI_HOT_RESET_INFO to retrieve the devices and iommu groups that need to be checked and involved in the bus reset. If I understand correctly we now need to make sure the devices are handled in the same security context (bound to the same iommufd)

however VFIO_DEVICE_PCI_HOT_RESET operate on a collection of group fds.

How do you see the porting of this functionality onto /dev/iommu?

Thanks

Eric




>
> As well as a need to access these features beyond just VFIO, VDPA for
> instance, but other classes of accelerator HW are touching on these areas
> now too.
>
> The v1 series proposed re-using the VFIO type 1 data structure, however it
> was suggested that if we are doing this big update then we should also
> come with a data structure that solves the limitations that VFIO type1
> has. Notably this addresses:
>
>  - Multiple IOAS/'containers' and multiple domains inside a single FD
>
>  - Single-pin operation no matter how many domains and containers use
>    a page
>
>  - A fine grained locking scheme supporting user managed concurrency for
>    multi-threaded map/unmap
>
>  - A pre-registration mechanism to optimize vIOMMU use cases by
>    pre-pinning pages
>
>  - Extended ioctl API that can manage these new objects and exposes
>    domains directly to user space
>
>  - domains are sharable between subsystems, eg VFIO and VDPA
>
> The bulk of this code is a new data structure design to track how the
> IOVAs are mapped to PFNs.
>
> iommufd intends to be general and consumable by any driver that wants to
> DMA to userspace. From a driver perspective it can largely be dropped in
> in-place of iommu_attach_device() and provides a uniform full feature set
> to all consumers.
>
> As this is a larger project this series is the first step. This series
> provides the iommfd "generic interface" which is designed to be suitable
> for applications like DPDK and VMM flows that are not optimized to
> specific HW scenarios. It is close to being a drop in replacement for the
> existing VFIO type 1.
>
> This is part two of three for an initial sequence:
>  - Move IOMMU Group security into the iommu layer
>    https://lore.kernel.org/linux-iommu/20220218005521.172832-1-baolu.lu@linux.intel.com/
>  * Generic IOMMUFD implementation
>  - VFIO ability to consume IOMMUFD
>    An early exploration of this is available here:
>     https://github.com/luxis1999/iommufd/commits/iommufd-v5.17-rc6
>
> Various parts of the above extended features are in WIP stages currently
> to define how their IOCTL interface should work.
>
> At this point, using the draft VFIO series, unmodified qemu has been
> tested to operate using iommufd on x86 and ARM systems.
>
> Several people have contributed directly to this work: Eric Auger, Kevin
> Tian, Lu Baolu, Nicolin Chen, Yi L Liu. Many more have participated in the
> discussions that lead here, and provided ideas. Thanks to all!
>
> This is on github: https://github.com/jgunthorpe/linux/commits/iommufd
>
> # S390 in-kernel page table walker
> Cc: Niklas Schnelle <schnelle@linux.ibm.com>
> Cc: Matthew Rosato <mjrosato@linux.ibm.com>
> # AMD Dirty page tracking
> Cc: Joao Martins <joao.m.martins@oracle.com>
> # ARM SMMU Dirty page tracking
> Cc: Keqian Zhu <zhukeqian1@huawei.com>
> Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> # ARM SMMU nesting
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> # Map/unmap performance
> Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
> # VDPA
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Jason Wang <jasowang@redhat.com>
> # Power
> Cc: David Gibson <david@gibson.dropbear.id.au>
> # vfio
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Cornelia Huck <cohuck@redhat.com>
> Cc: kvm@vger.kernel.org
> # iommu
> Cc: iommu@lists.linux-foundation.org
> # Collaborators
> Cc: "Chaitanya Kulkarni" <chaitanyak@nvidia.com>
> Cc: Nicolin Chen <nicolinc@nvidia.com>
> Cc: Lu Baolu <baolu.lu@linux.intel.com>
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>
> Jason Gunthorpe (11):
>   interval-tree: Add a utility to iterate over spans in an interval tree
>   iommufd: File descriptor, context, kconfig and makefiles
>   kernel/user: Allow user::locked_vm to be usable for iommufd
>   iommufd: PFN handling for iopt_pages
>   iommufd: Algorithms for PFN storage
>   iommufd: Data structure to provide IOVA to PFN mapping
>   iommufd: IOCTLs for the io_pagetable
>   iommufd: Add a HW pagetable object
>   iommufd: Add kAPI toward external drivers
>   iommufd: vfio container FD ioctl compatibility
>   iommufd: Add a selftest
>
> Kevin Tian (1):
>   iommufd: Overview documentation
>
>  Documentation/userspace-api/index.rst         |    1 +
>  .../userspace-api/ioctl/ioctl-number.rst      |    1 +
>  Documentation/userspace-api/iommufd.rst       |  224 +++
>  MAINTAINERS                                   |   10 +
>  drivers/iommu/Kconfig                         |    1 +
>  drivers/iommu/Makefile                        |    2 +-
>  drivers/iommu/iommufd/Kconfig                 |   22 +
>  drivers/iommu/iommufd/Makefile                |   13 +
>  drivers/iommu/iommufd/device.c                |  274 ++++
>  drivers/iommu/iommufd/hw_pagetable.c          |  142 ++
>  drivers/iommu/iommufd/io_pagetable.c          |  890 +++++++++++
>  drivers/iommu/iommufd/io_pagetable.h          |  170 +++
>  drivers/iommu/iommufd/ioas.c                  |  252 ++++
>  drivers/iommu/iommufd/iommufd_private.h       |  231 +++
>  drivers/iommu/iommufd/iommufd_test.h          |   65 +
>  drivers/iommu/iommufd/main.c                  |  346 +++++
>  drivers/iommu/iommufd/pages.c                 | 1321 +++++++++++++++++
>  drivers/iommu/iommufd/selftest.c              |  495 ++++++
>  drivers/iommu/iommufd/vfio_compat.c           |  401 +++++
>  include/linux/interval_tree.h                 |   41 +
>  include/linux/iommufd.h                       |   50 +
>  include/linux/sched/user.h                    |    2 +-
>  include/uapi/linux/iommufd.h                  |  223 +++
>  kernel/user.c                                 |    1 +
>  lib/interval_tree.c                           |   98 ++
>  tools/testing/selftests/Makefile              |    1 +
>  tools/testing/selftests/iommu/.gitignore      |    2 +
>  tools/testing/selftests/iommu/Makefile        |   11 +
>  tools/testing/selftests/iommu/config          |    2 +
>  tools/testing/selftests/iommu/iommufd.c       | 1225 +++++++++++++++
>  30 files changed, 6515 insertions(+), 2 deletions(-)
>  create mode 100644 Documentation/userspace-api/iommufd.rst
>  create mode 100644 drivers/iommu/iommufd/Kconfig
>  create mode 100644 drivers/iommu/iommufd/Makefile
>  create mode 100644 drivers/iommu/iommufd/device.c
>  create mode 100644 drivers/iommu/iommufd/hw_pagetable.c
>  create mode 100644 drivers/iommu/iommufd/io_pagetable.c
>  create mode 100644 drivers/iommu/iommufd/io_pagetable.h
>  create mode 100644 drivers/iommu/iommufd/ioas.c
>  create mode 100644 drivers/iommu/iommufd/iommufd_private.h
>  create mode 100644 drivers/iommu/iommufd/iommufd_test.h
>  create mode 100644 drivers/iommu/iommufd/main.c
>  create mode 100644 drivers/iommu/iommufd/pages.c
>  create mode 100644 drivers/iommu/iommufd/selftest.c
>  create mode 100644 drivers/iommu/iommufd/vfio_compat.c
>  create mode 100644 include/linux/iommufd.h
>  create mode 100644 include/uapi/linux/iommufd.h
>  create mode 100644 tools/testing/selftests/iommu/.gitignore
>  create mode 100644 tools/testing/selftests/iommu/Makefile
>  create mode 100644 tools/testing/selftests/iommu/config
>  create mode 100644 tools/testing/selftests/iommu/iommufd.c
>
>
> base-commit: d1c716ed82a6bf4c35ba7be3741b9362e84cd722

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 00/12] IOMMUFD Generic interface
@ 2022-04-12 20:13   ` Eric Auger
  0 siblings, 0 replies; 244+ messages in thread
From: Eric Auger @ 2022-04-12 20:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

Hi,

On 3/18/22 6:27 PM, Jason Gunthorpe wrote:
> iommufd is the user API to control the IOMMU subsystem as it relates to
> managing IO page tables that point at user space memory.
>
> It takes over from drivers/vfio/vfio_iommu_type1.c (aka the VFIO
> container) which is the VFIO specific interface for a similar idea.
>
> We see a broad need for extended features, some being highly IOMMU device
> specific:
>  - Binding iommu_domain's to PASID/SSID
>  - Userspace page tables, for ARM, x86 and S390
>  - Kernel bypass'd invalidation of user page tables
>  - Re-use of the KVM page table in the IOMMU
>  - Dirty page tracking in the IOMMU
>  - Runtime Increase/Decrease of IOPTE size
>  - PRI support with faults resolved in userspace

This series does not have any concept of group fds anymore and the API
is device oriented.
I have a question wrt pci bus reset capability.

8b27ee60bfd6 ("vfio-pci: PCI hot reset interface")
introduced VFIO_DEVICE_PCI_GET_HOT_RESET_INFO and VFIO_DEVICE_PCI_HOT_RESET

Maybe we can reuse VFIO_DEVICE_GET_PCI_HOT_RESET_INFO to retrieve the devices and iommu groups that need to be checked and involved in the bus reset. If I understand correctly we now need to make sure the devices are handled in the same security context (bound to the same iommufd)

however VFIO_DEVICE_PCI_HOT_RESET operate on a collection of group fds.

How do you see the porting of this functionality onto /dev/iommu?

Thanks

Eric




>
> As well as a need to access these features beyond just VFIO, VDPA for
> instance, but other classes of accelerator HW are touching on these areas
> now too.
>
> The v1 series proposed re-using the VFIO type 1 data structure, however it
> was suggested that if we are doing this big update then we should also
> come with a data structure that solves the limitations that VFIO type1
> has. Notably this addresses:
>
>  - Multiple IOAS/'containers' and multiple domains inside a single FD
>
>  - Single-pin operation no matter how many domains and containers use
>    a page
>
>  - A fine grained locking scheme supporting user managed concurrency for
>    multi-threaded map/unmap
>
>  - A pre-registration mechanism to optimize vIOMMU use cases by
>    pre-pinning pages
>
>  - Extended ioctl API that can manage these new objects and exposes
>    domains directly to user space
>
>  - domains are sharable between subsystems, eg VFIO and VDPA
>
> The bulk of this code is a new data structure design to track how the
> IOVAs are mapped to PFNs.
>
> iommufd intends to be general and consumable by any driver that wants to
> DMA to userspace. From a driver perspective it can largely be dropped in
> in-place of iommu_attach_device() and provides a uniform full feature set
> to all consumers.
>
> As this is a larger project this series is the first step. This series
> provides the iommfd "generic interface" which is designed to be suitable
> for applications like DPDK and VMM flows that are not optimized to
> specific HW scenarios. It is close to being a drop in replacement for the
> existing VFIO type 1.
>
> This is part two of three for an initial sequence:
>  - Move IOMMU Group security into the iommu layer
>    https://lore.kernel.org/linux-iommu/20220218005521.172832-1-baolu.lu@linux.intel.com/
>  * Generic IOMMUFD implementation
>  - VFIO ability to consume IOMMUFD
>    An early exploration of this is available here:
>     https://github.com/luxis1999/iommufd/commits/iommufd-v5.17-rc6
>
> Various parts of the above extended features are in WIP stages currently
> to define how their IOCTL interface should work.
>
> At this point, using the draft VFIO series, unmodified qemu has been
> tested to operate using iommufd on x86 and ARM systems.
>
> Several people have contributed directly to this work: Eric Auger, Kevin
> Tian, Lu Baolu, Nicolin Chen, Yi L Liu. Many more have participated in the
> discussions that lead here, and provided ideas. Thanks to all!
>
> This is on github: https://github.com/jgunthorpe/linux/commits/iommufd
>
> # S390 in-kernel page table walker
> Cc: Niklas Schnelle <schnelle@linux.ibm.com>
> Cc: Matthew Rosato <mjrosato@linux.ibm.com>
> # AMD Dirty page tracking
> Cc: Joao Martins <joao.m.martins@oracle.com>
> # ARM SMMU Dirty page tracking
> Cc: Keqian Zhu <zhukeqian1@huawei.com>
> Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> # ARM SMMU nesting
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> # Map/unmap performance
> Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
> # VDPA
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Jason Wang <jasowang@redhat.com>
> # Power
> Cc: David Gibson <david@gibson.dropbear.id.au>
> # vfio
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Cornelia Huck <cohuck@redhat.com>
> Cc: kvm@vger.kernel.org
> # iommu
> Cc: iommu@lists.linux-foundation.org
> # Collaborators
> Cc: "Chaitanya Kulkarni" <chaitanyak@nvidia.com>
> Cc: Nicolin Chen <nicolinc@nvidia.com>
> Cc: Lu Baolu <baolu.lu@linux.intel.com>
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>
> Jason Gunthorpe (11):
>   interval-tree: Add a utility to iterate over spans in an interval tree
>   iommufd: File descriptor, context, kconfig and makefiles
>   kernel/user: Allow user::locked_vm to be usable for iommufd
>   iommufd: PFN handling for iopt_pages
>   iommufd: Algorithms for PFN storage
>   iommufd: Data structure to provide IOVA to PFN mapping
>   iommufd: IOCTLs for the io_pagetable
>   iommufd: Add a HW pagetable object
>   iommufd: Add kAPI toward external drivers
>   iommufd: vfio container FD ioctl compatibility
>   iommufd: Add a selftest
>
> Kevin Tian (1):
>   iommufd: Overview documentation
>
>  Documentation/userspace-api/index.rst         |    1 +
>  .../userspace-api/ioctl/ioctl-number.rst      |    1 +
>  Documentation/userspace-api/iommufd.rst       |  224 +++
>  MAINTAINERS                                   |   10 +
>  drivers/iommu/Kconfig                         |    1 +
>  drivers/iommu/Makefile                        |    2 +-
>  drivers/iommu/iommufd/Kconfig                 |   22 +
>  drivers/iommu/iommufd/Makefile                |   13 +
>  drivers/iommu/iommufd/device.c                |  274 ++++
>  drivers/iommu/iommufd/hw_pagetable.c          |  142 ++
>  drivers/iommu/iommufd/io_pagetable.c          |  890 +++++++++++
>  drivers/iommu/iommufd/io_pagetable.h          |  170 +++
>  drivers/iommu/iommufd/ioas.c                  |  252 ++++
>  drivers/iommu/iommufd/iommufd_private.h       |  231 +++
>  drivers/iommu/iommufd/iommufd_test.h          |   65 +
>  drivers/iommu/iommufd/main.c                  |  346 +++++
>  drivers/iommu/iommufd/pages.c                 | 1321 +++++++++++++++++
>  drivers/iommu/iommufd/selftest.c              |  495 ++++++
>  drivers/iommu/iommufd/vfio_compat.c           |  401 +++++
>  include/linux/interval_tree.h                 |   41 +
>  include/linux/iommufd.h                       |   50 +
>  include/linux/sched/user.h                    |    2 +-
>  include/uapi/linux/iommufd.h                  |  223 +++
>  kernel/user.c                                 |    1 +
>  lib/interval_tree.c                           |   98 ++
>  tools/testing/selftests/Makefile              |    1 +
>  tools/testing/selftests/iommu/.gitignore      |    2 +
>  tools/testing/selftests/iommu/Makefile        |   11 +
>  tools/testing/selftests/iommu/config          |    2 +
>  tools/testing/selftests/iommu/iommufd.c       | 1225 +++++++++++++++
>  30 files changed, 6515 insertions(+), 2 deletions(-)
>  create mode 100644 Documentation/userspace-api/iommufd.rst
>  create mode 100644 drivers/iommu/iommufd/Kconfig
>  create mode 100644 drivers/iommu/iommufd/Makefile
>  create mode 100644 drivers/iommu/iommufd/device.c
>  create mode 100644 drivers/iommu/iommufd/hw_pagetable.c
>  create mode 100644 drivers/iommu/iommufd/io_pagetable.c
>  create mode 100644 drivers/iommu/iommufd/io_pagetable.h
>  create mode 100644 drivers/iommu/iommufd/ioas.c
>  create mode 100644 drivers/iommu/iommufd/iommufd_private.h
>  create mode 100644 drivers/iommu/iommufd/iommufd_test.h
>  create mode 100644 drivers/iommu/iommufd/main.c
>  create mode 100644 drivers/iommu/iommufd/pages.c
>  create mode 100644 drivers/iommu/iommufd/selftest.c
>  create mode 100644 drivers/iommu/iommufd/vfio_compat.c
>  create mode 100644 include/linux/iommufd.h
>  create mode 100644 include/uapi/linux/iommufd.h
>  create mode 100644 tools/testing/selftests/iommu/.gitignore
>  create mode 100644 tools/testing/selftests/iommu/Makefile
>  create mode 100644 tools/testing/selftests/iommu/config
>  create mode 100644 tools/testing/selftests/iommu/iommufd.c
>
>
> base-commit: d1c716ed82a6bf4c35ba7be3741b9362e84cd722


^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 00/12] IOMMUFD Generic interface
  2022-04-12 20:13   ` Eric Auger
@ 2022-04-12 20:22     ` Jason Gunthorpe
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-04-12 20:22 UTC (permalink / raw)
  To: Eric Auger
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins,
	David Gibson

On Tue, Apr 12, 2022 at 10:13:32PM +0200, Eric Auger wrote:
> Hi,
> 
> On 3/18/22 6:27 PM, Jason Gunthorpe wrote:
> > iommufd is the user API to control the IOMMU subsystem as it relates to
> > managing IO page tables that point at user space memory.
> >
> > It takes over from drivers/vfio/vfio_iommu_type1.c (aka the VFIO
> > container) which is the VFIO specific interface for a similar idea.
> >
> > We see a broad need for extended features, some being highly IOMMU device
> > specific:
> >  - Binding iommu_domain's to PASID/SSID
> >  - Userspace page tables, for ARM, x86 and S390
> >  - Kernel bypass'd invalidation of user page tables
> >  - Re-use of the KVM page table in the IOMMU
> >  - Dirty page tracking in the IOMMU
> >  - Runtime Increase/Decrease of IOPTE size
> >  - PRI support with faults resolved in userspace
> 
> This series does not have any concept of group fds anymore and the API
> is device oriented.
> I have a question wrt pci bus reset capability.
> 
> 8b27ee60bfd6 ("vfio-pci: PCI hot reset interface")
> introduced VFIO_DEVICE_PCI_GET_HOT_RESET_INFO and VFIO_DEVICE_PCI_HOT_RESET
> 
> Maybe we can reuse VFIO_DEVICE_GET_PCI_HOT_RESET_INFO to retrieve the devices and iommu groups that need to be checked and involved in the bus reset. If I understand correctly we now need to make sure the devices are handled in the same security context (bound to the same iommufd)
> 
> however VFIO_DEVICE_PCI_HOT_RESET operate on a collection of group fds.
> 
> How do you see the porting of this functionality onto /dev/iommu?

I already made a patch that converts VFIO_DEVICE_PCI_HOT_RESET to work
on a generic notion of a file and the underlying infrastructure to
allow it to accept either a device or group fd.

Same for the similar issue in KVM.

It is part of three VFIO series I will be posting. First is up here:

https://lore.kernel.org/kvm/0-v1-a8faf768d202+125dd-vfio_mdev_no_group_jgg@nvidia.com/

Overall the strategy is to contain the vfio_group as an internal detail
of vfio.ko and external interfaces use either a struct vfio_device *
or a struct file *

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 00/12] IOMMUFD Generic interface
@ 2022-04-12 20:22     ` Jason Gunthorpe
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-04-12 20:22 UTC (permalink / raw)
  To: Eric Auger
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Tue, Apr 12, 2022 at 10:13:32PM +0200, Eric Auger wrote:
> Hi,
> 
> On 3/18/22 6:27 PM, Jason Gunthorpe wrote:
> > iommufd is the user API to control the IOMMU subsystem as it relates to
> > managing IO page tables that point at user space memory.
> >
> > It takes over from drivers/vfio/vfio_iommu_type1.c (aka the VFIO
> > container) which is the VFIO specific interface for a similar idea.
> >
> > We see a broad need for extended features, some being highly IOMMU device
> > specific:
> >  - Binding iommu_domain's to PASID/SSID
> >  - Userspace page tables, for ARM, x86 and S390
> >  - Kernel bypass'd invalidation of user page tables
> >  - Re-use of the KVM page table in the IOMMU
> >  - Dirty page tracking in the IOMMU
> >  - Runtime Increase/Decrease of IOPTE size
> >  - PRI support with faults resolved in userspace
> 
> This series does not have any concept of group fds anymore and the API
> is device oriented.
> I have a question wrt pci bus reset capability.
> 
> 8b27ee60bfd6 ("vfio-pci: PCI hot reset interface")
> introduced VFIO_DEVICE_PCI_GET_HOT_RESET_INFO and VFIO_DEVICE_PCI_HOT_RESET
> 
> Maybe we can reuse VFIO_DEVICE_GET_PCI_HOT_RESET_INFO to retrieve the devices and iommu groups that need to be checked and involved in the bus reset. If I understand correctly we now need to make sure the devices are handled in the same security context (bound to the same iommufd)
> 
> however VFIO_DEVICE_PCI_HOT_RESET operate on a collection of group fds.
> 
> How do you see the porting of this functionality onto /dev/iommu?

I already made a patch that converts VFIO_DEVICE_PCI_HOT_RESET to work
on a generic notion of a file and the underlying infrastructure to
allow it to accept either a device or group fd.

Same for the similar issue in KVM.

It is part of three VFIO series I will be posting. First is up here:

https://lore.kernel.org/kvm/0-v1-a8faf768d202+125dd-vfio_mdev_no_group_jgg@nvidia.com/

Overall the strategy is to contain the vfio_group as an internal detail
of vfio.ko and external interfaces use either a struct vfio_device *
or a struct file *

Jason

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 00/12] IOMMUFD Generic interface
  2022-04-12 20:22     ` Jason Gunthorpe
@ 2022-04-12 20:50       ` Eric Auger
  -1 siblings, 0 replies; 244+ messages in thread
From: Eric Auger @ 2022-04-12 20:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins,
	David Gibson

Hi Jason,

On 4/12/22 10:22 PM, Jason Gunthorpe wrote:
> On Tue, Apr 12, 2022 at 10:13:32PM +0200, Eric Auger wrote:
>> Hi,
>>
>> On 3/18/22 6:27 PM, Jason Gunthorpe wrote:
>>> iommufd is the user API to control the IOMMU subsystem as it relates to
>>> managing IO page tables that point at user space memory.
>>>
>>> It takes over from drivers/vfio/vfio_iommu_type1.c (aka the VFIO
>>> container) which is the VFIO specific interface for a similar idea.
>>>
>>> We see a broad need for extended features, some being highly IOMMU device
>>> specific:
>>>  - Binding iommu_domain's to PASID/SSID
>>>  - Userspace page tables, for ARM, x86 and S390
>>>  - Kernel bypass'd invalidation of user page tables
>>>  - Re-use of the KVM page table in the IOMMU
>>>  - Dirty page tracking in the IOMMU
>>>  - Runtime Increase/Decrease of IOPTE size
>>>  - PRI support with faults resolved in userspace
>> This series does not have any concept of group fds anymore and the API
>> is device oriented.
>> I have a question wrt pci bus reset capability.
>>
>> 8b27ee60bfd6 ("vfio-pci: PCI hot reset interface")
>> introduced VFIO_DEVICE_PCI_GET_HOT_RESET_INFO and VFIO_DEVICE_PCI_HOT_RESET
>>
>> Maybe we can reuse VFIO_DEVICE_GET_PCI_HOT_RESET_INFO to retrieve the devices and iommu groups that need to be checked and involved in the bus reset. If I understand correctly we now need to make sure the devices are handled in the same security context (bound to the same iommufd)
>>
>> however VFIO_DEVICE_PCI_HOT_RESET operate on a collection of group fds.
>>
>> How do you see the porting of this functionality onto /dev/iommu?
> I already made a patch that converts VFIO_DEVICE_PCI_HOT_RESET to work
> on a generic notion of a file and the underlying infrastructure to
> allow it to accept either a device or group fd.
>
> Same for the similar issue in KVM.
>
> It is part of three VFIO series I will be posting. First is up here:
>
> https://lore.kernel.org/kvm/0-v1-a8faf768d202+125dd-vfio_mdev_no_group_jgg@nvidia.com/
>
> Overall the strategy is to contain the vfio_group as an internal detail
> of vfio.ko and external interfaces use either a struct vfio_device *
> or a struct file *
Thank you for the quick reply. Yi and I will look at this series. I
guess we won't support the bus reset functionality in our first QEMU
porting onto /dev/iommu until that code stabilizes.

Eric
>
> Jason
>

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 00/12] IOMMUFD Generic interface
@ 2022-04-12 20:50       ` Eric Auger
  0 siblings, 0 replies; 244+ messages in thread
From: Eric Auger @ 2022-04-12 20:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

Hi Jason,

On 4/12/22 10:22 PM, Jason Gunthorpe wrote:
> On Tue, Apr 12, 2022 at 10:13:32PM +0200, Eric Auger wrote:
>> Hi,
>>
>> On 3/18/22 6:27 PM, Jason Gunthorpe wrote:
>>> iommufd is the user API to control the IOMMU subsystem as it relates to
>>> managing IO page tables that point at user space memory.
>>>
>>> It takes over from drivers/vfio/vfio_iommu_type1.c (aka the VFIO
>>> container) which is the VFIO specific interface for a similar idea.
>>>
>>> We see a broad need for extended features, some being highly IOMMU device
>>> specific:
>>>  - Binding iommu_domain's to PASID/SSID
>>>  - Userspace page tables, for ARM, x86 and S390
>>>  - Kernel bypass'd invalidation of user page tables
>>>  - Re-use of the KVM page table in the IOMMU
>>>  - Dirty page tracking in the IOMMU
>>>  - Runtime Increase/Decrease of IOPTE size
>>>  - PRI support with faults resolved in userspace
>> This series does not have any concept of group fds anymore and the API
>> is device oriented.
>> I have a question wrt pci bus reset capability.
>>
>> 8b27ee60bfd6 ("vfio-pci: PCI hot reset interface")
>> introduced VFIO_DEVICE_PCI_GET_HOT_RESET_INFO and VFIO_DEVICE_PCI_HOT_RESET
>>
>> Maybe we can reuse VFIO_DEVICE_GET_PCI_HOT_RESET_INFO to retrieve the devices and iommu groups that need to be checked and involved in the bus reset. If I understand correctly we now need to make sure the devices are handled in the same security context (bound to the same iommufd)
>>
>> however VFIO_DEVICE_PCI_HOT_RESET operate on a collection of group fds.
>>
>> How do you see the porting of this functionality onto /dev/iommu?
> I already made a patch that converts VFIO_DEVICE_PCI_HOT_RESET to work
> on a generic notion of a file and the underlying infrastructure to
> allow it to accept either a device or group fd.
>
> Same for the similar issue in KVM.
>
> It is part of three VFIO series I will be posting. First is up here:
>
> https://lore.kernel.org/kvm/0-v1-a8faf768d202+125dd-vfio_mdev_no_group_jgg@nvidia.com/
>
> Overall the strategy is to contain the vfio_group as an internal detail
> of vfio.ko and external interfaces use either a struct vfio_device *
> or a struct file *
Thank you for the quick reply. Yi and I will look at this series. I
guess we won't support the bus reset functionality in our first QEMU
porting onto /dev/iommu until that code stabilizes.

Eric
>
> Jason
>


^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 07/12] iommufd: Data structure to provide IOVA to PFN mapping
  2022-03-18 17:27   ` Jason Gunthorpe via iommu
@ 2022-04-13 14:02     ` Yi Liu
  -1 siblings, 0 replies; 244+ messages in thread
From: Yi Liu @ 2022-04-13 14:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Keqian Zhu

Hi Jason,

On 2022/3/19 01:27, Jason Gunthorpe wrote:
> This is the remainder of the IOAS data structure. Provide an object called
> an io_pagetable that is composed of iopt_areas pointing at iopt_pages,
> along with a list of iommu_domains that mirror the IOVA to PFN map.
> 
> At the top this is a simple interval tree of iopt_areas indicating the map
> of IOVA to iopt_pages. An xarray keeps track of a list of domains. Based
> on the attached domains there is a minimum alignment for areas (which may
> be smaller than PAGE_SIZE) and an interval tree of reserved IOVA that
> can't be mapped.
> 
> The concept of a 'user' refers to something like a VFIO mdev that is
> accessing the IOVA and using a 'struct page *' for CPU based access.
> 
> Externally an API is provided that matches the requirements of the IOCTL
> interface for map/unmap and domain attachment.
> 
> The API provides a 'copy' primitive to establish a new IOVA map in a
> different IOAS from an existing mapping.
> 
> This is designed to support a pre-registration flow where userspace would
> setup an dummy IOAS with no domains, map in memory and then establish a
> user to pin all PFNs into the xarray.
> 
> Copy can then be used to create new IOVA mappings in a different IOAS,
> with iommu_domains attached. Upon copy the PFNs will be read out of the
> xarray and mapped into the iommu_domains, avoiding any pin_user_pages()
> overheads.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>   drivers/iommu/iommufd/Makefile          |   1 +
>   drivers/iommu/iommufd/io_pagetable.c    | 890 ++++++++++++++++++++++++
>   drivers/iommu/iommufd/iommufd_private.h |  35 +
>   3 files changed, 926 insertions(+)
>   create mode 100644 drivers/iommu/iommufd/io_pagetable.c
> 
> diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
> index 05a0e91e30afad..b66a8c47ff55ec 100644
> --- a/drivers/iommu/iommufd/Makefile
> +++ b/drivers/iommu/iommufd/Makefile
> @@ -1,5 +1,6 @@
>   # SPDX-License-Identifier: GPL-2.0-only
>   iommufd-y := \
> +	io_pagetable.o \
>   	main.o \
>   	pages.o
>   
> diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
> new file mode 100644
> index 00000000000000..f9f3b06946bfb9
> --- /dev/null
> +++ b/drivers/iommu/iommufd/io_pagetable.c
> @@ -0,0 +1,890 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
> + *
> + * The io_pagetable is the top of datastructure that maps IOVA's to PFNs. The
> + * PFNs can be placed into an iommu_domain, or returned to the caller as a page
> + * list for access by an in-kernel user.
> + *
> + * The datastructure uses the iopt_pages to optimize the storage of the PFNs
> + * between the domains and xarray.
> + */
> +#include <linux/lockdep.h>
> +#include <linux/iommu.h>
> +#include <linux/sched/mm.h>
> +#include <linux/err.h>
> +#include <linux/slab.h>
> +#include <linux/errno.h>
> +
> +#include "io_pagetable.h"
> +
> +static unsigned long iopt_area_iova_to_index(struct iopt_area *area,
> +					     unsigned long iova)
> +{
> +	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
> +		WARN_ON(iova < iopt_area_iova(area) ||
> +			iova > iopt_area_last_iova(area));
> +	return (iova - (iopt_area_iova(area) & PAGE_MASK)) / PAGE_SIZE;
> +}
> +
> +static struct iopt_area *iopt_find_exact_area(struct io_pagetable *iopt,
> +					      unsigned long iova,
> +					      unsigned long last_iova)
> +{
> +	struct iopt_area *area;
> +
> +	area = iopt_area_iter_first(iopt, iova, last_iova);
> +	if (!area || !area->pages || iopt_area_iova(area) != iova ||
> +	    iopt_area_last_iova(area) != last_iova)
> +		return NULL;
> +	return area;
> +}
> +
> +static bool __alloc_iova_check_hole(struct interval_tree_span_iter *span,
> +				    unsigned long length,
> +				    unsigned long iova_alignment,
> +				    unsigned long page_offset)
> +{
> +	if (!span->is_hole || span->last_hole - span->start_hole < length - 1)
> +		return false;
> +
> +	span->start_hole =
> +		ALIGN(span->start_hole, iova_alignment) | page_offset;
> +	if (span->start_hole > span->last_hole ||
> +	    span->last_hole - span->start_hole < length - 1)
> +		return false;
> +	return true;
> +}
> +
> +/*
> + * Automatically find a block of IOVA that is not being used and not reserved.
> + * Does not return a 0 IOVA even if it is valid.
> + */
> +static int iopt_alloc_iova(struct io_pagetable *iopt, unsigned long *iova,
> +			   unsigned long uptr, unsigned long length)
> +{
> +	struct interval_tree_span_iter reserved_span;
> +	unsigned long page_offset = uptr % PAGE_SIZE;
> +	struct interval_tree_span_iter area_span;
> +	unsigned long iova_alignment;
> +
> +	lockdep_assert_held(&iopt->iova_rwsem);
> +
> +	/* Protect roundup_pow-of_two() from overflow */
> +	if (length == 0 || length >= ULONG_MAX / 2)
> +		return -EOVERFLOW;
> +
> +	/*
> +	 * Keep alignment present in the uptr when building the IOVA, this
> +	 * increases the chance we can map a THP.
> +	 */
> +	if (!uptr)
> +		iova_alignment = roundup_pow_of_two(length);
> +	else
> +		iova_alignment =
> +			min_t(unsigned long, roundup_pow_of_two(length),
> +			      1UL << __ffs64(uptr));
> +
> +	if (iova_alignment < iopt->iova_alignment)
> +		return -EINVAL;
> +	for (interval_tree_span_iter_first(&area_span, &iopt->area_itree,
> +					   PAGE_SIZE, ULONG_MAX - PAGE_SIZE);
> +	     !interval_tree_span_iter_done(&area_span);
> +	     interval_tree_span_iter_next(&area_span)) {
> +		if (!__alloc_iova_check_hole(&area_span, length, iova_alignment,
> +					     page_offset))
> +			continue;
> +
> +		for (interval_tree_span_iter_first(
> +			     &reserved_span, &iopt->reserved_iova_itree,
> +			     area_span.start_hole, area_span.last_hole);
> +		     !interval_tree_span_iter_done(&reserved_span);
> +		     interval_tree_span_iter_next(&reserved_span)) {
> +			if (!__alloc_iova_check_hole(&reserved_span, length,
> +						     iova_alignment,
> +						     page_offset))
> +				continue;
> +
> +			*iova = reserved_span.start_hole;
> +			return 0;
> +		}
> +	}
> +	return -ENOSPC;
> +}
> +
> +/*
> + * The area takes a slice of the pages from start_bytes to start_byte + length
> + */
> +static struct iopt_area *
> +iopt_alloc_area(struct io_pagetable *iopt, struct iopt_pages *pages,
> +		unsigned long iova, unsigned long start_byte,
> +		unsigned long length, int iommu_prot, unsigned int flags)
> +{
> +	struct iopt_area *area;
> +	int rc;
> +
> +	area = kzalloc(sizeof(*area), GFP_KERNEL);
> +	if (!area)
> +		return ERR_PTR(-ENOMEM);
> +
> +	area->iopt = iopt;
> +	area->iommu_prot = iommu_prot;
> +	area->page_offset = start_byte % PAGE_SIZE;
> +	area->pages_node.start = start_byte / PAGE_SIZE;
> +	if (check_add_overflow(start_byte, length - 1, &area->pages_node.last))
> +		return ERR_PTR(-EOVERFLOW);
> +	area->pages_node.last = area->pages_node.last / PAGE_SIZE;
> +	if (WARN_ON(area->pages_node.last >= pages->npages))
> +		return ERR_PTR(-EOVERFLOW);
> +
> +	down_write(&iopt->iova_rwsem);
> +	if (flags & IOPT_ALLOC_IOVA) {
> +		rc = iopt_alloc_iova(iopt, &iova,
> +				     (uintptr_t)pages->uptr + start_byte,
> +				     length);
> +		if (rc)
> +			goto out_unlock;
> +	}
> +
> +	if (check_add_overflow(iova, length - 1, &area->node.last)) {
> +		rc = -EOVERFLOW;
> +		goto out_unlock;
> +	}
> +
> +	if (!(flags & IOPT_ALLOC_IOVA)) {
> +		if ((iova & (iopt->iova_alignment - 1)) ||
> +		    (length & (iopt->iova_alignment - 1)) || !length) {
> +			rc = -EINVAL;
> +			goto out_unlock;
> +		}
> +
> +		/* No reserved IOVA intersects the range */
> +		if (interval_tree_iter_first(&iopt->reserved_iova_itree, iova,
> +					     area->node.last)) {
> +			rc = -ENOENT;
> +			goto out_unlock;
> +		}
> +
> +		/* Check that there is not already a mapping in the range */
> +		if (iopt_area_iter_first(iopt, iova, area->node.last)) {
> +			rc = -EADDRINUSE;
> +			goto out_unlock;
> +		}
> +	}
> +
> +	/*
> +	 * The area is inserted with a NULL pages indicating it is not fully
> +	 * initialized yet.
> +	 */
> +	area->node.start = iova;
> +	interval_tree_insert(&area->node, &area->iopt->area_itree);
> +	up_write(&iopt->iova_rwsem);
> +	return area;
> +
> +out_unlock:
> +	up_write(&iopt->iova_rwsem);
> +	kfree(area);
> +	return ERR_PTR(rc);
> +}
> +
> +static void iopt_abort_area(struct iopt_area *area)
> +{
> +	down_write(&area->iopt->iova_rwsem);
> +	interval_tree_remove(&area->node, &area->iopt->area_itree);
> +	up_write(&area->iopt->iova_rwsem);
> +	kfree(area);
> +}
> +
> +static int iopt_finalize_area(struct iopt_area *area, struct iopt_pages *pages)
> +{
> +	int rc;
> +
> +	down_read(&area->iopt->domains_rwsem);
> +	rc = iopt_area_fill_domains(area, pages);
> +	if (!rc) {
> +		/*
> +		 * area->pages must be set inside the domains_rwsem to ensure
> +		 * any newly added domains will get filled. Moves the reference
> +		 * in from the caller
> +		 */
> +		down_write(&area->iopt->iova_rwsem);
> +		area->pages = pages;
> +		up_write(&area->iopt->iova_rwsem);
> +	}
> +	up_read(&area->iopt->domains_rwsem);
> +	return rc;
> +}
> +
> +int iopt_map_pages(struct io_pagetable *iopt, struct iopt_pages *pages,
> +		   unsigned long *dst_iova, unsigned long start_bytes,
> +		   unsigned long length, int iommu_prot, unsigned int flags)
> +{
> +	struct iopt_area *area;
> +	int rc;
> +
> +	if ((iommu_prot & IOMMU_WRITE) && !pages->writable)
> +		return -EPERM;
> +
> +	area = iopt_alloc_area(iopt, pages, *dst_iova, start_bytes, length,
> +			       iommu_prot, flags);
> +	if (IS_ERR(area))
> +		return PTR_ERR(area);
> +	*dst_iova = iopt_area_iova(area);
> +
> +	rc = iopt_finalize_area(area, pages);
> +	if (rc) {
> +		iopt_abort_area(area);
> +		return rc;
> +	}
> +	return 0;
> +}
> +
> +/**
> + * iopt_map_user_pages() - Map a user VA to an iova in the io page table
> + * @iopt: io_pagetable to act on
> + * @iova: If IOPT_ALLOC_IOVA is set this is unused on input and contains
> + *        the chosen iova on output. Otherwise is the iova to map to on input
> + * @uptr: User VA to map
> + * @length: Number of bytes to map
> + * @iommu_prot: Combination of IOMMU_READ/WRITE/etc bits for the mapping
> + * @flags: IOPT_ALLOC_IOVA or zero
> + *
> + * iova, uptr, and length must be aligned to iova_alignment. For domain backed
> + * page tables this will pin the pages and load them into the domain at iova.
> + * For non-domain page tables this will only setup a lazy reference and the
> + * caller must use iopt_access_pages() to touch them.
> + *
> + * iopt_unmap_iova() must be called to undo this before the io_pagetable can be
> + * destroyed.
> + */
> +int iopt_map_user_pages(struct io_pagetable *iopt, unsigned long *iova,
> +			void __user *uptr, unsigned long length, int iommu_prot,
> +			unsigned int flags)
> +{
> +	struct iopt_pages *pages;
> +	int rc;
> +
> +	pages = iopt_alloc_pages(uptr, length, iommu_prot & IOMMU_WRITE);
> +	if (IS_ERR(pages))
> +		return PTR_ERR(pages);
> +
> +	rc = iopt_map_pages(iopt, pages, iova, uptr - pages->uptr, length,
> +			    iommu_prot, flags);
> +	if (rc) {
> +		iopt_put_pages(pages);
> +		return rc;
> +	}
> +	return 0;
> +}
> +
> +struct iopt_pages *iopt_get_pages(struct io_pagetable *iopt, unsigned long iova,
> +				  unsigned long *start_byte,
> +				  unsigned long length)
> +{
> +	unsigned long iova_end;
> +	struct iopt_pages *pages;
> +	struct iopt_area *area;
> +
> +	if (check_add_overflow(iova, length - 1, &iova_end))
> +		return ERR_PTR(-EOVERFLOW);
> +
> +	down_read(&iopt->iova_rwsem);
> +	area = iopt_find_exact_area(iopt, iova, iova_end);
> +	if (!area) {
> +		up_read(&iopt->iova_rwsem);
> +		return ERR_PTR(-ENOENT);
> +	}
> +	pages = area->pages;
> +	*start_byte = area->page_offset + iopt_area_index(area) * PAGE_SIZE;
> +	kref_get(&pages->kref);
> +	up_read(&iopt->iova_rwsem);
> +
> +	return pages;
> +}
> +
> +static int __iopt_unmap_iova(struct io_pagetable *iopt, struct iopt_area *area,
> +			     struct iopt_pages *pages)
> +{
> +	/* Drivers have to unpin on notification. */
> +	if (WARN_ON(atomic_read(&area->num_users)))
> +		return -EBUSY;
> +
> +	iopt_area_unfill_domains(area, pages);
> +	WARN_ON(atomic_read(&area->num_users));
> +	iopt_abort_area(area);
> +	iopt_put_pages(pages);
> +	return 0;
> +}
> +
> +/**
> + * iopt_unmap_iova() - Remove a range of iova
> + * @iopt: io_pagetable to act on
> + * @iova: Starting iova to unmap
> + * @length: Number of bytes to unmap
> + *
> + * The requested range must exactly match an existing range.
> + * Splitting/truncating IOVA mappings is not allowed.
> + */
> +int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
> +		    unsigned long length)
> +{
> +	struct iopt_pages *pages;
> +	struct iopt_area *area;
> +	unsigned long iova_end;
> +	int rc;
> +
> +	if (!length)
> +		return -EINVAL;
> +
> +	if (check_add_overflow(iova, length - 1, &iova_end))
> +		return -EOVERFLOW;
> +
> +	down_read(&iopt->domains_rwsem);
> +	down_write(&iopt->iova_rwsem);
> +	area = iopt_find_exact_area(iopt, iova, iova_end);

when testing vIOMMU with Qemu using iommufd, I hit a problem as log #3
shows. Qemu failed when trying to do map due to an IOVA still in use.
After debugging, the 0xfffff000 IOVA is mapped but not unmapped. But per 
log #2, Qemu has issued unmap with a larger range (0xff000000 -
0x100000000) which includes the 0xfffff000. But iopt_find_exact_area()
doesn't find any area. So 0xfffff000 is not unmapped. Is this correct? Same
test passed with vfio iommu type1 driver. any idea?

#1
qemu-system-x86_64: 218 vfio_dma_map(0x55b99d7b7d40, 0xfffff000, 0x1000, 
0x7f79ee81f000) = 0

#2
qemu-system-x86_64: 232 vfio_dma_unmap(0x55b99d7b7d40, 0xff000000, 
0x1000000) = 0 (No such file or directory)
qemu-system-x86_64: IOMMU_IOAS_UNMAP failed: No such file or directory
qemu-system-x86_64: 241 vfio_dma_unmap(0x55b99d7b7d40, 0xff000000, 
0x1000000) = -2 (No such file or directory)
                                vtd_address_space_unmap, notify iommu 
start: ff000000, end: 100000000 - 2

#3
qemu-system-x86_64: IOMMU_IOAS_MAP failed: Address already in use
qemu-system-x86_64: vfio_container_dma_map(0x55b99d7b7d40, 0xfffc0000, 
0x40000, 0x7f7968c00000) = -98 (Address already in use)
qemu: hardware error: vfio: DMA mapping failed, unable to continue

#4 Kernel debug log:

[ 1042.662165] iopt_unmap_iova 338 iova: ff000000, length: 1000000
[ 1042.662339] iopt_unmap_iova 345 iova: ff000000, length: 1000000
[ 1042.662505] iopt_unmap_iova 348 iova: ff000000, length: 1000000
[ 1042.662736] iopt_find_exact_area, iova: ff000000, last_iova: ffffffff
[ 1042.662909] iopt_unmap_iova 350 iova: ff000000, length: 1000000
[ 1042.663084] iommufd_ioas_unmap 253 iova: ff000000 length: 1000000, rc: -2

> +	if (!area) {
> +		up_write(&iopt->iova_rwsem);
> +		up_read(&iopt->domains_rwsem);
> +		return -ENOENT;
> +	}
> +	pages = area->pages;
> +	area->pages = NULL;
> +	up_write(&iopt->iova_rwsem);
> +
> +	rc = __iopt_unmap_iova(iopt, area, pages);
> +	up_read(&iopt->domains_rwsem);
> +	return rc;
> +}
> +
> +int iopt_unmap_all(struct io_pagetable *iopt)
> +{
> +	struct iopt_area *area;
> +	int rc;
> +
> +	down_read(&iopt->domains_rwsem);
> +	down_write(&iopt->iova_rwsem);
> +	while ((area = iopt_area_iter_first(iopt, 0, ULONG_MAX))) {
> +		struct iopt_pages *pages;
> +
> +		/* Userspace should not race unmap all and map */
> +		if (!area->pages) {
> +			rc = -EBUSY;
> +			goto out_unlock_iova;
> +		}
> +		pages = area->pages;
> +		area->pages = NULL;
> +		up_write(&iopt->iova_rwsem);
> +
> +		rc = __iopt_unmap_iova(iopt, area, pages);
> +		if (rc)
> +			goto out_unlock_domains;
> +
> +		down_write(&iopt->iova_rwsem);
> +	}
> +	rc = 0;
> +
> +out_unlock_iova:
> +	up_write(&iopt->iova_rwsem);
> +out_unlock_domains:
> +	up_read(&iopt->domains_rwsem);
> +	return rc;
> +}
> +
> +/**
> + * iopt_access_pages() - Return a list of pages under the iova
> + * @iopt: io_pagetable to act on
> + * @iova: Starting IOVA
> + * @length: Number of bytes to access
> + * @out_pages: Output page list
> + * @write: True if access is for writing
> + *
> + * Reads @npages starting at iova and returns the struct page * pointers. These
> + * can be kmap'd by the caller for CPU access.
> + *
> + * The caller must perform iopt_unaccess_pages() when done to balance this.
> + *
> + * iova can be unaligned from PAGE_SIZE. The first returned byte starts at
> + * page_to_phys(out_pages[0]) + (iova % PAGE_SIZE). The caller promises not to
> + * touch memory outside the requested iova slice.
> + *
> + * FIXME: callers that need a DMA mapping via a sgl should create another
> + * interface to build the SGL efficiently
> + */
> +int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
> +		      unsigned long length, struct page **out_pages, bool write)
> +{
> +	unsigned long cur_iova = iova;
> +	unsigned long last_iova;
> +	struct iopt_area *area;
> +	int rc;
> +
> +	if (!length)
> +		return -EINVAL;
> +	if (check_add_overflow(iova, length - 1, &last_iova))
> +		return -EOVERFLOW;
> +
> +	down_read(&iopt->iova_rwsem);
> +	for (area = iopt_area_iter_first(iopt, iova, last_iova); area;
> +	     area = iopt_area_iter_next(area, iova, last_iova)) {
> +		unsigned long last = min(last_iova, iopt_area_last_iova(area));
> +		unsigned long last_index;
> +		unsigned long index;
> +
> +		/* Need contiguous areas in the access */
> +		if (iopt_area_iova(area) < cur_iova || !area->pages) {
> +			rc = -EINVAL;
> +			goto out_remove;
> +		}
> +
> +		index = iopt_area_iova_to_index(area, cur_iova);
> +		last_index = iopt_area_iova_to_index(area, last);
> +		rc = iopt_pages_add_user(area->pages, index, last_index,
> +					 out_pages, write);
> +		if (rc)
> +			goto out_remove;
> +		if (last == last_iova)
> +			break;
> +		/*
> +		 * Can't cross areas that are not aligned to the system page
> +		 * size with this API.
> +		 */
> +		if (cur_iova % PAGE_SIZE) {
> +			rc = -EINVAL;
> +			goto out_remove;
> +		}
> +		cur_iova = last + 1;
> +		out_pages += last_index - index;
> +		atomic_inc(&area->num_users);
> +	}
> +
> +	up_read(&iopt->iova_rwsem);
> +	return 0;
> +
> +out_remove:
> +	if (cur_iova != iova)
> +		iopt_unaccess_pages(iopt, iova, cur_iova - iova);
> +	up_read(&iopt->iova_rwsem);
> +	return rc;
> +}
> +
> +/**
> + * iopt_unaccess_pages() - Undo iopt_access_pages
> + * @iopt: io_pagetable to act on
> + * @iova: Starting IOVA
> + * @length:- Number of bytes to access
> + *
> + * Return the struct page's. The caller must stop accessing them before calling
> + * this. The iova/length must exactly match the one provided to access_pages.
> + */
> +void iopt_unaccess_pages(struct io_pagetable *iopt, unsigned long iova,
> +			 size_t length)
> +{
> +	unsigned long cur_iova = iova;
> +	unsigned long last_iova;
> +	struct iopt_area *area;
> +
> +	if (WARN_ON(!length) ||
> +	    WARN_ON(check_add_overflow(iova, length - 1, &last_iova)))
> +		return;
> +
> +	down_read(&iopt->iova_rwsem);
> +	for (area = iopt_area_iter_first(iopt, iova, last_iova); area;
> +	     area = iopt_area_iter_next(area, iova, last_iova)) {
> +		unsigned long last = min(last_iova, iopt_area_last_iova(area));
> +		int num_users;
> +
> +		iopt_pages_remove_user(area->pages,
> +				       iopt_area_iova_to_index(area, cur_iova),
> +				       iopt_area_iova_to_index(area, last));
> +		if (last == last_iova)
> +			break;
> +		cur_iova = last + 1;
> +		num_users = atomic_dec_return(&area->num_users);
> +		WARN_ON(num_users < 0);
> +	}
> +	up_read(&iopt->iova_rwsem);
> +}
> +
> +struct iopt_reserved_iova {
> +	struct interval_tree_node node;
> +	void *owner;
> +};
> +
> +int iopt_reserve_iova(struct io_pagetable *iopt, unsigned long start,
> +		      unsigned long last, void *owner)
> +{
> +	struct iopt_reserved_iova *reserved;
> +
> +	lockdep_assert_held_write(&iopt->iova_rwsem);
> +
> +	if (iopt_area_iter_first(iopt, start, last))
> +		return -EADDRINUSE;
> +
> +	reserved = kzalloc(sizeof(*reserved), GFP_KERNEL);
> +	if (!reserved)
> +		return -ENOMEM;
> +	reserved->node.start = start;
> +	reserved->node.last = last;
> +	reserved->owner = owner;
> +	interval_tree_insert(&reserved->node, &iopt->reserved_iova_itree);
> +	return 0;
> +}
> +
> +void iopt_remove_reserved_iova(struct io_pagetable *iopt, void *owner)
> +{
> +
> +	struct interval_tree_node *node;
> +
> +	for (node = interval_tree_iter_first(&iopt->reserved_iova_itree, 0,
> +					     ULONG_MAX);
> +	     node;) {
> +		struct iopt_reserved_iova *reserved =
> +			container_of(node, struct iopt_reserved_iova, node);
> +
> +		node = interval_tree_iter_next(node, 0, ULONG_MAX);
> +
> +		if (reserved->owner == owner) {
> +			interval_tree_remove(&reserved->node,
> +					     &iopt->reserved_iova_itree);
> +			kfree(reserved);
> +		}
> +	}
> +}
> +
> +int iopt_init_table(struct io_pagetable *iopt)
> +{
> +	init_rwsem(&iopt->iova_rwsem);
> +	init_rwsem(&iopt->domains_rwsem);
> +	iopt->area_itree = RB_ROOT_CACHED;
> +	iopt->reserved_iova_itree = RB_ROOT_CACHED;
> +	xa_init(&iopt->domains);
> +
> +	/*
> +	 * iopt's start as SW tables that can use the entire size_t IOVA space
> +	 * due to the use of size_t in the APIs. They have no alignment
> +	 * restriction.
> +	 */
> +	iopt->iova_alignment = 1;
> +
> +	return 0;
> +}
> +
> +void iopt_destroy_table(struct io_pagetable *iopt)
> +{
> +	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
> +		iopt_remove_reserved_iova(iopt, NULL);
> +	WARN_ON(!RB_EMPTY_ROOT(&iopt->reserved_iova_itree.rb_root));
> +	WARN_ON(!xa_empty(&iopt->domains));
> +	WARN_ON(!RB_EMPTY_ROOT(&iopt->area_itree.rb_root));
> +}
> +
> +/**
> + * iopt_unfill_domain() - Unfill a domain with PFNs
> + * @iopt: io_pagetable to act on
> + * @domain: domain to unfill
> + *
> + * This is used when removing a domain from the iopt. Every area in the iopt
> + * will be unmapped from the domain. The domain must already be removed from the
> + * domains xarray.
> + */
> +static void iopt_unfill_domain(struct io_pagetable *iopt,
> +			       struct iommu_domain *domain)
> +{
> +	struct iopt_area *area;
> +
> +	lockdep_assert_held(&iopt->iova_rwsem);
> +	lockdep_assert_held_write(&iopt->domains_rwsem);
> +
> +	/*
> +	 * Some other domain is holding all the pfns still, rapidly unmap this
> +	 * domain.
> +	 */
> +	if (iopt->next_domain_id != 0) {
> +		/* Pick an arbitrary remaining domain to act as storage */
> +		struct iommu_domain *storage_domain =
> +			xa_load(&iopt->domains, 0);
> +
> +		for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
> +		     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
> +			struct iopt_pages *pages = area->pages;
> +
> +			if (WARN_ON(!pages))
> +				continue;
> +
> +			mutex_lock(&pages->mutex);
> +			if (area->storage_domain != domain) {
> +				mutex_unlock(&pages->mutex);
> +				continue;
> +			}
> +			area->storage_domain = storage_domain;
> +			mutex_unlock(&pages->mutex);
> +		}
> +
> +
> +		iopt_unmap_domain(iopt, domain);
> +		return;
> +	}
> +
> +	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
> +	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
> +		struct iopt_pages *pages = area->pages;
> +
> +		if (WARN_ON(!pages))
> +			continue;
> +
> +		mutex_lock(&pages->mutex);
> +		interval_tree_remove(&area->pages_node,
> +				     &area->pages->domains_itree);
> +		WARN_ON(area->storage_domain != domain);
> +		area->storage_domain = NULL;
> +		iopt_area_unfill_domain(area, pages, domain);
> +		mutex_unlock(&pages->mutex);
> +	}
> +}
> +
> +/**
> + * iopt_fill_domain() - Fill a domain with PFNs
> + * @iopt: io_pagetable to act on
> + * @domain: domain to fill
> + *
> + * Fill the domain with PFNs from every area in the iopt. On failure the domain
> + * is left unchanged.
> + */
> +static int iopt_fill_domain(struct io_pagetable *iopt,
> +			    struct iommu_domain *domain)
> +{
> +	struct iopt_area *end_area;
> +	struct iopt_area *area;
> +	int rc;
> +
> +	lockdep_assert_held(&iopt->iova_rwsem);
> +	lockdep_assert_held_write(&iopt->domains_rwsem);
> +
> +	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
> +	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
> +		struct iopt_pages *pages = area->pages;
> +
> +		if (WARN_ON(!pages))
> +			continue;
> +
> +		mutex_lock(&pages->mutex);
> +		rc = iopt_area_fill_domain(area, domain);
> +		if (rc) {
> +			mutex_unlock(&pages->mutex);
> +			goto out_unfill;
> +		}
> +		if (!area->storage_domain) {
> +			WARN_ON(iopt->next_domain_id != 0);
> +			area->storage_domain = domain;
> +			interval_tree_insert(&area->pages_node,
> +					     &pages->domains_itree);
> +		}
> +		mutex_unlock(&pages->mutex);
> +	}
> +	return 0;
> +
> +out_unfill:
> +	end_area = area;
> +	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
> +	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
> +		struct iopt_pages *pages = area->pages;
> +
> +		if (area == end_area)
> +			break;
> +		if (WARN_ON(!pages))
> +			continue;
> +		mutex_lock(&pages->mutex);
> +		if (iopt->next_domain_id == 0) {
> +			interval_tree_remove(&area->pages_node,
> +					     &pages->domains_itree);
> +			area->storage_domain = NULL;
> +		}
> +		iopt_area_unfill_domain(area, pages, domain);
> +		mutex_unlock(&pages->mutex);
> +	}
> +	return rc;
> +}
> +
> +/* All existing area's conform to an increased page size */
> +static int iopt_check_iova_alignment(struct io_pagetable *iopt,
> +				     unsigned long new_iova_alignment)
> +{
> +	struct iopt_area *area;
> +
> +	lockdep_assert_held(&iopt->iova_rwsem);
> +
> +	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
> +	     area = iopt_area_iter_next(area, 0, ULONG_MAX))
> +		if ((iopt_area_iova(area) % new_iova_alignment) ||
> +		    (iopt_area_length(area) % new_iova_alignment))
> +			return -EADDRINUSE;
> +	return 0;
> +}
> +
> +int iopt_table_add_domain(struct io_pagetable *iopt,
> +			  struct iommu_domain *domain)
> +{
> +	const struct iommu_domain_geometry *geometry = &domain->geometry;
> +	struct iommu_domain *iter_domain;
> +	unsigned int new_iova_alignment;
> +	unsigned long index;
> +	int rc;
> +
> +	down_write(&iopt->domains_rwsem);
> +	down_write(&iopt->iova_rwsem);
> +
> +	xa_for_each (&iopt->domains, index, iter_domain) {
> +		if (WARN_ON(iter_domain == domain)) {
> +			rc = -EEXIST;
> +			goto out_unlock;
> +		}
> +	}
> +
> +	/*
> +	 * The io page size drives the iova_alignment. Internally the iopt_pages
> +	 * works in PAGE_SIZE units and we adjust when mapping sub-PAGE_SIZE
> +	 * objects into the iommu_domain.
> +	 *
> +	 * A iommu_domain must always be able to accept PAGE_SIZE to be
> +	 * compatible as we can't guarantee higher contiguity.
> +	 */
> +	new_iova_alignment =
> +		max_t(unsigned long, 1UL << __ffs(domain->pgsize_bitmap),
> +		      iopt->iova_alignment);
> +	if (new_iova_alignment > PAGE_SIZE) {
> +		rc = -EINVAL;
> +		goto out_unlock;
> +	}
> +	if (new_iova_alignment != iopt->iova_alignment) {
> +		rc = iopt_check_iova_alignment(iopt, new_iova_alignment);
> +		if (rc)
> +			goto out_unlock;
> +	}
> +
> +	/* No area exists that is outside the allowed domain aperture */
> +	if (geometry->aperture_start != 0) {
> +		rc = iopt_reserve_iova(iopt, 0, geometry->aperture_start - 1,
> +				       domain);
> +		if (rc)
> +			goto out_reserved;
> +	}
> +	if (geometry->aperture_end != ULONG_MAX) {
> +		rc = iopt_reserve_iova(iopt, geometry->aperture_end + 1,
> +				       ULONG_MAX, domain);
> +		if (rc)
> +			goto out_reserved;
> +	}
> +
> +	rc = xa_reserve(&iopt->domains, iopt->next_domain_id, GFP_KERNEL);
> +	if (rc)
> +		goto out_reserved;
> +
> +	rc = iopt_fill_domain(iopt, domain);
> +	if (rc)
> +		goto out_release;
> +
> +	iopt->iova_alignment = new_iova_alignment;
> +	xa_store(&iopt->domains, iopt->next_domain_id, domain, GFP_KERNEL);
> +	iopt->next_domain_id++;
> +	up_write(&iopt->iova_rwsem);
> +	up_write(&iopt->domains_rwsem);
> +	return 0;
> +out_release:
> +	xa_release(&iopt->domains, iopt->next_domain_id);
> +out_reserved:
> +	iopt_remove_reserved_iova(iopt, domain);
> +out_unlock:
> +	up_write(&iopt->iova_rwsem);
> +	up_write(&iopt->domains_rwsem);
> +	return rc;
> +}
> +
> +void iopt_table_remove_domain(struct io_pagetable *iopt,
> +			      struct iommu_domain *domain)
> +{
> +	struct iommu_domain *iter_domain = NULL;
> +	unsigned long new_iova_alignment;
> +	unsigned long index;
> +
> +	down_write(&iopt->domains_rwsem);
> +	down_write(&iopt->iova_rwsem);
> +
> +	xa_for_each (&iopt->domains, index, iter_domain)
> +		if (iter_domain == domain)
> +			break;
> +	if (WARN_ON(iter_domain != domain) || index >= iopt->next_domain_id)
> +		goto out_unlock;
> +
> +	/*
> +	 * Compress the xarray to keep it linear by swapping the entry to erase
> +	 * with the tail entry and shrinking the tail.
> +	 */
> +	iopt->next_domain_id--;
> +	iter_domain = xa_erase(&iopt->domains, iopt->next_domain_id);
> +	if (index != iopt->next_domain_id)
> +		xa_store(&iopt->domains, index, iter_domain, GFP_KERNEL);
> +
> +	iopt_unfill_domain(iopt, domain);
> +	iopt_remove_reserved_iova(iopt, domain);
> +
> +	/* Recalculate the iova alignment without the domain */
> +	new_iova_alignment = 1;
> +	xa_for_each (&iopt->domains, index, iter_domain)
> +		new_iova_alignment = max_t(unsigned long,
> +					   1UL << __ffs(domain->pgsize_bitmap),
> +					   new_iova_alignment);
> +	if (!WARN_ON(new_iova_alignment > iopt->iova_alignment))
> +		iopt->iova_alignment = new_iova_alignment;
> +
> +out_unlock:
> +	up_write(&iopt->iova_rwsem);
> +	up_write(&iopt->domains_rwsem);
> +}
> +
> +/* Narrow the valid_iova_itree to include reserved ranges from a group. */
> +int iopt_table_enforce_group_resv_regions(struct io_pagetable *iopt,
> +					  struct iommu_group *group,
> +					  phys_addr_t *sw_msi_start)
> +{
> +	struct iommu_resv_region *resv;
> +	struct iommu_resv_region *tmp;
> +	LIST_HEAD(group_resv_regions);
> +	int rc;
> +
> +	down_write(&iopt->iova_rwsem);
> +	rc = iommu_get_group_resv_regions(group, &group_resv_regions);
> +	if (rc)
> +		goto out_unlock;
> +
> +	list_for_each_entry (resv, &group_resv_regions, list) {
> +		if (resv->type == IOMMU_RESV_DIRECT_RELAXABLE)
> +			continue;
> +
> +		/*
> +		 * The presence of any 'real' MSI regions should take precedence
> +		 * over the software-managed one if the IOMMU driver happens to
> +		 * advertise both types.
> +		 */
> +		if (sw_msi_start && resv->type == IOMMU_RESV_MSI) {
> +			*sw_msi_start = 0;
> +			sw_msi_start = NULL;
> +		}
> +		if (sw_msi_start && resv->type == IOMMU_RESV_SW_MSI)
> +			*sw_msi_start = resv->start;
> +
> +		rc = iopt_reserve_iova(iopt, resv->start,
> +				       resv->length - 1 + resv->start, group);
> +		if (rc)
> +			goto out_reserved;
> +	}
> +	rc = 0;
> +	goto out_free_resv;
> +
> +out_reserved:
> +	iopt_remove_reserved_iova(iopt, group);
> +out_free_resv:
> +	list_for_each_entry_safe (resv, tmp, &group_resv_regions, list)
> +		kfree(resv);
> +out_unlock:
> +	up_write(&iopt->iova_rwsem);
> +	return rc;
> +}
> diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
> index 2f1301d39bba7c..bcf08e61bc87e9 100644
> --- a/drivers/iommu/iommufd/iommufd_private.h
> +++ b/drivers/iommu/iommufd/iommufd_private.h
> @@ -9,6 +9,9 @@
>   #include <linux/refcount.h>
>   #include <linux/uaccess.h>
>   
> +struct iommu_domain;
> +struct iommu_group;
> +
>   /*
>    * The IOVA to PFN map. The mapper automatically copies the PFNs into multiple
>    * domains and permits sharing of PFNs between io_pagetable instances. This
> @@ -27,8 +30,40 @@ struct io_pagetable {
>   	struct rw_semaphore iova_rwsem;
>   	struct rb_root_cached area_itree;
>   	struct rb_root_cached reserved_iova_itree;
> +	unsigned long iova_alignment;
>   };
>   
> +int iopt_init_table(struct io_pagetable *iopt);
> +void iopt_destroy_table(struct io_pagetable *iopt);
> +struct iopt_pages *iopt_get_pages(struct io_pagetable *iopt, unsigned long iova,
> +				  unsigned long *start_byte,
> +				  unsigned long length);
> +enum { IOPT_ALLOC_IOVA = 1 << 0 };
> +int iopt_map_user_pages(struct io_pagetable *iopt, unsigned long *iova,
> +			void __user *uptr, unsigned long length, int iommu_prot,
> +			unsigned int flags);
> +int iopt_map_pages(struct io_pagetable *iopt, struct iopt_pages *pages,
> +		   unsigned long *dst_iova, unsigned long start_byte,
> +		   unsigned long length, int iommu_prot, unsigned int flags);
> +int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
> +		    unsigned long length);
> +int iopt_unmap_all(struct io_pagetable *iopt);
> +
> +int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
> +		      unsigned long npages, struct page **out_pages, bool write);
> +void iopt_unaccess_pages(struct io_pagetable *iopt, unsigned long iova,
> +			 size_t npages);
> +int iopt_table_add_domain(struct io_pagetable *iopt,
> +			  struct iommu_domain *domain);
> +void iopt_table_remove_domain(struct io_pagetable *iopt,
> +			      struct iommu_domain *domain);
> +int iopt_table_enforce_group_resv_regions(struct io_pagetable *iopt,
> +					  struct iommu_group *group,
> +					  phys_addr_t *sw_msi_start);
> +int iopt_reserve_iova(struct io_pagetable *iopt, unsigned long start,
> +		      unsigned long last, void *owner);
> +void iopt_remove_reserved_iova(struct io_pagetable *iopt, void *owner);
> +
>   struct iommufd_ctx {
>   	struct file *filp;
>   	struct xarray objects;

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 07/12] iommufd: Data structure to provide IOVA to PFN mapping
@ 2022-04-13 14:02     ` Yi Liu
  0 siblings, 0 replies; 244+ messages in thread
From: Yi Liu @ 2022-04-13 14:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins,
	David Gibson

Hi Jason,

On 2022/3/19 01:27, Jason Gunthorpe wrote:
> This is the remainder of the IOAS data structure. Provide an object called
> an io_pagetable that is composed of iopt_areas pointing at iopt_pages,
> along with a list of iommu_domains that mirror the IOVA to PFN map.
> 
> At the top this is a simple interval tree of iopt_areas indicating the map
> of IOVA to iopt_pages. An xarray keeps track of a list of domains. Based
> on the attached domains there is a minimum alignment for areas (which may
> be smaller than PAGE_SIZE) and an interval tree of reserved IOVA that
> can't be mapped.
> 
> The concept of a 'user' refers to something like a VFIO mdev that is
> accessing the IOVA and using a 'struct page *' for CPU based access.
> 
> Externally an API is provided that matches the requirements of the IOCTL
> interface for map/unmap and domain attachment.
> 
> The API provides a 'copy' primitive to establish a new IOVA map in a
> different IOAS from an existing mapping.
> 
> This is designed to support a pre-registration flow where userspace would
> setup an dummy IOAS with no domains, map in memory and then establish a
> user to pin all PFNs into the xarray.
> 
> Copy can then be used to create new IOVA mappings in a different IOAS,
> with iommu_domains attached. Upon copy the PFNs will be read out of the
> xarray and mapped into the iommu_domains, avoiding any pin_user_pages()
> overheads.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>   drivers/iommu/iommufd/Makefile          |   1 +
>   drivers/iommu/iommufd/io_pagetable.c    | 890 ++++++++++++++++++++++++
>   drivers/iommu/iommufd/iommufd_private.h |  35 +
>   3 files changed, 926 insertions(+)
>   create mode 100644 drivers/iommu/iommufd/io_pagetable.c
> 
> diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
> index 05a0e91e30afad..b66a8c47ff55ec 100644
> --- a/drivers/iommu/iommufd/Makefile
> +++ b/drivers/iommu/iommufd/Makefile
> @@ -1,5 +1,6 @@
>   # SPDX-License-Identifier: GPL-2.0-only
>   iommufd-y := \
> +	io_pagetable.o \
>   	main.o \
>   	pages.o
>   
> diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
> new file mode 100644
> index 00000000000000..f9f3b06946bfb9
> --- /dev/null
> +++ b/drivers/iommu/iommufd/io_pagetable.c
> @@ -0,0 +1,890 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
> + *
> + * The io_pagetable is the top of datastructure that maps IOVA's to PFNs. The
> + * PFNs can be placed into an iommu_domain, or returned to the caller as a page
> + * list for access by an in-kernel user.
> + *
> + * The datastructure uses the iopt_pages to optimize the storage of the PFNs
> + * between the domains and xarray.
> + */
> +#include <linux/lockdep.h>
> +#include <linux/iommu.h>
> +#include <linux/sched/mm.h>
> +#include <linux/err.h>
> +#include <linux/slab.h>
> +#include <linux/errno.h>
> +
> +#include "io_pagetable.h"
> +
> +static unsigned long iopt_area_iova_to_index(struct iopt_area *area,
> +					     unsigned long iova)
> +{
> +	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
> +		WARN_ON(iova < iopt_area_iova(area) ||
> +			iova > iopt_area_last_iova(area));
> +	return (iova - (iopt_area_iova(area) & PAGE_MASK)) / PAGE_SIZE;
> +}
> +
> +static struct iopt_area *iopt_find_exact_area(struct io_pagetable *iopt,
> +					      unsigned long iova,
> +					      unsigned long last_iova)
> +{
> +	struct iopt_area *area;
> +
> +	area = iopt_area_iter_first(iopt, iova, last_iova);
> +	if (!area || !area->pages || iopt_area_iova(area) != iova ||
> +	    iopt_area_last_iova(area) != last_iova)
> +		return NULL;
> +	return area;
> +}
> +
> +static bool __alloc_iova_check_hole(struct interval_tree_span_iter *span,
> +				    unsigned long length,
> +				    unsigned long iova_alignment,
> +				    unsigned long page_offset)
> +{
> +	if (!span->is_hole || span->last_hole - span->start_hole < length - 1)
> +		return false;
> +
> +	span->start_hole =
> +		ALIGN(span->start_hole, iova_alignment) | page_offset;
> +	if (span->start_hole > span->last_hole ||
> +	    span->last_hole - span->start_hole < length - 1)
> +		return false;
> +	return true;
> +}
> +
> +/*
> + * Automatically find a block of IOVA that is not being used and not reserved.
> + * Does not return a 0 IOVA even if it is valid.
> + */
> +static int iopt_alloc_iova(struct io_pagetable *iopt, unsigned long *iova,
> +			   unsigned long uptr, unsigned long length)
> +{
> +	struct interval_tree_span_iter reserved_span;
> +	unsigned long page_offset = uptr % PAGE_SIZE;
> +	struct interval_tree_span_iter area_span;
> +	unsigned long iova_alignment;
> +
> +	lockdep_assert_held(&iopt->iova_rwsem);
> +
> +	/* Protect roundup_pow-of_two() from overflow */
> +	if (length == 0 || length >= ULONG_MAX / 2)
> +		return -EOVERFLOW;
> +
> +	/*
> +	 * Keep alignment present in the uptr when building the IOVA, this
> +	 * increases the chance we can map a THP.
> +	 */
> +	if (!uptr)
> +		iova_alignment = roundup_pow_of_two(length);
> +	else
> +		iova_alignment =
> +			min_t(unsigned long, roundup_pow_of_two(length),
> +			      1UL << __ffs64(uptr));
> +
> +	if (iova_alignment < iopt->iova_alignment)
> +		return -EINVAL;
> +	for (interval_tree_span_iter_first(&area_span, &iopt->area_itree,
> +					   PAGE_SIZE, ULONG_MAX - PAGE_SIZE);
> +	     !interval_tree_span_iter_done(&area_span);
> +	     interval_tree_span_iter_next(&area_span)) {
> +		if (!__alloc_iova_check_hole(&area_span, length, iova_alignment,
> +					     page_offset))
> +			continue;
> +
> +		for (interval_tree_span_iter_first(
> +			     &reserved_span, &iopt->reserved_iova_itree,
> +			     area_span.start_hole, area_span.last_hole);
> +		     !interval_tree_span_iter_done(&reserved_span);
> +		     interval_tree_span_iter_next(&reserved_span)) {
> +			if (!__alloc_iova_check_hole(&reserved_span, length,
> +						     iova_alignment,
> +						     page_offset))
> +				continue;
> +
> +			*iova = reserved_span.start_hole;
> +			return 0;
> +		}
> +	}
> +	return -ENOSPC;
> +}
> +
> +/*
> + * The area takes a slice of the pages from start_bytes to start_byte + length
> + */
> +static struct iopt_area *
> +iopt_alloc_area(struct io_pagetable *iopt, struct iopt_pages *pages,
> +		unsigned long iova, unsigned long start_byte,
> +		unsigned long length, int iommu_prot, unsigned int flags)
> +{
> +	struct iopt_area *area;
> +	int rc;
> +
> +	area = kzalloc(sizeof(*area), GFP_KERNEL);
> +	if (!area)
> +		return ERR_PTR(-ENOMEM);
> +
> +	area->iopt = iopt;
> +	area->iommu_prot = iommu_prot;
> +	area->page_offset = start_byte % PAGE_SIZE;
> +	area->pages_node.start = start_byte / PAGE_SIZE;
> +	if (check_add_overflow(start_byte, length - 1, &area->pages_node.last))
> +		return ERR_PTR(-EOVERFLOW);
> +	area->pages_node.last = area->pages_node.last / PAGE_SIZE;
> +	if (WARN_ON(area->pages_node.last >= pages->npages))
> +		return ERR_PTR(-EOVERFLOW);
> +
> +	down_write(&iopt->iova_rwsem);
> +	if (flags & IOPT_ALLOC_IOVA) {
> +		rc = iopt_alloc_iova(iopt, &iova,
> +				     (uintptr_t)pages->uptr + start_byte,
> +				     length);
> +		if (rc)
> +			goto out_unlock;
> +	}
> +
> +	if (check_add_overflow(iova, length - 1, &area->node.last)) {
> +		rc = -EOVERFLOW;
> +		goto out_unlock;
> +	}
> +
> +	if (!(flags & IOPT_ALLOC_IOVA)) {
> +		if ((iova & (iopt->iova_alignment - 1)) ||
> +		    (length & (iopt->iova_alignment - 1)) || !length) {
> +			rc = -EINVAL;
> +			goto out_unlock;
> +		}
> +
> +		/* No reserved IOVA intersects the range */
> +		if (interval_tree_iter_first(&iopt->reserved_iova_itree, iova,
> +					     area->node.last)) {
> +			rc = -ENOENT;
> +			goto out_unlock;
> +		}
> +
> +		/* Check that there is not already a mapping in the range */
> +		if (iopt_area_iter_first(iopt, iova, area->node.last)) {
> +			rc = -EADDRINUSE;
> +			goto out_unlock;
> +		}
> +	}
> +
> +	/*
> +	 * The area is inserted with a NULL pages indicating it is not fully
> +	 * initialized yet.
> +	 */
> +	area->node.start = iova;
> +	interval_tree_insert(&area->node, &area->iopt->area_itree);
> +	up_write(&iopt->iova_rwsem);
> +	return area;
> +
> +out_unlock:
> +	up_write(&iopt->iova_rwsem);
> +	kfree(area);
> +	return ERR_PTR(rc);
> +}
> +
> +static void iopt_abort_area(struct iopt_area *area)
> +{
> +	down_write(&area->iopt->iova_rwsem);
> +	interval_tree_remove(&area->node, &area->iopt->area_itree);
> +	up_write(&area->iopt->iova_rwsem);
> +	kfree(area);
> +}
> +
> +static int iopt_finalize_area(struct iopt_area *area, struct iopt_pages *pages)
> +{
> +	int rc;
> +
> +	down_read(&area->iopt->domains_rwsem);
> +	rc = iopt_area_fill_domains(area, pages);
> +	if (!rc) {
> +		/*
> +		 * area->pages must be set inside the domains_rwsem to ensure
> +		 * any newly added domains will get filled. Moves the reference
> +		 * in from the caller
> +		 */
> +		down_write(&area->iopt->iova_rwsem);
> +		area->pages = pages;
> +		up_write(&area->iopt->iova_rwsem);
> +	}
> +	up_read(&area->iopt->domains_rwsem);
> +	return rc;
> +}
> +
> +int iopt_map_pages(struct io_pagetable *iopt, struct iopt_pages *pages,
> +		   unsigned long *dst_iova, unsigned long start_bytes,
> +		   unsigned long length, int iommu_prot, unsigned int flags)
> +{
> +	struct iopt_area *area;
> +	int rc;
> +
> +	if ((iommu_prot & IOMMU_WRITE) && !pages->writable)
> +		return -EPERM;
> +
> +	area = iopt_alloc_area(iopt, pages, *dst_iova, start_bytes, length,
> +			       iommu_prot, flags);
> +	if (IS_ERR(area))
> +		return PTR_ERR(area);
> +	*dst_iova = iopt_area_iova(area);
> +
> +	rc = iopt_finalize_area(area, pages);
> +	if (rc) {
> +		iopt_abort_area(area);
> +		return rc;
> +	}
> +	return 0;
> +}
> +
> +/**
> + * iopt_map_user_pages() - Map a user VA to an iova in the io page table
> + * @iopt: io_pagetable to act on
> + * @iova: If IOPT_ALLOC_IOVA is set this is unused on input and contains
> + *        the chosen iova on output. Otherwise is the iova to map to on input
> + * @uptr: User VA to map
> + * @length: Number of bytes to map
> + * @iommu_prot: Combination of IOMMU_READ/WRITE/etc bits for the mapping
> + * @flags: IOPT_ALLOC_IOVA or zero
> + *
> + * iova, uptr, and length must be aligned to iova_alignment. For domain backed
> + * page tables this will pin the pages and load them into the domain at iova.
> + * For non-domain page tables this will only setup a lazy reference and the
> + * caller must use iopt_access_pages() to touch them.
> + *
> + * iopt_unmap_iova() must be called to undo this before the io_pagetable can be
> + * destroyed.
> + */
> +int iopt_map_user_pages(struct io_pagetable *iopt, unsigned long *iova,
> +			void __user *uptr, unsigned long length, int iommu_prot,
> +			unsigned int flags)
> +{
> +	struct iopt_pages *pages;
> +	int rc;
> +
> +	pages = iopt_alloc_pages(uptr, length, iommu_prot & IOMMU_WRITE);
> +	if (IS_ERR(pages))
> +		return PTR_ERR(pages);
> +
> +	rc = iopt_map_pages(iopt, pages, iova, uptr - pages->uptr, length,
> +			    iommu_prot, flags);
> +	if (rc) {
> +		iopt_put_pages(pages);
> +		return rc;
> +	}
> +	return 0;
> +}
> +
> +struct iopt_pages *iopt_get_pages(struct io_pagetable *iopt, unsigned long iova,
> +				  unsigned long *start_byte,
> +				  unsigned long length)
> +{
> +	unsigned long iova_end;
> +	struct iopt_pages *pages;
> +	struct iopt_area *area;
> +
> +	if (check_add_overflow(iova, length - 1, &iova_end))
> +		return ERR_PTR(-EOVERFLOW);
> +
> +	down_read(&iopt->iova_rwsem);
> +	area = iopt_find_exact_area(iopt, iova, iova_end);
> +	if (!area) {
> +		up_read(&iopt->iova_rwsem);
> +		return ERR_PTR(-ENOENT);
> +	}
> +	pages = area->pages;
> +	*start_byte = area->page_offset + iopt_area_index(area) * PAGE_SIZE;
> +	kref_get(&pages->kref);
> +	up_read(&iopt->iova_rwsem);
> +
> +	return pages;
> +}
> +
> +static int __iopt_unmap_iova(struct io_pagetable *iopt, struct iopt_area *area,
> +			     struct iopt_pages *pages)
> +{
> +	/* Drivers have to unpin on notification. */
> +	if (WARN_ON(atomic_read(&area->num_users)))
> +		return -EBUSY;
> +
> +	iopt_area_unfill_domains(area, pages);
> +	WARN_ON(atomic_read(&area->num_users));
> +	iopt_abort_area(area);
> +	iopt_put_pages(pages);
> +	return 0;
> +}
> +
> +/**
> + * iopt_unmap_iova() - Remove a range of iova
> + * @iopt: io_pagetable to act on
> + * @iova: Starting iova to unmap
> + * @length: Number of bytes to unmap
> + *
> + * The requested range must exactly match an existing range.
> + * Splitting/truncating IOVA mappings is not allowed.
> + */
> +int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
> +		    unsigned long length)
> +{
> +	struct iopt_pages *pages;
> +	struct iopt_area *area;
> +	unsigned long iova_end;
> +	int rc;
> +
> +	if (!length)
> +		return -EINVAL;
> +
> +	if (check_add_overflow(iova, length - 1, &iova_end))
> +		return -EOVERFLOW;
> +
> +	down_read(&iopt->domains_rwsem);
> +	down_write(&iopt->iova_rwsem);
> +	area = iopt_find_exact_area(iopt, iova, iova_end);

when testing vIOMMU with Qemu using iommufd, I hit a problem as log #3
shows. Qemu failed when trying to do map due to an IOVA still in use.
After debugging, the 0xfffff000 IOVA is mapped but not unmapped. But per 
log #2, Qemu has issued unmap with a larger range (0xff000000 -
0x100000000) which includes the 0xfffff000. But iopt_find_exact_area()
doesn't find any area. So 0xfffff000 is not unmapped. Is this correct? Same
test passed with vfio iommu type1 driver. any idea?

#1
qemu-system-x86_64: 218 vfio_dma_map(0x55b99d7b7d40, 0xfffff000, 0x1000, 
0x7f79ee81f000) = 0

#2
qemu-system-x86_64: 232 vfio_dma_unmap(0x55b99d7b7d40, 0xff000000, 
0x1000000) = 0 (No such file or directory)
qemu-system-x86_64: IOMMU_IOAS_UNMAP failed: No such file or directory
qemu-system-x86_64: 241 vfio_dma_unmap(0x55b99d7b7d40, 0xff000000, 
0x1000000) = -2 (No such file or directory)
                                vtd_address_space_unmap, notify iommu 
start: ff000000, end: 100000000 - 2

#3
qemu-system-x86_64: IOMMU_IOAS_MAP failed: Address already in use
qemu-system-x86_64: vfio_container_dma_map(0x55b99d7b7d40, 0xfffc0000, 
0x40000, 0x7f7968c00000) = -98 (Address already in use)
qemu: hardware error: vfio: DMA mapping failed, unable to continue

#4 Kernel debug log:

[ 1042.662165] iopt_unmap_iova 338 iova: ff000000, length: 1000000
[ 1042.662339] iopt_unmap_iova 345 iova: ff000000, length: 1000000
[ 1042.662505] iopt_unmap_iova 348 iova: ff000000, length: 1000000
[ 1042.662736] iopt_find_exact_area, iova: ff000000, last_iova: ffffffff
[ 1042.662909] iopt_unmap_iova 350 iova: ff000000, length: 1000000
[ 1042.663084] iommufd_ioas_unmap 253 iova: ff000000 length: 1000000, rc: -2

> +	if (!area) {
> +		up_write(&iopt->iova_rwsem);
> +		up_read(&iopt->domains_rwsem);
> +		return -ENOENT;
> +	}
> +	pages = area->pages;
> +	area->pages = NULL;
> +	up_write(&iopt->iova_rwsem);
> +
> +	rc = __iopt_unmap_iova(iopt, area, pages);
> +	up_read(&iopt->domains_rwsem);
> +	return rc;
> +}
> +
> +int iopt_unmap_all(struct io_pagetable *iopt)
> +{
> +	struct iopt_area *area;
> +	int rc;
> +
> +	down_read(&iopt->domains_rwsem);
> +	down_write(&iopt->iova_rwsem);
> +	while ((area = iopt_area_iter_first(iopt, 0, ULONG_MAX))) {
> +		struct iopt_pages *pages;
> +
> +		/* Userspace should not race unmap all and map */
> +		if (!area->pages) {
> +			rc = -EBUSY;
> +			goto out_unlock_iova;
> +		}
> +		pages = area->pages;
> +		area->pages = NULL;
> +		up_write(&iopt->iova_rwsem);
> +
> +		rc = __iopt_unmap_iova(iopt, area, pages);
> +		if (rc)
> +			goto out_unlock_domains;
> +
> +		down_write(&iopt->iova_rwsem);
> +	}
> +	rc = 0;
> +
> +out_unlock_iova:
> +	up_write(&iopt->iova_rwsem);
> +out_unlock_domains:
> +	up_read(&iopt->domains_rwsem);
> +	return rc;
> +}
> +
> +/**
> + * iopt_access_pages() - Return a list of pages under the iova
> + * @iopt: io_pagetable to act on
> + * @iova: Starting IOVA
> + * @length: Number of bytes to access
> + * @out_pages: Output page list
> + * @write: True if access is for writing
> + *
> + * Reads @npages starting at iova and returns the struct page * pointers. These
> + * can be kmap'd by the caller for CPU access.
> + *
> + * The caller must perform iopt_unaccess_pages() when done to balance this.
> + *
> + * iova can be unaligned from PAGE_SIZE. The first returned byte starts at
> + * page_to_phys(out_pages[0]) + (iova % PAGE_SIZE). The caller promises not to
> + * touch memory outside the requested iova slice.
> + *
> + * FIXME: callers that need a DMA mapping via a sgl should create another
> + * interface to build the SGL efficiently
> + */
> +int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
> +		      unsigned long length, struct page **out_pages, bool write)
> +{
> +	unsigned long cur_iova = iova;
> +	unsigned long last_iova;
> +	struct iopt_area *area;
> +	int rc;
> +
> +	if (!length)
> +		return -EINVAL;
> +	if (check_add_overflow(iova, length - 1, &last_iova))
> +		return -EOVERFLOW;
> +
> +	down_read(&iopt->iova_rwsem);
> +	for (area = iopt_area_iter_first(iopt, iova, last_iova); area;
> +	     area = iopt_area_iter_next(area, iova, last_iova)) {
> +		unsigned long last = min(last_iova, iopt_area_last_iova(area));
> +		unsigned long last_index;
> +		unsigned long index;
> +
> +		/* Need contiguous areas in the access */
> +		if (iopt_area_iova(area) < cur_iova || !area->pages) {
> +			rc = -EINVAL;
> +			goto out_remove;
> +		}
> +
> +		index = iopt_area_iova_to_index(area, cur_iova);
> +		last_index = iopt_area_iova_to_index(area, last);
> +		rc = iopt_pages_add_user(area->pages, index, last_index,
> +					 out_pages, write);
> +		if (rc)
> +			goto out_remove;
> +		if (last == last_iova)
> +			break;
> +		/*
> +		 * Can't cross areas that are not aligned to the system page
> +		 * size with this API.
> +		 */
> +		if (cur_iova % PAGE_SIZE) {
> +			rc = -EINVAL;
> +			goto out_remove;
> +		}
> +		cur_iova = last + 1;
> +		out_pages += last_index - index;
> +		atomic_inc(&area->num_users);
> +	}
> +
> +	up_read(&iopt->iova_rwsem);
> +	return 0;
> +
> +out_remove:
> +	if (cur_iova != iova)
> +		iopt_unaccess_pages(iopt, iova, cur_iova - iova);
> +	up_read(&iopt->iova_rwsem);
> +	return rc;
> +}
> +
> +/**
> + * iopt_unaccess_pages() - Undo iopt_access_pages
> + * @iopt: io_pagetable to act on
> + * @iova: Starting IOVA
> + * @length:- Number of bytes to access
> + *
> + * Return the struct page's. The caller must stop accessing them before calling
> + * this. The iova/length must exactly match the one provided to access_pages.
> + */
> +void iopt_unaccess_pages(struct io_pagetable *iopt, unsigned long iova,
> +			 size_t length)
> +{
> +	unsigned long cur_iova = iova;
> +	unsigned long last_iova;
> +	struct iopt_area *area;
> +
> +	if (WARN_ON(!length) ||
> +	    WARN_ON(check_add_overflow(iova, length - 1, &last_iova)))
> +		return;
> +
> +	down_read(&iopt->iova_rwsem);
> +	for (area = iopt_area_iter_first(iopt, iova, last_iova); area;
> +	     area = iopt_area_iter_next(area, iova, last_iova)) {
> +		unsigned long last = min(last_iova, iopt_area_last_iova(area));
> +		int num_users;
> +
> +		iopt_pages_remove_user(area->pages,
> +				       iopt_area_iova_to_index(area, cur_iova),
> +				       iopt_area_iova_to_index(area, last));
> +		if (last == last_iova)
> +			break;
> +		cur_iova = last + 1;
> +		num_users = atomic_dec_return(&area->num_users);
> +		WARN_ON(num_users < 0);
> +	}
> +	up_read(&iopt->iova_rwsem);
> +}
> +
> +struct iopt_reserved_iova {
> +	struct interval_tree_node node;
> +	void *owner;
> +};
> +
> +int iopt_reserve_iova(struct io_pagetable *iopt, unsigned long start,
> +		      unsigned long last, void *owner)
> +{
> +	struct iopt_reserved_iova *reserved;
> +
> +	lockdep_assert_held_write(&iopt->iova_rwsem);
> +
> +	if (iopt_area_iter_first(iopt, start, last))
> +		return -EADDRINUSE;
> +
> +	reserved = kzalloc(sizeof(*reserved), GFP_KERNEL);
> +	if (!reserved)
> +		return -ENOMEM;
> +	reserved->node.start = start;
> +	reserved->node.last = last;
> +	reserved->owner = owner;
> +	interval_tree_insert(&reserved->node, &iopt->reserved_iova_itree);
> +	return 0;
> +}
> +
> +void iopt_remove_reserved_iova(struct io_pagetable *iopt, void *owner)
> +{
> +
> +	struct interval_tree_node *node;
> +
> +	for (node = interval_tree_iter_first(&iopt->reserved_iova_itree, 0,
> +					     ULONG_MAX);
> +	     node;) {
> +		struct iopt_reserved_iova *reserved =
> +			container_of(node, struct iopt_reserved_iova, node);
> +
> +		node = interval_tree_iter_next(node, 0, ULONG_MAX);
> +
> +		if (reserved->owner == owner) {
> +			interval_tree_remove(&reserved->node,
> +					     &iopt->reserved_iova_itree);
> +			kfree(reserved);
> +		}
> +	}
> +}
> +
> +int iopt_init_table(struct io_pagetable *iopt)
> +{
> +	init_rwsem(&iopt->iova_rwsem);
> +	init_rwsem(&iopt->domains_rwsem);
> +	iopt->area_itree = RB_ROOT_CACHED;
> +	iopt->reserved_iova_itree = RB_ROOT_CACHED;
> +	xa_init(&iopt->domains);
> +
> +	/*
> +	 * iopt's start as SW tables that can use the entire size_t IOVA space
> +	 * due to the use of size_t in the APIs. They have no alignment
> +	 * restriction.
> +	 */
> +	iopt->iova_alignment = 1;
> +
> +	return 0;
> +}
> +
> +void iopt_destroy_table(struct io_pagetable *iopt)
> +{
> +	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
> +		iopt_remove_reserved_iova(iopt, NULL);
> +	WARN_ON(!RB_EMPTY_ROOT(&iopt->reserved_iova_itree.rb_root));
> +	WARN_ON(!xa_empty(&iopt->domains));
> +	WARN_ON(!RB_EMPTY_ROOT(&iopt->area_itree.rb_root));
> +}
> +
> +/**
> + * iopt_unfill_domain() - Unfill a domain with PFNs
> + * @iopt: io_pagetable to act on
> + * @domain: domain to unfill
> + *
> + * This is used when removing a domain from the iopt. Every area in the iopt
> + * will be unmapped from the domain. The domain must already be removed from the
> + * domains xarray.
> + */
> +static void iopt_unfill_domain(struct io_pagetable *iopt,
> +			       struct iommu_domain *domain)
> +{
> +	struct iopt_area *area;
> +
> +	lockdep_assert_held(&iopt->iova_rwsem);
> +	lockdep_assert_held_write(&iopt->domains_rwsem);
> +
> +	/*
> +	 * Some other domain is holding all the pfns still, rapidly unmap this
> +	 * domain.
> +	 */
> +	if (iopt->next_domain_id != 0) {
> +		/* Pick an arbitrary remaining domain to act as storage */
> +		struct iommu_domain *storage_domain =
> +			xa_load(&iopt->domains, 0);
> +
> +		for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
> +		     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
> +			struct iopt_pages *pages = area->pages;
> +
> +			if (WARN_ON(!pages))
> +				continue;
> +
> +			mutex_lock(&pages->mutex);
> +			if (area->storage_domain != domain) {
> +				mutex_unlock(&pages->mutex);
> +				continue;
> +			}
> +			area->storage_domain = storage_domain;
> +			mutex_unlock(&pages->mutex);
> +		}
> +
> +
> +		iopt_unmap_domain(iopt, domain);
> +		return;
> +	}
> +
> +	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
> +	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
> +		struct iopt_pages *pages = area->pages;
> +
> +		if (WARN_ON(!pages))
> +			continue;
> +
> +		mutex_lock(&pages->mutex);
> +		interval_tree_remove(&area->pages_node,
> +				     &area->pages->domains_itree);
> +		WARN_ON(area->storage_domain != domain);
> +		area->storage_domain = NULL;
> +		iopt_area_unfill_domain(area, pages, domain);
> +		mutex_unlock(&pages->mutex);
> +	}
> +}
> +
> +/**
> + * iopt_fill_domain() - Fill a domain with PFNs
> + * @iopt: io_pagetable to act on
> + * @domain: domain to fill
> + *
> + * Fill the domain with PFNs from every area in the iopt. On failure the domain
> + * is left unchanged.
> + */
> +static int iopt_fill_domain(struct io_pagetable *iopt,
> +			    struct iommu_domain *domain)
> +{
> +	struct iopt_area *end_area;
> +	struct iopt_area *area;
> +	int rc;
> +
> +	lockdep_assert_held(&iopt->iova_rwsem);
> +	lockdep_assert_held_write(&iopt->domains_rwsem);
> +
> +	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
> +	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
> +		struct iopt_pages *pages = area->pages;
> +
> +		if (WARN_ON(!pages))
> +			continue;
> +
> +		mutex_lock(&pages->mutex);
> +		rc = iopt_area_fill_domain(area, domain);
> +		if (rc) {
> +			mutex_unlock(&pages->mutex);
> +			goto out_unfill;
> +		}
> +		if (!area->storage_domain) {
> +			WARN_ON(iopt->next_domain_id != 0);
> +			area->storage_domain = domain;
> +			interval_tree_insert(&area->pages_node,
> +					     &pages->domains_itree);
> +		}
> +		mutex_unlock(&pages->mutex);
> +	}
> +	return 0;
> +
> +out_unfill:
> +	end_area = area;
> +	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
> +	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
> +		struct iopt_pages *pages = area->pages;
> +
> +		if (area == end_area)
> +			break;
> +		if (WARN_ON(!pages))
> +			continue;
> +		mutex_lock(&pages->mutex);
> +		if (iopt->next_domain_id == 0) {
> +			interval_tree_remove(&area->pages_node,
> +					     &pages->domains_itree);
> +			area->storage_domain = NULL;
> +		}
> +		iopt_area_unfill_domain(area, pages, domain);
> +		mutex_unlock(&pages->mutex);
> +	}
> +	return rc;
> +}
> +
> +/* All existing area's conform to an increased page size */
> +static int iopt_check_iova_alignment(struct io_pagetable *iopt,
> +				     unsigned long new_iova_alignment)
> +{
> +	struct iopt_area *area;
> +
> +	lockdep_assert_held(&iopt->iova_rwsem);
> +
> +	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
> +	     area = iopt_area_iter_next(area, 0, ULONG_MAX))
> +		if ((iopt_area_iova(area) % new_iova_alignment) ||
> +		    (iopt_area_length(area) % new_iova_alignment))
> +			return -EADDRINUSE;
> +	return 0;
> +}
> +
> +int iopt_table_add_domain(struct io_pagetable *iopt,
> +			  struct iommu_domain *domain)
> +{
> +	const struct iommu_domain_geometry *geometry = &domain->geometry;
> +	struct iommu_domain *iter_domain;
> +	unsigned int new_iova_alignment;
> +	unsigned long index;
> +	int rc;
> +
> +	down_write(&iopt->domains_rwsem);
> +	down_write(&iopt->iova_rwsem);
> +
> +	xa_for_each (&iopt->domains, index, iter_domain) {
> +		if (WARN_ON(iter_domain == domain)) {
> +			rc = -EEXIST;
> +			goto out_unlock;
> +		}
> +	}
> +
> +	/*
> +	 * The io page size drives the iova_alignment. Internally the iopt_pages
> +	 * works in PAGE_SIZE units and we adjust when mapping sub-PAGE_SIZE
> +	 * objects into the iommu_domain.
> +	 *
> +	 * A iommu_domain must always be able to accept PAGE_SIZE to be
> +	 * compatible as we can't guarantee higher contiguity.
> +	 */
> +	new_iova_alignment =
> +		max_t(unsigned long, 1UL << __ffs(domain->pgsize_bitmap),
> +		      iopt->iova_alignment);
> +	if (new_iova_alignment > PAGE_SIZE) {
> +		rc = -EINVAL;
> +		goto out_unlock;
> +	}
> +	if (new_iova_alignment != iopt->iova_alignment) {
> +		rc = iopt_check_iova_alignment(iopt, new_iova_alignment);
> +		if (rc)
> +			goto out_unlock;
> +	}
> +
> +	/* No area exists that is outside the allowed domain aperture */
> +	if (geometry->aperture_start != 0) {
> +		rc = iopt_reserve_iova(iopt, 0, geometry->aperture_start - 1,
> +				       domain);
> +		if (rc)
> +			goto out_reserved;
> +	}
> +	if (geometry->aperture_end != ULONG_MAX) {
> +		rc = iopt_reserve_iova(iopt, geometry->aperture_end + 1,
> +				       ULONG_MAX, domain);
> +		if (rc)
> +			goto out_reserved;
> +	}
> +
> +	rc = xa_reserve(&iopt->domains, iopt->next_domain_id, GFP_KERNEL);
> +	if (rc)
> +		goto out_reserved;
> +
> +	rc = iopt_fill_domain(iopt, domain);
> +	if (rc)
> +		goto out_release;
> +
> +	iopt->iova_alignment = new_iova_alignment;
> +	xa_store(&iopt->domains, iopt->next_domain_id, domain, GFP_KERNEL);
> +	iopt->next_domain_id++;
> +	up_write(&iopt->iova_rwsem);
> +	up_write(&iopt->domains_rwsem);
> +	return 0;
> +out_release:
> +	xa_release(&iopt->domains, iopt->next_domain_id);
> +out_reserved:
> +	iopt_remove_reserved_iova(iopt, domain);
> +out_unlock:
> +	up_write(&iopt->iova_rwsem);
> +	up_write(&iopt->domains_rwsem);
> +	return rc;
> +}
> +
> +void iopt_table_remove_domain(struct io_pagetable *iopt,
> +			      struct iommu_domain *domain)
> +{
> +	struct iommu_domain *iter_domain = NULL;
> +	unsigned long new_iova_alignment;
> +	unsigned long index;
> +
> +	down_write(&iopt->domains_rwsem);
> +	down_write(&iopt->iova_rwsem);
> +
> +	xa_for_each (&iopt->domains, index, iter_domain)
> +		if (iter_domain == domain)
> +			break;
> +	if (WARN_ON(iter_domain != domain) || index >= iopt->next_domain_id)
> +		goto out_unlock;
> +
> +	/*
> +	 * Compress the xarray to keep it linear by swapping the entry to erase
> +	 * with the tail entry and shrinking the tail.
> +	 */
> +	iopt->next_domain_id--;
> +	iter_domain = xa_erase(&iopt->domains, iopt->next_domain_id);
> +	if (index != iopt->next_domain_id)
> +		xa_store(&iopt->domains, index, iter_domain, GFP_KERNEL);
> +
> +	iopt_unfill_domain(iopt, domain);
> +	iopt_remove_reserved_iova(iopt, domain);
> +
> +	/* Recalculate the iova alignment without the domain */
> +	new_iova_alignment = 1;
> +	xa_for_each (&iopt->domains, index, iter_domain)
> +		new_iova_alignment = max_t(unsigned long,
> +					   1UL << __ffs(domain->pgsize_bitmap),
> +					   new_iova_alignment);
> +	if (!WARN_ON(new_iova_alignment > iopt->iova_alignment))
> +		iopt->iova_alignment = new_iova_alignment;
> +
> +out_unlock:
> +	up_write(&iopt->iova_rwsem);
> +	up_write(&iopt->domains_rwsem);
> +}
> +
> +/* Narrow the valid_iova_itree to include reserved ranges from a group. */
> +int iopt_table_enforce_group_resv_regions(struct io_pagetable *iopt,
> +					  struct iommu_group *group,
> +					  phys_addr_t *sw_msi_start)
> +{
> +	struct iommu_resv_region *resv;
> +	struct iommu_resv_region *tmp;
> +	LIST_HEAD(group_resv_regions);
> +	int rc;
> +
> +	down_write(&iopt->iova_rwsem);
> +	rc = iommu_get_group_resv_regions(group, &group_resv_regions);
> +	if (rc)
> +		goto out_unlock;
> +
> +	list_for_each_entry (resv, &group_resv_regions, list) {
> +		if (resv->type == IOMMU_RESV_DIRECT_RELAXABLE)
> +			continue;
> +
> +		/*
> +		 * The presence of any 'real' MSI regions should take precedence
> +		 * over the software-managed one if the IOMMU driver happens to
> +		 * advertise both types.
> +		 */
> +		if (sw_msi_start && resv->type == IOMMU_RESV_MSI) {
> +			*sw_msi_start = 0;
> +			sw_msi_start = NULL;
> +		}
> +		if (sw_msi_start && resv->type == IOMMU_RESV_SW_MSI)
> +			*sw_msi_start = resv->start;
> +
> +		rc = iopt_reserve_iova(iopt, resv->start,
> +				       resv->length - 1 + resv->start, group);
> +		if (rc)
> +			goto out_reserved;
> +	}
> +	rc = 0;
> +	goto out_free_resv;
> +
> +out_reserved:
> +	iopt_remove_reserved_iova(iopt, group);
> +out_free_resv:
> +	list_for_each_entry_safe (resv, tmp, &group_resv_regions, list)
> +		kfree(resv);
> +out_unlock:
> +	up_write(&iopt->iova_rwsem);
> +	return rc;
> +}
> diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
> index 2f1301d39bba7c..bcf08e61bc87e9 100644
> --- a/drivers/iommu/iommufd/iommufd_private.h
> +++ b/drivers/iommu/iommufd/iommufd_private.h
> @@ -9,6 +9,9 @@
>   #include <linux/refcount.h>
>   #include <linux/uaccess.h>
>   
> +struct iommu_domain;
> +struct iommu_group;
> +
>   /*
>    * The IOVA to PFN map. The mapper automatically copies the PFNs into multiple
>    * domains and permits sharing of PFNs between io_pagetable instances. This
> @@ -27,8 +30,40 @@ struct io_pagetable {
>   	struct rw_semaphore iova_rwsem;
>   	struct rb_root_cached area_itree;
>   	struct rb_root_cached reserved_iova_itree;
> +	unsigned long iova_alignment;
>   };
>   
> +int iopt_init_table(struct io_pagetable *iopt);
> +void iopt_destroy_table(struct io_pagetable *iopt);
> +struct iopt_pages *iopt_get_pages(struct io_pagetable *iopt, unsigned long iova,
> +				  unsigned long *start_byte,
> +				  unsigned long length);
> +enum { IOPT_ALLOC_IOVA = 1 << 0 };
> +int iopt_map_user_pages(struct io_pagetable *iopt, unsigned long *iova,
> +			void __user *uptr, unsigned long length, int iommu_prot,
> +			unsigned int flags);
> +int iopt_map_pages(struct io_pagetable *iopt, struct iopt_pages *pages,
> +		   unsigned long *dst_iova, unsigned long start_byte,
> +		   unsigned long length, int iommu_prot, unsigned int flags);
> +int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
> +		    unsigned long length);
> +int iopt_unmap_all(struct io_pagetable *iopt);
> +
> +int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
> +		      unsigned long npages, struct page **out_pages, bool write);
> +void iopt_unaccess_pages(struct io_pagetable *iopt, unsigned long iova,
> +			 size_t npages);
> +int iopt_table_add_domain(struct io_pagetable *iopt,
> +			  struct iommu_domain *domain);
> +void iopt_table_remove_domain(struct io_pagetable *iopt,
> +			      struct iommu_domain *domain);
> +int iopt_table_enforce_group_resv_regions(struct io_pagetable *iopt,
> +					  struct iommu_group *group,
> +					  phys_addr_t *sw_msi_start);
> +int iopt_reserve_iova(struct io_pagetable *iopt, unsigned long start,
> +		      unsigned long last, void *owner);
> +void iopt_remove_reserved_iova(struct io_pagetable *iopt, void *owner);
> +
>   struct iommufd_ctx {
>   	struct file *filp;
>   	struct xarray objects;

-- 
Regards,
Yi Liu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 07/12] iommufd: Data structure to provide IOVA to PFN mapping
  2022-04-13 14:02     ` Yi Liu
@ 2022-04-13 14:36       ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-04-13 14:36 UTC (permalink / raw)
  To: Yi Liu
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Keqian Zhu

On Wed, Apr 13, 2022 at 10:02:58PM +0800, Yi Liu wrote:
> > +/**
> > + * iopt_unmap_iova() - Remove a range of iova
> > + * @iopt: io_pagetable to act on
> > + * @iova: Starting iova to unmap
> > + * @length: Number of bytes to unmap
> > + *
> > + * The requested range must exactly match an existing range.
> > + * Splitting/truncating IOVA mappings is not allowed.
> > + */
> > +int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
> > +		    unsigned long length)
> > +{
> > +	struct iopt_pages *pages;
> > +	struct iopt_area *area;
> > +	unsigned long iova_end;
> > +	int rc;
> > +
> > +	if (!length)
> > +		return -EINVAL;
> > +
> > +	if (check_add_overflow(iova, length - 1, &iova_end))
> > +		return -EOVERFLOW;
> > +
> > +	down_read(&iopt->domains_rwsem);
> > +	down_write(&iopt->iova_rwsem);
> > +	area = iopt_find_exact_area(iopt, iova, iova_end);
> 
> when testing vIOMMU with Qemu using iommufd, I hit a problem as log #3
> shows. Qemu failed when trying to do map due to an IOVA still in use.
> After debugging, the 0xfffff000 IOVA is mapped but not unmapped. But per log
> #2, Qemu has issued unmap with a larger range (0xff000000 -
> 0x100000000) which includes the 0xfffff000. But iopt_find_exact_area()
> doesn't find any area. So 0xfffff000 is not unmapped. Is this correct? Same
> test passed with vfio iommu type1 driver. any idea?

There are a couple of good reasons why the iopt_unmap_iova() should
proccess any contiguous range of fully contained areas, so I would
consider this something worth fixing. can you send a small patch and
test case and I'll fold it in?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 07/12] iommufd: Data structure to provide IOVA to PFN mapping
@ 2022-04-13 14:36       ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-04-13 14:36 UTC (permalink / raw)
  To: Yi Liu
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins,
	David Gibson

On Wed, Apr 13, 2022 at 10:02:58PM +0800, Yi Liu wrote:
> > +/**
> > + * iopt_unmap_iova() - Remove a range of iova
> > + * @iopt: io_pagetable to act on
> > + * @iova: Starting iova to unmap
> > + * @length: Number of bytes to unmap
> > + *
> > + * The requested range must exactly match an existing range.
> > + * Splitting/truncating IOVA mappings is not allowed.
> > + */
> > +int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
> > +		    unsigned long length)
> > +{
> > +	struct iopt_pages *pages;
> > +	struct iopt_area *area;
> > +	unsigned long iova_end;
> > +	int rc;
> > +
> > +	if (!length)
> > +		return -EINVAL;
> > +
> > +	if (check_add_overflow(iova, length - 1, &iova_end))
> > +		return -EOVERFLOW;
> > +
> > +	down_read(&iopt->domains_rwsem);
> > +	down_write(&iopt->iova_rwsem);
> > +	area = iopt_find_exact_area(iopt, iova, iova_end);
> 
> when testing vIOMMU with Qemu using iommufd, I hit a problem as log #3
> shows. Qemu failed when trying to do map due to an IOVA still in use.
> After debugging, the 0xfffff000 IOVA is mapped but not unmapped. But per log
> #2, Qemu has issued unmap with a larger range (0xff000000 -
> 0x100000000) which includes the 0xfffff000. But iopt_find_exact_area()
> doesn't find any area. So 0xfffff000 is not unmapped. Is this correct? Same
> test passed with vfio iommu type1 driver. any idea?

There are a couple of good reasons why the iopt_unmap_iova() should
proccess any contiguous range of fully contained areas, so I would
consider this something worth fixing. can you send a small patch and
test case and I'll fold it in?

Thanks,
Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 07/12] iommufd: Data structure to provide IOVA to PFN mapping
  2022-04-13 14:36       ` Jason Gunthorpe via iommu
@ 2022-04-13 14:49         ` Yi Liu
  -1 siblings, 0 replies; 244+ messages in thread
From: Yi Liu @ 2022-04-13 14:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Keqian Zhu

On 2022/4/13 22:36, Jason Gunthorpe wrote:
> On Wed, Apr 13, 2022 at 10:02:58PM +0800, Yi Liu wrote:
>>> +/**
>>> + * iopt_unmap_iova() - Remove a range of iova
>>> + * @iopt: io_pagetable to act on
>>> + * @iova: Starting iova to unmap
>>> + * @length: Number of bytes to unmap
>>> + *
>>> + * The requested range must exactly match an existing range.
>>> + * Splitting/truncating IOVA mappings is not allowed.
>>> + */
>>> +int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
>>> +		    unsigned long length)
>>> +{
>>> +	struct iopt_pages *pages;
>>> +	struct iopt_area *area;
>>> +	unsigned long iova_end;
>>> +	int rc;
>>> +
>>> +	if (!length)
>>> +		return -EINVAL;
>>> +
>>> +	if (check_add_overflow(iova, length - 1, &iova_end))
>>> +		return -EOVERFLOW;
>>> +
>>> +	down_read(&iopt->domains_rwsem);
>>> +	down_write(&iopt->iova_rwsem);
>>> +	area = iopt_find_exact_area(iopt, iova, iova_end);
>>
>> when testing vIOMMU with Qemu using iommufd, I hit a problem as log #3
>> shows. Qemu failed when trying to do map due to an IOVA still in use.
>> After debugging, the 0xfffff000 IOVA is mapped but not unmapped. But per log
>> #2, Qemu has issued unmap with a larger range (0xff000000 -
>> 0x100000000) which includes the 0xfffff000. But iopt_find_exact_area()
>> doesn't find any area. So 0xfffff000 is not unmapped. Is this correct? Same
>> test passed with vfio iommu type1 driver. any idea?
> 
> There are a couple of good reasons why the iopt_unmap_iova() should
> proccess any contiguous range of fully contained areas, so I would
> consider this something worth fixing. can you send a small patch and
> test case and I'll fold it in?

sure. just spotted it, so haven't got fix patch yet. I may work on
it tomorrow.

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 07/12] iommufd: Data structure to provide IOVA to PFN mapping
@ 2022-04-13 14:49         ` Yi Liu
  0 siblings, 0 replies; 244+ messages in thread
From: Yi Liu @ 2022-04-13 14:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins,
	David Gibson

On 2022/4/13 22:36, Jason Gunthorpe wrote:
> On Wed, Apr 13, 2022 at 10:02:58PM +0800, Yi Liu wrote:
>>> +/**
>>> + * iopt_unmap_iova() - Remove a range of iova
>>> + * @iopt: io_pagetable to act on
>>> + * @iova: Starting iova to unmap
>>> + * @length: Number of bytes to unmap
>>> + *
>>> + * The requested range must exactly match an existing range.
>>> + * Splitting/truncating IOVA mappings is not allowed.
>>> + */
>>> +int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
>>> +		    unsigned long length)
>>> +{
>>> +	struct iopt_pages *pages;
>>> +	struct iopt_area *area;
>>> +	unsigned long iova_end;
>>> +	int rc;
>>> +
>>> +	if (!length)
>>> +		return -EINVAL;
>>> +
>>> +	if (check_add_overflow(iova, length - 1, &iova_end))
>>> +		return -EOVERFLOW;
>>> +
>>> +	down_read(&iopt->domains_rwsem);
>>> +	down_write(&iopt->iova_rwsem);
>>> +	area = iopt_find_exact_area(iopt, iova, iova_end);
>>
>> when testing vIOMMU with Qemu using iommufd, I hit a problem as log #3
>> shows. Qemu failed when trying to do map due to an IOVA still in use.
>> After debugging, the 0xfffff000 IOVA is mapped but not unmapped. But per log
>> #2, Qemu has issued unmap with a larger range (0xff000000 -
>> 0x100000000) which includes the 0xfffff000. But iopt_find_exact_area()
>> doesn't find any area. So 0xfffff000 is not unmapped. Is this correct? Same
>> test passed with vfio iommu type1 driver. any idea?
> 
> There are a couple of good reasons why the iopt_unmap_iova() should
> proccess any contiguous range of fully contained areas, so I would
> consider this something worth fixing. can you send a small patch and
> test case and I'll fold it in?

sure. just spotted it, so haven't got fix patch yet. I may work on
it tomorrow.

-- 
Regards,
Yi Liu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 00/12] IOMMUFD Generic interface
  2022-03-18 17:27 ` Jason Gunthorpe via iommu
@ 2022-04-14 10:56   ` Yi Liu
  -1 siblings, 0 replies; 244+ messages in thread
From: Yi Liu @ 2022-04-14 10:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Keqian Zhu

On 2022/3/19 01:27, Jason Gunthorpe wrote:
> iommufd is the user API to control the IOMMU subsystem as it relates to
> managing IO page tables that point at user space memory.
> 
> It takes over from drivers/vfio/vfio_iommu_type1.c (aka the VFIO
> container) which is the VFIO specific interface for a similar idea.
> 
> We see a broad need for extended features, some being highly IOMMU device
> specific:
>   - Binding iommu_domain's to PASID/SSID
>   - Userspace page tables, for ARM, x86 and S390
>   - Kernel bypass'd invalidation of user page tables
>   - Re-use of the KVM page table in the IOMMU
>   - Dirty page tracking in the IOMMU
>   - Runtime Increase/Decrease of IOPTE size
>   - PRI support with faults resolved in userspace
> 
> As well as a need to access these features beyond just VFIO, VDPA for
> instance, but other classes of accelerator HW are touching on these areas
> now too.
> 
> The v1 series proposed re-using the VFIO type 1 data structure, however it
> was suggested that if we are doing this big update then we should also
> come with a data structure that solves the limitations that VFIO type1
> has. Notably this addresses:
> 
>   - Multiple IOAS/'containers' and multiple domains inside a single FD
> 
>   - Single-pin operation no matter how many domains and containers use
>     a page
> 
>   - A fine grained locking scheme supporting user managed concurrency for
>     multi-threaded map/unmap
> 
>   - A pre-registration mechanism to optimize vIOMMU use cases by
>     pre-pinning pages
> 
>   - Extended ioctl API that can manage these new objects and exposes
>     domains directly to user space
> 
>   - domains are sharable between subsystems, eg VFIO and VDPA
> 
> The bulk of this code is a new data structure design to track how the
> IOVAs are mapped to PFNs.
> 
> iommufd intends to be general and consumable by any driver that wants to
> DMA to userspace. From a driver perspective it can largely be dropped in
> in-place of iommu_attach_device() and provides a uniform full feature set
> to all consumers.
> 
> As this is a larger project this series is the first step. This series
> provides the iommfd "generic interface" which is designed to be suitable
> for applications like DPDK and VMM flows that are not optimized to
> specific HW scenarios. It is close to being a drop in replacement for the
> existing VFIO type 1.
> 
> This is part two of three for an initial sequence:
>   - Move IOMMU Group security into the iommu layer
>     https://lore.kernel.org/linux-iommu/20220218005521.172832-1-baolu.lu@linux.intel.com/
>   * Generic IOMMUFD implementation
>   - VFIO ability to consume IOMMUFD
>     An early exploration of this is available here:
>      https://github.com/luxis1999/iommufd/commits/iommufd-v5.17-rc6

Eric Auger and me have posted a QEMU rfc based on this branch.

https://lore.kernel.org/kvm/20220414104710.28534-1-yi.l.liu@intel.com/

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 00/12] IOMMUFD Generic interface
@ 2022-04-14 10:56   ` Yi Liu
  0 siblings, 0 replies; 244+ messages in thread
From: Yi Liu @ 2022-04-14 10:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins,
	David Gibson

On 2022/3/19 01:27, Jason Gunthorpe wrote:
> iommufd is the user API to control the IOMMU subsystem as it relates to
> managing IO page tables that point at user space memory.
> 
> It takes over from drivers/vfio/vfio_iommu_type1.c (aka the VFIO
> container) which is the VFIO specific interface for a similar idea.
> 
> We see a broad need for extended features, some being highly IOMMU device
> specific:
>   - Binding iommu_domain's to PASID/SSID
>   - Userspace page tables, for ARM, x86 and S390
>   - Kernel bypass'd invalidation of user page tables
>   - Re-use of the KVM page table in the IOMMU
>   - Dirty page tracking in the IOMMU
>   - Runtime Increase/Decrease of IOPTE size
>   - PRI support with faults resolved in userspace
> 
> As well as a need to access these features beyond just VFIO, VDPA for
> instance, but other classes of accelerator HW are touching on these areas
> now too.
> 
> The v1 series proposed re-using the VFIO type 1 data structure, however it
> was suggested that if we are doing this big update then we should also
> come with a data structure that solves the limitations that VFIO type1
> has. Notably this addresses:
> 
>   - Multiple IOAS/'containers' and multiple domains inside a single FD
> 
>   - Single-pin operation no matter how many domains and containers use
>     a page
> 
>   - A fine grained locking scheme supporting user managed concurrency for
>     multi-threaded map/unmap
> 
>   - A pre-registration mechanism to optimize vIOMMU use cases by
>     pre-pinning pages
> 
>   - Extended ioctl API that can manage these new objects and exposes
>     domains directly to user space
> 
>   - domains are sharable between subsystems, eg VFIO and VDPA
> 
> The bulk of this code is a new data structure design to track how the
> IOVAs are mapped to PFNs.
> 
> iommufd intends to be general and consumable by any driver that wants to
> DMA to userspace. From a driver perspective it can largely be dropped in
> in-place of iommu_attach_device() and provides a uniform full feature set
> to all consumers.
> 
> As this is a larger project this series is the first step. This series
> provides the iommfd "generic interface" which is designed to be suitable
> for applications like DPDK and VMM flows that are not optimized to
> specific HW scenarios. It is close to being a drop in replacement for the
> existing VFIO type 1.
> 
> This is part two of three for an initial sequence:
>   - Move IOMMU Group security into the iommu layer
>     https://lore.kernel.org/linux-iommu/20220218005521.172832-1-baolu.lu@linux.intel.com/
>   * Generic IOMMUFD implementation
>   - VFIO ability to consume IOMMUFD
>     An early exploration of this is available here:
>      https://github.com/luxis1999/iommufd/commits/iommufd-v5.17-rc6

Eric Auger and me have posted a QEMU rfc based on this branch.

https://lore.kernel.org/kvm/20220414104710.28534-1-yi.l.liu@intel.com/

-- 
Regards,
Yi Liu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 07/12] iommufd: Data structure to provide IOVA to PFN mapping
  2022-04-13 14:49         ` Yi Liu
@ 2022-04-17 14:56           ` Yi Liu
  -1 siblings, 0 replies; 244+ messages in thread
From: Yi Liu @ 2022-04-17 14:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Keqian Zhu

On 2022/4/13 22:49, Yi Liu wrote:
> On 2022/4/13 22:36, Jason Gunthorpe wrote:
>> On Wed, Apr 13, 2022 at 10:02:58PM +0800, Yi Liu wrote:
>>>> +/**
>>>> + * iopt_unmap_iova() - Remove a range of iova
>>>> + * @iopt: io_pagetable to act on
>>>> + * @iova: Starting iova to unmap
>>>> + * @length: Number of bytes to unmap
>>>> + *
>>>> + * The requested range must exactly match an existing range.
>>>> + * Splitting/truncating IOVA mappings is not allowed.
>>>> + */
>>>> +int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
>>>> +            unsigned long length)
>>>> +{
>>>> +    struct iopt_pages *pages;
>>>> +    struct iopt_area *area;
>>>> +    unsigned long iova_end;
>>>> +    int rc;
>>>> +
>>>> +    if (!length)
>>>> +        return -EINVAL;
>>>> +
>>>> +    if (check_add_overflow(iova, length - 1, &iova_end))
>>>> +        return -EOVERFLOW;
>>>> +
>>>> +    down_read(&iopt->domains_rwsem);
>>>> +    down_write(&iopt->iova_rwsem);
>>>> +    area = iopt_find_exact_area(iopt, iova, iova_end);
>>>
>>> when testing vIOMMU with Qemu using iommufd, I hit a problem as log #3
>>> shows. Qemu failed when trying to do map due to an IOVA still in use.
>>> After debugging, the 0xfffff000 IOVA is mapped but not unmapped. But per 
>>> log
>>> #2, Qemu has issued unmap with a larger range (0xff000000 -
>>> 0x100000000) which includes the 0xfffff000. But iopt_find_exact_area()
>>> doesn't find any area. So 0xfffff000 is not unmapped. Is this correct? Same
>>> test passed with vfio iommu type1 driver. any idea?
>>
>> There are a couple of good reasons why the iopt_unmap_iova() should
>> proccess any contiguous range of fully contained areas, so I would
>> consider this something worth fixing. can you send a small patch and
>> test case and I'll fold it in?
> 
> sure. just spotted it, so haven't got fix patch yet. I may work on
> it tomorrow.

Hi Jason,

Got below patch for it. Also pushed to the exploration branch.

https://github.com/luxis1999/iommufd/commit/d764f3288de0fd52c578684788a437701ec31b2d

 From 22a758c401a1c7f6656625013bb87204c9ea65fe Mon Sep 17 00:00:00 2001
From: Yi Liu <yi.l.liu@intel.com>
Date: Sun, 17 Apr 2022 07:39:03 -0700
Subject: [PATCH] iommufd/io_pagetable: Support unmap fully contained areas

Changes:
- return the unmapped bytes to caller
- supports unmap fully containerd contiguous areas
- add a test case in selftest

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
  drivers/iommu/iommufd/io_pagetable.c    | 90 ++++++++++++-------------
  drivers/iommu/iommufd/ioas.c            |  8 ++-
  drivers/iommu/iommufd/iommufd_private.h |  4 +-
  drivers/iommu/iommufd/vfio_compat.c     |  8 ++-
  include/uapi/linux/iommufd.h            |  2 +-
  tools/testing/selftests/iommu/iommufd.c | 40 +++++++++++
  6 files changed, 99 insertions(+), 53 deletions(-)

diff --git a/drivers/iommu/iommufd/io_pagetable.c 
b/drivers/iommu/iommufd/io_pagetable.c
index f9f3b06946bf..5142f797a812 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -315,61 +315,26 @@ static int __iopt_unmap_iova(struct io_pagetable 
*iopt, struct iopt_area *area,
  	return 0;
  }

-/**
- * iopt_unmap_iova() - Remove a range of iova
- * @iopt: io_pagetable to act on
- * @iova: Starting iova to unmap
- * @length: Number of bytes to unmap
- *
- * The requested range must exactly match an existing range.
- * Splitting/truncating IOVA mappings is not allowed.
- */
-int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
-		    unsigned long length)
-{
-	struct iopt_pages *pages;
-	struct iopt_area *area;
-	unsigned long iova_end;
-	int rc;
-
-	if (!length)
-		return -EINVAL;
-
-	if (check_add_overflow(iova, length - 1, &iova_end))
-		return -EOVERFLOW;
-
-	down_read(&iopt->domains_rwsem);
-	down_write(&iopt->iova_rwsem);
-	area = iopt_find_exact_area(iopt, iova, iova_end);
-	if (!area) {
-		up_write(&iopt->iova_rwsem);
-		up_read(&iopt->domains_rwsem);
-		return -ENOENT;
-	}
-	pages = area->pages;
-	area->pages = NULL;
-	up_write(&iopt->iova_rwsem);
-
-	rc = __iopt_unmap_iova(iopt, area, pages);
-	up_read(&iopt->domains_rwsem);
-	return rc;
-}
-
-int iopt_unmap_all(struct io_pagetable *iopt)
+static int __iopt_unmap_iova_range(struct io_pagetable *iopt,
+				   unsigned long start,
+				   unsigned long end,
+				   unsigned long *unmapped)
  {
  	struct iopt_area *area;
+	unsigned long unmapped_bytes = 0;
  	int rc;

  	down_read(&iopt->domains_rwsem);
  	down_write(&iopt->iova_rwsem);
-	while ((area = iopt_area_iter_first(iopt, 0, ULONG_MAX))) {
+	while ((area = iopt_area_iter_first(iopt, start, end))) {
  		struct iopt_pages *pages;

-		/* Userspace should not race unmap all and map */
-		if (!area->pages) {
-			rc = -EBUSY;
+		if (!area->pages || iopt_area_iova(area) < start ||
+		    iopt_area_last_iova(area) > end) {
+			rc = -ENOENT;
  			goto out_unlock_iova;
  		}
+
  		pages = area->pages;
  		area->pages = NULL;
  		up_write(&iopt->iova_rwsem);
@@ -378,6 +343,10 @@ int iopt_unmap_all(struct io_pagetable *iopt)
  		if (rc)
  			goto out_unlock_domains;

+		start = iopt_area_last_iova(area) + 1;
+		unmapped_bytes +=
+			iopt_area_last_iova(area) - iopt_area_iova(area) + 1;
+
  		down_write(&iopt->iova_rwsem);
  	}
  	rc = 0;
@@ -386,9 +355,40 @@ int iopt_unmap_all(struct io_pagetable *iopt)
  	up_write(&iopt->iova_rwsem);
  out_unlock_domains:
  	up_read(&iopt->domains_rwsem);
+	if (unmapped)
+		*unmapped = unmapped_bytes;
  	return rc;
  }

+/**
+ * iopt_unmap_iova() - Remove a range of iova
+ * @iopt: io_pagetable to act on
+ * @iova: Starting iova to unmap
+ * @length: Number of bytes to unmap
+ * @unmapped: Return number of bytes unmapped
+ *
+ * The requested range must exactly match an existing range.
+ * Splitting/truncating IOVA mappings is not allowed.
+ */
+int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
+		    unsigned long length, unsigned long *unmapped)
+{
+	unsigned long iova_end;
+
+	if (!length)
+		return -EINVAL;
+
+	if (check_add_overflow(iova, length - 1, &iova_end))
+		return -EOVERFLOW;
+
+	return __iopt_unmap_iova_range(iopt, iova, iova_end, unmapped);
+}
+
+int iopt_unmap_all(struct io_pagetable *iopt, unsigned long *unmapped)
+{
+	return __iopt_unmap_iova_range(iopt, 0, ULONG_MAX, unmapped);
+}
+
  /**
   * iopt_access_pages() - Return a list of pages under the iova
   * @iopt: io_pagetable to act on
diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
index 48149988c84b..4e701d053ed6 100644
--- a/drivers/iommu/iommufd/ioas.c
+++ b/drivers/iommu/iommufd/ioas.c
@@ -14,7 +14,7 @@ void iommufd_ioas_destroy(struct iommufd_object *obj)
  	struct iommufd_ioas *ioas = container_of(obj, struct iommufd_ioas, obj);
  	int rc;

-	rc = iopt_unmap_all(&ioas->iopt);
+	rc = iopt_unmap_all(&ioas->iopt, NULL);
  	WARN_ON(rc);
  	iopt_destroy_table(&ioas->iopt);
  	mutex_destroy(&ioas->mutex);
@@ -230,6 +230,7 @@ int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd)
  {
  	struct iommu_ioas_unmap *cmd = ucmd->cmd;
  	struct iommufd_ioas *ioas;
+	unsigned long unmapped;
  	int rc;

  	ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
@@ -237,16 +238,17 @@ int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd)
  		return PTR_ERR(ioas);

  	if (cmd->iova == 0 && cmd->length == U64_MAX) {
-		rc = iopt_unmap_all(&ioas->iopt);
+		rc = iopt_unmap_all(&ioas->iopt, &unmapped);
  	} else {
  		if (cmd->iova >= ULONG_MAX || cmd->length >= ULONG_MAX) {
  			rc = -EOVERFLOW;
  			goto out_put;
  		}
-		rc = iopt_unmap_iova(&ioas->iopt, cmd->iova, cmd->length);
+		rc = iopt_unmap_iova(&ioas->iopt, cmd->iova, cmd->length, &unmapped);
  	}

  out_put:
  	iommufd_put_object(&ioas->obj);
+	cmd->length = unmapped;
  	return rc;
  }
diff --git a/drivers/iommu/iommufd/iommufd_private.h 
b/drivers/iommu/iommufd/iommufd_private.h
index f55654278ac4..382704f4d698 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -46,8 +46,8 @@ int iopt_map_pages(struct io_pagetable *iopt, struct 
iopt_pages *pages,
  		   unsigned long *dst_iova, unsigned long start_byte,
  		   unsigned long length, int iommu_prot, unsigned int flags);
  int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
-		    unsigned long length);
-int iopt_unmap_all(struct io_pagetable *iopt);
+		    unsigned long length, unsigned long *unmapped);
+int iopt_unmap_all(struct io_pagetable *iopt, unsigned long *unmapped);

  int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
  		      unsigned long npages, struct page **out_pages, bool write);
diff --git a/drivers/iommu/iommufd/vfio_compat.c 
b/drivers/iommu/iommufd/vfio_compat.c
index 5b196de00ff9..4539ff45efd9 100644
--- a/drivers/iommu/iommufd/vfio_compat.c
+++ b/drivers/iommu/iommufd/vfio_compat.c
@@ -133,6 +133,7 @@ static int iommufd_vfio_unmap_dma(struct iommufd_ctx 
*ictx, unsigned int cmd,
  	u32 supported_flags = VFIO_DMA_UNMAP_FLAG_ALL;
  	struct vfio_iommu_type1_dma_unmap unmap;
  	struct iommufd_ioas *ioas;
+	unsigned long unmapped;
  	int rc;

  	if (copy_from_user(&unmap, arg, minsz))
@@ -146,10 +147,13 @@ static int iommufd_vfio_unmap_dma(struct iommufd_ctx 
*ictx, unsigned int cmd,
  		return PTR_ERR(ioas);

  	if (unmap.flags & VFIO_DMA_UNMAP_FLAG_ALL)
-		rc = iopt_unmap_all(&ioas->iopt);
+		rc = iopt_unmap_all(&ioas->iopt, &unmapped);
  	else
-		rc = iopt_unmap_iova(&ioas->iopt, unmap.iova, unmap.size);
+		rc = iopt_unmap_iova(&ioas->iopt, unmap.iova,
+				     unmap.size, &unmapped);
  	iommufd_put_object(&ioas->obj);
+	unmap.size = unmapped;
+
  	return rc;
  }

diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 2c0f5ced4173..8cbc6a083156 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -172,7 +172,7 @@ struct iommu_ioas_copy {
   * @size: sizeof(struct iommu_ioas_copy)
   * @ioas_id: IOAS ID to change the mapping of
   * @iova: IOVA to start the unmapping at
- * @length: Number of bytes to unmap
+ * @length: Number of bytes to unmap, and return back the bytes unmapped
   *
   * Unmap an IOVA range. The iova/length must exactly match a range
   * used with IOMMU_IOAS_PAGETABLE_MAP, or be the values 0 & U64_MAX.
diff --git a/tools/testing/selftests/iommu/iommufd.c 
b/tools/testing/selftests/iommu/iommufd.c
index 5c47d706ed94..42956acd2c04 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -357,6 +357,47 @@ TEST_F(iommufd_ioas, area)
  	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_UNMAP, &unmap_cmd));
  }

+TEST_F(iommufd_ioas, unmap_fully_contained_area)
+{
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.ioas_id = self->ioas_id,
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.length = PAGE_SIZE,
+		.user_va = (uintptr_t)buffer,
+	};
+	struct iommu_ioas_unmap unmap_cmd = {
+		.size = sizeof(unmap_cmd),
+		.ioas_id = self->ioas_id,
+		.length = PAGE_SIZE,
+	};
+	int i;
+
+	for (i = 0; i != 4; i++) {
+		map_cmd.iova = self->base_iova + i * 16 * PAGE_SIZE;
+		map_cmd.length = 8 * PAGE_SIZE;
+		ASSERT_EQ(0,
+			  ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	}
+
+	/* Unmap not fully contained area doesn't work */
+	unmap_cmd.iova = self->base_iova - 4 * PAGE_SIZE;
+	unmap_cmd.length = 8 * PAGE_SIZE;
+	ASSERT_EQ(ENOENT,
+		  ioctl(self->fd, IOMMU_IOAS_UNMAP, &unmap_cmd));
+
+	unmap_cmd.iova = self->base_iova + 3 * 16 * PAGE_SIZE + 8 * PAGE_SIZE - 4 
* PAGE_SIZE;
+	unmap_cmd.length = 8 * PAGE_SIZE;
+	ASSERT_EQ(ENOENT,
+		  ioctl(self->fd, IOMMU_IOAS_UNMAP, &unmap_cmd));
+
+	/* Unmap fully contained areas works */
+	unmap_cmd.iova = self->base_iova - 4 * PAGE_SIZE;
+	unmap_cmd.length = 3 * 16 * PAGE_SIZE + 8 * PAGE_SIZE + 4 * PAGE_SIZE;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_UNMAP, &unmap_cmd));
+	ASSERT_EQ(32, unmap_cmd.length);
+}
+
  TEST_F(iommufd_ioas, area_auto_iova)
  {
  	struct iommu_test_cmd test_cmd = {
-- 
2.27.0

-- 
Regards,
Yi Liu

^ permalink raw reply related	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 07/12] iommufd: Data structure to provide IOVA to PFN mapping
@ 2022-04-17 14:56           ` Yi Liu
  0 siblings, 0 replies; 244+ messages in thread
From: Yi Liu @ 2022-04-17 14:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins,
	David Gibson

On 2022/4/13 22:49, Yi Liu wrote:
> On 2022/4/13 22:36, Jason Gunthorpe wrote:
>> On Wed, Apr 13, 2022 at 10:02:58PM +0800, Yi Liu wrote:
>>>> +/**
>>>> + * iopt_unmap_iova() - Remove a range of iova
>>>> + * @iopt: io_pagetable to act on
>>>> + * @iova: Starting iova to unmap
>>>> + * @length: Number of bytes to unmap
>>>> + *
>>>> + * The requested range must exactly match an existing range.
>>>> + * Splitting/truncating IOVA mappings is not allowed.
>>>> + */
>>>> +int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
>>>> +            unsigned long length)
>>>> +{
>>>> +    struct iopt_pages *pages;
>>>> +    struct iopt_area *area;
>>>> +    unsigned long iova_end;
>>>> +    int rc;
>>>> +
>>>> +    if (!length)
>>>> +        return -EINVAL;
>>>> +
>>>> +    if (check_add_overflow(iova, length - 1, &iova_end))
>>>> +        return -EOVERFLOW;
>>>> +
>>>> +    down_read(&iopt->domains_rwsem);
>>>> +    down_write(&iopt->iova_rwsem);
>>>> +    area = iopt_find_exact_area(iopt, iova, iova_end);
>>>
>>> when testing vIOMMU with Qemu using iommufd, I hit a problem as log #3
>>> shows. Qemu failed when trying to do map due to an IOVA still in use.
>>> After debugging, the 0xfffff000 IOVA is mapped but not unmapped. But per 
>>> log
>>> #2, Qemu has issued unmap with a larger range (0xff000000 -
>>> 0x100000000) which includes the 0xfffff000. But iopt_find_exact_area()
>>> doesn't find any area. So 0xfffff000 is not unmapped. Is this correct? Same
>>> test passed with vfio iommu type1 driver. any idea?
>>
>> There are a couple of good reasons why the iopt_unmap_iova() should
>> proccess any contiguous range of fully contained areas, so I would
>> consider this something worth fixing. can you send a small patch and
>> test case and I'll fold it in?
> 
> sure. just spotted it, so haven't got fix patch yet. I may work on
> it tomorrow.

Hi Jason,

Got below patch for it. Also pushed to the exploration branch.

https://github.com/luxis1999/iommufd/commit/d764f3288de0fd52c578684788a437701ec31b2d

 From 22a758c401a1c7f6656625013bb87204c9ea65fe Mon Sep 17 00:00:00 2001
From: Yi Liu <yi.l.liu@intel.com>
Date: Sun, 17 Apr 2022 07:39:03 -0700
Subject: [PATCH] iommufd/io_pagetable: Support unmap fully contained areas

Changes:
- return the unmapped bytes to caller
- supports unmap fully containerd contiguous areas
- add a test case in selftest

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
  drivers/iommu/iommufd/io_pagetable.c    | 90 ++++++++++++-------------
  drivers/iommu/iommufd/ioas.c            |  8 ++-
  drivers/iommu/iommufd/iommufd_private.h |  4 +-
  drivers/iommu/iommufd/vfio_compat.c     |  8 ++-
  include/uapi/linux/iommufd.h            |  2 +-
  tools/testing/selftests/iommu/iommufd.c | 40 +++++++++++
  6 files changed, 99 insertions(+), 53 deletions(-)

diff --git a/drivers/iommu/iommufd/io_pagetable.c 
b/drivers/iommu/iommufd/io_pagetable.c
index f9f3b06946bf..5142f797a812 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -315,61 +315,26 @@ static int __iopt_unmap_iova(struct io_pagetable 
*iopt, struct iopt_area *area,
  	return 0;
  }

-/**
- * iopt_unmap_iova() - Remove a range of iova
- * @iopt: io_pagetable to act on
- * @iova: Starting iova to unmap
- * @length: Number of bytes to unmap
- *
- * The requested range must exactly match an existing range.
- * Splitting/truncating IOVA mappings is not allowed.
- */
-int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
-		    unsigned long length)
-{
-	struct iopt_pages *pages;
-	struct iopt_area *area;
-	unsigned long iova_end;
-	int rc;
-
-	if (!length)
-		return -EINVAL;
-
-	if (check_add_overflow(iova, length - 1, &iova_end))
-		return -EOVERFLOW;
-
-	down_read(&iopt->domains_rwsem);
-	down_write(&iopt->iova_rwsem);
-	area = iopt_find_exact_area(iopt, iova, iova_end);
-	if (!area) {
-		up_write(&iopt->iova_rwsem);
-		up_read(&iopt->domains_rwsem);
-		return -ENOENT;
-	}
-	pages = area->pages;
-	area->pages = NULL;
-	up_write(&iopt->iova_rwsem);
-
-	rc = __iopt_unmap_iova(iopt, area, pages);
-	up_read(&iopt->domains_rwsem);
-	return rc;
-}
-
-int iopt_unmap_all(struct io_pagetable *iopt)
+static int __iopt_unmap_iova_range(struct io_pagetable *iopt,
+				   unsigned long start,
+				   unsigned long end,
+				   unsigned long *unmapped)
  {
  	struct iopt_area *area;
+	unsigned long unmapped_bytes = 0;
  	int rc;

  	down_read(&iopt->domains_rwsem);
  	down_write(&iopt->iova_rwsem);
-	while ((area = iopt_area_iter_first(iopt, 0, ULONG_MAX))) {
+	while ((area = iopt_area_iter_first(iopt, start, end))) {
  		struct iopt_pages *pages;

-		/* Userspace should not race unmap all and map */
-		if (!area->pages) {
-			rc = -EBUSY;
+		if (!area->pages || iopt_area_iova(area) < start ||
+		    iopt_area_last_iova(area) > end) {
+			rc = -ENOENT;
  			goto out_unlock_iova;
  		}
+
  		pages = area->pages;
  		area->pages = NULL;
  		up_write(&iopt->iova_rwsem);
@@ -378,6 +343,10 @@ int iopt_unmap_all(struct io_pagetable *iopt)
  		if (rc)
  			goto out_unlock_domains;

+		start = iopt_area_last_iova(area) + 1;
+		unmapped_bytes +=
+			iopt_area_last_iova(area) - iopt_area_iova(area) + 1;
+
  		down_write(&iopt->iova_rwsem);
  	}
  	rc = 0;
@@ -386,9 +355,40 @@ int iopt_unmap_all(struct io_pagetable *iopt)
  	up_write(&iopt->iova_rwsem);
  out_unlock_domains:
  	up_read(&iopt->domains_rwsem);
+	if (unmapped)
+		*unmapped = unmapped_bytes;
  	return rc;
  }

+/**
+ * iopt_unmap_iova() - Remove a range of iova
+ * @iopt: io_pagetable to act on
+ * @iova: Starting iova to unmap
+ * @length: Number of bytes to unmap
+ * @unmapped: Return number of bytes unmapped
+ *
+ * The requested range must exactly match an existing range.
+ * Splitting/truncating IOVA mappings is not allowed.
+ */
+int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
+		    unsigned long length, unsigned long *unmapped)
+{
+	unsigned long iova_end;
+
+	if (!length)
+		return -EINVAL;
+
+	if (check_add_overflow(iova, length - 1, &iova_end))
+		return -EOVERFLOW;
+
+	return __iopt_unmap_iova_range(iopt, iova, iova_end, unmapped);
+}
+
+int iopt_unmap_all(struct io_pagetable *iopt, unsigned long *unmapped)
+{
+	return __iopt_unmap_iova_range(iopt, 0, ULONG_MAX, unmapped);
+}
+
  /**
   * iopt_access_pages() - Return a list of pages under the iova
   * @iopt: io_pagetable to act on
diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
index 48149988c84b..4e701d053ed6 100644
--- a/drivers/iommu/iommufd/ioas.c
+++ b/drivers/iommu/iommufd/ioas.c
@@ -14,7 +14,7 @@ void iommufd_ioas_destroy(struct iommufd_object *obj)
  	struct iommufd_ioas *ioas = container_of(obj, struct iommufd_ioas, obj);
  	int rc;

-	rc = iopt_unmap_all(&ioas->iopt);
+	rc = iopt_unmap_all(&ioas->iopt, NULL);
  	WARN_ON(rc);
  	iopt_destroy_table(&ioas->iopt);
  	mutex_destroy(&ioas->mutex);
@@ -230,6 +230,7 @@ int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd)
  {
  	struct iommu_ioas_unmap *cmd = ucmd->cmd;
  	struct iommufd_ioas *ioas;
+	unsigned long unmapped;
  	int rc;

  	ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
@@ -237,16 +238,17 @@ int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd)
  		return PTR_ERR(ioas);

  	if (cmd->iova == 0 && cmd->length == U64_MAX) {
-		rc = iopt_unmap_all(&ioas->iopt);
+		rc = iopt_unmap_all(&ioas->iopt, &unmapped);
  	} else {
  		if (cmd->iova >= ULONG_MAX || cmd->length >= ULONG_MAX) {
  			rc = -EOVERFLOW;
  			goto out_put;
  		}
-		rc = iopt_unmap_iova(&ioas->iopt, cmd->iova, cmd->length);
+		rc = iopt_unmap_iova(&ioas->iopt, cmd->iova, cmd->length, &unmapped);
  	}

  out_put:
  	iommufd_put_object(&ioas->obj);
+	cmd->length = unmapped;
  	return rc;
  }
diff --git a/drivers/iommu/iommufd/iommufd_private.h 
b/drivers/iommu/iommufd/iommufd_private.h
index f55654278ac4..382704f4d698 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -46,8 +46,8 @@ int iopt_map_pages(struct io_pagetable *iopt, struct 
iopt_pages *pages,
  		   unsigned long *dst_iova, unsigned long start_byte,
  		   unsigned long length, int iommu_prot, unsigned int flags);
  int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
-		    unsigned long length);
-int iopt_unmap_all(struct io_pagetable *iopt);
+		    unsigned long length, unsigned long *unmapped);
+int iopt_unmap_all(struct io_pagetable *iopt, unsigned long *unmapped);

  int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
  		      unsigned long npages, struct page **out_pages, bool write);
diff --git a/drivers/iommu/iommufd/vfio_compat.c 
b/drivers/iommu/iommufd/vfio_compat.c
index 5b196de00ff9..4539ff45efd9 100644
--- a/drivers/iommu/iommufd/vfio_compat.c
+++ b/drivers/iommu/iommufd/vfio_compat.c
@@ -133,6 +133,7 @@ static int iommufd_vfio_unmap_dma(struct iommufd_ctx 
*ictx, unsigned int cmd,
  	u32 supported_flags = VFIO_DMA_UNMAP_FLAG_ALL;
  	struct vfio_iommu_type1_dma_unmap unmap;
  	struct iommufd_ioas *ioas;
+	unsigned long unmapped;
  	int rc;

  	if (copy_from_user(&unmap, arg, minsz))
@@ -146,10 +147,13 @@ static int iommufd_vfio_unmap_dma(struct iommufd_ctx 
*ictx, unsigned int cmd,
  		return PTR_ERR(ioas);

  	if (unmap.flags & VFIO_DMA_UNMAP_FLAG_ALL)
-		rc = iopt_unmap_all(&ioas->iopt);
+		rc = iopt_unmap_all(&ioas->iopt, &unmapped);
  	else
-		rc = iopt_unmap_iova(&ioas->iopt, unmap.iova, unmap.size);
+		rc = iopt_unmap_iova(&ioas->iopt, unmap.iova,
+				     unmap.size, &unmapped);
  	iommufd_put_object(&ioas->obj);
+	unmap.size = unmapped;
+
  	return rc;
  }

diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 2c0f5ced4173..8cbc6a083156 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -172,7 +172,7 @@ struct iommu_ioas_copy {
   * @size: sizeof(struct iommu_ioas_copy)
   * @ioas_id: IOAS ID to change the mapping of
   * @iova: IOVA to start the unmapping at
- * @length: Number of bytes to unmap
+ * @length: Number of bytes to unmap, and return back the bytes unmapped
   *
   * Unmap an IOVA range. The iova/length must exactly match a range
   * used with IOMMU_IOAS_PAGETABLE_MAP, or be the values 0 & U64_MAX.
diff --git a/tools/testing/selftests/iommu/iommufd.c 
b/tools/testing/selftests/iommu/iommufd.c
index 5c47d706ed94..42956acd2c04 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -357,6 +357,47 @@ TEST_F(iommufd_ioas, area)
  	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_UNMAP, &unmap_cmd));
  }

+TEST_F(iommufd_ioas, unmap_fully_contained_area)
+{
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.ioas_id = self->ioas_id,
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.length = PAGE_SIZE,
+		.user_va = (uintptr_t)buffer,
+	};
+	struct iommu_ioas_unmap unmap_cmd = {
+		.size = sizeof(unmap_cmd),
+		.ioas_id = self->ioas_id,
+		.length = PAGE_SIZE,
+	};
+	int i;
+
+	for (i = 0; i != 4; i++) {
+		map_cmd.iova = self->base_iova + i * 16 * PAGE_SIZE;
+		map_cmd.length = 8 * PAGE_SIZE;
+		ASSERT_EQ(0,
+			  ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	}
+
+	/* Unmap not fully contained area doesn't work */
+	unmap_cmd.iova = self->base_iova - 4 * PAGE_SIZE;
+	unmap_cmd.length = 8 * PAGE_SIZE;
+	ASSERT_EQ(ENOENT,
+		  ioctl(self->fd, IOMMU_IOAS_UNMAP, &unmap_cmd));
+
+	unmap_cmd.iova = self->base_iova + 3 * 16 * PAGE_SIZE + 8 * PAGE_SIZE - 4 
* PAGE_SIZE;
+	unmap_cmd.length = 8 * PAGE_SIZE;
+	ASSERT_EQ(ENOENT,
+		  ioctl(self->fd, IOMMU_IOAS_UNMAP, &unmap_cmd));
+
+	/* Unmap fully contained areas works */
+	unmap_cmd.iova = self->base_iova - 4 * PAGE_SIZE;
+	unmap_cmd.length = 3 * 16 * PAGE_SIZE + 8 * PAGE_SIZE + 4 * PAGE_SIZE;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_UNMAP, &unmap_cmd));
+	ASSERT_EQ(32, unmap_cmd.length);
+}
+
  TEST_F(iommufd_ioas, area_auto_iova)
  {
  	struct iommu_test_cmd test_cmd = {
-- 
2.27.0

-- 
Regards,
Yi Liu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 07/12] iommufd: Data structure to provide IOVA to PFN mapping
  2022-04-17 14:56           ` Yi Liu
@ 2022-04-18 10:47             ` Yi Liu
  -1 siblings, 0 replies; 244+ messages in thread
From: Yi Liu @ 2022-04-18 10:47 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Keqian Zhu

Hi Jason,

On 2022/4/17 22:56, Yi Liu wrote:
> On 2022/4/13 22:49, Yi Liu wrote:
>> On 2022/4/13 22:36, Jason Gunthorpe wrote:
>>> On Wed, Apr 13, 2022 at 10:02:58PM +0800, Yi Liu wrote:
>>>>> +/**
>>>>> + * iopt_unmap_iova() - Remove a range of iova
>>>>> + * @iopt: io_pagetable to act on
>>>>> + * @iova: Starting iova to unmap
>>>>> + * @length: Number of bytes to unmap
>>>>> + *
>>>>> + * The requested range must exactly match an existing range.
>>>>> + * Splitting/truncating IOVA mappings is not allowed.
>>>>> + */
>>>>> +int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
>>>>> +            unsigned long length)
>>>>> +{
>>>>> +    struct iopt_pages *pages;
>>>>> +    struct iopt_area *area;
>>>>> +    unsigned long iova_end;
>>>>> +    int rc;
>>>>> +
>>>>> +    if (!length)
>>>>> +        return -EINVAL;
>>>>> +
>>>>> +    if (check_add_overflow(iova, length - 1, &iova_end))
>>>>> +        return -EOVERFLOW;
>>>>> +
>>>>> +    down_read(&iopt->domains_rwsem);
>>>>> +    down_write(&iopt->iova_rwsem);
>>>>> +    area = iopt_find_exact_area(iopt, iova, iova_end);
>>>>
>>>> when testing vIOMMU with Qemu using iommufd, I hit a problem as log #3
>>>> shows. Qemu failed when trying to do map due to an IOVA still in use.
>>>> After debugging, the 0xfffff000 IOVA is mapped but not unmapped. But 
>>>> per log
>>>> #2, Qemu has issued unmap with a larger range (0xff000000 -
>>>> 0x100000000) which includes the 0xfffff000. But iopt_find_exact_area()
>>>> doesn't find any area. So 0xfffff000 is not unmapped. Is this correct? 
>>>> Same
>>>> test passed with vfio iommu type1 driver. any idea?
>>>
>>> There are a couple of good reasons why the iopt_unmap_iova() should
>>> proccess any contiguous range of fully contained areas, so I would
>>> consider this something worth fixing. can you send a small patch and
>>> test case and I'll fold it in?
>>
>> sure. just spotted it, so haven't got fix patch yet. I may work on
>> it tomorrow.
> 
> Hi Jason,
> 
> Got below patch for it. Also pushed to the exploration branch.
> 
> https://github.com/luxis1999/iommufd/commit/d764f3288de0fd52c578684788a437701ec31b2d 

0-day reports a use without initialization to me. So updated it. Please get
the change in below commit. Sorry for the noise.

https://github.com/luxis1999/iommufd/commit/10674417c235cb4a4caf2202fffb078611441da2

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 07/12] iommufd: Data structure to provide IOVA to PFN mapping
@ 2022-04-18 10:47             ` Yi Liu
  0 siblings, 0 replies; 244+ messages in thread
From: Yi Liu @ 2022-04-18 10:47 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins,
	David Gibson

Hi Jason,

On 2022/4/17 22:56, Yi Liu wrote:
> On 2022/4/13 22:49, Yi Liu wrote:
>> On 2022/4/13 22:36, Jason Gunthorpe wrote:
>>> On Wed, Apr 13, 2022 at 10:02:58PM +0800, Yi Liu wrote:
>>>>> +/**
>>>>> + * iopt_unmap_iova() - Remove a range of iova
>>>>> + * @iopt: io_pagetable to act on
>>>>> + * @iova: Starting iova to unmap
>>>>> + * @length: Number of bytes to unmap
>>>>> + *
>>>>> + * The requested range must exactly match an existing range.
>>>>> + * Splitting/truncating IOVA mappings is not allowed.
>>>>> + */
>>>>> +int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
>>>>> +            unsigned long length)
>>>>> +{
>>>>> +    struct iopt_pages *pages;
>>>>> +    struct iopt_area *area;
>>>>> +    unsigned long iova_end;
>>>>> +    int rc;
>>>>> +
>>>>> +    if (!length)
>>>>> +        return -EINVAL;
>>>>> +
>>>>> +    if (check_add_overflow(iova, length - 1, &iova_end))
>>>>> +        return -EOVERFLOW;
>>>>> +
>>>>> +    down_read(&iopt->domains_rwsem);
>>>>> +    down_write(&iopt->iova_rwsem);
>>>>> +    area = iopt_find_exact_area(iopt, iova, iova_end);
>>>>
>>>> when testing vIOMMU with Qemu using iommufd, I hit a problem as log #3
>>>> shows. Qemu failed when trying to do map due to an IOVA still in use.
>>>> After debugging, the 0xfffff000 IOVA is mapped but not unmapped. But 
>>>> per log
>>>> #2, Qemu has issued unmap with a larger range (0xff000000 -
>>>> 0x100000000) which includes the 0xfffff000. But iopt_find_exact_area()
>>>> doesn't find any area. So 0xfffff000 is not unmapped. Is this correct? 
>>>> Same
>>>> test passed with vfio iommu type1 driver. any idea?
>>>
>>> There are a couple of good reasons why the iopt_unmap_iova() should
>>> proccess any contiguous range of fully contained areas, so I would
>>> consider this something worth fixing. can you send a small patch and
>>> test case and I'll fold it in?
>>
>> sure. just spotted it, so haven't got fix patch yet. I may work on
>> it tomorrow.
> 
> Hi Jason,
> 
> Got below patch for it. Also pushed to the exploration branch.
> 
> https://github.com/luxis1999/iommufd/commit/d764f3288de0fd52c578684788a437701ec31b2d 

0-day reports a use without initialization to me. So updated it. Please get
the change in below commit. Sorry for the noise.

https://github.com/luxis1999/iommufd/commit/10674417c235cb4a4caf2202fffb078611441da2

-- 
Regards,
Yi Liu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
  2022-03-31 12:58       ` Jason Gunthorpe
@ 2022-04-28  5:58         ` David Gibson
  -1 siblings, 0 replies; 244+ messages in thread
From: David Gibson @ 2022-04-28  5:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

[-- Attachment #1: Type: text/plain, Size: 4056 bytes --]

On Thu, Mar 31, 2022 at 09:58:41AM -0300, Jason Gunthorpe wrote:
> On Thu, Mar 31, 2022 at 03:36:29PM +1100, David Gibson wrote:
> 
> > > +/**
> > > + * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
> > > + * @size: sizeof(struct iommu_ioas_iova_ranges)
> > > + * @ioas_id: IOAS ID to read ranges from
> > > + * @out_num_iovas: Output total number of ranges in the IOAS
> > > + * @__reserved: Must be 0
> > > + * @out_valid_iovas: Array of valid IOVA ranges. The array length is the smaller
> > > + *                   of out_num_iovas or the length implied by size.
> > > + * @out_valid_iovas.start: First IOVA in the allowed range
> > > + * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
> > > + *
> > > + * Query an IOAS for ranges of allowed IOVAs. Operation outside these ranges is
> > > + * not allowed. out_num_iovas will be set to the total number of iovas
> > > + * and the out_valid_iovas[] will be filled in as space permits.
> > > + * size should include the allocated flex array.
> > > + */
> > > +struct iommu_ioas_iova_ranges {
> > > +	__u32 size;
> > > +	__u32 ioas_id;
> > > +	__u32 out_num_iovas;
> > > +	__u32 __reserved;
> > > +	struct iommu_valid_iovas {
> > > +		__aligned_u64 start;
> > > +		__aligned_u64 last;
> > > +	} out_valid_iovas[];
> > > +};
> > > +#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_IOVA_RANGES)
> > 
> > Is the information returned by this valid for the lifeime of the IOAS,
> > or can it change?  If it can change, what events can change it?
> >
> > If it *can't* change, then how do we have enough information to
> > determine this at ALLOC time, since we don't necessarily know which
> > (if any) hardware IOMMU will be attached to it.
> 
> It is a good point worth documenting. It can change. Particularly
> after any device attachment.

Right.. this is vital and needs to be front and centre in the
comments/docs here.  Really, I think an interface that *doesn't* have
magically changing status would be better (which is why I was
advocating that the user set the constraints, and the kernel supplied
or failed outright).  Still I recognize that has its own problems.

> I added this:
> 
>  * Query an IOAS for ranges of allowed IOVAs. Mapping IOVA outside these ranges
>  * is not allowed. out_num_iovas will be set to the total number of iovas and
>  * the out_valid_iovas[] will be filled in as space permits. size should include
>  * the allocated flex array.
>  *
>  * The allowed ranges are dependent on the HW path the DMA operation takes, and
>  * can change during the lifetime of the IOAS. A fresh empty IOAS will have a
>  * full range, and each attached device will narrow the ranges based on that
>  * devices HW restrictions.

I think you need to be even more explicit about this: which exact
operations on the fd can invalidate exactly which items in the
information from this call?  Can it only ever be narrowed, or can it
be broadened with any operations?

> > > +#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY)
> > 
> > Since it can only copy a single mapping, what's the benefit of this
> > over just repeating an IOAS_MAP in the new IOAS?
> 
> It causes the underlying pin accounting to be shared and can avoid
> calling GUP entirely.

If that's the only purpose, then that needs to be right here in the
comments too.  So is expected best practice to IOAS_MAP everything you
might want to map into a sort of "scratch" IOAS, then IOAS_COPY the
mappings you actually end up wanting into the "real" IOASes for use?

Seems like it would be nicer for the interface to just figure it out
for you: I can see there being sufficient complications with that to
have this slightly awkward interface, but I think it needs a rationale
to accompany it.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
@ 2022-04-28  5:58         ` David Gibson
  0 siblings, 0 replies; 244+ messages in thread
From: David Gibson @ 2022-04-28  5:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins


[-- Attachment #1.1: Type: text/plain, Size: 4056 bytes --]

On Thu, Mar 31, 2022 at 09:58:41AM -0300, Jason Gunthorpe wrote:
> On Thu, Mar 31, 2022 at 03:36:29PM +1100, David Gibson wrote:
> 
> > > +/**
> > > + * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
> > > + * @size: sizeof(struct iommu_ioas_iova_ranges)
> > > + * @ioas_id: IOAS ID to read ranges from
> > > + * @out_num_iovas: Output total number of ranges in the IOAS
> > > + * @__reserved: Must be 0
> > > + * @out_valid_iovas: Array of valid IOVA ranges. The array length is the smaller
> > > + *                   of out_num_iovas or the length implied by size.
> > > + * @out_valid_iovas.start: First IOVA in the allowed range
> > > + * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
> > > + *
> > > + * Query an IOAS for ranges of allowed IOVAs. Operation outside these ranges is
> > > + * not allowed. out_num_iovas will be set to the total number of iovas
> > > + * and the out_valid_iovas[] will be filled in as space permits.
> > > + * size should include the allocated flex array.
> > > + */
> > > +struct iommu_ioas_iova_ranges {
> > > +	__u32 size;
> > > +	__u32 ioas_id;
> > > +	__u32 out_num_iovas;
> > > +	__u32 __reserved;
> > > +	struct iommu_valid_iovas {
> > > +		__aligned_u64 start;
> > > +		__aligned_u64 last;
> > > +	} out_valid_iovas[];
> > > +};
> > > +#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_IOVA_RANGES)
> > 
> > Is the information returned by this valid for the lifeime of the IOAS,
> > or can it change?  If it can change, what events can change it?
> >
> > If it *can't* change, then how do we have enough information to
> > determine this at ALLOC time, since we don't necessarily know which
> > (if any) hardware IOMMU will be attached to it.
> 
> It is a good point worth documenting. It can change. Particularly
> after any device attachment.

Right.. this is vital and needs to be front and centre in the
comments/docs here.  Really, I think an interface that *doesn't* have
magically changing status would be better (which is why I was
advocating that the user set the constraints, and the kernel supplied
or failed outright).  Still I recognize that has its own problems.

> I added this:
> 
>  * Query an IOAS for ranges of allowed IOVAs. Mapping IOVA outside these ranges
>  * is not allowed. out_num_iovas will be set to the total number of iovas and
>  * the out_valid_iovas[] will be filled in as space permits. size should include
>  * the allocated flex array.
>  *
>  * The allowed ranges are dependent on the HW path the DMA operation takes, and
>  * can change during the lifetime of the IOAS. A fresh empty IOAS will have a
>  * full range, and each attached device will narrow the ranges based on that
>  * devices HW restrictions.

I think you need to be even more explicit about this: which exact
operations on the fd can invalidate exactly which items in the
information from this call?  Can it only ever be narrowed, or can it
be broadened with any operations?

> > > +#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY)
> > 
> > Since it can only copy a single mapping, what's the benefit of this
> > over just repeating an IOAS_MAP in the new IOAS?
> 
> It causes the underlying pin accounting to be shared and can avoid
> calling GUP entirely.

If that's the only purpose, then that needs to be right here in the
comments too.  So is expected best practice to IOAS_MAP everything you
might want to map into a sort of "scratch" IOAS, then IOAS_COPY the
mappings you actually end up wanting into the "real" IOASes for use?

Seems like it would be nicer for the interface to just figure it out
for you: I can see there being sufficient complications with that to
have this slightly awkward interface, but I think it needs a rationale
to accompany it.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 156 bytes --]

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
  2022-04-28  5:58         ` David Gibson
@ 2022-04-28 14:22           ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-04-28 14:22 UTC (permalink / raw)
  To: David Gibson
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Thu, Apr 28, 2022 at 03:58:30PM +1000, David Gibson wrote:
> On Thu, Mar 31, 2022 at 09:58:41AM -0300, Jason Gunthorpe wrote:
> > On Thu, Mar 31, 2022 at 03:36:29PM +1100, David Gibson wrote:
> > 
> > > > +/**
> > > > + * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
> > > > + * @size: sizeof(struct iommu_ioas_iova_ranges)
> > > > + * @ioas_id: IOAS ID to read ranges from
> > > > + * @out_num_iovas: Output total number of ranges in the IOAS
> > > > + * @__reserved: Must be 0
> > > > + * @out_valid_iovas: Array of valid IOVA ranges. The array length is the smaller
> > > > + *                   of out_num_iovas or the length implied by size.
> > > > + * @out_valid_iovas.start: First IOVA in the allowed range
> > > > + * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
> > > > + *
> > > > + * Query an IOAS for ranges of allowed IOVAs. Operation outside these ranges is
> > > > + * not allowed. out_num_iovas will be set to the total number of iovas
> > > > + * and the out_valid_iovas[] will be filled in as space permits.
> > > > + * size should include the allocated flex array.
> > > > + */
> > > > +struct iommu_ioas_iova_ranges {
> > > > +	__u32 size;
> > > > +	__u32 ioas_id;
> > > > +	__u32 out_num_iovas;
> > > > +	__u32 __reserved;
> > > > +	struct iommu_valid_iovas {
> > > > +		__aligned_u64 start;
> > > > +		__aligned_u64 last;
> > > > +	} out_valid_iovas[];
> > > > +};
> > > > +#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_IOVA_RANGES)
> > > 
> > > Is the information returned by this valid for the lifeime of the IOAS,
> > > or can it change?  If it can change, what events can change it?
> > >
> > > If it *can't* change, then how do we have enough information to
> > > determine this at ALLOC time, since we don't necessarily know which
> > > (if any) hardware IOMMU will be attached to it.
> > 
> > It is a good point worth documenting. It can change. Particularly
> > after any device attachment.
> 
> Right.. this is vital and needs to be front and centre in the
> comments/docs here.  Really, I think an interface that *doesn't* have
> magically changing status would be better (which is why I was
> advocating that the user set the constraints, and the kernel supplied
> or failed outright).  Still I recognize that has its own problems.

That is a neat idea, it could be a nice option, it lets userspace
further customize the kernel allocator.

But I don't have a use case in mind? The simplified things I know
about want to attach their devices then allocate valid IOVA, they
don't really have a notion about what IOVA regions they are willing to
accept, or necessarily do hotplug.

What might be interesting is to have some option to load in a machine
specific default ranges - ie the union of every group and and every
iommu_domain. The idea being that after such a call hotplug of a
device should be very likely to succeed.

Though I don't have a user in mind..

> > I added this:
> > 
> >  * Query an IOAS for ranges of allowed IOVAs. Mapping IOVA outside these ranges
> >  * is not allowed. out_num_iovas will be set to the total number of iovas and
> >  * the out_valid_iovas[] will be filled in as space permits. size should include
> >  * the allocated flex array.
> >  *
> >  * The allowed ranges are dependent on the HW path the DMA operation takes, and
> >  * can change during the lifetime of the IOAS. A fresh empty IOAS will have a
> >  * full range, and each attached device will narrow the ranges based on that
> >  * devices HW restrictions.
> 
> I think you need to be even more explicit about this: which exact
> operations on the fd can invalidate exactly which items in the
> information from this call?  Can it only ever be narrowed, or can it
> be broadened with any operations?

I think "attach" is the phrase we are using for that operation - it is
not a specific IOCTL here because it happens on, say, the VFIO device FD.

Let's add "detatching a device can widen the ranges. Userspace should
query ranges after every attach/detatch to know what IOVAs are valid
for mapping."

> > > > +#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY)
> > > 
> > > Since it can only copy a single mapping, what's the benefit of this
> > > over just repeating an IOAS_MAP in the new IOAS?
> > 
> > It causes the underlying pin accounting to be shared and can avoid
> > calling GUP entirely.
> 
> If that's the only purpose, then that needs to be right here in the
> comments too.  So is expected best practice to IOAS_MAP everything you
> might want to map into a sort of "scratch" IOAS, then IOAS_COPY the
> mappings you actually end up wanting into the "real" IOASes for use?

That is one possibility, yes. qemu seems to be using this to establish
a clone ioas of an existing operational one which is another usage
model.

I added this additionally:

 * This may be used to efficiently clone a subset of an IOAS to another, or as a
 * kind of 'cache' to speed up mapping. Copy has an effciency advantage over
 * establishing equivilant new mappings, as internal resources are shared, and
 * the kernel will pin the user memory only once.

> Seems like it would be nicer for the interface to just figure it out
> for you: I can see there being sufficient complications with that to
> have this slightly awkward interface, but I think it needs a rationale
> to accompany it.

It is more than complicates, the kernel has no way to accurately know
when a user pointer is an alias of an existing user pointer or is
something new because the mm has become incoherent.

It is possible that uncoordinated modules in userspace could
experience data corruption if the wrong decision is made - mm
coherence with pinning is pretty weak in Linux.. Since I dislike that
kind of unreliable magic I made it explicit.

Thanks,
Jason


^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
@ 2022-04-28 14:22           ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-04-28 14:22 UTC (permalink / raw)
  To: David Gibson
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins

On Thu, Apr 28, 2022 at 03:58:30PM +1000, David Gibson wrote:
> On Thu, Mar 31, 2022 at 09:58:41AM -0300, Jason Gunthorpe wrote:
> > On Thu, Mar 31, 2022 at 03:36:29PM +1100, David Gibson wrote:
> > 
> > > > +/**
> > > > + * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
> > > > + * @size: sizeof(struct iommu_ioas_iova_ranges)
> > > > + * @ioas_id: IOAS ID to read ranges from
> > > > + * @out_num_iovas: Output total number of ranges in the IOAS
> > > > + * @__reserved: Must be 0
> > > > + * @out_valid_iovas: Array of valid IOVA ranges. The array length is the smaller
> > > > + *                   of out_num_iovas or the length implied by size.
> > > > + * @out_valid_iovas.start: First IOVA in the allowed range
> > > > + * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
> > > > + *
> > > > + * Query an IOAS for ranges of allowed IOVAs. Operation outside these ranges is
> > > > + * not allowed. out_num_iovas will be set to the total number of iovas
> > > > + * and the out_valid_iovas[] will be filled in as space permits.
> > > > + * size should include the allocated flex array.
> > > > + */
> > > > +struct iommu_ioas_iova_ranges {
> > > > +	__u32 size;
> > > > +	__u32 ioas_id;
> > > > +	__u32 out_num_iovas;
> > > > +	__u32 __reserved;
> > > > +	struct iommu_valid_iovas {
> > > > +		__aligned_u64 start;
> > > > +		__aligned_u64 last;
> > > > +	} out_valid_iovas[];
> > > > +};
> > > > +#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_IOVA_RANGES)
> > > 
> > > Is the information returned by this valid for the lifeime of the IOAS,
> > > or can it change?  If it can change, what events can change it?
> > >
> > > If it *can't* change, then how do we have enough information to
> > > determine this at ALLOC time, since we don't necessarily know which
> > > (if any) hardware IOMMU will be attached to it.
> > 
> > It is a good point worth documenting. It can change. Particularly
> > after any device attachment.
> 
> Right.. this is vital and needs to be front and centre in the
> comments/docs here.  Really, I think an interface that *doesn't* have
> magically changing status would be better (which is why I was
> advocating that the user set the constraints, and the kernel supplied
> or failed outright).  Still I recognize that has its own problems.

That is a neat idea, it could be a nice option, it lets userspace
further customize the kernel allocator.

But I don't have a use case in mind? The simplified things I know
about want to attach their devices then allocate valid IOVA, they
don't really have a notion about what IOVA regions they are willing to
accept, or necessarily do hotplug.

What might be interesting is to have some option to load in a machine
specific default ranges - ie the union of every group and and every
iommu_domain. The idea being that after such a call hotplug of a
device should be very likely to succeed.

Though I don't have a user in mind..

> > I added this:
> > 
> >  * Query an IOAS for ranges of allowed IOVAs. Mapping IOVA outside these ranges
> >  * is not allowed. out_num_iovas will be set to the total number of iovas and
> >  * the out_valid_iovas[] will be filled in as space permits. size should include
> >  * the allocated flex array.
> >  *
> >  * The allowed ranges are dependent on the HW path the DMA operation takes, and
> >  * can change during the lifetime of the IOAS. A fresh empty IOAS will have a
> >  * full range, and each attached device will narrow the ranges based on that
> >  * devices HW restrictions.
> 
> I think you need to be even more explicit about this: which exact
> operations on the fd can invalidate exactly which items in the
> information from this call?  Can it only ever be narrowed, or can it
> be broadened with any operations?

I think "attach" is the phrase we are using for that operation - it is
not a specific IOCTL here because it happens on, say, the VFIO device FD.

Let's add "detatching a device can widen the ranges. Userspace should
query ranges after every attach/detatch to know what IOVAs are valid
for mapping."

> > > > +#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY)
> > > 
> > > Since it can only copy a single mapping, what's the benefit of this
> > > over just repeating an IOAS_MAP in the new IOAS?
> > 
> > It causes the underlying pin accounting to be shared and can avoid
> > calling GUP entirely.
> 
> If that's the only purpose, then that needs to be right here in the
> comments too.  So is expected best practice to IOAS_MAP everything you
> might want to map into a sort of "scratch" IOAS, then IOAS_COPY the
> mappings you actually end up wanting into the "real" IOASes for use?

That is one possibility, yes. qemu seems to be using this to establish
a clone ioas of an existing operational one which is another usage
model.

I added this additionally:

 * This may be used to efficiently clone a subset of an IOAS to another, or as a
 * kind of 'cache' to speed up mapping. Copy has an effciency advantage over
 * establishing equivilant new mappings, as internal resources are shared, and
 * the kernel will pin the user memory only once.

> Seems like it would be nicer for the interface to just figure it out
> for you: I can see there being sufficient complications with that to
> have this slightly awkward interface, but I think it needs a rationale
> to accompany it.

It is more than complicates, the kernel has no way to accurately know
when a user pointer is an alias of an existing user pointer or is
something new because the mm has become incoherent.

It is possible that uncoordinated modules in userspace could
experience data corruption if the wrong decision is made - mm
coherence with pinning is pretty weak in Linux.. Since I dislike that
kind of unreliable magic I made it explicit.

Thanks,
Jason

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
  2022-03-24 22:04         ` Alex Williamson
@ 2022-04-28 14:53           ` David Gibson
  -1 siblings, 0 replies; 244+ messages in thread
From: David Gibson @ 2022-04-28 14:53 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

[-- Attachment #1: Type: text/plain, Size: 6444 bytes --]

On Thu, Mar 24, 2022 at 04:04:03PM -0600, Alex Williamson wrote:
> On Wed, 23 Mar 2022 21:33:42 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote:
> > 
> > > My overall question here would be whether we can actually achieve a
> > > compatibility interface that has sufficient feature transparency that we
> > > can dump vfio code in favor of this interface, or will there be enough
> > > niche use cases that we need to keep type1 and vfio containers around
> > > through a deprecation process?  
> > 
> > Other than SPAPR, I think we can.
> 
> Does this mean #ifdef CONFIG_PPC in vfio core to retain infrastructure
> for POWER support?

There are a few different levels to consider for dealing with PPC.
For a suitable long term interface for ppc hosts and guests dropping
this is fine: the ppc specific iommu model was basically an
ill-conceived idea from the beginning, because none of us had
sufficiently understood what things were general and what things where
iommu model/hw specific.

..mostly.  There are several points of divergence for the ppc iommu
model.

1) Limited IOVA windows.  This one turned out to not really be ppc
specific, and is (rightly) handlded generically in the new interface.
No problem here.

2) Costly GUPs.  pseries (the most common ppc machine type) always
expects a (v)IOMMU.  That means that unlike the common x86 model of a
host with IOMMU, but guests with no-vIOMMU, guest initiated
maps/unmaps can be a hot path.  Accounting in that path can be
prohibitive (and on POWER8 in particular it prevented us from
optimizing that path the way we wanted).  We had two solutions for
that, in v1 the explicit ENABLE/DISABLE calls, which preaccounted
based on the IOVA window sizes.  That was improved in the v2 which
used the concept of preregistration.  IIUC iommufd can achieve the
same effect as preregistration using IOAS_COPY, so this one isn't
really a problem either.

3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for 2 IOVA
windows, which aren't contiguous with each other.  The base addresses
of each of these are fixed, but the size of each window, the pagesize
(i.e. granularity) of each window and the number of levels in the
IOMMU pagetable are runtime configurable.  Because it's true in the
hardware, it's also true of the vIOMMU interface defined by the IBM
hypervisor (and adpoted by KVM as well).  So, guests can request
changes in how these windows are handled.  Typical Linux guests will
use the "low" window (IOVA 0..2GiB) dynamically, and the high window
(IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
can't count on that; the guest can use them however it wants.


(3) still needs a plan for how to fit it into the /dev/iommufd model.
This is a secondary reason that in the past I advocated for the user
requesting specific DMA windows which the kernel would accept or
refuse, rather than having a query function - it connects easily to
the DDW model.  With the query-first model we'd need some sort of
extension here, not really sure what it should look like.



Then, there's handling existing qemu (or other software) that is using
the VFIO SPAPR_TCE interfaces.  First, it's not entirely clear if this
should be a goal or not: as others have noted, working actively to
port qemu to the new interface at the same time as making a
comprehensive in-kernel compat layer is arguably redundant work.

That said, if we did want to handle this in an in-kernel compat layer,
here's roughly what you'd need for SPAPR_TCE v2:

- VFIO_IOMMU_SPAPR_TCE_GET_INFO
    I think this should be fairly straightforward; the information you
    need should be in the now generic IOVA window stuff and would just
    need massaging into the expected format.
- VFIO_IOMMU_SPAPR_REGISTER_MEMORY /
  VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY
    IIUC, these could be traslated into map/unmap operations onto a
    second implicit IOAS which represents the preregistered memory
    areas (to which we'd never connect an actual device).  Along with
    this VFIO_MAP and VFIO_UNMAP operations would need to check for
    this case, verify their addresses against the preregistered space
    and be translated into IOAS_COPY operations from the prereg
    address space instead of raw IOAS_MAP operations.  Fiddly, but not
    fundamentally hard, I think.

For SPAPR_TCE_v1 things are a bit trickier

- VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE
    I suspect you could get away with implementing these as no-ops.
    It wouldn't be strictly correct, but I think software which is
    using the interface correctly should work this way, though
    possibly not optimally.  That might be good enough for this ugly
    old interface.

And... then there's VFIO_EEH_PE_OP.  It's very hard to know what to do
with this because the interface was completely broken for most of its
lifetime.  EEH is a fancy error handling feature of IBM PCI hardware
somewhat similar in concept, though not interface, to PCIe AER.  I have
a very strong impression that while this was a much-touted checkbox
feature for RAS, no-one, ever. actually used it.  As evidenced by the
fact that there was, I believe over a *decade* in which all the
interfaces were completely broken by design, and apparently no-one
noticed.

So, cynically, you could probably get away with making this a no-op as
well.  If you wanted to do it properly... well... that would require
training up yet another person to actually understand this and hoping
they get it done before they run screaming.  This one gets very ugly
because the EEH operations have to operate on the hardware (or
firmware) "Partitionable Endpoints" (PEs) which correspond one to one
with IOMMU groups, but not necessarily with VFIO containers, but
there's not really any sensible way to expose that to users.

You might be able to do this by simply failing this outright if
there's anything other than exactly one IOMMU group bound to the
container / IOAS (which I think might be what VFIO itself does now).
Handling that with a device centric API gets somewhat fiddlier, of
course.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
@ 2022-04-28 14:53           ` David Gibson
  0 siblings, 0 replies; 244+ messages in thread
From: David Gibson @ 2022-04-28 14:53 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	Kevin Tian, Daniel Jordan, iommu, Jason Gunthorpe, Joao Martins


[-- Attachment #1.1: Type: text/plain, Size: 6444 bytes --]

On Thu, Mar 24, 2022 at 04:04:03PM -0600, Alex Williamson wrote:
> On Wed, 23 Mar 2022 21:33:42 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote:
> > 
> > > My overall question here would be whether we can actually achieve a
> > > compatibility interface that has sufficient feature transparency that we
> > > can dump vfio code in favor of this interface, or will there be enough
> > > niche use cases that we need to keep type1 and vfio containers around
> > > through a deprecation process?  
> > 
> > Other than SPAPR, I think we can.
> 
> Does this mean #ifdef CONFIG_PPC in vfio core to retain infrastructure
> for POWER support?

There are a few different levels to consider for dealing with PPC.
For a suitable long term interface for ppc hosts and guests dropping
this is fine: the ppc specific iommu model was basically an
ill-conceived idea from the beginning, because none of us had
sufficiently understood what things were general and what things where
iommu model/hw specific.

..mostly.  There are several points of divergence for the ppc iommu
model.

1) Limited IOVA windows.  This one turned out to not really be ppc
specific, and is (rightly) handlded generically in the new interface.
No problem here.

2) Costly GUPs.  pseries (the most common ppc machine type) always
expects a (v)IOMMU.  That means that unlike the common x86 model of a
host with IOMMU, but guests with no-vIOMMU, guest initiated
maps/unmaps can be a hot path.  Accounting in that path can be
prohibitive (and on POWER8 in particular it prevented us from
optimizing that path the way we wanted).  We had two solutions for
that, in v1 the explicit ENABLE/DISABLE calls, which preaccounted
based on the IOVA window sizes.  That was improved in the v2 which
used the concept of preregistration.  IIUC iommufd can achieve the
same effect as preregistration using IOAS_COPY, so this one isn't
really a problem either.

3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for 2 IOVA
windows, which aren't contiguous with each other.  The base addresses
of each of these are fixed, but the size of each window, the pagesize
(i.e. granularity) of each window and the number of levels in the
IOMMU pagetable are runtime configurable.  Because it's true in the
hardware, it's also true of the vIOMMU interface defined by the IBM
hypervisor (and adpoted by KVM as well).  So, guests can request
changes in how these windows are handled.  Typical Linux guests will
use the "low" window (IOVA 0..2GiB) dynamically, and the high window
(IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
can't count on that; the guest can use them however it wants.


(3) still needs a plan for how to fit it into the /dev/iommufd model.
This is a secondary reason that in the past I advocated for the user
requesting specific DMA windows which the kernel would accept or
refuse, rather than having a query function - it connects easily to
the DDW model.  With the query-first model we'd need some sort of
extension here, not really sure what it should look like.



Then, there's handling existing qemu (or other software) that is using
the VFIO SPAPR_TCE interfaces.  First, it's not entirely clear if this
should be a goal or not: as others have noted, working actively to
port qemu to the new interface at the same time as making a
comprehensive in-kernel compat layer is arguably redundant work.

That said, if we did want to handle this in an in-kernel compat layer,
here's roughly what you'd need for SPAPR_TCE v2:

- VFIO_IOMMU_SPAPR_TCE_GET_INFO
    I think this should be fairly straightforward; the information you
    need should be in the now generic IOVA window stuff and would just
    need massaging into the expected format.
- VFIO_IOMMU_SPAPR_REGISTER_MEMORY /
  VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY
    IIUC, these could be traslated into map/unmap operations onto a
    second implicit IOAS which represents the preregistered memory
    areas (to which we'd never connect an actual device).  Along with
    this VFIO_MAP and VFIO_UNMAP operations would need to check for
    this case, verify their addresses against the preregistered space
    and be translated into IOAS_COPY operations from the prereg
    address space instead of raw IOAS_MAP operations.  Fiddly, but not
    fundamentally hard, I think.

For SPAPR_TCE_v1 things are a bit trickier

- VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE
    I suspect you could get away with implementing these as no-ops.
    It wouldn't be strictly correct, but I think software which is
    using the interface correctly should work this way, though
    possibly not optimally.  That might be good enough for this ugly
    old interface.

And... then there's VFIO_EEH_PE_OP.  It's very hard to know what to do
with this because the interface was completely broken for most of its
lifetime.  EEH is a fancy error handling feature of IBM PCI hardware
somewhat similar in concept, though not interface, to PCIe AER.  I have
a very strong impression that while this was a much-touted checkbox
feature for RAS, no-one, ever. actually used it.  As evidenced by the
fact that there was, I believe over a *decade* in which all the
interfaces were completely broken by design, and apparently no-one
noticed.

So, cynically, you could probably get away with making this a no-op as
well.  If you wanted to do it properly... well... that would require
training up yet another person to actually understand this and hoping
they get it done before they run screaming.  This one gets very ugly
because the EEH operations have to operate on the hardware (or
firmware) "Partitionable Endpoints" (PEs) which correspond one to one
with IOMMU groups, but not necessarily with VFIO containers, but
there's not really any sensible way to expose that to users.

You might be able to do this by simply failing this outright if
there's anything other than exactly one IOMMU group bound to the
container / IOAS (which I think might be what VFIO itself does now).
Handling that with a device centric API gets somewhat fiddlier, of
course.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 156 bytes --]

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
  2022-04-28 14:53           ` David Gibson
@ 2022-04-28 15:10             ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-04-28 15:10 UTC (permalink / raw)
  To: David Gibson
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Fri, Apr 29, 2022 at 12:53:16AM +1000, David Gibson wrote:

> 2) Costly GUPs.  pseries (the most common ppc machine type) always
> expects a (v)IOMMU.  That means that unlike the common x86 model of a
> host with IOMMU, but guests with no-vIOMMU, guest initiated
> maps/unmaps can be a hot path.  Accounting in that path can be
> prohibitive (and on POWER8 in particular it prevented us from
> optimizing that path the way we wanted).  We had two solutions for
> that, in v1 the explicit ENABLE/DISABLE calls, which preaccounted
> based on the IOVA window sizes.  That was improved in the v2 which
> used the concept of preregistration.  IIUC iommufd can achieve the
> same effect as preregistration using IOAS_COPY, so this one isn't
> really a problem either.

I think PPC and S390 are solving the same problem here. I think S390
is going to go to a SW nested model where it has an iommu_domain
controlled by iommufd that is populated with the pinned pages, eg
stored in an xarray.

Then the performance map/unmap path is simply copying pages from the
xarray to the real IOPTEs - and this would be modeled as a nested
iommu_domain with a SW vIOPTE walker instead of a HW vIOPTE walker.

Perhaps this is agreeable for PPC too?

> 3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for 2 IOVA
> windows, which aren't contiguous with each other.  The base addresses
> of each of these are fixed, but the size of each window, the pagesize
> (i.e. granularity) of each window and the number of levels in the
> IOMMU pagetable are runtime configurable.  Because it's true in the
> hardware, it's also true of the vIOMMU interface defined by the IBM
> hypervisor (and adpoted by KVM as well).  So, guests can request
> changes in how these windows are handled.  Typical Linux guests will
> use the "low" window (IOVA 0..2GiB) dynamically, and the high window
> (IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
> can't count on that; the guest can use them however it wants.

As part of nesting iommufd will have a 'create iommu_domain using
iommu driver specific data' primitive.

The driver specific data for PPC can include a description of these
windows so the PPC specific qemu driver can issue this new ioctl
using the information provided by the guest.

The main issue is that internally to the iommu subsystem the
iommu_domain aperture is assumed to be a single window. This kAPI will
have to be improved to model the PPC multi-window iommu_domain.

If this API is not used then the PPC driver should choose some
sensible default windows that makes things like DPDK happy.

> Then, there's handling existing qemu (or other software) that is using
> the VFIO SPAPR_TCE interfaces.  First, it's not entirely clear if this
> should be a goal or not: as others have noted, working actively to
> port qemu to the new interface at the same time as making a
> comprehensive in-kernel compat layer is arguably redundant work.

At the moment I think I would stick with not including the SPAPR
interfaces in vfio_compat, but there does seem to be a path if someone
with HW wants to build and test them?

> You might be able to do this by simply failing this outright if
> there's anything other than exactly one IOMMU group bound to the
> container / IOAS (which I think might be what VFIO itself does now).
> Handling that with a device centric API gets somewhat fiddlier, of
> course.

Maybe every device gets a copy of the error notification?

ie maybe this should be part of vfio_pci and not part of iommufd to
mirror how AER works?

It feels strange to put in device error notification to iommufd, is
that connected the IOMMU?

Jason

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
@ 2022-04-28 15:10             ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-04-28 15:10 UTC (permalink / raw)
  To: David Gibson
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins

On Fri, Apr 29, 2022 at 12:53:16AM +1000, David Gibson wrote:

> 2) Costly GUPs.  pseries (the most common ppc machine type) always
> expects a (v)IOMMU.  That means that unlike the common x86 model of a
> host with IOMMU, but guests with no-vIOMMU, guest initiated
> maps/unmaps can be a hot path.  Accounting in that path can be
> prohibitive (and on POWER8 in particular it prevented us from
> optimizing that path the way we wanted).  We had two solutions for
> that, in v1 the explicit ENABLE/DISABLE calls, which preaccounted
> based on the IOVA window sizes.  That was improved in the v2 which
> used the concept of preregistration.  IIUC iommufd can achieve the
> same effect as preregistration using IOAS_COPY, so this one isn't
> really a problem either.

I think PPC and S390 are solving the same problem here. I think S390
is going to go to a SW nested model where it has an iommu_domain
controlled by iommufd that is populated with the pinned pages, eg
stored in an xarray.

Then the performance map/unmap path is simply copying pages from the
xarray to the real IOPTEs - and this would be modeled as a nested
iommu_domain with a SW vIOPTE walker instead of a HW vIOPTE walker.

Perhaps this is agreeable for PPC too?

> 3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for 2 IOVA
> windows, which aren't contiguous with each other.  The base addresses
> of each of these are fixed, but the size of each window, the pagesize
> (i.e. granularity) of each window and the number of levels in the
> IOMMU pagetable are runtime configurable.  Because it's true in the
> hardware, it's also true of the vIOMMU interface defined by the IBM
> hypervisor (and adpoted by KVM as well).  So, guests can request
> changes in how these windows are handled.  Typical Linux guests will
> use the "low" window (IOVA 0..2GiB) dynamically, and the high window
> (IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
> can't count on that; the guest can use them however it wants.

As part of nesting iommufd will have a 'create iommu_domain using
iommu driver specific data' primitive.

The driver specific data for PPC can include a description of these
windows so the PPC specific qemu driver can issue this new ioctl
using the information provided by the guest.

The main issue is that internally to the iommu subsystem the
iommu_domain aperture is assumed to be a single window. This kAPI will
have to be improved to model the PPC multi-window iommu_domain.

If this API is not used then the PPC driver should choose some
sensible default windows that makes things like DPDK happy.

> Then, there's handling existing qemu (or other software) that is using
> the VFIO SPAPR_TCE interfaces.  First, it's not entirely clear if this
> should be a goal or not: as others have noted, working actively to
> port qemu to the new interface at the same time as making a
> comprehensive in-kernel compat layer is arguably redundant work.

At the moment I think I would stick with not including the SPAPR
interfaces in vfio_compat, but there does seem to be a path if someone
with HW wants to build and test them?

> You might be able to do this by simply failing this outright if
> there's anything other than exactly one IOMMU group bound to the
> container / IOAS (which I think might be what VFIO itself does now).
> Handling that with a device centric API gets somewhat fiddlier, of
> course.

Maybe every device gets a copy of the error notification?

ie maybe this should be part of vfio_pci and not part of iommufd to
mirror how AER works?

It feels strange to put in device error notification to iommufd, is
that connected the IOMMU?

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* RE: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
  2022-04-28 15:10             ` Jason Gunthorpe via iommu
@ 2022-04-29  1:21               ` Tian, Kevin
  -1 siblings, 0 replies; 244+ messages in thread
From: Tian, Kevin @ 2022-04-29  1:21 UTC (permalink / raw)
  To: Jason Gunthorpe, David Gibson
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, April 28, 2022 11:11 PM
> 
> 
> > 3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for
> 2 IOVA
> > windows, which aren't contiguous with each other.  The base addresses
> > of each of these are fixed, but the size of each window, the pagesize
> > (i.e. granularity) of each window and the number of levels in the
> > IOMMU pagetable are runtime configurable.  Because it's true in the
> > hardware, it's also true of the vIOMMU interface defined by the IBM
> > hypervisor (and adpoted by KVM as well).  So, guests can request
> > changes in how these windows are handled.  Typical Linux guests will
> > use the "low" window (IOVA 0..2GiB) dynamically, and the high window
> > (IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
> > can't count on that; the guest can use them however it wants.
> 
> As part of nesting iommufd will have a 'create iommu_domain using
> iommu driver specific data' primitive.
> 
> The driver specific data for PPC can include a description of these
> windows so the PPC specific qemu driver can issue this new ioctl
> using the information provided by the guest.
> 
> The main issue is that internally to the iommu subsystem the
> iommu_domain aperture is assumed to be a single window. This kAPI will
> have to be improved to model the PPC multi-window iommu_domain.
> 

From the point of nesting probably each window can be a separate
domain then the existing aperture should still work?

^ permalink raw reply	[flat|nested] 244+ messages in thread

* RE: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
@ 2022-04-29  1:21               ` Tian, Kevin
  0 siblings, 0 replies; 244+ messages in thread
From: Tian, Kevin @ 2022-04-29  1:21 UTC (permalink / raw)
  To: Jason Gunthorpe, David Gibson
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Alex Williamson, Martins, Joao

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, April 28, 2022 11:11 PM
> 
> 
> > 3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for
> 2 IOVA
> > windows, which aren't contiguous with each other.  The base addresses
> > of each of these are fixed, but the size of each window, the pagesize
> > (i.e. granularity) of each window and the number of levels in the
> > IOMMU pagetable are runtime configurable.  Because it's true in the
> > hardware, it's also true of the vIOMMU interface defined by the IBM
> > hypervisor (and adpoted by KVM as well).  So, guests can request
> > changes in how these windows are handled.  Typical Linux guests will
> > use the "low" window (IOVA 0..2GiB) dynamically, and the high window
> > (IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
> > can't count on that; the guest can use them however it wants.
> 
> As part of nesting iommufd will have a 'create iommu_domain using
> iommu driver specific data' primitive.
> 
> The driver specific data for PPC can include a description of these
> windows so the PPC specific qemu driver can issue this new ioctl
> using the information provided by the guest.
> 
> The main issue is that internally to the iommu subsystem the
> iommu_domain aperture is assumed to be a single window. This kAPI will
> have to be improved to model the PPC multi-window iommu_domain.
> 

From the point of nesting probably each window can be a separate
domain then the existing aperture should still work?
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
  2022-04-28 14:22           ` Jason Gunthorpe via iommu
@ 2022-04-29  6:00             ` David Gibson
  -1 siblings, 0 replies; 244+ messages in thread
From: David Gibson @ 2022-04-29  6:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

[-- Attachment #1: Type: text/plain, Size: 8894 bytes --]

On Thu, Apr 28, 2022 at 11:22:58AM -0300, Jason Gunthorpe wrote:
> On Thu, Apr 28, 2022 at 03:58:30PM +1000, David Gibson wrote:
> > On Thu, Mar 31, 2022 at 09:58:41AM -0300, Jason Gunthorpe wrote:
> > > On Thu, Mar 31, 2022 at 03:36:29PM +1100, David Gibson wrote:
> > > 
> > > > > +/**
> > > > > + * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
> > > > > + * @size: sizeof(struct iommu_ioas_iova_ranges)
> > > > > + * @ioas_id: IOAS ID to read ranges from
> > > > > + * @out_num_iovas: Output total number of ranges in the IOAS
> > > > > + * @__reserved: Must be 0
> > > > > + * @out_valid_iovas: Array of valid IOVA ranges. The array length is the smaller
> > > > > + *                   of out_num_iovas or the length implied by size.
> > > > > + * @out_valid_iovas.start: First IOVA in the allowed range
> > > > > + * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
> > > > > + *
> > > > > + * Query an IOAS for ranges of allowed IOVAs. Operation outside these ranges is
> > > > > + * not allowed. out_num_iovas will be set to the total number of iovas
> > > > > + * and the out_valid_iovas[] will be filled in as space permits.
> > > > > + * size should include the allocated flex array.
> > > > > + */
> > > > > +struct iommu_ioas_iova_ranges {
> > > > > +	__u32 size;
> > > > > +	__u32 ioas_id;
> > > > > +	__u32 out_num_iovas;
> > > > > +	__u32 __reserved;
> > > > > +	struct iommu_valid_iovas {
> > > > > +		__aligned_u64 start;
> > > > > +		__aligned_u64 last;
> > > > > +	} out_valid_iovas[];
> > > > > +};
> > > > > +#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_IOVA_RANGES)
> > > > 
> > > > Is the information returned by this valid for the lifeime of the IOAS,
> > > > or can it change?  If it can change, what events can change it?
> > > >
> > > > If it *can't* change, then how do we have enough information to
> > > > determine this at ALLOC time, since we don't necessarily know which
> > > > (if any) hardware IOMMU will be attached to it.
> > > 
> > > It is a good point worth documenting. It can change. Particularly
> > > after any device attachment.
> > 
> > Right.. this is vital and needs to be front and centre in the
> > comments/docs here.  Really, I think an interface that *doesn't* have
> > magically changing status would be better (which is why I was
> > advocating that the user set the constraints, and the kernel supplied
> > or failed outright).  Still I recognize that has its own problems.
> 
> That is a neat idea, it could be a nice option, it lets userspace
> further customize the kernel allocator.
> 
> But I don't have a use case in mind? The simplified things I know
> about want to attach their devices then allocate valid IOVA, they
> don't really have a notion about what IOVA regions they are willing to
> accept, or necessarily do hotplug.

The obvious use case is qemu (or whatever) emulating a vIOMMU.  The
emulation code knows the IOVA windows that are expected of the vIOMMU
(because that's a property of the emulated platform), and requests
them of the host IOMMU.  If the host can supply that, you're good
(this doesn't necessarily mean the host windows match exactly, just
that the requested windows fit within the host windows).  If not,
you report an error.  This can be done at any point when the host
windows might change - so try to attach a device that can't support
the requested windows, and it will fail.  Attaching a device which
shrinks the windows, but still fits the requested windows within, and
you're still good to go.

For a typical direct userspace case you don't want that.  However, it
probably *does* make sense for userspace to specify how large a window
it wants.  So some form that allows you to specify size without base
address also makes sense.  In that case the kernel would set a base
address according to the host IOMMU's capabilities, or fail if it
can't supply any window of the requested size.  When to allocate that
base address is a bit unclear though.  If you do it at window request
time, then you might pick something that a later device can't work
with.  If you do it later, it's less clear how to sensibly report it
to userspace.

One option might be to only allow IOAS_MAP (or COPY) operations after
windows are requested, but otherwise you can choose the order.  So,
things that have strict requirements for the windows (vIOMMU
emulation) would request the windows then add devices: they know the
windows they need, if the devices can't work with that, that's what
needs to fail.  A userspace driver, however, would attach the devices
it wants to use, then request a window (without specifying base
address).

A query ioctl to give the largest possible windows in the current
state could still be useful for debugging here, of course, but
wouldn't need to be used in the normal course of operation.

> What might be interesting is to have some option to load in a machine
> specific default ranges - ie the union of every group and and every
> iommu_domain. The idea being that after such a call hotplug of a
> device should be very likely to succeed.
> 
> Though I don't have a user in mind..
> 
> > > I added this:
> > > 
> > >  * Query an IOAS for ranges of allowed IOVAs. Mapping IOVA outside these ranges
> > >  * is not allowed. out_num_iovas will be set to the total number of iovas and
> > >  * the out_valid_iovas[] will be filled in as space permits. size should include
> > >  * the allocated flex array.
> > >  *
> > >  * The allowed ranges are dependent on the HW path the DMA operation takes, and
> > >  * can change during the lifetime of the IOAS. A fresh empty IOAS will have a
> > >  * full range, and each attached device will narrow the ranges based on that
> > >  * devices HW restrictions.
> > 
> > I think you need to be even more explicit about this: which exact
> > operations on the fd can invalidate exactly which items in the
> > information from this call?  Can it only ever be narrowed, or can it
> > be broadened with any operations?
> 
> I think "attach" is the phrase we are using for that operation - it is
> not a specific IOCTL here because it happens on, say, the VFIO device FD.
> 
> Let's add "detatching a device can widen the ranges. Userspace should
> query ranges after every attach/detatch to know what IOVAs are valid
> for mapping."
> 
> > > > > +#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY)
> > > > 
> > > > Since it can only copy a single mapping, what's the benefit of this
> > > > over just repeating an IOAS_MAP in the new IOAS?
> > > 
> > > It causes the underlying pin accounting to be shared and can avoid
> > > calling GUP entirely.
> > 
> > If that's the only purpose, then that needs to be right here in the
> > comments too.  So is expected best practice to IOAS_MAP everything you
> > might want to map into a sort of "scratch" IOAS, then IOAS_COPY the
> > mappings you actually end up wanting into the "real" IOASes for use?
> 
> That is one possibility, yes. qemu seems to be using this to establish
> a clone ioas of an existing operational one which is another usage
> model.

Right, for qemu (or other hypervisors) the obvious choice would be to
create a "staging" IOAS where IOVA == GPA, then COPY that into the various
emulated bus IOASes.  For a userspace driver situation, I'm guessing
you'd map your relevant memory pool into an IOAS, then COPY to the
IOAS you need for whatever specific devices you're using.

> I added this additionally:
> 
>  * This may be used to efficiently clone a subset of an IOAS to another, or as a
>  * kind of 'cache' to speed up mapping. Copy has an effciency advantage over
>  * establishing equivilant new mappings, as internal resources are shared, and
>  * the kernel will pin the user memory only once.

I think adding that helps substantially.

> > Seems like it would be nicer for the interface to just figure it out
> > for you: I can see there being sufficient complications with that to
> > have this slightly awkward interface, but I think it needs a rationale
> > to accompany it.
> 
> It is more than complicates, the kernel has no way to accurately know
> when a user pointer is an alias of an existing user pointer or is
> something new because the mm has become incoherent.
> 
> It is possible that uncoordinated modules in userspace could
> experience data corruption if the wrong decision is made - mm
> coherence with pinning is pretty weak in Linux.. Since I dislike that
> kind of unreliable magic I made it explicit.

Fair enough.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
@ 2022-04-29  6:00             ` David Gibson
  0 siblings, 0 replies; 244+ messages in thread
From: David Gibson @ 2022-04-29  6:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins


[-- Attachment #1.1: Type: text/plain, Size: 8894 bytes --]

On Thu, Apr 28, 2022 at 11:22:58AM -0300, Jason Gunthorpe wrote:
> On Thu, Apr 28, 2022 at 03:58:30PM +1000, David Gibson wrote:
> > On Thu, Mar 31, 2022 at 09:58:41AM -0300, Jason Gunthorpe wrote:
> > > On Thu, Mar 31, 2022 at 03:36:29PM +1100, David Gibson wrote:
> > > 
> > > > > +/**
> > > > > + * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
> > > > > + * @size: sizeof(struct iommu_ioas_iova_ranges)
> > > > > + * @ioas_id: IOAS ID to read ranges from
> > > > > + * @out_num_iovas: Output total number of ranges in the IOAS
> > > > > + * @__reserved: Must be 0
> > > > > + * @out_valid_iovas: Array of valid IOVA ranges. The array length is the smaller
> > > > > + *                   of out_num_iovas or the length implied by size.
> > > > > + * @out_valid_iovas.start: First IOVA in the allowed range
> > > > > + * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
> > > > > + *
> > > > > + * Query an IOAS for ranges of allowed IOVAs. Operation outside these ranges is
> > > > > + * not allowed. out_num_iovas will be set to the total number of iovas
> > > > > + * and the out_valid_iovas[] will be filled in as space permits.
> > > > > + * size should include the allocated flex array.
> > > > > + */
> > > > > +struct iommu_ioas_iova_ranges {
> > > > > +	__u32 size;
> > > > > +	__u32 ioas_id;
> > > > > +	__u32 out_num_iovas;
> > > > > +	__u32 __reserved;
> > > > > +	struct iommu_valid_iovas {
> > > > > +		__aligned_u64 start;
> > > > > +		__aligned_u64 last;
> > > > > +	} out_valid_iovas[];
> > > > > +};
> > > > > +#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_IOVA_RANGES)
> > > > 
> > > > Is the information returned by this valid for the lifeime of the IOAS,
> > > > or can it change?  If it can change, what events can change it?
> > > >
> > > > If it *can't* change, then how do we have enough information to
> > > > determine this at ALLOC time, since we don't necessarily know which
> > > > (if any) hardware IOMMU will be attached to it.
> > > 
> > > It is a good point worth documenting. It can change. Particularly
> > > after any device attachment.
> > 
> > Right.. this is vital and needs to be front and centre in the
> > comments/docs here.  Really, I think an interface that *doesn't* have
> > magically changing status would be better (which is why I was
> > advocating that the user set the constraints, and the kernel supplied
> > or failed outright).  Still I recognize that has its own problems.
> 
> That is a neat idea, it could be a nice option, it lets userspace
> further customize the kernel allocator.
> 
> But I don't have a use case in mind? The simplified things I know
> about want to attach their devices then allocate valid IOVA, they
> don't really have a notion about what IOVA regions they are willing to
> accept, or necessarily do hotplug.

The obvious use case is qemu (or whatever) emulating a vIOMMU.  The
emulation code knows the IOVA windows that are expected of the vIOMMU
(because that's a property of the emulated platform), and requests
them of the host IOMMU.  If the host can supply that, you're good
(this doesn't necessarily mean the host windows match exactly, just
that the requested windows fit within the host windows).  If not,
you report an error.  This can be done at any point when the host
windows might change - so try to attach a device that can't support
the requested windows, and it will fail.  Attaching a device which
shrinks the windows, but still fits the requested windows within, and
you're still good to go.

For a typical direct userspace case you don't want that.  However, it
probably *does* make sense for userspace to specify how large a window
it wants.  So some form that allows you to specify size without base
address also makes sense.  In that case the kernel would set a base
address according to the host IOMMU's capabilities, or fail if it
can't supply any window of the requested size.  When to allocate that
base address is a bit unclear though.  If you do it at window request
time, then you might pick something that a later device can't work
with.  If you do it later, it's less clear how to sensibly report it
to userspace.

One option might be to only allow IOAS_MAP (or COPY) operations after
windows are requested, but otherwise you can choose the order.  So,
things that have strict requirements for the windows (vIOMMU
emulation) would request the windows then add devices: they know the
windows they need, if the devices can't work with that, that's what
needs to fail.  A userspace driver, however, would attach the devices
it wants to use, then request a window (without specifying base
address).

A query ioctl to give the largest possible windows in the current
state could still be useful for debugging here, of course, but
wouldn't need to be used in the normal course of operation.

> What might be interesting is to have some option to load in a machine
> specific default ranges - ie the union of every group and and every
> iommu_domain. The idea being that after such a call hotplug of a
> device should be very likely to succeed.
> 
> Though I don't have a user in mind..
> 
> > > I added this:
> > > 
> > >  * Query an IOAS for ranges of allowed IOVAs. Mapping IOVA outside these ranges
> > >  * is not allowed. out_num_iovas will be set to the total number of iovas and
> > >  * the out_valid_iovas[] will be filled in as space permits. size should include
> > >  * the allocated flex array.
> > >  *
> > >  * The allowed ranges are dependent on the HW path the DMA operation takes, and
> > >  * can change during the lifetime of the IOAS. A fresh empty IOAS will have a
> > >  * full range, and each attached device will narrow the ranges based on that
> > >  * devices HW restrictions.
> > 
> > I think you need to be even more explicit about this: which exact
> > operations on the fd can invalidate exactly which items in the
> > information from this call?  Can it only ever be narrowed, or can it
> > be broadened with any operations?
> 
> I think "attach" is the phrase we are using for that operation - it is
> not a specific IOCTL here because it happens on, say, the VFIO device FD.
> 
> Let's add "detatching a device can widen the ranges. Userspace should
> query ranges after every attach/detatch to know what IOVAs are valid
> for mapping."
> 
> > > > > +#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY)
> > > > 
> > > > Since it can only copy a single mapping, what's the benefit of this
> > > > over just repeating an IOAS_MAP in the new IOAS?
> > > 
> > > It causes the underlying pin accounting to be shared and can avoid
> > > calling GUP entirely.
> > 
> > If that's the only purpose, then that needs to be right here in the
> > comments too.  So is expected best practice to IOAS_MAP everything you
> > might want to map into a sort of "scratch" IOAS, then IOAS_COPY the
> > mappings you actually end up wanting into the "real" IOASes for use?
> 
> That is one possibility, yes. qemu seems to be using this to establish
> a clone ioas of an existing operational one which is another usage
> model.

Right, for qemu (or other hypervisors) the obvious choice would be to
create a "staging" IOAS where IOVA == GPA, then COPY that into the various
emulated bus IOASes.  For a userspace driver situation, I'm guessing
you'd map your relevant memory pool into an IOAS, then COPY to the
IOAS you need for whatever specific devices you're using.

> I added this additionally:
> 
>  * This may be used to efficiently clone a subset of an IOAS to another, or as a
>  * kind of 'cache' to speed up mapping. Copy has an effciency advantage over
>  * establishing equivilant new mappings, as internal resources are shared, and
>  * the kernel will pin the user memory only once.

I think adding that helps substantially.

> > Seems like it would be nicer for the interface to just figure it out
> > for you: I can see there being sufficient complications with that to
> > have this slightly awkward interface, but I think it needs a rationale
> > to accompany it.
> 
> It is more than complicates, the kernel has no way to accurately know
> when a user pointer is an alias of an existing user pointer or is
> something new because the mm has become incoherent.
> 
> It is possible that uncoordinated modules in userspace could
> experience data corruption if the wrong decision is made - mm
> coherence with pinning is pretty weak in Linux.. Since I dislike that
> kind of unreliable magic I made it explicit.

Fair enough.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 156 bytes --]

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
  2022-04-28 15:10             ` Jason Gunthorpe via iommu
@ 2022-04-29  6:20               ` David Gibson
  -1 siblings, 0 replies; 244+ messages in thread
From: David Gibson @ 2022-04-29  6:20 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

[-- Attachment #1: Type: text/plain, Size: 5929 bytes --]

On Thu, Apr 28, 2022 at 12:10:37PM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 29, 2022 at 12:53:16AM +1000, David Gibson wrote:
> 
> > 2) Costly GUPs.  pseries (the most common ppc machine type) always
> > expects a (v)IOMMU.  That means that unlike the common x86 model of a
> > host with IOMMU, but guests with no-vIOMMU, guest initiated
> > maps/unmaps can be a hot path.  Accounting in that path can be
> > prohibitive (and on POWER8 in particular it prevented us from
> > optimizing that path the way we wanted).  We had two solutions for
> > that, in v1 the explicit ENABLE/DISABLE calls, which preaccounted
> > based on the IOVA window sizes.  That was improved in the v2 which
> > used the concept of preregistration.  IIUC iommufd can achieve the
> > same effect as preregistration using IOAS_COPY, so this one isn't
> > really a problem either.
> 
> I think PPC and S390 are solving the same problem here. I think S390
> is going to go to a SW nested model where it has an iommu_domain
> controlled by iommufd that is populated with the pinned pages, eg
> stored in an xarray.
> 
> Then the performance map/unmap path is simply copying pages from the
> xarray to the real IOPTEs - and this would be modeled as a nested
> iommu_domain with a SW vIOPTE walker instead of a HW vIOPTE walker.
> 
> Perhaps this is agreeable for PPC too?

Uh.. maybe?  Note that I'm making these comments based on working on
this some years ago (the initial VFIO for ppc implementation in
particular).  I'm no longer actively involved in ppc kernel work.

> > 3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for 2 IOVA
> > windows, which aren't contiguous with each other.  The base addresses
> > of each of these are fixed, but the size of each window, the pagesize
> > (i.e. granularity) of each window and the number of levels in the
> > IOMMU pagetable are runtime configurable.  Because it's true in the
> > hardware, it's also true of the vIOMMU interface defined by the IBM
> > hypervisor (and adpoted by KVM as well).  So, guests can request
> > changes in how these windows are handled.  Typical Linux guests will
> > use the "low" window (IOVA 0..2GiB) dynamically, and the high window
> > (IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
> > can't count on that; the guest can use them however it wants.
> 
> As part of nesting iommufd will have a 'create iommu_domain using
> iommu driver specific data' primitive.
> 
> The driver specific data for PPC can include a description of these
> windows so the PPC specific qemu driver can issue this new ioctl
> using the information provided by the guest.

Hmm.. not sure if that works.  At the moment, qemu (for example) needs
to set up the domains/containers/IOASes as it constructs the machine,
because that's based on the virtual hardware topology.  Initially they
use the default windows (0..2GiB first window, second window
disabled).  Only once the guest kernel is up and running does it issue
the hypercalls to set the final windows as it prefers.  In theory the
guest could change them during runtime though it's unlikely in
practice.  They could change during machine lifetime in practice,
though, if you rebooted from one guest kernel to another that uses a
different configuration.

*Maybe* IOAS construction can be deferred somehow, though I'm not sure
because the assigned devices need to live somewhere.

> The main issue is that internally to the iommu subsystem the
> iommu_domain aperture is assumed to be a single window. This kAPI will
> have to be improved to model the PPC multi-window iommu_domain.

Right.

> If this API is not used then the PPC driver should choose some
> sensible default windows that makes things like DPDK happy.
> 
> > Then, there's handling existing qemu (or other software) that is using
> > the VFIO SPAPR_TCE interfaces.  First, it's not entirely clear if this
> > should be a goal or not: as others have noted, working actively to
> > port qemu to the new interface at the same time as making a
> > comprehensive in-kernel compat layer is arguably redundant work.
> 
> At the moment I think I would stick with not including the SPAPR
> interfaces in vfio_compat, but there does seem to be a path if someone
> with HW wants to build and test them?
> 
> > You might be able to do this by simply failing this outright if
> > there's anything other than exactly one IOMMU group bound to the
> > container / IOAS (which I think might be what VFIO itself does now).
> > Handling that with a device centric API gets somewhat fiddlier, of
> > course.
> 
> Maybe every device gets a copy of the error notification?

Alas, it's harder than that.  One of the things that can happen on an
EEH fault is that the entire PE gets suspended (blocking both DMA and
MMIO, IIRC) until the proper recovery steps are taken.  Since that's
handled at the hardware/firmware level, it will obviously only affect
the host side PE (== host iommu group).  However the interfaces we
have only allow things to be reported to the guest at the granularity
of a guest side PE (== container/IOAS == guest host bridge in
practice).  So to handle this correctly when guest PE != host PE we'd
need to synchronize suspended / recovery state between all the host
PEs in the guest PE.  That *might* be technically possible, but it's
really damn fiddly.

> ie maybe this should be part of vfio_pci and not part of iommufd to
> mirror how AER works?
> 
> It feels strange to put in device error notification to iommufd, is
> that connected the IOMMU?

Only in that operates at the granularity of a PE, which is mostly an
IOMMU concept.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
@ 2022-04-29  6:20               ` David Gibson
  0 siblings, 0 replies; 244+ messages in thread
From: David Gibson @ 2022-04-29  6:20 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins


[-- Attachment #1.1: Type: text/plain, Size: 5929 bytes --]

On Thu, Apr 28, 2022 at 12:10:37PM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 29, 2022 at 12:53:16AM +1000, David Gibson wrote:
> 
> > 2) Costly GUPs.  pseries (the most common ppc machine type) always
> > expects a (v)IOMMU.  That means that unlike the common x86 model of a
> > host with IOMMU, but guests with no-vIOMMU, guest initiated
> > maps/unmaps can be a hot path.  Accounting in that path can be
> > prohibitive (and on POWER8 in particular it prevented us from
> > optimizing that path the way we wanted).  We had two solutions for
> > that, in v1 the explicit ENABLE/DISABLE calls, which preaccounted
> > based on the IOVA window sizes.  That was improved in the v2 which
> > used the concept of preregistration.  IIUC iommufd can achieve the
> > same effect as preregistration using IOAS_COPY, so this one isn't
> > really a problem either.
> 
> I think PPC and S390 are solving the same problem here. I think S390
> is going to go to a SW nested model where it has an iommu_domain
> controlled by iommufd that is populated with the pinned pages, eg
> stored in an xarray.
> 
> Then the performance map/unmap path is simply copying pages from the
> xarray to the real IOPTEs - and this would be modeled as a nested
> iommu_domain with a SW vIOPTE walker instead of a HW vIOPTE walker.
> 
> Perhaps this is agreeable for PPC too?

Uh.. maybe?  Note that I'm making these comments based on working on
this some years ago (the initial VFIO for ppc implementation in
particular).  I'm no longer actively involved in ppc kernel work.

> > 3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for 2 IOVA
> > windows, which aren't contiguous with each other.  The base addresses
> > of each of these are fixed, but the size of each window, the pagesize
> > (i.e. granularity) of each window and the number of levels in the
> > IOMMU pagetable are runtime configurable.  Because it's true in the
> > hardware, it's also true of the vIOMMU interface defined by the IBM
> > hypervisor (and adpoted by KVM as well).  So, guests can request
> > changes in how these windows are handled.  Typical Linux guests will
> > use the "low" window (IOVA 0..2GiB) dynamically, and the high window
> > (IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
> > can't count on that; the guest can use them however it wants.
> 
> As part of nesting iommufd will have a 'create iommu_domain using
> iommu driver specific data' primitive.
> 
> The driver specific data for PPC can include a description of these
> windows so the PPC specific qemu driver can issue this new ioctl
> using the information provided by the guest.

Hmm.. not sure if that works.  At the moment, qemu (for example) needs
to set up the domains/containers/IOASes as it constructs the machine,
because that's based on the virtual hardware topology.  Initially they
use the default windows (0..2GiB first window, second window
disabled).  Only once the guest kernel is up and running does it issue
the hypercalls to set the final windows as it prefers.  In theory the
guest could change them during runtime though it's unlikely in
practice.  They could change during machine lifetime in practice,
though, if you rebooted from one guest kernel to another that uses a
different configuration.

*Maybe* IOAS construction can be deferred somehow, though I'm not sure
because the assigned devices need to live somewhere.

> The main issue is that internally to the iommu subsystem the
> iommu_domain aperture is assumed to be a single window. This kAPI will
> have to be improved to model the PPC multi-window iommu_domain.

Right.

> If this API is not used then the PPC driver should choose some
> sensible default windows that makes things like DPDK happy.
> 
> > Then, there's handling existing qemu (or other software) that is using
> > the VFIO SPAPR_TCE interfaces.  First, it's not entirely clear if this
> > should be a goal or not: as others have noted, working actively to
> > port qemu to the new interface at the same time as making a
> > comprehensive in-kernel compat layer is arguably redundant work.
> 
> At the moment I think I would stick with not including the SPAPR
> interfaces in vfio_compat, but there does seem to be a path if someone
> with HW wants to build and test them?
> 
> > You might be able to do this by simply failing this outright if
> > there's anything other than exactly one IOMMU group bound to the
> > container / IOAS (which I think might be what VFIO itself does now).
> > Handling that with a device centric API gets somewhat fiddlier, of
> > course.
> 
> Maybe every device gets a copy of the error notification?

Alas, it's harder than that.  One of the things that can happen on an
EEH fault is that the entire PE gets suspended (blocking both DMA and
MMIO, IIRC) until the proper recovery steps are taken.  Since that's
handled at the hardware/firmware level, it will obviously only affect
the host side PE (== host iommu group).  However the interfaces we
have only allow things to be reported to the guest at the granularity
of a guest side PE (== container/IOAS == guest host bridge in
practice).  So to handle this correctly when guest PE != host PE we'd
need to synchronize suspended / recovery state between all the host
PEs in the guest PE.  That *might* be technically possible, but it's
really damn fiddly.

> ie maybe this should be part of vfio_pci and not part of iommufd to
> mirror how AER works?
> 
> It feels strange to put in device error notification to iommufd, is
> that connected the IOMMU?

Only in that operates at the granularity of a PE, which is mostly an
IOMMU concept.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 156 bytes --]

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
  2022-04-29  1:21               ` Tian, Kevin
@ 2022-04-29  6:22                 ` David Gibson
  -1 siblings, 0 replies; 244+ messages in thread
From: David Gibson @ 2022-04-29  6:22 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, Alex Williamson, Lu Baolu, Chaitanya Kulkarni,
	Cornelia Huck, Daniel Jordan, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

[-- Attachment #1: Type: text/plain, Size: 2053 bytes --]

On Fri, Apr 29, 2022 at 01:21:30AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Thursday, April 28, 2022 11:11 PM
> > 
> > 
> > > 3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for
> > 2 IOVA
> > > windows, which aren't contiguous with each other.  The base addresses
> > > of each of these are fixed, but the size of each window, the pagesize
> > > (i.e. granularity) of each window and the number of levels in the
> > > IOMMU pagetable are runtime configurable.  Because it's true in the
> > > hardware, it's also true of the vIOMMU interface defined by the IBM
> > > hypervisor (and adpoted by KVM as well).  So, guests can request
> > > changes in how these windows are handled.  Typical Linux guests will
> > > use the "low" window (IOVA 0..2GiB) dynamically, and the high window
> > > (IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
> > > can't count on that; the guest can use them however it wants.
> > 
> > As part of nesting iommufd will have a 'create iommu_domain using
> > iommu driver specific data' primitive.
> > 
> > The driver specific data for PPC can include a description of these
> > windows so the PPC specific qemu driver can issue this new ioctl
> > using the information provided by the guest.
> > 
> > The main issue is that internally to the iommu subsystem the
> > iommu_domain aperture is assumed to be a single window. This kAPI will
> > have to be improved to model the PPC multi-window iommu_domain.
> > 
> 
> From the point of nesting probably each window can be a separate
> domain then the existing aperture should still work?

Maybe.  There might be several different ways to represent it, but the
vital piece is that any individual device (well, group, technically)
must atomically join/leave both windows at once.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
@ 2022-04-29  6:22                 ` David Gibson
  0 siblings, 0 replies; 244+ messages in thread
From: David Gibson @ 2022-04-29  6:22 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Alex Williamson, Jason Gunthorpe, Martins,
	Joao


[-- Attachment #1.1: Type: text/plain, Size: 2053 bytes --]

On Fri, Apr 29, 2022 at 01:21:30AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Thursday, April 28, 2022 11:11 PM
> > 
> > 
> > > 3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for
> > 2 IOVA
> > > windows, which aren't contiguous with each other.  The base addresses
> > > of each of these are fixed, but the size of each window, the pagesize
> > > (i.e. granularity) of each window and the number of levels in the
> > > IOMMU pagetable are runtime configurable.  Because it's true in the
> > > hardware, it's also true of the vIOMMU interface defined by the IBM
> > > hypervisor (and adpoted by KVM as well).  So, guests can request
> > > changes in how these windows are handled.  Typical Linux guests will
> > > use the "low" window (IOVA 0..2GiB) dynamically, and the high window
> > > (IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
> > > can't count on that; the guest can use them however it wants.
> > 
> > As part of nesting iommufd will have a 'create iommu_domain using
> > iommu driver specific data' primitive.
> > 
> > The driver specific data for PPC can include a description of these
> > windows so the PPC specific qemu driver can issue this new ioctl
> > using the information provided by the guest.
> > 
> > The main issue is that internally to the iommu subsystem the
> > iommu_domain aperture is assumed to be a single window. This kAPI will
> > have to be improved to model the PPC multi-window iommu_domain.
> > 
> 
> From the point of nesting probably each window can be a separate
> domain then the existing aperture should still work?

Maybe.  There might be several different ways to represent it, but the
vital piece is that any individual device (well, group, technically)
must atomically join/leave both windows at once.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 156 bytes --]

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
  2022-04-29  6:20               ` David Gibson
@ 2022-04-29 12:48                 ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-04-29 12:48 UTC (permalink / raw)
  To: David Gibson
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Fri, Apr 29, 2022 at 04:20:36PM +1000, David Gibson wrote:

> > I think PPC and S390 are solving the same problem here. I think S390
> > is going to go to a SW nested model where it has an iommu_domain
> > controlled by iommufd that is populated with the pinned pages, eg
> > stored in an xarray.
> > 
> > Then the performance map/unmap path is simply copying pages from the
> > xarray to the real IOPTEs - and this would be modeled as a nested
> > iommu_domain with a SW vIOPTE walker instead of a HW vIOPTE walker.
> > 
> > Perhaps this is agreeable for PPC too?
> 
> Uh.. maybe?  Note that I'm making these comments based on working on
> this some years ago (the initial VFIO for ppc implementation in
> particular).  I'm no longer actively involved in ppc kernel work.

OK
 
> > > 3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for 2 IOVA
> > > windows, which aren't contiguous with each other.  The base addresses
> > > of each of these are fixed, but the size of each window, the pagesize
> > > (i.e. granularity) of each window and the number of levels in the
> > > IOMMU pagetable are runtime configurable.  Because it's true in the
> > > hardware, it's also true of the vIOMMU interface defined by the IBM
> > > hypervisor (and adpoted by KVM as well).  So, guests can request
> > > changes in how these windows are handled.  Typical Linux guests will
> > > use the "low" window (IOVA 0..2GiB) dynamically, and the high window
> > > (IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
> > > can't count on that; the guest can use them however it wants.
> > 
> > As part of nesting iommufd will have a 'create iommu_domain using
> > iommu driver specific data' primitive.
> > 
> > The driver specific data for PPC can include a description of these
> > windows so the PPC specific qemu driver can issue this new ioctl
> > using the information provided by the guest.
> 
> Hmm.. not sure if that works.  At the moment, qemu (for example) needs
> to set up the domains/containers/IOASes as it constructs the machine,
> because that's based on the virtual hardware topology.  Initially they
> use the default windows (0..2GiB first window, second window
> disabled).  Only once the guest kernel is up and running does it issue
> the hypercalls to set the final windows as it prefers.  In theory the
> guest could change them during runtime though it's unlikely in
> practice.  They could change during machine lifetime in practice,
> though, if you rebooted from one guest kernel to another that uses a
> different configuration.
> 
> *Maybe* IOAS construction can be deferred somehow, though I'm not sure
> because the assigned devices need to live somewhere.

This is a general requirement for all the nesting implementations, we
start out with some default nested page table and then later the VM
does the vIOMMU call to change it. So nesting will have to come along
with some kind of 'switch domains IOCTL'

In this case I would guess PPC could do the same and start out with a
small (nested) iommu_domain and then create the VM's desired
iommu_domain from the hypercall, and switch to it.

It is a bit more CPU work since maps in the lower range would have to
be copied over, but conceptually the model matches the HW nesting.

> > > You might be able to do this by simply failing this outright if
> > > there's anything other than exactly one IOMMU group bound to the
> > > container / IOAS (which I think might be what VFIO itself does now).
> > > Handling that with a device centric API gets somewhat fiddlier, of
> > > course.
> > 
> > Maybe every device gets a copy of the error notification?
> 
> Alas, it's harder than that.  One of the things that can happen on an
> EEH fault is that the entire PE gets suspended (blocking both DMA and
> MMIO, IIRC) until the proper recovery steps are taken.  

I think qemu would have to de-duplicate the duplicated device
notifications and then it can go from a device notifiation to the
device's iommu_group to the IOAS to the vPE?

A simple serial number in the event would make this pretty simple.

The way back to clear the event would just forward the commands
through a random device in the iommu_group to the PE?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
@ 2022-04-29 12:48                 ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-04-29 12:48 UTC (permalink / raw)
  To: David Gibson
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins

On Fri, Apr 29, 2022 at 04:20:36PM +1000, David Gibson wrote:

> > I think PPC and S390 are solving the same problem here. I think S390
> > is going to go to a SW nested model where it has an iommu_domain
> > controlled by iommufd that is populated with the pinned pages, eg
> > stored in an xarray.
> > 
> > Then the performance map/unmap path is simply copying pages from the
> > xarray to the real IOPTEs - and this would be modeled as a nested
> > iommu_domain with a SW vIOPTE walker instead of a HW vIOPTE walker.
> > 
> > Perhaps this is agreeable for PPC too?
> 
> Uh.. maybe?  Note that I'm making these comments based on working on
> this some years ago (the initial VFIO for ppc implementation in
> particular).  I'm no longer actively involved in ppc kernel work.

OK
 
> > > 3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for 2 IOVA
> > > windows, which aren't contiguous with each other.  The base addresses
> > > of each of these are fixed, but the size of each window, the pagesize
> > > (i.e. granularity) of each window and the number of levels in the
> > > IOMMU pagetable are runtime configurable.  Because it's true in the
> > > hardware, it's also true of the vIOMMU interface defined by the IBM
> > > hypervisor (and adpoted by KVM as well).  So, guests can request
> > > changes in how these windows are handled.  Typical Linux guests will
> > > use the "low" window (IOVA 0..2GiB) dynamically, and the high window
> > > (IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
> > > can't count on that; the guest can use them however it wants.
> > 
> > As part of nesting iommufd will have a 'create iommu_domain using
> > iommu driver specific data' primitive.
> > 
> > The driver specific data for PPC can include a description of these
> > windows so the PPC specific qemu driver can issue this new ioctl
> > using the information provided by the guest.
> 
> Hmm.. not sure if that works.  At the moment, qemu (for example) needs
> to set up the domains/containers/IOASes as it constructs the machine,
> because that's based on the virtual hardware topology.  Initially they
> use the default windows (0..2GiB first window, second window
> disabled).  Only once the guest kernel is up and running does it issue
> the hypercalls to set the final windows as it prefers.  In theory the
> guest could change them during runtime though it's unlikely in
> practice.  They could change during machine lifetime in practice,
> though, if you rebooted from one guest kernel to another that uses a
> different configuration.
> 
> *Maybe* IOAS construction can be deferred somehow, though I'm not sure
> because the assigned devices need to live somewhere.

This is a general requirement for all the nesting implementations, we
start out with some default nested page table and then later the VM
does the vIOMMU call to change it. So nesting will have to come along
with some kind of 'switch domains IOCTL'

In this case I would guess PPC could do the same and start out with a
small (nested) iommu_domain and then create the VM's desired
iommu_domain from the hypercall, and switch to it.

It is a bit more CPU work since maps in the lower range would have to
be copied over, but conceptually the model matches the HW nesting.

> > > You might be able to do this by simply failing this outright if
> > > there's anything other than exactly one IOMMU group bound to the
> > > container / IOAS (which I think might be what VFIO itself does now).
> > > Handling that with a device centric API gets somewhat fiddlier, of
> > > course.
> > 
> > Maybe every device gets a copy of the error notification?
> 
> Alas, it's harder than that.  One of the things that can happen on an
> EEH fault is that the entire PE gets suspended (blocking both DMA and
> MMIO, IIRC) until the proper recovery steps are taken.  

I think qemu would have to de-duplicate the duplicated device
notifications and then it can go from a device notifiation to the
device's iommu_group to the IOAS to the vPE?

A simple serial number in the event would make this pretty simple.

The way back to clear the event would just forward the commands
through a random device in the iommu_group to the PE?

Thanks,
Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
  2022-04-29  6:22                 ` David Gibson
@ 2022-04-29 12:50                   ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-04-29 12:50 UTC (permalink / raw)
  To: David Gibson
  Cc: Tian, Kevin, Alex Williamson, Lu Baolu, Chaitanya Kulkarni,
	Cornelia Huck, Daniel Jordan, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

On Fri, Apr 29, 2022 at 04:22:56PM +1000, David Gibson wrote:
> On Fri, Apr 29, 2022 at 01:21:30AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Thursday, April 28, 2022 11:11 PM
> > > 
> > > 
> > > > 3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for
> > > 2 IOVA
> > > > windows, which aren't contiguous with each other.  The base addresses
> > > > of each of these are fixed, but the size of each window, the pagesize
> > > > (i.e. granularity) of each window and the number of levels in the
> > > > IOMMU pagetable are runtime configurable.  Because it's true in the
> > > > hardware, it's also true of the vIOMMU interface defined by the IBM
> > > > hypervisor (and adpoted by KVM as well).  So, guests can request
> > > > changes in how these windows are handled.  Typical Linux guests will
> > > > use the "low" window (IOVA 0..2GiB) dynamically, and the high window
> > > > (IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
> > > > can't count on that; the guest can use them however it wants.
> > > 
> > > As part of nesting iommufd will have a 'create iommu_domain using
> > > iommu driver specific data' primitive.
> > > 
> > > The driver specific data for PPC can include a description of these
> > > windows so the PPC specific qemu driver can issue this new ioctl
> > > using the information provided by the guest.
> > > 
> > > The main issue is that internally to the iommu subsystem the
> > > iommu_domain aperture is assumed to be a single window. This kAPI will
> > > have to be improved to model the PPC multi-window iommu_domain.
> > > 
> > 
> > From the point of nesting probably each window can be a separate
> > domain then the existing aperture should still work?
> 
> Maybe.  There might be several different ways to represent it, but the
> vital piece is that any individual device (well, group, technically)
> must atomically join/leave both windows at once.

I'm not keen on the multi-iommu_domains because it means we have to
create the idea that a device can be attached to multiple
iommu_domains, which we don't have at all today.

Since iommu_domain allows PPC to implement its special rules, like the
atomicness above.

Jason


^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
@ 2022-04-29 12:50                   ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-04-29 12:50 UTC (permalink / raw)
  To: David Gibson
  Cc: Jean-Philippe Brucker, Tian, Kevin, kvm, Michael S. Tsirkin,
	Jason Wang, Cornelia Huck, Niklas Schnelle, Chaitanya Kulkarni,
	iommu, Daniel Jordan, Alex Williamson, Martins, Joao

On Fri, Apr 29, 2022 at 04:22:56PM +1000, David Gibson wrote:
> On Fri, Apr 29, 2022 at 01:21:30AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Thursday, April 28, 2022 11:11 PM
> > > 
> > > 
> > > > 3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for
> > > 2 IOVA
> > > > windows, which aren't contiguous with each other.  The base addresses
> > > > of each of these are fixed, but the size of each window, the pagesize
> > > > (i.e. granularity) of each window and the number of levels in the
> > > > IOMMU pagetable are runtime configurable.  Because it's true in the
> > > > hardware, it's also true of the vIOMMU interface defined by the IBM
> > > > hypervisor (and adpoted by KVM as well).  So, guests can request
> > > > changes in how these windows are handled.  Typical Linux guests will
> > > > use the "low" window (IOVA 0..2GiB) dynamically, and the high window
> > > > (IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
> > > > can't count on that; the guest can use them however it wants.
> > > 
> > > As part of nesting iommufd will have a 'create iommu_domain using
> > > iommu driver specific data' primitive.
> > > 
> > > The driver specific data for PPC can include a description of these
> > > windows so the PPC specific qemu driver can issue this new ioctl
> > > using the information provided by the guest.
> > > 
> > > The main issue is that internally to the iommu subsystem the
> > > iommu_domain aperture is assumed to be a single window. This kAPI will
> > > have to be improved to model the PPC multi-window iommu_domain.
> > > 
> > 
> > From the point of nesting probably each window can be a separate
> > domain then the existing aperture should still work?
> 
> Maybe.  There might be several different ways to represent it, but the
> vital piece is that any individual device (well, group, technically)
> must atomically join/leave both windows at once.

I'm not keen on the multi-iommu_domains because it means we have to
create the idea that a device can be attached to multiple
iommu_domains, which we don't have at all today.

Since iommu_domain allows PPC to implement its special rules, like the
atomicness above.

Jason

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
  2022-04-29  6:00             ` David Gibson
@ 2022-04-29 12:54               ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-04-29 12:54 UTC (permalink / raw)
  To: David Gibson
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Fri, Apr 29, 2022 at 04:00:14PM +1000, David Gibson wrote:
> > But I don't have a use case in mind? The simplified things I know
> > about want to attach their devices then allocate valid IOVA, they
> > don't really have a notion about what IOVA regions they are willing to
> > accept, or necessarily do hotplug.
> 
> The obvious use case is qemu (or whatever) emulating a vIOMMU.  The
> emulation code knows the IOVA windows that are expected of the vIOMMU
> (because that's a property of the emulated platform), and requests
> them of the host IOMMU.  If the host can supply that, you're good
> (this doesn't necessarily mean the host windows match exactly, just
> that the requested windows fit within the host windows).  If not,
> you report an error.  This can be done at any point when the host
> windows might change - so try to attach a device that can't support
> the requested windows, and it will fail.  Attaching a device which
> shrinks the windows, but still fits the requested windows within, and
> you're still good to go.

We were just talking about this in another area - Alex said that qemu
doesn't know the IOVA ranges? Is there some vIOMMU cases where it does?

Even if yes, qemu is able to manage this on its own - it doesn't use
the kernel IOVA allocator, so there is not a strong reason to tell the
kernel what the narrowed ranges are.

> > That is one possibility, yes. qemu seems to be using this to establish
> > a clone ioas of an existing operational one which is another usage
> > model.
> 
> Right, for qemu (or other hypervisors) the obvious choice would be to
> create a "staging" IOAS where IOVA == GPA, then COPY that into the various
> emulated bus IOASes.  For a userspace driver situation, I'm guessing
> you'd map your relevant memory pool into an IOAS, then COPY to the
> IOAS you need for whatever specific devices you're using.

qemu seems simpler, it juggled multiple containers so it literally
just copies when it instantiates a new container and does a map in
multi-container.

Jason

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
@ 2022-04-29 12:54               ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-04-29 12:54 UTC (permalink / raw)
  To: David Gibson
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins

On Fri, Apr 29, 2022 at 04:00:14PM +1000, David Gibson wrote:
> > But I don't have a use case in mind? The simplified things I know
> > about want to attach their devices then allocate valid IOVA, they
> > don't really have a notion about what IOVA regions they are willing to
> > accept, or necessarily do hotplug.
> 
> The obvious use case is qemu (or whatever) emulating a vIOMMU.  The
> emulation code knows the IOVA windows that are expected of the vIOMMU
> (because that's a property of the emulated platform), and requests
> them of the host IOMMU.  If the host can supply that, you're good
> (this doesn't necessarily mean the host windows match exactly, just
> that the requested windows fit within the host windows).  If not,
> you report an error.  This can be done at any point when the host
> windows might change - so try to attach a device that can't support
> the requested windows, and it will fail.  Attaching a device which
> shrinks the windows, but still fits the requested windows within, and
> you're still good to go.

We were just talking about this in another area - Alex said that qemu
doesn't know the IOVA ranges? Is there some vIOMMU cases where it does?

Even if yes, qemu is able to manage this on its own - it doesn't use
the kernel IOVA allocator, so there is not a strong reason to tell the
kernel what the narrowed ranges are.

> > That is one possibility, yes. qemu seems to be using this to establish
> > a clone ioas of an existing operational one which is another usage
> > model.
> 
> Right, for qemu (or other hypervisors) the obvious choice would be to
> create a "staging" IOAS where IOVA == GPA, then COPY that into the various
> emulated bus IOASes.  For a userspace driver situation, I'm guessing
> you'd map your relevant memory pool into an IOAS, then COPY to the
> IOAS you need for whatever specific devices you're using.

qemu seems simpler, it juggled multiple containers so it literally
just copies when it instantiates a new container and does a map in
multi-container.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
  2022-04-29 12:54               ` Jason Gunthorpe via iommu
@ 2022-04-30 14:44                 ` David Gibson
  -1 siblings, 0 replies; 244+ messages in thread
From: David Gibson @ 2022-04-30 14:44 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

[-- Attachment #1: Type: text/plain, Size: 4198 bytes --]

On Fri, Apr 29, 2022 at 09:54:42AM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 29, 2022 at 04:00:14PM +1000, David Gibson wrote:
> > > But I don't have a use case in mind? The simplified things I know
> > > about want to attach their devices then allocate valid IOVA, they
> > > don't really have a notion about what IOVA regions they are willing to
> > > accept, or necessarily do hotplug.
> > 
> > The obvious use case is qemu (or whatever) emulating a vIOMMU.  The
> > emulation code knows the IOVA windows that are expected of the vIOMMU
> > (because that's a property of the emulated platform), and requests
> > them of the host IOMMU.  If the host can supply that, you're good
> > (this doesn't necessarily mean the host windows match exactly, just
> > that the requested windows fit within the host windows).  If not,
> > you report an error.  This can be done at any point when the host
> > windows might change - so try to attach a device that can't support
> > the requested windows, and it will fail.  Attaching a device which
> > shrinks the windows, but still fits the requested windows within, and
> > you're still good to go.
> 
> We were just talking about this in another area - Alex said that qemu
> doesn't know the IOVA ranges? Is there some vIOMMU cases where it does?

Uh.. what?  We certainly know (or, rather, choose) the IOVA ranges for
ppc.  That is to say we set up the default IOVA ranges at machine
construction (those defaults have changed with machine version a
couple of times).  If the guest uses dynamic DMA windows we then
update those ranges based on the hypercalls, but at any point we know
what the IOVA windows are supposed to be.  I don't really see how x86
or anything else could not know the IOVA ranges.  Who else *could* set
the ranges when implementing a vIOMMU in TCG mode?

For the non-vIOMMU case then IOVA==GPA, so everything qemu knows about
the GPA space it also knows about the IOVA space.  Which, come to
think of it, means memory hotplug also complicates things.

> Even if yes, qemu is able to manage this on its own - it doesn't use
> the kernel IOVA allocator, so there is not a strong reason to tell the
> kernel what the narrowed ranges are.

I don't follow.  The problem for the qemu case here is if you hotplug
a device which narrows down the range to something smaller than the
guest expects.  If qemu has told the kernel the ranges it needs, that
can just fail (which is the best you can do).  If the kernel adds the
device but narrows the ranges, then you may have just put the guest
into a situation where the vIOMMU cannot do what the guest expects it
to.  If qemu can only query the windows, not specify them then it
won't know that adding a particular device will conflict with its
guest side requirements until after it's already added.  That could
mess up concurrent guest initiated map operations for existing devices
in the same guest side domain, so I don't think reversing the hotplug
after the problem is detected is enough.

> > > That is one possibility, yes. qemu seems to be using this to establish
> > > a clone ioas of an existing operational one which is another usage
> > > model.
> > 
> > Right, for qemu (or other hypervisors) the obvious choice would be to
> > create a "staging" IOAS where IOVA == GPA, then COPY that into the various
> > emulated bus IOASes.  For a userspace driver situation, I'm guessing
> > you'd map your relevant memory pool into an IOAS, then COPY to the
> > IOAS you need for whatever specific devices you're using.
> 
> qemu seems simpler, it juggled multiple containers so it literally
> just copies when it instantiates a new container and does a map in
> multi-container.

I don't follow you.  Are you talking about the vIOMMU or non vIOMMU
case?  In the vIOMMU case the different containers can be for
different guest side iommu domains with different guest-IOVA spaces,
so you can't just copy from one to another.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable
@ 2022-04-30 14:44                 ` David Gibson
  0 siblings, 0 replies; 244+ messages in thread
From: David Gibson @ 2022-04-30 14:44 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins


[-- Attachment #1.1: Type: text/plain, Size: 4198 bytes --]

On Fri, Apr 29, 2022 at 09:54:42AM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 29, 2022 at 04:00:14PM +1000, David Gibson wrote:
> > > But I don't have a use case in mind? The simplified things I know
> > > about want to attach their devices then allocate valid IOVA, they
> > > don't really have a notion about what IOVA regions they are willing to
> > > accept, or necessarily do hotplug.
> > 
> > The obvious use case is qemu (or whatever) emulating a vIOMMU.  The
> > emulation code knows the IOVA windows that are expected of the vIOMMU
> > (because that's a property of the emulated platform), and requests
> > them of the host IOMMU.  If the host can supply that, you're good
> > (this doesn't necessarily mean the host windows match exactly, just
> > that the requested windows fit within the host windows).  If not,
> > you report an error.  This can be done at any point when the host
> > windows might change - so try to attach a device that can't support
> > the requested windows, and it will fail.  Attaching a device which
> > shrinks the windows, but still fits the requested windows within, and
> > you're still good to go.
> 
> We were just talking about this in another area - Alex said that qemu
> doesn't know the IOVA ranges? Is there some vIOMMU cases where it does?

Uh.. what?  We certainly know (or, rather, choose) the IOVA ranges for
ppc.  That is to say we set up the default IOVA ranges at machine
construction (those defaults have changed with machine version a
couple of times).  If the guest uses dynamic DMA windows we then
update those ranges based on the hypercalls, but at any point we know
what the IOVA windows are supposed to be.  I don't really see how x86
or anything else could not know the IOVA ranges.  Who else *could* set
the ranges when implementing a vIOMMU in TCG mode?

For the non-vIOMMU case then IOVA==GPA, so everything qemu knows about
the GPA space it also knows about the IOVA space.  Which, come to
think of it, means memory hotplug also complicates things.

> Even if yes, qemu is able to manage this on its own - it doesn't use
> the kernel IOVA allocator, so there is not a strong reason to tell the
> kernel what the narrowed ranges are.

I don't follow.  The problem for the qemu case here is if you hotplug
a device which narrows down the range to something smaller than the
guest expects.  If qemu has told the kernel the ranges it needs, that
can just fail (which is the best you can do).  If the kernel adds the
device but narrows the ranges, then you may have just put the guest
into a situation where the vIOMMU cannot do what the guest expects it
to.  If qemu can only query the windows, not specify them then it
won't know that adding a particular device will conflict with its
guest side requirements until after it's already added.  That could
mess up concurrent guest initiated map operations for existing devices
in the same guest side domain, so I don't think reversing the hotplug
after the problem is detected is enough.

> > > That is one possibility, yes. qemu seems to be using this to establish
> > > a clone ioas of an existing operational one which is another usage
> > > model.
> > 
> > Right, for qemu (or other hypervisors) the obvious choice would be to
> > create a "staging" IOAS where IOVA == GPA, then COPY that into the various
> > emulated bus IOASes.  For a userspace driver situation, I'm guessing
> > you'd map your relevant memory pool into an IOAS, then COPY to the
> > IOAS you need for whatever specific devices you're using.
> 
> qemu seems simpler, it juggled multiple containers so it literally
> just copies when it instantiates a new container and does a map in
> multi-container.

I don't follow you.  Are you talking about the vIOMMU or non vIOMMU
case?  In the vIOMMU case the different containers can be for
different guest side iommu domains with different guest-IOVA spaces,
so you can't just copy from one to another.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 156 bytes --]

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
  2022-04-29 12:50                   ` Jason Gunthorpe via iommu
@ 2022-05-02  4:10                     ` David Gibson
  -1 siblings, 0 replies; 244+ messages in thread
From: David Gibson @ 2022-05-02  4:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Tian, Kevin, kvm, Michael S. Tsirkin,
	Jason Wang, Cornelia Huck, Niklas Schnelle, Chaitanya Kulkarni,
	iommu, Daniel Jordan, Alex Williamson, Martins, Joao


[-- Attachment #1.1: Type: text/plain, Size: 2938 bytes --]

On Fri, Apr 29, 2022 at 09:50:30AM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 29, 2022 at 04:22:56PM +1000, David Gibson wrote:
> > On Fri, Apr 29, 2022 at 01:21:30AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Thursday, April 28, 2022 11:11 PM
> > > > 
> > > > 
> > > > > 3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for
> > > > 2 IOVA
> > > > > windows, which aren't contiguous with each other.  The base addresses
> > > > > of each of these are fixed, but the size of each window, the pagesize
> > > > > (i.e. granularity) of each window and the number of levels in the
> > > > > IOMMU pagetable are runtime configurable.  Because it's true in the
> > > > > hardware, it's also true of the vIOMMU interface defined by the IBM
> > > > > hypervisor (and adpoted by KVM as well).  So, guests can request
> > > > > changes in how these windows are handled.  Typical Linux guests will
> > > > > use the "low" window (IOVA 0..2GiB) dynamically, and the high window
> > > > > (IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
> > > > > can't count on that; the guest can use them however it wants.
> > > > 
> > > > As part of nesting iommufd will have a 'create iommu_domain using
> > > > iommu driver specific data' primitive.
> > > > 
> > > > The driver specific data for PPC can include a description of these
> > > > windows so the PPC specific qemu driver can issue this new ioctl
> > > > using the information provided by the guest.
> > > > 
> > > > The main issue is that internally to the iommu subsystem the
> > > > iommu_domain aperture is assumed to be a single window. This kAPI will
> > > > have to be improved to model the PPC multi-window iommu_domain.
> > > > 
> > > 
> > > From the point of nesting probably each window can be a separate
> > > domain then the existing aperture should still work?
> > 
> > Maybe.  There might be several different ways to represent it, but the
> > vital piece is that any individual device (well, group, technically)
> > must atomically join/leave both windows at once.
> 
> I'm not keen on the multi-iommu_domains because it means we have to
> create the idea that a device can be attached to multiple
> iommu_domains, which we don't have at all today.
> 
> Since iommu_domain allows PPC to implement its special rules, like the
> atomicness above.

I tend to agree; I think extending the iommu domain concept to
incorporate multiple windows makes more sense than extending to allow
multiple domains per device.  I'm just saying there might be other
ways of representing this, and that's not a sticking point for me as
long as the right properties can be preserved.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 156 bytes --]

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
@ 2022-05-02  4:10                     ` David Gibson
  0 siblings, 0 replies; 244+ messages in thread
From: David Gibson @ 2022-05-02  4:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Alex Williamson, Lu Baolu, Chaitanya Kulkarni,
	Cornelia Huck, Daniel Jordan, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

[-- Attachment #1: Type: text/plain, Size: 2938 bytes --]

On Fri, Apr 29, 2022 at 09:50:30AM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 29, 2022 at 04:22:56PM +1000, David Gibson wrote:
> > On Fri, Apr 29, 2022 at 01:21:30AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Thursday, April 28, 2022 11:11 PM
> > > > 
> > > > 
> > > > > 3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for
> > > > 2 IOVA
> > > > > windows, which aren't contiguous with each other.  The base addresses
> > > > > of each of these are fixed, but the size of each window, the pagesize
> > > > > (i.e. granularity) of each window and the number of levels in the
> > > > > IOMMU pagetable are runtime configurable.  Because it's true in the
> > > > > hardware, it's also true of the vIOMMU interface defined by the IBM
> > > > > hypervisor (and adpoted by KVM as well).  So, guests can request
> > > > > changes in how these windows are handled.  Typical Linux guests will
> > > > > use the "low" window (IOVA 0..2GiB) dynamically, and the high window
> > > > > (IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
> > > > > can't count on that; the guest can use them however it wants.
> > > > 
> > > > As part of nesting iommufd will have a 'create iommu_domain using
> > > > iommu driver specific data' primitive.
> > > > 
> > > > The driver specific data for PPC can include a description of these
> > > > windows so the PPC specific qemu driver can issue this new ioctl
> > > > using the information provided by the guest.
> > > > 
> > > > The main issue is that internally to the iommu subsystem the
> > > > iommu_domain aperture is assumed to be a single window. This kAPI will
> > > > have to be improved to model the PPC multi-window iommu_domain.
> > > > 
> > > 
> > > From the point of nesting probably each window can be a separate
> > > domain then the existing aperture should still work?
> > 
> > Maybe.  There might be several different ways to represent it, but the
> > vital piece is that any individual device (well, group, technically)
> > must atomically join/leave both windows at once.
> 
> I'm not keen on the multi-iommu_domains because it means we have to
> create the idea that a device can be attached to multiple
> iommu_domains, which we don't have at all today.
> 
> Since iommu_domain allows PPC to implement its special rules, like the
> atomicness above.

I tend to agree; I think extending the iommu domain concept to
incorporate multiple windows makes more sense than extending to allow
multiple domains per device.  I'm just saying there might be other
ways of representing this, and that's not a sticking point for me as
long as the right properties can be preserved.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
  2022-04-29 12:48                 ` Jason Gunthorpe via iommu
@ 2022-05-02  7:30                   ` David Gibson
  -1 siblings, 0 replies; 244+ messages in thread
From: David Gibson @ 2022-05-02  7:30 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins


[-- Attachment #1.1: Type: text/plain, Size: 12951 bytes --]

On Fri, Apr 29, 2022 at 09:48:38AM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 29, 2022 at 04:20:36PM +1000, David Gibson wrote:
> 
> > > I think PPC and S390 are solving the same problem here. I think S390
> > > is going to go to a SW nested model where it has an iommu_domain
> > > controlled by iommufd that is populated with the pinned pages, eg
> > > stored in an xarray.
> > > 
> > > Then the performance map/unmap path is simply copying pages from the
> > > xarray to the real IOPTEs - and this would be modeled as a nested
> > > iommu_domain with a SW vIOPTE walker instead of a HW vIOPTE walker.
> > > 
> > > Perhaps this is agreeable for PPC too?
> > 
> > Uh.. maybe?  Note that I'm making these comments based on working on
> > this some years ago (the initial VFIO for ppc implementation in
> > particular).  I'm no longer actively involved in ppc kernel work.
> 
> OK
>  
> > > > 3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for 2 IOVA
> > > > windows, which aren't contiguous with each other.  The base addresses
> > > > of each of these are fixed, but the size of each window, the pagesize
> > > > (i.e. granularity) of each window and the number of levels in the
> > > > IOMMU pagetable are runtime configurable.  Because it's true in the
> > > > hardware, it's also true of the vIOMMU interface defined by the IBM
> > > > hypervisor (and adpoted by KVM as well).  So, guests can request
> > > > changes in how these windows are handled.  Typical Linux guests will
> > > > use the "low" window (IOVA 0..2GiB) dynamically, and the high window
> > > > (IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
> > > > can't count on that; the guest can use them however it wants.
> > > 
> > > As part of nesting iommufd will have a 'create iommu_domain using
> > > iommu driver specific data' primitive.
> > > 
> > > The driver specific data for PPC can include a description of these
> > > windows so the PPC specific qemu driver can issue this new ioctl
> > > using the information provided by the guest.
> > 
> > Hmm.. not sure if that works.  At the moment, qemu (for example) needs
> > to set up the domains/containers/IOASes as it constructs the machine,
> > because that's based on the virtual hardware topology.  Initially they
> > use the default windows (0..2GiB first window, second window
> > disabled).  Only once the guest kernel is up and running does it issue
> > the hypercalls to set the final windows as it prefers.  In theory the
> > guest could change them during runtime though it's unlikely in
> > practice.  They could change during machine lifetime in practice,
> > though, if you rebooted from one guest kernel to another that uses a
> > different configuration.
> > 
> > *Maybe* IOAS construction can be deferred somehow, though I'm not sure
> > because the assigned devices need to live somewhere.
> 
> This is a general requirement for all the nesting implementations, we
> start out with some default nested page table and then later the VM
> does the vIOMMU call to change it. So nesting will have to come along
> with some kind of 'switch domains IOCTL'
> 
> In this case I would guess PPC could do the same and start out with a
> small (nested) iommu_domain and then create the VM's desired
> iommu_domain from the hypercall, and switch to it.
> 
> It is a bit more CPU work since maps in the lower range would have to
> be copied over, but conceptually the model matches the HW nesting.

Ah.. ok.  IIUC what you're saying is that the kernel-side IOASes have
fixed windows, but we fake dynamic windows in the userspace
implementation by flipping the devices over to a new IOAS with the new
windows.  Is that right?

Where exactly would the windows be specified?  My understanding was
that when creating a back-end specific IOAS, that would typically be
for the case where you're using a user / guest managed IO pagetable,
with the backend specifying the format for that.  In the ppc case we'd
need to specify the windows, but we'd still need the IOAS_MAP/UNMAP
operations to manage the mappings.  The PAPR vIOMMU is
paravirtualized, so all updates come via hypercalls, so there's no
user/guest managed data structure.

That should work from the point of view of the userspace and guest
side interfaces.  It might be fiddly from the point of view of the
back end.  The ppc iommu doesn't really have the notion of
configurable domains - instead the address spaces are the hardware or
firmware fixed PEs, so they have a fixed set of devices.  At the bare
metal level it's possible to sort of do domains by making the actual
pagetable pointers for several PEs point to a common place.

However, in the future, nested KVM under PowerVM is likely to be the
norm.  In that situation the L1 as well as the L2 only has the
paravirtualized interfaces, which don't have any notion of domains,
only PEs.  All updates take place via hypercalls which explicitly
specify a PE (strictly speaking they take a "Logical IO Bus Number"
(LIOBN), but those generally map one to one with PEs), so it can't use
shared pointer tricks either.

Now obviously we can chose devices to share a logical iommu domain by
just duplicating mappings across the PEs, but what we can't do is
construct mappings in a domain and *then* atomically move devices into
it.


So, here's an alternative set of interfaces that should work for ppc,
maybe you can tell me whether they also work for x86 and others:

  * Each domain/IOAS has a concept of one or more IOVA windows, which
    each have a base address, size, pagesize (granularity) and optionally
    other flags/attributes.
      * This has some bearing on hardware capabilities, but is
        primarily a software notion
  * MAP/UNMAP operations are only permitted within an existing IOVA
    window (and with addresses aligned to the window's pagesize)
      * This is enforced by software whether or not it is required by
        the underlying hardware
  * Likewise IOAS_COPY operations are only permitted if the source and
    destination windows have compatible attributes
  * A newly created kernel-managed IOAS has *no* IOVA windows
  * A CREATE_WINDOW operation is added
      * This takes a size, pagesize/granularity, optional base address
        and optional additional attributes 
      * If any of the specified attributes are incompatible with the
        underlying hardware, the operation fails
    * "Compatibility" doesn't require an exact match: creating a small
       window within the hardware's much larger valid address range is
       fine, as is creating a window with a pagesize that's a multiple
       of the hardware pagesize
    * Unspecified attributes (including base address) are allocated by
      the kernel, based on the hardware capabilities/configuration
    * The properties of the window (including allocated ones) are
      recorded with the IOAS/domain
    * For IOMMUs like ppc which don't have fixed ranges, this would
      create windows on the backend to match, if possible, otherwise
      fail
  * A REMOVE_WINDOW operation reverses a single CREATE_WINDOW
    * This would also delete any MAPs within that window
  * ATTACH operations recaculate the hardware capabilities (as now),
    then verify then against the already created windows, and fail if
    the existing windows can't be preserved

MAP/UNMAP, CREATE/REMOVE_WINDOW and ATTACH/DETACH operations can
happen in any order, though whether they succeed may depend on the
order in some cases.

So, for a userspace driver setup (e.g. DPDK) the typical sequence
would be:

  1. Create IOAS
  2. ATTACH all the devices we want to use
  3. CREATE_WINDOW with the size and pagesize we need, other
     attributes unspecified
       - Because we already attached the devices, this will allocate a
         suitable base address and attributes for us
  4. IOAS_MAP the buffers we want into the window that was allocated
     for us
  5. Do work

For qemu or another hypervisor with a ppc (PAPR) guest, the sequence would be:

  1. Create IOAS for each guest side bus
  2. Determine the expected IOVA ranges based on the guest
     configuration (model of vIOMMU, what virtual devices are present)
  3. CREATE_WINDOW ranges for the default guest side windows.  This
     will have base address specified
       - If this fails, we're out of luck, report error and quit
  4. ATTACH devices that are on the guest bus associated with each
     IOAS
       - Again, if this fails, report error and quit
  5. Start machine, boot guest
  6. During runtime:
       - If the guest attempts to create new windows, do new
         CREATE_WINDOW operations.  If they fail, fail the initiating
	 hypercall
       - If the guest maps things into the IO pagetable, IOAS_MAP
         them.  If that fails, fail the hypercall
       - If the user hotplugs a device, try to ATTACH it.  If that
         fails, fail the hotplug operation.
       - If the machine is reset, REMOVE_WINDOW whatever's there and
         CREATE_WINDOW the defaults again

For qemu with an x86 guest, the sequence would be:

  1. Create default IOAS
  2. CREATE_WINDOW for all guest memory blocks
     - If it fails, report and quit
  3. IOAS_MAP guest RAM into default IOAS so that GPA==IOVA
  4. ATTACH all virtual devices to the default IOAS
     - If it fails, report and quit
  5. Start machine, boot guest
  6. During runtime
     - If the guest assigns devices to domains on the vIOMMU, create
       new IOAS for each guest domain, and ATTACH the device to the
       new IOAS
     - If the guest triggers a vIOMMU cache flush, (re)mirror the guest IO
       pagetables into the host using IOAS_MAP or COPY
     - If the user hotplugs a device, ATTACH it to the default
       domain.  If that fails, fail the hotplug
     - If the user hotplugs memory, CREATE_WINDOW in the default IOAS
       for the new memory region.  If that fails, fail the hotplug
     - On system reset, re-ATTACH all devices to default IOAS, remove
       all non-default IOAS


> > > > You might be able to do this by simply failing this outright if
> > > > there's anything other than exactly one IOMMU group bound to the
> > > > container / IOAS (which I think might be what VFIO itself does now).
> > > > Handling that with a device centric API gets somewhat fiddlier, of
> > > > course.
> > > 
> > > Maybe every device gets a copy of the error notification?
> > 
> > Alas, it's harder than that.  One of the things that can happen on an
> > EEH fault is that the entire PE gets suspended (blocking both DMA and
> > MMIO, IIRC) until the proper recovery steps are taken.  
> 
> I think qemu would have to de-duplicate the duplicated device
> notifications and then it can go from a device notifiation to the
> device's iommu_group to the IOAS to the vPE?

It's not about the notifications.  When the PE goes into
suspended/error state, that changes guest visible behaviour of
devices.  It essentially blocks all IO on the bus (the design
philosophy is to prioritize not propagating bad data above all else).
That state is triggered on (certain sorts of) IO errors and will
remain until explicit recovery steps are taken.

The guest might see that IO blockage before it gets/processes the
notifications (because concurrency).  On the host side only the host
PE will be affected, but on the guest side we can only report and
recover things at the guest PE level, which might be larger than a
host PE.

So the problem is if we have a guest PE with multiple host PEs, and we
trip an error which puts one of the host PEs into error state, we have
what's essentially an inconsistent, invalid state from the point of
view of the guest interfaces.  The guest has no way to sensibly query
or recover from this.

To make this work, essentially we'd have to trap the error on the
host, then inject errors to put all of the host PEs in the guest PE
into the same error state.  When the guest takes recovery actions we'd
need to duplicate those actions on all the host PEs.  If one of those
steps fails on one of the host PEs we have to work out out to report
that and bring the PEs back into a synchronized state.

Like I said, it might be possible, but it's really, really hairy.  All
for a feature that I'm not convinced anyone has ever used in earnest.

> A simple serial number in the event would make this pretty simple.
> 
> The way back to clear the event would just forward the commands
> through a random device in the iommu_group to the PE?
> 
> Thanks,
> Jason
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 156 bytes --]

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
@ 2022-05-02  7:30                   ` David Gibson
  0 siblings, 0 replies; 244+ messages in thread
From: David Gibson @ 2022-05-02  7:30 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

[-- Attachment #1: Type: text/plain, Size: 12951 bytes --]

On Fri, Apr 29, 2022 at 09:48:38AM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 29, 2022 at 04:20:36PM +1000, David Gibson wrote:
> 
> > > I think PPC and S390 are solving the same problem here. I think S390
> > > is going to go to a SW nested model where it has an iommu_domain
> > > controlled by iommufd that is populated with the pinned pages, eg
> > > stored in an xarray.
> > > 
> > > Then the performance map/unmap path is simply copying pages from the
> > > xarray to the real IOPTEs - and this would be modeled as a nested
> > > iommu_domain with a SW vIOPTE walker instead of a HW vIOPTE walker.
> > > 
> > > Perhaps this is agreeable for PPC too?
> > 
> > Uh.. maybe?  Note that I'm making these comments based on working on
> > this some years ago (the initial VFIO for ppc implementation in
> > particular).  I'm no longer actively involved in ppc kernel work.
> 
> OK
>  
> > > > 3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for 2 IOVA
> > > > windows, which aren't contiguous with each other.  The base addresses
> > > > of each of these are fixed, but the size of each window, the pagesize
> > > > (i.e. granularity) of each window and the number of levels in the
> > > > IOMMU pagetable are runtime configurable.  Because it's true in the
> > > > hardware, it's also true of the vIOMMU interface defined by the IBM
> > > > hypervisor (and adpoted by KVM as well).  So, guests can request
> > > > changes in how these windows are handled.  Typical Linux guests will
> > > > use the "low" window (IOVA 0..2GiB) dynamically, and the high window
> > > > (IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
> > > > can't count on that; the guest can use them however it wants.
> > > 
> > > As part of nesting iommufd will have a 'create iommu_domain using
> > > iommu driver specific data' primitive.
> > > 
> > > The driver specific data for PPC can include a description of these
> > > windows so the PPC specific qemu driver can issue this new ioctl
> > > using the information provided by the guest.
> > 
> > Hmm.. not sure if that works.  At the moment, qemu (for example) needs
> > to set up the domains/containers/IOASes as it constructs the machine,
> > because that's based on the virtual hardware topology.  Initially they
> > use the default windows (0..2GiB first window, second window
> > disabled).  Only once the guest kernel is up and running does it issue
> > the hypercalls to set the final windows as it prefers.  In theory the
> > guest could change them during runtime though it's unlikely in
> > practice.  They could change during machine lifetime in practice,
> > though, if you rebooted from one guest kernel to another that uses a
> > different configuration.
> > 
> > *Maybe* IOAS construction can be deferred somehow, though I'm not sure
> > because the assigned devices need to live somewhere.
> 
> This is a general requirement for all the nesting implementations, we
> start out with some default nested page table and then later the VM
> does the vIOMMU call to change it. So nesting will have to come along
> with some kind of 'switch domains IOCTL'
> 
> In this case I would guess PPC could do the same and start out with a
> small (nested) iommu_domain and then create the VM's desired
> iommu_domain from the hypercall, and switch to it.
> 
> It is a bit more CPU work since maps in the lower range would have to
> be copied over, but conceptually the model matches the HW nesting.

Ah.. ok.  IIUC what you're saying is that the kernel-side IOASes have
fixed windows, but we fake dynamic windows in the userspace
implementation by flipping the devices over to a new IOAS with the new
windows.  Is that right?

Where exactly would the windows be specified?  My understanding was
that when creating a back-end specific IOAS, that would typically be
for the case where you're using a user / guest managed IO pagetable,
with the backend specifying the format for that.  In the ppc case we'd
need to specify the windows, but we'd still need the IOAS_MAP/UNMAP
operations to manage the mappings.  The PAPR vIOMMU is
paravirtualized, so all updates come via hypercalls, so there's no
user/guest managed data structure.

That should work from the point of view of the userspace and guest
side interfaces.  It might be fiddly from the point of view of the
back end.  The ppc iommu doesn't really have the notion of
configurable domains - instead the address spaces are the hardware or
firmware fixed PEs, so they have a fixed set of devices.  At the bare
metal level it's possible to sort of do domains by making the actual
pagetable pointers for several PEs point to a common place.

However, in the future, nested KVM under PowerVM is likely to be the
norm.  In that situation the L1 as well as the L2 only has the
paravirtualized interfaces, which don't have any notion of domains,
only PEs.  All updates take place via hypercalls which explicitly
specify a PE (strictly speaking they take a "Logical IO Bus Number"
(LIOBN), but those generally map one to one with PEs), so it can't use
shared pointer tricks either.

Now obviously we can chose devices to share a logical iommu domain by
just duplicating mappings across the PEs, but what we can't do is
construct mappings in a domain and *then* atomically move devices into
it.


So, here's an alternative set of interfaces that should work for ppc,
maybe you can tell me whether they also work for x86 and others:

  * Each domain/IOAS has a concept of one or more IOVA windows, which
    each have a base address, size, pagesize (granularity) and optionally
    other flags/attributes.
      * This has some bearing on hardware capabilities, but is
        primarily a software notion
  * MAP/UNMAP operations are only permitted within an existing IOVA
    window (and with addresses aligned to the window's pagesize)
      * This is enforced by software whether or not it is required by
        the underlying hardware
  * Likewise IOAS_COPY operations are only permitted if the source and
    destination windows have compatible attributes
  * A newly created kernel-managed IOAS has *no* IOVA windows
  * A CREATE_WINDOW operation is added
      * This takes a size, pagesize/granularity, optional base address
        and optional additional attributes 
      * If any of the specified attributes are incompatible with the
        underlying hardware, the operation fails
    * "Compatibility" doesn't require an exact match: creating a small
       window within the hardware's much larger valid address range is
       fine, as is creating a window with a pagesize that's a multiple
       of the hardware pagesize
    * Unspecified attributes (including base address) are allocated by
      the kernel, based on the hardware capabilities/configuration
    * The properties of the window (including allocated ones) are
      recorded with the IOAS/domain
    * For IOMMUs like ppc which don't have fixed ranges, this would
      create windows on the backend to match, if possible, otherwise
      fail
  * A REMOVE_WINDOW operation reverses a single CREATE_WINDOW
    * This would also delete any MAPs within that window
  * ATTACH operations recaculate the hardware capabilities (as now),
    then verify then against the already created windows, and fail if
    the existing windows can't be preserved

MAP/UNMAP, CREATE/REMOVE_WINDOW and ATTACH/DETACH operations can
happen in any order, though whether they succeed may depend on the
order in some cases.

So, for a userspace driver setup (e.g. DPDK) the typical sequence
would be:

  1. Create IOAS
  2. ATTACH all the devices we want to use
  3. CREATE_WINDOW with the size and pagesize we need, other
     attributes unspecified
       - Because we already attached the devices, this will allocate a
         suitable base address and attributes for us
  4. IOAS_MAP the buffers we want into the window that was allocated
     for us
  5. Do work

For qemu or another hypervisor with a ppc (PAPR) guest, the sequence would be:

  1. Create IOAS for each guest side bus
  2. Determine the expected IOVA ranges based on the guest
     configuration (model of vIOMMU, what virtual devices are present)
  3. CREATE_WINDOW ranges for the default guest side windows.  This
     will have base address specified
       - If this fails, we're out of luck, report error and quit
  4. ATTACH devices that are on the guest bus associated with each
     IOAS
       - Again, if this fails, report error and quit
  5. Start machine, boot guest
  6. During runtime:
       - If the guest attempts to create new windows, do new
         CREATE_WINDOW operations.  If they fail, fail the initiating
	 hypercall
       - If the guest maps things into the IO pagetable, IOAS_MAP
         them.  If that fails, fail the hypercall
       - If the user hotplugs a device, try to ATTACH it.  If that
         fails, fail the hotplug operation.
       - If the machine is reset, REMOVE_WINDOW whatever's there and
         CREATE_WINDOW the defaults again

For qemu with an x86 guest, the sequence would be:

  1. Create default IOAS
  2. CREATE_WINDOW for all guest memory blocks
     - If it fails, report and quit
  3. IOAS_MAP guest RAM into default IOAS so that GPA==IOVA
  4. ATTACH all virtual devices to the default IOAS
     - If it fails, report and quit
  5. Start machine, boot guest
  6. During runtime
     - If the guest assigns devices to domains on the vIOMMU, create
       new IOAS for each guest domain, and ATTACH the device to the
       new IOAS
     - If the guest triggers a vIOMMU cache flush, (re)mirror the guest IO
       pagetables into the host using IOAS_MAP or COPY
     - If the user hotplugs a device, ATTACH it to the default
       domain.  If that fails, fail the hotplug
     - If the user hotplugs memory, CREATE_WINDOW in the default IOAS
       for the new memory region.  If that fails, fail the hotplug
     - On system reset, re-ATTACH all devices to default IOAS, remove
       all non-default IOAS


> > > > You might be able to do this by simply failing this outright if
> > > > there's anything other than exactly one IOMMU group bound to the
> > > > container / IOAS (which I think might be what VFIO itself does now).
> > > > Handling that with a device centric API gets somewhat fiddlier, of
> > > > course.
> > > 
> > > Maybe every device gets a copy of the error notification?
> > 
> > Alas, it's harder than that.  One of the things that can happen on an
> > EEH fault is that the entire PE gets suspended (blocking both DMA and
> > MMIO, IIRC) until the proper recovery steps are taken.  
> 
> I think qemu would have to de-duplicate the duplicated device
> notifications and then it can go from a device notifiation to the
> device's iommu_group to the IOAS to the vPE?

It's not about the notifications.  When the PE goes into
suspended/error state, that changes guest visible behaviour of
devices.  It essentially blocks all IO on the bus (the design
philosophy is to prioritize not propagating bad data above all else).
That state is triggered on (certain sorts of) IO errors and will
remain until explicit recovery steps are taken.

The guest might see that IO blockage before it gets/processes the
notifications (because concurrency).  On the host side only the host
PE will be affected, but on the guest side we can only report and
recover things at the guest PE level, which might be larger than a
host PE.

So the problem is if we have a guest PE with multiple host PEs, and we
trip an error which puts one of the host PEs into error state, we have
what's essentially an inconsistent, invalid state from the point of
view of the guest interfaces.  The guest has no way to sensibly query
or recover from this.

To make this work, essentially we'd have to trap the error on the
host, then inject errors to put all of the host PEs in the guest PE
into the same error state.  When the guest takes recovery actions we'd
need to duplicate those actions on all the host PEs.  If one of those
steps fails on one of the host PEs we have to work out out to report
that and bring the PEs back into a synchronized state.

Like I said, it might be possible, but it's really, really hairy.  All
for a feature that I'm not convinced anyone has ever used in earnest.

> A simple serial number in the event would make this pretty simple.
> 
> The way back to clear the event would just forward the commands
> through a random device in the iommu_group to the PE?
> 
> Thanks,
> Jason
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
  2022-05-02  7:30                   ` David Gibson
@ 2022-05-05 19:07                     ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-05-05 19:07 UTC (permalink / raw)
  To: David Gibson
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Mon, May 02, 2022 at 05:30:05PM +1000, David Gibson wrote:

> > It is a bit more CPU work since maps in the lower range would have to
> > be copied over, but conceptually the model matches the HW nesting.
> 
> Ah.. ok.  IIUC what you're saying is that the kernel-side IOASes have
> fixed windows, but we fake dynamic windows in the userspace
> implementation by flipping the devices over to a new IOAS with the new
> windows.  Is that right?

Yes

> Where exactly would the windows be specified?  My understanding was
> that when creating a back-end specific IOAS, that would typically be
> for the case where you're using a user / guest managed IO pagetable,
> with the backend specifying the format for that.  In the ppc case we'd
> need to specify the windows, but we'd still need the IOAS_MAP/UNMAP
> operations to manage the mappings.  The PAPR vIOMMU is
> paravirtualized, so all updates come via hypercalls, so there's no
> user/guest managed data structure.

When the iommu_domain is created I want to have a
iommu-driver-specific struct, so PPC can customize its iommu_domain
however it likes.

> That should work from the point of view of the userspace and guest
> side interfaces.  It might be fiddly from the point of view of the
> back end.  The ppc iommu doesn't really have the notion of
> configurable domains - instead the address spaces are the hardware or
> firmware fixed PEs, so they have a fixed set of devices.  At the bare
> metal level it's possible to sort of do domains by making the actual
> pagetable pointers for several PEs point to a common place.

I'm not sure I understand this - a domain is just a storage container
for an IO page table, if the HW has IOPTEs then it should be able to
have a domain?

Making page table pointers point to a common IOPTE tree is exactly
what iommu_domains are for - why is that "sort of" for ppc?

> However, in the future, nested KVM under PowerVM is likely to be the
> norm.  In that situation the L1 as well as the L2 only has the
> paravirtualized interfaces, which don't have any notion of domains,
> only PEs.  All updates take place via hypercalls which explicitly
> specify a PE (strictly speaking they take a "Logical IO Bus Number"
> (LIOBN), but those generally map one to one with PEs), so it can't use
> shared pointer tricks either.

How does the paravirtualized interfaces deal with the page table? Does
it call a map/unmap hypercall instead of providing guest IOPTEs?

Assuming yes, I'd expect that:

The iommu_domain for nested PPC is just a log of map/unmap hypervsior
calls to make. Whenever a new PE is attached to that domain it gets
the logged map's replayed to set it up, and when a PE is detached the
log is used to unmap everything.

It is not perfectly memory efficient - and we could perhaps talk about
a API modification to allow re-use of the iommufd datastructure
somehow, but I think this is a good logical starting point.

The PE would have to be modeled as an iommu_group.

> So, here's an alternative set of interfaces that should work for ppc,
> maybe you can tell me whether they also work for x86 and others:

Fundamentally PPC has to fit into the iommu standard framework of
group and domains, we can talk about modifications, but drifting too
far away is a big problem.

>   * Each domain/IOAS has a concept of one or more IOVA windows, which
>     each have a base address, size, pagesize (granularity) and optionally
>     other flags/attributes.
>       * This has some bearing on hardware capabilities, but is
>         primarily a software notion

iommu_domain has the aperture, PPC will require extending this to a
list of apertures since it is currently only one window.

Once a domain is created and attached to a group the aperture should
be immutable.

>   * MAP/UNMAP operations are only permitted within an existing IOVA
>     window (and with addresses aligned to the window's pagesize)
>       * This is enforced by software whether or not it is required by
>         the underlying hardware
>   * Likewise IOAS_COPY operations are only permitted if the source and
>     destination windows have compatible attributes

Already done, domain's aperture restricts all the iommufd operations

>   * A newly created kernel-managed IOAS has *no* IOVA windows

Already done, the iommufd IOAS has no iommu_domains inside it at
creation time.

>   * A CREATE_WINDOW operation is added
>       * This takes a size, pagesize/granularity, optional base address
>         and optional additional attributes 
>       * If any of the specified attributes are incompatible with the
>         underlying hardware, the operation fails

iommu layer has nothing called a window. The closest thing is a
domain.

I really don't want to try to make a new iommu layer object that is so
unique and special to PPC - we have to figure out how to fit PPC into
the iommu_domain model with reasonable extensions.

> > > > Maybe every device gets a copy of the error notification?
> > > 
> > > Alas, it's harder than that.  One of the things that can happen on an
> > > EEH fault is that the entire PE gets suspended (blocking both DMA and
> > > MMIO, IIRC) until the proper recovery steps are taken.  
> > 
> > I think qemu would have to de-duplicate the duplicated device
> > notifications and then it can go from a device notifiation to the
> > device's iommu_group to the IOAS to the vPE?
> 
> It's not about the notifications. 

The only thing the kernel can do is rely a notification that something
happened to a PE. The kernel gets an event on the PE basis, I would
like it to replicate it to all the devices and push it through the
VFIO device FD.

qemu will de-duplicate the replicates and recover exactly the same
event the kernel saw, delivered at exactly the same time.

If instead you want to have one event per-PE then all that changes in
the kernel is one event is generated instead of N, and qemu doesn't
have to throw away the duplicates.

With either method qemu still gets the same PE centric event,
delivered at the same time, with the same races.

This is not a general mechanism, it is some PPC specific thing to
communicate a PPC specific PE centric event to userspace. I just
prefer it in VFIO instead of iommufd.

Jason

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
@ 2022-05-05 19:07                     ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-05-05 19:07 UTC (permalink / raw)
  To: David Gibson
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins

On Mon, May 02, 2022 at 05:30:05PM +1000, David Gibson wrote:

> > It is a bit more CPU work since maps in the lower range would have to
> > be copied over, but conceptually the model matches the HW nesting.
> 
> Ah.. ok.  IIUC what you're saying is that the kernel-side IOASes have
> fixed windows, but we fake dynamic windows in the userspace
> implementation by flipping the devices over to a new IOAS with the new
> windows.  Is that right?

Yes

> Where exactly would the windows be specified?  My understanding was
> that when creating a back-end specific IOAS, that would typically be
> for the case where you're using a user / guest managed IO pagetable,
> with the backend specifying the format for that.  In the ppc case we'd
> need to specify the windows, but we'd still need the IOAS_MAP/UNMAP
> operations to manage the mappings.  The PAPR vIOMMU is
> paravirtualized, so all updates come via hypercalls, so there's no
> user/guest managed data structure.

When the iommu_domain is created I want to have a
iommu-driver-specific struct, so PPC can customize its iommu_domain
however it likes.

> That should work from the point of view of the userspace and guest
> side interfaces.  It might be fiddly from the point of view of the
> back end.  The ppc iommu doesn't really have the notion of
> configurable domains - instead the address spaces are the hardware or
> firmware fixed PEs, so they have a fixed set of devices.  At the bare
> metal level it's possible to sort of do domains by making the actual
> pagetable pointers for several PEs point to a common place.

I'm not sure I understand this - a domain is just a storage container
for an IO page table, if the HW has IOPTEs then it should be able to
have a domain?

Making page table pointers point to a common IOPTE tree is exactly
what iommu_domains are for - why is that "sort of" for ppc?

> However, in the future, nested KVM under PowerVM is likely to be the
> norm.  In that situation the L1 as well as the L2 only has the
> paravirtualized interfaces, which don't have any notion of domains,
> only PEs.  All updates take place via hypercalls which explicitly
> specify a PE (strictly speaking they take a "Logical IO Bus Number"
> (LIOBN), but those generally map one to one with PEs), so it can't use
> shared pointer tricks either.

How does the paravirtualized interfaces deal with the page table? Does
it call a map/unmap hypercall instead of providing guest IOPTEs?

Assuming yes, I'd expect that:

The iommu_domain for nested PPC is just a log of map/unmap hypervsior
calls to make. Whenever a new PE is attached to that domain it gets
the logged map's replayed to set it up, and when a PE is detached the
log is used to unmap everything.

It is not perfectly memory efficient - and we could perhaps talk about
a API modification to allow re-use of the iommufd datastructure
somehow, but I think this is a good logical starting point.

The PE would have to be modeled as an iommu_group.

> So, here's an alternative set of interfaces that should work for ppc,
> maybe you can tell me whether they also work for x86 and others:

Fundamentally PPC has to fit into the iommu standard framework of
group and domains, we can talk about modifications, but drifting too
far away is a big problem.

>   * Each domain/IOAS has a concept of one or more IOVA windows, which
>     each have a base address, size, pagesize (granularity) and optionally
>     other flags/attributes.
>       * This has some bearing on hardware capabilities, but is
>         primarily a software notion

iommu_domain has the aperture, PPC will require extending this to a
list of apertures since it is currently only one window.

Once a domain is created and attached to a group the aperture should
be immutable.

>   * MAP/UNMAP operations are only permitted within an existing IOVA
>     window (and with addresses aligned to the window's pagesize)
>       * This is enforced by software whether or not it is required by
>         the underlying hardware
>   * Likewise IOAS_COPY operations are only permitted if the source and
>     destination windows have compatible attributes

Already done, domain's aperture restricts all the iommufd operations

>   * A newly created kernel-managed IOAS has *no* IOVA windows

Already done, the iommufd IOAS has no iommu_domains inside it at
creation time.

>   * A CREATE_WINDOW operation is added
>       * This takes a size, pagesize/granularity, optional base address
>         and optional additional attributes 
>       * If any of the specified attributes are incompatible with the
>         underlying hardware, the operation fails

iommu layer has nothing called a window. The closest thing is a
domain.

I really don't want to try to make a new iommu layer object that is so
unique and special to PPC - we have to figure out how to fit PPC into
the iommu_domain model with reasonable extensions.

> > > > Maybe every device gets a copy of the error notification?
> > > 
> > > Alas, it's harder than that.  One of the things that can happen on an
> > > EEH fault is that the entire PE gets suspended (blocking both DMA and
> > > MMIO, IIRC) until the proper recovery steps are taken.  
> > 
> > I think qemu would have to de-duplicate the duplicated device
> > notifications and then it can go from a device notifiation to the
> > device's iommu_group to the IOAS to the vPE?
> 
> It's not about the notifications. 

The only thing the kernel can do is rely a notification that something
happened to a PE. The kernel gets an event on the PE basis, I would
like it to replicate it to all the devices and push it through the
VFIO device FD.

qemu will de-duplicate the replicates and recover exactly the same
event the kernel saw, delivered at exactly the same time.

If instead you want to have one event per-PE then all that changes in
the kernel is one event is generated instead of N, and qemu doesn't
have to throw away the duplicates.

With either method qemu still gets the same PE centric event,
delivered at the same time, with the same races.

This is not a general mechanism, it is some PPC specific thing to
communicate a PPC specific PE centric event to userspace. I just
prefer it in VFIO instead of iommufd.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
  2022-05-05 19:07                     ` Jason Gunthorpe via iommu
@ 2022-05-06  5:25                       ` David Gibson
  -1 siblings, 0 replies; 244+ messages in thread
From: David Gibson @ 2022-05-06  5:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

[-- Attachment #1: Type: text/plain, Size: 15014 bytes --]

On Thu, May 05, 2022 at 04:07:28PM -0300, Jason Gunthorpe wrote:
> On Mon, May 02, 2022 at 05:30:05PM +1000, David Gibson wrote:
> 
> > > It is a bit more CPU work since maps in the lower range would have to
> > > be copied over, but conceptually the model matches the HW nesting.
> > 
> > Ah.. ok.  IIUC what you're saying is that the kernel-side IOASes have
> > fixed windows, but we fake dynamic windows in the userspace
> > implementation by flipping the devices over to a new IOAS with the new
> > windows.  Is that right?
> 
> Yes
> 
> > Where exactly would the windows be specified?  My understanding was
> > that when creating a back-end specific IOAS, that would typically be
> > for the case where you're using a user / guest managed IO pagetable,
> > with the backend specifying the format for that.  In the ppc case we'd
> > need to specify the windows, but we'd still need the IOAS_MAP/UNMAP
> > operations to manage the mappings.  The PAPR vIOMMU is
> > paravirtualized, so all updates come via hypercalls, so there's no
> > user/guest managed data structure.
> 
> When the iommu_domain is created I want to have a
> iommu-driver-specific struct, so PPC can customize its iommu_domain
> however it likes.

This requires that the client be aware of the host side IOMMU model.
That's true in VFIO now, and it's nasty; I was really hoping we could
*stop* doing that.

Note that I'm talking here *purely* about the non-optimized case where
all updates to the host side IO pagetables are handled by IOAS_MAP /
IOAS_COPY, with no direct hardware access to user or guest managed IO
pagetables.  The optimized case obviously requires end-to-end
agreement on the pagetable format amongst other domain properties.

What I'm hoping is that qemu (or whatever) can use this non-optimized
as a fallback case where it does't need to know the properties of
whatever host side IOMMU models there are.  It just requests what it
needs based on the vIOMMU properties it needs to replicate and the
host kernel either can supply it or can't.

In many cases it should be perfectly possible to emulate a PPC style
vIOMMU on an x86 host, because the x86 IOMMU has such a colossal
aperture that it will encompass wherever the ppc apertures end
up. Similarly we could simulate an x86-style no-vIOMMU guest on a ppc
host (currently somewhere between awkward and impossible) by placing
the host apertures to cover guest memory.

Admittedly those are pretty niche cases, but allowing for them gives
us flexibility for the future.  Emulating an ARM SMMUv3 guest on an
ARM SMMU v4 or v5 or v.whatever host is likely to be a real case in
the future, and AFAICT, ARM are much less conservative that x86 about
maintaining similar hw interfaces over time.  That's why I think
considering these ppc cases will give a more robust interface for
other future possibilities as well.

> > That should work from the point of view of the userspace and guest
> > side interfaces.  It might be fiddly from the point of view of the
> > back end.  The ppc iommu doesn't really have the notion of
> > configurable domains - instead the address spaces are the hardware or
> > firmware fixed PEs, so they have a fixed set of devices.  At the bare
> > metal level it's possible to sort of do domains by making the actual
> > pagetable pointers for several PEs point to a common place.
> 
> I'm not sure I understand this - a domain is just a storage container
> for an IO page table, if the HW has IOPTEs then it should be able to
> have a domain?
> 
> Making page table pointers point to a common IOPTE tree is exactly
> what iommu_domains are for - why is that "sort of" for ppc?

Ok, fair enough, it's only "sort of" in the sense that the hw specs /
docs don't present any equivalent concept.

> > However, in the future, nested KVM under PowerVM is likely to be the
> > norm.  In that situation the L1 as well as the L2 only has the
> > paravirtualized interfaces, which don't have any notion of domains,
> > only PEs.  All updates take place via hypercalls which explicitly
> > specify a PE (strictly speaking they take a "Logical IO Bus Number"
> > (LIOBN), but those generally map one to one with PEs), so it can't use
> > shared pointer tricks either.
> 
> How does the paravirtualized interfaces deal with the page table? Does
> it call a map/unmap hypercall instead of providing guest IOPTEs?

Sort of.  The main interface is H_PUT_TCE ("TCE" - Translation Control
Entry - being IBMese for an IOPTE). This takes an LIOBN (which selects
which PE and aperture), an IOVA and a TCE value - which is a guest
physical address plus some permission bits.  There are some variants
for performance that can set a batch of IOPTEs from a buffer, or clear
a range of IOPTEs, but they're just faster ways of doing the same
thing as a bunch of H_PUT_TCE calls.  H_PUT_TCE calls.

You can consider that a map/unmap hypercall, but the size of the
mapping is fixed (the IO pagesize which was previously set for the
aperture).

> Assuming yes, I'd expect that:
> 
> The iommu_domain for nested PPC is just a log of map/unmap hypervsior
> calls to make. Whenever a new PE is attached to that domain it gets
> the logged map's replayed to set it up, and when a PE is detached the
> log is used to unmap everything.

And likewise duplicate every H_PUT_TCE to all the PEs in the domain.
Sure.  It means the changes won't be atomic across the domain, but I
guess that doesn't matter.  I guess you could have the same thing on a
sufficiently complex x86 or ARM system, if you put two devices into
the IOAS that were sufficiently far from each other in the bus
topology that they use a different top-level host IOMMU.

> It is not perfectly memory efficient - and we could perhaps talk about
> a API modification to allow re-use of the iommufd datastructure
> somehow, but I think this is a good logical starting point.

Because the map size is fixed, a "replay log" is effectively
equivalent to a mirror of the entire IO pagetable.

> The PE would have to be modeled as an iommu_group.

Yes, they already are.

> > So, here's an alternative set of interfaces that should work for ppc,
> > maybe you can tell me whether they also work for x86 and others:
> 
> Fundamentally PPC has to fit into the iommu standard framework of
> group and domains, we can talk about modifications, but drifting too
> far away is a big problem.

Well, what I'm trying to do here is to see how big a change to the
model this really requires.  If it's not too much, we gain flexibility
for future unknown IOMMU models as well as for the admittedly niche
ppc case.

> >   * Each domain/IOAS has a concept of one or more IOVA windows, which
> >     each have a base address, size, pagesize (granularity) and optionally
> >     other flags/attributes.
> >       * This has some bearing on hardware capabilities, but is
> >         primarily a software notion
> 
> iommu_domain has the aperture, PPC will require extending this to a
> list of apertures since it is currently only one window.

Yes, that's needed as a minimum.

> Once a domain is created and attached to a group the aperture should
> be immutable.

I realize that's the model right now, but is there a strong rationale
for that immutability?

Come to that, IIUC that's true for the iommu_domain at the lower
level, but not for the IOAS at a higher level.  You've stated that's
*not* immutable, since it can shrink as new devices are added to the
IOAS.  So I guess in that case the IOAS must be mapped by multiple
iommu_domains?

> >   * MAP/UNMAP operations are only permitted within an existing IOVA
> >     window (and with addresses aligned to the window's pagesize)
> >       * This is enforced by software whether or not it is required by
> >         the underlying hardware
> >   * Likewise IOAS_COPY operations are only permitted if the source and
> >     destination windows have compatible attributes
> 
> Already done, domain's aperture restricts all the iommufd operations

Right but that aperture is coming only from the hardware.  What I'm
suggesting here is that userspace can essentially opt into a *smaller*
aperture (or apertures) than the hardware permits.  The value of this
is that if the effective hardware aperture shrinks due to adding
devices, the kernel has the information it needs to determine if this
will be a problem for the userspace client or not.

> >   * A newly created kernel-managed IOAS has *no* IOVA windows
> 
> Already done, the iommufd IOAS has no iommu_domains inside it at
> creation time.

That.. doesn't seem like the same thing at all.  If there are no
domains, there are no restrictions from the hardware, so there's
effectively an unlimited aperture.

> >   * A CREATE_WINDOW operation is added
> >       * This takes a size, pagesize/granularity, optional base address
> >         and optional additional attributes 
> >       * If any of the specified attributes are incompatible with the
> >         underlying hardware, the operation fails
> 
> iommu layer has nothing called a window. The closest thing is a
> domain.

Maybe "window" is a bad name.  You called it "aperture" above (and
I've shifted to that naming) and implied it *was* part of the IOMMU
domain.  That said, it doesn't need to be at the iommu layer - as I
said I'm consdiering this primarily a software concept and it could be
at the IOAS layer.  The linkage would be that every iommufd operation
which could change either the user-selected windows or the effective
hardware apertures must ensure that all the user windows (still) lie
within the hardware apertures.

> I really don't want to try to make a new iommu layer object that is so
> unique and special to PPC - we have to figure out how to fit PPC into
> the iommu_domain model with reasonable extensions.

So, I'm probably conflating things between the iommu layer and the
iommufd/IOAS layer because it's a been a while since I looked at this
code. I'm primarily interested in the interfaces at the iommufd layer,
I don't much care at what layer it's implemented.

Having the iommu layer only care about the hw limitations (let's call
those "apertures") and the iommufd/IOAS layer care about the software
chosen limitations (let's call those "windows") makes reasonable sense
to me: HW advertises apertures according to its capabilities, SW
requests windows according to its needs.  The iommufd layer does its
best to accomodate SW's operations while maintaining the invariant
that the windows must always lie within the apertures.

At least.. that's the model on x86 and most other hosts.  For a ppc
host, we'd need want to have attempts to create windows also attempt
to create apertures on the hardware (or next level hypervisor).  But
with this model, that doesn't require userspace to do anything
different, so that can be localized to the ppc host backend.

> > > > > Maybe every device gets a copy of the error notification?
> > > > 
> > > > Alas, it's harder than that.  One of the things that can happen on an
> > > > EEH fault is that the entire PE gets suspended (blocking both DMA and
> > > > MMIO, IIRC) until the proper recovery steps are taken.  
> > > 
> > > I think qemu would have to de-duplicate the duplicated device
> > > notifications and then it can go from a device notifiation to the
> > > device's iommu_group to the IOAS to the vPE?
> > 
> > It's not about the notifications. 
> 
> The only thing the kernel can do is rely a notification that something
> happened to a PE. The kernel gets an event on the PE basis, I would
> like it to replicate it to all the devices and push it through the
> VFIO device FD.

Again: it's not about the notifications (I don't even really know how
those work in EEH).  The way EEH is specified, the expectation is that
when an error is tripped, the whole PE goes into an error state,
blocking IO for every device in the PE.  Something working with EEH
expects that to happen at the (virtual) hardware level.  So if a guest
trips an error on one device, it expects IO to stop for every other
device in the PE.  But if hardware's notion of the PE is smaller than
the guest's the host hardware won't accomplish that itself.

So the kernel would somehow have to replicate the error state in one
host PE to the other host PEs within the domain.  I think it's
theoretically possible, but it's really fiddly. It has to maintain
that replicated state on every step of the recovery process as
well. At last count trying to figure out how to do this correctly has
burnt out 3 or more separate (competent) developers at IBM so they
either just give up and run away, or procrastinate/avoid it until they
got a different job.

Bear in mind that in this scenario things are already not working
quite right at the hw level (or there wouldn't be an error in the
first place), so different results from the same recovery/injection
steps on different host PEs is a very real possibility.  We can't
present that that back to the guest running the recovery, because the
interface only allows for a single result.  So every operation's error
path becomes a nightmare of trying to resychronize the state across
the domain, which involves steps that themselves could fail, meaning
more resync... and so on.

> qemu will de-duplicate the replicates and recover exactly the same
> event the kernel saw, delivered at exactly the same time.

qemu's not doing the recovery, the guest is.  So we can't choose the
interfaces, we just have what PAPR specifies (and like most PAPR
interfaces, it's pretty crap).

> If instead you want to have one event per-PE then all that changes in
> the kernel is one event is generated instead of N, and qemu doesn't
> have to throw away the duplicates.
> 
> With either method qemu still gets the same PE centric event,
> delivered at the same time, with the same races.
> 
> This is not a general mechanism, it is some PPC specific thing to
> communicate a PPC specific PE centric event to userspace. I just
> prefer it in VFIO instead of iommufd.

Oh, yes, this one (unlike the multi-aperture stuff, where I'm not yet
convinced) is definitely ppc specific.  And I'm frankly not convinced
it's of any value even on ppc.  Given I'm not sure anyone has ever
used this in anger, I think a better approach is to just straight up
fail any attempt to do any EEH operation if there's not a 1 to 1
mapping from guest PE (domain) to host PE (group).

Or just ignore it entirely and count on no-one noticing.  EEH is
a largely irrelevant tangent to the actual interface issues I'm
interested in.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
@ 2022-05-06  5:25                       ` David Gibson
  0 siblings, 0 replies; 244+ messages in thread
From: David Gibson @ 2022-05-06  5:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins


[-- Attachment #1.1: Type: text/plain, Size: 15014 bytes --]

On Thu, May 05, 2022 at 04:07:28PM -0300, Jason Gunthorpe wrote:
> On Mon, May 02, 2022 at 05:30:05PM +1000, David Gibson wrote:
> 
> > > It is a bit more CPU work since maps in the lower range would have to
> > > be copied over, but conceptually the model matches the HW nesting.
> > 
> > Ah.. ok.  IIUC what you're saying is that the kernel-side IOASes have
> > fixed windows, but we fake dynamic windows in the userspace
> > implementation by flipping the devices over to a new IOAS with the new
> > windows.  Is that right?
> 
> Yes
> 
> > Where exactly would the windows be specified?  My understanding was
> > that when creating a back-end specific IOAS, that would typically be
> > for the case where you're using a user / guest managed IO pagetable,
> > with the backend specifying the format for that.  In the ppc case we'd
> > need to specify the windows, but we'd still need the IOAS_MAP/UNMAP
> > operations to manage the mappings.  The PAPR vIOMMU is
> > paravirtualized, so all updates come via hypercalls, so there's no
> > user/guest managed data structure.
> 
> When the iommu_domain is created I want to have a
> iommu-driver-specific struct, so PPC can customize its iommu_domain
> however it likes.

This requires that the client be aware of the host side IOMMU model.
That's true in VFIO now, and it's nasty; I was really hoping we could
*stop* doing that.

Note that I'm talking here *purely* about the non-optimized case where
all updates to the host side IO pagetables are handled by IOAS_MAP /
IOAS_COPY, with no direct hardware access to user or guest managed IO
pagetables.  The optimized case obviously requires end-to-end
agreement on the pagetable format amongst other domain properties.

What I'm hoping is that qemu (or whatever) can use this non-optimized
as a fallback case where it does't need to know the properties of
whatever host side IOMMU models there are.  It just requests what it
needs based on the vIOMMU properties it needs to replicate and the
host kernel either can supply it or can't.

In many cases it should be perfectly possible to emulate a PPC style
vIOMMU on an x86 host, because the x86 IOMMU has such a colossal
aperture that it will encompass wherever the ppc apertures end
up. Similarly we could simulate an x86-style no-vIOMMU guest on a ppc
host (currently somewhere between awkward and impossible) by placing
the host apertures to cover guest memory.

Admittedly those are pretty niche cases, but allowing for them gives
us flexibility for the future.  Emulating an ARM SMMUv3 guest on an
ARM SMMU v4 or v5 or v.whatever host is likely to be a real case in
the future, and AFAICT, ARM are much less conservative that x86 about
maintaining similar hw interfaces over time.  That's why I think
considering these ppc cases will give a more robust interface for
other future possibilities as well.

> > That should work from the point of view of the userspace and guest
> > side interfaces.  It might be fiddly from the point of view of the
> > back end.  The ppc iommu doesn't really have the notion of
> > configurable domains - instead the address spaces are the hardware or
> > firmware fixed PEs, so they have a fixed set of devices.  At the bare
> > metal level it's possible to sort of do domains by making the actual
> > pagetable pointers for several PEs point to a common place.
> 
> I'm not sure I understand this - a domain is just a storage container
> for an IO page table, if the HW has IOPTEs then it should be able to
> have a domain?
> 
> Making page table pointers point to a common IOPTE tree is exactly
> what iommu_domains are for - why is that "sort of" for ppc?

Ok, fair enough, it's only "sort of" in the sense that the hw specs /
docs don't present any equivalent concept.

> > However, in the future, nested KVM under PowerVM is likely to be the
> > norm.  In that situation the L1 as well as the L2 only has the
> > paravirtualized interfaces, which don't have any notion of domains,
> > only PEs.  All updates take place via hypercalls which explicitly
> > specify a PE (strictly speaking they take a "Logical IO Bus Number"
> > (LIOBN), but those generally map one to one with PEs), so it can't use
> > shared pointer tricks either.
> 
> How does the paravirtualized interfaces deal with the page table? Does
> it call a map/unmap hypercall instead of providing guest IOPTEs?

Sort of.  The main interface is H_PUT_TCE ("TCE" - Translation Control
Entry - being IBMese for an IOPTE). This takes an LIOBN (which selects
which PE and aperture), an IOVA and a TCE value - which is a guest
physical address plus some permission bits.  There are some variants
for performance that can set a batch of IOPTEs from a buffer, or clear
a range of IOPTEs, but they're just faster ways of doing the same
thing as a bunch of H_PUT_TCE calls.  H_PUT_TCE calls.

You can consider that a map/unmap hypercall, but the size of the
mapping is fixed (the IO pagesize which was previously set for the
aperture).

> Assuming yes, I'd expect that:
> 
> The iommu_domain for nested PPC is just a log of map/unmap hypervsior
> calls to make. Whenever a new PE is attached to that domain it gets
> the logged map's replayed to set it up, and when a PE is detached the
> log is used to unmap everything.

And likewise duplicate every H_PUT_TCE to all the PEs in the domain.
Sure.  It means the changes won't be atomic across the domain, but I
guess that doesn't matter.  I guess you could have the same thing on a
sufficiently complex x86 or ARM system, if you put two devices into
the IOAS that were sufficiently far from each other in the bus
topology that they use a different top-level host IOMMU.

> It is not perfectly memory efficient - and we could perhaps talk about
> a API modification to allow re-use of the iommufd datastructure
> somehow, but I think this is a good logical starting point.

Because the map size is fixed, a "replay log" is effectively
equivalent to a mirror of the entire IO pagetable.

> The PE would have to be modeled as an iommu_group.

Yes, they already are.

> > So, here's an alternative set of interfaces that should work for ppc,
> > maybe you can tell me whether they also work for x86 and others:
> 
> Fundamentally PPC has to fit into the iommu standard framework of
> group and domains, we can talk about modifications, but drifting too
> far away is a big problem.

Well, what I'm trying to do here is to see how big a change to the
model this really requires.  If it's not too much, we gain flexibility
for future unknown IOMMU models as well as for the admittedly niche
ppc case.

> >   * Each domain/IOAS has a concept of one or more IOVA windows, which
> >     each have a base address, size, pagesize (granularity) and optionally
> >     other flags/attributes.
> >       * This has some bearing on hardware capabilities, but is
> >         primarily a software notion
> 
> iommu_domain has the aperture, PPC will require extending this to a
> list of apertures since it is currently only one window.

Yes, that's needed as a minimum.

> Once a domain is created and attached to a group the aperture should
> be immutable.

I realize that's the model right now, but is there a strong rationale
for that immutability?

Come to that, IIUC that's true for the iommu_domain at the lower
level, but not for the IOAS at a higher level.  You've stated that's
*not* immutable, since it can shrink as new devices are added to the
IOAS.  So I guess in that case the IOAS must be mapped by multiple
iommu_domains?

> >   * MAP/UNMAP operations are only permitted within an existing IOVA
> >     window (and with addresses aligned to the window's pagesize)
> >       * This is enforced by software whether or not it is required by
> >         the underlying hardware
> >   * Likewise IOAS_COPY operations are only permitted if the source and
> >     destination windows have compatible attributes
> 
> Already done, domain's aperture restricts all the iommufd operations

Right but that aperture is coming only from the hardware.  What I'm
suggesting here is that userspace can essentially opt into a *smaller*
aperture (or apertures) than the hardware permits.  The value of this
is that if the effective hardware aperture shrinks due to adding
devices, the kernel has the information it needs to determine if this
will be a problem for the userspace client or not.

> >   * A newly created kernel-managed IOAS has *no* IOVA windows
> 
> Already done, the iommufd IOAS has no iommu_domains inside it at
> creation time.

That.. doesn't seem like the same thing at all.  If there are no
domains, there are no restrictions from the hardware, so there's
effectively an unlimited aperture.

> >   * A CREATE_WINDOW operation is added
> >       * This takes a size, pagesize/granularity, optional base address
> >         and optional additional attributes 
> >       * If any of the specified attributes are incompatible with the
> >         underlying hardware, the operation fails
> 
> iommu layer has nothing called a window. The closest thing is a
> domain.

Maybe "window" is a bad name.  You called it "aperture" above (and
I've shifted to that naming) and implied it *was* part of the IOMMU
domain.  That said, it doesn't need to be at the iommu layer - as I
said I'm consdiering this primarily a software concept and it could be
at the IOAS layer.  The linkage would be that every iommufd operation
which could change either the user-selected windows or the effective
hardware apertures must ensure that all the user windows (still) lie
within the hardware apertures.

> I really don't want to try to make a new iommu layer object that is so
> unique and special to PPC - we have to figure out how to fit PPC into
> the iommu_domain model with reasonable extensions.

So, I'm probably conflating things between the iommu layer and the
iommufd/IOAS layer because it's a been a while since I looked at this
code. I'm primarily interested in the interfaces at the iommufd layer,
I don't much care at what layer it's implemented.

Having the iommu layer only care about the hw limitations (let's call
those "apertures") and the iommufd/IOAS layer care about the software
chosen limitations (let's call those "windows") makes reasonable sense
to me: HW advertises apertures according to its capabilities, SW
requests windows according to its needs.  The iommufd layer does its
best to accomodate SW's operations while maintaining the invariant
that the windows must always lie within the apertures.

At least.. that's the model on x86 and most other hosts.  For a ppc
host, we'd need want to have attempts to create windows also attempt
to create apertures on the hardware (or next level hypervisor).  But
with this model, that doesn't require userspace to do anything
different, so that can be localized to the ppc host backend.

> > > > > Maybe every device gets a copy of the error notification?
> > > > 
> > > > Alas, it's harder than that.  One of the things that can happen on an
> > > > EEH fault is that the entire PE gets suspended (blocking both DMA and
> > > > MMIO, IIRC) until the proper recovery steps are taken.  
> > > 
> > > I think qemu would have to de-duplicate the duplicated device
> > > notifications and then it can go from a device notifiation to the
> > > device's iommu_group to the IOAS to the vPE?
> > 
> > It's not about the notifications. 
> 
> The only thing the kernel can do is rely a notification that something
> happened to a PE. The kernel gets an event on the PE basis, I would
> like it to replicate it to all the devices and push it through the
> VFIO device FD.

Again: it's not about the notifications (I don't even really know how
those work in EEH).  The way EEH is specified, the expectation is that
when an error is tripped, the whole PE goes into an error state,
blocking IO for every device in the PE.  Something working with EEH
expects that to happen at the (virtual) hardware level.  So if a guest
trips an error on one device, it expects IO to stop for every other
device in the PE.  But if hardware's notion of the PE is smaller than
the guest's the host hardware won't accomplish that itself.

So the kernel would somehow have to replicate the error state in one
host PE to the other host PEs within the domain.  I think it's
theoretically possible, but it's really fiddly. It has to maintain
that replicated state on every step of the recovery process as
well. At last count trying to figure out how to do this correctly has
burnt out 3 or more separate (competent) developers at IBM so they
either just give up and run away, or procrastinate/avoid it until they
got a different job.

Bear in mind that in this scenario things are already not working
quite right at the hw level (or there wouldn't be an error in the
first place), so different results from the same recovery/injection
steps on different host PEs is a very real possibility.  We can't
present that that back to the guest running the recovery, because the
interface only allows for a single result.  So every operation's error
path becomes a nightmare of trying to resychronize the state across
the domain, which involves steps that themselves could fail, meaning
more resync... and so on.

> qemu will de-duplicate the replicates and recover exactly the same
> event the kernel saw, delivered at exactly the same time.

qemu's not doing the recovery, the guest is.  So we can't choose the
interfaces, we just have what PAPR specifies (and like most PAPR
interfaces, it's pretty crap).

> If instead you want to have one event per-PE then all that changes in
> the kernel is one event is generated instead of N, and qemu doesn't
> have to throw away the duplicates.
> 
> With either method qemu still gets the same PE centric event,
> delivered at the same time, with the same races.
> 
> This is not a general mechanism, it is some PPC specific thing to
> communicate a PPC specific PE centric event to userspace. I just
> prefer it in VFIO instead of iommufd.

Oh, yes, this one (unlike the multi-aperture stuff, where I'm not yet
convinced) is definitely ppc specific.  And I'm frankly not convinced
it's of any value even on ppc.  Given I'm not sure anyone has ever
used this in anger, I think a better approach is to just straight up
fail any attempt to do any EEH operation if there's not a 1 to 1
mapping from guest PE (domain) to host PE (group).

Or just ignore it entirely and count on no-one noticing.  EEH is
a largely irrelevant tangent to the actual interface issues I'm
interested in.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 156 bytes --]

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* RE: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
  2022-05-06  5:25                       ` David Gibson
@ 2022-05-06 10:42                         ` Tian, Kevin
  -1 siblings, 0 replies; 244+ messages in thread
From: Tian, Kevin @ 2022-05-06 10:42 UTC (permalink / raw)
  To: David Gibson, Jason Gunthorpe
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

> From: David Gibson <david@gibson.dropbear.id.au>
> Sent: Friday, May 6, 2022 1:25 PM
> 
> >
> > When the iommu_domain is created I want to have a
> > iommu-driver-specific struct, so PPC can customize its iommu_domain
> > however it likes.
> 
> This requires that the client be aware of the host side IOMMU model.
> That's true in VFIO now, and it's nasty; I was really hoping we could
> *stop* doing that.

that model is anyway inevitable when talking about user page table,
i.e. when nesting is enabled.

> 
> Note that I'm talking here *purely* about the non-optimized case where
> all updates to the host side IO pagetables are handled by IOAS_MAP /
> IOAS_COPY, with no direct hardware access to user or guest managed IO
> pagetables.  The optimized case obviously requires end-to-end
> agreement on the pagetable format amongst other domain properties.
> 
> What I'm hoping is that qemu (or whatever) can use this non-optimized
> as a fallback case where it does't need to know the properties of
> whatever host side IOMMU models there are.  It just requests what it
> needs based on the vIOMMU properties it needs to replicate and the
> host kernel either can supply it or can't.
> 
> In many cases it should be perfectly possible to emulate a PPC style
> vIOMMU on an x86 host, because the x86 IOMMU has such a colossal
> aperture that it will encompass wherever the ppc apertures end
> up. Similarly we could simulate an x86-style no-vIOMMU guest on a ppc
> host (currently somewhere between awkward and impossible) by placing
> the host apertures to cover guest memory.
> 
> Admittedly those are pretty niche cases, but allowing for them gives
> us flexibility for the future.  Emulating an ARM SMMUv3 guest on an
> ARM SMMU v4 or v5 or v.whatever host is likely to be a real case in
> the future, and AFAICT, ARM are much less conservative that x86 about
> maintaining similar hw interfaces over time.  That's why I think
> considering these ppc cases will give a more robust interface for
> other future possibilities as well.

It's not niche cases. We already have virtio-iommu which can work
on both ARM and x86 platforms, i.e. what current iommufd provides
is already generic enough except on PPC.

Then IMHO the key open here is:

Can PPC adapt to the current iommufd proposal if it can be
refactored to fit the standard iommu domain/group concepts?

If not, what is the remaining gap after PPC becomes a normal
citizen in the iommu layer and is it worth solving it in the general
interface or via iommu-driver-specific domain (given this will
exist anyway)?

to close that open I'm with Jason:

   "Fundamentally PPC has to fit into the iommu standard framework of
   group and domains, we can talk about modifications, but drifting too
   far away is a big problem."

Directly jumping to the iommufd layer for what changes might be
applied to all platforms sounds counter-intuitive if we haven't tried 
to solve the gap in the iommu layer in the first place, as even
there is argument that certain changes in iommufd layer can find
matching concept on other platforms it still sort of looks redundant
since those platforms already work with the current model.

My two cents.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 244+ messages in thread

* RE: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
@ 2022-05-06 10:42                         ` Tian, Kevin
  0 siblings, 0 replies; 244+ messages in thread
From: Tian, Kevin @ 2022-05-06 10:42 UTC (permalink / raw)
  To: David Gibson, Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Alex Williamson, Martins, Joao

> From: David Gibson <david@gibson.dropbear.id.au>
> Sent: Friday, May 6, 2022 1:25 PM
> 
> >
> > When the iommu_domain is created I want to have a
> > iommu-driver-specific struct, so PPC can customize its iommu_domain
> > however it likes.
> 
> This requires that the client be aware of the host side IOMMU model.
> That's true in VFIO now, and it's nasty; I was really hoping we could
> *stop* doing that.

that model is anyway inevitable when talking about user page table,
i.e. when nesting is enabled.

> 
> Note that I'm talking here *purely* about the non-optimized case where
> all updates to the host side IO pagetables are handled by IOAS_MAP /
> IOAS_COPY, with no direct hardware access to user or guest managed IO
> pagetables.  The optimized case obviously requires end-to-end
> agreement on the pagetable format amongst other domain properties.
> 
> What I'm hoping is that qemu (or whatever) can use this non-optimized
> as a fallback case where it does't need to know the properties of
> whatever host side IOMMU models there are.  It just requests what it
> needs based on the vIOMMU properties it needs to replicate and the
> host kernel either can supply it or can't.
> 
> In many cases it should be perfectly possible to emulate a PPC style
> vIOMMU on an x86 host, because the x86 IOMMU has such a colossal
> aperture that it will encompass wherever the ppc apertures end
> up. Similarly we could simulate an x86-style no-vIOMMU guest on a ppc
> host (currently somewhere between awkward and impossible) by placing
> the host apertures to cover guest memory.
> 
> Admittedly those are pretty niche cases, but allowing for them gives
> us flexibility for the future.  Emulating an ARM SMMUv3 guest on an
> ARM SMMU v4 or v5 or v.whatever host is likely to be a real case in
> the future, and AFAICT, ARM are much less conservative that x86 about
> maintaining similar hw interfaces over time.  That's why I think
> considering these ppc cases will give a more robust interface for
> other future possibilities as well.

It's not niche cases. We already have virtio-iommu which can work
on both ARM and x86 platforms, i.e. what current iommufd provides
is already generic enough except on PPC.

Then IMHO the key open here is:

Can PPC adapt to the current iommufd proposal if it can be
refactored to fit the standard iommu domain/group concepts?

If not, what is the remaining gap after PPC becomes a normal
citizen in the iommu layer and is it worth solving it in the general
interface or via iommu-driver-specific domain (given this will
exist anyway)?

to close that open I'm with Jason:

   "Fundamentally PPC has to fit into the iommu standard framework of
   group and domains, we can talk about modifications, but drifting too
   far away is a big problem."

Directly jumping to the iommufd layer for what changes might be
applied to all platforms sounds counter-intuitive if we haven't tried 
to solve the gap in the iommu layer in the first place, as even
there is argument that certain changes in iommufd layer can find
matching concept on other platforms it still sort of looks redundant
since those platforms already work with the current model.

My two cents.

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
  2022-05-06  5:25                       ` David Gibson
@ 2022-05-06 12:48                         ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-05-06 12:48 UTC (permalink / raw)
  To: David Gibson
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Fri, May 06, 2022 at 03:25:03PM +1000, David Gibson wrote:
> On Thu, May 05, 2022 at 04:07:28PM -0300, Jason Gunthorpe wrote:

> > When the iommu_domain is created I want to have a
> > iommu-driver-specific struct, so PPC can customize its iommu_domain
> > however it likes.
> 
> This requires that the client be aware of the host side IOMMU model.
> That's true in VFIO now, and it's nasty; I was really hoping we could
> *stop* doing that.

iommufd has two modes, the 'generic interface' which what this patch
series shows that does not require any device specific knowledge.

The default iommu_domain that the iommu driver creates will be used
here, it is up to the iommu driver to choose something reasonable for
use by applications like DPDK. ie PPC should probably pick its biggest
x86-like aperture.

The iommu-driver-specific struct is the "advanced" interface and
allows a user-space IOMMU driver to tightly control the HW with full
HW specific knowledge. This is where all the weird stuff that is not
general should go.

> Note that I'm talking here *purely* about the non-optimized case where
> all updates to the host side IO pagetables are handled by IOAS_MAP /
> IOAS_COPY, with no direct hardware access to user or guest managed IO
> pagetables.  The optimized case obviously requires end-to-end
> agreement on the pagetable format amongst other domain properties.

Sure, this is how things are already..

> What I'm hoping is that qemu (or whatever) can use this non-optimized
> as a fallback case where it does't need to know the properties of
> whatever host side IOMMU models there are.  It just requests what it
> needs based on the vIOMMU properties it needs to replicate and the
> host kernel either can supply it or can't.

There aren't really any negotiable vIOMMU properties beyond the
ranges, and the ranges are not *really* negotiable.

There are lots of dragons with the idea we can actually negotiate
ranges - because asking for the wrong range for what the HW can do
means you don't get anything. Which is completely contrary to the idea
of easy generic support for things like DPDK.

So DPDK can't ask for ranges, it is not generic.

This means we are really talking about a qemu-only API, and IMHO, qemu
is fine to have a PPC specific userspace driver to tweak this PPC
unique thing if the default windows are not acceptable.

IMHO it is no different from imagining an Intel specific userspace
driver that is using userspace IO pagetables to optimize
cross-platform qemu vIOMMU emulation. We should be comfortable with
the idea that accessing the full device-specific feature set requires
a HW specific user space driver.

> Admittedly those are pretty niche cases, but allowing for them gives
> us flexibility for the future.  Emulating an ARM SMMUv3 guest on an
> ARM SMMU v4 or v5 or v.whatever host is likely to be a real case in
> the future, and AFAICT, ARM are much less conservative that x86 about
> maintaining similar hw interfaces over time.  That's why I think
> considering these ppc cases will give a more robust interface for
> other future possibilities as well.

I don't think so - PPC has just done two things that are completely
divergent from eveything else - having two IO page tables for the same
end point, and using map/unmap hypercalls instead of a nested page
table.

Everyone else seems to be focused on IOPTEs that are similar to CPU
PTEs, particularly to enable SVA and other tricks, and CPU's don't
have either of this weirdness.

> You can consider that a map/unmap hypercall, but the size of the
> mapping is fixed (the IO pagesize which was previously set for the
> aperture).

Yes, I would consider that a map/unmap hypercall vs a nested translation.
 
> > Assuming yes, I'd expect that:
> > 
> > The iommu_domain for nested PPC is just a log of map/unmap hypervsior
> > calls to make. Whenever a new PE is attached to that domain it gets
> > the logged map's replayed to set it up, and when a PE is detached the
> > log is used to unmap everything.
> 
> And likewise duplicate every H_PUT_TCE to all the PEs in the domain.
> Sure.  It means the changes won't be atomic across the domain, but I
> guess that doesn't matter.  I guess you could have the same thing on a
> sufficiently complex x86 or ARM system, if you put two devices into
> the IOAS that were sufficiently far from each other in the bus
> topology that they use a different top-level host IOMMU.

Yes, strict atomicity is not needed.

> > It is not perfectly memory efficient - and we could perhaps talk about
> > a API modification to allow re-use of the iommufd datastructure
> > somehow, but I think this is a good logical starting point.
> 
> Because the map size is fixed, a "replay log" is effectively
> equivalent to a mirror of the entire IO pagetable.

So, for virtualized PPC the iommu_domain is an xarray of mapped PFNs
and device attach/detach just sweeps the xarray and does the
hypercalls. Very similar to what we discussed for S390.

It seems OK, this isn't even really overhead in most cases as we
always have to track the mapped PFNs anyhow.

> > Once a domain is created and attached to a group the aperture should
> > be immutable.
> 
> I realize that's the model right now, but is there a strong rationale
> for that immutability?

I have a strong preference that iommu_domains have immutable
properties once they are created just for overall sanity. Otherwise
everything becomes a racy mess.

If the iommu_domain changes dynamically then things using the aperture
data get all broken - it is complicated to fix. So it would need a big
reason to do it, I think.
 
> Come to that, IIUC that's true for the iommu_domain at the lower
> level, but not for the IOAS at a higher level.  You've stated that's
> *not* immutable, since it can shrink as new devices are added to the
> IOAS.  So I guess in that case the IOAS must be mapped by multiple
> iommu_domains?

Yes, thats right. The IOAS is expressly mutable because it its on
multiple domains and multiple groups each of which contribute to the
aperture. The IOAS aperture is the narrowest of everything, and we
have semi-reasonable semantics here. Here we have all the special code
to juggle this, but even in this case we can't handle an iommu_domain
or group changing asynchronously.

> Right but that aperture is coming only from the hardware.  What I'm
> suggesting here is that userspace can essentially opt into a *smaller*
> aperture (or apertures) than the hardware permits.  The value of this
> is that if the effective hardware aperture shrinks due to adding
> devices, the kernel has the information it needs to determine if this
> will be a problem for the userspace client or not.

We can do this check in userspace without more kernel APIs, userspace
should fetch the ranges and confirm they are still good after
attaching devices.

In general I have no issue with limiting the IOVA allocator in the
kernel, I just don't have a use case of an application that could use
the IOVA allocator (like DPDK) and also needs a limitation..

> > >   * A newly created kernel-managed IOAS has *no* IOVA windows
> > 
> > Already done, the iommufd IOAS has no iommu_domains inside it at
> > creation time.
> 
> That.. doesn't seem like the same thing at all.  If there are no
> domains, there are no restrictions from the hardware, so there's
> effectively an unlimited aperture.

Yes.. You wanted a 0 sized window instead? Why? That breaks what I'm
trying to do to make DPDK/etc portable and dead simple.
 
> > >   * A CREATE_WINDOW operation is added
> > >       * This takes a size, pagesize/granularity, optional base address
> > >         and optional additional attributes 
> > >       * If any of the specified attributes are incompatible with the
> > >         underlying hardware, the operation fails
> > 
> > iommu layer has nothing called a window. The closest thing is a
> > domain.
> 
> Maybe "window" is a bad name.  You called it "aperture" above (and
> I've shifted to that naming) and implied it *was* part of the IOMMU
> domain. 

It is but not as an object that can be mutated - it is just a
property.

You are talking about a window object that exists somehow, I don't
know this fits beyond some creation attribute of the domain..

> That said, it doesn't need to be at the iommu layer - as I
> said I'm consdiering this primarily a software concept and it could be
> at the IOAS layer.  The linkage would be that every iommufd operation
> which could change either the user-selected windows or the effective
> hardware apertures must ensure that all the user windows (still) lie
> within the hardware apertures.

As Kevin said, we need to start at the iommu_domain layer first -
when we understand how that needs to look then we can imagine what the
uAPI should be.

I don't want to create something in iommufd that is wildly divergent
from what the iommu_domain/etc can support.

> > The only thing the kernel can do is rely a notification that something
> > happened to a PE. The kernel gets an event on the PE basis, I would
> > like it to replicate it to all the devices and push it through the
> > VFIO device FD.
> 
> Again: it's not about the notifications (I don't even really know how
> those work in EEH).

Well, then I don't know what we are talking about - I'm interested in
what uAPI is needed to support this, as far as can I tell that is a
notification something bad happened and some control knobs?

As I said, I'd prefer this to be on the vfio_device FD vs on iommufd
and would like to find a way to make that work.

> expects that to happen at the (virtual) hardware level.  So if a guest
> trips an error on one device, it expects IO to stop for every other
> device in the PE.  But if hardware's notion of the PE is smaller than
> the guest's the host hardware won't accomplish that itself.

So, why do that to the guest? Shouldn't the PE in the guest be
strictly a subset of the PE in the host, otherwise you hit all these
problems you are talking about?

> used this in anger, I think a better approach is to just straight up
> fail any attempt to do any EEH operation if there's not a 1 to 1
> mapping from guest PE (domain) to host PE (group).

That makes sense

Jason

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
@ 2022-05-06 12:48                         ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-05-06 12:48 UTC (permalink / raw)
  To: David Gibson
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins

On Fri, May 06, 2022 at 03:25:03PM +1000, David Gibson wrote:
> On Thu, May 05, 2022 at 04:07:28PM -0300, Jason Gunthorpe wrote:

> > When the iommu_domain is created I want to have a
> > iommu-driver-specific struct, so PPC can customize its iommu_domain
> > however it likes.
> 
> This requires that the client be aware of the host side IOMMU model.
> That's true in VFIO now, and it's nasty; I was really hoping we could
> *stop* doing that.

iommufd has two modes, the 'generic interface' which what this patch
series shows that does not require any device specific knowledge.

The default iommu_domain that the iommu driver creates will be used
here, it is up to the iommu driver to choose something reasonable for
use by applications like DPDK. ie PPC should probably pick its biggest
x86-like aperture.

The iommu-driver-specific struct is the "advanced" interface and
allows a user-space IOMMU driver to tightly control the HW with full
HW specific knowledge. This is where all the weird stuff that is not
general should go.

> Note that I'm talking here *purely* about the non-optimized case where
> all updates to the host side IO pagetables are handled by IOAS_MAP /
> IOAS_COPY, with no direct hardware access to user or guest managed IO
> pagetables.  The optimized case obviously requires end-to-end
> agreement on the pagetable format amongst other domain properties.

Sure, this is how things are already..

> What I'm hoping is that qemu (or whatever) can use this non-optimized
> as a fallback case where it does't need to know the properties of
> whatever host side IOMMU models there are.  It just requests what it
> needs based on the vIOMMU properties it needs to replicate and the
> host kernel either can supply it or can't.

There aren't really any negotiable vIOMMU properties beyond the
ranges, and the ranges are not *really* negotiable.

There are lots of dragons with the idea we can actually negotiate
ranges - because asking for the wrong range for what the HW can do
means you don't get anything. Which is completely contrary to the idea
of easy generic support for things like DPDK.

So DPDK can't ask for ranges, it is not generic.

This means we are really talking about a qemu-only API, and IMHO, qemu
is fine to have a PPC specific userspace driver to tweak this PPC
unique thing if the default windows are not acceptable.

IMHO it is no different from imagining an Intel specific userspace
driver that is using userspace IO pagetables to optimize
cross-platform qemu vIOMMU emulation. We should be comfortable with
the idea that accessing the full device-specific feature set requires
a HW specific user space driver.

> Admittedly those are pretty niche cases, but allowing for them gives
> us flexibility for the future.  Emulating an ARM SMMUv3 guest on an
> ARM SMMU v4 or v5 or v.whatever host is likely to be a real case in
> the future, and AFAICT, ARM are much less conservative that x86 about
> maintaining similar hw interfaces over time.  That's why I think
> considering these ppc cases will give a more robust interface for
> other future possibilities as well.

I don't think so - PPC has just done two things that are completely
divergent from eveything else - having two IO page tables for the same
end point, and using map/unmap hypercalls instead of a nested page
table.

Everyone else seems to be focused on IOPTEs that are similar to CPU
PTEs, particularly to enable SVA and other tricks, and CPU's don't
have either of this weirdness.

> You can consider that a map/unmap hypercall, but the size of the
> mapping is fixed (the IO pagesize which was previously set for the
> aperture).

Yes, I would consider that a map/unmap hypercall vs a nested translation.
 
> > Assuming yes, I'd expect that:
> > 
> > The iommu_domain for nested PPC is just a log of map/unmap hypervsior
> > calls to make. Whenever a new PE is attached to that domain it gets
> > the logged map's replayed to set it up, and when a PE is detached the
> > log is used to unmap everything.
> 
> And likewise duplicate every H_PUT_TCE to all the PEs in the domain.
> Sure.  It means the changes won't be atomic across the domain, but I
> guess that doesn't matter.  I guess you could have the same thing on a
> sufficiently complex x86 or ARM system, if you put two devices into
> the IOAS that were sufficiently far from each other in the bus
> topology that they use a different top-level host IOMMU.

Yes, strict atomicity is not needed.

> > It is not perfectly memory efficient - and we could perhaps talk about
> > a API modification to allow re-use of the iommufd datastructure
> > somehow, but I think this is a good logical starting point.
> 
> Because the map size is fixed, a "replay log" is effectively
> equivalent to a mirror of the entire IO pagetable.

So, for virtualized PPC the iommu_domain is an xarray of mapped PFNs
and device attach/detach just sweeps the xarray and does the
hypercalls. Very similar to what we discussed for S390.

It seems OK, this isn't even really overhead in most cases as we
always have to track the mapped PFNs anyhow.

> > Once a domain is created and attached to a group the aperture should
> > be immutable.
> 
> I realize that's the model right now, but is there a strong rationale
> for that immutability?

I have a strong preference that iommu_domains have immutable
properties once they are created just for overall sanity. Otherwise
everything becomes a racy mess.

If the iommu_domain changes dynamically then things using the aperture
data get all broken - it is complicated to fix. So it would need a big
reason to do it, I think.
 
> Come to that, IIUC that's true for the iommu_domain at the lower
> level, but not for the IOAS at a higher level.  You've stated that's
> *not* immutable, since it can shrink as new devices are added to the
> IOAS.  So I guess in that case the IOAS must be mapped by multiple
> iommu_domains?

Yes, thats right. The IOAS is expressly mutable because it its on
multiple domains and multiple groups each of which contribute to the
aperture. The IOAS aperture is the narrowest of everything, and we
have semi-reasonable semantics here. Here we have all the special code
to juggle this, but even in this case we can't handle an iommu_domain
or group changing asynchronously.

> Right but that aperture is coming only from the hardware.  What I'm
> suggesting here is that userspace can essentially opt into a *smaller*
> aperture (or apertures) than the hardware permits.  The value of this
> is that if the effective hardware aperture shrinks due to adding
> devices, the kernel has the information it needs to determine if this
> will be a problem for the userspace client or not.

We can do this check in userspace without more kernel APIs, userspace
should fetch the ranges and confirm they are still good after
attaching devices.

In general I have no issue with limiting the IOVA allocator in the
kernel, I just don't have a use case of an application that could use
the IOVA allocator (like DPDK) and also needs a limitation..

> > >   * A newly created kernel-managed IOAS has *no* IOVA windows
> > 
> > Already done, the iommufd IOAS has no iommu_domains inside it at
> > creation time.
> 
> That.. doesn't seem like the same thing at all.  If there are no
> domains, there are no restrictions from the hardware, so there's
> effectively an unlimited aperture.

Yes.. You wanted a 0 sized window instead? Why? That breaks what I'm
trying to do to make DPDK/etc portable and dead simple.
 
> > >   * A CREATE_WINDOW operation is added
> > >       * This takes a size, pagesize/granularity, optional base address
> > >         and optional additional attributes 
> > >       * If any of the specified attributes are incompatible with the
> > >         underlying hardware, the operation fails
> > 
> > iommu layer has nothing called a window. The closest thing is a
> > domain.
> 
> Maybe "window" is a bad name.  You called it "aperture" above (and
> I've shifted to that naming) and implied it *was* part of the IOMMU
> domain. 

It is but not as an object that can be mutated - it is just a
property.

You are talking about a window object that exists somehow, I don't
know this fits beyond some creation attribute of the domain..

> That said, it doesn't need to be at the iommu layer - as I
> said I'm consdiering this primarily a software concept and it could be
> at the IOAS layer.  The linkage would be that every iommufd operation
> which could change either the user-selected windows or the effective
> hardware apertures must ensure that all the user windows (still) lie
> within the hardware apertures.

As Kevin said, we need to start at the iommu_domain layer first -
when we understand how that needs to look then we can imagine what the
uAPI should be.

I don't want to create something in iommufd that is wildly divergent
from what the iommu_domain/etc can support.

> > The only thing the kernel can do is rely a notification that something
> > happened to a PE. The kernel gets an event on the PE basis, I would
> > like it to replicate it to all the devices and push it through the
> > VFIO device FD.
> 
> Again: it's not about the notifications (I don't even really know how
> those work in EEH).

Well, then I don't know what we are talking about - I'm interested in
what uAPI is needed to support this, as far as can I tell that is a
notification something bad happened and some control knobs?

As I said, I'd prefer this to be on the vfio_device FD vs on iommufd
and would like to find a way to make that work.

> expects that to happen at the (virtual) hardware level.  So if a guest
> trips an error on one device, it expects IO to stop for every other
> device in the PE.  But if hardware's notion of the PE is smaller than
> the guest's the host hardware won't accomplish that itself.

So, why do that to the guest? Shouldn't the PE in the guest be
strictly a subset of the PE in the host, otherwise you hit all these
problems you are talking about?

> used this in anger, I think a better approach is to just straight up
> fail any attempt to do any EEH operation if there's not a 1 to 1
> mapping from guest PE (domain) to host PE (group).

That makes sense

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
  2022-05-06 10:42                         ` Tian, Kevin
@ 2022-05-09  3:36                           ` David Gibson
  -1 siblings, 0 replies; 244+ messages in thread
From: David Gibson @ 2022-05-09  3:36 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Alex Williamson, Jason Gunthorpe, Martins,
	Joao


[-- Attachment #1.1: Type: text/plain, Size: 4323 bytes --]

On Fri, May 06, 2022 at 10:42:21AM +0000, Tian, Kevin wrote:
> > From: David Gibson <david@gibson.dropbear.id.au>
> > Sent: Friday, May 6, 2022 1:25 PM
> > 
> > >
> > > When the iommu_domain is created I want to have a
> > > iommu-driver-specific struct, so PPC can customize its iommu_domain
> > > however it likes.
> > 
> > This requires that the client be aware of the host side IOMMU model.
> > That's true in VFIO now, and it's nasty; I was really hoping we could
> > *stop* doing that.
> 
> that model is anyway inevitable when talking about user page table,

Right, but I'm explicitly not talking about the user managed page
table case.  I'm talking about the case where the IO pagetable is
still managed by the kernel and we update it via IOAS_MAP and similar
operations.

> i.e. when nesting is enabled.

I don't really follow the connection you're drawing between a user
managed table and nesting.

> > Note that I'm talking here *purely* about the non-optimized case where
> > all updates to the host side IO pagetables are handled by IOAS_MAP /
> > IOAS_COPY, with no direct hardware access to user or guest managed IO
> > pagetables.  The optimized case obviously requires end-to-end
> > agreement on the pagetable format amongst other domain properties.
> > 
> > What I'm hoping is that qemu (or whatever) can use this non-optimized
> > as a fallback case where it does't need to know the properties of
> > whatever host side IOMMU models there are.  It just requests what it
> > needs based on the vIOMMU properties it needs to replicate and the
> > host kernel either can supply it or can't.
> > 
> > In many cases it should be perfectly possible to emulate a PPC style
> > vIOMMU on an x86 host, because the x86 IOMMU has such a colossal
> > aperture that it will encompass wherever the ppc apertures end
> > up. Similarly we could simulate an x86-style no-vIOMMU guest on a ppc
> > host (currently somewhere between awkward and impossible) by placing
> > the host apertures to cover guest memory.
> > 
> > Admittedly those are pretty niche cases, but allowing for them gives
> > us flexibility for the future.  Emulating an ARM SMMUv3 guest on an
> > ARM SMMU v4 or v5 or v.whatever host is likely to be a real case in
> > the future, and AFAICT, ARM are much less conservative that x86 about
> > maintaining similar hw interfaces over time.  That's why I think
> > considering these ppc cases will give a more robust interface for
> > other future possibilities as well.
> 
> It's not niche cases. We already have virtio-iommu which can work
> on both ARM and x86 platforms, i.e. what current iommufd provides
> is already generic enough except on PPC.
> 
> Then IMHO the key open here is:
> 
> Can PPC adapt to the current iommufd proposal if it can be
> refactored to fit the standard iommu domain/group concepts?

Right...  and I'm still trying to figure out whether it can adapt to
either part of that.  We absolutely need to allow for multiple IOVA
apertures within a domain.  If we have that I *think* we can manage
(if suboptimally), but I'm trying to figure out the corner cases to
make sure I haven't missed something.

> If not, what is the remaining gap after PPC becomes a normal
> citizen in the iommu layer and is it worth solving it in the general
> interface or via iommu-driver-specific domain (given this will
> exist anyway)?
> 
> to close that open I'm with Jason:
> 
>    "Fundamentally PPC has to fit into the iommu standard framework of
>    group and domains, we can talk about modifications, but drifting too
>    far away is a big problem."
> 
> Directly jumping to the iommufd layer for what changes might be
> applied to all platforms sounds counter-intuitive if we haven't tried 
> to solve the gap in the iommu layer in the first place, as even
> there is argument that certain changes in iommufd layer can find
> matching concept on other platforms it still sort of looks redundant
> since those platforms already work with the current model.

I don't really follow what you're saying here.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 156 bytes --]

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
@ 2022-05-09  3:36                           ` David Gibson
  0 siblings, 0 replies; 244+ messages in thread
From: David Gibson @ 2022-05-09  3:36 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, Alex Williamson, Lu Baolu, Chaitanya Kulkarni,
	Cornelia Huck, Daniel Jordan, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

[-- Attachment #1: Type: text/plain, Size: 4323 bytes --]

On Fri, May 06, 2022 at 10:42:21AM +0000, Tian, Kevin wrote:
> > From: David Gibson <david@gibson.dropbear.id.au>
> > Sent: Friday, May 6, 2022 1:25 PM
> > 
> > >
> > > When the iommu_domain is created I want to have a
> > > iommu-driver-specific struct, so PPC can customize its iommu_domain
> > > however it likes.
> > 
> > This requires that the client be aware of the host side IOMMU model.
> > That's true in VFIO now, and it's nasty; I was really hoping we could
> > *stop* doing that.
> 
> that model is anyway inevitable when talking about user page table,

Right, but I'm explicitly not talking about the user managed page
table case.  I'm talking about the case where the IO pagetable is
still managed by the kernel and we update it via IOAS_MAP and similar
operations.

> i.e. when nesting is enabled.

I don't really follow the connection you're drawing between a user
managed table and nesting.

> > Note that I'm talking here *purely* about the non-optimized case where
> > all updates to the host side IO pagetables are handled by IOAS_MAP /
> > IOAS_COPY, with no direct hardware access to user or guest managed IO
> > pagetables.  The optimized case obviously requires end-to-end
> > agreement on the pagetable format amongst other domain properties.
> > 
> > What I'm hoping is that qemu (or whatever) can use this non-optimized
> > as a fallback case where it does't need to know the properties of
> > whatever host side IOMMU models there are.  It just requests what it
> > needs based on the vIOMMU properties it needs to replicate and the
> > host kernel either can supply it or can't.
> > 
> > In many cases it should be perfectly possible to emulate a PPC style
> > vIOMMU on an x86 host, because the x86 IOMMU has such a colossal
> > aperture that it will encompass wherever the ppc apertures end
> > up. Similarly we could simulate an x86-style no-vIOMMU guest on a ppc
> > host (currently somewhere between awkward and impossible) by placing
> > the host apertures to cover guest memory.
> > 
> > Admittedly those are pretty niche cases, but allowing for them gives
> > us flexibility for the future.  Emulating an ARM SMMUv3 guest on an
> > ARM SMMU v4 or v5 or v.whatever host is likely to be a real case in
> > the future, and AFAICT, ARM are much less conservative that x86 about
> > maintaining similar hw interfaces over time.  That's why I think
> > considering these ppc cases will give a more robust interface for
> > other future possibilities as well.
> 
> It's not niche cases. We already have virtio-iommu which can work
> on both ARM and x86 platforms, i.e. what current iommufd provides
> is already generic enough except on PPC.
> 
> Then IMHO the key open here is:
> 
> Can PPC adapt to the current iommufd proposal if it can be
> refactored to fit the standard iommu domain/group concepts?

Right...  and I'm still trying to figure out whether it can adapt to
either part of that.  We absolutely need to allow for multiple IOVA
apertures within a domain.  If we have that I *think* we can manage
(if suboptimally), but I'm trying to figure out the corner cases to
make sure I haven't missed something.

> If not, what is the remaining gap after PPC becomes a normal
> citizen in the iommu layer and is it worth solving it in the general
> interface or via iommu-driver-specific domain (given this will
> exist anyway)?
> 
> to close that open I'm with Jason:
> 
>    "Fundamentally PPC has to fit into the iommu standard framework of
>    group and domains, we can talk about modifications, but drifting too
>    far away is a big problem."
> 
> Directly jumping to the iommufd layer for what changes might be
> applied to all platforms sounds counter-intuitive if we haven't tried 
> to solve the gap in the iommu layer in the first place, as even
> there is argument that certain changes in iommufd layer can find
> matching concept on other platforms it still sort of looks redundant
> since those platforms already work with the current model.

I don't really follow what you're saying here.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
  2022-05-06 12:48                         ` Jason Gunthorpe via iommu
@ 2022-05-09  6:01                           ` David Gibson
  -1 siblings, 0 replies; 244+ messages in thread
From: David Gibson @ 2022-05-09  6:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins


[-- Attachment #1.1: Type: text/plain, Size: 20260 bytes --]

On Fri, May 06, 2022 at 09:48:37AM -0300, Jason Gunthorpe wrote:
> On Fri, May 06, 2022 at 03:25:03PM +1000, David Gibson wrote:
> > On Thu, May 05, 2022 at 04:07:28PM -0300, Jason Gunthorpe wrote:
> 
> > > When the iommu_domain is created I want to have a
> > > iommu-driver-specific struct, so PPC can customize its iommu_domain
> > > however it likes.
> > 
> > This requires that the client be aware of the host side IOMMU model.
> > That's true in VFIO now, and it's nasty; I was really hoping we could
> > *stop* doing that.
> 
> iommufd has two modes, the 'generic interface' which what this patch
> series shows that does not require any device specific knowledge.

Right, and I'm speaking specifically to that generic interface.  But
I'm thinking particularly about the qemu case where we do have
specific knowledge of the *guest* vIOMMU, but we want to avoid having
specific knowledge of the host IOMMU, because they might not be the same.

It would be good to have a way of seeing if the guest vIOMMU can be
emulated on this host IOMMU without qemu having to have separate
logic for every host IOMMU.

> The default iommu_domain that the iommu driver creates will be used
> here, it is up to the iommu driver to choose something reasonable for
> use by applications like DPDK. ie PPC should probably pick its biggest
> x86-like aperture.

So, using the big aperture means a very high base IOVA
(1<<59)... which means that it won't work at all if you want to attach
any devices that aren't capable of 64-bit DMA.  Using the maximum
possible window size would mean we either potentially waste a lot of
kernel memory on pagetables, or we use unnecessarily large number of
levels to the pagetable.

Basically we don't have enough information to make a good decision
here.

More generally, the problem with the interface advertising limitations
and it being up to userspace to work out if those are ok or not is
that it's fragile.  It's pretty plausible that some future IOMMU model
will have some new kind of limitation that can't be expressed in the
query structure we invented now.  That means that to add support for
that we need some kind of gate to prevent old userspace using the new
IOMMU (e.g. only allowing the new IOMMU to be used if userspace uses
newly added queries to get the new limitations).  That's true even if
what userspace was actually doing with the IOMMU would fit just fine
into those new limitations.

But if userspace requests the capabilities it wants, and the kernel
acks or nacks that, we can support the new host IOMMU with existing
software just fine.  They won't be able to use any *new* features or
capabilities of the new hardware, of course, but they'll be able to
use what it does that overlaps with what they needed before.

ppc - or more correctly, the POWER and PAPR IOMMU models - is just
acting here as an example of an IOMMU with limitations and
capabilities that don't fit into the current query model.

> The iommu-driver-specific struct is the "advanced" interface and
> allows a user-space IOMMU driver to tightly control the HW with full
> HW specific knowledge. This is where all the weird stuff that is not
> general should go.

Right, but forcing anything more complicated than "give me some IOVA
region" to go through the advanced interface means that qemu (or any
hypervisor where the guest platform need not identically match the
host) has to have n^2 complexity to match each guest IOMMU model to
each host IOMMU model.

> > Note that I'm talking here *purely* about the non-optimized case where
> > all updates to the host side IO pagetables are handled by IOAS_MAP /
> > IOAS_COPY, with no direct hardware access to user or guest managed IO
> > pagetables.  The optimized case obviously requires end-to-end
> > agreement on the pagetable format amongst other domain properties.
> 
> Sure, this is how things are already..
> 
> > What I'm hoping is that qemu (or whatever) can use this non-optimized
> > as a fallback case where it does't need to know the properties of
> > whatever host side IOMMU models there are.  It just requests what it
> > needs based on the vIOMMU properties it needs to replicate and the
> > host kernel either can supply it or can't.
> 
> There aren't really any negotiable vIOMMU properties beyond the
> ranges, and the ranges are not *really* negotiable.

Errr.. how do you figure?  On ppc the ranges and pagesizes are
definitely negotiable.  I'm not really familiar with other models, but
anything which allows *any* variations in the pagetable structure will
effectively have at least some negotiable properties.

Even if any individual host IOMMU doesn't have negotiable properties
(which ppc demonstrates is false), there's still a negotiation here in
the context that userspace doesn't know (and doesn't care) what
specific host IOMMU model it has.

> There are lots of dragons with the idea we can actually negotiate
> ranges - because asking for the wrong range for what the HW can do
> means you don't get anything. Which is completely contrary to the idea
> of easy generic support for things like DPDK.
>
> So DPDK can't ask for ranges, it is not generic.

Which is why I'm suggesting that the base address be an optional
request.  DPDK *will* care about the size of the range, so it just
requests that and gets told a base address.

Qemu emulating a vIOMMU absolutely does care about the ranges, and if
the HW can't do it, failing outright is the correct behaviour.

> This means we are really talking about a qemu-only API,

Kind of, yes.  I can't immediately think of anything which would need
this fine gained control over the IOMMU capabilities other than an
emulator/hypervisor emulating a different vIOMMU than the host IOMMU.
qemu is by far the most prominent example of this.

> and IMHO, qemu
> is fine to have a PPC specific userspace driver to tweak this PPC
> unique thing if the default windows are not acceptable.

Not really, because it's the ppc *target* (guest) side which requires
the specific properties, but selecting the "advanced" interface
requires special knowledge on the *host* side.

> IMHO it is no different from imagining an Intel specific userspace
> driver that is using userspace IO pagetables to optimize
> cross-platform qemu vIOMMU emulation.

I'm not quite sure what you have in mind here.  How is it both Intel
specific and cross-platform?

> We should be comfortable with
> the idea that accessing the full device-specific feature set requires
> a HW specific user space driver.
> 
> > Admittedly those are pretty niche cases, but allowing for them gives
> > us flexibility for the future.  Emulating an ARM SMMUv3 guest on an
> > ARM SMMU v4 or v5 or v.whatever host is likely to be a real case in
> > the future, and AFAICT, ARM are much less conservative that x86 about
> > maintaining similar hw interfaces over time.  That's why I think
> > considering these ppc cases will give a more robust interface for
> > other future possibilities as well.
> 
> I don't think so - PPC has just done two things that are completely
> divergent from eveything else - having two IO page tables for the same
> end point, and using map/unmap hypercalls instead of a nested page
> table.
> 
> Everyone else seems to be focused on IOPTEs that are similar to CPU
> PTEs, particularly to enable SVA and other tricks, and CPU's don't
> have either of this weirdness.

Well, you may have a point there.  Though, my experience makes me
pretty reluctant to ever bet on hw designers *not* inserting some
weird new constraint and assuming sw can just figure it out somehow.

Note however, that having multiple apertures isn't really ppc specific.
Anything with an IO hole effectively has separate apertures above and
below the hole.  They're much closer together in address than POWER's
two apertures, but I don't see that makes any fundamental difference
to the handling of them.

> > You can consider that a map/unmap hypercall, but the size of the
> > mapping is fixed (the IO pagesize which was previously set for the
> > aperture).
> 
> Yes, I would consider that a map/unmap hypercall vs a nested translation.
>  
> > > Assuming yes, I'd expect that:
> > > 
> > > The iommu_domain for nested PPC is just a log of map/unmap hypervsior
> > > calls to make. Whenever a new PE is attached to that domain it gets
> > > the logged map's replayed to set it up, and when a PE is detached the
> > > log is used to unmap everything.
> > 
> > And likewise duplicate every H_PUT_TCE to all the PEs in the domain.
> > Sure.  It means the changes won't be atomic across the domain, but I
> > guess that doesn't matter.  I guess you could have the same thing on a
> > sufficiently complex x86 or ARM system, if you put two devices into
> > the IOAS that were sufficiently far from each other in the bus
> > topology that they use a different top-level host IOMMU.
> 
> Yes, strict atomicity is not needed.

Ok, good to know.

> > > It is not perfectly memory efficient - and we could perhaps talk about
> > > a API modification to allow re-use of the iommufd datastructure
> > > somehow, but I think this is a good logical starting point.
> > 
> > Because the map size is fixed, a "replay log" is effectively
> > equivalent to a mirror of the entire IO pagetable.
> 
> So, for virtualized PPC the iommu_domain is an xarray of mapped PFNs
> and device attach/detach just sweeps the xarray and does the
> hypercalls. Very similar to what we discussed for S390.
> 
> It seems OK, this isn't even really overhead in most cases as we
> always have to track the mapped PFNs anyhow.

Fair enough, I think that works.

> > > Once a domain is created and attached to a group the aperture should
> > > be immutable.
> > 
> > I realize that's the model right now, but is there a strong rationale
> > for that immutability?
> 
> I have a strong preference that iommu_domains have immutable
> properties once they are created just for overall sanity. Otherwise
> everything becomes a racy mess.
> 
> If the iommu_domain changes dynamically then things using the aperture
> data get all broken - it is complicated to fix. So it would need a big
> reason to do it, I think.

Ok, that makes sense.  I'm pretty ok with the iommu_domain having
static apertures.  We can't optimally implement ddw that way, but I
think we can do it well enough by creating new domains and copying the
mappings, as you suggested.

> > Come to that, IIUC that's true for the iommu_domain at the lower
> > level, but not for the IOAS at a higher level.  You've stated that's
> > *not* immutable, since it can shrink as new devices are added to the
> > IOAS.  So I guess in that case the IOAS must be mapped by multiple
> > iommu_domains?
> 
> Yes, thats right. The IOAS is expressly mutable because it its on
> multiple domains and multiple groups each of which contribute to the
> aperture. The IOAS aperture is the narrowest of everything, and we
> have semi-reasonable semantics here. Here we have all the special code
> to juggle this, but even in this case we can't handle an iommu_domain
> or group changing asynchronously.

Ok.

So at the IOAS level, the apertures are already dynamic.  Arguably the
main point of my proposal is that it makes any changes in IOAS
apertures explicit, rather than implicit.  Typically userspace
wouldn't need to rely on the aperture information from a query,
because they know it will never shrink below the parameters they
specified.

Another approach would be to give the required apertures / pagesizes
in the initial creation of the domain/IOAS.  In that case they would
be static for the IOAS, as well as the underlying iommu_domains: any
ATTACH which would be incompatible would fail.

That makes life hard for the DPDK case, though.  Obviously we can
still make the base address optional, but for it to be static the
kernel would have to pick it immediately, before we know what devices
will be attached, which will be a problem on any system where there
are multiple IOMMUs with different constraints.

> > Right but that aperture is coming only from the hardware.  What I'm
> > suggesting here is that userspace can essentially opt into a *smaller*
> > aperture (or apertures) than the hardware permits.  The value of this
> > is that if the effective hardware aperture shrinks due to adding
> > devices, the kernel has the information it needs to determine if this
> > will be a problem for the userspace client or not.
> 
> We can do this check in userspace without more kernel APIs, userspace
> should fetch the ranges and confirm they are still good after
> attaching devices.

Well, yes, but I always think an API which stops you doing the wrong
thing is better than one which requires you take a bunch of extra
steps to use safely.  Plus, as noted above, this breaks down if some
future IOMMU has a new constraint not expressed in the current query
API.

> In general I have no issue with limiting the IOVA allocator in the
> kernel, I just don't have a use case of an application that could use
> the IOVA allocator (like DPDK) and also needs a limitation..

Well, I imagine DPDK has at least the minimal limitation that it needs
the aperture to be a certain minimum size (I'm guessing at least the
size of its pinned hugepage working memory region).  That's a
limitation that's unlikely to fail on modern hardware, but it's there.

> > > >   * A newly created kernel-managed IOAS has *no* IOVA windows
> > > 
> > > Already done, the iommufd IOAS has no iommu_domains inside it at
> > > creation time.
> > 
> > That.. doesn't seem like the same thing at all.  If there are no
> > domains, there are no restrictions from the hardware, so there's
> > effectively an unlimited aperture.
> 
> Yes.. You wanted a 0 sized window instead?

Yes.  Well.. since I'm thinking in terms of multiple windows, I was
thinking of it as "0 windows" rather than a 0-sized window, but
whatever.  Starting with no permitted IOVAs is the relevant point.

> Why?

To force the app to declare its windows with CREATE_WINDOW before
using them.

> That breaks what I'm
> trying to do to make DPDK/etc portable and dead simple.

It doesn't break portability at all.  As for simplicity, yes it adds
an extra required step, but the payoff is that it's now impossible to
subtly screw up by failing to recheck your apertures after an ATTACH.
That is, it's taking a step which was implicitly required and
replacing it with one that's explicitly required.

> > > >   * A CREATE_WINDOW operation is added
> > > >       * This takes a size, pagesize/granularity, optional base address
> > > >         and optional additional attributes 
> > > >       * If any of the specified attributes are incompatible with the
> > > >         underlying hardware, the operation fails
> > > 
> > > iommu layer has nothing called a window. The closest thing is a
> > > domain.
> > 
> > Maybe "window" is a bad name.  You called it "aperture" above (and
> > I've shifted to that naming) and implied it *was* part of the IOMMU
> > domain. 
> 
> It is but not as an object that can be mutated - it is just a
> property.
> 
> You are talking about a window object that exists somehow, I don't
> know this fits beyond some creation attribute of the domain..

At the domain layer, I think we can manage it that way, albeit
suboptimally.

> > That said, it doesn't need to be at the iommu layer - as I
> > said I'm consdiering this primarily a software concept and it could be
> > at the IOAS layer.  The linkage would be that every iommufd operation
> > which could change either the user-selected windows or the effective
> > hardware apertures must ensure that all the user windows (still) lie
> > within the hardware apertures.
> 
> As Kevin said, we need to start at the iommu_domain layer first -
> when we understand how that needs to look then we can imagine what the
> uAPI should be.

Hm, ok.  I'm looking in the other direction - what uAPIs are needed to
do the things I'm considering in userspace, given the constraints of
the hardware I know about.  I'm seeing what layer we match one to the
other at as secondary.

> I don't want to create something in iommufd that is wildly divergent
> from what the iommu_domain/etc can support.

Fair enough.

> > > The only thing the kernel can do is rely a notification that something
> > > happened to a PE. The kernel gets an event on the PE basis, I would
> > > like it to replicate it to all the devices and push it through the
> > > VFIO device FD.
> > 
> > Again: it's not about the notifications (I don't even really know how
> > those work in EEH).
> 
> Well, then I don't know what we are talking about - I'm interested in
> what uAPI is needed to support this, as far as can I tell that is a
> notification something bad happened and some control knobs?

As far as I remember that's basically right, but the question is the
granularity at which those control knobs operate, given that some of
the knobs can be automatically flipped in case of a hardware error.

> As I said, I'd prefer this to be on the vfio_device FD vs on iommufd
> and would like to find a way to make that work.

It probably does belong on the device FD, but we need to consider the
IOMMU, because these things are operating at the granularity of (at
least) an IOMMU group.

> > expects that to happen at the (virtual) hardware level.  So if a guest
> > trips an error on one device, it expects IO to stop for every other
> > device in the PE.  But if hardware's notion of the PE is smaller than
> > the guest's the host hardware won't accomplish that itself.
> 
> So, why do that to the guest? Shouldn't the PE in the guest be
> strictly a subset of the PE in the host, otherwise you hit all these
> problems you are talking about?

Ah.. yes, I did leave that out.  It's moderately complicated

- The guest PE can never be a (strict) subset of the host PE.  A PE
  is an IOMMU group in Linux terms, so if a host PE were split
  amongst guest PEs we couldn't supply the isolation semantics that
  the guest kernel will expect.  So one guest PE must always consist
  of 1 or more host PEs.

- We can't choose the guest PE boundaries completely arbitrarily.  The
  PAPR interfaces tie the PE boundaries to (guest) PCI(e) topology in a
  pretty confusing way that comes from the history of IBM hardware.

- To keep things simple, qemu makes the choice that each guest PCI
  Host Bridge (vPHB) is a single (guest) PE.  (Having many independent
  PHBs is routine on POWER systems.)

- So, where the guest PEs lie depends on how the user/configuration
  assigns devices to PHBs in the guest topology.

- We enforce (as we must) that devices from a single host PE can't be
  split across guest PEs.  I think qemu has specific checks for this
  in order to give better errors, but it will also be enforced by
  VFIO: we have one container per PHB (guest PE) so we can't attach
  the same group (host PE) to two of them.

- We don't enforce that we have at *most* one host PE per PHB - and
  that's useful, because lots of scripts / management tools / whatever
  designed for x86 just put all PCI devices onto the same vPHB,
  regardless of how they're grouped on the host.

- There are some additional complications because of the possibility
  of qemu emulated devices, which don't have *any* host PE.

> > used this in anger, I think a better approach is to just straight up
> > fail any attempt to do any EEH operation if there's not a 1 to 1
> > mapping from guest PE (domain) to host PE (group).
> 
> That makes sense

Short of a time machine to murd^H^H^H^Heducate the people who wrote
the EEH interfaces, I think it's the best we have.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 156 bytes --]

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
@ 2022-05-09  6:01                           ` David Gibson
  0 siblings, 0 replies; 244+ messages in thread
From: David Gibson @ 2022-05-09  6:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

[-- Attachment #1: Type: text/plain, Size: 20260 bytes --]

On Fri, May 06, 2022 at 09:48:37AM -0300, Jason Gunthorpe wrote:
> On Fri, May 06, 2022 at 03:25:03PM +1000, David Gibson wrote:
> > On Thu, May 05, 2022 at 04:07:28PM -0300, Jason Gunthorpe wrote:
> 
> > > When the iommu_domain is created I want to have a
> > > iommu-driver-specific struct, so PPC can customize its iommu_domain
> > > however it likes.
> > 
> > This requires that the client be aware of the host side IOMMU model.
> > That's true in VFIO now, and it's nasty; I was really hoping we could
> > *stop* doing that.
> 
> iommufd has two modes, the 'generic interface' which what this patch
> series shows that does not require any device specific knowledge.

Right, and I'm speaking specifically to that generic interface.  But
I'm thinking particularly about the qemu case where we do have
specific knowledge of the *guest* vIOMMU, but we want to avoid having
specific knowledge of the host IOMMU, because they might not be the same.

It would be good to have a way of seeing if the guest vIOMMU can be
emulated on this host IOMMU without qemu having to have separate
logic for every host IOMMU.

> The default iommu_domain that the iommu driver creates will be used
> here, it is up to the iommu driver to choose something reasonable for
> use by applications like DPDK. ie PPC should probably pick its biggest
> x86-like aperture.

So, using the big aperture means a very high base IOVA
(1<<59)... which means that it won't work at all if you want to attach
any devices that aren't capable of 64-bit DMA.  Using the maximum
possible window size would mean we either potentially waste a lot of
kernel memory on pagetables, or we use unnecessarily large number of
levels to the pagetable.

Basically we don't have enough information to make a good decision
here.

More generally, the problem with the interface advertising limitations
and it being up to userspace to work out if those are ok or not is
that it's fragile.  It's pretty plausible that some future IOMMU model
will have some new kind of limitation that can't be expressed in the
query structure we invented now.  That means that to add support for
that we need some kind of gate to prevent old userspace using the new
IOMMU (e.g. only allowing the new IOMMU to be used if userspace uses
newly added queries to get the new limitations).  That's true even if
what userspace was actually doing with the IOMMU would fit just fine
into those new limitations.

But if userspace requests the capabilities it wants, and the kernel
acks or nacks that, we can support the new host IOMMU with existing
software just fine.  They won't be able to use any *new* features or
capabilities of the new hardware, of course, but they'll be able to
use what it does that overlaps with what they needed before.

ppc - or more correctly, the POWER and PAPR IOMMU models - is just
acting here as an example of an IOMMU with limitations and
capabilities that don't fit into the current query model.

> The iommu-driver-specific struct is the "advanced" interface and
> allows a user-space IOMMU driver to tightly control the HW with full
> HW specific knowledge. This is where all the weird stuff that is not
> general should go.

Right, but forcing anything more complicated than "give me some IOVA
region" to go through the advanced interface means that qemu (or any
hypervisor where the guest platform need not identically match the
host) has to have n^2 complexity to match each guest IOMMU model to
each host IOMMU model.

> > Note that I'm talking here *purely* about the non-optimized case where
> > all updates to the host side IO pagetables are handled by IOAS_MAP /
> > IOAS_COPY, with no direct hardware access to user or guest managed IO
> > pagetables.  The optimized case obviously requires end-to-end
> > agreement on the pagetable format amongst other domain properties.
> 
> Sure, this is how things are already..
> 
> > What I'm hoping is that qemu (or whatever) can use this non-optimized
> > as a fallback case where it does't need to know the properties of
> > whatever host side IOMMU models there are.  It just requests what it
> > needs based on the vIOMMU properties it needs to replicate and the
> > host kernel either can supply it or can't.
> 
> There aren't really any negotiable vIOMMU properties beyond the
> ranges, and the ranges are not *really* negotiable.

Errr.. how do you figure?  On ppc the ranges and pagesizes are
definitely negotiable.  I'm not really familiar with other models, but
anything which allows *any* variations in the pagetable structure will
effectively have at least some negotiable properties.

Even if any individual host IOMMU doesn't have negotiable properties
(which ppc demonstrates is false), there's still a negotiation here in
the context that userspace doesn't know (and doesn't care) what
specific host IOMMU model it has.

> There are lots of dragons with the idea we can actually negotiate
> ranges - because asking for the wrong range for what the HW can do
> means you don't get anything. Which is completely contrary to the idea
> of easy generic support for things like DPDK.
>
> So DPDK can't ask for ranges, it is not generic.

Which is why I'm suggesting that the base address be an optional
request.  DPDK *will* care about the size of the range, so it just
requests that and gets told a base address.

Qemu emulating a vIOMMU absolutely does care about the ranges, and if
the HW can't do it, failing outright is the correct behaviour.

> This means we are really talking about a qemu-only API,

Kind of, yes.  I can't immediately think of anything which would need
this fine gained control over the IOMMU capabilities other than an
emulator/hypervisor emulating a different vIOMMU than the host IOMMU.
qemu is by far the most prominent example of this.

> and IMHO, qemu
> is fine to have a PPC specific userspace driver to tweak this PPC
> unique thing if the default windows are not acceptable.

Not really, because it's the ppc *target* (guest) side which requires
the specific properties, but selecting the "advanced" interface
requires special knowledge on the *host* side.

> IMHO it is no different from imagining an Intel specific userspace
> driver that is using userspace IO pagetables to optimize
> cross-platform qemu vIOMMU emulation.

I'm not quite sure what you have in mind here.  How is it both Intel
specific and cross-platform?

> We should be comfortable with
> the idea that accessing the full device-specific feature set requires
> a HW specific user space driver.
> 
> > Admittedly those are pretty niche cases, but allowing for them gives
> > us flexibility for the future.  Emulating an ARM SMMUv3 guest on an
> > ARM SMMU v4 or v5 or v.whatever host is likely to be a real case in
> > the future, and AFAICT, ARM are much less conservative that x86 about
> > maintaining similar hw interfaces over time.  That's why I think
> > considering these ppc cases will give a more robust interface for
> > other future possibilities as well.
> 
> I don't think so - PPC has just done two things that are completely
> divergent from eveything else - having two IO page tables for the same
> end point, and using map/unmap hypercalls instead of a nested page
> table.
> 
> Everyone else seems to be focused on IOPTEs that are similar to CPU
> PTEs, particularly to enable SVA and other tricks, and CPU's don't
> have either of this weirdness.

Well, you may have a point there.  Though, my experience makes me
pretty reluctant to ever bet on hw designers *not* inserting some
weird new constraint and assuming sw can just figure it out somehow.

Note however, that having multiple apertures isn't really ppc specific.
Anything with an IO hole effectively has separate apertures above and
below the hole.  They're much closer together in address than POWER's
two apertures, but I don't see that makes any fundamental difference
to the handling of them.

> > You can consider that a map/unmap hypercall, but the size of the
> > mapping is fixed (the IO pagesize which was previously set for the
> > aperture).
> 
> Yes, I would consider that a map/unmap hypercall vs a nested translation.
>  
> > > Assuming yes, I'd expect that:
> > > 
> > > The iommu_domain for nested PPC is just a log of map/unmap hypervsior
> > > calls to make. Whenever a new PE is attached to that domain it gets
> > > the logged map's replayed to set it up, and when a PE is detached the
> > > log is used to unmap everything.
> > 
> > And likewise duplicate every H_PUT_TCE to all the PEs in the domain.
> > Sure.  It means the changes won't be atomic across the domain, but I
> > guess that doesn't matter.  I guess you could have the same thing on a
> > sufficiently complex x86 or ARM system, if you put two devices into
> > the IOAS that were sufficiently far from each other in the bus
> > topology that they use a different top-level host IOMMU.
> 
> Yes, strict atomicity is not needed.

Ok, good to know.

> > > It is not perfectly memory efficient - and we could perhaps talk about
> > > a API modification to allow re-use of the iommufd datastructure
> > > somehow, but I think this is a good logical starting point.
> > 
> > Because the map size is fixed, a "replay log" is effectively
> > equivalent to a mirror of the entire IO pagetable.
> 
> So, for virtualized PPC the iommu_domain is an xarray of mapped PFNs
> and device attach/detach just sweeps the xarray and does the
> hypercalls. Very similar to what we discussed for S390.
> 
> It seems OK, this isn't even really overhead in most cases as we
> always have to track the mapped PFNs anyhow.

Fair enough, I think that works.

> > > Once a domain is created and attached to a group the aperture should
> > > be immutable.
> > 
> > I realize that's the model right now, but is there a strong rationale
> > for that immutability?
> 
> I have a strong preference that iommu_domains have immutable
> properties once they are created just for overall sanity. Otherwise
> everything becomes a racy mess.
> 
> If the iommu_domain changes dynamically then things using the aperture
> data get all broken - it is complicated to fix. So it would need a big
> reason to do it, I think.

Ok, that makes sense.  I'm pretty ok with the iommu_domain having
static apertures.  We can't optimally implement ddw that way, but I
think we can do it well enough by creating new domains and copying the
mappings, as you suggested.

> > Come to that, IIUC that's true for the iommu_domain at the lower
> > level, but not for the IOAS at a higher level.  You've stated that's
> > *not* immutable, since it can shrink as new devices are added to the
> > IOAS.  So I guess in that case the IOAS must be mapped by multiple
> > iommu_domains?
> 
> Yes, thats right. The IOAS is expressly mutable because it its on
> multiple domains and multiple groups each of which contribute to the
> aperture. The IOAS aperture is the narrowest of everything, and we
> have semi-reasonable semantics here. Here we have all the special code
> to juggle this, but even in this case we can't handle an iommu_domain
> or group changing asynchronously.

Ok.

So at the IOAS level, the apertures are already dynamic.  Arguably the
main point of my proposal is that it makes any changes in IOAS
apertures explicit, rather than implicit.  Typically userspace
wouldn't need to rely on the aperture information from a query,
because they know it will never shrink below the parameters they
specified.

Another approach would be to give the required apertures / pagesizes
in the initial creation of the domain/IOAS.  In that case they would
be static for the IOAS, as well as the underlying iommu_domains: any
ATTACH which would be incompatible would fail.

That makes life hard for the DPDK case, though.  Obviously we can
still make the base address optional, but for it to be static the
kernel would have to pick it immediately, before we know what devices
will be attached, which will be a problem on any system where there
are multiple IOMMUs with different constraints.

> > Right but that aperture is coming only from the hardware.  What I'm
> > suggesting here is that userspace can essentially opt into a *smaller*
> > aperture (or apertures) than the hardware permits.  The value of this
> > is that if the effective hardware aperture shrinks due to adding
> > devices, the kernel has the information it needs to determine if this
> > will be a problem for the userspace client or not.
> 
> We can do this check in userspace without more kernel APIs, userspace
> should fetch the ranges and confirm they are still good after
> attaching devices.

Well, yes, but I always think an API which stops you doing the wrong
thing is better than one which requires you take a bunch of extra
steps to use safely.  Plus, as noted above, this breaks down if some
future IOMMU has a new constraint not expressed in the current query
API.

> In general I have no issue with limiting the IOVA allocator in the
> kernel, I just don't have a use case of an application that could use
> the IOVA allocator (like DPDK) and also needs a limitation..

Well, I imagine DPDK has at least the minimal limitation that it needs
the aperture to be a certain minimum size (I'm guessing at least the
size of its pinned hugepage working memory region).  That's a
limitation that's unlikely to fail on modern hardware, but it's there.

> > > >   * A newly created kernel-managed IOAS has *no* IOVA windows
> > > 
> > > Already done, the iommufd IOAS has no iommu_domains inside it at
> > > creation time.
> > 
> > That.. doesn't seem like the same thing at all.  If there are no
> > domains, there are no restrictions from the hardware, so there's
> > effectively an unlimited aperture.
> 
> Yes.. You wanted a 0 sized window instead?

Yes.  Well.. since I'm thinking in terms of multiple windows, I was
thinking of it as "0 windows" rather than a 0-sized window, but
whatever.  Starting with no permitted IOVAs is the relevant point.

> Why?

To force the app to declare its windows with CREATE_WINDOW before
using them.

> That breaks what I'm
> trying to do to make DPDK/etc portable and dead simple.

It doesn't break portability at all.  As for simplicity, yes it adds
an extra required step, but the payoff is that it's now impossible to
subtly screw up by failing to recheck your apertures after an ATTACH.
That is, it's taking a step which was implicitly required and
replacing it with one that's explicitly required.

> > > >   * A CREATE_WINDOW operation is added
> > > >       * This takes a size, pagesize/granularity, optional base address
> > > >         and optional additional attributes 
> > > >       * If any of the specified attributes are incompatible with the
> > > >         underlying hardware, the operation fails
> > > 
> > > iommu layer has nothing called a window. The closest thing is a
> > > domain.
> > 
> > Maybe "window" is a bad name.  You called it "aperture" above (and
> > I've shifted to that naming) and implied it *was* part of the IOMMU
> > domain. 
> 
> It is but not as an object that can be mutated - it is just a
> property.
> 
> You are talking about a window object that exists somehow, I don't
> know this fits beyond some creation attribute of the domain..

At the domain layer, I think we can manage it that way, albeit
suboptimally.

> > That said, it doesn't need to be at the iommu layer - as I
> > said I'm consdiering this primarily a software concept and it could be
> > at the IOAS layer.  The linkage would be that every iommufd operation
> > which could change either the user-selected windows or the effective
> > hardware apertures must ensure that all the user windows (still) lie
> > within the hardware apertures.
> 
> As Kevin said, we need to start at the iommu_domain layer first -
> when we understand how that needs to look then we can imagine what the
> uAPI should be.

Hm, ok.  I'm looking in the other direction - what uAPIs are needed to
do the things I'm considering in userspace, given the constraints of
the hardware I know about.  I'm seeing what layer we match one to the
other at as secondary.

> I don't want to create something in iommufd that is wildly divergent
> from what the iommu_domain/etc can support.

Fair enough.

> > > The only thing the kernel can do is rely a notification that something
> > > happened to a PE. The kernel gets an event on the PE basis, I would
> > > like it to replicate it to all the devices and push it through the
> > > VFIO device FD.
> > 
> > Again: it's not about the notifications (I don't even really know how
> > those work in EEH).
> 
> Well, then I don't know what we are talking about - I'm interested in
> what uAPI is needed to support this, as far as can I tell that is a
> notification something bad happened and some control knobs?

As far as I remember that's basically right, but the question is the
granularity at which those control knobs operate, given that some of
the knobs can be automatically flipped in case of a hardware error.

> As I said, I'd prefer this to be on the vfio_device FD vs on iommufd
> and would like to find a way to make that work.

It probably does belong on the device FD, but we need to consider the
IOMMU, because these things are operating at the granularity of (at
least) an IOMMU group.

> > expects that to happen at the (virtual) hardware level.  So if a guest
> > trips an error on one device, it expects IO to stop for every other
> > device in the PE.  But if hardware's notion of the PE is smaller than
> > the guest's the host hardware won't accomplish that itself.
> 
> So, why do that to the guest? Shouldn't the PE in the guest be
> strictly a subset of the PE in the host, otherwise you hit all these
> problems you are talking about?

Ah.. yes, I did leave that out.  It's moderately complicated

- The guest PE can never be a (strict) subset of the host PE.  A PE
  is an IOMMU group in Linux terms, so if a host PE were split
  amongst guest PEs we couldn't supply the isolation semantics that
  the guest kernel will expect.  So one guest PE must always consist
  of 1 or more host PEs.

- We can't choose the guest PE boundaries completely arbitrarily.  The
  PAPR interfaces tie the PE boundaries to (guest) PCI(e) topology in a
  pretty confusing way that comes from the history of IBM hardware.

- To keep things simple, qemu makes the choice that each guest PCI
  Host Bridge (vPHB) is a single (guest) PE.  (Having many independent
  PHBs is routine on POWER systems.)

- So, where the guest PEs lie depends on how the user/configuration
  assigns devices to PHBs in the guest topology.

- We enforce (as we must) that devices from a single host PE can't be
  split across guest PEs.  I think qemu has specific checks for this
  in order to give better errors, but it will also be enforced by
  VFIO: we have one container per PHB (guest PE) so we can't attach
  the same group (host PE) to two of them.

- We don't enforce that we have at *most* one host PE per PHB - and
  that's useful, because lots of scripts / management tools / whatever
  designed for x86 just put all PCI devices onto the same vPHB,
  regardless of how they're grouped on the host.

- There are some additional complications because of the possibility
  of qemu emulated devices, which don't have *any* host PE.

> > used this in anger, I think a better approach is to just straight up
> > fail any attempt to do any EEH operation if there's not a 1 to 1
> > mapping from guest PE (domain) to host PE (group).
> 
> That makes sense

Short of a time machine to murd^H^H^H^Heducate the people who wrote
the EEH interfaces, I think it's the best we have.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
  2022-05-09  6:01                           ` David Gibson
@ 2022-05-09 14:00                             ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-05-09 14:00 UTC (permalink / raw)
  To: David Gibson
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Mon, May 09, 2022 at 04:01:52PM +1000, David Gibson wrote:

> > The default iommu_domain that the iommu driver creates will be used
> > here, it is up to the iommu driver to choose something reasonable for
> > use by applications like DPDK. ie PPC should probably pick its biggest
> > x86-like aperture.
> 
> So, using the big aperture means a very high base IOVA
> (1<<59)... which means that it won't work at all if you want to attach
> any devices that aren't capable of 64-bit DMA.

I'd expect to include the 32 bit window too..

> Using the maximum possible window size would mean we either
> potentially waste a lot of kernel memory on pagetables, or we use
> unnecessarily large number of levels to the pagetable.

All drivers have this issue to one degree or another. We seem to be
ignoring it - in any case this is a micro optimization, not a
functional need?

> More generally, the problem with the interface advertising limitations
> and it being up to userspace to work out if those are ok or not is
> that it's fragile.  It's pretty plausible that some future IOMMU model
> will have some new kind of limitation that can't be expressed in the
> query structure we invented now.

The basic API is very simple - the driver needs to provide ranges of
IOVA and map/unmap - I don't think we have a future problem here we
need to try and guess and solve today.

Even PPC fits this just fine, the open question for DPDK is more
around optimization, not functional.

> But if userspace requests the capabilities it wants, and the kernel
> acks or nacks that, we can support the new host IOMMU with existing
> software just fine.

No, this just makes it fragile in the other direction because now
userspace has to know what platform specific things to ask for *or it
doesn't work at all*. This is not a improvement for the DPDK cases.

Kernel decides, using all the kernel knowledge it has and tells the
application what it can do - this is the basic simplified interface.

> > The iommu-driver-specific struct is the "advanced" interface and
> > allows a user-space IOMMU driver to tightly control the HW with full
> > HW specific knowledge. This is where all the weird stuff that is not
> > general should go.
> 
> Right, but forcing anything more complicated than "give me some IOVA
> region" to go through the advanced interface means that qemu (or any
> hypervisor where the guest platform need not identically match the
> host) has to have n^2 complexity to match each guest IOMMU model to
> each host IOMMU model.

I wouldn't say n^2, but yes, qemu needs to have a userspace driver for
the platform IOMMU, and yes it needs this to reach optimal
behavior. We already know this is a hard requirement for using nesting
as acceleration, I don't see why apertures are so different.

> Errr.. how do you figure?  On ppc the ranges and pagesizes are
> definitely negotiable.  I'm not really familiar with other models, but
> anything which allows *any* variations in the pagetable structure will
> effectively have at least some negotiable properties.

As above, if you ask for the wrong thing then you don't get
anything. If DPDK asks for something that works on ARM like 0 -> 4G
then PPC and x86 will always fail. How is this improving anything to
require applications to carefully ask for exactly the right platform
specific ranges?

It isn't like there is some hard coded value we can put into DPDK that
will work on every platform. So kernel must pick for DPDK, IMHO. I
don't see any feasible alternative.

> Which is why I'm suggesting that the base address be an optional
> request.  DPDK *will* care about the size of the range, so it just
> requests that and gets told a base address.

We've talked about a size of IOVA address space before, strictly as a
hint, to possible optimize page table layout, or something, and I'm
fine with that idea. But - we have no driver implementation today, so
I'm not sure what we can really do with this right now..

Kevin could Intel consume a hint on IOVA space and optimize the number
of IO page table levels?

> > and IMHO, qemu
> > is fine to have a PPC specific userspace driver to tweak this PPC
> > unique thing if the default windows are not acceptable.
> 
> Not really, because it's the ppc *target* (guest) side which requires
> the specific properties, but selecting the "advanced" interface
> requires special knowledge on the *host* side.

The ppc specific driver would be on the generic side of qemu in its
viommu support framework. There is lots of host driver optimization
possible here with knowledge of the underlying host iommu HW. It
should not be connected to the qemu target.

It is not so different from today where qemu has to know about ppc's
special vfio interface generically even to emulate x86.

> > IMHO it is no different from imagining an Intel specific userspace
> > driver that is using userspace IO pagetables to optimize
> > cross-platform qemu vIOMMU emulation.
> 
> I'm not quite sure what you have in mind here.  How is it both Intel
> specific and cross-platform?

It is part of the generic qemu iommu interface layer. For nesting qemu
would copy the guest page table format to the host page table format
in userspace and trigger invalidation - no pinning, no kernel
map/unmap calls. It can only be done with detailed knowledge of the
host iommu since the host iommu io page table format is exposed
directly to userspace.

> Note however, that having multiple apertures isn't really ppc specific.
> Anything with an IO hole effectively has separate apertures above and
> below the hole.  They're much closer together in address than POWER's
> two apertures, but I don't see that makes any fundamental difference
> to the handling of them.

In the iommu core it handled the io holes and things through the group
reserved IOVA list - there isn't actualy a limit in the iommu_domain,
it has a flat pagetable format - and in cases like PASID/SVA the group
reserved list doesn't apply at all.

> Another approach would be to give the required apertures / pagesizes
> in the initial creation of the domain/IOAS.  In that case they would
> be static for the IOAS, as well as the underlying iommu_domains: any
> ATTACH which would be incompatible would fail.

This is the device-specific iommu_domain creation path. The domain can
have information defining its aperture.

> That makes life hard for the DPDK case, though.  Obviously we can
> still make the base address optional, but for it to be static the
> kernel would have to pick it immediately, before we know what devices
> will be attached, which will be a problem on any system where there
> are multiple IOMMUs with different constraints.

Which is why the current scheme is fully automatic and we rely on the
iommu driver to automatically select something sane for DPDK/etc
today.

> > In general I have no issue with limiting the IOVA allocator in the
> > kernel, I just don't have a use case of an application that could use
> > the IOVA allocator (like DPDK) and also needs a limitation..
> 
> Well, I imagine DPDK has at least the minimal limitation that it needs
> the aperture to be a certain minimum size (I'm guessing at least the
> size of its pinned hugepage working memory region).  That's a
> limitation that's unlikely to fail on modern hardware, but it's there.

Yes, DPDK does assume there is some fairly large available aperture,
that should be the driver default behavior, IMHO.

> > That breaks what I'm
> > trying to do to make DPDK/etc portable and dead simple.
> 
> It doesn't break portability at all.  As for simplicity, yes it adds
> an extra required step, but the payoff is that it's now impossible to
> subtly screw up by failing to recheck your apertures after an ATTACH.
> That is, it's taking a step which was implicitly required and
> replacing it with one that's explicitly required.

Again, as above, it breaks portability because apps have no hope to
know what window range to ask for to succeed. It cannot just be a hard
coded range.

Jason

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
@ 2022-05-09 14:00                             ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-05-09 14:00 UTC (permalink / raw)
  To: David Gibson
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins

On Mon, May 09, 2022 at 04:01:52PM +1000, David Gibson wrote:

> > The default iommu_domain that the iommu driver creates will be used
> > here, it is up to the iommu driver to choose something reasonable for
> > use by applications like DPDK. ie PPC should probably pick its biggest
> > x86-like aperture.
> 
> So, using the big aperture means a very high base IOVA
> (1<<59)... which means that it won't work at all if you want to attach
> any devices that aren't capable of 64-bit DMA.

I'd expect to include the 32 bit window too..

> Using the maximum possible window size would mean we either
> potentially waste a lot of kernel memory on pagetables, or we use
> unnecessarily large number of levels to the pagetable.

All drivers have this issue to one degree or another. We seem to be
ignoring it - in any case this is a micro optimization, not a
functional need?

> More generally, the problem with the interface advertising limitations
> and it being up to userspace to work out if those are ok or not is
> that it's fragile.  It's pretty plausible that some future IOMMU model
> will have some new kind of limitation that can't be expressed in the
> query structure we invented now.

The basic API is very simple - the driver needs to provide ranges of
IOVA and map/unmap - I don't think we have a future problem here we
need to try and guess and solve today.

Even PPC fits this just fine, the open question for DPDK is more
around optimization, not functional.

> But if userspace requests the capabilities it wants, and the kernel
> acks or nacks that, we can support the new host IOMMU with existing
> software just fine.

No, this just makes it fragile in the other direction because now
userspace has to know what platform specific things to ask for *or it
doesn't work at all*. This is not a improvement for the DPDK cases.

Kernel decides, using all the kernel knowledge it has and tells the
application what it can do - this is the basic simplified interface.

> > The iommu-driver-specific struct is the "advanced" interface and
> > allows a user-space IOMMU driver to tightly control the HW with full
> > HW specific knowledge. This is where all the weird stuff that is not
> > general should go.
> 
> Right, but forcing anything more complicated than "give me some IOVA
> region" to go through the advanced interface means that qemu (or any
> hypervisor where the guest platform need not identically match the
> host) has to have n^2 complexity to match each guest IOMMU model to
> each host IOMMU model.

I wouldn't say n^2, but yes, qemu needs to have a userspace driver for
the platform IOMMU, and yes it needs this to reach optimal
behavior. We already know this is a hard requirement for using nesting
as acceleration, I don't see why apertures are so different.

> Errr.. how do you figure?  On ppc the ranges and pagesizes are
> definitely negotiable.  I'm not really familiar with other models, but
> anything which allows *any* variations in the pagetable structure will
> effectively have at least some negotiable properties.

As above, if you ask for the wrong thing then you don't get
anything. If DPDK asks for something that works on ARM like 0 -> 4G
then PPC and x86 will always fail. How is this improving anything to
require applications to carefully ask for exactly the right platform
specific ranges?

It isn't like there is some hard coded value we can put into DPDK that
will work on every platform. So kernel must pick for DPDK, IMHO. I
don't see any feasible alternative.

> Which is why I'm suggesting that the base address be an optional
> request.  DPDK *will* care about the size of the range, so it just
> requests that and gets told a base address.

We've talked about a size of IOVA address space before, strictly as a
hint, to possible optimize page table layout, or something, and I'm
fine with that idea. But - we have no driver implementation today, so
I'm not sure what we can really do with this right now..

Kevin could Intel consume a hint on IOVA space and optimize the number
of IO page table levels?

> > and IMHO, qemu
> > is fine to have a PPC specific userspace driver to tweak this PPC
> > unique thing if the default windows are not acceptable.
> 
> Not really, because it's the ppc *target* (guest) side which requires
> the specific properties, but selecting the "advanced" interface
> requires special knowledge on the *host* side.

The ppc specific driver would be on the generic side of qemu in its
viommu support framework. There is lots of host driver optimization
possible here with knowledge of the underlying host iommu HW. It
should not be connected to the qemu target.

It is not so different from today where qemu has to know about ppc's
special vfio interface generically even to emulate x86.

> > IMHO it is no different from imagining an Intel specific userspace
> > driver that is using userspace IO pagetables to optimize
> > cross-platform qemu vIOMMU emulation.
> 
> I'm not quite sure what you have in mind here.  How is it both Intel
> specific and cross-platform?

It is part of the generic qemu iommu interface layer. For nesting qemu
would copy the guest page table format to the host page table format
in userspace and trigger invalidation - no pinning, no kernel
map/unmap calls. It can only be done with detailed knowledge of the
host iommu since the host iommu io page table format is exposed
directly to userspace.

> Note however, that having multiple apertures isn't really ppc specific.
> Anything with an IO hole effectively has separate apertures above and
> below the hole.  They're much closer together in address than POWER's
> two apertures, but I don't see that makes any fundamental difference
> to the handling of them.

In the iommu core it handled the io holes and things through the group
reserved IOVA list - there isn't actualy a limit in the iommu_domain,
it has a flat pagetable format - and in cases like PASID/SVA the group
reserved list doesn't apply at all.

> Another approach would be to give the required apertures / pagesizes
> in the initial creation of the domain/IOAS.  In that case they would
> be static for the IOAS, as well as the underlying iommu_domains: any
> ATTACH which would be incompatible would fail.

This is the device-specific iommu_domain creation path. The domain can
have information defining its aperture.

> That makes life hard for the DPDK case, though.  Obviously we can
> still make the base address optional, but for it to be static the
> kernel would have to pick it immediately, before we know what devices
> will be attached, which will be a problem on any system where there
> are multiple IOMMUs with different constraints.

Which is why the current scheme is fully automatic and we rely on the
iommu driver to automatically select something sane for DPDK/etc
today.

> > In general I have no issue with limiting the IOVA allocator in the
> > kernel, I just don't have a use case of an application that could use
> > the IOVA allocator (like DPDK) and also needs a limitation..
> 
> Well, I imagine DPDK has at least the minimal limitation that it needs
> the aperture to be a certain minimum size (I'm guessing at least the
> size of its pinned hugepage working memory region).  That's a
> limitation that's unlikely to fail on modern hardware, but it's there.

Yes, DPDK does assume there is some fairly large available aperture,
that should be the driver default behavior, IMHO.

> > That breaks what I'm
> > trying to do to make DPDK/etc portable and dead simple.
> 
> It doesn't break portability at all.  As for simplicity, yes it adds
> an extra required step, but the payoff is that it's now impossible to
> subtly screw up by failing to recheck your apertures after an ATTACH.
> That is, it's taking a step which was implicitly required and
> replacing it with one that's explicitly required.

Again, as above, it breaks portability because apps have no hope to
know what window range to ask for to succeed. It cannot just be a hard
coded range.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
  2022-05-09 14:00                             ` Jason Gunthorpe via iommu
@ 2022-05-10  7:12                               ` David Gibson
  -1 siblings, 0 replies; 244+ messages in thread
From: David Gibson @ 2022-05-10  7:12 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins


[-- Attachment #1.1: Type: text/plain, Size: 17409 bytes --]

On Mon, May 09, 2022 at 11:00:41AM -0300, Jason Gunthorpe wrote:
> On Mon, May 09, 2022 at 04:01:52PM +1000, David Gibson wrote:
> 
> > > The default iommu_domain that the iommu driver creates will be used
> > > here, it is up to the iommu driver to choose something reasonable for
> > > use by applications like DPDK. ie PPC should probably pick its biggest
> > > x86-like aperture.
> > 
> > So, using the big aperture means a very high base IOVA
> > (1<<59)... which means that it won't work at all if you want to attach
> > any devices that aren't capable of 64-bit DMA.
> 
> I'd expect to include the 32 bit window too..

I'm not entirely sure what you mean.  Are you working on the
assumption that we've extended to allowing multiple apertures, so we'd
default to advertising both a small/low aperture and a large/high
aperture?

> > Using the maximum possible window size would mean we either
> > potentially waste a lot of kernel memory on pagetables, or we use
> > unnecessarily large number of levels to the pagetable.
> 
> All drivers have this issue to one degree or another. We seem to be
> ignoring it - in any case this is a micro optimization, not a
> functional need?

Ok, fair point.

> > More generally, the problem with the interface advertising limitations
> > and it being up to userspace to work out if those are ok or not is
> > that it's fragile.  It's pretty plausible that some future IOMMU model
> > will have some new kind of limitation that can't be expressed in the
> > query structure we invented now.
> 
> The basic API is very simple - the driver needs to provide ranges of
> IOVA and map/unmap - I don't think we have a future problem here we
> need to try and guess and solve today.

Well.. maybe.  My experience of encountering hardware doing weird-arse
stuff makes me less sanguine.

> Even PPC fits this just fine, the open question for DPDK is more
> around optimization, not functional.
> 
> > But if userspace requests the capabilities it wants, and the kernel
> > acks or nacks that, we can support the new host IOMMU with existing
> > software just fine.
> 
> No, this just makes it fragile in the other direction because now
> userspace has to know what platform specific things to ask for *or it
> doesn't work at all*. This is not a improvement for the DPDK cases.

Um.. no.  The idea is that userspace requests *what it needs*, not
anything platform specific.  In the case of DPDK that would be nothing
more than the (minimum) aperture size.  Nothing platform specific
about that.

> Kernel decides, using all the kernel knowledge it has and tells the
> application what it can do - this is the basic simplified interface.
> 
> > > The iommu-driver-specific struct is the "advanced" interface and
> > > allows a user-space IOMMU driver to tightly control the HW with full
> > > HW specific knowledge. This is where all the weird stuff that is not
> > > general should go.
> > 
> > Right, but forcing anything more complicated than "give me some IOVA
> > region" to go through the advanced interface means that qemu (or any
> > hypervisor where the guest platform need not identically match the
> > host) has to have n^2 complexity to match each guest IOMMU model to
> > each host IOMMU model.
> 
> I wouldn't say n^2, but yes, qemu needs to have a userspace driver for
> the platform IOMMU, and yes it needs this to reach optimal
> behavior. We already know this is a hard requirement for using nesting
> as acceleration, I don't see why apertures are so different.

For one thing, because we only care about optimal behaviour on the
host ~= guest KVM case.  That means it's not n^2, just (roughly) one
host driver for each matching guest driver.  I'm considering the
general X on Y case - we don't need to optimize it, but it would be
nice for it to work without considering every combination separately.

> > Errr.. how do you figure?  On ppc the ranges and pagesizes are
> > definitely negotiable.  I'm not really familiar with other models, but
> > anything which allows *any* variations in the pagetable structure will
> > effectively have at least some negotiable properties.
> 
> As above, if you ask for the wrong thing then you don't get
> anything. If DPDK asks for something that works on ARM like 0 -> 4G
> then PPC and x86 will always fail. How is this improving anything to
> require applications to carefully ask for exactly the right platform
> specific ranges?

Hm, looks like I didn't sufficiently emphasize that the base address
would be optional for userspace to supply.  So userspace would request
a range *size* only, unless it needs a specific IOVA base address.  It
only requests the latter if it actually needs it - so failing in that
case is correct.  (Qemu, with or without an vIOMMU is the obvious case
for that, though I could also imagine it for a specialized driver for
some broken device which has weird limitations on what IOVA addresses
it can generate on the bus).

> It isn't like there is some hard coded value we can put into DPDK that
> will work on every platform. So kernel must pick for DPDK, IMHO. I
> don't see any feasible alternative.

Yes, hence *optionally specified* base address only.

> > Which is why I'm suggesting that the base address be an optional
> > request.  DPDK *will* care about the size of the range, so it just
> > requests that and gets told a base address.
> 
> We've talked about a size of IOVA address space before, strictly as a
> hint, to possible optimize page table layout, or something, and I'm
> fine with that idea. But - we have no driver implementation today, so
> I'm not sure what we can really do with this right now..

You can check that the hardware aperture is at least as large as the
requested range.  Even on the pretty general x86 IOMMU, if the program
wants 2^62 bytes of aperture, I'm pretty sure you won't be able to
supply it.

> Kevin could Intel consume a hint on IOVA space and optimize the number
> of IO page table levels?
> 
> > > and IMHO, qemu
> > > is fine to have a PPC specific userspace driver to tweak this PPC
> > > unique thing if the default windows are not acceptable.
> > 
> > Not really, because it's the ppc *target* (guest) side which requires
> > the specific properties, but selecting the "advanced" interface
> > requires special knowledge on the *host* side.
> 
> The ppc specific driver would be on the generic side of qemu in its
> viommu support framework. There is lots of host driver optimization
> possible here with knowledge of the underlying host iommu HW. It
> should not be connected to the qemu target.

Thinking through this...

So, I guess we could have basically the same logic I'm suggesting be
in the qemu backend iommu driver instead.  So the target side (machine
type, strictly speaking) would request of the host side the apertures
it needs, and the host side driver would see if it can do that, based
on both specific knowledge of that driver and the query reponses.

ppc on x86 should work with that.. at least if the x86 aperture is
large enough to reach up to ppc's high window.  I guess we'd have the
option here of using either the generic host driver or the
x86-specific driver.  The latter would mean qemu maintaining an
x86-format shadow of the io pagetables; mildly tedious, but doable.

x86-without-viommu on ppc could work in at least some cases if the ppc
host driver requests a low window large enough to cover guest memory.
Memory hotplug we can handle by creating a new IOAS using the ppc
specific interface with a new window covering the hotplugged region.

x86-with-viommu on ppc probably can't be done, since I don't think the
ppc low window can be extended far enough to allow for the guest's
expected IOVA range.. but there's probably no way around that.  Unless
there's some way of configuring / advertising a "crippled" version of
the x86 IOMMU with a more limited IOVA range.

Device hotplug is the remaining curly case.  We're set up happily,
then we hotplug a device.  The backend aperture shrinks, the host-side
qemu driver notices this and notices it longer covers the ranges that
the target-side expects.  So... is there any way of backing out of
this gracefully.  We could detach the device, but in the meantime
ongoing DMA maps from previous devices might have failed.  We could
pre-attach the new device to a new IOAS and check the apertures there
- but when we move it to the final IOAS is it guaranteed that the
apertures will be (at least) the intersection of the old and new
apertures, or is that just the probable outcome.  Or I guess we could
add a pre-attach-query operation of some sort in the kernel to check
what the effect will be before doing the attach for real.

Ok.. you convinced me.  As long as we have some way to handle the
device hotplug case, we can work with this.

> It is not so different from today where qemu has to know about ppc's
> special vfio interface generically even to emulate x86.

Well, yes, and that's a horrible aspect of the current vfio interface.
It arose because of (partly) *my* mistake, for not realizing at the
time that we could reasonably extend the type1 interface to work for
ppc as well.  I'm hoping iommufd doesn't repeat my mistake.

> > > IMHO it is no different from imagining an Intel specific userspace
> > > driver that is using userspace IO pagetables to optimize
> > > cross-platform qemu vIOMMU emulation.
> > 
> > I'm not quite sure what you have in mind here.  How is it both Intel
> > specific and cross-platform?
> 
> It is part of the generic qemu iommu interface layer. For nesting qemu
> would copy the guest page table format to the host page table format
> in userspace and trigger invalidation - no pinning, no kernel
> map/unmap calls. It can only be done with detailed knowledge of the
> host iommu since the host iommu io page table format is exposed
> directly to userspace.

Ok, I see.  That can certainly be done.  I was really hoping we could
have a working, though non-optimized, implementation using just the
generic interface.

> > Note however, that having multiple apertures isn't really ppc specific.
> > Anything with an IO hole effectively has separate apertures above and
> > below the hole.  They're much closer together in address than POWER's
> > two apertures, but I don't see that makes any fundamental difference
> > to the handling of them.
> 
> In the iommu core it handled the io holes and things through the group
> reserved IOVA list - there isn't actualy a limit in the iommu_domain,
> it has a flat pagetable format - and in cases like PASID/SVA the group
> reserved list doesn't apply at all.

Sure, but how it's implemented internally doesn't change the user
visible fact: some IOVAs can't be used, some can.  Userspace needs to
know which is which in order to operate correctly, and the only
reasonable way for it to get them is to be told by the kernel.  We
should advertise that in a common way, not have different ways for
"holes" versus "windows".  We can choose either one; I think "windows"
rather than "holes" makes more sense, but it doesn't really matter.
Whichever one we choose, we need more than one of them for both ppc
and x86:

    - x86 has a "window" from 0 to the bottom IO hole, and a window
      from the top of the IO hole to the maximum address describable
      in the IO page table.
    - x86 has a hole for the IO hole (duh), and another hole from the
      maximum IO pagetable address to 2^64-1 (you could call it the
      "out of bounds hole", I guess)

    - ppc has a "window" at 0 to a configurable maximum, and another
      "window" from 2^59 to a configurable maximum
    - ppc has a hole between the two windows, and another from the end
      of the high window to 2^64-1

So either representation, either arch, it's 2 windows, 2 holes.  There
may be other cases that only need 1 of each (SVA, ancient ppc without
the high window, probably others).  Point is there are common cases
that require more than 1.

> > Another approach would be to give the required apertures / pagesizes
> > in the initial creation of the domain/IOAS.  In that case they would
> > be static for the IOAS, as well as the underlying iommu_domains: any
> > ATTACH which would be incompatible would fail.
> 
> This is the device-specific iommu_domain creation path. The domain can
> have information defining its aperture.

But that also requires managing the pagetables yourself; I think tying
these two concepts together is inflexible.

> > That makes life hard for the DPDK case, though.  Obviously we can
> > still make the base address optional, but for it to be static the
> > kernel would have to pick it immediately, before we know what devices
> > will be attached, which will be a problem on any system where there
> > are multiple IOMMUs with different constraints.
> 
> Which is why the current scheme is fully automatic and we rely on the
> iommu driver to automatically select something sane for DPDK/etc
> today.

But the cost is that the allowable addresses can change implicitly
with every ATTACH.

> > > In general I have no issue with limiting the IOVA allocator in the
> > > kernel, I just don't have a use case of an application that could use
> > > the IOVA allocator (like DPDK) and also needs a limitation..
> > 
> > Well, I imagine DPDK has at least the minimal limitation that it needs
> > the aperture to be a certain minimum size (I'm guessing at least the
> > size of its pinned hugepage working memory region).  That's a
> > limitation that's unlikely to fail on modern hardware, but it's there.
> 
> Yes, DPDK does assume there is some fairly large available aperture,
> that should be the driver default behavior, IMHO.

Well, sure, but "fairly large" tends to change meaning over time.  The
idea is to ensure that the app's idea of "fairly large" matches the
kernel's idea of "fairly large".

> > > That breaks what I'm
> > > trying to do to make DPDK/etc portable and dead simple.
> > 
> > It doesn't break portability at all.  As for simplicity, yes it adds
> > an extra required step, but the payoff is that it's now impossible to
> > subtly screw up by failing to recheck your apertures after an ATTACH.
> > That is, it's taking a step which was implicitly required and
> > replacing it with one that's explicitly required.
> 
> Again, as above, it breaks portability because apps have no hope to
> know what window range to ask for to succeed. It cannot just be a hard
> coded range.

I see the problem if you have an app where there's a difference
between the smallest window it can cope with versus the largest window
it can take advantage of.  Not sure if that's likely in pratice.
AFAIK, DPDK will alway require it's hugepage memory pool mapped, can't
deal with less, can't benefit from more.  But maybe there's some use
case for this I haven't thought of.


Ok... here's a revised version of my proposal which I think addresses
your concerns and simplfies things.

- No new operations, but IOAS_MAP gets some new flags (and IOAS_COPY
  will probably need matching changes)

- By default the IOVA given to IOAS_MAP is a hint only, and the IOVA
  is chosen by the kernel within the aperture(s).  This is closer to
  how mmap() operates, and DPDK and similar shouldn't care about
  having specific IOVAs, even at the individual mapping level.

- IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s MAP_FIXED,
  for when you really do want to control the IOVA (qemu, maybe some
  special userspace driver cases)

- ATTACH will fail if the new device would shrink the aperture to
  exclude any already established mappings (I assume this is already
  the case)

- IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a
  PROT_NONE mmap().  It reserves that IOVA space, so other (non-FIXED)
  MAPs won't use it, but doesn't actually put anything into the IO
  pagetables.
    - Like a regular mapping, ATTACHes that are incompatible with an
      IOMAP_RESERVEed region will fail
    - An IOMAP_RESERVEed area can be overmapped with an IOMAP_FIXED
      mapping

So, for DPDK the sequence would be:

1. Create IOAS
2. ATTACH devices
3. IOAS_MAP some stuff
4. Do DMA with the IOVAs that IOAS_MAP returned

(Note, not even any need for QUERY in simple cases)

For (unoptimized) qemu it would be:

1. Create IOAS
2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of the
   guest platform
3. ATTACH devices (this will fail if they're not compatible with the
   reserved IOVA regions)
4. Boot the guest

  (on guest map/invalidate) -> IOAS_MAP(IOMAP_FIXED) to overmap part of
                               the reserved regions
  (on dev hotplug) -> ATTACH (which might fail, if it conflicts with the
                      reserved regions)
  (on vIOMMU reconfiguration) -> UNMAP/MAP reserved regions as
                                 necessary (which might fail)

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 156 bytes --]

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
@ 2022-05-10  7:12                               ` David Gibson
  0 siblings, 0 replies; 244+ messages in thread
From: David Gibson @ 2022-05-10  7:12 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

[-- Attachment #1: Type: text/plain, Size: 17409 bytes --]

On Mon, May 09, 2022 at 11:00:41AM -0300, Jason Gunthorpe wrote:
> On Mon, May 09, 2022 at 04:01:52PM +1000, David Gibson wrote:
> 
> > > The default iommu_domain that the iommu driver creates will be used
> > > here, it is up to the iommu driver to choose something reasonable for
> > > use by applications like DPDK. ie PPC should probably pick its biggest
> > > x86-like aperture.
> > 
> > So, using the big aperture means a very high base IOVA
> > (1<<59)... which means that it won't work at all if you want to attach
> > any devices that aren't capable of 64-bit DMA.
> 
> I'd expect to include the 32 bit window too..

I'm not entirely sure what you mean.  Are you working on the
assumption that we've extended to allowing multiple apertures, so we'd
default to advertising both a small/low aperture and a large/high
aperture?

> > Using the maximum possible window size would mean we either
> > potentially waste a lot of kernel memory on pagetables, or we use
> > unnecessarily large number of levels to the pagetable.
> 
> All drivers have this issue to one degree or another. We seem to be
> ignoring it - in any case this is a micro optimization, not a
> functional need?

Ok, fair point.

> > More generally, the problem with the interface advertising limitations
> > and it being up to userspace to work out if those are ok or not is
> > that it's fragile.  It's pretty plausible that some future IOMMU model
> > will have some new kind of limitation that can't be expressed in the
> > query structure we invented now.
> 
> The basic API is very simple - the driver needs to provide ranges of
> IOVA and map/unmap - I don't think we have a future problem here we
> need to try and guess and solve today.

Well.. maybe.  My experience of encountering hardware doing weird-arse
stuff makes me less sanguine.

> Even PPC fits this just fine, the open question for DPDK is more
> around optimization, not functional.
> 
> > But if userspace requests the capabilities it wants, and the kernel
> > acks or nacks that, we can support the new host IOMMU with existing
> > software just fine.
> 
> No, this just makes it fragile in the other direction because now
> userspace has to know what platform specific things to ask for *or it
> doesn't work at all*. This is not a improvement for the DPDK cases.

Um.. no.  The idea is that userspace requests *what it needs*, not
anything platform specific.  In the case of DPDK that would be nothing
more than the (minimum) aperture size.  Nothing platform specific
about that.

> Kernel decides, using all the kernel knowledge it has and tells the
> application what it can do - this is the basic simplified interface.
> 
> > > The iommu-driver-specific struct is the "advanced" interface and
> > > allows a user-space IOMMU driver to tightly control the HW with full
> > > HW specific knowledge. This is where all the weird stuff that is not
> > > general should go.
> > 
> > Right, but forcing anything more complicated than "give me some IOVA
> > region" to go through the advanced interface means that qemu (or any
> > hypervisor where the guest platform need not identically match the
> > host) has to have n^2 complexity to match each guest IOMMU model to
> > each host IOMMU model.
> 
> I wouldn't say n^2, but yes, qemu needs to have a userspace driver for
> the platform IOMMU, and yes it needs this to reach optimal
> behavior. We already know this is a hard requirement for using nesting
> as acceleration, I don't see why apertures are so different.

For one thing, because we only care about optimal behaviour on the
host ~= guest KVM case.  That means it's not n^2, just (roughly) one
host driver for each matching guest driver.  I'm considering the
general X on Y case - we don't need to optimize it, but it would be
nice for it to work without considering every combination separately.

> > Errr.. how do you figure?  On ppc the ranges and pagesizes are
> > definitely negotiable.  I'm not really familiar with other models, but
> > anything which allows *any* variations in the pagetable structure will
> > effectively have at least some negotiable properties.
> 
> As above, if you ask for the wrong thing then you don't get
> anything. If DPDK asks for something that works on ARM like 0 -> 4G
> then PPC and x86 will always fail. How is this improving anything to
> require applications to carefully ask for exactly the right platform
> specific ranges?

Hm, looks like I didn't sufficiently emphasize that the base address
would be optional for userspace to supply.  So userspace would request
a range *size* only, unless it needs a specific IOVA base address.  It
only requests the latter if it actually needs it - so failing in that
case is correct.  (Qemu, with or without an vIOMMU is the obvious case
for that, though I could also imagine it for a specialized driver for
some broken device which has weird limitations on what IOVA addresses
it can generate on the bus).

> It isn't like there is some hard coded value we can put into DPDK that
> will work on every platform. So kernel must pick for DPDK, IMHO. I
> don't see any feasible alternative.

Yes, hence *optionally specified* base address only.

> > Which is why I'm suggesting that the base address be an optional
> > request.  DPDK *will* care about the size of the range, so it just
> > requests that and gets told a base address.
> 
> We've talked about a size of IOVA address space before, strictly as a
> hint, to possible optimize page table layout, or something, and I'm
> fine with that idea. But - we have no driver implementation today, so
> I'm not sure what we can really do with this right now..

You can check that the hardware aperture is at least as large as the
requested range.  Even on the pretty general x86 IOMMU, if the program
wants 2^62 bytes of aperture, I'm pretty sure you won't be able to
supply it.

> Kevin could Intel consume a hint on IOVA space and optimize the number
> of IO page table levels?
> 
> > > and IMHO, qemu
> > > is fine to have a PPC specific userspace driver to tweak this PPC
> > > unique thing if the default windows are not acceptable.
> > 
> > Not really, because it's the ppc *target* (guest) side which requires
> > the specific properties, but selecting the "advanced" interface
> > requires special knowledge on the *host* side.
> 
> The ppc specific driver would be on the generic side of qemu in its
> viommu support framework. There is lots of host driver optimization
> possible here with knowledge of the underlying host iommu HW. It
> should not be connected to the qemu target.

Thinking through this...

So, I guess we could have basically the same logic I'm suggesting be
in the qemu backend iommu driver instead.  So the target side (machine
type, strictly speaking) would request of the host side the apertures
it needs, and the host side driver would see if it can do that, based
on both specific knowledge of that driver and the query reponses.

ppc on x86 should work with that.. at least if the x86 aperture is
large enough to reach up to ppc's high window.  I guess we'd have the
option here of using either the generic host driver or the
x86-specific driver.  The latter would mean qemu maintaining an
x86-format shadow of the io pagetables; mildly tedious, but doable.

x86-without-viommu on ppc could work in at least some cases if the ppc
host driver requests a low window large enough to cover guest memory.
Memory hotplug we can handle by creating a new IOAS using the ppc
specific interface with a new window covering the hotplugged region.

x86-with-viommu on ppc probably can't be done, since I don't think the
ppc low window can be extended far enough to allow for the guest's
expected IOVA range.. but there's probably no way around that.  Unless
there's some way of configuring / advertising a "crippled" version of
the x86 IOMMU with a more limited IOVA range.

Device hotplug is the remaining curly case.  We're set up happily,
then we hotplug a device.  The backend aperture shrinks, the host-side
qemu driver notices this and notices it longer covers the ranges that
the target-side expects.  So... is there any way of backing out of
this gracefully.  We could detach the device, but in the meantime
ongoing DMA maps from previous devices might have failed.  We could
pre-attach the new device to a new IOAS and check the apertures there
- but when we move it to the final IOAS is it guaranteed that the
apertures will be (at least) the intersection of the old and new
apertures, or is that just the probable outcome.  Or I guess we could
add a pre-attach-query operation of some sort in the kernel to check
what the effect will be before doing the attach for real.

Ok.. you convinced me.  As long as we have some way to handle the
device hotplug case, we can work with this.

> It is not so different from today where qemu has to know about ppc's
> special vfio interface generically even to emulate x86.

Well, yes, and that's a horrible aspect of the current vfio interface.
It arose because of (partly) *my* mistake, for not realizing at the
time that we could reasonably extend the type1 interface to work for
ppc as well.  I'm hoping iommufd doesn't repeat my mistake.

> > > IMHO it is no different from imagining an Intel specific userspace
> > > driver that is using userspace IO pagetables to optimize
> > > cross-platform qemu vIOMMU emulation.
> > 
> > I'm not quite sure what you have in mind here.  How is it both Intel
> > specific and cross-platform?
> 
> It is part of the generic qemu iommu interface layer. For nesting qemu
> would copy the guest page table format to the host page table format
> in userspace and trigger invalidation - no pinning, no kernel
> map/unmap calls. It can only be done with detailed knowledge of the
> host iommu since the host iommu io page table format is exposed
> directly to userspace.

Ok, I see.  That can certainly be done.  I was really hoping we could
have a working, though non-optimized, implementation using just the
generic interface.

> > Note however, that having multiple apertures isn't really ppc specific.
> > Anything with an IO hole effectively has separate apertures above and
> > below the hole.  They're much closer together in address than POWER's
> > two apertures, but I don't see that makes any fundamental difference
> > to the handling of them.
> 
> In the iommu core it handled the io holes and things through the group
> reserved IOVA list - there isn't actualy a limit in the iommu_domain,
> it has a flat pagetable format - and in cases like PASID/SVA the group
> reserved list doesn't apply at all.

Sure, but how it's implemented internally doesn't change the user
visible fact: some IOVAs can't be used, some can.  Userspace needs to
know which is which in order to operate correctly, and the only
reasonable way for it to get them is to be told by the kernel.  We
should advertise that in a common way, not have different ways for
"holes" versus "windows".  We can choose either one; I think "windows"
rather than "holes" makes more sense, but it doesn't really matter.
Whichever one we choose, we need more than one of them for both ppc
and x86:

    - x86 has a "window" from 0 to the bottom IO hole, and a window
      from the top of the IO hole to the maximum address describable
      in the IO page table.
    - x86 has a hole for the IO hole (duh), and another hole from the
      maximum IO pagetable address to 2^64-1 (you could call it the
      "out of bounds hole", I guess)

    - ppc has a "window" at 0 to a configurable maximum, and another
      "window" from 2^59 to a configurable maximum
    - ppc has a hole between the two windows, and another from the end
      of the high window to 2^64-1

So either representation, either arch, it's 2 windows, 2 holes.  There
may be other cases that only need 1 of each (SVA, ancient ppc without
the high window, probably others).  Point is there are common cases
that require more than 1.

> > Another approach would be to give the required apertures / pagesizes
> > in the initial creation of the domain/IOAS.  In that case they would
> > be static for the IOAS, as well as the underlying iommu_domains: any
> > ATTACH which would be incompatible would fail.
> 
> This is the device-specific iommu_domain creation path. The domain can
> have information defining its aperture.

But that also requires managing the pagetables yourself; I think tying
these two concepts together is inflexible.

> > That makes life hard for the DPDK case, though.  Obviously we can
> > still make the base address optional, but for it to be static the
> > kernel would have to pick it immediately, before we know what devices
> > will be attached, which will be a problem on any system where there
> > are multiple IOMMUs with different constraints.
> 
> Which is why the current scheme is fully automatic and we rely on the
> iommu driver to automatically select something sane for DPDK/etc
> today.

But the cost is that the allowable addresses can change implicitly
with every ATTACH.

> > > In general I have no issue with limiting the IOVA allocator in the
> > > kernel, I just don't have a use case of an application that could use
> > > the IOVA allocator (like DPDK) and also needs a limitation..
> > 
> > Well, I imagine DPDK has at least the minimal limitation that it needs
> > the aperture to be a certain minimum size (I'm guessing at least the
> > size of its pinned hugepage working memory region).  That's a
> > limitation that's unlikely to fail on modern hardware, but it's there.
> 
> Yes, DPDK does assume there is some fairly large available aperture,
> that should be the driver default behavior, IMHO.

Well, sure, but "fairly large" tends to change meaning over time.  The
idea is to ensure that the app's idea of "fairly large" matches the
kernel's idea of "fairly large".

> > > That breaks what I'm
> > > trying to do to make DPDK/etc portable and dead simple.
> > 
> > It doesn't break portability at all.  As for simplicity, yes it adds
> > an extra required step, but the payoff is that it's now impossible to
> > subtly screw up by failing to recheck your apertures after an ATTACH.
> > That is, it's taking a step which was implicitly required and
> > replacing it with one that's explicitly required.
> 
> Again, as above, it breaks portability because apps have no hope to
> know what window range to ask for to succeed. It cannot just be a hard
> coded range.

I see the problem if you have an app where there's a difference
between the smallest window it can cope with versus the largest window
it can take advantage of.  Not sure if that's likely in pratice.
AFAIK, DPDK will alway require it's hugepage memory pool mapped, can't
deal with less, can't benefit from more.  But maybe there's some use
case for this I haven't thought of.


Ok... here's a revised version of my proposal which I think addresses
your concerns and simplfies things.

- No new operations, but IOAS_MAP gets some new flags (and IOAS_COPY
  will probably need matching changes)

- By default the IOVA given to IOAS_MAP is a hint only, and the IOVA
  is chosen by the kernel within the aperture(s).  This is closer to
  how mmap() operates, and DPDK and similar shouldn't care about
  having specific IOVAs, even at the individual mapping level.

- IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s MAP_FIXED,
  for when you really do want to control the IOVA (qemu, maybe some
  special userspace driver cases)

- ATTACH will fail if the new device would shrink the aperture to
  exclude any already established mappings (I assume this is already
  the case)

- IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a
  PROT_NONE mmap().  It reserves that IOVA space, so other (non-FIXED)
  MAPs won't use it, but doesn't actually put anything into the IO
  pagetables.
    - Like a regular mapping, ATTACHes that are incompatible with an
      IOMAP_RESERVEed region will fail
    - An IOMAP_RESERVEed area can be overmapped with an IOMAP_FIXED
      mapping

So, for DPDK the sequence would be:

1. Create IOAS
2. ATTACH devices
3. IOAS_MAP some stuff
4. Do DMA with the IOVAs that IOAS_MAP returned

(Note, not even any need for QUERY in simple cases)

For (unoptimized) qemu it would be:

1. Create IOAS
2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of the
   guest platform
3. ATTACH devices (this will fail if they're not compatible with the
   reserved IOVA regions)
4. Boot the guest

  (on guest map/invalidate) -> IOAS_MAP(IOMAP_FIXED) to overmap part of
                               the reserved regions
  (on dev hotplug) -> ATTACH (which might fail, if it conflicts with the
                      reserved regions)
  (on vIOMMU reconfiguration) -> UNMAP/MAP reserved regions as
                                 necessary (which might fail)

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
  2022-05-10  7:12                               ` David Gibson
@ 2022-05-10 19:00                                 ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-05-10 19:00 UTC (permalink / raw)
  To: David Gibson
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote:
> On Mon, May 09, 2022 at 11:00:41AM -0300, Jason Gunthorpe wrote:
> > On Mon, May 09, 2022 at 04:01:52PM +1000, David Gibson wrote:
> > 
> > > > The default iommu_domain that the iommu driver creates will be used
> > > > here, it is up to the iommu driver to choose something reasonable for
> > > > use by applications like DPDK. ie PPC should probably pick its biggest
> > > > x86-like aperture.
> > > 
> > > So, using the big aperture means a very high base IOVA
> > > (1<<59)... which means that it won't work at all if you want to attach
> > > any devices that aren't capable of 64-bit DMA.
> > 
> > I'd expect to include the 32 bit window too..
> 
> I'm not entirely sure what you mean.  Are you working on the
> assumption that we've extended to allowing multiple apertures, so we'd
> default to advertising both a small/low aperture and a large/high
> aperture?

Yes

> > No, this just makes it fragile in the other direction because now
> > userspace has to know what platform specific things to ask for *or it
> > doesn't work at all*. This is not a improvement for the DPDK cases.
> 
> Um.. no.  The idea is that userspace requests *what it needs*, not
> anything platform specific.  In the case of DPDK that would be nothing
> more than the (minimum) aperture size.  Nothing platform specific
> about that.

Except a 32 bit platform can only maybe do a < 4G aperture, a 64 bit
platform can do more, but it varies how much more, etc.

There is no constant value DPDK could stuff in this request, unless it
needs a really small amount of IOVA, like 1G or something.

> > It isn't like there is some hard coded value we can put into DPDK that
> > will work on every platform. So kernel must pick for DPDK, IMHO. I
> > don't see any feasible alternative.
> 
> Yes, hence *optionally specified* base address only.

Okay, so imagine we've already done this and DPDK is not optionally
specifying anything :)

The structs can be extended so we can add this as an input to creation
when a driver can implement it.

> > The ppc specific driver would be on the generic side of qemu in its
> > viommu support framework. There is lots of host driver optimization
> > possible here with knowledge of the underlying host iommu HW. It
> > should not be connected to the qemu target.
> 
> Thinking through this...
> 
> So, I guess we could have basically the same logic I'm suggesting be
> in the qemu backend iommu driver instead.  So the target side (machine
> type, strictly speaking) would request of the host side the apertures
> it needs, and the host side driver would see if it can do that, based
> on both specific knowledge of that driver and the query reponses.

Yes, this is what I'm thinking

> ppc on x86 should work with that.. at least if the x86 aperture is
> large enough to reach up to ppc's high window.  I guess we'd have the
> option here of using either the generic host driver or the
> x86-specific driver.  The latter would mean qemu maintaining an
> x86-format shadow of the io pagetables; mildly tedious, but doable.

The appeal of having userspace page tables is performance, so it is
tedious to shadow, but it should run faster.

> So... is there any way of backing out of this gracefully.  We could
> detach the device, but in the meantime ongoing DMA maps from
> previous devices might have failed.  

This sounds like a good use case for qemu to communicate ranges - but
as I mentioned before Alex said qemu didn't know the ranges..

> We could pre-attach the new device to a new IOAS and check the
> apertures there - but when we move it to the final IOAS is it
> guaranteed that the apertures will be (at least) the intersection of
> the old and new apertures, or is that just the probable outcome. 

Should be guarenteed

> Ok.. you convinced me.  As long as we have some way to handle the
> device hotplug case, we can work with this.

I like the communicate ranges for hotplug, so long as we can actually
implement it in qemu - I'm a bit unclear on that honestly.

> Ok, I see.  That can certainly be done.  I was really hoping we could
> have a working, though non-optimized, implementation using just the
> generic interface.

Oh, sure that should largely work as well too, this is just an
additional direction people may find interesting and helps explain why
qemu should have an iommu layer inside.

> "holes" versus "windows".  We can choose either one; I think "windows"
> rather than "holes" makes more sense, but it doesn't really matter.

Yes, I picked windows aka ranges for the uAPI - we translate the holes
from the groups into windows and intersect them with the apertures.

> > > Another approach would be to give the required apertures / pagesizes
> > > in the initial creation of the domain/IOAS.  In that case they would
> > > be static for the IOAS, as well as the underlying iommu_domains: any
> > > ATTACH which would be incompatible would fail.
> > 
> > This is the device-specific iommu_domain creation path. The domain can
> > have information defining its aperture.
> 
> But that also requires managing the pagetables yourself; I think tying
> these two concepts together is inflexible.

Oh, no, those need to be independent for HW nesting already

One device-specific creation path will create the kernel owned
map/unmap iommu_domain with device-specific parameters to allow it to
be the root of a nest - ie specify S2 on ARM.

The second device-specific creation path will create the user owned
iommu_domain with device-specific parameters, with the first as a
parent.

So you get to do both.

> > Which is why the current scheme is fully automatic and we rely on the
> > iommu driver to automatically select something sane for DPDK/etc
> > today.
> 
> But the cost is that the allowable addresses can change implicitly
> with every ATTACH.

Yes, dpdk/etc don't care.
 
> I see the problem if you have an app where there's a difference
> between the smallest window it can cope with versus the largest window
> it can take advantage of.  Not sure if that's likely in pratice.
> AFAIK, DPDK will alway require it's hugepage memory pool mapped, can't
> deal with less, can't benefit from more.  But maybe there's some use
> case for this I haven't thought of.

Other apps I've seen don't even have a fixed memory pool, they just
malloc and can't really predict how much IOVA they
need. "approximately the same amount as a process VA" is a reasonable
goal for the kernel to default too.

A tunable to allow efficiency from smaller allocations sounds great -
but let's have driver support before adding the uAPI for
it. Intel/AMD/ARM support to have fewer page table levels for instance
would be perfect.
 
> Ok... here's a revised version of my proposal which I think addresses
> your concerns and simplfies things.
> 
> - No new operations, but IOAS_MAP gets some new flags (and IOAS_COPY
>   will probably need matching changes)
> 
> - By default the IOVA given to IOAS_MAP is a hint only, and the IOVA
>   is chosen by the kernel within the aperture(s).  This is closer to
>   how mmap() operates, and DPDK and similar shouldn't care about
>   having specific IOVAs, even at the individual mapping level.
>
> - IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s MAP_FIXED,
>   for when you really do want to control the IOVA (qemu, maybe some
>   special userspace driver cases)

We already did both of these, the flag is called
IOMMU_IOAS_MAP_FIXED_IOVA - if it is not specified then kernel will
select the IOVA internally.

> - ATTACH will fail if the new device would shrink the aperture to
>   exclude any already established mappings (I assume this is already
>   the case)

Yes

> - IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a
>   PROT_NONE mmap().  It reserves that IOVA space, so other (non-FIXED)
>   MAPs won't use it, but doesn't actually put anything into the IO
>   pagetables.
>     - Like a regular mapping, ATTACHes that are incompatible with an
>       IOMAP_RESERVEed region will fail
>     - An IOMAP_RESERVEed area can be overmapped with an IOMAP_FIXED
>       mapping

Yeah, this seems OK, I'm thinking a new API might make sense because
you don't really want mmap replacement semantics but a permanent
record of what IOVA must always be valid.

IOMMU_IOA_REQUIRE_IOVA perhaps, similar signature to
IOMMUFD_CMD_IOAS_IOVA_RANGES:

struct iommu_ioas_require_iova {
        __u32 size;
        __u32 ioas_id;
        __u32 num_iovas;
        __u32 __reserved;
        struct iommu_required_iovas {
                __aligned_u64 start;
                __aligned_u64 last;
        } required_iovas[];
};

> So, for DPDK the sequence would be:
> 
> 1. Create IOAS
> 2. ATTACH devices
> 3. IOAS_MAP some stuff
> 4. Do DMA with the IOVAs that IOAS_MAP returned
> 
> (Note, not even any need for QUERY in simple cases)

Yes, this is done already

> For (unoptimized) qemu it would be:
> 
> 1. Create IOAS
> 2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of the
>    guest platform
> 3. ATTACH devices (this will fail if they're not compatible with the
>    reserved IOVA regions)
> 4. Boot the guest
> 
>   (on guest map/invalidate) -> IOAS_MAP(IOMAP_FIXED) to overmap part of
>                                the reserved regions
>   (on dev hotplug) -> ATTACH (which might fail, if it conflicts with the
>                       reserved regions)
>   (on vIOMMU reconfiguration) -> UNMAP/MAP reserved regions as
>                                  necessary (which might fail)

OK, I will take care of it

Thanks,
Jason

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
@ 2022-05-10 19:00                                 ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-05-10 19:00 UTC (permalink / raw)
  To: David Gibson
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins

On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote:
> On Mon, May 09, 2022 at 11:00:41AM -0300, Jason Gunthorpe wrote:
> > On Mon, May 09, 2022 at 04:01:52PM +1000, David Gibson wrote:
> > 
> > > > The default iommu_domain that the iommu driver creates will be used
> > > > here, it is up to the iommu driver to choose something reasonable for
> > > > use by applications like DPDK. ie PPC should probably pick its biggest
> > > > x86-like aperture.
> > > 
> > > So, using the big aperture means a very high base IOVA
> > > (1<<59)... which means that it won't work at all if you want to attach
> > > any devices that aren't capable of 64-bit DMA.
> > 
> > I'd expect to include the 32 bit window too..
> 
> I'm not entirely sure what you mean.  Are you working on the
> assumption that we've extended to allowing multiple apertures, so we'd
> default to advertising both a small/low aperture and a large/high
> aperture?

Yes

> > No, this just makes it fragile in the other direction because now
> > userspace has to know what platform specific things to ask for *or it
> > doesn't work at all*. This is not a improvement for the DPDK cases.
> 
> Um.. no.  The idea is that userspace requests *what it needs*, not
> anything platform specific.  In the case of DPDK that would be nothing
> more than the (minimum) aperture size.  Nothing platform specific
> about that.

Except a 32 bit platform can only maybe do a < 4G aperture, a 64 bit
platform can do more, but it varies how much more, etc.

There is no constant value DPDK could stuff in this request, unless it
needs a really small amount of IOVA, like 1G or something.

> > It isn't like there is some hard coded value we can put into DPDK that
> > will work on every platform. So kernel must pick for DPDK, IMHO. I
> > don't see any feasible alternative.
> 
> Yes, hence *optionally specified* base address only.

Okay, so imagine we've already done this and DPDK is not optionally
specifying anything :)

The structs can be extended so we can add this as an input to creation
when a driver can implement it.

> > The ppc specific driver would be on the generic side of qemu in its
> > viommu support framework. There is lots of host driver optimization
> > possible here with knowledge of the underlying host iommu HW. It
> > should not be connected to the qemu target.
> 
> Thinking through this...
> 
> So, I guess we could have basically the same logic I'm suggesting be
> in the qemu backend iommu driver instead.  So the target side (machine
> type, strictly speaking) would request of the host side the apertures
> it needs, and the host side driver would see if it can do that, based
> on both specific knowledge of that driver and the query reponses.

Yes, this is what I'm thinking

> ppc on x86 should work with that.. at least if the x86 aperture is
> large enough to reach up to ppc's high window.  I guess we'd have the
> option here of using either the generic host driver or the
> x86-specific driver.  The latter would mean qemu maintaining an
> x86-format shadow of the io pagetables; mildly tedious, but doable.

The appeal of having userspace page tables is performance, so it is
tedious to shadow, but it should run faster.

> So... is there any way of backing out of this gracefully.  We could
> detach the device, but in the meantime ongoing DMA maps from
> previous devices might have failed.  

This sounds like a good use case for qemu to communicate ranges - but
as I mentioned before Alex said qemu didn't know the ranges..

> We could pre-attach the new device to a new IOAS and check the
> apertures there - but when we move it to the final IOAS is it
> guaranteed that the apertures will be (at least) the intersection of
> the old and new apertures, or is that just the probable outcome. 

Should be guarenteed

> Ok.. you convinced me.  As long as we have some way to handle the
> device hotplug case, we can work with this.

I like the communicate ranges for hotplug, so long as we can actually
implement it in qemu - I'm a bit unclear on that honestly.

> Ok, I see.  That can certainly be done.  I was really hoping we could
> have a working, though non-optimized, implementation using just the
> generic interface.

Oh, sure that should largely work as well too, this is just an
additional direction people may find interesting and helps explain why
qemu should have an iommu layer inside.

> "holes" versus "windows".  We can choose either one; I think "windows"
> rather than "holes" makes more sense, but it doesn't really matter.

Yes, I picked windows aka ranges for the uAPI - we translate the holes
from the groups into windows and intersect them with the apertures.

> > > Another approach would be to give the required apertures / pagesizes
> > > in the initial creation of the domain/IOAS.  In that case they would
> > > be static for the IOAS, as well as the underlying iommu_domains: any
> > > ATTACH which would be incompatible would fail.
> > 
> > This is the device-specific iommu_domain creation path. The domain can
> > have information defining its aperture.
> 
> But that also requires managing the pagetables yourself; I think tying
> these two concepts together is inflexible.

Oh, no, those need to be independent for HW nesting already

One device-specific creation path will create the kernel owned
map/unmap iommu_domain with device-specific parameters to allow it to
be the root of a nest - ie specify S2 on ARM.

The second device-specific creation path will create the user owned
iommu_domain with device-specific parameters, with the first as a
parent.

So you get to do both.

> > Which is why the current scheme is fully automatic and we rely on the
> > iommu driver to automatically select something sane for DPDK/etc
> > today.
> 
> But the cost is that the allowable addresses can change implicitly
> with every ATTACH.

Yes, dpdk/etc don't care.
 
> I see the problem if you have an app where there's a difference
> between the smallest window it can cope with versus the largest window
> it can take advantage of.  Not sure if that's likely in pratice.
> AFAIK, DPDK will alway require it's hugepage memory pool mapped, can't
> deal with less, can't benefit from more.  But maybe there's some use
> case for this I haven't thought of.

Other apps I've seen don't even have a fixed memory pool, they just
malloc and can't really predict how much IOVA they
need. "approximately the same amount as a process VA" is a reasonable
goal for the kernel to default too.

A tunable to allow efficiency from smaller allocations sounds great -
but let's have driver support before adding the uAPI for
it. Intel/AMD/ARM support to have fewer page table levels for instance
would be perfect.
 
> Ok... here's a revised version of my proposal which I think addresses
> your concerns and simplfies things.
> 
> - No new operations, but IOAS_MAP gets some new flags (and IOAS_COPY
>   will probably need matching changes)
> 
> - By default the IOVA given to IOAS_MAP is a hint only, and the IOVA
>   is chosen by the kernel within the aperture(s).  This is closer to
>   how mmap() operates, and DPDK and similar shouldn't care about
>   having specific IOVAs, even at the individual mapping level.
>
> - IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s MAP_FIXED,
>   for when you really do want to control the IOVA (qemu, maybe some
>   special userspace driver cases)

We already did both of these, the flag is called
IOMMU_IOAS_MAP_FIXED_IOVA - if it is not specified then kernel will
select the IOVA internally.

> - ATTACH will fail if the new device would shrink the aperture to
>   exclude any already established mappings (I assume this is already
>   the case)

Yes

> - IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a
>   PROT_NONE mmap().  It reserves that IOVA space, so other (non-FIXED)
>   MAPs won't use it, but doesn't actually put anything into the IO
>   pagetables.
>     - Like a regular mapping, ATTACHes that are incompatible with an
>       IOMAP_RESERVEed region will fail
>     - An IOMAP_RESERVEed area can be overmapped with an IOMAP_FIXED
>       mapping

Yeah, this seems OK, I'm thinking a new API might make sense because
you don't really want mmap replacement semantics but a permanent
record of what IOVA must always be valid.

IOMMU_IOA_REQUIRE_IOVA perhaps, similar signature to
IOMMUFD_CMD_IOAS_IOVA_RANGES:

struct iommu_ioas_require_iova {
        __u32 size;
        __u32 ioas_id;
        __u32 num_iovas;
        __u32 __reserved;
        struct iommu_required_iovas {
                __aligned_u64 start;
                __aligned_u64 last;
        } required_iovas[];
};

> So, for DPDK the sequence would be:
> 
> 1. Create IOAS
> 2. ATTACH devices
> 3. IOAS_MAP some stuff
> 4. Do DMA with the IOVAs that IOAS_MAP returned
> 
> (Note, not even any need for QUERY in simple cases)

Yes, this is done already

> For (unoptimized) qemu it would be:
> 
> 1. Create IOAS
> 2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of the
>    guest platform
> 3. ATTACH devices (this will fail if they're not compatible with the
>    reserved IOVA regions)
> 4. Boot the guest
> 
>   (on guest map/invalidate) -> IOAS_MAP(IOMAP_FIXED) to overmap part of
>                                the reserved regions
>   (on dev hotplug) -> ATTACH (which might fail, if it conflicts with the
>                       reserved regions)
>   (on vIOMMU reconfiguration) -> UNMAP/MAP reserved regions as
>                                  necessary (which might fail)

OK, I will take care of it

Thanks,
Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* RE: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
  2022-05-09 14:00                             ` Jason Gunthorpe via iommu
@ 2022-05-11  2:46                               ` Tian, Kevin
  -1 siblings, 0 replies; 244+ messages in thread
From: Tian, Kevin @ 2022-05-11  2:46 UTC (permalink / raw)
  To: Jason Gunthorpe, David Gibson
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	Alex Williamson, Daniel Jordan, iommu, Martins, Joao

> From: Jason Gunthorpe
> Sent: Monday, May 9, 2022 10:01 PM
> 
> On Mon, May 09, 2022 at 04:01:52PM +1000, David Gibson wrote:
> 
> > Which is why I'm suggesting that the base address be an optional
> > request.  DPDK *will* care about the size of the range, so it just
> > requests that and gets told a base address.
> 
> We've talked about a size of IOVA address space before, strictly as a
> hint, to possible optimize page table layout, or something, and I'm
> fine with that idea. But - we have no driver implementation today, so
> I'm not sure what we can really do with this right now..
> 
> Kevin could Intel consume a hint on IOVA space and optimize the number
> of IO page table levels?
> 

It could, but not implemented now.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* RE: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
@ 2022-05-11  2:46                               ` Tian, Kevin
  0 siblings, 0 replies; 244+ messages in thread
From: Tian, Kevin @ 2022-05-11  2:46 UTC (permalink / raw)
  To: Jason Gunthorpe, David Gibson
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Alex Williamson, Martins, Joao

> From: Jason Gunthorpe
> Sent: Monday, May 9, 2022 10:01 PM
> 
> On Mon, May 09, 2022 at 04:01:52PM +1000, David Gibson wrote:
> 
> > Which is why I'm suggesting that the base address be an optional
> > request.  DPDK *will* care about the size of the range, so it just
> > requests that and gets told a base address.
> 
> We've talked about a size of IOVA address space before, strictly as a
> hint, to possible optimize page table layout, or something, and I'm
> fine with that idea. But - we have no driver implementation today, so
> I'm not sure what we can really do with this right now..
> 
> Kevin could Intel consume a hint on IOVA space and optimize the number
> of IO page table levels?
> 

It could, but not implemented now.

^ permalink raw reply	[flat|nested] 244+ messages in thread

* RE: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
  2022-05-10 19:00                                 ` Jason Gunthorpe via iommu
@ 2022-05-11  3:15                                   ` Tian, Kevin
  -1 siblings, 0 replies; 244+ messages in thread
From: Tian, Kevin @ 2022-05-11  3:15 UTC (permalink / raw)
  To: Jason Gunthorpe, David Gibson
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, May 11, 2022 3:00 AM
> 
> On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote:
> > Ok... here's a revised version of my proposal which I think addresses
> > your concerns and simplfies things.
> >
> > - No new operations, but IOAS_MAP gets some new flags (and IOAS_COPY
> >   will probably need matching changes)
> >
> > - By default the IOVA given to IOAS_MAP is a hint only, and the IOVA
> >   is chosen by the kernel within the aperture(s).  This is closer to
> >   how mmap() operates, and DPDK and similar shouldn't care about
> >   having specific IOVAs, even at the individual mapping level.
> >
> > - IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s MAP_FIXED,
> >   for when you really do want to control the IOVA (qemu, maybe some
> >   special userspace driver cases)
> 
> We already did both of these, the flag is called
> IOMMU_IOAS_MAP_FIXED_IOVA - if it is not specified then kernel will
> select the IOVA internally.
> 
> > - ATTACH will fail if the new device would shrink the aperture to
> >   exclude any already established mappings (I assume this is already
> >   the case)
> 
> Yes
> 
> > - IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a
> >   PROT_NONE mmap().  It reserves that IOVA space, so other (non-FIXED)
> >   MAPs won't use it, but doesn't actually put anything into the IO
> >   pagetables.
> >     - Like a regular mapping, ATTACHes that are incompatible with an
> >       IOMAP_RESERVEed region will fail
> >     - An IOMAP_RESERVEed area can be overmapped with an IOMAP_FIXED
> >       mapping
> 
> Yeah, this seems OK, I'm thinking a new API might make sense because
> you don't really want mmap replacement semantics but a permanent
> record of what IOVA must always be valid.
> 
> IOMMU_IOA_REQUIRE_IOVA perhaps, similar signature to
> IOMMUFD_CMD_IOAS_IOVA_RANGES:
> 
> struct iommu_ioas_require_iova {
>         __u32 size;
>         __u32 ioas_id;
>         __u32 num_iovas;
>         __u32 __reserved;
>         struct iommu_required_iovas {
>                 __aligned_u64 start;
>                 __aligned_u64 last;
>         } required_iovas[];
> };

As a permanent record do we want to enforce that once the required
range list is set all FIXED and non-FIXED allocations must be within the
list of ranges?

If yes we can take the end of the last range as the max size of the iova
address space to optimize the page table layout.

otherwise we may need another dedicated hint for that optimization.

> 
> > So, for DPDK the sequence would be:
> >
> > 1. Create IOAS
> > 2. ATTACH devices
> > 3. IOAS_MAP some stuff
> > 4. Do DMA with the IOVAs that IOAS_MAP returned
> >
> > (Note, not even any need for QUERY in simple cases)
> 
> Yes, this is done already
> 
> > For (unoptimized) qemu it would be:
> >
> > 1. Create IOAS
> > 2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of
> the
> >    guest platform
> > 3. ATTACH devices (this will fail if they're not compatible with the
> >    reserved IOVA regions)
> > 4. Boot the guest

I suppose above is only the sample flow for PPC vIOMMU. For non-PPC
vIOMMUs regular mappings are required before booting the guest and
reservation might be done but not mandatory (at least not what current
Qemu vfio can afford as it simply replays valid ranges in the CPU address
space).

> >
> >   (on guest map/invalidate) -> IOAS_MAP(IOMAP_FIXED) to overmap part
> of
> >                                the reserved regions
> >   (on dev hotplug) -> ATTACH (which might fail, if it conflicts with the
> >                       reserved regions)
> >   (on vIOMMU reconfiguration) -> UNMAP/MAP reserved regions as
> >                                  necessary (which might fail)
> 
> OK, I will take care of it
> 
> Thanks,
> Jason

^ permalink raw reply	[flat|nested] 244+ messages in thread

* RE: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
@ 2022-05-11  3:15                                   ` Tian, Kevin
  0 siblings, 0 replies; 244+ messages in thread
From: Tian, Kevin @ 2022-05-11  3:15 UTC (permalink / raw)
  To: Jason Gunthorpe, David Gibson
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Alex Williamson, Martins, Joao

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, May 11, 2022 3:00 AM
> 
> On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote:
> > Ok... here's a revised version of my proposal which I think addresses
> > your concerns and simplfies things.
> >
> > - No new operations, but IOAS_MAP gets some new flags (and IOAS_COPY
> >   will probably need matching changes)
> >
> > - By default the IOVA given to IOAS_MAP is a hint only, and the IOVA
> >   is chosen by the kernel within the aperture(s).  This is closer to
> >   how mmap() operates, and DPDK and similar shouldn't care about
> >   having specific IOVAs, even at the individual mapping level.
> >
> > - IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s MAP_FIXED,
> >   for when you really do want to control the IOVA (qemu, maybe some
> >   special userspace driver cases)
> 
> We already did both of these, the flag is called
> IOMMU_IOAS_MAP_FIXED_IOVA - if it is not specified then kernel will
> select the IOVA internally.
> 
> > - ATTACH will fail if the new device would shrink the aperture to
> >   exclude any already established mappings (I assume this is already
> >   the case)
> 
> Yes
> 
> > - IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a
> >   PROT_NONE mmap().  It reserves that IOVA space, so other (non-FIXED)
> >   MAPs won't use it, but doesn't actually put anything into the IO
> >   pagetables.
> >     - Like a regular mapping, ATTACHes that are incompatible with an
> >       IOMAP_RESERVEed region will fail
> >     - An IOMAP_RESERVEed area can be overmapped with an IOMAP_FIXED
> >       mapping
> 
> Yeah, this seems OK, I'm thinking a new API might make sense because
> you don't really want mmap replacement semantics but a permanent
> record of what IOVA must always be valid.
> 
> IOMMU_IOA_REQUIRE_IOVA perhaps, similar signature to
> IOMMUFD_CMD_IOAS_IOVA_RANGES:
> 
> struct iommu_ioas_require_iova {
>         __u32 size;
>         __u32 ioas_id;
>         __u32 num_iovas;
>         __u32 __reserved;
>         struct iommu_required_iovas {
>                 __aligned_u64 start;
>                 __aligned_u64 last;
>         } required_iovas[];
> };

As a permanent record do we want to enforce that once the required
range list is set all FIXED and non-FIXED allocations must be within the
list of ranges?

If yes we can take the end of the last range as the max size of the iova
address space to optimize the page table layout.

otherwise we may need another dedicated hint for that optimization.

> 
> > So, for DPDK the sequence would be:
> >
> > 1. Create IOAS
> > 2. ATTACH devices
> > 3. IOAS_MAP some stuff
> > 4. Do DMA with the IOVAs that IOAS_MAP returned
> >
> > (Note, not even any need for QUERY in simple cases)
> 
> Yes, this is done already
> 
> > For (unoptimized) qemu it would be:
> >
> > 1. Create IOAS
> > 2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of
> the
> >    guest platform
> > 3. ATTACH devices (this will fail if they're not compatible with the
> >    reserved IOVA regions)
> > 4. Boot the guest

I suppose above is only the sample flow for PPC vIOMMU. For non-PPC
vIOMMUs regular mappings are required before booting the guest and
reservation might be done but not mandatory (at least not what current
Qemu vfio can afford as it simply replays valid ranges in the CPU address
space).

> >
> >   (on guest map/invalidate) -> IOAS_MAP(IOMAP_FIXED) to overmap part
> of
> >                                the reserved regions
> >   (on dev hotplug) -> ATTACH (which might fail, if it conflicts with the
> >                       reserved regions)
> >   (on vIOMMU reconfiguration) -> UNMAP/MAP reserved regions as
> >                                  necessary (which might fail)
> 
> OK, I will take care of it
> 
> Thanks,
> Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
  2022-05-10 19:00                                 ` Jason Gunthorpe via iommu
@ 2022-05-11  4:40                                   ` David Gibson
  -1 siblings, 0 replies; 244+ messages in thread
From: David Gibson @ 2022-05-11  4:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

[-- Attachment #1: Type: text/plain, Size: 11689 bytes --]

On Tue, May 10, 2022 at 04:00:09PM -0300, Jason Gunthorpe wrote:
> On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote:
> > On Mon, May 09, 2022 at 11:00:41AM -0300, Jason Gunthorpe wrote:
> > > On Mon, May 09, 2022 at 04:01:52PM +1000, David Gibson wrote:
> > > 
> > > > > The default iommu_domain that the iommu driver creates will be used
> > > > > here, it is up to the iommu driver to choose something reasonable for
> > > > > use by applications like DPDK. ie PPC should probably pick its biggest
> > > > > x86-like aperture.
> > > > 
> > > > So, using the big aperture means a very high base IOVA
> > > > (1<<59)... which means that it won't work at all if you want to attach
> > > > any devices that aren't capable of 64-bit DMA.
> > > 
> > > I'd expect to include the 32 bit window too..
> > 
> > I'm not entirely sure what you mean.  Are you working on the
> > assumption that we've extended to allowing multiple apertures, so we'd
> > default to advertising both a small/low aperture and a large/high
> > aperture?
> 
> Yes

Ok, that works assuming we can advertise multiple windows.

> > > No, this just makes it fragile in the other direction because now
> > > userspace has to know what platform specific things to ask for *or it
> > > doesn't work at all*. This is not a improvement for the DPDK cases.
> > 
> > Um.. no.  The idea is that userspace requests *what it needs*, not
> > anything platform specific.  In the case of DPDK that would be nothing
> > more than the (minimum) aperture size.  Nothing platform specific
> > about that.
> 
> Except a 32 bit platform can only maybe do a < 4G aperture, a 64 bit
> platform can do more, but it varies how much more, etc.
> 
> There is no constant value DPDK could stuff in this request, unless it
> needs a really small amount of IOVA, like 1G or something.

Well, my assumption was that DPDK always wanted an IOVA window to
cover its hugepage buffer space.  So not "constant" exactly, but a
value it will know at start up time.  But I think we cover that more
closely below.

> > > It isn't like there is some hard coded value we can put into DPDK that
> > > will work on every platform. So kernel must pick for DPDK, IMHO. I
> > > don't see any feasible alternative.
> > 
> > Yes, hence *optionally specified* base address only.
> 
> Okay, so imagine we've already done this and DPDK is not optionally
> specifying anything :)
> 
> The structs can be extended so we can add this as an input to creation
> when a driver can implement it.
> 
> > > The ppc specific driver would be on the generic side of qemu in its
> > > viommu support framework. There is lots of host driver optimization
> > > possible here with knowledge of the underlying host iommu HW. It
> > > should not be connected to the qemu target.
> > 
> > Thinking through this...
> > 
> > So, I guess we could have basically the same logic I'm suggesting be
> > in the qemu backend iommu driver instead.  So the target side (machine
> > type, strictly speaking) would request of the host side the apertures
> > it needs, and the host side driver would see if it can do that, based
> > on both specific knowledge of that driver and the query reponses.
> 
> Yes, this is what I'm thinking
> 
> > ppc on x86 should work with that.. at least if the x86 aperture is
> > large enough to reach up to ppc's high window.  I guess we'd have the
> > option here of using either the generic host driver or the
> > x86-specific driver.  The latter would mean qemu maintaining an
> > x86-format shadow of the io pagetables; mildly tedious, but doable.
> 
> The appeal of having userspace page tables is performance, so it is
> tedious to shadow, but it should run faster.

I doubt the difference is meaningful in the context of an emulated
guest, though.

> > So... is there any way of backing out of this gracefully.  We could
> > detach the device, but in the meantime ongoing DMA maps from
> > previous devices might have failed.  
> 
> This sounds like a good use case for qemu to communicate ranges - but
> as I mentioned before Alex said qemu didn't know the ranges..

Yeah, I'm a bit baffled by that, and I don't know the context.  Note
that there are at least two different very different users of the host
IOMMU backends in: one is for emulation of guest DMA (with or without
a vIOMMU).  In that case the details of the guest platform should let
qemu know the ranges.  There's also a VFIO based NVME backend; that
one's much more like a "normal" userspace driver, where it doesn't
care about the address ranges (because they're not guest visible).

> > We could pre-attach the new device to a new IOAS and check the
> > apertures there - but when we move it to the final IOAS is it
> > guaranteed that the apertures will be (at least) the intersection of
> > the old and new apertures, or is that just the probable outcome. 
> 
> Should be guarenteed

Ok; that would need to be documented.

> > Ok.. you convinced me.  As long as we have some way to handle the
> > device hotplug case, we can work with this.
> 
> I like the communicate ranges for hotplug, so long as we can actually
> implement it in qemu - I'm a bit unclear on that honestly.
> 
> > Ok, I see.  That can certainly be done.  I was really hoping we could
> > have a working, though non-optimized, implementation using just the
> > generic interface.
> 
> Oh, sure that should largely work as well too, this is just an
> additional direction people may find interesting and helps explain why
> qemu should have an iommu layer inside.

> > "holes" versus "windows".  We can choose either one; I think "windows"
> > rather than "holes" makes more sense, but it doesn't really matter.
> 
> Yes, I picked windows aka ranges for the uAPI - we translate the holes
> from the groups into windows and intersect them with the apertures.

Ok.

> > > > Another approach would be to give the required apertures / pagesizes
> > > > in the initial creation of the domain/IOAS.  In that case they would
> > > > be static for the IOAS, as well as the underlying iommu_domains: any
> > > > ATTACH which would be incompatible would fail.
> > > 
> > > This is the device-specific iommu_domain creation path. The domain can
> > > have information defining its aperture.
> > 
> > But that also requires managing the pagetables yourself; I think tying
> > these two concepts together is inflexible.
> 
> Oh, no, those need to be independent for HW nesting already
> 
> One device-specific creation path will create the kernel owned
> map/unmap iommu_domain with device-specific parameters to allow it to
> be the root of a nest - ie specify S2 on ARM.
> 
> The second device-specific creation path will create the user owned
> iommu_domain with device-specific parameters, with the first as a
> parent.
> 
> So you get to do both.

Ah! Good to know.

> > > Which is why the current scheme is fully automatic and we rely on the
> > > iommu driver to automatically select something sane for DPDK/etc
> > > today.
> > 
> > But the cost is that the allowable addresses can change implicitly
> > with every ATTACH.
> 
> Yes, dpdk/etc don't care.

Well... as long as nothing they've already mapped goes away.

> > I see the problem if you have an app where there's a difference
> > between the smallest window it can cope with versus the largest window
> > it can take advantage of.  Not sure if that's likely in pratice.
> > AFAIK, DPDK will alway require it's hugepage memory pool mapped, can't
> > deal with less, can't benefit from more.  But maybe there's some use
> > case for this I haven't thought of.
> 
> Other apps I've seen don't even have a fixed memory pool, they just
> malloc and can't really predict how much IOVA they
> need. "approximately the same amount as a process VA" is a reasonable
> goal for the kernel to default too.

Hm, ok, I guess that makes sense.

> A tunable to allow efficiency from smaller allocations sounds great -
> but let's have driver support before adding the uAPI for
> it. Intel/AMD/ARM support to have fewer page table levels for instance
> would be perfect.
>  
> > Ok... here's a revised version of my proposal which I think addresses
> > your concerns and simplfies things.
> > 
> > - No new operations, but IOAS_MAP gets some new flags (and IOAS_COPY
> >   will probably need matching changes)
> > 
> > - By default the IOVA given to IOAS_MAP is a hint only, and the IOVA
> >   is chosen by the kernel within the aperture(s).  This is closer to
> >   how mmap() operates, and DPDK and similar shouldn't care about
> >   having specific IOVAs, even at the individual mapping level.
> >
> > - IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s MAP_FIXED,
> >   for when you really do want to control the IOVA (qemu, maybe some
> >   special userspace driver cases)
> 
> We already did both of these, the flag is called
> IOMMU_IOAS_MAP_FIXED_IOVA - if it is not specified then kernel will
> select the IOVA internally.

Ok, great.

> > - ATTACH will fail if the new device would shrink the aperture to
> >   exclude any already established mappings (I assume this is already
> >   the case)
> 
> Yes

Good to know.

> > - IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a
> >   PROT_NONE mmap().  It reserves that IOVA space, so other (non-FIXED)
> >   MAPs won't use it, but doesn't actually put anything into the IO
> >   pagetables.
> >     - Like a regular mapping, ATTACHes that are incompatible with an
> >       IOMAP_RESERVEed region will fail
> >     - An IOMAP_RESERVEed area can be overmapped with an IOMAP_FIXED
> >       mapping
> 
> Yeah, this seems OK, I'm thinking a new API might make sense because
> you don't really want mmap replacement semantics but a permanent
> record of what IOVA must always be valid.
> 
> IOMMU_IOA_REQUIRE_IOVA perhaps, similar signature to
> IOMMUFD_CMD_IOAS_IOVA_RANGES:
> 
> struct iommu_ioas_require_iova {
>         __u32 size;
>         __u32 ioas_id;
>         __u32 num_iovas;
>         __u32 __reserved;
>         struct iommu_required_iovas {
>                 __aligned_u64 start;
>                 __aligned_u64 last;
>         } required_iovas[];
> };

Sounds reasonable.

> > So, for DPDK the sequence would be:
> > 
> > 1. Create IOAS
> > 2. ATTACH devices
> > 3. IOAS_MAP some stuff
> > 4. Do DMA with the IOVAs that IOAS_MAP returned
> > 
> > (Note, not even any need for QUERY in simple cases)
> 
> Yes, this is done already
> 
> > For (unoptimized) qemu it would be:
> > 
> > 1. Create IOAS
> > 2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of the
> >    guest platform
> > 3. ATTACH devices (this will fail if they're not compatible with the
> >    reserved IOVA regions)
> > 4. Boot the guest
> > 
> >   (on guest map/invalidate) -> IOAS_MAP(IOMAP_FIXED) to overmap part of
> >                                the reserved regions
> >   (on dev hotplug) -> ATTACH (which might fail, if it conflicts with the
> >                       reserved regions)
> >   (on vIOMMU reconfiguration) -> UNMAP/MAP reserved regions as
> >                                  necessary (which might fail)
> 
> OK, I will take care of it

Hooray!  Long contentious thread eventually reaches productive
resolution :).

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
@ 2022-05-11  4:40                                   ` David Gibson
  0 siblings, 0 replies; 244+ messages in thread
From: David Gibson @ 2022-05-11  4:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins


[-- Attachment #1.1: Type: text/plain, Size: 11689 bytes --]

On Tue, May 10, 2022 at 04:00:09PM -0300, Jason Gunthorpe wrote:
> On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote:
> > On Mon, May 09, 2022 at 11:00:41AM -0300, Jason Gunthorpe wrote:
> > > On Mon, May 09, 2022 at 04:01:52PM +1000, David Gibson wrote:
> > > 
> > > > > The default iommu_domain that the iommu driver creates will be used
> > > > > here, it is up to the iommu driver to choose something reasonable for
> > > > > use by applications like DPDK. ie PPC should probably pick its biggest
> > > > > x86-like aperture.
> > > > 
> > > > So, using the big aperture means a very high base IOVA
> > > > (1<<59)... which means that it won't work at all if you want to attach
> > > > any devices that aren't capable of 64-bit DMA.
> > > 
> > > I'd expect to include the 32 bit window too..
> > 
> > I'm not entirely sure what you mean.  Are you working on the
> > assumption that we've extended to allowing multiple apertures, so we'd
> > default to advertising both a small/low aperture and a large/high
> > aperture?
> 
> Yes

Ok, that works assuming we can advertise multiple windows.

> > > No, this just makes it fragile in the other direction because now
> > > userspace has to know what platform specific things to ask for *or it
> > > doesn't work at all*. This is not a improvement for the DPDK cases.
> > 
> > Um.. no.  The idea is that userspace requests *what it needs*, not
> > anything platform specific.  In the case of DPDK that would be nothing
> > more than the (minimum) aperture size.  Nothing platform specific
> > about that.
> 
> Except a 32 bit platform can only maybe do a < 4G aperture, a 64 bit
> platform can do more, but it varies how much more, etc.
> 
> There is no constant value DPDK could stuff in this request, unless it
> needs a really small amount of IOVA, like 1G or something.

Well, my assumption was that DPDK always wanted an IOVA window to
cover its hugepage buffer space.  So not "constant" exactly, but a
value it will know at start up time.  But I think we cover that more
closely below.

> > > It isn't like there is some hard coded value we can put into DPDK that
> > > will work on every platform. So kernel must pick for DPDK, IMHO. I
> > > don't see any feasible alternative.
> > 
> > Yes, hence *optionally specified* base address only.
> 
> Okay, so imagine we've already done this and DPDK is not optionally
> specifying anything :)
> 
> The structs can be extended so we can add this as an input to creation
> when a driver can implement it.
> 
> > > The ppc specific driver would be on the generic side of qemu in its
> > > viommu support framework. There is lots of host driver optimization
> > > possible here with knowledge of the underlying host iommu HW. It
> > > should not be connected to the qemu target.
> > 
> > Thinking through this...
> > 
> > So, I guess we could have basically the same logic I'm suggesting be
> > in the qemu backend iommu driver instead.  So the target side (machine
> > type, strictly speaking) would request of the host side the apertures
> > it needs, and the host side driver would see if it can do that, based
> > on both specific knowledge of that driver and the query reponses.
> 
> Yes, this is what I'm thinking
> 
> > ppc on x86 should work with that.. at least if the x86 aperture is
> > large enough to reach up to ppc's high window.  I guess we'd have the
> > option here of using either the generic host driver or the
> > x86-specific driver.  The latter would mean qemu maintaining an
> > x86-format shadow of the io pagetables; mildly tedious, but doable.
> 
> The appeal of having userspace page tables is performance, so it is
> tedious to shadow, but it should run faster.

I doubt the difference is meaningful in the context of an emulated
guest, though.

> > So... is there any way of backing out of this gracefully.  We could
> > detach the device, but in the meantime ongoing DMA maps from
> > previous devices might have failed.  
> 
> This sounds like a good use case for qemu to communicate ranges - but
> as I mentioned before Alex said qemu didn't know the ranges..

Yeah, I'm a bit baffled by that, and I don't know the context.  Note
that there are at least two different very different users of the host
IOMMU backends in: one is for emulation of guest DMA (with or without
a vIOMMU).  In that case the details of the guest platform should let
qemu know the ranges.  There's also a VFIO based NVME backend; that
one's much more like a "normal" userspace driver, where it doesn't
care about the address ranges (because they're not guest visible).

> > We could pre-attach the new device to a new IOAS and check the
> > apertures there - but when we move it to the final IOAS is it
> > guaranteed that the apertures will be (at least) the intersection of
> > the old and new apertures, or is that just the probable outcome. 
> 
> Should be guarenteed

Ok; that would need to be documented.

> > Ok.. you convinced me.  As long as we have some way to handle the
> > device hotplug case, we can work with this.
> 
> I like the communicate ranges for hotplug, so long as we can actually
> implement it in qemu - I'm a bit unclear on that honestly.
> 
> > Ok, I see.  That can certainly be done.  I was really hoping we could
> > have a working, though non-optimized, implementation using just the
> > generic interface.
> 
> Oh, sure that should largely work as well too, this is just an
> additional direction people may find interesting and helps explain why
> qemu should have an iommu layer inside.

> > "holes" versus "windows".  We can choose either one; I think "windows"
> > rather than "holes" makes more sense, but it doesn't really matter.
> 
> Yes, I picked windows aka ranges for the uAPI - we translate the holes
> from the groups into windows and intersect them with the apertures.

Ok.

> > > > Another approach would be to give the required apertures / pagesizes
> > > > in the initial creation of the domain/IOAS.  In that case they would
> > > > be static for the IOAS, as well as the underlying iommu_domains: any
> > > > ATTACH which would be incompatible would fail.
> > > 
> > > This is the device-specific iommu_domain creation path. The domain can
> > > have information defining its aperture.
> > 
> > But that also requires managing the pagetables yourself; I think tying
> > these two concepts together is inflexible.
> 
> Oh, no, those need to be independent for HW nesting already
> 
> One device-specific creation path will create the kernel owned
> map/unmap iommu_domain with device-specific parameters to allow it to
> be the root of a nest - ie specify S2 on ARM.
> 
> The second device-specific creation path will create the user owned
> iommu_domain with device-specific parameters, with the first as a
> parent.
> 
> So you get to do both.

Ah! Good to know.

> > > Which is why the current scheme is fully automatic and we rely on the
> > > iommu driver to automatically select something sane for DPDK/etc
> > > today.
> > 
> > But the cost is that the allowable addresses can change implicitly
> > with every ATTACH.
> 
> Yes, dpdk/etc don't care.

Well... as long as nothing they've already mapped goes away.

> > I see the problem if you have an app where there's a difference
> > between the smallest window it can cope with versus the largest window
> > it can take advantage of.  Not sure if that's likely in pratice.
> > AFAIK, DPDK will alway require it's hugepage memory pool mapped, can't
> > deal with less, can't benefit from more.  But maybe there's some use
> > case for this I haven't thought of.
> 
> Other apps I've seen don't even have a fixed memory pool, they just
> malloc and can't really predict how much IOVA they
> need. "approximately the same amount as a process VA" is a reasonable
> goal for the kernel to default too.

Hm, ok, I guess that makes sense.

> A tunable to allow efficiency from smaller allocations sounds great -
> but let's have driver support before adding the uAPI for
> it. Intel/AMD/ARM support to have fewer page table levels for instance
> would be perfect.
>  
> > Ok... here's a revised version of my proposal which I think addresses
> > your concerns and simplfies things.
> > 
> > - No new operations, but IOAS_MAP gets some new flags (and IOAS_COPY
> >   will probably need matching changes)
> > 
> > - By default the IOVA given to IOAS_MAP is a hint only, and the IOVA
> >   is chosen by the kernel within the aperture(s).  This is closer to
> >   how mmap() operates, and DPDK and similar shouldn't care about
> >   having specific IOVAs, even at the individual mapping level.
> >
> > - IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s MAP_FIXED,
> >   for when you really do want to control the IOVA (qemu, maybe some
> >   special userspace driver cases)
> 
> We already did both of these, the flag is called
> IOMMU_IOAS_MAP_FIXED_IOVA - if it is not specified then kernel will
> select the IOVA internally.

Ok, great.

> > - ATTACH will fail if the new device would shrink the aperture to
> >   exclude any already established mappings (I assume this is already
> >   the case)
> 
> Yes

Good to know.

> > - IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a
> >   PROT_NONE mmap().  It reserves that IOVA space, so other (non-FIXED)
> >   MAPs won't use it, but doesn't actually put anything into the IO
> >   pagetables.
> >     - Like a regular mapping, ATTACHes that are incompatible with an
> >       IOMAP_RESERVEed region will fail
> >     - An IOMAP_RESERVEed area can be overmapped with an IOMAP_FIXED
> >       mapping
> 
> Yeah, this seems OK, I'm thinking a new API might make sense because
> you don't really want mmap replacement semantics but a permanent
> record of what IOVA must always be valid.
> 
> IOMMU_IOA_REQUIRE_IOVA perhaps, similar signature to
> IOMMUFD_CMD_IOAS_IOVA_RANGES:
> 
> struct iommu_ioas_require_iova {
>         __u32 size;
>         __u32 ioas_id;
>         __u32 num_iovas;
>         __u32 __reserved;
>         struct iommu_required_iovas {
>                 __aligned_u64 start;
>                 __aligned_u64 last;
>         } required_iovas[];
> };

Sounds reasonable.

> > So, for DPDK the sequence would be:
> > 
> > 1. Create IOAS
> > 2. ATTACH devices
> > 3. IOAS_MAP some stuff
> > 4. Do DMA with the IOVAs that IOAS_MAP returned
> > 
> > (Note, not even any need for QUERY in simple cases)
> 
> Yes, this is done already
> 
> > For (unoptimized) qemu it would be:
> > 
> > 1. Create IOAS
> > 2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of the
> >    guest platform
> > 3. ATTACH devices (this will fail if they're not compatible with the
> >    reserved IOVA regions)
> > 4. Boot the guest
> > 
> >   (on guest map/invalidate) -> IOAS_MAP(IOMAP_FIXED) to overmap part of
> >                                the reserved regions
> >   (on dev hotplug) -> ATTACH (which might fail, if it conflicts with the
> >                       reserved regions)
> >   (on vIOMMU reconfiguration) -> UNMAP/MAP reserved regions as
> >                                  necessary (which might fail)
> 
> OK, I will take care of it

Hooray!  Long contentious thread eventually reaches productive
resolution :).

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 156 bytes --]

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 10/12] iommufd: Add kAPI toward external drivers
  2022-03-18 17:27   ` Jason Gunthorpe via iommu
@ 2022-05-11 12:54     ` Yi Liu
  -1 siblings, 0 replies; 244+ messages in thread
From: Yi Liu @ 2022-05-11 12:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Keqian Zhu


On 2022/3/19 01:27, Jason Gunthorpe wrote:

> +
> +/**
> + * iommufd_device_attach - Connect a device to an iommu_domain
> + * @idev: device to attach
> + * @pt_id: Input a IOMMUFD_OBJ_IOAS, or IOMMUFD_OBJ_HW_PAGETABLE
> + *         Output the IOMMUFD_OBJ_HW_PAGETABLE ID
> + * @flags: Optional flags
> + *
> + * This connects the device to an iommu_domain, either automatically or manually
> + * selected. Once this completes the device could do DMA.
> + *
> + * The caller should return the resulting pt_id back to userspace.
> + * This function is undone by calling iommufd_device_detach().
> + */
> +int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id,
> +			  unsigned int flags)
> +{
> +	struct iommufd_hw_pagetable *hwpt;
> +	int rc;
> +
> +	refcount_inc(&idev->obj.users);
> +
> +	hwpt = iommufd_hw_pagetable_from_id(idev->ictx, *pt_id, idev->dev);
> +	if (IS_ERR(hwpt)) {
> +		rc = PTR_ERR(hwpt);
> +		goto out_users;
> +	}
> +
> +	mutex_lock(&hwpt->devices_lock);
> +	/* FIXME: Use a device-centric iommu api. For now check if the
> +	 * hw_pagetable already has a device of the same group joined to tell if
> +	 * we are the first and need to attach the group. */
> +	if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
> +		phys_addr_t sw_msi_start = 0;
> +
> +		rc = iommu_attach_group(hwpt->domain, idev->group);
> +		if (rc)
> +			goto out_unlock;
> +
> +		/*
> +		 * hwpt is now the exclusive owner of the group so this is the
> +		 * first time enforce is called for this group.
> +		 */
> +		rc = iopt_table_enforce_group_resv_regions(
> +			&hwpt->ioas->iopt, idev->group, &sw_msi_start);
> +		if (rc)
> +			goto out_detach;
> +		rc = iommufd_device_setup_msi(idev, hwpt, sw_msi_start, flags);
> +		if (rc)
> +			goto out_iova;
> +	}
> +
> +	idev->hwpt = hwpt;

could the below list_empty check be moved to the above "if branch"? If
above "if branch" is false, that means there is already group attached with
the hwpt->domain. So the hwpt->devices should be non-empty. Only if the 
above "if branch" is true, should the hwpt->devices possible to be empty.
So moving it into above "if branch" may be better?

> +	if (list_empty(&hwpt->devices)) {
> +		rc = iopt_table_add_domain(&hwpt->ioas->iopt, hwpt->domain);
> +		if (rc)
> +			goto out_iova;
> +	}
> +	list_add(&idev->devices_item, &hwpt->devices);
> +	mutex_unlock(&hwpt->devices_lock);
> +
> +	*pt_id = idev->hwpt->obj.id;
> +	return 0;
> +
> +out_iova:
> +	iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->group);
> +out_detach:
> +	iommu_detach_group(hwpt->domain, idev->group);
> +out_unlock:
> +	mutex_unlock(&hwpt->devices_lock);
> +	iommufd_hw_pagetable_put(idev->ictx, hwpt);
> +out_users:
> +	refcount_dec(&idev->obj.users);
> +	return rc;
> +}
> +EXPORT_SYMBOL_GPL(iommufd_device_attach);

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 10/12] iommufd: Add kAPI toward external drivers
@ 2022-05-11 12:54     ` Yi Liu
  0 siblings, 0 replies; 244+ messages in thread
From: Yi Liu @ 2022-05-11 12:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins,
	David Gibson


On 2022/3/19 01:27, Jason Gunthorpe wrote:

> +
> +/**
> + * iommufd_device_attach - Connect a device to an iommu_domain
> + * @idev: device to attach
> + * @pt_id: Input a IOMMUFD_OBJ_IOAS, or IOMMUFD_OBJ_HW_PAGETABLE
> + *         Output the IOMMUFD_OBJ_HW_PAGETABLE ID
> + * @flags: Optional flags
> + *
> + * This connects the device to an iommu_domain, either automatically or manually
> + * selected. Once this completes the device could do DMA.
> + *
> + * The caller should return the resulting pt_id back to userspace.
> + * This function is undone by calling iommufd_device_detach().
> + */
> +int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id,
> +			  unsigned int flags)
> +{
> +	struct iommufd_hw_pagetable *hwpt;
> +	int rc;
> +
> +	refcount_inc(&idev->obj.users);
> +
> +	hwpt = iommufd_hw_pagetable_from_id(idev->ictx, *pt_id, idev->dev);
> +	if (IS_ERR(hwpt)) {
> +		rc = PTR_ERR(hwpt);
> +		goto out_users;
> +	}
> +
> +	mutex_lock(&hwpt->devices_lock);
> +	/* FIXME: Use a device-centric iommu api. For now check if the
> +	 * hw_pagetable already has a device of the same group joined to tell if
> +	 * we are the first and need to attach the group. */
> +	if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
> +		phys_addr_t sw_msi_start = 0;
> +
> +		rc = iommu_attach_group(hwpt->domain, idev->group);
> +		if (rc)
> +			goto out_unlock;
> +
> +		/*
> +		 * hwpt is now the exclusive owner of the group so this is the
> +		 * first time enforce is called for this group.
> +		 */
> +		rc = iopt_table_enforce_group_resv_regions(
> +			&hwpt->ioas->iopt, idev->group, &sw_msi_start);
> +		if (rc)
> +			goto out_detach;
> +		rc = iommufd_device_setup_msi(idev, hwpt, sw_msi_start, flags);
> +		if (rc)
> +			goto out_iova;
> +	}
> +
> +	idev->hwpt = hwpt;

could the below list_empty check be moved to the above "if branch"? If
above "if branch" is false, that means there is already group attached with
the hwpt->domain. So the hwpt->devices should be non-empty. Only if the 
above "if branch" is true, should the hwpt->devices possible to be empty.
So moving it into above "if branch" may be better?

> +	if (list_empty(&hwpt->devices)) {
> +		rc = iopt_table_add_domain(&hwpt->ioas->iopt, hwpt->domain);
> +		if (rc)
> +			goto out_iova;
> +	}
> +	list_add(&idev->devices_item, &hwpt->devices);
> +	mutex_unlock(&hwpt->devices_lock);
> +
> +	*pt_id = idev->hwpt->obj.id;
> +	return 0;
> +
> +out_iova:
> +	iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->group);
> +out_detach:
> +	iommu_detach_group(hwpt->domain, idev->group);
> +out_unlock:
> +	mutex_unlock(&hwpt->devices_lock);
> +	iommufd_hw_pagetable_put(idev->ictx, hwpt);
> +out_users:
> +	refcount_dec(&idev->obj.users);
> +	return rc;
> +}
> +EXPORT_SYMBOL_GPL(iommufd_device_attach);

-- 
Regards,
Yi Liu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
  2022-05-11  3:15                                   ` Tian, Kevin
@ 2022-05-11 16:32                                     ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-05-11 16:32 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: David Gibson, Alex Williamson, Lu Baolu, Chaitanya Kulkarni,
	Cornelia Huck, Daniel Jordan, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

On Wed, May 11, 2022 at 03:15:22AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, May 11, 2022 3:00 AM
> > 
> > On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote:
> > > Ok... here's a revised version of my proposal which I think addresses
> > > your concerns and simplfies things.
> > >
> > > - No new operations, but IOAS_MAP gets some new flags (and IOAS_COPY
> > >   will probably need matching changes)
> > >
> > > - By default the IOVA given to IOAS_MAP is a hint only, and the IOVA
> > >   is chosen by the kernel within the aperture(s).  This is closer to
> > >   how mmap() operates, and DPDK and similar shouldn't care about
> > >   having specific IOVAs, even at the individual mapping level.
> > >
> > > - IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s MAP_FIXED,
> > >   for when you really do want to control the IOVA (qemu, maybe some
> > >   special userspace driver cases)
> > 
> > We already did both of these, the flag is called
> > IOMMU_IOAS_MAP_FIXED_IOVA - if it is not specified then kernel will
> > select the IOVA internally.
> > 
> > > - ATTACH will fail if the new device would shrink the aperture to
> > >   exclude any already established mappings (I assume this is already
> > >   the case)
> > 
> > Yes
> > 
> > > - IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a
> > >   PROT_NONE mmap().  It reserves that IOVA space, so other (non-FIXED)
> > >   MAPs won't use it, but doesn't actually put anything into the IO
> > >   pagetables.
> > >     - Like a regular mapping, ATTACHes that are incompatible with an
> > >       IOMAP_RESERVEed region will fail
> > >     - An IOMAP_RESERVEed area can be overmapped with an IOMAP_FIXED
> > >       mapping
> > 
> > Yeah, this seems OK, I'm thinking a new API might make sense because
> > you don't really want mmap replacement semantics but a permanent
> > record of what IOVA must always be valid.
> > 
> > IOMMU_IOA_REQUIRE_IOVA perhaps, similar signature to
> > IOMMUFD_CMD_IOAS_IOVA_RANGES:
> > 
> > struct iommu_ioas_require_iova {
> >         __u32 size;
> >         __u32 ioas_id;
> >         __u32 num_iovas;
> >         __u32 __reserved;
> >         struct iommu_required_iovas {
> >                 __aligned_u64 start;
> >                 __aligned_u64 last;
> >         } required_iovas[];
> > };
> 
> As a permanent record do we want to enforce that once the required
> range list is set all FIXED and non-FIXED allocations must be within the
> list of ranges?

No, I would just use this as a guarntee that going forward any
get_ranges will always return ranges that cover the listed required
ranges. Ie any narrowing of the ranges will be refused.

map/unmap should only be restricted to the get_ranges output.

Wouldn't burn CPU cycles to nanny userspace here.

> If yes we can take the end of the last range as the max size of the iova
> address space to optimize the page table layout.

I think this API should not interact with the driver. Its only job is
to prevent devices from attaching that would narrow the ranges.

If we also use it to adjust the aperture of the created iommu_domain
then it looses its usefullness as guard since something like qemu
would have to leave room for hotplug as well.

I suppose optimizing the created iommu_domains should be some other
API, with a different set of ranges and the '# of bytes of IOVA' hint
as well.

> > > For (unoptimized) qemu it would be:
> > >
> > > 1. Create IOAS
> > > 2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of
> > the
> > >    guest platform
> > > 3. ATTACH devices (this will fail if they're not compatible with the
> > >    reserved IOVA regions)
> > > 4. Boot the guest
> 
> I suppose above is only the sample flow for PPC vIOMMU. For non-PPC
> vIOMMUs regular mappings are required before booting the guest and
> reservation might be done but not mandatory (at least not what current
> Qemu vfio can afford as it simply replays valid ranges in the CPU address
> space).

I think qemu can always do it, it feels like it would simplify error
cases around aperture mismatches.

Jason

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
@ 2022-05-11 16:32                                     ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-05-11 16:32 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Alex Williamson, Martins, Joao,
	David Gibson

On Wed, May 11, 2022 at 03:15:22AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, May 11, 2022 3:00 AM
> > 
> > On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote:
> > > Ok... here's a revised version of my proposal which I think addresses
> > > your concerns and simplfies things.
> > >
> > > - No new operations, but IOAS_MAP gets some new flags (and IOAS_COPY
> > >   will probably need matching changes)
> > >
> > > - By default the IOVA given to IOAS_MAP is a hint only, and the IOVA
> > >   is chosen by the kernel within the aperture(s).  This is closer to
> > >   how mmap() operates, and DPDK and similar shouldn't care about
> > >   having specific IOVAs, even at the individual mapping level.
> > >
> > > - IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s MAP_FIXED,
> > >   for when you really do want to control the IOVA (qemu, maybe some
> > >   special userspace driver cases)
> > 
> > We already did both of these, the flag is called
> > IOMMU_IOAS_MAP_FIXED_IOVA - if it is not specified then kernel will
> > select the IOVA internally.
> > 
> > > - ATTACH will fail if the new device would shrink the aperture to
> > >   exclude any already established mappings (I assume this is already
> > >   the case)
> > 
> > Yes
> > 
> > > - IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a
> > >   PROT_NONE mmap().  It reserves that IOVA space, so other (non-FIXED)
> > >   MAPs won't use it, but doesn't actually put anything into the IO
> > >   pagetables.
> > >     - Like a regular mapping, ATTACHes that are incompatible with an
> > >       IOMAP_RESERVEed region will fail
> > >     - An IOMAP_RESERVEed area can be overmapped with an IOMAP_FIXED
> > >       mapping
> > 
> > Yeah, this seems OK, I'm thinking a new API might make sense because
> > you don't really want mmap replacement semantics but a permanent
> > record of what IOVA must always be valid.
> > 
> > IOMMU_IOA_REQUIRE_IOVA perhaps, similar signature to
> > IOMMUFD_CMD_IOAS_IOVA_RANGES:
> > 
> > struct iommu_ioas_require_iova {
> >         __u32 size;
> >         __u32 ioas_id;
> >         __u32 num_iovas;
> >         __u32 __reserved;
> >         struct iommu_required_iovas {
> >                 __aligned_u64 start;
> >                 __aligned_u64 last;
> >         } required_iovas[];
> > };
> 
> As a permanent record do we want to enforce that once the required
> range list is set all FIXED and non-FIXED allocations must be within the
> list of ranges?

No, I would just use this as a guarntee that going forward any
get_ranges will always return ranges that cover the listed required
ranges. Ie any narrowing of the ranges will be refused.

map/unmap should only be restricted to the get_ranges output.

Wouldn't burn CPU cycles to nanny userspace here.

> If yes we can take the end of the last range as the max size of the iova
> address space to optimize the page table layout.

I think this API should not interact with the driver. Its only job is
to prevent devices from attaching that would narrow the ranges.

If we also use it to adjust the aperture of the created iommu_domain
then it looses its usefullness as guard since something like qemu
would have to leave room for hotplug as well.

I suppose optimizing the created iommu_domains should be some other
API, with a different set of ranges and the '# of bytes of IOVA' hint
as well.

> > > For (unoptimized) qemu it would be:
> > >
> > > 1. Create IOAS
> > > 2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of
> > the
> > >    guest platform
> > > 3. ATTACH devices (this will fail if they're not compatible with the
> > >    reserved IOVA regions)
> > > 4. Boot the guest
> 
> I suppose above is only the sample flow for PPC vIOMMU. For non-PPC
> vIOMMUs regular mappings are required before booting the guest and
> reservation might be done but not mandatory (at least not what current
> Qemu vfio can afford as it simply replays valid ranges in the CPU address
> space).

I think qemu can always do it, it feels like it would simplify error
cases around aperture mismatches.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* RE: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
  2022-05-11 16:32                                     ` Jason Gunthorpe via iommu
@ 2022-05-11 23:23                                       ` Tian, Kevin
  -1 siblings, 0 replies; 244+ messages in thread
From: Tian, Kevin @ 2022-05-11 23:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: David Gibson, Alex Williamson, Lu Baolu, Chaitanya Kulkarni,
	Cornelia Huck, Daniel Jordan, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, May 12, 2022 12:32 AM
> 
> On Wed, May 11, 2022 at 03:15:22AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, May 11, 2022 3:00 AM
> > >
> > > On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote:
> > > > Ok... here's a revised version of my proposal which I think addresses
> > > > your concerns and simplfies things.
> > > >
> > > > - No new operations, but IOAS_MAP gets some new flags (and
> IOAS_COPY
> > > >   will probably need matching changes)
> > > >
> > > > - By default the IOVA given to IOAS_MAP is a hint only, and the IOVA
> > > >   is chosen by the kernel within the aperture(s).  This is closer to
> > > >   how mmap() operates, and DPDK and similar shouldn't care about
> > > >   having specific IOVAs, even at the individual mapping level.
> > > >
> > > > - IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s
> MAP_FIXED,
> > > >   for when you really do want to control the IOVA (qemu, maybe some
> > > >   special userspace driver cases)
> > >
> > > We already did both of these, the flag is called
> > > IOMMU_IOAS_MAP_FIXED_IOVA - if it is not specified then kernel will
> > > select the IOVA internally.
> > >
> > > > - ATTACH will fail if the new device would shrink the aperture to
> > > >   exclude any already established mappings (I assume this is already
> > > >   the case)
> > >
> > > Yes
> > >
> > > > - IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a
> > > >   PROT_NONE mmap().  It reserves that IOVA space, so other (non-
> FIXED)
> > > >   MAPs won't use it, but doesn't actually put anything into the IO
> > > >   pagetables.
> > > >     - Like a regular mapping, ATTACHes that are incompatible with an
> > > >       IOMAP_RESERVEed region will fail
> > > >     - An IOMAP_RESERVEed area can be overmapped with an
> IOMAP_FIXED
> > > >       mapping
> > >
> > > Yeah, this seems OK, I'm thinking a new API might make sense because
> > > you don't really want mmap replacement semantics but a permanent
> > > record of what IOVA must always be valid.
> > >
> > > IOMMU_IOA_REQUIRE_IOVA perhaps, similar signature to
> > > IOMMUFD_CMD_IOAS_IOVA_RANGES:
> > >
> > > struct iommu_ioas_require_iova {
> > >         __u32 size;
> > >         __u32 ioas_id;
> > >         __u32 num_iovas;
> > >         __u32 __reserved;
> > >         struct iommu_required_iovas {
> > >                 __aligned_u64 start;
> > >                 __aligned_u64 last;
> > >         } required_iovas[];
> > > };
> >
> > As a permanent record do we want to enforce that once the required
> > range list is set all FIXED and non-FIXED allocations must be within the
> > list of ranges?
> 
> No, I would just use this as a guarntee that going forward any
> get_ranges will always return ranges that cover the listed required
> ranges. Ie any narrowing of the ranges will be refused.
> 
> map/unmap should only be restricted to the get_ranges output.
> 
> Wouldn't burn CPU cycles to nanny userspace here.

fair enough.

> 
> > If yes we can take the end of the last range as the max size of the iova
> > address space to optimize the page table layout.
> 
> I think this API should not interact with the driver. Its only job is
> to prevent devices from attaching that would narrow the ranges.
> 
> If we also use it to adjust the aperture of the created iommu_domain
> then it looses its usefullness as guard since something like qemu
> would have to leave room for hotplug as well.
> 
> I suppose optimizing the created iommu_domains should be some other
> API, with a different set of ranges and the '# of bytes of IOVA' hint
> as well.

make sense.

> 
> > > > For (unoptimized) qemu it would be:
> > > >
> > > > 1. Create IOAS
> > > > 2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of
> > > the
> > > >    guest platform
> > > > 3. ATTACH devices (this will fail if they're not compatible with the
> > > >    reserved IOVA regions)
> > > > 4. Boot the guest
> >
> > I suppose above is only the sample flow for PPC vIOMMU. For non-PPC
> > vIOMMUs regular mappings are required before booting the guest and
> > reservation might be done but not mandatory (at least not what current
> > Qemu vfio can afford as it simply replays valid ranges in the CPU address
> > space).
> 
> I think qemu can always do it, it feels like it would simplify error
> cases around aperture mismatches.
> 

It could, but require more changes in Qemu to define required ranges
in platform logic and then convey it from Qemu address space to VFIO.
I view it as an optimization hence not necessarily to be done immediately.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 244+ messages in thread

* RE: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
@ 2022-05-11 23:23                                       ` Tian, Kevin
  0 siblings, 0 replies; 244+ messages in thread
From: Tian, Kevin @ 2022-05-11 23:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Alex Williamson, Martins, Joao,
	David Gibson

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, May 12, 2022 12:32 AM
> 
> On Wed, May 11, 2022 at 03:15:22AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, May 11, 2022 3:00 AM
> > >
> > > On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote:
> > > > Ok... here's a revised version of my proposal which I think addresses
> > > > your concerns and simplfies things.
> > > >
> > > > - No new operations, but IOAS_MAP gets some new flags (and
> IOAS_COPY
> > > >   will probably need matching changes)
> > > >
> > > > - By default the IOVA given to IOAS_MAP is a hint only, and the IOVA
> > > >   is chosen by the kernel within the aperture(s).  This is closer to
> > > >   how mmap() operates, and DPDK and similar shouldn't care about
> > > >   having specific IOVAs, even at the individual mapping level.
> > > >
> > > > - IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s
> MAP_FIXED,
> > > >   for when you really do want to control the IOVA (qemu, maybe some
> > > >   special userspace driver cases)
> > >
> > > We already did both of these, the flag is called
> > > IOMMU_IOAS_MAP_FIXED_IOVA - if it is not specified then kernel will
> > > select the IOVA internally.
> > >
> > > > - ATTACH will fail if the new device would shrink the aperture to
> > > >   exclude any already established mappings (I assume this is already
> > > >   the case)
> > >
> > > Yes
> > >
> > > > - IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a
> > > >   PROT_NONE mmap().  It reserves that IOVA space, so other (non-
> FIXED)
> > > >   MAPs won't use it, but doesn't actually put anything into the IO
> > > >   pagetables.
> > > >     - Like a regular mapping, ATTACHes that are incompatible with an
> > > >       IOMAP_RESERVEed region will fail
> > > >     - An IOMAP_RESERVEed area can be overmapped with an
> IOMAP_FIXED
> > > >       mapping
> > >
> > > Yeah, this seems OK, I'm thinking a new API might make sense because
> > > you don't really want mmap replacement semantics but a permanent
> > > record of what IOVA must always be valid.
> > >
> > > IOMMU_IOA_REQUIRE_IOVA perhaps, similar signature to
> > > IOMMUFD_CMD_IOAS_IOVA_RANGES:
> > >
> > > struct iommu_ioas_require_iova {
> > >         __u32 size;
> > >         __u32 ioas_id;
> > >         __u32 num_iovas;
> > >         __u32 __reserved;
> > >         struct iommu_required_iovas {
> > >                 __aligned_u64 start;
> > >                 __aligned_u64 last;
> > >         } required_iovas[];
> > > };
> >
> > As a permanent record do we want to enforce that once the required
> > range list is set all FIXED and non-FIXED allocations must be within the
> > list of ranges?
> 
> No, I would just use this as a guarntee that going forward any
> get_ranges will always return ranges that cover the listed required
> ranges. Ie any narrowing of the ranges will be refused.
> 
> map/unmap should only be restricted to the get_ranges output.
> 
> Wouldn't burn CPU cycles to nanny userspace here.

fair enough.

> 
> > If yes we can take the end of the last range as the max size of the iova
> > address space to optimize the page table layout.
> 
> I think this API should not interact with the driver. Its only job is
> to prevent devices from attaching that would narrow the ranges.
> 
> If we also use it to adjust the aperture of the created iommu_domain
> then it looses its usefullness as guard since something like qemu
> would have to leave room for hotplug as well.
> 
> I suppose optimizing the created iommu_domains should be some other
> API, with a different set of ranges and the '# of bytes of IOVA' hint
> as well.

make sense.

> 
> > > > For (unoptimized) qemu it would be:
> > > >
> > > > 1. Create IOAS
> > > > 2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of
> > > the
> > > >    guest platform
> > > > 3. ATTACH devices (this will fail if they're not compatible with the
> > > >    reserved IOVA regions)
> > > > 4. Boot the guest
> >
> > I suppose above is only the sample flow for PPC vIOMMU. For non-PPC
> > vIOMMUs regular mappings are required before booting the guest and
> > reservation might be done but not mandatory (at least not what current
> > Qemu vfio can afford as it simply replays valid ranges in the CPU address
> > space).
> 
> I think qemu can always do it, it feels like it would simplify error
> cases around aperture mismatches.
> 

It could, but require more changes in Qemu to define required ranges
in platform logic and then convey it from Qemu address space to VFIO.
I view it as an optimization hence not necessarily to be done immediately.

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
  2022-05-11  3:15                                   ` Tian, Kevin
@ 2022-05-13  4:35                                     ` David Gibson
  -1 siblings, 0 replies; 244+ messages in thread
From: David Gibson @ 2022-05-13  4:35 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, Alex Williamson, Lu Baolu, Chaitanya Kulkarni,
	Cornelia Huck, Daniel Jordan, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

[-- Attachment #1: Type: text/plain, Size: 8026 bytes --]

On Wed, May 11, 2022 at 03:15:22AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, May 11, 2022 3:00 AM
> > 
> > On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote:
> > > Ok... here's a revised version of my proposal which I think addresses
> > > your concerns and simplfies things.
> > >
> > > - No new operations, but IOAS_MAP gets some new flags (and IOAS_COPY
> > >   will probably need matching changes)
> > >
> > > - By default the IOVA given to IOAS_MAP is a hint only, and the IOVA
> > >   is chosen by the kernel within the aperture(s).  This is closer to
> > >   how mmap() operates, and DPDK and similar shouldn't care about
> > >   having specific IOVAs, even at the individual mapping level.
> > >
> > > - IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s MAP_FIXED,
> > >   for when you really do want to control the IOVA (qemu, maybe some
> > >   special userspace driver cases)
> > 
> > We already did both of these, the flag is called
> > IOMMU_IOAS_MAP_FIXED_IOVA - if it is not specified then kernel will
> > select the IOVA internally.
> > 
> > > - ATTACH will fail if the new device would shrink the aperture to
> > >   exclude any already established mappings (I assume this is already
> > >   the case)
> > 
> > Yes
> > 
> > > - IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a
> > >   PROT_NONE mmap().  It reserves that IOVA space, so other (non-FIXED)
> > >   MAPs won't use it, but doesn't actually put anything into the IO
> > >   pagetables.
> > >     - Like a regular mapping, ATTACHes that are incompatible with an
> > >       IOMAP_RESERVEed region will fail
> > >     - An IOMAP_RESERVEed area can be overmapped with an IOMAP_FIXED
> > >       mapping
> > 
> > Yeah, this seems OK, I'm thinking a new API might make sense because
> > you don't really want mmap replacement semantics but a permanent
> > record of what IOVA must always be valid.
> > 
> > IOMMU_IOA_REQUIRE_IOVA perhaps, similar signature to
> > IOMMUFD_CMD_IOAS_IOVA_RANGES:
> > 
> > struct iommu_ioas_require_iova {
> >         __u32 size;
> >         __u32 ioas_id;
> >         __u32 num_iovas;
> >         __u32 __reserved;
> >         struct iommu_required_iovas {
> >                 __aligned_u64 start;
> >                 __aligned_u64 last;
> >         } required_iovas[];
> > };
> 
> As a permanent record do we want to enforce that once the required
> range list is set all FIXED and non-FIXED allocations must be within the
> list of ranges?

No, I don't think so.  In fact the way I was envisaging this,
non-FIXED mappings will *never* go into the reserved ranges.  This is
for the benefit of any use cases that need both mappings where they
don't care about the IOVA and those which do.

Essentially, reserving a region here is saying to the kernel "I want
to manage this IOVA space; make sure nothing else touches it".  That
means both that the kernel must disallow any hw associated changes
(like ATTACH) which would impinge on the reserved region, and also any
IOVA allocations that would take parts away from that space.

Whether we want to restrict FIXED mappings to the reserved regions is
an interesting question.  I wasn't thinking that would be necessary
(just as you can use mmap() MAP_FIXED anywhere).  However.. much as
MAP_FIXED is very dangerous to use if you don't previously reserve
address space, I think IOMAP_FIXED is dangerous if you haven't
previously reserved space.  So maybe it would make sense to only allow
FIXED mappings within reserved regions.

Strictly dividing the IOVA space into kernel managed and user managed
regions does make a certain amount of sense.

> If yes we can take the end of the last range as the max size of the iova
> address space to optimize the page table layout.
> 
> otherwise we may need another dedicated hint for that optimization.

Right.  With the revised model where reserving windows is optional,
not required, I don't think we can quite re-use this for optimization
hints.  Which is a bit unfortunate.

I can't immediately see a way to tweak this which handles both more
neatly, but I like the idea if we can figure out a way.

> > > So, for DPDK the sequence would be:
> > >
> > > 1. Create IOAS
> > > 2. ATTACH devices
> > > 3. IOAS_MAP some stuff
> > > 4. Do DMA with the IOVAs that IOAS_MAP returned
> > >
> > > (Note, not even any need for QUERY in simple cases)
> > 
> > Yes, this is done already
> > 
> > > For (unoptimized) qemu it would be:
> > >
> > > 1. Create IOAS
> > > 2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of
> > the
> > >    guest platform
> > > 3. ATTACH devices (this will fail if they're not compatible with the
> > >    reserved IOVA regions)
> > > 4. Boot the guest
> 
> I suppose above is only the sample flow for PPC vIOMMU. For non-PPC
> vIOMMUs regular mappings are required before booting the guest and
> reservation might be done but not mandatory (at least not what current
> Qemu vfio can afford as it simply replays valid ranges in the CPU address
> space).

That was a somewhat simplified description.  When we look in more
detail, I think the ppc and x86 models become more similar.  So, in
more detail, I think it would look like this:

1. Create base IOAS
2. Map guest memory into base IOAS so that IOVA==GPA
3. Create IOASes for each vIOMMU domain
4. Reserve windows in domain IOASes where the vIOMMU will allow
   mappings by default
5. ATTACH devices to appropriate IOASes (***)
6. Boot the guest

  On guest map/invalidate:
        Use IOAS_COPY to take mappings from base IOAS and put them
	into the domain IOAS
  On memory hotplug:
        IOAS_MAP new memory block into base IOAS
  On dev hotplug: (***)
        ATTACH devices to appropriate IOAS
  On guest reconfiguration of vIOMMU domains (x86 only):
        DETACH device from base IOAS, attach to vIOMMU domain IOAS
  On guest reconfiguration of vIOMMU apertures (ppc only):
        Alter reserved regions to match vIOMMU

The difference between ppc and x86 is at the places marked (***):
which IOAS each device gets attached to and when. For x86 all devices
live in the base IOAS by default, and only get moved to domain IOASes
when those domains are set up in the vIOMMU.  For POWER each device
starts in a domain IOAS based on its guest PE, and never moves.

[This is still a bit simplified.  In practice, I imagine you'd
 optimize to only create the domain IOASes at the point
 they're needed - on boot for ppc, but only when the vIOMMU is
 configured for x86.  I don't think that really changes the model,
 though.]

A few aspects of the model interact quite nicely here.  Mapping a
large memory guest with IOVA==GPA would probably fail on a ppc host
IOMMU.  But if both guest and host are ppc, then no devices get
attached to that base IOAS, so its apertures don't get restricted by
the host hardware.  That way we get a common model, and the benefits
of GUP sharing via IOAS_COPY, without it failing in the ppc-on-ppc
case.

x86-on-ppc and ppc-on-x86 will probably only work in limited cases
where the various sizes and windows line up, but the possibility isn't
precluded by the model or interfaces.

> > >   (on guest map/invalidate) -> IOAS_MAP(IOMAP_FIXED) to overmap part
> > of
> > >                                the reserved regions
> > >   (on dev hotplug) -> ATTACH (which might fail, if it conflicts with the
> > >                       reserved regions)
> > >   (on vIOMMU reconfiguration) -> UNMAP/MAP reserved regions as
> > >                                  necessary (which might fail)
> > 
> > OK, I will take care of it
> > 
> > Thanks,
> > Jason
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
@ 2022-05-13  4:35                                     ` David Gibson
  0 siblings, 0 replies; 244+ messages in thread
From: David Gibson @ 2022-05-13  4:35 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Alex Williamson, Jason Gunthorpe, Martins,
	Joao


[-- Attachment #1.1: Type: text/plain, Size: 8026 bytes --]

On Wed, May 11, 2022 at 03:15:22AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, May 11, 2022 3:00 AM
> > 
> > On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote:
> > > Ok... here's a revised version of my proposal which I think addresses
> > > your concerns and simplfies things.
> > >
> > > - No new operations, but IOAS_MAP gets some new flags (and IOAS_COPY
> > >   will probably need matching changes)
> > >
> > > - By default the IOVA given to IOAS_MAP is a hint only, and the IOVA
> > >   is chosen by the kernel within the aperture(s).  This is closer to
> > >   how mmap() operates, and DPDK and similar shouldn't care about
> > >   having specific IOVAs, even at the individual mapping level.
> > >
> > > - IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s MAP_FIXED,
> > >   for when you really do want to control the IOVA (qemu, maybe some
> > >   special userspace driver cases)
> > 
> > We already did both of these, the flag is called
> > IOMMU_IOAS_MAP_FIXED_IOVA - if it is not specified then kernel will
> > select the IOVA internally.
> > 
> > > - ATTACH will fail if the new device would shrink the aperture to
> > >   exclude any already established mappings (I assume this is already
> > >   the case)
> > 
> > Yes
> > 
> > > - IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a
> > >   PROT_NONE mmap().  It reserves that IOVA space, so other (non-FIXED)
> > >   MAPs won't use it, but doesn't actually put anything into the IO
> > >   pagetables.
> > >     - Like a regular mapping, ATTACHes that are incompatible with an
> > >       IOMAP_RESERVEed region will fail
> > >     - An IOMAP_RESERVEed area can be overmapped with an IOMAP_FIXED
> > >       mapping
> > 
> > Yeah, this seems OK, I'm thinking a new API might make sense because
> > you don't really want mmap replacement semantics but a permanent
> > record of what IOVA must always be valid.
> > 
> > IOMMU_IOA_REQUIRE_IOVA perhaps, similar signature to
> > IOMMUFD_CMD_IOAS_IOVA_RANGES:
> > 
> > struct iommu_ioas_require_iova {
> >         __u32 size;
> >         __u32 ioas_id;
> >         __u32 num_iovas;
> >         __u32 __reserved;
> >         struct iommu_required_iovas {
> >                 __aligned_u64 start;
> >                 __aligned_u64 last;
> >         } required_iovas[];
> > };
> 
> As a permanent record do we want to enforce that once the required
> range list is set all FIXED and non-FIXED allocations must be within the
> list of ranges?

No, I don't think so.  In fact the way I was envisaging this,
non-FIXED mappings will *never* go into the reserved ranges.  This is
for the benefit of any use cases that need both mappings where they
don't care about the IOVA and those which do.

Essentially, reserving a region here is saying to the kernel "I want
to manage this IOVA space; make sure nothing else touches it".  That
means both that the kernel must disallow any hw associated changes
(like ATTACH) which would impinge on the reserved region, and also any
IOVA allocations that would take parts away from that space.

Whether we want to restrict FIXED mappings to the reserved regions is
an interesting question.  I wasn't thinking that would be necessary
(just as you can use mmap() MAP_FIXED anywhere).  However.. much as
MAP_FIXED is very dangerous to use if you don't previously reserve
address space, I think IOMAP_FIXED is dangerous if you haven't
previously reserved space.  So maybe it would make sense to only allow
FIXED mappings within reserved regions.

Strictly dividing the IOVA space into kernel managed and user managed
regions does make a certain amount of sense.

> If yes we can take the end of the last range as the max size of the iova
> address space to optimize the page table layout.
> 
> otherwise we may need another dedicated hint for that optimization.

Right.  With the revised model where reserving windows is optional,
not required, I don't think we can quite re-use this for optimization
hints.  Which is a bit unfortunate.

I can't immediately see a way to tweak this which handles both more
neatly, but I like the idea if we can figure out a way.

> > > So, for DPDK the sequence would be:
> > >
> > > 1. Create IOAS
> > > 2. ATTACH devices
> > > 3. IOAS_MAP some stuff
> > > 4. Do DMA with the IOVAs that IOAS_MAP returned
> > >
> > > (Note, not even any need for QUERY in simple cases)
> > 
> > Yes, this is done already
> > 
> > > For (unoptimized) qemu it would be:
> > >
> > > 1. Create IOAS
> > > 2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of
> > the
> > >    guest platform
> > > 3. ATTACH devices (this will fail if they're not compatible with the
> > >    reserved IOVA regions)
> > > 4. Boot the guest
> 
> I suppose above is only the sample flow for PPC vIOMMU. For non-PPC
> vIOMMUs regular mappings are required before booting the guest and
> reservation might be done but not mandatory (at least not what current
> Qemu vfio can afford as it simply replays valid ranges in the CPU address
> space).

That was a somewhat simplified description.  When we look in more
detail, I think the ppc and x86 models become more similar.  So, in
more detail, I think it would look like this:

1. Create base IOAS
2. Map guest memory into base IOAS so that IOVA==GPA
3. Create IOASes for each vIOMMU domain
4. Reserve windows in domain IOASes where the vIOMMU will allow
   mappings by default
5. ATTACH devices to appropriate IOASes (***)
6. Boot the guest

  On guest map/invalidate:
        Use IOAS_COPY to take mappings from base IOAS and put them
	into the domain IOAS
  On memory hotplug:
        IOAS_MAP new memory block into base IOAS
  On dev hotplug: (***)
        ATTACH devices to appropriate IOAS
  On guest reconfiguration of vIOMMU domains (x86 only):
        DETACH device from base IOAS, attach to vIOMMU domain IOAS
  On guest reconfiguration of vIOMMU apertures (ppc only):
        Alter reserved regions to match vIOMMU

The difference between ppc and x86 is at the places marked (***):
which IOAS each device gets attached to and when. For x86 all devices
live in the base IOAS by default, and only get moved to domain IOASes
when those domains are set up in the vIOMMU.  For POWER each device
starts in a domain IOAS based on its guest PE, and never moves.

[This is still a bit simplified.  In practice, I imagine you'd
 optimize to only create the domain IOASes at the point
 they're needed - on boot for ppc, but only when the vIOMMU is
 configured for x86.  I don't think that really changes the model,
 though.]

A few aspects of the model interact quite nicely here.  Mapping a
large memory guest with IOVA==GPA would probably fail on a ppc host
IOMMU.  But if both guest and host are ppc, then no devices get
attached to that base IOAS, so its apertures don't get restricted by
the host hardware.  That way we get a common model, and the benefits
of GUP sharing via IOAS_COPY, without it failing in the ppc-on-ppc
case.

x86-on-ppc and ppc-on-x86 will probably only work in limited cases
where the various sizes and windows line up, but the possibility isn't
precluded by the model or interfaces.

> > >   (on guest map/invalidate) -> IOAS_MAP(IOMAP_FIXED) to overmap part
> > of
> > >                                the reserved regions
> > >   (on dev hotplug) -> ATTACH (which might fail, if it conflicts with the
> > >                       reserved regions)
> > >   (on vIOMMU reconfiguration) -> UNMAP/MAP reserved regions as
> > >                                  necessary (which might fail)
> > 
> > OK, I will take care of it
> > 
> > Thanks,
> > Jason
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 156 bytes --]

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 10/12] iommufd: Add kAPI toward external drivers
  2022-03-18 17:27   ` Jason Gunthorpe via iommu
@ 2022-05-19  9:45     ` Yi Liu
  -1 siblings, 0 replies; 244+ messages in thread
From: Yi Liu @ 2022-05-19  9:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Keqian Zhu

Hi Jason,

On 2022/3/19 01:27, Jason Gunthorpe wrote:

> +/**
> + * iommufd_device_attach - Connect a device to an iommu_domain
> + * @idev: device to attach
> + * @pt_id: Input a IOMMUFD_OBJ_IOAS, or IOMMUFD_OBJ_HW_PAGETABLE
> + *         Output the IOMMUFD_OBJ_HW_PAGETABLE ID
> + * @flags: Optional flags
> + *
> + * This connects the device to an iommu_domain, either automatically or manually
> + * selected. Once this completes the device could do DMA.
> + *
> + * The caller should return the resulting pt_id back to userspace.
> + * This function is undone by calling iommufd_device_detach().
> + */
> +int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id,
> +			  unsigned int flags)
> +{
> +	struct iommufd_hw_pagetable *hwpt;
> +	int rc;
> +
> +	refcount_inc(&idev->obj.users);
> +
> +	hwpt = iommufd_hw_pagetable_from_id(idev->ictx, *pt_id, idev->dev);
> +	if (IS_ERR(hwpt)) {
> +		rc = PTR_ERR(hwpt);
> +		goto out_users;
> +	}
> +
> +	mutex_lock(&hwpt->devices_lock);
> +	/* FIXME: Use a device-centric iommu api. For now check if the
> +	 * hw_pagetable already has a device of the same group joined to tell if
> +	 * we are the first and need to attach the group. */
> +	if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
> +		phys_addr_t sw_msi_start = 0;
> +
> +		rc = iommu_attach_group(hwpt->domain, idev->group);
> +		if (rc)
> +			goto out_unlock;
> +
> +		/*
> +		 * hwpt is now the exclusive owner of the group so this is the
> +		 * first time enforce is called for this group.
> +		 */
> +		rc = iopt_table_enforce_group_resv_regions(
> +			&hwpt->ioas->iopt, idev->group, &sw_msi_start);
> +		if (rc)
> +			goto out_detach;
> +		rc = iommufd_device_setup_msi(idev, hwpt, sw_msi_start, flags);
> +		if (rc)
> +			goto out_iova;
> +	}
> +
> +	idev->hwpt = hwpt;
> +	if (list_empty(&hwpt->devices)) {
> +		rc = iopt_table_add_domain(&hwpt->ioas->iopt, hwpt->domain);
> +		if (rc)
> +			goto out_iova;
> +	}
> +	list_add(&idev->devices_item, &hwpt->devices);

Just double check here.
This API doesn't prevent caller from calling this API multiple times with
the same @idev and @pt_id. right? Note that idev has only one device_item
list head. If caller does do multiple callings, then there should be
problem. right? If so, this API assumes caller should take care of it and
not do such bad function call. Is this the design here?

> +	mutex_unlock(&hwpt->devices_lock);
> +
> +	*pt_id = idev->hwpt->obj.id;
> +	return 0;
> +
> +out_iova:
> +	iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->group);
> +out_detach:
> +	iommu_detach_group(hwpt->domain, idev->group);
> +out_unlock:
> +	mutex_unlock(&hwpt->devices_lock);
> +	iommufd_hw_pagetable_put(idev->ictx, hwpt);
> +out_users:
> +	refcount_dec(&idev->obj.users);
> +	return rc;
> +}
> +EXPORT_SYMBOL_GPL(iommufd_device_attach);
> +
> +void iommufd_device_detach(struct iommufd_device *idev)
> +{
> +	struct iommufd_hw_pagetable *hwpt = idev->hwpt;
> +
> +	mutex_lock(&hwpt->devices_lock);
> +	list_del(&idev->devices_item);
> +	if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
> +		iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->group);
> +		iommu_detach_group(hwpt->domain, idev->group);
> +	}
> +	if (list_empty(&hwpt->devices))
> +		iopt_table_remove_domain(&hwpt->ioas->iopt, hwpt->domain);
> +	mutex_unlock(&hwpt->devices_lock);
> +
> +	iommufd_hw_pagetable_put(idev->ictx, hwpt);
> +	idev->hwpt = NULL;
> +
> +	refcount_dec(&idev->obj.users);
> +}
> +EXPORT_SYMBOL_GPL(iommufd_device_detach);
> diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
> index c5c9650cc86818..e5c717231f851e 100644
> --- a/drivers/iommu/iommufd/iommufd_private.h
> +++ b/drivers/iommu/iommufd/iommufd_private.h
> @@ -96,6 +96,7 @@ static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd,
>   enum iommufd_object_type {
>   	IOMMUFD_OBJ_NONE,
>   	IOMMUFD_OBJ_ANY = IOMMUFD_OBJ_NONE,
> +	IOMMUFD_OBJ_DEVICE,
>   	IOMMUFD_OBJ_HW_PAGETABLE,
>   	IOMMUFD_OBJ_IOAS,
>   	IOMMUFD_OBJ_MAX,
> @@ -196,6 +197,7 @@ struct iommufd_hw_pagetable {
>   	struct iommufd_object obj;
>   	struct iommufd_ioas *ioas;
>   	struct iommu_domain *domain;
> +	bool msi_cookie;
>   	/* Head at iommufd_ioas::auto_domains */
>   	struct list_head auto_domains_item;
>   	struct mutex devices_lock;
> @@ -209,4 +211,6 @@ void iommufd_hw_pagetable_put(struct iommufd_ctx *ictx,
>   			      struct iommufd_hw_pagetable *hwpt);
>   void iommufd_hw_pagetable_destroy(struct iommufd_object *obj);
>   
> +void iommufd_device_destroy(struct iommufd_object *obj);
> +
>   #endif
> diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
> index 954cde173c86fc..6a895489fb5b82 100644
> --- a/drivers/iommu/iommufd/main.c
> +++ b/drivers/iommu/iommufd/main.c
> @@ -284,6 +284,9 @@ struct iommufd_ctx *iommufd_fget(int fd)
>   }
>   
>   static struct iommufd_object_ops iommufd_object_ops[] = {
> +	[IOMMUFD_OBJ_DEVICE] = {
> +		.destroy = iommufd_device_destroy,
> +	},
>   	[IOMMUFD_OBJ_IOAS] = {
>   		.destroy = iommufd_ioas_destroy,
>   	},
> diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
> new file mode 100644
> index 00000000000000..6caac05475e39f
> --- /dev/null
> +++ b/include/linux/iommufd.h
> @@ -0,0 +1,50 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (C) 2021 Intel Corporation
> + * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
> + */
> +#ifndef __LINUX_IOMMUFD_H
> +#define __LINUX_IOMMUFD_H
> +
> +#include <linux/types.h>
> +#include <linux/errno.h>
> +#include <linux/err.h>
> +#include <linux/device.h>
> +
> +struct pci_dev;
> +struct iommufd_device;
> +
> +#if IS_ENABLED(CONFIG_IOMMUFD)
> +struct iommufd_device *iommufd_bind_pci_device(int fd, struct pci_dev *pdev,
> +					       u32 *id);
> +void iommufd_unbind_device(struct iommufd_device *idev);
> +
> +enum {
> +	IOMMUFD_ATTACH_FLAGS_ALLOW_UNSAFE_INTERRUPT = 1 << 0,
> +};
> +int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id,
> +			  unsigned int flags);
> +void iommufd_device_detach(struct iommufd_device *idev);
> +
> +#else /* !CONFIG_IOMMUFD */
> +static inline struct iommufd_device *
> +iommufd_bind_pci_device(int fd, struct pci_dev *pdev, u32 *id)
> +{
> +	return ERR_PTR(-EOPNOTSUPP);
> +}
> +
> +static inline void iommufd_unbind_device(struct iommufd_device *idev)
> +{
> +}
> +
> +static inline int iommufd_device_attach(struct iommufd_device *idev,
> +					u32 ioas_id)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static inline void iommufd_device_detach(struct iommufd_device *idev)
> +{
> +}
> +#endif /* CONFIG_IOMMUFD */
> +#endif

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 10/12] iommufd: Add kAPI toward external drivers
@ 2022-05-19  9:45     ` Yi Liu
  0 siblings, 0 replies; 244+ messages in thread
From: Yi Liu @ 2022-05-19  9:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins,
	David Gibson

Hi Jason,

On 2022/3/19 01:27, Jason Gunthorpe wrote:

> +/**
> + * iommufd_device_attach - Connect a device to an iommu_domain
> + * @idev: device to attach
> + * @pt_id: Input a IOMMUFD_OBJ_IOAS, or IOMMUFD_OBJ_HW_PAGETABLE
> + *         Output the IOMMUFD_OBJ_HW_PAGETABLE ID
> + * @flags: Optional flags
> + *
> + * This connects the device to an iommu_domain, either automatically or manually
> + * selected. Once this completes the device could do DMA.
> + *
> + * The caller should return the resulting pt_id back to userspace.
> + * This function is undone by calling iommufd_device_detach().
> + */
> +int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id,
> +			  unsigned int flags)
> +{
> +	struct iommufd_hw_pagetable *hwpt;
> +	int rc;
> +
> +	refcount_inc(&idev->obj.users);
> +
> +	hwpt = iommufd_hw_pagetable_from_id(idev->ictx, *pt_id, idev->dev);
> +	if (IS_ERR(hwpt)) {
> +		rc = PTR_ERR(hwpt);
> +		goto out_users;
> +	}
> +
> +	mutex_lock(&hwpt->devices_lock);
> +	/* FIXME: Use a device-centric iommu api. For now check if the
> +	 * hw_pagetable already has a device of the same group joined to tell if
> +	 * we are the first and need to attach the group. */
> +	if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
> +		phys_addr_t sw_msi_start = 0;
> +
> +		rc = iommu_attach_group(hwpt->domain, idev->group);
> +		if (rc)
> +			goto out_unlock;
> +
> +		/*
> +		 * hwpt is now the exclusive owner of the group so this is the
> +		 * first time enforce is called for this group.
> +		 */
> +		rc = iopt_table_enforce_group_resv_regions(
> +			&hwpt->ioas->iopt, idev->group, &sw_msi_start);
> +		if (rc)
> +			goto out_detach;
> +		rc = iommufd_device_setup_msi(idev, hwpt, sw_msi_start, flags);
> +		if (rc)
> +			goto out_iova;
> +	}
> +
> +	idev->hwpt = hwpt;
> +	if (list_empty(&hwpt->devices)) {
> +		rc = iopt_table_add_domain(&hwpt->ioas->iopt, hwpt->domain);
> +		if (rc)
> +			goto out_iova;
> +	}
> +	list_add(&idev->devices_item, &hwpt->devices);

Just double check here.
This API doesn't prevent caller from calling this API multiple times with
the same @idev and @pt_id. right? Note that idev has only one device_item
list head. If caller does do multiple callings, then there should be
problem. right? If so, this API assumes caller should take care of it and
not do such bad function call. Is this the design here?

> +	mutex_unlock(&hwpt->devices_lock);
> +
> +	*pt_id = idev->hwpt->obj.id;
> +	return 0;
> +
> +out_iova:
> +	iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->group);
> +out_detach:
> +	iommu_detach_group(hwpt->domain, idev->group);
> +out_unlock:
> +	mutex_unlock(&hwpt->devices_lock);
> +	iommufd_hw_pagetable_put(idev->ictx, hwpt);
> +out_users:
> +	refcount_dec(&idev->obj.users);
> +	return rc;
> +}
> +EXPORT_SYMBOL_GPL(iommufd_device_attach);
> +
> +void iommufd_device_detach(struct iommufd_device *idev)
> +{
> +	struct iommufd_hw_pagetable *hwpt = idev->hwpt;
> +
> +	mutex_lock(&hwpt->devices_lock);
> +	list_del(&idev->devices_item);
> +	if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
> +		iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->group);
> +		iommu_detach_group(hwpt->domain, idev->group);
> +	}
> +	if (list_empty(&hwpt->devices))
> +		iopt_table_remove_domain(&hwpt->ioas->iopt, hwpt->domain);
> +	mutex_unlock(&hwpt->devices_lock);
> +
> +	iommufd_hw_pagetable_put(idev->ictx, hwpt);
> +	idev->hwpt = NULL;
> +
> +	refcount_dec(&idev->obj.users);
> +}
> +EXPORT_SYMBOL_GPL(iommufd_device_detach);
> diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
> index c5c9650cc86818..e5c717231f851e 100644
> --- a/drivers/iommu/iommufd/iommufd_private.h
> +++ b/drivers/iommu/iommufd/iommufd_private.h
> @@ -96,6 +96,7 @@ static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd,
>   enum iommufd_object_type {
>   	IOMMUFD_OBJ_NONE,
>   	IOMMUFD_OBJ_ANY = IOMMUFD_OBJ_NONE,
> +	IOMMUFD_OBJ_DEVICE,
>   	IOMMUFD_OBJ_HW_PAGETABLE,
>   	IOMMUFD_OBJ_IOAS,
>   	IOMMUFD_OBJ_MAX,
> @@ -196,6 +197,7 @@ struct iommufd_hw_pagetable {
>   	struct iommufd_object obj;
>   	struct iommufd_ioas *ioas;
>   	struct iommu_domain *domain;
> +	bool msi_cookie;
>   	/* Head at iommufd_ioas::auto_domains */
>   	struct list_head auto_domains_item;
>   	struct mutex devices_lock;
> @@ -209,4 +211,6 @@ void iommufd_hw_pagetable_put(struct iommufd_ctx *ictx,
>   			      struct iommufd_hw_pagetable *hwpt);
>   void iommufd_hw_pagetable_destroy(struct iommufd_object *obj);
>   
> +void iommufd_device_destroy(struct iommufd_object *obj);
> +
>   #endif
> diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
> index 954cde173c86fc..6a895489fb5b82 100644
> --- a/drivers/iommu/iommufd/main.c
> +++ b/drivers/iommu/iommufd/main.c
> @@ -284,6 +284,9 @@ struct iommufd_ctx *iommufd_fget(int fd)
>   }
>   
>   static struct iommufd_object_ops iommufd_object_ops[] = {
> +	[IOMMUFD_OBJ_DEVICE] = {
> +		.destroy = iommufd_device_destroy,
> +	},
>   	[IOMMUFD_OBJ_IOAS] = {
>   		.destroy = iommufd_ioas_destroy,
>   	},
> diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
> new file mode 100644
> index 00000000000000..6caac05475e39f
> --- /dev/null
> +++ b/include/linux/iommufd.h
> @@ -0,0 +1,50 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (C) 2021 Intel Corporation
> + * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
> + */
> +#ifndef __LINUX_IOMMUFD_H
> +#define __LINUX_IOMMUFD_H
> +
> +#include <linux/types.h>
> +#include <linux/errno.h>
> +#include <linux/err.h>
> +#include <linux/device.h>
> +
> +struct pci_dev;
> +struct iommufd_device;
> +
> +#if IS_ENABLED(CONFIG_IOMMUFD)
> +struct iommufd_device *iommufd_bind_pci_device(int fd, struct pci_dev *pdev,
> +					       u32 *id);
> +void iommufd_unbind_device(struct iommufd_device *idev);
> +
> +enum {
> +	IOMMUFD_ATTACH_FLAGS_ALLOW_UNSAFE_INTERRUPT = 1 << 0,
> +};
> +int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id,
> +			  unsigned int flags);
> +void iommufd_device_detach(struct iommufd_device *idev);
> +
> +#else /* !CONFIG_IOMMUFD */
> +static inline struct iommufd_device *
> +iommufd_bind_pci_device(int fd, struct pci_dev *pdev, u32 *id)
> +{
> +	return ERR_PTR(-EOPNOTSUPP);
> +}
> +
> +static inline void iommufd_unbind_device(struct iommufd_device *idev)
> +{
> +}
> +
> +static inline int iommufd_device_attach(struct iommufd_device *idev,
> +					u32 ioas_id)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static inline void iommufd_device_detach(struct iommufd_device *idev)
> +{
> +}
> +#endif /* CONFIG_IOMMUFD */
> +#endif

-- 
Regards,
Yi Liu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 10/12] iommufd: Add kAPI toward external drivers
  2022-05-19  9:45     ` Yi Liu
@ 2022-05-19 12:35       ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-05-19 12:35 UTC (permalink / raw)
  To: Yi Liu
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Keqian Zhu

On Thu, May 19, 2022 at 05:45:06PM +0800, Yi Liu wrote:
> Hi Jason,
> 
> On 2022/3/19 01:27, Jason Gunthorpe wrote:
> 
> > +/**
> > + * iommufd_device_attach - Connect a device to an iommu_domain
> > + * @idev: device to attach
> > + * @pt_id: Input a IOMMUFD_OBJ_IOAS, or IOMMUFD_OBJ_HW_PAGETABLE
> > + *         Output the IOMMUFD_OBJ_HW_PAGETABLE ID
> > + * @flags: Optional flags
> > + *
> > + * This connects the device to an iommu_domain, either automatically or manually
> > + * selected. Once this completes the device could do DMA.
> > + *
> > + * The caller should return the resulting pt_id back to userspace.
> > + * This function is undone by calling iommufd_device_detach().
> > + */
> > +int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id,
> > +			  unsigned int flags)
> > +{

> Just double check here.
> This API doesn't prevent caller from calling this API multiple times with
> the same @idev and @pt_id. right? Note that idev has only one device_item
> list head. If caller does do multiple callings, then there should be
> problem. right? If so, this API assumes caller should take care of it and
> not do such bad function call. Is this the design here?

Yes, caller must ensure strict pairing, we don't have an assertion to
check it.

Jason

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 10/12] iommufd: Add kAPI toward external drivers
@ 2022-05-19 12:35       ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-05-19 12:35 UTC (permalink / raw)
  To: Yi Liu
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	iommu, Daniel Jordan, Kevin Tian, Alex Williamson, Joao Martins,
	David Gibson

On Thu, May 19, 2022 at 05:45:06PM +0800, Yi Liu wrote:
> Hi Jason,
> 
> On 2022/3/19 01:27, Jason Gunthorpe wrote:
> 
> > +/**
> > + * iommufd_device_attach - Connect a device to an iommu_domain
> > + * @idev: device to attach
> > + * @pt_id: Input a IOMMUFD_OBJ_IOAS, or IOMMUFD_OBJ_HW_PAGETABLE
> > + *         Output the IOMMUFD_OBJ_HW_PAGETABLE ID
> > + * @flags: Optional flags
> > + *
> > + * This connects the device to an iommu_domain, either automatically or manually
> > + * selected. Once this completes the device could do DMA.
> > + *
> > + * The caller should return the resulting pt_id back to userspace.
> > + * This function is undone by calling iommufd_device_detach().
> > + */
> > +int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id,
> > +			  unsigned int flags)
> > +{

> Just double check here.
> This API doesn't prevent caller from calling this API multiple times with
> the same @idev and @pt_id. right? Note that idev has only one device_item
> list head. If caller does do multiple callings, then there should be
> problem. right? If so, this API assumes caller should take care of it and
> not do such bad function call. Is this the design here?

Yes, caller must ensure strict pairing, we don't have an assertion to
check it.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
  2022-04-28 14:53           ` David Gibson
@ 2022-05-23  6:02             ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 244+ messages in thread
From: Alexey Kardashevskiy @ 2022-05-23  6:02 UTC (permalink / raw)
  To: David Gibson, Alex Williamson
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	Kevin Tian, Daniel Jordan, iommu, Jason Gunthorpe, Joao Martins



On 4/29/22 00:53, David Gibson wrote:
> On Thu, Mar 24, 2022 at 04:04:03PM -0600, Alex Williamson wrote:
>> On Wed, 23 Mar 2022 21:33:42 -0300
>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>
>>> On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote:
>>>
>>>> My overall question here would be whether we can actually achieve a
>>>> compatibility interface that has sufficient feature transparency that we
>>>> can dump vfio code in favor of this interface, or will there be enough
>>>> niche use cases that we need to keep type1 and vfio containers around
>>>> through a deprecation process?
>>>
>>> Other than SPAPR, I think we can.
>>
>> Does this mean #ifdef CONFIG_PPC in vfio core to retain infrastructure
>> for POWER support?
> 
> There are a few different levels to consider for dealing with PPC.
> For a suitable long term interface for ppc hosts and guests dropping
> this is fine: the ppc specific iommu model was basically an
> ill-conceived idea from the beginning, because none of us had
> sufficiently understood what things were general and what things where
> iommu model/hw specific.
> 
> ..mostly.  There are several points of divergence for the ppc iommu
> model.
> 
> 1) Limited IOVA windows.  This one turned out to not really be ppc
> specific, and is (rightly) handlded generically in the new interface.
> No problem here.
> 
> 2) Costly GUPs.  pseries (the most common ppc machine type) always
> expects a (v)IOMMU.  That means that unlike the common x86 model of a
> host with IOMMU, but guests with no-vIOMMU, guest initiated
> maps/unmaps can be a hot path.  Accounting in that path can be
> prohibitive (and on POWER8 in particular it prevented us from
> optimizing that path the way we wanted).  We had two solutions for
> that, in v1 the explicit ENABLE/DISABLE calls, which preaccounted
> based on the IOVA window sizes.  That was improved in the v2 which
> used the concept of preregistration.  IIUC iommufd can achieve the
> same effect as preregistration using IOAS_COPY, so this one isn't
> really a problem either.


I am getting rid of those POWER8-related realmode handlers as POWER9 has 
MMU enabled when hcalls are handled. Costly GUP problem is still there 
though (which base IOAS should solve?).


> 3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for 2 IOVA
> windows, which aren't contiguous with each other.  The base addresses
> of each of these are fixed, but the size of each window, the pagesize
> (i.e. granularity) of each window and the number of levels in the
> IOMMU pagetable are runtime configurable.  Because it's true in the
> hardware, it's also true of the vIOMMU interface defined by the IBM
> hypervisor (and adpoted by KVM as well).  So, guests can request
> changes in how these windows are handled.  Typical Linux guests will
> use the "low" window (IOVA 0..2GiB) dynamically, and the high window
> (IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
> can't count on that; the guest can use them however it wants.


The guest actually does this already. AIX has always been like that, 
Linux is forced to do that for SRIOV VFs as there can be many VFs and 
TCEs (==IOPTEs) are limited resource. The today's pseries IOMMU code 
first tried mapping 1:1 (as it has been for ages) but if there is not 
enough TCEs - it removes the first window (which increases the TCE 
budget), creates a new 64bit window (as big as possible but not 
necessarily enough for 1:1, 64K/2M IOMMU page sizes allowed) and does 
map/unmap as drivers go.


Which means the guest RAM does not need to be all mapped in that base 
IOAS suggested down this thread as that would mean all memory is pinned 
and powervm won't be able to swap it out (yeah, it can do such thing 
now!). Not sure if we really want to support this or stick to a simpler 
design.



> 
> (3) still needs a plan for how to fit it into the /dev/iommufd model.
> This is a secondary reason that in the past I advocated for the user
> requesting specific DMA windows which the kernel would accept or
> refuse, rather than having a query function - it connects easily to
> the DDW model.  With the query-first model we'd need some sort of
> extension here, not really sure what it should look like.
> 
> 
> 
> Then, there's handling existing qemu (or other software) that is using
> the VFIO SPAPR_TCE interfaces.  First, it's not entirely clear if this
> should be a goal or not: as others have noted, working actively to
> port qemu to the new interface at the same time as making a
> comprehensive in-kernel compat layer is arguably redundant work.
> 
> That said, if we did want to handle this in an in-kernel compat layer,
> here's roughly what you'd need for SPAPR_TCE v2:
> 
> - VFIO_IOMMU_SPAPR_TCE_GET_INFO
>      I think this should be fairly straightforward; the information you
>      need should be in the now generic IOVA window stuff and would just
>      need massaging into the expected format.
> - VFIO_IOMMU_SPAPR_REGISTER_MEMORY /
>    VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY
>      IIUC, these could be traslated into map/unmap operations onto a
>      second implicit IOAS which represents the preregistered memory
>      areas (to which we'd never connect an actual device).  Along with
>      this VFIO_MAP and VFIO_UNMAP operations would need to check for
>      this case, verify their addresses against the preregistered space
>      and be translated into IOAS_COPY operations from the prereg
>      address space instead of raw IOAS_MAP operations.  Fiddly, but not
>      fundamentally hard, I think.
> 
> For SPAPR_TCE_v1 things are a bit trickier
> 
> - VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE
>      I suspect you could get away with implementing these as no-ops.
>      It wouldn't be strictly correct, but I think software which is
>      using the interface correctly should work this way, though
>      possibly not optimally.  That might be good enough for this ugly
>      old interface.
> 
> And... then there's VFIO_EEH_PE_OP.  It's very hard to know what to do
> with this because the interface was completely broken for most of its
> lifetime.  EEH is a fancy error handling feature of IBM PCI hardware
> somewhat similar in concept, though not interface, to PCIe AER.  I have
> a very strong impression that while this was a much-touted checkbox
> feature for RAS, no-one, ever. actually used it.  As evidenced by the
> fact that there was, I believe over a *decade* in which all the
> interfaces were completely broken by design, and apparently no-one
> noticed.
> 
> So, cynically, you could probably get away with making this a no-op as
> well.  If you wanted to do it properly... well... that would require
> training up yet another person to actually understand this and hoping
> they get it done before they run screaming.  This one gets very ugly
> because the EEH operations have to operate on the hardware (or
> firmware) "Partitionable Endpoints" (PEs) which correspond one to one
> with IOMMU groups, but not necessarily with VFIO containers, but
> there's not really any sensible way to expose that to users.
> 
> You might be able to do this by simply failing this outright if
> there's anything other than exactly one IOMMU group bound to the
> container / IOAS (which I think might be what VFIO itself does now).
> Handling that with a device centric API gets somewhat fiddlier, of
> course.
> 
> 
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

-- 
Alexey
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
@ 2022-05-23  6:02             ` Alexey Kardashevskiy
  0 siblings, 0 replies; 244+ messages in thread
From: Alexey Kardashevskiy @ 2022-05-23  6:02 UTC (permalink / raw)
  To: David Gibson, Alex Williamson
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	Kevin Tian, Daniel Jordan, iommu, Jason Gunthorpe, Joao Martins



On 4/29/22 00:53, David Gibson wrote:
> On Thu, Mar 24, 2022 at 04:04:03PM -0600, Alex Williamson wrote:
>> On Wed, 23 Mar 2022 21:33:42 -0300
>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>
>>> On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote:
>>>
>>>> My overall question here would be whether we can actually achieve a
>>>> compatibility interface that has sufficient feature transparency that we
>>>> can dump vfio code in favor of this interface, or will there be enough
>>>> niche use cases that we need to keep type1 and vfio containers around
>>>> through a deprecation process?
>>>
>>> Other than SPAPR, I think we can.
>>
>> Does this mean #ifdef CONFIG_PPC in vfio core to retain infrastructure
>> for POWER support?
> 
> There are a few different levels to consider for dealing with PPC.
> For a suitable long term interface for ppc hosts and guests dropping
> this is fine: the ppc specific iommu model was basically an
> ill-conceived idea from the beginning, because none of us had
> sufficiently understood what things were general and what things where
> iommu model/hw specific.
> 
> ..mostly.  There are several points of divergence for the ppc iommu
> model.
> 
> 1) Limited IOVA windows.  This one turned out to not really be ppc
> specific, and is (rightly) handlded generically in the new interface.
> No problem here.
> 
> 2) Costly GUPs.  pseries (the most common ppc machine type) always
> expects a (v)IOMMU.  That means that unlike the common x86 model of a
> host with IOMMU, but guests with no-vIOMMU, guest initiated
> maps/unmaps can be a hot path.  Accounting in that path can be
> prohibitive (and on POWER8 in particular it prevented us from
> optimizing that path the way we wanted).  We had two solutions for
> that, in v1 the explicit ENABLE/DISABLE calls, which preaccounted
> based on the IOVA window sizes.  That was improved in the v2 which
> used the concept of preregistration.  IIUC iommufd can achieve the
> same effect as preregistration using IOAS_COPY, so this one isn't
> really a problem either.


I am getting rid of those POWER8-related realmode handlers as POWER9 has 
MMU enabled when hcalls are handled. Costly GUP problem is still there 
though (which base IOAS should solve?).


> 3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for 2 IOVA
> windows, which aren't contiguous with each other.  The base addresses
> of each of these are fixed, but the size of each window, the pagesize
> (i.e. granularity) of each window and the number of levels in the
> IOMMU pagetable are runtime configurable.  Because it's true in the
> hardware, it's also true of the vIOMMU interface defined by the IBM
> hypervisor (and adpoted by KVM as well).  So, guests can request
> changes in how these windows are handled.  Typical Linux guests will
> use the "low" window (IOVA 0..2GiB) dynamically, and the high window
> (IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
> can't count on that; the guest can use them however it wants.


The guest actually does this already. AIX has always been like that, 
Linux is forced to do that for SRIOV VFs as there can be many VFs and 
TCEs (==IOPTEs) are limited resource. The today's pseries IOMMU code 
first tried mapping 1:1 (as it has been for ages) but if there is not 
enough TCEs - it removes the first window (which increases the TCE 
budget), creates a new 64bit window (as big as possible but not 
necessarily enough for 1:1, 64K/2M IOMMU page sizes allowed) and does 
map/unmap as drivers go.


Which means the guest RAM does not need to be all mapped in that base 
IOAS suggested down this thread as that would mean all memory is pinned 
and powervm won't be able to swap it out (yeah, it can do such thing 
now!). Not sure if we really want to support this or stick to a simpler 
design.



> 
> (3) still needs a plan for how to fit it into the /dev/iommufd model.
> This is a secondary reason that in the past I advocated for the user
> requesting specific DMA windows which the kernel would accept or
> refuse, rather than having a query function - it connects easily to
> the DDW model.  With the query-first model we'd need some sort of
> extension here, not really sure what it should look like.
> 
> 
> 
> Then, there's handling existing qemu (or other software) that is using
> the VFIO SPAPR_TCE interfaces.  First, it's not entirely clear if this
> should be a goal or not: as others have noted, working actively to
> port qemu to the new interface at the same time as making a
> comprehensive in-kernel compat layer is arguably redundant work.
> 
> That said, if we did want to handle this in an in-kernel compat layer,
> here's roughly what you'd need for SPAPR_TCE v2:
> 
> - VFIO_IOMMU_SPAPR_TCE_GET_INFO
>      I think this should be fairly straightforward; the information you
>      need should be in the now generic IOVA window stuff and would just
>      need massaging into the expected format.
> - VFIO_IOMMU_SPAPR_REGISTER_MEMORY /
>    VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY
>      IIUC, these could be traslated into map/unmap operations onto a
>      second implicit IOAS which represents the preregistered memory
>      areas (to which we'd never connect an actual device).  Along with
>      this VFIO_MAP and VFIO_UNMAP operations would need to check for
>      this case, verify their addresses against the preregistered space
>      and be translated into IOAS_COPY operations from the prereg
>      address space instead of raw IOAS_MAP operations.  Fiddly, but not
>      fundamentally hard, I think.
> 
> For SPAPR_TCE_v1 things are a bit trickier
> 
> - VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE
>      I suspect you could get away with implementing these as no-ops.
>      It wouldn't be strictly correct, but I think software which is
>      using the interface correctly should work this way, though
>      possibly not optimally.  That might be good enough for this ugly
>      old interface.
> 
> And... then there's VFIO_EEH_PE_OP.  It's very hard to know what to do
> with this because the interface was completely broken for most of its
> lifetime.  EEH is a fancy error handling feature of IBM PCI hardware
> somewhat similar in concept, though not interface, to PCIe AER.  I have
> a very strong impression that while this was a much-touted checkbox
> feature for RAS, no-one, ever. actually used it.  As evidenced by the
> fact that there was, I believe over a *decade* in which all the
> interfaces were completely broken by design, and apparently no-one
> noticed.
> 
> So, cynically, you could probably get away with making this a no-op as
> well.  If you wanted to do it properly... well... that would require
> training up yet another person to actually understand this and hoping
> they get it done before they run screaming.  This one gets very ugly
> because the EEH operations have to operate on the hardware (or
> firmware) "Partitionable Endpoints" (PEs) which correspond one to one
> with IOMMU groups, but not necessarily with VFIO containers, but
> there's not really any sensible way to expose that to users.
> 
> You might be able to do this by simply failing this outright if
> there's anything other than exactly one IOMMU group bound to the
> container / IOAS (which I think might be what VFIO itself does now).
> Handling that with a device centric API gets somewhat fiddlier, of
> course.
> 
> 
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

-- 
Alexey

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
  2022-05-23  6:02             ` Alexey Kardashevskiy
@ 2022-05-24 13:25               ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe @ 2022-05-24 13:25 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: David Gibson, Alex Williamson, Jean-Philippe Brucker,
	Chaitanya Kulkarni, kvm, Michael S. Tsirkin, Jason Wang,
	Cornelia Huck, Niklas Schnelle, Kevin Tian, Daniel Jordan, iommu,
	Joao Martins

On Mon, May 23, 2022 at 04:02:22PM +1000, Alexey Kardashevskiy wrote:

> Which means the guest RAM does not need to be all mapped in that base IOAS
> suggested down this thread as that would mean all memory is pinned and
> powervm won't be able to swap it out (yeah, it can do such thing now!). Not
> sure if we really want to support this or stick to a simpler design.

Huh? How can it swap? Calling GUP is not optional. Either you call GUP
at the start and there is no swap, or you call GUP for each vIOMMU
hypercall.

Since everyone says PPC doesn't call GUP during the hypercall - how is
it working?

Jason

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
@ 2022-05-24 13:25               ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 244+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-05-24 13:25 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	Kevin Tian, iommu, Daniel Jordan, Alex Williamson, Joao Martins,
	David Gibson

On Mon, May 23, 2022 at 04:02:22PM +1000, Alexey Kardashevskiy wrote:

> Which means the guest RAM does not need to be all mapped in that base IOAS
> suggested down this thread as that would mean all memory is pinned and
> powervm won't be able to swap it out (yeah, it can do such thing now!). Not
> sure if we really want to support this or stick to a simpler design.

Huh? How can it swap? Calling GUP is not optional. Either you call GUP
at the start and there is no swap, or you call GUP for each vIOMMU
hypercall.

Since everyone says PPC doesn't call GUP during the hypercall - how is
it working?

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
  2022-05-24 13:25               ` Jason Gunthorpe via iommu
@ 2022-05-25  1:39                 ` David Gibson
  -1 siblings, 0 replies; 244+ messages in thread
From: David Gibson @ 2022-05-25  1:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alexey Kardashevskiy, Alex Williamson, Jean-Philippe Brucker,
	Chaitanya Kulkarni, kvm, Michael S. Tsirkin, Jason Wang,
	Cornelia Huck, Niklas Schnelle, Kevin Tian, Daniel Jordan, iommu,
	Joao Martins

[-- Attachment #1: Type: text/plain, Size: 1071 bytes --]

On Tue, May 24, 2022 at 10:25:53AM -0300, Jason Gunthorpe wrote:
> On Mon, May 23, 2022 at 04:02:22PM +1000, Alexey Kardashevskiy wrote:
> 
> > Which means the guest RAM does not need to be all mapped in that base IOAS
> > suggested down this thread as that would mean all memory is pinned and
> > powervm won't be able to swap it out (yeah, it can do such thing now!). Not
> > sure if we really want to support this or stick to a simpler design.
> 
> Huh? How can it swap? Calling GUP is not optional. Either you call GUP
> at the start and there is no swap, or you call GUP for each vIOMMU
> hypercall.
> 
> Since everyone says PPC doesn't call GUP during the hypercall - how is
> it working?

The current implementation does GUP during the pre-reserve.  I think
Alexey's talking about a new PowerVM (IBM hypervisor) feature; I don't
know how that works.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
@ 2022-05-25  1:39                 ` David Gibson
  0 siblings, 0 replies; 244+ messages in thread
From: David Gibson @ 2022-05-25  1:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	Kevin Tian, iommu, Daniel Jordan, Alex Williamson, Joao Martins


[-- Attachment #1.1: Type: text/plain, Size: 1071 bytes --]

On Tue, May 24, 2022 at 10:25:53AM -0300, Jason Gunthorpe wrote:
> On Mon, May 23, 2022 at 04:02:22PM +1000, Alexey Kardashevskiy wrote:
> 
> > Which means the guest RAM does not need to be all mapped in that base IOAS
> > suggested down this thread as that would mean all memory is pinned and
> > powervm won't be able to swap it out (yeah, it can do such thing now!). Not
> > sure if we really want to support this or stick to a simpler design.
> 
> Huh? How can it swap? Calling GUP is not optional. Either you call GUP
> at the start and there is no swap, or you call GUP for each vIOMMU
> hypercall.
> 
> Since everyone says PPC doesn't call GUP during the hypercall - how is
> it working?

The current implementation does GUP during the pre-reserve.  I think
Alexey's talking about a new PowerVM (IBM hypervisor) feature; I don't
know how that works.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 156 bytes --]

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
  2022-05-24 13:25               ` Jason Gunthorpe via iommu
@ 2022-05-25  2:09                 ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 244+ messages in thread
From: Alexey Kardashevskiy @ 2022-05-25  2:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: David Gibson, Alex Williamson, Jean-Philippe Brucker,
	Chaitanya Kulkarni, kvm, Michael S. Tsirkin, Jason Wang,
	Cornelia Huck, Niklas Schnelle, Kevin Tian, Daniel Jordan, iommu,
	Joao Martins



On 5/24/22 23:25, Jason Gunthorpe wrote:
> On Mon, May 23, 2022 at 04:02:22PM +1000, Alexey Kardashevskiy wrote:
> 
>> Which means the guest RAM does not need to be all mapped in that base IOAS
>> suggested down this thread as that would mean all memory is pinned and
>> powervm won't be able to swap it out (yeah, it can do such thing now!). Not
>> sure if we really want to support this or stick to a simpler design.
> 
> Huh? How can it swap? Calling GUP is not optional. Either you call GUP
> at the start and there is no swap, or you call GUP for each vIOMMU
> hypercall.

Correct, not optional.


> Since everyone says PPC doesn't call GUP during the hypercall - how is
> it working?

It does not call GUP during hypercalls because all VM pages are GUPed in 
advance at a special memory preregistration step as we could not call 
GUP from a hypercall handler with MMU off (often the case with POWER8 
when this was developed in the first place). Things are better with 
POWER9 (bare metal can do all sorts of things pretty much) but the 
PowerVM interface with 2 windows is still there and this iommufd 
proposal is going to be ported on top of PowerVM at first.

I am just saying there is a model when not everything is mapped and this 
has its use. The PowerVM's swapping capability is something new and I do 
not really know how that works though.


-- 
Alexey

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
@ 2022-05-25  2:09                 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 244+ messages in thread
From: Alexey Kardashevskiy @ 2022-05-25  2:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Chaitanya Kulkarni, kvm,
	Michael S. Tsirkin, Jason Wang, Cornelia Huck, Niklas Schnelle,
	Kevin Tian, iommu, Daniel Jordan, Alex Williamson, Joao Martins,
	David Gibson



On 5/24/22 23:25, Jason Gunthorpe wrote:
> On Mon, May 23, 2022 at 04:02:22PM +1000, Alexey Kardashevskiy wrote:
> 
>> Which means the guest RAM does not need to be all mapped in that base IOAS
>> suggested down this thread as that would mean all memory is pinned and
>> powervm won't be able to swap it out (yeah, it can do such thing now!). Not
>> sure if we really want to support this or stick to a simpler design.
> 
> Huh? How can it swap? Calling GUP is not optional. Either you call GUP
> at the start and there is no swap, or you call GUP for each vIOMMU
> hypercall.

Correct, not optional.


> Since everyone says PPC doesn't call GUP during the hypercall - how is
> it working?

It does not call GUP during hypercalls because all VM pages are GUPed in 
advance at a special memory preregistration step as we could not call 
GUP from a hypercall handler with MMU off (often the case with POWER8 
when this was developed in the first place). Things are better with 
POWER9 (bare metal can do all sorts of things pretty much) but the 
PowerVM interface with 2 windows is still there and this iommufd 
proposal is going to be ported on top of PowerVM at first.

I am just saying there is a model when not everything is mapped and this 
has its use. The PowerVM's swapping capability is something new and I do 
not really know how that works though.


-- 
Alexey
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 244+ messages in thread

end of thread, other threads:[~2022-05-25  2:09 UTC | newest]

Thread overview: 244+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-18 17:27 [PATCH RFC 00/12] IOMMUFD Generic interface Jason Gunthorpe
2022-03-18 17:27 ` Jason Gunthorpe via iommu
2022-03-18 17:27 ` [PATCH RFC 01/12] interval-tree: Add a utility to iterate over spans in an interval tree Jason Gunthorpe
2022-03-18 17:27   ` Jason Gunthorpe via iommu
2022-03-18 17:27 ` [PATCH RFC 02/12] iommufd: Overview documentation Jason Gunthorpe
2022-03-18 17:27   ` Jason Gunthorpe via iommu
2022-03-18 17:27 ` [PATCH RFC 03/12] iommufd: File descriptor, context, kconfig and makefiles Jason Gunthorpe
2022-03-18 17:27   ` Jason Gunthorpe via iommu
2022-03-22 14:18   ` Niklas Schnelle
2022-03-22 14:18     ` Niklas Schnelle
2022-03-22 14:50     ` Jason Gunthorpe
2022-03-22 14:50       ` Jason Gunthorpe via iommu
2022-03-18 17:27 ` [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd Jason Gunthorpe
2022-03-18 17:27   ` Jason Gunthorpe via iommu
2022-03-22 14:28   ` Niklas Schnelle
2022-03-22 14:28     ` Niklas Schnelle
2022-03-22 14:57     ` Jason Gunthorpe
2022-03-22 14:57       ` Jason Gunthorpe via iommu
2022-03-22 15:29       ` Alex Williamson
2022-03-22 15:29         ` Alex Williamson
2022-03-22 16:15         ` Jason Gunthorpe
2022-03-22 16:15           ` Jason Gunthorpe via iommu
2022-03-24  2:11           ` Tian, Kevin
2022-03-24  2:11             ` Tian, Kevin
2022-03-24  2:27             ` Jason Wang
2022-03-24  2:27               ` Jason Wang
2022-03-24  2:42               ` Tian, Kevin
2022-03-24  2:42                 ` Tian, Kevin
2022-03-24  2:57                 ` Jason Wang
2022-03-24  2:57                   ` Jason Wang
2022-03-24  3:15                   ` Tian, Kevin
2022-03-24  3:15                     ` Tian, Kevin
2022-03-24  3:50                     ` Jason Wang
2022-03-24  3:50                       ` Jason Wang
2022-03-24  4:29                       ` Tian, Kevin
2022-03-24  4:29                         ` Tian, Kevin
2022-03-24 11:46                       ` Jason Gunthorpe
2022-03-24 11:46                         ` Jason Gunthorpe via iommu
2022-03-28  1:53                         ` Jason Wang
2022-03-28  1:53                           ` Jason Wang
2022-03-28 12:22                           ` Jason Gunthorpe
2022-03-28 12:22                             ` Jason Gunthorpe via iommu
2022-03-29  4:59                             ` Jason Wang
2022-03-29  4:59                               ` Jason Wang
2022-03-29 11:46                               ` Jason Gunthorpe
2022-03-29 11:46                                 ` Jason Gunthorpe via iommu
2022-03-28 13:14                           ` Sean Mooney
2022-03-28 13:14                             ` Sean Mooney
2022-03-28 14:27                             ` Jason Gunthorpe via iommu
2022-03-28 14:27                               ` Jason Gunthorpe
2022-03-24 20:40           ` Alex Williamson
2022-03-24 20:40             ` Alex Williamson
2022-03-24 22:27             ` Jason Gunthorpe
2022-03-24 22:27               ` Jason Gunthorpe via iommu
2022-03-24 22:41               ` Alex Williamson
2022-03-24 22:41                 ` Alex Williamson
2022-03-22 16:31       ` Niklas Schnelle
2022-03-22 16:31         ` Niklas Schnelle
2022-03-22 16:41         ` Jason Gunthorpe via iommu
2022-03-22 16:41           ` Jason Gunthorpe
2022-03-18 17:27 ` [PATCH RFC 05/12] iommufd: PFN handling for iopt_pages Jason Gunthorpe
2022-03-18 17:27   ` Jason Gunthorpe via iommu
2022-03-23 15:37   ` Niklas Schnelle
2022-03-23 15:37     ` Niklas Schnelle
2022-03-23 16:09     ` Jason Gunthorpe
2022-03-23 16:09       ` Jason Gunthorpe via iommu
2022-03-18 17:27 ` [PATCH RFC 06/12] iommufd: Algorithms for PFN storage Jason Gunthorpe
2022-03-18 17:27   ` Jason Gunthorpe via iommu
2022-03-18 17:27 ` [PATCH RFC 07/12] iommufd: Data structure to provide IOVA to PFN mapping Jason Gunthorpe
2022-03-18 17:27   ` Jason Gunthorpe via iommu
2022-03-22 22:15   ` Alex Williamson
2022-03-22 22:15     ` Alex Williamson
2022-03-23 18:15     ` Jason Gunthorpe
2022-03-23 18:15       ` Jason Gunthorpe via iommu
2022-03-24  3:09       ` Tian, Kevin
2022-03-24  3:09         ` Tian, Kevin
2022-03-24 12:46         ` Jason Gunthorpe
2022-03-24 12:46           ` Jason Gunthorpe via iommu
2022-03-25 13:34   ` zhangfei.gao
2022-03-25 13:34     ` zhangfei.gao
2022-03-25 17:19     ` Jason Gunthorpe via iommu
2022-03-25 17:19       ` Jason Gunthorpe
2022-04-13 14:02   ` Yi Liu
2022-04-13 14:02     ` Yi Liu
2022-04-13 14:36     ` Jason Gunthorpe
2022-04-13 14:36       ` Jason Gunthorpe via iommu
2022-04-13 14:49       ` Yi Liu
2022-04-13 14:49         ` Yi Liu
2022-04-17 14:56         ` Yi Liu
2022-04-17 14:56           ` Yi Liu
2022-04-18 10:47           ` Yi Liu
2022-04-18 10:47             ` Yi Liu
2022-03-18 17:27 ` [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable Jason Gunthorpe
2022-03-18 17:27   ` Jason Gunthorpe via iommu
2022-03-23 19:10   ` Alex Williamson
2022-03-23 19:10     ` Alex Williamson
2022-03-23 19:34     ` Jason Gunthorpe
2022-03-23 19:34       ` Jason Gunthorpe via iommu
2022-03-23 20:04       ` Alex Williamson
2022-03-23 20:04         ` Alex Williamson
2022-03-23 20:34         ` Jason Gunthorpe via iommu
2022-03-23 20:34           ` Jason Gunthorpe
2022-03-23 22:54           ` Jason Gunthorpe
2022-03-23 22:54             ` Jason Gunthorpe via iommu
2022-03-24  7:25             ` Tian, Kevin
2022-03-24  7:25               ` Tian, Kevin
2022-03-24 13:46               ` Jason Gunthorpe via iommu
2022-03-24 13:46                 ` Jason Gunthorpe
2022-03-25  2:15                 ` Tian, Kevin
2022-03-25  2:15                   ` Tian, Kevin
2022-03-27  2:32                 ` Tian, Kevin
2022-03-27  2:32                   ` Tian, Kevin
2022-03-27 14:28                   ` Jason Gunthorpe
2022-03-27 14:28                     ` Jason Gunthorpe via iommu
2022-03-28 17:17                 ` Alex Williamson
2022-03-28 17:17                   ` Alex Williamson
2022-03-28 18:57                   ` Jason Gunthorpe
2022-03-28 18:57                     ` Jason Gunthorpe via iommu
2022-03-28 19:47                     ` Jason Gunthorpe via iommu
2022-03-28 19:47                       ` Jason Gunthorpe
2022-03-28 21:26                       ` Alex Williamson
2022-03-28 21:26                         ` Alex Williamson
2022-03-24  6:46           ` Tian, Kevin
2022-03-24  6:46             ` Tian, Kevin
2022-03-30 13:35   ` Yi Liu
2022-03-30 13:35     ` Yi Liu
2022-03-31 12:59     ` Jason Gunthorpe via iommu
2022-03-31 12:59       ` Jason Gunthorpe
2022-04-01 13:30       ` Yi Liu
2022-04-01 13:30         ` Yi Liu
2022-03-31  4:36   ` David Gibson
2022-03-31  4:36     ` David Gibson
2022-03-31  5:41     ` Tian, Kevin
2022-03-31  5:41       ` Tian, Kevin
2022-03-31 12:58     ` Jason Gunthorpe via iommu
2022-03-31 12:58       ` Jason Gunthorpe
2022-04-28  5:58       ` David Gibson
2022-04-28  5:58         ` David Gibson
2022-04-28 14:22         ` Jason Gunthorpe
2022-04-28 14:22           ` Jason Gunthorpe via iommu
2022-04-29  6:00           ` David Gibson
2022-04-29  6:00             ` David Gibson
2022-04-29 12:54             ` Jason Gunthorpe
2022-04-29 12:54               ` Jason Gunthorpe via iommu
2022-04-30 14:44               ` David Gibson
2022-04-30 14:44                 ` David Gibson
2022-03-18 17:27 ` [PATCH RFC 09/12] iommufd: Add a HW pagetable object Jason Gunthorpe
2022-03-18 17:27   ` Jason Gunthorpe via iommu
2022-03-18 17:27 ` [PATCH RFC 10/12] iommufd: Add kAPI toward external drivers Jason Gunthorpe
2022-03-18 17:27   ` Jason Gunthorpe via iommu
2022-03-23 18:10   ` Alex Williamson
2022-03-23 18:10     ` Alex Williamson
2022-03-23 18:15     ` Jason Gunthorpe
2022-03-23 18:15       ` Jason Gunthorpe via iommu
2022-05-11 12:54   ` Yi Liu
2022-05-11 12:54     ` Yi Liu
2022-05-19  9:45   ` Yi Liu
2022-05-19  9:45     ` Yi Liu
2022-05-19 12:35     ` Jason Gunthorpe
2022-05-19 12:35       ` Jason Gunthorpe via iommu
2022-03-18 17:27 ` [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility Jason Gunthorpe
2022-03-18 17:27   ` Jason Gunthorpe via iommu
2022-03-23 22:51   ` Alex Williamson
2022-03-23 22:51     ` Alex Williamson
2022-03-24  0:33     ` Jason Gunthorpe
2022-03-24  0:33       ` Jason Gunthorpe via iommu
2022-03-24  8:13       ` Eric Auger
2022-03-24  8:13         ` Eric Auger
2022-03-24 22:04       ` Alex Williamson
2022-03-24 22:04         ` Alex Williamson
2022-03-24 23:11         ` Jason Gunthorpe
2022-03-24 23:11           ` Jason Gunthorpe via iommu
2022-03-25  3:10           ` Tian, Kevin
2022-03-25  3:10             ` Tian, Kevin
2022-03-25 11:24           ` Joao Martins
2022-03-25 11:24             ` Joao Martins
2022-04-28 14:53         ` David Gibson
2022-04-28 14:53           ` David Gibson
2022-04-28 15:10           ` Jason Gunthorpe
2022-04-28 15:10             ` Jason Gunthorpe via iommu
2022-04-29  1:21             ` Tian, Kevin
2022-04-29  1:21               ` Tian, Kevin
2022-04-29  6:22               ` David Gibson
2022-04-29  6:22                 ` David Gibson
2022-04-29 12:50                 ` Jason Gunthorpe
2022-04-29 12:50                   ` Jason Gunthorpe via iommu
2022-05-02  4:10                   ` David Gibson
2022-05-02  4:10                     ` David Gibson
2022-04-29  6:20             ` David Gibson
2022-04-29  6:20               ` David Gibson
2022-04-29 12:48               ` Jason Gunthorpe
2022-04-29 12:48                 ` Jason Gunthorpe via iommu
2022-05-02  7:30                 ` David Gibson
2022-05-02  7:30                   ` David Gibson
2022-05-05 19:07                   ` Jason Gunthorpe
2022-05-05 19:07                     ` Jason Gunthorpe via iommu
2022-05-06  5:25                     ` David Gibson
2022-05-06  5:25                       ` David Gibson
2022-05-06 10:42                       ` Tian, Kevin
2022-05-06 10:42                         ` Tian, Kevin
2022-05-09  3:36                         ` David Gibson
2022-05-09  3:36                           ` David Gibson
2022-05-06 12:48                       ` Jason Gunthorpe
2022-05-06 12:48                         ` Jason Gunthorpe via iommu
2022-05-09  6:01                         ` David Gibson
2022-05-09  6:01                           ` David Gibson
2022-05-09 14:00                           ` Jason Gunthorpe
2022-05-09 14:00                             ` Jason Gunthorpe via iommu
2022-05-10  7:12                             ` David Gibson
2022-05-10  7:12                               ` David Gibson
2022-05-10 19:00                               ` Jason Gunthorpe
2022-05-10 19:00                                 ` Jason Gunthorpe via iommu
2022-05-11  3:15                                 ` Tian, Kevin
2022-05-11  3:15                                   ` Tian, Kevin
2022-05-11 16:32                                   ` Jason Gunthorpe
2022-05-11 16:32                                     ` Jason Gunthorpe via iommu
2022-05-11 23:23                                     ` Tian, Kevin
2022-05-11 23:23                                       ` Tian, Kevin
2022-05-13  4:35                                   ` David Gibson
2022-05-13  4:35                                     ` David Gibson
2022-05-11  4:40                                 ` David Gibson
2022-05-11  4:40                                   ` David Gibson
2022-05-11  2:46                             ` Tian, Kevin
2022-05-11  2:46                               ` Tian, Kevin
2022-05-23  6:02           ` Alexey Kardashevskiy
2022-05-23  6:02             ` Alexey Kardashevskiy
2022-05-24 13:25             ` Jason Gunthorpe
2022-05-24 13:25               ` Jason Gunthorpe via iommu
2022-05-25  1:39               ` David Gibson
2022-05-25  1:39                 ` David Gibson
2022-05-25  2:09               ` Alexey Kardashevskiy
2022-05-25  2:09                 ` Alexey Kardashevskiy
2022-03-29  9:17     ` Yi Liu
2022-03-29  9:17       ` Yi Liu
2022-03-18 17:27 ` [PATCH RFC 12/12] iommufd: Add a selftest Jason Gunthorpe
2022-03-18 17:27   ` Jason Gunthorpe via iommu
2022-04-12 20:13 ` [PATCH RFC 00/12] IOMMUFD Generic interface Eric Auger
2022-04-12 20:13   ` Eric Auger
2022-04-12 20:22   ` Jason Gunthorpe via iommu
2022-04-12 20:22     ` Jason Gunthorpe
2022-04-12 20:50     ` Eric Auger
2022-04-12 20:50       ` Eric Auger
2022-04-14 10:56 ` Yi Liu
2022-04-14 10:56   ` Yi Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.