All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 00/17] IOMMUFD Generic interface
@ 2022-11-08  0:48 Jason Gunthorpe
  2022-11-08  0:48 ` [PATCH v4 01/17] iommu: Add IOMMU_CAP_ENFORCE_CACHE_COHERENCY Jason Gunthorpe
                   ` (18 more replies)
  0 siblings, 19 replies; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-08  0:48 UTC (permalink / raw)
  To: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

[
This has been in linux-next for a little while now, and we've completed
the syzkaller run. 1300 hours of CPU time have been invested since the
last report with no improvement in coverage or new detections. syzkaller
coverage reached 69%(75%), and review of the misses show substantial
amounts are WARN_ON's and other debugging which are not expected to be
covered.
]

iommufd is the user API to control the IOMMU subsystem as it relates to
managing IO page tables that point at user space memory.

It takes over from drivers/vfio/vfio_iommu_type1.c (aka the VFIO
container) which is the VFIO specific interface for a similar idea.

We see a broad need for extended features, some being highly IOMMU device
specific:
 - Binding iommu_domain's to PASID/SSID
 - Userspace IO page tables, for ARM, x86 and S390
 - Kernel bypassed invalidation of user page tables
 - Re-use of the KVM page table in the IOMMU
 - Dirty page tracking in the IOMMU
 - Runtime Increase/Decrease of IOPTE size
 - PRI support with faults resolved in userspace

Many of these HW features exist to support VM use cases - for instance the
combination of PASID, PRI and Userspace IO Page Tables allows an
implementation of DMA Shared Virtual Addressing (vSVA) within a
guest. Dirty tracking enables VM live migration with SRIOV devices and
PASID support allow creating "scalable IOV" devices, among other things.

As these features are fundamental to a VM platform they need to be
uniformly exposed to all the driver families that do DMA into VMs, which
is currently VFIO and VDPA.

The pre-v1 series proposed re-using the VFIO type 1 data structure,
however it was suggested that if we are doing this big update then we
should also come with an improved data structure that solves the
limitations that VFIO type1 has. Notably this addresses:

 - Multiple IOAS/'containers' and multiple domains inside a single FD

 - Single-pin operation no matter how many domains and containers use
   a page

 - A fine grained locking scheme supporting user managed concurrency for
   multi-threaded map/unmap

 - A pre-registration mechanism to optimize vIOMMU use cases by
   pre-pinning pages

 - Extended ioctl API that can manage these new objects and exposes
   domains directly to user space

 - domains are sharable between subsystems, eg VFIO and VDPA

The bulk of this code is a new data structure design to track how the
IOVAs are mapped to PFNs.

iommufd intends to be general and consumable by any driver that wants to
DMA to userspace. From a driver perspective it can largely be dropped in
in-place of iommu_attach_device() and provides a uniform full feature set
to all consumers.

As this is a larger project this series is the first step. This series
provides the iommfd "generic interface" which is designed to be suitable
for applications like DPDK and VMM flows that are not optimized to
specific HW scenarios. It is close to being a drop in replacement for the
existing VFIO type 1 and supports existing qemu based VM flows.

Several follow-on series are being prepared:

- Patches integrating with qemu in native mode:
  https://github.com/yiliu1765/qemu/commits/qemu-iommufd-6.0-rc2

- A completed integration with VFIO now exists that covers "emulated" mdev
  use cases now, and can pass testing with qemu/etc in compatability mode:
  https://github.com/jgunthorpe/linux/commits/vfio_iommufd

- A draft providing system iommu dirty tracking on top of iommufd,
  including iommu driver implementations:
  https://github.com/jpemartins/linux/commits/x86-iommufd

  This pairs with patches for providing a similar API to support VFIO-device
  tracking to give a complete vfio solution:
  https://lore.kernel.org/kvm/20220901093853.60194-1-yishaih@nvidia.com/

- Userspace page tables aka 'nested translation' for ARM and Intel iommu
  drivers:
  https://github.com/nicolinc/iommufd/commits/iommufd_nesting

- "device centric" vfio series to expose the vfio_device FD directly as a
  normal cdev, and provide an extended API allowing dynamically changing
  the IOAS binding:
  https://github.com/yiliu1765/iommufd/commits/iommufd-v6.0-rc2-nesting-0901

- Drafts for PASID and PRI interfaces are included above as well

Overall enough work is done now to show the merit of the new API design
and at least draft solutions to many of the main problems.

Several people have contributed directly to this work: Eric Auger, Joao
Martins, Kevin Tian, Lu Baolu, Nicolin Chen, Yi L Liu. Many more have
participated in the discussions that lead here, and provided ideas. Thanks
to all!

The v1/v2 iommufd series has been used to guide a large amount of preparatory
work that has now been merged. The general theme is to organize things in
a way that makes injecting iommufd natural:

 - VFIO live migration support with mlx5 and hisi_acc drivers.
   These series need a dirty tracking solution to be really usable.
   https://lore.kernel.org/kvm/20220224142024.147653-1-yishaih@nvidia.com/
   https://lore.kernel.org/kvm/20220308184902.2242-1-shameerali.kolothum.thodi@huawei.com/

 - Significantly rework the VFIO gvt mdev and remove struct
   mdev_parent_ops
   https://lore.kernel.org/lkml/20220411141403.86980-1-hch@lst.de/

 - Rework how PCIe no-snoop blocking works
   https://lore.kernel.org/kvm/0-v3-2cf356649677+a32-intel_no_snoop_jgg@nvidia.com/

 - Consolidate dma ownership into the iommu core code
   https://lore.kernel.org/linux-iommu/20220418005000.897664-1-baolu.lu@linux.intel.com/

 - Make all vfio driver interfaces use struct vfio_device consistently
   https://lore.kernel.org/kvm/0-v4-8045e76bf00b+13d-vfio_mdev_no_group_jgg@nvidia.com/

 - Remove the vfio_group from the kvm/vfio interface
   https://lore.kernel.org/kvm/0-v3-f7729924a7ea+25e33-vfio_kvm_no_group_jgg@nvidia.com/

 - Simplify locking in vfio
   https://lore.kernel.org/kvm/0-v2-d035a1842d81+1bf-vfio_group_locking_jgg@nvidia.com/

 - Remove the vfio notifiter scheme that faces drivers
   https://lore.kernel.org/kvm/0-v4-681e038e30fd+78-vfio_unmap_notif_jgg@nvidia.com/

 - Improve the driver facing API for vfio pin/unpin pages to make the
   presence of struct page clear
   https://lore.kernel.org/kvm/20220723020256.30081-1-nicolinc@nvidia.com/

 - Clean up in the Intel IOMMU driver
   https://lore.kernel.org/linux-iommu/20220301020159.633356-1-baolu.lu@linux.intel.com/
   https://lore.kernel.org/linux-iommu/20220510023407.2759143-1-baolu.lu@linux.intel.com/
   https://lore.kernel.org/linux-iommu/20220514014322.2927339-1-baolu.lu@linux.intel.com/
   https://lore.kernel.org/linux-iommu/20220706025524.2904370-1-baolu.lu@linux.intel.com/
   https://lore.kernel.org/linux-iommu/20220702015610.2849494-1-baolu.lu@linux.intel.com/

 - Rework s390 vfio drivers
   https://lore.kernel.org/kvm/20220707135737.720765-1-farman@linux.ibm.com/

 - Normalize vfio ioctl handling
   https://lore.kernel.org/kvm/0-v2-0f9e632d54fb+d6-vfio_ioctl_split_jgg@nvidia.com/

 - VFIO API for dirty tracking (aka dma logging) managed inside a PCI
   device, with mlx5 implementation
   https://lore.kernel.org/kvm/20220901093853.60194-1-yishaih@nvidia.com

 - Introduce a struct device sysfs presence for struct vfio_device
   https://lore.kernel.org/kvm/20220901143747.32858-1-kevin.tian@intel.com/

 - Complete restructuring the vfio mdev model
   https://lore.kernel.org/kvm/20220822062208.152745-1-hch@lst.de/

 - Isolate VFIO container code in preperation for iommufd to provide an
   alternative implementation of it all
   https://lore.kernel.org/kvm/0-v1-a805b607f1fb+17b-vfio_container_split_jgg@nvidia.com

 - Simplify and consolidate iommu_domain/device compatability checking
   https://lore.kernel.org/linux-iommu/cover.1666042872.git.nicolinc@nvidia.com/

 - Align iommu SVA support with the domain-centric model
   https://lore.kernel.org/all/20221031005917.45690-1-baolu.lu@linux.intel.com/

This is about 233 patches applied since March, thank you to everyone
involved in all this work!

Currently there are a number of supporting series still in progress:

 - DMABUF exporter support for VFIO to allow PCI P2P with VFIO
   https://lore.kernel.org/r/0-v2-472615b3877e+28f7-vfio_dma_buf_jgg@nvidia.com

 - Start to provide iommu_domain ops for POWER
   https://lore.kernel.org/all/20220714081822.3717693-1-aik@ozlabs.ru/

However, these are not necessary for this series to advance.

This is on github: https://github.com/jgunthorpe/linux/commits/iommufd

v4:
 - Rebase to v6.1-rc3, include the iommu branch with the needed EINVAL
   patch series and also the SVA rework
 - All bug fixes and comments with no API or behavioral changes
 - gvt tests are passing again
 - Syzkaller is no longer finding issues and achieved high coverage of
   69%(75%)
 - Coverity has been run by two people
 - new "nth failure" test that systematically sweeps all error unwind paths
   looking for splats
 - All fixes noted in the mailing list
   If you sent an email and I didn't reply please ping it, I have lost it.
 - The selftest patch has been broken into three to make the additional
   modification to the main code clearer
 - The interdiff is 1.8k lines for the main code, with another 3k of
   test suite changes
v3: https://lore.kernel.org/r/0-v3-402a7d6459de+24b-iommufd_jgg@nvidia.com
 - Rebase to v6.1-rc1
 - Improve documentation
 - Use EXPORT_SYMBOL_NS
 - Fix W1, checkpatch stuff
 - Revise pages.c to resolve the FIXMEs. Create a
   interval_tree_double_span_iter which allows a simple expression of the
   previously problematic algorithms
 - Consistently use the word 'access' instead of user to refer to an
   access from an in-kernel user (eg vfio mdev)
 - Support two forms of rlimit accounting and make the vfio compatible one
   the default in compatability mode (following series)
 - Support old VFIO type1 by disabling huge pages and implementing a
   simple algorithm to split a struct iopt_area
 - Full implementation of access support, test coverage and optimizations
 - Complete COPY to be able to copy across contiguous areas. Improve
   all the algorithms around contiguous areas with a dedicated iterator
 - Functional ENFORCED_COHERENT support
 - Support multi-device groups
 - Lots of smaller changes (the interdiff is 5k lines)
v2: https://lore.kernel.org/r/0-v2-f9436d0bde78+4bb-iommufd_jgg@nvidia.com
 - Rebase to v6.0-rc3
 - Improve comments
 - Change to an iterative destruction approach to avoid cycles
 - Near rewrite of the vfio facing implementation, supported by a complete
   implementation on the vfio side
 - New IOMMU_IOAS_ALLOW_IOVAS API as discussed. Allows userspace to
   assert that ranges of IOVA must always be mappable. To be used by a VMM
   that has promised a guest a certain availability of IOVA. May help
   guide PPC's multi-window implementation.
 - Rework how unmap_iova works, user can unmap the whole ioas now
 - The no-snoop / wbinvd support is implemented
 - Bug fixes
 - Test suite improvements
 - Lots of smaller changes (the interdiff is 3k lines)
v1: https://lore.kernel.org/r/0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com

Jason Gunthorpe (15):
  iommu: Add IOMMU_CAP_ENFORCE_CACHE_COHERENCY
  interval-tree: Add a utility to iterate over spans in an interval tree
  iommufd: File descriptor, context, kconfig and makefiles
  kernel/user: Allow user::locked_vm to be usable for iommufd
  iommufd: PFN handling for iopt_pages
  iommufd: Algorithms for PFN storage
  iommufd: Data structure to provide IOVA to PFN mapping
  iommufd: IOCTLs for the io_pagetable
  iommufd: Add a HW pagetable object
  iommufd: Add kAPI toward external drivers for physical devices
  iommufd: Add kAPI toward external drivers for kernel access
  iommufd: vfio container FD ioctl compatibility
  iommufd: Add a selftest
  iommufd: Add some fault injection points
  iommufd: Add additional invariant assertions

Kevin Tian (1):
  iommufd: Document overview of iommufd

Lu Baolu (1):
  iommu: Add device-centric DMA ownership interfaces

 .clang-format                                 |    3 +
 Documentation/userspace-api/index.rst         |    1 +
 .../userspace-api/ioctl/ioctl-number.rst      |    1 +
 Documentation/userspace-api/iommufd.rst       |  222 ++
 MAINTAINERS                                   |   12 +
 drivers/iommu/Kconfig                         |    1 +
 drivers/iommu/Makefile                        |    2 +-
 drivers/iommu/amd/iommu.c                     |    2 +
 drivers/iommu/intel/iommu.c                   |   16 +-
 drivers/iommu/iommu.c                         |  124 +-
 drivers/iommu/iommufd/Kconfig                 |   23 +
 drivers/iommu/iommufd/Makefile                |   13 +
 drivers/iommu/iommufd/device.c                |  748 +++++++
 drivers/iommu/iommufd/double_span.h           |   98 +
 drivers/iommu/iommufd/hw_pagetable.c          |   57 +
 drivers/iommu/iommufd/io_pagetable.c          | 1214 +++++++++++
 drivers/iommu/iommufd/io_pagetable.h          |  241 +++
 drivers/iommu/iommufd/ioas.c                  |  390 ++++
 drivers/iommu/iommufd/iommufd_private.h       |  307 +++
 drivers/iommu/iommufd/iommufd_test.h          |   93 +
 drivers/iommu/iommufd/main.c                  |  419 ++++
 drivers/iommu/iommufd/pages.c                 | 1884 +++++++++++++++++
 drivers/iommu/iommufd/selftest.c              |  853 ++++++++
 drivers/iommu/iommufd/vfio_compat.c           |  452 ++++
 include/linux/interval_tree.h                 |   58 +
 include/linux/iommu.h                         |   17 +
 include/linux/iommufd.h                       |  102 +
 include/linux/sched/user.h                    |    2 +-
 include/uapi/linux/iommufd.h                  |  332 +++
 kernel/user.c                                 |    1 +
 lib/Kconfig                                   |    4 +
 lib/interval_tree.c                           |  132 ++
 tools/testing/selftests/Makefile              |    1 +
 tools/testing/selftests/iommu/.gitignore      |    3 +
 tools/testing/selftests/iommu/Makefile        |   12 +
 tools/testing/selftests/iommu/config          |    2 +
 tools/testing/selftests/iommu/iommufd.c       | 1627 ++++++++++++++
 .../selftests/iommu/iommufd_fail_nth.c        |  580 +++++
 tools/testing/selftests/iommu/iommufd_utils.h |  278 +++
 39 files changed, 10294 insertions(+), 33 deletions(-)
 create mode 100644 Documentation/userspace-api/iommufd.rst
 create mode 100644 drivers/iommu/iommufd/Kconfig
 create mode 100644 drivers/iommu/iommufd/Makefile
 create mode 100644 drivers/iommu/iommufd/device.c
 create mode 100644 drivers/iommu/iommufd/double_span.h
 create mode 100644 drivers/iommu/iommufd/hw_pagetable.c
 create mode 100644 drivers/iommu/iommufd/io_pagetable.c
 create mode 100644 drivers/iommu/iommufd/io_pagetable.h
 create mode 100644 drivers/iommu/iommufd/ioas.c
 create mode 100644 drivers/iommu/iommufd/iommufd_private.h
 create mode 100644 drivers/iommu/iommufd/iommufd_test.h
 create mode 100644 drivers/iommu/iommufd/main.c
 create mode 100644 drivers/iommu/iommufd/pages.c
 create mode 100644 drivers/iommu/iommufd/selftest.c
 create mode 100644 drivers/iommu/iommufd/vfio_compat.c
 create mode 100644 include/linux/iommufd.h
 create mode 100644 include/uapi/linux/iommufd.h
 create mode 100644 tools/testing/selftests/iommu/.gitignore
 create mode 100644 tools/testing/selftests/iommu/Makefile
 create mode 100644 tools/testing/selftests/iommu/config
 create mode 100644 tools/testing/selftests/iommu/iommufd.c
 create mode 100644 tools/testing/selftests/iommu/iommufd_fail_nth.c
 create mode 100644 tools/testing/selftests/iommu/iommufd_utils.h


base-commit: 69e61edebea030f177de7a23b8d5d9b8c4a90bda
-- 
2.38.1


^ permalink raw reply	[flat|nested] 99+ messages in thread

* [PATCH v4 01/17] iommu: Add IOMMU_CAP_ENFORCE_CACHE_COHERENCY
  2022-11-08  0:48 [PATCH v4 00/17] IOMMUFD Generic interface Jason Gunthorpe
@ 2022-11-08  0:48 ` Jason Gunthorpe
  2022-11-08  0:48 ` [PATCH v4 02/17] iommu: Add device-centric DMA ownership interfaces Jason Gunthorpe
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-08  0:48 UTC (permalink / raw)
  To: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

This queries if a domain linked to a device should expect to support
enforce_cache_coherency() so iommufd can negotiate the rules for when a
domain should be shared or not.

For iommufd a device that declares IOMMU_CAP_ENFORCE_CACHE_COHERENCY will
not be attached to a domain that does not support it.

Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/amd/iommu.c   |  2 ++
 drivers/iommu/intel/iommu.c | 16 +++++++++++-----
 include/linux/iommu.h       |  5 +++++
 3 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 45299eb7e8e306..240c535e317cc7 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2278,6 +2278,8 @@ static bool amd_iommu_capable(struct device *dev, enum iommu_cap cap)
 		return false;
 	case IOMMU_CAP_PRE_BOOT_PROTECTION:
 		return amdr_ivrs_remap_support;
+	case IOMMU_CAP_ENFORCE_CACHE_COHERENCY:
+		return true;
 	default:
 		break;
 	}
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index f298e51d5aa67a..157c9727411076 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -4450,14 +4450,20 @@ static bool intel_iommu_enforce_cache_coherency(struct iommu_domain *domain)
 
 static bool intel_iommu_capable(struct device *dev, enum iommu_cap cap)
 {
-	if (cap == IOMMU_CAP_CACHE_COHERENCY)
+	struct device_domain_info *info = dev_iommu_priv_get(dev);
+
+	switch (cap) {
+	case IOMMU_CAP_CACHE_COHERENCY:
 		return true;
-	if (cap == IOMMU_CAP_INTR_REMAP)
+	case IOMMU_CAP_INTR_REMAP:
 		return irq_remapping_enabled == 1;
-	if (cap == IOMMU_CAP_PRE_BOOT_PROTECTION)
+	case IOMMU_CAP_PRE_BOOT_PROTECTION:
 		return dmar_platform_optin();
-
-	return false;
+	case IOMMU_CAP_ENFORCE_CACHE_COHERENCY:
+		return ecap_sc_support(info->iommu->ecap);
+	default:
+		return false;
+	}
 }
 
 static struct iommu_device *intel_iommu_probe_device(struct device *dev)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 68d7d304cdb761..a09fd32d8cc273 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -124,6 +124,11 @@ enum iommu_cap {
 	IOMMU_CAP_NOEXEC,		/* IOMMU_NOEXEC flag */
 	IOMMU_CAP_PRE_BOOT_PROTECTION,	/* Firmware says it used the IOMMU for
 					   DMA protection and we should too */
+	/*
+	 * Per-device flag indicating if enforce_cache_coherency() will work on
+	 * this device.
+	 */
+	IOMMU_CAP_ENFORCE_CACHE_COHERENCY,
 };
 
 /* These are the possible reserved region types */
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH v4 02/17] iommu: Add device-centric DMA ownership interfaces
  2022-11-08  0:48 [PATCH v4 00/17] IOMMUFD Generic interface Jason Gunthorpe
  2022-11-08  0:48 ` [PATCH v4 01/17] iommu: Add IOMMU_CAP_ENFORCE_CACHE_COHERENCY Jason Gunthorpe
@ 2022-11-08  0:48 ` Jason Gunthorpe
  2022-11-11  5:37   ` Tian, Kevin
  2022-11-14 13:33   ` Eric Auger
  2022-11-08  0:48 ` [PATCH v4 03/17] interval-tree: Add a utility to iterate over spans in an interval tree Jason Gunthorpe
                   ` (16 subsequent siblings)
  18 siblings, 2 replies; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-08  0:48 UTC (permalink / raw)
  To: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

From: Lu Baolu <baolu.lu@linux.intel.com>

These complement the group interfaces used by VFIO and are for use by
iommufd. The main difference is that multiple devices in the same group
can all share the ownership by passing the same ownership pointer.

Move the common code into shared functions.

Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommu.c | 124 +++++++++++++++++++++++++++++++++---------
 include/linux/iommu.h |  12 ++++
 2 files changed, 110 insertions(+), 26 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 6ca377f4fbf9e9..4cb14e44e40f83 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -3108,41 +3108,52 @@ static int __iommu_group_alloc_blocking_domain(struct iommu_group *group)
 	return 0;
 }
 
+static int __iommu_take_dma_ownership(struct iommu_group *group, void *owner)
+{
+	int ret;
+
+	if (WARN_ON(!owner))
+		return -EINVAL;
+
+	if ((group->domain && group->domain != group->default_domain) ||
+	    !xa_empty(&group->pasid_array))
+		return -EBUSY;
+
+	ret = __iommu_group_alloc_blocking_domain(group);
+	if (ret)
+		return ret;
+	ret = __iommu_group_set_domain(group, group->blocking_domain);
+	if (ret)
+		return ret;
+
+	group->owner = owner;
+	group->owner_cnt++;
+	return 0;
+}
+
 /**
  * iommu_group_claim_dma_owner() - Set DMA ownership of a group
  * @group: The group.
  * @owner: Caller specified pointer. Used for exclusive ownership.
  *
- * This is to support backward compatibility for vfio which manages
- * the dma ownership in iommu_group level. New invocations on this
- * interface should be prohibited.
+ * This is to support backward compatibility for vfio which manages the dma
+ * ownership in iommu_group level. New invocations on this interface should be
+ * prohibited. Only a single owner may exist for a group.
  */
 int iommu_group_claim_dma_owner(struct iommu_group *group, void *owner)
 {
 	int ret = 0;
 
+	if (WARN_ON(!owner))
+		return -EINVAL;
+
 	mutex_lock(&group->mutex);
 	if (group->owner_cnt) {
 		ret = -EPERM;
 		goto unlock_out;
-	} else {
-		if ((group->domain && group->domain != group->default_domain) ||
-		    !xa_empty(&group->pasid_array)) {
-			ret = -EBUSY;
-			goto unlock_out;
-		}
-
-		ret = __iommu_group_alloc_blocking_domain(group);
-		if (ret)
-			goto unlock_out;
-
-		ret = __iommu_group_set_domain(group, group->blocking_domain);
-		if (ret)
-			goto unlock_out;
-		group->owner = owner;
 	}
 
-	group->owner_cnt++;
+	ret = __iommu_take_dma_ownership(group, owner);
 unlock_out:
 	mutex_unlock(&group->mutex);
 
@@ -3151,30 +3162,91 @@ int iommu_group_claim_dma_owner(struct iommu_group *group, void *owner)
 EXPORT_SYMBOL_GPL(iommu_group_claim_dma_owner);
 
 /**
- * iommu_group_release_dma_owner() - Release DMA ownership of a group
- * @group: The group.
+ * iommu_device_claim_dma_owner() - Set DMA ownership of a device
+ * @dev: The device.
+ * @owner: Caller specified pointer. Used for exclusive ownership.
  *
- * Release the DMA ownership claimed by iommu_group_claim_dma_owner().
+ * Claim the DMA ownership of a device. Multiple devices in the same group may
+ * concurrently claim ownership if they present the same owner value. Returns 0
+ * on success and error code on failure
  */
-void iommu_group_release_dma_owner(struct iommu_group *group)
+int iommu_device_claim_dma_owner(struct device *dev, void *owner)
 {
-	int ret;
+	struct iommu_group *group = iommu_group_get(dev);
+	int ret = 0;
+
+	if (!group)
+		return -ENODEV;
 
 	mutex_lock(&group->mutex);
+	if (group->owner_cnt) {
+		if (group->owner != owner) {
+			ret = -EPERM;
+			goto unlock_out;
+		}
+		group->owner_cnt++;
+		goto unlock_out;
+	}
+
+	ret = __iommu_take_dma_ownership(group, owner);
+unlock_out:
+	mutex_unlock(&group->mutex);
+	iommu_group_put(group);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_device_claim_dma_owner);
+
+static void __iommu_release_dma_ownership(struct iommu_group *group)
+{
+	int ret;
+
 	if (WARN_ON(!group->owner_cnt || !group->owner ||
 		    !xa_empty(&group->pasid_array)))
-		goto unlock_out;
+		return;
 
 	group->owner_cnt = 0;
 	group->owner = NULL;
 	ret = __iommu_group_set_domain(group, group->default_domain);
 	WARN(ret, "iommu driver failed to attach the default domain");
+}
 
-unlock_out:
+/**
+ * iommu_group_release_dma_owner() - Release DMA ownership of a group
+ * @group: The group.
+ *
+ * Release the DMA ownership claimed by iommu_group_claim_dma_owner().
+ */
+void iommu_group_release_dma_owner(struct iommu_group *group)
+{
+	mutex_lock(&group->mutex);
+	__iommu_release_dma_ownership(group);
 	mutex_unlock(&group->mutex);
 }
 EXPORT_SYMBOL_GPL(iommu_group_release_dma_owner);
 
+/**
+ * iommu_device_release_dma_owner() - Release DMA ownership of a device
+ * @group: The device.
+ *
+ * Release the DMA ownership claimed by iommu_device_claim_dma_owner().
+ */
+void iommu_device_release_dma_owner(struct device *dev)
+{
+	struct iommu_group *group = iommu_group_get(dev);
+
+	mutex_lock(&group->mutex);
+	if (group->owner_cnt > 1) {
+		group->owner_cnt--;
+		goto unlock_out;
+	}
+	__iommu_release_dma_ownership(group);
+unlock_out:
+	mutex_unlock(&group->mutex);
+	iommu_group_put(group);
+}
+EXPORT_SYMBOL_GPL(iommu_device_release_dma_owner);
+
 /**
  * iommu_group_dma_owner_claimed() - Query group dma ownership status
  * @group: The group.
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index a09fd32d8cc273..1690c334e51631 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -707,6 +707,9 @@ int iommu_group_claim_dma_owner(struct iommu_group *group, void *owner);
 void iommu_group_release_dma_owner(struct iommu_group *group);
 bool iommu_group_dma_owner_claimed(struct iommu_group *group);
 
+int iommu_device_claim_dma_owner(struct device *dev, void *owner);
+void iommu_device_release_dma_owner(struct device *dev);
+
 struct iommu_domain *iommu_sva_domain_alloc(struct device *dev,
 					    struct mm_struct *mm);
 int iommu_attach_device_pasid(struct iommu_domain *domain,
@@ -1064,6 +1067,15 @@ static inline bool iommu_group_dma_owner_claimed(struct iommu_group *group)
 	return false;
 }
 
+static inline void iommu_device_release_dma_owner(struct device *dev)
+{
+}
+
+static inline int iommu_device_claim_dma_owner(struct device *dev, void *owner)
+{
+	return -ENODEV;
+}
+
 static inline struct iommu_domain *
 iommu_sva_domain_alloc(struct device *dev, struct mm_struct *mm)
 {
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH v4 03/17] interval-tree: Add a utility to iterate over spans in an interval tree
  2022-11-08  0:48 [PATCH v4 00/17] IOMMUFD Generic interface Jason Gunthorpe
  2022-11-08  0:48 ` [PATCH v4 01/17] iommu: Add IOMMU_CAP_ENFORCE_CACHE_COHERENCY Jason Gunthorpe
  2022-11-08  0:48 ` [PATCH v4 02/17] iommu: Add device-centric DMA ownership interfaces Jason Gunthorpe
@ 2022-11-08  0:48 ` Jason Gunthorpe
  2022-11-15 14:14   ` Eric Auger
  2022-11-08  0:48 ` [PATCH v4 04/17] iommufd: Document overview of iommufd Jason Gunthorpe
                   ` (15 subsequent siblings)
  18 siblings, 1 reply; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-08  0:48 UTC (permalink / raw)
  To: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

The span iterator travels over the indexes of the interval_tree, not the
nodes, and classifies spans of indexes as either 'used' or 'hole'.

'used' spans are fully covered by nodes in the tree and 'hole' spans have
no node intersecting the span.

This is done greedily such that spans are maximally sized and every
iteration step switches between used/hole.

As an example a trivial allocator can be written as:

	for (interval_tree_span_iter_first(&span, itree, 0, ULONG_MAX);
	     !interval_tree_span_iter_done(&span);
	     interval_tree_span_iter_next(&span))
		if (span.is_hole &&
		    span.last_hole - span.start_hole >= allocation_size - 1)
			return span.start_hole;

With all the tricky boundary conditions handled by the library code.

The following iommufd patches have several algorithms for its overlapping
node interval trees that are significantly simplified with this kind of
iteration primitive. As it seems generally useful, put it into lib/.

Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 .clang-format                 |   1 +
 include/linux/interval_tree.h |  58 +++++++++++++++
 lib/Kconfig                   |   4 ++
 lib/interval_tree.c           | 132 ++++++++++++++++++++++++++++++++++
 4 files changed, 195 insertions(+)

diff --git a/.clang-format b/.clang-format
index 1247d54f9e49fa..96d07786dcfb46 100644
--- a/.clang-format
+++ b/.clang-format
@@ -440,6 +440,7 @@ ForEachMacros:
   - 'inet_lhash2_for_each_icsk'
   - 'inet_lhash2_for_each_icsk_continue'
   - 'inet_lhash2_for_each_icsk_rcu'
+  - 'interval_tree_for_each_span'
   - 'intlist__for_each_entry'
   - 'intlist__for_each_entry_safe'
   - 'kcore_copy__for_each_phdr'
diff --git a/include/linux/interval_tree.h b/include/linux/interval_tree.h
index 288c26f50732d7..2b8026a3990645 100644
--- a/include/linux/interval_tree.h
+++ b/include/linux/interval_tree.h
@@ -27,4 +27,62 @@ extern struct interval_tree_node *
 interval_tree_iter_next(struct interval_tree_node *node,
 			unsigned long start, unsigned long last);
 
+/**
+ * struct interval_tree_span_iter - Find used and unused spans.
+ * @start_hole: Start of an interval for a hole when is_hole == 1
+ * @last_hole: Inclusive end of an interval for a hole when is_hole == 1
+ * @start_used: Start of a used interval when is_hole == 0
+ * @last_used: Inclusive end of a used interval when is_hole == 0
+ * @is_hole: 0 == used, 1 == is_hole, -1 == done iteration
+ *
+ * This iterator travels over spans in an interval tree. It does not return
+ * nodes but classifies each span as either a hole, where no nodes intersect, or
+ * a used, which is fully covered by nodes. Each iteration step toggles between
+ * hole and used until the entire range is covered. The returned spans always
+ * fully cover the requested range.
+ *
+ * The iterator is greedy, it always returns the largest hole or used possible,
+ * consolidating all consecutive nodes.
+ *
+ * Use interval_tree_span_iter_done() to detect end of iteration.
+ */
+struct interval_tree_span_iter {
+	/* private: not for use by the caller */
+	struct interval_tree_node *nodes[2];
+	unsigned long first_index;
+	unsigned long last_index;
+
+	/* public: */
+	union {
+		unsigned long start_hole;
+		unsigned long start_used;
+	};
+	union {
+		unsigned long last_hole;
+		unsigned long last_used;
+	};
+	int is_hole;
+};
+
+void interval_tree_span_iter_first(struct interval_tree_span_iter *state,
+				   struct rb_root_cached *itree,
+				   unsigned long first_index,
+				   unsigned long last_index);
+void interval_tree_span_iter_advance(struct interval_tree_span_iter *iter,
+				     struct rb_root_cached *itree,
+				     unsigned long new_index);
+void interval_tree_span_iter_next(struct interval_tree_span_iter *state);
+
+static inline bool
+interval_tree_span_iter_done(struct interval_tree_span_iter *state)
+{
+	return state->is_hole == -1;
+}
+
+#define interval_tree_for_each_span(span, itree, first_index, last_index)      \
+	for (interval_tree_span_iter_first(span, itree,                        \
+					   first_index, last_index);           \
+	     !interval_tree_span_iter_done(span);                              \
+	     interval_tree_span_iter_next(span))
+
 #endif	/* _LINUX_INTERVAL_TREE_H */
diff --git a/lib/Kconfig b/lib/Kconfig
index 9bbf8a4b2108e6..c6c323fd251721 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -479,6 +479,10 @@ config INTERVAL_TREE
 
 	  for more information.
 
+config INTERVAL_TREE_SPAN_ITER
+	bool
+	depends on INTERVAL_TREE
+
 config XARRAY_MULTI
 	bool
 	help
diff --git a/lib/interval_tree.c b/lib/interval_tree.c
index 593ce56ece5050..d2882db8fa2a07 100644
--- a/lib/interval_tree.c
+++ b/lib/interval_tree.c
@@ -15,3 +15,135 @@ EXPORT_SYMBOL_GPL(interval_tree_insert);
 EXPORT_SYMBOL_GPL(interval_tree_remove);
 EXPORT_SYMBOL_GPL(interval_tree_iter_first);
 EXPORT_SYMBOL_GPL(interval_tree_iter_next);
+
+#ifdef CONFIG_INTERVAL_TREE_SPAN_ITER
+static void
+interval_tree_span_iter_next_gap(struct interval_tree_span_iter *state)
+{
+	struct interval_tree_node *cur = state->nodes[1];
+
+	/*
+	 * Roll nodes[1] into nodes[0] by advancing nodes[1] to the end of a
+	 * contiguous span of nodes. This makes nodes[0]->last the end of that
+	 * contiguous span of valid indexes that started at the original
+	 * nodes[1]->start. nodes[1] is now the next node and a hole is between
+	 * nodes[0] and [1].
+	 */
+	state->nodes[0] = cur;
+	do {
+		if (cur->last > state->nodes[0]->last)
+			state->nodes[0] = cur;
+		cur = interval_tree_iter_next(cur, state->first_index,
+					      state->last_index);
+	} while (cur && (state->nodes[0]->last >= cur->start ||
+			 state->nodes[0]->last + 1 == cur->start));
+	state->nodes[1] = cur;
+}
+
+void interval_tree_span_iter_first(struct interval_tree_span_iter *iter,
+				   struct rb_root_cached *itree,
+				   unsigned long first_index,
+				   unsigned long last_index)
+{
+	iter->first_index = first_index;
+	iter->last_index = last_index;
+	iter->nodes[0] = NULL;
+	iter->nodes[1] =
+		interval_tree_iter_first(itree, first_index, last_index);
+	if (!iter->nodes[1]) {
+		/* No nodes intersect the span, whole span is hole */
+		iter->start_hole = first_index;
+		iter->last_hole = last_index;
+		iter->is_hole = 1;
+		return;
+	}
+	if (iter->nodes[1]->start > first_index) {
+		/* Leading hole on first iteration */
+		iter->start_hole = first_index;
+		iter->last_hole = iter->nodes[1]->start - 1;
+		iter->is_hole = 1;
+		interval_tree_span_iter_next_gap(iter);
+		return;
+	}
+
+	/* Starting inside a used */
+	iter->start_used = first_index;
+	iter->is_hole = 0;
+	interval_tree_span_iter_next_gap(iter);
+	iter->last_used = iter->nodes[0]->last;
+	if (iter->last_used >= last_index) {
+		iter->last_used = last_index;
+		iter->nodes[0] = NULL;
+		iter->nodes[1] = NULL;
+	}
+}
+EXPORT_SYMBOL_GPL(interval_tree_span_iter_first);
+
+void interval_tree_span_iter_next(struct interval_tree_span_iter *iter)
+{
+	if (!iter->nodes[0] && !iter->nodes[1]) {
+		iter->is_hole = -1;
+		return;
+	}
+
+	if (iter->is_hole) {
+		iter->start_used = iter->last_hole + 1;
+		iter->last_used = iter->nodes[0]->last;
+		if (iter->last_used >= iter->last_index) {
+			iter->last_used = iter->last_index;
+			iter->nodes[0] = NULL;
+			iter->nodes[1] = NULL;
+		}
+		iter->is_hole = 0;
+		return;
+	}
+
+	if (!iter->nodes[1]) {
+		/* Trailing hole */
+		iter->start_hole = iter->nodes[0]->last + 1;
+		iter->last_hole = iter->last_index;
+		iter->nodes[0] = NULL;
+		iter->is_hole = 1;
+		return;
+	}
+
+	/* must have both nodes[0] and [1], interior hole */
+	iter->start_hole = iter->nodes[0]->last + 1;
+	iter->last_hole = iter->nodes[1]->start - 1;
+	iter->is_hole = 1;
+	interval_tree_span_iter_next_gap(iter);
+}
+EXPORT_SYMBOL_GPL(interval_tree_span_iter_next);
+
+/*
+ * Advance the iterator index to a specific position. The returned used/hole is
+ * updated to start at new_index. This is faster than calling
+ * interval_tree_span_iter_first() as it can avoid full searches in several
+ * cases where the iterator is already set.
+ */
+void interval_tree_span_iter_advance(struct interval_tree_span_iter *iter,
+				     struct rb_root_cached *itree,
+				     unsigned long new_index)
+{
+	if (iter->is_hole == -1)
+		return;
+
+	iter->first_index = new_index;
+	if (new_index > iter->last_index) {
+		iter->is_hole = -1;
+		return;
+	}
+
+	/* Rely on the union aliasing hole/used */
+	if (iter->start_hole <= new_index && new_index <= iter->last_hole) {
+		iter->start_hole = new_index;
+		return;
+	}
+	if (new_index == iter->last_hole + 1)
+		interval_tree_span_iter_next(iter);
+	else
+		interval_tree_span_iter_first(iter, itree, new_index,
+					      iter->last_index);
+}
+EXPORT_SYMBOL_GPL(interval_tree_span_iter_advance);
+#endif
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH v4 04/17] iommufd: Document overview of iommufd
  2022-11-08  0:48 [PATCH v4 00/17] IOMMUFD Generic interface Jason Gunthorpe
                   ` (2 preceding siblings ...)
  2022-11-08  0:48 ` [PATCH v4 03/17] interval-tree: Add a utility to iterate over spans in an interval tree Jason Gunthorpe
@ 2022-11-08  0:48 ` Jason Gunthorpe
  2022-11-08  3:45   ` Bagas Sanjaya
                     ` (3 more replies)
  2022-11-08  0:48 ` [PATCH v4 05/17] iommufd: File descriptor, context, kconfig and makefiles Jason Gunthorpe
                   ` (14 subsequent siblings)
  18 siblings, 4 replies; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-08  0:48 UTC (permalink / raw)
  To: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

From: Kevin Tian <kevin.tian@intel.com>

Add iommufd into the documentation tree, and supply initial documentation.
Much of this is linked from code comments by kdoc.

Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 Documentation/userspace-api/index.rst   |   1 +
 Documentation/userspace-api/iommufd.rst | 222 ++++++++++++++++++++++++
 2 files changed, 223 insertions(+)
 create mode 100644 Documentation/userspace-api/iommufd.rst

diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
index c78da9ce0ec44e..f16337bdb8520f 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -25,6 +25,7 @@ place where this information is gathered.
    ebpf/index
    ioctl/index
    iommu
+   iommufd
    media/index
    netlink/index
    sysfs-platform_profile
diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst
new file mode 100644
index 00000000000000..64a135f3055adc
--- /dev/null
+++ b/Documentation/userspace-api/iommufd.rst
@@ -0,0 +1,222 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+=======
+IOMMUFD
+=======
+
+:Author: Jason Gunthorpe
+:Author: Kevin Tian
+
+Overview
+========
+
+IOMMUFD is the user API to control the IOMMU subsystem as it relates to managing
+IO page tables from userspace using file descriptors. It intends to be general
+and consumable by any driver that wants to expose DMA to userspace. These
+drivers are eventually expected to deprecate any internal IOMMU logic if exists
+(e.g. vfio_iommu_type1.c).
+
+At minimum iommufd provides universal support of managing I/O address spaces and
+I/O page tables for all IOMMUs, with room in the design to add non-generic
+features to cater to specific hardware functionality.
+
+In this context the capital letter (IOMMUFD) refers to the subsystem while the
+small letter (iommufd) refers to the file descriptors created via /dev/iommu for
+use by userspace.
+
+Key Concepts
+============
+
+User Visible Objects
+--------------------
+
+Following IOMMUFD objects are exposed to userspace:
+
+- IOMMUFD_OBJ_IOAS, representing an I/O address space (IOAS), allowing map/unmap
+  of user space memory into ranges of I/O Virtual Address (IOVA).
+
+  The IOAS is a functional replacement for the VFIO container, and like the VFIO
+  container it copies an IOVA map to a list of iommu_domains held within it.
+
+- IOMMUFD_OBJ_DEVICE, representing a device that is bound to iommufd by an
+  external driver.
+
+- IOMMUFD_OBJ_HW_PAGETABLE, representing an actual hardware I/O page table
+  (i.e. a single struct iommu_domain) managed by the iommu driver.
+
+  The IOAS has a list of HW_PAGETABLES that share the same IOVA mapping and
+  it will synchronize its mapping with each member HW_PAGETABLE.
+
+All user-visible objects are destroyed via the IOMMU_DESTROY uAPI.
+
+The diagram below shows relationship between user-visible objects and kernel
+datastructures (external to iommufd), with numbers referred to operations
+creating the objects and links::
+
+  _________________________________________________________
+ |                         iommufd                         |
+ |       [1]                                               |
+ |  _________________                                      |
+ | |                 |                                     |
+ | |                 |                                     |
+ | |                 |                                     |
+ | |                 |                                     |
+ | |                 |                                     |
+ | |                 |                                     |
+ | |                 |        [3]                 [2]      |
+ | |                 |    ____________         __________  |
+ | |      IOAS       |<--|            |<------|          | |
+ | |                 |   |HW_PAGETABLE|       |  DEVICE  | |
+ | |                 |   |____________|       |__________| |
+ | |                 |         |                   |       |
+ | |                 |         |                   |       |
+ | |                 |         |                   |       |
+ | |                 |         |                   |       |
+ | |                 |         |                   |       |
+ | |_________________|         |                   |       |
+ |         |                   |                   |       |
+ |_________|___________________|___________________|_______|
+           |                   |                   |
+           |              _____v______      _______v_____
+           | PFN storage |            |    |             |
+           |------------>|iommu_domain|    |struct device|
+                         |____________|    |_____________|
+
+1. IOMMUFD_OBJ_IOAS is created via the IOMMU_IOAS_ALLOC uAPI. An iommufd can
+   hold multiple IOAS objects. IOAS is the most generic object and does not
+   expose interfaces that are specific to single IOMMU drivers. All operations
+   on the IOAS must operate equally on each of the iommu_domains inside of it.
+
+2. IOMMUFD_OBJ_DEVICE is created when an external driver calls the IOMMUFD kAPI
+   to bind a device to an iommufd. The driver is expected to implement proper a
+   set of ioctls to allow userspace to initiate the binding operation.
+   Successful completion of this operation establishes the desired DMA ownership
+   over the device. The driver must also set the driver_managed_dma flag and
+   must not touch the device until this operation succeeds.
+
+3. IOMMUFD_OBJ_HW_PAGETABLE is created when an external driver calls the IOMMUFD
+   kAPI to attach a bound device to an IOAS. Similarly the external driver uAPI
+   allows userspace to initiate the attaching operation. If a compatible
+   pagetable already exists then it is reused for the attachment. Otherwise a
+   new pagetable object and iommu_domain is created. Successful completion of
+   this operation sets up the linkages among IOAS, device and iommu_domain. Once
+   this completes the device could do DMA.
+
+   Every iommu_domain inside the IOAS is also represented to userspace as a
+   HW_PAGETABLE object.
+
+   .. note::
+
+      Future IOMMUFD updates will provide an API to create and manipulate the
+      HW_PAGETABLE directly.
+
+A device can only bind to an iommufd due to DMA ownership claim and attach to at
+most one IOAS object (no support of PASID yet).
+
+Currently only PCI device is allowed to use IOMMUFD.
+
+Kernel Datastructure
+--------------------
+
+User visible objects are backed by following datastructures:
+
+- iommufd_ioas for IOMMUFD_OBJ_IOAS.
+- iommufd_device for IOMMUFD_OBJ_DEVICE.
+- iommufd_hw_pagetable for IOMMUFD_OBJ_HW_PAGETABLE.
+
+Several terminologies when looking at these datastructures:
+
+- Automatic domain - refers to an iommu domain created automatically when
+  attaching a device to an IOAS object. This is compatible to the semantics of
+  VFIO type1.
+
+- Manual domain - refers to an iommu domain designated by the user as the
+  target pagetable to be attached to by a device. Though currently there are
+  no uAPIs to directly create such domain, the datastructure and algorithms
+  are ready for handling that use case.
+
+- In-kernel user - refers to something like a VFIO mdev that is using the
+  IOMMUFD access interface to access the IOAS. This starts by creating an
+  iommufd_access object that is similar to the domain binding a physical device
+  would do. The access object will then allow converting IOVA ranges into struct
+  page * lists, or doing direct read/write to an IOVA.
+
+iommufd_ioas serves as the metadata datastructure to manage how IOVA ranges are
+mapped to memory pages, composed of:
+
+- struct io_pagetable holding the IOVA map
+- struct iopt_areas representing populated portions of IOVA
+- struct iopt_pages representing the storage of PFNs
+- struct iommu_domain representing the IO page table in the IOMMU
+- struct iopt_pages_access representing in-kernel users of PFNs
+- struct xarray pinned_pfns holding a list of pages pinned by in-kernel users
+
+Each iopt_pages represents a logical linear array of full PFNs. The PFNs are
+ultimately derived from userspave VAs via an mm_struct. Once they have been
+pinned the PFN is stored in IOPTEs of an iommu_domain or inside the pinned_pages
+xarray if they have been pinned through an iommufd_access.
+
+PFN have to be copied between all combinations of storage locations, depending
+on what domains are present and what kinds of in-kernel "software access" users
+exists. The mechanism ensures that a page is pinned only once.
+
+An io_pagetable is composed of iopt_areas pointing at iopt_pages, along with a
+list of iommu_domains that mirror the IOVA to PFN map.
+
+Multiple io_pagetable-s, through their iopt_area-s, can share a single
+iopt_pages which avoids multi-pinning and double accounting of page
+consumption.
+
+iommufd_ioas is sharable between subsystems, e.g. VFIO and VDPA, as long as
+devices managed by different subsystems are bound to a same iommufd.
+
+IOMMUFD User API
+================
+
+.. kernel-doc:: include/uapi/linux/iommufd.h
+
+IOMMUFD Kernel API
+==================
+
+The IOMMUFD kAPI is device-centric with group-related tricks managed behind the
+scene. This allows the external drivers calling such kAPI to implement a simple
+device-centric uAPI for connecting its device to an iommufd, instead of
+explicitly imposing the group semantics in its uAPI as VFIO does.
+
+.. kernel-doc:: drivers/iommu/iommufd/device.c
+   :export:
+
+VFIO and IOMMUFD
+----------------
+
+Connecting a VFIO device to iommufd can be done in two ways.
+
+First is a VFIO compatible way by directly implementing the /dev/vfio/vfio
+container IOCTLs by mapping them into io_pagetable operations. Doing so allows
+the use of iommufd in legacy VFIO applications by symlinking /dev/vfio/vfio to
+/dev/iommufd or extending VFIO to SET_CONTAINER using an iommufd instead of a
+container fd.
+
+The second approach directly extends VFIO to support a new set of device-centric
+user API based on aforementioned IOMMUFD kernel API. It requires userspace
+change but better matches the IOMMUFD API semantics and easier to support new
+iommufd features when comparing it to the first approach.
+
+Currently both approaches are still work-in-progress.
+
+There are still a few gaps to be resolved to catch up with VFIO type1, as
+documented in iommufd_vfio_check_extension().
+
+Future TODOs
+============
+
+Currently IOMMUFD supports only kernel-managed I/O page table, similar to VFIO
+type1. New features on the radar include:
+
+ - Binding iommu_domain's to PASID/SSID
+ - Userspace page tables, for ARM, x86 and S390
+ - Kernel bypass'd invalidation of user page tables
+ - Re-use of the KVM page table in the IOMMU
+ - Dirty page tracking in the IOMMU
+ - Runtime Increase/Decrease of IOPTE size
+ - PRI support with faults resolved in userspace
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH v4 05/17] iommufd: File descriptor, context, kconfig and makefiles
  2022-11-08  0:48 [PATCH v4 00/17] IOMMUFD Generic interface Jason Gunthorpe
                   ` (3 preceding siblings ...)
  2022-11-08  0:48 ` [PATCH v4 04/17] iommufd: Document overview of iommufd Jason Gunthorpe
@ 2022-11-08  0:48 ` Jason Gunthorpe
  2022-11-11  6:07   ` Tian, Kevin
  2022-11-08  0:48 ` [PATCH v4 06/17] kernel/user: Allow user::locked_vm to be usable for iommufd Jason Gunthorpe
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-08  0:48 UTC (permalink / raw)
  To: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

This is the basic infrastructure of a new miscdevice to hold the iommufd
IOCTL API.

It provides:
 - A miscdevice to create file descriptors to run the IOCTL interface over

 - A table based ioctl dispatch and centralized extendable pre-validation
   step

 - An xarray mapping userspace ID's to kernel objects. The design has
   multiple inter-related objects held within in a single IOMMUFD fd

 - A simple usage count to build a graph of object relations and protect
   against hostile userspace racing ioctls

The only IOCTL provided in this patch is the generic 'destroy any object
by handle' operation.

Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 .../userspace-api/ioctl/ioctl-number.rst      |   1 +
 MAINTAINERS                                   |  12 +
 drivers/iommu/Kconfig                         |   1 +
 drivers/iommu/Makefile                        |   2 +-
 drivers/iommu/iommufd/Kconfig                 |  12 +
 drivers/iommu/iommufd/Makefile                |   5 +
 drivers/iommu/iommufd/iommufd_private.h       | 109 ++++++
 drivers/iommu/iommufd/main.c                  | 342 ++++++++++++++++++
 include/linux/iommufd.h                       |  31 ++
 include/uapi/linux/iommufd.h                  |  55 +++
 10 files changed, 569 insertions(+), 1 deletion(-)
 create mode 100644 drivers/iommu/iommufd/Kconfig
 create mode 100644 drivers/iommu/iommufd/Makefile
 create mode 100644 drivers/iommu/iommufd/iommufd_private.h
 create mode 100644 drivers/iommu/iommufd/main.c
 create mode 100644 include/linux/iommufd.h
 create mode 100644 include/uapi/linux/iommufd.h

diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
index 5f81e2a24a5c04..eb045fc495a4e3 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -105,6 +105,7 @@ Code  Seq#    Include File                                           Comments
 '8'   all                                                            SNP8023 advanced NIC card
                                                                      <mailto:mcr@solidum.com>
 ';'   64-7F  linux/vfio.h
+';'   80-FF  linux/iommufd.h
 '='   00-3f  uapi/linux/ptp_clock.h                                  <mailto:richardcochran@gmail.com>
 '@'   00-0F  linux/radeonfb.h                                        conflict!
 '@'   00-0F  drivers/video/aty/aty128fb.c                            conflict!
diff --git a/MAINTAINERS b/MAINTAINERS
index 379945f82a6438..c0a93779731d7e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10717,6 +10717,18 @@ F:	drivers/iommu/dma-iommu.h
 F:	drivers/iommu/iova.c
 F:	include/linux/iova.h
 
+IOMMUFD
+M:	Jason Gunthorpe <jgg@nvidia.com>
+M:	Kevin Tian <kevin.tian@intel.com>
+L:	iommu@lists.linux.dev
+S:	Maintained
+T:	git git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd.git
+F:	Documentation/userspace-api/iommufd.rst
+F:	drivers/iommu/iommufd/
+F:	include/linux/iommufd.h
+F:	include/uapi/linux/iommufd.h
+F:	tools/testing/selftests/iommu/
+
 IOMMU SUBSYSTEM
 M:	Joerg Roedel <joro@8bytes.org>
 M:	Will Deacon <will@kernel.org>
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index dc5f7a156ff5ec..319966cde5cf6c 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -188,6 +188,7 @@ config MSM_IOMMU
 
 source "drivers/iommu/amd/Kconfig"
 source "drivers/iommu/intel/Kconfig"
+source "drivers/iommu/iommufd/Kconfig"
 
 config IRQ_REMAP
 	bool "Support for Interrupt Remapping"
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 7fbf6a3376620e..f461d065138564 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -1,5 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0
-obj-y += amd/ intel/ arm/
+obj-y += amd/ intel/ arm/ iommufd/
 obj-$(CONFIG_IOMMU_API) += iommu.o
 obj-$(CONFIG_IOMMU_API) += iommu-traces.o
 obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
diff --git a/drivers/iommu/iommufd/Kconfig b/drivers/iommu/iommufd/Kconfig
new file mode 100644
index 00000000000000..6d65d0f06f169f
--- /dev/null
+++ b/drivers/iommu/iommufd/Kconfig
@@ -0,0 +1,12 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config IOMMUFD
+	tristate "IOMMU Userspace API"
+	select INTERVAL_TREE
+	select INTERVAL_TREE_SPAN_ITER
+	select IOMMU_API
+	default n
+	help
+	  Provides /dev/iommu the user API to control the IOMMU subsystem as
+	  it relates to managing IO page tables that point at user space memory.
+
+	  If you don't know what to do here, say N.
diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
new file mode 100644
index 00000000000000..a07a8cffe937c6
--- /dev/null
+++ b/drivers/iommu/iommufd/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0-only
+iommufd-y := \
+	main.o
+
+obj-$(CONFIG_IOMMUFD) += iommufd.o
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
new file mode 100644
index 00000000000000..d523e7967b1440
--- /dev/null
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -0,0 +1,109 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ */
+#ifndef __IOMMUFD_PRIVATE_H
+#define __IOMMUFD_PRIVATE_H
+
+#include <linux/rwsem.h>
+#include <linux/xarray.h>
+#include <linux/refcount.h>
+#include <linux/uaccess.h>
+
+struct iommufd_ctx {
+	struct file *file;
+	struct xarray objects;
+};
+
+struct iommufd_ucmd {
+	struct iommufd_ctx *ictx;
+	void __user *ubuffer;
+	u32 user_size;
+	void *cmd;
+};
+
+/* Copy the response in ucmd->cmd back to userspace. */
+static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd,
+				       size_t cmd_len)
+{
+	if (copy_to_user(ucmd->ubuffer, ucmd->cmd,
+			 min_t(size_t, ucmd->user_size, cmd_len)))
+		return -EFAULT;
+	return 0;
+}
+
+enum iommufd_object_type {
+	IOMMUFD_OBJ_NONE,
+	IOMMUFD_OBJ_ANY = IOMMUFD_OBJ_NONE,
+};
+
+/* Base struct for all objects with a userspace ID handle. */
+struct iommufd_object {
+	struct rw_semaphore destroy_rwsem;
+	refcount_t users;
+	enum iommufd_object_type type;
+	unsigned int id;
+};
+
+static inline bool iommufd_lock_obj(struct iommufd_object *obj)
+{
+	if (!down_read_trylock(&obj->destroy_rwsem))
+		return false;
+	if (!refcount_inc_not_zero(&obj->users)) {
+		up_read(&obj->destroy_rwsem);
+		return false;
+	}
+	return true;
+}
+
+struct iommufd_object *iommufd_get_object(struct iommufd_ctx *ictx, u32 id,
+					  enum iommufd_object_type type);
+static inline void iommufd_put_object(struct iommufd_object *obj)
+{
+	refcount_dec(&obj->users);
+	up_read(&obj->destroy_rwsem);
+}
+
+/**
+ * iommufd_ref_to_users() - Switch from destroy_rwsem to users refcount
+ *        protection
+ * @obj - Object to release
+ *
+ * Objects have two refcount protections (destroy_rwsem and the refcount_t
+ * users). Holding either of these will prevent the object from being destroyed.
+ *
+ * Depending on the use case, one protection or the other is appropriate.  In
+ * most cases references are being protected by the destroy_rwsem. This allows
+ * orderly destruction of the object because iommufd_object_destroy_user() will
+ * wait for it to become unlocked. However, as a rwsem, it cannot be held across
+ * a system call return. So cases that have longer term needs must switch
+ * to the weaker users refcount_t.
+ *
+ * With users protection iommufd_object_destroy_user() will return -EBUSY to
+ * userspace and refuse to destroy the object.
+ */
+static inline void iommufd_ref_to_users(struct iommufd_object *obj)
+{
+	up_read(&obj->destroy_rwsem);
+	/* iommufd_lock_obj() obtains users as well */
+}
+void iommufd_object_abort(struct iommufd_ctx *ictx, struct iommufd_object *obj);
+void iommufd_object_abort_and_destroy(struct iommufd_ctx *ictx,
+				      struct iommufd_object *obj);
+void iommufd_object_finalize(struct iommufd_ctx *ictx,
+			     struct iommufd_object *obj);
+bool iommufd_object_destroy_user(struct iommufd_ctx *ictx,
+				 struct iommufd_object *obj);
+struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
+					     size_t size,
+					     enum iommufd_object_type type);
+
+#define iommufd_object_alloc(ictx, ptr, type)                                  \
+	container_of(_iommufd_object_alloc(                                    \
+			     ictx,                                             \
+			     sizeof(*(ptr)) + BUILD_BUG_ON_ZERO(               \
+						      offsetof(typeof(*(ptr)), \
+							       obj) != 0),     \
+			     type),                                            \
+		     typeof(*(ptr)), obj)
+
+#endif
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
new file mode 100644
index 00000000000000..3a705cadb85020
--- /dev/null
+++ b/drivers/iommu/iommufd/main.c
@@ -0,0 +1,342 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (C) 2021 Intel Corporation
+ * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ *
+ * iommufd provides control over the IOMMU HW objects created by IOMMU kernel
+ * drivers. IOMMU HW objects revolve around IO page tables that map incoming DMA
+ * addresses (IOVA) to CPU addresses.
+ */
+#define pr_fmt(fmt) "iommufd: " fmt
+
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/miscdevice.h>
+#include <linux/mutex.h>
+#include <linux/bug.h>
+#include <uapi/linux/iommufd.h>
+#include <linux/iommufd.h>
+
+#include "iommufd_private.h"
+
+struct iommufd_object_ops {
+	void (*destroy)(struct iommufd_object *obj);
+};
+static const struct iommufd_object_ops iommufd_object_ops[];
+
+struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
+					     size_t size,
+					     enum iommufd_object_type type)
+{
+	struct iommufd_object *obj;
+	int rc;
+
+	obj = kzalloc(size, GFP_KERNEL_ACCOUNT);
+	if (!obj)
+		return ERR_PTR(-ENOMEM);
+	obj->type = type;
+	init_rwsem(&obj->destroy_rwsem);
+	refcount_set(&obj->users, 1);
+
+	/*
+	 * Reserve an ID in the xarray but do not publish the pointer yet since
+	 * the caller hasn't initialized it yet. Once the pointer is published
+	 * in the xarray and visible to other threads we can't reliably destroy
+	 * it anymore, so the caller must complete all errorable operations
+	 * before calling iommufd_object_finalize().
+	 */
+	rc = xa_alloc(&ictx->objects, &obj->id, XA_ZERO_ENTRY,
+		      xa_limit_32b, GFP_KERNEL_ACCOUNT);
+	if (rc)
+		goto out_free;
+	return obj;
+out_free:
+	kfree(obj);
+	return ERR_PTR(rc);
+}
+
+/*
+ * Allow concurrent access to the object.
+ *
+ * Once another thread can see the object pointer it can prevent object
+ * destruction. Expect for special kernel-only objects there is no in-kernel way
+ * to reliably destroy a single object. Thus all APIs that are creating objects
+ * must use iommufd_object_abort() to handle their errors and only call
+ * iommufd_object_finalize() once object creation cannot fail.
+ */
+void iommufd_object_finalize(struct iommufd_ctx *ictx,
+			     struct iommufd_object *obj)
+{
+	void *old;
+
+	old = xa_store(&ictx->objects, obj->id, obj, GFP_KERNEL);
+	/* obj->id was returned from xa_alloc() so the xa_store() cannot fail */
+	WARN_ON(old);
+}
+
+/* Undo _iommufd_object_alloc() if iommufd_object_finalize() was not called */
+void iommufd_object_abort(struct iommufd_ctx *ictx, struct iommufd_object *obj)
+{
+	void *old;
+
+	old = xa_erase(&ictx->objects, obj->id);
+	WARN_ON(old);
+	kfree(obj);
+}
+
+/*
+ * Abort an object that has been fully initialized and needs destroy, but has
+ * not been finalized.
+ */
+void iommufd_object_abort_and_destroy(struct iommufd_ctx *ictx,
+				      struct iommufd_object *obj)
+{
+	iommufd_object_ops[obj->type].destroy(obj);
+	iommufd_object_abort(ictx, obj);
+}
+
+struct iommufd_object *iommufd_get_object(struct iommufd_ctx *ictx, u32 id,
+					  enum iommufd_object_type type)
+{
+	struct iommufd_object *obj;
+
+	xa_lock(&ictx->objects);
+	obj = xa_load(&ictx->objects, id);
+	if (!obj || (type != IOMMUFD_OBJ_ANY && obj->type != type) ||
+	    !iommufd_lock_obj(obj))
+		obj = ERR_PTR(-ENOENT);
+	xa_unlock(&ictx->objects);
+	return obj;
+}
+
+/*
+ * The caller holds a users refcount and wants to destroy the object. Returns
+ * true if the object was destroyed. In all cases the caller no longer has a
+ * reference on obj.
+ */
+bool iommufd_object_destroy_user(struct iommufd_ctx *ictx,
+				 struct iommufd_object *obj)
+{
+	/*
+	 * The purpose of the destroy_rwsem is to ensure deterministic
+	 * destruction of objects used by external drivers and destroyed by this
+	 * function. Any temporary increment of the refcount must hold the read
+	 * side of this, such as during ioctl execution.
+	 */
+	down_write(&obj->destroy_rwsem);
+	xa_lock(&ictx->objects);
+	refcount_dec(&obj->users);
+	if (!refcount_dec_if_one(&obj->users)) {
+		xa_unlock(&ictx->objects);
+		up_write(&obj->destroy_rwsem);
+		return false;
+	}
+	__xa_erase(&ictx->objects, obj->id);
+	xa_unlock(&ictx->objects);
+	up_write(&obj->destroy_rwsem);
+
+	iommufd_object_ops[obj->type].destroy(obj);
+	kfree(obj);
+	return true;
+}
+
+static int iommufd_destroy(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_destroy *cmd = ucmd->cmd;
+	struct iommufd_object *obj;
+
+	obj = iommufd_get_object(ucmd->ictx, cmd->id, IOMMUFD_OBJ_ANY);
+	if (IS_ERR(obj))
+		return PTR_ERR(obj);
+	iommufd_ref_to_users(obj);
+	/* See iommufd_ref_to_users() */
+	if (!iommufd_object_destroy_user(ucmd->ictx, obj))
+		return -EBUSY;
+	return 0;
+}
+
+static int iommufd_fops_open(struct inode *inode, struct file *filp)
+{
+	struct iommufd_ctx *ictx;
+
+	ictx = kzalloc(sizeof(*ictx), GFP_KERNEL_ACCOUNT);
+	if (!ictx)
+		return -ENOMEM;
+
+	xa_init_flags(&ictx->objects, XA_FLAGS_ALLOC1 | XA_FLAGS_ACCOUNT);
+	ictx->file = filp;
+	filp->private_data = ictx;
+	return 0;
+}
+
+static int iommufd_fops_release(struct inode *inode, struct file *filp)
+{
+	struct iommufd_ctx *ictx = filp->private_data;
+	struct iommufd_object *obj;
+
+	/* Destroy the graph from depth first */
+	while (!xa_empty(&ictx->objects)) {
+		unsigned int destroyed = 0;
+		unsigned long index;
+
+		xa_for_each(&ictx->objects, index, obj) {
+			/*
+			 * Since we are in release elevated users must come from
+			 * other objects holding the users. We will eventually
+			 * destroy the object that holds this one and the next
+			 * pass will progress it.
+			 */
+			if (!refcount_dec_if_one(&obj->users))
+				continue;
+			destroyed++;
+			xa_erase(&ictx->objects, index);
+			iommufd_object_ops[obj->type].destroy(obj);
+			kfree(obj);
+		}
+		/* Bug related to users refcount */
+		if (WARN_ON(!destroyed))
+			break;
+	}
+	kfree(ictx);
+	return 0;
+}
+
+union ucmd_buffer {
+	struct iommu_destroy destroy;
+};
+
+struct iommufd_ioctl_op {
+	unsigned int size;
+	unsigned int min_size;
+	unsigned int ioctl_num;
+	int (*execute)(struct iommufd_ucmd *ucmd);
+};
+
+#define IOCTL_OP(_ioctl, _fn, _struct, _last)                                  \
+	[_IOC_NR(_ioctl) - IOMMUFD_CMD_BASE] = {                               \
+		.size = sizeof(_struct) +                                      \
+			BUILD_BUG_ON_ZERO(sizeof(union ucmd_buffer) <          \
+					  sizeof(_struct)),                    \
+		.min_size = offsetofend(_struct, _last),                       \
+		.ioctl_num = _ioctl,                                           \
+		.execute = _fn,                                                \
+	}
+static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
+	IOCTL_OP(IOMMU_DESTROY, iommufd_destroy, struct iommu_destroy, id),
+};
+
+static long iommufd_fops_ioctl(struct file *filp, unsigned int cmd,
+			       unsigned long arg)
+{
+	const struct iommufd_ioctl_op *op;
+	struct iommufd_ucmd ucmd = {};
+	union ucmd_buffer buf;
+	unsigned int nr;
+	int ret;
+
+	ucmd.ictx = filp->private_data;
+	ucmd.ubuffer = (void __user *)arg;
+	ret = get_user(ucmd.user_size, (u32 __user *)ucmd.ubuffer);
+	if (ret)
+		return ret;
+
+	nr = _IOC_NR(cmd);
+	if (nr < IOMMUFD_CMD_BASE ||
+	    (nr - IOMMUFD_CMD_BASE) >= ARRAY_SIZE(iommufd_ioctl_ops))
+		return -ENOIOCTLCMD;
+	op = &iommufd_ioctl_ops[nr - IOMMUFD_CMD_BASE];
+	if (op->ioctl_num != cmd)
+		return -ENOIOCTLCMD;
+	if (ucmd.user_size < op->min_size)
+		return -EINVAL;
+
+	ucmd.cmd = &buf;
+	ret = copy_struct_from_user(ucmd.cmd, op->size, ucmd.ubuffer,
+				    ucmd.user_size);
+	if (ret)
+		return ret;
+	ret = op->execute(&ucmd);
+	return ret;
+}
+
+static const struct file_operations iommufd_fops = {
+	.owner = THIS_MODULE,
+	.open = iommufd_fops_open,
+	.release = iommufd_fops_release,
+	.unlocked_ioctl = iommufd_fops_ioctl,
+};
+
+/**
+ * iommufd_ctx_get - Get a context reference
+ * @ictx: Context to get
+ *
+ * The caller must already hold a valid reference to ictx.
+ */
+void iommufd_ctx_get(struct iommufd_ctx *ictx)
+{
+	get_file(ictx->file);
+}
+EXPORT_SYMBOL_NS_GPL(iommufd_ctx_get, IOMMUFD);
+
+/**
+ * iommufd_ctx_from_file - Acquires a reference to the iommufd context
+ * @file: File to obtain the reference from
+ *
+ * Returns a pointer to the iommufd_ctx, otherwise ERR_PTR. The struct file
+ * remains owned by the caller and the caller must still do fput. On success
+ * the caller is responsible to call iommufd_ctx_put().
+ */
+struct iommufd_ctx *iommufd_ctx_from_file(struct file *file)
+{
+	struct iommufd_ctx *ictx;
+
+	if (file->f_op != &iommufd_fops)
+		return ERR_PTR(-EBADFD);
+	ictx = file->private_data;
+	iommufd_ctx_get(ictx);
+	return ictx;
+}
+EXPORT_SYMBOL_NS_GPL(iommufd_ctx_from_file, IOMMUFD);
+
+/**
+ * iommufd_ctx_put - Put back a reference
+ * @ictx: Context to put back
+ */
+void iommufd_ctx_put(struct iommufd_ctx *ictx)
+{
+	fput(ictx->file);
+}
+EXPORT_SYMBOL_NS_GPL(iommufd_ctx_put, IOMMUFD);
+
+static const struct iommufd_object_ops iommufd_object_ops[] = {
+};
+
+static struct miscdevice iommu_misc_dev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "iommu",
+	.fops = &iommufd_fops,
+	.nodename = "iommu",
+	.mode = 0660,
+};
+
+static int __init iommufd_init(void)
+{
+	int ret;
+
+	ret = misc_register(&iommu_misc_dev);
+	if (ret)
+		return ret;
+	return 0;
+}
+
+static void __exit iommufd_exit(void)
+{
+	misc_deregister(&iommu_misc_dev);
+}
+
+module_init(iommufd_init);
+module_exit(iommufd_exit);
+
+MODULE_DESCRIPTION("I/O Address Space Management for passthrough devices");
+MODULE_LICENSE("GPL");
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
new file mode 100644
index 00000000000000..d1817472c27373
--- /dev/null
+++ b/include/linux/iommufd.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2021 Intel Corporation
+ * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ */
+#ifndef __LINUX_IOMMUFD_H
+#define __LINUX_IOMMUFD_H
+
+#include <linux/types.h>
+#include <linux/errno.h>
+#include <linux/err.h>
+
+struct iommufd_ctx;
+struct file;
+
+void iommufd_ctx_get(struct iommufd_ctx *ictx);
+
+#if IS_ENABLED(CONFIG_IOMMUFD)
+struct iommufd_ctx *iommufd_ctx_from_file(struct file *file);
+void iommufd_ctx_put(struct iommufd_ctx *ictx);
+#else /* !CONFIG_IOMMUFD */
+static inline struct iommufd_ctx *iommufd_ctx_from_file(struct file *file)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+
+static inline void iommufd_ctx_put(struct iommufd_ctx *ictx)
+{
+}
+#endif /* CONFIG_IOMMUFD */
+#endif
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
new file mode 100644
index 00000000000000..2ad06b27a35fe5
--- /dev/null
+++ b/include/uapi/linux/iommufd.h
@@ -0,0 +1,55 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
+ */
+#ifndef _UAPI_IOMMUFD_H
+#define _UAPI_IOMMUFD_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+#define IOMMUFD_TYPE (';')
+
+/**
+ * DOC: General ioctl format
+ *
+ * The ioctl interface follows a general format to allow for extensibility. Each
+ * ioctl is passed in a structure pointer as the argument providing the size of
+ * the structure in the first u32. The kernel checks that any structure space
+ * beyond what it understands is 0. This allows userspace to use the backward
+ * compatible portion while consistently using the newer, larger, structures.
+ *
+ * ioctls use a standard meaning for common errnos:
+ *
+ *  - ENOTTY: The IOCTL number itself is not supported at all
+ *  - E2BIG: The IOCTL number is supported, but the provided structure has
+ *    non-zero in a part the kernel does not understand.
+ *  - EOPNOTSUPP: The IOCTL number is supported, and the structure is
+ *    understood, however a known field has a value the kernel does not
+ *    understand or support.
+ *  - EINVAL: Everything about the IOCTL was understood, but a field is not
+ *    correct.
+ *  - ENOENT: An ID or IOVA provided does not exist.
+ *  - ENOMEM: Out of memory.
+ *  - EOVERFLOW: Mathematics oveflowed.
+ *
+ * As well as additional errnos, within specific ioctls.
+ */
+enum {
+	IOMMUFD_CMD_BASE = 0x80,
+	IOMMUFD_CMD_DESTROY = IOMMUFD_CMD_BASE,
+};
+
+/**
+ * struct iommu_destroy - ioctl(IOMMU_DESTROY)
+ * @size: sizeof(struct iommu_destroy)
+ * @id: iommufd object ID to destroy. Can by any destroyable object type.
+ *
+ * Destroy any object held within iommufd.
+ */
+struct iommu_destroy {
+	__u32 size;
+	__u32 id;
+};
+#define IOMMU_DESTROY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_DESTROY)
+
+#endif
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH v4 06/17] kernel/user: Allow user::locked_vm to be usable for iommufd
  2022-11-08  0:48 [PATCH v4 00/17] IOMMUFD Generic interface Jason Gunthorpe
                   ` (4 preceding siblings ...)
  2022-11-08  0:48 ` [PATCH v4 05/17] iommufd: File descriptor, context, kconfig and makefiles Jason Gunthorpe
@ 2022-11-08  0:48 ` Jason Gunthorpe
  2022-11-08  0:49 ` [PATCH v4 07/17] iommufd: PFN handling for iopt_pages Jason Gunthorpe
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-08  0:48 UTC (permalink / raw)
  To: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

Following the pattern of io_uring, perf, skb, and bpf iommfd will use
user->locked_vm for accounting pinned pages. Ensure the value is included
in the struct and export free_uid() as iommufd is modular.

user->locked_vm is the good accounting to use for ulimit because it is
per-user, and the security sandboxing of locked pages is not supposed to
be per-process. Other places (vfio, vdpa and infiniband) have used
mm->pinned_vm and/or mm->locked_vm for accounting pinned pages, but this
is only per-process and inconsistent with the new FOLL_LONGTERM users in
the kernel.

Concurrent work is underway to try to put this in a cgroup, so everything
can be consistent and the kernel can provide a FOLL_LONGTERM limit that
actually provides security.

Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 include/linux/sched/user.h | 2 +-
 kernel/user.c              | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index f054d0360a7533..4cc52698e214e2 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -25,7 +25,7 @@ struct user_struct {
 
 #if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL) || \
 	defined(CONFIG_NET) || defined(CONFIG_IO_URING) || \
-	defined(CONFIG_VFIO_PCI_ZDEV_KVM)
+	defined(CONFIG_VFIO_PCI_ZDEV_KVM) || IS_ENABLED(CONFIG_IOMMUFD)
 	atomic_long_t locked_vm;
 #endif
 #ifdef CONFIG_WATCH_QUEUE
diff --git a/kernel/user.c b/kernel/user.c
index e2cf8c22b539a7..d667debeafd609 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -185,6 +185,7 @@ void free_uid(struct user_struct *up)
 	if (refcount_dec_and_lock_irqsave(&up->__count, &uidhash_lock, &flags))
 		free_user(up, flags);
 }
+EXPORT_SYMBOL_GPL(free_uid);
 
 struct user_struct *alloc_uid(kuid_t uid)
 {
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH v4 07/17] iommufd: PFN handling for iopt_pages
  2022-11-08  0:48 [PATCH v4 00/17] IOMMUFD Generic interface Jason Gunthorpe
                   ` (5 preceding siblings ...)
  2022-11-08  0:48 ` [PATCH v4 06/17] kernel/user: Allow user::locked_vm to be usable for iommufd Jason Gunthorpe
@ 2022-11-08  0:49 ` Jason Gunthorpe
  2022-11-11  9:56   ` Tian, Kevin
  2022-11-11 11:09   ` Tian, Kevin
  2022-11-08  0:49 ` [PATCH v4 08/17] iommufd: Algorithms for PFN storage Jason Gunthorpe
                   ` (11 subsequent siblings)
  18 siblings, 2 replies; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-08  0:49 UTC (permalink / raw)
  To: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

The top of the data structure provides an IO Address Space (IOAS) that is
similar to a VFIO container. The IOAS allows map/unmap of memory into
ranges of IOVA called iopt_areas. Multiple IOMMU domains (IO page tables)
and in-kernel accesses (like VFIO mdevs) can be attached to the IOAS to
access the PFNs that those IOVA areas cover.

The IO Address Space (IOAS) datastructure is composed of:
 - struct io_pagetable holding the IOVA map
 - struct iopt_areas representing populated portions of IOVA
 - struct iopt_pages representing the storage of PFNs
 - struct iommu_domain representing each IO page table in the system IOMMU
 - struct iopt_pages_access representing in-kernel accesses of PFNs (ie
   VFIO mdevs)
 - struct xarray pinned_pfns holding a list of pages pinned by in-kernel
   accesses

This patch introduces the lowest part of the datastructure - the movement
of PFNs in a tiered storage scheme:
 1) iopt_pages::pinned_pfns xarray
 2) Multiple iommu_domains
 3) The origin of the PFNs, i.e. the userspace pointer

PFN have to be copied between all combinations of tiers, depending on the
configuration.

The interface is an iterator called a 'pfn_reader' which determines which
tier each PFN is stored and loads it into a list of PFNs held in a struct
pfn_batch.

Each step of the iterator will fill up the pfn_batch, then the caller can
use the pfn_batch to send the PFNs to the required destination. Repeating
this loop will read all the PFNs in an IOVA range.

The pfn_reader and pfn_batch also keep track of the pinned page accounting.

While PFNs are always stored and accessed as full PAGE_SIZE units the
iommu_domain tier can store with a sub-page offset/length to support
IOMMUs with a smaller IOPTE size than PAGE_SIZE.

Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 .clang-format                           |   1 +
 drivers/iommu/iommufd/Makefile          |   3 +-
 drivers/iommu/iommufd/double_span.h     |  98 +++
 drivers/iommu/iommufd/io_pagetable.h    | 109 +++
 drivers/iommu/iommufd/iommufd_private.h |  24 +
 drivers/iommu/iommufd/pages.c           | 999 ++++++++++++++++++++++++
 include/linux/iommufd.h                 |   7 +
 7 files changed, 1240 insertions(+), 1 deletion(-)
 create mode 100644 drivers/iommu/iommufd/double_span.h
 create mode 100644 drivers/iommu/iommufd/io_pagetable.h
 create mode 100644 drivers/iommu/iommufd/pages.c

diff --git a/.clang-format b/.clang-format
index 96d07786dcfb46..501241f8977664 100644
--- a/.clang-format
+++ b/.clang-format
@@ -440,6 +440,7 @@ ForEachMacros:
   - 'inet_lhash2_for_each_icsk'
   - 'inet_lhash2_for_each_icsk_continue'
   - 'inet_lhash2_for_each_icsk_rcu'
+  - 'interval_tree_for_each_double_span'
   - 'interval_tree_for_each_span'
   - 'intlist__for_each_entry'
   - 'intlist__for_each_entry_safe'
diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index a07a8cffe937c6..05a0e91e30afad 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 iommufd-y := \
-	main.o
+	main.o \
+	pages.o
 
 obj-$(CONFIG_IOMMUFD) += iommufd.o
diff --git a/drivers/iommu/iommufd/double_span.h b/drivers/iommu/iommufd/double_span.h
new file mode 100644
index 00000000000000..1d7fa31f9b6cc2
--- /dev/null
+++ b/drivers/iommu/iommufd/double_span.h
@@ -0,0 +1,98 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES.
+ */
+#ifndef __IOMMUFD_DOUBLE_SPAN_H
+#define __IOMMUFD_DOUBLE_SPAN_H
+
+#include <linux/interval_tree.h>
+
+struct interval_tree_double_span_iter {
+	struct rb_root_cached *itrees[2];
+	struct interval_tree_span_iter spans[2];
+	union {
+		unsigned long start_hole;
+		unsigned long start_used;
+	};
+	union {
+		unsigned long last_hole;
+		unsigned long last_used;
+	};
+	/* 0 = hole, 1 = used span[0], 2 = used span[1], -1 done iteration */
+	int is_used;
+};
+
+static void interval_tree_double_span_iter_update(
+	struct interval_tree_double_span_iter *iter)
+{
+	unsigned long last_hole = ULONG_MAX;
+	unsigned int i;
+
+	for (i = 0; i != ARRAY_SIZE(iter->spans); i++) {
+		if (interval_tree_span_iter_done(&iter->spans[i])) {
+			iter->is_used = -1;
+			return;
+		}
+
+		if (iter->spans[i].is_hole) {
+			last_hole = min(last_hole, iter->spans[i].last_hole);
+			continue;
+		}
+
+		iter->is_used = i + 1;
+		iter->start_used = iter->spans[i].start_used;
+		iter->last_used = min(iter->spans[i].last_used, last_hole);
+		return;
+	}
+
+	iter->is_used = 0;
+	iter->start_hole = iter->spans[0].start_hole;
+	iter->last_hole =
+		min(iter->spans[0].last_hole, iter->spans[1].last_hole);
+}
+
+static void interval_tree_double_span_iter_first(
+	struct interval_tree_double_span_iter *iter,
+	struct rb_root_cached *itree1, struct rb_root_cached *itree2,
+	unsigned long first_index, unsigned long last_index)
+{
+	unsigned int i;
+
+	iter->itrees[0] = itree1;
+	iter->itrees[1] = itree2;
+	for (i = 0; i != ARRAY_SIZE(iter->spans); i++)
+		interval_tree_span_iter_first(&iter->spans[i], iter->itrees[i],
+					      first_index, last_index);
+	interval_tree_double_span_iter_update(iter);
+}
+
+static void
+interval_tree_double_span_iter_next(struct interval_tree_double_span_iter *iter)
+{
+	unsigned int i;
+
+	if (iter->is_used == -1 ||
+	    iter->last_hole == iter->spans[0].last_index) {
+		iter->is_used = -1;
+		return;
+	}
+
+	for (i = 0; i != ARRAY_SIZE(iter->spans); i++)
+		interval_tree_span_iter_advance(
+			&iter->spans[i], iter->itrees[i], iter->last_hole + 1);
+	interval_tree_double_span_iter_update(iter);
+}
+
+static inline bool
+interval_tree_double_span_iter_done(struct interval_tree_double_span_iter *state)
+{
+	return state->is_used == -1;
+}
+
+#define interval_tree_for_each_double_span(span, itree1, itree2, first_index, \
+					   last_index)                        \
+	for (interval_tree_double_span_iter_first(span, itree1, itree2,       \
+						  first_index, last_index);   \
+	     !interval_tree_double_span_iter_done(span);                      \
+	     interval_tree_double_span_iter_next(span))
+
+#endif
diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h
new file mode 100644
index 00000000000000..b74bf01ffc52c2
--- /dev/null
+++ b/drivers/iommu/iommufd/io_pagetable.h
@@ -0,0 +1,109 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
+ *
+ */
+#ifndef __IO_PAGETABLE_H
+#define __IO_PAGETABLE_H
+
+#include <linux/interval_tree.h>
+#include <linux/mutex.h>
+#include <linux/kref.h>
+#include <linux/xarray.h>
+
+#include "iommufd_private.h"
+
+struct iommu_domain;
+
+/*
+ * Each io_pagetable is composed of intervals of areas which cover regions of
+ * the iova that are backed by something. iova not covered by areas is not
+ * populated in the page table. Each area is fully populated with pages.
+ *
+ * iovas are in byte units, but must be iopt->iova_alignment aligned.
+ *
+ * pages can be NULL, this means some other thread is still working on setting
+ * up or tearing down the area. When observed under the write side of the
+ * domain_rwsem a NULL pages must mean the area is still being setup and no
+ * domains are filled.
+ *
+ * storage_domain points at an arbitrary iommu_domain that is holding the PFNs
+ * for this area. It is locked by the pages->mutex. This simplifies the locking
+ * as the pages code can rely on the storage_domain without having to get the
+ * iopt->domains_rwsem.
+ *
+ * The io_pagetable::iova_rwsem protects node
+ * The iopt_pages::mutex protects pages_node
+ * iopt and immu_prot are immutable
+ * The pages::mutex protects num_accesses
+ */
+struct iopt_area {
+	struct interval_tree_node node;
+	struct interval_tree_node pages_node;
+	struct io_pagetable *iopt;
+	struct iopt_pages *pages;
+	struct iommu_domain *storage_domain;
+	/* How many bytes into the first page the area starts */
+	unsigned int page_offset;
+	/* IOMMU_READ, IOMMU_WRITE, etc */
+	int iommu_prot;
+	unsigned int num_accesses;
+};
+
+static inline unsigned long iopt_area_index(struct iopt_area *area)
+{
+	return area->pages_node.start;
+}
+
+static inline unsigned long iopt_area_last_index(struct iopt_area *area)
+{
+	return area->pages_node.last;
+}
+
+static inline unsigned long iopt_area_iova(struct iopt_area *area)
+{
+	return area->node.start;
+}
+
+static inline unsigned long iopt_area_last_iova(struct iopt_area *area)
+{
+	return area->node.last;
+}
+
+enum {
+	IOPT_PAGES_ACCOUNT_NONE = 0,
+	IOPT_PAGES_ACCOUNT_USER = 1,
+	IOPT_PAGES_ACCOUNT_MM = 2,
+};
+
+/*
+ * This holds a pinned page list for multiple areas of IO address space. The
+ * pages always originate from a linear chunk of userspace VA. Multiple
+ * io_pagetable's, through their iopt_area's, can share a single iopt_pages
+ * which avoids multi-pinning and double accounting of page consumption.
+ *
+ * indexes in this structure are measured in PAGE_SIZE units, are 0 based from
+ * the start of the uptr and extend to npages. pages are pinned dynamically
+ * according to the intervals in the access_itree and domains_itree, npinned
+ * records the current number of pages pinned.
+ */
+struct iopt_pages {
+	struct kref kref;
+	struct mutex mutex;
+	size_t npages;
+	size_t npinned;
+	size_t last_npinned;
+	struct task_struct *source_task;
+	struct mm_struct *source_mm;
+	struct user_struct *source_user;
+	void __user *uptr;
+	bool writable:1;
+	u8 account_mode;
+
+	struct xarray pinned_pfns;
+	/* Of iopt_pages_access::node */
+	struct rb_root_cached access_itree;
+	/* Of iopt_area::pages_node */
+	struct rb_root_cached domains_itree;
+};
+
+#endif
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index d523e7967b1440..6ed6f99ee439fa 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -14,6 +14,30 @@ struct iommufd_ctx {
 	struct xarray objects;
 };
 
+/*
+ * The IOVA to PFN map. The mapper automatically copies the PFNs into multiple
+ * domains and permits sharing of PFNs between io_pagetable instances. This
+ * supports both a design where IOAS's are 1:1 with a domain (eg because the
+ * domain is HW customized), or where the IOAS is 1:N with multiple generic
+ * domains.  The io_pagetable holds an interval tree of iopt_areas which point
+ * to shared iopt_pages which hold the pfns mapped to the page table.
+ *
+ * The locking order is domains_rwsem -> iova_rwsem -> pages::mutex
+ */
+struct io_pagetable {
+	struct rw_semaphore domains_rwsem;
+	struct xarray domains;
+	unsigned int next_domain_id;
+
+	struct rw_semaphore iova_rwsem;
+	struct rb_root_cached area_itree;
+	/* IOVA that cannot become reserved, struct iopt_allowed */
+	struct rb_root_cached allowed_itree;
+	/* IOVA that cannot be allocated, struct iopt_reserved */
+	struct rb_root_cached reserved_itree;
+	u8 disable_large_pages;
+};
+
 struct iommufd_ucmd {
 	struct iommufd_ctx *ictx;
 	void __user *ubuffer;
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
new file mode 100644
index 00000000000000..87e2bd78ae91e8
--- /dev/null
+++ b/drivers/iommu/iommufd/pages.c
@@ -0,0 +1,999 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
+ *
+ * The iopt_pages is the center of the storage and motion of PFNs. Each
+ * iopt_pages represents a logical linear array of full PFNs. The array is 0
+ * based and has npages in it. Accessors use 'index' to refer to the entry in
+ * this logical array, regardless of its storage location.
+ *
+ * PFNs are stored in a tiered scheme:
+ *  1) iopt_pages::pinned_pfns xarray
+ *  2) An iommu_domain
+ *  3) The origin of the PFNs, i.e. the userspace pointer
+ *
+ * PFN have to be copied between all combinations of tiers, depending on the
+ * configuration.
+ *
+ * When a PFN is taken out of the userspace pointer it is pinned exactly once.
+ * The storage locations of the PFN's index are tracked in the two interval
+ * trees. If no interval includes the index then it is not pinned.
+ *
+ * If access_itree includes the PFN's index then an in-kernel access has
+ * requested the page. The PFN is stored in the xarray so other requestors can
+ * continue to find it.
+ *
+ * If the domains_itree includes the PFN's index then an iommu_domain is storing
+ * the PFN and it can be read back using iommu_iova_to_phys(). To avoid
+ * duplicating storage the xarray is not used if only iommu_domains are using
+ * the PFN's index.
+ *
+ * As a general principle this is designed so that destroy never fails. This
+ * means removing an iommu_domain or releasing a in-kernel access will not fail
+ * due to insufficient memory. In practice this means some cases have to hold
+ * PFNs in the xarray even though they are also being stored in an iommu_domain.
+ *
+ * While the iopt_pages can use an iommu_domain as storage, it does not have an
+ * IOVA itself. Instead the iopt_area represents a range of IOVA and uses the
+ * iopt_pages as the PFN provider. Multiple iopt_areas can share the iopt_pages
+ * and reference their own slice of the PFN array, with sub page granularity.
+ *
+ * In this file the term 'last' indicates an inclusive and closed interval, eg
+ * [0,0] refers to a single PFN. 'end' means an open range, eg [0,0) refers to
+ * no PFNs.
+ *
+ * Be cautious of overflow, an IOVA can go all the way up to U64_MAX, so
+ * last_iova + 1 can overflow. An iopt_pages index will always be much less than
+ * ULONG_MAX< so last_index + 1 cannot overflow.
+ */
+#include <linux/overflow.h>
+#include <linux/slab.h>
+#include <linux/iommu.h>
+#include <linux/sched/mm.h>
+#include <linux/highmem.h>
+#include <linux/kthread.h>
+#include <linux/iommufd.h>
+
+#include "io_pagetable.h"
+#include "double_span.h"
+
+#define TEMP_MEMORY_LIMIT 65536
+#define BATCH_BACKUP_SIZE 32
+
+/*
+ * More memory makes pin_user_pages() and the batching more efficient, but as
+ * this is only a performance optimization don't try too hard to get it. A 64k
+ * allocation can hold about 26M of 4k pages and 13G of 2M pages in an
+ * pfn_batch. Various destroy paths cannot fail and provide a small amount of
+ * stack memory as a backup contingency. If backup_len is given this cannot
+ * fail.
+ */
+static void *temp_kmalloc(size_t *size, void *backup, size_t backup_len)
+{
+	void *res;
+
+	if (WARN_ON(*size == 0))
+		return NULL;
+
+	if (*size < backup_len)
+		return backup;
+	*size = min_t(size_t, *size, TEMP_MEMORY_LIMIT);
+	res = kmalloc(*size, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
+	if (res)
+		return res;
+	*size = PAGE_SIZE;
+	if (backup_len) {
+		res = kmalloc(*size, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
+		if (res)
+			return res;
+		*size = backup_len;
+		return backup;
+	}
+	return kmalloc(*size, GFP_KERNEL);
+}
+
+static void iopt_pages_add_npinned(struct iopt_pages *pages, size_t npages)
+{
+	pages->npinned += npages;
+}
+
+static void iopt_pages_sub_npinned(struct iopt_pages *pages, size_t npages)
+{
+	pages->npinned -= npages;
+}
+
+static void iopt_pages_err_unpin(struct iopt_pages *pages,
+				 unsigned long start_index,
+				 unsigned long last_index,
+				 struct page **page_list)
+{
+	unsigned long npages = last_index - start_index + 1;
+
+	unpin_user_pages(page_list, npages);
+	iopt_pages_sub_npinned(pages, npages);
+}
+
+/*
+ * index is the number of PAGE_SIZE units from the start of the area's
+ * iopt_pages. If the iova is sub page-size then the area has an iova that
+ * covers a portion of the first and last pages in the range.
+ */
+static unsigned long iopt_area_index_to_iova(struct iopt_area *area,
+					     unsigned long index)
+{
+	index -= iopt_area_index(area);
+	if (index == 0)
+		return iopt_area_iova(area);
+	return iopt_area_iova(area) - area->page_offset + index * PAGE_SIZE;
+}
+
+static unsigned long iopt_area_index_to_iova_last(struct iopt_area *area,
+						  unsigned long index)
+{
+	if (index == iopt_area_last_index(area))
+		return iopt_area_last_iova(area);
+	return iopt_area_iova(area) - area->page_offset +
+	       (index - iopt_area_index(area) + 1) * PAGE_SIZE - 1;
+}
+
+static void iommu_unmap_nofail(struct iommu_domain *domain, unsigned long iova,
+			       size_t size)
+{
+	size_t ret;
+
+	ret = iommu_unmap(domain, iova, size);
+	/*
+	 * It is a logic error in this code or a driver bug if the IOMMU unmaps
+	 * something other than exactly as requested. This implies that the
+	 * iommu driver may not fail unmap for reasons beyond bad agruments.
+	 * Particularly, the iommu driver may not do a memory allocation on the
+	 * unmap path.
+	 */
+	WARN_ON(ret != size);
+}
+
+static struct iopt_area *iopt_pages_find_domain_area(struct iopt_pages *pages,
+						     unsigned long index)
+{
+	struct interval_tree_node *node;
+
+	node = interval_tree_iter_first(&pages->domains_itree, index, index);
+	if (!node)
+		return NULL;
+	return container_of(node, struct iopt_area, pages_node);
+}
+
+/*
+ * A simple datastructure to hold a vector of PFNs, optimized for contiguous
+ * PFNs. This is used as a temporary holding memory for shuttling pfns from one
+ * place to another. Generally everything is made more efficient if operations
+ * work on the largest possible grouping of pfns. eg fewer lock/unlock cycles,
+ * better cache locality, etc
+ */
+struct pfn_batch {
+	unsigned long *pfns;
+	u16 *npfns;
+	unsigned int array_size;
+	unsigned int end;
+	unsigned int total_pfns;
+};
+
+static void batch_clear(struct pfn_batch *batch)
+{
+	batch->total_pfns = 0;
+	batch->end = 0;
+	batch->pfns[0] = 0;
+	batch->npfns[0] = 0;
+}
+
+/*
+ * Carry means we carry a portion of the final hugepage over to the front of the
+ * batch
+ */
+static void batch_clear_carry(struct pfn_batch *batch, unsigned int keep_pfns)
+{
+	if (!keep_pfns)
+		return batch_clear(batch);
+
+	batch->total_pfns = keep_pfns;
+	batch->npfns[0] = keep_pfns;
+	batch->pfns[0] = batch->pfns[batch->end - 1] +
+			 (batch->npfns[batch->end - 1] - keep_pfns);
+	batch->end = 0;
+}
+
+static void batch_skip_carry(struct pfn_batch *batch, unsigned int skip_pfns)
+{
+	if (!batch->total_pfns)
+		return;
+	skip_pfns = min(batch->total_pfns, skip_pfns);
+	batch->pfns[0] += skip_pfns;
+	batch->npfns[0] -= skip_pfns;
+	batch->total_pfns -= skip_pfns;
+}
+
+static int __batch_init(struct pfn_batch *batch, size_t max_pages, void *backup,
+			size_t backup_len)
+{
+	const size_t elmsz = sizeof(*batch->pfns) + sizeof(*batch->npfns);
+	size_t size = max_pages * elmsz;
+
+	batch->pfns = temp_kmalloc(&size, backup, backup_len);
+	if (!batch->pfns)
+		return -ENOMEM;
+	batch->array_size = size / elmsz;
+	batch->npfns = (u16 *)(batch->pfns + batch->array_size);
+	batch_clear(batch);
+	return 0;
+}
+
+static int batch_init(struct pfn_batch *batch, size_t max_pages)
+{
+	return __batch_init(batch, max_pages, NULL, 0);
+}
+
+static void batch_init_backup(struct pfn_batch *batch, size_t max_pages,
+			      void *backup, size_t backup_len)
+{
+	__batch_init(batch, max_pages, backup, backup_len);
+}
+
+static void batch_destroy(struct pfn_batch *batch, void *backup)
+{
+	if (batch->pfns != backup)
+		kfree(batch->pfns);
+}
+
+/* true if the pfn could be added, false otherwise */
+static bool batch_add_pfn(struct pfn_batch *batch, unsigned long pfn)
+{
+	/* FIXME: U16 is too small */
+	if (batch->end &&
+	    pfn == batch->pfns[batch->end - 1] + batch->npfns[batch->end - 1] &&
+	    batch->npfns[batch->end - 1] != U16_MAX) {
+		batch->npfns[batch->end - 1]++;
+		batch->total_pfns++;
+		return true;
+	}
+	if (batch->end == batch->array_size)
+		return false;
+	batch->total_pfns++;
+	batch->pfns[batch->end] = pfn;
+	batch->npfns[batch->end] = 1;
+	batch->end++;
+	return true;
+}
+
+/*
+ * Fill the batch with pfns from the domain. When the batch is full, or it
+ * reaches last_index, the function will return. The caller should use
+ * batch->total_pfns to determine the starting point for the next iteration.
+ */
+static void batch_from_domain(struct pfn_batch *batch,
+			      struct iommu_domain *domain,
+			      struct iopt_area *area, unsigned long start_index,
+			      unsigned long last_index)
+{
+	unsigned int page_offset = 0;
+	unsigned long iova;
+	phys_addr_t phys;
+
+	iova = iopt_area_index_to_iova(area, start_index);
+	if (start_index == iopt_area_index(area))
+		page_offset = area->page_offset;
+	while (start_index <= last_index) {
+		/*
+		 * This is pretty slow, it would be nice to get the page size
+		 * back from the driver, or have the driver directly fill the
+		 * batch.
+		 */
+		phys = iommu_iova_to_phys(domain, iova) - page_offset;
+		if (!batch_add_pfn(batch, PHYS_PFN(phys)))
+			return;
+		iova += PAGE_SIZE - page_offset;
+		page_offset = 0;
+		start_index++;
+	}
+}
+
+static struct page **raw_pages_from_domain(struct iommu_domain *domain,
+					   struct iopt_area *area,
+					   unsigned long start_index,
+					   unsigned long last_index,
+					   struct page **out_pages)
+{
+	unsigned int page_offset = 0;
+	unsigned long iova;
+	phys_addr_t phys;
+
+	iova = iopt_area_index_to_iova(area, start_index);
+	if (start_index == iopt_area_index(area))
+		page_offset = area->page_offset;
+	while (start_index <= last_index) {
+		phys = iommu_iova_to_phys(domain, iova) - page_offset;
+		*(out_pages++) = pfn_to_page(PHYS_PFN(phys));
+		iova += PAGE_SIZE - page_offset;
+		page_offset = 0;
+		start_index++;
+	}
+	return out_pages;
+}
+
+/* Continues reading a domain until we reach a discontiguity in the pfns. */
+static void batch_from_domain_continue(struct pfn_batch *batch,
+				       struct iommu_domain *domain,
+				       struct iopt_area *area,
+				       unsigned long start_index,
+				       unsigned long last_index)
+{
+	unsigned int array_size = batch->array_size;
+
+	batch->array_size = batch->end;
+	batch_from_domain(batch, domain, area, start_index, last_index);
+	batch->array_size = array_size;
+}
+
+/*
+ * This is part of the VFIO compatibility support for VFIO_TYPE1_IOMMU. That
+ * mode permits splitting a mapped area up, and then one of the splits is
+ * unmapped. Doing this normally would cause us to violate our invariant of
+ * pairing map/unmap. Thus, to support old VFIO compatibility disable support
+ * for batching consecutive PFNs.
+ */
+static int batch_iommu_map_small(struct iommu_domain *domain,
+				 unsigned long iova, phys_addr_t paddr,
+				 size_t size, int prot)
+{
+	unsigned long start_iova = iova;
+	int rc;
+
+	while (size) {
+		rc = iommu_map(domain, iova, paddr, PAGE_SIZE, prot);
+		if (rc)
+			goto err_unmap;
+		iova += PAGE_SIZE;
+		paddr += PAGE_SIZE;
+		size -= PAGE_SIZE;
+	}
+	return 0;
+
+err_unmap:
+	if (start_iova != iova)
+		iommu_unmap_nofail(domain, start_iova, iova - start_iova);
+	return rc;
+}
+
+static int batch_to_domain(struct pfn_batch *batch, struct iommu_domain *domain,
+			   struct iopt_area *area, unsigned long start_index)
+{
+	bool disable_large_pages = area->iopt->disable_large_pages;
+	unsigned long last_iova = iopt_area_last_iova(area);
+	unsigned int page_offset = 0;
+	unsigned long start_iova;
+	unsigned long next_iova;
+	unsigned int cur = 0;
+	unsigned long iova;
+	int rc;
+
+	/* The first index might be a partial page */
+	if (start_index == iopt_area_index(area))
+		page_offset = area->page_offset;
+	next_iova = iova = start_iova =
+		iopt_area_index_to_iova(area, start_index);
+	while (cur < batch->end) {
+		next_iova = min(last_iova + 1,
+				next_iova + batch->npfns[cur] * PAGE_SIZE -
+					page_offset);
+		if (disable_large_pages)
+			rc = batch_iommu_map_small(
+				domain, iova,
+				PFN_PHYS(batch->pfns[cur]) + page_offset,
+				next_iova - iova, area->iommu_prot);
+		else
+			rc = iommu_map(domain, iova,
+				       PFN_PHYS(batch->pfns[cur]) + page_offset,
+				       next_iova - iova, area->iommu_prot);
+		if (rc)
+			goto err_unmap;
+		iova = next_iova;
+		page_offset = 0;
+		cur++;
+	}
+	return 0;
+err_unmap:
+	if (start_iova != iova)
+		iommu_unmap_nofail(domain, start_iova, iova - start_iova);
+	return rc;
+}
+
+static void batch_from_xarray(struct pfn_batch *batch, struct xarray *xa,
+			      unsigned long start_index,
+			      unsigned long last_index)
+{
+	XA_STATE(xas, xa, start_index);
+	void *entry;
+
+	rcu_read_lock();
+	while (true) {
+		entry = xas_next(&xas);
+		if (xas_retry(&xas, entry))
+			continue;
+		WARN_ON(!xa_is_value(entry));
+		if (!batch_add_pfn(batch, xa_to_value(entry)) ||
+		    start_index == last_index)
+			break;
+		start_index++;
+	}
+	rcu_read_unlock();
+}
+
+static void batch_from_xarray_clear(struct pfn_batch *batch, struct xarray *xa,
+				    unsigned long start_index,
+				    unsigned long last_index)
+{
+	XA_STATE(xas, xa, start_index);
+	void *entry;
+
+	xas_lock(&xas);
+	while (true) {
+		entry = xas_next(&xas);
+		if (xas_retry(&xas, entry))
+			continue;
+		WARN_ON(!xa_is_value(entry));
+		if (!batch_add_pfn(batch, xa_to_value(entry)))
+			break;
+		xas_store(&xas, NULL);
+		if (start_index == last_index)
+			break;
+		start_index++;
+	}
+	xas_unlock(&xas);
+}
+
+static void clear_xarray(struct xarray *xa, unsigned long start_index,
+			 unsigned long last_index)
+{
+	XA_STATE(xas, xa, start_index);
+	void *entry;
+
+	xas_lock(&xas);
+	xas_for_each(&xas, entry, last_index)
+		xas_store(&xas, NULL);
+	xas_unlock(&xas);
+}
+
+static int pages_to_xarray(struct xarray *xa, unsigned long start_index,
+			   unsigned long last_index, struct page **pages)
+{
+	struct page **end_pages = pages + (last_index - start_index) + 1;
+	XA_STATE(xas, xa, start_index);
+
+	do {
+		void *old;
+
+		xas_lock(&xas);
+		while (pages != end_pages) {
+			old = xas_store(&xas, xa_mk_value(page_to_pfn(*pages)));
+			if (xas_error(&xas))
+				break;
+			WARN_ON(old);
+			pages++;
+			xas_next(&xas);
+		}
+		xas_unlock(&xas);
+	} while (xas_nomem(&xas, GFP_KERNEL));
+
+	if (xas_error(&xas)) {
+		if (xas.xa_index != start_index)
+			clear_xarray(xa, start_index, xas.xa_index - 1);
+		return xas_error(&xas);
+	}
+	return 0;
+}
+
+static void batch_from_pages(struct pfn_batch *batch, struct page **pages,
+			     size_t npages)
+{
+	struct page **end = pages + npages;
+
+	for (; pages != end; pages++)
+		if (!batch_add_pfn(batch, page_to_pfn(*pages)))
+			break;
+}
+
+static void batch_unpin(struct pfn_batch *batch, struct iopt_pages *pages,
+			unsigned int offset, size_t npages)
+{
+	unsigned int cur = 0;
+
+	while (offset) {
+		if (batch->npfns[cur] > offset)
+			break;
+		offset -= batch->npfns[cur];
+		cur++;
+	}
+
+	while (npages) {
+		size_t to_unpin =
+			min_t(size_t, npages, batch->npfns[cur] - offset);
+
+		unpin_user_page_range_dirty_lock(
+			pfn_to_page(batch->pfns[cur] + offset), to_unpin,
+			pages->writable);
+		iopt_pages_sub_npinned(pages, to_unpin);
+		cur++;
+		offset = 0;
+		npages -= to_unpin;
+	}
+}
+
+static void copy_data_page(struct page *page, void *data, unsigned long offset,
+			   size_t length, unsigned int flags)
+{
+	void *mem;
+
+	mem = kmap_local_page(page);
+	if (flags & IOMMUFD_ACCESS_RW_WRITE) {
+		memcpy(mem + offset, data, length);
+		set_page_dirty_lock(page);
+	} else {
+		memcpy(data, mem + offset, length);
+	}
+	kunmap_local(mem);
+}
+
+static unsigned long batch_rw(struct pfn_batch *batch, void *data,
+			      unsigned long offset, unsigned long length,
+			      unsigned int flags)
+{
+	unsigned long copied = 0;
+	unsigned int npage = 0;
+	unsigned int cur = 0;
+
+	while (cur < batch->end) {
+		unsigned long bytes = min(length, PAGE_SIZE - offset);
+
+		copy_data_page(pfn_to_page(batch->pfns[cur] + npage), data,
+			       offset, bytes, flags);
+		offset = 0;
+		length -= bytes;
+		data += bytes;
+		copied += bytes;
+		npage++;
+		if (npage == batch->npfns[cur]) {
+			npage = 0;
+			cur++;
+		}
+		if (!length)
+			break;
+	}
+	return copied;
+}
+
+/* pfn_reader_user is just the pin_user_pages() path */
+struct pfn_reader_user {
+	struct page **upages;
+	size_t upages_len;
+	unsigned long upages_start;
+	unsigned long upages_end;
+	unsigned int gup_flags;
+	int locked;
+};
+
+static void pfn_reader_user_init(struct pfn_reader_user *user,
+				 struct iopt_pages *pages)
+{
+	user->upages = NULL;
+	user->upages_start = 0;
+	user->upages_end = 0;
+	user->locked = -1;
+
+	if (pages->writable) {
+		user->gup_flags = FOLL_LONGTERM | FOLL_WRITE;
+	} else {
+		/* Still need to break COWs on read */
+		user->gup_flags = FOLL_LONGTERM | FOLL_FORCE | FOLL_WRITE;
+	}
+}
+
+static void pfn_reader_user_destroy(struct pfn_reader_user *user,
+				    struct iopt_pages *pages)
+{
+	if (user->locked != -1) {
+		if (user->locked)
+			mmap_read_unlock(pages->source_mm);
+		if (pages->source_mm != current->mm)
+			mmput(pages->source_mm);
+		user->locked = 0;
+	}
+
+	kfree(user->upages);
+	user->upages = NULL;
+}
+
+static int pfn_reader_user_pin(struct pfn_reader_user *user,
+			       struct iopt_pages *pages,
+			       unsigned long start_index,
+			       unsigned long last_index)
+{
+	bool remote_mm = pages->source_mm != current->mm;
+	unsigned long npages;
+	uintptr_t uptr;
+	long rc;
+
+	if (!user->upages) {
+		/* All undone in pfn_reader_destroy() */
+		user->upages_len =
+			(last_index - start_index + 1) * sizeof(*user->upages);
+		user->upages = temp_kmalloc(&user->upages_len, NULL, 0);
+		if (!user->upages)
+			return -ENOMEM;
+	}
+
+	if (user->locked == -1) {
+		/*
+		 * The majority of usages will run the map task within the mm
+		 * providing the pages, so we can optimize into
+		 * get_user_pages_fast()
+		 */
+		if (remote_mm) {
+			if (!mmget_not_zero(pages->source_mm))
+				return -EFAULT;
+		}
+		user->locked = 0;
+	}
+
+	npages = min_t(unsigned long, last_index - start_index + 1,
+		       user->upages_len / sizeof(*user->upages));
+
+	uptr = (uintptr_t)(pages->uptr + start_index * PAGE_SIZE);
+	if (!remote_mm)
+		rc = pin_user_pages_fast(uptr, npages, user->gup_flags,
+					 user->upages);
+	else {
+		if (!user->locked) {
+			mmap_read_lock(pages->source_mm);
+			user->locked = 1;
+		}
+		/*
+		 * FIXME: last NULL can be &pfns->locked once the GUP patch
+		 * is merged.
+		 */
+		rc = pin_user_pages_remote(pages->source_mm, uptr, npages,
+					   user->gup_flags, user->upages, NULL,
+					   NULL);
+	}
+	if (rc <= 0) {
+		if (WARN_ON(!rc))
+			return -EFAULT;
+		return rc;
+	}
+	iopt_pages_add_npinned(pages, rc);
+	user->upages_start = start_index;
+	user->upages_end = start_index + rc;
+	return 0;
+}
+
+/* This is the "modern"  and faster accounting method used by io_uring */
+static int incr_user_locked_vm(struct iopt_pages *pages, unsigned long npages)
+{
+	unsigned long lock_limit;
+	unsigned long cur_pages;
+	unsigned long new_pages;
+
+	lock_limit = task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >>
+		     PAGE_SHIFT;
+	npages = pages->npinned - pages->last_npinned;
+	do {
+		cur_pages = atomic_long_read(&pages->source_user->locked_vm);
+		new_pages = cur_pages + npages;
+		if (new_pages > lock_limit)
+			return -ENOMEM;
+	} while (atomic_long_cmpxchg(&pages->source_user->locked_vm, cur_pages,
+				     new_pages) != cur_pages);
+	return 0;
+}
+
+static void decr_user_locked_vm(struct iopt_pages *pages, unsigned long npages)
+{
+	if (WARN_ON(atomic_long_read(&pages->source_user->locked_vm) < npages))
+		return;
+	atomic_long_sub(npages, &pages->source_user->locked_vm);
+}
+
+/* This is the accounting method used for compatibility with VFIO */
+static int update_mm_locked_vm(struct iopt_pages *pages, unsigned long npages,
+			       bool inc, struct pfn_reader_user *user)
+{
+	bool do_put = false;
+	int rc;
+
+	if (user && user->locked) {
+		mmap_read_unlock(pages->source_mm);
+		user->locked = 0;
+		/* If we had the lock then we also have a get */
+	} else if ((!user || !user->upages) &&
+		   pages->source_mm != current->mm) {
+		if (!mmget_not_zero(pages->source_mm))
+			return -EINVAL;
+		do_put = true;
+	}
+
+	mmap_write_lock(pages->source_mm);
+	rc = __account_locked_vm(pages->source_mm, npages, inc,
+				 pages->source_task, false);
+	mmap_write_unlock(pages->source_mm);
+
+	if (do_put)
+		mmput(pages->source_mm);
+	return rc;
+}
+
+static int do_update_pinned(struct iopt_pages *pages, unsigned long npages,
+			    bool inc, struct pfn_reader_user *user)
+{
+	int rc = 0;
+
+	switch (pages->account_mode) {
+	case IOPT_PAGES_ACCOUNT_NONE:
+		break;
+	case IOPT_PAGES_ACCOUNT_USER:
+		if (inc)
+			rc = incr_user_locked_vm(pages, npages);
+		else
+			decr_user_locked_vm(pages, npages);
+		break;
+	case IOPT_PAGES_ACCOUNT_MM:
+		rc = update_mm_locked_vm(pages, npages, inc, user);
+		break;
+	}
+	if (rc)
+		return rc;
+
+	pages->last_npinned = pages->npinned;
+	if (inc)
+		atomic64_add(npages, &pages->source_mm->pinned_vm);
+	else
+		atomic64_sub(npages, &pages->source_mm->pinned_vm);
+	return 0;
+}
+
+static void update_unpinned(struct iopt_pages *pages)
+{
+	if (WARN_ON(pages->npinned > pages->last_npinned))
+		return;
+	if (pages->npinned == pages->last_npinned)
+		return;
+	do_update_pinned(pages, pages->last_npinned - pages->npinned, false,
+			 NULL);
+}
+
+/*
+ * Changes in the number of pages pinned is done after the pages have been read
+ * and processed. If the user lacked the limit then the error unwind will unpin
+ * everything that was just pinned. This is because it is expensive to calculate
+ * how many pages we have already pinned within a range to generate an accurate
+ * prediction in advance of doing the work to actually pin them.
+ */
+static int pfn_reader_user_update_pinned(struct pfn_reader_user *user,
+					 struct iopt_pages *pages)
+{
+	unsigned long npages;
+	bool inc;
+
+	lockdep_assert_held(&pages->mutex);
+
+	if (pages->npinned == pages->last_npinned)
+		return 0;
+
+	if (pages->npinned < pages->last_npinned) {
+		npages = pages->last_npinned - pages->npinned;
+		inc = false;
+	} else {
+		npages = pages->npinned - pages->last_npinned;
+		inc = true;
+	}
+	return do_update_pinned(pages, npages, inc, user);
+}
+
+/*
+ * PFNs are stored in three places, in order of preference:
+ * - The iopt_pages xarray. This is only populated if there is a
+ *   iopt_pages_access
+ * - The iommu_domain under an area
+ * - The original PFN source, ie pages->source_mm
+ *
+ * This iterator reads the pfns optimizing to load according to the
+ * above order.
+ */
+struct pfn_reader {
+	struct iopt_pages *pages;
+	struct interval_tree_double_span_iter span;
+	struct pfn_batch batch;
+	unsigned long batch_start_index;
+	unsigned long batch_end_index;
+	unsigned long last_index;
+
+	struct pfn_reader_user user;
+};
+
+static int pfn_reader_update_pinned(struct pfn_reader *pfns)
+{
+	return pfn_reader_user_update_pinned(&pfns->user, pfns->pages);
+}
+
+/*
+ * The batch can contain a mixture of pages that are still in use and pages that
+ * need to be unpinned. Unpin only pages that are not held anywhere else.
+ */
+static void pfn_reader_unpin(struct pfn_reader *pfns)
+{
+	unsigned long last = pfns->batch_end_index - 1;
+	unsigned long start = pfns->batch_start_index;
+	struct interval_tree_double_span_iter span;
+	struct iopt_pages *pages = pfns->pages;
+
+	lockdep_assert_held(&pages->mutex);
+
+	interval_tree_for_each_double_span(&span, &pages->access_itree,
+					   &pages->domains_itree, start, last) {
+		if (span.is_used)
+			continue;
+
+		batch_unpin(&pfns->batch, pages, span.start_hole - start,
+			    span.last_hole - span.start_hole + 1);
+	}
+}
+
+/* Process a single span in the access_itree */
+static int pfn_reader_fill_span(struct pfn_reader *pfns)
+{
+	struct interval_tree_double_span_iter *span = &pfns->span;
+	unsigned long start_index = pfns->batch_end_index;
+	struct iopt_area *area;
+	int rc;
+
+	if (span->is_used == 1) {
+		batch_from_xarray(&pfns->batch, &pfns->pages->pinned_pfns,
+				  start_index, span->last_used);
+		return 0;
+	}
+
+	if (span->is_used == 2) {
+		/*
+		 * Pull as many pages from the first domain we find in the
+		 * target span. If it is too small then we will be called again
+		 * and we'll find another area.
+		 */
+		area = iopt_pages_find_domain_area(pfns->pages, start_index);
+		if (WARN_ON(!area))
+			return -EINVAL;
+
+		/* The storage_domain cannot change without the pages mutex */
+		batch_from_domain(
+			&pfns->batch, area->storage_domain, area, start_index,
+			min(iopt_area_last_index(area), span->last_used));
+		return 0;
+	}
+
+	if (start_index >= pfns->user.upages_end) {
+		rc = pfn_reader_user_pin(&pfns->user, pfns->pages, start_index,
+					 span->last_hole);
+		if (rc)
+			return rc;
+	}
+
+	batch_from_pages(&pfns->batch,
+			 pfns->user.upages +
+				 (start_index - pfns->user.upages_start),
+			 pfns->user.upages_end - start_index);
+	return 0;
+}
+
+static bool pfn_reader_done(struct pfn_reader *pfns)
+{
+	return pfns->batch_start_index == pfns->last_index + 1;
+}
+
+static int pfn_reader_next(struct pfn_reader *pfns)
+{
+	int rc;
+
+	batch_clear(&pfns->batch);
+	pfns->batch_start_index = pfns->batch_end_index;
+
+	while (pfns->batch_end_index != pfns->last_index + 1) {
+		unsigned int npfns = pfns->batch.total_pfns;
+
+		rc = pfn_reader_fill_span(pfns);
+		if (rc)
+			return rc;
+
+		if (WARN_ON(!pfns->batch.total_pfns))
+			return -EINVAL;
+
+		pfns->batch_end_index =
+			pfns->batch_start_index + pfns->batch.total_pfns;
+		if (pfns->batch_end_index == pfns->span.last_used + 1)
+			interval_tree_double_span_iter_next(&pfns->span);
+
+		/* Batch is full */
+		if (npfns == pfns->batch.total_pfns)
+			return 0;
+	}
+	return 0;
+}
+
+static int pfn_reader_init(struct pfn_reader *pfns, struct iopt_pages *pages,
+			   unsigned long start_index, unsigned long last_index)
+{
+	int rc;
+
+	lockdep_assert_held(&pages->mutex);
+
+	pfns->pages = pages;
+	pfns->batch_start_index = start_index;
+	pfns->batch_end_index = start_index;
+	pfns->last_index = last_index;
+	pfn_reader_user_init(&pfns->user, pages);
+	rc = batch_init(&pfns->batch, last_index - start_index + 1);
+	if (rc)
+		return rc;
+	interval_tree_double_span_iter_first(&pfns->span, &pages->access_itree,
+					     &pages->domains_itree, start_index,
+					     last_index);
+	return 0;
+}
+
+/*
+ * There are many assertions regarding the state of pages->npinned vs
+ * pages->last_pinned, for instance something like unmapping a domain must only
+ * decrement the npinned, and pfn_reader_destroy() must be called only after all
+ * the pins are updated. This is fine for success flows, but error flows
+ * sometimes need to release the pins held inside the pfn_reader before going on
+ * to complete unmapping and releasing pins held in domains.
+ */
+static void pfn_reader_release_pins(struct pfn_reader *pfns)
+{
+	struct iopt_pages *pages = pfns->pages;
+
+	if (pfns->user.upages_end > pfns->batch_end_index) {
+		size_t npages = pfns->user.upages_end - pfns->batch_end_index;
+
+		/* Any pages not transferred to the batch are just unpinned */
+		unpin_user_pages(pfns->user.upages + (pfns->batch_end_index -
+						      pfns->user.upages_start),
+				 npages);
+		iopt_pages_sub_npinned(pages, npages);
+		pfns->user.upages_end = pfns->batch_end_index;
+	}
+	if (pfns->batch_start_index != pfns->batch_end_index) {
+		pfn_reader_unpin(pfns);
+		pfns->batch_start_index = pfns->batch_end_index;
+	}
+}
+
+static void pfn_reader_destroy(struct pfn_reader *pfns)
+{
+	struct iopt_pages *pages = pfns->pages;
+
+	pfn_reader_release_pins(pfns);
+	pfn_reader_user_destroy(&pfns->user, pfns->pages);
+	batch_destroy(&pfns->batch, NULL);
+	WARN_ON(pages->last_npinned != pages->npinned);
+}
+
+static int pfn_reader_first(struct pfn_reader *pfns, struct iopt_pages *pages,
+			    unsigned long start_index, unsigned long last_index)
+{
+	int rc;
+
+	rc = pfn_reader_init(pfns, pages, start_index, last_index);
+	if (rc)
+		return rc;
+	rc = pfn_reader_next(pfns);
+	if (rc) {
+		pfn_reader_destroy(pfns);
+		return rc;
+	}
+	return 0;
+}
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index d1817472c27373..26e09d539737bb 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -13,6 +13,13 @@
 struct iommufd_ctx;
 struct file;
 
+enum {
+	IOMMUFD_ACCESS_RW_READ = 0,
+	IOMMUFD_ACCESS_RW_WRITE = 1 << 0,
+	/* Set if the caller is in a kthread then rw will use kthread_use_mm() */
+	IOMMUFD_ACCESS_RW_KTHREAD = 1 << 1,
+};
+
 void iommufd_ctx_get(struct iommufd_ctx *ictx);
 
 #if IS_ENABLED(CONFIG_IOMMUFD)
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH v4 08/17] iommufd: Algorithms for PFN storage
  2022-11-08  0:48 [PATCH v4 00/17] IOMMUFD Generic interface Jason Gunthorpe
                   ` (6 preceding siblings ...)
  2022-11-08  0:49 ` [PATCH v4 07/17] iommufd: PFN handling for iopt_pages Jason Gunthorpe
@ 2022-11-08  0:49 ` Jason Gunthorpe
  2022-11-14  5:50   ` Tian, Kevin
  2022-11-14 19:19   ` [PATCH v4 8/17] " Jason Gunthorpe
  2022-11-08  0:49 ` [PATCH v4 09/17] iommufd: Data structure to provide IOVA to PFN mapping Jason Gunthorpe
                   ` (10 subsequent siblings)
  18 siblings, 2 replies; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-08  0:49 UTC (permalink / raw)
  To: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

The iopt_pages which represents a logical linear list of full PFNs held in
different storage tiers. Each area points to a slice of exactly one
iopt_pages, and each iopt_pages can have multiple areas and accesses.

The three storage tiers are managed to meet these objectives:

 - If no iommu_domain or in-kerenel access exists then minimal memory
   should be consumed by iomufd
 - If a page has been pinned then an iopt_pages will not pin it again
 - If an in-kernel access exists then the xarray must provide the backing
   storage to avoid allocations on domain removals
 - Otherwise any iommu_domain will be used for storage

In a common configuration with only an iommu_domain the iopt_pages does
not allocate significant memory itself.

The external interface for pages has several logical operations:

  iopt_area_fill_domain() will load the PFNs from storage into a single
  domain. This is used when attaching a new domain to an existing IOAS.

  iopt_area_fill_domains() will load the PFNs from storage into multiple
  domains. This is used when creating a new IOVA map in an existing IOAS

  iopt_pages_add_access() creates an iopt_pages_access that tracks an
  in-kernel access of PFNs. This is some external driver that might be
  accessing the IOVA using the CPU, or programming PFNs with the DMA
  API. ie a VFIO mdev.

  iopt_pages_rw_access() directly perform a memcpy on the PFNs, without
  the overhead of iopt_pages_add_access()

  iopt_pages_fill_xarray() will load PFNs into the xarray and return a
  'struct page *' array. It is used by iopt_pages_access's to extract PFNs
  for in-kernel use. iopt_pages_fill_from_xarray() is a fast path when it
  is known the xarray is already filled.

As an iopt_pages can be referred to in slices by many areas and accesses
it uses interval trees to keep track of which storage tiers currently hold
the PFNs. On a page-by-page basis any request for a PFN will be satisfied
from one of the storage tiers and the PFN copied to target domain/array.

Unfill actions are similar, on a page by page basis domains are unmapped,
xarray entries freed or struct pages fully put back.

Significant complexity is required to fully optimize all of these data
motions. The implementation calculates the largest consecutive range of
same-storage indexes and operates in blocks. The accumulation of PFNs
always generates the largest contiguous PFN range possible to optimize and
this gathering can cross storage tier boundaries. For cases like 'fill
domains' care is taken to avoid duplicated work and PFNs are read once and
pushed into all domains.

The map/unmap interaction with the iommu_domain always works in contiguous
PFN blocks. The implementation does not require or benefit from any
split/merge optimization in the iommu_domain driver.

This design suggests several possible improvements in the IOMMU API that
would greatly help performance, particularly a way for the driver to map
and read the pfns lists instead of working with one driver call per page
to read, and one driver call per contiguous range to store.

Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommufd/io_pagetable.h |  74 +++
 drivers/iommu/iommufd/pages.c        | 823 ++++++++++++++++++++++++++-
 2 files changed, 893 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h
index b74bf01ffc52c2..aebb9a496258ce 100644
--- a/drivers/iommu/iommufd/io_pagetable.h
+++ b/drivers/iommu/iommufd/io_pagetable.h
@@ -49,6 +49,15 @@ struct iopt_area {
 	unsigned int num_accesses;
 };
 
+int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages);
+void iopt_area_unfill_domains(struct iopt_area *area, struct iopt_pages *pages);
+
+int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain);
+void iopt_area_unfill_domain(struct iopt_area *area, struct iopt_pages *pages,
+			     struct iommu_domain *domain);
+void iopt_area_unmap_domain(struct iopt_area *area,
+			    struct iommu_domain *domain);
+
 static inline unsigned long iopt_area_index(struct iopt_area *area)
 {
 	return area->pages_node.start;
@@ -69,6 +78,39 @@ static inline unsigned long iopt_area_last_iova(struct iopt_area *area)
 	return area->node.last;
 }
 
+static inline size_t iopt_area_length(struct iopt_area *area)
+{
+	return (area->node.last - area->node.start) + 1;
+}
+
+#define __make_iopt_iter(name)                                                 \
+	static inline struct iopt_##name *iopt_##name##_iter_first(            \
+		struct io_pagetable *iopt, unsigned long start,                \
+		unsigned long last)                                            \
+	{                                                                      \
+		struct interval_tree_node *node;                               \
+									       \
+		lockdep_assert_held(&iopt->iova_rwsem);                        \
+		node = interval_tree_iter_first(&iopt->name##_itree, start,    \
+						last);                         \
+		if (!node)                                                     \
+			return NULL;                                           \
+		return container_of(node, struct iopt_##name, node);           \
+	}                                                                      \
+	static inline struct iopt_##name *iopt_##name##_iter_next(             \
+		struct iopt_##name *last_node, unsigned long start,            \
+		unsigned long last)                                            \
+	{                                                                      \
+		struct interval_tree_node *node;                               \
+									       \
+		node = interval_tree_iter_next(&last_node->node, start, last); \
+		if (!node)                                                     \
+			return NULL;                                           \
+		return container_of(node, struct iopt_##name, node);           \
+	}
+
+__make_iopt_iter(area)
+
 enum {
 	IOPT_PAGES_ACCOUNT_NONE = 0,
 	IOPT_PAGES_ACCOUNT_USER = 1,
@@ -106,4 +148,36 @@ struct iopt_pages {
 	struct rb_root_cached domains_itree;
 };
 
+struct iopt_pages *iopt_alloc_pages(void __user *uptr, unsigned long length,
+				    bool writable);
+void iopt_release_pages(struct kref *kref);
+static inline void iopt_put_pages(struct iopt_pages *pages)
+{
+	kref_put(&pages->kref, iopt_release_pages);
+}
+
+void iopt_pages_fill_from_xarray(struct iopt_pages *pages, unsigned long start,
+				 unsigned long last, struct page **out_pages);
+int iopt_pages_fill_xarray(struct iopt_pages *pages, unsigned long start,
+			   unsigned long last, struct page **out_pages);
+void iopt_pages_unfill_xarray(struct iopt_pages *pages, unsigned long start,
+			      unsigned long last);
+
+int iopt_pages_add_access(struct iopt_pages *pages, unsigned long start,
+			unsigned long last, struct page **out_pages,
+			unsigned int flags);
+void iopt_pages_remove_access(struct iopt_area *area, unsigned long start,
+			    unsigned long last);
+int iopt_pages_rw_access(struct iopt_pages *pages, unsigned long start_byte,
+			 void *data, unsigned long length, unsigned int flags);
+
+/*
+ * Each interval represents an active iopt_access_pages(), it acts as an
+ * interval lock that keeps the PFNs pinned and stored in the xarray.
+ */
+struct iopt_pages_access {
+	struct interval_tree_node node;
+	refcount_t refcount;
+};
+
 #endif
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index 87e2bd78ae91e8..d8f3f2a07e8378 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -151,6 +151,18 @@ static void iommu_unmap_nofail(struct iommu_domain *domain, unsigned long iova,
 	WARN_ON(ret != size);
 }
 
+static void iopt_area_unmap_domain_range(struct iopt_area *area,
+					 struct iommu_domain *domain,
+					 unsigned long start_index,
+					 unsigned long last_index)
+{
+	unsigned long start_iova = iopt_area_index_to_iova(area, start_index);
+
+	iommu_unmap_nofail(domain, start_iova,
+			   iopt_area_index_to_iova_last(area, last_index) -
+				   start_iova + 1);
+}
+
 static struct iopt_area *iopt_pages_find_domain_area(struct iopt_pages *pages,
 						     unsigned long index)
 {
@@ -171,7 +183,7 @@ static struct iopt_area *iopt_pages_find_domain_area(struct iopt_pages *pages,
  */
 struct pfn_batch {
 	unsigned long *pfns;
-	u16 *npfns;
+	u32 *npfns;
 	unsigned int array_size;
 	unsigned int end;
 	unsigned int total_pfns;
@@ -221,7 +233,7 @@ static int __batch_init(struct pfn_batch *batch, size_t max_pages, void *backup,
 	if (!batch->pfns)
 		return -ENOMEM;
 	batch->array_size = size / elmsz;
-	batch->npfns = (u16 *)(batch->pfns + batch->array_size);
+	batch->npfns = (u32 *)(batch->pfns + batch->array_size);
 	batch_clear(batch);
 	return 0;
 }
@@ -246,10 +258,11 @@ static void batch_destroy(struct pfn_batch *batch, void *backup)
 /* true if the pfn could be added, false otherwise */
 static bool batch_add_pfn(struct pfn_batch *batch, unsigned long pfn)
 {
-	/* FIXME: U16 is too small */
+	const unsigned int MAX_NPFNS = type_max(typeof(*batch->npfns));
+
 	if (batch->end &&
 	    pfn == batch->pfns[batch->end - 1] + batch->npfns[batch->end - 1] &&
-	    batch->npfns[batch->end - 1] != U16_MAX) {
+	    batch->npfns[batch->end - 1] != MAX_NPFNS) {
 		batch->npfns[batch->end - 1]++;
 		batch->total_pfns++;
 		return true;
@@ -997,3 +1010,805 @@ static int pfn_reader_first(struct pfn_reader *pfns, struct iopt_pages *pages,
 	}
 	return 0;
 }
+
+struct iopt_pages *iopt_alloc_pages(void __user *uptr, unsigned long length,
+				    bool writable)
+{
+	struct iopt_pages *pages;
+
+	/*
+	 * The iommu API uses size_t as the length, and protect the DIV_ROUND_UP
+	 * below from overflow
+	 */
+	if (length > SIZE_MAX - PAGE_SIZE || length == 0)
+		return ERR_PTR(-EINVAL);
+
+	pages = kzalloc(sizeof(*pages), GFP_KERNEL_ACCOUNT);
+	if (!pages)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&pages->kref);
+	xa_init_flags(&pages->pinned_pfns, XA_FLAGS_ACCOUNT);
+	mutex_init(&pages->mutex);
+	pages->source_mm = current->mm;
+	mmgrab(pages->source_mm);
+	pages->uptr = (void __user *)ALIGN_DOWN((uintptr_t)uptr, PAGE_SIZE);
+	pages->npages = DIV_ROUND_UP(length + (uptr - pages->uptr), PAGE_SIZE);
+	pages->access_itree = RB_ROOT_CACHED;
+	pages->domains_itree = RB_ROOT_CACHED;
+	pages->writable = writable;
+	if (capable(CAP_IPC_LOCK))
+		pages->account_mode = IOPT_PAGES_ACCOUNT_NONE;
+	else
+		pages->account_mode = IOPT_PAGES_ACCOUNT_USER;
+	pages->source_task = current->group_leader;
+	get_task_struct(current->group_leader);
+	pages->source_user = get_uid(current_user());
+	return pages;
+}
+
+void iopt_release_pages(struct kref *kref)
+{
+	struct iopt_pages *pages = container_of(kref, struct iopt_pages, kref);
+
+	WARN_ON(!RB_EMPTY_ROOT(&pages->access_itree.rb_root));
+	WARN_ON(!RB_EMPTY_ROOT(&pages->domains_itree.rb_root));
+	WARN_ON(pages->npinned);
+	WARN_ON(!xa_empty(&pages->pinned_pfns));
+	mmdrop(pages->source_mm);
+	mutex_destroy(&pages->mutex);
+	put_task_struct(pages->source_task);
+	free_uid(pages->source_user);
+	kfree(pages);
+}
+
+static void
+iopt_area_unpin_domain(struct pfn_batch *batch, struct iopt_area *area,
+		       struct iopt_pages *pages, struct iommu_domain *domain,
+		       unsigned long start_index, unsigned long last_index,
+		       unsigned long *unmapped_end_index,
+		       unsigned long real_last_index)
+{
+	while (start_index <= last_index) {
+		unsigned long batch_last_index;
+
+		if (*unmapped_end_index <= last_index) {
+			unsigned long start =
+				max(start_index, *unmapped_end_index);
+
+			batch_from_domain(batch, domain, area, start,
+					  last_index);
+			batch_last_index = start + batch->total_pfns - 1;
+		} else {
+			batch_last_index = last_index;
+		}
+
+		/*
+		 * unmaps must always 'cut' at a place where the pfns are not
+		 * contiguous to pair with the maps that always install
+		 * contiguous pages. Thus, if we have to stop unpinning in the
+		 * middle of the domains we need to keep reading pfns until we
+		 * find a cut point to do the unmap. The pfns we read are
+		 * carried over and either skipped or integrated into the next
+		 * batch.
+		 */
+		if (batch_last_index == last_index &&
+		    last_index != real_last_index)
+			batch_from_domain_continue(batch, domain, area,
+						   last_index + 1,
+						   real_last_index);
+
+		if (*unmapped_end_index <= batch_last_index) {
+			iopt_area_unmap_domain_range(
+				area, domain, *unmapped_end_index,
+				start_index + batch->total_pfns - 1);
+			*unmapped_end_index = start_index + batch->total_pfns;
+		}
+
+		/* unpin must follow unmap */
+		batch_unpin(batch, pages, 0,
+			    batch_last_index - start_index + 1);
+		start_index = batch_last_index + 1;
+
+		batch_clear_carry(batch,
+				  *unmapped_end_index - batch_last_index - 1);
+	}
+}
+
+static void __iopt_area_unfill_domain(struct iopt_area *area,
+				      struct iopt_pages *pages,
+				      struct iommu_domain *domain,
+				      unsigned long last_index)
+{
+	struct interval_tree_double_span_iter span;
+	unsigned long start_index = iopt_area_index(area);
+	unsigned long unmapped_end_index = start_index;
+	u64 backup[BATCH_BACKUP_SIZE];
+	struct pfn_batch batch;
+
+	lockdep_assert_held(&pages->mutex);
+
+	batch_init_backup(&batch, last_index + 1, backup, sizeof(backup));
+	interval_tree_for_each_double_span(&span, &pages->domains_itree,
+					   &pages->access_itree, start_index,
+					   last_index) {
+		if (span.is_used) {
+			batch_skip_carry(&batch,
+					 span.last_used - span.start_used + 1);
+			continue;
+		}
+		iopt_area_unpin_domain(&batch, area, pages, domain,
+				       span.start_hole, span.last_hole,
+				       &unmapped_end_index, last_index);
+	}
+	if (unmapped_end_index != last_index + 1)
+		iopt_area_unmap_domain_range(area, domain, unmapped_end_index,
+					     last_index);
+	WARN_ON(batch.total_pfns);
+	batch_destroy(&batch, backup);
+	update_unpinned(pages);
+}
+
+static void iopt_area_unfill_partial_domain(struct iopt_area *area,
+					    struct iopt_pages *pages,
+					    struct iommu_domain *domain,
+					    unsigned long end_index)
+{
+	if (end_index != iopt_area_index(area))
+		__iopt_area_unfill_domain(area, pages, domain, end_index - 1);
+}
+
+/**
+ * iopt_area_unmap_domain() - Unmap without unpinning PFNs in a domain
+ * @area: The IOVA range to unmap
+ * @domain: The domain to unmap
+ *
+ * The caller must know that unpinning is not required, usually because there
+ * are other domains in the iopt.
+ */
+void iopt_area_unmap_domain(struct iopt_area *area, struct iommu_domain *domain)
+{
+	iommu_unmap_nofail(domain, iopt_area_iova(area),
+			   iopt_area_length(area));
+}
+
+/**
+ * iopt_area_unfill_domain() - Unmap and unpin PFNs in a domain
+ * @area: IOVA area to use
+ * @pages: page supplier for the area (area->pages is NULL)
+ * @domain: Domain to unmap from
+ *
+ * The domain should be removed from the domains_itree before calling. The
+ * domain will always be unmapped, but the PFNs may not be unpinned if there are
+ * still accesses.
+ */
+void iopt_area_unfill_domain(struct iopt_area *area, struct iopt_pages *pages,
+			     struct iommu_domain *domain)
+{
+	__iopt_area_unfill_domain(area, pages, domain,
+				  iopt_area_last_index(area));
+}
+
+/**
+ * iopt_area_fill_domain() - Map PFNs from the area into a domain
+ * @area: IOVA area to use
+ * @domain: Domain to load PFNs into
+ *
+ * Read the pfns from the area's underlying iopt_pages and map them into the
+ * given domain. Called when attaching a new domain to an io_pagetable.
+ */
+int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain)
+{
+	unsigned long done_end_index;
+	struct pfn_reader pfns;
+	int rc;
+
+	lockdep_assert_held(&area->pages->mutex);
+
+	rc = pfn_reader_first(&pfns, area->pages, iopt_area_index(area),
+			      iopt_area_last_index(area));
+	if (rc)
+		return rc;
+
+	while (!pfn_reader_done(&pfns)) {
+		done_end_index = pfns.batch_start_index;
+		rc = batch_to_domain(&pfns.batch, domain, area,
+				     pfns.batch_start_index);
+		if (rc)
+			goto out_unmap;
+		done_end_index = pfns.batch_end_index;
+
+		rc = pfn_reader_next(&pfns);
+		if (rc)
+			goto out_unmap;
+	}
+
+	rc = pfn_reader_update_pinned(&pfns);
+	if (rc)
+		goto out_unmap;
+	goto out_destroy;
+
+out_unmap:
+	pfn_reader_release_pins(&pfns);
+	iopt_area_unfill_partial_domain(area, area->pages, domain,
+					done_end_index);
+out_destroy:
+	pfn_reader_destroy(&pfns);
+	return rc;
+}
+
+/**
+ * iopt_area_fill_domains() - Install PFNs into the area's domains
+ * @area: The area to act on
+ * @pages: The pages associated with the area (area->pages is NULL)
+ *
+ * Called during area creation. The area is freshly created and not inserted in
+ * the domains_itree yet. PFNs are read and loaded into every domain held in the
+ * area's io_pagetable and the area is installed in the domains_itree.
+ *
+ * On failure all domains are left unchanged.
+ */
+int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages)
+{
+	unsigned long done_first_end_index;
+	unsigned long done_all_end_index;
+	struct iommu_domain *domain;
+	unsigned long unmap_index;
+	struct pfn_reader pfns;
+	unsigned long index;
+	int rc;
+
+	lockdep_assert_held(&area->iopt->domains_rwsem);
+
+	if (xa_empty(&area->iopt->domains))
+		return 0;
+
+	mutex_lock(&pages->mutex);
+	rc = pfn_reader_first(&pfns, pages, iopt_area_index(area),
+			      iopt_area_last_index(area));
+	if (rc)
+		goto out_unlock;
+
+	while (!pfn_reader_done(&pfns)) {
+		done_first_end_index = pfns.batch_end_index;
+		done_all_end_index = pfns.batch_start_index;
+		xa_for_each(&area->iopt->domains, index, domain) {
+			rc = batch_to_domain(&pfns.batch, domain, area,
+					     pfns.batch_start_index);
+			if (rc)
+				goto out_unmap;
+		}
+		done_all_end_index = done_first_end_index;
+
+		rc = pfn_reader_next(&pfns);
+		if (rc)
+			goto out_unmap;
+	}
+	rc = pfn_reader_update_pinned(&pfns);
+	if (rc)
+		goto out_unmap;
+
+	area->storage_domain = xa_load(&area->iopt->domains, 0);
+	interval_tree_insert(&area->pages_node, &pages->domains_itree);
+	goto out_destroy;
+
+out_unmap:
+	pfn_reader_release_pins(&pfns);
+	xa_for_each(&area->iopt->domains, unmap_index, domain) {
+		unsigned long end_index;
+
+		if (unmap_index < index)
+			end_index = done_first_end_index;
+		else
+			end_index = done_all_end_index;
+
+		/*
+		 * The area is not yet part of the domains_itree so we have to
+		 * manage the unpinning specially. The last domain does the
+		 * unpin, every other domain is just unmapped.
+		 */
+		if (unmap_index != area->iopt->next_domain_id - 1) {
+			if (end_index != iopt_area_index(area))
+				iopt_area_unmap_domain_range(
+					area, domain, iopt_area_index(area),
+					end_index - 1);
+		} else {
+			iopt_area_unfill_partial_domain(area, pages, domain,
+							end_index);
+		}
+	}
+out_destroy:
+	pfn_reader_destroy(&pfns);
+out_unlock:
+	mutex_unlock(&pages->mutex);
+	return rc;
+}
+
+/**
+ * iopt_area_unfill_domains() - unmap PFNs from the area's domains
+ * @area: The area to act on
+ * @pages: The pages associated with the area (area->pages is NULL)
+ *
+ * Called during area destruction. This unmaps the iova's covered by all the
+ * area's domains and releases the PFNs.
+ */
+void iopt_area_unfill_domains(struct iopt_area *area, struct iopt_pages *pages)
+{
+	struct io_pagetable *iopt = area->iopt;
+	struct iommu_domain *domain;
+	unsigned long index;
+
+	lockdep_assert_held(&iopt->domains_rwsem);
+
+	mutex_lock(&pages->mutex);
+	if (!area->storage_domain)
+		goto out_unlock;
+
+	xa_for_each(&iopt->domains, index, domain)
+		if (domain != area->storage_domain)
+			iopt_area_unmap_domain_range(
+				area, domain, iopt_area_index(area),
+				iopt_area_last_index(area));
+
+	interval_tree_remove(&area->pages_node, &pages->domains_itree);
+	iopt_area_unfill_domain(area, pages, area->storage_domain);
+	area->storage_domain = NULL;
+out_unlock:
+	mutex_unlock(&pages->mutex);
+}
+
+static void iopt_pages_unpin_xarray(struct pfn_batch *batch,
+				    struct iopt_pages *pages,
+				    unsigned long start_index,
+				    unsigned long end_index)
+{
+	while (start_index <= end_index) {
+		batch_from_xarray_clear(batch, &pages->pinned_pfns, start_index,
+					end_index);
+		batch_unpin(batch, pages, 0, batch->total_pfns);
+		start_index += batch->total_pfns;
+		batch_clear(batch);
+	}
+}
+
+/**
+ * iopt_pages_unfill_xarray() - Update the xarry after removing an access
+ * @pages: The pages to act on
+ * @start_index: Starting PFN index
+ * @last_index: Last PFN index
+ *
+ * Called when an iopt_pages_access is removed, removes pages from the itree.
+ * The access should already be removed from the access_itree.
+ */
+void iopt_pages_unfill_xarray(struct iopt_pages *pages,
+			      unsigned long start_index,
+			      unsigned long last_index)
+{
+	struct interval_tree_double_span_iter span;
+	u64 backup[BATCH_BACKUP_SIZE];
+	struct pfn_batch batch;
+	bool batch_inited = false;
+
+	lockdep_assert_held(&pages->mutex);
+
+	interval_tree_for_each_double_span(&span, &pages->access_itree,
+					   &pages->domains_itree, start_index,
+					   last_index) {
+		if (!span.is_used) {
+			if (!batch_inited) {
+				batch_init_backup(&batch,
+						  last_index - start_index + 1,
+						  backup, sizeof(backup));
+				batch_inited = true;
+			}
+			iopt_pages_unpin_xarray(&batch, pages, span.start_hole,
+						span.last_hole);
+		} else if (span.is_used == 2) {
+			/* Covered by a domain */
+			clear_xarray(&pages->pinned_pfns, span.start_used,
+				     span.last_used);
+		}
+		/* Otherwise covered by an existing access */
+	}
+	if (batch_inited)
+		batch_destroy(&batch, backup);
+	update_unpinned(pages);
+}
+
+/**
+ * iopt_pages_fill_from_xarray() - Fast path for reading PFNs
+ * @pages: The pages to act on
+ * @start_index: The first page index in the range
+ * @last_index: The last page index in the range
+ * @out_pages: The output array to return the pages
+ *
+ * This can be called if the caller is holding a refcount on an
+ * iopt_pages_access that is known to have already been filled. It quickly reads
+ * the pages directly from the xarray.
+ *
+ * This is part of the SW iommu interface to read pages for in-kernel use.
+ */
+void iopt_pages_fill_from_xarray(struct iopt_pages *pages,
+				 unsigned long start_index,
+				 unsigned long last_index,
+				 struct page **out_pages)
+{
+	XA_STATE(xas, &pages->pinned_pfns, start_index);
+	void *entry;
+
+	rcu_read_lock();
+	while (start_index <= last_index) {
+		entry = xas_next(&xas);
+		if (xas_retry(&xas, entry))
+			continue;
+		WARN_ON(!xa_is_value(entry));
+		*(out_pages++) = pfn_to_page(xa_to_value(entry));
+		start_index++;
+	}
+	rcu_read_unlock();
+}
+
+static int iopt_pages_fill_from_domain(struct iopt_pages *pages,
+				       unsigned long start_index,
+				       unsigned long last_index,
+				       struct page **out_pages)
+{
+	while (start_index != last_index + 1) {
+		unsigned long domain_last;
+		struct iopt_area *area;
+
+		area = iopt_pages_find_domain_area(pages, start_index);
+		if (WARN_ON(!area))
+			return -EINVAL;
+
+		domain_last = min(iopt_area_last_index(area), last_index);
+		out_pages = raw_pages_from_domain(area->storage_domain, area,
+						  start_index, domain_last,
+						  out_pages);
+		start_index = domain_last + 1;
+	}
+	return 0;
+}
+
+static int iopt_pages_fill_from_mm(struct iopt_pages *pages,
+				   struct pfn_reader_user *user,
+				   unsigned long start_index,
+				   unsigned long last_index,
+				   struct page **out_pages)
+{
+	unsigned long cur_index = start_index;
+	int rc;
+
+	while (cur_index != last_index + 1) {
+		user->upages = out_pages + (cur_index - start_index);
+		rc = pfn_reader_user_pin(user, pages, cur_index, last_index);
+		if (rc)
+			goto out_unpin;
+		cur_index = user->upages_end;
+	}
+	return 0;
+
+out_unpin:
+	if (start_index != cur_index)
+		iopt_pages_err_unpin(pages, start_index, cur_index - 1,
+				     out_pages);
+	return rc;
+}
+
+/**
+ * iopt_pages_fill_xarray() - Read PFNs
+ * @pages: The pages to act on
+ * @start_index: The first page index in the range
+ * @last_index: The last page index in the range
+ * @out_pages: The output array to return the pages, may be NULL
+ *
+ * This populates the xarray and returns the pages in out_pages. As the slow
+ * path this is able to copy pages from other storage tiers into the xarray.
+ *
+ * On failure the xarray is left unchanged.
+ *
+ * This is part of the SW iommu interface to read pages for in-kernel use.
+ */
+int iopt_pages_fill_xarray(struct iopt_pages *pages, unsigned long start_index,
+			   unsigned long last_index, struct page **out_pages)
+{
+	struct interval_tree_double_span_iter span;
+	unsigned long xa_end = start_index;
+	struct pfn_reader_user user;
+	int rc;
+
+	lockdep_assert_held(&pages->mutex);
+
+	pfn_reader_user_init(&user, pages);
+	user.upages_len = (last_index - start_index + 1) * sizeof(*out_pages);
+	interval_tree_for_each_double_span(&span, &pages->access_itree,
+					   &pages->domains_itree, start_index,
+					   last_index) {
+		struct page **cur_pages;
+
+		if (span.is_used == 1) {
+			cur_pages = out_pages + (span.start_used - start_index);
+			iopt_pages_fill_from_xarray(pages, span.start_used,
+						    span.last_used, cur_pages);
+			continue;
+		}
+
+		if (span.is_used == 2) {
+			cur_pages = out_pages + (span.start_used - start_index);
+			iopt_pages_fill_from_domain(pages, span.start_used,
+						    span.last_used, cur_pages);
+			rc = pages_to_xarray(&pages->pinned_pfns,
+					     span.start_used, span.last_used,
+					     cur_pages);
+			if (rc)
+				goto out_clean_xa;
+			xa_end = span.last_used + 1;
+			continue;
+		}
+
+		/* hole */
+		cur_pages = out_pages + (span.start_hole - start_index);
+		rc = iopt_pages_fill_from_mm(pages, &user, span.start_hole,
+					     span.last_hole, cur_pages);
+		if (rc)
+			goto out_clean_xa;
+		rc = pages_to_xarray(&pages->pinned_pfns, span.start_hole,
+				     span.last_hole, cur_pages);
+		if (rc) {
+			iopt_pages_err_unpin(pages, span.start_hole,
+					     span.last_hole, cur_pages);
+			goto out_clean_xa;
+		}
+		xa_end = span.last_hole + 1;
+	}
+	rc = pfn_reader_user_update_pinned(&user, pages);
+	if (rc)
+		goto out_clean_xa;
+	user.upages = NULL;
+	pfn_reader_user_destroy(&user, pages);
+	return 0;
+
+out_clean_xa:
+	if (start_index != xa_end)
+		iopt_pages_unfill_xarray(pages, start_index, xa_end - 1);
+	user.upages = NULL;
+	pfn_reader_user_destroy(&user, pages);
+	return rc;
+}
+
+/*
+ * This can do everything and is fully coherent with what a iommu_domain would
+ * see.
+ */
+static int iopt_pages_rw_slow(struct iopt_pages *pages,
+			      unsigned long start_index,
+			      unsigned long last_index, unsigned long offset,
+			      void *data, unsigned long length,
+			      unsigned int flags)
+{
+	struct pfn_reader pfns;
+	int rc;
+
+	mutex_lock(&pages->mutex);
+
+	rc = pfn_reader_first(&pfns, pages, start_index, last_index);
+	if (rc)
+		goto out_unlock;
+
+	while (!pfn_reader_done(&pfns)) {
+		unsigned long done;
+
+		done = batch_rw(&pfns.batch, data, offset, length, flags);
+		data += done;
+		length -= done;
+		offset = 0;
+		pfn_reader_unpin(&pfns);
+
+		rc = pfn_reader_next(&pfns);
+		if (rc)
+			goto out_destroy;
+	}
+	if (WARN_ON(length != 0))
+		rc = -EINVAL;
+out_destroy:
+	pfn_reader_destroy(&pfns);
+out_unlock:
+	mutex_unlock(&pages->mutex);
+	return rc;
+}
+
+/*
+ * A medium speed path that still allows DMA decoherence, but doesn't do any
+ * memory allocations or interval tree searches.
+ */
+static int iopt_pages_rw_page(struct iopt_pages *pages, unsigned long index,
+			      unsigned long offset, void *data,
+			      unsigned long length, unsigned int flags)
+{
+	struct page *page = NULL;
+	int rc;
+
+	if (!mmget_not_zero(pages->source_mm))
+		return iopt_pages_rw_slow(pages, index, index, offset, data,
+					  length, flags);
+
+	mmap_read_lock(pages->source_mm);
+	rc = pin_user_pages_remote(
+		pages->source_mm, (uintptr_t)(pages->uptr + index * PAGE_SIZE),
+		1, (flags & IOMMUFD_ACCESS_RW_WRITE) ? FOLL_WRITE : 0, &page,
+		NULL, NULL);
+	mmap_read_unlock(pages->source_mm);
+	if (rc != 1) {
+		if (WARN_ON(rc >= 0))
+			rc = -EINVAL;
+		goto out_mmput;
+	}
+	copy_data_page(page, data, offset, length, flags);
+	unpin_user_page(page);
+	rc = 0;
+
+out_mmput:
+	mmput(pages->source_mm);
+	return rc;
+}
+
+/**
+ * iopt_pages_rw_access - Copy to/from a linear slice of the pages
+ * @pages: pages to act on
+ * @start_byte: First byte of pages to copy to/from
+ * @data: Kernel buffer to get/put the data
+ * @length: Number of bytes to copy
+ * @flags: IOMMUFD_ACCESS_RW_* flags
+ *
+ * This will find each page in the range, kmap it and then memcpy to/from
+ * the given kernel buffer.
+ */
+int iopt_pages_rw_access(struct iopt_pages *pages, unsigned long start_byte,
+			 void *data, unsigned long length, unsigned int flags)
+{
+	unsigned long start_index = start_byte / PAGE_SIZE;
+	unsigned long last_index = (start_byte + length - 1) / PAGE_SIZE;
+	bool change_mm = current->mm != pages->source_mm;
+	int rc = 0;
+
+	if ((flags & IOMMUFD_ACCESS_RW_WRITE) && !pages->writable)
+		return -EPERM;
+
+	if (!(flags & IOMMUFD_ACCESS_RW_KTHREAD) && change_mm) {
+		if (start_index == last_index)
+			return iopt_pages_rw_page(pages, start_index,
+						  start_byte % PAGE_SIZE, data,
+						  length, flags);
+		return iopt_pages_rw_slow(pages, start_index, last_index,
+					  start_byte % PAGE_SIZE, data, length,
+					  flags);
+	}
+
+	/*
+	 * Try to copy using copy_to_user(). We do this as a fast path and
+	 * ignore any pinning decoherence, unlike a real DMA path.
+	 */
+	if (change_mm) {
+		if (!mmget_not_zero(pages->source_mm))
+			return iopt_pages_rw_slow(pages, start_index,
+						  last_index,
+						  start_byte % PAGE_SIZE, data,
+						  length, flags);
+		kthread_use_mm(pages->source_mm);
+	}
+
+	if (flags & IOMMUFD_ACCESS_RW_WRITE) {
+		if (copy_to_user(pages->uptr + start_byte, data, length))
+			rc = -EFAULT;
+	} else {
+		if (copy_from_user(data, pages->uptr + start_byte, length))
+			rc = -EFAULT;
+	}
+
+	if (change_mm) {
+		kthread_unuse_mm(pages->source_mm);
+		mmput(pages->source_mm);
+	}
+
+	return rc;
+}
+
+static struct iopt_pages_access *
+iopt_pages_get_exact_access(struct iopt_pages *pages, unsigned long index,
+			    unsigned long last)
+{
+	struct interval_tree_node *node;
+
+	lockdep_assert_held(&pages->mutex);
+
+	/* There can be overlapping ranges in this interval tree */
+	for (node = interval_tree_iter_first(&pages->access_itree, index, last);
+	     node; node = interval_tree_iter_next(node, index, last))
+		if (node->start == index && node->last == last)
+			return container_of(node, struct iopt_pages_access,
+					    node);
+	return NULL;
+}
+
+/**
+ * iopt_pages_add_access() - Record an in-knerel access for PFNs
+ * @pages: The source of PFNs
+ * @start_index: First page index
+ * @last_index: Inclusive last page index
+ * @out_pages: Output list of struct page's representing the PFNs
+ * @flags: IOMMUFD_ACCESS_RW_* flags
+ *
+ * Record that an in-kernel access will be accessing the pages, ensure they are
+ * pinned, and return the PFNs as a simple list of 'struct page *'.
+ *
+ * This should be undone through a matching call to iopt_pages_remove_access()
+ */
+int iopt_pages_add_access(struct iopt_pages *pages, unsigned long start_index,
+			  unsigned long last_index, struct page **out_pages,
+			  unsigned int flags)
+{
+	struct iopt_pages_access *access;
+	int rc;
+
+	if ((flags & IOMMUFD_ACCESS_RW_WRITE) && !pages->writable)
+		return -EPERM;
+
+	access = iopt_pages_get_exact_access(pages, start_index, last_index);
+	if (access) {
+		refcount_inc(&access->refcount);
+		iopt_pages_fill_from_xarray(pages, start_index, last_index,
+					    out_pages);
+		return 0;
+	}
+
+	access = kzalloc(sizeof(*access), GFP_KERNEL_ACCOUNT);
+	if (!access)
+		return -ENOMEM;
+
+	rc = iopt_pages_fill_xarray(pages, start_index, last_index, out_pages);
+	if (rc)
+		goto out_free;
+
+	access->node.start = start_index;
+	access->node.last = last_index;
+	refcount_set(&access->refcount, 1);
+	interval_tree_insert(&access->node, &pages->access_itree);
+	return 0;
+
+out_free:
+	kfree(access);
+	return rc;
+}
+
+/**
+ * iopt_pages_remove_access() - Release an in-kernel access for PFNs
+ * @area: The source of PFNs
+ * @start_index: First page index
+ * @last_index: Inclusive last page index
+ *
+ * Undo iopt_pages_add_access() and unpin the pages if necessary. The caller
+ * must stop using the PFNs before calling this.
+ */
+void iopt_pages_remove_access(struct iopt_area *area, unsigned long start_index,
+			      unsigned long last_index)
+{
+	struct iopt_pages_access *access;
+	struct iopt_pages *pages = area->pages;
+
+	mutex_lock(&pages->mutex);
+	access = iopt_pages_get_exact_access(pages, start_index, last_index);
+	if (WARN_ON(!access))
+		goto out_unlock;
+
+	WARN_ON(area->num_accesses == 0);
+	area->num_accesses--;
+
+	if (!refcount_dec_and_test(&access->refcount))
+		goto out_unlock;
+
+	interval_tree_remove(&access->node, &pages->access_itree);
+	iopt_pages_unfill_xarray(pages, start_index, last_index);
+	kfree(access);
+out_unlock:
+	mutex_unlock(&pages->mutex);
+}
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH v4 09/17] iommufd: Data structure to provide IOVA to PFN mapping
  2022-11-08  0:48 [PATCH v4 00/17] IOMMUFD Generic interface Jason Gunthorpe
                   ` (7 preceding siblings ...)
  2022-11-08  0:49 ` [PATCH v4 08/17] iommufd: Algorithms for PFN storage Jason Gunthorpe
@ 2022-11-08  0:49 ` Jason Gunthorpe
  2022-11-14  7:28   ` Tian, Kevin
  2022-11-08  0:49 ` [PATCH v4 10/17] iommufd: IOCTLs for the io_pagetable Jason Gunthorpe
                   ` (9 subsequent siblings)
  18 siblings, 1 reply; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-08  0:49 UTC (permalink / raw)
  To: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

This is the remainder of the IOAS data structure. Provide an object called
an io_pagetable that is composed of iopt_areas pointing at iopt_pages,
along with a list of iommu_domains that mirror the IOVA to PFN map.

At the top this is a simple interval tree of iopt_areas indicating the map
of IOVA to iopt_pages. An xarray keeps track of a list of domains. Based
on the attached domains there is a minimum alignment for areas (which may
be smaller than PAGE_SIZE), an interval tree of reserved IOVA that can't
be mapped and an IOVA of allowed IOVA that can always be mappable.

The concept of an 'access' refers to something like a VFIO mdev that is
accessing the IOVA and using a 'struct page *' for CPU based access.

Externally an API is provided that matches the requirements of the IOCTL
interface for map/unmap and domain attachment.

The API provides a 'copy' primitive to establish a new IOVA map in a
different IOAS from an existing mapping by re-using the iopt_pages. This
is the basic mechanism to provide single pinning.

This is designed to support a pre-registration flow where userspace would
setup an dummy IOAS with no domains, map in memory and then establish an
access to pin all PFNs into the xarray.

Copy can then be used to create new IOVA mappings in a different IOAS,
with iommu_domains attached. Upon copy the PFNs will be read out of the
xarray and mapped into the iommu_domains, avoiding any pin_user_pages()
overheads.

Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 .clang-format                           |    1 +
 drivers/iommu/iommufd/Makefile          |    1 +
 drivers/iommu/iommufd/io_pagetable.c    | 1188 +++++++++++++++++++++++
 drivers/iommu/iommufd/io_pagetable.h    |   55 ++
 drivers/iommu/iommufd/iommufd_private.h |   52 +
 5 files changed, 1297 insertions(+)
 create mode 100644 drivers/iommu/iommufd/io_pagetable.c

diff --git a/.clang-format b/.clang-format
index 501241f8977664..78aba4a10b1bbc 100644
--- a/.clang-format
+++ b/.clang-format
@@ -444,6 +444,7 @@ ForEachMacros:
   - 'interval_tree_for_each_span'
   - 'intlist__for_each_entry'
   - 'intlist__for_each_entry_safe'
+  - 'iopt_for_each_contig_area'
   - 'kcore_copy__for_each_phdr'
   - 'key_for_each'
   - 'key_for_each_safe'
diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index 05a0e91e30afad..b66a8c47ff55ec 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 iommufd-y := \
+	io_pagetable.o \
 	main.o \
 	pages.o
 
diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
new file mode 100644
index 00000000000000..86effe8ed1dce5
--- /dev/null
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -0,0 +1,1188 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
+ *
+ * The io_pagetable is the top of datastructure that maps IOVA's to PFNs. The
+ * PFNs can be placed into an iommu_domain, or returned to the caller as a page
+ * list for access by an in-kernel user.
+ *
+ * The datastructure uses the iopt_pages to optimize the storage of the PFNs
+ * between the domains and xarray.
+ */
+#include <linux/iommufd.h>
+#include <linux/lockdep.h>
+#include <linux/iommu.h>
+#include <linux/sched/mm.h>
+#include <linux/err.h>
+#include <linux/slab.h>
+#include <linux/errno.h>
+
+#include "io_pagetable.h"
+
+struct iopt_pages_list {
+	struct iopt_pages *pages;
+	struct iopt_area *area;
+	struct list_head next;
+	unsigned long start_byte;
+	unsigned long length;
+};
+
+struct iopt_area *iopt_area_contig_init(struct iopt_area_contig_iter *iter,
+					struct io_pagetable *iopt,
+					unsigned long iova,
+					unsigned long last_iova)
+{
+	lockdep_assert_held(&iopt->iova_rwsem);
+
+	iter->cur_iova = iova;
+	iter->last_iova = last_iova;
+	iter->area = iopt_area_iter_first(iopt, iova, iova);
+	if (!iter->area)
+		return NULL;
+	if (!iter->area->pages) {
+		iter->area = NULL;
+		return NULL;
+	}
+	return iter->area;
+}
+
+struct iopt_area *iopt_area_contig_next(struct iopt_area_contig_iter *iter)
+{
+	unsigned long last_iova;
+
+	if (!iter->area)
+		return NULL;
+	last_iova = iopt_area_last_iova(iter->area);
+	if (iter->last_iova <= last_iova)
+		return NULL;
+
+	iter->cur_iova = last_iova + 1;
+	iter->area = iopt_area_iter_next(iter->area, iter->cur_iova,
+					 iter->last_iova);
+	if (!iter->area)
+		return NULL;
+	if (iter->cur_iova != iopt_area_iova(iter->area) ||
+	    !iter->area->pages) {
+		iter->area = NULL;
+		return NULL;
+	}
+	return iter->area;
+}
+
+static bool __alloc_iova_check_hole(struct interval_tree_span_iter *span,
+				    unsigned long length,
+				    unsigned long iova_alignment,
+				    unsigned long page_offset)
+{
+	if (!span->is_hole || span->last_hole - span->start_hole < length - 1)
+		return false;
+
+	span->start_hole = ALIGN(span->start_hole, iova_alignment) |
+			   page_offset;
+	if (span->start_hole > span->last_hole ||
+	    span->last_hole - span->start_hole < length - 1)
+		return false;
+	return true;
+}
+
+static bool __alloc_iova_check_used(struct interval_tree_span_iter *span,
+				    unsigned long length,
+				    unsigned long iova_alignment,
+				    unsigned long page_offset)
+{
+	if (span->is_hole || span->last_used - span->start_used < length - 1)
+		return false;
+
+	span->start_used = ALIGN(span->start_used, iova_alignment) |
+			   page_offset;
+	if (span->start_used > span->last_used ||
+	    span->last_used - span->start_used < length - 1)
+		return false;
+	return true;
+}
+
+/*
+ * Automatically find a block of IOVA that is not being used and not reserved.
+ * Does not return a 0 IOVA even if it is valid.
+ */
+static int iopt_alloc_iova(struct io_pagetable *iopt, unsigned long *iova,
+			   unsigned long uptr, unsigned long length)
+{
+	struct interval_tree_span_iter reserved_span;
+	unsigned long page_offset = uptr % PAGE_SIZE;
+	struct interval_tree_span_iter allowed_span;
+	struct interval_tree_span_iter area_span;
+	unsigned long iova_alignment;
+
+	lockdep_assert_held(&iopt->iova_rwsem);
+
+	/* Protect roundup_pow-of_two() from overflow */
+	if (length == 0 || length >= ULONG_MAX / 2)
+		return -EOVERFLOW;
+
+	/*
+	 * Keep alignment present in the uptr when building the IOVA, this
+	 * increases the chance we can map a THP.
+	 */
+	if (!uptr)
+		iova_alignment = roundup_pow_of_two(length);
+	else
+		iova_alignment = min_t(unsigned long,
+				       roundup_pow_of_two(length),
+				       1UL << __ffs64(uptr));
+
+	if (iova_alignment < iopt->iova_alignment)
+		return -EINVAL;
+
+	interval_tree_for_each_span(&allowed_span, &iopt->allowed_itree,
+				    PAGE_SIZE, ULONG_MAX - PAGE_SIZE) {
+		if (RB_EMPTY_ROOT(&iopt->allowed_itree.rb_root)) {
+			allowed_span.start_used = PAGE_SIZE;
+			allowed_span.last_used = ULONG_MAX - PAGE_SIZE;
+			allowed_span.is_hole = false;
+		}
+
+		if (!__alloc_iova_check_used(&allowed_span, length,
+					     iova_alignment, page_offset))
+			continue;
+
+		interval_tree_for_each_span(&area_span, &iopt->area_itree,
+					    allowed_span.start_used,
+					    allowed_span.last_used) {
+			if (!__alloc_iova_check_hole(&area_span, length,
+						     iova_alignment,
+						     page_offset))
+				continue;
+
+			interval_tree_for_each_span(&reserved_span,
+						    &iopt->reserved_itree,
+						    area_span.start_used,
+						    area_span.last_used) {
+				if (!__alloc_iova_check_hole(
+					    &reserved_span, length,
+					    iova_alignment, page_offset))
+					continue;
+
+				*iova = reserved_span.start_hole;
+				return 0;
+			}
+		}
+	}
+	return -ENOSPC;
+}
+
+static int iopt_check_iova(struct io_pagetable *iopt, unsigned long iova,
+			   unsigned long length)
+{
+	unsigned long last;
+
+	lockdep_assert_held(&iopt->iova_rwsem);
+
+	if ((iova & (iopt->iova_alignment - 1)))
+		return -EINVAL;
+
+	if (check_add_overflow(iova, length - 1, &last))
+		return -EOVERFLOW;
+
+	/* No reserved IOVA intersects the range */
+	if (iopt_reserved_iter_first(iopt, iova, last))
+		return -ENOENT;
+
+	/* Check that there is not already a mapping in the range */
+	if (iopt_area_iter_first(iopt, iova, last))
+		return -EADDRINUSE;
+	return 0;
+}
+
+/*
+ * The area takes a slice of the pages from start_bytes to start_byte + length
+ */
+static int iopt_insert_area(struct io_pagetable *iopt, struct iopt_area *area,
+			    struct iopt_pages *pages, unsigned long iova,
+			    unsigned long start_byte, unsigned long length,
+			    int iommu_prot)
+{
+	lockdep_assert_held_write(&iopt->iova_rwsem);
+
+	if ((iommu_prot & IOMMU_WRITE) && !pages->writable)
+		return -EPERM;
+
+	area->iommu_prot = iommu_prot;
+	area->page_offset = start_byte % PAGE_SIZE;
+	if (area->page_offset & (iopt->iova_alignment - 1))
+		return -EINVAL;
+
+	area->node.start = iova;
+	if (check_add_overflow(iova, length - 1, &area->node.last))
+		return -EOVERFLOW;
+
+	area->pages_node.start = start_byte / PAGE_SIZE;
+	if (check_add_overflow(start_byte, length - 1, &area->pages_node.last))
+		return -EOVERFLOW;
+	area->pages_node.last = area->pages_node.last / PAGE_SIZE;
+	if (WARN_ON(area->pages_node.last >= pages->npages))
+		return -EOVERFLOW;
+
+	/*
+	 * The area is inserted with a NULL pages indicating it is not fully
+	 * initialized yet.
+	 */
+	area->iopt = iopt;
+	interval_tree_insert(&area->node, &iopt->area_itree);
+	return 0;
+}
+
+static int iopt_alloc_area_pages(struct io_pagetable *iopt,
+				 struct list_head *pages_list,
+				 unsigned long length, unsigned long *dst_iova,
+				 int iommu_prot, unsigned int flags)
+{
+	struct iopt_pages_list *elm;
+	unsigned long iova;
+	int rc = 0;
+
+	list_for_each_entry(elm, pages_list, next) {
+		elm->area = kzalloc(sizeof(*elm->area), GFP_KERNEL_ACCOUNT);
+		if (!elm->area)
+			return -ENOMEM;
+	}
+
+	down_write(&iopt->iova_rwsem);
+	if ((length & (iopt->iova_alignment - 1)) || !length) {
+		rc = -EINVAL;
+		goto out_unlock;
+	}
+
+	if (flags & IOPT_ALLOC_IOVA) {
+		/* Use the first entry to guess the ideal IOVA alignment */
+		elm = list_first_entry(pages_list, struct iopt_pages_list,
+				       next);
+		rc = iopt_alloc_iova(
+			iopt, dst_iova,
+			(uintptr_t)elm->pages->uptr + elm->start_byte, length);
+		if (rc)
+			goto out_unlock;
+	} else {
+		rc = iopt_check_iova(iopt, *dst_iova, length);
+		if (rc)
+			goto out_unlock;
+	}
+
+	/*
+	 * Areas are created with a NULL pages so that the IOVA space is
+	 * reserved and we can unlock the iova_rwsem.
+	 */
+	iova = *dst_iova;
+	list_for_each_entry(elm, pages_list, next) {
+		rc = iopt_insert_area(iopt, elm->area, elm->pages, iova,
+				      elm->start_byte, elm->length, iommu_prot);
+		if (rc)
+			goto out_unlock;
+		iova += elm->length;
+	}
+
+out_unlock:
+	up_write(&iopt->iova_rwsem);
+	return rc;
+}
+
+static void iopt_abort_area(struct iopt_area *area)
+{
+	if (area->iopt) {
+		down_write(&area->iopt->iova_rwsem);
+		interval_tree_remove(&area->node, &area->iopt->area_itree);
+		up_write(&area->iopt->iova_rwsem);
+	}
+	kfree(area);
+}
+
+void iopt_free_pages_list(struct list_head *pages_list)
+{
+	struct iopt_pages_list *elm;
+
+	while ((elm = list_first_entry_or_null(pages_list,
+					       struct iopt_pages_list, next))) {
+		if (elm->area)
+			iopt_abort_area(elm->area);
+		if (elm->pages)
+			iopt_put_pages(elm->pages);
+		list_del(&elm->next);
+		kfree(elm);
+	}
+}
+
+static int iopt_fill_domains_pages(struct list_head *pages_list)
+{
+	struct iopt_pages_list *undo_elm;
+	struct iopt_pages_list *elm;
+	int rc;
+
+	list_for_each_entry(elm, pages_list, next) {
+		rc = iopt_area_fill_domains(elm->area, elm->pages);
+		if (rc)
+			goto err_undo;
+	}
+	return 0;
+
+err_undo:
+	list_for_each_entry(undo_elm, pages_list, next) {
+		if (undo_elm == elm)
+			break;
+		iopt_area_unfill_domains(undo_elm->area, undo_elm->pages);
+	}
+	return rc;
+}
+
+int iopt_map_pages(struct io_pagetable *iopt, struct list_head *pages_list,
+		   unsigned long length, unsigned long *dst_iova,
+		   int iommu_prot, unsigned int flags)
+{
+	struct iopt_pages_list *elm;
+	int rc;
+
+	rc = iopt_alloc_area_pages(iopt, pages_list, length, dst_iova,
+				   iommu_prot, flags);
+	if (rc)
+		return rc;
+
+	down_read(&iopt->domains_rwsem);
+	rc = iopt_fill_domains_pages(pages_list);
+	if (rc)
+		goto out_unlock_domains;
+
+	down_write(&iopt->iova_rwsem);
+	list_for_each_entry(elm, pages_list, next) {
+		/*
+		 * area->pages must be set inside the domains_rwsem to ensure
+		 * any newly added domains will get filled. Moves the reference
+		 * in from the list.
+		 */
+		elm->area->pages = elm->pages;
+		elm->pages = NULL;
+		elm->area = NULL;
+	}
+	up_write(&iopt->iova_rwsem);
+out_unlock_domains:
+	up_read(&iopt->domains_rwsem);
+	return rc;
+}
+
+/**
+ * iopt_map_user_pages() - Map a user VA to an iova in the io page table
+ * @ictx: iommufd_ctx the iopt is part of
+ * @iopt: io_pagetable to act on
+ * @iova: If IOPT_ALLOC_IOVA is set this is unused on input and contains
+ *        the chosen iova on output. Otherwise is the iova to map to on input
+ * @uptr: User VA to map
+ * @length: Number of bytes to map
+ * @iommu_prot: Combination of IOMMU_READ/WRITE/etc bits for the mapping
+ * @flags: IOPT_ALLOC_IOVA or zero
+ *
+ * iova, uptr, and length must be aligned to iova_alignment. For domain backed
+ * page tables this will pin the pages and load them into the domain at iova.
+ * For non-domain page tables this will only setup a lazy reference and the
+ * caller must use iopt_access_pages() to touch them.
+ *
+ * iopt_unmap_iova() must be called to undo this before the io_pagetable can be
+ * destroyed.
+ */
+int iopt_map_user_pages(struct iommufd_ctx *ictx, struct io_pagetable *iopt,
+			unsigned long *iova, void __user *uptr,
+			unsigned long length, int iommu_prot,
+			unsigned int flags)
+{
+	struct iopt_pages_list elm = {};
+	LIST_HEAD(pages_list);
+	int rc;
+
+	elm.pages = iopt_alloc_pages(uptr, length, iommu_prot & IOMMU_WRITE);
+	if (IS_ERR(elm.pages))
+		return PTR_ERR(elm.pages);
+	if (ictx->account_mode == IOPT_PAGES_ACCOUNT_MM &&
+	    elm.pages->account_mode == IOPT_PAGES_ACCOUNT_USER)
+		elm.pages->account_mode = IOPT_PAGES_ACCOUNT_MM;
+	elm.start_byte = uptr - elm.pages->uptr;
+	elm.length = length;
+	list_add(&elm.next, &pages_list);
+
+	rc = iopt_map_pages(iopt, &pages_list, length, iova, iommu_prot, flags);
+	if (rc) {
+		if (elm.area)
+			iopt_abort_area(elm.area);
+		if (elm.pages)
+			iopt_put_pages(elm.pages);
+		return rc;
+	}
+	return 0;
+}
+
+int iopt_get_pages(struct io_pagetable *iopt, unsigned long iova,
+		   unsigned long length, struct list_head *pages_list)
+{
+	struct iopt_area_contig_iter iter;
+	unsigned long last_iova;
+	struct iopt_area *area;
+	int rc;
+
+	if (!length)
+		return -EINVAL;
+	if (check_add_overflow(iova, length - 1, &last_iova))
+		return -EOVERFLOW;
+
+	down_read(&iopt->iova_rwsem);
+	iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova) {
+		struct iopt_pages_list *elm;
+		unsigned long last = min(last_iova, iopt_area_last_iova(area));
+
+		elm = kzalloc(sizeof(*elm), GFP_KERNEL_ACCOUNT);
+		if (!elm) {
+			rc = -ENOMEM;
+			goto err_free;
+		}
+		elm->start_byte = iopt_area_start_byte(area, iter.cur_iova);
+		elm->pages = area->pages;
+		elm->length = (last - iter.cur_iova) + 1;
+		kref_get(&elm->pages->kref);
+		list_add_tail(&elm->next, pages_list);
+	}
+	if (!iopt_area_contig_done(&iter)) {
+		rc = -ENOENT;
+		goto err_free;
+	}
+	up_read(&iopt->iova_rwsem);
+	return 0;
+err_free:
+	up_read(&iopt->iova_rwsem);
+	iopt_free_pages_list(pages_list);
+	return rc;
+}
+
+static int iopt_unmap_iova_range(struct io_pagetable *iopt, unsigned long start,
+				 unsigned long end, unsigned long *unmapped)
+{
+	struct iopt_area *area;
+	unsigned long unmapped_bytes = 0;
+	int rc = -ENOENT;
+
+	/*
+	 * The domains_rwsem must be held in read mode any time any area->pages
+	 * is NULL. This prevents domain attach/detatch from running
+	 * concurrently with cleaning up the area.
+	 */
+	down_read(&iopt->domains_rwsem);
+	down_write(&iopt->iova_rwsem);
+	while ((area = iopt_area_iter_first(iopt, start, end))) {
+		unsigned long area_last = iopt_area_last_iova(area);
+		unsigned long area_first = iopt_area_iova(area);
+		struct iopt_pages *pages;
+
+		/* Userspace should not race map/unmap's of the same area */
+		if (!area->pages) {
+			rc = -EBUSY;
+			goto out_unlock_iova;
+		}
+
+		if (area_first < start || area_last > end) {
+			rc = -ENOENT;
+			goto out_unlock_iova;
+		}
+
+		/*
+		 * num_accesses writers must hold the iova_rwsem too, so we can
+		 * safely read it under the write side of the iovam_rwsem
+		 * without the pages->mutex.
+		 */
+		if (area->num_accesses) {
+			start = area_first;
+			area->prevent_access = true;
+			up_write(&iopt->iova_rwsem);
+			up_read(&iopt->domains_rwsem);
+			/* Later patch calls back to drivers to unmap */
+			return -EBUSY;
+		}
+
+		pages = area->pages;
+		area->pages = NULL;
+		up_write(&iopt->iova_rwsem);
+
+		iopt_area_unfill_domains(area, pages);
+		iopt_abort_area(area);
+		iopt_put_pages(pages);
+
+		unmapped_bytes += area_last - area_first + 1;
+
+		down_write(&iopt->iova_rwsem);
+	}
+	if (unmapped_bytes)
+		rc = 0;
+
+out_unlock_iova:
+	up_write(&iopt->iova_rwsem);
+	up_read(&iopt->domains_rwsem);
+	if (unmapped)
+		*unmapped = unmapped_bytes;
+	return rc;
+}
+
+/**
+ * iopt_unmap_iova() - Remove a range of iova
+ * @iopt: io_pagetable to act on
+ * @iova: Starting iova to unmap
+ * @length: Number of bytes to unmap
+ * @unmapped: Return number of bytes unmapped
+ *
+ * The requested range must be a superset of existing ranges.
+ * Splitting/truncating IOVA mappings is not allowed.
+ */
+int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
+		    unsigned long length, unsigned long *unmapped)
+{
+	unsigned long iova_end;
+
+	if (!length)
+		return -EINVAL;
+
+	if (check_add_overflow(iova, length - 1, &iova_end))
+		return -EOVERFLOW;
+
+	return iopt_unmap_iova_range(iopt, iova, iova_end, unmapped);
+}
+
+int iopt_unmap_all(struct io_pagetable *iopt, unsigned long *unmapped)
+{
+	int rc;
+
+	rc = iopt_unmap_iova_range(iopt, 0, ULONG_MAX, unmapped);
+	/* If the IOVAs are empty then unmap all succeeds */
+	if (rc == -ENOENT)
+		return 0;
+	return rc;
+}
+
+/* The caller must always free all the nodes in the allowed_iova rb_root. */
+int iopt_set_allow_iova(struct io_pagetable *iopt,
+			struct rb_root_cached *allowed_iova)
+{
+	struct iopt_allowed *allowed;
+
+	down_write(&iopt->iova_rwsem);
+	swap(*allowed_iova, iopt->allowed_itree);
+
+	for (allowed = iopt_allowed_iter_first(iopt, 0, ULONG_MAX); allowed;
+	     allowed = iopt_allowed_iter_next(allowed, 0, ULONG_MAX)) {
+		if (iopt_reserved_iter_first(iopt, allowed->node.start,
+					     allowed->node.last)) {
+			swap(*allowed_iova, iopt->allowed_itree);
+			up_write(&iopt->iova_rwsem);
+			return -EADDRINUSE;
+		}
+	}
+	up_write(&iopt->iova_rwsem);
+	return 0;
+}
+
+int iopt_reserve_iova(struct io_pagetable *iopt, unsigned long start,
+		      unsigned long last, void *owner)
+{
+	struct iopt_reserved *reserved;
+
+	lockdep_assert_held_write(&iopt->iova_rwsem);
+
+	if (iopt_area_iter_first(iopt, start, last) ||
+	    iopt_allowed_iter_first(iopt, start, last))
+		return -EADDRINUSE;
+
+	reserved = kzalloc(sizeof(*reserved), GFP_KERNEL_ACCOUNT);
+	if (!reserved)
+		return -ENOMEM;
+	reserved->node.start = start;
+	reserved->node.last = last;
+	reserved->owner = owner;
+	interval_tree_insert(&reserved->node, &iopt->reserved_itree);
+	return 0;
+}
+
+static void __iopt_remove_reserved_iova(struct io_pagetable *iopt, void *owner)
+{
+	struct iopt_reserved *reserved, *next;
+
+	lockdep_assert_held_write(&iopt->iova_rwsem);
+
+	for (reserved = iopt_reserved_iter_first(iopt, 0, ULONG_MAX); reserved;
+	     reserved = next) {
+		next = iopt_reserved_iter_next(reserved, 0, ULONG_MAX);
+
+		if (reserved->owner == owner) {
+			interval_tree_remove(&reserved->node,
+					     &iopt->reserved_itree);
+			kfree(reserved);
+		}
+	}
+}
+
+void iopt_remove_reserved_iova(struct io_pagetable *iopt, void *owner)
+{
+	down_write(&iopt->iova_rwsem);
+	__iopt_remove_reserved_iova(iopt, owner);
+	up_write(&iopt->iova_rwsem);
+}
+
+void iopt_init_table(struct io_pagetable *iopt)
+{
+	init_rwsem(&iopt->iova_rwsem);
+	init_rwsem(&iopt->domains_rwsem);
+	iopt->area_itree = RB_ROOT_CACHED;
+	iopt->allowed_itree = RB_ROOT_CACHED;
+	iopt->reserved_itree = RB_ROOT_CACHED;
+	xa_init_flags(&iopt->domains, XA_FLAGS_ACCOUNT);
+	xa_init_flags(&iopt->access_list, XA_FLAGS_ALLOC);
+
+	/*
+	 * iopt's start as SW tables that can use the entire size_t IOVA space
+	 * due to the use of size_t in the APIs. They have no alignment
+	 * restriction.
+	 */
+	iopt->iova_alignment = 1;
+}
+
+void iopt_destroy_table(struct io_pagetable *iopt)
+{
+	struct interval_tree_node *node;
+
+	while ((node = interval_tree_iter_first(&iopt->allowed_itree, 0,
+						ULONG_MAX))) {
+		interval_tree_remove(node, &iopt->allowed_itree);
+		kfree(container_of(node, struct iopt_allowed, node));
+	}
+
+	WARN_ON(!RB_EMPTY_ROOT(&iopt->reserved_itree.rb_root));
+	WARN_ON(!xa_empty(&iopt->domains));
+	WARN_ON(!xa_empty(&iopt->access_list));
+	WARN_ON(!RB_EMPTY_ROOT(&iopt->area_itree.rb_root));
+}
+
+/**
+ * iopt_unfill_domain() - Unfill a domain with PFNs
+ * @iopt: io_pagetable to act on
+ * @domain: domain to unfill
+ *
+ * This is used when removing a domain from the iopt. Every area in the iopt
+ * will be unmapped from the domain. The domain must already be removed from the
+ * domains xarray.
+ */
+static void iopt_unfill_domain(struct io_pagetable *iopt,
+			       struct iommu_domain *domain)
+{
+	struct iopt_area *area;
+
+	lockdep_assert_held(&iopt->iova_rwsem);
+	lockdep_assert_held_write(&iopt->domains_rwsem);
+
+	/*
+	 * Some other domain is holding all the pfns still, rapidly unmap this
+	 * domain.
+	 */
+	if (iopt->next_domain_id != 0) {
+		/* Pick an arbitrary remaining domain to act as storage */
+		struct iommu_domain *storage_domain =
+			xa_load(&iopt->domains, 0);
+
+		for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
+		     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
+			struct iopt_pages *pages = area->pages;
+
+			if (!pages)
+				continue;
+
+			mutex_lock(&pages->mutex);
+			if (area->storage_domain == domain)
+				area->storage_domain = storage_domain;
+			mutex_unlock(&pages->mutex);
+
+			iopt_area_unmap_domain(area, domain);
+		}
+		return;
+	}
+
+	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
+	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
+		struct iopt_pages *pages = area->pages;
+
+		if (!pages)
+			continue;
+
+		mutex_lock(&pages->mutex);
+		interval_tree_remove(&area->pages_node, &pages->domains_itree);
+		WARN_ON(area->storage_domain != domain);
+		area->storage_domain = NULL;
+		iopt_area_unfill_domain(area, pages, domain);
+		mutex_unlock(&pages->mutex);
+	}
+}
+
+/**
+ * iopt_fill_domain() - Fill a domain with PFNs
+ * @iopt: io_pagetable to act on
+ * @domain: domain to fill
+ *
+ * Fill the domain with PFNs from every area in the iopt. On failure the domain
+ * is left unchanged.
+ */
+static int iopt_fill_domain(struct io_pagetable *iopt,
+			    struct iommu_domain *domain)
+{
+	struct iopt_area *end_area;
+	struct iopt_area *area;
+	int rc;
+
+	lockdep_assert_held(&iopt->iova_rwsem);
+	lockdep_assert_held_write(&iopt->domains_rwsem);
+
+	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
+	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
+		struct iopt_pages *pages = area->pages;
+
+		if (!pages)
+			continue;
+
+		mutex_lock(&pages->mutex);
+		rc = iopt_area_fill_domain(area, domain);
+		if (rc) {
+			mutex_unlock(&pages->mutex);
+			goto out_unfill;
+		}
+		if (!area->storage_domain) {
+			WARN_ON(iopt->next_domain_id != 0);
+			area->storage_domain = domain;
+			interval_tree_insert(&area->pages_node,
+					     &pages->domains_itree);
+		}
+		mutex_unlock(&pages->mutex);
+	}
+	return 0;
+
+out_unfill:
+	end_area = area;
+	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
+	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
+		struct iopt_pages *pages = area->pages;
+
+		if (area == end_area)
+			break;
+		if (!pages)
+			continue;
+		mutex_lock(&pages->mutex);
+		if (iopt->next_domain_id == 0) {
+			interval_tree_remove(&area->pages_node,
+					     &pages->domains_itree);
+			area->storage_domain = NULL;
+		}
+		iopt_area_unfill_domain(area, pages, domain);
+		mutex_unlock(&pages->mutex);
+	}
+	return rc;
+}
+
+/* All existing area's conform to an increased page size */
+static int iopt_check_iova_alignment(struct io_pagetable *iopt,
+				     unsigned long new_iova_alignment)
+{
+	unsigned long align_mask = new_iova_alignment - 1;
+	struct iopt_area *area;
+
+	lockdep_assert_held(&iopt->iova_rwsem);
+	lockdep_assert_held(&iopt->domains_rwsem);
+
+	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
+	     area = iopt_area_iter_next(area, 0, ULONG_MAX))
+		if ((iopt_area_iova(area) & align_mask) ||
+		    (iopt_area_length(area) & align_mask) ||
+		    (area->page_offset & align_mask))
+			return -EADDRINUSE;
+	return 0;
+}
+
+int iopt_table_add_domain(struct io_pagetable *iopt,
+			  struct iommu_domain *domain)
+{
+	const struct iommu_domain_geometry *geometry = &domain->geometry;
+	struct iommu_domain *iter_domain;
+	unsigned int new_iova_alignment;
+	unsigned long index;
+	int rc;
+
+	down_write(&iopt->domains_rwsem);
+	down_write(&iopt->iova_rwsem);
+
+	xa_for_each(&iopt->domains, index, iter_domain) {
+		if (WARN_ON(iter_domain == domain)) {
+			rc = -EEXIST;
+			goto out_unlock;
+		}
+	}
+
+	/*
+	 * The io page size drives the iova_alignment. Internally the iopt_pages
+	 * works in PAGE_SIZE units and we adjust when mapping sub-PAGE_SIZE
+	 * objects into the iommu_domain.
+	 *
+	 * A iommu_domain must always be able to accept PAGE_SIZE to be
+	 * compatible as we can't guarantee higher contiguity.
+	 */
+	new_iova_alignment = max_t(unsigned long,
+				   1UL << __ffs(domain->pgsize_bitmap),
+				   iopt->iova_alignment);
+	if (new_iova_alignment > PAGE_SIZE) {
+		rc = -EINVAL;
+		goto out_unlock;
+	}
+	if (new_iova_alignment != iopt->iova_alignment) {
+		rc = iopt_check_iova_alignment(iopt, new_iova_alignment);
+		if (rc)
+			goto out_unlock;
+	}
+
+	/* No area exists that is outside the allowed domain aperture */
+	if (geometry->aperture_start != 0) {
+		rc = iopt_reserve_iova(iopt, 0, geometry->aperture_start - 1,
+				       domain);
+		if (rc)
+			goto out_reserved;
+	}
+	if (geometry->aperture_end != ULONG_MAX) {
+		rc = iopt_reserve_iova(iopt, geometry->aperture_end + 1,
+				       ULONG_MAX, domain);
+		if (rc)
+			goto out_reserved;
+	}
+
+	rc = xa_reserve(&iopt->domains, iopt->next_domain_id, GFP_KERNEL);
+	if (rc)
+		goto out_reserved;
+
+	rc = iopt_fill_domain(iopt, domain);
+	if (rc)
+		goto out_release;
+
+	iopt->iova_alignment = new_iova_alignment;
+	xa_store(&iopt->domains, iopt->next_domain_id, domain, GFP_KERNEL);
+	iopt->next_domain_id++;
+	up_write(&iopt->iova_rwsem);
+	up_write(&iopt->domains_rwsem);
+	return 0;
+out_release:
+	xa_release(&iopt->domains, iopt->next_domain_id);
+out_reserved:
+	__iopt_remove_reserved_iova(iopt, domain);
+out_unlock:
+	up_write(&iopt->iova_rwsem);
+	up_write(&iopt->domains_rwsem);
+	return rc;
+}
+
+static int iopt_calculate_iova_alignment(struct io_pagetable *iopt)
+{
+	unsigned long new_iova_alignment;
+	struct iommufd_access *access;
+	struct iommu_domain *domain;
+	unsigned long index;
+
+	lockdep_assert_held_write(&iopt->iova_rwsem);
+	lockdep_assert_held(&iopt->domains_rwsem);
+
+	if (iopt->disable_large_pages)
+		new_iova_alignment = PAGE_SIZE;
+	else
+		new_iova_alignment = 1;
+
+	xa_for_each(&iopt->domains, index, domain)
+		new_iova_alignment = max_t(unsigned long,
+					   1UL << __ffs(domain->pgsize_bitmap),
+					   new_iova_alignment);
+	xa_for_each(&iopt->access_list, index, access)
+		new_iova_alignment = max_t(unsigned long,
+					   access->iova_alignment,
+					   new_iova_alignment);
+
+	if (new_iova_alignment > iopt->iova_alignment) {
+		int rc;
+
+		rc = iopt_check_iova_alignment(iopt, new_iova_alignment);
+		if (rc)
+			return rc;
+	}
+	iopt->iova_alignment = new_iova_alignment;
+	return 0;
+}
+
+void iopt_table_remove_domain(struct io_pagetable *iopt,
+			      struct iommu_domain *domain)
+{
+	struct iommu_domain *iter_domain = NULL;
+	unsigned long index;
+
+	down_write(&iopt->domains_rwsem);
+	down_write(&iopt->iova_rwsem);
+
+	xa_for_each(&iopt->domains, index, iter_domain)
+		if (iter_domain == domain)
+			break;
+	if (WARN_ON(iter_domain != domain) || index >= iopt->next_domain_id)
+		goto out_unlock;
+
+	/*
+	 * Compress the xarray to keep it linear by swapping the entry to erase
+	 * with the tail entry and shrinking the tail.
+	 */
+	iopt->next_domain_id--;
+	iter_domain = xa_erase(&iopt->domains, iopt->next_domain_id);
+	if (index != iopt->next_domain_id)
+		xa_store(&iopt->domains, index, iter_domain, GFP_KERNEL);
+
+	iopt_unfill_domain(iopt, domain);
+	__iopt_remove_reserved_iova(iopt, domain);
+
+	WARN_ON(iopt_calculate_iova_alignment(iopt));
+out_unlock:
+	up_write(&iopt->iova_rwsem);
+	up_write(&iopt->domains_rwsem);
+}
+
+/**
+ * iopt_area_split - Split an area into two parts at iova
+ * @area: The area to split
+ * @iova: Becomes the last of a new area
+ *
+ * This splits an area into two. It is part of the VFIO compatibility to allow
+ * poking a hole in the mapping. The two areas continue to point at the same
+ * iopt_pages, just with different starting bytes.
+ */
+static int iopt_area_split(struct iopt_area *area, unsigned long iova)
+{
+	unsigned long alignment = area->iopt->iova_alignment;
+	unsigned long last_iova = iopt_area_last_iova(area);
+	unsigned long start_iova = iopt_area_iova(area);
+	unsigned long new_start = iova + 1;
+	struct io_pagetable *iopt = area->iopt;
+	struct iopt_pages *pages = area->pages;
+	struct iopt_area *lhs;
+	struct iopt_area *rhs;
+	int rc;
+
+	lockdep_assert_held_write(&iopt->iova_rwsem);
+
+	if (iova == start_iova || iova == last_iova)
+		return 0;
+
+	if (!pages || area->prevent_access)
+		return -EBUSY;
+
+	if (new_start & (alignment - 1) ||
+	    iopt_area_start_byte(area, new_start) & (alignment - 1))
+		return -EINVAL;
+
+	lhs = kzalloc(sizeof(*area), GFP_KERNEL_ACCOUNT);
+	if (!lhs)
+		return -ENOMEM;
+
+	rhs = kzalloc(sizeof(*area), GFP_KERNEL_ACCOUNT);
+	if (!rhs) {
+		rc = -ENOMEM;
+		goto err_free_lhs;
+	}
+
+	mutex_lock(&pages->mutex);
+	/*
+	 * Splitting is not permitted if a domain could have been mapped with
+	 * huge pages.
+	 */
+	if (area->storage_domain && !iopt->disable_large_pages) {
+		rc = -EINVAL;
+		goto err_unlock;
+	}
+
+	interval_tree_remove(&area->node, &iopt->area_itree);
+	rc = iopt_insert_area(iopt, lhs, area->pages, start_iova,
+			      iopt_area_start_byte(area, start_iova),
+			      (new_start - 1) - start_iova + 1,
+			      area->iommu_prot);
+	if (WARN_ON(rc))
+		goto err_insert;
+
+	rc = iopt_insert_area(iopt, rhs, area->pages, new_start,
+			      iopt_area_start_byte(area, new_start),
+			      last_iova - new_start + 1, area->iommu_prot);
+	if (WARN_ON(rc))
+		goto err_remove_lhs;
+
+	lhs->storage_domain = area->storage_domain;
+	lhs->num_accesses = area->num_accesses;
+	lhs->pages = area->pages;
+	rhs->storage_domain = area->storage_domain;
+	rhs->num_accesses = area->num_accesses;
+	rhs->pages = area->pages;
+	kref_get(&rhs->pages->kref);
+	kfree(area);
+	mutex_unlock(&pages->mutex);
+
+	/*
+	 * No change to domains or accesses because the pages hasn't been
+	 * changed
+	 */
+	return 0;
+
+err_remove_lhs:
+	interval_tree_remove(&lhs->node, &iopt->area_itree);
+err_insert:
+	interval_tree_insert(&area->node, &iopt->area_itree);
+err_unlock:
+	mutex_unlock(&pages->mutex);
+	kfree(rhs);
+err_free_lhs:
+	kfree(lhs);
+	return rc;
+}
+
+int iopt_cut_iova(struct io_pagetable *iopt, unsigned long *iovas,
+		  size_t num_iovas)
+{
+	int rc = 0;
+	int i;
+
+	down_write(&iopt->iova_rwsem);
+	for (i = 0; i < num_iovas; i++) {
+		struct iopt_area *area;
+
+		area = iopt_area_iter_first(iopt, iovas[i], iovas[i]);
+		if (!area)
+			continue;
+		rc = iopt_area_split(area, iovas[i]);
+		if (rc)
+			break;
+	}
+	up_write(&iopt->iova_rwsem);
+	return rc;
+}
+
+void iopt_enable_large_pages(struct io_pagetable *iopt)
+{
+	int rc;
+
+	down_write(&iopt->domains_rwsem);
+	down_write(&iopt->iova_rwsem);
+	WRITE_ONCE(iopt->disable_large_pages, false);
+	rc = iopt_calculate_iova_alignment(iopt);
+	WARN_ON(rc);
+	up_write(&iopt->iova_rwsem);
+	up_write(&iopt->domains_rwsem);
+}
+
+int iopt_disable_large_pages(struct io_pagetable *iopt)
+{
+	int rc = 0;
+
+	down_write(&iopt->domains_rwsem);
+	down_write(&iopt->iova_rwsem);
+	if (iopt->disable_large_pages)
+		goto out_unlock;
+
+	/* Won't do it if domains already have pages mapped in them */
+	if (!xa_empty(&iopt->domains) &&
+	    !RB_EMPTY_ROOT(&iopt->area_itree.rb_root)) {
+		rc = -EINVAL;
+		goto out_unlock;
+	}
+
+	WRITE_ONCE(iopt->disable_large_pages, true);
+	rc = iopt_calculate_iova_alignment(iopt);
+	if (rc)
+		WRITE_ONCE(iopt->disable_large_pages, false);
+out_unlock:
+	up_write(&iopt->iova_rwsem);
+	up_write(&iopt->domains_rwsem);
+	return rc;
+}
+
+int iopt_add_access(struct io_pagetable *iopt, struct iommufd_access *access)
+{
+	int rc;
+
+	down_write(&iopt->domains_rwsem);
+	down_write(&iopt->iova_rwsem);
+	rc = xa_alloc(&iopt->access_list, &access->iopt_access_list_id, access,
+		      xa_limit_16b, GFP_KERNEL_ACCOUNT);
+	if (rc)
+		goto out_unlock;
+
+	rc = iopt_calculate_iova_alignment(iopt);
+	if (rc) {
+		xa_erase(&iopt->access_list, access->iopt_access_list_id);
+		goto out_unlock;
+	}
+
+out_unlock:
+	up_write(&iopt->iova_rwsem);
+	up_write(&iopt->domains_rwsem);
+	return rc;
+}
+
+void iopt_remove_access(struct io_pagetable *iopt,
+			struct iommufd_access *access)
+{
+	down_write(&iopt->domains_rwsem);
+	down_write(&iopt->iova_rwsem);
+	WARN_ON(xa_erase(&iopt->access_list, access->iopt_access_list_id) !=
+		access);
+	WARN_ON(iopt_calculate_iova_alignment(iopt));
+	up_write(&iopt->iova_rwsem);
+	up_write(&iopt->domains_rwsem);
+}
+
+/* Narrow the valid_iova_itree to include reserved ranges from a group. */
+int iopt_table_enforce_group_resv_regions(struct io_pagetable *iopt,
+					  struct device *device,
+					  struct iommu_group *group,
+					  phys_addr_t *sw_msi_start)
+{
+	struct iommu_resv_region *resv;
+	struct iommu_resv_region *tmp;
+	LIST_HEAD(group_resv_regions);
+	int rc;
+
+	down_write(&iopt->iova_rwsem);
+	rc = iommu_get_group_resv_regions(group, &group_resv_regions);
+	if (rc)
+		goto out_unlock;
+
+	list_for_each_entry(resv, &group_resv_regions, list) {
+		if (resv->type == IOMMU_RESV_DIRECT_RELAXABLE)
+			continue;
+
+		/*
+		 * The presence of any 'real' MSI regions should take precedence
+		 * over the software-managed one if the IOMMU driver happens to
+		 * advertise both types.
+		 */
+		if (sw_msi_start && resv->type == IOMMU_RESV_MSI) {
+			*sw_msi_start = 0;
+			sw_msi_start = NULL;
+		}
+		if (sw_msi_start && resv->type == IOMMU_RESV_SW_MSI)
+			*sw_msi_start = resv->start;
+
+		rc = iopt_reserve_iova(iopt, resv->start,
+				       resv->length - 1 + resv->start, device);
+		if (rc)
+			goto out_reserved;
+	}
+	rc = 0;
+	goto out_free_resv;
+
+out_reserved:
+	__iopt_remove_reserved_iova(iopt, device);
+out_free_resv:
+	list_for_each_entry_safe(resv, tmp, &group_resv_regions, list)
+		kfree(resv);
+out_unlock:
+	up_write(&iopt->iova_rwsem);
+	return rc;
+}
diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h
index aebb9a496258ce..9a1c8a5ae3c883 100644
--- a/drivers/iommu/iommufd/io_pagetable.h
+++ b/drivers/iommu/iommufd/io_pagetable.h
@@ -46,9 +46,19 @@ struct iopt_area {
 	unsigned int page_offset;
 	/* IOMMU_READ, IOMMU_WRITE, etc */
 	int iommu_prot;
+	bool prevent_access : 1;
 	unsigned int num_accesses;
 };
 
+struct iopt_allowed {
+	struct interval_tree_node node;
+};
+
+struct iopt_reserved {
+	struct interval_tree_node node;
+	void *owner;
+};
+
 int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages);
 void iopt_area_unfill_domains(struct iopt_area *area, struct iopt_pages *pages);
 
@@ -83,6 +93,24 @@ static inline size_t iopt_area_length(struct iopt_area *area)
 	return (area->node.last - area->node.start) + 1;
 }
 
+/*
+ * Number of bytes from the start of the iopt_pages that the iova begins.
+ * iopt_area_start_byte() / PAGE_SIZE encodes the starting page index
+ * iopt_area_start_byte() % PAGE_SIZE encodes the offset within that page
+ */
+static inline unsigned long iopt_area_start_byte(struct iopt_area *area,
+						 unsigned long iova)
+{
+	return (iova - iopt_area_iova(area)) + area->page_offset +
+	       iopt_area_index(area) * PAGE_SIZE;
+}
+
+static inline unsigned long iopt_area_iova_to_index(struct iopt_area *area,
+						    unsigned long iova)
+{
+	return iopt_area_start_byte(area, iova) / PAGE_SIZE;
+}
+
 #define __make_iopt_iter(name)                                                 \
 	static inline struct iopt_##name *iopt_##name##_iter_first(            \
 		struct io_pagetable *iopt, unsigned long start,                \
@@ -110,6 +138,33 @@ static inline size_t iopt_area_length(struct iopt_area *area)
 	}
 
 __make_iopt_iter(area)
+__make_iopt_iter(allowed)
+__make_iopt_iter(reserved)
+
+struct iopt_area_contig_iter {
+	unsigned long cur_iova;
+	unsigned long last_iova;
+	struct iopt_area *area;
+};
+struct iopt_area *iopt_area_contig_init(struct iopt_area_contig_iter *iter,
+					struct io_pagetable *iopt,
+					unsigned long iova,
+					unsigned long last_iova);
+struct iopt_area *iopt_area_contig_next(struct iopt_area_contig_iter *iter);
+
+static inline bool iopt_area_contig_done(struct iopt_area_contig_iter *iter)
+{
+	return iter->area && iter->last_iova <= iopt_area_last_iova(iter->area);
+}
+
+/*
+ * Iterate over a contiguous list of areas that span the iova,last_iova range.
+ * The caller must check iopt_area_contig_done() after the loop to see if
+ * contiguous areas existed.
+ */
+#define iopt_for_each_contig_area(iter, area, iopt, iova, last_iova)          \
+	for (area = iopt_area_contig_init(iter, iopt, iova, last_iova); area; \
+	     area = iopt_area_contig_next(iter))
 
 enum {
 	IOPT_PAGES_ACCOUNT_NONE = 0,
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 6ed6f99ee439fa..6b7e0982b8c802 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -9,9 +9,14 @@
 #include <linux/refcount.h>
 #include <linux/uaccess.h>
 
+struct iommu_domain;
+struct iommu_group;
+
 struct iommufd_ctx {
 	struct file *file;
 	struct xarray objects;
+
+	u8 account_mode;
 };
 
 /*
@@ -27,6 +32,7 @@ struct iommufd_ctx {
 struct io_pagetable {
 	struct rw_semaphore domains_rwsem;
 	struct xarray domains;
+	struct xarray access_list;
 	unsigned int next_domain_id;
 
 	struct rw_semaphore iova_rwsem;
@@ -36,7 +42,45 @@ struct io_pagetable {
 	/* IOVA that cannot be allocated, struct iopt_reserved */
 	struct rb_root_cached reserved_itree;
 	u8 disable_large_pages;
+	unsigned long iova_alignment;
+};
+
+void iopt_init_table(struct io_pagetable *iopt);
+void iopt_destroy_table(struct io_pagetable *iopt);
+int iopt_get_pages(struct io_pagetable *iopt, unsigned long iova,
+		   unsigned long length, struct list_head *pages_list);
+void iopt_free_pages_list(struct list_head *pages_list);
+enum {
+	IOPT_ALLOC_IOVA = 1 << 0,
 };
+int iopt_map_user_pages(struct iommufd_ctx *ictx, struct io_pagetable *iopt,
+			unsigned long *iova, void __user *uptr,
+			unsigned long length, int iommu_prot,
+			unsigned int flags);
+int iopt_map_pages(struct io_pagetable *iopt, struct list_head *pages_list,
+		   unsigned long length, unsigned long *dst_iova,
+		   int iommu_prot, unsigned int flags);
+int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
+		    unsigned long length, unsigned long *unmapped);
+int iopt_unmap_all(struct io_pagetable *iopt, unsigned long *unmapped);
+
+int iopt_table_add_domain(struct io_pagetable *iopt,
+			  struct iommu_domain *domain);
+void iopt_table_remove_domain(struct io_pagetable *iopt,
+			      struct iommu_domain *domain);
+int iopt_table_enforce_group_resv_regions(struct io_pagetable *iopt,
+					  struct device *device,
+					  struct iommu_group *group,
+					  phys_addr_t *sw_msi_start);
+int iopt_set_allow_iova(struct io_pagetable *iopt,
+			struct rb_root_cached *allowed_iova);
+int iopt_reserve_iova(struct io_pagetable *iopt, unsigned long start,
+		      unsigned long last, void *owner);
+void iopt_remove_reserved_iova(struct io_pagetable *iopt, void *owner);
+int iopt_cut_iova(struct io_pagetable *iopt, unsigned long *iovas,
+		  size_t num_iovas);
+void iopt_enable_large_pages(struct io_pagetable *iopt);
+int iopt_disable_large_pages(struct io_pagetable *iopt);
 
 struct iommufd_ucmd {
 	struct iommufd_ctx *ictx;
@@ -130,4 +174,12 @@ struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
 			     type),                                            \
 		     typeof(*(ptr)), obj)
 
+struct iommufd_access {
+	unsigned long iova_alignment;
+	u32 iopt_access_list_id;
+};
+
+int iopt_add_access(struct io_pagetable *iopt, struct iommufd_access *access);
+void iopt_remove_access(struct io_pagetable *iopt,
+			struct iommufd_access *access);
 #endif
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH v4 10/17] iommufd: IOCTLs for the io_pagetable
  2022-11-08  0:48 [PATCH v4 00/17] IOMMUFD Generic interface Jason Gunthorpe
                   ` (8 preceding siblings ...)
  2022-11-08  0:49 ` [PATCH v4 09/17] iommufd: Data structure to provide IOVA to PFN mapping Jason Gunthorpe
@ 2022-11-08  0:49 ` Jason Gunthorpe
  2022-11-08 13:27   ` Bagas Sanjaya
  2022-11-14  7:46   ` Tian, Kevin
  2022-11-08  0:49 ` [PATCH v4 11/17] iommufd: Add a HW pagetable object Jason Gunthorpe
                   ` (8 subsequent siblings)
  18 siblings, 2 replies; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-08  0:49 UTC (permalink / raw)
  To: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

Connect the IOAS to its IOCTL interface. This exposes most of the
functionality in the io_pagetable to userspace.

This is intended to be the core of the generic interface that IOMMUFD will
provide. Every IOMMU driver should be able to implement an iommu_domain
that is compatible with this generic mechanism.

It is also designed to be easy to use for simple non virtual machine
monitor users, like DPDK:
 - Universal simple support for all IOMMUs (no PPC special path)
 - An IOVA allocator that considers the aperture and the allowed/reserved
   ranges
 - io_pagetable allows any number of iommu_domains to be connected to the
   IOAS
 - Automatic allocation and re-use of iommu_domains

Along with room in the design to add non-generic features to cater to
specific HW functionality.

Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommufd/Makefile          |   1 +
 drivers/iommu/iommufd/ioas.c            | 384 ++++++++++++++++++++++++
 drivers/iommu/iommufd/iommufd_private.h |  33 ++
 drivers/iommu/iommufd/main.c            |  45 +++
 include/uapi/linux/iommufd.h            | 241 +++++++++++++++
 5 files changed, 704 insertions(+)
 create mode 100644 drivers/iommu/iommufd/ioas.c

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index b66a8c47ff55ec..2b4f36f1b72f9d 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 iommufd-y := \
 	io_pagetable.o \
+	ioas.o \
 	main.o \
 	pages.o
 
diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
new file mode 100644
index 00000000000000..7671456e86413a
--- /dev/null
+++ b/drivers/iommu/iommufd/ioas.c
@@ -0,0 +1,384 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ */
+#include <linux/interval_tree.h>
+#include <linux/iommufd.h>
+#include <linux/iommu.h>
+#include <uapi/linux/iommufd.h>
+
+#include "io_pagetable.h"
+
+void iommufd_ioas_destroy(struct iommufd_object *obj)
+{
+	struct iommufd_ioas *ioas = container_of(obj, struct iommufd_ioas, obj);
+	int rc;
+
+	rc = iopt_unmap_all(&ioas->iopt, NULL);
+	WARN_ON(rc && rc != -ENOENT);
+	iopt_destroy_table(&ioas->iopt);
+}
+
+struct iommufd_ioas *iommufd_ioas_alloc(struct iommufd_ctx *ictx)
+{
+	struct iommufd_ioas *ioas;
+
+	ioas = iommufd_object_alloc(ictx, ioas, IOMMUFD_OBJ_IOAS);
+	if (IS_ERR(ioas))
+		return ioas;
+
+	iopt_init_table(&ioas->iopt);
+	return ioas;
+}
+
+int iommufd_ioas_alloc_ioctl(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_ioas_alloc *cmd = ucmd->cmd;
+	struct iommufd_ioas *ioas;
+	int rc;
+
+	if (cmd->flags)
+		return -EOPNOTSUPP;
+
+	ioas = iommufd_ioas_alloc(ucmd->ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	cmd->out_ioas_id = ioas->obj.id;
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+	if (rc)
+		goto out_table;
+	iommufd_object_finalize(ucmd->ictx, &ioas->obj);
+	return 0;
+
+out_table:
+	iommufd_object_abort_and_destroy(ucmd->ictx, &ioas->obj);
+	return rc;
+}
+
+int iommufd_ioas_iova_ranges(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_iova_range __user *ranges;
+	struct iommu_ioas_iova_ranges *cmd = ucmd->cmd;
+	struct iommufd_ioas *ioas;
+	struct interval_tree_span_iter span;
+	u32 max_iovas;
+	int rc;
+
+	if (cmd->__reserved)
+		return -EOPNOTSUPP;
+
+	ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	down_read(&ioas->iopt.iova_rwsem);
+	max_iovas = cmd->num_iovas;
+	ranges = u64_to_user_ptr(cmd->allowed_iovas);
+	cmd->num_iovas = 0;
+	cmd->out_iova_alignment = ioas->iopt.iova_alignment;
+	interval_tree_for_each_span(&span, &ioas->iopt.reserved_itree, 0,
+				    ULONG_MAX) {
+		if (!span.is_hole)
+			continue;
+		if (cmd->num_iovas < max_iovas) {
+			struct iommu_iova_range elm = {
+				.start = span.start_hole,
+				.last = span.last_hole,
+			};
+
+			if (copy_to_user(&ranges[cmd->num_iovas], &elm,
+					 sizeof(elm))) {
+				rc = -EFAULT;
+				goto out_put;
+			}
+		}
+		cmd->num_iovas++;
+	}
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+	if (rc)
+		goto out_put;
+	if (cmd->num_iovas > max_iovas)
+		rc = -EMSGSIZE;
+out_put:
+	up_read(&ioas->iopt.iova_rwsem);
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+static int iommufd_ioas_load_iovas(struct rb_root_cached *itree,
+				   struct iommu_iova_range __user *ranges,
+				   u32 num)
+{
+	u32 i;
+
+	for (i = 0; i != num; i++) {
+		struct iommu_iova_range range;
+		struct iopt_allowed *allowed;
+
+		if (copy_from_user(&range, ranges + i, sizeof(range)))
+			return -EFAULT;
+
+		if (range.start >= range.last)
+			return -EINVAL;
+
+		if (interval_tree_iter_first(itree, range.start, range.last))
+			return -EINVAL;
+
+		allowed = kzalloc(sizeof(*allowed), GFP_KERNEL_ACCOUNT);
+		if (!allowed)
+			return -ENOMEM;
+		allowed->node.start = range.start;
+		allowed->node.last = range.last;
+
+		interval_tree_insert(&allowed->node, itree);
+	}
+	return 0;
+}
+
+int iommufd_ioas_allow_iovas(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_ioas_allow_iovas *cmd = ucmd->cmd;
+	struct rb_root_cached allowed_iova = RB_ROOT_CACHED;
+	struct interval_tree_node *node;
+	struct iommufd_ioas *ioas;
+	struct io_pagetable *iopt;
+	int rc = 0;
+
+	if (cmd->__reserved)
+		return -EOPNOTSUPP;
+
+	ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+	iopt = &ioas->iopt;
+
+	rc = iommufd_ioas_load_iovas(&allowed_iova,
+				     u64_to_user_ptr(cmd->allowed_iovas),
+				     cmd->num_iovas);
+	if (rc)
+		goto out_free;
+
+	rc = iopt_set_allow_iova(iopt, &allowed_iova);
+out_free:
+	while ((node = interval_tree_iter_first(&allowed_iova, 0, ULONG_MAX))) {
+		interval_tree_remove(node, &allowed_iova);
+		kfree(container_of(node, struct iopt_allowed, node));
+	}
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+static int conv_iommu_prot(u32 map_flags)
+{
+	int iommu_prot;
+
+	/*
+	 * We provide no manual cache coherency ioctls to userspace and most
+	 * architectures make the CPU ops for cache flushing privileged.
+	 * Therefore we require the underlying IOMMU to support CPU coherent
+	 * operation. Support for IOMMU_CACHE is enforced by the
+	 * IOMMU_CAP_CACHE_COHERENCY test during bind.
+	 */
+	iommu_prot = IOMMU_CACHE;
+	if (map_flags & IOMMU_IOAS_MAP_WRITEABLE)
+		iommu_prot |= IOMMU_WRITE;
+	if (map_flags & IOMMU_IOAS_MAP_READABLE)
+		iommu_prot |= IOMMU_READ;
+	return iommu_prot;
+}
+
+int iommufd_ioas_map(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_ioas_map *cmd = ucmd->cmd;
+	struct iommufd_ioas *ioas;
+	unsigned int flags = 0;
+	unsigned long iova;
+	int rc;
+
+	if ((cmd->flags &
+	     ~(IOMMU_IOAS_MAP_FIXED_IOVA | IOMMU_IOAS_MAP_WRITEABLE |
+	       IOMMU_IOAS_MAP_READABLE)) ||
+	    cmd->__reserved)
+		return -EOPNOTSUPP;
+	if (cmd->iova >= ULONG_MAX || cmd->length >= ULONG_MAX)
+		return -EOVERFLOW;
+
+	ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	if (!(cmd->flags & IOMMU_IOAS_MAP_FIXED_IOVA))
+		flags = IOPT_ALLOC_IOVA;
+	iova = cmd->iova;
+	rc = iopt_map_user_pages(ucmd->ictx, &ioas->iopt, &iova,
+				 u64_to_user_ptr(cmd->user_va), cmd->length,
+				 conv_iommu_prot(cmd->flags), flags);
+	if (rc)
+		goto out_put;
+
+	cmd->iova = iova;
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+out_put:
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+int iommufd_ioas_copy(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_ioas_copy *cmd = ucmd->cmd;
+	struct iommufd_ioas *src_ioas;
+	struct iommufd_ioas *dst_ioas;
+	unsigned int flags = 0;
+	LIST_HEAD(pages_list);
+	unsigned long iova;
+	int rc;
+
+	if ((cmd->flags &
+	     ~(IOMMU_IOAS_MAP_FIXED_IOVA | IOMMU_IOAS_MAP_WRITEABLE |
+	       IOMMU_IOAS_MAP_READABLE)))
+		return -EOPNOTSUPP;
+	if (cmd->length >= ULONG_MAX || cmd->src_iova >= ULONG_MAX ||
+	    cmd->dst_iova >= ULONG_MAX)
+		return -EOVERFLOW;
+
+	src_ioas = iommufd_get_ioas(ucmd, cmd->src_ioas_id);
+	if (IS_ERR(src_ioas))
+		return PTR_ERR(src_ioas);
+	rc = iopt_get_pages(&src_ioas->iopt, cmd->src_iova, cmd->length,
+			    &pages_list);
+	iommufd_put_object(&src_ioas->obj);
+	if (rc)
+		return rc;
+
+	dst_ioas = iommufd_get_ioas(ucmd, cmd->dst_ioas_id);
+	if (IS_ERR(dst_ioas)) {
+		rc = PTR_ERR(dst_ioas);
+		goto out_pages;
+	}
+
+	if (!(cmd->flags & IOMMU_IOAS_MAP_FIXED_IOVA))
+		flags = IOPT_ALLOC_IOVA;
+	iova = cmd->dst_iova;
+	rc = iopt_map_pages(&dst_ioas->iopt, &pages_list, cmd->length, &iova,
+			    conv_iommu_prot(cmd->flags), flags);
+	if (rc)
+		goto out_put_dst;
+
+	cmd->dst_iova = iova;
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+out_put_dst:
+	iommufd_put_object(&dst_ioas->obj);
+out_pages:
+	iopt_free_pages_list(&pages_list);
+	return rc;
+}
+
+int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_ioas_unmap *cmd = ucmd->cmd;
+	struct iommufd_ioas *ioas;
+	unsigned long unmapped = 0;
+	int rc;
+
+	ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	if (cmd->iova == 0 && cmd->length == U64_MAX) {
+		rc = iopt_unmap_all(&ioas->iopt, &unmapped);
+		if (rc)
+			goto out_put;
+	} else {
+		if (cmd->iova >= ULONG_MAX || cmd->length >= ULONG_MAX) {
+			rc = -EOVERFLOW;
+			goto out_put;
+		}
+		rc = iopt_unmap_iova(&ioas->iopt, cmd->iova, cmd->length,
+				     &unmapped);
+		if (rc)
+			goto out_put;
+	}
+
+	cmd->length = unmapped;
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+
+out_put:
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+int iommufd_option_rlimit_mode(struct iommu_option *cmd,
+			       struct iommufd_ctx *ictx)
+{
+	if (cmd->op == IOMMU_OPTION_OP_GET) {
+		cmd->val64 = ictx->account_mode == IOPT_PAGES_ACCOUNT_MM;
+		return 0;
+	}
+	if (cmd->op == IOMMU_OPTION_OP_SET) {
+		int rc = 0;
+
+		if (!capable(CAP_SYS_RESOURCE))
+			return -EPERM;
+
+		xa_lock(&ictx->objects);
+		if (!xa_empty(&ictx->objects)) {
+			rc = -EBUSY;
+		} else {
+			if (cmd->val64 == 0)
+				ictx->account_mode = IOPT_PAGES_ACCOUNT_USER;
+			else if (cmd->val64 == 1)
+				ictx->account_mode = IOPT_PAGES_ACCOUNT_MM;
+			else
+				rc = -EINVAL;
+		}
+		xa_unlock(&ictx->objects);
+
+		return rc;
+	}
+	return -EOPNOTSUPP;
+}
+
+static int iommufd_ioas_option_huge_pages(struct iommu_option *cmd,
+					  struct iommufd_ioas *ioas)
+{
+	if (cmd->op == IOMMU_OPTION_OP_GET) {
+		cmd->val64 = !ioas->iopt.disable_large_pages;
+		return 0;
+	}
+	if (cmd->op == IOMMU_OPTION_OP_SET) {
+		if (cmd->val64 == 0)
+			return iopt_disable_large_pages(&ioas->iopt);
+		if (cmd->val64 == 1) {
+			iopt_enable_large_pages(&ioas->iopt);
+			return 0;
+		}
+		return -EINVAL;
+	}
+	return -EOPNOTSUPP;
+}
+
+int iommufd_ioas_option(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_option *cmd = ucmd->cmd;
+	struct iommufd_ioas *ioas;
+	int rc = 0;
+
+	if (cmd->__reserved)
+		return -EOPNOTSUPP;
+
+	ioas = iommufd_get_ioas(ucmd, cmd->object_id);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	switch (cmd->option_id) {
+	case IOMMU_OPTION_HUGE_PAGES:
+		rc = iommufd_ioas_option_huge_pages(cmd, ioas);
+		break;
+	default:
+		rc = -EOPNOTSUPP;
+	}
+
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 6b7e0982b8c802..68c12115400285 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -11,6 +11,7 @@
 
 struct iommu_domain;
 struct iommu_group;
+struct iommu_option;
 
 struct iommufd_ctx {
 	struct file *file;
@@ -102,6 +103,7 @@ static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd,
 enum iommufd_object_type {
 	IOMMUFD_OBJ_NONE,
 	IOMMUFD_OBJ_ANY = IOMMUFD_OBJ_NONE,
+	IOMMUFD_OBJ_IOAS,
 };
 
 /* Base struct for all objects with a userspace ID handle. */
@@ -174,6 +176,37 @@ struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
 			     type),                                            \
 		     typeof(*(ptr)), obj)
 
+/*
+ * The IO Address Space (IOAS) pagetable is a virtual page table backed by the
+ * io_pagetable object. It is a user controlled mapping of IOVA -> PFNs. The
+ * mapping is copied into all of the associated domains and made available to
+ * in-kernel users.
+ */
+struct iommufd_ioas {
+	struct iommufd_object obj;
+	struct io_pagetable iopt;
+};
+
+static inline struct iommufd_ioas *iommufd_get_ioas(struct iommufd_ucmd *ucmd,
+						    u32 id)
+{
+	return container_of(iommufd_get_object(ucmd->ictx, id,
+					       IOMMUFD_OBJ_IOAS),
+			    struct iommufd_ioas, obj);
+}
+
+struct iommufd_ioas *iommufd_ioas_alloc(struct iommufd_ctx *ictx);
+int iommufd_ioas_alloc_ioctl(struct iommufd_ucmd *ucmd);
+void iommufd_ioas_destroy(struct iommufd_object *obj);
+int iommufd_ioas_iova_ranges(struct iommufd_ucmd *ucmd);
+int iommufd_ioas_allow_iovas(struct iommufd_ucmd *ucmd);
+int iommufd_ioas_map(struct iommufd_ucmd *ucmd);
+int iommufd_ioas_copy(struct iommufd_ucmd *ucmd);
+int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd);
+int iommufd_ioas_option(struct iommufd_ucmd *ucmd);
+int iommufd_option_rlimit_mode(struct iommu_option *cmd,
+			       struct iommufd_ctx *ictx);
+
 struct iommufd_access {
 	unsigned long iova_alignment;
 	u32 iopt_access_list_id;
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 3a705cadb85020..266109045537ed 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -202,8 +202,36 @@ static int iommufd_fops_release(struct inode *inode, struct file *filp)
 	return 0;
 }
 
+static int iommufd_option(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_option *cmd = ucmd->cmd;
+	int rc;
+
+	switch (cmd->option_id) {
+	case IOMMU_OPTION_RLIMIT_MODE:
+		rc = iommufd_option_rlimit_mode(cmd, ucmd->ictx);
+		break;
+	case IOMMU_OPTION_HUGE_PAGES:
+		rc = iommufd_ioas_option(ucmd);
+		break;
+	default:
+		return -EOPNOTSUPP;
+	}
+	if (rc)
+		return rc;
+	if (copy_to_user(&((struct iommu_option __user *)ucmd->ubuffer)->val64,
+			 &cmd->val64, sizeof(cmd->val64)))
+		return -EFAULT;
+	return 0;
+}
+
 union ucmd_buffer {
 	struct iommu_destroy destroy;
+	struct iommu_ioas_alloc alloc;
+	struct iommu_ioas_allow_iovas allow_iovas;
+	struct iommu_ioas_iova_ranges iova_ranges;
+	struct iommu_ioas_map map;
+	struct iommu_ioas_unmap unmap;
 };
 
 struct iommufd_ioctl_op {
@@ -224,6 +252,20 @@ struct iommufd_ioctl_op {
 	}
 static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
 	IOCTL_OP(IOMMU_DESTROY, iommufd_destroy, struct iommu_destroy, id),
+	IOCTL_OP(IOMMU_IOAS_ALLOC, iommufd_ioas_alloc_ioctl,
+		 struct iommu_ioas_alloc, out_ioas_id),
+	IOCTL_OP(IOMMU_IOAS_ALLOW_IOVAS, iommufd_ioas_allow_iovas,
+		 struct iommu_ioas_allow_iovas, allowed_iovas),
+	IOCTL_OP(IOMMU_IOAS_COPY, iommufd_ioas_copy, struct iommu_ioas_copy,
+		 src_iova),
+	IOCTL_OP(IOMMU_IOAS_IOVA_RANGES, iommufd_ioas_iova_ranges,
+		 struct iommu_ioas_iova_ranges, out_iova_alignment),
+	IOCTL_OP(IOMMU_IOAS_MAP, iommufd_ioas_map, struct iommu_ioas_map,
+		 iova),
+	IOCTL_OP(IOMMU_IOAS_UNMAP, iommufd_ioas_unmap, struct iommu_ioas_unmap,
+		 length),
+	IOCTL_OP(IOMMU_OPTION, iommufd_option, struct iommu_option,
+		 val64),
 };
 
 static long iommufd_fops_ioctl(struct file *filp, unsigned int cmd,
@@ -310,6 +352,9 @@ void iommufd_ctx_put(struct iommufd_ctx *ictx)
 EXPORT_SYMBOL_NS_GPL(iommufd_ctx_put, IOMMUFD);
 
 static const struct iommufd_object_ops iommufd_object_ops[] = {
+	[IOMMUFD_OBJ_IOAS] = {
+		.destroy = iommufd_ioas_destroy,
+	},
 };
 
 static struct miscdevice iommu_misc_dev = {
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 2ad06b27a35fe5..a872a50d800a41 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -37,6 +37,13 @@
 enum {
 	IOMMUFD_CMD_BASE = 0x80,
 	IOMMUFD_CMD_DESTROY = IOMMUFD_CMD_BASE,
+	IOMMUFD_CMD_IOAS_ALLOC,
+	IOMMUFD_CMD_IOAS_ALLOW_IOVAS,
+	IOMMUFD_CMD_IOAS_COPY,
+	IOMMUFD_CMD_IOAS_IOVA_RANGES,
+	IOMMUFD_CMD_IOAS_MAP,
+	IOMMUFD_CMD_IOAS_UNMAP,
+	IOMMUFD_CMD_OPTION,
 };
 
 /**
@@ -52,4 +59,238 @@ struct iommu_destroy {
 };
 #define IOMMU_DESTROY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_DESTROY)
 
+/**
+ * struct iommu_ioas_alloc - ioctl(IOMMU_IOAS_ALLOC)
+ * @size: sizeof(struct iommu_ioas_alloc)
+ * @flags: Must be 0
+ * @out_ioas_id: Output IOAS ID for the allocated object
+ *
+ * Allocate an IO Address Space (IOAS) which holds an IO Virtual Address (IOVA)
+ * to memory mapping.
+ */
+struct iommu_ioas_alloc {
+	__u32 size;
+	__u32 flags;
+	__u32 out_ioas_id;
+};
+#define IOMMU_IOAS_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_ALLOC)
+
+/**
+ * struct iommu_iova_range
+ * @start: First IOVA
+ * @last: Inclusive last IOVA
+ *
+ * An interval in IOVA space.
+ */
+struct iommu_iova_range {
+	__aligned_u64 start;
+	__aligned_u64 last;
+};
+
+/**
+ * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
+ * @size: sizeof(struct iommu_ioas_iova_ranges)
+ * @ioas_id: IOAS ID to read ranges from
+ * @num_iovas: Input/Output total number of ranges in the IOAS
+ * @__reserved: Must be 0
+ * @allowed_iovas: Pointer to the output array of struct iommu_iova_range
+ * @out_iova_alignment: Minimum alignment required for mapping IOVA
+ *
+ * Query an IOAS for ranges of allowed IOVAs. Mapping IOVA outside these ranges
+ * is not allowed. num_iovas will be set to the total number of iovas and
+ * the allowed_iovas[] will be filled in as space permits.
+ *
+ * The allowed ranges are dependent on the HW path the DMA operation takes, and
+ * can change during the lifetime of the IOAS. A fresh empty IOAS will have a
+ * full range, and each attached device will narrow the ranges based on that
+ * device's HW restrictions. Detatching a device can widen the ranges. Userspace
+ * should query ranges after every attach/detatch to know what IOVAs are valid
+ * for mapping.
+ *
+ * On input num_iovas is the length of the allowed_iovas array. On output it is
+ * the total number of iovas filled in. The ioctl will return -EMSGSIZE and set
+ * num_iovas to the required value if num_iovas is too small. In this case the
+ * caller should allocate a larger output array and re-issue the ioctl.
+ */
+struct iommu_ioas_iova_ranges {
+	__u32 size;
+	__u32 ioas_id;
+	__u32 num_iovas;
+	__u32 __reserved;
+	__aligned_u64 allowed_iovas;
+	__aligned_u64 out_iova_alignment;
+};
+#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_IOVA_RANGES)
+
+/**
+ * struct iommu_ioas_allow_iovas - ioctl(IOMMU_IOAS_ALLOW_IOVAS)
+ * @size: sizeof(struct iommu_ioas_allow_iovas)
+ * @ioas_id: IOAS ID to allow IOVAs from
+ * @num_iovas: Input/Output total number of ranges in the IOAS
+ * @__reserved: Must be 0
+ * @allowed_iovas: Pointer to array of struct iommu_iova_range
+ *
+ * Ensure a range of IOVAs are always available for allocation. If this call
+ * succeeds then IOMMU_IOAS_IOVA_RANGES will never return a list of IOVA ranges
+ * that are narrower than the ranges provided here. This call will fail if
+ * IOMMU_IOAS_IOVA_RANGES is currently narrower than the given ranges.
+ *
+ * When an IOAS is first created the IOVA_RANGES will be maximally sized, and as
+ * devices are attached the IOVA will narrow based on the device restrictions.
+ * When an allowed range is specified any narrowing will be refused, ie device
+ * attachment can fail if the device requires limiting within the allowed range.
+ *
+ * Automatic IOVA allocation is also impacted by this call. MAP will only
+ * allocate within the allowed IOVAs if they are present.
+ *
+ * This call replaces the entire allowed list with the given list.
+ */
+struct iommu_ioas_allow_iovas {
+	__u32 size;
+	__u32 ioas_id;
+	__u32 num_iovas;
+	__u32 __reserved;
+	__aligned_u64 allowed_iovas;
+};
+#define IOMMU_IOAS_ALLOW_IOVAS _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_ALLOW_IOVAS)
+
+/**
+ * enum iommufd_ioas_map_flags - Flags for map and copy
+ * @IOMMU_IOAS_MAP_FIXED_IOVA: If clear the kernel will compute an appropriate
+ *                             IOVA to place the mapping at
+ * @IOMMU_IOAS_MAP_WRITEABLE: DMA is allowed to write to this mapping
+ * @IOMMU_IOAS_MAP_READABLE: DMA is allowed to read from this mapping
+ */
+enum iommufd_ioas_map_flags {
+	IOMMU_IOAS_MAP_FIXED_IOVA = 1 << 0,
+	IOMMU_IOAS_MAP_WRITEABLE = 1 << 1,
+	IOMMU_IOAS_MAP_READABLE = 1 << 2,
+};
+
+/**
+ * struct iommu_ioas_map - ioctl(IOMMU_IOAS_MAP)
+ * @size: sizeof(struct iommu_ioas_map)
+ * @flags: Combination of enum iommufd_ioas_map_flags
+ * @ioas_id: IOAS ID to change the mapping of
+ * @__reserved: Must be 0
+ * @user_va: Userspace pointer to start mapping from
+ * @length: Number of bytes to map
+ * @iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is set
+ *        then this must be provided as input.
+ *
+ * Set an IOVA mapping from a user pointer. If FIXED_IOVA is specified then the
+ * mapping will be established at iova, otherwise a suitable location based on
+ * the reserved and allowed lists will be automatically selected and returned in
+ * iova.
+ */
+struct iommu_ioas_map {
+	__u32 size;
+	__u32 flags;
+	__u32 ioas_id;
+	__u32 __reserved;
+	__aligned_u64 user_va;
+	__aligned_u64 length;
+	__aligned_u64 iova;
+};
+#define IOMMU_IOAS_MAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_MAP)
+
+/**
+ * struct iommu_ioas_copy - ioctl(IOMMU_IOAS_COPY)
+ * @size: sizeof(struct iommu_ioas_copy)
+ * @flags: Combination of enum iommufd_ioas_map_flags
+ * @dst_ioas_id: IOAS ID to change the mapping of
+ * @src_ioas_id: IOAS ID to copy from
+ * @length: Number of bytes to copy and map
+ * @dst_iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is
+ *            set then this must be provided as input.
+ * @src_iova: IOVA to start the copy
+ *
+ * Copy an already existing mapping from src_ioas_id and establish it in
+ * dst_ioas_id. The src iova/length must exactly match a range used with
+ * IOMMU_IOAS_MAP.
+ *
+ * This may be used to efficiently clone a subset of an IOAS to another, or as a
+ * kind of 'cache' to speed up mapping. Copy has an effciency advantage over
+ * establishing equivalent new mappings, as internal resources are shared, and
+ * the kernel will pin the user memory only once.
+ */
+struct iommu_ioas_copy {
+	__u32 size;
+	__u32 flags;
+	__u32 dst_ioas_id;
+	__u32 src_ioas_id;
+	__aligned_u64 length;
+	__aligned_u64 dst_iova;
+	__aligned_u64 src_iova;
+};
+#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY)
+
+/**
+ * struct iommu_ioas_unmap - ioctl(IOMMU_IOAS_UNMAP)
+ * @size: sizeof(struct iommu_ioas_unmap)
+ * @ioas_id: IOAS ID to change the mapping of
+ * @iova: IOVA to start the unmapping at
+ * @length: Number of bytes to unmap, and return back the bytes unmapped
+ *
+ * Unmap an IOVA range. The iova/length must be a superset of a previously
+ * mapped range used with IOMMU_IOAS_MAP or IOMMU_IOAS_COPY. Splitting or
+ * truncating ranges is not allowed. The values 0 to U64_MAX will unmap
+ * everything.
+ */
+struct iommu_ioas_unmap {
+	__u32 size;
+	__u32 ioas_id;
+	__aligned_u64 iova;
+	__aligned_u64 length;
+};
+#define IOMMU_IOAS_UNMAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_UNMAP)
+
+/**
+ * enum iommufd_option
+ * @IOMMU_OPTION_RLIMIT_MODE:
+ *    Change how RLIMIT_MEMLOCK accounting works. The caller must have privilege
+ *    to invoke this. Value 0 (default) is user based accouting, 1 uses process
+ *    based accounting. Global option, object_id must be 0
+ * @IOMMU_OPTION_HUGE_PAGES:
+ *    Value 1 (default) allows contiguous pages to be combined when generating
+ *    iommu mappings. Value 0 disables combining, everything is mapped to
+ *    PAGE_SIZE. This can be useful for benchmarking.  This is a per-IOAS
+ *    option, the object_id must be the IOAS ID.
+ */
+enum iommufd_option {
+	IOMMU_OPTION_RLIMIT_MODE = 0,
+	IOMMU_OPTION_HUGE_PAGES = 1,
+};
+
+/**
+ * enum iommufd_option_ops
+ * @IOMMU_OPTION_OP_SET: Set the option's value
+ * @IOMMU_OPTION_OP_GET: Get the option's value
+ */
+enum iommufd_option_ops {
+	IOMMU_OPTION_OP_SET = 0,
+	IOMMU_OPTION_OP_GET = 1,
+};
+
+/**
+ * @size: sizeof(struct iommu_option)
+ * @option_id: One of enum iommufd_option
+ * @op: One of enum iommufd_option_ops
+ * @__reserved: Must be 0
+ * @object_id: ID of the object if required
+ * @val64: Option value to set or value returned on get
+ *
+ * Change a simple option value. This multiplexor allows controlling a options
+ * on objects. IOMMU_OPTION_OP_SET will load an option and IOMMU_OPTION_OP_GET
+ * will return the current value.
+ */
+struct iommu_option {
+	__u32 size;
+	__u32 option_id;
+	__u16 op;
+	__u16 __reserved;
+	__u32 object_id;
+	__aligned_u64 val64;
+};
+#define IOMMU_OPTION _IO(IOMMUFD_TYPE, IOMMUFD_CMD_OPTION)
 #endif
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH v4 11/17] iommufd: Add a HW pagetable object
  2022-11-08  0:48 [PATCH v4 00/17] IOMMUFD Generic interface Jason Gunthorpe
                   ` (9 preceding siblings ...)
  2022-11-08  0:49 ` [PATCH v4 10/17] iommufd: IOCTLs for the io_pagetable Jason Gunthorpe
@ 2022-11-08  0:49 ` Jason Gunthorpe
  2022-11-08  0:49 ` [PATCH v4 12/17] iommufd: Add kAPI toward external drivers for physical devices Jason Gunthorpe
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-08  0:49 UTC (permalink / raw)
  To: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

The hw_pagetable object exposes the internal struct iommu_domain's to
userspace. An iommu_domain is required when any DMA device attaches to an
IOAS to control the io page table through the iommu driver.

For compatibility with VFIO the hw_pagetable is automatically created when
a DMA device is attached to the IOAS. If a compatible iommu_domain already
exists then the hw_pagetable associated with it is used for the
attachment.

In the initial series there is no iommufd uAPI for the hw_pagetable
object. The next patch provides driver facing APIs for IO page table
attachment that allows drivers to accept either an IOAS or a hw_pagetable
ID and for the driver to return the hw_pagetable ID that was auto-selected
from an IOAS. The expectation is the driver will provide uAPI through its
own FD for attaching its device to iommufd. This allows userspace to learn
the mapping of devices to iommu_domains and to override the automatic
attachment.

The future HW specific interface will allow userspace to create
hw_pagetable objects using iommu_domains with IOMMU driver specific
parameters. This infrastructure will allow linking those domains to IOAS's
and devices.

Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommufd/Makefile          |  1 +
 drivers/iommu/iommufd/hw_pagetable.c    | 57 +++++++++++++++++++++++++
 drivers/iommu/iommufd/ioas.c            |  3 ++
 drivers/iommu/iommufd/iommufd_private.h | 33 ++++++++++++++
 drivers/iommu/iommufd/main.c            |  3 ++
 5 files changed, 97 insertions(+)
 create mode 100644 drivers/iommu/iommufd/hw_pagetable.c

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index 2b4f36f1b72f9d..e13e971aa28c60 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 iommufd-y := \
+	hw_pagetable.o \
 	io_pagetable.o \
 	ioas.o \
 	main.o \
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
new file mode 100644
index 00000000000000..43d473989a0667
--- /dev/null
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -0,0 +1,57 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ */
+#include <linux/iommu.h>
+
+#include "iommufd_private.h"
+
+void iommufd_hw_pagetable_destroy(struct iommufd_object *obj)
+{
+	struct iommufd_hw_pagetable *hwpt =
+		container_of(obj, struct iommufd_hw_pagetable, obj);
+
+	WARN_ON(!list_empty(&hwpt->devices));
+
+	iommu_domain_free(hwpt->domain);
+	refcount_dec(&hwpt->ioas->obj.users);
+	mutex_destroy(&hwpt->devices_lock);
+}
+
+/**
+ * iommufd_hw_pagetable_alloc() - Get an iommu_domain for a device
+ * @ictx: iommufd context
+ * @ioas: IOAS to associate the domain with
+ * @dev: Device to get an iommu_domain for
+ *
+ * Allocate a new iommu_domain and return it as a hw_pagetable.
+ */
+struct iommufd_hw_pagetable *
+iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
+			   struct device *dev)
+{
+	struct iommufd_hw_pagetable *hwpt;
+	int rc;
+
+	hwpt = iommufd_object_alloc(ictx, hwpt, IOMMUFD_OBJ_HW_PAGETABLE);
+	if (IS_ERR(hwpt))
+		return hwpt;
+
+	hwpt->domain = iommu_domain_alloc(dev->bus);
+	if (!hwpt->domain) {
+		rc = -ENOMEM;
+		goto out_abort;
+	}
+
+	INIT_LIST_HEAD(&hwpt->devices);
+	INIT_LIST_HEAD(&hwpt->hwpt_item);
+	mutex_init(&hwpt->devices_lock);
+	/* Pairs with iommufd_hw_pagetable_destroy() */
+	refcount_inc(&ioas->obj.users);
+	hwpt->ioas = ioas;
+	return hwpt;
+
+out_abort:
+	iommufd_object_abort(ictx, &hwpt->obj);
+	return ERR_PTR(rc);
+}
diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
index 7671456e86413a..64e6d0f73e39aa 100644
--- a/drivers/iommu/iommufd/ioas.c
+++ b/drivers/iommu/iommufd/ioas.c
@@ -17,6 +17,7 @@ void iommufd_ioas_destroy(struct iommufd_object *obj)
 	rc = iopt_unmap_all(&ioas->iopt, NULL);
 	WARN_ON(rc && rc != -ENOENT);
 	iopt_destroy_table(&ioas->iopt);
+	mutex_destroy(&ioas->mutex);
 }
 
 struct iommufd_ioas *iommufd_ioas_alloc(struct iommufd_ctx *ictx)
@@ -28,6 +29,8 @@ struct iommufd_ioas *iommufd_ioas_alloc(struct iommufd_ctx *ictx)
 		return ioas;
 
 	iopt_init_table(&ioas->iopt);
+	INIT_LIST_HEAD(&ioas->hwpt_list);
+	mutex_init(&ioas->mutex);
 	return ioas;
 }
 
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 68c12115400285..d0272efcd8038e 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -103,6 +103,7 @@ static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd,
 enum iommufd_object_type {
 	IOMMUFD_OBJ_NONE,
 	IOMMUFD_OBJ_ANY = IOMMUFD_OBJ_NONE,
+	IOMMUFD_OBJ_HW_PAGETABLE,
 	IOMMUFD_OBJ_IOAS,
 };
 
@@ -181,10 +182,20 @@ struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
  * io_pagetable object. It is a user controlled mapping of IOVA -> PFNs. The
  * mapping is copied into all of the associated domains and made available to
  * in-kernel users.
+ *
+ * Every iommu_domain that is created is wrapped in a iommufd_hw_pagetable
+ * object. When we go to attach a device to an IOAS we need to get an
+ * iommu_domain and wrapping iommufd_hw_pagetable for it.
+ *
+ * An iommu_domain & iommfd_hw_pagetable will be automatically selected
+ * for a device based on the hwpt_list. If no suitable iommu_domain
+ * is found a new iommu_domain will be created.
  */
 struct iommufd_ioas {
 	struct iommufd_object obj;
 	struct io_pagetable iopt;
+	struct mutex mutex;
+	struct list_head hwpt_list;
 };
 
 static inline struct iommufd_ioas *iommufd_get_ioas(struct iommufd_ucmd *ucmd,
@@ -207,6 +218,28 @@ int iommufd_ioas_option(struct iommufd_ucmd *ucmd);
 int iommufd_option_rlimit_mode(struct iommu_option *cmd,
 			       struct iommufd_ctx *ictx);
 
+/*
+ * A HW pagetable is called an iommu_domain inside the kernel. This user object
+ * allows directly creating and inspecting the domains. Domains that have kernel
+ * owned page tables will be associated with an iommufd_ioas that provides the
+ * IOVA to PFN map.
+ */
+struct iommufd_hw_pagetable {
+	struct iommufd_object obj;
+	struct iommufd_ioas *ioas;
+	struct iommu_domain *domain;
+	bool auto_domain : 1;
+	/* Head at iommufd_ioas::hwpt_list */
+	struct list_head hwpt_item;
+	struct mutex devices_lock;
+	struct list_head devices;
+};
+
+struct iommufd_hw_pagetable *
+iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
+			   struct device *dev);
+void iommufd_hw_pagetable_destroy(struct iommufd_object *obj);
+
 struct iommufd_access {
 	unsigned long iova_alignment;
 	u32 iopt_access_list_id;
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 266109045537ed..3eab714b8e12a3 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -355,6 +355,9 @@ static const struct iommufd_object_ops iommufd_object_ops[] = {
 	[IOMMUFD_OBJ_IOAS] = {
 		.destroy = iommufd_ioas_destroy,
 	},
+	[IOMMUFD_OBJ_HW_PAGETABLE] = {
+		.destroy = iommufd_hw_pagetable_destroy,
+	},
 };
 
 static struct miscdevice iommu_misc_dev = {
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH v4 12/17] iommufd: Add kAPI toward external drivers for physical devices
  2022-11-08  0:48 [PATCH v4 00/17] IOMMUFD Generic interface Jason Gunthorpe
                   ` (10 preceding siblings ...)
  2022-11-08  0:49 ` [PATCH v4 11/17] iommufd: Add a HW pagetable object Jason Gunthorpe
@ 2022-11-08  0:49 ` Jason Gunthorpe
  2022-11-08 14:34   ` Yi Liu
  2022-11-14  7:59   ` Tian, Kevin
  2022-11-08  0:49 ` [PATCH v4 13/17] iommufd: Add kAPI toward external drivers for kernel access Jason Gunthorpe
                   ` (6 subsequent siblings)
  18 siblings, 2 replies; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-08  0:49 UTC (permalink / raw)
  To: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

Add the four functions external drivers need to connect physical DMA to
the IOMMUFD:

iommufd_device_bind() / iommufd_device_unbind()
  Register the device with iommufd and establish security isolation.

iommufd_device_attach() / iommufd_device_detach()
  Connect a bound device to a page table

Binding a device creates a device object ID in the uAPI, however the
generic API provides no IOCTLs to manipulate them.

Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommufd/Makefile          |   1 +
 drivers/iommu/iommufd/device.c          | 402 ++++++++++++++++++++++++
 drivers/iommu/iommufd/iommufd_private.h |   5 +
 drivers/iommu/iommufd/main.c            |   3 +
 include/linux/iommufd.h                 |  13 +
 5 files changed, 424 insertions(+)
 create mode 100644 drivers/iommu/iommufd/device.c

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index e13e971aa28c60..ca28a135b9675f 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 iommufd-y := \
+	device.o \
 	hw_pagetable.o \
 	io_pagetable.o \
 	ioas.o \
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
new file mode 100644
index 00000000000000..a3bf3c07d3f800
--- /dev/null
+++ b/drivers/iommu/iommufd/device.c
@@ -0,0 +1,402 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ */
+#include <linux/iommufd.h>
+#include <linux/slab.h>
+#include <linux/iommu.h>
+#include <linux/irqdomain.h>
+
+#include "iommufd_private.h"
+
+/*
+ * A iommufd_device object represents the binding relationship between a
+ * consuming driver and the iommufd. These objects are created/destroyed by
+ * external drivers, not by userspace.
+ */
+struct iommufd_device {
+	struct iommufd_object obj;
+	struct iommufd_ctx *ictx;
+	struct iommufd_hw_pagetable *hwpt;
+	/* Head at iommufd_hw_pagetable::devices */
+	struct list_head devices_item;
+	/* always the physical device */
+	struct device *dev;
+	struct iommu_group *group;
+	bool enforce_cache_coherency;
+};
+
+void iommufd_device_destroy(struct iommufd_object *obj)
+{
+	struct iommufd_device *idev =
+		container_of(obj, struct iommufd_device, obj);
+
+	iommu_device_release_dma_owner(idev->dev);
+	iommu_group_put(idev->group);
+	iommufd_ctx_put(idev->ictx);
+}
+
+/**
+ * iommufd_device_bind - Bind a physical device to an iommu fd
+ * @ictx: iommufd file descriptor
+ * @dev: Pointer to a physical PCI device struct
+ * @id: Output ID number to return to userspace for this device
+ *
+ * A successful bind establishes an ownership over the device and returns
+ * struct iommufd_device pointer, otherwise returns error pointer.
+ *
+ * A driver using this API must set driver_managed_dma and must not touch
+ * the device until this routine succeeds and establishes ownership.
+ *
+ * Binding a PCI device places the entire RID under iommufd control.
+ *
+ * The caller must undo this with iommufd_unbind_device()
+ */
+struct iommufd_device *iommufd_device_bind(struct iommufd_ctx *ictx,
+					   struct device *dev, u32 *id)
+{
+	struct iommufd_device *idev;
+	struct iommu_group *group;
+	int rc;
+
+	/*
+	 * iommufd always sets IOMMU_CACHE because we offer no way for userspace
+	 * to restore cache coherency.
+	 */
+	if (!device_iommu_capable(dev, IOMMU_CAP_CACHE_COHERENCY))
+		return ERR_PTR(-EINVAL);
+
+	group = iommu_group_get(dev);
+	if (!group)
+		return ERR_PTR(-ENODEV);
+
+	rc = iommu_device_claim_dma_owner(dev, ictx);
+	if (rc)
+		goto out_group_put;
+
+	idev = iommufd_object_alloc(ictx, idev, IOMMUFD_OBJ_DEVICE);
+	if (IS_ERR(idev)) {
+		rc = PTR_ERR(idev);
+		goto out_release_owner;
+	}
+	idev->ictx = ictx;
+	iommufd_ctx_get(ictx);
+	idev->dev = dev;
+	idev->enforce_cache_coherency =
+		device_iommu_capable(dev, IOMMU_CAP_ENFORCE_CACHE_COHERENCY);
+	/* The calling driver is a user until iommufd_device_unbind() */
+	refcount_inc(&idev->obj.users);
+	/* group refcount moves into iommufd_device */
+	idev->group = group;
+
+	/*
+	 * If the caller fails after this success it must call
+	 * iommufd_unbind_device() which is safe since we hold this refcount.
+	 * This also means the device is a leaf in the graph and no other object
+	 * can take a reference on it.
+	 */
+	iommufd_object_finalize(ictx, &idev->obj);
+	*id = idev->obj.id;
+	return idev;
+
+out_release_owner:
+	iommu_device_release_dma_owner(dev);
+out_group_put:
+	iommu_group_put(group);
+	return ERR_PTR(rc);
+}
+EXPORT_SYMBOL_NS_GPL(iommufd_device_bind, IOMMUFD);
+
+void iommufd_device_unbind(struct iommufd_device *idev)
+{
+	bool was_destroyed;
+
+	was_destroyed = iommufd_object_destroy_user(idev->ictx, &idev->obj);
+	WARN_ON(!was_destroyed);
+}
+EXPORT_SYMBOL_NS_GPL(iommufd_device_unbind, IOMMUFD);
+
+static int iommufd_device_setup_msi(struct iommufd_device *idev,
+				    struct iommufd_hw_pagetable *hwpt,
+				    phys_addr_t sw_msi_start,
+				    unsigned int flags)
+{
+	int rc;
+
+	/*
+	 * IOMMU_CAP_INTR_REMAP means that the platform is isolating MSI, and it
+	 * creates the MSI window by default in the iommu domain. Nothing
+	 * further to do.
+	 */
+	if (device_iommu_capable(idev->dev, IOMMU_CAP_INTR_REMAP))
+		return 0;
+
+	/*
+	 * On ARM systems that set the global IRQ_DOMAIN_FLAG_MSI_REMAP every
+	 * allocated iommu_domain will block interrupts by default and this
+	 * special flow is needed to turn them back on. iommu_dma_prepare_msi()
+	 * will install pages into our domain after request_irq() to make this
+	 * work.
+	 *
+	 * FIXME: This is conceptually broken for iommufd since we want to allow
+	 * userspace to change the domains, eg switch from an identity IOAS to a
+	 * DMA IOAS. There is currently no way to create a MSI window that
+	 * matches what the IRQ layer actually expects in a newly created
+	 * domain.
+	 */
+	if (irq_domain_check_msi_remap()) {
+		if (WARN_ON(!sw_msi_start))
+			return -EPERM;
+		/*
+		 * iommu_get_msi_cookie() can only be called once per domain,
+		 * it returns -EBUSY on later calls.
+		 */
+		if (hwpt->msi_cookie)
+			return 0;
+		rc = iommu_get_msi_cookie(hwpt->domain, sw_msi_start);
+		if (rc)
+			return rc;
+		hwpt->msi_cookie = true;
+		return 0;
+	}
+
+	/*
+	 * Otherwise the platform has a MSI window that is not isolated. For
+	 * historical compat with VFIO allow a module parameter to ignore the
+	 * insecurity.
+	 */
+	if (!(flags & IOMMUFD_ATTACH_FLAGS_ALLOW_UNSAFE_INTERRUPT))
+		return -EPERM;
+	else
+		dev_warn(
+			idev->dev,
+			"Device interrupts cannot be isolated by the IOMMU, this platform in insecure. Use an \"allow_unsafe_interrupts\" module parameter to override\n");
+
+	return 0;
+}
+
+static bool iommufd_hw_pagetable_has_group(struct iommufd_hw_pagetable *hwpt,
+					   struct iommu_group *group)
+{
+	struct iommufd_device *cur_dev;
+
+	list_for_each_entry(cur_dev, &hwpt->devices, devices_item)
+		if (cur_dev->group == group)
+			return true;
+	return false;
+}
+
+static int iommufd_device_do_attach(struct iommufd_device *idev,
+				    struct iommufd_hw_pagetable *hwpt,
+				    unsigned int flags)
+{
+	phys_addr_t sw_msi_start = 0;
+	int rc;
+
+	mutex_lock(&hwpt->devices_lock);
+
+	/*
+	 * Try to upgrade the domain we have, it is an iommu driver bug to
+	 * report IOMMU_CAP_ENFORCE_CACHE_COHERENCY but fail
+	 * enforce_cache_coherency when there are no devices attached to the
+	 * domain.
+	 */
+	if (idev->enforce_cache_coherency && !hwpt->enforce_cache_coherency) {
+		if (hwpt->domain->ops->enforce_cache_coherency)
+			hwpt->enforce_cache_coherency =
+				hwpt->domain->ops->enforce_cache_coherency(
+					hwpt->domain);
+		if (!hwpt->enforce_cache_coherency) {
+			WARN_ON(list_empty(&hwpt->devices));
+			rc = -EINVAL;
+			goto out_unlock;
+		}
+	}
+
+	rc = iopt_table_enforce_group_resv_regions(&hwpt->ioas->iopt, idev->dev,
+						   idev->group, &sw_msi_start);
+	if (rc)
+		goto out_unlock;
+
+	rc = iommufd_device_setup_msi(idev, hwpt, sw_msi_start, flags);
+	if (rc)
+		goto out_iova;
+
+	/*
+	 * FIXME: Hack around missing a device-centric iommu api, only attach to
+	 * the group once for the first device that is in the group.
+	 */
+	if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
+		rc = iommu_attach_group(hwpt->domain, idev->group);
+		if (rc)
+			goto out_iova;
+
+		if (list_empty(&hwpt->devices)) {
+			rc = iopt_table_add_domain(&hwpt->ioas->iopt,
+						   hwpt->domain);
+			if (rc)
+				goto out_detach;
+		}
+	}
+
+	idev->hwpt = hwpt;
+	refcount_inc(&hwpt->obj.users);
+	list_add(&idev->devices_item, &hwpt->devices);
+	mutex_unlock(&hwpt->devices_lock);
+	return 0;
+
+out_detach:
+	iommu_detach_group(hwpt->domain, idev->group);
+out_iova:
+	iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->dev);
+out_unlock:
+	mutex_unlock(&hwpt->devices_lock);
+	return rc;
+}
+
+/*
+ * When automatically managing the domains we search for a compatible domain in
+ * the iopt and if one is found use it, otherwise create a new domain.
+ * Automatic domain selection will never pick a manually created domain.
+ */
+static int iommufd_device_auto_get_domain(struct iommufd_device *idev,
+					  struct iommufd_ioas *ioas,
+					  unsigned int flags)
+{
+	struct iommufd_hw_pagetable *hwpt;
+	int rc;
+
+	/*
+	 * There is no differentiation when domains are allocated, so any domain
+	 * that is willing to attach to the device is interchangeable with any
+	 * other.
+	 */
+	mutex_lock(&ioas->mutex);
+	list_for_each_entry(hwpt, &ioas->hwpt_list, hwpt_item) {
+		if (!hwpt->auto_domain)
+			continue;
+
+		rc = iommufd_device_do_attach(idev, hwpt, flags);
+
+		/*
+		 * -EINVAL means the domain is incompatible with the device.
+		 * Other error codes should propagate to userspace as failure.
+		 * Success means the domain is attached.
+		 */
+		if (rc == -EINVAL)
+			continue;
+		goto out_unlock;
+	}
+
+	hwpt = iommufd_hw_pagetable_alloc(idev->ictx, ioas, idev->dev);
+	if (IS_ERR(hwpt)) {
+		rc = PTR_ERR(hwpt);
+		goto out_unlock;
+	}
+	hwpt->auto_domain = true;
+
+	rc = iommufd_device_do_attach(idev, hwpt, flags);
+	if (rc)
+		goto out_abort;
+	list_add_tail(&hwpt->hwpt_item, &ioas->hwpt_list);
+
+	mutex_unlock(&ioas->mutex);
+	iommufd_object_finalize(idev->ictx, &hwpt->obj);
+	return 0;
+
+out_abort:
+	iommufd_object_abort_and_destroy(idev->ictx, &hwpt->obj);
+out_unlock:
+	mutex_unlock(&ioas->mutex);
+	return rc;
+}
+
+/**
+ * iommufd_device_attach - Connect a device to an iommu_domain
+ * @idev: device to attach
+ * @pt_id: Input a IOMMUFD_OBJ_IOAS, or IOMMUFD_OBJ_HW_PAGETABLE
+ *         Output the IOMMUFD_OBJ_HW_PAGETABLE ID
+ * @flags: Optional flags
+ *
+ * This connects the device to an iommu_domain, either automatically or manually
+ * selected. Once this completes the device could do DMA.
+ *
+ * The caller should return the resulting pt_id back to userspace.
+ * This function is undone by calling iommufd_device_detach().
+ */
+int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id,
+			  unsigned int flags)
+{
+	struct iommufd_object *pt_obj;
+	int rc;
+
+	pt_obj = iommufd_get_object(idev->ictx, *pt_id, IOMMUFD_OBJ_ANY);
+	if (IS_ERR(pt_obj))
+		return PTR_ERR(pt_obj);
+
+	switch (pt_obj->type) {
+	case IOMMUFD_OBJ_HW_PAGETABLE: {
+		struct iommufd_hw_pagetable *hwpt =
+			container_of(pt_obj, struct iommufd_hw_pagetable, obj);
+
+		rc = iommufd_device_do_attach(idev, hwpt, flags);
+		if (rc)
+			goto out_put_pt_obj;
+
+		mutex_lock(&hwpt->ioas->mutex);
+		list_add_tail(&hwpt->hwpt_item, &hwpt->ioas->hwpt_list);
+		mutex_unlock(&hwpt->ioas->mutex);
+		break;
+	}
+	case IOMMUFD_OBJ_IOAS: {
+		struct iommufd_ioas *ioas =
+			container_of(pt_obj, struct iommufd_ioas, obj);
+
+		rc = iommufd_device_auto_get_domain(idev, ioas, flags);
+		if (rc)
+			goto out_put_pt_obj;
+		break;
+	}
+	default:
+		rc = -EINVAL;
+		goto out_put_pt_obj;
+	}
+
+	refcount_inc(&idev->obj.users);
+	*pt_id = idev->hwpt->obj.id;
+	rc = 0;
+
+out_put_pt_obj:
+	iommufd_put_object(pt_obj);
+	return rc;
+}
+EXPORT_SYMBOL_NS_GPL(iommufd_device_attach, IOMMUFD);
+
+void iommufd_device_detach(struct iommufd_device *idev)
+{
+	struct iommufd_hw_pagetable *hwpt = idev->hwpt;
+
+	mutex_lock(&hwpt->ioas->mutex);
+	mutex_lock(&hwpt->devices_lock);
+	list_del(&idev->devices_item);
+	if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
+		if (list_empty(&hwpt->devices)) {
+			iopt_table_remove_domain(&hwpt->ioas->iopt,
+						 hwpt->domain);
+			list_del(&hwpt->hwpt_item);
+		}
+		iommu_detach_group(hwpt->domain, idev->group);
+	}
+	iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->dev);
+	mutex_unlock(&hwpt->devices_lock);
+	mutex_unlock(&hwpt->ioas->mutex);
+
+	if (hwpt->auto_domain)
+		iommufd_object_destroy_user(idev->ictx, &hwpt->obj);
+	else
+		refcount_dec(&hwpt->obj.users);
+
+	idev->hwpt = NULL;
+
+	refcount_dec(&idev->obj.users);
+}
+EXPORT_SYMBOL_NS_GPL(iommufd_device_detach, IOMMUFD);
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index d0272efcd8038e..1f8c73aca176e2 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -103,6 +103,7 @@ static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd,
 enum iommufd_object_type {
 	IOMMUFD_OBJ_NONE,
 	IOMMUFD_OBJ_ANY = IOMMUFD_OBJ_NONE,
+	IOMMUFD_OBJ_DEVICE,
 	IOMMUFD_OBJ_HW_PAGETABLE,
 	IOMMUFD_OBJ_IOAS,
 };
@@ -229,6 +230,8 @@ struct iommufd_hw_pagetable {
 	struct iommufd_ioas *ioas;
 	struct iommu_domain *domain;
 	bool auto_domain : 1;
+	bool enforce_cache_coherency : 1;
+	bool msi_cookie : 1;
 	/* Head at iommufd_ioas::hwpt_list */
 	struct list_head hwpt_item;
 	struct mutex devices_lock;
@@ -240,6 +243,8 @@ iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
 			   struct device *dev);
 void iommufd_hw_pagetable_destroy(struct iommufd_object *obj);
 
+void iommufd_device_destroy(struct iommufd_object *obj);
+
 struct iommufd_access {
 	unsigned long iova_alignment;
 	u32 iopt_access_list_id;
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 3eab714b8e12a3..8a114ddbdfcde2 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -352,6 +352,9 @@ void iommufd_ctx_put(struct iommufd_ctx *ictx)
 EXPORT_SYMBOL_NS_GPL(iommufd_ctx_put, IOMMUFD);
 
 static const struct iommufd_object_ops iommufd_object_ops[] = {
+	[IOMMUFD_OBJ_DEVICE] = {
+		.destroy = iommufd_device_destroy,
+	},
 	[IOMMUFD_OBJ_IOAS] = {
 		.destroy = iommufd_ioas_destroy,
 	},
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index 26e09d539737bb..31efacd8a46cce 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -9,10 +9,23 @@
 #include <linux/types.h>
 #include <linux/errno.h>
 #include <linux/err.h>
+#include <linux/device.h>
 
+struct iommufd_device;
 struct iommufd_ctx;
 struct file;
 
+struct iommufd_device *iommufd_device_bind(struct iommufd_ctx *ictx,
+					   struct device *dev, u32 *id);
+void iommufd_device_unbind(struct iommufd_device *idev);
+
+enum {
+	IOMMUFD_ATTACH_FLAGS_ALLOW_UNSAFE_INTERRUPT = 1 << 0,
+};
+int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id,
+			  unsigned int flags);
+void iommufd_device_detach(struct iommufd_device *idev);
+
 enum {
 	IOMMUFD_ACCESS_RW_READ = 0,
 	IOMMUFD_ACCESS_RW_WRITE = 1 << 0,
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH v4 13/17] iommufd: Add kAPI toward external drivers for kernel access
  2022-11-08  0:48 [PATCH v4 00/17] IOMMUFD Generic interface Jason Gunthorpe
                   ` (11 preceding siblings ...)
  2022-11-08  0:49 ` [PATCH v4 12/17] iommufd: Add kAPI toward external drivers for physical devices Jason Gunthorpe
@ 2022-11-08  0:49 ` Jason Gunthorpe
  2022-11-14  8:25   ` Tian, Kevin
  2022-11-08  0:49 ` [PATCH v4 14/17] iommufd: vfio container FD ioctl compatibility Jason Gunthorpe
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-08  0:49 UTC (permalink / raw)
  To: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

Kernel access is the mode that VFIO "mdevs" use. In this case there is no
struct device and no IOMMU connection. iommufd acts as a record keeper for
accesses and returns the actual struct pages back to the caller to use
however they need. eg with kmap or the DMA API.

Each caller must create a struct iommufd_access with
iommufd_access_create(), similar to how iommufd_device_bind() works. Using
this struct the caller can access blocks of IOVA using
iommufd_access_pin_pages() or iommufd_access_rw().

Callers must provide a callback that immediately unpins any IOVA being
used within a range. This happens if userspace unmaps the IOVA under the
pin.

The implementation forwards the access requests directly to the iopt
infrastructure that manages the iopt_pages_access.

Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommufd/device.c          | 303 ++++++++++++++++++++++++
 drivers/iommu/iommufd/io_pagetable.c    |   8 +-
 drivers/iommu/iommufd/iommufd_private.h |  10 +
 drivers/iommu/iommufd/main.c            |   3 +
 include/linux/iommufd.h                 |  43 +++-
 5 files changed, 364 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index a3bf3c07d3f800..da365d0cf4cf19 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -6,6 +6,7 @@
 #include <linux/iommu.h>
 #include <linux/irqdomain.h>
 
+#include "io_pagetable.h"
 #include "iommufd_private.h"
 
 /*
@@ -400,3 +401,305 @@ void iommufd_device_detach(struct iommufd_device *idev)
 	refcount_dec(&idev->obj.users);
 }
 EXPORT_SYMBOL_NS_GPL(iommufd_device_detach, IOMMUFD);
+
+void iommufd_access_destroy_object(struct iommufd_object *obj)
+{
+	struct iommufd_access *access =
+		container_of(obj, struct iommufd_access, obj);
+
+	iopt_remove_access(&access->ioas->iopt, access);
+	iommufd_ctx_put(access->ictx);
+	refcount_dec(&access->ioas->obj.users);
+}
+
+struct iommufd_access *
+iommufd_access_create(struct iommufd_ctx *ictx, u32 ioas_id,
+		      const struct iommufd_access_ops *ops, void *data)
+{
+	struct iommufd_access *access;
+	struct iommufd_object *obj;
+	int rc;
+
+	/*
+	 * There is no uAPI for the access object, but to keep things symmetric
+	 * use the object infrastructure anyhow.
+	 */
+	access = iommufd_object_alloc(ictx, access, IOMMUFD_OBJ_ACCESS);
+	if (IS_ERR(access))
+		return access;
+
+	access->data = data;
+	access->ops = ops;
+	access->ictx = ictx;
+
+	obj = iommufd_get_object(ictx, ioas_id, IOMMUFD_OBJ_IOAS);
+	if (IS_ERR(obj)) {
+		rc = PTR_ERR(obj);
+		goto out_abort;
+	}
+	access->ioas = container_of(obj, struct iommufd_ioas, obj);
+	iommufd_ref_to_users(obj);
+
+	if (ops->needs_pin_pages)
+		access->iova_alignment = PAGE_SIZE;
+	else
+		access->iova_alignment = 1;
+	rc = iopt_add_access(&access->ioas->iopt, access);
+	if (rc)
+		goto out_put_ioas;
+
+	/* The calling driver is a user until iommufd_access_destroy() */
+	refcount_inc(&access->obj.users);
+	iommufd_ctx_get(ictx);
+	iommufd_object_finalize(ictx, &access->obj);
+	return access;
+out_put_ioas:
+	refcount_dec(&access->ioas->obj.users);
+out_abort:
+	iommufd_object_abort(ictx, &access->obj);
+	return ERR_PTR(rc);
+}
+EXPORT_SYMBOL_NS_GPL(iommufd_access_create, IOMMUFD);
+
+void iommufd_access_destroy(struct iommufd_access *access)
+{
+	bool was_destroyed;
+
+	was_destroyed = iommufd_object_destroy_user(access->ictx, &access->obj);
+	WARN_ON(!was_destroyed);
+}
+EXPORT_SYMBOL_NS_GPL(iommufd_access_destroy, IOMMUFD);
+
+/**
+ * iommufd_access_notify_unmap - Notify users of an iopt to stop using it
+ * @iopt: iopt to work on
+ * @iova: Starting iova in the iopt
+ * @length: Number of bytes
+ *
+ * After this function returns there should be no users attached to the pages
+ * linked to this iopt that intersect with iova,length. Anyone that has attached
+ * a user through iopt_access_pages() needs to detatch it through
+ * iommufd_access_unpin_pages() before this function returns.
+ *
+ * The unmap callback may not call or wait for a iommufd_access_destroy() to
+ * complete. Once iommufd_access_destroy() returns no ops are running and no
+ * future ops will be called.
+ */
+void iommufd_access_notify_unmap(struct io_pagetable *iopt, unsigned long iova,
+				 unsigned long length)
+{
+	struct iommufd_ioas *ioas =
+		container_of(iopt, struct iommufd_ioas, iopt);
+	struct iommufd_access *access;
+	unsigned long index;
+
+	xa_lock(&ioas->iopt.access_list);
+	xa_for_each(&ioas->iopt.access_list, index, access) {
+		if (!iommufd_lock_obj(&access->obj))
+			continue;
+		xa_unlock(&ioas->iopt.access_list);
+
+		access->ops->unmap(access->data, iova, length);
+
+		iommufd_put_object(&access->obj);
+		xa_lock(&ioas->iopt.access_list);
+	}
+	xa_unlock(&ioas->iopt.access_list);
+}
+
+/**
+ * iommufd_access_unpin_pages() - Undo iommufd_access_pin_pages
+ * @access: IOAS access to act on
+ * @iova: Starting IOVA
+ * @length:- Number of bytes to access
+ *
+ * Return the struct page's. The caller must stop accessing them before calling
+ * this. The iova/length must exactly match the one provided to access_pages.
+ */
+void iommufd_access_unpin_pages(struct iommufd_access *access,
+				unsigned long iova, unsigned long length)
+{
+	struct io_pagetable *iopt = &access->ioas->iopt;
+	struct iopt_area_contig_iter iter;
+	unsigned long last_iova;
+	struct iopt_area *area;
+
+	if (WARN_ON(!length) ||
+	    WARN_ON(check_add_overflow(iova, length - 1, &last_iova)))
+		return;
+
+	down_read(&iopt->iova_rwsem);
+	iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova)
+		iopt_pages_remove_access(
+			area, iopt_area_iova_to_index(area, iter.cur_iova),
+			iopt_area_iova_to_index(
+				area,
+				min(last_iova, iopt_area_last_iova(area))));
+	up_read(&iopt->iova_rwsem);
+	WARN_ON(!iopt_area_contig_done(&iter));
+}
+EXPORT_SYMBOL_NS_GPL(iommufd_access_unpin_pages, IOMMUFD);
+
+static bool iopt_area_contig_is_aligned(struct iopt_area_contig_iter *iter,
+					bool first)
+{
+	if (iopt_area_start_byte(iter->area, iter->cur_iova) % PAGE_SIZE)
+		return false;
+
+	if (!iopt_area_contig_done(iter) &&
+	    (iopt_area_start_byte(iter->area, iopt_area_last_iova(iter->area)) %
+	     PAGE_SIZE) != (PAGE_SIZE - 1))
+		return false;
+	return true;
+}
+
+static bool check_area_prot(struct iopt_area *area, unsigned int flags)
+{
+	if (flags & IOMMUFD_ACCESS_RW_WRITE)
+		return area->iommu_prot & IOMMU_WRITE;
+	return area->iommu_prot & IOMMU_READ;
+}
+
+/**
+ * iommufd_access_pin_pages() - Return a list of pages under the iova
+ * @access: IOAS access to act on
+ * @iova: Starting IOVA
+ * @length: Number of bytes to access
+ * @out_pages: Output page list
+ * @flags: IOPMMUFD_ACCESS_RW_* flags
+ *
+ * Reads @length bytes starting at iova and returns the struct page * pointers.
+ * These can be kmap'd by the caller for CPU access.
+ *
+ * The caller must perform iopt_unaccess_pages() when done to balance this.
+ *
+ * This API always requires a page aligned iova. This happens naturally if the
+ * ioas alignment is >= PAGE_SIZE and the iova is PAGE_SIZE aligned. However
+ * smaller alignments have corner cases where this API can fail on otherwise
+ * aligned iova.
+ */
+int iommufd_access_pin_pages(struct iommufd_access *access, unsigned long iova,
+			     unsigned long length, struct page **out_pages,
+			     unsigned int flags)
+{
+	struct io_pagetable *iopt = &access->ioas->iopt;
+	struct iopt_area_contig_iter iter;
+	unsigned long last_iova;
+	struct iopt_area *area;
+	bool first = true;
+	int rc;
+
+	if (!length)
+		return -EINVAL;
+	if (check_add_overflow(iova, length - 1, &last_iova))
+		return -EOVERFLOW;
+
+	down_read(&iopt->iova_rwsem);
+	iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova) {
+		unsigned long last = min(last_iova, iopt_area_last_iova(area));
+		unsigned long last_index = iopt_area_iova_to_index(area, last);
+		unsigned long index =
+			iopt_area_iova_to_index(area, iter.cur_iova);
+
+		if (area->prevent_access ||
+		    !iopt_area_contig_is_aligned(&iter, first)) {
+			rc = -EINVAL;
+			goto err_remove;
+		}
+
+		if (!check_area_prot(area, flags)) {
+			rc = -EPERM;
+			goto err_remove;
+		}
+		first = false;
+
+		mutex_lock(&area->pages->mutex);
+		rc = iopt_pages_add_access(area->pages, index, last_index,
+					   out_pages, flags);
+		if (rc) {
+			mutex_unlock(&area->pages->mutex);
+			goto err_remove;
+		}
+		area->num_accesses++;
+		mutex_unlock(&area->pages->mutex);
+		out_pages += last_index - index + 1;
+	}
+	if (!iopt_area_contig_done(&iter)) {
+		rc = -ENOENT;
+		goto err_remove;
+	}
+
+	up_read(&iopt->iova_rwsem);
+	return 0;
+
+err_remove:
+	if (iova < iter.cur_iova) {
+		last_iova = iter.cur_iova - 1;
+		iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova)
+			iopt_pages_remove_access(
+				area,
+				iopt_area_iova_to_index(area, iter.cur_iova),
+				iopt_area_iova_to_index(
+					area, min(last_iova,
+						  iopt_area_last_iova(area))));
+	}
+	up_read(&iopt->iova_rwsem);
+	return rc;
+}
+EXPORT_SYMBOL_NS_GPL(iommufd_access_pin_pages, IOMMUFD);
+
+/**
+ * iommufd_access_rw - Read or write data under the iova
+ * @access: IOAS access to act on
+ * @iova: Starting IOVA
+ * @data: Kernel buffer to copy to/from
+ * @length: Number of bytes to access
+ * @flags: IOMMUFD_ACCESS_RW_* flags
+ *
+ * Copy kernel to/from data into the range given by IOVA/length. If flags
+ * indicates IOMMUFD_ACCESS_RW_KTHREAD then a large copy can be optimized
+ * by changing it into copy_to/from_user().
+ */
+int iommufd_access_rw(struct iommufd_access *access, unsigned long iova,
+		      void *data, size_t length, unsigned int flags)
+{
+	struct io_pagetable *iopt = &access->ioas->iopt;
+	struct iopt_area_contig_iter iter;
+	struct iopt_area *area;
+	unsigned long last_iova;
+	int rc;
+
+	if (!length)
+		return -EINVAL;
+	if (check_add_overflow(iova, length - 1, &last_iova))
+		return -EOVERFLOW;
+
+	down_read(&iopt->iova_rwsem);
+	iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova) {
+		unsigned long last = min(last_iova, iopt_area_last_iova(area));
+		unsigned long bytes = (last - iter.cur_iova) + 1;
+
+		if (area->prevent_access) {
+			rc = -EINVAL;
+			goto err_out;
+		}
+
+		if (!check_area_prot(area, flags)) {
+			rc = -EPERM;
+			goto err_out;
+		}
+
+		rc = iopt_pages_rw_access(
+			area->pages, iopt_area_start_byte(area, iter.cur_iova),
+			data, bytes, flags);
+		if (rc)
+			goto err_out;
+		data += bytes;
+	}
+	if (!iopt_area_contig_done(&iter))
+		rc = -ENOENT;
+err_out:
+	up_read(&iopt->iova_rwsem);
+	return rc;
+}
+EXPORT_SYMBOL_NS_GPL(iommufd_access_rw, IOMMUFD);
diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
index 86effe8ed1dce5..a7d16cd14ea200 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -468,6 +468,7 @@ static int iopt_unmap_iova_range(struct io_pagetable *iopt, unsigned long start,
 	 * is NULL. This prevents domain attach/detatch from running
 	 * concurrently with cleaning up the area.
 	 */
+again:
 	down_read(&iopt->domains_rwsem);
 	down_write(&iopt->iova_rwsem);
 	while ((area = iopt_area_iter_first(iopt, start, end))) {
@@ -496,8 +497,11 @@ static int iopt_unmap_iova_range(struct io_pagetable *iopt, unsigned long start,
 			area->prevent_access = true;
 			up_write(&iopt->iova_rwsem);
 			up_read(&iopt->domains_rwsem);
-			/* Later patch calls back to drivers to unmap */
-			return -EBUSY;
+			iommufd_access_notify_unmap(iopt, area_first,
+						    iopt_area_length(area));
+			if (WARN_ON(READ_ONCE(area->num_accesses)))
+				return -EDEADLOCK;
+			goto again;
 		}
 
 		pages = area->pages;
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 1f8c73aca176e2..d7d606d459d4c8 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -65,6 +65,8 @@ int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
 		    unsigned long length, unsigned long *unmapped);
 int iopt_unmap_all(struct io_pagetable *iopt, unsigned long *unmapped);
 
+void iommufd_access_notify_unmap(struct io_pagetable *iopt, unsigned long iova,
+				 unsigned long length);
 int iopt_table_add_domain(struct io_pagetable *iopt,
 			  struct iommu_domain *domain);
 void iopt_table_remove_domain(struct io_pagetable *iopt,
@@ -106,6 +108,7 @@ enum iommufd_object_type {
 	IOMMUFD_OBJ_DEVICE,
 	IOMMUFD_OBJ_HW_PAGETABLE,
 	IOMMUFD_OBJ_IOAS,
+	IOMMUFD_OBJ_ACCESS,
 };
 
 /* Base struct for all objects with a userspace ID handle. */
@@ -246,6 +249,11 @@ void iommufd_hw_pagetable_destroy(struct iommufd_object *obj);
 void iommufd_device_destroy(struct iommufd_object *obj);
 
 struct iommufd_access {
+	struct iommufd_object obj;
+	struct iommufd_ctx *ictx;
+	struct iommufd_ioas *ioas;
+	const struct iommufd_access_ops *ops;
+	void *data;
 	unsigned long iova_alignment;
 	u32 iopt_access_list_id;
 };
@@ -253,4 +261,6 @@ struct iommufd_access {
 int iopt_add_access(struct io_pagetable *iopt, struct iommufd_access *access);
 void iopt_remove_access(struct io_pagetable *iopt,
 			struct iommufd_access *access);
+void iommufd_access_destroy_object(struct iommufd_object *obj);
+
 #endif
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 8a114ddbdfcde2..c8cc0953dea13a 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -352,6 +352,9 @@ void iommufd_ctx_put(struct iommufd_ctx *ictx)
 EXPORT_SYMBOL_NS_GPL(iommufd_ctx_put, IOMMUFD);
 
 static const struct iommufd_object_ops iommufd_object_ops[] = {
+	[IOMMUFD_OBJ_ACCESS] = {
+		.destroy = iommufd_access_destroy_object,
+	},
 	[IOMMUFD_OBJ_DEVICE] = {
 		.destroy = iommufd_device_destroy,
 	},
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index 31efacd8a46cce..fb9a4c275cca86 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -9,10 +9,12 @@
 #include <linux/types.h>
 #include <linux/errno.h>
 #include <linux/err.h>
-#include <linux/device.h>
 
+struct device;
 struct iommufd_device;
+struct page;
 struct iommufd_ctx;
+struct iommufd_access;
 struct file;
 
 struct iommufd_device *iommufd_device_bind(struct iommufd_ctx *ictx,
@@ -26,6 +28,11 @@ int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id,
 			  unsigned int flags);
 void iommufd_device_detach(struct iommufd_device *idev);
 
+struct iommufd_access_ops {
+	u8 needs_pin_pages : 1;
+	void (*unmap)(void *data, unsigned long iova, unsigned long length);
+};
+
 enum {
 	IOMMUFD_ACCESS_RW_READ = 0,
 	IOMMUFD_ACCESS_RW_WRITE = 1 << 0,
@@ -33,11 +40,24 @@ enum {
 	IOMMUFD_ACCESS_RW_KTHREAD = 1 << 1,
 };
 
+struct iommufd_access *
+iommufd_access_create(struct iommufd_ctx *ictx, u32 ioas_id,
+		      const struct iommufd_access_ops *ops, void *data);
+void iommufd_access_destroy(struct iommufd_access *access);
+
 void iommufd_ctx_get(struct iommufd_ctx *ictx);
 
 #if IS_ENABLED(CONFIG_IOMMUFD)
 struct iommufd_ctx *iommufd_ctx_from_file(struct file *file);
 void iommufd_ctx_put(struct iommufd_ctx *ictx);
+
+int iommufd_access_pin_pages(struct iommufd_access *access, unsigned long iova,
+			     unsigned long length, struct page **out_pages,
+			     unsigned int flags);
+void iommufd_access_unpin_pages(struct iommufd_access *access,
+				unsigned long iova, unsigned long length);
+int iommufd_access_rw(struct iommufd_access *access, unsigned long iova,
+		      void *data, size_t len, unsigned int flags);
 #else /* !CONFIG_IOMMUFD */
 static inline struct iommufd_ctx *iommufd_ctx_from_file(struct file *file)
 {
@@ -47,5 +67,26 @@ static inline struct iommufd_ctx *iommufd_ctx_from_file(struct file *file)
 static inline void iommufd_ctx_put(struct iommufd_ctx *ictx)
 {
 }
+
+static inline int iommufd_access_pin_pages(struct iommufd_access *access,
+					   unsigned long iova,
+					   unsigned long length,
+					   struct page **out_pages,
+					   unsigned int flags)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline void iommufd_access_unpin_pages(struct iommufd_access *access,
+					      unsigned long iova,
+					      unsigned long length)
+{
+}
+
+static inline int iommufd_access_rw(struct iommufd_access *access, unsigned long iova,
+		      void *data, size_t len, unsigned int flags)
+{
+	return -EOPNOTSUPP;
+}
 #endif /* CONFIG_IOMMUFD */
 #endif
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH v4 14/17] iommufd: vfio container FD ioctl compatibility
  2022-11-08  0:48 [PATCH v4 00/17] IOMMUFD Generic interface Jason Gunthorpe
                   ` (12 preceding siblings ...)
  2022-11-08  0:49 ` [PATCH v4 13/17] iommufd: Add kAPI toward external drivers for kernel access Jason Gunthorpe
@ 2022-11-08  0:49 ` Jason Gunthorpe
  2022-11-08  0:49 ` [PATCH v4 15/17] iommufd: Add a selftest Jason Gunthorpe
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-08  0:49 UTC (permalink / raw)
  To: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

iommufd can directly implement the /dev/vfio/vfio container IOCTLs by
mapping them into io_pagetable operations.

A userspace application can test against iommufd and confirm compatibility
then simply make a small change to open /dev/iommu instead of
/dev/vfio/vfio.

For testing purposes /dev/vfio/vfio can be symlinked to /dev/iommu and
then all applications will use the compatibility path with no code
changes. A later series allows /dev/vfio/vfio to be directly provided by
iommufd, which allows the rlimit mode to work the same as well.

This series just provides the iommufd side of compatibility. Actually
linking this to VFIO_SET_CONTAINER is a followup series, with a link in
the cover letter.

Internally the compatibility API uses a normal IOAS object that, like
vfio, is automatically allocated when the first device is
attached.

Userspace can also query or set this IOAS object directly using the
IOMMU_VFIO_IOAS ioctl. This allows mixing and matching new iommufd only
features while still using the VFIO style map/unmap ioctls.

While this is enough to operate qemu, it has a few differences:

 - Resource limits rely on memory cgroups to bound what userspace can do
   instead of the module parameter dma_entry_limit.

 - VFIO P2P is not implemented. The DMABUF patches for vfio are a start at
   a solution where iommufd would import a special DMABUF. This is to avoid
   further propogating the follow_pfn() security problem.

 - A full audit for pedantic compatibility details (eg errnos, etc) has
   not yet been done

 - powerpc SPAPR is left out, as it is not connected to the iommu_domain
   framework. It seems interest in SPAPR is minimal as it is currently
   non-working in v6.1-rc1. They will have to convert to the iommu
   subsystem framework to enjoy iommfd.

The following are not going to be implemented and we expect to remove them
from VFIO type1:

 - SW access 'dirty tracking'. As discussed in the cover letter this will
   be done in VFIO.

 - VFIO_TYPE1_NESTING_IOMMU
    https://lore.kernel.org/all/0-v1-0093c9b0e345+19-vfio_no_nesting_jgg@nvidia.com/

 - VFIO_DMA_MAP_FLAG_VADDR
    https://lore.kernel.org/all/Yz777bJZjTyLrHEQ@nvidia.com/

Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommufd/Makefile          |   3 +-
 drivers/iommu/iommufd/iommufd_private.h |   6 +
 drivers/iommu/iommufd/main.c            |  16 +-
 drivers/iommu/iommufd/vfio_compat.c     | 452 ++++++++++++++++++++++++
 include/linux/iommufd.h                 |   7 +
 include/uapi/linux/iommufd.h            |  36 ++
 6 files changed, 514 insertions(+), 6 deletions(-)
 create mode 100644 drivers/iommu/iommufd/vfio_compat.c

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index ca28a135b9675f..2fdff04000b326 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -5,6 +5,7 @@ iommufd-y := \
 	io_pagetable.o \
 	ioas.o \
 	main.o \
-	pages.o
+	pages.o \
+	vfio_compat.o
 
 obj-$(CONFIG_IOMMUFD) += iommufd.o
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index d7d606d459d4c8..5bb9788fc5aef1 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -18,6 +18,7 @@ struct iommufd_ctx {
 	struct xarray objects;
 
 	u8 account_mode;
+	struct iommufd_ioas *vfio_ioas;
 };
 
 /*
@@ -92,6 +93,9 @@ struct iommufd_ucmd {
 	void *cmd;
 };
 
+int iommufd_vfio_ioctl(struct iommufd_ctx *ictx, unsigned int cmd,
+		       unsigned long arg);
+
 /* Copy the response in ucmd->cmd back to userspace. */
 static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd,
 				       size_t cmd_len)
@@ -222,6 +226,8 @@ int iommufd_ioas_option(struct iommufd_ucmd *ucmd);
 int iommufd_option_rlimit_mode(struct iommu_option *cmd,
 			       struct iommufd_ctx *ictx);
 
+int iommufd_vfio_ioas(struct iommufd_ucmd *ucmd);
+
 /*
  * A HW pagetable is called an iommu_domain inside the kernel. This user object
  * allows directly creating and inspecting the domains. Domains that have kernel
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index c8cc0953dea13a..f76b2c8dcde8e5 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -133,6 +133,8 @@ bool iommufd_object_destroy_user(struct iommufd_ctx *ictx,
 		return false;
 	}
 	__xa_erase(&ictx->objects, obj->id);
+	if (ictx->vfio_ioas && &ictx->vfio_ioas->obj == obj)
+		ictx->vfio_ioas = NULL;
 	xa_unlock(&ictx->objects);
 	up_write(&obj->destroy_rwsem);
 
@@ -266,27 +268,31 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
 		 length),
 	IOCTL_OP(IOMMU_OPTION, iommufd_option, struct iommu_option,
 		 val64),
+	IOCTL_OP(IOMMU_VFIO_IOAS, iommufd_vfio_ioas, struct iommu_vfio_ioas,
+		 __reserved),
 };
 
 static long iommufd_fops_ioctl(struct file *filp, unsigned int cmd,
 			       unsigned long arg)
 {
+	struct iommufd_ctx *ictx = filp->private_data;
 	const struct iommufd_ioctl_op *op;
 	struct iommufd_ucmd ucmd = {};
 	union ucmd_buffer buf;
 	unsigned int nr;
 	int ret;
 
-	ucmd.ictx = filp->private_data;
+	nr = _IOC_NR(cmd);
+	if (nr < IOMMUFD_CMD_BASE ||
+	    (nr - IOMMUFD_CMD_BASE) >= ARRAY_SIZE(iommufd_ioctl_ops))
+		return iommufd_vfio_ioctl(ictx, cmd, arg);
+
+	ucmd.ictx = ictx;
 	ucmd.ubuffer = (void __user *)arg;
 	ret = get_user(ucmd.user_size, (u32 __user *)ucmd.ubuffer);
 	if (ret)
 		return ret;
 
-	nr = _IOC_NR(cmd);
-	if (nr < IOMMUFD_CMD_BASE ||
-	    (nr - IOMMUFD_CMD_BASE) >= ARRAY_SIZE(iommufd_ioctl_ops))
-		return -ENOIOCTLCMD;
 	op = &iommufd_ioctl_ops[nr - IOMMUFD_CMD_BASE];
 	if (op->ioctl_num != cmd)
 		return -ENOIOCTLCMD;
diff --git a/drivers/iommu/iommufd/vfio_compat.c b/drivers/iommu/iommufd/vfio_compat.c
new file mode 100644
index 00000000000000..dbef3274803336
--- /dev/null
+++ b/drivers/iommu/iommufd/vfio_compat.c
@@ -0,0 +1,452 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ */
+#include <linux/file.h>
+#include <linux/interval_tree.h>
+#include <linux/iommu.h>
+#include <linux/iommufd.h>
+#include <linux/slab.h>
+#include <linux/vfio.h>
+#include <uapi/linux/vfio.h>
+#include <uapi/linux/iommufd.h>
+
+#include "iommufd_private.h"
+
+static struct iommufd_ioas *get_compat_ioas(struct iommufd_ctx *ictx)
+{
+	struct iommufd_ioas *ioas = ERR_PTR(-ENODEV);
+
+	xa_lock(&ictx->objects);
+	if (!ictx->vfio_ioas || !iommufd_lock_obj(&ictx->vfio_ioas->obj))
+		goto out_unlock;
+	ioas = ictx->vfio_ioas;
+out_unlock:
+	xa_unlock(&ictx->objects);
+	return ioas;
+}
+
+/**
+ * iommufd_vfio_compat_ioas_id - Return the IOAS ID that vfio should use
+ * @ictx: Context to operate on
+ * @out_ioas_id: The ioas_id the caller should use
+ *
+ * The compatibility IOAS is the IOAS that the vfio compatibility ioctls operate
+ * on since they do not have an IOAS ID input in their ABI. Only attaching a
+ * group should cause a default creation of the internal ioas, this returns the
+ * existing ioas if it has already been assigned somehow.
+ */
+int iommufd_vfio_compat_ioas_id(struct iommufd_ctx *ictx, u32 *out_ioas_id)
+{
+	struct iommufd_ioas *ioas = NULL;
+	struct iommufd_ioas *out_ioas;
+
+	ioas = iommufd_ioas_alloc(ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	xa_lock(&ictx->objects);
+	if (ictx->vfio_ioas && iommufd_lock_obj(&ictx->vfio_ioas->obj))
+		out_ioas = ictx->vfio_ioas;
+	else {
+		out_ioas = ioas;
+		ictx->vfio_ioas = ioas;
+	}
+	xa_unlock(&ictx->objects);
+
+	*out_ioas_id = out_ioas->obj.id;
+	if (out_ioas != ioas) {
+		iommufd_put_object(&out_ioas->obj);
+		iommufd_object_abort(ictx, &ioas->obj);
+		return 0;
+	}
+	/*
+	 * An automatically created compat IOAS is treated as a userspace
+	 * created object. Userspace can learn the ID via IOMMU_VFIO_IOAS_GET,
+	 * and if not manually destroyed it will be destroyed automatically
+	 * at iommufd release.
+	 */
+	iommufd_object_finalize(ictx, &ioas->obj);
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(iommufd_vfio_compat_ioas_id, IOMMUFD_VFIO);
+
+int iommufd_vfio_ioas(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_vfio_ioas *cmd = ucmd->cmd;
+	struct iommufd_ioas *ioas;
+
+	if (cmd->__reserved)
+		return -EOPNOTSUPP;
+	switch (cmd->op) {
+	case IOMMU_VFIO_IOAS_GET:
+		ioas = get_compat_ioas(ucmd->ictx);
+		if (IS_ERR(ioas))
+			return PTR_ERR(ioas);
+		cmd->ioas_id = ioas->obj.id;
+		iommufd_put_object(&ioas->obj);
+		return iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+
+	case IOMMU_VFIO_IOAS_SET:
+		ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
+		if (IS_ERR(ioas))
+			return PTR_ERR(ioas);
+		xa_lock(&ucmd->ictx->objects);
+		ucmd->ictx->vfio_ioas = ioas;
+		xa_unlock(&ucmd->ictx->objects);
+		iommufd_put_object(&ioas->obj);
+		return 0;
+
+	case IOMMU_VFIO_IOAS_CLEAR:
+		xa_lock(&ucmd->ictx->objects);
+		ucmd->ictx->vfio_ioas = NULL;
+		xa_unlock(&ucmd->ictx->objects);
+		return 0;
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
+static int iommufd_vfio_map_dma(struct iommufd_ctx *ictx, unsigned int cmd,
+				void __user *arg)
+{
+	u32 supported_flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
+	size_t minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
+	struct vfio_iommu_type1_dma_map map;
+	int iommu_prot = IOMMU_CACHE;
+	struct iommufd_ioas *ioas;
+	unsigned long iova;
+	int rc;
+
+	if (copy_from_user(&map, arg, minsz))
+		return -EFAULT;
+
+	if (map.argsz < minsz || map.flags & ~supported_flags)
+		return -EINVAL;
+
+	if (map.flags & VFIO_DMA_MAP_FLAG_READ)
+		iommu_prot |= IOMMU_READ;
+	if (map.flags & VFIO_DMA_MAP_FLAG_WRITE)
+		iommu_prot |= IOMMU_WRITE;
+
+	ioas = get_compat_ioas(ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	/*
+	 * Maps created through the legacy interface always use VFIO compatible
+	 * rlimit accounting. If the user wishes to use the faster user based
+	 * rlimit accounting then they must use the new interface.
+	 */
+	iova = map.iova;
+	rc = iopt_map_user_pages(ictx, &ioas->iopt, &iova, u64_to_user_ptr(map.vaddr),
+				 map.size, iommu_prot, 0);
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+static int iommufd_vfio_unmap_dma(struct iommufd_ctx *ictx, unsigned int cmd,
+				  void __user *arg)
+{
+	size_t minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size);
+	/*
+	 * VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP is obsoleted by the new
+	 * dirty tracking direction:
+	 *  https://lore.kernel.org/kvm/20220731125503.142683-1-yishaih@nvidia.com/
+	 *  https://lore.kernel.org/kvm/20220428210933.3583-1-joao.m.martins@oracle.com/
+	 */
+	u32 supported_flags = VFIO_DMA_UNMAP_FLAG_ALL;
+	struct vfio_iommu_type1_dma_unmap unmap;
+	struct iommufd_ioas *ioas;
+	unsigned long unmapped;
+	int rc;
+
+	if (copy_from_user(&unmap, arg, minsz))
+		return -EFAULT;
+
+	if (unmap.argsz < minsz || unmap.flags & ~supported_flags)
+		return -EINVAL;
+
+	ioas = get_compat_ioas(ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	if (unmap.flags & VFIO_DMA_UNMAP_FLAG_ALL) {
+		if (unmap.iova != 0 || unmap.size != 0) {
+			rc = -EINVAL;
+			goto err_put;
+		}
+		rc = iopt_unmap_all(&ioas->iopt, &unmapped);
+	} else {
+		if (READ_ONCE(ioas->iopt.disable_large_pages)) {
+			unsigned long iovas[] = { unmap.iova + unmap.size - 1,
+						  unmap.iova - 1 };
+
+			rc = iopt_cut_iova(&ioas->iopt, iovas,
+					   unmap.iova ? 2 : 1);
+			if (rc)
+				goto err_put;
+		}
+		rc = iopt_unmap_iova(&ioas->iopt, unmap.iova, unmap.size,
+				     &unmapped);
+	}
+	unmap.size = unmapped;
+	if (copy_to_user(arg, &unmap, minsz))
+		rc = -EFAULT;
+
+err_put:
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+static int iommufd_vfio_cc_iommu(struct iommufd_ctx *ictx)
+{
+	struct iommufd_hw_pagetable *hwpt;
+	struct iommufd_ioas *ioas;
+	int rc = 1;
+
+	ioas = get_compat_ioas(ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	mutex_lock(&ioas->mutex);
+	list_for_each_entry(hwpt, &ioas->hwpt_list, hwpt_item) {
+		if (!hwpt->enforce_cache_coherency) {
+			rc = 0;
+			break;
+		}
+	}
+	mutex_unlock(&ioas->mutex);
+
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+static int iommufd_vfio_check_extension(struct iommufd_ctx *ictx,
+					unsigned long type)
+{
+	switch (type) {
+	case VFIO_TYPE1_IOMMU:
+	case VFIO_TYPE1v2_IOMMU:
+	case VFIO_UNMAP_ALL:
+		return 1;
+
+	case VFIO_DMA_CC_IOMMU:
+		return iommufd_vfio_cc_iommu(ictx);
+
+	/*
+	 * This is obsolete, and to be removed from VFIO. It was an incomplete
+	 * idea that got merged.
+	 * https://lore.kernel.org/kvm/0-v1-0093c9b0e345+19-vfio_no_nesting_jgg@nvidia.com/
+	 */
+	case VFIO_TYPE1_NESTING_IOMMU:
+		return 0;
+
+	/*
+	 * VFIO_DMA_MAP_FLAG_VADDR
+	 * https://lore.kernel.org/kvm/1611939252-7240-1-git-send-email-steven.sistare@oracle.com/
+	 * https://lore.kernel.org/all/Yz777bJZjTyLrHEQ@nvidia.com/
+	 *
+	 * It is hard to see how this could be implemented safely.
+	 */
+	case VFIO_UPDATE_VADDR:
+	default:
+		return 0;
+	}
+}
+
+static int iommufd_vfio_set_iommu(struct iommufd_ctx *ictx, unsigned long type)
+{
+	struct iommufd_ioas *ioas = NULL;
+	int rc = 0;
+
+	if (type != VFIO_TYPE1_IOMMU && type != VFIO_TYPE1v2_IOMMU)
+		return -EINVAL;
+
+	/* VFIO fails the set_iommu if there is no group */
+	ioas = get_compat_ioas(ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+	if (type == VFIO_TYPE1_IOMMU)
+		rc = iopt_disable_large_pages(&ioas->iopt);
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+static unsigned long iommufd_get_pagesizes(struct iommufd_ioas *ioas)
+{
+	struct io_pagetable *iopt = &ioas->iopt;
+	unsigned long pgsize_bitmap = ULONG_MAX;
+	struct iommu_domain *domain;
+	unsigned long index;
+
+	down_read(&iopt->domains_rwsem);
+	xa_for_each(&iopt->domains, index, domain)
+		pgsize_bitmap &= domain->pgsize_bitmap;
+
+	/* See vfio_update_pgsize_bitmap() */
+	if (pgsize_bitmap & ~PAGE_MASK) {
+		pgsize_bitmap &= PAGE_MASK;
+		pgsize_bitmap |= PAGE_SIZE;
+	}
+	pgsize_bitmap = max(pgsize_bitmap, ioas->iopt.iova_alignment);
+	up_read(&iopt->domains_rwsem);
+	return pgsize_bitmap;
+}
+
+static int iommufd_fill_cap_iova(struct iommufd_ioas *ioas,
+				 struct vfio_info_cap_header __user *cur,
+				 size_t avail)
+{
+	struct vfio_iommu_type1_info_cap_iova_range __user *ucap_iovas =
+		container_of(cur,
+			     struct vfio_iommu_type1_info_cap_iova_range __user,
+			     header);
+	struct vfio_iommu_type1_info_cap_iova_range cap_iovas = {
+		.header = {
+			.id = VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE,
+			.version = 1,
+		},
+	};
+	struct interval_tree_span_iter span;
+
+	interval_tree_for_each_span(&span, &ioas->iopt.reserved_itree, 0,
+				    ULONG_MAX) {
+		struct vfio_iova_range range;
+
+		if (!span.is_hole)
+			continue;
+		range.start = span.start_hole;
+		range.end = span.last_hole;
+		if (avail >= struct_size(&cap_iovas, iova_ranges,
+					 cap_iovas.nr_iovas + 1) &&
+		    copy_to_user(&ucap_iovas->iova_ranges[cap_iovas.nr_iovas],
+				 &range, sizeof(range)))
+			return -EFAULT;
+		cap_iovas.nr_iovas++;
+	}
+	if (avail >= struct_size(&cap_iovas, iova_ranges, cap_iovas.nr_iovas) &&
+	    copy_to_user(ucap_iovas, &cap_iovas, sizeof(cap_iovas)))
+		return -EFAULT;
+	return struct_size(&cap_iovas, iova_ranges, cap_iovas.nr_iovas);
+}
+
+static int iommufd_fill_cap_dma_avail(struct iommufd_ioas *ioas,
+				      struct vfio_info_cap_header __user *cur,
+				      size_t avail)
+{
+	struct vfio_iommu_type1_info_dma_avail cap_dma = {
+		.header = {
+			.id = VFIO_IOMMU_TYPE1_INFO_DMA_AVAIL,
+			.version = 1,
+		},
+		/* iommufd has no limit, return the same value as VFIO. */
+		.avail = U16_MAX,
+	};
+
+	if (avail >= sizeof(cap_dma) &&
+	    copy_to_user(cur, &cap_dma, sizeof(cap_dma)))
+		return -EFAULT;
+	return sizeof(cap_dma);
+}
+
+static int iommufd_vfio_iommu_get_info(struct iommufd_ctx *ictx,
+				       void __user *arg)
+{
+	typedef int (*fill_cap_fn)(struct iommufd_ioas *ioas,
+				   struct vfio_info_cap_header __user *cur,
+				   size_t avail);
+	static const fill_cap_fn fill_fns[] = {
+		iommufd_fill_cap_dma_avail,
+		iommufd_fill_cap_iova,
+	};
+	size_t minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes);
+	struct vfio_info_cap_header __user *last_cap = NULL;
+	struct vfio_iommu_type1_info info;
+	struct iommufd_ioas *ioas;
+	size_t total_cap_size;
+	int rc;
+	int i;
+
+	if (copy_from_user(&info, arg, minsz))
+		return -EFAULT;
+
+	if (info.argsz < minsz)
+		return -EINVAL;
+	minsz = min_t(size_t, info.argsz, sizeof(info));
+
+	ioas = get_compat_ioas(ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	info.flags = VFIO_IOMMU_INFO_PGSIZES;
+	info.iova_pgsizes = iommufd_get_pagesizes(ioas);
+	info.cap_offset = 0;
+
+	down_read(&ioas->iopt.iova_rwsem);
+	total_cap_size = sizeof(info);
+	for (i = 0; i != ARRAY_SIZE(fill_fns); i++) {
+		int cap_size;
+
+		if (info.argsz > total_cap_size)
+			cap_size = fill_fns[i](ioas, arg + total_cap_size,
+					       info.argsz - total_cap_size);
+		else
+			cap_size = fill_fns[i](ioas, NULL, 0);
+		if (cap_size < 0) {
+			rc = cap_size;
+			goto out_put;
+		}
+		if (last_cap && info.argsz >= total_cap_size &&
+		    put_user(total_cap_size, &last_cap->next)) {
+			rc = -EFAULT;
+			goto out_put;
+		}
+		last_cap = arg + total_cap_size;
+		total_cap_size += cap_size;
+	}
+
+	/*
+	 * If the user did not provide enough space then only some caps are
+	 * returned and the argsz will be updated to the correct amount to get
+	 * all caps.
+	 */
+	if (info.argsz >= total_cap_size)
+		info.cap_offset = sizeof(info);
+	info.argsz = total_cap_size;
+	info.flags |= VFIO_IOMMU_INFO_CAPS;
+	if (copy_to_user(arg, &info, minsz)) {
+		rc = -EFAULT;
+		goto out_put;
+	}
+	rc = 0;
+
+out_put:
+	up_read(&ioas->iopt.iova_rwsem);
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+int iommufd_vfio_ioctl(struct iommufd_ctx *ictx, unsigned int cmd,
+		       unsigned long arg)
+{
+	void __user *uarg = (void __user *)arg;
+
+	switch (cmd) {
+	case VFIO_GET_API_VERSION:
+		return VFIO_API_VERSION;
+	case VFIO_SET_IOMMU:
+		return iommufd_vfio_set_iommu(ictx, arg);
+	case VFIO_CHECK_EXTENSION:
+		return iommufd_vfio_check_extension(ictx, arg);
+	case VFIO_IOMMU_GET_INFO:
+		return iommufd_vfio_iommu_get_info(ictx, uarg);
+	case VFIO_IOMMU_MAP_DMA:
+		return iommufd_vfio_map_dma(ictx, cmd, uarg);
+	case VFIO_IOMMU_UNMAP_DMA:
+		return iommufd_vfio_unmap_dma(ictx, cmd, uarg);
+	case VFIO_IOMMU_DIRTY_PAGES:
+	default:
+		return -ENOIOCTLCMD;
+	}
+	return -ENOIOCTLCMD;
+}
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index fb9a4c275cca86..d57b044acca4cb 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -58,6 +58,7 @@ void iommufd_access_unpin_pages(struct iommufd_access *access,
 				unsigned long iova, unsigned long length);
 int iommufd_access_rw(struct iommufd_access *access, unsigned long iova,
 		      void *data, size_t len, unsigned int flags);
+int iommufd_vfio_compat_ioas_id(struct iommufd_ctx *ictx, u32 *out_ioas_id);
 #else /* !CONFIG_IOMMUFD */
 static inline struct iommufd_ctx *iommufd_ctx_from_file(struct file *file)
 {
@@ -88,5 +89,11 @@ static inline int iommufd_access_rw(struct iommufd_access *access, unsigned long
 {
 	return -EOPNOTSUPP;
 }
+
+static inline int iommufd_vfio_compat_ioas_id(struct iommufd_ctx *ictx,
+					      u32 *out_ioas_id)
+{
+	return -EOPNOTSUPP;
+}
 #endif /* CONFIG_IOMMUFD */
 #endif
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index a872a50d800a41..c7109d7e5e09a9 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -44,6 +44,7 @@ enum {
 	IOMMUFD_CMD_IOAS_MAP,
 	IOMMUFD_CMD_IOAS_UNMAP,
 	IOMMUFD_CMD_OPTION,
+	IOMMUFD_CMD_VFIO_IOAS,
 };
 
 /**
@@ -293,4 +294,39 @@ struct iommu_option {
 	__aligned_u64 val64;
 };
 #define IOMMU_OPTION _IO(IOMMUFD_TYPE, IOMMUFD_CMD_OPTION)
+
+/**
+ * enum iommufd_vfio_ioas_op
+ * @IOMMU_VFIO_IOAS_GET: Get the current compatibility IOAS
+ * @IOMMU_VFIO_IOAS_SET: Change the current compatibility IOAS
+ * @IOMMU_VFIO_IOAS_CLEAR: Disable VFIO compatibility
+ */
+enum iommufd_vfio_ioas_op {
+	IOMMU_VFIO_IOAS_GET = 0,
+	IOMMU_VFIO_IOAS_SET = 1,
+	IOMMU_VFIO_IOAS_CLEAR = 2,
+};
+
+/**
+ * struct iommu_vfio_ioas - ioctl(IOMMU_VFIO_IOAS)
+ * @size: sizeof(struct iommu_vfio_ioas)
+ * @ioas_id: For IOMMU_VFIO_IOAS_SET the input IOAS ID to set
+ *           For IOMMU_VFIO_IOAS_GET will output the IOAS ID
+ * @op: One of enum iommufd_vfio_ioas_op
+ * @__reserved: Must be 0
+ *
+ * The VFIO compatibility support uses a single ioas because VFIO APIs do not
+ * support the ID field. Set or Get the IOAS that VFIO compatibility will use.
+ * When VFIO_GROUP_SET_CONTAINER is used on an iommufd it will get the
+ * compatibility ioas, either by taking what is already set, or auto creating
+ * one. From then on VFIO will continue to use that ioas and is not effected by
+ * this ioctl. SET or CLEAR does not destroy any auto-created IOAS.
+ */
+struct iommu_vfio_ioas {
+	__u32 size;
+	__u32 ioas_id;
+	__u16 op;
+	__u16 __reserved;
+};
+#define IOMMU_VFIO_IOAS _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VFIO_IOAS)
 #endif
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH v4 15/17] iommufd: Add a selftest
  2022-11-08  0:48 [PATCH v4 00/17] IOMMUFD Generic interface Jason Gunthorpe
                   ` (13 preceding siblings ...)
  2022-11-08  0:49 ` [PATCH v4 14/17] iommufd: vfio container FD ioctl compatibility Jason Gunthorpe
@ 2022-11-08  0:49 ` Jason Gunthorpe
  2022-11-08  1:01   ` Jason Gunthorpe
                     ` (2 more replies)
  2022-11-08  0:49 ` [PATCH v4 16/17] iommufd: Add some fault injection points Jason Gunthorpe
                   ` (3 subsequent siblings)
  18 siblings, 3 replies; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-08  0:49 UTC (permalink / raw)
  To: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

Cover the essential functionality of the iommufd with a directed
test. This aims to achieve reasonable functional coverage using the
in-kernel self test framework.

It provides a mock kernel module for the iommu_domain that allows it to
run without any HW and the mocking provides a way to directly validate
that the PFNs loaded into the iommu_domain are correct.

The mock also simulates the rare case of PAGE_SIZE > iommu page size as
the mock will operate at a 2K iommu page size. This allows exercising all
of the calculations to support this mismatch.

This allows achieving high coverage of the corner cases in the iopt_pages.

However, it is an unusually invasive config option to enable all of
this. The config option should not be enabled in a production kernel.

Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/Kconfig                 |   11 +
 drivers/iommu/iommufd/Makefile                |    2 +
 drivers/iommu/iommufd/device.c                |   38 +
 drivers/iommu/iommufd/ioas.c                  |    3 +
 drivers/iommu/iommufd/iommufd_private.h       |   35 +
 drivers/iommu/iommufd/iommufd_test.h          |   93 +
 drivers/iommu/iommufd/main.c                  |   14 +
 drivers/iommu/iommufd/pages.c                 |    8 +
 drivers/iommu/iommufd/selftest.c              |  853 +++++++++
 include/linux/iommufd.h                       |    3 +
 tools/testing/selftests/Makefile              |    1 +
 tools/testing/selftests/iommu/.gitignore      |    3 +
 tools/testing/selftests/iommu/Makefile        |   12 +
 tools/testing/selftests/iommu/config          |    2 +
 tools/testing/selftests/iommu/iommufd.c       | 1627 +++++++++++++++++
 .../selftests/iommu/iommufd_fail_nth.c        |  580 ++++++
 tools/testing/selftests/iommu/iommufd_utils.h |  278 +++
 17 files changed, 3563 insertions(+)
 create mode 100644 drivers/iommu/iommufd/iommufd_test.h
 create mode 100644 drivers/iommu/iommufd/selftest.c
 create mode 100644 tools/testing/selftests/iommu/.gitignore
 create mode 100644 tools/testing/selftests/iommu/Makefile
 create mode 100644 tools/testing/selftests/iommu/config
 create mode 100644 tools/testing/selftests/iommu/iommufd.c
 create mode 100644 tools/testing/selftests/iommu/iommufd_fail_nth.c
 create mode 100644 tools/testing/selftests/iommu/iommufd_utils.h

diff --git a/drivers/iommu/iommufd/Kconfig b/drivers/iommu/iommufd/Kconfig
index 6d65d0f06f169f..399a2edeaef6de 100644
--- a/drivers/iommu/iommufd/Kconfig
+++ b/drivers/iommu/iommufd/Kconfig
@@ -10,3 +10,14 @@ config IOMMUFD
 	  it relates to managing IO page tables that point at user space memory.
 
 	  If you don't know what to do here, say N.
+
+if IOMMUFD
+config IOMMUFD_TEST
+	bool "IOMMU Userspace API Test support"
+	depends on RUNTIME_TESTING_MENU
+	depends on FAULT_INJECTION
+	default n
+	help
+	  This is dangerous, do not enable unless running
+	  tools/testing/selftests/iommu
+endif
diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index 2fdff04000b326..8aeba81800c512 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -8,4 +8,6 @@ iommufd-y := \
 	pages.o \
 	vfio_compat.o
 
+iommufd-$(CONFIG_IOMMUFD_TEST) += selftest.o
+
 obj-$(CONFIG_IOMMUFD) += iommufd.o
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index da365d0cf4cf19..8ca5a38b2a79b8 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -703,3 +703,41 @@ int iommufd_access_rw(struct iommufd_access *access, unsigned long iova,
 	return rc;
 }
 EXPORT_SYMBOL_NS_GPL(iommufd_access_rw, IOMMUFD);
+
+#ifdef CONFIG_IOMMUFD_TEST
+/*
+ * Creating a real iommufd_device is too hard, bypass creating a iommufd_device
+ * and go directly to attaching a domain.
+ */
+struct iommufd_hw_pagetable *
+iommufd_device_selftest_attach(struct iommufd_ctx *ictx,
+			       struct iommufd_ioas *ioas,
+			       struct device *mock_dev)
+{
+	struct iommufd_hw_pagetable *hwpt;
+	int rc;
+
+	hwpt = iommufd_hw_pagetable_alloc(ictx, ioas, mock_dev);
+	if (IS_ERR(hwpt))
+		return hwpt;
+
+	rc = iopt_table_add_domain(&hwpt->ioas->iopt, hwpt->domain);
+	if (rc)
+		goto out_hwpt;
+
+	refcount_inc(&hwpt->obj.users);
+	iommufd_object_finalize(ictx, &hwpt->obj);
+	return hwpt;
+
+out_hwpt:
+	iommufd_object_abort_and_destroy(ictx, &hwpt->obj);
+	return ERR_PTR(rc);
+}
+
+void iommufd_device_selftest_detach(struct iommufd_ctx *ictx,
+				    struct iommufd_hw_pagetable *hwpt)
+{
+	iopt_table_remove_domain(&hwpt->ioas->iopt, hwpt->domain);
+	refcount_dec(&hwpt->obj.users);
+}
+#endif
diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
index 64e6d0f73e39aa..9e5b8c16bc9c45 100644
--- a/drivers/iommu/iommufd/ioas.c
+++ b/drivers/iommu/iommufd/ioas.c
@@ -237,6 +237,9 @@ int iommufd_ioas_copy(struct iommufd_ucmd *ucmd)
 	unsigned long iova;
 	int rc;
 
+	iommufd_test_syz_conv_iova_id(ucmd, cmd->src_ioas_id, &cmd->src_iova,
+				      &cmd->flags);
+
 	if ((cmd->flags &
 	     ~(IOMMU_IOAS_MAP_FIXED_IOVA | IOMMU_IOAS_MAP_WRITEABLE |
 	       IOMMU_IOAS_MAP_READABLE)))
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 5bb9788fc5aef1..7605c5a5a6091b 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -113,6 +113,9 @@ enum iommufd_object_type {
 	IOMMUFD_OBJ_HW_PAGETABLE,
 	IOMMUFD_OBJ_IOAS,
 	IOMMUFD_OBJ_ACCESS,
+#ifdef CONFIG_IOMMUFD_TEST
+	IOMMUFD_OBJ_SELFTEST,
+#endif
 };
 
 /* Base struct for all objects with a userspace ID handle. */
@@ -269,4 +272,36 @@ void iopt_remove_access(struct io_pagetable *iopt,
 			struct iommufd_access *access);
 void iommufd_access_destroy_object(struct iommufd_object *obj);
 
+#ifdef CONFIG_IOMMUFD_TEST
+struct iommufd_hw_pagetable *
+iommufd_device_selftest_attach(struct iommufd_ctx *ictx,
+			       struct iommufd_ioas *ioas,
+			       struct device *mock_dev);
+void iommufd_device_selftest_detach(struct iommufd_ctx *ictx,
+				    struct iommufd_hw_pagetable *hwpt);
+int iommufd_test(struct iommufd_ucmd *ucmd);
+void iommufd_selftest_destroy(struct iommufd_object *obj);
+extern size_t iommufd_test_memory_limit;
+void iommufd_test_syz_conv_iova_id(struct iommufd_ucmd *ucmd,
+				   unsigned int ioas_id, u64 *iova, u32 *flags);
+bool iommufd_should_fail(void);
+void __init iommufd_test_init(void);
+void iommufd_test_exit(void);
+#else
+static inline void iommufd_test_syz_conv_iova_id(struct iommufd_ucmd *ucmd,
+						 unsigned int ioas_id,
+						 u64 *iova, u32 *flags)
+{
+}
+static inline bool iommufd_should_fail(void)
+{
+	return false;
+}
+static inline void __init iommufd_test_init(void)
+{
+}
+static inline void iommufd_test_exit(void)
+{
+}
+#endif
 #endif
diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h
new file mode 100644
index 00000000000000..1d96a8f466fd29
--- /dev/null
+++ b/drivers/iommu/iommufd/iommufd_test.h
@@ -0,0 +1,93 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
+ */
+#ifndef _UAPI_IOMMUFD_TEST_H
+#define _UAPI_IOMMUFD_TEST_H
+
+#include <linux/types.h>
+#include <linux/iommufd.h>
+
+enum {
+	IOMMU_TEST_OP_ADD_RESERVED = 1,
+	IOMMU_TEST_OP_MOCK_DOMAIN,
+	IOMMU_TEST_OP_MD_CHECK_MAP,
+	IOMMU_TEST_OP_MD_CHECK_REFS,
+	IOMMU_TEST_OP_CREATE_ACCESS,
+	IOMMU_TEST_OP_DESTROY_ACCESS_PAGES,
+	IOMMU_TEST_OP_ACCESS_PAGES,
+	IOMMU_TEST_OP_ACCESS_RW,
+	IOMMU_TEST_OP_SET_TEMP_MEMORY_LIMIT,
+};
+
+enum {
+	MOCK_APERTURE_START = 1UL << 24,
+	MOCK_APERTURE_LAST = (1UL << 31) - 1,
+};
+
+enum {
+	MOCK_FLAGS_ACCESS_WRITE = 1 << 0,
+	MOCK_FLAGS_ACCESS_SYZ = 1 << 16,
+};
+
+enum {
+	MOCK_ACCESS_RW_WRITE = 1 << 0,
+	MOCK_ACCESS_RW_SLOW_PATH = 1 << 2,
+};
+
+enum {
+	MOCK_FLAGS_ACCESS_CREATE_NEEDS_PIN_PAGES = 1 << 0,
+};
+
+struct iommu_test_cmd {
+	__u32 size;
+	__u32 op;
+	__u32 id;
+	__u32 __reserved;
+	union {
+		struct {
+			__aligned_u64 start;
+			__aligned_u64 length;
+		} add_reserved;
+		struct {
+			__u32 out_device_id;
+			__u32 out_hwpt_id;
+		} mock_domain;
+		struct {
+			__aligned_u64 iova;
+			__aligned_u64 length;
+			__aligned_u64 uptr;
+		} check_map;
+		struct {
+			__aligned_u64 length;
+			__aligned_u64 uptr;
+			__u32 refs;
+		} check_refs;
+		struct {
+			__u32 out_access_fd;
+			__u32 flags;
+		} create_access;
+		struct {
+			__u32 access_pages_id;
+		} destroy_access_pages;
+		struct {
+			__u32 flags;
+			__u32 out_access_pages_id;
+			__aligned_u64 iova;
+			__aligned_u64 length;
+			__aligned_u64 uptr;
+		} access_pages;
+		struct {
+			__aligned_u64 iova;
+			__aligned_u64 length;
+			__aligned_u64 uptr;
+			__u32 flags;
+		} access_rw;
+		struct {
+			__u32 limit;
+		} memory_limit;
+	};
+	__u32 last;
+};
+#define IOMMU_TEST_CMD _IO(IOMMUFD_TYPE, IOMMUFD_CMD_BASE + 32)
+
+#endif
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index f76b2c8dcde8e5..7de0f95f2ee68a 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -19,6 +19,7 @@
 #include <linux/iommufd.h>
 
 #include "iommufd_private.h"
+#include "iommufd_test.h"
 
 struct iommufd_object_ops {
 	void (*destroy)(struct iommufd_object *obj);
@@ -234,6 +235,9 @@ union ucmd_buffer {
 	struct iommu_ioas_iova_ranges iova_ranges;
 	struct iommu_ioas_map map;
 	struct iommu_ioas_unmap unmap;
+#ifdef CONFIG_IOMMUFD_TEST
+	struct iommu_test_cmd test;
+#endif
 };
 
 struct iommufd_ioctl_op {
@@ -270,6 +274,9 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
 		 val64),
 	IOCTL_OP(IOMMU_VFIO_IOAS, iommufd_vfio_ioas, struct iommu_vfio_ioas,
 		 __reserved),
+#ifdef CONFIG_IOMMUFD_TEST
+	IOCTL_OP(IOMMU_TEST_CMD, iommufd_test, struct iommu_test_cmd, last),
+#endif
 };
 
 static long iommufd_fops_ioctl(struct file *filp, unsigned int cmd,
@@ -370,6 +377,11 @@ static const struct iommufd_object_ops iommufd_object_ops[] = {
 	[IOMMUFD_OBJ_HW_PAGETABLE] = {
 		.destroy = iommufd_hw_pagetable_destroy,
 	},
+#ifdef CONFIG_IOMMUFD_TEST
+	[IOMMUFD_OBJ_SELFTEST] = {
+		.destroy = iommufd_selftest_destroy,
+	},
+#endif
 };
 
 static struct miscdevice iommu_misc_dev = {
@@ -387,11 +399,13 @@ static int __init iommufd_init(void)
 	ret = misc_register(&iommu_misc_dev);
 	if (ret)
 		return ret;
+	iommufd_test_init();
 	return 0;
 }
 
 static void __exit iommufd_exit(void)
 {
+	iommufd_test_exit();
 	misc_deregister(&iommu_misc_dev);
 }
 
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index d8f3f2a07e8378..c3783ea01d7996 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -56,7 +56,11 @@
 #include "io_pagetable.h"
 #include "double_span.h"
 
+#ifndef CONFIG_IOMMUFD_TEST
 #define TEMP_MEMORY_LIMIT 65536
+#else
+#define TEMP_MEMORY_LIMIT iommufd_test_memory_limit
+#endif
 #define BATCH_BACKUP_SIZE 32
 
 /*
@@ -1671,6 +1675,10 @@ int iopt_pages_rw_access(struct iopt_pages *pages, unsigned long start_byte,
 	bool change_mm = current->mm != pages->source_mm;
 	int rc = 0;
 
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST) &&
+	    (flags & __IOMMUFD_ACCESS_RW_SLOW_PATH))
+		change_mm = true;
+
 	if ((flags & IOMMUFD_ACCESS_RW_WRITE) && !pages->writable)
 		return -EPERM;
 
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
new file mode 100644
index 00000000000000..d31493f4fdf473
--- /dev/null
+++ b/drivers/iommu/iommufd/selftest.c
@@ -0,0 +1,853 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
+ *
+ * Kernel side components to support tools/testing/selftests/iommu
+ */
+#include <linux/slab.h>
+#include <linux/iommu.h>
+#include <linux/xarray.h>
+#include <linux/file.h>
+#include <linux/anon_inodes.h>
+#include <linux/fault-inject.h>
+#include <uapi/linux/iommufd.h>
+
+#include "io_pagetable.h"
+#include "iommufd_private.h"
+#include "iommufd_test.h"
+
+static DECLARE_FAULT_ATTR(fail_iommufd);
+static struct dentry *dbgfs_root;
+
+size_t iommufd_test_memory_limit = 65536;
+
+enum {
+	MOCK_IO_PAGE_SIZE = PAGE_SIZE / 2,
+
+	/*
+	 * Like a real page table alignment requires the low bits of the address
+	 * to be zero. xarray also requires the high bit to be zero, so we store
+	 * the pfns shifted. The upper bits are used for metadata.
+	 */
+	MOCK_PFN_MASK = ULONG_MAX / MOCK_IO_PAGE_SIZE,
+
+	_MOCK_PFN_START = MOCK_PFN_MASK + 1,
+	MOCK_PFN_START_IOVA = _MOCK_PFN_START,
+	MOCK_PFN_LAST_IOVA = _MOCK_PFN_START,
+};
+
+/*
+ * Syzkaller has trouble randomizing the correct iova to use since it is linked
+ * to the map ioctl's output, and it has no ide about that. So, simplify things.
+ * In syzkaller mode the 64 bit IOVA is converted into an nth area and offset
+ * value. This has a much smaller randomization space and syzkaller can hit it.
+ */
+static unsigned long iommufd_test_syz_conv_iova(struct io_pagetable *iopt,
+						u64 *iova)
+{
+	struct syz_layout {
+		__u32 nth_area;
+		__u32 offset;
+	};
+	struct syz_layout *syz = (void *)iova;
+	unsigned int nth = syz->nth_area;
+	struct iopt_area *area;
+
+	down_read(&iopt->iova_rwsem);
+	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
+	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
+		if (nth == 0) {
+			up_read(&iopt->iova_rwsem);
+			return iopt_area_iova(area) + syz->offset;
+		}
+		nth--;
+	}
+	up_read(&iopt->iova_rwsem);
+
+	return 0;
+}
+
+void iommufd_test_syz_conv_iova_id(struct iommufd_ucmd *ucmd,
+				   unsigned int ioas_id, u64 *iova, u32 *flags)
+{
+	struct iommufd_ioas *ioas;
+
+	if (!(*flags & MOCK_FLAGS_ACCESS_SYZ))
+		return;
+	*flags &= ~(u32)MOCK_FLAGS_ACCESS_SYZ;
+
+	ioas = iommufd_get_ioas(ucmd, ioas_id);
+	if (IS_ERR(ioas))
+		return;
+	*iova = iommufd_test_syz_conv_iova(&ioas->iopt, iova);
+	iommufd_put_object(&ioas->obj);
+}
+
+struct mock_iommu_domain {
+	struct iommu_domain domain;
+	struct xarray pfns;
+};
+
+enum selftest_obj_type {
+	TYPE_IDEV,
+};
+
+struct selftest_obj {
+	struct iommufd_object obj;
+	enum selftest_obj_type type;
+
+	union {
+		struct {
+			struct iommufd_hw_pagetable *hwpt;
+			struct iommufd_ctx *ictx;
+			struct device mock_dev;
+		} idev;
+	};
+};
+
+static struct iommu_domain *mock_domain_alloc(unsigned int iommu_domain_type)
+{
+	struct mock_iommu_domain *mock;
+
+	if (WARN_ON(iommu_domain_type != IOMMU_DOMAIN_UNMANAGED))
+		return NULL;
+
+	mock = kzalloc(sizeof(*mock), GFP_KERNEL);
+	if (!mock)
+		return NULL;
+	mock->domain.geometry.aperture_start = MOCK_APERTURE_START;
+	mock->domain.geometry.aperture_end = MOCK_APERTURE_LAST;
+	mock->domain.pgsize_bitmap = MOCK_IO_PAGE_SIZE;
+	xa_init(&mock->pfns);
+	return &mock->domain;
+}
+
+static void mock_domain_free(struct iommu_domain *domain)
+{
+	struct mock_iommu_domain *mock =
+		container_of(domain, struct mock_iommu_domain, domain);
+
+	WARN_ON(!xa_empty(&mock->pfns));
+	kfree(mock);
+}
+
+static int mock_domain_map_pages(struct iommu_domain *domain,
+				 unsigned long iova, phys_addr_t paddr,
+				 size_t pgsize, size_t pgcount, int prot,
+				 gfp_t gfp, size_t *mapped)
+{
+	struct mock_iommu_domain *mock =
+		container_of(domain, struct mock_iommu_domain, domain);
+	unsigned long flags = MOCK_PFN_START_IOVA;
+	unsigned long start_iova = iova;
+
+	/*
+	 * xarray does not reliably work with fault injection because it does a
+	 * retry allocation, so put our own failure point.
+	 */
+	if (iommufd_should_fail())
+		return -ENOENT;
+
+	WARN_ON(iova % MOCK_IO_PAGE_SIZE);
+	WARN_ON(pgsize % MOCK_IO_PAGE_SIZE);
+	for (; pgcount; pgcount--) {
+		size_t cur;
+
+		for (cur = 0; cur != pgsize; cur += MOCK_IO_PAGE_SIZE) {
+			void *old;
+
+			if (pgcount == 1 && cur + MOCK_IO_PAGE_SIZE == pgsize)
+				flags = MOCK_PFN_LAST_IOVA;
+			old = xa_store(&mock->pfns, iova / MOCK_IO_PAGE_SIZE,
+				       xa_mk_value((paddr / MOCK_IO_PAGE_SIZE) |
+						   flags),
+				       GFP_KERNEL);
+			if (xa_is_err(old)) {
+				for (; start_iova != iova;
+				     start_iova += MOCK_IO_PAGE_SIZE)
+					xa_erase(&mock->pfns,
+						 start_iova /
+							 MOCK_IO_PAGE_SIZE);
+				return xa_err(old);
+			}
+			WARN_ON(old);
+			iova += MOCK_IO_PAGE_SIZE;
+			paddr += MOCK_IO_PAGE_SIZE;
+			*mapped += MOCK_IO_PAGE_SIZE;
+			flags = 0;
+		}
+	}
+	return 0;
+}
+
+static size_t mock_domain_unmap_pages(struct iommu_domain *domain,
+				      unsigned long iova, size_t pgsize,
+				      size_t pgcount,
+				      struct iommu_iotlb_gather *iotlb_gather)
+{
+	struct mock_iommu_domain *mock =
+		container_of(domain, struct mock_iommu_domain, domain);
+	bool first = true;
+	size_t ret = 0;
+	void *ent;
+
+	WARN_ON(iova % MOCK_IO_PAGE_SIZE);
+	WARN_ON(pgsize % MOCK_IO_PAGE_SIZE);
+
+	for (; pgcount; pgcount--) {
+		size_t cur;
+
+		for (cur = 0; cur != pgsize; cur += MOCK_IO_PAGE_SIZE) {
+			ent = xa_erase(&mock->pfns, iova / MOCK_IO_PAGE_SIZE);
+			WARN_ON(!ent);
+			/*
+			 * iommufd generates unmaps that must be a strict
+			 * superset of the map's performend So every starting
+			 * IOVA should have been an iova passed to map, and the
+			 *
+			 * First IOVA must be present and have been a first IOVA
+			 * passed to map_pages
+			 */
+			if (first) {
+				WARN_ON(!(xa_to_value(ent) &
+					  MOCK_PFN_START_IOVA));
+				first = false;
+			}
+			if (pgcount == 1 && cur + MOCK_IO_PAGE_SIZE == pgsize)
+				WARN_ON(!(xa_to_value(ent) &
+					  MOCK_PFN_LAST_IOVA));
+
+			iova += MOCK_IO_PAGE_SIZE;
+			ret += MOCK_IO_PAGE_SIZE;
+		}
+	}
+	return ret;
+}
+
+static phys_addr_t mock_domain_iova_to_phys(struct iommu_domain *domain,
+					    dma_addr_t iova)
+{
+	struct mock_iommu_domain *mock =
+		container_of(domain, struct mock_iommu_domain, domain);
+	void *ent;
+
+	WARN_ON(iova % MOCK_IO_PAGE_SIZE);
+	ent = xa_load(&mock->pfns, iova / MOCK_IO_PAGE_SIZE);
+	WARN_ON(!ent);
+	return (xa_to_value(ent) & MOCK_PFN_MASK) * MOCK_IO_PAGE_SIZE;
+}
+
+static const struct iommu_ops mock_ops = {
+	.owner = THIS_MODULE,
+	.pgsize_bitmap = MOCK_IO_PAGE_SIZE,
+	.domain_alloc = mock_domain_alloc,
+	.default_domain_ops =
+		&(struct iommu_domain_ops){
+			.free = mock_domain_free,
+			.map_pages = mock_domain_map_pages,
+			.unmap_pages = mock_domain_unmap_pages,
+			.iova_to_phys = mock_domain_iova_to_phys,
+		},
+};
+
+static inline struct iommufd_hw_pagetable *
+get_md_pagetable(struct iommufd_ucmd *ucmd, u32 mockpt_id,
+		 struct mock_iommu_domain **mock)
+{
+	struct iommufd_hw_pagetable *hwpt;
+	struct iommufd_object *obj;
+
+	obj = iommufd_get_object(ucmd->ictx, mockpt_id,
+				 IOMMUFD_OBJ_HW_PAGETABLE);
+	if (IS_ERR(obj))
+		return ERR_CAST(obj);
+	hwpt = container_of(obj, struct iommufd_hw_pagetable, obj);
+	if (hwpt->domain->ops != mock_ops.default_domain_ops) {
+		iommufd_put_object(&hwpt->obj);
+		return ERR_PTR(-EINVAL);
+	}
+	*mock = container_of(hwpt->domain, struct mock_iommu_domain, domain);
+	return hwpt;
+}
+
+/* Create an hw_pagetable with the mock domain so we can test the domain ops */
+static int iommufd_test_mock_domain(struct iommufd_ucmd *ucmd,
+				    struct iommu_test_cmd *cmd)
+{
+	static struct bus_type mock_bus = { .iommu_ops = &mock_ops };
+	struct iommufd_hw_pagetable *hwpt;
+	struct selftest_obj *sobj;
+	struct iommufd_ioas *ioas;
+	int rc;
+
+	ioas = iommufd_get_ioas(ucmd, cmd->id);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	sobj = iommufd_object_alloc(ucmd->ictx, sobj, IOMMUFD_OBJ_SELFTEST);
+	if (IS_ERR(sobj)) {
+		rc = PTR_ERR(sobj);
+		goto out_ioas;
+	}
+	sobj->idev.ictx = ucmd->ictx;
+	sobj->type = TYPE_IDEV;
+	sobj->idev.mock_dev.bus = &mock_bus;
+
+	hwpt = iommufd_device_selftest_attach(ucmd->ictx, ioas,
+					      &sobj->idev.mock_dev);
+	if (IS_ERR(hwpt)) {
+		rc = PTR_ERR(hwpt);
+		goto out_sobj;
+	}
+	sobj->idev.hwpt = hwpt;
+
+	/* Userspace must destroy both of these IDs to destroy the object */
+	cmd->mock_domain.out_hwpt_id = hwpt->obj.id;
+	cmd->mock_domain.out_device_id = sobj->obj.id;
+	iommufd_object_finalize(ucmd->ictx, &sobj->obj);
+	iommufd_put_object(&ioas->obj);
+	return iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+
+out_sobj:
+	iommufd_object_abort(ucmd->ictx, &sobj->obj);
+out_ioas:
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+/* Add an additional reserved IOVA to the IOAS */
+static int iommufd_test_add_reserved(struct iommufd_ucmd *ucmd,
+				     unsigned int mockpt_id,
+				     unsigned long start, size_t length)
+{
+	struct iommufd_ioas *ioas;
+	int rc;
+
+	ioas = iommufd_get_ioas(ucmd, mockpt_id);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+	down_write(&ioas->iopt.iova_rwsem);
+	rc = iopt_reserve_iova(&ioas->iopt, start, start + length - 1, NULL);
+	up_write(&ioas->iopt.iova_rwsem);
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+/* Check that every pfn under each iova matches the pfn under a user VA */
+static int iommufd_test_md_check_pa(struct iommufd_ucmd *ucmd,
+				    unsigned int mockpt_id, unsigned long iova,
+				    size_t length, void __user *uptr)
+{
+	struct iommufd_hw_pagetable *hwpt;
+	struct mock_iommu_domain *mock;
+	int rc;
+
+	if (iova % MOCK_IO_PAGE_SIZE || length % MOCK_IO_PAGE_SIZE ||
+	    (uintptr_t)uptr % MOCK_IO_PAGE_SIZE)
+		return -EINVAL;
+
+	hwpt = get_md_pagetable(ucmd, mockpt_id, &mock);
+	if (IS_ERR(hwpt))
+		return PTR_ERR(hwpt);
+
+	for (; length; length -= MOCK_IO_PAGE_SIZE) {
+		struct page *pages[1];
+		unsigned long pfn;
+		long npages;
+		void *ent;
+
+		npages = get_user_pages_fast((uintptr_t)uptr & PAGE_MASK, 1, 0,
+					     pages);
+		if (npages < 0) {
+			rc = npages;
+			goto out_put;
+		}
+		if (WARN_ON(npages != 1)) {
+			rc = -EFAULT;
+			goto out_put;
+		}
+		pfn = page_to_pfn(pages[0]);
+		put_page(pages[0]);
+
+		ent = xa_load(&mock->pfns, iova / MOCK_IO_PAGE_SIZE);
+		if (!ent ||
+		    (xa_to_value(ent) & MOCK_PFN_MASK) * MOCK_IO_PAGE_SIZE !=
+			    pfn * PAGE_SIZE + ((uintptr_t)uptr % PAGE_SIZE)) {
+			rc = -EINVAL;
+			goto out_put;
+		}
+		iova += MOCK_IO_PAGE_SIZE;
+		uptr += MOCK_IO_PAGE_SIZE;
+	}
+	rc = 0;
+
+out_put:
+	iommufd_put_object(&hwpt->obj);
+	return rc;
+}
+
+/* Check that the page ref count matches, to look for missing pin/unpins */
+static int iommufd_test_md_check_refs(struct iommufd_ucmd *ucmd,
+				      void __user *uptr, size_t length,
+				      unsigned int refs)
+{
+	if (length % PAGE_SIZE || (uintptr_t)uptr % PAGE_SIZE)
+		return -EINVAL;
+
+	for (; length; length -= PAGE_SIZE) {
+		struct page *pages[1];
+		long npages;
+
+		npages = get_user_pages_fast((uintptr_t)uptr, 1, 0, pages);
+		if (npages < 0)
+			return npages;
+		if (WARN_ON(npages != 1))
+			return -EFAULT;
+		if (!PageCompound(pages[0])) {
+			unsigned int count;
+
+			count = page_ref_count(pages[0]);
+			if (count / GUP_PIN_COUNTING_BIAS != refs) {
+				put_page(pages[0]);
+				return -EIO;
+			}
+		}
+		put_page(pages[0]);
+		uptr += PAGE_SIZE;
+	}
+	return 0;
+}
+
+struct selftest_access {
+	struct iommufd_access *access;
+	struct file *file;
+	struct mutex lock;
+	struct list_head items;
+	unsigned int next_id;
+	bool destroying;
+};
+
+struct selftest_access_item {
+	struct list_head items_elm;
+	unsigned long iova;
+	size_t length;
+	unsigned int id;
+};
+
+static const struct file_operations iommfd_test_staccess_fops;
+
+static struct selftest_access *iommufd_access_get(int fd)
+{
+	struct file *file;
+
+	file = fget(fd);
+	if (!file)
+		return ERR_PTR(-EBADFD);
+
+	if (file->f_op != &iommfd_test_staccess_fops) {
+		fput(file);
+		return ERR_PTR(-EBADFD);
+	}
+	return file->private_data;
+}
+
+static void iommufd_test_access_unmap(void *data, unsigned long iova,
+				      unsigned long length)
+{
+	unsigned long iova_last = iova + length - 1;
+	struct selftest_access *staccess = data;
+	struct selftest_access_item *item;
+	struct selftest_access_item *tmp;
+
+	mutex_lock(&staccess->lock);
+	list_for_each_entry_safe(item, tmp, &staccess->items, items_elm) {
+		if (iova > item->iova + item->length - 1 ||
+		    iova_last < item->iova)
+			continue;
+		list_del(&item->items_elm);
+		iommufd_access_unpin_pages(staccess->access, item->iova,
+					   item->length);
+		kfree(item);
+	}
+	mutex_unlock(&staccess->lock);
+}
+
+static int iommufd_test_access_item_destroy(struct iommufd_ucmd *ucmd,
+					    unsigned int access_id,
+					    unsigned int item_id)
+{
+	struct selftest_access_item *item;
+	struct selftest_access *staccess;
+
+	staccess = iommufd_access_get(access_id);
+	if (IS_ERR(staccess))
+		return PTR_ERR(staccess);
+
+	mutex_lock(&staccess->lock);
+	list_for_each_entry(item, &staccess->items, items_elm) {
+		if (item->id == item_id) {
+			list_del(&item->items_elm);
+			iommufd_access_unpin_pages(staccess->access, item->iova,
+						   item->length);
+			mutex_unlock(&staccess->lock);
+			kfree(item);
+			fput(staccess->file);
+			return 0;
+		}
+	}
+	mutex_unlock(&staccess->lock);
+	fput(staccess->file);
+	return -ENOENT;
+}
+
+static int iommufd_test_staccess_release(struct inode *inode,
+					 struct file *filep)
+{
+	struct selftest_access *staccess = filep->private_data;
+
+	if (staccess->access) {
+		iommufd_test_access_unmap(staccess, 0, ULONG_MAX);
+		iommufd_access_destroy(staccess->access);
+	}
+	mutex_destroy(&staccess->lock);
+	kfree(staccess);
+	return 0;
+}
+
+static const struct iommufd_access_ops selftest_access_ops_pin = {
+	.needs_pin_pages = 1,
+	.unmap = iommufd_test_access_unmap,
+};
+
+static const struct iommufd_access_ops selftest_access_ops = {
+	.unmap = iommufd_test_access_unmap,
+};
+
+static const struct file_operations iommfd_test_staccess_fops = {
+	.release = iommufd_test_staccess_release,
+};
+
+static struct selftest_access *iommufd_test_alloc_access(void)
+{
+	struct selftest_access *staccess;
+	struct file *filep;
+
+	staccess = kzalloc(sizeof(*staccess), GFP_KERNEL_ACCOUNT);
+	if (!staccess)
+		return ERR_PTR(-ENOMEM);
+	INIT_LIST_HEAD(&staccess->items);
+	mutex_init(&staccess->lock);
+
+	filep = anon_inode_getfile("[iommufd_test_staccess]",
+				   &iommfd_test_staccess_fops, staccess,
+				   O_RDWR);
+	if (IS_ERR(filep)) {
+		kfree(staccess);
+		return ERR_CAST(filep);
+	}
+	staccess->file = filep;
+	return staccess;
+}
+
+static int iommufd_test_create_access(struct iommufd_ucmd *ucmd,
+				      unsigned int ioas_id, unsigned int flags)
+{
+	struct iommu_test_cmd *cmd = ucmd->cmd;
+	struct selftest_access *staccess;
+	struct iommufd_access *access;
+	int fdno;
+	int rc;
+
+	if (flags & ~MOCK_FLAGS_ACCESS_CREATE_NEEDS_PIN_PAGES)
+		return -EOPNOTSUPP;
+
+	staccess = iommufd_test_alloc_access();
+	if (IS_ERR(staccess))
+		return PTR_ERR(staccess);
+
+	fdno = get_unused_fd_flags(O_CLOEXEC);
+	if (fdno < 0) {
+		rc = -ENOMEM;
+		goto out_free_staccess;
+	}
+
+	access = iommufd_access_create(
+		ucmd->ictx, ioas_id,
+		(flags & MOCK_FLAGS_ACCESS_CREATE_NEEDS_PIN_PAGES) ?
+			&selftest_access_ops_pin :
+			&selftest_access_ops,
+		staccess);
+	if (IS_ERR(access)) {
+		rc = PTR_ERR(access);
+		goto out_put_fdno;
+	}
+	cmd->create_access.out_access_fd = fdno;
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+	if (rc)
+		goto out_destroy;
+
+	staccess->access = access;
+	fd_install(fdno, staccess->file);
+	return 0;
+
+out_destroy:
+	iommufd_access_destroy(access);
+out_put_fdno:
+	put_unused_fd(fdno);
+out_free_staccess:
+	fput(staccess->file);
+	return rc;
+}
+
+/* Check that the pages in a page array match the pages in the user VA */
+static int iommufd_test_check_pages(void __user *uptr, struct page **pages,
+				    size_t npages)
+{
+	for (; npages; npages--) {
+		struct page *tmp_pages[1];
+		long rc;
+
+		rc = get_user_pages_fast((uintptr_t)uptr, 1, 0, tmp_pages);
+		if (rc < 0)
+			return rc;
+		if (WARN_ON(rc != 1))
+			return -EFAULT;
+		put_page(tmp_pages[0]);
+		if (tmp_pages[0] != *pages)
+			return -EBADE;
+		pages++;
+		uptr += PAGE_SIZE;
+	}
+	return 0;
+}
+
+static int iommufd_test_access_pages(struct iommufd_ucmd *ucmd,
+				     unsigned int access_id, unsigned long iova,
+				     size_t length, void __user *uptr,
+				     u32 flags)
+{
+	struct iommu_test_cmd *cmd = ucmd->cmd;
+	struct selftest_access_item *item;
+	struct selftest_access *staccess;
+	struct page **pages;
+	size_t npages;
+	int rc;
+
+	/* Prevent syzkaller from triggering a WARN_ON in kvzalloc() */
+	if (length > 16*1024*1024)
+		return -ENOMEM;
+
+	if (flags & ~(MOCK_FLAGS_ACCESS_WRITE | MOCK_FLAGS_ACCESS_SYZ))
+		return -EOPNOTSUPP;
+
+	staccess = iommufd_access_get(access_id);
+	if (IS_ERR(staccess))
+		return PTR_ERR(staccess);
+
+	if (staccess->access->ops != &selftest_access_ops_pin) {
+		rc = -EOPNOTSUPP;
+		goto out_put;
+	}
+
+	if (flags & MOCK_FLAGS_ACCESS_SYZ)
+		iova = iommufd_test_syz_conv_iova(&staccess->access->ioas->iopt,
+					&cmd->access_pages.iova);
+
+	npages = (ALIGN(iova + length, PAGE_SIZE) -
+		  ALIGN_DOWN(iova, PAGE_SIZE)) /
+		 PAGE_SIZE;
+	pages = kvcalloc(npages, sizeof(*pages), GFP_KERNEL_ACCOUNT);
+	if (!pages) {
+		rc = -ENOMEM;
+		goto out_put;
+	}
+
+	/*
+	 * Drivers will need to think very carefully about this locking. The
+	 * core code can do multiple unmaps instantaneously after
+	 * iommufd_access_pin_pages() and *all* the unmaps must not return until
+	 * the range is unpinned. This simple implementation puts a global lock
+	 * around the pin, which may not suit drivers that want this to be a
+	 * performance path. drivers that get this wrong will trigger WARN_ON
+	 * races and cause EDEADLOCK failures to userspace.
+	 */
+	mutex_lock(&staccess->lock);
+	rc = iommufd_access_pin_pages(staccess->access, iova, length, pages,
+				      flags & MOCK_FLAGS_ACCESS_WRITE);
+	if (rc)
+		goto out_unlock;
+
+	/* For syzkaller allow uptr to be NULL to skip this check */
+	if (uptr) {
+		rc = iommufd_test_check_pages(
+			uptr - (iova - ALIGN_DOWN(iova, PAGE_SIZE)), pages,
+			npages);
+		if (rc)
+			goto out_unaccess;
+	}
+
+	item = kzalloc(sizeof(*item), GFP_KERNEL_ACCOUNT);
+	if (!item) {
+		rc = -ENOMEM;
+		goto out_unaccess;
+	}
+
+	item->iova = iova;
+	item->length = length;
+	item->id = staccess->next_id++;
+	list_add_tail(&item->items_elm, &staccess->items);
+
+	cmd->access_pages.out_access_pages_id = item->id;
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+	if (rc)
+		goto out_free_item;
+	goto out_unlock;
+
+out_free_item:
+	list_del(&item->items_elm);
+	kfree(item);
+out_unaccess:
+	iommufd_access_unpin_pages(staccess->access, iova, length);
+out_unlock:
+	mutex_unlock(&staccess->lock);
+	kvfree(pages);
+out_put:
+	fput(staccess->file);
+	return rc;
+}
+
+static int iommufd_test_access_rw(struct iommufd_ucmd *ucmd,
+				  unsigned int access_id, unsigned long iova,
+				  size_t length, void __user *ubuf,
+				  unsigned int flags)
+{
+	struct iommu_test_cmd *cmd = ucmd->cmd;
+	struct selftest_access *staccess;
+	void *tmp;
+	int rc;
+
+	/* Prevent syzkaller from triggering a WARN_ON in kvzalloc() */
+	if (length > 16*1024*1024)
+		return -ENOMEM;
+
+	if (flags & ~(MOCK_ACCESS_RW_WRITE | MOCK_ACCESS_RW_SLOW_PATH |
+		      MOCK_FLAGS_ACCESS_SYZ))
+		return -EOPNOTSUPP;
+
+	staccess = iommufd_access_get(access_id);
+	if (IS_ERR(staccess))
+		return PTR_ERR(staccess);
+
+	tmp = kvzalloc(length, GFP_KERNEL_ACCOUNT);
+	if (!tmp) {
+		rc = -ENOMEM;
+		goto out_put;
+	}
+
+	if (flags & MOCK_ACCESS_RW_WRITE) {
+		if (copy_from_user(tmp, ubuf, length)) {
+			rc = -EFAULT;
+			goto out_free;
+		}
+	}
+
+	if (flags & MOCK_FLAGS_ACCESS_SYZ)
+		iova = iommufd_test_syz_conv_iova(&staccess->access->ioas->iopt,
+					&cmd->access_rw.iova);
+
+	rc = iommufd_access_rw(staccess->access, iova, tmp, length, flags);
+	if (rc)
+		goto out_free;
+	if (!(flags & MOCK_ACCESS_RW_WRITE)) {
+		if (copy_to_user(ubuf, tmp, length)) {
+			rc = -EFAULT;
+			goto out_free;
+		}
+	}
+
+out_free:
+	kvfree(tmp);
+out_put:
+	fput(staccess->file);
+	return rc;
+}
+static_assert((unsigned int)MOCK_ACCESS_RW_WRITE == IOMMUFD_ACCESS_RW_WRITE);
+static_assert((unsigned int)MOCK_ACCESS_RW_SLOW_PATH ==
+	      __IOMMUFD_ACCESS_RW_SLOW_PATH);
+
+void iommufd_selftest_destroy(struct iommufd_object *obj)
+{
+	struct selftest_obj *sobj = container_of(obj, struct selftest_obj, obj);
+
+	switch (sobj->type) {
+	case TYPE_IDEV:
+		iommufd_device_selftest_detach(sobj->idev.ictx,
+					       sobj->idev.hwpt);
+		break;
+	}
+}
+
+int iommufd_test(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_test_cmd *cmd = ucmd->cmd;
+
+	switch (cmd->op) {
+	case IOMMU_TEST_OP_ADD_RESERVED:
+		return iommufd_test_add_reserved(ucmd, cmd->id,
+						 cmd->add_reserved.start,
+						 cmd->add_reserved.length);
+	case IOMMU_TEST_OP_MOCK_DOMAIN:
+		return iommufd_test_mock_domain(ucmd, cmd);
+	case IOMMU_TEST_OP_MD_CHECK_MAP:
+		return iommufd_test_md_check_pa(
+			ucmd, cmd->id, cmd->check_map.iova,
+			cmd->check_map.length,
+			u64_to_user_ptr(cmd->check_map.uptr));
+	case IOMMU_TEST_OP_MD_CHECK_REFS:
+		return iommufd_test_md_check_refs(
+			ucmd, u64_to_user_ptr(cmd->check_refs.uptr),
+			cmd->check_refs.length, cmd->check_refs.refs);
+	case IOMMU_TEST_OP_CREATE_ACCESS:
+		return iommufd_test_create_access(ucmd, cmd->id,
+						  cmd->create_access.flags);
+	case IOMMU_TEST_OP_ACCESS_PAGES:
+		return iommufd_test_access_pages(
+			ucmd, cmd->id, cmd->access_pages.iova,
+			cmd->access_pages.length,
+			u64_to_user_ptr(cmd->access_pages.uptr),
+			cmd->access_pages.flags);
+	case IOMMU_TEST_OP_ACCESS_RW:
+		return iommufd_test_access_rw(
+			ucmd, cmd->id, cmd->access_rw.iova,
+			cmd->access_rw.length,
+			u64_to_user_ptr(cmd->access_rw.uptr),
+			cmd->access_rw.flags);
+	case IOMMU_TEST_OP_DESTROY_ACCESS_PAGES:
+		return iommufd_test_access_item_destroy(
+			ucmd, cmd->id, cmd->destroy_access_pages.access_pages_id);
+	case IOMMU_TEST_OP_SET_TEMP_MEMORY_LIMIT:
+		/* Protect _batch_init(), can not be less than elmsz */
+		if (cmd->memory_limit.limit <
+		    sizeof(unsigned long) + sizeof(u32))
+			return -EINVAL;
+		iommufd_test_memory_limit = cmd->memory_limit.limit;
+		return 0;
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
+bool iommufd_should_fail(void)
+{
+	return should_fail(&fail_iommufd, 1);
+}
+
+void __init iommufd_test_init(void)
+{
+	dbgfs_root =
+		fault_create_debugfs_attr("fail_iommufd", NULL, &fail_iommufd);
+}
+
+void iommufd_test_exit(void)
+{
+	debugfs_remove_recursive(dbgfs_root);
+}
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index d57b044acca4cb..7a5d64a1dae482 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -38,6 +38,9 @@ enum {
 	IOMMUFD_ACCESS_RW_WRITE = 1 << 0,
 	/* Set if the caller is in a kthread then rw will use kthread_use_mm() */
 	IOMMUFD_ACCESS_RW_KTHREAD = 1 << 1,
+
+	/* Only for use by selftest */
+	__IOMMUFD_ACCESS_RW_SLOW_PATH = 1 << 2,
 };
 
 struct iommufd_access *
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index f07aef7c592c2e..d6680af7b2956e 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -27,6 +27,7 @@ TARGETS += ftrace
 TARGETS += futex
 TARGETS += gpio
 TARGETS += intel_pstate
+TARGETS += iommu
 TARGETS += ipc
 TARGETS += ir
 TARGETS += kcmp
diff --git a/tools/testing/selftests/iommu/.gitignore b/tools/testing/selftests/iommu/.gitignore
new file mode 100644
index 00000000000000..7d0703049ebaf4
--- /dev/null
+++ b/tools/testing/selftests/iommu/.gitignore
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0-only
+/iommufd
+/iommufd_fail_nth
diff --git a/tools/testing/selftests/iommu/Makefile b/tools/testing/selftests/iommu/Makefile
new file mode 100644
index 00000000000000..7cb74d26f14171
--- /dev/null
+++ b/tools/testing/selftests/iommu/Makefile
@@ -0,0 +1,12 @@
+# SPDX-License-Identifier: GPL-2.0-only
+CFLAGS += -Wall -O2 -Wno-unused-function
+CFLAGS += -I../../../../include/uapi/
+CFLAGS += -I../../../../include/
+
+CFLAGS += -D_GNU_SOURCE
+
+TEST_GEN_PROGS :=
+TEST_GEN_PROGS += iommufd
+TEST_GEN_PROGS += iommufd_fail_nth
+
+include ../lib.mk
diff --git a/tools/testing/selftests/iommu/config b/tools/testing/selftests/iommu/config
new file mode 100644
index 00000000000000..6c4f901d6fed3c
--- /dev/null
+++ b/tools/testing/selftests/iommu/config
@@ -0,0 +1,2 @@
+CONFIG_IOMMUFD
+CONFIG_IOMMUFD_TEST
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
new file mode 100644
index 00000000000000..4ff3592b282b70
--- /dev/null
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -0,0 +1,1627 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES */
+#include <stdlib.h>
+#include <sys/mman.h>
+#include <sys/eventfd.h>
+
+#define __EXPORTED_HEADERS__
+#include <linux/vfio.h>
+
+#include "iommufd_utils.h"
+
+static void *buffer;
+
+static unsigned long PAGE_SIZE;
+static unsigned long HUGEPAGE_SIZE;
+
+#define MOCK_PAGE_SIZE (PAGE_SIZE / 2)
+
+static unsigned long get_huge_page_size(void)
+{
+	char buf[80];
+	int ret;
+	int fd;
+
+	fd = open("/sys/kernel/mm/transparent_hugepage/hpage_pmd_size",
+		  O_RDONLY);
+	if (fd < 0)
+		return 2 * 1024 * 1024;
+
+	ret = read(fd, buf, sizeof(buf));
+	close(fd);
+	if (ret <= 0 || ret == sizeof(buf))
+		return 2 * 1024 * 1024;
+	buf[ret] = 0;
+	return strtoul(buf, NULL, 10);
+}
+
+static __attribute__((constructor)) void setup_sizes(void)
+{
+	void *vrc;
+	int rc;
+
+	PAGE_SIZE = sysconf(_SC_PAGE_SIZE);
+	HUGEPAGE_SIZE = get_huge_page_size();
+
+	BUFFER_SIZE = PAGE_SIZE * 16;
+	rc = posix_memalign(&buffer, HUGEPAGE_SIZE, BUFFER_SIZE);
+	assert(!rc);
+	assert(buffer);
+	assert((uintptr_t)buffer % HUGEPAGE_SIZE == 0);
+	vrc = mmap(buffer, BUFFER_SIZE, PROT_READ | PROT_WRITE,
+		   MAP_SHARED | MAP_ANONYMOUS | MAP_FIXED, -1, 0);
+	assert(vrc == buffer);
+}
+
+FIXTURE(iommufd)
+{
+	int fd;
+};
+
+FIXTURE_SETUP(iommufd)
+{
+	self->fd = open("/dev/iommu", O_RDWR);
+	ASSERT_NE(-1, self->fd);
+}
+
+FIXTURE_TEARDOWN(iommufd)
+{
+	teardown_iommufd(self->fd, _metadata);
+}
+
+TEST_F(iommufd, simple_close)
+{
+}
+
+TEST_F(iommufd, cmd_fail)
+{
+	struct iommu_destroy cmd = { .size = sizeof(cmd), .id = 0 };
+
+	/* object id is invalid */
+	EXPECT_ERRNO(ENOENT, _test_ioctl_destroy(self->fd, 0));
+	/* Bad pointer */
+	EXPECT_ERRNO(EFAULT, ioctl(self->fd, IOMMU_DESTROY, NULL));
+	/* Unknown ioctl */
+	EXPECT_ERRNO(ENOTTY,
+		     ioctl(self->fd, _IO(IOMMUFD_TYPE, IOMMUFD_CMD_BASE - 1),
+			   &cmd));
+}
+
+TEST_F(iommufd, cmd_length)
+{
+#define TEST_LENGTH(_struct, _ioctl)                                     \
+	{                                                                \
+		struct {                                                 \
+			struct _struct cmd;                              \
+			uint8_t extra;                                   \
+		} cmd = { .cmd = { .size = sizeof(struct _struct) - 1 }, \
+			  .extra = UINT8_MAX };                          \
+		int old_errno;                                           \
+		int rc;                                                  \
+									 \
+		EXPECT_ERRNO(EOPNOTSUPP, ioctl(self->fd, _ioctl, &cmd)); \
+		cmd.cmd.size = sizeof(struct _struct) + 1;               \
+		EXPECT_ERRNO(E2BIG, ioctl(self->fd, _ioctl, &cmd));      \
+		cmd.cmd.size = sizeof(struct _struct);                   \
+		rc = ioctl(self->fd, _ioctl, &cmd);                      \
+		old_errno = errno;                                       \
+		cmd.cmd.size = sizeof(struct _struct) + 1;               \
+		cmd.extra = 0;                                           \
+		if (rc) {                                                \
+			EXPECT_ERRNO(old_errno,                          \
+				     ioctl(self->fd, _ioctl, &cmd));     \
+		} else {                                                 \
+			ASSERT_EQ(0, ioctl(self->fd, _ioctl, &cmd));     \
+		}                                                        \
+	}
+
+	TEST_LENGTH(iommu_destroy, IOMMU_DESTROY);
+	TEST_LENGTH(iommu_ioas_alloc, IOMMU_IOAS_ALLOC);
+	TEST_LENGTH(iommu_ioas_iova_ranges, IOMMU_IOAS_IOVA_RANGES);
+	TEST_LENGTH(iommu_ioas_allow_iovas, IOMMU_IOAS_ALLOW_IOVAS);
+	TEST_LENGTH(iommu_ioas_map, IOMMU_IOAS_MAP);
+	TEST_LENGTH(iommu_ioas_copy, IOMMU_IOAS_COPY);
+	TEST_LENGTH(iommu_ioas_unmap, IOMMU_IOAS_UNMAP);
+	TEST_LENGTH(iommu_option, IOMMU_OPTION);
+	TEST_LENGTH(iommu_vfio_ioas, IOMMU_VFIO_IOAS);
+#undef TEST_LENGTH
+}
+
+TEST_F(iommufd, cmd_ex_fail)
+{
+	struct {
+		struct iommu_destroy cmd;
+		__u64 future;
+	} cmd = { .cmd = { .size = sizeof(cmd), .id = 0 } };
+
+	/* object id is invalid and command is longer */
+	EXPECT_ERRNO(ENOENT, ioctl(self->fd, IOMMU_DESTROY, &cmd));
+	/* future area is non-zero */
+	cmd.future = 1;
+	EXPECT_ERRNO(E2BIG, ioctl(self->fd, IOMMU_DESTROY, &cmd));
+	/* Original command "works" */
+	cmd.cmd.size = sizeof(cmd.cmd);
+	EXPECT_ERRNO(ENOENT, ioctl(self->fd, IOMMU_DESTROY, &cmd));
+	/* Short command fails */
+	cmd.cmd.size = sizeof(cmd.cmd) - 1;
+	EXPECT_ERRNO(EOPNOTSUPP, ioctl(self->fd, IOMMU_DESTROY, &cmd));
+}
+
+TEST_F(iommufd, global_options)
+{
+	struct iommu_option cmd = {
+		.size = sizeof(cmd),
+		.option_id = IOMMU_OPTION_RLIMIT_MODE,
+		.op = IOMMU_OPTION_OP_GET,
+		.val64 = 1,
+	};
+
+	cmd.option_id = IOMMU_OPTION_RLIMIT_MODE;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_OPTION, &cmd));
+	ASSERT_EQ(0, cmd.val64);
+
+	/* This requires root */
+	cmd.op = IOMMU_OPTION_OP_SET;
+	cmd.val64 = 1;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_OPTION, &cmd));
+	cmd.val64 = 2;
+	EXPECT_ERRNO(EINVAL, ioctl(self->fd, IOMMU_OPTION, &cmd));
+
+	cmd.op = IOMMU_OPTION_OP_GET;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_OPTION, &cmd));
+	ASSERT_EQ(1, cmd.val64);
+
+	cmd.op = IOMMU_OPTION_OP_SET;
+	cmd.val64 = 0;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_OPTION, &cmd));
+
+	cmd.op = IOMMU_OPTION_OP_GET;
+	cmd.option_id = IOMMU_OPTION_HUGE_PAGES;
+	EXPECT_ERRNO(ENOENT, ioctl(self->fd, IOMMU_OPTION, &cmd));
+	cmd.op = IOMMU_OPTION_OP_SET;
+	EXPECT_ERRNO(ENOENT, ioctl(self->fd, IOMMU_OPTION, &cmd));
+}
+
+FIXTURE(iommufd_ioas)
+{
+	int fd;
+	uint32_t ioas_id;
+	uint32_t domain_id;
+	uint64_t base_iova;
+};
+
+FIXTURE_VARIANT(iommufd_ioas)
+{
+	unsigned int mock_domains;
+	unsigned int memory_limit;
+};
+
+FIXTURE_SETUP(iommufd_ioas)
+{
+	unsigned int i;
+
+
+	self->fd = open("/dev/iommu", O_RDWR);
+	ASSERT_NE(-1, self->fd);
+	test_ioctl_ioas_alloc(&self->ioas_id);
+
+	if (!variant->memory_limit) {
+		test_ioctl_set_default_memory_limit();
+	} else {
+		test_ioctl_set_temp_memory_limit(variant->memory_limit);
+	}
+
+	for (i = 0; i != variant->mock_domains; i++) {
+		test_cmd_mock_domain(self->ioas_id, NULL, &self->domain_id);
+		self->base_iova = MOCK_APERTURE_START;
+	}
+}
+
+FIXTURE_TEARDOWN(iommufd_ioas)
+{
+	test_ioctl_set_default_memory_limit();
+	teardown_iommufd(self->fd, _metadata);
+}
+
+FIXTURE_VARIANT_ADD(iommufd_ioas, no_domain)
+{
+};
+
+FIXTURE_VARIANT_ADD(iommufd_ioas, mock_domain)
+{
+	.mock_domains = 1,
+};
+
+FIXTURE_VARIANT_ADD(iommufd_ioas, two_mock_domain)
+{
+	.mock_domains = 2,
+};
+
+FIXTURE_VARIANT_ADD(iommufd_ioas, mock_domain_limit)
+{
+	.mock_domains = 1,
+	.memory_limit = 16,
+};
+
+TEST_F(iommufd_ioas, ioas_auto_destroy)
+{
+}
+
+TEST_F(iommufd_ioas, ioas_destroy)
+{
+	if (self->domain_id) {
+		/* IOAS cannot be freed while a domain is on it */
+		EXPECT_ERRNO(EBUSY,
+			     _test_ioctl_destroy(self->fd, self->ioas_id));
+	} else {
+		/* Can allocate and manually free an IOAS table */
+		test_ioctl_destroy(self->ioas_id);
+	}
+}
+
+TEST_F(iommufd_ioas, ioas_area_destroy)
+{
+	/* Adding an area does not change ability to destroy */
+	test_ioctl_ioas_map_fixed(buffer, PAGE_SIZE, self->base_iova);
+	if (self->domain_id)
+		EXPECT_ERRNO(EBUSY,
+			     _test_ioctl_destroy(self->fd, self->ioas_id));
+	else
+		test_ioctl_destroy(self->ioas_id);
+}
+
+TEST_F(iommufd_ioas, ioas_area_auto_destroy)
+{
+	int i;
+
+	/* Can allocate and automatically free an IOAS table with many areas */
+	for (i = 0; i != 10; i++) {
+		test_ioctl_ioas_map_fixed(buffer, PAGE_SIZE,
+					  self->base_iova + i * PAGE_SIZE);
+	}
+}
+
+TEST_F(iommufd_ioas, area)
+{
+	int i;
+
+	/* Unmap fails if nothing is mapped */
+	for (i = 0; i != 10; i++)
+		test_err_ioctl_ioas_unmap(ENOENT, i * PAGE_SIZE, PAGE_SIZE);
+
+	/* Unmap works */
+	for (i = 0; i != 10; i++)
+		test_ioctl_ioas_map_fixed(buffer, PAGE_SIZE,
+					  self->base_iova + i * PAGE_SIZE);
+	for (i = 0; i != 10; i++)
+		test_ioctl_ioas_unmap(self->base_iova + i * PAGE_SIZE,
+				      PAGE_SIZE);
+
+	/* Split fails */
+	test_ioctl_ioas_map_fixed(buffer, PAGE_SIZE * 2,
+				  self->base_iova + 16 * PAGE_SIZE);
+	test_err_ioctl_ioas_unmap(ENOENT, self->base_iova + 16 * PAGE_SIZE,
+				  PAGE_SIZE);
+	test_err_ioctl_ioas_unmap(ENOENT, self->base_iova + 17 * PAGE_SIZE,
+				  PAGE_SIZE);
+
+	/* Over map fails */
+	test_err_ioctl_ioas_map_fixed(EADDRINUSE, buffer, PAGE_SIZE * 2,
+				      self->base_iova + 16 * PAGE_SIZE);
+	test_err_ioctl_ioas_map_fixed(EADDRINUSE, buffer, PAGE_SIZE,
+				      self->base_iova + 16 * PAGE_SIZE);
+	test_err_ioctl_ioas_map_fixed(EADDRINUSE, buffer, PAGE_SIZE,
+				      self->base_iova + 17 * PAGE_SIZE);
+	test_err_ioctl_ioas_map_fixed(EADDRINUSE, buffer, PAGE_SIZE * 2,
+				      self->base_iova + 15 * PAGE_SIZE);
+	test_err_ioctl_ioas_map_fixed(EADDRINUSE, buffer, PAGE_SIZE * 3,
+				      self->base_iova + 15 * PAGE_SIZE);
+
+	/* unmap all works */
+	test_ioctl_ioas_unmap(0, UINT64_MAX);
+
+	/* Unmap all succeeds on an empty IOAS */
+	test_ioctl_ioas_unmap(0, UINT64_MAX);
+}
+
+TEST_F(iommufd_ioas, unmap_fully_contained_areas)
+{
+	uint64_t unmap_len;
+	int i;
+
+	/* Give no_domain some space to rewind base_iova */
+	self->base_iova += 4 * PAGE_SIZE;
+
+	for (i = 0; i != 4; i++)
+		test_ioctl_ioas_map_fixed(buffer, 8 * PAGE_SIZE,
+					  self->base_iova + i * 16 * PAGE_SIZE);
+
+	/* Unmap not fully contained area doesn't work */
+	test_err_ioctl_ioas_unmap(ENOENT, self->base_iova - 4 * PAGE_SIZE,
+				  8 * PAGE_SIZE);
+	test_err_ioctl_ioas_unmap(ENOENT,
+				  self->base_iova + 3 * 16 * PAGE_SIZE +
+					  8 * PAGE_SIZE - 4 * PAGE_SIZE,
+				  8 * PAGE_SIZE);
+
+	/* Unmap fully contained areas works */
+	ASSERT_EQ(0, _test_ioctl_ioas_unmap(self->fd, self->ioas_id,
+					    self->base_iova - 4 * PAGE_SIZE,
+					    3 * 16 * PAGE_SIZE + 8 * PAGE_SIZE +
+						    4 * PAGE_SIZE,
+					    &unmap_len));
+	ASSERT_EQ(32 * PAGE_SIZE, unmap_len);
+}
+
+TEST_F(iommufd_ioas, area_auto_iova)
+{
+	struct iommu_test_cmd test_cmd = {
+		.size = sizeof(test_cmd),
+		.op = IOMMU_TEST_OP_ADD_RESERVED,
+		.id = self->ioas_id,
+		.add_reserved = { .start = PAGE_SIZE * 4,
+				  .length = PAGE_SIZE * 100 },
+	};
+	struct iommu_iova_range ranges[1] = {};
+	struct iommu_ioas_allow_iovas allow_cmd = {
+		.size = sizeof(allow_cmd),
+		.ioas_id = self->ioas_id,
+		.num_iovas = 1,
+		.allowed_iovas = (uintptr_t)ranges,
+	};
+	__u64 iovas[10];
+	int i;
+
+	/* Simple 4k pages */
+	for (i = 0; i != 10; i++)
+		test_ioctl_ioas_map(buffer, PAGE_SIZE, &iovas[i]);
+	for (i = 0; i != 10; i++)
+		test_ioctl_ioas_unmap(iovas[i], PAGE_SIZE);
+
+	/* Kernel automatically aligns IOVAs properly */
+	for (i = 0; i != 10; i++) {
+		size_t length = PAGE_SIZE * (i + 1);
+
+		if (self->domain_id) {
+			test_ioctl_ioas_map(buffer, length, &iovas[i]);
+		} else {
+			test_ioctl_ioas_map((void *)(1UL << 31), length,
+					    &iovas[i]);
+		}
+		EXPECT_EQ(0, iovas[i] % (1UL << (ffs(length) - 1)));
+	}
+	for (i = 0; i != 10; i++)
+		test_ioctl_ioas_unmap(iovas[i], PAGE_SIZE * (i + 1));
+
+	/* Avoids a reserved region */
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ADD_RESERVED),
+			&test_cmd));
+	for (i = 0; i != 10; i++) {
+		size_t length = PAGE_SIZE * (i + 1);
+
+		test_ioctl_ioas_map(buffer, length, &iovas[i]);
+		EXPECT_EQ(0, iovas[i] % (1UL << (ffs(length) - 1)));
+		EXPECT_EQ(false,
+			  iovas[i] > test_cmd.add_reserved.start &&
+				  iovas[i] <
+					  test_cmd.add_reserved.start +
+						  test_cmd.add_reserved.length);
+	}
+	for (i = 0; i != 10; i++)
+		test_ioctl_ioas_unmap(iovas[i], PAGE_SIZE * (i + 1));
+
+	/* Allowed region intersects with a reserved region */
+	ranges[0].start = PAGE_SIZE;
+	ranges[0].last = PAGE_SIZE * 600;
+	EXPECT_ERRNO(EADDRINUSE,
+		     ioctl(self->fd, IOMMU_IOAS_ALLOW_IOVAS, &allow_cmd));
+
+	/* Allocate from an allowed region */
+	if (self->domain_id) {
+		ranges[0].start = MOCK_APERTURE_START + PAGE_SIZE;
+		ranges[0].last = MOCK_APERTURE_START + PAGE_SIZE * 600 - 1;
+	} else {
+		ranges[0].start = PAGE_SIZE * 200;
+		ranges[0].last = PAGE_SIZE * 600 - 1;
+	}
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_ALLOW_IOVAS, &allow_cmd));
+	for (i = 0; i != 10; i++) {
+		size_t length = PAGE_SIZE * (i + 1);
+
+		test_ioctl_ioas_map(buffer, length, &iovas[i]);
+		EXPECT_EQ(0, iovas[i] % (1UL << (ffs(length) - 1)));
+		EXPECT_EQ(true, iovas[i] >= ranges[0].start);
+		EXPECT_EQ(true, iovas[i] <= ranges[0].last);
+		EXPECT_EQ(true, iovas[i] + length > ranges[0].start);
+		EXPECT_EQ(true, iovas[i] + length <= ranges[0].last + 1);
+	}
+	for (i = 0; i != 10; i++)
+		test_ioctl_ioas_unmap(iovas[i], PAGE_SIZE * (i + 1));
+}
+
+TEST_F(iommufd_ioas, area_allowed)
+{
+	struct iommu_test_cmd test_cmd = {
+		.size = sizeof(test_cmd),
+		.op = IOMMU_TEST_OP_ADD_RESERVED,
+		.id = self->ioas_id,
+		.add_reserved = { .start = PAGE_SIZE * 4,
+				  .length = PAGE_SIZE * 100 },
+	};
+	struct iommu_iova_range ranges[1] = {};
+	struct iommu_ioas_allow_iovas allow_cmd = {
+		.size = sizeof(allow_cmd),
+		.ioas_id = self->ioas_id,
+		.num_iovas = 1,
+		.allowed_iovas = (uintptr_t)ranges,
+	};
+
+	/* Reserved intersects an allowed */
+	allow_cmd.num_iovas = 1;
+	ranges[0].start = self->base_iova;
+	ranges[0].last = ranges[0].start + PAGE_SIZE * 600;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_ALLOW_IOVAS, &allow_cmd));
+	test_cmd.add_reserved.start = ranges[0].start + PAGE_SIZE;
+	test_cmd.add_reserved.length = PAGE_SIZE;
+	EXPECT_ERRNO(EADDRINUSE,
+		     ioctl(self->fd,
+			   _IOMMU_TEST_CMD(IOMMU_TEST_OP_ADD_RESERVED),
+			   &test_cmd));
+	allow_cmd.num_iovas = 0;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_ALLOW_IOVAS, &allow_cmd));
+
+	/* Allowed intersects a reserved */
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ADD_RESERVED),
+			&test_cmd));
+	allow_cmd.num_iovas = 1;
+	ranges[0].start = self->base_iova;
+	ranges[0].last = ranges[0].start + PAGE_SIZE * 600;
+	EXPECT_ERRNO(EADDRINUSE,
+		     ioctl(self->fd, IOMMU_IOAS_ALLOW_IOVAS, &allow_cmd));
+}
+
+TEST_F(iommufd_ioas, copy_area)
+{
+	struct iommu_ioas_copy copy_cmd = {
+		.size = sizeof(copy_cmd),
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.dst_ioas_id = self->ioas_id,
+		.src_ioas_id = self->ioas_id,
+		.length = PAGE_SIZE,
+	};
+
+	test_ioctl_ioas_map_fixed(buffer, PAGE_SIZE, self->base_iova);
+
+	/* Copy inside a single IOAS */
+	copy_cmd.src_iova = self->base_iova;
+	copy_cmd.dst_iova = self->base_iova + PAGE_SIZE;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_COPY, &copy_cmd));
+
+	/* Copy between IOAS's */
+	copy_cmd.src_iova = self->base_iova;
+	copy_cmd.dst_iova = 0;
+	test_ioctl_ioas_alloc(&copy_cmd.dst_ioas_id);
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_COPY, &copy_cmd));
+}
+
+TEST_F(iommufd_ioas, iova_ranges)
+{
+	struct iommu_test_cmd test_cmd = {
+		.size = sizeof(test_cmd),
+		.op = IOMMU_TEST_OP_ADD_RESERVED,
+		.id = self->ioas_id,
+		.add_reserved = { .start = PAGE_SIZE, .length = PAGE_SIZE },
+	};
+	struct iommu_iova_range *ranges = buffer;
+	struct iommu_ioas_iova_ranges ranges_cmd = {
+		.size = sizeof(ranges_cmd),
+		.ioas_id = self->ioas_id,
+		.num_iovas = BUFFER_SIZE / sizeof(*ranges),
+		.allowed_iovas = (uintptr_t)ranges,
+	};
+
+	/* Range can be read */
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_IOVA_RANGES, &ranges_cmd));
+	EXPECT_EQ(1, ranges_cmd.num_iovas);
+	if (!self->domain_id) {
+		EXPECT_EQ(0, ranges[0].start);
+		EXPECT_EQ(SIZE_MAX, ranges[0].last);
+		EXPECT_EQ(1, ranges_cmd.out_iova_alignment);
+	} else {
+		EXPECT_EQ(MOCK_APERTURE_START, ranges[0].start);
+		EXPECT_EQ(MOCK_APERTURE_LAST, ranges[0].last);
+		EXPECT_EQ(MOCK_PAGE_SIZE, ranges_cmd.out_iova_alignment);
+	}
+
+	/* Buffer too small */
+	memset(ranges, 0, BUFFER_SIZE);
+	ranges_cmd.num_iovas = 0;
+	EXPECT_ERRNO(EMSGSIZE,
+		     ioctl(self->fd, IOMMU_IOAS_IOVA_RANGES, &ranges_cmd));
+	EXPECT_EQ(1, ranges_cmd.num_iovas);
+	EXPECT_EQ(0, ranges[0].start);
+	EXPECT_EQ(0, ranges[0].last);
+
+	/* 2 ranges */
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ADD_RESERVED),
+			&test_cmd));
+	ranges_cmd.num_iovas = BUFFER_SIZE / sizeof(*ranges);
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_IOVA_RANGES, &ranges_cmd));
+	if (!self->domain_id) {
+		EXPECT_EQ(2, ranges_cmd.num_iovas);
+		EXPECT_EQ(0, ranges[0].start);
+		EXPECT_EQ(PAGE_SIZE - 1, ranges[0].last);
+		EXPECT_EQ(PAGE_SIZE * 2, ranges[1].start);
+		EXPECT_EQ(SIZE_MAX, ranges[1].last);
+	} else {
+		EXPECT_EQ(1, ranges_cmd.num_iovas);
+		EXPECT_EQ(MOCK_APERTURE_START, ranges[0].start);
+		EXPECT_EQ(MOCK_APERTURE_LAST, ranges[0].last);
+	}
+
+	/* Buffer too small */
+	memset(ranges, 0, BUFFER_SIZE);
+	ranges_cmd.num_iovas = 1;
+	if (!self->domain_id) {
+		EXPECT_ERRNO(EMSGSIZE, ioctl(self->fd, IOMMU_IOAS_IOVA_RANGES,
+					     &ranges_cmd));
+		EXPECT_EQ(2, ranges_cmd.num_iovas);
+		EXPECT_EQ(0, ranges[0].start);
+		EXPECT_EQ(PAGE_SIZE - 1, ranges[0].last);
+	} else {
+		ASSERT_EQ(0,
+			  ioctl(self->fd, IOMMU_IOAS_IOVA_RANGES, &ranges_cmd));
+		EXPECT_EQ(1, ranges_cmd.num_iovas);
+		EXPECT_EQ(MOCK_APERTURE_START, ranges[0].start);
+		EXPECT_EQ(MOCK_APERTURE_LAST, ranges[0].last);
+	}
+	EXPECT_EQ(0, ranges[1].start);
+	EXPECT_EQ(0, ranges[1].last);
+}
+
+TEST_F(iommufd_ioas, access_pin)
+{
+	struct iommu_test_cmd access_cmd = {
+		.size = sizeof(access_cmd),
+		.op = IOMMU_TEST_OP_ACCESS_PAGES,
+		.access_pages = { .iova = MOCK_APERTURE_START,
+				  .length = BUFFER_SIZE,
+				  .uptr = (uintptr_t)buffer },
+	};
+	struct iommu_test_cmd check_map_cmd = {
+		.size = sizeof(check_map_cmd),
+		.op = IOMMU_TEST_OP_MD_CHECK_MAP,
+		.check_map = { .iova = MOCK_APERTURE_START,
+			       .length = BUFFER_SIZE,
+			       .uptr = (uintptr_t)buffer },
+	};
+	uint32_t access_pages_id;
+	unsigned int npages;
+
+	test_cmd_create_access(self->ioas_id, &access_cmd.id,
+			       MOCK_FLAGS_ACCESS_CREATE_NEEDS_PIN_PAGES);
+
+	for (npages = 1; npages < BUFFER_SIZE / PAGE_SIZE; npages++) {
+		uint32_t mock_device_id;
+		uint32_t mock_hwpt_id;
+
+		access_cmd.access_pages.length = npages * PAGE_SIZE;
+
+		/* Single map/unmap */
+		test_ioctl_ioas_map_fixed(buffer, BUFFER_SIZE,
+					  MOCK_APERTURE_START);
+		ASSERT_EQ(0, ioctl(self->fd,
+				   _IOMMU_TEST_CMD(IOMMU_TEST_OP_ACCESS_PAGES),
+				   &access_cmd));
+		test_cmd_destroy_access_pages(
+			access_cmd.id,
+			access_cmd.access_pages.out_access_pages_id);
+
+		/* Double user */
+		ASSERT_EQ(0, ioctl(self->fd,
+				   _IOMMU_TEST_CMD(IOMMU_TEST_OP_ACCESS_PAGES),
+				   &access_cmd));
+		access_pages_id = access_cmd.access_pages.out_access_pages_id;
+		ASSERT_EQ(0, ioctl(self->fd,
+				   _IOMMU_TEST_CMD(IOMMU_TEST_OP_ACCESS_PAGES),
+				   &access_cmd));
+		test_cmd_destroy_access_pages(
+			access_cmd.id,
+			access_cmd.access_pages.out_access_pages_id);
+		test_cmd_destroy_access_pages(access_cmd.id, access_pages_id);
+
+		/* Add/remove a domain with a user */
+		ASSERT_EQ(0, ioctl(self->fd,
+				   _IOMMU_TEST_CMD(IOMMU_TEST_OP_ACCESS_PAGES),
+				   &access_cmd));
+		test_cmd_mock_domain(self->ioas_id, &mock_device_id,
+				     &mock_hwpt_id);
+		check_map_cmd.id = mock_hwpt_id;
+		ASSERT_EQ(0, ioctl(self->fd,
+				   _IOMMU_TEST_CMD(IOMMU_TEST_OP_MD_CHECK_MAP),
+				   &check_map_cmd));
+
+		test_ioctl_destroy(mock_device_id);
+		test_ioctl_destroy(mock_hwpt_id);
+		test_cmd_destroy_access_pages(
+			access_cmd.id,
+			access_cmd.access_pages.out_access_pages_id);
+
+		test_ioctl_ioas_unmap(MOCK_APERTURE_START, BUFFER_SIZE);
+	}
+	test_cmd_destroy_access(access_cmd.id);
+}
+
+TEST_F(iommufd_ioas, access_pin_unmap)
+{
+	struct iommu_test_cmd access_pages_cmd = {
+		.size = sizeof(access_pages_cmd),
+		.op = IOMMU_TEST_OP_ACCESS_PAGES,
+		.access_pages = { .iova = MOCK_APERTURE_START,
+				  .length = BUFFER_SIZE,
+				  .uptr = (uintptr_t)buffer },
+	};
+
+	test_cmd_create_access(self->ioas_id, &access_pages_cmd.id,
+			       MOCK_FLAGS_ACCESS_CREATE_NEEDS_PIN_PAGES);
+	test_ioctl_ioas_map_fixed(buffer, BUFFER_SIZE, MOCK_APERTURE_START);
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ACCESS_PAGES),
+			&access_pages_cmd));
+
+	/* Trigger the unmap op */
+	test_ioctl_ioas_unmap(MOCK_APERTURE_START, BUFFER_SIZE);
+
+	/* kernel removed the item for us */
+	test_err_destroy_access_pages(
+		ENOENT, access_pages_cmd.id,
+		access_pages_cmd.access_pages.out_access_pages_id);
+}
+
+static void check_access_rw(struct __test_metadata *_metadata, int fd,
+			    unsigned int access_id, uint64_t iova,
+			    unsigned int def_flags)
+{
+	uint16_t tmp[32];
+	struct iommu_test_cmd access_cmd = {
+		.size = sizeof(access_cmd),
+		.op = IOMMU_TEST_OP_ACCESS_RW,
+		.id = access_id,
+		.access_rw = { .uptr = (uintptr_t)tmp },
+	};
+	uint16_t *buffer16 = buffer;
+	unsigned int i;
+	void *tmp2;
+
+	for (i = 0; i != BUFFER_SIZE / sizeof(*buffer16); i++)
+		buffer16[i] = rand();
+
+	for (access_cmd.access_rw.iova = iova + PAGE_SIZE - 50;
+	     access_cmd.access_rw.iova < iova + PAGE_SIZE + 50;
+	     access_cmd.access_rw.iova++) {
+		for (access_cmd.access_rw.length = 1;
+		     access_cmd.access_rw.length < sizeof(tmp);
+		     access_cmd.access_rw.length++) {
+			access_cmd.access_rw.flags = def_flags;
+			ASSERT_EQ(0, ioctl(fd,
+					   _IOMMU_TEST_CMD(
+						   IOMMU_TEST_OP_ACCESS_RW),
+					   &access_cmd));
+			ASSERT_EQ(0,
+				  memcmp(buffer + (access_cmd.access_rw.iova -
+						   iova),
+					 tmp, access_cmd.access_rw.length));
+
+			for (i = 0; i != ARRAY_SIZE(tmp); i++)
+				tmp[i] = rand();
+			access_cmd.access_rw.flags = def_flags |
+						     MOCK_ACCESS_RW_WRITE;
+			ASSERT_EQ(0, ioctl(fd,
+					   _IOMMU_TEST_CMD(
+						   IOMMU_TEST_OP_ACCESS_RW),
+					   &access_cmd));
+			ASSERT_EQ(0,
+				  memcmp(buffer + (access_cmd.access_rw.iova -
+						   iova),
+					 tmp, access_cmd.access_rw.length));
+		}
+	}
+
+	/* Multi-page test */
+	tmp2 = malloc(BUFFER_SIZE);
+	ASSERT_NE(NULL, tmp2);
+	access_cmd.access_rw.iova = iova;
+	access_cmd.access_rw.length = BUFFER_SIZE;
+	access_cmd.access_rw.flags = def_flags;
+	access_cmd.access_rw.uptr = (uintptr_t)tmp2;
+	ASSERT_EQ(0, ioctl(fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ACCESS_RW),
+			   &access_cmd));
+	ASSERT_EQ(0, memcmp(buffer, tmp2, access_cmd.access_rw.length));
+	free(tmp2);
+}
+
+TEST_F(iommufd_ioas, access_rw)
+{
+	__u32 access_id;
+	__u64 iova;
+
+	test_cmd_create_access(self->ioas_id, &access_id, 0);
+	test_ioctl_ioas_map(buffer, BUFFER_SIZE, &iova);
+	check_access_rw(_metadata, self->fd, access_id, iova, 0);
+	check_access_rw(_metadata, self->fd, access_id, iova,
+			MOCK_ACCESS_RW_SLOW_PATH);
+	test_ioctl_ioas_unmap(iova, BUFFER_SIZE);
+	test_cmd_destroy_access(access_id);
+}
+
+TEST_F(iommufd_ioas, access_rw_unaligned)
+{
+	__u32 access_id;
+	__u64 iova;
+
+	test_cmd_create_access(self->ioas_id, &access_id, 0);
+
+	/* Unaligned pages */
+	iova = self->base_iova + MOCK_PAGE_SIZE;
+	test_ioctl_ioas_map_fixed(buffer, BUFFER_SIZE, iova);
+	check_access_rw(_metadata, self->fd, access_id, iova, 0);
+	test_ioctl_ioas_unmap(iova, BUFFER_SIZE);
+	test_cmd_destroy_access(access_id);
+}
+
+TEST_F(iommufd_ioas, fork_gone)
+{
+	__u32 access_id;
+	pid_t child;
+
+	test_cmd_create_access(self->ioas_id, &access_id, 0);
+
+	/* Create a mapping with a different mm */
+	child = fork();
+	if (!child) {
+		test_ioctl_ioas_map_fixed(buffer, BUFFER_SIZE,
+					  MOCK_APERTURE_START);
+		exit(0);
+	}
+	ASSERT_NE(-1, child);
+	ASSERT_EQ(child, waitpid(child, NULL, 0));
+
+	if (self->domain_id) {
+		/*
+		 * If a domain already existed then everything was pinned within
+		 * the fork, so this copies from one domain to another.
+		 */
+		test_cmd_mock_domain(self->ioas_id, NULL, NULL);
+		check_access_rw(_metadata, self->fd, access_id,
+				MOCK_APERTURE_START, 0);
+
+	} else {
+		/*
+		 * Otherwise we need to actually pin pages which can't happen
+		 * since the fork is gone.
+		 */
+		test_err_mock_domain(EFAULT, self->ioas_id, NULL, NULL);
+	}
+
+	test_cmd_destroy_access(access_id);
+}
+
+TEST_F(iommufd_ioas, fork_present)
+{
+	__u32 access_id;
+	int pipefds[2];
+	uint64_t tmp;
+	pid_t child;
+	int efd;
+
+	test_cmd_create_access(self->ioas_id, &access_id, 0);
+
+	ASSERT_EQ(0, pipe2(pipefds, O_CLOEXEC));
+	efd = eventfd(0, EFD_CLOEXEC);
+	ASSERT_NE(-1, efd);
+
+	/* Create a mapping with a different mm */
+	child = fork();
+	if (!child) {
+		__u64 iova;
+		uint64_t one = 1;
+
+		close(pipefds[1]);
+		test_ioctl_ioas_map_fixed(buffer, BUFFER_SIZE,
+					  MOCK_APERTURE_START);
+		if (write(efd, &one, sizeof(one)) != sizeof(one))
+			exit(100);
+		if (read(pipefds[0], &iova, 1) != 1)
+			exit(100);
+		exit(0);
+	}
+	close(pipefds[0]);
+	ASSERT_NE(-1, child);
+	ASSERT_EQ(8, read(efd, &tmp, sizeof(tmp)));
+
+	/* Read pages from the remote process */
+	test_cmd_mock_domain(self->ioas_id, NULL, NULL);
+	check_access_rw(_metadata, self->fd, access_id, MOCK_APERTURE_START, 0);
+
+	ASSERT_EQ(0, close(pipefds[1]));
+	ASSERT_EQ(child, waitpid(child, NULL, 0));
+
+	test_cmd_destroy_access(access_id);
+}
+
+TEST_F(iommufd_ioas, ioas_option_huge_pages)
+{
+	struct iommu_option cmd = {
+		.size = sizeof(cmd),
+		.option_id = IOMMU_OPTION_HUGE_PAGES,
+		.op = IOMMU_OPTION_OP_GET,
+		.val64 = 3,
+		.object_id = self->ioas_id,
+	};
+
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_OPTION, &cmd));
+	ASSERT_EQ(1, cmd.val64);
+
+	cmd.op = IOMMU_OPTION_OP_SET;
+	cmd.val64 = 0;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_OPTION, &cmd));
+
+	cmd.op = IOMMU_OPTION_OP_GET;
+	cmd.val64 = 3;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_OPTION, &cmd));
+	ASSERT_EQ(0, cmd.val64);
+
+	cmd.op = IOMMU_OPTION_OP_SET;
+	cmd.val64 = 2;
+	EXPECT_ERRNO(EINVAL, ioctl(self->fd, IOMMU_OPTION, &cmd));
+
+	cmd.op = IOMMU_OPTION_OP_SET;
+	cmd.val64 = 1;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_OPTION, &cmd));
+}
+
+TEST_F(iommufd_ioas, ioas_iova_alloc)
+{
+	unsigned int length;
+	__u64 iova;
+
+	for (length = 1; length != PAGE_SIZE * 2; length++) {
+		if (variant->mock_domains && (length % MOCK_PAGE_SIZE)) {
+			test_err_ioctl_ioas_map(EINVAL, buffer, length, &iova);
+		} else {
+			test_ioctl_ioas_map(buffer, length, &iova);
+			test_ioctl_ioas_unmap(iova, length);
+		}
+	}
+}
+
+TEST_F(iommufd_ioas, ioas_align_change)
+{
+	struct iommu_option cmd = {
+		.size = sizeof(cmd),
+		.option_id = IOMMU_OPTION_HUGE_PAGES,
+		.op = IOMMU_OPTION_OP_SET,
+		.object_id = self->ioas_id,
+		/* 0 means everything must be aligned to PAGE_SIZE */
+		.val64 = 0,
+	};
+
+	/*
+	 * We cannot upgrade the alignment using OPTION_HUGE_PAGES when a domain
+	 * and map are present.
+	 */
+	if (variant->mock_domains)
+		return;
+
+	/*
+	 * We can upgrade to PAGE_SIZE alignment when things are aligned right
+	 */
+	test_ioctl_ioas_map_fixed(buffer, PAGE_SIZE, MOCK_APERTURE_START);
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_OPTION, &cmd));
+
+	/* Misalignment is rejected at map time */
+	test_err_ioctl_ioas_map_fixed(EINVAL, buffer + MOCK_PAGE_SIZE,
+				      PAGE_SIZE,
+				      MOCK_APERTURE_START + PAGE_SIZE);
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_OPTION, &cmd));
+
+	/* Reduce alignment */
+	cmd.val64 = 1;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_OPTION, &cmd));
+
+	/* Confirm misalignment is rejected during alignment upgrade */
+	test_ioctl_ioas_map_fixed(buffer + MOCK_PAGE_SIZE, PAGE_SIZE,
+				  MOCK_APERTURE_START + PAGE_SIZE);
+	cmd.val64 = 0;
+	EXPECT_ERRNO(EADDRINUSE, ioctl(self->fd, IOMMU_OPTION, &cmd));
+
+	test_ioctl_ioas_unmap(MOCK_APERTURE_START + PAGE_SIZE, PAGE_SIZE);
+	test_ioctl_ioas_unmap(MOCK_APERTURE_START, PAGE_SIZE);
+}
+
+TEST_F(iommufd_ioas, copy_sweep)
+{
+	struct iommu_ioas_copy copy_cmd = {
+		.size = sizeof(copy_cmd),
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.src_ioas_id = self->ioas_id,
+		.dst_iova = MOCK_APERTURE_START,
+		.length = MOCK_PAGE_SIZE,
+	};
+	unsigned int dst_ioas_id;
+	uint64_t last_iova;
+	uint64_t iova;
+
+	test_ioctl_ioas_alloc(&dst_ioas_id);
+	copy_cmd.dst_ioas_id = dst_ioas_id;
+
+	if (variant->mock_domains)
+		last_iova = MOCK_APERTURE_START + BUFFER_SIZE - 1;
+	else
+		last_iova = MOCK_APERTURE_START + BUFFER_SIZE - 2;
+
+	test_ioctl_ioas_map_fixed(buffer, last_iova - MOCK_APERTURE_START + 1,
+				  MOCK_APERTURE_START);
+
+	for (iova = MOCK_APERTURE_START - PAGE_SIZE; iova <= last_iova;
+	     iova += 511) {
+		copy_cmd.src_iova = iova;
+		if (iova < MOCK_APERTURE_START ||
+		    iova + copy_cmd.length - 1 > last_iova) {
+			EXPECT_ERRNO(ENOENT, ioctl(self->fd, IOMMU_IOAS_COPY,
+						   &copy_cmd));
+		} else {
+			ASSERT_EQ(0,
+				  ioctl(self->fd, IOMMU_IOAS_COPY, &copy_cmd));
+			test_ioctl_ioas_unmap_id(dst_ioas_id, copy_cmd.dst_iova,
+						 copy_cmd.length);
+		}
+	}
+
+	test_ioctl_destroy(dst_ioas_id);
+}
+
+FIXTURE(iommufd_mock_domain)
+{
+	int fd;
+	uint32_t ioas_id;
+	uint32_t domain_id;
+	uint32_t domain_ids[2];
+	int mmap_flags;
+	size_t mmap_buf_size;
+};
+
+FIXTURE_VARIANT(iommufd_mock_domain)
+{
+	unsigned int mock_domains;
+	bool hugepages;
+};
+
+FIXTURE_SETUP(iommufd_mock_domain)
+{
+	unsigned int i;
+
+	self->fd = open("/dev/iommu", O_RDWR);
+	ASSERT_NE(-1, self->fd);
+	test_ioctl_ioas_alloc(&self->ioas_id);
+
+	ASSERT_GE(ARRAY_SIZE(self->domain_ids), variant->mock_domains);
+
+	for (i = 0; i != variant->mock_domains; i++)
+		test_cmd_mock_domain(self->ioas_id, NULL, &self->domain_ids[i]);
+	self->domain_id = self->domain_ids[0];
+
+	self->mmap_flags = MAP_SHARED | MAP_ANONYMOUS;
+	self->mmap_buf_size = PAGE_SIZE * 8;
+	if (variant->hugepages) {
+		/*
+		 * MAP_POPULATE will cause the kernel to fail mmap if THPs are
+		 * not available.
+		 */
+		self->mmap_flags |= MAP_HUGETLB | MAP_POPULATE;
+		self->mmap_buf_size = HUGEPAGE_SIZE * 2;
+	}
+}
+
+FIXTURE_TEARDOWN(iommufd_mock_domain)
+{
+	teardown_iommufd(self->fd, _metadata);
+}
+
+FIXTURE_VARIANT_ADD(iommufd_mock_domain, one_domain)
+{
+	.mock_domains = 1,
+	.hugepages = false,
+};
+
+FIXTURE_VARIANT_ADD(iommufd_mock_domain, two_domains)
+{
+	.mock_domains = 2,
+	.hugepages = false,
+};
+
+FIXTURE_VARIANT_ADD(iommufd_mock_domain, one_domain_hugepage)
+{
+	.mock_domains = 1,
+	.hugepages = true,
+};
+
+FIXTURE_VARIANT_ADD(iommufd_mock_domain, two_domains_hugepage)
+{
+	.mock_domains = 2,
+	.hugepages = true,
+};
+
+/* Have the kernel check that the user pages made it to the iommu_domain */
+#define check_mock_iova(_ptr, _iova, _length)                                \
+	({                                                                   \
+		struct iommu_test_cmd check_map_cmd = {                      \
+			.size = sizeof(check_map_cmd),                       \
+			.op = IOMMU_TEST_OP_MD_CHECK_MAP,                    \
+			.id = self->domain_id,                               \
+			.check_map = { .iova = _iova,                        \
+				       .length = _length,                    \
+				       .uptr = (uintptr_t)(_ptr) },          \
+		};                                                           \
+		ASSERT_EQ(0,                                                 \
+			  ioctl(self->fd,                                    \
+				_IOMMU_TEST_CMD(IOMMU_TEST_OP_MD_CHECK_MAP), \
+				&check_map_cmd));                            \
+		if (self->domain_ids[1]) {                                   \
+			check_map_cmd.id = self->domain_ids[1];              \
+			ASSERT_EQ(0,                                         \
+				  ioctl(self->fd,                            \
+					_IOMMU_TEST_CMD(                     \
+						IOMMU_TEST_OP_MD_CHECK_MAP), \
+					&check_map_cmd));                    \
+		}                                                            \
+	})
+
+TEST_F(iommufd_mock_domain, basic)
+{
+	size_t buf_size = self->mmap_buf_size;
+	uint8_t *buf;
+	__u64 iova;
+
+	/* Simple one page map */
+	test_ioctl_ioas_map(buffer, PAGE_SIZE, &iova);
+	check_mock_iova(buffer, iova, PAGE_SIZE);
+
+	buf = mmap(0, buf_size, PROT_READ | PROT_WRITE, self->mmap_flags, -1,
+		   0);
+	ASSERT_NE(MAP_FAILED, buf);
+
+	/* EFAULT half way through mapping */
+	ASSERT_EQ(0, munmap(buf + buf_size / 2, buf_size / 2));
+	test_err_ioctl_ioas_map(EFAULT, buf, buf_size, &iova);
+
+	/* EFAULT on first page */
+	ASSERT_EQ(0, munmap(buf, buf_size / 2));
+	test_err_ioctl_ioas_map(EFAULT, buf, buf_size, &iova);
+}
+
+TEST_F(iommufd_mock_domain, all_aligns)
+{
+	size_t test_step = variant->hugepages ? (self->mmap_buf_size / 16) :
+						MOCK_PAGE_SIZE;
+	size_t buf_size = self->mmap_buf_size;
+	unsigned int start;
+	unsigned int end;
+	uint8_t *buf;
+
+	buf = mmap(0, buf_size, PROT_READ | PROT_WRITE, self->mmap_flags, -1,
+		   0);
+	ASSERT_NE(MAP_FAILED, buf);
+	check_refs(buf, buf_size, 0);
+
+	/*
+	 * Map every combination of page size and alignment within a big region,
+	 * less for hugepage case as it takes so long to finish.
+	 */
+	for (start = 0; start < buf_size; start += test_step) {
+		if (variant->hugepages)
+			end = buf_size;
+		else
+			end = start + MOCK_PAGE_SIZE;
+		for (; end < buf_size; end += MOCK_PAGE_SIZE) {
+			size_t length = end - start;
+			__u64 iova;
+
+			test_ioctl_ioas_map(buf + start, length, &iova);
+			check_mock_iova(buf + start, iova, length);
+			check_refs(buf + start / PAGE_SIZE * PAGE_SIZE,
+				   end / PAGE_SIZE * PAGE_SIZE -
+					   start / PAGE_SIZE * PAGE_SIZE,
+				   1);
+
+			test_ioctl_ioas_unmap(iova, length);
+		}
+	}
+	check_refs(buf, buf_size, 0);
+	ASSERT_EQ(0, munmap(buf, buf_size));
+}
+
+TEST_F(iommufd_mock_domain, all_aligns_copy)
+{
+	size_t test_step = variant->hugepages ? self->mmap_buf_size / 16 :
+						MOCK_PAGE_SIZE;
+	size_t buf_size = self->mmap_buf_size;
+	unsigned int start;
+	unsigned int end;
+	uint8_t *buf;
+
+	buf = mmap(0, buf_size, PROT_READ | PROT_WRITE, self->mmap_flags, -1,
+		   0);
+	ASSERT_NE(MAP_FAILED, buf);
+	check_refs(buf, buf_size, 0);
+
+	/*
+	 * Map every combination of page size and alignment within a big region,
+	 * less for hugepage case as it takes so long to finish.
+	 */
+	for (start = 0; start < buf_size; start += test_step) {
+		if (variant->hugepages)
+			end = buf_size;
+		else
+			end = start + MOCK_PAGE_SIZE;
+		for (; end < buf_size; end += MOCK_PAGE_SIZE) {
+			size_t length = end - start;
+			unsigned int old_id;
+			uint32_t mock_device_id;
+			__u64 iova;
+
+			test_ioctl_ioas_map(buf + start, length, &iova);
+
+			/* Add and destroy a domain while the area exists */
+			old_id = self->domain_ids[1];
+			test_cmd_mock_domain(self->ioas_id, &mock_device_id,
+					     &self->domain_ids[1]);
+
+			check_mock_iova(buf + start, iova, length);
+			check_refs(buf + start / PAGE_SIZE * PAGE_SIZE,
+				   end / PAGE_SIZE * PAGE_SIZE -
+					   start / PAGE_SIZE * PAGE_SIZE,
+				   1);
+
+			test_ioctl_destroy(mock_device_id);
+			test_ioctl_destroy(self->domain_ids[1]);
+			self->domain_ids[1] = old_id;
+
+			test_ioctl_ioas_unmap(iova, length);
+		}
+	}
+	check_refs(buf, buf_size, 0);
+	ASSERT_EQ(0, munmap(buf, buf_size));
+}
+
+TEST_F(iommufd_mock_domain, user_copy)
+{
+	struct iommu_test_cmd access_cmd = {
+		.size = sizeof(access_cmd),
+		.op = IOMMU_TEST_OP_ACCESS_PAGES,
+		.access_pages = { .length = BUFFER_SIZE,
+				  .uptr = (uintptr_t)buffer },
+	};
+	struct iommu_ioas_copy copy_cmd = {
+		.size = sizeof(copy_cmd),
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.dst_ioas_id = self->ioas_id,
+		.dst_iova = MOCK_APERTURE_START,
+		.length = BUFFER_SIZE,
+	};
+	unsigned int ioas_id;
+
+	/* Pin the pages in an IOAS with no domains then copy to an IOAS with domains */
+	test_ioctl_ioas_alloc(&ioas_id);
+	test_ioctl_ioas_map_id(ioas_id, buffer, BUFFER_SIZE,
+			       &copy_cmd.src_iova);
+
+	test_cmd_create_access(ioas_id, &access_cmd.id,
+			       MOCK_FLAGS_ACCESS_CREATE_NEEDS_PIN_PAGES);
+
+	access_cmd.access_pages.iova = copy_cmd.src_iova;
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ACCESS_PAGES),
+			&access_cmd));
+	copy_cmd.src_ioas_id = ioas_id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_COPY, &copy_cmd));
+	check_mock_iova(buffer, MOCK_APERTURE_START, BUFFER_SIZE);
+
+	test_cmd_destroy_access_pages(
+		access_cmd.id, access_cmd.access_pages.out_access_pages_id);
+	test_cmd_destroy_access(access_cmd.id) test_ioctl_destroy(ioas_id);
+
+	test_ioctl_destroy(ioas_id);
+}
+
+/* VFIO compatibility IOCTLs */
+
+TEST_F(iommufd, simple_ioctls)
+{
+	ASSERT_EQ(VFIO_API_VERSION, ioctl(self->fd, VFIO_GET_API_VERSION));
+	ASSERT_EQ(1, ioctl(self->fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1v2_IOMMU));
+}
+
+TEST_F(iommufd, unmap_cmd)
+{
+	struct vfio_iommu_type1_dma_unmap unmap_cmd = {
+		.iova = MOCK_APERTURE_START,
+		.size = PAGE_SIZE,
+	};
+
+	unmap_cmd.argsz = 1;
+	EXPECT_ERRNO(EINVAL, ioctl(self->fd, VFIO_IOMMU_UNMAP_DMA, &unmap_cmd));
+
+	unmap_cmd.argsz = sizeof(unmap_cmd);
+	unmap_cmd.flags = 1 << 31;
+	EXPECT_ERRNO(EINVAL, ioctl(self->fd, VFIO_IOMMU_UNMAP_DMA, &unmap_cmd));
+
+	unmap_cmd.flags = 0;
+	EXPECT_ERRNO(ENODEV, ioctl(self->fd, VFIO_IOMMU_UNMAP_DMA, &unmap_cmd));
+}
+
+TEST_F(iommufd, map_cmd)
+{
+	struct vfio_iommu_type1_dma_map map_cmd = {
+		.iova = MOCK_APERTURE_START,
+		.size = PAGE_SIZE,
+		.vaddr = (__u64)buffer,
+	};
+
+	map_cmd.argsz = 1;
+	EXPECT_ERRNO(EINVAL, ioctl(self->fd, VFIO_IOMMU_MAP_DMA, &map_cmd));
+
+	map_cmd.argsz = sizeof(map_cmd);
+	map_cmd.flags = 1 << 31;
+	EXPECT_ERRNO(EINVAL, ioctl(self->fd, VFIO_IOMMU_MAP_DMA, &map_cmd));
+
+	/* Requires a domain to be attached */
+	map_cmd.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
+	EXPECT_ERRNO(ENODEV, ioctl(self->fd, VFIO_IOMMU_MAP_DMA, &map_cmd));
+}
+
+TEST_F(iommufd, info_cmd)
+{
+	struct vfio_iommu_type1_info info_cmd = {};
+
+	/* Invalid argsz */
+	info_cmd.argsz = 1;
+	EXPECT_ERRNO(EINVAL, ioctl(self->fd, VFIO_IOMMU_GET_INFO, &info_cmd));
+
+	info_cmd.argsz = sizeof(info_cmd);
+	EXPECT_ERRNO(ENODEV, ioctl(self->fd, VFIO_IOMMU_GET_INFO, &info_cmd));
+}
+
+TEST_F(iommufd, set_iommu_cmd)
+{
+	/* Requires a domain to be attached */
+	EXPECT_ERRNO(ENODEV,
+		     ioctl(self->fd, VFIO_SET_IOMMU, VFIO_TYPE1v2_IOMMU));
+	EXPECT_ERRNO(ENODEV, ioctl(self->fd, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU));
+}
+
+TEST_F(iommufd, vfio_ioas)
+{
+	struct iommu_vfio_ioas vfio_ioas_cmd = {
+		.size = sizeof(vfio_ioas_cmd),
+		.op = IOMMU_VFIO_IOAS_GET,
+	};
+	__u32 ioas_id;
+
+	/* ENODEV if there is no compat ioas */
+	EXPECT_ERRNO(ENODEV, ioctl(self->fd, IOMMU_VFIO_IOAS, &vfio_ioas_cmd));
+
+	/* Invalid id for set */
+	vfio_ioas_cmd.op = IOMMU_VFIO_IOAS_SET;
+	EXPECT_ERRNO(ENOENT, ioctl(self->fd, IOMMU_VFIO_IOAS, &vfio_ioas_cmd));
+
+	/* Valid id for set*/
+	test_ioctl_ioas_alloc(&ioas_id);
+	vfio_ioas_cmd.ioas_id = ioas_id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_VFIO_IOAS, &vfio_ioas_cmd));
+
+	/* Same id comes back from get */
+	vfio_ioas_cmd.op = IOMMU_VFIO_IOAS_GET;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_VFIO_IOAS, &vfio_ioas_cmd));
+	ASSERT_EQ(ioas_id, vfio_ioas_cmd.ioas_id);
+
+	/* Clear works */
+	vfio_ioas_cmd.op = IOMMU_VFIO_IOAS_CLEAR;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_VFIO_IOAS, &vfio_ioas_cmd));
+	vfio_ioas_cmd.op = IOMMU_VFIO_IOAS_GET;
+	EXPECT_ERRNO(ENODEV, ioctl(self->fd, IOMMU_VFIO_IOAS, &vfio_ioas_cmd));
+}
+
+FIXTURE(vfio_compat_mock_domain)
+{
+	int fd;
+	uint32_t ioas_id;
+};
+
+FIXTURE_VARIANT(vfio_compat_mock_domain)
+{
+	unsigned int version;
+};
+
+FIXTURE_SETUP(vfio_compat_mock_domain)
+{
+	struct iommu_vfio_ioas vfio_ioas_cmd = {
+		.size = sizeof(vfio_ioas_cmd),
+		.op = IOMMU_VFIO_IOAS_SET,
+	};
+
+	self->fd = open("/dev/iommu", O_RDWR);
+	ASSERT_NE(-1, self->fd);
+
+	/* Create what VFIO would consider a group */
+	test_ioctl_ioas_alloc(&self->ioas_id);
+	test_cmd_mock_domain(self->ioas_id, NULL, NULL);
+
+	/* Attach it to the vfio compat */
+	vfio_ioas_cmd.ioas_id = self->ioas_id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_VFIO_IOAS, &vfio_ioas_cmd));
+	ASSERT_EQ(0, ioctl(self->fd, VFIO_SET_IOMMU, variant->version));
+}
+
+FIXTURE_TEARDOWN(vfio_compat_mock_domain)
+{
+	teardown_iommufd(self->fd, _metadata);
+}
+
+FIXTURE_VARIANT_ADD(vfio_compat_mock_domain, Ver1v2)
+{
+	.version = VFIO_TYPE1v2_IOMMU,
+};
+
+FIXTURE_VARIANT_ADD(vfio_compat_mock_domain, Ver1v0)
+{
+	.version = VFIO_TYPE1_IOMMU,
+};
+
+TEST_F(vfio_compat_mock_domain, simple_close)
+{
+}
+
+TEST_F(vfio_compat_mock_domain, option_huge_pages)
+{
+	struct iommu_option cmd = {
+		.size = sizeof(cmd),
+		.option_id = IOMMU_OPTION_HUGE_PAGES,
+		.op = IOMMU_OPTION_OP_GET,
+		.val64 = 3,
+		.object_id = self->ioas_id,
+	};
+
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_OPTION, &cmd));
+	if (variant->version == VFIO_TYPE1_IOMMU) {
+		ASSERT_EQ(0, cmd.val64);
+	} else {
+		ASSERT_EQ(1, cmd.val64);
+	}
+}
+
+/*
+ * Execute an ioctl command stored in buffer and check that the result does not
+ * overflow memory.
+ */
+static bool is_filled(const void *buf, uint8_t c, size_t len)
+{
+	const uint8_t *cbuf = buf;
+
+	for (; len; cbuf++, len--)
+		if (*cbuf != c)
+			return false;
+	return true;
+}
+
+#define ioctl_check_buf(fd, cmd)                                         \
+	({                                                               \
+		size_t _cmd_len = *(__u32 *)buffer;                      \
+									 \
+		memset(buffer + _cmd_len, 0xAA, BUFFER_SIZE - _cmd_len); \
+		ASSERT_EQ(0, ioctl(fd, cmd, buffer));                    \
+		ASSERT_EQ(true, is_filled(buffer + _cmd_len, 0xAA,       \
+					  BUFFER_SIZE - _cmd_len));      \
+	})
+
+static void check_vfio_info_cap_chain(struct __test_metadata *_metadata,
+				      struct vfio_iommu_type1_info *info_cmd)
+{
+	const struct vfio_info_cap_header *cap;
+
+	ASSERT_GE(info_cmd->argsz, info_cmd->cap_offset + sizeof(*cap));
+	cap = buffer + info_cmd->cap_offset;
+	while (true) {
+		size_t cap_size;
+
+		if (cap->next)
+			cap_size = (buffer + cap->next) - (void *)cap;
+		else
+			cap_size = (buffer + info_cmd->argsz) - (void *)cap;
+
+		switch (cap->id) {
+		case VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE: {
+			struct vfio_iommu_type1_info_cap_iova_range *data =
+				(void *)cap;
+
+			ASSERT_EQ(1, data->header.version);
+			ASSERT_EQ(1, data->nr_iovas);
+			EXPECT_EQ(MOCK_APERTURE_START,
+				  data->iova_ranges[0].start);
+			EXPECT_EQ(MOCK_APERTURE_LAST, data->iova_ranges[0].end);
+			break;
+		}
+		case VFIO_IOMMU_TYPE1_INFO_DMA_AVAIL: {
+			struct vfio_iommu_type1_info_dma_avail *data =
+				(void *)cap;
+
+			ASSERT_EQ(1, data->header.version);
+			ASSERT_EQ(sizeof(*data), cap_size);
+			break;
+		}
+		default:
+			ASSERT_EQ(false, true);
+			break;
+		}
+		if (!cap->next)
+			break;
+
+		ASSERT_GE(info_cmd->argsz, cap->next + sizeof(*cap));
+		ASSERT_GE(buffer + cap->next, (void *)cap);
+		cap = buffer + cap->next;
+	}
+}
+
+TEST_F(vfio_compat_mock_domain, get_info)
+{
+	struct vfio_iommu_type1_info *info_cmd = buffer;
+	unsigned int i;
+	size_t caplen;
+
+	/* Pre-cap ABI */
+	*info_cmd = (struct vfio_iommu_type1_info){
+		.argsz = offsetof(struct vfio_iommu_type1_info, cap_offset),
+	};
+	ioctl_check_buf(self->fd, VFIO_IOMMU_GET_INFO);
+	ASSERT_NE(0, info_cmd->iova_pgsizes);
+	ASSERT_EQ(VFIO_IOMMU_INFO_PGSIZES | VFIO_IOMMU_INFO_CAPS,
+		  info_cmd->flags);
+
+	/* Read the cap chain size */
+	*info_cmd = (struct vfio_iommu_type1_info){
+		.argsz = sizeof(*info_cmd),
+	};
+	ioctl_check_buf(self->fd, VFIO_IOMMU_GET_INFO);
+	ASSERT_NE(0, info_cmd->iova_pgsizes);
+	ASSERT_EQ(VFIO_IOMMU_INFO_PGSIZES | VFIO_IOMMU_INFO_CAPS,
+		  info_cmd->flags);
+	ASSERT_EQ(0, info_cmd->cap_offset);
+	ASSERT_LT(sizeof(*info_cmd), info_cmd->argsz);
+
+	/* Read the caps, kernel should never create a corrupted caps */
+	caplen = info_cmd->argsz;
+	for (i = sizeof(*info_cmd); i < caplen; i++) {
+		*info_cmd = (struct vfio_iommu_type1_info){
+			.argsz = i,
+		};
+		ioctl_check_buf(self->fd, VFIO_IOMMU_GET_INFO);
+		ASSERT_EQ(VFIO_IOMMU_INFO_PGSIZES | VFIO_IOMMU_INFO_CAPS,
+			  info_cmd->flags);
+		if (!info_cmd->cap_offset)
+			continue;
+		check_vfio_info_cap_chain(_metadata, info_cmd);
+	}
+}
+
+static void shuffle_array(unsigned long *array, size_t nelms)
+{
+	unsigned int i;
+
+	/* Shuffle */
+	for (i = 0; i != nelms; i++) {
+		unsigned long tmp = array[i];
+		unsigned int other = rand() % (nelms - i);
+
+		array[i] = array[other];
+		array[other] = tmp;
+	}
+}
+
+TEST_F(vfio_compat_mock_domain, map)
+{
+	struct vfio_iommu_type1_dma_map map_cmd = {
+		.argsz = sizeof(map_cmd),
+		.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE,
+		.vaddr = (uintptr_t)buffer,
+		.size = BUFFER_SIZE,
+		.iova = MOCK_APERTURE_START,
+	};
+	struct vfio_iommu_type1_dma_unmap unmap_cmd = {
+		.argsz = sizeof(unmap_cmd),
+		.size = BUFFER_SIZE,
+		.iova = MOCK_APERTURE_START,
+	};
+	unsigned long pages_iova[BUFFER_SIZE / PAGE_SIZE];
+	unsigned int i;
+
+	/* Simple map/unmap */
+	ASSERT_EQ(0, ioctl(self->fd, VFIO_IOMMU_MAP_DMA, &map_cmd));
+	ASSERT_EQ(0, ioctl(self->fd, VFIO_IOMMU_UNMAP_DMA, &unmap_cmd));
+	ASSERT_EQ(BUFFER_SIZE, unmap_cmd.size);
+
+	/* UNMAP_FLAG_ALL requres 0 iova/size */
+	ASSERT_EQ(0, ioctl(self->fd, VFIO_IOMMU_MAP_DMA, &map_cmd));
+	unmap_cmd.flags = VFIO_DMA_UNMAP_FLAG_ALL;
+	EXPECT_ERRNO(EINVAL, ioctl(self->fd, VFIO_IOMMU_UNMAP_DMA, &unmap_cmd));
+
+	unmap_cmd.iova = 0;
+	unmap_cmd.size = 0;
+	ASSERT_EQ(0, ioctl(self->fd, VFIO_IOMMU_UNMAP_DMA, &unmap_cmd));
+	ASSERT_EQ(BUFFER_SIZE, unmap_cmd.size);
+
+	/* Small pages */
+	for (i = 0; i != ARRAY_SIZE(pages_iova); i++) {
+		map_cmd.iova = pages_iova[i] =
+			MOCK_APERTURE_START + i * PAGE_SIZE;
+		map_cmd.vaddr = (uintptr_t)buffer + i * PAGE_SIZE;
+		map_cmd.size = PAGE_SIZE;
+		ASSERT_EQ(0, ioctl(self->fd, VFIO_IOMMU_MAP_DMA, &map_cmd));
+	}
+	shuffle_array(pages_iova, ARRAY_SIZE(pages_iova));
+
+	unmap_cmd.flags = 0;
+	unmap_cmd.size = PAGE_SIZE;
+	for (i = 0; i != ARRAY_SIZE(pages_iova); i++) {
+		unmap_cmd.iova = pages_iova[i];
+		ASSERT_EQ(0, ioctl(self->fd, VFIO_IOMMU_UNMAP_DMA, &unmap_cmd));
+	}
+}
+
+TEST_F(vfio_compat_mock_domain, huge_map)
+{
+	size_t buf_size = HUGEPAGE_SIZE * 2;
+	struct vfio_iommu_type1_dma_map map_cmd = {
+		.argsz = sizeof(map_cmd),
+		.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE,
+		.size = buf_size,
+		.iova = MOCK_APERTURE_START,
+	};
+	struct vfio_iommu_type1_dma_unmap unmap_cmd = {
+		.argsz = sizeof(unmap_cmd),
+	};
+	unsigned long pages_iova[16];
+	unsigned int i;
+	void *buf;
+
+	/* Test huge pages and splitting */
+	buf = mmap(0, buf_size, PROT_READ | PROT_WRITE,
+		   MAP_SHARED | MAP_ANONYMOUS | MAP_HUGETLB | MAP_POPULATE, -1,
+		   0);
+	ASSERT_NE(MAP_FAILED, buf);
+	map_cmd.vaddr = (uintptr_t)buf;
+	ASSERT_EQ(0, ioctl(self->fd, VFIO_IOMMU_MAP_DMA, &map_cmd));
+
+	unmap_cmd.size = buf_size / ARRAY_SIZE(pages_iova);
+	for (i = 0; i != ARRAY_SIZE(pages_iova); i++)
+		pages_iova[i] = MOCK_APERTURE_START + (i * unmap_cmd.size);
+	shuffle_array(pages_iova, ARRAY_SIZE(pages_iova));
+
+	/* type1 mode can cut up larger mappings, type1v2 always fails */
+	for (i = 0; i != ARRAY_SIZE(pages_iova); i++) {
+		unmap_cmd.iova = pages_iova[i];
+		unmap_cmd.size = buf_size / ARRAY_SIZE(pages_iova);
+		if (variant->version == VFIO_TYPE1_IOMMU) {
+			ASSERT_EQ(0, ioctl(self->fd, VFIO_IOMMU_UNMAP_DMA,
+					   &unmap_cmd));
+		} else {
+			EXPECT_ERRNO(ENOENT,
+				     ioctl(self->fd, VFIO_IOMMU_UNMAP_DMA,
+					   &unmap_cmd));
+		}
+	}
+}
+
+TEST_HARNESS_MAIN
diff --git a/tools/testing/selftests/iommu/iommufd_fail_nth.c b/tools/testing/selftests/iommu/iommufd_fail_nth.c
new file mode 100644
index 00000000000000..99eaa9f32e0b23
--- /dev/null
+++ b/tools/testing/selftests/iommu/iommufd_fail_nth.c
@@ -0,0 +1,580 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ *
+ * These tests are "kernel integrity" tests. They are looking for kernel
+ * WARN/OOPS/kasn/etc splats triggered by kernel sanitizers & debugging
+ * features. It does not attempt to verify that the system calls are doing what
+ * they are supposed to do.
+ *
+ * The basic philosophy is to run a sequence of calls that will succeed and then
+ * sweep every failure injection point on that call chain to look for
+ * interesting things in error handling.
+ *
+ * This test is best run with:
+ *  echo 1 > /proc/sys/kernel/panic_on_warn
+ * If something is actually going wrong.
+ */
+#include <fcntl.h>
+#include <dirent.h>
+
+#define __EXPORTED_HEADERS__
+#include <linux/vfio.h>
+
+#include "iommufd_utils.h"
+
+static bool have_fault_injection;
+
+static int writeat(int dfd, const char *fn, const char *val)
+{
+	size_t val_len = strlen(val);
+	ssize_t res;
+	int fd;
+
+	fd = openat(dfd, fn, O_WRONLY);
+	if (fd == -1)
+		return -1;
+	res = write(fd, val, val_len);
+	assert(res == val_len);
+	close(fd);
+	return 0;
+}
+
+static __attribute__((constructor)) void setup_buffer(void)
+{
+	BUFFER_SIZE = 2*1024*1024;
+
+	buffer = mmap(0, BUFFER_SIZE, PROT_READ | PROT_WRITE,
+		      MAP_SHARED | MAP_ANONYMOUS, -1, 0);
+}
+
+/*
+ * This sets up fail_injection in a way that is useful for this test.
+ * It does not attempt to restore things back to how they were.
+ */
+static __attribute__((constructor)) void setup_fault_injection(void)
+{
+	DIR *debugfs = opendir("/sys/kernel/debug/");
+	struct dirent *dent;
+
+	if (!debugfs)
+		return;
+
+	/* Allow any allocation call to be fault injected */
+	if (writeat(dirfd(debugfs), "failslab/ignore-gfp-wait", "N"))
+		return;
+	writeat(dirfd(debugfs), "fail_page_alloc/ignore-gfp-wait", "N");
+	writeat(dirfd(debugfs), "fail_page_alloc/ignore-gfp-highmem", "N");
+
+	while ((dent = readdir(debugfs))) {
+		char fn[300];
+
+		if (strncmp(dent->d_name, "fail", 4) != 0)
+			continue;
+
+		/* We are looking for kernel splats, quiet down the log */
+		snprintf(fn, sizeof(fn), "%s/verbose", dent->d_name);
+		writeat(dirfd(debugfs), fn, "0");
+	}
+	closedir(debugfs);
+	have_fault_injection = true;
+}
+
+struct fail_nth_state {
+	int proc_fd;
+	unsigned int iteration;
+};
+
+static void fail_nth_first(struct __test_metadata *_metadata,
+			   struct fail_nth_state *nth_state)
+{
+	char buf[300];
+
+	snprintf(buf, sizeof(buf), "/proc/self/task/%u/fail-nth", gettid());
+	nth_state->proc_fd = open(buf, O_RDWR);
+	ASSERT_NE(-1, nth_state->proc_fd);
+}
+
+static bool fail_nth_next(struct __test_metadata *_metadata,
+			  struct fail_nth_state *nth_state,
+			  int test_result)
+{
+	static const char disable_nth[] = "0";
+	char buf[300];
+
+	/*
+	 * This is just an arbitrary limit based on the current kernel
+	 * situation. Changes in the kernel can dramtically change the number of
+	 * required fault injection sites, so if this hits it doesn't
+	 * necessarily mean a test failure, just that the limit has to be made
+	 * bigger.
+	 */
+	ASSERT_GT(400, nth_state->iteration);
+	if (nth_state->iteration != 0) {
+		ssize_t res;
+		ssize_t res2;
+
+		buf[0] = 0;
+		/*
+		 * Annoyingly disabling the nth can also fail. This means
+		 * the test passed without triggering failure
+		 */
+		res = pread(nth_state->proc_fd, buf, sizeof(buf), 0);
+		if (res == -1 && errno == EFAULT) {
+			buf[0] = '1';
+			buf[1] = '\n';
+			res = 2;
+		}
+
+		res2 = pwrite(nth_state->proc_fd, disable_nth,
+			      ARRAY_SIZE(disable_nth) - 1, 0);
+		if (res2 == -1 && errno == EFAULT) {
+			res2 = pwrite(nth_state->proc_fd, disable_nth,
+				      ARRAY_SIZE(disable_nth) - 1, 0);
+			buf[0] = '1';
+			buf[1] = '\n';
+		}
+		ASSERT_EQ(ARRAY_SIZE(disable_nth) - 1, res2);
+
+		/* printf("  nth %u result=%d nth=%u\n", nth_state->iteration,
+		       test_result, atoi(buf)); */
+		fflush(stdout);
+		ASSERT_LT(1, res);
+		if (res != 2 || buf[0] != '0' || buf[1] != '\n')
+			return false;
+	} else {
+		/* printf("  nth %u result=%d\n", nth_state->iteration,
+		       test_result); */
+	}
+	nth_state->iteration++;
+	return true;
+}
+
+/*
+ * This is called during the test to start failure injection. It allows the test
+ * to do some setup that has already been swept and thus reduce the required
+ * iterations.
+ */
+void __fail_nth_enable(struct __test_metadata *_metadata,
+		       struct fail_nth_state *nth_state)
+{
+	char buf[300];
+	size_t len;
+
+	if (!nth_state->iteration)
+		return;
+
+	len = snprintf(buf, sizeof(buf), "%u", nth_state->iteration);
+	ASSERT_EQ(len, pwrite(nth_state->proc_fd, buf, len, 0));
+}
+#define fail_nth_enable() __fail_nth_enable(_metadata, _nth_state)
+
+#define TEST_FAIL_NTH(fixture_name, name)                                           \
+	static int test_nth_##name(struct __test_metadata *_metadata,               \
+				   FIXTURE_DATA(fixture_name) *self,                \
+				   const FIXTURE_VARIANT(fixture_name)              \
+					   *variant,                                \
+				   struct fail_nth_state *_nth_state);              \
+	TEST_F(fixture_name, name)                                                  \
+	{                                                                           \
+		struct fail_nth_state nth_state = {};                               \
+		int test_result = 0;                                                \
+										    \
+		if (!have_fault_injection)                                          \
+			SKIP(return,                                                \
+				   "fault injection is not enabled in the kernel"); \
+		fail_nth_first(_metadata, &nth_state);                              \
+		ASSERT_EQ(0, test_nth_##name(_metadata, self, variant,              \
+					     &nth_state));                          \
+		while (fail_nth_next(_metadata, &nth_state, test_result)) {         \
+			fixture_name##_teardown(_metadata, self, variant);          \
+			fixture_name##_setup(_metadata, self, variant);             \
+			test_result = test_nth_##name(_metadata, self,              \
+						      variant, &nth_state);         \
+		};                                                                  \
+		ASSERT_EQ(0, test_result);                                          \
+	}                                                                           \
+	static int test_nth_##name(                                                 \
+		struct __test_metadata __attribute__((unused)) *_metadata,          \
+		FIXTURE_DATA(fixture_name) __attribute__((unused)) *self,           \
+		const FIXTURE_VARIANT(fixture_name) __attribute__((unused))         \
+			*variant,                                                   \
+		struct fail_nth_state *_nth_state)
+
+FIXTURE(basic_fail_nth)
+{
+	int fd;
+	uint32_t access_id;
+};
+
+FIXTURE_SETUP(basic_fail_nth)
+{
+	self->fd = -1;
+	self->access_id = 0;
+}
+
+FIXTURE_TEARDOWN(basic_fail_nth)
+{
+	int rc;
+
+	if (self->access_id) {
+		/* The access FD holds the iommufd open until it closes */
+		rc = _test_cmd_destroy_access(self->access_id);
+		assert(rc == 0);
+	}
+	teardown_iommufd(self->fd, _metadata);
+}
+
+/* Cover ioas.c */
+TEST_FAIL_NTH(basic_fail_nth, basic)
+{
+	struct iommu_iova_range ranges[10];
+	uint32_t ioas_id;
+	__u64 iova;
+
+	fail_nth_enable();
+
+	self->fd = open("/dev/iommu", O_RDWR);
+	if (self->fd == -1)
+		return -1;
+
+	if (_test_ioctl_ioas_alloc(self->fd, &ioas_id))
+		return -1;
+
+	{
+		struct iommu_ioas_iova_ranges ranges_cmd = {
+			.size = sizeof(ranges_cmd),
+			.num_iovas = ARRAY_SIZE(ranges),
+			.ioas_id = ioas_id,
+			.allowed_iovas = (uintptr_t)ranges,
+		};
+		if (ioctl(self->fd, IOMMU_IOAS_IOVA_RANGES, &ranges_cmd))
+			return -1;
+	}
+
+	{
+		struct iommu_ioas_allow_iovas allow_cmd = {
+			.size = sizeof(allow_cmd),
+			.ioas_id = ioas_id,
+			.num_iovas = 1,
+			.allowed_iovas = (uintptr_t)ranges,
+		};
+
+		ranges[0].start = 16*1024;
+		ranges[0].last = BUFFER_SIZE + 16 * 1024 * 600 - 1;
+		if (ioctl(self->fd, IOMMU_IOAS_ALLOW_IOVAS, &allow_cmd))
+			return -1;
+	}
+
+	if (_test_ioctl_ioas_map(self->fd, ioas_id, buffer, BUFFER_SIZE, &iova,
+				 IOMMU_IOAS_MAP_WRITEABLE |
+					 IOMMU_IOAS_MAP_READABLE))
+		return -1;
+
+	{
+		struct iommu_ioas_copy copy_cmd = {
+			.size = sizeof(copy_cmd),
+			.flags = IOMMU_IOAS_MAP_WRITEABLE |
+				 IOMMU_IOAS_MAP_READABLE,
+			.dst_ioas_id = ioas_id,
+			.src_ioas_id = ioas_id,
+			.src_iova = iova,
+			.length = sizeof(ranges),
+		};
+
+		if (ioctl(self->fd, IOMMU_IOAS_COPY, &copy_cmd))
+			return -1;
+	}
+
+	if (_test_ioctl_ioas_unmap(self->fd, ioas_id, iova, BUFFER_SIZE,
+				   NULL))
+		return -1;
+	/* Failure path of no IOVA to unmap */
+	_test_ioctl_ioas_unmap(self->fd, ioas_id, iova, BUFFER_SIZE, NULL);
+	return 0;
+}
+
+/* iopt_area_fill_domains() and iopt_area_fill_domain() */
+TEST_FAIL_NTH(basic_fail_nth, map_domain)
+{
+	uint32_t ioas_id;
+	__u32 device_id;
+	__u32 hwpt_id;
+	__u64 iova;
+
+	self->fd = open("/dev/iommu", O_RDWR);
+	if (self->fd == -1)
+		return -1;
+
+	if (_test_ioctl_ioas_alloc(self->fd, &ioas_id))
+		return -1;
+
+	if (_test_ioctl_set_temp_memory_limit(self->fd, 32))
+		return -1;
+
+	fail_nth_enable();
+
+	if (_test_cmd_mock_domain(self->fd, ioas_id, &device_id, &hwpt_id))
+		return -1;
+
+	if (_test_ioctl_ioas_map(self->fd, ioas_id, buffer, 262144, &iova,
+				 IOMMU_IOAS_MAP_WRITEABLE |
+					 IOMMU_IOAS_MAP_READABLE))
+		return -1;
+
+	if (_test_ioctl_destroy(self->fd, device_id))
+		return -1;
+	if (_test_ioctl_destroy(self->fd, hwpt_id))
+		return -1;
+
+	if (_test_cmd_mock_domain(self->fd, ioas_id, &device_id, &hwpt_id))
+		return -1;
+	return 0;
+}
+
+TEST_FAIL_NTH(basic_fail_nth, map_two_domains)
+{
+	uint32_t ioas_id;
+	__u32 device_id2;
+	__u32 device_id;
+	__u32 hwpt_id2;
+	__u32 hwpt_id;
+	__u64 iova;
+
+	self->fd = open("/dev/iommu", O_RDWR);
+	if (self->fd == -1)
+		return -1;
+
+	if (_test_ioctl_ioas_alloc(self->fd, &ioas_id))
+		return -1;
+
+	if (_test_ioctl_set_temp_memory_limit(self->fd, 32))
+		return -1;
+
+	if (_test_cmd_mock_domain(self->fd, ioas_id, &device_id, &hwpt_id))
+		return -1;
+
+	fail_nth_enable();
+
+	if (_test_cmd_mock_domain(self->fd, ioas_id, &device_id2, &hwpt_id2))
+		return -1;
+
+	if (_test_ioctl_ioas_map(self->fd, ioas_id, buffer, 262144, &iova,
+				 IOMMU_IOAS_MAP_WRITEABLE |
+					 IOMMU_IOAS_MAP_READABLE))
+		return -1;
+
+	if (_test_ioctl_destroy(self->fd, device_id))
+		return -1;
+	if (_test_ioctl_destroy(self->fd, hwpt_id))
+		return -1;
+
+	if (_test_ioctl_destroy(self->fd, device_id2))
+		return -1;
+	if (_test_ioctl_destroy(self->fd, hwpt_id2))
+		return -1;
+
+	if (_test_cmd_mock_domain(self->fd, ioas_id, &device_id, &hwpt_id))
+		return -1;
+	if (_test_cmd_mock_domain(self->fd, ioas_id, &device_id2, &hwpt_id2))
+		return -1;
+	return 0;
+}
+
+TEST_FAIL_NTH(basic_fail_nth, access_rw)
+{
+	uint64_t tmp_big[4096];
+	uint32_t ioas_id;
+	uint16_t tmp[32];
+	__u64 iova;
+
+	self->fd = open("/dev/iommu", O_RDWR);
+	if (self->fd == -1)
+		return -1;
+
+	if (_test_ioctl_ioas_alloc(self->fd, &ioas_id))
+		return -1;
+
+	if (_test_ioctl_set_temp_memory_limit(self->fd, 32))
+		return -1;
+
+	if (_test_ioctl_ioas_map(self->fd, ioas_id, buffer, 262144, &iova,
+				 IOMMU_IOAS_MAP_WRITEABLE |
+					 IOMMU_IOAS_MAP_READABLE))
+		return -1;
+
+	fail_nth_enable();
+
+	if (_test_cmd_create_access(self->fd, ioas_id, &self->access_id, 0))
+		return -1;
+
+	{
+		struct iommu_test_cmd access_cmd = {
+			.size = sizeof(access_cmd),
+			.op = IOMMU_TEST_OP_ACCESS_RW,
+			.id = self->access_id,
+			.access_rw = { .iova = iova,
+				       .length = sizeof(tmp),
+				       .uptr = (uintptr_t)tmp },
+		};
+
+		// READ
+		if (ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ACCESS_RW),
+			  &access_cmd))
+			return -1;
+
+		access_cmd.access_rw.flags = MOCK_ACCESS_RW_WRITE;
+		if (ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ACCESS_RW),
+			  &access_cmd))
+			return -1;
+
+		access_cmd.access_rw.flags = MOCK_ACCESS_RW_SLOW_PATH;
+		if (ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ACCESS_RW),
+			  &access_cmd))
+			return -1;
+		access_cmd.access_rw.flags = MOCK_ACCESS_RW_SLOW_PATH |
+					     MOCK_ACCESS_RW_WRITE;
+		if (ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ACCESS_RW),
+			  &access_cmd))
+			return -1;
+	}
+
+	{
+		struct iommu_test_cmd access_cmd = {
+			.size = sizeof(access_cmd),
+			.op = IOMMU_TEST_OP_ACCESS_RW,
+			.id = self->access_id,
+			.access_rw = { .iova = iova,
+				       .flags = MOCK_ACCESS_RW_SLOW_PATH,
+				       .length = sizeof(tmp_big),
+				       .uptr = (uintptr_t)tmp_big },
+		};
+
+		if (ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ACCESS_RW),
+			  &access_cmd))
+			return -1;
+	}
+	if (_test_cmd_destroy_access(self->access_id))
+		return -1;
+	self->access_id = 0;
+	return 0;
+}
+
+/* pages.c access functions */
+TEST_FAIL_NTH(basic_fail_nth, access_pin)
+{
+	uint32_t access_pages_id;
+	uint32_t ioas_id;
+	__u64 iova;
+
+	self->fd = open("/dev/iommu", O_RDWR);
+	if (self->fd == -1)
+		return -1;
+
+	if (_test_ioctl_ioas_alloc(self->fd, &ioas_id))
+		return -1;
+
+	if (_test_ioctl_set_temp_memory_limit(self->fd, 32))
+		return -1;
+
+	if (_test_ioctl_ioas_map(self->fd, ioas_id, buffer, BUFFER_SIZE, &iova,
+				 IOMMU_IOAS_MAP_WRITEABLE |
+					 IOMMU_IOAS_MAP_READABLE))
+		return -1;
+
+	if (_test_cmd_create_access(self->fd, ioas_id, &self->access_id,
+				    MOCK_FLAGS_ACCESS_CREATE_NEEDS_PIN_PAGES))
+		return -1;
+
+	fail_nth_enable();
+
+	{
+		struct iommu_test_cmd access_cmd = {
+			.size = sizeof(access_cmd),
+			.op = IOMMU_TEST_OP_ACCESS_PAGES,
+			.id = self->access_id,
+			.access_pages = { .iova = iova,
+					  .length = BUFFER_SIZE,
+					  .uptr = (uintptr_t)buffer },
+		};
+
+		if (ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ACCESS_RW),
+			  &access_cmd))
+			return -1;
+		access_pages_id = access_cmd.access_pages.out_access_pages_id;
+	}
+
+	if (_test_cmd_destroy_access_pages(self->fd, self->access_id,
+					   access_pages_id))
+		return -1;
+
+	if (_test_cmd_destroy_access(self->access_id))
+		return -1;
+	self->access_id = 0;
+	return 0;
+}
+
+/* iopt_pages_fill_xarray() */
+TEST_FAIL_NTH(basic_fail_nth, access_pin_domain)
+{
+	uint32_t access_pages_id;
+	uint32_t ioas_id;
+	__u32 device_id;
+	__u32 hwpt_id;
+	__u64 iova;
+
+	self->fd = open("/dev/iommu", O_RDWR);
+	if (self->fd == -1)
+		return -1;
+
+	if (_test_ioctl_ioas_alloc(self->fd, &ioas_id))
+		return -1;
+
+	if (_test_ioctl_set_temp_memory_limit(self->fd, 32))
+		return -1;
+
+	if (_test_cmd_mock_domain(self->fd, ioas_id, &device_id, &hwpt_id))
+		return -1;
+
+	if (_test_ioctl_ioas_map(self->fd, ioas_id, buffer, BUFFER_SIZE, &iova,
+				 IOMMU_IOAS_MAP_WRITEABLE |
+					 IOMMU_IOAS_MAP_READABLE))
+		return -1;
+
+	if (_test_cmd_create_access(self->fd, ioas_id, &self->access_id,
+				    MOCK_FLAGS_ACCESS_CREATE_NEEDS_PIN_PAGES))
+		return -1;
+
+	fail_nth_enable();
+
+	{
+		struct iommu_test_cmd access_cmd = {
+			.size = sizeof(access_cmd),
+			.op = IOMMU_TEST_OP_ACCESS_PAGES,
+			.id = self->access_id,
+			.access_pages = { .iova = iova,
+					  .length = BUFFER_SIZE,
+					  .uptr = (uintptr_t)buffer },
+		};
+
+		if (ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ACCESS_RW),
+			  &access_cmd))
+			return -1;
+		access_pages_id = access_cmd.access_pages.out_access_pages_id;
+	}
+
+	if (_test_cmd_destroy_access_pages(self->fd, self->access_id,
+					   access_pages_id))
+		return -1;
+
+	if (_test_cmd_destroy_access(self->access_id))
+		return -1;
+	self->access_id = 0;
+
+	if (_test_ioctl_destroy(self->fd, device_id))
+		return -1;
+	if (_test_ioctl_destroy(self->fd, hwpt_id))
+		return -1;
+	return 0;
+}
+
+TEST_HARNESS_MAIN
diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h
new file mode 100644
index 00000000000000..0d1f46369c2a30
--- /dev/null
+++ b/tools/testing/selftests/iommu/iommufd_utils.h
@@ -0,0 +1,278 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES */
+#ifndef __SELFTEST_IOMMUFD_UTILS
+#define __SELFTEST_IOMMUFD_UTILS
+
+#include <unistd.h>
+#include <stddef.h>
+#include <sys/fcntl.h>
+#include <sys/ioctl.h>
+#include <stdint.h>
+#include <assert.h>
+
+#include "../kselftest_harness.h"
+#include "../../../../drivers/iommu/iommufd/iommufd_test.h"
+
+/* Hack to make assertions more readable */
+#define _IOMMU_TEST_CMD(x) IOMMU_TEST_CMD
+
+static void *buffer;
+static unsigned long BUFFER_SIZE;
+
+/*
+ * Have the kernel check the refcount on pages. I don't know why a freshly
+ * mmap'd anon non-compound page starts out with a ref of 3
+ */
+#define check_refs(_ptr, _length, _refs)                                      \
+	({                                                                    \
+		struct iommu_test_cmd test_cmd = {                            \
+			.size = sizeof(test_cmd),                             \
+			.op = IOMMU_TEST_OP_MD_CHECK_REFS,                    \
+			.check_refs = { .length = _length,                    \
+					.uptr = (uintptr_t)(_ptr),            \
+					.refs = _refs },                      \
+		};                                                            \
+		ASSERT_EQ(0,                                                  \
+			  ioctl(self->fd,                                     \
+				_IOMMU_TEST_CMD(IOMMU_TEST_OP_MD_CHECK_REFS), \
+				&test_cmd));                                  \
+	})
+
+static int _test_cmd_mock_domain(int fd, unsigned int ioas_id, __u32 *device_id,
+				 __u32 *hwpt_id)
+{
+	struct iommu_test_cmd cmd = {
+		.size = sizeof(cmd),
+		.op = IOMMU_TEST_OP_MOCK_DOMAIN,
+		.id = ioas_id,
+		.mock_domain = {},
+	};
+	int ret;
+
+	ret = ioctl(fd, IOMMU_TEST_CMD, &cmd);
+	if (ret)
+		return ret;
+	if (device_id)
+		*device_id = cmd.mock_domain.out_device_id;
+	assert(cmd.id != 0);
+	if (hwpt_id)
+		*hwpt_id = cmd.mock_domain.out_hwpt_id;
+	return 0;
+}
+#define test_cmd_mock_domain(ioas_id, device_id, hwpt_id)                \
+	ASSERT_EQ(0, _test_cmd_mock_domain(self->fd, ioas_id, device_id, \
+					   hwpt_id))
+#define test_err_mock_domain(_errno, ioas_id, device_id, hwpt_id)     \
+	EXPECT_ERRNO(_errno, _test_cmd_mock_domain(self->fd, ioas_id, \
+						   device_id, hwpt_id))
+
+static int _test_cmd_create_access(int fd, unsigned int ioas_id,
+				   __u32 *access_id, unsigned int flags)
+{
+	struct iommu_test_cmd cmd = {
+		.size = sizeof(cmd),
+		.op = IOMMU_TEST_OP_CREATE_ACCESS,
+		.id = ioas_id,
+		.create_access = { .flags = flags },
+	};
+	int ret;
+
+	ret = ioctl(fd, IOMMU_TEST_CMD, &cmd);
+	if (ret)
+		return ret;
+	*access_id = cmd.create_access.out_access_fd;
+	return 0;
+}
+#define test_cmd_create_access(ioas_id, access_id, flags)                  \
+	ASSERT_EQ(0, _test_cmd_create_access(self->fd, ioas_id, access_id, \
+					     flags))
+
+static int _test_cmd_destroy_access(unsigned int access_id)
+{
+	return close(access_id);
+}
+#define test_cmd_destroy_access(access_id) \
+	ASSERT_EQ(0, _test_cmd_destroy_access(access_id))
+
+static int _test_cmd_destroy_access_pages(int fd, unsigned int access_id,
+					  unsigned int access_pages_id)
+{
+	struct iommu_test_cmd cmd = {
+		.size = sizeof(cmd),
+		.op = IOMMU_TEST_OP_DESTROY_ACCESS_PAGES,
+		.id = access_id,
+		.destroy_access_pages = { .access_pages_id = access_pages_id },
+	};
+	return ioctl(fd, IOMMU_TEST_CMD, &cmd);
+}
+#define test_cmd_destroy_access_pages(access_id, access_pages_id)        \
+	ASSERT_EQ(0, _test_cmd_destroy_access_pages(self->fd, access_id, \
+						    access_pages_id))
+#define test_err_destroy_access_pages(_errno, access_id, access_pages_id) \
+	EXPECT_ERRNO(_errno, _test_cmd_destroy_access_pages(              \
+				     self->fd, access_id, access_pages_id))
+
+static int _test_ioctl_destroy(int fd, unsigned int id)
+{
+	struct iommu_destroy cmd = {
+		.size = sizeof(cmd),
+		.id = id,
+	};
+	return ioctl(fd, IOMMU_DESTROY, &cmd);
+}
+#define test_ioctl_destroy(id) ASSERT_EQ(0, _test_ioctl_destroy(self->fd, id))
+
+static int _test_ioctl_ioas_alloc(int fd, __u32 *id)
+{
+	struct iommu_ioas_alloc cmd = {
+		.size = sizeof(cmd),
+	};
+	int ret;
+
+	ret = ioctl(fd, IOMMU_IOAS_ALLOC, &cmd);
+	if (ret)
+		return ret;
+	*id = cmd.out_ioas_id;
+	return 0;
+}
+#define test_ioctl_ioas_alloc(id)                                   \
+	({                                                          \
+		ASSERT_EQ(0, _test_ioctl_ioas_alloc(self->fd, id)); \
+		ASSERT_NE(0, *(id));                                \
+	})
+
+static int _test_ioctl_ioas_map(int fd, unsigned int ioas_id, void *buffer,
+				size_t length, __u64 *iova, unsigned int flags)
+{
+	struct iommu_ioas_map cmd = {
+		.size = sizeof(cmd),
+		.flags = flags,
+		.ioas_id = ioas_id,
+		.user_va = (uintptr_t)buffer,
+		.length = length,
+	};
+	int ret;
+
+	if (flags & IOMMU_IOAS_MAP_FIXED_IOVA)
+		cmd.iova = *iova;
+
+	ret = ioctl(fd, IOMMU_IOAS_MAP, &cmd);
+	*iova = cmd.iova;
+	return ret;
+}
+#define test_ioctl_ioas_map(buffer, length, iova_p)                        \
+	ASSERT_EQ(0, _test_ioctl_ioas_map(self->fd, self->ioas_id, buffer, \
+					  length, iova_p,                  \
+					  IOMMU_IOAS_MAP_WRITEABLE |       \
+						  IOMMU_IOAS_MAP_READABLE))
+
+#define test_err_ioctl_ioas_map(_errno, buffer, length, iova_p)            \
+	EXPECT_ERRNO(_errno,                                               \
+		     _test_ioctl_ioas_map(self->fd, self->ioas_id, buffer, \
+					  length, iova_p,                  \
+					  IOMMU_IOAS_MAP_WRITEABLE |       \
+						  IOMMU_IOAS_MAP_READABLE))
+
+#define test_ioctl_ioas_map_id(ioas_id, buffer, length, iova_p)              \
+	ASSERT_EQ(0, _test_ioctl_ioas_map(self->fd, ioas_id, buffer, length, \
+					  iova_p,                            \
+					  IOMMU_IOAS_MAP_WRITEABLE |         \
+						  IOMMU_IOAS_MAP_READABLE))
+
+#define test_ioctl_ioas_map_fixed(buffer, length, iova)                       \
+	({                                                                    \
+		__u64 __iova = iova;                                          \
+		ASSERT_EQ(0, _test_ioctl_ioas_map(                            \
+				     self->fd, self->ioas_id, buffer, length, \
+				     &__iova,                                 \
+				     IOMMU_IOAS_MAP_FIXED_IOVA |              \
+					     IOMMU_IOAS_MAP_WRITEABLE |       \
+					     IOMMU_IOAS_MAP_READABLE));       \
+	})
+
+#define test_err_ioctl_ioas_map_fixed(_errno, buffer, length, iova)           \
+	({                                                                    \
+		__u64 __iova = iova;                                          \
+		EXPECT_ERRNO(_errno,                                          \
+			     _test_ioctl_ioas_map(                            \
+				     self->fd, self->ioas_id, buffer, length, \
+				     &__iova,                                 \
+				     IOMMU_IOAS_MAP_FIXED_IOVA |              \
+					     IOMMU_IOAS_MAP_WRITEABLE |       \
+					     IOMMU_IOAS_MAP_READABLE));       \
+	})
+
+static int _test_ioctl_ioas_unmap(int fd, unsigned int ioas_id, uint64_t iova,
+				  size_t length, uint64_t *out_len)
+{
+	struct iommu_ioas_unmap cmd = {
+		.size = sizeof(cmd),
+		.ioas_id = ioas_id,
+		.iova = iova,
+		.length = length,
+	};
+	int ret;
+
+	ret = ioctl(fd, IOMMU_IOAS_UNMAP, &cmd);
+	if (out_len)
+		*out_len = cmd.length;
+	return ret;
+}
+#define test_ioctl_ioas_unmap(iova, length)                                \
+	ASSERT_EQ(0, _test_ioctl_ioas_unmap(self->fd, self->ioas_id, iova, \
+					    length, NULL))
+
+#define test_ioctl_ioas_unmap_id(ioas_id, iova, length)                      \
+	ASSERT_EQ(0, _test_ioctl_ioas_unmap(self->fd, ioas_id, iova, length, \
+					    NULL))
+
+#define test_err_ioctl_ioas_unmap(_errno, iova, length)                      \
+	EXPECT_ERRNO(_errno, _test_ioctl_ioas_unmap(self->fd, self->ioas_id, \
+						    iova, length, NULL))
+
+static int _test_ioctl_set_temp_memory_limit(int fd, unsigned int limit)
+{
+	struct iommu_test_cmd memlimit_cmd = {
+		.size = sizeof(memlimit_cmd),
+		.op = IOMMU_TEST_OP_SET_TEMP_MEMORY_LIMIT,
+		.memory_limit = { .limit = limit },
+	};
+
+	return ioctl(fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_SET_TEMP_MEMORY_LIMIT),
+		     &memlimit_cmd);
+}
+
+#define test_ioctl_set_temp_memory_limit(limit) \
+	ASSERT_EQ(0, _test_ioctl_set_temp_memory_limit(self->fd, limit))
+
+#define test_ioctl_set_default_memory_limit() \
+	test_ioctl_set_temp_memory_limit(65536)
+
+static void teardown_iommufd(int fd, struct __test_metadata *_metadata)
+{
+	struct iommu_test_cmd test_cmd = {
+		.size = sizeof(test_cmd),
+		.op = IOMMU_TEST_OP_MD_CHECK_REFS,
+		.check_refs = { .length = BUFFER_SIZE,
+				.uptr = (uintptr_t)buffer },
+	};
+
+	if (fd == -1)
+		return;
+
+	EXPECT_EQ(0, close(fd));
+
+	fd = open("/dev/iommu", O_RDWR);
+	EXPECT_NE(-1, fd);
+	EXPECT_EQ(0, ioctl(fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_MD_CHECK_REFS),
+			   &test_cmd));
+	EXPECT_EQ(0, close(fd));
+}
+
+#define EXPECT_ERRNO(expected_errno, cmd)         \
+	({                                        \
+		ASSERT_EQ(-1, cmd);               \
+		EXPECT_EQ(expected_errno, errno); \
+	})
+
+#endif
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH v4 16/17] iommufd: Add some fault injection points
  2022-11-08  0:48 [PATCH v4 00/17] IOMMUFD Generic interface Jason Gunthorpe
                   ` (14 preceding siblings ...)
  2022-11-08  0:49 ` [PATCH v4 15/17] iommufd: Add a selftest Jason Gunthorpe
@ 2022-11-08  0:49 ` Jason Gunthorpe
  2022-11-08  7:25   ` Nicolin Chen
  2022-11-08  0:49 ` [PATCH v4 17/17] iommufd: Add additional invariant assertions Jason Gunthorpe
                   ` (2 subsequent siblings)
  18 siblings, 1 reply; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-08  0:49 UTC (permalink / raw)
  To: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

This increases the coverage the fail_nth test gets, as well as via
syzkaller.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommufd/main.c  |  3 +++
 drivers/iommu/iommufd/pages.c | 26 ++++++++++++++++++++++++++
 2 files changed, 29 insertions(+)

diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 7de0f95f2ee68a..ab3fa05f38505d 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -102,6 +102,9 @@ struct iommufd_object *iommufd_get_object(struct iommufd_ctx *ictx, u32 id,
 {
 	struct iommufd_object *obj;
 
+	if (iommufd_should_fail())
+		return ERR_PTR(-ENOENT);
+
 	xa_lock(&ictx->objects);
 	obj = xa_load(&ictx->objects, id);
 	if (!obj || (type != IOMMUFD_OBJ_ANY && obj->type != type) ||
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index c3783ea01d7996..2ddcb0d4f71e04 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -80,6 +80,10 @@ static void *temp_kmalloc(size_t *size, void *backup, size_t backup_len)
 
 	if (*size < backup_len)
 		return backup;
+
+	if (!backup && iommufd_should_fail())
+		return NULL;
+
 	*size = min_t(size_t, *size, TEMP_MEMORY_LIMIT);
 	res = kmalloc(*size, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
 	if (res)
@@ -482,6 +486,7 @@ static int pages_to_xarray(struct xarray *xa, unsigned long start_index,
 			   unsigned long last_index, struct page **pages)
 {
 	struct page **end_pages = pages + (last_index - start_index) + 1;
+	struct page **half_pages = pages + (end_pages - pages) / 2;
 	XA_STATE(xas, xa, start_index);
 
 	do {
@@ -489,6 +494,15 @@ static int pages_to_xarray(struct xarray *xa, unsigned long start_index,
 
 		xas_lock(&xas);
 		while (pages != end_pages) {
+			/* xarray does not participate in fault injection */
+			if (pages == half_pages && iommufd_should_fail()) {
+				xas_set_err(&xas, -EINVAL);
+				xas_unlock(&xas);
+				/* aka xas_destroy() */
+				xas_nomem(&xas, GFP_KERNEL);
+				goto err_clear;
+			}
+
 			old = xas_store(&xas, xa_mk_value(page_to_pfn(*pages)));
 			if (xas_error(&xas))
 				break;
@@ -499,6 +513,7 @@ static int pages_to_xarray(struct xarray *xa, unsigned long start_index,
 		xas_unlock(&xas);
 	} while (xas_nomem(&xas, GFP_KERNEL));
 
+err_clear:
 	if (xas_error(&xas)) {
 		if (xas.xa_index != start_index)
 			clear_xarray(xa, start_index, xas.xa_index - 1);
@@ -662,6 +677,10 @@ static int pfn_reader_user_pin(struct pfn_reader_user *user,
 	npages = min_t(unsigned long, last_index - start_index + 1,
 		       user->upages_len / sizeof(*user->upages));
 
+
+	if (iommufd_should_fail())
+		return -EFAULT;
+
 	uptr = (uintptr_t)(pages->uptr + start_index * PAGE_SIZE);
 	if (!remote_mm)
 		rc = pin_user_pages_fast(uptr, npages, user->gup_flags,
@@ -806,6 +825,8 @@ static int pfn_reader_user_update_pinned(struct pfn_reader_user *user,
 		npages = pages->last_npinned - pages->npinned;
 		inc = false;
 	} else {
+		if (iommufd_should_fail())
+			return -ENOMEM;
 		npages = pages->npinned - pages->last_npinned;
 		inc = true;
 	}
@@ -1636,6 +1657,11 @@ static int iopt_pages_rw_page(struct iopt_pages *pages, unsigned long index,
 		return iopt_pages_rw_slow(pages, index, index, offset, data,
 					  length, flags);
 
+	if (iommufd_should_fail()) {
+		rc = -EINVAL;
+		goto out_mmput;
+	}
+
 	mmap_read_lock(pages->source_mm);
 	rc = pin_user_pages_remote(
 		pages->source_mm, (uintptr_t)(pages->uptr + index * PAGE_SIZE),
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH v4 17/17] iommufd: Add additional invariant assertions
  2022-11-08  0:48 [PATCH v4 00/17] IOMMUFD Generic interface Jason Gunthorpe
                   ` (15 preceding siblings ...)
  2022-11-08  0:49 ` [PATCH v4 16/17] iommufd: Add some fault injection points Jason Gunthorpe
@ 2022-11-08  0:49 ` Jason Gunthorpe
  2022-11-08  1:09 ` S390 testing for IOMMUFD Jason Gunthorpe
  2022-11-11 15:51 ` [PATCH v4 00/17] IOMMUFD Generic interface Shameerali Kolothum Thodi
  18 siblings, 0 replies; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-08  0:49 UTC (permalink / raw)
  To: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

These are on performance paths so we protect them using the
CONFIG_IOMMUFD_TEST to not hit during normal operation.

These are useful when running the test suite and syzkaller to find data
structure inconsistencies early.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommufd/device.c       |  5 ++++
 drivers/iommu/iommufd/io_pagetable.c | 22 +++++++++++++++
 drivers/iommu/iommufd/io_pagetable.h |  3 +++
 drivers/iommu/iommufd/pages.c        | 40 ++++++++++++++++++++++++++--
 4 files changed, 68 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 8ca5a38b2a79b8..536a34d099968d 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -589,6 +589,11 @@ int iommufd_access_pin_pages(struct iommufd_access *access, unsigned long iova,
 	bool first = true;
 	int rc;
 
+	/* Driver didn't specify needs_pin_pages in its ops */
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST) &&
+	    WARN_ON(access->iova_alignment != PAGE_SIZE))
+		return -EINVAL;
+
 	if (!length)
 		return -EINVAL;
 	if (check_add_overflow(iova, length - 1, &last_iova))
diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
index a7d16cd14ea200..432592fc026f4e 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -261,6 +261,11 @@ static int iopt_alloc_area_pages(struct io_pagetable *iopt,
 			(uintptr_t)elm->pages->uptr + elm->start_byte, length);
 		if (rc)
 			goto out_unlock;
+		if (IS_ENABLED(CONFIG_IOMMUFD_TEST) &&
+		    WARN_ON(iopt_check_iova(iopt, *dst_iova, length))) {
+			rc = -EINVAL;
+			goto out_unlock;
+		}
 	} else {
 		rc = iopt_check_iova(iopt, *dst_iova, length);
 		if (rc)
@@ -287,6 +292,8 @@ static int iopt_alloc_area_pages(struct io_pagetable *iopt,
 
 static void iopt_abort_area(struct iopt_area *area)
 {
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
+		WARN_ON(area->pages);
 	if (area->iopt) {
 		down_write(&area->iopt->iova_rwsem);
 		interval_tree_remove(&area->node, &area->iopt->area_itree);
@@ -652,6 +659,9 @@ void iopt_destroy_table(struct io_pagetable *iopt)
 {
 	struct interval_tree_node *node;
 
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
+		iopt_remove_reserved_iova(iopt, NULL);
+
 	while ((node = interval_tree_iter_first(&iopt->allowed_itree, 0,
 						ULONG_MAX))) {
 		interval_tree_remove(node, &iopt->allowed_itree);
@@ -698,6 +708,8 @@ static void iopt_unfill_domain(struct io_pagetable *iopt,
 				continue;
 
 			mutex_lock(&pages->mutex);
+			if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
+				WARN_ON(!area->storage_domain);
 			if (area->storage_domain == domain)
 				area->storage_domain = storage_domain;
 			mutex_unlock(&pages->mutex);
@@ -802,6 +814,16 @@ static int iopt_check_iova_alignment(struct io_pagetable *iopt,
 		    (iopt_area_length(area) & align_mask) ||
 		    (area->page_offset & align_mask))
 			return -EADDRINUSE;
+
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST)) {
+		struct iommufd_access *access;
+		unsigned long index;
+
+		xa_for_each(&iopt->access_list, index, access)
+			if (WARN_ON(access->iova_alignment >
+				    new_iova_alignment))
+				return -EADDRINUSE;
+	}
 	return 0;
 }
 
diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h
index 9a1c8a5ae3c883..3b85fa344f6be3 100644
--- a/drivers/iommu/iommufd/io_pagetable.h
+++ b/drivers/iommu/iommufd/io_pagetable.h
@@ -101,6 +101,9 @@ static inline size_t iopt_area_length(struct iopt_area *area)
 static inline unsigned long iopt_area_start_byte(struct iopt_area *area,
 						 unsigned long iova)
 {
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
+		WARN_ON(iova < iopt_area_iova(area) ||
+			iova > iopt_area_last_iova(area));
 	return (iova - iopt_area_iova(area)) + area->page_offset +
 	       iopt_area_index(area) * PAGE_SIZE;
 }
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index 2ddcb0d4f71e04..c535be1bedb9ba 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -101,12 +101,20 @@ static void *temp_kmalloc(size_t *size, void *backup, size_t backup_len)
 
 static void iopt_pages_add_npinned(struct iopt_pages *pages, size_t npages)
 {
-	pages->npinned += npages;
+	int rc;
+
+	rc = check_add_overflow(pages->npinned, npages, &pages->npinned);
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
+		WARN_ON(rc || pages->npinned > pages->npages);
 }
 
 static void iopt_pages_sub_npinned(struct iopt_pages *pages, size_t npages)
 {
-	pages->npinned -= npages;
+	int rc;
+
+	rc = check_sub_overflow(pages->npinned, npages, &pages->npinned);
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
+		WARN_ON(rc || pages->npinned > pages->npages);
 }
 
 static void iopt_pages_err_unpin(struct iopt_pages *pages,
@@ -128,6 +136,9 @@ static void iopt_pages_err_unpin(struct iopt_pages *pages,
 static unsigned long iopt_area_index_to_iova(struct iopt_area *area,
 					     unsigned long index)
 {
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
+		WARN_ON(index < iopt_area_index(area) ||
+			index > iopt_area_last_index(area));
 	index -= iopt_area_index(area);
 	if (index == 0)
 		return iopt_area_iova(area);
@@ -137,6 +148,9 @@ static unsigned long iopt_area_index_to_iova(struct iopt_area *area,
 static unsigned long iopt_area_index_to_iova_last(struct iopt_area *area,
 						  unsigned long index)
 {
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
+		WARN_ON(index < iopt_area_index(area) ||
+			index > iopt_area_last_index(area));
 	if (index == iopt_area_last_index(area))
 		return iopt_area_last_iova(area);
 	return iopt_area_iova(area) - area->page_offset +
@@ -240,6 +254,8 @@ static int __batch_init(struct pfn_batch *batch, size_t max_pages, void *backup,
 	batch->pfns = temp_kmalloc(&size, backup, backup_len);
 	if (!batch->pfns)
 		return -ENOMEM;
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST) && WARN_ON(size < elmsz))
+		return -EINVAL;
 	batch->array_size = size / elmsz;
 	batch->npfns = (u32 *)(batch->pfns + batch->array_size);
 	batch_clear(batch);
@@ -367,6 +383,10 @@ static int batch_iommu_map_small(struct iommu_domain *domain,
 	unsigned long start_iova = iova;
 	int rc;
 
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
+		WARN_ON(paddr % PAGE_SIZE || iova % PAGE_SIZE ||
+			size % PAGE_SIZE);
+
 	while (size) {
 		rc = iommu_map(domain, iova, paddr, PAGE_SIZE, prot);
 		if (rc)
@@ -652,6 +672,10 @@ static int pfn_reader_user_pin(struct pfn_reader_user *user,
 	uintptr_t uptr;
 	long rc;
 
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST) &&
+	    WARN_ON(last_index < start_index))
+		return -EINVAL;
+
 	if (!user->upages) {
 		/* All undone in pfn_reader_destroy() */
 		user->upages_len =
@@ -890,6 +914,10 @@ static int pfn_reader_fill_span(struct pfn_reader *pfns)
 	struct iopt_area *area;
 	int rc;
 
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST) &&
+	    WARN_ON(span->last_used < start_index))
+		return -EINVAL;
+
 	if (span->is_used == 1) {
 		batch_from_xarray(&pfns->batch, &pfns->pages->pinned_pfns,
 				  start_index, span->last_used);
@@ -942,6 +970,10 @@ static int pfn_reader_next(struct pfn_reader *pfns)
 	while (pfns->batch_end_index != pfns->last_index + 1) {
 		unsigned int npfns = pfns->batch.total_pfns;
 
+		if (IS_ENABLED(CONFIG_IOMMUFD_TEST) &&
+		    WARN_ON(interval_tree_double_span_iter_done(&pfns->span)))
+			return -EINVAL;
+
 		rc = pfn_reader_fill_span(pfns);
 		if (rc)
 			return rc;
@@ -1025,6 +1057,10 @@ static int pfn_reader_first(struct pfn_reader *pfns, struct iopt_pages *pages,
 {
 	int rc;
 
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST) &&
+	    WARN_ON(last_index < start_index))
+		return -EINVAL;
+
 	rc = pfn_reader_init(pfns, pages, start_index, last_index);
 	if (rc)
 		return rc;
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 15/17] iommufd: Add a selftest
  2022-11-08  0:49 ` [PATCH v4 15/17] iommufd: Add a selftest Jason Gunthorpe
@ 2022-11-08  1:01   ` Jason Gunthorpe
  2022-11-08  5:48   ` Nicolin Chen
  2022-11-09 23:51   ` Matthew Rosato
  2 siblings, 0 replies; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-08  1:01 UTC (permalink / raw)
  To: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Mon, Nov 07, 2022 at 08:49:08PM -0400, Jason Gunthorpe wrote:
> Cover the essential functionality of the iommufd with a directed
> test. This aims to achieve reasonable functional coverage using the
> in-kernel self test framework.
> 
> It provides a mock kernel module for the iommu_domain that allows it to
> run without any HW and the mocking provides a way to directly validate
> that the PFNs loaded into the iommu_domain are correct.
> 
> The mock also simulates the rare case of PAGE_SIZE > iommu page size as
> the mock will operate at a 2K iommu page size. This allows exercising all
> of the calculations to support this mismatch.
> 
> This allows achieving high coverage of the corner cases in the iopt_pages.
> 
> However, it is an unusually invasive config option to enable all of
> this. The config option should not be enabled in a production kernel.

This patch crossed 100k so it was bounced from the vger lists.

If anyone didn't get it and would like to see it lore has it:

https://lore.kernel.org/linux-iommu/15-v4-0de2f6c78ed0+9d1-iommufd_jgg@nvidia.com/

For the eventual v5 I will split the nth test into its own patch

Jason

^ permalink raw reply	[flat|nested] 99+ messages in thread

* S390 testing for IOMMUFD
  2022-11-08  0:48 [PATCH v4 00/17] IOMMUFD Generic interface Jason Gunthorpe
                   ` (16 preceding siblings ...)
  2022-11-08  0:49 ` [PATCH v4 17/17] iommufd: Add additional invariant assertions Jason Gunthorpe
@ 2022-11-08  1:09 ` Jason Gunthorpe
  2022-11-08 10:12   ` Christian Borntraeger
  2022-11-08 13:50   ` Matthew Rosato
  2022-11-11 15:51 ` [PATCH v4 00/17] IOMMUFD Generic interface Shameerali Kolothum Thodi
  18 siblings, 2 replies; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-08  1:09 UTC (permalink / raw)
  To: Cornelia Huck, Eric Farman, Matthew Rosato, Niklas Schnelle,
	Tony Krowiak, Halil Pasic, Jason Herne, linux-s390
  Cc: iommu, Kevin Tian, Alex Williamson, kvm, Lu Baolu, Nicolin Chen

On Mon, Nov 07, 2022 at 08:48:53PM -0400, Jason Gunthorpe wrote:
> [
> This has been in linux-next for a little while now, and we've completed
> the syzkaller run. 1300 hours of CPU time have been invested since the
> last report with no improvement in coverage or new detections. syzkaller
> coverage reached 69%(75%), and review of the misses show substantial
> amounts are WARN_ON's and other debugging which are not expected to be
> covered.
> ]
> 
> iommufd is the user API to control the IOMMU subsystem as it relates to
> managing IO page tables that point at user space memory.

[chop cc list]

s390 mdev maintainers,

Can I ask your help to test this with the two S390 mdev drivers? Now
that gvt is passing and we've covered alot of the QA ground it is a
good time to run it.

Take the branch from here:

https://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd.git/log/?h=for-next

And build the kernel with 

CONFIG_VFIO_CONTAINER=n
CONFIG_IOMMUFD=y
CONFIG_IOMMUFD_VFIO_CONTAINER=y

And your existing stuff should work with iommufd providing the iommu
support to vfio. There will be a dmesg confirming this.

Let me know if there are any problems!

If I recall there was some desire from the S390 platform team to start
building on iommufd to create some vIOMMU acceleration for S390
guests, this is a necessary first step.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 04/17] iommufd: Document overview of iommufd
  2022-11-08  0:48 ` [PATCH v4 04/17] iommufd: Document overview of iommufd Jason Gunthorpe
@ 2022-11-08  3:45   ` Bagas Sanjaya
  2022-11-08 17:10   ` [PATCH v4 4/17] " Jason Gunthorpe
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 99+ messages in thread
From: Bagas Sanjaya @ 2022-11-08  3:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon, Alex Williamson,
	Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

[-- Attachment #1: Type: text/plain, Size: 399 bytes --]

On Mon, Nov 07, 2022 at 08:48:57PM -0400, Jason Gunthorpe wrote:
> From: Kevin Tian <kevin.tian@intel.com>
> 
> Add iommufd into the documentation tree, and supply initial documentation.
> Much of this is linked from code comments by kdoc.
> 

The documentation LGTM, thanks.

Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>

-- 
An old man doll... just what I always wanted! - Clara

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 15/17] iommufd: Add a selftest
  2022-11-08  0:49 ` [PATCH v4 15/17] iommufd: Add a selftest Jason Gunthorpe
  2022-11-08  1:01   ` Jason Gunthorpe
@ 2022-11-08  5:48   ` Nicolin Chen
  2022-11-08 13:27     ` Jason Gunthorpe
  2022-11-09 23:51   ` Matthew Rosato
  2 siblings, 1 reply; 99+ messages in thread
From: Nicolin Chen @ 2022-11-08  5:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon, Alex Williamson,
	Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Niklas Schnelle, Shameerali Kolothum Thodi,
	Yi Liu, Keqian Zhu

On Mon, Nov 07, 2022 at 08:49:08PM -0400, Jason Gunthorpe wrote:

> diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c

> +TEST_F(iommufd, cmd_length)
> +{
> +#define TEST_LENGTH(_struct, _ioctl)                                     \
> +	{                                                                \
> +		struct {                                                 \
> +			struct _struct cmd;                              \
> +			uint8_t extra;                                   \
> +		} cmd = { .cmd = { .size = sizeof(struct _struct) - 1 }, \
> +			  .extra = UINT8_MAX };                          \
> +		int old_errno;                                           \
> +		int rc;                                                  \
> +									 \
> +		EXPECT_ERRNO(EOPNOTSUPP, ioctl(self->fd, _ioctl, &cmd)); \

I guess it should be EINVAL corresponding to updated kernel code?

> +TEST_F(iommufd, cmd_ex_fail)
> +{
> +	struct {
> +		struct iommu_destroy cmd;
> +		__u64 future;
> +	} cmd = { .cmd = { .size = sizeof(cmd), .id = 0 } };
> +
> +	/* object id is invalid and command is longer */
> +	EXPECT_ERRNO(ENOENT, ioctl(self->fd, IOMMU_DESTROY, &cmd));
> +	/* future area is non-zero */
> +	cmd.future = 1;
> +	EXPECT_ERRNO(E2BIG, ioctl(self->fd, IOMMU_DESTROY, &cmd));
> +	/* Original command "works" */
> +	cmd.cmd.size = sizeof(cmd.cmd);
> +	EXPECT_ERRNO(ENOENT, ioctl(self->fd, IOMMU_DESTROY, &cmd));
> +	/* Short command fails */
> +	cmd.cmd.size = sizeof(cmd.cmd) - 1;
> +	EXPECT_ERRNO(EOPNOTSUPP, ioctl(self->fd, IOMMU_DESTROY, &cmd));

Ditto

> +TEST_HARNESS_MAIN
> diff --git a/tools/testing/selftests/iommu/iommufd_fail_nth.c b/tools/testing/selftests/iommu/iommufd_fail_nth.c

> +static void fail_nth_first(struct __test_metadata *_metadata,
> +			   struct fail_nth_state *nth_state)
> +{
> +	char buf[300];
> +
> +	snprintf(buf, sizeof(buf), "/proc/self/task/%u/fail-nth", gettid());

Not sure what's missing, I have a build error at gettid. Copying
a solution from tools/perf/jvmti/jvmti_agent.c file, can fix with:
------------------------------
diff --git a/tools/testing/selftests/iommu/iommufd_fail_nth.c b/tools/testing/selftests/iommu/iommufd_fail_nth.c
index 99eaa9f32e0b..7704b3a754d3 100644
--- a/tools/testing/selftests/iommu/iommufd_fail_nth.c
+++ b/tools/testing/selftests/iommu/iommufd_fail_nth.c
@@ -19,6 +19,7 @@
 
 #define __EXPORTED_HEADERS__
 #include <linux/vfio.h>
+#include <syscall.h> /* for gettid() */
 
 #include "iommufd_utils.h"
 
@@ -84,6 +85,13 @@ struct fail_nth_state {
        unsigned int iteration;
 };
 
+#ifndef HAVE_GETTID
+static inline pid_t gettid(void)
+{
+       return (pid_t)syscall(__NR_gettid);
+}
+#endif
+
 static void fail_nth_first(struct __test_metadata *_metadata,
                           struct fail_nth_state *nth_state)
 {
------------------------------

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 16/17] iommufd: Add some fault injection points
  2022-11-08  0:49 ` [PATCH v4 16/17] iommufd: Add some fault injection points Jason Gunthorpe
@ 2022-11-08  7:25   ` Nicolin Chen
  2022-11-08 12:37     ` Jason Gunthorpe
  0 siblings, 1 reply; 99+ messages in thread
From: Nicolin Chen @ 2022-11-08  7:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon, Alex Williamson,
	Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Niklas Schnelle, Shameerali Kolothum Thodi,
	Yi Liu, Keqian Zhu

On Mon, Nov 07, 2022 at 08:49:09PM -0400, Jason Gunthorpe wrote:
 
> diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c

> @@ -489,6 +494,15 @@ static int pages_to_xarray(struct xarray *xa, unsigned long start_index,
>  
>  		xas_lock(&xas);
>  		while (pages != end_pages) {
> +			/* xarray does not participate in fault injection */
> +			if (pages == half_pages && iommufd_should_fail()) {
> +				xas_set_err(&xas, -EINVAL);
> +				xas_unlock(&xas);
> +				/* aka xas_destroy() */
> +				xas_nomem(&xas, GFP_KERNEL);
> +				goto err_clear;

Coverity reports an "unchecked return value" at xas_nomem()...

> +err_clear:
>  	if (xas_error(&xas)) {

...yet, I think we should be fine since we do xas_error here?

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: S390 testing for IOMMUFD
  2022-11-08  1:09 ` S390 testing for IOMMUFD Jason Gunthorpe
@ 2022-11-08 10:12   ` Christian Borntraeger
  2022-11-08 14:04     ` Anthony Krowiak
  2022-11-08 13:50   ` Matthew Rosato
  1 sibling, 1 reply; 99+ messages in thread
From: Christian Borntraeger @ 2022-11-08 10:12 UTC (permalink / raw)
  To: Jason Gunthorpe, Cornelia Huck, Eric Farman, Matthew Rosato,
	Niklas Schnelle, Tony Krowiak, Halil Pasic, Jason Herne,
	linux-s390
  Cc: iommu, Kevin Tian, Alex Williamson, kvm, Lu Baolu, Nicolin Chen



Am 08.11.22 um 02:09 schrieb Jason Gunthorpe:
> On Mon, Nov 07, 2022 at 08:48:53PM -0400, Jason Gunthorpe wrote:
>> [
>> This has been in linux-next for a little while now, and we've completed
>> the syzkaller run. 1300 hours of CPU time have been invested since the
>> last report with no improvement in coverage or new detections. syzkaller
>> coverage reached 69%(75%), and review of the misses show substantial
>> amounts are WARN_ON's and other debugging which are not expected to be
>> covered.
>> ]
>>
>> iommufd is the user API to control the IOMMU subsystem as it relates to
>> managing IO page tables that point at user space memory.
> 
> [chop cc list]
> 
> s390 mdev maintainers,
> 
> Can I ask your help to test this with the two S390 mdev drivers? Now
> that gvt is passing and we've covered alot of the QA ground it is a
> good time to run it.
> 
> Take the branch from here:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd.git/log/?h=for-next

> 
> And build the kernel with
> 
> CONFIG_VFIO_CONTAINER=n
> CONFIG_IOMMUFD=y
> CONFIG_IOMMUFD_VFIO_CONTAINER=y
> 
> And your existing stuff should work with iommufd providing the iommu
> support to vfio. There will be a dmesg confirming this.

Gave it a quick spin with vfio_ap:
[  401.679199] vfio_ap_mdev b01a7c33-9696-48b2-9a98-050e8e17c69a: Adding to iommu group 1
[  402.085386] iommufd: IOMMUFD is providing /dev/vfio/vfio, not VFIO.

Some tests seem to work, but others dont (running into timeouts). I need to look
into that (or ideally Tony will have a look, FWIW tests.test_vfio_ap.VfioAPAssignMdevToGuestTest
fails for me.


The same kernel tree with defconfig (instead of CONFIG_IOMMUFD_VFIO_CONTAINER=y) works fine.
> 
> Let me know if there are any problems!
> 
> If I recall there was some desire from the S390 platform team to start
> building on iommufd to create some vIOMMU acceleration for S390
> guests, this is a necessary first step.
> 
> Thanks,
> Jason

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 16/17] iommufd: Add some fault injection points
  2022-11-08  7:25   ` Nicolin Chen
@ 2022-11-08 12:37     ` Jason Gunthorpe
  0 siblings, 0 replies; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-08 12:37 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon, Alex Williamson,
	Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Niklas Schnelle, Shameerali Kolothum Thodi,
	Yi Liu, Keqian Zhu

On Mon, Nov 07, 2022 at 11:25:26PM -0800, Nicolin Chen wrote:
> On Mon, Nov 07, 2022 at 08:49:09PM -0400, Jason Gunthorpe wrote:
>  
> > diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
> 
> > @@ -489,6 +494,15 @@ static int pages_to_xarray(struct xarray *xa, unsigned long start_index,
> >  
> >  		xas_lock(&xas);
> >  		while (pages != end_pages) {
> > +			/* xarray does not participate in fault injection */
> > +			if (pages == half_pages && iommufd_should_fail()) {
> > +				xas_set_err(&xas, -EINVAL);
> > +				xas_unlock(&xas);
> > +				/* aka xas_destroy() */
> > +				xas_nomem(&xas, GFP_KERNEL);
> > +				goto err_clear;
> 
> Coverity reports an "unchecked return value" at xas_nomem()...
> 
> > +err_clear:
> >  	if (xas_error(&xas)) {
> 
> ...yet, I think we should be fine since we do xas_error here?

Yes, in this flow xas_nomem() always returns false and there is no
reason to check it. Just Coverity noise

Jason

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 10/17] iommufd: IOCTLs for the io_pagetable
  2022-11-08  0:49 ` [PATCH v4 10/17] iommufd: IOCTLs for the io_pagetable Jason Gunthorpe
@ 2022-11-08 13:27   ` Bagas Sanjaya
  2022-11-08 17:01     ` Jason Gunthorpe
  2022-11-14  7:46   ` Tian, Kevin
  1 sibling, 1 reply; 99+ messages in thread
From: Bagas Sanjaya @ 2022-11-08 13:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon, Alex Williamson,
	Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

[-- Attachment #1: Type: text/plain, Size: 13494 bytes --]

On Mon, Nov 07, 2022 at 08:49:03PM -0400, Jason Gunthorpe wrote:
> diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
> index 2ad06b27a35fe5..a872a50d800a41 100644
> --- a/include/uapi/linux/iommufd.h
> +++ b/include/uapi/linux/iommufd.h
> @@ -37,6 +37,13 @@
>  enum {
>  	IOMMUFD_CMD_BASE = 0x80,
>  	IOMMUFD_CMD_DESTROY = IOMMUFD_CMD_BASE,
> +	IOMMUFD_CMD_IOAS_ALLOC,
> +	IOMMUFD_CMD_IOAS_ALLOW_IOVAS,
> +	IOMMUFD_CMD_IOAS_COPY,
> +	IOMMUFD_CMD_IOAS_IOVA_RANGES,
> +	IOMMUFD_CMD_IOAS_MAP,
> +	IOMMUFD_CMD_IOAS_UNMAP,
> +	IOMMUFD_CMD_OPTION,
>  };
>  
>  /**
> @@ -52,4 +59,238 @@ struct iommu_destroy {
>  };
>  #define IOMMU_DESTROY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_DESTROY)
>  
> +/**
> + * struct iommu_ioas_alloc - ioctl(IOMMU_IOAS_ALLOC)
> + * @size: sizeof(struct iommu_ioas_alloc)
> + * @flags: Must be 0
> + * @out_ioas_id: Output IOAS ID for the allocated object
> + *
> + * Allocate an IO Address Space (IOAS) which holds an IO Virtual Address (IOVA)
> + * to memory mapping.
> + */
> +struct iommu_ioas_alloc {
> +	__u32 size;
> +	__u32 flags;
> +	__u32 out_ioas_id;
> +};
> +#define IOMMU_IOAS_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_ALLOC)
> +
> +/**
> + * struct iommu_iova_range
> + * @start: First IOVA
> + * @last: Inclusive last IOVA
> + *
> + * An interval in IOVA space.
> + */
> +struct iommu_iova_range {
> +	__aligned_u64 start;
> +	__aligned_u64 last;
> +};
> +
> +/**
> + * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
> + * @size: sizeof(struct iommu_ioas_iova_ranges)
> + * @ioas_id: IOAS ID to read ranges from
> + * @num_iovas: Input/Output total number of ranges in the IOAS
> + * @__reserved: Must be 0
> + * @allowed_iovas: Pointer to the output array of struct iommu_iova_range
> + * @out_iova_alignment: Minimum alignment required for mapping IOVA
> + *
> + * Query an IOAS for ranges of allowed IOVAs. Mapping IOVA outside these ranges
> + * is not allowed. num_iovas will be set to the total number of iovas and
> + * the allowed_iovas[] will be filled in as space permits.
> + *
> + * The allowed ranges are dependent on the HW path the DMA operation takes, and
> + * can change during the lifetime of the IOAS. A fresh empty IOAS will have a
> + * full range, and each attached device will narrow the ranges based on that
> + * device's HW restrictions. Detatching a device can widen the ranges. Userspace
> + * should query ranges after every attach/detatch to know what IOVAs are valid
> + * for mapping.
> + *
> + * On input num_iovas is the length of the allowed_iovas array. On output it is
> + * the total number of iovas filled in. The ioctl will return -EMSGSIZE and set
> + * num_iovas to the required value if num_iovas is too small. In this case the
> + * caller should allocate a larger output array and re-issue the ioctl.
> + */
> +struct iommu_ioas_iova_ranges {
> +	__u32 size;
> +	__u32 ioas_id;
> +	__u32 num_iovas;
> +	__u32 __reserved;
> +	__aligned_u64 allowed_iovas;
> +	__aligned_u64 out_iova_alignment;
> +};
> +#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_IOVA_RANGES)
> +
> +/**
> + * struct iommu_ioas_allow_iovas - ioctl(IOMMU_IOAS_ALLOW_IOVAS)
> + * @size: sizeof(struct iommu_ioas_allow_iovas)
> + * @ioas_id: IOAS ID to allow IOVAs from
> + * @num_iovas: Input/Output total number of ranges in the IOAS
> + * @__reserved: Must be 0
> + * @allowed_iovas: Pointer to array of struct iommu_iova_range
> + *
> + * Ensure a range of IOVAs are always available for allocation. If this call
> + * succeeds then IOMMU_IOAS_IOVA_RANGES will never return a list of IOVA ranges
> + * that are narrower than the ranges provided here. This call will fail if
> + * IOMMU_IOAS_IOVA_RANGES is currently narrower than the given ranges.
> + *
> + * When an IOAS is first created the IOVA_RANGES will be maximally sized, and as
> + * devices are attached the IOVA will narrow based on the device restrictions.
> + * When an allowed range is specified any narrowing will be refused, ie device
> + * attachment can fail if the device requires limiting within the allowed range.
> + *
> + * Automatic IOVA allocation is also impacted by this call. MAP will only
> + * allocate within the allowed IOVAs if they are present.
> + *
> + * This call replaces the entire allowed list with the given list.
> + */
> +struct iommu_ioas_allow_iovas {
> +	__u32 size;
> +	__u32 ioas_id;
> +	__u32 num_iovas;
> +	__u32 __reserved;
> +	__aligned_u64 allowed_iovas;
> +};
> +#define IOMMU_IOAS_ALLOW_IOVAS _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_ALLOW_IOVAS)
> +
> +/**
> + * enum iommufd_ioas_map_flags - Flags for map and copy
> + * @IOMMU_IOAS_MAP_FIXED_IOVA: If clear the kernel will compute an appropriate
> + *                             IOVA to place the mapping at
> + * @IOMMU_IOAS_MAP_WRITEABLE: DMA is allowed to write to this mapping
> + * @IOMMU_IOAS_MAP_READABLE: DMA is allowed to read from this mapping
> + */
> +enum iommufd_ioas_map_flags {
> +	IOMMU_IOAS_MAP_FIXED_IOVA = 1 << 0,
> +	IOMMU_IOAS_MAP_WRITEABLE = 1 << 1,
> +	IOMMU_IOAS_MAP_READABLE = 1 << 2,
> +};
> +
> +/**
> + * struct iommu_ioas_map - ioctl(IOMMU_IOAS_MAP)
> + * @size: sizeof(struct iommu_ioas_map)
> + * @flags: Combination of enum iommufd_ioas_map_flags
> + * @ioas_id: IOAS ID to change the mapping of
> + * @__reserved: Must be 0
> + * @user_va: Userspace pointer to start mapping from
> + * @length: Number of bytes to map
> + * @iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is set
> + *        then this must be provided as input.
> + *
> + * Set an IOVA mapping from a user pointer. If FIXED_IOVA is specified then the
> + * mapping will be established at iova, otherwise a suitable location based on
> + * the reserved and allowed lists will be automatically selected and returned in
> + * iova.
> + */
> +struct iommu_ioas_map {
> +	__u32 size;
> +	__u32 flags;
> +	__u32 ioas_id;
> +	__u32 __reserved;
> +	__aligned_u64 user_va;
> +	__aligned_u64 length;
> +	__aligned_u64 iova;
> +};
> +#define IOMMU_IOAS_MAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_MAP)
> +
> +/**
> + * struct iommu_ioas_copy - ioctl(IOMMU_IOAS_COPY)
> + * @size: sizeof(struct iommu_ioas_copy)
> + * @flags: Combination of enum iommufd_ioas_map_flags
> + * @dst_ioas_id: IOAS ID to change the mapping of
> + * @src_ioas_id: IOAS ID to copy from
> + * @length: Number of bytes to copy and map
> + * @dst_iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is
> + *            set then this must be provided as input.
> + * @src_iova: IOVA to start the copy
> + *
> + * Copy an already existing mapping from src_ioas_id and establish it in
> + * dst_ioas_id. The src iova/length must exactly match a range used with
> + * IOMMU_IOAS_MAP.
> + *
> + * This may be used to efficiently clone a subset of an IOAS to another, or as a
> + * kind of 'cache' to speed up mapping. Copy has an effciency advantage over
> + * establishing equivalent new mappings, as internal resources are shared, and
> + * the kernel will pin the user memory only once.
> + */
> +struct iommu_ioas_copy {
> +	__u32 size;
> +	__u32 flags;
> +	__u32 dst_ioas_id;
> +	__u32 src_ioas_id;
> +	__aligned_u64 length;
> +	__aligned_u64 dst_iova;
> +	__aligned_u64 src_iova;
> +};
> +#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY)
> +
> +/**
> + * struct iommu_ioas_unmap - ioctl(IOMMU_IOAS_UNMAP)
> + * @size: sizeof(struct iommu_ioas_unmap)
> + * @ioas_id: IOAS ID to change the mapping of
> + * @iova: IOVA to start the unmapping at
> + * @length: Number of bytes to unmap, and return back the bytes unmapped
> + *
> + * Unmap an IOVA range. The iova/length must be a superset of a previously
> + * mapped range used with IOMMU_IOAS_MAP or IOMMU_IOAS_COPY. Splitting or
> + * truncating ranges is not allowed. The values 0 to U64_MAX will unmap
> + * everything.
> + */
> +struct iommu_ioas_unmap {
> +	__u32 size;
> +	__u32 ioas_id;
> +	__aligned_u64 iova;
> +	__aligned_u64 length;
> +};
> +#define IOMMU_IOAS_UNMAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_UNMAP)
> +
> +/**
> + * enum iommufd_option
> + * @IOMMU_OPTION_RLIMIT_MODE:
> + *    Change how RLIMIT_MEMLOCK accounting works. The caller must have privilege
> + *    to invoke this. Value 0 (default) is user based accouting, 1 uses process
> + *    based accounting. Global option, object_id must be 0
> + * @IOMMU_OPTION_HUGE_PAGES:
> + *    Value 1 (default) allows contiguous pages to be combined when generating
> + *    iommu mappings. Value 0 disables combining, everything is mapped to
> + *    PAGE_SIZE. This can be useful for benchmarking.  This is a per-IOAS
> + *    option, the object_id must be the IOAS ID.
> + */
> +enum iommufd_option {
> +	IOMMU_OPTION_RLIMIT_MODE = 0,
> +	IOMMU_OPTION_HUGE_PAGES = 1,
> +};
> +
> +/**
> + * enum iommufd_option_ops
> + * @IOMMU_OPTION_OP_SET: Set the option's value
> + * @IOMMU_OPTION_OP_GET: Get the option's value
> + */
> +enum iommufd_option_ops {
> +	IOMMU_OPTION_OP_SET = 0,
> +	IOMMU_OPTION_OP_GET = 1,
> +};
> +
> +/**
> + * @size: sizeof(struct iommu_option)
> + * @option_id: One of enum iommufd_option
> + * @op: One of enum iommufd_option_ops
> + * @__reserved: Must be 0
> + * @object_id: ID of the object if required
> + * @val64: Option value to set or value returned on get
> + *
> + * Change a simple option value. This multiplexor allows controlling a options
> + * on objects. IOMMU_OPTION_OP_SET will load an option and IOMMU_OPTION_OP_GET
> + * will return the current value.
> + */
> +struct iommu_option {
> +	__u32 size;
> +	__u32 option_id;
> +	__u16 op;
> +	__u16 __reserved;
> +	__u32 object_id;
> +	__aligned_u64 val64;
> +};
> +#define IOMMU_OPTION _IO(IOMMUFD_TYPE, IOMMUFD_CMD_OPTION)
>  #endif

I see kernel-doc warnings (missing short descriptions) when making
htmldocs, so I have applied the fixup:

---- >8 ----

From 5643b202ae9853c11434466c76aeaaa044e88b07 Mon Sep 17 00:00:00 2001
From: Bagas Sanjaya <bagasdotme@gmail.com>
Date: Tue, 8 Nov 2022 20:14:25 +0700
Subject: [PATCH] iommufd: Add missing ioctl short descriptions

Checking kernel-doc comments in iommufd.h header with scripts/kernel-doc
produces missing short description warnings:

include/uapi/linux/iommufd.h:80: warning: missing initial short description on line:
 * struct iommu_iova_range
include/uapi/linux/iommufd.h:250: warning: missing initial short description on line:
 * enum iommufd_option
include/uapi/linux/iommufd.h:267: warning: missing initial short description on line:
 * enum iommufd_option_ops
include/uapi/linux/iommufd.h:277: warning: Cannot understand  * @size: sizeof(struct iommu_option)
 on line 277 - I thought it was a doc line
include/uapi/linux/iommufd.h:299: warning: missing initial short description on line:
 * enum iommufd_vfio_ioas_op

One of them is reported by Stephen Rothwell when merging iommufd tree for
linux-next.

Describe these ioctl structs and enums to fix above warnings.

Link: https://lore.kernel.org/linux-next/20221108165125.3bf96028@canb.auug.org.au/
Fixes: 5a013b3b3b2f02 ("iommufd: IOCTLs for the io_pagetable")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
---
 include/uapi/linux/iommufd.h | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index c7109d7e5e09a9..66e9fb104bde5a 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -77,7 +77,7 @@ struct iommu_ioas_alloc {
 #define IOMMU_IOAS_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_ALLOC)
 
 /**
- * struct iommu_iova_range
+ * struct iommu_iova_range - ioctl(IOMMU_IOVA_RANGE)
  * @start: First IOVA
  * @last: Inclusive last IOVA
  *
@@ -247,7 +247,8 @@ struct iommu_ioas_unmap {
 #define IOMMU_IOAS_UNMAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_UNMAP)
 
 /**
- * enum iommufd_option
+ * enum iommufd_option - ioctl(IOMMU_OPTION_RLIMIT_MODE) and
+ *                       ioctl(IOMMU_OPTION_HUGE_PAGES)
  * @IOMMU_OPTION_RLIMIT_MODE:
  *    Change how RLIMIT_MEMLOCK accounting works. The caller must have privilege
  *    to invoke this. Value 0 (default) is user based accouting, 1 uses process
@@ -264,7 +265,8 @@ enum iommufd_option {
 };
 
 /**
- * enum iommufd_option_ops
+ * enum iommufd_option_ops - ioctl(IOMMU_OPTION_OP_SET) and
+ *                           ioctl(IOMMU_OPTION_OP_GET)
  * @IOMMU_OPTION_OP_SET: Set the option's value
  * @IOMMU_OPTION_OP_GET: Get the option's value
  */
@@ -274,6 +276,7 @@ enum iommufd_option_ops {
 };
 
 /**
+ * struct iommu_option - iommu option multiplexer
  * @size: sizeof(struct iommu_option)
  * @option_id: One of enum iommufd_option
  * @op: One of enum iommufd_option_ops
@@ -296,7 +299,7 @@ struct iommu_option {
 #define IOMMU_OPTION _IO(IOMMUFD_TYPE, IOMMUFD_CMD_OPTION)
 
 /**
- * enum iommufd_vfio_ioas_op
+ * enum iommufd_vfio_ioas_op - IOMMU_VFIO_IOAS_* ioctls
  * @IOMMU_VFIO_IOAS_GET: Get the current compatibility IOAS
  * @IOMMU_VFIO_IOAS_SET: Change the current compatibility IOAS
  * @IOMMU_VFIO_IOAS_CLEAR: Disable VFIO compatibility


Thanks.

-- 
An old man doll... just what I always wanted! - Clara

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 15/17] iommufd: Add a selftest
  2022-11-08  5:48   ` Nicolin Chen
@ 2022-11-08 13:27     ` Jason Gunthorpe
  0 siblings, 0 replies; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-08 13:27 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon, Alex Williamson,
	Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Niklas Schnelle, Shameerali Kolothum Thodi,
	Yi Liu, Keqian Zhu

On Mon, Nov 07, 2022 at 09:48:32PM -0800, Nicolin Chen wrote:
> On Mon, Nov 07, 2022 at 08:49:08PM -0400, Jason Gunthorpe wrote:
> 
> > diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
> 
> > +TEST_F(iommufd, cmd_length)
> > +{
> > +#define TEST_LENGTH(_struct, _ioctl)                                     \
> > +	{                                                                \
> > +		struct {                                                 \
> > +			struct _struct cmd;                              \
> > +			uint8_t extra;                                   \
> > +		} cmd = { .cmd = { .size = sizeof(struct _struct) - 1 }, \
> > +			  .extra = UINT8_MAX };                          \
> > +		int old_errno;                                           \
> > +		int rc;                                                  \
> > +									 \
> > +		EXPECT_ERRNO(EOPNOTSUPP, ioctl(self->fd, _ioctl, &cmd)); \
> 
> I guess it should be EINVAL corresponding to updated kernel code?
> 
> > +TEST_F(iommufd, cmd_ex_fail)
> > +{
> > +	struct {
> > +		struct iommu_destroy cmd;
> > +		__u64 future;
> > +	} cmd = { .cmd = { .size = sizeof(cmd), .id = 0 } };
> > +
> > +	/* object id is invalid and command is longer */
> > +	EXPECT_ERRNO(ENOENT, ioctl(self->fd, IOMMU_DESTROY, &cmd));
> > +	/* future area is non-zero */
> > +	cmd.future = 1;
> > +	EXPECT_ERRNO(E2BIG, ioctl(self->fd, IOMMU_DESTROY, &cmd));
> > +	/* Original command "works" */
> > +	cmd.cmd.size = sizeof(cmd.cmd);
> > +	EXPECT_ERRNO(ENOENT, ioctl(self->fd, IOMMU_DESTROY, &cmd));
> > +	/* Short command fails */
> > +	cmd.cmd.size = sizeof(cmd.cmd) - 1;
> > +	EXPECT_ERRNO(EOPNOTSUPP, ioctl(self->fd, IOMMU_DESTROY, &cmd));
> 
> Ditto

Oops, yes, I fixed these

> 
> > +TEST_HARNESS_MAIN
> > diff --git a/tools/testing/selftests/iommu/iommufd_fail_nth.c b/tools/testing/selftests/iommu/iommufd_fail_nth.c
> 
> > +static void fail_nth_first(struct __test_metadata *_metadata,
> > +			   struct fail_nth_state *nth_state)
> > +{
> > +	char buf[300];
> > +
> > +	snprintf(buf, sizeof(buf), "/proc/self/task/%u/fail-nth", gettid());
> 
> Not sure what's missing, I have a build error at gettid. Copying
> a solution from tools/perf/jvmti/jvmti_agent.c file, can fix with:

I think your glibc is probably old

> ------------------------------
> diff --git a/tools/testing/selftests/iommu/iommufd_fail_nth.c b/tools/testing/selftests/iommu/iommufd_fail_nth.c
> index 99eaa9f32e0b..7704b3a754d3 100644
> --- a/tools/testing/selftests/iommu/iommufd_fail_nth.c
> +++ b/tools/testing/selftests/iommu/iommufd_fail_nth.c
> @@ -19,6 +19,7 @@
>  
>  #define __EXPORTED_HEADERS__
>  #include <linux/vfio.h>
> +#include <syscall.h> /* for gettid() */
>  
>  #include "iommufd_utils.h"
>  
> @@ -84,6 +85,13 @@ struct fail_nth_state {
>         unsigned int iteration;
>  };
>  
> +#ifndef HAVE_GETTID
> +static inline pid_t gettid(void)
> +{
> +       return (pid_t)syscall(__NR_gettid);
> +}
> +#endif

Ah, there is a lot of complicated makefile stuff to make this work,
and it only works for perf/bpf not selftests

It looks like there is no reason for this to need gettid, we don't use
threads. So this can just be getpid and that is portable.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: S390 testing for IOMMUFD
  2022-11-08  1:09 ` S390 testing for IOMMUFD Jason Gunthorpe
  2022-11-08 10:12   ` Christian Borntraeger
@ 2022-11-08 13:50   ` Matthew Rosato
  2022-11-08 13:54     ` Jason Gunthorpe
  1 sibling, 1 reply; 99+ messages in thread
From: Matthew Rosato @ 2022-11-08 13:50 UTC (permalink / raw)
  To: Jason Gunthorpe, Cornelia Huck, Eric Farman, Niklas Schnelle,
	Tony Krowiak, Halil Pasic, Jason Herne, linux-s390
  Cc: iommu, Kevin Tian, Alex Williamson, kvm, Lu Baolu, Nicolin Chen

On 11/7/22 8:09 PM, Jason Gunthorpe wrote:
> On Mon, Nov 07, 2022 at 08:48:53PM -0400, Jason Gunthorpe wrote:
>> [
>> This has been in linux-next for a little while now, and we've completed
>> the syzkaller run. 1300 hours of CPU time have been invested since the
>> last report with no improvement in coverage or new detections. syzkaller
>> coverage reached 69%(75%), and review of the misses show substantial
>> amounts are WARN_ON's and other debugging which are not expected to be
>> covered.
>> ]
>>
>> iommufd is the user API to control the IOMMU subsystem as it relates to
>> managing IO page tables that point at user space memory.
> 
> [chop cc list]
> 
> s390 mdev maintainers,
> 
> Can I ask your help to test this with the two S390 mdev drivers? Now
> that gvt is passing and we've covered alot of the QA ground it is a
> good time to run it.
> 
> Take the branch from here:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd.git/log/?h=for-next
> 
> And build the kernel with 
> 
> CONFIG_VFIO_CONTAINER=n
> CONFIG_IOMMUFD=y
> CONFIG_IOMMUFD_VFIO_CONTAINER=y
> 
> And your existing stuff should work with iommufd providing the iommu
> support to vfio. There will be a dmesg confirming this.
> 
> Let me know if there are any problems!

FWIW, vfio-pci via s390 is working fine so far, though I'll put it through more paces over the next few weeks and report if I find anything.

As far as mdev drivers...  

-ccw: Sounds like Eric is already aware there is an issue and is investigating (I see errors as well).

-ap: I see the exact same issue that Christian mentioned...  I'll talk to Tony & Jason about it.

> 
> If I recall there was some desire from the S390 platform team to start
> building on iommufd to create some vIOMMU acceleration for S390
> guests, this is a necessary first step.

There's probably something here for -ccw in the future, but you might be thinking of s390 vfio-pci e.g. to implement the in-kernel handling of nested mappings on s390 -- yep, work in in progress here, not ready for sharing yet but I have been most recently basing my work on top of the nesting series https://github.com/yiliu1765/iommufd/tree/iommufd-v6.0-rc3-nesting

Matt



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: S390 testing for IOMMUFD
  2022-11-08 13:50   ` Matthew Rosato
@ 2022-11-08 13:54     ` Jason Gunthorpe
  2022-11-08 14:19       ` Eric Farman
  0 siblings, 1 reply; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-08 13:54 UTC (permalink / raw)
  To: Matthew Rosato
  Cc: Cornelia Huck, Eric Farman, Niklas Schnelle, Tony Krowiak,
	Halil Pasic, Jason Herne, linux-s390, iommu, Kevin Tian,
	Alex Williamson, kvm, Lu Baolu, Nicolin Chen

On Tue, Nov 08, 2022 at 08:50:53AM -0500, Matthew Rosato wrote:

> FWIW, vfio-pci via s390 is working fine so far, though I'll put it
> through more paces over the next few weeks and report if I find
> anything.

OK great

> As far as mdev drivers...  
> 
> -ccw: Sounds like Eric is already aware there is an issue and is investigating (I see errors as well).
> 
> -ap: I see the exact same issue that Christian mentioned...  I'll talk to Tony & Jason about it.

A clue what is going wrong might get a quick realization on the
problem?

I'm guessing something in the vfio side more than the access 'pinning'
part?

> > If I recall there was some desire from the S390 platform team to start
> > building on iommufd to create some vIOMMU acceleration for S390
> > guests, this is a necessary first step.
> 
> There's probably something here for -ccw in the future, but you
> might be thinking of s390 vfio-pci e.g. to implement the in-kernel
> handling of nested mappings on s390 -- yep, work in in progress
> here, not ready for sharing yet but I have been most recently basing
> my work on top of the nesting series
> https://github.com/yiliu1765/iommufd/tree/iommufd-v6.0-rc3-nesting

The relation is that if vfio-pci wants to do the above then
vfio-ccw/ap will have to run on iommufd when used in the same VM as
vfio-pci.

So we need it to work :)

Jason

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: S390 testing for IOMMUFD
  2022-11-08 10:12   ` Christian Borntraeger
@ 2022-11-08 14:04     ` Anthony Krowiak
  2022-11-09 14:49       ` Anthony Krowiak
  0 siblings, 1 reply; 99+ messages in thread
From: Anthony Krowiak @ 2022-11-08 14:04 UTC (permalink / raw)
  To: Christian Borntraeger, Jason Gunthorpe, Cornelia Huck,
	Eric Farman, Matthew Rosato, Niklas Schnelle, Halil Pasic,
	Jason Herne, linux-s390
  Cc: iommu, Kevin Tian, Alex Williamson, kvm, Lu Baolu, Nicolin Chen


On 11/8/22 5:12 AM, Christian Borntraeger wrote:
>
>
> Am 08.11.22 um 02:09 schrieb Jason Gunthorpe:
>> On Mon, Nov 07, 2022 at 08:48:53PM -0400, Jason Gunthorpe wrote:
>>> [
>>> This has been in linux-next for a little while now, and we've completed
>>> the syzkaller run. 1300 hours of CPU time have been invested since the
>>> last report with no improvement in coverage or new detections. 
>>> syzkaller
>>> coverage reached 69%(75%), and review of the misses show substantial
>>> amounts are WARN_ON's and other debugging which are not expected to be
>>> covered.
>>> ]
>>>
>>> iommufd is the user API to control the IOMMU subsystem as it relates to
>>> managing IO page tables that point at user space memory.
>>
>> [chop cc list]
>>
>> s390 mdev maintainers,
>>
>> Can I ask your help to test this with the two S390 mdev drivers? Now
>> that gvt is passing and we've covered alot of the QA ground it is a
>> good time to run it.
>>
>> Take the branch from here:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd.git/log/?h=for-next 
>>
>
>>
>> And build the kernel with
>>
>> CONFIG_VFIO_CONTAINER=n
>> CONFIG_IOMMUFD=y
>> CONFIG_IOMMUFD_VFIO_CONTAINER=y
>>
>> And your existing stuff should work with iommufd providing the iommu
>> support to vfio. There will be a dmesg confirming this.
>
> Gave it a quick spin with vfio_ap:
> [  401.679199] vfio_ap_mdev b01a7c33-9696-48b2-9a98-050e8e17c69a: 
> Adding to iommu group 1
> [  402.085386] iommufd: IOMMUFD is providing /dev/vfio/vfio, not VFIO.
>
> Some tests seem to work, but others dont (running into timeouts). I 
> need to look
> into that (or ideally Tony will have a look, FWIW 
> tests.test_vfio_ap.VfioAPAssignMdevToGuestTest
> fails for me.


I'm looking into it.


>
>
> The same kernel tree with defconfig (instead of 
> CONFIG_IOMMUFD_VFIO_CONTAINER=y) works fine.
>>
>> Let me know if there are any problems!
>>
>> If I recall there was some desire from the S390 platform team to start
>> building on iommufd to create some vIOMMU acceleration for S390
>> guests, this is a necessary first step.
>>
>> Thanks,
>> Jason

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: S390 testing for IOMMUFD
  2022-11-08 13:54     ` Jason Gunthorpe
@ 2022-11-08 14:19       ` Eric Farman
  2022-11-08 14:37         ` Jason Gunthorpe
  0 siblings, 1 reply; 99+ messages in thread
From: Eric Farman @ 2022-11-08 14:19 UTC (permalink / raw)
  To: Jason Gunthorpe, Matthew Rosato
  Cc: Cornelia Huck, Niklas Schnelle, Tony Krowiak, Halil Pasic,
	Jason Herne, linux-s390, iommu, Kevin Tian, Alex Williamson, kvm,
	Lu Baolu, Nicolin Chen

On Tue, 2022-11-08 at 09:54 -0400, Jason Gunthorpe wrote:
> On Tue, Nov 08, 2022 at 08:50:53AM -0500, Matthew Rosato wrote:
> 
> > FWIW, vfio-pci via s390 is working fine so far, though I'll put it
> > through more paces over the next few weeks and report if I find
> > anything.
> 
> OK great
> 
> > As far as mdev drivers...  
> > 
> > -ccw: Sounds like Eric is already aware there is an issue and is
> > investigating (I see errors as well).

I -think- the problem for -ccw is that the new vfio_pin_pages requires
the input addresses to be page-aligned, and while most of ours are, the
first one in any given transaction may not be. We never bothered to
mask off the addresses since it was handled for us, and we needed to
keep the offsets anyway.

By happenstance, I had some code that would do the masking ourselves
(for an unrelated reason); I'll see if I can get that fit on top and if
it helps matters. After coffee.

Eric

> > 
> > -ap: I see the exact same issue that Christian mentioned...  I'll
> > talk to Tony & Jason about it.
> 
> A clue what is going wrong might get a quick realization on the
> problem?
> 
> I'm guessing something in the vfio side more than the access
> 'pinning'
> part?
> 
> > > If I recall there was some desire from the S390 platform team to
> > > start
> > > building on iommufd to create some vIOMMU acceleration for S390
> > > guests, this is a necessary first step.
> > 
> > There's probably something here for -ccw in the future, but you
> > might be thinking of s390 vfio-pci e.g. to implement the in-kernel
> > handling of nested mappings on s390 -- yep, work in in progress
> > here, not ready for sharing yet but I have been most recently
> > basing
> > my work on top of the nesting series
> > https://github.com/yiliu1765/iommufd/tree/iommufd-v6.0-rc3-nesting
> 
> The relation is that if vfio-pci wants to do the above then
> vfio-ccw/ap will have to run on iommufd when used in the same VM as
> vfio-pci.
> 
> So we need it to work :)
> 
> Jason

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 12/17] iommufd: Add kAPI toward external drivers for physical devices
  2022-11-08  0:49 ` [PATCH v4 12/17] iommufd: Add kAPI toward external drivers for physical devices Jason Gunthorpe
@ 2022-11-08 14:34   ` Yi Liu
  2022-11-08 17:57     ` Jason Gunthorpe
  2022-11-14  7:59   ` Tian, Kevin
  1 sibling, 1 reply; 99+ messages in thread
From: Yi Liu @ 2022-11-08 14:34 UTC (permalink / raw)
  To: Jason Gunthorpe, bpf, Jonathan Corbet, David Woodhouse, iommu,
	Joerg Roedel, Kevin Tian, linux-doc, linux-kselftest, llvm,
	Nathan Chancellor, Nick Desaulniers, Miguel Ojeda, Robin Murphy,
	Shuah Khan, Suravee Suthikulpanit, Tom Rix, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Keqian Zhu

On 2022/11/8 08:49, Jason Gunthorpe wrote:
> Add the four functions external drivers need to connect physical DMA to
> the IOMMUFD:
> 
> iommufd_device_bind() / iommufd_device_unbind()
>    Register the device with iommufd and establish security isolation.
> 
> iommufd_device_attach() / iommufd_device_detach()
>    Connect a bound device to a page table
> 
> Binding a device creates a device object ID in the uAPI, however the
> generic API provides no IOCTLs to manipulate them.
> 
> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>   drivers/iommu/iommufd/Makefile          |   1 +
>   drivers/iommu/iommufd/device.c          | 402 ++++++++++++++++++++++++
>   drivers/iommu/iommufd/iommufd_private.h |   5 +
>   drivers/iommu/iommufd/main.c            |   3 +
>   include/linux/iommufd.h                 |  13 +
>   5 files changed, 424 insertions(+)
>   create mode 100644 drivers/iommu/iommufd/device.c
> 
> diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
> index e13e971aa28c60..ca28a135b9675f 100644
> --- a/drivers/iommu/iommufd/Makefile
> +++ b/drivers/iommu/iommufd/Makefile
> @@ -1,5 +1,6 @@
>   # SPDX-License-Identifier: GPL-2.0-only
>   iommufd-y := \
> +	device.o \
>   	hw_pagetable.o \
>   	io_pagetable.o \
>   	ioas.o \
> diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
> new file mode 100644
> index 00000000000000..a3bf3c07d3f800
> --- /dev/null
> +++ b/drivers/iommu/iommufd/device.c
> @@ -0,0 +1,402 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
> + */
> +#include <linux/iommufd.h>
> +#include <linux/slab.h>
> +#include <linux/iommu.h>
> +#include <linux/irqdomain.h>
> +
> +#include "iommufd_private.h"
> +
> +/*
> + * A iommufd_device object represents the binding relationship between a
> + * consuming driver and the iommufd. These objects are created/destroyed by
> + * external drivers, not by userspace.
> + */
> +struct iommufd_device {
> +	struct iommufd_object obj;
> +	struct iommufd_ctx *ictx;
> +	struct iommufd_hw_pagetable *hwpt;
> +	/* Head at iommufd_hw_pagetable::devices */
> +	struct list_head devices_item;
> +	/* always the physical device */
> +	struct device *dev;
> +	struct iommu_group *group;
> +	bool enforce_cache_coherency;
> +};
> +
> +void iommufd_device_destroy(struct iommufd_object *obj)
> +{
> +	struct iommufd_device *idev =
> +		container_of(obj, struct iommufd_device, obj);
> +
> +	iommu_device_release_dma_owner(idev->dev);
> +	iommu_group_put(idev->group);
> +	iommufd_ctx_put(idev->ictx);
> +}
> +
> +/**
> + * iommufd_device_bind - Bind a physical device to an iommu fd
> + * @ictx: iommufd file descriptor
> + * @dev: Pointer to a physical PCI device struct
> + * @id: Output ID number to return to userspace for this device
> + *
> + * A successful bind establishes an ownership over the device and returns
> + * struct iommufd_device pointer, otherwise returns error pointer.
> + *
> + * A driver using this API must set driver_managed_dma and must not touch
> + * the device until this routine succeeds and establishes ownership.
> + *
> + * Binding a PCI device places the entire RID under iommufd control.
> + *
> + * The caller must undo this with iommufd_unbind_device()

it should be iommufd_device_unbind() now.

> + */
> +struct iommufd_device *iommufd_device_bind(struct iommufd_ctx *ictx,
> +					   struct device *dev, u32 *id)
> +{
> +	struct iommufd_device *idev;
> +	struct iommu_group *group;
> +	int rc;
> +
> +	/*
> +	 * iommufd always sets IOMMU_CACHE because we offer no way for userspace
> +	 * to restore cache coherency.
> +	 */
> +	if (!device_iommu_capable(dev, IOMMU_CAP_CACHE_COHERENCY))
> +		return ERR_PTR(-EINVAL);
> +
> +	group = iommu_group_get(dev);
> +	if (!group)
> +		return ERR_PTR(-ENODEV);
> +
> +	rc = iommu_device_claim_dma_owner(dev, ictx);
> +	if (rc)
> +		goto out_group_put;
> +
> +	idev = iommufd_object_alloc(ictx, idev, IOMMUFD_OBJ_DEVICE);
> +	if (IS_ERR(idev)) {
> +		rc = PTR_ERR(idev);
> +		goto out_release_owner;
> +	}
> +	idev->ictx = ictx;
> +	iommufd_ctx_get(ictx);
> +	idev->dev = dev;
> +	idev->enforce_cache_coherency =
> +		device_iommu_capable(dev, IOMMU_CAP_ENFORCE_CACHE_COHERENCY);
> +	/* The calling driver is a user until iommufd_device_unbind() */
> +	refcount_inc(&idev->obj.users);
> +	/* group refcount moves into iommufd_device */
> +	idev->group = group;
> +
> +	/*
> +	 * If the caller fails after this success it must call
> +	 * iommufd_unbind_device() which is safe since we hold this refcount.
> +	 * This also means the device is a leaf in the graph and no other object
> +	 * can take a reference on it.
> +	 */
> +	iommufd_object_finalize(ictx, &idev->obj);
> +	*id = idev->obj.id;
> +	return idev;
> +
> +out_release_owner:
> +	iommu_device_release_dma_owner(dev);
> +out_group_put:
> +	iommu_group_put(group);
> +	return ERR_PTR(rc);
> +}
> +EXPORT_SYMBOL_NS_GPL(iommufd_device_bind, IOMMUFD);
> +
> +void iommufd_device_unbind(struct iommufd_device *idev)
> +{
> +	bool was_destroyed;
> +
> +	was_destroyed = iommufd_object_destroy_user(idev->ictx, &idev->obj);
> +	WARN_ON(!was_destroyed);
> +}
> +EXPORT_SYMBOL_NS_GPL(iommufd_device_unbind, IOMMUFD);
> +
> +static int iommufd_device_setup_msi(struct iommufd_device *idev,
> +				    struct iommufd_hw_pagetable *hwpt,
> +				    phys_addr_t sw_msi_start,
> +				    unsigned int flags)
> +{
> +	int rc;
> +
> +	/*
> +	 * IOMMU_CAP_INTR_REMAP means that the platform is isolating MSI, and it
> +	 * creates the MSI window by default in the iommu domain. Nothing
> +	 * further to do.
> +	 */
> +	if (device_iommu_capable(idev->dev, IOMMU_CAP_INTR_REMAP))
> +		return 0;
> +
> +	/*
> +	 * On ARM systems that set the global IRQ_DOMAIN_FLAG_MSI_REMAP every
> +	 * allocated iommu_domain will block interrupts by default and this
> +	 * special flow is needed to turn them back on. iommu_dma_prepare_msi()
> +	 * will install pages into our domain after request_irq() to make this
> +	 * work.
> +	 *
> +	 * FIXME: This is conceptually broken for iommufd since we want to allow
> +	 * userspace to change the domains, eg switch from an identity IOAS to a
> +	 * DMA IOAS. There is currently no way to create a MSI window that
> +	 * matches what the IRQ layer actually expects in a newly created
> +	 * domain.
> +	 */
> +	if (irq_domain_check_msi_remap()) {
> +		if (WARN_ON(!sw_msi_start))
> +			return -EPERM;
> +		/*
> +		 * iommu_get_msi_cookie() can only be called once per domain,
> +		 * it returns -EBUSY on later calls.
> +		 */
> +		if (hwpt->msi_cookie)
> +			return 0;
> +		rc = iommu_get_msi_cookie(hwpt->domain, sw_msi_start);
> +		if (rc)
> +			return rc;
> +		hwpt->msi_cookie = true;
> +		return 0;
> +	}
> +
> +	/*
> +	 * Otherwise the platform has a MSI window that is not isolated. For
> +	 * historical compat with VFIO allow a module parameter to ignore the
> +	 * insecurity.
> +	 */
> +	if (!(flags & IOMMUFD_ATTACH_FLAGS_ALLOW_UNSAFE_INTERRUPT))
> +		return -EPERM;
> +	else
> +		dev_warn(
> +			idev->dev,
> +			"Device interrupts cannot be isolated by the IOMMU, this platform in insecure. Use an \"allow_unsafe_interrupts\" module parameter to override\n");
> +
> +	return 0;
> +}
> +
> +static bool iommufd_hw_pagetable_has_group(struct iommufd_hw_pagetable *hwpt,
> +					   struct iommu_group *group)
> +{
> +	struct iommufd_device *cur_dev;
> +
> +	list_for_each_entry(cur_dev, &hwpt->devices, devices_item)
> +		if (cur_dev->group == group)
> +			return true;
> +	return false;
> +}
> +
> +static int iommufd_device_do_attach(struct iommufd_device *idev,
> +				    struct iommufd_hw_pagetable *hwpt,
> +				    unsigned int flags)
> +{
> +	phys_addr_t sw_msi_start = 0;
> +	int rc;
> +
> +	mutex_lock(&hwpt->devices_lock);
> +
> +	/*
> +	 * Try to upgrade the domain we have, it is an iommu driver bug to
> +	 * report IOMMU_CAP_ENFORCE_CACHE_COHERENCY but fail
> +	 * enforce_cache_coherency when there are no devices attached to the
> +	 * domain.
> +	 */
> +	if (idev->enforce_cache_coherency && !hwpt->enforce_cache_coherency) {
> +		if (hwpt->domain->ops->enforce_cache_coherency)
> +			hwpt->enforce_cache_coherency =
> +				hwpt->domain->ops->enforce_cache_coherency(
> +					hwpt->domain);
> +		if (!hwpt->enforce_cache_coherency) {
> +			WARN_ON(list_empty(&hwpt->devices));
> +			rc = -EINVAL;
> +			goto out_unlock;
> +		}
> +	}
> +
> +	rc = iopt_table_enforce_group_resv_regions(&hwpt->ioas->iopt, idev->dev,
> +						   idev->group, &sw_msi_start);
> +	if (rc)
> +		goto out_unlock;
> +
> +	rc = iommufd_device_setup_msi(idev, hwpt, sw_msi_start, flags);
> +	if (rc)
> +		goto out_iova;

aren't the above two operations only once for a group? I remember you did
the two after iommu_attach_group().

> +	/*
> +	 * FIXME: Hack around missing a device-centric iommu api, only attach to
> +	 * the group once for the first device that is in the group.
> +	 */
> +	if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
> +		rc = iommu_attach_group(hwpt->domain, idev->group);
> +		if (rc)
> +			goto out_iova;
> +
> +		if (list_empty(&hwpt->devices)) {
> +			rc = iopt_table_add_domain(&hwpt->ioas->iopt,
> +						   hwpt->domain);
> +			if (rc)
> +				goto out_detach;
> +		}
> +	}
> +
> +	idev->hwpt = hwpt;
> +	refcount_inc(&hwpt->obj.users);
> +	list_add(&idev->devices_item, &hwpt->devices);
> +	mutex_unlock(&hwpt->devices_lock);
> +	return 0;
> +
> +out_detach:
> +	iommu_detach_group(hwpt->domain, idev->group);
> +out_iova:
> +	iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->dev);
> +out_unlock:
> +	mutex_unlock(&hwpt->devices_lock);
> +	return rc;
> +}
> +

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: S390 testing for IOMMUFD
  2022-11-08 14:19       ` Eric Farman
@ 2022-11-08 14:37         ` Jason Gunthorpe
  2022-11-08 15:29           ` Eric Farman
  0 siblings, 1 reply; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-08 14:37 UTC (permalink / raw)
  To: Eric Farman
  Cc: Matthew Rosato, Cornelia Huck, Niklas Schnelle, Tony Krowiak,
	Halil Pasic, Jason Herne, linux-s390, iommu, Kevin Tian,
	Alex Williamson, kvm, Lu Baolu, Nicolin Chen

On Tue, Nov 08, 2022 at 09:19:17AM -0500, Eric Farman wrote:
> On Tue, 2022-11-08 at 09:54 -0400, Jason Gunthorpe wrote:
> > On Tue, Nov 08, 2022 at 08:50:53AM -0500, Matthew Rosato wrote:
> > 
> > > FWIW, vfio-pci via s390 is working fine so far, though I'll put it
> > > through more paces over the next few weeks and report if I find
> > > anything.
> > 
> > OK great
> > 
> > > As far as mdev drivers...  
> > > 
> > > -ccw: Sounds like Eric is already aware there is an issue and is
> > > investigating (I see errors as well).
> 
> I -think- the problem for -ccw is that the new vfio_pin_pages requires
> the input addresses to be page-aligned, and while most of ours are, the
> first one in any given transaction may not be. We never bothered to
> mask off the addresses since it was handled for us, and we needed to
> keep the offsets anyway.
> 
> By happenstance, I had some code that would do the masking ourselves
> (for an unrelated reason); I'll see if I can get that fit on top and if
> it helps matters. After coffee.

Oh, yes, that makes alot of sense.

Ah, if that is how VFIO worked we could match it like below:

 EXPORT_SYMBOL_NS_GPL(iommufd_access_unpin_pages, IOMMUFD);
 
 static bool iopt_area_contig_is_aligned(struct iopt_area_contig_iter *iter,
-                                       bool first)
+                                       bool first, unsigned long first_iova)
 {
-       if (iopt_area_start_byte(iter->area, iter->cur_iova) % PAGE_SIZE)
+       unsigned long start_offset = first ? (first_iova % PAGE_SIZE) : 0;
+
+       if ((iopt_area_start_byte(iter->area, iter->cur_iova) % PAGE_SIZE) !=
+           start_offset)
                return false;
 
        if (!iopt_area_contig_done(iter) &&
@@ -607,7 +610,7 @@ int iommufd_access_pin_pages(struct iommufd_access *access, unsigned long iova,
                        iopt_area_iova_to_index(area, iter.cur_iova);
 
                if (area->prevent_access ||
-                   !iopt_area_contig_is_aligned(&iter, first)) {
+                   !iopt_area_contig_is_aligned(&iter, first, iova)) {
                        rc = -EINVAL;
                        goto err_remove;
                }

Jason

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: S390 testing for IOMMUFD
  2022-11-08 14:37         ` Jason Gunthorpe
@ 2022-11-08 15:29           ` Eric Farman
  2022-11-08 19:18             ` Matthew Rosato
  2022-11-08 19:34             ` Jason Gunthorpe
  0 siblings, 2 replies; 99+ messages in thread
From: Eric Farman @ 2022-11-08 15:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Rosato, Cornelia Huck, Niklas Schnelle, Tony Krowiak,
	Halil Pasic, Jason Herne, linux-s390, iommu, Kevin Tian,
	Alex Williamson, kvm, Lu Baolu, Nicolin Chen

On Tue, 2022-11-08 at 10:37 -0400, Jason Gunthorpe wrote:
> On Tue, Nov 08, 2022 at 09:19:17AM -0500, Eric Farman wrote:
> > On Tue, 2022-11-08 at 09:54 -0400, Jason Gunthorpe wrote:
> > > On Tue, Nov 08, 2022 at 08:50:53AM -0500, Matthew Rosato wrote:
> > > 
> > > > FWIW, vfio-pci via s390 is working fine so far, though I'll put
> > > > it
> > > > through more paces over the next few weeks and report if I find
> > > > anything.
> > > 
> > > OK great
> > > 
> > > > As far as mdev drivers...  
> > > > 
> > > > -ccw: Sounds like Eric is already aware there is an issue and
> > > > is
> > > > investigating (I see errors as well).
> > 
> > I -think- the problem for -ccw is that the new vfio_pin_pages
> > requires
> > the input addresses to be page-aligned, and while most of ours are,
> > the
> > first one in any given transaction may not be. We never bothered to
> > mask off the addresses since it was handled for us, and we needed
> > to
> > keep the offsets anyway.
> > 
> > By happenstance, I had some code that would do the masking
> > ourselves
> > (for an unrelated reason); I'll see if I can get that fit on top
> > and if
> > it helps matters. After coffee.
> 
> Oh, yes, that makes alot of sense.
> 
> Ah, if that is how VFIO worked we could match it like below:

That's a start. The pin appears to have worked, but the unpin fails at
the bottom of iommufd_access_unpin_pages:

WARN_ON(!iopt_area_contig_done(&iter));

> 
>  EXPORT_SYMBOL_NS_GPL(iommufd_access_unpin_pages, IOMMUFD);
>  
>  static bool iopt_area_contig_is_aligned(struct iopt_area_contig_iter
> *iter,
> -                                       bool first)
> +                                       bool first, unsigned long
> first_iova)
>  {
> -       if (iopt_area_start_byte(iter->area, iter->cur_iova) %
> PAGE_SIZE)
> +       unsigned long start_offset = first ? (first_iova % PAGE_SIZE)
> : 0;
> +
> +       if ((iopt_area_start_byte(iter->area, iter->cur_iova) %
> PAGE_SIZE) !=
> +           start_offset)
>                 return false;
>  
>         if (!iopt_area_contig_done(iter) &&
> @@ -607,7 +610,7 @@ int iommufd_access_pin_pages(struct
> iommufd_access *access, unsigned long iova,
>                         iopt_area_iova_to_index(area, iter.cur_iova);
>  
>                 if (area->prevent_access ||
> -                   !iopt_area_contig_is_aligned(&iter, first)) {
> +                   !iopt_area_contig_is_aligned(&iter, first, iova))
> {
>                         rc = -EINVAL;
>                         goto err_remove;
>                 }
> 
> Jason


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 10/17] iommufd: IOCTLs for the io_pagetable
  2022-11-08 13:27   ` Bagas Sanjaya
@ 2022-11-08 17:01     ` Jason Gunthorpe
  0 siblings, 0 replies; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-08 17:01 UTC (permalink / raw)
  To: Bagas Sanjaya
  Cc: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon, Alex Williamson,
	Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Tue, Nov 08, 2022 at 08:27:13PM +0700, Bagas Sanjaya wrote:
> I see kernel-doc warnings (missing short descriptions) when making
> htmldocs, so I have applied the fixup:
> 
> ---- >8 ----
> 
> From 5643b202ae9853c11434466c76aeaaa044e88b07 Mon Sep 17 00:00:00 2001
> From: Bagas Sanjaya <bagasdotme@gmail.com>
> Date: Tue, 8 Nov 2022 20:14:25 +0700
> Subject: [PATCH] iommufd: Add missing ioctl short descriptions
> 
> Checking kernel-doc comments in iommufd.h header with scripts/kernel-doc
> produces missing short description warnings:
> 
> include/uapi/linux/iommufd.h:80: warning: missing initial short description on line:
>  * struct iommu_iova_range
> include/uapi/linux/iommufd.h:250: warning: missing initial short description on line:
>  * enum iommufd_option
> include/uapi/linux/iommufd.h:267: warning: missing initial short description on line:
>  * enum iommufd_option_ops
> include/uapi/linux/iommufd.h:277: warning: Cannot understand  * @size: sizeof(struct iommu_option)
>  on line 277 - I thought it was a doc line
> include/uapi/linux/iommufd.h:299: warning: missing initial short description on line:
>  * enum iommufd_vfio_ioas_op
>
> One of them is reported by Stephen Rothwell when merging iommufd tree for
> linux-next.

Thanks, I was about to try to figure out what that was. I squashed it in.

Jason

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 4/17] iommufd: Document overview of iommufd
  2022-11-08  0:48 ` [PATCH v4 04/17] iommufd: Document overview of iommufd Jason Gunthorpe
  2022-11-08  3:45   ` Bagas Sanjaya
@ 2022-11-08 17:10   ` Jason Gunthorpe
  2022-11-11  5:59     ` Tian, Kevin
  2022-11-10  9:30   ` [PATCH v4 04/17] " Bagas Sanjaya
  2022-11-14 20:50   ` Eric Auger
  3 siblings, 1 reply; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-08 17:10 UTC (permalink / raw)
  To: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

> +IOMMUFD User API
> +================
> +
> +.. kernel-doc:: include/uapi/linux/iommufd.h

I noticed this isn't working

It needs this patch:
  https://lore.kernel.org/r/0-v1-c80e152ce63b+12-kdoc_export_ns_jgg@nvidia.com

And also some updating to capture kdocs for all the exported symbols:

diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst
index 64a135f3055adc..ffc5f4bc65492e 100644
--- a/Documentation/userspace-api/iommufd.rst
+++ b/Documentation/userspace-api/iommufd.rst
@@ -186,6 +186,9 @@ explicitly imposing the group semantics in its uAPI as VFIO does.
 .. kernel-doc:: drivers/iommu/iommufd/device.c
    :export:
 
+.. kernel-doc:: drivers/iommu/iommufd/main.c
+   :export:
+
 VFIO and IOMMUFD
 ----------------
 
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index dc3058e063d8de..8f4a0e11c51bae 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -107,6 +107,14 @@ struct iommufd_device *iommufd_device_bind(struct iommufd_ctx *ictx,
 }
 EXPORT_SYMBOL_NS_GPL(iommufd_device_bind, IOMMUFD);
 
+/**
+ * iommufd_device_unbind - Undo iommufd_device_bind()
+ * @idev: Device returned by iommufd_device_bind()
+ *
+ * Release the device from iommufd control. The DMA ownership will return back
+ * to unowned with blocked DMA. This invalidates the iommufd_device pointer,
+ * other APIs that consume it must not be called concurrently.
+ */
 void iommufd_device_unbind(struct iommufd_device *idev)
 {
 	bool was_destroyed;
@@ -372,6 +380,13 @@ int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id,
 }
 EXPORT_SYMBOL_NS_GPL(iommufd_device_attach, IOMMUFD);
 
+/**
+ * iommufd_device_detach - Disconnect a device to an iommu_domain
+ * @idev: device to detach
+ *
+ * Undoes iommufd_device_attach(). This disconnects the idev from the previously
+ * attached pt_id. The device returns back to a blocked DMA translation.
+ */
 void iommufd_device_detach(struct iommufd_device *idev)
 {
 	struct iommufd_hw_pagetable *hwpt = idev->hwpt;
@@ -412,6 +427,19 @@ void iommufd_access_destroy_object(struct iommufd_object *obj)
 	refcount_dec(&access->ioas->obj.users);
 }
 
+/**
+ * iommufd_access_create - Create an iommufd_access
+ * @ictx: iommufd file descriptor
+ * @ioas_id: ID for a IOMMUFD_OBJ_IOAS
+ * @ops: Driver's ops to associate with the access
+ * @data: Opaque data to pass into ops functions
+ *
+ * An iommufd_access allows a driver to read/write to the IOAS without using
+ * DMA. The underlying CPU memory can be accessed using the
+ * iommufd_access_pin_pages() or iommufd_access_rw() functions.
+ *
+ * The provided ops are required to use iommufd_access_pin_pages().
+ */
 struct iommufd_access *
 iommufd_access_create(struct iommufd_ctx *ictx, u32 ioas_id,
 		      const struct iommufd_access_ops *ops, void *data)
@@ -461,6 +489,12 @@ iommufd_access_create(struct iommufd_ctx *ictx, u32 ioas_id,
 }
 EXPORT_SYMBOL_NS_GPL(iommufd_access_create, IOMMUFD);
 
+/**
+ * iommufd_access_destroy - Destroy an iommufd_access
+ * @access: The access to destroy
+ *
+ * The caller must stop using the access before destroying it.
+ */
 void iommufd_access_destroy(struct iommufd_access *access)
 {
 	bool was_destroyed;

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 12/17] iommufd: Add kAPI toward external drivers for physical devices
  2022-11-08 14:34   ` Yi Liu
@ 2022-11-08 17:57     ` Jason Gunthorpe
  0 siblings, 0 replies; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-08 17:57 UTC (permalink / raw)
  To: Yi Liu
  Cc: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon, Alex Williamson,
	Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Keqian Zhu

On Tue, Nov 08, 2022 at 10:34:05PM +0800, Yi Liu wrote:
> > +/**
> > + * iommufd_device_bind - Bind a physical device to an iommu fd
> > + * @ictx: iommufd file descriptor
> > + * @dev: Pointer to a physical PCI device struct
> > + * @id: Output ID number to return to userspace for this device
> > + *
> > + * A successful bind establishes an ownership over the device and returns
> > + * struct iommufd_device pointer, otherwise returns error pointer.
> > + *
> > + * A driver using this API must set driver_managed_dma and must not touch
> > + * the device until this routine succeeds and establishes ownership.
> > + *
> > + * Binding a PCI device places the entire RID under iommufd control.
> > + *
> > + * The caller must undo this with iommufd_unbind_device()
> 
> it should be iommufd_device_unbind() now.

Done

> > +static int iommufd_device_do_attach(struct iommufd_device *idev,
> > +				    struct iommufd_hw_pagetable *hwpt,
> > +				    unsigned int flags)
> > +{
> > +	phys_addr_t sw_msi_start = 0;
> > +	int rc;
> > +
> > +	mutex_lock(&hwpt->devices_lock);
> > +
> > +	/*
> > +	 * Try to upgrade the domain we have, it is an iommu driver bug to
> > +	 * report IOMMU_CAP_ENFORCE_CACHE_COHERENCY but fail
> > +	 * enforce_cache_coherency when there are no devices attached to the
> > +	 * domain.
> > +	 */
> > +	if (idev->enforce_cache_coherency && !hwpt->enforce_cache_coherency) {
> > +		if (hwpt->domain->ops->enforce_cache_coherency)
> > +			hwpt->enforce_cache_coherency =
> > +				hwpt->domain->ops->enforce_cache_coherency(
> > +					hwpt->domain);
> > +		if (!hwpt->enforce_cache_coherency) {
> > +			WARN_ON(list_empty(&hwpt->devices));
> > +			rc = -EINVAL;
> > +			goto out_unlock;
> > +		}
> > +	}
> > +
> > +	rc = iopt_table_enforce_group_resv_regions(&hwpt->ioas->iopt, idev->dev,
> > +						   idev->group, &sw_msi_start);
> > +	if (rc)
> > +		goto out_unlock;
> > +
> > +	rc = iommufd_device_setup_msi(idev, hwpt, sw_msi_start, flags);
> > +	if (rc)
> > +		goto out_iova;
> 
> aren't the above two operations only once for a group? I remember you did
> the two after iommu_attach_group().

No, with the new attach logic per-device is simpler.

iopt_table_enforce_group_resv_regions() tags all the reserved ranges
with:

		rc = iopt_reserve_iova(iopt, resv->start,
				       resv->length - 1 + resv->start,
				       device);

So they are all undone as each device detaches

And iommufd_device_setup_msi() keeps track of what has happened to the
domain via:

		if (hwpt->msi_cookie)
			return 0;
		rc = iommu_get_msi_cookie(hwpt->domain, sw_msi_start);
		if (rc)
			return rc;
		hwpt->msi_cookie = true;

So it is OK to call it multiple times

Thanks,
Jason

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: S390 testing for IOMMUFD
  2022-11-08 15:29           ` Eric Farman
@ 2022-11-08 19:18             ` Matthew Rosato
  2022-11-08 20:04               ` Jason Gunthorpe
  2022-11-08 19:34             ` Jason Gunthorpe
  1 sibling, 1 reply; 99+ messages in thread
From: Matthew Rosato @ 2022-11-08 19:18 UTC (permalink / raw)
  To: Eric Farman, Jason Gunthorpe, Tony Krowiak, Jason Herne,
	Christian Borntraeger
  Cc: Cornelia Huck, Niklas Schnelle, Halil Pasic, linux-s390, iommu,
	Kevin Tian, Alex Williamson, kvm, Lu Baolu, Nicolin Chen

On 11/8/22 10:29 AM, Eric Farman wrote:
> On Tue, 2022-11-08 at 10:37 -0400, Jason Gunthorpe wrote:
>> On Tue, Nov 08, 2022 at 09:19:17AM -0500, Eric Farman wrote:
>>> On Tue, 2022-11-08 at 09:54 -0400, Jason Gunthorpe wrote:
>>>> On Tue, Nov 08, 2022 at 08:50:53AM -0500, Matthew Rosato wrote:
>>>>
>>>>> FWIW, vfio-pci via s390 is working fine so far, though I'll put
>>>>> it
>>>>> through more paces over the next few weeks and report if I find
>>>>> anything.
>>>>
>>>> OK great
>>>>
>>>>> As far as mdev drivers...  
>>>>>
>>>>> -ccw: Sounds like Eric is already aware there is an issue and
>>>>> is
>>>>> investigating (I see errors as well).
>>>
>>> I -think- the problem for -ccw is that the new vfio_pin_pages
>>> requires
>>> the input addresses to be page-aligned, and while most of ours are,
>>> the
>>> first one in any given transaction may not be. We never bothered to
>>> mask off the addresses since it was handled for us, and we needed
>>> to
>>> keep the offsets anyway.
>>>
>>> By happenstance, I had some code that would do the masking
>>> ourselves
>>> (for an unrelated reason); I'll see if I can get that fit on top
>>> and if
>>> it helps matters. After coffee.
>>
>> Oh, yes, that makes alot of sense.
>>
>> Ah, if that is how VFIO worked we could match it like below:
> 
> That's a start. The pin appears to have worked, but the unpin fails at
> the bottom of iommufd_access_unpin_pages:
> 
> WARN_ON(!iopt_area_contig_done(&iter));
> 

Update on why -ap is failing -- I see vfio_pin_pages requests from vfio_ap_irq_enable that are failing on -EINVAL -- input is not page-aligned, just like what vfio-ccw was hitting.

I just tried a quick hack to force these to page-aligned requests and with that the vfio-ap tests I'm running start passing again.  So I think a proper fix in the iommufd code for this will also fix vfio-ap (we will test of course)

>>
>>  EXPORT_SYMBOL_NS_GPL(iommufd_access_unpin_pages, IOMMUFD);
>>  
>>  static bool iopt_area_contig_is_aligned(struct iopt_area_contig_iter
>> *iter,
>> -                                       bool first)
>> +                                       bool first, unsigned long
>> first_iova)
>>  {
>> -       if (iopt_area_start_byte(iter->area, iter->cur_iova) %
>> PAGE_SIZE)
>> +       unsigned long start_offset = first ? (first_iova % PAGE_SIZE)
>> : 0;
>> +
>> +       if ((iopt_area_start_byte(iter->area, iter->cur_iova) %
>> PAGE_SIZE) !=
>> +           start_offset)
>>                 return false;
>>  
>>         if (!iopt_area_contig_done(iter) &&
>> @@ -607,7 +610,7 @@ int iommufd_access_pin_pages(struct
>> iommufd_access *access, unsigned long iova,
>>                         iopt_area_iova_to_index(area, iter.cur_iova);
>>  
>>                 if (area->prevent_access ||
>> -                   !iopt_area_contig_is_aligned(&iter, first)) {
>> +                   !iopt_area_contig_is_aligned(&iter, first, iova))
>> {
>>                         rc = -EINVAL;
>>                         goto err_remove;
>>                 }
>>
>> Jason
> 


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: S390 testing for IOMMUFD
  2022-11-08 15:29           ` Eric Farman
  2022-11-08 19:18             ` Matthew Rosato
@ 2022-11-08 19:34             ` Jason Gunthorpe
  2022-11-08 20:07               ` Eric Farman
  2022-11-09  0:41               ` [PATCH 1/2] vfio/ccw: Convert to use vfio_dma_rw() kernel test robot
  1 sibling, 2 replies; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-08 19:34 UTC (permalink / raw)
  To: Eric Farman
  Cc: Matthew Rosato, Cornelia Huck, Niklas Schnelle, Tony Krowiak,
	Halil Pasic, Jason Herne, linux-s390, iommu, Kevin Tian,
	Alex Williamson, kvm, Lu Baolu, Nicolin Chen

[-- Attachment #1: Type: text/plain, Size: 2056 bytes --]

On Tue, Nov 08, 2022 at 10:29:33AM -0500, Eric Farman wrote:
> On Tue, 2022-11-08 at 10:37 -0400, Jason Gunthorpe wrote:
> > On Tue, Nov 08, 2022 at 09:19:17AM -0500, Eric Farman wrote:
> > > On Tue, 2022-11-08 at 09:54 -0400, Jason Gunthorpe wrote:
> > > > On Tue, Nov 08, 2022 at 08:50:53AM -0500, Matthew Rosato wrote:
> > > > 
> > > > > FWIW, vfio-pci via s390 is working fine so far, though I'll put
> > > > > it
> > > > > through more paces over the next few weeks and report if I find
> > > > > anything.
> > > > 
> > > > OK great
> > > > 
> > > > > As far as mdev drivers...  
> > > > > 
> > > > > -ccw: Sounds like Eric is already aware there is an issue and
> > > > > is
> > > > > investigating (I see errors as well).
> > > 
> > > I -think- the problem for -ccw is that the new vfio_pin_pages
> > > requires
> > > the input addresses to be page-aligned, and while most of ours are,
> > > the
> > > first one in any given transaction may not be. We never bothered to
> > > mask off the addresses since it was handled for us, and we needed
> > > to
> > > keep the offsets anyway.
> > > 
> > > By happenstance, I had some code that would do the masking
> > > ourselves
> > > (for an unrelated reason); I'll see if I can get that fit on top
> > > and if
> > > it helps matters. After coffee.
> > 
> > Oh, yes, that makes alot of sense.
> > 
> > Ah, if that is how VFIO worked we could match it like below:
> 
> That's a start. The pin appears to have worked, but the unpin fails at
> the bottom of iommufd_access_unpin_pages:
> 
> WARN_ON(!iopt_area_contig_done(&iter));

This seems like a different bug, probably a ccw driver bug. The
WARN_ON is designed to detect cases where the driver is unpinning an
IOVA range that is not exactly what it pinned. The pin side already
does this validation, so if it fails it means pin/unpin did not have
identical iova ranges. Some debugging prints should confirm this.

I looked at CCW and came up with the following two things, can you
look at them and finish them off? It will probably help.

Thanks,
Jason

[-- Attachment #2: 0001-vfio-ccw-Convert-to-use-vfio_dma_rw.patch --]
[-- Type: text/x-diff, Size: 2878 bytes --]

From b6884847ece19733065fa246c3bbea63cec474c3 Mon Sep 17 00:00:00 2001
From: Jason Gunthorpe <jgg@nvidia.com>
Date: Tue, 8 Nov 2022 15:21:04 -0400
Subject: [PATCH 1/2] vfio/ccw: Convert to use vfio_dma_rw()

Do not open code a slow version of vfio_dma_rw() as copy_from_iova().
The core code provides this function now, just call it directly.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/s390/cio/vfio_ccw_cp.c | 57 ++++------------------------------
 1 file changed, 6 insertions(+), 51 deletions(-)

diff --git a/drivers/s390/cio/vfio_ccw_cp.c b/drivers/s390/cio/vfio_ccw_cp.c
index 7b02e97f4b2914..f5f6eff005b99f 100644
--- a/drivers/s390/cio/vfio_ccw_cp.c
+++ b/drivers/s390/cio/vfio_ccw_cp.c
@@ -228,51 +228,6 @@ static void convert_ccw0_to_ccw1(struct ccw1 *source, unsigned long len)
 	}
 }
 
-/*
- * Within the domain (@mdev), copy @n bytes from a guest physical
- * address (@iova) to a host physical address (@to).
- */
-static long copy_from_iova(struct vfio_device *vdev, void *to, u64 iova,
-			   unsigned long n)
-{
-	struct page_array pa = {0};
-	int i, ret;
-	unsigned long l, m;
-
-	ret = page_array_alloc(&pa, iova, n);
-	if (ret < 0)
-		return ret;
-
-	ret = page_array_pin(&pa, vdev);
-	if (ret < 0) {
-		page_array_unpin_free(&pa, vdev);
-		return ret;
-	}
-
-	l = n;
-	for (i = 0; i < pa.pa_nr; i++) {
-		void *from = kmap_local_page(pa.pa_page[i]);
-
-		m = PAGE_SIZE;
-		if (i == 0) {
-			from += iova & (PAGE_SIZE - 1);
-			m -= iova & (PAGE_SIZE - 1);
-		}
-
-		m = min(l, m);
-		memcpy(to + (n - l), from, m);
-		kunmap_local(from);
-
-		l -= m;
-		if (l == 0)
-			break;
-	}
-
-	page_array_unpin_free(&pa, vdev);
-
-	return l;
-}
-
 /*
  * Helpers to operate ccwchain.
  */
@@ -471,10 +426,10 @@ static int ccwchain_handle_ccw(u32 cda, struct channel_program *cp)
 	int len, ret;
 
 	/* Copy 2K (the most we support today) of possible CCWs */
-	len = copy_from_iova(vdev, cp->guest_cp, cda,
-			     CCWCHAIN_LEN_MAX * sizeof(struct ccw1));
-	if (len)
-		return len;
+	ret = vfio_dma_rw(vdev, cda, cp->guest_cp,
+			  CWCHAIN_LEN_MAX * sizeof(struct ccw1));
+	if (ret)
+		return ret;
 
 	/* Convert any Format-0 CCWs to Format-1 */
 	if (!cp->orb.cmd.fmt)
@@ -572,7 +527,7 @@ static int ccwchain_fetch_direct(struct ccwchain *chain,
 	if (ccw_is_idal(ccw)) {
 		/* Read first IDAW to see if it's 4K-aligned or not. */
 		/* All subsequent IDAws will be 4K-aligned. */
-		ret = copy_from_iova(vdev, &iova, ccw->cda, sizeof(iova));
+		ret = vfio_dma_rw(vdev, ccw->cda, &iova, sizeof(iova));
 		if (ret)
 			return ret;
 	} else {
@@ -601,7 +556,7 @@ static int ccwchain_fetch_direct(struct ccwchain *chain,
 
 	if (ccw_is_idal(ccw)) {
 		/* Copy guest IDAL into host IDAL */
-		ret = copy_from_iova(vdev, idaws, ccw->cda, idal_len);
+		ret = vfio_dma_rw(vdev, ccw->cda, idaws, idal_len);
 		if (ret)
 			goto out_unpin;
 
-- 
2.38.1


[-- Attachment #3: 0002-vfio-ccw-Fix-error-unwinding-around-page_array_unpin.patch --]
[-- Type: text/x-diff, Size: 2653 bytes --]

From 7cd2cccf37db91d18da9d041826f0460a56fc95c Mon Sep 17 00:00:00 2001
From: Jason Gunthorpe <jgg@nvidia.com>
Date: Tue, 8 Nov 2022 15:31:08 -0400
Subject: [PATCH 2/2] vfio/ccw: Fix error unwinding around
 page_array_unpin_free()

We should only call page_array_unpin() if page_array_pin() has succeeded.
If page_array_pin() fails then it undoes all its changes internally.

Split free and unpin into two functions and only call unpin in the one case
where everything has succeeded.

Add missing pa_nr = idaw_nr assignment

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/s390/cio/vfio_ccw_cp.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/s390/cio/vfio_ccw_cp.c b/drivers/s390/cio/vfio_ccw_cp.c
index f5f6eff005b99f..4eab1b2fb32dd2 100644
--- a/drivers/s390/cio/vfio_ccw_cp.c
+++ b/drivers/s390/cio/vfio_ccw_cp.c
@@ -164,9 +164,8 @@ static int page_array_pin(struct page_array *pa, struct vfio_device *vdev)
 }
 
 /* Unpin the pages before releasing the memory. */
-static void page_array_unpin_free(struct page_array *pa, struct vfio_device *vdev)
+static void page_array_free(struct page_array *pa)
 {
-	page_array_unpin(pa, vdev, pa->pa_nr);
 	kfree(pa->pa_iova);
 }
 
@@ -558,7 +557,7 @@ static int ccwchain_fetch_direct(struct ccwchain *chain,
 		/* Copy guest IDAL into host IDAL */
 		ret = vfio_dma_rw(vdev, ccw->cda, idaws, idal_len);
 		if (ret)
-			goto out_unpin;
+			goto out_free_pa;
 
 		/*
 		 * Copy guest IDAWs into page_array, in case the memory they
@@ -566,6 +565,7 @@ static int ccwchain_fetch_direct(struct ccwchain *chain,
 		 */
 		for (i = 0; i < idaw_nr; i++)
 			pa->pa_iova[i] = idaws[i];
+		pa->pa_nr = idaw_nr;
 	} else {
 		/*
 		 * No action is required here; the iova addresses in page_array
@@ -577,7 +577,7 @@ static int ccwchain_fetch_direct(struct ccwchain *chain,
 	if (ccw_does_data_transfer(ccw)) {
 		ret = page_array_pin(pa, vdev);
 		if (ret < 0)
-			goto out_unpin;
+			goto out_free_pa;
 	} else {
 		pa->pa_nr = 0;
 	}
@@ -590,8 +590,8 @@ static int ccwchain_fetch_direct(struct ccwchain *chain,
 
 	return 0;
 
-out_unpin:
-	page_array_unpin_free(pa, vdev);
+out_free_pa:
+	page_array_free(pa);
 out_free_idaws:
 	kfree(idaws);
 out_init:
@@ -697,7 +697,8 @@ void cp_free(struct channel_program *cp)
 	cp->initialized = false;
 	list_for_each_entry_safe(chain, temp, &cp->ccwchain_list, next) {
 		for (i = 0; i < chain->ch_len; i++) {
-			page_array_unpin_free(chain->ch_pa + i, vdev);
+			page_array_unpin(pa, vdev, pa->pa_nr);
+			page_array_free(chain->ch_pa + i);
 			ccwchain_cda_free(chain, i);
 		}
 		ccwchain_free(chain);
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: S390 testing for IOMMUFD
  2022-11-08 19:18             ` Matthew Rosato
@ 2022-11-08 20:04               ` Jason Gunthorpe
  2022-11-08 20:17                 ` Eric Farman
  0 siblings, 1 reply; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-08 20:04 UTC (permalink / raw)
  To: Matthew Rosato
  Cc: Eric Farman, Tony Krowiak, Jason Herne, Christian Borntraeger,
	Cornelia Huck, Niklas Schnelle, Halil Pasic, linux-s390, iommu,
	Kevin Tian, Alex Williamson, kvm, Lu Baolu, Nicolin Chen

On Tue, Nov 08, 2022 at 02:18:12PM -0500, Matthew Rosato wrote:

> Update on why -ap is failing -- I see vfio_pin_pages requests from
> vfio_ap_irq_enable that are failing on -EINVAL -- input is not
> page-aligned, just like what vfio-ccw was hitting.
> 
> I just tried a quick hack to force these to page-aligned requests
> and with that the vfio-ap tests I'm running start passing again.  So
> I think a proper fix in the iommufd code for this will also fix
> vfio-ap (we will test of course)

Right, so my first fix isn't the right thing. The APIs are mismatched
too much. The length gets all messed up in the process.

So how about this? (drop the prior attempt)

diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index d835a77aaf26d9..b590ca3c186396 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -1906,8 +1906,13 @@ int vfio_pin_pages(struct vfio_device *device, dma_addr_t iova,
 
 		if (iova > ULONG_MAX)
 			return -EINVAL;
+		/*
+		 * VFIO ignores the sub page offset, npages is from the start of
+		 * a PAGE_SIZE chunk of IOVA.
+		 */
 		ret = iommufd_access_pin_pages(
-			device->iommufd_access, iova, npage * PAGE_SIZE, pages,
+			device->iommufd_access, ALIGN_DOWN(iova, PAGE_SIZE),
+			npage * PAGE_SIZE, pages,
 			(prot & IOMMU_WRITE) ? IOMMUFD_ACCESS_RW_WRITE : 0);
 		if (ret)
 			return ret;
@@ -1937,7 +1942,8 @@ void vfio_unpin_pages(struct vfio_device *device, dma_addr_t iova, int npage)
 	if (device->iommufd_access) {
 		if (WARN_ON(iova > ULONG_MAX))
 			return;
-		iommufd_access_unpin_pages(device->iommufd_access, iova,
+		iommufd_access_unpin_pages(device->iommufd_access,
+					   ALIGN_DOWN(iova, PAGE_SIZE),
 					   npage * PAGE_SIZE);
 		return;
 	}

Thanks,
Jason

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: S390 testing for IOMMUFD
  2022-11-08 19:34             ` Jason Gunthorpe
@ 2022-11-08 20:07               ` Eric Farman
  2022-11-08 20:10                 ` Jason Gunthorpe
  2022-11-09  0:41               ` [PATCH 1/2] vfio/ccw: Convert to use vfio_dma_rw() kernel test robot
  1 sibling, 1 reply; 99+ messages in thread
From: Eric Farman @ 2022-11-08 20:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Rosato, Cornelia Huck, Niklas Schnelle, Tony Krowiak,
	Halil Pasic, Jason Herne, linux-s390, iommu, Kevin Tian,
	Alex Williamson, kvm, Lu Baolu, Nicolin Chen

On Tue, 2022-11-08 at 15:34 -0400, Jason Gunthorpe wrote:
> On Tue, Nov 08, 2022 at 10:29:33AM -0500, Eric Farman wrote:
> > On Tue, 2022-11-08 at 10:37 -0400, Jason Gunthorpe wrote:
> > > On Tue, Nov 08, 2022 at 09:19:17AM -0500, Eric Farman wrote:
> > > > On Tue, 2022-11-08 at 09:54 -0400, Jason Gunthorpe wrote:
> > > > > On Tue, Nov 08, 2022 at 08:50:53AM -0500, Matthew Rosato
> > > > > wrote:
> > > > > 
> > > > > > FWIW, vfio-pci via s390 is working fine so far, though I'll
> > > > > > put
> > > > > > it
> > > > > > through more paces over the next few weeks and report if I
> > > > > > find
> > > > > > anything.
> > > > > 
> > > > > OK great
> > > > > 
> > > > > > As far as mdev drivers...  
> > > > > > 
> > > > > > -ccw: Sounds like Eric is already aware there is an issue
> > > > > > and
> > > > > > is
> > > > > > investigating (I see errors as well).
> > > > 
> > > > I -think- the problem for -ccw is that the new vfio_pin_pages
> > > > requires
> > > > the input addresses to be page-aligned, and while most of ours
> > > > are,
> > > > the
> > > > first one in any given transaction may not be. We never
> > > > bothered to
> > > > mask off the addresses since it was handled for us, and we
> > > > needed
> > > > to
> > > > keep the offsets anyway.
> > > > 
> > > > By happenstance, I had some code that would do the masking
> > > > ourselves
> > > > (for an unrelated reason); I'll see if I can get that fit on
> > > > top
> > > > and if
> > > > it helps matters. After coffee.
> > > 
> > > Oh, yes, that makes alot of sense.
> > > 
> > > Ah, if that is how VFIO worked we could match it like below:
> > 
> > That's a start. The pin appears to have worked, but the unpin fails
> > at
> > the bottom of iommufd_access_unpin_pages:
> > 
> > WARN_ON(!iopt_area_contig_done(&iter));
> 
> This seems like a different bug, probably a ccw driver bug. The
> WARN_ON is designed to detect cases where the driver is unpinning an
> IOVA range that is not exactly what it pinned. The pin side already
> does this validation, so if it fails it means pin/unpin did not have
> identical iova ranges. Some debugging prints should confirm this.
> 
> I looked at CCW and came up with the following two things, can you
> look at them and finish them off? It will probably help.

I happen to already have patch 1 in a series I've been working on in
parallel with the private/parent split. I haven't forgotten it. :)

Patch 2 doesn't address the above symptoms, but a lot of that code is
getting reworked by the aforementioned series so I didn't spend a lot
of time studying your suggestion. And as I type this I see you just
sent a new patch, let me go try that...

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: S390 testing for IOMMUFD
  2022-11-08 20:07               ` Eric Farman
@ 2022-11-08 20:10                 ` Jason Gunthorpe
  0 siblings, 0 replies; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-08 20:10 UTC (permalink / raw)
  To: Eric Farman
  Cc: Matthew Rosato, Cornelia Huck, Niklas Schnelle, Tony Krowiak,
	Halil Pasic, Jason Herne, linux-s390, iommu, Kevin Tian,
	Alex Williamson, kvm, Lu Baolu, Nicolin Chen

On Tue, Nov 08, 2022 at 03:07:19PM -0500, Eric Farman wrote:
> Patch 2 doesn't address the above symptoms, but a lot of that code is
> getting reworked by the aforementioned series so I didn't spend a lot
> of time studying your suggestion. And as I type this I see you just
> sent a new patch, let me go try that...

Patch2 is following the assumption the WARN_ON is triggered by a
failure path that isn't work right. Eg trying to unpin a 0 IOVA
because the failure flow is wrong. Removing all the calls to unpin on
failure paths make that impossible.

Jason

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: S390 testing for IOMMUFD
  2022-11-08 20:04               ` Jason Gunthorpe
@ 2022-11-08 20:17                 ` Eric Farman
  0 siblings, 0 replies; 99+ messages in thread
From: Eric Farman @ 2022-11-08 20:17 UTC (permalink / raw)
  To: Jason Gunthorpe, Matthew Rosato
  Cc: Tony Krowiak, Jason Herne, Christian Borntraeger, Cornelia Huck,
	Niklas Schnelle, Halil Pasic, linux-s390, iommu, Kevin Tian,
	Alex Williamson, kvm, Lu Baolu, Nicolin Chen

On Tue, 2022-11-08 at 16:04 -0400, Jason Gunthorpe wrote:
> On Tue, Nov 08, 2022 at 02:18:12PM -0500, Matthew Rosato wrote:
> 
> > Update on why -ap is failing -- I see vfio_pin_pages requests from
> > vfio_ap_irq_enable that are failing on -EINVAL -- input is not
> > page-aligned, just like what vfio-ccw was hitting.
> > 
> > I just tried a quick hack to force these to page-aligned requests
> > and with that the vfio-ap tests I'm running start passing again. 
> > So
> > I think a proper fix in the iommufd code for this will also fix
> > vfio-ap (we will test of course)
> 
> Right, so my first fix isn't the right thing. The APIs are mismatched
> too much. The length gets all messed up in the process.
> 
> So how about this? (drop the prior attempt)

That seems to get the sniff tests for both -ccw and -ap working. I'll
keep playing with it for -ccw; Tony and Jason can do more validation on
the -ap side.

> 
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> index d835a77aaf26d9..b590ca3c186396 100644
> --- a/drivers/vfio/vfio_main.c
> +++ b/drivers/vfio/vfio_main.c
> @@ -1906,8 +1906,13 @@ int vfio_pin_pages(struct vfio_device *device,
> dma_addr_t iova,
>  
>                 if (iova > ULONG_MAX)
>                         return -EINVAL;
> +               /*
> +                * VFIO ignores the sub page offset, npages is from
> the start of
> +                * a PAGE_SIZE chunk of IOVA.
> +                */
>                 ret = iommufd_access_pin_pages(
> -                       device->iommufd_access, iova, npage *
> PAGE_SIZE, pages,
> +                       device->iommufd_access, ALIGN_DOWN(iova,
> PAGE_SIZE),
> +                       npage * PAGE_SIZE, pages,
>                         (prot & IOMMU_WRITE) ?
> IOMMUFD_ACCESS_RW_WRITE : 0);
>                 if (ret)
>                         return ret;
> @@ -1937,7 +1942,8 @@ void vfio_unpin_pages(struct vfio_device
> *device, dma_addr_t iova, int npage)
>         if (device->iommufd_access) {
>                 if (WARN_ON(iova > ULONG_MAX))
>                         return;
> -               iommufd_access_unpin_pages(device->iommufd_access,
> iova,
> +               iommufd_access_unpin_pages(device->iommufd_access,
> +                                          ALIGN_DOWN(iova,
> PAGE_SIZE),
>                                            npage * PAGE_SIZE);
>                 return;
>         }
> 
> Thanks,
> Jason


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 1/2] vfio/ccw: Convert to use vfio_dma_rw()
  2022-11-08 19:34             ` Jason Gunthorpe
  2022-11-08 20:07               ` Eric Farman
@ 2022-11-09  0:41               ` kernel test robot
  1 sibling, 0 replies; 99+ messages in thread
From: kernel test robot @ 2022-11-09  0:41 UTC (permalink / raw)
  To: Jason Gunthorpe, Eric Farman
  Cc: oe-kbuild-all, Matthew Rosato, Cornelia Huck, Niklas Schnelle,
	Tony Krowiak, Halil Pasic, Jason Herne, linux-s390, iommu,
	Kevin Tian, Alex Williamson, kvm, Lu Baolu, Nicolin Chen

[-- Attachment #1: Type: text/plain, Size: 5157 bytes --]

Hi Jason,

I love your patch! Yet something to improve:

[auto build test ERROR on s390/features]
[also build test ERROR on linus/master v6.1-rc4 next-20221108]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Jason-Gunthorpe/vfio-ccw-Convert-to-use-vfio_dma_rw/20221109-033618
base:   https://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git features
patch link:    https://lore.kernel.org/r/Y2qvYJRsv%2BmO8FSM%40nvidia.com
patch subject: [PATCH 1/2] vfio/ccw: Convert to use vfio_dma_rw()
config: s390-allmodconfig
compiler: s390-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/2a5b55a0bcea8b07741dd246a4ea7f6d853286dd
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Jason-Gunthorpe/vfio-ccw-Convert-to-use-vfio_dma_rw/20221109-033618
        git checkout 2a5b55a0bcea8b07741dd246a4ea7f6d853286dd
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=s390 SHELL=/bin/bash drivers/s390/cio/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   drivers/s390/cio/vfio_ccw_cp.c: In function 'ccwchain_handle_ccw':
>> drivers/s390/cio/vfio_ccw_cp.c:429:27: error: 'CWCHAIN_LEN_MAX' undeclared (first use in this function); did you mean 'CCWCHAIN_LEN_MAX'?
     429 |                           CWCHAIN_LEN_MAX * sizeof(struct ccw1));
         |                           ^~~~~~~~~~~~~~~
         |                           CCWCHAIN_LEN_MAX
   drivers/s390/cio/vfio_ccw_cp.c:429:27: note: each undeclared identifier is reported only once for each function it appears in
>> drivers/s390/cio/vfio_ccw_cp.c:428:15: error: too few arguments to function 'vfio_dma_rw'
     428 |         ret = vfio_dma_rw(vdev, cda, cp->guest_cp,
         |               ^~~~~~~~~~~
   In file included from drivers/s390/cio/vfio_ccw_cp.c:16:
   include/linux/vfio.h:212:5: note: declared here
     212 | int vfio_dma_rw(struct vfio_device *device, dma_addr_t iova,
         |     ^~~~~~~~~~~
   drivers/s390/cio/vfio_ccw_cp.c: In function 'ccwchain_fetch_direct':
   drivers/s390/cio/vfio_ccw_cp.c:529:23: error: too few arguments to function 'vfio_dma_rw'
     529 |                 ret = vfio_dma_rw(vdev, ccw->cda, &iova, sizeof(iova));
         |                       ^~~~~~~~~~~
   include/linux/vfio.h:212:5: note: declared here
     212 | int vfio_dma_rw(struct vfio_device *device, dma_addr_t iova,
         |     ^~~~~~~~~~~
   drivers/s390/cio/vfio_ccw_cp.c:558:23: error: too few arguments to function 'vfio_dma_rw'
     558 |                 ret = vfio_dma_rw(vdev, ccw->cda, idaws, idal_len);
         |                       ^~~~~~~~~~~
   include/linux/vfio.h:212:5: note: declared here
     212 | int vfio_dma_rw(struct vfio_device *device, dma_addr_t iova,
         |     ^~~~~~~~~~~
   drivers/s390/cio/vfio_ccw_cp.c: In function 'cp_free':
>> drivers/s390/cio/vfio_ccw_cp.c:700:42: error: 'pa' undeclared (first use in this function)
     700 |                         page_array_unpin(pa, vdev, pa->pa_nr);
         |                                          ^~


vim +429 drivers/s390/cio/vfio_ccw_cp.c

   416	
   417	static int ccwchain_loop_tic(struct ccwchain *chain,
   418				     struct channel_program *cp);
   419	
   420	static int ccwchain_handle_ccw(u32 cda, struct channel_program *cp)
   421	{
   422		struct vfio_device *vdev =
   423			&container_of(cp, struct vfio_ccw_private, cp)->vdev;
   424		struct ccwchain *chain;
   425		int len, ret;
   426	
   427		/* Copy 2K (the most we support today) of possible CCWs */
 > 428		ret = vfio_dma_rw(vdev, cda, cp->guest_cp,
 > 429				  CWCHAIN_LEN_MAX * sizeof(struct ccw1));
   430		if (ret)
   431			return ret;
   432	
   433		/* Convert any Format-0 CCWs to Format-1 */
   434		if (!cp->orb.cmd.fmt)
   435			convert_ccw0_to_ccw1(cp->guest_cp, CCWCHAIN_LEN_MAX);
   436	
   437		/* Count the CCWs in the current chain */
   438		len = ccwchain_calc_length(cda, cp);
   439		if (len < 0)
   440			return len;
   441	
   442		/* Need alloc a new chain for this one. */
   443		chain = ccwchain_alloc(cp, len);
   444		if (!chain)
   445			return -ENOMEM;
   446		chain->ch_iova = cda;
   447	
   448		/* Copy the actual CCWs into the new chain */
   449		memcpy(chain->ch_ccw, cp->guest_cp, len * sizeof(struct ccw1));
   450	
   451		/* Loop for tics on this new chain. */
   452		ret = ccwchain_loop_tic(chain, cp);
   453	
   454		if (ret)
   455			ccwchain_free(chain);
   456	
   457		return ret;
   458	}
   459	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

[-- Attachment #2: config --]
[-- Type: text/plain, Size: 121019 bytes --]

#
# Automatically generated file; DO NOT EDIT.
# Linux/s390 6.1.0-rc2 Kernel Configuration
#
CONFIG_CC_VERSION_TEXT="s390-linux-gcc (GCC) 12.1.0"
CONFIG_CC_IS_GCC=y
CONFIG_GCC_VERSION=120100
CONFIG_CLANG_VERSION=0
CONFIG_AS_IS_GNU=y
CONFIG_AS_VERSION=23800
CONFIG_LD_IS_BFD=y
CONFIG_LD_VERSION=23800
CONFIG_LLD_VERSION=0
CONFIG_CC_HAS_ASM_GOTO_OUTPUT=y
CONFIG_CC_HAS_ASM_INLINE=y
CONFIG_CC_HAS_NO_PROFILE_FN_ATTR=y
CONFIG_PAHOLE_VERSION=123
CONFIG_CONSTRUCTORS=y
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_TABLE_SORT=y
CONFIG_THREAD_INFO_IN_TASK=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
# CONFIG_WERROR is not set
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_BUILD_SALT=""
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_HAVE_KERNEL_ZSTD=y
CONFIG_HAVE_KERNEL_UNCOMPRESSED=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
# CONFIG_KERNEL_ZSTD is not set
# CONFIG_KERNEL_UNCOMPRESSED is not set
CONFIG_DEFAULT_INIT=""
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_SYSVIPC_COMPAT=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_WATCH_QUEUE=y
CONFIG_CROSS_MEMORY_ATTACH=y
CONFIG_USELIB=y
CONFIG_AUDIT=y
CONFIG_HAVE_ARCH_AUDITSYSCALL=y
CONFIG_AUDITSYSCALL=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_INJECTION=y
CONFIG_GENERIC_IRQ_CHIP=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_SIM=y
CONFIG_SPARSE_IRQ=y
CONFIG_GENERIC_IRQ_DEBUGFS=y
# end of IRQ subsystem

CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_TIME_KUNIT_TEST=m
CONFIG_CONTEXT_TRACKING=y
CONFIG_CONTEXT_TRACKING_IDLE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
# end of Timers subsystem

CONFIG_BPF=y
CONFIG_HAVE_EBPF_JIT=y
CONFIG_ARCH_WANT_DEFAULT_BPF_JIT=y

#
# BPF subsystem
#
CONFIG_BPF_SYSCALL=y
CONFIG_BPF_JIT=y
CONFIG_BPF_JIT_ALWAYS_ON=y
CONFIG_BPF_JIT_DEFAULT_ON=y
CONFIG_BPF_UNPRIV_DEFAULT_OFF=y
CONFIG_USERMODE_DRIVER=y
# CONFIG_BPF_PRELOAD is not set
CONFIG_BPF_LSM=y
# end of BPF subsystem

CONFIG_PREEMPT_NONE_BUILD=y
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_COUNT=y
CONFIG_SCHED_CORE=y

#
# CPU/Task time and stats accounting
#
CONFIG_VIRT_CPU_ACCOUNTING=y
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y
CONFIG_PSI=y
CONFIG_PSI_DEFAULT_DISABLED=y
# end of CPU/Task time and stats accounting

CONFIG_CPU_ISOLATION=y

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
CONFIG_RCU_EXPERT=y
CONFIG_SRCU=y
CONFIG_TREE_SRCU=y
CONFIG_TASKS_RCU_GENERIC=y
CONFIG_FORCE_TASKS_RCU=y
CONFIG_TASKS_RCU=y
CONFIG_FORCE_TASKS_RUDE_RCU=y
CONFIG_TASKS_RUDE_RCU=y
CONFIG_FORCE_TASKS_TRACE_RCU=y
CONFIG_TASKS_TRACE_RCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
CONFIG_RCU_FANOUT=64
CONFIG_RCU_FANOUT_LEAF=16
CONFIG_RCU_NOCB_CPU=y
CONFIG_RCU_NOCB_CPU_DEFAULT_ALL=y
CONFIG_TASKS_TRACE_RCU_READ_MB=y
# end of RCU Subsystem

CONFIG_BUILD_BIN2C=y
CONFIG_IKCONFIG=m
CONFIG_IKCONFIG_PROC=y
CONFIG_IKHEADERS=m
CONFIG_LOG_BUF_SHIFT=17
CONFIG_LOG_CPU_MAX_BUF_SHIFT=12
CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT=13
CONFIG_PRINTK_INDEX=y

#
# Scheduler features
#
# end of Scheduler features

CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_CC_HAS_INT128=y
CONFIG_CC_IMPLICIT_FALLTHROUGH="-Wimplicit-fallthrough=5"
CONFIG_GCC12_NO_ARRAY_BOUNDS=y
CONFIG_CC_NO_ARRAY_BOUNDS=y
CONFIG_NUMA_BALANCING=y
CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y
CONFIG_CGROUPS=y
CONFIG_PAGE_COUNTER=y
CONFIG_CGROUP_FAVOR_DYNMODS=y
CONFIG_MEMCG=y
CONFIG_MEMCG_KMEM=y
CONFIG_BLK_CGROUP=y
CONFIG_CGROUP_WRITEBACK=y
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_CFS_BANDWIDTH=y
CONFIG_RT_GROUP_SCHED=y
CONFIG_CGROUP_PIDS=y
CONFIG_CGROUP_RDMA=y
CONFIG_CGROUP_FREEZER=y
CONFIG_CGROUP_HUGETLB=y
CONFIG_CPUSETS=y
CONFIG_PROC_PID_CPUSET=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_CGROUP_PERF=y
CONFIG_CGROUP_BPF=y
CONFIG_CGROUP_MISC=y
CONFIG_CGROUP_DEBUG=y
CONFIG_SOCK_CGROUP_DATA=y
CONFIG_NAMESPACES=y
CONFIG_UTS_NS=y
CONFIG_TIME_NS=y
CONFIG_IPC_NS=y
CONFIG_USER_NS=y
CONFIG_PID_NS=y
CONFIG_NET_NS=y
CONFIG_CHECKPOINT_RESTORE=y
CONFIG_SCHED_AUTOGROUP=y
CONFIG_SYSFS_DEPRECATED=y
CONFIG_SYSFS_DEPRECATED_V2=y
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_RD_GZIP=y
CONFIG_RD_BZIP2=y
CONFIG_RD_LZMA=y
CONFIG_RD_XZ=y
CONFIG_RD_LZO=y
CONFIG_RD_LZ4=y
CONFIG_RD_ZSTD=y
CONFIG_BOOT_CONFIG=y
CONFIG_BOOT_CONFIG_EMBED=y
CONFIG_BOOT_CONFIG_EMBED_FILE=""
CONFIG_INITRAMFS_PRESERVE_MTIME=y
CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE=y
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
CONFIG_HAVE_UID16=y
CONFIG_SYSCTL_EXCEPTION_TRACE=y
CONFIG_EXPERT=y
CONFIG_UID16=y
CONFIG_MULTIUSER=y
CONFIG_SGETMASK_SYSCALL=y
CONFIG_SYSFS_SYSCALL=y
CONFIG_FHANDLE=y
CONFIG_POSIX_TIMERS=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_FUTEX_PI=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_AIO=y
CONFIG_IO_URING=y
CONFIG_ADVISE_SYSCALLS=y
CONFIG_MEMBARRIER=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
CONFIG_KALLSYMS_BASE_RELATIVE=y
CONFIG_KCMP=y
CONFIG_RSEQ=y
CONFIG_DEBUG_RSEQ=y
CONFIG_EMBEDDED=y
CONFIG_HAVE_PERF_EVENTS=y
CONFIG_PERF_USE_VMALLOC=y
CONFIG_PC104=y

#
# Kernel Performance Events And Counters
#
CONFIG_PERF_EVENTS=y
CONFIG_DEBUG_PERF_USE_VMALLOC=y
# end of Kernel Performance Events And Counters

CONFIG_SYSTEM_DATA_VERIFICATION=y
CONFIG_PROFILING=y
CONFIG_TRACEPOINTS=y
# end of General setup

CONFIG_MMU=y
CONFIG_CPU_BIG_ENDIAN=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_PGSTE=y
CONFIG_AUDIT_ARCH=y
CONFIG_NO_IOPORT_MAP=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_KASAN_SHADOW_OFFSET=0x1C000000000000
CONFIG_S390=y
CONFIG_SCHED_OMIT_FRAME_POINTER=y
CONFIG_PGTABLE_LEVELS=5
CONFIG_HAVE_LIVEPATCH=y
CONFIG_LIVEPATCH=y

#
# Processor type and features
#
CONFIG_HAVE_MARCH_Z10_FEATURES=y
CONFIG_HAVE_MARCH_Z196_FEATURES=y
# CONFIG_MARCH_Z10 is not set
CONFIG_MARCH_Z196=y
# CONFIG_MARCH_ZEC12 is not set
# CONFIG_MARCH_Z13 is not set
# CONFIG_MARCH_Z14 is not set
# CONFIG_MARCH_Z15 is not set
CONFIG_MARCH_Z196_TUNE=y
CONFIG_TUNE_DEFAULT=y
# CONFIG_TUNE_Z10 is not set
# CONFIG_TUNE_Z196 is not set
# CONFIG_TUNE_ZEC12 is not set
# CONFIG_TUNE_Z13 is not set
# CONFIG_TUNE_Z14 is not set
# CONFIG_TUNE_Z15 is not set
# CONFIG_TUNE_Z16 is not set
CONFIG_64BIT=y
CONFIG_COMMAND_LINE_SIZE=4096
CONFIG_COMPAT=y
CONFIG_SMP=y
CONFIG_NR_CPUS=64
CONFIG_HOTPLUG_CPU=y
CONFIG_NUMA=y
CONFIG_NODES_SHIFT=1
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
CONFIG_SCHED_BOOK=y
CONFIG_SCHED_DRAWER=y
CONFIG_SCHED_TOPOLOGY=y
# CONFIG_HZ_100 is not set
CONFIG_HZ_250=y
# CONFIG_HZ_300 is not set
# CONFIG_HZ_1000 is not set
CONFIG_HZ=250
CONFIG_SCHED_HRTICK=y
CONFIG_KEXEC=y
CONFIG_KEXEC_FILE=y
CONFIG_ARCH_HAS_KEXEC_PURGATORY=y
CONFIG_KEXEC_SIG=y
CONFIG_KERNEL_NOBP=y
CONFIG_EXPOLINE=y
CONFIG_EXPOLINE_EXTERN=y
# CONFIG_EXPOLINE_OFF is not set
# CONFIG_EXPOLINE_AUTO is not set
CONFIG_EXPOLINE_FULL=y
CONFIG_RELOCATABLE=y
# CONFIG_RANDOMIZE_BASE is not set
# end of Processor type and features

#
# Memory setup
#
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SPARSEMEM_DEFAULT=y
CONFIG_MAX_PHYSMEM_BITS=46
# end of Memory setup

#
# I/O subsystem
#
CONFIG_QDIO=m
CONFIG_CHSC_SCH=m
CONFIG_SCM_BUS=y
CONFIG_EADM_SCH=m
CONFIG_VFIO_CCW=m
CONFIG_VFIO_AP=m
# end of I/O subsystem

#
# Dump support
#
CONFIG_CRASH_DUMP=y
# end of Dump support

CONFIG_CCW=y
CONFIG_HAVE_PNETID=m

#
# Virtualization
#
CONFIG_PROTECTED_VIRTUALIZATION_GUEST=y
CONFIG_PFAULT=y
CONFIG_CMM=m
CONFIG_CMM_IUCV=y
CONFIG_APPLDATA_BASE=y
CONFIG_APPLDATA_MEM=m
CONFIG_APPLDATA_OS=m
CONFIG_APPLDATA_NET_SUM=m
CONFIG_S390_HYPFS_FS=y
CONFIG_HAVE_KVM=y
CONFIG_HAVE_KVM_IRQCHIP=y
CONFIG_HAVE_KVM_IRQFD=y
CONFIG_HAVE_KVM_IRQ_ROUTING=y
CONFIG_HAVE_KVM_EVENTFD=y
CONFIG_KVM_ASYNC_PF=y
CONFIG_KVM_ASYNC_PF_SYNC=y
CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT=y
CONFIG_KVM_VFIO=y
CONFIG_HAVE_KVM_INVALID_WAKEUPS=y
CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL=y
CONFIG_HAVE_KVM_NO_POLL=y
CONFIG_VIRTUALIZATION=y
CONFIG_KVM=m
CONFIG_KVM_S390_UCONTROL=y
CONFIG_S390_GUEST=y
# end of Virtualization

CONFIG_S390_MODULES_SANITY_TEST_HELPERS=y

#
# Selftests
#
CONFIG_S390_UNWIND_SELFTEST=m
CONFIG_S390_KPROBES_SANITY_TEST=m
CONFIG_S390_MODULES_SANITY_TEST=m
# end of Selftests

#
# General architecture-dependent options
#
CONFIG_CRASH_CORE=y
CONFIG_KEXEC_CORE=y
CONFIG_GENERIC_ENTRY=y
CONFIG_KPROBES=y
CONFIG_JUMP_LABEL=y
CONFIG_STATIC_KEYS_SELFTEST=y
CONFIG_KPROBES_ON_FTRACE=y
CONFIG_UPROBES=y
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_ARCH_USE_BUILTIN_BSWAP=y
CONFIG_KRETPROBES=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_KPROBES_ON_FTRACE=y
CONFIG_ARCH_CORRECT_STACKTRACE_ON_KRETPROBE=y
CONFIG_HAVE_FUNCTION_ERROR_INJECTION=y
CONFIG_HAVE_NMI=y
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
CONFIG_HAVE_DMA_CONTIGUOUS=y
CONFIG_GENERIC_SMP_IDLE_THREAD=y
CONFIG_ARCH_HAS_FORTIFY_SOURCE=y
CONFIG_ARCH_HAS_SET_MEMORY=y
CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT=y
CONFIG_ARCH_WANTS_NO_INSTR=y
CONFIG_ARCH_32BIT_USTAT_F_TINODE=y
CONFIG_HAVE_ASM_MODVERSIONS=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
CONFIG_HAVE_RSEQ=y
CONFIG_HAVE_FUNCTION_ARG_ACCESS_API=y
CONFIG_HAVE_PERF_REGS=y
CONFIG_HAVE_PERF_USER_STACK_DUMP=y
CONFIG_HAVE_ARCH_JUMP_LABEL=y
CONFIG_HAVE_ARCH_JUMP_LABEL_RELATIVE=y
CONFIG_MMU_GATHER_TABLE_FREE=y
CONFIG_MMU_GATHER_RCU_TABLE_FREE=y
CONFIG_MMU_GATHER_MERGE_VMAS=y
CONFIG_MMU_GATHER_NO_GATHER=y
CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG=y
CONFIG_HAVE_ALIGNED_STRUCT_PAGE=y
CONFIG_HAVE_CMPXCHG_LOCAL=y
CONFIG_HAVE_CMPXCHG_DOUBLE=y
CONFIG_ARCH_WANT_IPC_PARSE_VERSION=y
CONFIG_ARCH_WANT_COMPAT_IPC_PARSE_VERSION=y
CONFIG_ARCH_WANT_OLD_COMPAT_IPC=y
CONFIG_HAVE_ARCH_SECCOMP=y
CONFIG_HAVE_ARCH_SECCOMP_FILTER=y
CONFIG_SECCOMP=y
CONFIG_SECCOMP_FILTER=y
CONFIG_SECCOMP_CACHE_DEBUG=y
CONFIG_LTO_NONE=y
CONFIG_HAVE_VIRT_CPU_ACCOUNTING=y
CONFIG_HAVE_VIRT_CPU_ACCOUNTING_IDLE=y
CONFIG_ARCH_HAS_SCALED_CPUTIME=y
CONFIG_HAVE_VIRT_CPU_ACCOUNTING_GEN=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y
CONFIG_HAVE_ARCH_SOFT_DIRTY=y
CONFIG_HAVE_MOD_ARCH_SPECIFIC=y
CONFIG_MODULES_USE_ELF_RELA=y
CONFIG_HAVE_SOFTIRQ_ON_OWN_STACK=y
CONFIG_SOFTIRQ_ON_OWN_STACK=y
CONFIG_ALTERNATE_USER_ADDRESS_SPACE=y
CONFIG_ARCH_HAS_ELF_RANDOMIZE=y
CONFIG_PAGE_SIZE_LESS_THAN_64KB=y
CONFIG_PAGE_SIZE_LESS_THAN_256KB=y
CONFIG_HAVE_RELIABLE_STACKTRACE=y
CONFIG_CLONE_BACKWARDS2=y
CONFIG_OLD_SIGSUSPEND3=y
CONFIG_OLD_SIGACTION=y
CONFIG_COMPAT_OLD_SIGACTION=y
CONFIG_COMPAT_32BIT_TIME=y
CONFIG_HAVE_ARCH_VMAP_STACK=y
CONFIG_VMAP_STACK=y
CONFIG_HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET=y
CONFIG_RANDOMIZE_KSTACK_OFFSET=y
CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT=y
CONFIG_ARCH_HAS_STRICT_KERNEL_RWX=y
CONFIG_STRICT_KERNEL_RWX=y
CONFIG_ARCH_HAS_STRICT_MODULE_RWX=y
CONFIG_STRICT_MODULE_RWX=y
CONFIG_LOCK_EVENT_COUNTS=y
CONFIG_ARCH_HAS_MEM_ENCRYPT=y
CONFIG_ARCH_HAS_VDSO_DATA=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y

#
# GCOV-based kernel profiling
#
CONFIG_GCOV_KERNEL=y
CONFIG_ARCH_HAS_GCOV_PROFILE_ALL=y
# CONFIG_GCOV_PROFILE_ALL is not set
# end of GCOV-based kernel profiling

CONFIG_HAVE_GCC_PLUGINS=y
CONFIG_GCC_PLUGINS=y
CONFIG_GCC_PLUGIN_LATENT_ENTROPY=y
# end of General architecture-dependent options

CONFIG_RT_MUTEXES=y
CONFIG_BASE_SMALL=0
CONFIG_MODULE_SIG_FORMAT=y
CONFIG_MODULES=y
CONFIG_MODULE_FORCE_LOAD=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODULE_FORCE_UNLOAD=y
CONFIG_MODULE_UNLOAD_TAINT_TRACKING=y
CONFIG_MODVERSIONS=y
CONFIG_ASM_MODVERSIONS=y
CONFIG_MODULE_SRCVERSION_ALL=y
CONFIG_MODULE_SIG=y
CONFIG_MODULE_SIG_FORCE=y
CONFIG_MODULE_SIG_ALL=y
CONFIG_MODULE_SIG_SHA1=y
# CONFIG_MODULE_SIG_SHA224 is not set
# CONFIG_MODULE_SIG_SHA256 is not set
# CONFIG_MODULE_SIG_SHA384 is not set
# CONFIG_MODULE_SIG_SHA512 is not set
CONFIG_MODULE_SIG_HASH="sha1"
CONFIG_MODULE_COMPRESS_NONE=y
# CONFIG_MODULE_COMPRESS_GZIP is not set
# CONFIG_MODULE_COMPRESS_XZ is not set
# CONFIG_MODULE_COMPRESS_ZSTD is not set
CONFIG_MODULE_ALLOW_MISSING_NAMESPACE_IMPORTS=y
CONFIG_MODPROBE_PATH="/sbin/modprobe"
# CONFIG_TRIM_UNUSED_KSYMS is not set
CONFIG_MODULES_TREE_LOOKUP=y
CONFIG_BLOCK=y
CONFIG_BLOCK_LEGACY_AUTOLOAD=y
CONFIG_BLK_RQ_ALLOC_TIME=y
CONFIG_BLK_CGROUP_RWSTAT=y
CONFIG_BLK_DEV_BSG_COMMON=y
CONFIG_BLK_ICQ=y
CONFIG_BLK_DEV_BSGLIB=y
CONFIG_BLK_DEV_INTEGRITY=y
CONFIG_BLK_DEV_INTEGRITY_T10=m
CONFIG_BLK_DEV_ZONED=y
CONFIG_BLK_DEV_THROTTLING=y
CONFIG_BLK_DEV_THROTTLING_LOW=y
CONFIG_BLK_WBT=y
CONFIG_BLK_WBT_MQ=y
CONFIG_BLK_CGROUP_IOLATENCY=y
CONFIG_BLK_CGROUP_FC_APPID=y
CONFIG_BLK_CGROUP_IOCOST=y
CONFIG_BLK_CGROUP_IOPRIO=y
CONFIG_BLK_DEBUG_FS=y
CONFIG_BLK_DEBUG_FS_ZONED=y
CONFIG_BLK_SED_OPAL=y
CONFIG_BLK_INLINE_ENCRYPTION=y
CONFIG_BLK_INLINE_ENCRYPTION_FALLBACK=y

#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
CONFIG_ACORN_PARTITION=y
CONFIG_ACORN_PARTITION_CUMANA=y
CONFIG_ACORN_PARTITION_EESOX=y
CONFIG_ACORN_PARTITION_ICS=y
CONFIG_ACORN_PARTITION_ADFS=y
CONFIG_ACORN_PARTITION_POWERTEC=y
CONFIG_ACORN_PARTITION_RISCIX=y
CONFIG_AIX_PARTITION=y
CONFIG_OSF_PARTITION=y
CONFIG_AMIGA_PARTITION=y
CONFIG_ATARI_PARTITION=y
CONFIG_IBM_PARTITION=y
CONFIG_MAC_PARTITION=y
CONFIG_MSDOS_PARTITION=y
CONFIG_BSD_DISKLABEL=y
CONFIG_MINIX_SUBPARTITION=y
CONFIG_SOLARIS_X86_PARTITION=y
CONFIG_UNIXWARE_DISKLABEL=y
CONFIG_LDM_PARTITION=y
CONFIG_LDM_DEBUG=y
CONFIG_SGI_PARTITION=y
CONFIG_ULTRIX_PARTITION=y
CONFIG_SUN_PARTITION=y
CONFIG_KARMA_PARTITION=y
CONFIG_EFI_PARTITION=y
CONFIG_SYSV68_PARTITION=y
CONFIG_CMDLINE_PARTITION=y
# end of Partition Types

CONFIG_BLOCK_COMPAT=y
CONFIG_BLK_MQ_VIRTIO=y
CONFIG_BLOCK_HOLDER_DEPRECATED=y
CONFIG_BLK_MQ_STACKING=y

#
# IO Schedulers
#
CONFIG_MQ_IOSCHED_DEADLINE=y
CONFIG_MQ_IOSCHED_KYBER=m
CONFIG_IOSCHED_BFQ=m
CONFIG_BFQ_GROUP_IOSCHED=y
CONFIG_BFQ_CGROUP_DEBUG=y
# end of IO Schedulers

CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_PADATA=y
CONFIG_ASN1=y
CONFIG_ARCH_INLINE_SPIN_TRYLOCK=y
CONFIG_ARCH_INLINE_SPIN_TRYLOCK_BH=y
CONFIG_ARCH_INLINE_SPIN_LOCK=y
CONFIG_ARCH_INLINE_SPIN_LOCK_BH=y
CONFIG_ARCH_INLINE_SPIN_LOCK_IRQ=y
CONFIG_ARCH_INLINE_SPIN_LOCK_IRQSAVE=y
CONFIG_ARCH_INLINE_SPIN_UNLOCK=y
CONFIG_ARCH_INLINE_SPIN_UNLOCK_BH=y
CONFIG_ARCH_INLINE_SPIN_UNLOCK_IRQ=y
CONFIG_ARCH_INLINE_SPIN_UNLOCK_IRQRESTORE=y
CONFIG_ARCH_INLINE_READ_TRYLOCK=y
CONFIG_ARCH_INLINE_READ_LOCK=y
CONFIG_ARCH_INLINE_READ_LOCK_BH=y
CONFIG_ARCH_INLINE_READ_LOCK_IRQ=y
CONFIG_ARCH_INLINE_READ_LOCK_IRQSAVE=y
CONFIG_ARCH_INLINE_READ_UNLOCK=y
CONFIG_ARCH_INLINE_READ_UNLOCK_BH=y
CONFIG_ARCH_INLINE_READ_UNLOCK_IRQ=y
CONFIG_ARCH_INLINE_READ_UNLOCK_IRQRESTORE=y
CONFIG_ARCH_INLINE_WRITE_TRYLOCK=y
CONFIG_ARCH_INLINE_WRITE_LOCK=y
CONFIG_ARCH_INLINE_WRITE_LOCK_BH=y
CONFIG_ARCH_INLINE_WRITE_LOCK_IRQ=y
CONFIG_ARCH_INLINE_WRITE_LOCK_IRQSAVE=y
CONFIG_ARCH_INLINE_WRITE_UNLOCK=y
CONFIG_ARCH_INLINE_WRITE_UNLOCK_BH=y
CONFIG_ARCH_INLINE_WRITE_UNLOCK_IRQ=y
CONFIG_ARCH_INLINE_WRITE_UNLOCK_IRQRESTORE=y
CONFIG_UNINLINE_SPIN_UNLOCK=y
CONFIG_ARCH_SUPPORTS_ATOMIC_RMW=y
CONFIG_MUTEX_SPIN_ON_OWNER=y
CONFIG_RWSEM_SPIN_ON_OWNER=y
CONFIG_LOCK_SPIN_ON_OWNER=y
CONFIG_ARCH_HAS_SYSCALL_WRAPPER=y
CONFIG_FREEZER=y

#
# Executable file formats
#
CONFIG_BINFMT_ELF=y
CONFIG_COMPAT_BINFMT_ELF=y
CONFIG_ARCH_BINFMT_ELF_STATE=y
CONFIG_ELFCORE=y
CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS=y
CONFIG_BINFMT_SCRIPT=m
CONFIG_BINFMT_MISC=m
CONFIG_COREDUMP=y
# end of Executable file formats

#
# Memory Management options
#
CONFIG_ZPOOL=y
CONFIG_SWAP=y
CONFIG_ZSWAP=y
CONFIG_ZSWAP_DEFAULT_ON=y
# CONFIG_ZSWAP_COMPRESSOR_DEFAULT_DEFLATE is not set
CONFIG_ZSWAP_COMPRESSOR_DEFAULT_LZO=y
# CONFIG_ZSWAP_COMPRESSOR_DEFAULT_842 is not set
# CONFIG_ZSWAP_COMPRESSOR_DEFAULT_LZ4 is not set
# CONFIG_ZSWAP_COMPRESSOR_DEFAULT_LZ4HC is not set
# CONFIG_ZSWAP_COMPRESSOR_DEFAULT_ZSTD is not set
CONFIG_ZSWAP_COMPRESSOR_DEFAULT="lzo"
CONFIG_ZSWAP_ZPOOL_DEFAULT_ZBUD=y
# CONFIG_ZSWAP_ZPOOL_DEFAULT_Z3FOLD is not set
# CONFIG_ZSWAP_ZPOOL_DEFAULT_ZSMALLOC is not set
CONFIG_ZSWAP_ZPOOL_DEFAULT="zbud"
CONFIG_ZBUD=y
CONFIG_Z3FOLD=m
CONFIG_ZSMALLOC=m
CONFIG_ZSMALLOC_STAT=y

#
# SLAB allocator options
#
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLOB is not set
CONFIG_SLAB_MERGE_DEFAULT=y
CONFIG_SLAB_FREELIST_RANDOM=y
CONFIG_SLAB_FREELIST_HARDENED=y
CONFIG_SLUB_STATS=y
CONFIG_SLUB_CPU_PARTIAL=y
# end of SLAB allocator options

CONFIG_SHUFFLE_PAGE_ALLOCATOR=y
CONFIG_COMPAT_BRK=y
CONFIG_SPARSEMEM=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_VMEMMAP=y
CONFIG_HAVE_MEMBLOCK_PHYS_MAP=y
CONFIG_HAVE_FAST_GUP=y
CONFIG_NUMA_KEEP_MEMINFO=y
CONFIG_MEMORY_ISOLATION=y
CONFIG_EXCLUSIVE_SYSTEM_RAM=y
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE=y
CONFIG_MEMORY_HOTPLUG=y
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y
CONFIG_MEMORY_HOTREMOVE=y
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK=y
CONFIG_MEMORY_BALLOON=y
CONFIG_BALLOON_COMPACTION=y
CONFIG_COMPACTION=y
CONFIG_COMPACT_UNEVICTABLE_DEFAULT=1
CONFIG_PAGE_REPORTING=y
CONFIG_MIGRATION=y
CONFIG_CONTIG_ALLOC=y
CONFIG_PHYS_ADDR_T_64BIT=y
CONFIG_MMU_NOTIFIER=y
CONFIG_KSM=y
CONFIG_DEFAULT_MMAP_MIN_ADDR=4096
CONFIG_TRANSPARENT_HUGEPAGE=y
CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
# CONFIG_TRANSPARENT_HUGEPAGE_MADVISE is not set
CONFIG_READ_ONLY_THP_FOR_FS=y
CONFIG_FRONTSWAP=y
CONFIG_CMA=y
CONFIG_CMA_DEBUG=y
CONFIG_CMA_DEBUGFS=y
CONFIG_CMA_SYSFS=y
CONFIG_CMA_AREAS=19
CONFIG_MEM_SOFT_DIRTY=y
CONFIG_DEFERRED_STRUCT_PAGE_INIT=y
CONFIG_PAGE_IDLE_FLAG=y
CONFIG_IDLE_PAGE_TRACKING=y
CONFIG_ARCH_HAS_CURRENT_STACK_POINTER=y
CONFIG_ZONE_DMA=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_PERCPU_STATS=y
CONFIG_GUP_TEST=y
CONFIG_ARCH_HAS_PTE_SPECIAL=y
CONFIG_ANON_VMA_NAME=y
CONFIG_USERFAULTFD=y
CONFIG_LRU_GEN=y
CONFIG_LRU_GEN_ENABLED=y
CONFIG_LRU_GEN_STATS=y

#
# Data Access Monitoring
#
CONFIG_DAMON=y
CONFIG_DAMON_VADDR=y
CONFIG_DAMON_PADDR=y
CONFIG_DAMON_SYSFS=y
CONFIG_DAMON_DBGFS=y
CONFIG_DAMON_RECLAIM=y
CONFIG_DAMON_LRU_SORT=y
# end of Data Access Monitoring
# end of Memory Management options

CONFIG_NET=y
CONFIG_NET_INGRESS=y
CONFIG_NET_EGRESS=y
CONFIG_NET_REDIRECT=y
CONFIG_SKB_EXTENSIONS=y

#
# Networking options
#
CONFIG_PACKET=m
CONFIG_PACKET_DIAG=m
CONFIG_UNIX=m
CONFIG_UNIX_SCM=y
CONFIG_AF_UNIX_OOB=y
CONFIG_UNIX_DIAG=m
CONFIG_TLS=m
CONFIG_TLS_DEVICE=y
CONFIG_TLS_TOE=y
CONFIG_XFRM=y
CONFIG_XFRM_OFFLOAD=y
CONFIG_XFRM_ALGO=m
CONFIG_XFRM_USER=m
CONFIG_XFRM_INTERFACE=m
CONFIG_XFRM_SUB_POLICY=y
CONFIG_XFRM_MIGRATE=y
CONFIG_XFRM_STATISTICS=y
CONFIG_XFRM_AH=m
CONFIG_XFRM_ESP=m
CONFIG_XFRM_IPCOMP=m
CONFIG_NET_KEY=m
CONFIG_NET_KEY_MIGRATE=y
CONFIG_XFRM_ESPINTCP=y
CONFIG_IUCV=y
CONFIG_AFIUCV=m
CONFIG_XDP_SOCKETS=y
CONFIG_XDP_SOCKETS_DIAG=m
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
CONFIG_IP_FIB_TRIE_STATS=y
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IP_ROUTE_MULTIPATH=y
CONFIG_IP_ROUTE_VERBOSE=y
CONFIG_IP_ROUTE_CLASSID=y
CONFIG_IP_PNP=y
CONFIG_IP_PNP_DHCP=y
CONFIG_IP_PNP_BOOTP=y
CONFIG_IP_PNP_RARP=y
CONFIG_NET_IPIP=m
CONFIG_NET_IPGRE_DEMUX=m
CONFIG_NET_IP_TUNNEL=m
CONFIG_NET_IPGRE=m
CONFIG_NET_IPGRE_BROADCAST=y
CONFIG_IP_MROUTE_COMMON=y
CONFIG_IP_MROUTE=y
CONFIG_IP_MROUTE_MULTIPLE_TABLES=y
CONFIG_IP_PIMSM_V1=y
CONFIG_IP_PIMSM_V2=y
CONFIG_SYN_COOKIES=y
CONFIG_NET_IPVTI=m
CONFIG_NET_UDP_TUNNEL=m
CONFIG_NET_FOU=m
CONFIG_NET_FOU_IP_TUNNELS=y
CONFIG_INET_AH=m
CONFIG_INET_ESP=m
CONFIG_INET_ESP_OFFLOAD=m
CONFIG_INET_ESPINTCP=y
CONFIG_INET_IPCOMP=m
CONFIG_INET_XFRM_TUNNEL=m
CONFIG_INET_TUNNEL=m
CONFIG_INET_DIAG=m
CONFIG_INET_TCP_DIAG=m
CONFIG_INET_UDP_DIAG=m
CONFIG_INET_RAW_DIAG=m
CONFIG_INET_DIAG_DESTROY=y
CONFIG_TCP_CONG_ADVANCED=y
CONFIG_TCP_CONG_BIC=m
CONFIG_TCP_CONG_CUBIC=m
CONFIG_TCP_CONG_WESTWOOD=m
CONFIG_TCP_CONG_HTCP=m
CONFIG_TCP_CONG_HSTCP=m
CONFIG_TCP_CONG_HYBLA=m
CONFIG_TCP_CONG_VEGAS=m
CONFIG_TCP_CONG_NV=m
CONFIG_TCP_CONG_SCALABLE=m
CONFIG_TCP_CONG_LP=m
CONFIG_TCP_CONG_VENO=m
CONFIG_TCP_CONG_YEAH=m
CONFIG_TCP_CONG_ILLINOIS=m
CONFIG_TCP_CONG_DCTCP=m
CONFIG_TCP_CONG_CDG=m
CONFIG_TCP_CONG_BBR=m
CONFIG_DEFAULT_RENO=y
CONFIG_DEFAULT_TCP_CONG="reno"
CONFIG_TCP_MD5SIG=y
CONFIG_IPV6=m
CONFIG_IPV6_ROUTER_PREF=y
CONFIG_IPV6_ROUTE_INFO=y
CONFIG_IPV6_OPTIMISTIC_DAD=y
CONFIG_INET6_AH=m
CONFIG_INET6_ESP=m
CONFIG_INET6_ESP_OFFLOAD=m
CONFIG_INET6_ESPINTCP=y
CONFIG_INET6_IPCOMP=m
CONFIG_IPV6_MIP6=m
CONFIG_IPV6_ILA=m
CONFIG_INET6_XFRM_TUNNEL=m
CONFIG_INET6_TUNNEL=m
CONFIG_IPV6_VTI=m
CONFIG_IPV6_SIT=m
CONFIG_IPV6_SIT_6RD=y
CONFIG_IPV6_NDISC_NODETYPE=y
CONFIG_IPV6_TUNNEL=m
CONFIG_IPV6_GRE=m
CONFIG_IPV6_FOU=m
CONFIG_IPV6_FOU_TUNNEL=m
CONFIG_IPV6_MULTIPLE_TABLES=y
CONFIG_IPV6_SUBTREES=y
CONFIG_IPV6_MROUTE=y
CONFIG_IPV6_MROUTE_MULTIPLE_TABLES=y
CONFIG_IPV6_PIMSM_V2=y
CONFIG_IPV6_SEG6_LWTUNNEL=y
CONFIG_IPV6_SEG6_HMAC=y
CONFIG_IPV6_RPL_LWTUNNEL=y
CONFIG_IPV6_IOAM6_LWTUNNEL=y
CONFIG_NETLABEL=y
CONFIG_MPTCP=y
CONFIG_INET_MPTCP_DIAG=m
CONFIG_MPTCP_KUNIT_TEST=m
CONFIG_NETWORK_SECMARK=y
CONFIG_NET_PTP_CLASSIFY=y
CONFIG_NETWORK_PHY_TIMESTAMPING=y
CONFIG_NETFILTER=y
CONFIG_NETFILTER_ADVANCED=y
CONFIG_BRIDGE_NETFILTER=m

#
# Core Netfilter Configuration
#
CONFIG_NETFILTER_INGRESS=y
CONFIG_NETFILTER_EGRESS=y
CONFIG_NETFILTER_SKIP_EGRESS=y
CONFIG_NETFILTER_NETLINK=m
CONFIG_NETFILTER_FAMILY_BRIDGE=y
CONFIG_NETFILTER_FAMILY_ARP=y
CONFIG_NETFILTER_NETLINK_HOOK=m
CONFIG_NETFILTER_NETLINK_ACCT=m
CONFIG_NETFILTER_NETLINK_QUEUE=m
CONFIG_NETFILTER_NETLINK_LOG=m
CONFIG_NETFILTER_NETLINK_OSF=m
CONFIG_NF_CONNTRACK=m
CONFIG_NF_LOG_SYSLOG=m
CONFIG_NETFILTER_CONNCOUNT=m
CONFIG_NF_CONNTRACK_MARK=y
CONFIG_NF_CONNTRACK_SECMARK=y
CONFIG_NF_CONNTRACK_ZONES=y
CONFIG_NF_CONNTRACK_PROCFS=y
CONFIG_NF_CONNTRACK_EVENTS=y
CONFIG_NF_CONNTRACK_TIMEOUT=y
CONFIG_NF_CONNTRACK_TIMESTAMP=y
CONFIG_NF_CONNTRACK_LABELS=y
CONFIG_NF_CT_PROTO_DCCP=y
CONFIG_NF_CT_PROTO_GRE=y
CONFIG_NF_CT_PROTO_SCTP=y
CONFIG_NF_CT_PROTO_UDPLITE=y
CONFIG_NF_CONNTRACK_AMANDA=m
CONFIG_NF_CONNTRACK_FTP=m
CONFIG_NF_CONNTRACK_H323=m
CONFIG_NF_CONNTRACK_IRC=m
CONFIG_NF_CONNTRACK_BROADCAST=m
CONFIG_NF_CONNTRACK_NETBIOS_NS=m
CONFIG_NF_CONNTRACK_SNMP=m
CONFIG_NF_CONNTRACK_PPTP=m
CONFIG_NF_CONNTRACK_SANE=m
CONFIG_NF_CONNTRACK_SIP=m
CONFIG_NF_CONNTRACK_TFTP=m
CONFIG_NF_CT_NETLINK=m
CONFIG_NF_CT_NETLINK_TIMEOUT=m
CONFIG_NF_CT_NETLINK_HELPER=m
CONFIG_NETFILTER_NETLINK_GLUE_CT=y
CONFIG_NF_NAT=m
CONFIG_NF_NAT_AMANDA=m
CONFIG_NF_NAT_FTP=m
CONFIG_NF_NAT_IRC=m
CONFIG_NF_NAT_SIP=m
CONFIG_NF_NAT_TFTP=m
CONFIG_NF_NAT_REDIRECT=y
CONFIG_NF_NAT_MASQUERADE=y
CONFIG_NETFILTER_SYNPROXY=m
CONFIG_NF_TABLES=m
CONFIG_NF_TABLES_INET=y
CONFIG_NF_TABLES_NETDEV=y
CONFIG_NFT_NUMGEN=m
CONFIG_NFT_CT=m
CONFIG_NFT_FLOW_OFFLOAD=m
CONFIG_NFT_CONNLIMIT=m
CONFIG_NFT_LOG=m
CONFIG_NFT_LIMIT=m
CONFIG_NFT_MASQ=m
CONFIG_NFT_REDIR=m
CONFIG_NFT_NAT=m
CONFIG_NFT_TUNNEL=m
# CONFIG_NFT_OBJREF is not set
CONFIG_NFT_QUEUE=m
CONFIG_NFT_QUOTA=m
CONFIG_NFT_REJECT=m
CONFIG_NFT_REJECT_INET=m
CONFIG_NFT_COMPAT=m
CONFIG_NFT_HASH=m
CONFIG_NFT_FIB=m
CONFIG_NFT_FIB_INET=m
CONFIG_NFT_XFRM=m
CONFIG_NFT_SOCKET=m
CONFIG_NFT_OSF=m
CONFIG_NFT_TPROXY=m
CONFIG_NFT_SYNPROXY=m
CONFIG_NF_DUP_NETDEV=m
CONFIG_NFT_DUP_NETDEV=m
CONFIG_NFT_FWD_NETDEV=m
CONFIG_NFT_FIB_NETDEV=m
CONFIG_NFT_REJECT_NETDEV=m
CONFIG_NF_FLOW_TABLE_INET=m
CONFIG_NF_FLOW_TABLE=m
CONFIG_NF_FLOW_TABLE_PROCFS=y
CONFIG_NETFILTER_XTABLES=m
CONFIG_NETFILTER_XTABLES_COMPAT=y

#
# Xtables combined modules
#
CONFIG_NETFILTER_XT_MARK=m
CONFIG_NETFILTER_XT_CONNMARK=m
CONFIG_NETFILTER_XT_SET=m

#
# Xtables targets
#
CONFIG_NETFILTER_XT_TARGET_AUDIT=m
CONFIG_NETFILTER_XT_TARGET_CHECKSUM=m
CONFIG_NETFILTER_XT_TARGET_CLASSIFY=m
CONFIG_NETFILTER_XT_TARGET_CONNMARK=m
CONFIG_NETFILTER_XT_TARGET_CONNSECMARK=m
CONFIG_NETFILTER_XT_TARGET_CT=m
CONFIG_NETFILTER_XT_TARGET_DSCP=m
CONFIG_NETFILTER_XT_TARGET_HL=m
CONFIG_NETFILTER_XT_TARGET_HMARK=m
CONFIG_NETFILTER_XT_TARGET_IDLETIMER=m
CONFIG_NETFILTER_XT_TARGET_LED=m
CONFIG_NETFILTER_XT_TARGET_LOG=m
CONFIG_NETFILTER_XT_TARGET_MARK=m
CONFIG_NETFILTER_XT_NAT=m
CONFIG_NETFILTER_XT_TARGET_NETMAP=m
CONFIG_NETFILTER_XT_TARGET_NFLOG=m
CONFIG_NETFILTER_XT_TARGET_NFQUEUE=m
CONFIG_NETFILTER_XT_TARGET_NOTRACK=m
CONFIG_NETFILTER_XT_TARGET_RATEEST=m
CONFIG_NETFILTER_XT_TARGET_REDIRECT=m
CONFIG_NETFILTER_XT_TARGET_MASQUERADE=m
CONFIG_NETFILTER_XT_TARGET_TEE=m
CONFIG_NETFILTER_XT_TARGET_TPROXY=m
CONFIG_NETFILTER_XT_TARGET_TRACE=m
CONFIG_NETFILTER_XT_TARGET_SECMARK=m
CONFIG_NETFILTER_XT_TARGET_TCPMSS=m
CONFIG_NETFILTER_XT_TARGET_TCPOPTSTRIP=m

#
# Xtables matches
#
CONFIG_NETFILTER_XT_MATCH_ADDRTYPE=m
CONFIG_NETFILTER_XT_MATCH_BPF=m
CONFIG_NETFILTER_XT_MATCH_CGROUP=m
CONFIG_NETFILTER_XT_MATCH_CLUSTER=m
CONFIG_NETFILTER_XT_MATCH_COMMENT=m
CONFIG_NETFILTER_XT_MATCH_CONNBYTES=m
CONFIG_NETFILTER_XT_MATCH_CONNLABEL=m
CONFIG_NETFILTER_XT_MATCH_CONNLIMIT=m
CONFIG_NETFILTER_XT_MATCH_CONNMARK=m
CONFIG_NETFILTER_XT_MATCH_CONNTRACK=m
CONFIG_NETFILTER_XT_MATCH_CPU=m
CONFIG_NETFILTER_XT_MATCH_DCCP=m
CONFIG_NETFILTER_XT_MATCH_DEVGROUP=m
CONFIG_NETFILTER_XT_MATCH_DSCP=m
CONFIG_NETFILTER_XT_MATCH_ECN=m
CONFIG_NETFILTER_XT_MATCH_ESP=m
CONFIG_NETFILTER_XT_MATCH_HASHLIMIT=m
CONFIG_NETFILTER_XT_MATCH_HELPER=m
CONFIG_NETFILTER_XT_MATCH_HL=m
CONFIG_NETFILTER_XT_MATCH_IPCOMP=m
CONFIG_NETFILTER_XT_MATCH_IPRANGE=m
CONFIG_NETFILTER_XT_MATCH_IPVS=m
CONFIG_NETFILTER_XT_MATCH_L2TP=m
CONFIG_NETFILTER_XT_MATCH_LENGTH=m
CONFIG_NETFILTER_XT_MATCH_LIMIT=m
CONFIG_NETFILTER_XT_MATCH_MAC=m
CONFIG_NETFILTER_XT_MATCH_MARK=m
CONFIG_NETFILTER_XT_MATCH_MULTIPORT=m
CONFIG_NETFILTER_XT_MATCH_NFACCT=m
CONFIG_NETFILTER_XT_MATCH_OSF=m
CONFIG_NETFILTER_XT_MATCH_OWNER=m
CONFIG_NETFILTER_XT_MATCH_POLICY=m
CONFIG_NETFILTER_XT_MATCH_PHYSDEV=m
CONFIG_NETFILTER_XT_MATCH_PKTTYPE=m
CONFIG_NETFILTER_XT_MATCH_QUOTA=m
CONFIG_NETFILTER_XT_MATCH_RATEEST=m
CONFIG_NETFILTER_XT_MATCH_REALM=m
CONFIG_NETFILTER_XT_MATCH_RECENT=m
CONFIG_NETFILTER_XT_MATCH_SCTP=m
CONFIG_NETFILTER_XT_MATCH_SOCKET=m
CONFIG_NETFILTER_XT_MATCH_STATE=m
CONFIG_NETFILTER_XT_MATCH_STATISTIC=m
CONFIG_NETFILTER_XT_MATCH_STRING=m
CONFIG_NETFILTER_XT_MATCH_TCPMSS=m
CONFIG_NETFILTER_XT_MATCH_TIME=m
CONFIG_NETFILTER_XT_MATCH_U32=m
# end of Core Netfilter Configuration

CONFIG_IP_SET=m
CONFIG_IP_SET_MAX=256
CONFIG_IP_SET_BITMAP_IP=m
CONFIG_IP_SET_BITMAP_IPMAC=m
CONFIG_IP_SET_BITMAP_PORT=m
CONFIG_IP_SET_HASH_IP=m
CONFIG_IP_SET_HASH_IPMARK=m
CONFIG_IP_SET_HASH_IPPORT=m
CONFIG_IP_SET_HASH_IPPORTIP=m
CONFIG_IP_SET_HASH_IPPORTNET=m
CONFIG_IP_SET_HASH_IPMAC=m
CONFIG_IP_SET_HASH_MAC=m
CONFIG_IP_SET_HASH_NETPORTNET=m
CONFIG_IP_SET_HASH_NET=m
CONFIG_IP_SET_HASH_NETNET=m
CONFIG_IP_SET_HASH_NETPORT=m
CONFIG_IP_SET_HASH_NETIFACE=m
CONFIG_IP_SET_LIST_SET=m
CONFIG_IP_VS=m
CONFIG_IP_VS_IPV6=y
CONFIG_IP_VS_DEBUG=y
CONFIG_IP_VS_TAB_BITS=12

#
# IPVS transport protocol load balancing support
#
CONFIG_IP_VS_PROTO_TCP=y
CONFIG_IP_VS_PROTO_UDP=y
CONFIG_IP_VS_PROTO_AH_ESP=y
CONFIG_IP_VS_PROTO_ESP=y
CONFIG_IP_VS_PROTO_AH=y
CONFIG_IP_VS_PROTO_SCTP=y

#
# IPVS scheduler
#
CONFIG_IP_VS_RR=m
CONFIG_IP_VS_WRR=m
CONFIG_IP_VS_LC=m
CONFIG_IP_VS_WLC=m
CONFIG_IP_VS_FO=m
CONFIG_IP_VS_OVF=m
CONFIG_IP_VS_LBLC=m
CONFIG_IP_VS_LBLCR=m
CONFIG_IP_VS_DH=m
CONFIG_IP_VS_SH=m
CONFIG_IP_VS_MH=m
CONFIG_IP_VS_SED=m
CONFIG_IP_VS_NQ=m
CONFIG_IP_VS_TWOS=m

#
# IPVS SH scheduler
#
CONFIG_IP_VS_SH_TAB_BITS=8

#
# IPVS MH scheduler
#
CONFIG_IP_VS_MH_TAB_INDEX=12

#
# IPVS application helper
#
CONFIG_IP_VS_FTP=m
CONFIG_IP_VS_NFCT=y
CONFIG_IP_VS_PE_SIP=m

#
# IP: Netfilter Configuration
#
CONFIG_NF_DEFRAG_IPV4=m
CONFIG_NF_SOCKET_IPV4=m
CONFIG_NF_TPROXY_IPV4=m
CONFIG_NF_TABLES_IPV4=y
CONFIG_NFT_REJECT_IPV4=m
CONFIG_NFT_DUP_IPV4=m
CONFIG_NFT_FIB_IPV4=m
CONFIG_NF_TABLES_ARP=y
CONFIG_NF_DUP_IPV4=m
CONFIG_NF_LOG_ARP=m
CONFIG_NF_LOG_IPV4=m
CONFIG_NF_REJECT_IPV4=m
CONFIG_NF_NAT_SNMP_BASIC=m
CONFIG_NF_NAT_PPTP=m
CONFIG_NF_NAT_H323=m
CONFIG_IP_NF_IPTABLES=m
CONFIG_IP_NF_MATCH_AH=m
CONFIG_IP_NF_MATCH_ECN=m
CONFIG_IP_NF_MATCH_RPFILTER=m
CONFIG_IP_NF_MATCH_TTL=m
CONFIG_IP_NF_FILTER=m
CONFIG_IP_NF_TARGET_REJECT=m
CONFIG_IP_NF_TARGET_SYNPROXY=m
CONFIG_IP_NF_NAT=m
CONFIG_IP_NF_TARGET_MASQUERADE=m
CONFIG_IP_NF_TARGET_NETMAP=m
CONFIG_IP_NF_TARGET_REDIRECT=m
CONFIG_IP_NF_MANGLE=m
CONFIG_IP_NF_TARGET_CLUSTERIP=m
CONFIG_IP_NF_TARGET_ECN=m
CONFIG_IP_NF_TARGET_TTL=m
CONFIG_IP_NF_RAW=m
CONFIG_IP_NF_SECURITY=m
CONFIG_IP_NF_ARPTABLES=m
CONFIG_IP_NF_ARPFILTER=m
CONFIG_IP_NF_ARP_MANGLE=m
# end of IP: Netfilter Configuration

#
# IPv6: Netfilter Configuration
#
CONFIG_NF_SOCKET_IPV6=m
CONFIG_NF_TPROXY_IPV6=m
CONFIG_NF_TABLES_IPV6=y
CONFIG_NFT_REJECT_IPV6=m
CONFIG_NFT_DUP_IPV6=m
CONFIG_NFT_FIB_IPV6=m
CONFIG_NF_DUP_IPV6=m
CONFIG_NF_REJECT_IPV6=m
CONFIG_NF_LOG_IPV6=m
CONFIG_IP6_NF_IPTABLES=m
CONFIG_IP6_NF_MATCH_AH=m
CONFIG_IP6_NF_MATCH_EUI64=m
CONFIG_IP6_NF_MATCH_FRAG=m
CONFIG_IP6_NF_MATCH_OPTS=m
CONFIG_IP6_NF_MATCH_HL=m
CONFIG_IP6_NF_MATCH_IPV6HEADER=m
CONFIG_IP6_NF_MATCH_MH=m
CONFIG_IP6_NF_MATCH_RPFILTER=m
CONFIG_IP6_NF_MATCH_RT=m
CONFIG_IP6_NF_MATCH_SRH=m
CONFIG_IP6_NF_TARGET_HL=m
CONFIG_IP6_NF_FILTER=m
CONFIG_IP6_NF_TARGET_REJECT=m
CONFIG_IP6_NF_TARGET_SYNPROXY=m
CONFIG_IP6_NF_MANGLE=m
CONFIG_IP6_NF_RAW=m
CONFIG_IP6_NF_SECURITY=m
CONFIG_IP6_NF_NAT=m
CONFIG_IP6_NF_TARGET_MASQUERADE=m
CONFIG_IP6_NF_TARGET_NPT=m
# end of IPv6: Netfilter Configuration

CONFIG_NF_DEFRAG_IPV6=m
CONFIG_NF_TABLES_BRIDGE=m
CONFIG_NFT_BRIDGE_META=m
CONFIG_NFT_BRIDGE_REJECT=m
CONFIG_NF_CONNTRACK_BRIDGE=m
CONFIG_BRIDGE_NF_EBTABLES=m
CONFIG_BRIDGE_EBT_BROUTE=m
CONFIG_BRIDGE_EBT_T_FILTER=m
CONFIG_BRIDGE_EBT_T_NAT=m
CONFIG_BRIDGE_EBT_802_3=m
CONFIG_BRIDGE_EBT_AMONG=m
CONFIG_BRIDGE_EBT_ARP=m
CONFIG_BRIDGE_EBT_IP=m
CONFIG_BRIDGE_EBT_IP6=m
CONFIG_BRIDGE_EBT_LIMIT=m
CONFIG_BRIDGE_EBT_MARK=m
CONFIG_BRIDGE_EBT_PKTTYPE=m
CONFIG_BRIDGE_EBT_STP=m
CONFIG_BRIDGE_EBT_VLAN=m
CONFIG_BRIDGE_EBT_ARPREPLY=m
CONFIG_BRIDGE_EBT_DNAT=m
CONFIG_BRIDGE_EBT_MARK_T=m
CONFIG_BRIDGE_EBT_REDIRECT=m
CONFIG_BRIDGE_EBT_SNAT=m
CONFIG_BRIDGE_EBT_LOG=m
CONFIG_BRIDGE_EBT_NFLOG=m
CONFIG_BPFILTER=y
CONFIG_IP_DCCP=m
CONFIG_INET_DCCP_DIAG=m

#
# DCCP CCIDs Configuration
#
CONFIG_IP_DCCP_CCID2_DEBUG=y
CONFIG_IP_DCCP_CCID3=y
CONFIG_IP_DCCP_CCID3_DEBUG=y
CONFIG_IP_DCCP_TFRC_LIB=y
CONFIG_IP_DCCP_TFRC_DEBUG=y
# end of DCCP CCIDs Configuration

#
# DCCP Kernel Hacking
#
CONFIG_IP_DCCP_DEBUG=y
# end of DCCP Kernel Hacking

CONFIG_IP_SCTP=m
CONFIG_SCTP_DBG_OBJCNT=y
CONFIG_SCTP_DEFAULT_COOKIE_HMAC_MD5=y
# CONFIG_SCTP_DEFAULT_COOKIE_HMAC_SHA1 is not set
# CONFIG_SCTP_DEFAULT_COOKIE_HMAC_NONE is not set
CONFIG_SCTP_COOKIE_HMAC_MD5=y
CONFIG_SCTP_COOKIE_HMAC_SHA1=y
CONFIG_INET_SCTP_DIAG=m
CONFIG_RDS=m
CONFIG_RDS_TCP=m
CONFIG_RDS_DEBUG=y
CONFIG_TIPC=m
CONFIG_TIPC_MEDIA_UDP=y
CONFIG_TIPC_CRYPTO=y
CONFIG_TIPC_DIAG=m
CONFIG_ATM=m
CONFIG_ATM_CLIP=m
CONFIG_ATM_CLIP_NO_ICMP=y
CONFIG_ATM_LANE=m
CONFIG_ATM_MPOA=m
CONFIG_ATM_BR2684=m
CONFIG_ATM_BR2684_IPFILTER=y
CONFIG_L2TP=m
CONFIG_L2TP_DEBUGFS=m
CONFIG_L2TP_V3=y
CONFIG_L2TP_IP=m
CONFIG_L2TP_ETH=m
CONFIG_STP=m
CONFIG_GARP=m
CONFIG_MRP=m
CONFIG_BRIDGE=m
CONFIG_BRIDGE_IGMP_SNOOPING=y
CONFIG_BRIDGE_VLAN_FILTERING=y
CONFIG_BRIDGE_MRP=y
CONFIG_BRIDGE_CFM=y
CONFIG_NET_DSA=m
CONFIG_NET_DSA_TAG_AR9331=m
CONFIG_NET_DSA_TAG_BRCM_COMMON=m
CONFIG_NET_DSA_TAG_BRCM=m
CONFIG_NET_DSA_TAG_BRCM_LEGACY=m
CONFIG_NET_DSA_TAG_BRCM_PREPEND=m
CONFIG_NET_DSA_TAG_HELLCREEK=m
CONFIG_NET_DSA_TAG_GSWIP=m
CONFIG_NET_DSA_TAG_DSA_COMMON=m
CONFIG_NET_DSA_TAG_DSA=m
CONFIG_NET_DSA_TAG_EDSA=m
CONFIG_NET_DSA_TAG_MTK=m
CONFIG_NET_DSA_TAG_KSZ=m
CONFIG_NET_DSA_TAG_OCELOT=m
CONFIG_NET_DSA_TAG_OCELOT_8021Q=m
CONFIG_NET_DSA_TAG_QCA=m
CONFIG_NET_DSA_TAG_RTL4_A=m
CONFIG_NET_DSA_TAG_RTL8_4=m
CONFIG_NET_DSA_TAG_RZN1_A5PSW=m
CONFIG_NET_DSA_TAG_LAN9303=m
CONFIG_NET_DSA_TAG_SJA1105=m
CONFIG_NET_DSA_TAG_TRAILER=m
CONFIG_NET_DSA_TAG_XRS700X=m
CONFIG_VLAN_8021Q=m
CONFIG_VLAN_8021Q_GVRP=y
CONFIG_VLAN_8021Q_MVRP=y
CONFIG_LLC=m
CONFIG_LLC2=m
CONFIG_ATALK=m
CONFIG_DEV_APPLETALK=m
CONFIG_IPDDP=m
CONFIG_IPDDP_ENCAP=y
CONFIG_X25=m
CONFIG_LAPB=m
CONFIG_PHONET=m
CONFIG_6LOWPAN=m
CONFIG_6LOWPAN_DEBUGFS=y
CONFIG_6LOWPAN_NHC=m
CONFIG_6LOWPAN_NHC_DEST=m
CONFIG_6LOWPAN_NHC_FRAGMENT=m
CONFIG_6LOWPAN_NHC_HOP=m
CONFIG_6LOWPAN_NHC_IPV6=m
CONFIG_6LOWPAN_NHC_MOBILITY=m
CONFIG_6LOWPAN_NHC_ROUTING=m
CONFIG_6LOWPAN_NHC_UDP=m
CONFIG_6LOWPAN_GHC_EXT_HDR_HOP=m
CONFIG_6LOWPAN_GHC_UDP=m
CONFIG_6LOWPAN_GHC_ICMPV6=m
CONFIG_6LOWPAN_GHC_EXT_HDR_DEST=m
CONFIG_6LOWPAN_GHC_EXT_HDR_FRAG=m
CONFIG_6LOWPAN_GHC_EXT_HDR_ROUTE=m
CONFIG_IEEE802154=m
CONFIG_IEEE802154_NL802154_EXPERIMENTAL=y
CONFIG_IEEE802154_SOCKET=m
CONFIG_IEEE802154_6LOWPAN=m
CONFIG_MAC802154=m
CONFIG_NET_SCHED=y

#
# Queueing/Scheduling
#
CONFIG_NET_SCH_CBQ=m
CONFIG_NET_SCH_HTB=m
CONFIG_NET_SCH_HFSC=m
CONFIG_NET_SCH_ATM=m
CONFIG_NET_SCH_PRIO=m
CONFIG_NET_SCH_MULTIQ=m
CONFIG_NET_SCH_RED=m
CONFIG_NET_SCH_SFB=m
CONFIG_NET_SCH_SFQ=m
CONFIG_NET_SCH_TEQL=m
CONFIG_NET_SCH_TBF=m
CONFIG_NET_SCH_CBS=m
CONFIG_NET_SCH_ETF=m
CONFIG_NET_SCH_TAPRIO=m
CONFIG_NET_SCH_GRED=m
CONFIG_NET_SCH_DSMARK=m
CONFIG_NET_SCH_NETEM=m
CONFIG_NET_SCH_DRR=m
CONFIG_NET_SCH_MQPRIO=m
CONFIG_NET_SCH_SKBPRIO=m
CONFIG_NET_SCH_CHOKE=m
CONFIG_NET_SCH_QFQ=m
CONFIG_NET_SCH_CODEL=m
CONFIG_NET_SCH_FQ_CODEL=m
CONFIG_NET_SCH_CAKE=m
CONFIG_NET_SCH_FQ=m
CONFIG_NET_SCH_HHF=m
CONFIG_NET_SCH_PIE=m
CONFIG_NET_SCH_FQ_PIE=m
CONFIG_NET_SCH_INGRESS=m
CONFIG_NET_SCH_PLUG=m
CONFIG_NET_SCH_ETS=m
CONFIG_NET_SCH_DEFAULT=y
# CONFIG_DEFAULT_FQ is not set
# CONFIG_DEFAULT_CODEL is not set
# CONFIG_DEFAULT_FQ_CODEL is not set
# CONFIG_DEFAULT_FQ_PIE is not set
# CONFIG_DEFAULT_SFQ is not set
CONFIG_DEFAULT_PFIFO_FAST=y
CONFIG_DEFAULT_NET_SCH="pfifo_fast"

#
# Classification
#
CONFIG_NET_CLS=y
CONFIG_NET_CLS_BASIC=m
CONFIG_NET_CLS_TCINDEX=m
CONFIG_NET_CLS_ROUTE4=m
CONFIG_NET_CLS_FW=m
CONFIG_NET_CLS_U32=m
CONFIG_CLS_U32_PERF=y
CONFIG_CLS_U32_MARK=y
CONFIG_NET_CLS_RSVP=m
CONFIG_NET_CLS_RSVP6=m
CONFIG_NET_CLS_FLOW=m
CONFIG_NET_CLS_CGROUP=m
CONFIG_NET_CLS_BPF=m
CONFIG_NET_CLS_FLOWER=m
CONFIG_NET_CLS_MATCHALL=m
CONFIG_NET_EMATCH=y
CONFIG_NET_EMATCH_STACK=32
CONFIG_NET_EMATCH_CMP=m
CONFIG_NET_EMATCH_NBYTE=m
CONFIG_NET_EMATCH_U32=m
CONFIG_NET_EMATCH_META=m
CONFIG_NET_EMATCH_TEXT=m
CONFIG_NET_EMATCH_CANID=m
CONFIG_NET_EMATCH_IPSET=m
CONFIG_NET_EMATCH_IPT=m
CONFIG_NET_CLS_ACT=y
CONFIG_NET_ACT_POLICE=m
CONFIG_NET_ACT_GACT=m
CONFIG_GACT_PROB=y
CONFIG_NET_ACT_MIRRED=m
CONFIG_NET_ACT_SAMPLE=m
CONFIG_NET_ACT_IPT=m
CONFIG_NET_ACT_NAT=m
CONFIG_NET_ACT_PEDIT=m
CONFIG_NET_ACT_SIMP=m
CONFIG_NET_ACT_SKBEDIT=m
CONFIG_NET_ACT_CSUM=m
CONFIG_NET_ACT_MPLS=m
CONFIG_NET_ACT_VLAN=m
CONFIG_NET_ACT_BPF=m
CONFIG_NET_ACT_CONNMARK=m
CONFIG_NET_ACT_CTINFO=m
CONFIG_NET_ACT_SKBMOD=m
CONFIG_NET_ACT_IFE=m
CONFIG_NET_ACT_TUNNEL_KEY=m
CONFIG_NET_ACT_CT=m
CONFIG_NET_ACT_GATE=m
CONFIG_NET_IFE_SKBMARK=m
CONFIG_NET_IFE_SKBPRIO=m
CONFIG_NET_IFE_SKBTCINDEX=m
CONFIG_NET_TC_SKB_EXT=y
CONFIG_NET_SCH_FIFO=y
CONFIG_DCB=y
CONFIG_DNS_RESOLVER=m
CONFIG_BATMAN_ADV=m
CONFIG_BATMAN_ADV_BATMAN_V=y
CONFIG_BATMAN_ADV_BLA=y
CONFIG_BATMAN_ADV_DAT=y
CONFIG_BATMAN_ADV_NC=y
CONFIG_BATMAN_ADV_MCAST=y
CONFIG_BATMAN_ADV_DEBUG=y
CONFIG_BATMAN_ADV_TRACING=y
CONFIG_OPENVSWITCH=m
CONFIG_OPENVSWITCH_GRE=m
CONFIG_OPENVSWITCH_VXLAN=m
CONFIG_OPENVSWITCH_GENEVE=m
CONFIG_VSOCKETS=m
CONFIG_VSOCKETS_DIAG=m
CONFIG_VSOCKETS_LOOPBACK=m
CONFIG_VIRTIO_VSOCKETS=m
CONFIG_VIRTIO_VSOCKETS_COMMON=m
CONFIG_NETLINK_DIAG=m
CONFIG_MPLS=y
CONFIG_NET_MPLS_GSO=m
CONFIG_MPLS_ROUTING=m
CONFIG_MPLS_IPTUNNEL=m
CONFIG_NET_NSH=m
CONFIG_HSR=m
CONFIG_NET_SWITCHDEV=y
CONFIG_NET_L3_MASTER_DEV=y
CONFIG_QRTR=m
CONFIG_QRTR_SMD=m
CONFIG_QRTR_TUN=m
CONFIG_QRTR_MHI=m
CONFIG_NET_NCSI=y
CONFIG_NCSI_OEM_CMD_GET_MAC=y
CONFIG_NCSI_OEM_CMD_KEEP_PHY=y
CONFIG_PCPU_DEV_REFCNT=y
CONFIG_RPS=y
CONFIG_RFS_ACCEL=y
CONFIG_SOCK_RX_QUEUE_MAPPING=y
CONFIG_XPS=y
CONFIG_CGROUP_NET_PRIO=y
CONFIG_CGROUP_NET_CLASSID=y
CONFIG_NET_RX_BUSY_POLL=y
CONFIG_BQL=y
CONFIG_BPF_STREAM_PARSER=y
CONFIG_NET_FLOW_LIMIT=y

#
# Network testing
#
CONFIG_NET_PKTGEN=m
CONFIG_NET_DROP_MONITOR=m
# end of Network testing
# end of Networking options

CONFIG_CAN=m
CONFIG_CAN_RAW=m
CONFIG_CAN_BCM=m
CONFIG_CAN_GW=m
CONFIG_CAN_J1939=m
CONFIG_CAN_ISOTP=m
CONFIG_AF_RXRPC=m
CONFIG_AF_RXRPC_IPV6=y
CONFIG_AF_RXRPC_INJECT_LOSS=y
CONFIG_AF_RXRPC_DEBUG=y
CONFIG_RXKAD=y
CONFIG_AF_KCM=m
CONFIG_STREAM_PARSER=y
CONFIG_MCTP=y
CONFIG_MCTP_FLOWS=y
CONFIG_FIB_RULES=y
CONFIG_RFKILL=m
CONFIG_RFKILL_LEDS=y
CONFIG_RFKILL_INPUT=y
CONFIG_RFKILL_GPIO=m
CONFIG_NET_9P=m
CONFIG_NET_9P_FD=m
CONFIG_NET_9P_VIRTIO=m
CONFIG_NET_9P_DEBUG=y
CONFIG_CAIF=m
CONFIG_CAIF_DEBUG=y
CONFIG_CAIF_NETDEV=m
CONFIG_CAIF_USB=m
CONFIG_CEPH_LIB=m
CONFIG_CEPH_LIB_PRETTYDEBUG=y
CONFIG_CEPH_LIB_USE_DNS_RESOLVER=y
CONFIG_NFC=m
CONFIG_NFC_DIGITAL=m
CONFIG_NFC_NCI=m
CONFIG_NFC_NCI_UART=m
CONFIG_NFC_HCI=m
CONFIG_NFC_SHDLC=y

#
# Near Field Communication (NFC) devices
#
CONFIG_NFC_SIM=m
CONFIG_NFC_VIRTUAL_NCI=m
CONFIG_NFC_FDP=m
CONFIG_NFC_FDP_I2C=m
CONFIG_NFC_PN544=m
CONFIG_NFC_PN544_I2C=m
CONFIG_NFC_PN533=m
CONFIG_NFC_PN533_I2C=m
CONFIG_NFC_PN532_UART=m
CONFIG_NFC_MICROREAD=m
CONFIG_NFC_MICROREAD_I2C=m
CONFIG_NFC_MRVL=m
CONFIG_NFC_MRVL_UART=m
CONFIG_NFC_MRVL_I2C=m
CONFIG_NFC_ST21NFCA=m
CONFIG_NFC_ST21NFCA_I2C=m
CONFIG_NFC_ST_NCI=m
CONFIG_NFC_ST_NCI_I2C=m
CONFIG_NFC_NXP_NCI=m
CONFIG_NFC_NXP_NCI_I2C=m
CONFIG_NFC_S3FWRN5=m
CONFIG_NFC_S3FWRN5_I2C=m
CONFIG_NFC_S3FWRN82_UART=m
# end of Near Field Communication (NFC) devices

CONFIG_PSAMPLE=m
CONFIG_NET_IFE=m
CONFIG_LWTUNNEL=y
CONFIG_LWTUNNEL_BPF=y
CONFIG_DST_CACHE=y
CONFIG_GRO_CELLS=y
CONFIG_SOCK_VALIDATE_XMIT=y
CONFIG_NET_SELFTESTS=m
CONFIG_NET_SOCK_MSG=y
CONFIG_NET_DEVLINK=y
CONFIG_PAGE_POOL=y
CONFIG_PAGE_POOL_STATS=y
CONFIG_FAILOVER=m
CONFIG_ETHTOOL_NETLINK=y
CONFIG_NETDEV_ADDR_LIST_TEST=m

#
# Device Drivers
#
CONFIG_HAVE_PCI=y
# CONFIG_PCI is not set
CONFIG_PCCARD=m
CONFIG_PCMCIA=m
CONFIG_PCMCIA_LOAD_CIS=y

#
# PC-card bridges
#

#
# Generic Driver Options
#
CONFIG_AUXILIARY_BUS=y
CONFIG_UEVENT_HELPER=y
CONFIG_UEVENT_HELPER_PATH=""
CONFIG_DEVTMPFS=y
CONFIG_DEVTMPFS_MOUNT=y
CONFIG_DEVTMPFS_SAFE=y
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y

#
# Firmware loader
#
CONFIG_FW_LOADER=y
CONFIG_FW_LOADER_PAGED_BUF=y
CONFIG_FW_LOADER_SYSFS=y
CONFIG_EXTRA_FIRMWARE=""
CONFIG_FW_LOADER_USER_HELPER=y
CONFIG_FW_LOADER_USER_HELPER_FALLBACK=y
CONFIG_FW_LOADER_COMPRESS=y
CONFIG_FW_LOADER_COMPRESS_XZ=y
CONFIG_FW_LOADER_COMPRESS_ZSTD=y
CONFIG_FW_UPLOAD=y
# end of Firmware loader

CONFIG_WANT_DEV_COREDUMP=y
CONFIG_ALLOW_DEV_COREDUMP=y
CONFIG_DEV_COREDUMP=y
CONFIG_DEBUG_DRIVER=y
CONFIG_DEBUG_DEVRES=y
CONFIG_DEBUG_TEST_DRIVER_REMOVE=y
CONFIG_TEST_ASYNC_DRIVER_PROBE=m
CONFIG_SYS_HYPERVISOR=y
CONFIG_GENERIC_CPU_AUTOPROBE=y
CONFIG_GENERIC_CPU_VULNERABILITIES=y
CONFIG_REGMAP=y
CONFIG_REGMAP_I2C=m
CONFIG_REGMAP_SPMI=m
CONFIG_REGMAP_MMIO=m
CONFIG_REGMAP_IRQ=y
CONFIG_REGMAP_I3C=m
CONFIG_DMA_SHARED_BUFFER=y
CONFIG_DMA_FENCE_TRACE=y
# end of Generic Driver Options

#
# Bus devices
#
CONFIG_MHI_BUS=m
CONFIG_MHI_BUS_DEBUG=y
CONFIG_MHI_BUS_EP=m
# end of Bus devices

CONFIG_CONNECTOR=m

#
# Firmware Drivers
#

#
# ARM System Control and Management Interface Protocol
#
# end of ARM System Control and Management Interface Protocol

CONFIG_FIRMWARE_MEMMAP=y
CONFIG_GOOGLE_FIRMWARE=y

#
# Tegra firmware driver
#
# end of Tegra firmware driver
# end of Firmware Drivers

CONFIG_GNSS=m
CONFIG_GNSS_SERIAL=m
CONFIG_GNSS_MTK_SERIAL=m
CONFIG_GNSS_SIRF_SERIAL=m
CONFIG_GNSS_UBX_SERIAL=m
CONFIG_MTD=m
CONFIG_MTD_TESTS=m

#
# Partition parsers
#
CONFIG_MTD_AR7_PARTS=m
CONFIG_MTD_CMDLINE_PARTS=m
CONFIG_MTD_OF_PARTS=m
CONFIG_MTD_REDBOOT_PARTS=m
CONFIG_MTD_REDBOOT_DIRECTORY_BLOCK=-1
CONFIG_MTD_REDBOOT_PARTS_UNALLOCATED=y
CONFIG_MTD_REDBOOT_PARTS_READONLY=y
# end of Partition parsers

#
# User Modules And Translation Layers
#
CONFIG_MTD_BLKDEVS=m
CONFIG_MTD_BLOCK=m
CONFIG_MTD_BLOCK_RO=m

#
# Note that in some cases UBI block is preferred. See MTD_UBI_BLOCK.
#
CONFIG_FTL=m
CONFIG_NFTL=m
CONFIG_NFTL_RW=y
CONFIG_INFTL=m
CONFIG_RFD_FTL=m
CONFIG_SSFDC=m
CONFIG_SM_FTL=m
CONFIG_MTD_OOPS=m
CONFIG_MTD_PSTORE=m
CONFIG_MTD_SWAP=m
CONFIG_MTD_PARTITIONED_MASTER=y

#
# RAM/ROM/Flash chip drivers
#
CONFIG_MTD_CFI=m
CONFIG_MTD_JEDECPROBE=m
CONFIG_MTD_GEN_PROBE=m
CONFIG_MTD_CFI_ADV_OPTIONS=y
CONFIG_MTD_CFI_NOSWAP=y
# CONFIG_MTD_CFI_BE_BYTE_SWAP is not set
# CONFIG_MTD_CFI_LE_BYTE_SWAP is not set
CONFIG_MTD_CFI_GEOMETRY=y
CONFIG_MTD_MAP_BANK_WIDTH_1=y
CONFIG_MTD_MAP_BANK_WIDTH_2=y
CONFIG_MTD_MAP_BANK_WIDTH_4=y
CONFIG_MTD_MAP_BANK_WIDTH_8=y
CONFIG_MTD_MAP_BANK_WIDTH_16=y
CONFIG_MTD_MAP_BANK_WIDTH_32=y
CONFIG_MTD_CFI_I1=y
CONFIG_MTD_CFI_I2=y
CONFIG_MTD_CFI_I4=y
CONFIG_MTD_CFI_I8=y
CONFIG_MTD_OTP=y
CONFIG_MTD_CFI_INTELEXT=m
CONFIG_MTD_CFI_AMDSTD=m
CONFIG_MTD_CFI_STAA=m
CONFIG_MTD_CFI_UTIL=m
CONFIG_MTD_RAM=m
CONFIG_MTD_ROM=m
CONFIG_MTD_ABSENT=m
# end of RAM/ROM/Flash chip drivers

#
# NAND
#
CONFIG_MTD_NAND_CORE=m
CONFIG_MTD_RAW_NAND=m

#
# Raw/parallel NAND flash controllers
#

#
# Misc
#
CONFIG_MTD_NAND_NANDSIM=m

#
# ECC engine support
#
CONFIG_MTD_NAND_ECC=y
CONFIG_MTD_NAND_ECC_SW_HAMMING=y
CONFIG_MTD_NAND_ECC_SW_HAMMING_SMC=y
CONFIG_MTD_NAND_ECC_SW_BCH=y
# end of ECC engine support
# end of NAND

#
# LPDDR & LPDDR2 PCM memory drivers
#
CONFIG_MTD_LPDDR=m
CONFIG_MTD_QINFO_PROBE=m
# end of LPDDR & LPDDR2 PCM memory drivers

CONFIG_MTD_UBI=m
CONFIG_MTD_UBI_WL_THRESHOLD=4096
CONFIG_MTD_UBI_BEB_LIMIT=20
CONFIG_MTD_UBI_FASTMAP=y
CONFIG_MTD_UBI_GLUEBI=m
CONFIG_MTD_UBI_BLOCK=y
CONFIG_DTC=y
CONFIG_OF=y
CONFIG_OF_UNITTEST=y
CONFIG_OF_FLATTREE=y
CONFIG_OF_EARLY_FLATTREE=y
CONFIG_OF_KOBJ=y
CONFIG_OF_DYNAMIC=y
CONFIG_OF_IRQ=y
CONFIG_OF_RESERVED_MEM=y
CONFIG_OF_RESOLVE=y
CONFIG_OF_OVERLAY=y
CONFIG_BLK_DEV=y
CONFIG_BLK_DEV_NULL_BLK=m
CONFIG_BLK_DEV_NULL_BLK_FAULT_INJECTION=y
CONFIG_CDROM=m
CONFIG_ZRAM=m
CONFIG_ZRAM_DEF_COMP_LZORLE=y
# CONFIG_ZRAM_DEF_COMP_ZSTD is not set
# CONFIG_ZRAM_DEF_COMP_LZ4 is not set
# CONFIG_ZRAM_DEF_COMP_LZO is not set
# CONFIG_ZRAM_DEF_COMP_LZ4HC is not set
# CONFIG_ZRAM_DEF_COMP_842 is not set
CONFIG_ZRAM_DEF_COMP="lzo-rle"
CONFIG_ZRAM_WRITEBACK=y
CONFIG_ZRAM_MEMORY_TRACKING=y
CONFIG_BLK_DEV_LOOP=m
CONFIG_BLK_DEV_LOOP_MIN_COUNT=8
CONFIG_BLK_DEV_DRBD=m
CONFIG_DRBD_FAULT_INJECTION=y
CONFIG_BLK_DEV_NBD=m
CONFIG_BLK_DEV_RAM=m
CONFIG_BLK_DEV_RAM_COUNT=16
CONFIG_BLK_DEV_RAM_SIZE=4096
CONFIG_CDROM_PKTCDVD=m
CONFIG_CDROM_PKTCDVD_BUFFERS=8
CONFIG_CDROM_PKTCDVD_WCACHE=y
CONFIG_ATA_OVER_ETH=m

#
# S/390 block device drivers
#
CONFIG_DCSSBLK=m
CONFIG_DASD=m
CONFIG_DASD_PROFILE=y
CONFIG_DASD_ECKD=m
CONFIG_DASD_FBA=m
CONFIG_DASD_DIAG=m
CONFIG_DASD_EER=y
CONFIG_SCM_BLOCK=m
CONFIG_VIRTIO_BLK=m
CONFIG_BLK_DEV_RBD=m
CONFIG_BLK_DEV_UBLK=m

#
# NVME Support
#
CONFIG_NVME_COMMON=m
CONFIG_NVME_CORE=m
CONFIG_NVME_MULTIPATH=y
CONFIG_NVME_VERBOSE_ERRORS=y
CONFIG_NVME_FABRICS=m
CONFIG_NVME_FC=m
CONFIG_NVME_TCP=m
CONFIG_NVME_AUTH=y
CONFIG_NVME_TARGET=m
CONFIG_NVME_TARGET_PASSTHRU=y
CONFIG_NVME_TARGET_LOOP=m
CONFIG_NVME_TARGET_FC=m
CONFIG_NVME_TARGET_FCLOOP=m
CONFIG_NVME_TARGET_TCP=m
CONFIG_NVME_TARGET_AUTH=y
# end of NVME Support

#
# Misc devices
#
CONFIG_SENSORS_LIS3LV02D=m
CONFIG_AD525X_DPOT=m
CONFIG_AD525X_DPOT_I2C=m
CONFIG_DUMMY_IRQ=m
CONFIG_ICS932S401=m
CONFIG_ENCLOSURE_SERVICES=m
CONFIG_APDS9802ALS=m
CONFIG_ISL29003=m
CONFIG_ISL29020=m
CONFIG_SENSORS_TSL2550=m
CONFIG_SENSORS_BH1770=m
CONFIG_SENSORS_APDS990X=m
CONFIG_HMC6352=m
CONFIG_DS1682=m
CONFIG_OPEN_DICE=m
CONFIG_C2PORT=m

#
# EEPROM support
#
CONFIG_EEPROM_AT24=m
CONFIG_EEPROM_LEGACY=m
CONFIG_EEPROM_MAX6875=m
CONFIG_EEPROM_93CX6=m
CONFIG_EEPROM_IDT_89HPESX=m
CONFIG_EEPROM_EE1004=m
# end of EEPROM support

#
# Texas Instruments shared transport line discipline
#
CONFIG_TI_ST=m
# end of Texas Instruments shared transport line discipline

CONFIG_SENSORS_LIS3_I2C=m

#
# Altera FPGA firmware download module (requires I2C)
#
CONFIG_ALTERA_STAPL=m
CONFIG_ECHO=m
CONFIG_UACCE=m
CONFIG_PVPANIC=y
# end of Misc devices

#
# SCSI device support
#
CONFIG_SCSI_MOD=m
CONFIG_RAID_ATTRS=m
CONFIG_SCSI_COMMON=m
CONFIG_SCSI=m
CONFIG_SCSI_DMA=y
CONFIG_SCSI_NETLINK=y
CONFIG_SCSI_PROC_FS=y

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=m
CONFIG_CHR_DEV_ST=m
CONFIG_BLK_DEV_SR=m
CONFIG_CHR_DEV_SG=m
CONFIG_BLK_DEV_BSG=y
CONFIG_CHR_DEV_SCH=m
CONFIG_SCSI_ENCLOSURE=m
CONFIG_SCSI_CONSTANTS=y
CONFIG_SCSI_LOGGING=y
CONFIG_SCSI_SCAN_ASYNC=y

#
# SCSI Transports
#
CONFIG_SCSI_SPI_ATTRS=m
CONFIG_SCSI_FC_ATTRS=m
CONFIG_SCSI_ISCSI_ATTRS=m
CONFIG_SCSI_SAS_ATTRS=m
CONFIG_SCSI_SAS_LIBSAS=m
CONFIG_SCSI_SAS_HOST_SMP=y
CONFIG_SCSI_SRP_ATTRS=m
# end of SCSI Transports

CONFIG_SCSI_LOWLEVEL=y
CONFIG_ISCSI_TCP=m
CONFIG_ISCSI_BOOT_SYSFS=m
CONFIG_LIBFC=m
CONFIG_LIBFCOE=m
CONFIG_SCSI_FDOMAIN=m
CONFIG_SCSI_DEBUG=m
CONFIG_ZFCP=m
CONFIG_SCSI_VIRTIO=m
CONFIG_SCSI_LOWLEVEL_PCMCIA=y
CONFIG_PCMCIA_AHA152X=m
CONFIG_PCMCIA_FDOMAIN=m
CONFIG_PCMCIA_QLOGIC=m
CONFIG_PCMCIA_SYM53C500=m
CONFIG_SCSI_DH=y
CONFIG_SCSI_DH_RDAC=m
CONFIG_SCSI_DH_HP_SW=m
CONFIG_SCSI_DH_EMC=m
CONFIG_SCSI_DH_ALUA=m
# end of SCSI device support

CONFIG_MD=y
CONFIG_BLK_DEV_MD=m
CONFIG_MD_LINEAR=m
CONFIG_MD_RAID0=m
CONFIG_MD_RAID1=m
CONFIG_MD_RAID10=m
CONFIG_MD_RAID456=m
CONFIG_MD_MULTIPATH=m
CONFIG_MD_FAULTY=m
CONFIG_MD_CLUSTER=m
CONFIG_BCACHE=m
CONFIG_BCACHE_DEBUG=y
CONFIG_BCACHE_CLOSURES_DEBUG=y
CONFIG_BCACHE_ASYNC_REGISTRATION=y
CONFIG_BLK_DEV_DM_BUILTIN=y
CONFIG_BLK_DEV_DM=m
CONFIG_DM_DEBUG=y
CONFIG_DM_BUFIO=m
CONFIG_DM_DEBUG_BLOCK_MANAGER_LOCKING=y
CONFIG_DM_DEBUG_BLOCK_STACK_TRACING=y
CONFIG_DM_BIO_PRISON=m
CONFIG_DM_PERSISTENT_DATA=m
CONFIG_DM_UNSTRIPED=m
CONFIG_DM_CRYPT=m
CONFIG_DM_SNAPSHOT=m
CONFIG_DM_THIN_PROVISIONING=m
CONFIG_DM_CACHE=m
CONFIG_DM_CACHE_SMQ=m
CONFIG_DM_WRITECACHE=m
CONFIG_DM_EBS=m
CONFIG_DM_ERA=m
CONFIG_DM_CLONE=m
CONFIG_DM_MIRROR=m
CONFIG_DM_LOG_USERSPACE=m
CONFIG_DM_RAID=m
CONFIG_DM_ZERO=m
CONFIG_DM_MULTIPATH=m
CONFIG_DM_MULTIPATH_QL=m
CONFIG_DM_MULTIPATH_ST=m
CONFIG_DM_MULTIPATH_HST=m
CONFIG_DM_MULTIPATH_IOA=m
CONFIG_DM_DELAY=m
CONFIG_DM_DUST=m
CONFIG_DM_UEVENT=y
CONFIG_DM_FLAKEY=m
CONFIG_DM_VERITY=m
CONFIG_DM_VERITY_VERIFY_ROOTHASH_SIG=y
CONFIG_DM_VERITY_VERIFY_ROOTHASH_SIG_SECONDARY_KEYRING=y
CONFIG_DM_VERITY_FEC=y
CONFIG_DM_SWITCH=m
CONFIG_DM_LOG_WRITES=m
CONFIG_DM_INTEGRITY=m
CONFIG_DM_ZONED=m
CONFIG_DM_AUDIT=y
CONFIG_TARGET_CORE=m
CONFIG_TCM_IBLOCK=m
CONFIG_TCM_FILEIO=m
CONFIG_TCM_PSCSI=m
CONFIG_TCM_USER2=m
CONFIG_LOOPBACK_TARGET=m
CONFIG_TCM_FC=m
CONFIG_ISCSI_TARGET=m
CONFIG_NETDEVICES=y
CONFIG_MII=m
CONFIG_NET_CORE=y
CONFIG_BONDING=m
CONFIG_DUMMY=m
CONFIG_WIREGUARD=m
CONFIG_WIREGUARD_DEBUG=y
CONFIG_EQUALIZER=m
CONFIG_IFB=m
CONFIG_NET_TEAM=m
CONFIG_NET_TEAM_MODE_BROADCAST=m
CONFIG_NET_TEAM_MODE_ROUNDROBIN=m
CONFIG_NET_TEAM_MODE_RANDOM=m
CONFIG_NET_TEAM_MODE_ACTIVEBACKUP=m
CONFIG_NET_TEAM_MODE_LOADBALANCE=m
CONFIG_MACVLAN=m
CONFIG_MACVTAP=m
CONFIG_IPVLAN_L3S=y
CONFIG_IPVLAN=m
CONFIG_IPVTAP=m
CONFIG_VXLAN=m
CONFIG_GENEVE=m
CONFIG_BAREUDP=m
CONFIG_GTP=m
CONFIG_AMT=m
CONFIG_MACSEC=m
CONFIG_NETCONSOLE=m
CONFIG_NETCONSOLE_DYNAMIC=y
CONFIG_NETPOLL=y
CONFIG_NET_POLL_CONTROLLER=y
CONFIG_TUN=m
CONFIG_TAP=m
CONFIG_TUN_VNET_CROSS_LE=y
CONFIG_VETH=m
CONFIG_VIRTIO_NET=m
CONFIG_NLMON=m
CONFIG_NET_VRF=m
CONFIG_VSOCKMON=m
CONFIG_MHI_NET=m
CONFIG_ARCNET=m
CONFIG_ARCNET_1201=m
CONFIG_ARCNET_1051=m
CONFIG_ARCNET_RAW=m
CONFIG_ARCNET_CAP=m
CONFIG_ARCNET_COM90xx=m
CONFIG_ARCNET_COM90xxIO=m
CONFIG_ARCNET_RIM_I=m
CONFIG_ARCNET_COM20020=m
CONFIG_ARCNET_COM20020_CS=m
CONFIG_ATM_DRIVERS=y
CONFIG_ATM_DUMMY=m
CONFIG_ATM_TCP=m
CONFIG_CAIF_DRIVERS=y
CONFIG_CAIF_TTY=m
CONFIG_CAIF_VIRTIO=m

#
# Distributed Switch Architecture drivers
#
CONFIG_B53=m
CONFIG_B53_MDIO_DRIVER=m
CONFIG_B53_SERDES=m
CONFIG_NET_DSA_LOOP=m
CONFIG_NET_DSA_MT7530=m
CONFIG_NET_DSA_MV88E6060=m
CONFIG_NET_DSA_MICROCHIP_KSZ_COMMON=m
CONFIG_NET_DSA_MICROCHIP_KSZ9477_I2C=m
CONFIG_NET_DSA_MICROCHIP_KSZ8863_SMI=m
CONFIG_NET_DSA_MV88E6XXX=m
CONFIG_NET_DSA_MV88E6XXX_PTP=y
CONFIG_NET_DSA_AR9331=m
CONFIG_NET_DSA_QCA8K=m
CONFIG_NET_DSA_XRS700X=m
CONFIG_NET_DSA_XRS700X_I2C=m
CONFIG_NET_DSA_XRS700X_MDIO=m
CONFIG_NET_DSA_REALTEK=m
CONFIG_NET_DSA_REALTEK_MDIO=m
CONFIG_NET_DSA_REALTEK_SMI=m
CONFIG_NET_DSA_REALTEK_RTL8365MB=m
CONFIG_NET_DSA_REALTEK_RTL8366RB=m
CONFIG_NET_DSA_SMSC_LAN9303=m
CONFIG_NET_DSA_SMSC_LAN9303_I2C=m
CONFIG_NET_DSA_SMSC_LAN9303_MDIO=m
# end of Distributed Switch Architecture drivers

CONFIG_ETHERNET=y
CONFIG_NET_VENDOR_3COM=y
CONFIG_PCMCIA_3C574=m
CONFIG_PCMCIA_3C589=m
CONFIG_NET_VENDOR_ALACRITECH=y
CONFIG_ALTERA_TSE=m
CONFIG_NET_VENDOR_AMAZON=y
CONFIG_NET_VENDOR_AMD=y
CONFIG_PCMCIA_NMCLAN=m
CONFIG_NET_VENDOR_AQUANTIA=y
CONFIG_NET_VENDOR_ARC=y
CONFIG_NET_VENDOR_ASIX=y
CONFIG_NET_VENDOR_CAVIUM=y
CONFIG_NET_VENDOR_CORTINA=y
CONFIG_NET_VENDOR_DAVICOM=y
CONFIG_NET_VENDOR_ENGLEDER=y
CONFIG_NET_VENDOR_EZCHIP=y
CONFIG_NET_VENDOR_FUJITSU=y
CONFIG_PCMCIA_FMVJ18X=m
CONFIG_NET_VENDOR_FUNGIBLE=y
CONFIG_NET_VENDOR_GOOGLE=y
CONFIG_NET_VENDOR_HUAWEI=y
CONFIG_NET_VENDOR_I825XX=y
CONFIG_NET_VENDOR_INTEL=y
CONFIG_NET_VENDOR_WANGXUN=y
CONFIG_NET_VENDOR_LITEX=y
CONFIG_NET_VENDOR_MARVELL=y
CONFIG_PRESTERA=m
CONFIG_NET_VENDOR_MELLANOX=y
CONFIG_MLXSW_CORE=m
CONFIG_MLXSW_CORE_THERMAL=y
CONFIG_MLXSW_I2C=m
CONFIG_MLXSW_MINIMAL=m
CONFIG_MLXFW=m
CONFIG_NET_VENDOR_MICROCHIP=y
CONFIG_NET_VENDOR_MICROSEMI=y
CONFIG_NET_VENDOR_MICROSOFT=y
CONFIG_NET_VENDOR_NI=y
CONFIG_NET_VENDOR_NATSEMI=y
CONFIG_NET_VENDOR_NETRONOME=y
CONFIG_NET_VENDOR_8390=y
CONFIG_PCMCIA_AXNET=m
CONFIG_PCMCIA_PCNET=m
CONFIG_NET_VENDOR_PENSANDO=y
CONFIG_NET_VENDOR_QUALCOMM=y
CONFIG_QCA7000=m
CONFIG_QCA7000_UART=m
CONFIG_RMNET=m
CONFIG_NET_VENDOR_RENESAS=y
CONFIG_NET_VENDOR_ROCKER=y
CONFIG_NET_VENDOR_SAMSUNG=y
CONFIG_NET_VENDOR_SOLARFLARE=y
CONFIG_NET_VENDOR_SMSC=y
CONFIG_PCMCIA_SMC91C92=m
CONFIG_NET_VENDOR_SOCIONEXT=y
CONFIG_NET_VENDOR_SYNOPSYS=y
CONFIG_NET_VENDOR_VERTEXCOM=y
CONFIG_NET_VENDOR_VIA=y
CONFIG_NET_VENDOR_XILINX=y
CONFIG_NET_VENDOR_XIRCOM=y
CONFIG_PCMCIA_XIRC2PS=m
CONFIG_PHYLINK=m
CONFIG_PHYLIB=m
CONFIG_SWPHY=y
CONFIG_LED_TRIGGER_PHY=y
CONFIG_FIXED_PHY=m
CONFIG_SFP=m

#
# MII PHY device drivers
#
CONFIG_AMD_PHY=m
CONFIG_ADIN_PHY=m
CONFIG_ADIN1100_PHY=m
CONFIG_AQUANTIA_PHY=m
CONFIG_AX88796B_PHY=m
CONFIG_BROADCOM_PHY=m
CONFIG_BCM54140_PHY=m
CONFIG_BCM7XXX_PHY=m
CONFIG_BCM84881_PHY=m
CONFIG_BCM87XX_PHY=m
CONFIG_BCM_NET_PHYLIB=m
CONFIG_BCM_NET_PHYPTP=m
CONFIG_CICADA_PHY=m
CONFIG_CORTINA_PHY=m
CONFIG_DAVICOM_PHY=m
CONFIG_ICPLUS_PHY=m
CONFIG_LXT_PHY=m
CONFIG_INTEL_XWAY_PHY=m
CONFIG_LSI_ET1011C_PHY=m
CONFIG_MARVELL_PHY=m
CONFIG_MARVELL_10G_PHY=m
CONFIG_MARVELL_88X2222_PHY=m
CONFIG_MAXLINEAR_GPHY=m
CONFIG_MEDIATEK_GE_PHY=m
CONFIG_MICREL_PHY=m
CONFIG_MICROCHIP_PHY=m
CONFIG_MICROCHIP_T1_PHY=m
CONFIG_MICROSEMI_PHY=m
CONFIG_MOTORCOMM_PHY=m
CONFIG_NATIONAL_PHY=m
CONFIG_NXP_C45_TJA11XX_PHY=m
CONFIG_AT803X_PHY=m
CONFIG_QSEMI_PHY=m
CONFIG_REALTEK_PHY=m
CONFIG_RENESAS_PHY=m
CONFIG_ROCKCHIP_PHY=m
CONFIG_SMSC_PHY=m
CONFIG_STE10XP=m
CONFIG_TERANETICS_PHY=m
CONFIG_DP83822_PHY=m
CONFIG_DP83TC811_PHY=m
CONFIG_DP83848_PHY=m
CONFIG_DP83867_PHY=m
CONFIG_DP83869_PHY=m
CONFIG_DP83TD510_PHY=m
CONFIG_VITESSE_PHY=m
CONFIG_XILINX_GMII2RGMII=m
CONFIG_PSE_CONTROLLER=y
CONFIG_PSE_REGULATOR=m
CONFIG_CAN_DEV=m
CONFIG_CAN_VCAN=m
CONFIG_CAN_VXCAN=m
CONFIG_CAN_NETLINK=y
CONFIG_CAN_CALC_BITTIMING=y
CONFIG_CAN_RX_OFFLOAD=y
CONFIG_CAN_CAN327=m
CONFIG_CAN_SLCAN=m
CONFIG_CAN_M_CAN=m
CONFIG_CAN_DEBUG_DEVICES=y

#
# MCTP Device Drivers
#
CONFIG_MCTP_SERIAL=m
CONFIG_MCTP_TRANSPORT_I2C=m
# end of MCTP Device Drivers

CONFIG_MDIO_DEVICE=m
CONFIG_MDIO_BUS=m
CONFIG_FWNODE_MDIO=m
CONFIG_OF_MDIO=m
CONFIG_MDIO_DEVRES=m
CONFIG_MDIO_BITBANG=m
CONFIG_MDIO_GPIO=m
CONFIG_MDIO_I2C=m

#
# MDIO Multiplexers
#
CONFIG_MDIO_BUS_MUX=m
CONFIG_MDIO_BUS_MUX_MULTIPLEXER=m

#
# PCS device drivers
#
CONFIG_PCS_ALTERA_TSE=m
# end of PCS device drivers

CONFIG_PPP=m
CONFIG_PPP_BSDCOMP=m
CONFIG_PPP_DEFLATE=m
CONFIG_PPP_FILTER=y
CONFIG_PPP_MPPE=m
CONFIG_PPP_MULTILINK=y
CONFIG_PPPOATM=m
CONFIG_PPPOE=m
CONFIG_PPTP=m
CONFIG_PPPOL2TP=m
CONFIG_PPP_ASYNC=m
CONFIG_PPP_SYNC_TTY=m
CONFIG_SLIP=m
CONFIG_SLHC=m
CONFIG_SLIP_COMPRESSED=y
CONFIG_SLIP_SMART=y
CONFIG_SLIP_MODE_SLIP6=y

#
# S/390 network device drivers
#
CONFIG_LCS=m
CONFIG_CTCM=m
CONFIG_NETIUCV=m
CONFIG_SMSGIUCV=m
CONFIG_SMSGIUCV_EVENT=m
CONFIG_QETH=m
CONFIG_QETH_L2=m
CONFIG_QETH_L3=m
CONFIG_QETH_OSX=y
CONFIG_CCWGROUP=m
# end of S/390 network device drivers

#
# Host-side USB support is needed for USB Network Adapter support
#
CONFIG_WAN=y
CONFIG_HDLC=m
CONFIG_HDLC_RAW=m
CONFIG_HDLC_RAW_ETH=m
CONFIG_HDLC_CISCO=m
CONFIG_HDLC_FR=m
CONFIG_HDLC_PPP=m
CONFIG_HDLC_X25=m
CONFIG_LAPBETHER=m
CONFIG_IEEE802154_DRIVERS=m
CONFIG_IEEE802154_FAKELB=m
CONFIG_IEEE802154_HWSIM=m

#
# Wireless WAN
#
CONFIG_WWAN=m
CONFIG_WWAN_DEBUGFS=y
CONFIG_WWAN_HWSIM=m
CONFIG_MHI_WWAN_CTRL=m
CONFIG_MHI_WWAN_MBIM=m
CONFIG_RPMSG_WWAN_CTRL=m
# end of Wireless WAN

CONFIG_NETDEVSIM=m
CONFIG_NET_FAILOVER=m

#
# Input device support
#
CONFIG_INPUT=y
CONFIG_INPUT_LEDS=m
CONFIG_INPUT_FF_MEMLESS=m
CONFIG_INPUT_SPARSEKMAP=m
CONFIG_INPUT_MATRIXKMAP=m
CONFIG_INPUT_VIVALDIFMAP=m

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=m
CONFIG_INPUT_MOUSEDEV_PSAUX=y
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
CONFIG_INPUT_JOYDEV=m
CONFIG_INPUT_EVDEV=m
CONFIG_INPUT_EVBUG=m

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
CONFIG_KEYBOARD_ADC=m
CONFIG_KEYBOARD_ADP5588=m
CONFIG_KEYBOARD_ADP5589=m
CONFIG_KEYBOARD_ATKBD=m
CONFIG_KEYBOARD_QT1050=m
CONFIG_KEYBOARD_QT1070=m
CONFIG_KEYBOARD_QT2160=m
CONFIG_KEYBOARD_DLINK_DIR685=m
CONFIG_KEYBOARD_LKKBD=m
CONFIG_KEYBOARD_GPIO=m
CONFIG_KEYBOARD_GPIO_POLLED=m
CONFIG_KEYBOARD_TCA6416=m
CONFIG_KEYBOARD_TCA8418=m
CONFIG_KEYBOARD_MATRIX=m
CONFIG_KEYBOARD_LM8323=m
CONFIG_KEYBOARD_LM8333=m
CONFIG_KEYBOARD_MAX7359=m
CONFIG_KEYBOARD_MCS=m
CONFIG_KEYBOARD_MPR121=m
CONFIG_KEYBOARD_NEWTON=m
CONFIG_KEYBOARD_PINEPHONE=m
CONFIG_KEYBOARD_STOWAWAY=m
CONFIG_KEYBOARD_SUNKBD=m
CONFIG_KEYBOARD_TM2_TOUCHKEY=m
CONFIG_KEYBOARD_XTKBD=m
CONFIG_KEYBOARD_CAP11XX=m
CONFIG_KEYBOARD_CYPRESS_SF=m
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=m
CONFIG_MOUSE_PS2_ALPS=y
CONFIG_MOUSE_PS2_BYD=y
CONFIG_MOUSE_PS2_LOGIPS2PP=y
CONFIG_MOUSE_PS2_SYNAPTICS=y
CONFIG_MOUSE_PS2_SYNAPTICS_SMBUS=y
CONFIG_MOUSE_PS2_CYPRESS=y
CONFIG_MOUSE_PS2_TRACKPOINT=y
CONFIG_MOUSE_PS2_ELANTECH=y
CONFIG_MOUSE_PS2_ELANTECH_SMBUS=y
CONFIG_MOUSE_PS2_SENTELIC=y
CONFIG_MOUSE_PS2_TOUCHKIT=y
CONFIG_MOUSE_PS2_FOCALTECH=y
CONFIG_MOUSE_PS2_SMBUS=y
CONFIG_MOUSE_SERIAL=m
CONFIG_MOUSE_CYAPA=m
CONFIG_MOUSE_ELAN_I2C=m
CONFIG_MOUSE_ELAN_I2C_I2C=y
CONFIG_MOUSE_ELAN_I2C_SMBUS=y
CONFIG_MOUSE_VSXXXAA=m
CONFIG_MOUSE_GPIO=m
CONFIG_MOUSE_SYNAPTICS_I2C=m
CONFIG_INPUT_JOYSTICK=y
CONFIG_JOYSTICK_ANALOG=m
CONFIG_JOYSTICK_A3D=m
CONFIG_JOYSTICK_ADC=m
CONFIG_JOYSTICK_ADI=m
CONFIG_JOYSTICK_COBRA=m
CONFIG_JOYSTICK_GF2K=m
CONFIG_JOYSTICK_GRIP=m
CONFIG_JOYSTICK_GRIP_MP=m
CONFIG_JOYSTICK_GUILLEMOT=m
CONFIG_JOYSTICK_INTERACT=m
CONFIG_JOYSTICK_SIDEWINDER=m
CONFIG_JOYSTICK_TMDC=m
CONFIG_JOYSTICK_IFORCE=m
CONFIG_JOYSTICK_IFORCE_232=m
CONFIG_JOYSTICK_WARRIOR=m
CONFIG_JOYSTICK_MAGELLAN=m
CONFIG_JOYSTICK_SPACEORB=m
CONFIG_JOYSTICK_SPACEBALL=m
CONFIG_JOYSTICK_STINGER=m
CONFIG_JOYSTICK_TWIDJOY=m
CONFIG_JOYSTICK_ZHENHUA=m
CONFIG_JOYSTICK_AS5011=m
CONFIG_JOYSTICK_JOYDUMP=m
CONFIG_JOYSTICK_QWIIC=m
CONFIG_JOYSTICK_FSIA6B=m
CONFIG_INPUT_TABLET=y
CONFIG_TABLET_SERIAL_WACOM4=m
CONFIG_INPUT_TOUCHSCREEN=y
CONFIG_TOUCHSCREEN_AD7879=m
CONFIG_TOUCHSCREEN_AD7879_I2C=m
CONFIG_TOUCHSCREEN_ADC=m
CONFIG_TOUCHSCREEN_AR1021_I2C=m
CONFIG_TOUCHSCREEN_ATMEL_MXT=m
CONFIG_TOUCHSCREEN_AUO_PIXCIR=m
CONFIG_TOUCHSCREEN_BU21013=m
CONFIG_TOUCHSCREEN_BU21029=m
CONFIG_TOUCHSCREEN_CHIPONE_ICN8318=m
CONFIG_TOUCHSCREEN_CY8CTMA140=m
CONFIG_TOUCHSCREEN_CY8CTMG110=m
CONFIG_TOUCHSCREEN_CYTTSP_CORE=m
CONFIG_TOUCHSCREEN_CYTTSP_I2C=m
CONFIG_TOUCHSCREEN_CYTTSP4_CORE=m
CONFIG_TOUCHSCREEN_CYTTSP4_I2C=m
CONFIG_TOUCHSCREEN_DYNAPRO=m
CONFIG_TOUCHSCREEN_HAMPSHIRE=m
CONFIG_TOUCHSCREEN_EETI=m
CONFIG_TOUCHSCREEN_EGALAX=m
CONFIG_TOUCHSCREEN_EGALAX_SERIAL=m
CONFIG_TOUCHSCREEN_EXC3000=m
CONFIG_TOUCHSCREEN_FUJITSU=m
CONFIG_TOUCHSCREEN_GOODIX=m
CONFIG_TOUCHSCREEN_HIDEEP=m
CONFIG_TOUCHSCREEN_HYCON_HY46XX=m
CONFIG_TOUCHSCREEN_ILI210X=m
CONFIG_TOUCHSCREEN_ILITEK=m
CONFIG_TOUCHSCREEN_S6SY761=m
CONFIG_TOUCHSCREEN_GUNZE=m
CONFIG_TOUCHSCREEN_EKTF2127=m
CONFIG_TOUCHSCREEN_ELAN=m
CONFIG_TOUCHSCREEN_ELO=m
CONFIG_TOUCHSCREEN_WACOM_W8001=m
CONFIG_TOUCHSCREEN_WACOM_I2C=m
CONFIG_TOUCHSCREEN_MAX11801=m
CONFIG_TOUCHSCREEN_MCS5000=m
CONFIG_TOUCHSCREEN_MMS114=m
CONFIG_TOUCHSCREEN_MELFAS_MIP4=m
CONFIG_TOUCHSCREEN_MSG2638=m
CONFIG_TOUCHSCREEN_MTOUCH=m
CONFIG_TOUCHSCREEN_IMAGIS=m
CONFIG_TOUCHSCREEN_INEXIO=m
CONFIG_TOUCHSCREEN_MK712=m
CONFIG_TOUCHSCREEN_PENMOUNT=m
CONFIG_TOUCHSCREEN_EDT_FT5X06=m
CONFIG_TOUCHSCREEN_TOUCHRIGHT=m
CONFIG_TOUCHSCREEN_TOUCHWIN=m
CONFIG_TOUCHSCREEN_PIXCIR=m
CONFIG_TOUCHSCREEN_WDT87XX_I2C=m
CONFIG_TOUCHSCREEN_TOUCHIT213=m
CONFIG_TOUCHSCREEN_TSC_SERIO=m
CONFIG_TOUCHSCREEN_TSC200X_CORE=m
CONFIG_TOUCHSCREEN_TSC2004=m
CONFIG_TOUCHSCREEN_TSC2007=m
CONFIG_TOUCHSCREEN_TSC2007_IIO=y
CONFIG_TOUCHSCREEN_RM_TS=m
CONFIG_TOUCHSCREEN_SILEAD=m
CONFIG_TOUCHSCREEN_SIS_I2C=m
CONFIG_TOUCHSCREEN_ST1232=m
CONFIG_TOUCHSCREEN_STMFTS=m
CONFIG_TOUCHSCREEN_SX8654=m
CONFIG_TOUCHSCREEN_TPS6507X=m
CONFIG_TOUCHSCREEN_ZET6223=m
CONFIG_TOUCHSCREEN_ZFORCE=m
CONFIG_TOUCHSCREEN_COLIBRI_VF50=m
CONFIG_TOUCHSCREEN_ROHM_BU21023=m
CONFIG_TOUCHSCREEN_IQS5XX=m
CONFIG_TOUCHSCREEN_ZINITIX=m
CONFIG_INPUT_MISC=y
CONFIG_INPUT_AD714X=m
CONFIG_INPUT_AD714X_I2C=m
CONFIG_INPUT_ATMEL_CAPTOUCH=m
CONFIG_INPUT_BMA150=m
CONFIG_INPUT_E3X0_BUTTON=m
CONFIG_INPUT_MMA8450=m
CONFIG_INPUT_GPIO_BEEPER=m
CONFIG_INPUT_GPIO_DECODER=m
CONFIG_INPUT_GPIO_VIBRA=m
CONFIG_INPUT_KXTJ9=m
CONFIG_INPUT_REGULATOR_HAPTIC=m
CONFIG_INPUT_UINPUT=m
CONFIG_INPUT_PCF8574=m
CONFIG_INPUT_PWM_BEEPER=m
CONFIG_INPUT_PWM_VIBRA=m
CONFIG_INPUT_GPIO_ROTARY_ENCODER=m
CONFIG_INPUT_DA7280_HAPTICS=m
CONFIG_INPUT_ADXL34X=m
CONFIG_INPUT_ADXL34X_I2C=m
CONFIG_INPUT_IBM_PANEL=m
CONFIG_INPUT_IQS269A=m
CONFIG_INPUT_IQS626A=m
CONFIG_INPUT_IQS7222=m
CONFIG_INPUT_CMA3000=m
CONFIG_INPUT_CMA3000_I2C=m
CONFIG_INPUT_DRV260X_HAPTICS=m
CONFIG_INPUT_DRV2665_HAPTICS=m
CONFIG_INPUT_DRV2667_HAPTICS=m
CONFIG_RMI4_CORE=m
CONFIG_RMI4_I2C=m
CONFIG_RMI4_SMB=m
CONFIG_RMI4_F03=y
CONFIG_RMI4_F03_SERIO=m
CONFIG_RMI4_2D_SENSOR=y
CONFIG_RMI4_F11=y
CONFIG_RMI4_F12=y
CONFIG_RMI4_F30=y
CONFIG_RMI4_F34=y
CONFIG_RMI4_F3A=y
CONFIG_RMI4_F55=y

#
# Hardware I/O ports
#
CONFIG_SERIO=m
CONFIG_SERIO_SERPORT=m
CONFIG_SERIO_LIBPS2=m
CONFIG_SERIO_RAW=m
CONFIG_SERIO_PS2MULT=m
CONFIG_SERIO_GPIO_PS2=m
CONFIG_USERIO=m
CONFIG_GAMEPORT=m
CONFIG_GAMEPORT_NS558=m
CONFIG_GAMEPORT_L4=m
# end of Hardware I/O ports
# end of Input device support

#
# Character devices
#
CONFIG_TTY=y
CONFIG_VT=y
CONFIG_CONSOLE_TRANSLATIONS=y
CONFIG_VT_CONSOLE=y
CONFIG_HW_CONSOLE=y
CONFIG_VT_HW_CONSOLE_BINDING=y
CONFIG_UNIX98_PTYS=y
CONFIG_LEGACY_PTYS=y
CONFIG_LEGACY_PTY_COUNT=256
CONFIG_LDISC_AUTOLOAD=y
CONFIG_N_GSM=m
CONFIG_NULL_TTY=m
CONFIG_HVC_DRIVER=y
CONFIG_HVC_IUCV=y
CONFIG_RPMSG_TTY=m
CONFIG_SERIAL_DEV_BUS=m
CONFIG_TTY_PRINTK=m
CONFIG_TTY_PRINTK_LEVEL=6
CONFIG_VIRTIO_CONSOLE=m
CONFIG_IPMB_DEVICE_INTERFACE=m
CONFIG_HW_RANDOM=m
CONFIG_HW_RANDOM_VIRTIO=m
CONFIG_HW_RANDOM_S390=m

#
# PCMCIA character devices
#
CONFIG_SYNCLINK_CS=m
CONFIG_CARDMAN_4000=m
CONFIG_CARDMAN_4040=m
CONFIG_SCR24X=m
CONFIG_IPWIRELESS=m
# end of PCMCIA character devices

CONFIG_DEVMEM=y
CONFIG_HANGCHECK_TIMER=m

#
# S/390 character device drivers
#
CONFIG_TN3270=m
CONFIG_TN3270_TTY=m
CONFIG_TN3270_FS=m
CONFIG_TN3215=y
CONFIG_TN3215_CONSOLE=y
CONFIG_CCW_CONSOLE=y
CONFIG_SCLP_TTY=y
CONFIG_SCLP_CONSOLE=y
CONFIG_SCLP_VT220_TTY=y
CONFIG_SCLP_VT220_CONSOLE=y
CONFIG_HMC_DRV=m
CONFIG_SCLP_OFB=y
CONFIG_S390_UV_UAPI=m
CONFIG_S390_TAPE=m

#
# S/390 tape hardware support
#
CONFIG_S390_TAPE_34XX=m
CONFIG_S390_TAPE_3590=m
CONFIG_VMLOGRDR=m
CONFIG_VMCP=y
CONFIG_VMCP_CMA_SIZE=4
CONFIG_MONREADER=m
CONFIG_MONWRITER=m
CONFIG_S390_VMUR=m
CONFIG_XILLYBUS_CLASS=m
CONFIG_XILLYBUS=m
CONFIG_XILLYBUS_OF=m
CONFIG_RANDOM_TRUST_CPU=y
CONFIG_RANDOM_TRUST_BOOTLOADER=y
# end of Character devices

#
# I2C support
#
CONFIG_I2C=m
CONFIG_I2C_BOARDINFO=y
CONFIG_I2C_COMPAT=y
CONFIG_I2C_CHARDEV=m
CONFIG_I2C_MUX=m

#
# Multiplexer I2C Chip support
#
CONFIG_I2C_ARB_GPIO_CHALLENGE=m
CONFIG_I2C_MUX_GPIO=m
CONFIG_I2C_MUX_GPMUX=m
CONFIG_I2C_MUX_LTC4306=m
CONFIG_I2C_MUX_PCA9541=m
CONFIG_I2C_MUX_PCA954x=m
CONFIG_I2C_MUX_PINCTRL=m
CONFIG_I2C_DEMUX_PINCTRL=m
CONFIG_I2C_MUX_MLXCPLD=m
# end of Multiplexer I2C Chip support

CONFIG_I2C_HELPER_AUTO=y
CONFIG_I2C_SMBUS=m
CONFIG_I2C_STUB=m
CONFIG_I2C_SLAVE=y
CONFIG_I2C_SLAVE_EEPROM=m
CONFIG_I2C_SLAVE_TESTUNIT=m
CONFIG_I2C_DEBUG_CORE=y
CONFIG_I2C_DEBUG_ALGO=y
# end of I2C support

CONFIG_I3C=m
CONFIG_SPMI=m
CONFIG_HSI=m
CONFIG_HSI_BOARDINFO=y

#
# HSI controllers
#

#
# HSI clients
#
CONFIG_HSI_CHAR=m
CONFIG_PPS=m
CONFIG_PPS_DEBUG=y

#
# PPS clients support
#
CONFIG_PPS_CLIENT_KTIMER=m
CONFIG_PPS_CLIENT_LDISC=m
CONFIG_PPS_CLIENT_GPIO=m

#
# PPS generators support
#

#
# PTP clock support
#
CONFIG_PTP_1588_CLOCK=m
CONFIG_PTP_1588_CLOCK_OPTIONAL=m
CONFIG_DP83640_PHY=m
CONFIG_PTP_1588_CLOCK_IDT82P33=m
CONFIG_PTP_1588_CLOCK_IDTCM=m
# end of PTP clock support

CONFIG_PINCTRL=y
CONFIG_PINMUX=y
CONFIG_PINCONF=y
CONFIG_GENERIC_PINCONF=y
CONFIG_DEBUG_PINCTRL=y
CONFIG_PINCTRL_CY8C95X0=m
CONFIG_PINCTRL_MCP23S08_I2C=m
CONFIG_PINCTRL_MCP23S08=m

#
# Renesas pinctrl drivers
#
# end of Renesas pinctrl drivers

CONFIG_GPIOLIB=y
CONFIG_GPIOLIB_FASTPATH_LIMIT=512
CONFIG_GPIOLIB_IRQCHIP=y
CONFIG_DEBUG_GPIO=y
CONFIG_GPIO_SYSFS=y
CONFIG_GPIO_CDEV=y
CONFIG_GPIO_CDEV_V1=y
CONFIG_GPIO_MAX730X=m

#
# I2C GPIO expanders
#
CONFIG_GPIO_MAX7300=m
CONFIG_GPIO_MAX732X=m
CONFIG_GPIO_PCA953X=m
CONFIG_GPIO_PCA953X_IRQ=y
CONFIG_GPIO_PCA9570=m
CONFIG_GPIO_PCF857X=m
CONFIG_GPIO_TPIC2810=m
# end of I2C GPIO expanders

#
# MFD GPIO expanders
#
# end of MFD GPIO expanders

#
# Virtual GPIO drivers
#
CONFIG_GPIO_AGGREGATOR=m
CONFIG_GPIO_MOCKUP=m
CONFIG_GPIO_VIRTIO=m
CONFIG_GPIO_SIM=m
# end of Virtual GPIO drivers

CONFIG_POWER_RESET=y
CONFIG_POWER_RESET_REGULATOR=y
CONFIG_POWER_RESET_RESTART=y
CONFIG_REBOOT_MODE=m
CONFIG_NVMEM_REBOOT_MODE=m
CONFIG_POWER_SUPPLY=y
CONFIG_POWER_SUPPLY_DEBUG=y
CONFIG_GENERIC_ADC_BATTERY=m
CONFIG_IP5XXX_POWER=m
CONFIG_TEST_POWER=m
CONFIG_CHARGER_ADP5061=m
CONFIG_BATTERY_CW2015=m
CONFIG_BATTERY_DS2782=m
CONFIG_BATTERY_SAMSUNG_SDI=y
CONFIG_BATTERY_SBS=m
CONFIG_CHARGER_SBS=m
CONFIG_MANAGER_SBS=m
CONFIG_BATTERY_BQ27XXX=m
CONFIG_BATTERY_BQ27XXX_I2C=m
CONFIG_BATTERY_BQ27XXX_DT_UPDATES_NVM=y
CONFIG_BATTERY_MAX17040=m
CONFIG_BATTERY_MAX17042=m
CONFIG_CHARGER_MAX8903=m
CONFIG_CHARGER_LP8727=m
CONFIG_CHARGER_GPIO=m
CONFIG_CHARGER_MANAGER=m
CONFIG_CHARGER_LT3651=m
CONFIG_CHARGER_LTC4162L=m
CONFIG_CHARGER_DETECTOR_MAX14656=m
CONFIG_CHARGER_MAX77976=m
CONFIG_CHARGER_BQ2415X=m
CONFIG_CHARGER_BQ24190=m
CONFIG_CHARGER_BQ24257=m
CONFIG_CHARGER_BQ24735=m
CONFIG_CHARGER_BQ2515X=m
CONFIG_CHARGER_BQ25890=m
CONFIG_CHARGER_BQ25980=m
CONFIG_CHARGER_BQ256XX=m
CONFIG_CHARGER_SMB347=m
CONFIG_BATTERY_GAUGE_LTC2941=m
CONFIG_BATTERY_RT5033=m
CONFIG_CHARGER_RT9455=m
CONFIG_CHARGER_UCS1002=m
CONFIG_CHARGER_BD99954=m
CONFIG_BATTERY_UG3105=m
CONFIG_THERMAL=y
CONFIG_THERMAL_NETLINK=y
CONFIG_THERMAL_STATISTICS=y
CONFIG_THERMAL_EMERGENCY_POWEROFF_DELAY_MS=0
CONFIG_THERMAL_OF=y
CONFIG_THERMAL_WRITABLE_TRIPS=y
CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE=y
# CONFIG_THERMAL_DEFAULT_GOV_FAIR_SHARE is not set
# CONFIG_THERMAL_DEFAULT_GOV_USER_SPACE is not set
CONFIG_THERMAL_GOV_FAIR_SHARE=y
CONFIG_THERMAL_GOV_STEP_WISE=y
CONFIG_THERMAL_GOV_BANG_BANG=y
CONFIG_THERMAL_GOV_USER_SPACE=y
CONFIG_CPU_THERMAL=y
CONFIG_DEVFREQ_THERMAL=y
CONFIG_THERMAL_EMULATION=y
CONFIG_GENERIC_ADC_THERMAL=m
CONFIG_WATCHDOG=y
CONFIG_WATCHDOG_CORE=y
CONFIG_WATCHDOG_NOWAYOUT=y
CONFIG_WATCHDOG_HANDLE_BOOT_ENABLED=y
CONFIG_WATCHDOG_OPEN_TIMEOUT=0
CONFIG_WATCHDOG_SYSFS=y
CONFIG_WATCHDOG_HRTIMER_PRETIMEOUT=y

#
# Watchdog Pretimeout Governors
#
CONFIG_WATCHDOG_PRETIMEOUT_GOV=y
CONFIG_WATCHDOG_PRETIMEOUT_GOV_SEL=m
CONFIG_WATCHDOG_PRETIMEOUT_GOV_NOOP=m
CONFIG_WATCHDOG_PRETIMEOUT_GOV_PANIC=m
# CONFIG_WATCHDOG_PRETIMEOUT_DEFAULT_GOV_NOOP is not set
CONFIG_WATCHDOG_PRETIMEOUT_DEFAULT_GOV_PANIC=y

#
# Watchdog Device Drivers
#
CONFIG_SOFT_WATCHDOG=m
CONFIG_SOFT_WATCHDOG_PRETIMEOUT=y
CONFIG_ZIIRAVE_WATCHDOG=m
CONFIG_MEN_A21_WDT=m
CONFIG_DIAG288_WATCHDOG=m
CONFIG_REGULATOR=y
CONFIG_REGULATOR_DEBUG=y
CONFIG_REGULATOR_FIXED_VOLTAGE=m
CONFIG_REGULATOR_VIRTUAL_CONSUMER=m
CONFIG_REGULATOR_USERSPACE_CONSUMER=m
CONFIG_REGULATOR_88PG86X=m
CONFIG_REGULATOR_ACT8865=m
CONFIG_REGULATOR_AD5398=m
CONFIG_REGULATOR_DA9121=m
CONFIG_REGULATOR_DA9210=m
CONFIG_REGULATOR_DA9211=m
CONFIG_REGULATOR_FAN53555=m
CONFIG_REGULATOR_FAN53880=m
CONFIG_REGULATOR_GPIO=m
CONFIG_REGULATOR_ISL9305=m
CONFIG_REGULATOR_ISL6271A=m
CONFIG_REGULATOR_LP3971=m
CONFIG_REGULATOR_LP3972=m
CONFIG_REGULATOR_LP872X=m
CONFIG_REGULATOR_LP8755=m
CONFIG_REGULATOR_LTC3589=m
CONFIG_REGULATOR_LTC3676=m
CONFIG_REGULATOR_MAX1586=m
CONFIG_REGULATOR_MAX8649=m
CONFIG_REGULATOR_MAX8660=m
CONFIG_REGULATOR_MAX8893=m
CONFIG_REGULATOR_MAX8952=m
CONFIG_REGULATOR_MAX8973=m
CONFIG_REGULATOR_MAX20086=m
CONFIG_REGULATOR_MAX77826=m
CONFIG_REGULATOR_MCP16502=m
CONFIG_REGULATOR_MP5416=m
CONFIG_REGULATOR_MP8859=m
CONFIG_REGULATOR_MP886X=m
CONFIG_REGULATOR_MPQ7920=m
CONFIG_REGULATOR_MT6311=m
CONFIG_REGULATOR_MT6315=m
CONFIG_REGULATOR_PCA9450=m
CONFIG_REGULATOR_PF8X00=m
CONFIG_REGULATOR_PFUZE100=m
CONFIG_REGULATOR_PV88060=m
CONFIG_REGULATOR_PV88080=m
CONFIG_REGULATOR_PV88090=m
CONFIG_REGULATOR_PWM=m
CONFIG_REGULATOR_QCOM_SPMI=m
CONFIG_REGULATOR_QCOM_USB_VBUS=m
CONFIG_REGULATOR_RT4801=m
CONFIG_REGULATOR_RT5190A=m
CONFIG_REGULATOR_RT5759=m
CONFIG_REGULATOR_RT6160=m
CONFIG_REGULATOR_RT6245=m
CONFIG_REGULATOR_RTQ2134=m
CONFIG_REGULATOR_RTMV20=m
CONFIG_REGULATOR_RTQ6752=m
CONFIG_REGULATOR_SLG51000=m
CONFIG_REGULATOR_SY8106A=m
CONFIG_REGULATOR_SY8824X=m
CONFIG_REGULATOR_SY8827N=m
CONFIG_REGULATOR_TPS51632=m
CONFIG_REGULATOR_TPS62360=m
CONFIG_REGULATOR_TPS6286X=m
CONFIG_REGULATOR_TPS65023=m
CONFIG_REGULATOR_TPS6507X=m
CONFIG_REGULATOR_TPS65132=m
CONFIG_REGULATOR_VCTRL=m
CONFIG_REGULATOR_QCOM_LABIBB=m
CONFIG_RC_CORE=m
CONFIG_LIRC=y
CONFIG_RC_MAP=m
CONFIG_RC_DECODERS=y
CONFIG_IR_IMON_DECODER=m
CONFIG_IR_JVC_DECODER=m
CONFIG_IR_MCE_KBD_DECODER=m
CONFIG_IR_NEC_DECODER=m
CONFIG_IR_RC5_DECODER=m
CONFIG_IR_RC6_DECODER=m
CONFIG_IR_RCMM_DECODER=m
CONFIG_IR_SANYO_DECODER=m
CONFIG_IR_SHARP_DECODER=m
CONFIG_IR_SONY_DECODER=m
CONFIG_IR_XMP_DECODER=m
CONFIG_RC_DEVICES=y
CONFIG_IR_GPIO_CIR=m
CONFIG_IR_GPIO_TX=m
CONFIG_IR_PWM_TX=m
CONFIG_IR_SERIAL=m
CONFIG_IR_SERIAL_TRANSMITTER=y
CONFIG_RC_LOOPBACK=m
CONFIG_CEC_CORE=m

#
# CEC support
#
CONFIG_MEDIA_CEC_RC=y
CONFIG_MEDIA_CEC_SUPPORT=y
CONFIG_CEC_CH7322=m
# end of CEC support

#
# Graphics support
#

#
# Console display driver support
#
CONFIG_DUMMY_CONSOLE=y
CONFIG_DUMMY_CONSOLE_COLUMNS=80
CONFIG_DUMMY_CONSOLE_ROWS=25
# end of Console display driver support
# end of Graphics support

#
# HID support
#
CONFIG_HID=m
CONFIG_HID_BATTERY_STRENGTH=y
CONFIG_HIDRAW=y
CONFIG_UHID=m
CONFIG_HID_GENERIC=m

#
# Special HID drivers
#
CONFIG_HID_A4TECH=m
CONFIG_HID_ACRUX=m
CONFIG_HID_ACRUX_FF=y
CONFIG_HID_APPLE=m
CONFIG_HID_AUREAL=m
CONFIG_HID_BELKIN=m
CONFIG_HID_CHERRY=m
CONFIG_HID_COUGAR=m
CONFIG_HID_MACALLY=m
CONFIG_HID_CMEDIA=m
CONFIG_HID_CYPRESS=m
CONFIG_HID_DRAGONRISE=m
CONFIG_DRAGONRISE_FF=y
CONFIG_HID_EMS_FF=m
CONFIG_HID_ELECOM=m
CONFIG_HID_EZKEY=m
CONFIG_HID_GEMBIRD=m
CONFIG_HID_GFRM=m
CONFIG_HID_GLORIOUS=m
CONFIG_HID_VIVALDI_COMMON=m
CONFIG_HID_VIVALDI=m
CONFIG_HID_KEYTOUCH=m
CONFIG_HID_KYE=m
CONFIG_HID_WALTOP=m
CONFIG_HID_VIEWSONIC=m
CONFIG_HID_VRC2=m
CONFIG_HID_XIAOMI=m
CONFIG_HID_GYRATION=m
CONFIG_HID_ICADE=m
CONFIG_HID_ITE=m
CONFIG_HID_JABRA=m
CONFIG_HID_TWINHAN=m
CONFIG_HID_KENSINGTON=m
CONFIG_HID_LCPOWER=m
CONFIG_HID_LED=m
CONFIG_HID_LENOVO=m
CONFIG_HID_MAGICMOUSE=m
CONFIG_HID_MALTRON=m
CONFIG_HID_MAYFLASH=m
CONFIG_HID_REDRAGON=m
CONFIG_HID_MICROSOFT=m
CONFIG_HID_MONTEREY=m
CONFIG_HID_MULTITOUCH=m
CONFIG_HID_NINTENDO=m
CONFIG_NINTENDO_FF=y
CONFIG_HID_NTI=m
CONFIG_HID_ORTEK=m
CONFIG_HID_PANTHERLORD=m
CONFIG_PANTHERLORD_FF=y
CONFIG_HID_PETALYNX=m
CONFIG_HID_PICOLCD=m
CONFIG_HID_PICOLCD_LEDS=y
CONFIG_HID_PICOLCD_CIR=y
CONFIG_HID_PLANTRONICS=m
CONFIG_HID_PLAYSTATION=m
CONFIG_PLAYSTATION_FF=y
CONFIG_HID_PXRC=m
CONFIG_HID_RAZER=m
CONFIG_HID_PRIMAX=m
CONFIG_HID_SAITEK=m
CONFIG_HID_SEMITEK=m
CONFIG_HID_SPEEDLINK=m
CONFIG_HID_STEAM=m
CONFIG_HID_STEELSERIES=m
CONFIG_HID_SUNPLUS=m
CONFIG_HID_RMI=m
CONFIG_HID_GREENASIA=m
CONFIG_GREENASIA_FF=y
CONFIG_HID_SMARTJOYPLUS=m
CONFIG_SMARTJOYPLUS_FF=y
CONFIG_HID_TIVO=m
CONFIG_HID_TOPSEED=m
CONFIG_HID_TOPRE=m
CONFIG_HID_THINGM=m
CONFIG_HID_UDRAW_PS3=m
CONFIG_HID_WIIMOTE=m
CONFIG_HID_XINMO=m
CONFIG_HID_ZEROPLUS=m
CONFIG_ZEROPLUS_FF=y
CONFIG_HID_ZYDACRON=m
CONFIG_HID_ALPS=m
# end of Special HID drivers

#
# I2C HID support
#
CONFIG_I2C_HID_OF=m
CONFIG_I2C_HID_OF_ELAN=m
CONFIG_I2C_HID_OF_GOODIX=m
# end of I2C HID support

CONFIG_I2C_HID_CORE=m
# end of HID support

CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_SCSI_UFSHCD=m
CONFIG_SCSI_UFS_BSG=y
CONFIG_SCSI_UFS_CRYPTO=y
CONFIG_SCSI_UFS_HPB=y
CONFIG_SCSI_UFS_FAULT_INJECTION=y
CONFIG_MEMSTICK=m
CONFIG_MEMSTICK_DEBUG=y

#
# MemoryStick drivers
#
CONFIG_MEMSTICK_UNSAFE_RESUME=y
CONFIG_MSPRO_BLOCK=m
CONFIG_MS_BLOCK=m

#
# MemoryStick Host Controller Drivers
#
CONFIG_NEW_LEDS=y
CONFIG_LEDS_CLASS=m
CONFIG_LEDS_CLASS_FLASH=m
CONFIG_LEDS_CLASS_MULTICOLOR=m
CONFIG_LEDS_BRIGHTNESS_HW_CHANGED=y

#
# LED drivers
#
CONFIG_LEDS_AN30259A=m
CONFIG_LEDS_AW2013=m
CONFIG_LEDS_LM3530=m
CONFIG_LEDS_LM3532=m
CONFIG_LEDS_LM3642=m
CONFIG_LEDS_LM3692X=m
CONFIG_LEDS_PCA9532=m
CONFIG_LEDS_PCA9532_GPIO=y
CONFIG_LEDS_GPIO=m
CONFIG_LEDS_LP3944=m
CONFIG_LEDS_LP3952=m
CONFIG_LEDS_LP50XX=m
CONFIG_LEDS_LP55XX_COMMON=m
CONFIG_LEDS_LP5521=m
CONFIG_LEDS_LP5523=m
CONFIG_LEDS_LP5562=m
CONFIG_LEDS_LP8501=m
CONFIG_LEDS_LP8860=m
CONFIG_LEDS_PCA955X=m
CONFIG_LEDS_PCA955X_GPIO=y
CONFIG_LEDS_PCA963X=m
CONFIG_LEDS_PWM=m
CONFIG_LEDS_REGULATOR=m
CONFIG_LEDS_BD2802=m
CONFIG_LEDS_LT3593=m
CONFIG_LEDS_TCA6507=m
CONFIG_LEDS_TLC591XX=m
CONFIG_LEDS_LM355x=m
CONFIG_LEDS_IS31FL319X=m
CONFIG_LEDS_IS31FL32XX=m

#
# LED driver for blink(1) USB RGB LED is under Special HID drivers (HID_THINGM)
#
CONFIG_LEDS_BLINKM=m
CONFIG_LEDS_MLXREG=m
CONFIG_LEDS_USER=m
CONFIG_LEDS_TI_LMU_COMMON=m
CONFIG_LEDS_LM3697=m

#
# Flash and Torch LED drivers
#
CONFIG_LEDS_AAT1290=m
CONFIG_LEDS_AS3645A=m
CONFIG_LEDS_KTD2692=m
CONFIG_LEDS_LM3601X=m
CONFIG_LEDS_RT4505=m
CONFIG_LEDS_RT8515=m
CONFIG_LEDS_SGM3140=m

#
# RGB LED drivers
#
CONFIG_LEDS_PWM_MULTICOLOR=m
CONFIG_LEDS_QCOM_LPG=m

#
# LED Triggers
#
CONFIG_LEDS_TRIGGERS=y
CONFIG_LEDS_TRIGGER_TIMER=m
CONFIG_LEDS_TRIGGER_ONESHOT=m
CONFIG_LEDS_TRIGGER_MTD=y
CONFIG_LEDS_TRIGGER_HEARTBEAT=m
CONFIG_LEDS_TRIGGER_BACKLIGHT=m
CONFIG_LEDS_TRIGGER_CPU=y
CONFIG_LEDS_TRIGGER_ACTIVITY=m
CONFIG_LEDS_TRIGGER_GPIO=m
CONFIG_LEDS_TRIGGER_DEFAULT_ON=m

#
# iptables trigger is under Netfilter config (LED target)
#
CONFIG_LEDS_TRIGGER_TRANSIENT=m
CONFIG_LEDS_TRIGGER_CAMERA=m
CONFIG_LEDS_TRIGGER_PANIC=y
CONFIG_LEDS_TRIGGER_NETDEV=m
CONFIG_LEDS_TRIGGER_PATTERN=m
CONFIG_LEDS_TRIGGER_AUDIO=m
CONFIG_LEDS_TRIGGER_TTY=m

#
# Simple LED drivers
#
CONFIG_ACCESSIBILITY=y

#
# Speakup console speech
#
CONFIG_SPEAKUP=m
CONFIG_SPEAKUP_SYNTH_ACNTSA=m
CONFIG_SPEAKUP_SYNTH_APOLLO=m
CONFIG_SPEAKUP_SYNTH_AUDPTR=m
CONFIG_SPEAKUP_SYNTH_BNS=m
CONFIG_SPEAKUP_SYNTH_DECTLK=m
CONFIG_SPEAKUP_SYNTH_DECEXT=m
CONFIG_SPEAKUP_SYNTH_LTLK=m
CONFIG_SPEAKUP_SYNTH_SOFT=m
CONFIG_SPEAKUP_SYNTH_SPKOUT=m
CONFIG_SPEAKUP_SYNTH_TXPRT=m
CONFIG_SPEAKUP_SYNTH_DUMMY=m
# end of Speakup console speech

CONFIG_DMADEVICES=y
CONFIG_DMADEVICES_DEBUG=y
CONFIG_DMADEVICES_VDEBUG=y

#
# DMA Devices
#
CONFIG_DMA_ENGINE=y
CONFIG_DMA_VIRTUAL_CHANNELS=m
CONFIG_DMA_OF=y
CONFIG_FSL_EDMA=m
CONFIG_INTEL_IDMA64=m
CONFIG_QCOM_HIDMA=m

#
# DMA Clients
#
CONFIG_ASYNC_TX_DMA=y
CONFIG_DMATEST=m
CONFIG_DMA_ENGINE_RAID=y

#
# DMABUF options
#
CONFIG_SYNC_FILE=y
CONFIG_SW_SYNC=y
CONFIG_UDMABUF=y
CONFIG_DMABUF_MOVE_NOTIFY=y
CONFIG_DMABUF_DEBUG=y
CONFIG_DMABUF_SELFTESTS=m
CONFIG_DMABUF_HEAPS=y
CONFIG_DMABUF_SYSFS_STATS=y
CONFIG_DMABUF_HEAPS_SYSTEM=y
CONFIG_DMABUF_HEAPS_CMA=y
# end of DMABUF options

CONFIG_AUXDISPLAY=y
CONFIG_CHARLCD=m
CONFIG_HD44780_COMMON=m
CONFIG_HD44780=m
CONFIG_LCD2S=m
CONFIG_PANEL_CHANGE_MESSAGE=y
CONFIG_PANEL_BOOT_MESSAGE=""
# CONFIG_CHARLCD_BL_OFF is not set
# CONFIG_CHARLCD_BL_ON is not set
CONFIG_CHARLCD_BL_FLASH=y
CONFIG_UIO=m
CONFIG_UIO_PDRV_GENIRQ=m
CONFIG_UIO_DMEM_GENIRQ=m
CONFIG_VFIO=m
CONFIG_VFIO_IOMMU_TYPE1=m
CONFIG_VFIO_NOIOMMU=y
CONFIG_VFIO_MDEV=m
CONFIG_IRQ_BYPASS_MANAGER=m
CONFIG_VIRT_DRIVERS=y
CONFIG_VIRTIO_ANCHOR=y
CONFIG_VIRTIO=y
CONFIG_VIRTIO_MENU=y
CONFIG_VIRTIO_VDPA=m
CONFIG_VIRTIO_BALLOON=m
CONFIG_VIRTIO_INPUT=m
CONFIG_VDPA=m
CONFIG_VDPA_SIM=m
CONFIG_VDPA_SIM_NET=m
CONFIG_VDPA_SIM_BLOCK=m
CONFIG_VDPA_USER=m
CONFIG_VHOST_IOTLB=m
CONFIG_VHOST_RING=m
CONFIG_VHOST=m
CONFIG_VHOST_MENU=y
CONFIG_VHOST_NET=m
CONFIG_VHOST_SCSI=m
CONFIG_VHOST_VSOCK=m
CONFIG_VHOST_VDPA=m
CONFIG_VHOST_CROSS_ENDIAN_LEGACY=y

#
# Microsoft Hyper-V guest support
#
# end of Microsoft Hyper-V guest support

CONFIG_GREYBUS=m
CONFIG_COMEDI=m
CONFIG_COMEDI_DEBUG=y
CONFIG_COMEDI_DEFAULT_BUF_SIZE_KB=2048
CONFIG_COMEDI_DEFAULT_BUF_MAXSIZE_KB=20480
CONFIG_COMEDI_MISC_DRIVERS=y
CONFIG_COMEDI_BOND=m
CONFIG_COMEDI_TEST=m
CONFIG_COMEDI_PARPORT=m
CONFIG_COMEDI_ISA_DRIVERS=y
CONFIG_COMEDI_PCL711=m
CONFIG_COMEDI_PCL724=m
CONFIG_COMEDI_PCL726=m
CONFIG_COMEDI_PCL730=m
CONFIG_COMEDI_PCL812=m
CONFIG_COMEDI_PCL816=m
CONFIG_COMEDI_PCL818=m
CONFIG_COMEDI_PCM3724=m
CONFIG_COMEDI_AMPLC_DIO200_ISA=m
CONFIG_COMEDI_AMPLC_PC236_ISA=m
CONFIG_COMEDI_AMPLC_PC263_ISA=m
CONFIG_COMEDI_RTI800=m
CONFIG_COMEDI_RTI802=m
CONFIG_COMEDI_DAC02=m
CONFIG_COMEDI_DAS16M1=m
CONFIG_COMEDI_DAS08_ISA=m
CONFIG_COMEDI_DAS16=m
CONFIG_COMEDI_DAS800=m
CONFIG_COMEDI_DAS1800=m
CONFIG_COMEDI_DAS6402=m
CONFIG_COMEDI_DT2801=m
CONFIG_COMEDI_DT2811=m
CONFIG_COMEDI_DT2814=m
CONFIG_COMEDI_DT2815=m
CONFIG_COMEDI_DT2817=m
CONFIG_COMEDI_DT282X=m
CONFIG_COMEDI_DMM32AT=m
CONFIG_COMEDI_FL512=m
CONFIG_COMEDI_AIO_AIO12_8=m
CONFIG_COMEDI_AIO_IIRO_16=m
CONFIG_COMEDI_C6XDIGIO=m
CONFIG_COMEDI_MPC624=m
CONFIG_COMEDI_ADQ12B=m
CONFIG_COMEDI_NI_AT_A2150=m
CONFIG_COMEDI_NI_AT_AO=m
CONFIG_COMEDI_NI_ATMIO=m
CONFIG_COMEDI_NI_ATMIO16D=m
CONFIG_COMEDI_NI_LABPC_ISA=m
CONFIG_COMEDI_PCMAD=m
CONFIG_COMEDI_PCMDA12=m
CONFIG_COMEDI_PCMMIO=m
CONFIG_COMEDI_PCMUIO=m
CONFIG_COMEDI_MULTIQ3=m
CONFIG_COMEDI_S526=m
CONFIG_COMEDI_PCMCIA_DRIVERS=m
CONFIG_COMEDI_CB_DAS16_CS=m
CONFIG_COMEDI_DAS08_CS=m
CONFIG_COMEDI_NI_DAQ_700_CS=m
CONFIG_COMEDI_NI_DAQ_DIO24_CS=m
CONFIG_COMEDI_NI_LABPC_CS=m
CONFIG_COMEDI_NI_MIO_CS=m
CONFIG_COMEDI_QUATECH_DAQP_CS=m
CONFIG_COMEDI_8254=m
CONFIG_COMEDI_8255=m
CONFIG_COMEDI_8255_SA=m
CONFIG_COMEDI_KCOMEDILIB=m
CONFIG_COMEDI_AMPLC_DIO200=m
CONFIG_COMEDI_AMPLC_PC236=m
CONFIG_COMEDI_DAS08=m
CONFIG_COMEDI_NI_LABPC=m
CONFIG_COMEDI_NI_TIO=m
CONFIG_COMEDI_NI_ROUTING=m
CONFIG_COMEDI_TESTS=m
CONFIG_COMEDI_TESTS_EXAMPLE=m
CONFIG_COMEDI_TESTS_NI_ROUTES=m
CONFIG_STAGING=y

#
# IIO staging drivers
#

#
# Accelerometers
#
# end of Accelerometers

#
# Analog to digital converters
#
# end of Analog to digital converters

#
# Analog digital bi-direction converters
#
CONFIG_ADT7316=m
CONFIG_ADT7316_I2C=m
# end of Analog digital bi-direction converters

#
# Direct Digital Synthesis
#
# end of Direct Digital Synthesis

#
# Network Analyzer, Impedance Converters
#
CONFIG_AD5933=m
# end of Network Analyzer, Impedance Converters

#
# Active energy metering IC
#
CONFIG_ADE7854=m
CONFIG_ADE7854_I2C=m
# end of Active energy metering IC

#
# Resolver to digital converters
#
# end of Resolver to digital converters
# end of IIO staging drivers

CONFIG_STAGING_MEDIA=y
CONFIG_MOST_COMPONENTS=m
CONFIG_MOST_NET=m
CONFIG_MOST_I2C=m
CONFIG_GREYBUS_BOOTROM=m
CONFIG_GREYBUS_HID=m
CONFIG_GREYBUS_LIGHT=m
CONFIG_GREYBUS_LOG=m
CONFIG_GREYBUS_LOOPBACK=m
CONFIG_GREYBUS_POWER=m
CONFIG_GREYBUS_RAW=m
CONFIG_GREYBUS_VIBRATOR=m
CONFIG_GREYBUS_BRIDGED_PHY=m
CONFIG_GREYBUS_GPIO=m
CONFIG_GREYBUS_I2C=m
CONFIG_GREYBUS_PWM=m
CONFIG_GREYBUS_UART=m
CONFIG_FIELDBUS_DEV=m
CONFIG_HMS_ANYBUSS_BUS=m
CONFIG_HMS_PROFINET=m
CONFIG_HAVE_CLK=y
CONFIG_HAVE_CLK_PREPARE=y
CONFIG_COMMON_CLK=y
CONFIG_COMMON_CLK_MAX9485=m
CONFIG_COMMON_CLK_SI5341=m
CONFIG_COMMON_CLK_SI5351=m
CONFIG_COMMON_CLK_SI514=m
CONFIG_COMMON_CLK_SI544=m
CONFIG_COMMON_CLK_SI570=m
CONFIG_COMMON_CLK_CDCE706=m
CONFIG_COMMON_CLK_CDCE925=m
CONFIG_COMMON_CLK_CS2000_CP=m
CONFIG_COMMON_CLK_PWM=m
CONFIG_COMMON_CLK_RS9_PCIE=m
CONFIG_COMMON_CLK_VC5=m
CONFIG_COMMON_CLK_VC7=m
CONFIG_COMMON_CLK_FIXED_MMIO=y
CONFIG_CLK_KUNIT_TEST=m
CONFIG_CLK_GATE_KUNIT_TEST=m
CONFIG_HWSPINLOCK=y

#
# Clock Source drivers
#
CONFIG_TIMER_OF=y
CONFIG_TIMER_PROBE=y
CONFIG_MICROCHIP_PIT64B=y
# end of Clock Source drivers

CONFIG_MAILBOX=y
CONFIG_IOMMU_IOVA=m
CONFIG_IOMMU_API=y
CONFIG_IOMMU_SUPPORT=y

#
# Generic IOMMU Pagetable Support
#
# end of Generic IOMMU Pagetable Support

CONFIG_IOMMU_DEBUGFS=y
CONFIG_IOMMU_DEFAULT_DMA_STRICT=y
# CONFIG_IOMMU_DEFAULT_DMA_LAZY is not set
# CONFIG_IOMMU_DEFAULT_PASSTHROUGH is not set
CONFIG_OF_IOMMU=y
CONFIG_S390_CCW_IOMMU=y
CONFIG_S390_AP_IOMMU=y

#
# Remoteproc drivers
#
CONFIG_REMOTEPROC=y
CONFIG_REMOTEPROC_CDEV=y
# end of Remoteproc drivers

#
# Rpmsg drivers
#
CONFIG_RPMSG=m
CONFIG_RPMSG_CHAR=m
CONFIG_RPMSG_CTRL=m
CONFIG_RPMSG_NS=m
CONFIG_RPMSG_VIRTIO=m
# end of Rpmsg drivers

CONFIG_SOUNDWIRE=m

#
# SoundWire Devices
#

#
# SOC (System On Chip) specific Drivers
#

#
# Amlogic SoC drivers
#
# end of Amlogic SoC drivers

#
# Broadcom SoC drivers
#
# end of Broadcom SoC drivers

#
# NXP/Freescale QorIQ SoC drivers
#
# end of NXP/Freescale QorIQ SoC drivers

#
# fujitsu SoC drivers
#
# end of fujitsu SoC drivers

#
# i.MX SoC drivers
#
# end of i.MX SoC drivers

#
# Enable LiteX SoC Builder specific drivers
#
# end of Enable LiteX SoC Builder specific drivers

#
# Qualcomm SoC drivers
#
# end of Qualcomm SoC drivers

CONFIG_SOC_TI=y

#
# Xilinx SoC drivers
#
# end of Xilinx SoC drivers
# end of SOC (System On Chip) specific Drivers

CONFIG_PM_DEVFREQ=y

#
# DEVFREQ Governors
#
CONFIG_DEVFREQ_GOV_SIMPLE_ONDEMAND=m
CONFIG_DEVFREQ_GOV_PERFORMANCE=m
CONFIG_DEVFREQ_GOV_POWERSAVE=m
CONFIG_DEVFREQ_GOV_USERSPACE=m
CONFIG_DEVFREQ_GOV_PASSIVE=m

#
# DEVFREQ Drivers
#
CONFIG_PM_DEVFREQ_EVENT=y
CONFIG_EXTCON=y

#
# Extcon Device Drivers
#
CONFIG_EXTCON_ADC_JACK=m
CONFIG_EXTCON_FSA9480=m
CONFIG_EXTCON_GPIO=m
CONFIG_EXTCON_MAX3355=m
CONFIG_EXTCON_PTN5150=m
CONFIG_EXTCON_RT8973A=m
CONFIG_EXTCON_SM5502=m
CONFIG_EXTCON_USB_GPIO=m
CONFIG_MEMORY=y
CONFIG_IIO=m
CONFIG_IIO_BUFFER=y
CONFIG_IIO_BUFFER_CB=m
CONFIG_IIO_BUFFER_DMA=m
CONFIG_IIO_BUFFER_DMAENGINE=m
CONFIG_IIO_BUFFER_HW_CONSUMER=m
CONFIG_IIO_KFIFO_BUF=m
CONFIG_IIO_TRIGGERED_BUFFER=m
CONFIG_IIO_CONFIGFS=m
CONFIG_IIO_TRIGGER=y
CONFIG_IIO_CONSUMERS_PER_TRIGGER=2
CONFIG_IIO_SW_DEVICE=m
CONFIG_IIO_SW_TRIGGER=m
CONFIG_IIO_TRIGGERED_EVENT=m

#
# Accelerometers
#
CONFIG_ADXL313=m
CONFIG_ADXL313_I2C=m
CONFIG_ADXL355=m
CONFIG_ADXL355_I2C=m
CONFIG_ADXL367=m
CONFIG_ADXL367_I2C=m
CONFIG_ADXL372=m
CONFIG_ADXL372_I2C=m
CONFIG_BMA400=m
CONFIG_BMA400_I2C=m
CONFIG_BMC150_ACCEL=m
CONFIG_BMC150_ACCEL_I2C=m
CONFIG_DA280=m
CONFIG_DA311=m
CONFIG_DMARD06=m
CONFIG_DMARD09=m
CONFIG_DMARD10=m
CONFIG_FXLS8962AF=m
CONFIG_FXLS8962AF_I2C=m
CONFIG_IIO_ST_ACCEL_3AXIS=m
CONFIG_IIO_ST_ACCEL_I2C_3AXIS=m
CONFIG_KXSD9=m
CONFIG_KXSD9_I2C=m
CONFIG_KXCJK1013=m
CONFIG_MC3230=m
CONFIG_MMA7455=m
CONFIG_MMA7455_I2C=m
CONFIG_MMA7660=m
CONFIG_MMA8452=m
CONFIG_MMA9551_CORE=m
CONFIG_MMA9551=m
CONFIG_MMA9553=m
CONFIG_MSA311=m
CONFIG_MXC4005=m
CONFIG_MXC6255=m
CONFIG_STK8312=m
CONFIG_STK8BA50=m
# end of Accelerometers

#
# Analog to digital converters
#
CONFIG_AD7091R5=m
CONFIG_AD7291=m
CONFIG_AD799X=m
CONFIG_ENVELOPE_DETECTOR=m
CONFIG_HX711=m
CONFIG_INA2XX_ADC=m
CONFIG_LTC2471=m
CONFIG_LTC2485=m
CONFIG_LTC2497=m
CONFIG_MAX1363=m
CONFIG_MAX9611=m
CONFIG_MCP3422=m
CONFIG_NAU7802=m
CONFIG_QCOM_VADC_COMMON=m
CONFIG_QCOM_SPMI_IADC=m
CONFIG_QCOM_SPMI_VADC=m
CONFIG_QCOM_SPMI_ADC5=m
CONFIG_RICHTEK_RTQ6056=m
CONFIG_SD_ADC_MODULATOR=m
CONFIG_TI_ADC081C=m
CONFIG_TI_ADS1015=m
# end of Analog to digital converters

#
# Analog to digital and digital to analog converters
#
# end of Analog to digital and digital to analog converters

#
# Analog Front Ends
#
CONFIG_IIO_RESCALE=m
# end of Analog Front Ends

#
# Amplifiers
#
CONFIG_HMC425=m
# end of Amplifiers

#
# Capacitance to digital converters
#
CONFIG_AD7150=m
CONFIG_AD7746=m
# end of Capacitance to digital converters

#
# Chemical Sensors
#
CONFIG_ATLAS_PH_SENSOR=m
CONFIG_ATLAS_EZO_SENSOR=m
CONFIG_BME680=m
CONFIG_BME680_I2C=m
CONFIG_CCS811=m
CONFIG_IAQCORE=m
CONFIG_PMS7003=m
CONFIG_SCD30_CORE=m
CONFIG_SCD30_I2C=m
CONFIG_SCD30_SERIAL=m
CONFIG_SCD4X=m
CONFIG_SENSIRION_SGP30=m
CONFIG_SENSIRION_SGP40=m
CONFIG_SPS30=m
CONFIG_SPS30_I2C=m
CONFIG_SPS30_SERIAL=m
CONFIG_SENSEAIR_SUNRISE_CO2=m
CONFIG_VZ89X=m
# end of Chemical Sensors

#
# Hid Sensor IIO Common
#
# end of Hid Sensor IIO Common

CONFIG_IIO_MS_SENSORS_I2C=m

#
# IIO SCMI Sensors
#
# end of IIO SCMI Sensors

#
# SSP Sensor Common
#
# end of SSP Sensor Common

CONFIG_IIO_ST_SENSORS_I2C=m
CONFIG_IIO_ST_SENSORS_CORE=m

#
# Digital to analog converters
#
CONFIG_AD5064=m
CONFIG_AD5380=m
CONFIG_AD5446=m
CONFIG_AD5592R_BASE=m
CONFIG_AD5593R=m
CONFIG_AD5686=m
CONFIG_AD5696_I2C=m
CONFIG_DPOT_DAC=m
CONFIG_DS4424=m
CONFIG_M62332=m
CONFIG_MAX517=m
CONFIG_MAX5821=m
CONFIG_MCP4725=m
CONFIG_TI_DAC5571=m
# end of Digital to analog converters

#
# IIO dummy driver
#
CONFIG_IIO_DUMMY_EVGEN=m
CONFIG_IIO_SIMPLE_DUMMY=m
CONFIG_IIO_SIMPLE_DUMMY_EVENTS=y
CONFIG_IIO_SIMPLE_DUMMY_BUFFER=y
# end of IIO dummy driver

#
# Filters
#
# end of Filters

#
# Frequency Synthesizers DDS/PLL
#

#
# Clock Generator/Distribution
#
# end of Clock Generator/Distribution

#
# Phase-Locked Loop (PLL) frequency synthesizers
#
# end of Phase-Locked Loop (PLL) frequency synthesizers
# end of Frequency Synthesizers DDS/PLL

#
# Digital gyroscope sensors
#
CONFIG_BMG160=m
CONFIG_BMG160_I2C=m
CONFIG_FXAS21002C=m
CONFIG_FXAS21002C_I2C=m
CONFIG_MPU3050=m
CONFIG_MPU3050_I2C=m
CONFIG_IIO_ST_GYRO_3AXIS=m
CONFIG_IIO_ST_GYRO_I2C_3AXIS=m
CONFIG_ITG3200=m
# end of Digital gyroscope sensors

#
# Health Sensors
#

#
# Heart Rate Monitors
#
CONFIG_AFE4404=m
CONFIG_MAX30100=m
CONFIG_MAX30102=m
# end of Heart Rate Monitors
# end of Health Sensors

#
# Humidity sensors
#
CONFIG_AM2315=m
CONFIG_DHT11=m
CONFIG_HDC100X=m
CONFIG_HDC2010=m
CONFIG_HTS221=m
CONFIG_HTS221_I2C=m
CONFIG_HTU21=m
CONFIG_SI7005=m
CONFIG_SI7020=m
# end of Humidity sensors

#
# Inertial measurement units
#
CONFIG_BMI160=m
CONFIG_BMI160_I2C=m
CONFIG_BOSCH_BNO055=m
CONFIG_BOSCH_BNO055_SERIAL=m
CONFIG_BOSCH_BNO055_I2C=m
CONFIG_FXOS8700=m
CONFIG_FXOS8700_I2C=m
CONFIG_KMX61=m
CONFIG_INV_ICM42600=m
CONFIG_INV_ICM42600_I2C=m
CONFIG_INV_MPU6050_IIO=m
CONFIG_INV_MPU6050_I2C=m
CONFIG_IIO_ST_LSM6DSX=m
CONFIG_IIO_ST_LSM6DSX_I2C=m
CONFIG_IIO_ST_LSM6DSX_I3C=m
CONFIG_IIO_ST_LSM9DS0=m
CONFIG_IIO_ST_LSM9DS0_I2C=m
# end of Inertial measurement units

#
# Light sensors
#
CONFIG_ADJD_S311=m
CONFIG_ADUX1020=m
CONFIG_AL3010=m
CONFIG_AL3320A=m
CONFIG_APDS9300=m
CONFIG_APDS9960=m
CONFIG_AS73211=m
CONFIG_BH1750=m
CONFIG_BH1780=m
CONFIG_CM32181=m
CONFIG_CM3232=m
CONFIG_CM3323=m
CONFIG_CM3605=m
CONFIG_CM36651=m
CONFIG_GP2AP002=m
CONFIG_GP2AP020A00F=m
CONFIG_SENSORS_ISL29018=m
CONFIG_SENSORS_ISL29028=m
CONFIG_ISL29125=m
CONFIG_JSA1212=m
CONFIG_RPR0521=m
CONFIG_LTR501=m
CONFIG_LTRF216A=m
CONFIG_LV0104CS=m
CONFIG_MAX44000=m
CONFIG_MAX44009=m
CONFIG_NOA1305=m
CONFIG_OPT3001=m
CONFIG_PA12203001=m
CONFIG_SI1133=m
CONFIG_SI1145=m
CONFIG_STK3310=m
CONFIG_ST_UVIS25=m
CONFIG_ST_UVIS25_I2C=m
CONFIG_TCS3414=m
CONFIG_TCS3472=m
CONFIG_SENSORS_TSL2563=m
CONFIG_TSL2583=m
CONFIG_TSL2591=m
CONFIG_TSL2772=m
CONFIG_TSL4531=m
CONFIG_US5182D=m
CONFIG_VCNL4000=m
CONFIG_VCNL4035=m
CONFIG_VEML6030=m
CONFIG_VEML6070=m
CONFIG_VL6180=m
CONFIG_ZOPT2201=m
# end of Light sensors

#
# Magnetometer sensors
#
CONFIG_AK8974=m
CONFIG_AK8975=m
CONFIG_AK09911=m
CONFIG_BMC150_MAGN=m
CONFIG_BMC150_MAGN_I2C=m
CONFIG_MAG3110=m
CONFIG_MMC35240=m
CONFIG_IIO_ST_MAGN_3AXIS=m
CONFIG_IIO_ST_MAGN_I2C_3AXIS=m
CONFIG_SENSORS_HMC5843=m
CONFIG_SENSORS_HMC5843_I2C=m
CONFIG_SENSORS_RM3100=m
CONFIG_SENSORS_RM3100_I2C=m
CONFIG_YAMAHA_YAS530=m
# end of Magnetometer sensors

#
# Multiplexers
#
CONFIG_IIO_MUX=m
# end of Multiplexers

#
# Inclinometer sensors
#
# end of Inclinometer sensors

CONFIG_IIO_RESCALE_KUNIT_TEST=m
CONFIG_IIO_FORMAT_KUNIT_TEST=m

#
# Triggers - standalone
#
CONFIG_IIO_HRTIMER_TRIGGER=m
CONFIG_IIO_INTERRUPT_TRIGGER=m
CONFIG_IIO_TIGHTLOOP_TRIGGER=m
CONFIG_IIO_SYSFS_TRIGGER=m
# end of Triggers - standalone

#
# Linear and angular position sensors
#
# end of Linear and angular position sensors

#
# Digital potentiometers
#
CONFIG_AD5110=m
CONFIG_AD5272=m
CONFIG_DS1803=m
CONFIG_MAX5432=m
CONFIG_MCP4018=m
CONFIG_MCP4531=m
CONFIG_TPL0102=m
# end of Digital potentiometers

#
# Digital potentiostats
#
CONFIG_LMP91000=m
# end of Digital potentiostats

#
# Pressure sensors
#
CONFIG_ABP060MG=m
CONFIG_BMP280=m
CONFIG_BMP280_I2C=m
CONFIG_DLHL60D=m
CONFIG_DPS310=m
CONFIG_HP03=m
CONFIG_ICP10100=m
CONFIG_MPL115=m
CONFIG_MPL115_I2C=m
CONFIG_MPL3115=m
CONFIG_MS5611=m
CONFIG_MS5611_I2C=m
CONFIG_MS5637=m
CONFIG_IIO_ST_PRESS=m
CONFIG_IIO_ST_PRESS_I2C=m
CONFIG_T5403=m
CONFIG_HP206C=m
CONFIG_ZPA2326=m
CONFIG_ZPA2326_I2C=m
# end of Pressure sensors

#
# Lightning sensors
#
# end of Lightning sensors

#
# Proximity and distance sensors
#
CONFIG_ISL29501=m
CONFIG_LIDAR_LITE_V2=m
CONFIG_MB1232=m
CONFIG_PING=m
CONFIG_RFD77402=m
CONFIG_SRF04=m
CONFIG_SX_COMMON=m
CONFIG_SX9310=m
CONFIG_SX9324=m
CONFIG_SX9360=m
CONFIG_SX9500=m
CONFIG_SRF08=m
CONFIG_VCNL3020=m
CONFIG_VL53L0X_I2C=m
# end of Proximity and distance sensors

#
# Resolver to digital converters
#
# end of Resolver to digital converters

#
# Temperature sensors
#
CONFIG_MLX90614=m
CONFIG_MLX90632=m
CONFIG_TMP006=m
CONFIG_TMP007=m
CONFIG_TMP117=m
CONFIG_TSYS01=m
CONFIG_TSYS02D=m
# end of Temperature sensors

CONFIG_PWM=y
CONFIG_PWM_SYSFS=y
CONFIG_PWM_DEBUG=y
CONFIG_PWM_ATMEL_TCB=m
CONFIG_PWM_CLK=m
CONFIG_PWM_PCA9685=m

#
# IRQ chip support
#
CONFIG_IRQCHIP=y
CONFIG_AL_FIC=y
# end of IRQ chip support

CONFIG_RESET_CONTROLLER=y
CONFIG_RESET_TI_TPS380X=m

#
# PHY Subsystem
#
CONFIG_GENERIC_PHY=y
CONFIG_PHY_CAN_TRANSCEIVER=m

#
# PHY drivers for Broadcom platforms
#
# end of PHY drivers for Broadcom platforms
# end of PHY Subsystem

CONFIG_POWERCAP=y
CONFIG_DTPM=y

#
# Performance monitor support
#
# end of Performance monitor support

CONFIG_RAS=y

#
# Android
#
CONFIG_ANDROID_BINDER_IPC=y
CONFIG_ANDROID_BINDERFS=y
CONFIG_ANDROID_BINDER_DEVICES="binder,hwbinder,vndbinder"
CONFIG_ANDROID_BINDER_IPC_SELFTEST=y
# end of Android

CONFIG_DAX=y
CONFIG_DEV_DAX=m
CONFIG_DEV_DAX_KMEM=m
CONFIG_NVMEM=y
CONFIG_NVMEM_SYSFS=y
CONFIG_NVMEM_SPMI_SDAM=m
CONFIG_NVMEM_U_BOOT_ENV=m

#
# HW tracing support
#
CONFIG_STM=m
CONFIG_STM_PROTO_BASIC=m
CONFIG_STM_PROTO_SYS_T=m
CONFIG_STM_DUMMY=m
CONFIG_STM_SOURCE_CONSOLE=m
CONFIG_STM_SOURCE_HEARTBEAT=m
CONFIG_STM_SOURCE_FTRACE=m
# end of HW tracing support

CONFIG_FPGA=m
CONFIG_ALTERA_PR_IP_CORE=m
CONFIG_FPGA_BRIDGE=m
CONFIG_FPGA_REGION=m
CONFIG_OF_FPGA_REGION=m
CONFIG_FSI=m
CONFIG_FSI_NEW_DEV_NODE=y
CONFIG_FSI_MASTER_GPIO=m
CONFIG_FSI_MASTER_HUB=m
CONFIG_FSI_SCOM=m
CONFIG_MULTIPLEXER=m

#
# Multiplexer drivers
#
CONFIG_MUX_ADG792A=m
CONFIG_MUX_GPIO=m
CONFIG_MUX_MMIO=m
# end of Multiplexer drivers

CONFIG_PM_OPP=y
CONFIG_SIOX=m
CONFIG_SIOX_BUS_GPIO=m
CONFIG_SLIMBUS=m
CONFIG_INTERCONNECT=y
CONFIG_COUNTER=m
CONFIG_INTERRUPT_CNT=m
CONFIG_MOST=m
CONFIG_MOST_CDEV=m
CONFIG_PECI=m
CONFIG_PECI_CPU=m
CONFIG_HTE=y
# end of Device Drivers

#
# File systems
#
CONFIG_VALIDATE_FS_PARSER=y
CONFIG_FS_IOMAP=y
CONFIG_EXT2_FS=m
CONFIG_EXT2_FS_XATTR=y
CONFIG_EXT2_FS_POSIX_ACL=y
CONFIG_EXT2_FS_SECURITY=y
CONFIG_EXT3_FS=m
CONFIG_EXT3_FS_POSIX_ACL=y
CONFIG_EXT3_FS_SECURITY=y
CONFIG_EXT4_FS=m
CONFIG_EXT4_FS_POSIX_ACL=y
CONFIG_EXT4_FS_SECURITY=y
CONFIG_EXT4_DEBUG=y
CONFIG_EXT4_KUNIT_TESTS=m
CONFIG_JBD2=m
CONFIG_JBD2_DEBUG=y
CONFIG_FS_MBCACHE=m
CONFIG_REISERFS_FS=m
CONFIG_REISERFS_CHECK=y
CONFIG_REISERFS_PROC_INFO=y
CONFIG_REISERFS_FS_XATTR=y
CONFIG_REISERFS_FS_POSIX_ACL=y
CONFIG_REISERFS_FS_SECURITY=y
CONFIG_JFS_FS=m
CONFIG_JFS_POSIX_ACL=y
CONFIG_JFS_SECURITY=y
CONFIG_JFS_DEBUG=y
CONFIG_JFS_STATISTICS=y
CONFIG_XFS_FS=m
CONFIG_XFS_SUPPORT_V4=y
CONFIG_XFS_QUOTA=y
CONFIG_XFS_POSIX_ACL=y
CONFIG_XFS_RT=y
CONFIG_XFS_ONLINE_SCRUB=y
CONFIG_XFS_ONLINE_REPAIR=y
CONFIG_XFS_DEBUG=y
CONFIG_XFS_ASSERT_FATAL=y
CONFIG_GFS2_FS=m
CONFIG_GFS2_FS_LOCKING_DLM=y
CONFIG_OCFS2_FS=m
CONFIG_OCFS2_FS_O2CB=m
CONFIG_OCFS2_FS_USERSPACE_CLUSTER=m
CONFIG_OCFS2_FS_STATS=y
CONFIG_OCFS2_DEBUG_MASKLOG=y
CONFIG_OCFS2_DEBUG_FS=y
CONFIG_BTRFS_FS=m
CONFIG_BTRFS_FS_POSIX_ACL=y
CONFIG_BTRFS_FS_CHECK_INTEGRITY=y
CONFIG_BTRFS_FS_RUN_SANITY_TESTS=y
CONFIG_BTRFS_DEBUG=y
CONFIG_BTRFS_ASSERT=y
CONFIG_BTRFS_FS_REF_VERIFY=y
CONFIG_NILFS2_FS=m
CONFIG_F2FS_FS=m
CONFIG_F2FS_STAT_FS=y
CONFIG_F2FS_FS_XATTR=y
CONFIG_F2FS_FS_POSIX_ACL=y
CONFIG_F2FS_FS_SECURITY=y
CONFIG_F2FS_CHECK_FS=y
CONFIG_F2FS_FAULT_INJECTION=y
CONFIG_F2FS_FS_COMPRESSION=y
CONFIG_F2FS_FS_LZO=y
CONFIG_F2FS_FS_LZORLE=y
CONFIG_F2FS_FS_LZ4=y
CONFIG_F2FS_FS_LZ4HC=y
CONFIG_F2FS_FS_ZSTD=y
CONFIG_F2FS_IOSTAT=y
CONFIG_F2FS_UNFAIR_RWSEM=y
CONFIG_ZONEFS_FS=m
CONFIG_FS_DAX=y
CONFIG_FS_DAX_LIMITED=y
CONFIG_FS_POSIX_ACL=y
CONFIG_EXPORTFS=y
CONFIG_EXPORTFS_BLOCK_OPS=y
CONFIG_FILE_LOCKING=y
CONFIG_FS_ENCRYPTION=y
CONFIG_FS_ENCRYPTION_ALGS=m
CONFIG_FS_ENCRYPTION_INLINE_CRYPT=y
CONFIG_FS_VERITY=y
CONFIG_FS_VERITY_DEBUG=y
CONFIG_FS_VERITY_BUILTIN_SIGNATURES=y
CONFIG_FSNOTIFY=y
CONFIG_DNOTIFY=y
CONFIG_INOTIFY_USER=y
CONFIG_FANOTIFY=y
CONFIG_FANOTIFY_ACCESS_PERMISSIONS=y
CONFIG_QUOTA=y
CONFIG_QUOTA_NETLINK_INTERFACE=y
CONFIG_PRINT_QUOTA_WARNING=y
CONFIG_QUOTA_DEBUG=y
CONFIG_QUOTA_TREE=m
CONFIG_QFMT_V1=m
CONFIG_QFMT_V2=m
CONFIG_QUOTACTL=y
CONFIG_AUTOFS4_FS=m
CONFIG_AUTOFS_FS=m
CONFIG_FUSE_FS=m
CONFIG_CUSE=m
CONFIG_VIRTIO_FS=m
CONFIG_FUSE_DAX=y
CONFIG_OVERLAY_FS=m
CONFIG_OVERLAY_FS_REDIRECT_DIR=y
CONFIG_OVERLAY_FS_REDIRECT_ALWAYS_FOLLOW=y
CONFIG_OVERLAY_FS_INDEX=y
CONFIG_OVERLAY_FS_XINO_AUTO=y
CONFIG_OVERLAY_FS_METACOPY=y

#
# Caches
#
CONFIG_NETFS_SUPPORT=m
CONFIG_NETFS_STATS=y
CONFIG_FSCACHE=m
CONFIG_FSCACHE_STATS=y
CONFIG_FSCACHE_DEBUG=y
CONFIG_CACHEFILES=m
CONFIG_CACHEFILES_DEBUG=y
CONFIG_CACHEFILES_ERROR_INJECTION=y
CONFIG_CACHEFILES_ONDEMAND=y
# end of Caches

#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=m
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
CONFIG_UDF_FS=m
# end of CD-ROM/DVD Filesystems

#
# DOS/FAT/EXFAT/NT Filesystems
#
CONFIG_FAT_FS=m
CONFIG_MSDOS_FS=m
CONFIG_VFAT_FS=m
CONFIG_FAT_DEFAULT_CODEPAGE=437
CONFIG_FAT_DEFAULT_IOCHARSET="iso8859-1"
CONFIG_FAT_DEFAULT_UTF8=y
CONFIG_FAT_KUNIT_TEST=m
CONFIG_EXFAT_FS=m
CONFIG_EXFAT_DEFAULT_IOCHARSET="utf8"
CONFIG_NTFS_FS=m
CONFIG_NTFS_DEBUG=y
CONFIG_NTFS_RW=y
CONFIG_NTFS3_FS=m
CONFIG_NTFS3_64BIT_CLUSTER=y
CONFIG_NTFS3_LZX_XPRESS=y
CONFIG_NTFS3_FS_POSIX_ACL=y
# end of DOS/FAT/EXFAT/NT Filesystems

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_VMCORE=y
CONFIG_PROC_VMCORE_DEVICE_DUMP=y
CONFIG_PROC_SYSCTL=y
CONFIG_PROC_PAGE_MONITOR=y
CONFIG_PROC_CHILDREN=y
CONFIG_KERNFS=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
CONFIG_TMPFS_XATTR=y
CONFIG_TMPFS_INODE64=y
CONFIG_ARCH_SUPPORTS_HUGETLBFS=y
CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y
CONFIG_ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y
CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y
CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON=y
CONFIG_MEMFD_CREATE=y
CONFIG_ARCH_HAS_GIGANTIC_PAGE=y
CONFIG_CONFIGFS_FS=y
# end of Pseudo filesystems

CONFIG_MISC_FILESYSTEMS=y
CONFIG_ORANGEFS_FS=m
CONFIG_ADFS_FS=m
CONFIG_ADFS_FS_RW=y
CONFIG_AFFS_FS=m
CONFIG_ECRYPT_FS=m
CONFIG_ECRYPT_FS_MESSAGING=y
CONFIG_HFS_FS=m
CONFIG_HFSPLUS_FS=m
CONFIG_BEFS_FS=m
CONFIG_BEFS_DEBUG=y
CONFIG_BFS_FS=m
CONFIG_EFS_FS=m
CONFIG_JFFS2_FS=m
CONFIG_JFFS2_FS_DEBUG=0
CONFIG_JFFS2_FS_WRITEBUFFER=y
CONFIG_JFFS2_FS_WBUF_VERIFY=y
CONFIG_JFFS2_SUMMARY=y
CONFIG_JFFS2_FS_XATTR=y
CONFIG_JFFS2_FS_POSIX_ACL=y
CONFIG_JFFS2_FS_SECURITY=y
CONFIG_JFFS2_COMPRESSION_OPTIONS=y
CONFIG_JFFS2_ZLIB=y
CONFIG_JFFS2_LZO=y
CONFIG_JFFS2_RTIME=y
CONFIG_JFFS2_RUBIN=y
# CONFIG_JFFS2_CMODE_NONE is not set
CONFIG_JFFS2_CMODE_PRIORITY=y
# CONFIG_JFFS2_CMODE_SIZE is not set
# CONFIG_JFFS2_CMODE_FAVOURLZO is not set
CONFIG_UBIFS_FS=m
CONFIG_UBIFS_FS_ADVANCED_COMPR=y
CONFIG_UBIFS_FS_LZO=y
CONFIG_UBIFS_FS_ZLIB=y
CONFIG_UBIFS_FS_ZSTD=y
CONFIG_UBIFS_ATIME_SUPPORT=y
CONFIG_UBIFS_FS_XATTR=y
CONFIG_UBIFS_FS_SECURITY=y
CONFIG_UBIFS_FS_AUTHENTICATION=y
CONFIG_CRAMFS=m
CONFIG_CRAMFS_BLOCKDEV=y
CONFIG_CRAMFS_MTD=y
CONFIG_SQUASHFS=m
CONFIG_SQUASHFS_FILE_CACHE=y
# CONFIG_SQUASHFS_FILE_DIRECT is not set
CONFIG_SQUASHFS_DECOMP_SINGLE=y
# CONFIG_SQUASHFS_DECOMP_MULTI is not set
# CONFIG_SQUASHFS_DECOMP_MULTI_PERCPU is not set
CONFIG_SQUASHFS_XATTR=y
CONFIG_SQUASHFS_ZLIB=y
CONFIG_SQUASHFS_LZ4=y
CONFIG_SQUASHFS_LZO=y
CONFIG_SQUASHFS_XZ=y
CONFIG_SQUASHFS_ZSTD=y
CONFIG_SQUASHFS_4K_DEVBLK_SIZE=y
CONFIG_SQUASHFS_EMBEDDED=y
CONFIG_SQUASHFS_FRAGMENT_CACHE_SIZE=3
CONFIG_VXFS_FS=m
CONFIG_MINIX_FS=m
CONFIG_MINIX_FS_NATIVE_ENDIAN=y
CONFIG_OMFS_FS=m
CONFIG_HPFS_FS=m
CONFIG_QNX4FS_FS=m
CONFIG_QNX6FS_FS=m
CONFIG_QNX6FS_DEBUG=y
CONFIG_ROMFS_FS=m
CONFIG_ROMFS_BACKED_BY_BLOCK=y
# CONFIG_ROMFS_BACKED_BY_MTD is not set
# CONFIG_ROMFS_BACKED_BY_BOTH is not set
CONFIG_ROMFS_ON_BLOCK=y
CONFIG_PSTORE=m
CONFIG_PSTORE_DEFAULT_KMSG_BYTES=10240
CONFIG_PSTORE_DEFLATE_COMPRESS=m
CONFIG_PSTORE_LZO_COMPRESS=m
CONFIG_PSTORE_LZ4_COMPRESS=m
CONFIG_PSTORE_LZ4HC_COMPRESS=m
CONFIG_PSTORE_842_COMPRESS=y
CONFIG_PSTORE_ZSTD_COMPRESS=y
CONFIG_PSTORE_COMPRESS=y
CONFIG_PSTORE_DEFLATE_COMPRESS_DEFAULT=y
# CONFIG_PSTORE_LZO_COMPRESS_DEFAULT is not set
# CONFIG_PSTORE_LZ4_COMPRESS_DEFAULT is not set
# CONFIG_PSTORE_LZ4HC_COMPRESS_DEFAULT is not set
# CONFIG_PSTORE_842_COMPRESS_DEFAULT is not set
# CONFIG_PSTORE_ZSTD_COMPRESS_DEFAULT is not set
CONFIG_PSTORE_COMPRESS_DEFAULT="deflate"
CONFIG_PSTORE_CONSOLE=y
CONFIG_PSTORE_PMSG=y
CONFIG_PSTORE_FTRACE=y
CONFIG_PSTORE_ZONE=m
CONFIG_PSTORE_BLK=m
CONFIG_PSTORE_BLK_BLKDEV=""
CONFIG_PSTORE_BLK_KMSG_SIZE=64
CONFIG_PSTORE_BLK_MAX_REASON=2
CONFIG_PSTORE_BLK_PMSG_SIZE=64
CONFIG_PSTORE_BLK_CONSOLE_SIZE=64
CONFIG_PSTORE_BLK_FTRACE_SIZE=64
CONFIG_SYSV_FS=m
CONFIG_UFS_FS=m
CONFIG_UFS_FS_WRITE=y
CONFIG_UFS_DEBUG=y
CONFIG_EROFS_FS=m
CONFIG_EROFS_FS_DEBUG=y
CONFIG_EROFS_FS_XATTR=y
CONFIG_EROFS_FS_POSIX_ACL=y
CONFIG_EROFS_FS_SECURITY=y
CONFIG_EROFS_FS_ZIP=y
CONFIG_EROFS_FS_ZIP_LZMA=y
CONFIG_EROFS_FS_ONDEMAND=y
CONFIG_NETWORK_FILESYSTEMS=y
CONFIG_NFS_FS=m
CONFIG_NFS_V2=m
CONFIG_NFS_V3=m
CONFIG_NFS_V3_ACL=y
CONFIG_NFS_V4=m
CONFIG_NFS_SWAP=y
CONFIG_NFS_V4_1=y
CONFIG_NFS_V4_2=y
CONFIG_PNFS_FILE_LAYOUT=m
CONFIG_PNFS_BLOCK=m
CONFIG_PNFS_FLEXFILE_LAYOUT=m
CONFIG_NFS_V4_1_IMPLEMENTATION_ID_DOMAIN="kernel.org"
CONFIG_NFS_V4_1_MIGRATION=y
CONFIG_NFS_V4_SECURITY_LABEL=y
CONFIG_NFS_FSCACHE=y
CONFIG_NFS_USE_LEGACY_DNS=y
CONFIG_NFS_DEBUG=y
CONFIG_NFS_DISABLE_UDP_SUPPORT=y
CONFIG_NFS_V4_2_READ_PLUS=y
CONFIG_NFSD=m
CONFIG_NFSD_V2_ACL=y
CONFIG_NFSD_V3_ACL=y
CONFIG_NFSD_V4=y
CONFIG_NFSD_PNFS=y
CONFIG_NFSD_BLOCKLAYOUT=y
CONFIG_NFSD_SCSILAYOUT=y
CONFIG_NFSD_FLEXFILELAYOUT=y
CONFIG_NFSD_V4_2_INTER_SSC=y
CONFIG_NFSD_V4_SECURITY_LABEL=y
CONFIG_GRACE_PERIOD=m
CONFIG_LOCKD=m
CONFIG_LOCKD_V4=y
CONFIG_NFS_ACL_SUPPORT=m
CONFIG_NFS_COMMON=y
CONFIG_NFS_V4_2_SSC_HELPER=y
CONFIG_SUNRPC=m
CONFIG_SUNRPC_GSS=m
CONFIG_SUNRPC_BACKCHANNEL=y
CONFIG_SUNRPC_SWAP=y
CONFIG_RPCSEC_GSS_KRB5=m
CONFIG_SUNRPC_DISABLE_INSECURE_ENCTYPES=y
CONFIG_SUNRPC_DEBUG=y
CONFIG_CEPH_FS=m
CONFIG_CEPH_FSCACHE=y
CONFIG_CEPH_FS_POSIX_ACL=y
CONFIG_CEPH_FS_SECURITY_LABEL=y
CONFIG_CIFS=m
CONFIG_CIFS_STATS2=y
CONFIG_CIFS_ALLOW_INSECURE_LEGACY=y
CONFIG_CIFS_UPCALL=y
CONFIG_CIFS_XATTR=y
CONFIG_CIFS_POSIX=y
CONFIG_CIFS_DEBUG=y
CONFIG_CIFS_DEBUG2=y
CONFIG_CIFS_DEBUG_DUMP_KEYS=y
CONFIG_CIFS_DFS_UPCALL=y
CONFIG_CIFS_SWN_UPCALL=y
CONFIG_CIFS_FSCACHE=y
CONFIG_SMB_SERVER=m
CONFIG_SMB_SERVER_CHECK_CAP_NET_ADMIN=y
CONFIG_SMB_SERVER_KERBEROS5=y
CONFIG_SMBFS_COMMON=m
CONFIG_CODA_FS=m
CONFIG_AFS_FS=m
CONFIG_AFS_DEBUG=y
CONFIG_AFS_FSCACHE=y
CONFIG_AFS_DEBUG_CURSOR=y
CONFIG_9P_FS=m
CONFIG_9P_FSCACHE=y
CONFIG_9P_FS_POSIX_ACL=y
CONFIG_9P_FS_SECURITY=y
CONFIG_NLS=m
CONFIG_NLS_DEFAULT="iso8859-1"
CONFIG_NLS_CODEPAGE_437=m
CONFIG_NLS_CODEPAGE_737=m
CONFIG_NLS_CODEPAGE_775=m
CONFIG_NLS_CODEPAGE_850=m
CONFIG_NLS_CODEPAGE_852=m
CONFIG_NLS_CODEPAGE_855=m
CONFIG_NLS_CODEPAGE_857=m
CONFIG_NLS_CODEPAGE_860=m
CONFIG_NLS_CODEPAGE_861=m
CONFIG_NLS_CODEPAGE_862=m
CONFIG_NLS_CODEPAGE_863=m
CONFIG_NLS_CODEPAGE_864=m
CONFIG_NLS_CODEPAGE_865=m
CONFIG_NLS_CODEPAGE_866=m
CONFIG_NLS_CODEPAGE_869=m
CONFIG_NLS_CODEPAGE_936=m
CONFIG_NLS_CODEPAGE_950=m
CONFIG_NLS_CODEPAGE_932=m
CONFIG_NLS_CODEPAGE_949=m
CONFIG_NLS_CODEPAGE_874=m
CONFIG_NLS_ISO8859_8=m
CONFIG_NLS_CODEPAGE_1250=m
CONFIG_NLS_CODEPAGE_1251=m
CONFIG_NLS_ASCII=m
CONFIG_NLS_ISO8859_1=m
CONFIG_NLS_ISO8859_2=m
CONFIG_NLS_ISO8859_3=m
CONFIG_NLS_ISO8859_4=m
CONFIG_NLS_ISO8859_5=m
CONFIG_NLS_ISO8859_6=m
CONFIG_NLS_ISO8859_7=m
CONFIG_NLS_ISO8859_9=m
CONFIG_NLS_ISO8859_13=m
CONFIG_NLS_ISO8859_14=m
CONFIG_NLS_ISO8859_15=m
CONFIG_NLS_KOI8_R=m
CONFIG_NLS_KOI8_U=m
CONFIG_NLS_MAC_ROMAN=m
CONFIG_NLS_MAC_CELTIC=m
CONFIG_NLS_MAC_CENTEURO=m
CONFIG_NLS_MAC_CROATIAN=m
CONFIG_NLS_MAC_CYRILLIC=m
CONFIG_NLS_MAC_GAELIC=m
CONFIG_NLS_MAC_GREEK=m
CONFIG_NLS_MAC_ICELAND=m
CONFIG_NLS_MAC_INUIT=m
CONFIG_NLS_MAC_ROMANIAN=m
CONFIG_NLS_MAC_TURKISH=m
CONFIG_NLS_UTF8=m
CONFIG_DLM=m
CONFIG_DLM_DEPRECATED_API=y
CONFIG_DLM_DEBUG=y
CONFIG_UNICODE=m
CONFIG_UNICODE_NORMALIZATION_SELFTEST=m
CONFIG_IO_WQ=y
# end of File systems

#
# Security options
#
CONFIG_KEYS=y
CONFIG_KEYS_REQUEST_CACHE=y
CONFIG_PERSISTENT_KEYRINGS=y
CONFIG_TRUSTED_KEYS=m

#
# No trust source selected!
#
CONFIG_ENCRYPTED_KEYS=y
CONFIG_USER_DECRYPTED_DATA=y
CONFIG_KEY_DH_OPERATIONS=y
CONFIG_KEY_NOTIFICATIONS=y
CONFIG_SECURITY_DMESG_RESTRICT=y
CONFIG_SECURITY=y
CONFIG_SECURITY_WRITABLE_HOOKS=y
CONFIG_SECURITYFS=y
CONFIG_SECURITY_NETWORK=y
CONFIG_SECURITY_NETWORK_XFRM=y
CONFIG_SECURITY_PATH=y
CONFIG_LSM_MMAP_MIN_ADDR=65536
CONFIG_HAVE_HARDENED_USERCOPY_ALLOCATOR=y
CONFIG_HARDENED_USERCOPY=y
CONFIG_FORTIFY_SOURCE=y
CONFIG_STATIC_USERMODEHELPER=y
CONFIG_STATIC_USERMODEHELPER_PATH="/sbin/usermode-helper"
CONFIG_SECURITY_SELINUX=y
CONFIG_SECURITY_SELINUX_BOOTPARAM=y
CONFIG_SECURITY_SELINUX_DISABLE=y
CONFIG_SECURITY_SELINUX_DEVELOP=y
CONFIG_SECURITY_SELINUX_AVC_STATS=y
CONFIG_SECURITY_SELINUX_CHECKREQPROT_VALUE=0
CONFIG_SECURITY_SELINUX_SIDTAB_HASH_BITS=9
CONFIG_SECURITY_SELINUX_SID2STR_CACHE_SIZE=256
CONFIG_SECURITY_SMACK=y
CONFIG_SECURITY_SMACK_BRINGUP=y
CONFIG_SECURITY_SMACK_NETFILTER=y
CONFIG_SECURITY_SMACK_APPEND_SIGNALS=y
CONFIG_SECURITY_TOMOYO=y
CONFIG_SECURITY_TOMOYO_MAX_ACCEPT_ENTRY=2048
CONFIG_SECURITY_TOMOYO_MAX_AUDIT_LOG=1024
CONFIG_SECURITY_TOMOYO_OMIT_USERSPACE_LOADER=y
CONFIG_SECURITY_TOMOYO_INSECURE_BUILTIN_SETTING=y
CONFIG_SECURITY_APPARMOR=y
CONFIG_SECURITY_APPARMOR_DEBUG=y
CONFIG_SECURITY_APPARMOR_DEBUG_ASSERTS=y
CONFIG_SECURITY_APPARMOR_DEBUG_MESSAGES=y
CONFIG_SECURITY_APPARMOR_INTROSPECT_POLICY=y
CONFIG_SECURITY_APPARMOR_HASH=y
CONFIG_SECURITY_APPARMOR_HASH_DEFAULT=y
CONFIG_SECURITY_APPARMOR_EXPORT_BINARY=y
CONFIG_SECURITY_APPARMOR_PARANOID_LOAD=y
CONFIG_SECURITY_LOADPIN=y
CONFIG_SECURITY_LOADPIN_ENFORCE=y
CONFIG_SECURITY_YAMA=y
CONFIG_SECURITY_SAFESETID=y
CONFIG_SECURITY_LOCKDOWN_LSM=y
CONFIG_SECURITY_LOCKDOWN_LSM_EARLY=y
CONFIG_LOCK_DOWN_KERNEL_FORCE_NONE=y
# CONFIG_LOCK_DOWN_KERNEL_FORCE_INTEGRITY is not set
# CONFIG_LOCK_DOWN_KERNEL_FORCE_CONFIDENTIALITY is not set
CONFIG_SECURITY_LANDLOCK=y
CONFIG_INTEGRITY=y
CONFIG_INTEGRITY_SIGNATURE=y
CONFIG_INTEGRITY_ASYMMETRIC_KEYS=y
CONFIG_INTEGRITY_TRUSTED_KEYRING=y
CONFIG_INTEGRITY_PLATFORM_KEYRING=y
CONFIG_LOAD_IPL_KEYS=y
CONFIG_INTEGRITY_AUDIT=y
CONFIG_IMA=y
CONFIG_IMA_MEASURE_PCR_IDX=10
CONFIG_IMA_LSM_RULES=y
CONFIG_IMA_NG_TEMPLATE=y
# CONFIG_IMA_SIG_TEMPLATE is not set
CONFIG_IMA_DEFAULT_TEMPLATE="ima-ng"
CONFIG_IMA_DEFAULT_HASH_SHA1=y
# CONFIG_IMA_DEFAULT_HASH_SHA256 is not set
# CONFIG_IMA_DEFAULT_HASH_SHA512 is not set
CONFIG_IMA_DEFAULT_HASH="sha1"
CONFIG_IMA_WRITE_POLICY=y
CONFIG_IMA_READ_POLICY=y
CONFIG_IMA_APPRAISE=y
CONFIG_IMA_ARCH_POLICY=y
CONFIG_IMA_APPRAISE_BUILD_POLICY=y
CONFIG_IMA_APPRAISE_REQUIRE_FIRMWARE_SIGS=y
CONFIG_IMA_APPRAISE_REQUIRE_KEXEC_SIGS=y
CONFIG_IMA_APPRAISE_REQUIRE_MODULE_SIGS=y
CONFIG_IMA_APPRAISE_REQUIRE_POLICY_SIGS=y
CONFIG_IMA_APPRAISE_BOOTPARAM=y
CONFIG_IMA_APPRAISE_MODSIG=y
CONFIG_IMA_TRUSTED_KEYRING=y
CONFIG_IMA_KEYRINGS_PERMIT_SIGNED_BY_BUILTIN_OR_SECONDARY=y
CONFIG_IMA_BLACKLIST_KEYRING=y
CONFIG_IMA_LOAD_X509=y
CONFIG_IMA_X509_PATH="/etc/keys/x509_ima.der"
CONFIG_IMA_APPRAISE_SIGNED_INIT=y
CONFIG_IMA_MEASURE_ASYMMETRIC_KEYS=y
CONFIG_IMA_QUEUE_EARLY_BOOT_KEYS=y
CONFIG_IMA_SECURE_AND_OR_TRUSTED_BOOT=y
CONFIG_IMA_DISABLE_HTABLE=y
CONFIG_EVM=y
CONFIG_EVM_ATTR_FSUUID=y
CONFIG_EVM_EXTRA_SMACK_XATTRS=y
CONFIG_EVM_ADD_XATTRS=y
CONFIG_EVM_LOAD_X509=y
CONFIG_EVM_X509_PATH="/etc/keys/x509_evm.der"
CONFIG_DEFAULT_SECURITY_SELINUX=y
# CONFIG_DEFAULT_SECURITY_SMACK is not set
# CONFIG_DEFAULT_SECURITY_TOMOYO is not set
# CONFIG_DEFAULT_SECURITY_APPARMOR is not set
# CONFIG_DEFAULT_SECURITY_DAC is not set
CONFIG_LSM="landlock,lockdown,yama,loadpin,safesetid,integrity,selinux,smack,tomoyo,apparmor,bpf"

#
# Kernel hardening options
#

#
# Memory initialization
#
CONFIG_CC_HAS_AUTO_VAR_INIT_PATTERN=y
CONFIG_CC_HAS_AUTO_VAR_INIT_ZERO_BARE=y
CONFIG_CC_HAS_AUTO_VAR_INIT_ZERO=y
# CONFIG_INIT_STACK_NONE is not set
CONFIG_INIT_STACK_ALL_PATTERN=y
# CONFIG_INIT_STACK_ALL_ZERO is not set
CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y
CONFIG_INIT_ON_FREE_DEFAULT_ON=y
CONFIG_CC_HAS_ZERO_CALL_USED_REGS=y
CONFIG_ZERO_CALL_USED_REGS=y
# end of Memory initialization

# CONFIG_RANDSTRUCT_NONE is not set
CONFIG_RANDSTRUCT_FULL=y
# CONFIG_RANDSTRUCT_PERFORMANCE is not set
CONFIG_RANDSTRUCT=y
CONFIG_GCC_PLUGIN_RANDSTRUCT=y
# end of Kernel hardening options
# end of Security options

CONFIG_XOR_BLOCKS=m
CONFIG_ASYNC_CORE=m
CONFIG_ASYNC_MEMCPY=m
CONFIG_ASYNC_XOR=m
CONFIG_ASYNC_PQ=m
CONFIG_ASYNC_RAID6_RECOV=m
CONFIG_CRYPTO=y

#
# Crypto core or helper
#
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_ALGAPI2=y
CONFIG_CRYPTO_AEAD=m
CONFIG_CRYPTO_AEAD2=y
CONFIG_CRYPTO_SKCIPHER=y
CONFIG_CRYPTO_SKCIPHER2=y
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_HASH2=y
CONFIG_CRYPTO_RNG=y
CONFIG_CRYPTO_RNG2=y
CONFIG_CRYPTO_RNG_DEFAULT=y
CONFIG_CRYPTO_AKCIPHER2=y
CONFIG_CRYPTO_AKCIPHER=y
CONFIG_CRYPTO_KPP2=y
CONFIG_CRYPTO_KPP=y
CONFIG_CRYPTO_ACOMP2=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_MANAGER2=y
CONFIG_CRYPTO_USER=m
CONFIG_CRYPTO_MANAGER_DISABLE_TESTS=y
CONFIG_CRYPTO_GF128MUL=m
CONFIG_CRYPTO_NULL=m
CONFIG_CRYPTO_NULL2=y
CONFIG_CRYPTO_PCRYPT=m
CONFIG_CRYPTO_CRYPTD=m
CONFIG_CRYPTO_AUTHENC=m
CONFIG_CRYPTO_TEST=m
CONFIG_CRYPTO_ENGINE=m
# end of Crypto core or helper

#
# Public-key cryptography
#
CONFIG_CRYPTO_RSA=y
CONFIG_CRYPTO_DH=y
CONFIG_CRYPTO_DH_RFC7919_GROUPS=y
CONFIG_CRYPTO_ECC=m
CONFIG_CRYPTO_ECDH=m
CONFIG_CRYPTO_ECDSA=m
CONFIG_CRYPTO_ECRDSA=m
CONFIG_CRYPTO_SM2=m
CONFIG_CRYPTO_CURVE25519=m
# end of Public-key cryptography

#
# Block ciphers
#
CONFIG_CRYPTO_AES=y
CONFIG_CRYPTO_AES_TI=m
CONFIG_CRYPTO_ANUBIS=m
CONFIG_CRYPTO_ARIA=m
CONFIG_CRYPTO_BLOWFISH=m
CONFIG_CRYPTO_BLOWFISH_COMMON=m
CONFIG_CRYPTO_CAMELLIA=m
CONFIG_CRYPTO_CAST_COMMON=m
CONFIG_CRYPTO_CAST5=m
CONFIG_CRYPTO_CAST6=m
CONFIG_CRYPTO_DES=m
CONFIG_CRYPTO_FCRYPT=m
CONFIG_CRYPTO_KHAZAD=m
CONFIG_CRYPTO_SEED=m
CONFIG_CRYPTO_SERPENT=m
CONFIG_CRYPTO_SM4=m
CONFIG_CRYPTO_SM4_GENERIC=m
CONFIG_CRYPTO_TEA=m
CONFIG_CRYPTO_TWOFISH=m
CONFIG_CRYPTO_TWOFISH_COMMON=m
# end of Block ciphers

#
# Length-preserving ciphers and modes
#
CONFIG_CRYPTO_ADIANTUM=m
CONFIG_CRYPTO_ARC4=m
CONFIG_CRYPTO_CHACHA20=m
CONFIG_CRYPTO_CBC=y
CONFIG_CRYPTO_CFB=m
CONFIG_CRYPTO_CTR=y
CONFIG_CRYPTO_CTS=m
CONFIG_CRYPTO_ECB=m
CONFIG_CRYPTO_HCTR2=m
CONFIG_CRYPTO_KEYWRAP=m
CONFIG_CRYPTO_LRW=m
CONFIG_CRYPTO_OFB=m
CONFIG_CRYPTO_PCBC=m
CONFIG_CRYPTO_XCTR=m
CONFIG_CRYPTO_XTS=m
CONFIG_CRYPTO_NHPOLY1305=m
# end of Length-preserving ciphers and modes

#
# AEAD (authenticated encryption with associated data) ciphers
#
CONFIG_CRYPTO_AEGIS128=m
CONFIG_CRYPTO_CHACHA20POLY1305=m
CONFIG_CRYPTO_CCM=m
CONFIG_CRYPTO_GCM=m
CONFIG_CRYPTO_SEQIV=m
CONFIG_CRYPTO_ECHAINIV=m
CONFIG_CRYPTO_ESSIV=m
# end of AEAD (authenticated encryption with associated data) ciphers

#
# Hashes, digests, and MACs
#
CONFIG_CRYPTO_BLAKE2B=m
CONFIG_CRYPTO_CMAC=m
CONFIG_CRYPTO_GHASH=m
CONFIG_CRYPTO_HMAC=y
CONFIG_CRYPTO_MD4=m
CONFIG_CRYPTO_MD5=y
CONFIG_CRYPTO_MICHAEL_MIC=m
CONFIG_CRYPTO_POLYVAL=m
CONFIG_CRYPTO_POLY1305=m
CONFIG_CRYPTO_RMD160=m
CONFIG_CRYPTO_SHA1=y
CONFIG_CRYPTO_SHA256=y
CONFIG_CRYPTO_SHA512=y
CONFIG_CRYPTO_SHA3=m
CONFIG_CRYPTO_SM3=m
CONFIG_CRYPTO_SM3_GENERIC=m
CONFIG_CRYPTO_STREEBOG=m
CONFIG_CRYPTO_VMAC=m
CONFIG_CRYPTO_WP512=m
CONFIG_CRYPTO_XCBC=m
CONFIG_CRYPTO_XXHASH=m
# end of Hashes, digests, and MACs

#
# CRCs (cyclic redundancy checks)
#
CONFIG_CRYPTO_CRC32C=m
CONFIG_CRYPTO_CRC32=m
CONFIG_CRYPTO_CRCT10DIF=m
CONFIG_CRYPTO_CRC64_ROCKSOFT=m
# end of CRCs (cyclic redundancy checks)

#
# Compression
#
CONFIG_CRYPTO_DEFLATE=m
CONFIG_CRYPTO_LZO=y
CONFIG_CRYPTO_842=m
CONFIG_CRYPTO_LZ4=m
CONFIG_CRYPTO_LZ4HC=m
CONFIG_CRYPTO_ZSTD=m
# end of Compression

#
# Random number generation
#
CONFIG_CRYPTO_ANSI_CPRNG=m
CONFIG_CRYPTO_DRBG_MENU=y
CONFIG_CRYPTO_DRBG_HMAC=y
CONFIG_CRYPTO_DRBG_HASH=y
CONFIG_CRYPTO_DRBG_CTR=y
CONFIG_CRYPTO_DRBG=y
CONFIG_CRYPTO_JITTERENTROPY=y
CONFIG_CRYPTO_KDF800108_CTR=y
# end of Random number generation

#
# Userspace interface
#
CONFIG_CRYPTO_USER_API=m
CONFIG_CRYPTO_USER_API_HASH=m
CONFIG_CRYPTO_USER_API_SKCIPHER=m
CONFIG_CRYPTO_USER_API_RNG=m
CONFIG_CRYPTO_USER_API_RNG_CAVP=y
CONFIG_CRYPTO_USER_API_AEAD=m
CONFIG_CRYPTO_USER_API_ENABLE_OBSOLETE=y
CONFIG_CRYPTO_STATS=y
# end of Userspace interface

CONFIG_CRYPTO_HASH_INFO=y

#
# Accelerated Cryptographic Algorithms for CPU (s390)
#
CONFIG_CRYPTO_CRC32_S390=m
CONFIG_CRYPTO_SHA512_S390=m
CONFIG_CRYPTO_SHA1_S390=m
CONFIG_CRYPTO_SHA256_S390=m
CONFIG_CRYPTO_SHA3_256_S390=m
CONFIG_CRYPTO_SHA3_512_S390=m
CONFIG_CRYPTO_GHASH_S390=m
CONFIG_CRYPTO_AES_S390=m
CONFIG_CRYPTO_DES_S390=m
CONFIG_CRYPTO_CHACHA_S390=m
# end of Accelerated Cryptographic Algorithms for CPU (s390)

CONFIG_CRYPTO_HW=y
CONFIG_ZCRYPT=m
CONFIG_ZCRYPT_DEBUG=y
CONFIG_ZCRYPT_MULTIDEVNODES=y
CONFIG_PKEY=m
CONFIG_CRYPTO_PAES_S390=m
CONFIG_S390_PRNG=m
CONFIG_CRYPTO_DEV_ATMEL_I2C=m
CONFIG_CRYPTO_DEV_ATMEL_ECC=m
CONFIG_CRYPTO_DEV_ATMEL_SHA204A=m
CONFIG_CRYPTO_DEV_VIRTIO=m
CONFIG_CRYPTO_DEV_CCREE=m
CONFIG_ASYMMETRIC_KEY_TYPE=y
CONFIG_ASYMMETRIC_PUBLIC_KEY_SUBTYPE=y
CONFIG_X509_CERTIFICATE_PARSER=y
CONFIG_PKCS8_PRIVATE_KEY_PARSER=m
CONFIG_PKCS7_MESSAGE_PARSER=y
CONFIG_PKCS7_TEST_KEY=m
CONFIG_SIGNED_PE_FILE_VERIFICATION=y
CONFIG_FIPS_SIGNATURE_SELFTEST=y

#
# Certificates for signature checking
#
CONFIG_MODULE_SIG_KEY="certs/signing_key.pem"
CONFIG_MODULE_SIG_KEY_TYPE_RSA=y
# CONFIG_MODULE_SIG_KEY_TYPE_ECDSA is not set
CONFIG_SYSTEM_TRUSTED_KEYRING=y
CONFIG_SYSTEM_TRUSTED_KEYS=""
CONFIG_SYSTEM_EXTRA_CERTIFICATE=y
CONFIG_SYSTEM_EXTRA_CERTIFICATE_SIZE=4096
CONFIG_SECONDARY_TRUSTED_KEYRING=y
CONFIG_SYSTEM_BLACKLIST_KEYRING=y
CONFIG_SYSTEM_BLACKLIST_HASH_LIST=""
CONFIG_SYSTEM_REVOCATION_LIST=y
CONFIG_SYSTEM_REVOCATION_KEYS=""
CONFIG_SYSTEM_BLACKLIST_AUTH_UPDATE=y
# end of Certificates for signature checking

CONFIG_BINARY_PRINTF=y

#
# Library routines
#
CONFIG_RAID6_PQ=m
CONFIG_RAID6_PQ_BENCHMARK=y
CONFIG_LINEAR_RANGES=y
CONFIG_PACKING=y
CONFIG_BITREVERSE=y
CONFIG_GENERIC_STRNCPY_FROM_USER=y
CONFIG_GENERIC_STRNLEN_USER=y
CONFIG_GENERIC_NET_UTILS=y
CONFIG_CORDIC=m
CONFIG_PRIME_NUMBERS=m
CONFIG_RATIONAL=y
CONFIG_ARCH_USE_CMPXCHG_LOCKREF=y

#
# Crypto library routines
#
CONFIG_CRYPTO_LIB_UTILS=y
CONFIG_CRYPTO_LIB_AES=y
CONFIG_CRYPTO_LIB_ARC4=m
CONFIG_CRYPTO_LIB_BLAKE2S_GENERIC=y
CONFIG_CRYPTO_ARCH_HAVE_LIB_CHACHA=m
CONFIG_CRYPTO_LIB_CHACHA_GENERIC=m
CONFIG_CRYPTO_LIB_CHACHA=m
CONFIG_CRYPTO_LIB_CURVE25519_GENERIC=m
CONFIG_CRYPTO_LIB_CURVE25519=m
CONFIG_CRYPTO_LIB_DES=m
CONFIG_CRYPTO_LIB_POLY1305_RSIZE=1
CONFIG_CRYPTO_LIB_POLY1305_GENERIC=m
CONFIG_CRYPTO_LIB_POLY1305=m
CONFIG_CRYPTO_LIB_CHACHA20POLY1305=m
CONFIG_CRYPTO_LIB_SHA1=y
CONFIG_CRYPTO_LIB_SHA256=y
# end of Crypto library routines

CONFIG_CRC_CCITT=m
CONFIG_CRC16=m
CONFIG_CRC_T10DIF=m
CONFIG_CRC64_ROCKSOFT=m
CONFIG_CRC_ITU_T=m
CONFIG_CRC32=y
CONFIG_CRC32_SELFTEST=m
CONFIG_CRC32_SLICEBY8=y
# CONFIG_CRC32_SLICEBY4 is not set
# CONFIG_CRC32_SARWATE is not set
# CONFIG_CRC32_BIT is not set
CONFIG_CRC64=m
CONFIG_CRC4=m
CONFIG_CRC7=m
CONFIG_LIBCRC32C=m
CONFIG_CRC8=y
CONFIG_XXHASH=y
CONFIG_RANDOM32_SELFTEST=y
CONFIG_842_COMPRESS=m
CONFIG_842_DECOMPRESS=m
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=y
CONFIG_ZLIB_DFLTCC=y
CONFIG_LZO_COMPRESS=y
CONFIG_LZO_DECOMPRESS=y
CONFIG_LZ4_COMPRESS=m
CONFIG_LZ4HC_COMPRESS=m
CONFIG_LZ4_DECOMPRESS=y
CONFIG_ZSTD_COMMON=y
CONFIG_ZSTD_COMPRESS=m
CONFIG_ZSTD_DECOMPRESS=y
CONFIG_XZ_DEC=y
CONFIG_XZ_DEC_X86=y
CONFIG_XZ_DEC_POWERPC=y
CONFIG_XZ_DEC_IA64=y
CONFIG_XZ_DEC_ARM=y
CONFIG_XZ_DEC_ARMTHUMB=y
CONFIG_XZ_DEC_SPARC=y
CONFIG_XZ_DEC_MICROLZMA=y
CONFIG_XZ_DEC_BCJ=y
CONFIG_XZ_DEC_TEST=m
CONFIG_DECOMPRESS_GZIP=y
CONFIG_DECOMPRESS_BZIP2=y
CONFIG_DECOMPRESS_LZMA=y
CONFIG_DECOMPRESS_XZ=y
CONFIG_DECOMPRESS_LZO=y
CONFIG_DECOMPRESS_LZ4=y
CONFIG_DECOMPRESS_ZSTD=y
CONFIG_GENERIC_ALLOCATOR=y
CONFIG_REED_SOLOMON=m
CONFIG_REED_SOLOMON_DEC8=y
CONFIG_REED_SOLOMON_ENC16=y
CONFIG_REED_SOLOMON_DEC16=y
CONFIG_BCH=m
CONFIG_TEXTSEARCH=y
CONFIG_TEXTSEARCH_KMP=m
CONFIG_TEXTSEARCH_BM=m
CONFIG_TEXTSEARCH_FSM=m
CONFIG_INTERVAL_TREE=y
CONFIG_XARRAY_MULTI=y
CONFIG_ASSOCIATIVE_ARRAY=y
CONFIG_HAS_DMA=y
CONFIG_DMA_OPS=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_ARCH_DMA_ADDR_T_64BIT=y
CONFIG_DMA_DECLARE_COHERENT=y
CONFIG_ARCH_HAS_FORCE_DMA_UNENCRYPTED=y
CONFIG_SWIOTLB=y
CONFIG_DMA_RESTRICTED_POOL=y
CONFIG_DMA_CMA=y
CONFIG_DMA_PERNUMA_CMA=y

#
# Default contiguous memory area size:
#
CONFIG_CMA_SIZE_MBYTES=16
CONFIG_CMA_SIZE_SEL_MBYTES=y
# CONFIG_CMA_SIZE_SEL_PERCENTAGE is not set
# CONFIG_CMA_SIZE_SEL_MIN is not set
# CONFIG_CMA_SIZE_SEL_MAX is not set
CONFIG_CMA_ALIGNMENT=8
CONFIG_DMA_API_DEBUG=y
CONFIG_DMA_API_DEBUG_SG=y
CONFIG_DMA_MAP_BENCHMARK=y
CONFIG_SGL_ALLOC=y
CONFIG_CPUMASK_OFFSTACK=y
CONFIG_FORCE_NR_CPUS=y
CONFIG_CPU_RMAP=y
CONFIG_DQL=y
CONFIG_GLOB=y
CONFIG_GLOB_SELFTEST=m
CONFIG_NLATTR=y
CONFIG_LRU_CACHE=m
CONFIG_CLZ_TAB=y
CONFIG_IRQ_POLL=y
CONFIG_MPILIB=y
CONFIG_SIGNATURE=y
CONFIG_LIBFDT=y
CONFIG_OID_REGISTRY=y
CONFIG_HAVE_GENERIC_VDSO=y
CONFIG_GENERIC_GETTIMEOFDAY=y
CONFIG_GENERIC_VDSO_TIME_NS=y
CONFIG_SG_POOL=y
CONFIG_ARCH_STACKWALK=y
CONFIG_STACKDEPOT=y
CONFIG_STACKDEPOT_ALWAYS_INIT=y
CONFIG_REF_TRACKER=y
CONFIG_SBITMAP=y
# end of Library routines

#
# Kernel hacking
#

#
# printk and dmesg options
#
CONFIG_PRINTK_TIME=y
CONFIG_PRINTK_CALLER=y
CONFIG_STACKTRACE_BUILD_ID=y
CONFIG_CONSOLE_LOGLEVEL_DEFAULT=7
CONFIG_CONSOLE_LOGLEVEL_QUIET=4
CONFIG_MESSAGE_LOGLEVEL_DEFAULT=4
CONFIG_DYNAMIC_DEBUG=y
CONFIG_DYNAMIC_DEBUG_CORE=y
CONFIG_SYMBOLIC_ERRNAME=y
CONFIG_DEBUG_BUGVERBOSE=y
# end of printk and dmesg options

CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_MISC=y

#
# Compile-time checks and compiler options
#
CONFIG_AS_HAS_NON_CONST_LEB128=y
CONFIG_DEBUG_INFO_NONE=y
# CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT is not set
# CONFIG_DEBUG_INFO_DWARF4 is not set
# CONFIG_DEBUG_INFO_DWARF5 is not set
CONFIG_FRAME_WARN=8192
CONFIG_STRIP_ASM_SYMS=y
CONFIG_READABLE_ASM=y
CONFIG_HEADERS_INSTALL=y
CONFIG_DEBUG_SECTION_MISMATCH=y
CONFIG_SECTION_MISMATCH_WARN_ONLY=y
CONFIG_VMLINUX_MAP=y
CONFIG_DEBUG_FORCE_WEAK_PER_CPU=y
# end of Compile-time checks and compiler options

#
# Generic Kernel Debugging Instruments
#
CONFIG_MAGIC_SYSRQ=y
CONFIG_MAGIC_SYSRQ_DEFAULT_ENABLE=0x1
CONFIG_MAGIC_SYSRQ_SERIAL=y
CONFIG_MAGIC_SYSRQ_SERIAL_SEQUENCE=""
CONFIG_DEBUG_FS=y
CONFIG_DEBUG_FS_ALLOW_ALL=y
# CONFIG_DEBUG_FS_DISALLOW_MOUNT is not set
# CONFIG_DEBUG_FS_ALLOW_NONE is not set
CONFIG_ARCH_HAS_UBSAN_SANITIZE_ALL=y
CONFIG_UBSAN=y
# CONFIG_UBSAN_TRAP is not set
CONFIG_CC_HAS_UBSAN_BOUNDS=y
CONFIG_UBSAN_BOUNDS=y
CONFIG_UBSAN_ONLY_BOUNDS=y
CONFIG_UBSAN_SHIFT=y
CONFIG_UBSAN_DIV_ZERO=y
CONFIG_UBSAN_UNREACHABLE=y
CONFIG_UBSAN_BOOL=y
CONFIG_UBSAN_ENUM=y
# CONFIG_UBSAN_ALIGNMENT is not set
CONFIG_UBSAN_SANITIZE_ALL=y
CONFIG_TEST_UBSAN=m
CONFIG_HAVE_ARCH_KCSAN=y
CONFIG_HAVE_KCSAN_COMPILER=y
# end of Generic Kernel Debugging Instruments

#
# Networking Debugging
#
CONFIG_NET_DEV_REFCNT_TRACKER=y
CONFIG_NET_NS_REFCNT_TRACKER=y
CONFIG_DEBUG_NET=y
# end of Networking Debugging

#
# Memory Debugging
#
CONFIG_PAGE_EXTENSION=y
CONFIG_DEBUG_PAGEALLOC=y
CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT=y
CONFIG_SLUB_DEBUG=y
CONFIG_SLUB_DEBUG_ON=y
CONFIG_PAGE_OWNER=y
CONFIG_PAGE_POISONING=y
CONFIG_DEBUG_PAGE_REF=y
CONFIG_DEBUG_RODATA_TEST=y
CONFIG_ARCH_HAS_DEBUG_WX=y
CONFIG_DEBUG_WX=y
CONFIG_GENERIC_PTDUMP=y
CONFIG_PTDUMP_CORE=y
CONFIG_PTDUMP_DEBUGFS=y
CONFIG_DEBUG_OBJECTS=y
CONFIG_DEBUG_OBJECTS_SELFTEST=y
CONFIG_DEBUG_OBJECTS_FREE=y
CONFIG_DEBUG_OBJECTS_TIMERS=y
CONFIG_DEBUG_OBJECTS_WORK=y
CONFIG_DEBUG_OBJECTS_RCU_HEAD=y
CONFIG_DEBUG_OBJECTS_PERCPU_COUNTER=y
CONFIG_DEBUG_OBJECTS_ENABLE_DEFAULT=1
CONFIG_SHRINKER_DEBUG=y
CONFIG_HAVE_DEBUG_KMEMLEAK=y
CONFIG_DEBUG_KMEMLEAK=y
CONFIG_DEBUG_KMEMLEAK_MEM_POOL_SIZE=16000
CONFIG_DEBUG_KMEMLEAK_TEST=m
CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=y
CONFIG_DEBUG_KMEMLEAK_AUTO_SCAN=y
CONFIG_DEBUG_STACK_USAGE=y
CONFIG_SCHED_STACK_END_CHECK=y
CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE=y
CONFIG_DEBUG_VM_IRQSOFF=y
CONFIG_DEBUG_VM=y
CONFIG_DEBUG_VM_MAPLE_TREE=y
CONFIG_DEBUG_VM_RB=y
CONFIG_DEBUG_VM_PGFLAGS=y
CONFIG_DEBUG_VM_PGTABLE=y
CONFIG_DEBUG_MEMORY_INIT=y
CONFIG_MEMORY_NOTIFIER_ERROR_INJECT=m
CONFIG_DEBUG_PER_CPU_MAPS=y
CONFIG_HAVE_ARCH_KASAN=y
CONFIG_HAVE_ARCH_KASAN_VMALLOC=y
CONFIG_CC_HAS_KASAN_GENERIC=y
CONFIG_CC_HAS_WORKING_NOSANITIZE_ADDRESS=y
CONFIG_KASAN=y
CONFIG_KASAN_GENERIC=y
CONFIG_KASAN_OUTLINE=y
# CONFIG_KASAN_INLINE is not set
CONFIG_KASAN_STACK=y
CONFIG_KASAN_VMALLOC=y
CONFIG_KASAN_KUNIT_TEST=m
CONFIG_KASAN_MODULE_TEST=m
CONFIG_HAVE_ARCH_KFENCE=y
CONFIG_KFENCE=y
CONFIG_KFENCE_SAMPLE_INTERVAL=100
CONFIG_KFENCE_NUM_OBJECTS=255
CONFIG_KFENCE_DEFERRABLE=y
CONFIG_KFENCE_STATIC_KEYS=y
CONFIG_KFENCE_STRESS_TEST_FAULTS=0
CONFIG_KFENCE_KUNIT_TEST=m
# end of Memory Debugging

CONFIG_DEBUG_SHIRQ=y

#
# Debug Oops, Lockups and Hangs
#
CONFIG_PANIC_ON_OOPS=y
CONFIG_PANIC_ON_OOPS_VALUE=1
CONFIG_PANIC_TIMEOUT=0
CONFIG_DETECT_HUNG_TASK=y
CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=120
CONFIG_BOOTPARAM_HUNG_TASK_PANIC=y
CONFIG_WQ_WATCHDOG=y
CONFIG_TEST_LOCKUP=m
# end of Debug Oops, Lockups and Hangs

#
# Scheduler Debugging
#
CONFIG_SCHED_DEBUG=y
CONFIG_SCHED_INFO=y
CONFIG_SCHEDSTATS=y
# end of Scheduler Debugging

CONFIG_DEBUG_TIMEKEEPING=y

#
# Lock Debugging (spinlocks, mutexes, etc...)
#
CONFIG_LOCK_DEBUGGING_SUPPORT=y
CONFIG_PROVE_LOCKING=y
CONFIG_PROVE_RAW_LOCK_NESTING=y
CONFIG_LOCK_STAT=y
CONFIG_DEBUG_RT_MUTEXES=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
CONFIG_DEBUG_RWSEMS=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_LOCKDEP=y
CONFIG_LOCKDEP_BITS=15
CONFIG_LOCKDEP_CHAINS_BITS=16
CONFIG_LOCKDEP_STACK_TRACE_BITS=19
CONFIG_LOCKDEP_STACK_TRACE_HASH_BITS=14
CONFIG_LOCKDEP_CIRCULAR_QUEUE_BITS=12
CONFIG_DEBUG_LOCKDEP=y
CONFIG_DEBUG_ATOMIC_SLEEP=y
CONFIG_DEBUG_LOCKING_API_SELFTESTS=y
CONFIG_LOCK_TORTURE_TEST=m
CONFIG_WW_MUTEX_SELFTEST=m
CONFIG_SCF_TORTURE_TEST=m
CONFIG_CSD_LOCK_WAIT_DEBUG=y
# end of Lock Debugging (spinlocks, mutexes, etc...)

CONFIG_TRACE_IRQFLAGS=y
CONFIG_DEBUG_IRQFLAGS=y
CONFIG_STACKTRACE=y
CONFIG_WARN_ALL_UNSEEDED_RANDOM=y
CONFIG_DEBUG_KOBJECT=y
CONFIG_DEBUG_KOBJECT_RELEASE=y

#
# Debug kernel data structures
#
CONFIG_DEBUG_LIST=y
CONFIG_DEBUG_PLIST=y
CONFIG_DEBUG_SG=y
CONFIG_DEBUG_NOTIFIERS=y
CONFIG_BUG_ON_DATA_CORRUPTION=y
CONFIG_DEBUG_MAPLE_TREE=y
# end of Debug kernel data structures

CONFIG_DEBUG_CREDENTIALS=y

#
# RCU Debugging
#
CONFIG_PROVE_RCU=y
CONFIG_PROVE_RCU_LIST=y
CONFIG_TORTURE_TEST=m
CONFIG_RCU_SCALE_TEST=m
CONFIG_RCU_TORTURE_TEST=m
CONFIG_RCU_REF_SCALE_TEST=m
CONFIG_RCU_CPU_STALL_TIMEOUT=21
CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=0
CONFIG_RCU_TRACE=y
CONFIG_RCU_EQS_DEBUG=y
# end of RCU Debugging

CONFIG_DEBUG_WQ_FORCE_RR_CPU=y
CONFIG_CPU_HOTPLUG_STATE_CONTROL=y
CONFIG_LATENCYTOP=y
CONFIG_NOP_TRACER=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_SYSCALL_TRACEPOINTS=y
CONFIG_HAVE_FENTRY=y
CONFIG_HAVE_NOP_MCOUNT=y
CONFIG_TRACER_MAX_TRACE=y
CONFIG_TRACE_CLOCK=y
CONFIG_RING_BUFFER=y
CONFIG_EVENT_TRACING=y
CONFIG_CONTEXT_SWITCH_TRACER=y
CONFIG_RING_BUFFER_ALLOW_SWAP=y
CONFIG_PREEMPTIRQ_TRACEPOINTS=y
CONFIG_TRACING=y
CONFIG_GENERIC_TRACER=y
CONFIG_TRACING_SUPPORT=y
CONFIG_FTRACE=y
CONFIG_BOOTTIME_TRACING=y
CONFIG_FUNCTION_TRACER=y
CONFIG_FUNCTION_GRAPH_TRACER=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS=y
CONFIG_DYNAMIC_FTRACE_WITH_ARGS=y
CONFIG_FUNCTION_PROFILER=y
CONFIG_STACK_TRACER=y
CONFIG_IRQSOFF_TRACER=y
CONFIG_SCHED_TRACER=y
CONFIG_HWLAT_TRACER=y
CONFIG_OSNOISE_TRACER=y
CONFIG_TIMERLAT_TRACER=y
CONFIG_FTRACE_SYSCALLS=y
CONFIG_TRACER_SNAPSHOT=y
CONFIG_TRACER_SNAPSHOT_PER_CPU_SWAP=y
CONFIG_BRANCH_PROFILE_NONE=y
# CONFIG_PROFILE_ANNOTATED_BRANCHES is not set
CONFIG_BLK_DEV_IO_TRACE=y
CONFIG_KPROBE_EVENTS=y
CONFIG_KPROBE_EVENTS_ON_NOTRACE=y
CONFIG_UPROBE_EVENTS=y
CONFIG_BPF_EVENTS=y
CONFIG_DYNAMIC_EVENTS=y
CONFIG_PROBE_EVENTS=y
CONFIG_BPF_KPROBE_OVERRIDE=y
CONFIG_FTRACE_MCOUNT_RECORD=y
CONFIG_FTRACE_MCOUNT_USE_CC=y
CONFIG_TRACING_MAP=y
CONFIG_SYNTH_EVENTS=y
CONFIG_HIST_TRIGGERS=y
CONFIG_TRACE_EVENT_INJECT=y
CONFIG_TRACEPOINT_BENCHMARK=y
CONFIG_RING_BUFFER_BENCHMARK=m
CONFIG_TRACE_EVAL_MAP_FILE=y
CONFIG_FTRACE_RECORD_RECURSION=y
CONFIG_FTRACE_RECORD_RECURSION_SIZE=128
CONFIG_RING_BUFFER_RECORD_RECURSION=y
CONFIG_GCOV_PROFILE_FTRACE=y
CONFIG_FTRACE_SELFTEST=y
CONFIG_FTRACE_STARTUP_TEST=y
CONFIG_EVENT_TRACE_STARTUP_TEST=y
CONFIG_EVENT_TRACE_TEST_SYSCALLS=y
CONFIG_RING_BUFFER_STARTUP_TEST=y
CONFIG_RING_BUFFER_VALIDATE_TIME_DELTAS=y
CONFIG_PREEMPTIRQ_DELAY_TEST=m
CONFIG_SYNTH_EVENT_GEN_TEST=m
CONFIG_KPROBE_EVENT_GEN_TEST=m
CONFIG_HIST_TRIGGERS_DEBUG=y
CONFIG_DA_MON_EVENTS=y
CONFIG_DA_MON_EVENTS_ID=y
CONFIG_RV=y
CONFIG_RV_MON_WWNR=y
CONFIG_RV_REACTORS=y
CONFIG_RV_REACT_PRINTK=y
CONFIG_RV_REACT_PANIC=y
# CONFIG_SAMPLES is not set
CONFIG_HAVE_SAMPLE_FTRACE_DIRECT=y
CONFIG_HAVE_SAMPLE_FTRACE_DIRECT_MULTI=y
CONFIG_ARCH_HAS_DEVMEM_IS_ALLOWED=y
CONFIG_STRICT_DEVMEM=y
CONFIG_IO_STRICT_DEVMEM=y

#
# s390 Debugging
#
CONFIG_EARLY_PRINTK=y
CONFIG_DEBUG_ENTRY=y
CONFIG_CIO_INJECT=y
# end of s390 Debugging

#
# Kernel Testing and Coverage
#
CONFIG_KUNIT=m
CONFIG_KUNIT_DEBUGFS=y
CONFIG_KUNIT_TEST=m
CONFIG_KUNIT_EXAMPLE_TEST=m
CONFIG_KUNIT_ALL_TESTS=m
CONFIG_KUNIT_DEFAULT_ENABLED=y
CONFIG_NOTIFIER_ERROR_INJECTION=m
CONFIG_OF_RECONFIG_NOTIFIER_ERROR_INJECT=m
CONFIG_NETDEV_NOTIFIER_ERROR_INJECT=m
CONFIG_FUNCTION_ERROR_INJECTION=y
CONFIG_FAULT_INJECTION=y
CONFIG_FAILSLAB=y
CONFIG_FAIL_PAGE_ALLOC=y
CONFIG_FAULT_INJECTION_USERCOPY=y
CONFIG_FAIL_MAKE_REQUEST=y
CONFIG_FAIL_IO_TIMEOUT=y
CONFIG_FAIL_FUTEX=y
CONFIG_FAULT_INJECTION_DEBUG_FS=y
CONFIG_FAIL_FUNCTION=y
CONFIG_FAIL_SUNRPC=y
CONFIG_FAULT_INJECTION_STACKTRACE_FILTER=y
CONFIG_ARCH_HAS_KCOV=y
CONFIG_CC_HAS_SANCOV_TRACE_PC=y
CONFIG_KCOV=y
CONFIG_KCOV_ENABLE_COMPARISONS=y
CONFIG_KCOV_INSTRUMENT_ALL=y
CONFIG_KCOV_IRQ_AREA_SIZE=0x40000
CONFIG_RUNTIME_TESTING_MENU=y
CONFIG_LKDTM=m
CONFIG_CPUMASK_KUNIT_TEST=m
CONFIG_TEST_LIST_SORT=m
CONFIG_TEST_MIN_HEAP=m
CONFIG_TEST_SORT=m
CONFIG_TEST_DIV64=m
CONFIG_KPROBES_SANITY_TEST=m
CONFIG_BACKTRACE_SELF_TEST=m
CONFIG_TEST_REF_TRACKER=m
CONFIG_RBTREE_TEST=m
CONFIG_REED_SOLOMON_TEST=m
CONFIG_INTERVAL_TREE_TEST=m
CONFIG_PERCPU_TEST=m
CONFIG_ATOMIC64_SELFTEST=m
CONFIG_ASYNC_RAID6_TEST=m
CONFIG_TEST_HEXDUMP=m
CONFIG_STRING_SELFTEST=m
CONFIG_TEST_STRING_HELPERS=m
# CONFIG_TEST_STRSCPY is not set
CONFIG_TEST_KSTRTOX=m
CONFIG_TEST_PRINTF=m
CONFIG_TEST_SCANF=m
CONFIG_TEST_BITMAP=m
CONFIG_TEST_UUID=m
CONFIG_TEST_XARRAY=m
CONFIG_TEST_RHASHTABLE=m
# CONFIG_TEST_SIPHASH is not set
CONFIG_TEST_IDA=m
CONFIG_TEST_LKM=m
CONFIG_TEST_BITOPS=m
CONFIG_TEST_VMALLOC=m
CONFIG_TEST_USER_COPY=m
CONFIG_TEST_BPF=m
CONFIG_TEST_BLACKHOLE_DEV=m
CONFIG_FIND_BIT_BENCHMARK=m
CONFIG_TEST_FIRMWARE=m
CONFIG_TEST_SYSCTL=m
CONFIG_BITFIELD_KUNIT=m
CONFIG_HASH_KUNIT_TEST=m
CONFIG_RESOURCE_KUNIT_TEST=m
CONFIG_SYSCTL_KUNIT_TEST=m
CONFIG_LIST_KUNIT_TEST=m
CONFIG_LINEAR_RANGES_TEST=m
CONFIG_CMDLINE_KUNIT_TEST=m
CONFIG_BITS_TEST=m
CONFIG_SLUB_KUNIT_TEST=m
CONFIG_RATIONAL_KUNIT_TEST=m
CONFIG_MEMCPY_KUNIT_TEST=m
CONFIG_IS_SIGNED_TYPE_KUNIT_TEST=m
CONFIG_OVERFLOW_KUNIT_TEST=m
CONFIG_STACKINIT_KUNIT_TEST=m
CONFIG_FORTIFY_KUNIT_TEST=m
CONFIG_TEST_UDELAY=m
CONFIG_TEST_STATIC_KEYS=m
CONFIG_TEST_DYNAMIC_DEBUG=m
CONFIG_TEST_KMOD=m
CONFIG_TEST_MEMCAT_P=m
CONFIG_TEST_LIVEPATCH=m
CONFIG_TEST_MEMINIT=m
CONFIG_TEST_FREE_PAGES=m
# end of Kernel Testing and Coverage

#
# Rust hacking
#
# end of Rust hacking
# end of Kernel hacking

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: S390 testing for IOMMUFD
  2022-11-08 14:04     ` Anthony Krowiak
@ 2022-11-09 14:49       ` Anthony Krowiak
  2022-11-09 16:12         ` Jason Gunthorpe
  2022-11-09 19:09         ` Anthony Krowiak
  0 siblings, 2 replies; 99+ messages in thread
From: Anthony Krowiak @ 2022-11-09 14:49 UTC (permalink / raw)
  To: Christian Borntraeger, Jason Gunthorpe, Cornelia Huck,
	Eric Farman, Matthew Rosato, Niklas Schnelle, Halil Pasic,
	Jason Herne, linux-s390
  Cc: iommu, Kevin Tian, Alex Williamson, kvm, Lu Baolu, Nicolin Chen


On 11/8/22 9:04 AM, Anthony Krowiak wrote:
>
> On 11/8/22 5:12 AM, Christian Borntraeger wrote:
>>
>>
>> Am 08.11.22 um 02:09 schrieb Jason Gunthorpe:
>>> On Mon, Nov 07, 2022 at 08:48:53PM -0400, Jason Gunthorpe wrote:
>>>> [
>>>> This has been in linux-next for a little while now, and we've 
>>>> completed
>>>> the syzkaller run. 1300 hours of CPU time have been invested since the
>>>> last report with no improvement in coverage or new detections. 
>>>> syzkaller
>>>> coverage reached 69%(75%), and review of the misses show substantial
>>>> amounts are WARN_ON's and other debugging which are not expected to be
>>>> covered.
>>>> ]
>>>>
>>>> iommufd is the user API to control the IOMMU subsystem as it 
>>>> relates to
>>>> managing IO page tables that point at user space memory.
>>>
>>> [chop cc list]
>>>
>>> s390 mdev maintainers,
>>>
>>> Can I ask your help to test this with the two S390 mdev drivers? Now
>>> that gvt is passing and we've covered alot of the QA ground it is a
>>> good time to run it.
>>>
>>> Take the branch from here:
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd.git/log/?h=for-next 
>>>
>>
>>>
>>> And build the kernel with
>>>
>>> CONFIG_VFIO_CONTAINER=n
>>> CONFIG_IOMMUFD=y
>>> CONFIG_IOMMUFD_VFIO_CONTAINER=y
>>>
>>> And your existing stuff should work with iommufd providing the iommu
>>> support to vfio. There will be a dmesg confirming this.
>>
>> Gave it a quick spin with vfio_ap:
>> [  401.679199] vfio_ap_mdev b01a7c33-9696-48b2-9a98-050e8e17c69a: 
>> Adding to iommu group 1
>> [  402.085386] iommufd: IOMMUFD is providing /dev/vfio/vfio, not VFIO.
>>
>> Some tests seem to work, but others dont (running into timeouts). I 
>> need to look
>> into that (or ideally Tony will have a look, FWIW 
>> tests.test_vfio_ap.VfioAPAssignMdevToGuestTest
>> fails for me.
>
>
> I'm looking into it.


I cloned the 
https://lore.kernel.org/kvm/Y2q3nFXwOk9jul5u@nvidia.com/T/#m76a9c609c5ccd1494c05c6f598f9c8e75b7c9888 
repo and ran the vfio_ap test cases. The tests ran without encountering 
the errors related to the vfio_pin_pages() function, but I did see two 
tests fail attempting to run crypto tests on the guest. I also saw a 
WARN_ON stack trace in the dmesg output indicating a timeout occurred 
trying to verify the completion of a queue reset. The reset problem has 
reared its ugly head in our CI, so this may be a good thing as it will 
allow me to debug why its happening.


>
>
>>
>>
>> The same kernel tree with defconfig (instead of 
>> CONFIG_IOMMUFD_VFIO_CONTAINER=y) works fine.
>>>
>>> Let me know if there are any problems!
>>>
>>> If I recall there was some desire from the S390 platform team to start
>>> building on iommufd to create some vIOMMU acceleration for S390
>>> guests, this is a necessary first step.
>>>
>>> Thanks,
>>> Jason

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: S390 testing for IOMMUFD
  2022-11-09 14:49       ` Anthony Krowiak
@ 2022-11-09 16:12         ` Jason Gunthorpe
  2022-11-09 19:13           ` Anthony Krowiak
  2022-11-09 19:09         ` Anthony Krowiak
  1 sibling, 1 reply; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-09 16:12 UTC (permalink / raw)
  To: Anthony Krowiak
  Cc: Christian Borntraeger, Cornelia Huck, Eric Farman,
	Matthew Rosato, Niklas Schnelle, Halil Pasic, Jason Herne,
	linux-s390, iommu, Kevin Tian, Alex Williamson, kvm, Lu Baolu,
	Nicolin Chen

On Wed, Nov 09, 2022 at 09:49:01AM -0500, Anthony Krowiak wrote:
> I cloned the https://lore.kernel.org/kvm/Y2q3nFXwOk9jul5u@nvidia.com/T/#m76a9c609c5ccd1494c05c6f598f9c8e75b7c9888
> repo and ran the vfio_ap test cases. The tests ran without encountering the
> errors related to the vfio_pin_pages() function

I updated the git repos with this change now

> but I did see two tests fail attempting to run crypto tests on the
> guest. I also saw a WARN_ON stack trace in the dmesg output
> indicating a timeout occurred trying to verify the completion of a
> queue reset. The reset problem has reared its ugly head in our CI,
> so this may be a good thing as it will allow me to debug why its
> happening.

Please let me know if you think this is iommufd related, from your
description it sounds like an existing latent bug?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: S390 testing for IOMMUFD
  2022-11-09 14:49       ` Anthony Krowiak
  2022-11-09 16:12         ` Jason Gunthorpe
@ 2022-11-09 19:09         ` Anthony Krowiak
  1 sibling, 0 replies; 99+ messages in thread
From: Anthony Krowiak @ 2022-11-09 19:09 UTC (permalink / raw)
  To: Christian Borntraeger, Jason Gunthorpe, Cornelia Huck,
	Eric Farman, Matthew Rosato, Niklas Schnelle, Halil Pasic,
	Jason Herne, linux-s390
  Cc: iommu, Kevin Tian, Alex Williamson, kvm, Lu Baolu, Nicolin Chen


On 11/9/22 9:49 AM, Anthony Krowiak wrote:
>
> On 11/8/22 9:04 AM, Anthony Krowiak wrote:
>>
>> On 11/8/22 5:12 AM, Christian Borntraeger wrote:
>>>
>>>
>>> Am 08.11.22 um 02:09 schrieb Jason Gunthorpe:
>>>> On Mon, Nov 07, 2022 at 08:48:53PM -0400, Jason Gunthorpe wrote:
>>>>> [
>>>>> This has been in linux-next for a little while now, and we've 
>>>>> completed
>>>>> the syzkaller run. 1300 hours of CPU time have been invested since 
>>>>> the
>>>>> last report with no improvement in coverage or new detections. 
>>>>> syzkaller
>>>>> coverage reached 69%(75%), and review of the misses show substantial
>>>>> amounts are WARN_ON's and other debugging which are not expected 
>>>>> to be
>>>>> covered.
>>>>> ]
>>>>>
>>>>> iommufd is the user API to control the IOMMU subsystem as it 
>>>>> relates to
>>>>> managing IO page tables that point at user space memory.
>>>>
>>>> [chop cc list]
>>>>
>>>> s390 mdev maintainers,
>>>>
>>>> Can I ask your help to test this with the two S390 mdev drivers? Now
>>>> that gvt is passing and we've covered alot of the QA ground it is a
>>>> good time to run it.
>>>>
>>>> Take the branch from here:
>>>>
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd.git/log/?h=for-next 
>>>>
>>>
>>>>
>>>> And build the kernel with
>>>>
>>>> CONFIG_VFIO_CONTAINER=n
>>>> CONFIG_IOMMUFD=y
>>>> CONFIG_IOMMUFD_VFIO_CONTAINER=y
>>>>
>>>> And your existing stuff should work with iommufd providing the iommu
>>>> support to vfio. There will be a dmesg confirming this.
>>>
>>> Gave it a quick spin with vfio_ap:
>>> [  401.679199] vfio_ap_mdev b01a7c33-9696-48b2-9a98-050e8e17c69a: 
>>> Adding to iommu group 1
>>> [  402.085386] iommufd: IOMMUFD is providing /dev/vfio/vfio, not VFIO.
>>>
>>> Some tests seem to work, but others dont (running into timeouts). I 
>>> need to look
>>> into that (or ideally Tony will have a look, FWIW 
>>> tests.test_vfio_ap.VfioAPAssignMdevToGuestTest
>>> fails for me.
>>
>>
>> I'm looking into it.
>
>
> I cloned the 
> https://lore.kernel.org/kvm/Y2q3nFXwOk9jul5u@nvidia.com/T/#m76a9c609c5ccd1494c05c6f598f9c8e75b7c9888 
> repo and ran the vfio_ap test cases. The tests ran without 
> encountering the errors related to the vfio_pin_pages() function, but 
> I did see two tests fail attempting to run crypto tests on the guest. 
> I also saw a WARN_ON stack trace in the dmesg output indicating a 
> timeout occurred trying to verify the completion of a queue reset. The 
> reset problem has reared its ugly head in our CI, so this may be a 
> good thing as it will allow me to debug why its happening.


The problems I encountered were due to using a set of regression tests 
that were not vanilla. They contained some changes I made to try to 
improve performance of the tests. After restoring the vanilla regression 
tests, I was able to successfully execute the tests without any problems 
or errors with Jason's vfio_ap_(un)pin_pages patch installed, so that 
looks good to me.


>
>
>>
>>
>>>
>>>
>>> The same kernel tree with defconfig (instead of 
>>> CONFIG_IOMMUFD_VFIO_CONTAINER=y) works fine.
>>>>
>>>> Let me know if there are any problems!
>>>>
>>>> If I recall there was some desire from the S390 platform team to start
>>>> building on iommufd to create some vIOMMU acceleration for S390
>>>> guests, this is a necessary first step.
>>>>
>>>> Thanks,
>>>> Jason

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: S390 testing for IOMMUFD
  2022-11-09 16:12         ` Jason Gunthorpe
@ 2022-11-09 19:13           ` Anthony Krowiak
  2022-11-09 20:43             ` Jason Gunthorpe
  0 siblings, 1 reply; 99+ messages in thread
From: Anthony Krowiak @ 2022-11-09 19:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christian Borntraeger, Cornelia Huck, Eric Farman,
	Matthew Rosato, Niklas Schnelle, Halil Pasic, Jason Herne,
	linux-s390, iommu, Kevin Tian, Alex Williamson, kvm, Lu Baolu,
	Nicolin Chen


On 11/9/22 11:12 AM, Jason Gunthorpe wrote:
> On Wed, Nov 09, 2022 at 09:49:01AM -0500, Anthony Krowiak wrote:
>> I cloned the https://lore.kernel.org/kvm/Y2q3nFXwOk9jul5u@nvidia.com/T/#m76a9c609c5ccd1494c05c6f598f9c8e75b7c9888
>> repo and ran the vfio_ap test cases. The tests ran without encountering the
>> errors related to the vfio_pin_pages() function
> I updated the git repos with this change now
>
>> but I did see two tests fail attempting to run crypto tests on the
>> guest. I also saw a WARN_ON stack trace in the dmesg output
>> indicating a timeout occurred trying to verify the completion of a
>> queue reset. The reset problem has reared its ugly head in our CI,
>> so this may be a good thing as it will allow me to debug why its
>> happening.
> Please let me know if you think this is iommufd related, from your
> description it sounds like an existing latent bug?


Just in case you missed my response to my previous email, the problems I 
was seeing were due to using a set regression tests that I patched to 
try to improve the tests performance. When I ran the vanilla tests they 
ran successfully without any problems with your patch. I will continue 
testing but as of now, it looks good to me.


>
> Thanks,
> Jason

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: S390 testing for IOMMUFD
  2022-11-09 19:13           ` Anthony Krowiak
@ 2022-11-09 20:43             ` Jason Gunthorpe
  0 siblings, 0 replies; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-09 20:43 UTC (permalink / raw)
  To: Anthony Krowiak
  Cc: Christian Borntraeger, Cornelia Huck, Eric Farman,
	Matthew Rosato, Niklas Schnelle, Halil Pasic, Jason Herne,
	linux-s390, iommu, Kevin Tian, Alex Williamson, kvm, Lu Baolu,
	Nicolin Chen

On Wed, Nov 09, 2022 at 02:13:22PM -0500, Anthony Krowiak wrote:
> 
> On 11/9/22 11:12 AM, Jason Gunthorpe wrote:
> > On Wed, Nov 09, 2022 at 09:49:01AM -0500, Anthony Krowiak wrote:
> > > I cloned the https://lore.kernel.org/kvm/Y2q3nFXwOk9jul5u@nvidia.com/T/#m76a9c609c5ccd1494c05c6f598f9c8e75b7c9888
> > > repo and ran the vfio_ap test cases. The tests ran without encountering the
> > > errors related to the vfio_pin_pages() function
> > I updated the git repos with this change now
> > 
> > > but I did see two tests fail attempting to run crypto tests on the
> > > guest. I also saw a WARN_ON stack trace in the dmesg output
> > > indicating a timeout occurred trying to verify the completion of a
> > > queue reset. The reset problem has reared its ugly head in our CI,
> > > so this may be a good thing as it will allow me to debug why its
> > > happening.
> > Please let me know if you think this is iommufd related, from your
> > description it sounds like an existing latent bug?
> 
> 
> Just in case you missed my response to my previous email, the problems I was
> seeing were due to using a set regression tests that I patched to try to
> improve the tests performance. When I ran the vanilla tests they ran
> successfully without any problems with your patch. I will continue testing
> but as of now, it looks good to me.

Great, can all of you provide some Tested-by's for the various things
you've run?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 15/17] iommufd: Add a selftest
  2022-11-08  0:49 ` [PATCH v4 15/17] iommufd: Add a selftest Jason Gunthorpe
  2022-11-08  1:01   ` Jason Gunthorpe
  2022-11-08  5:48   ` Nicolin Chen
@ 2022-11-09 23:51   ` Matthew Rosato
  2 siblings, 0 replies; 99+ messages in thread
From: Matthew Rosato @ 2022-11-09 23:51 UTC (permalink / raw)
  To: Jason Gunthorpe, bpf, Jonathan Corbet, David Woodhouse, iommu,
	Joerg Roedel, Kevin Tian, linux-doc, linux-kselftest, llvm,
	Nathan Chancellor, Nick Desaulniers, Miguel Ojeda, Robin Murphy,
	Shuah Khan, Suravee Suthikulpanit, Tom Rix, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Michael S. Tsirkin,
	Nicolin Chen, Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu,
	Keqian Zhu

On 11/7/22 7:49 PM, Jason Gunthorpe wrote:
> Cover the essential functionality of the iommufd with a directed
> test. This aims to achieve reasonable functional coverage using the
> in-kernel self test framework.
> 
> It provides a mock kernel module for the iommu_domain that allows it to
> run without any HW and the mocking provides a way to directly validate
> that the PFNs loaded into the iommu_domain are correct.
> 
> The mock also simulates the rare case of PAGE_SIZE > iommu page size as
> the mock will operate at a 2K iommu page size. This allows exercising all
> of the calculations to support this mismatch.
> 
> This allows achieving high coverage of the corner cases in the iopt_pages.
> 
> However, it is an unusually invasive config option to enable all of
> this. The config option should not be enabled in a production kernel.
> 
> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>

Ran the selftests on s390 in both LPAR (z16) and a QEMU kvm guest using 1M hugepages, all tests are passing.

Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> # s390

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 04/17] iommufd: Document overview of iommufd
  2022-11-08  0:48 ` [PATCH v4 04/17] iommufd: Document overview of iommufd Jason Gunthorpe
  2022-11-08  3:45   ` Bagas Sanjaya
  2022-11-08 17:10   ` [PATCH v4 4/17] " Jason Gunthorpe
@ 2022-11-10  9:30   ` Bagas Sanjaya
  2022-11-10 14:49     ` Jonathan Corbet
  2022-11-14 20:50   ` Eric Auger
  3 siblings, 1 reply; 99+ messages in thread
From: Bagas Sanjaya @ 2022-11-10  9:30 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon, Alex Williamson,
	Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Mon, Nov 07, 2022 at 08:48:57PM -0400, Jason Gunthorpe wrote:
> From: Kevin Tian <kevin.tian@intel.com>
> 
> Add iommufd into the documentation tree, and supply initial documentation.
> Much of this is linked from code comments by kdoc.
> 

The patch also exposes htmldocs warnings as Stephen Rothwell has
reported on linux-next [1] due to the copyright comments mistaken for
kernel-doc comments, so I have applied the fixup:

---- >8 ----

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 536a34d099968d..76b3761a89423e 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-only
-/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+/*
+ * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
  */
 #include <linux/iommufd.h>
 #include <linux/slab.h>
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 1eeb326f74f005..fc4c80ec0511f4 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-only
-/* Copyright (C) 2021 Intel Corporation
+/*
+ * Copyright (C) 2021 Intel Corporation
  * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
  *
  * iommufd provides control over the IOMMU HW objects created by IOMMU kernel

Thanks.

-- 
An old man doll... just what I always wanted! - Clara

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 04/17] iommufd: Document overview of iommufd
  2022-11-10  9:30   ` [PATCH v4 04/17] " Bagas Sanjaya
@ 2022-11-10 14:49     ` Jonathan Corbet
  2022-11-10 14:54       ` Jason Gunthorpe
  2022-11-11  1:46       ` Bagas Sanjaya
  0 siblings, 2 replies; 99+ messages in thread
From: Jonathan Corbet @ 2022-11-10 14:49 UTC (permalink / raw)
  To: Bagas Sanjaya, Jason Gunthorpe
  Cc: bpf, David Woodhouse, iommu, Joerg Roedel, Kevin Tian, linux-doc,
	linux-kselftest, llvm, Nathan Chancellor, Nick Desaulniers,
	Miguel Ojeda, Robin Murphy, Shuah Khan, Suravee Suthikulpanit,
	Tom Rix, Will Deacon, Alex Williamson, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Auger, Eric Farman, Jason Wang, Jean-Philippe Brucker,
	Joao Martins, kvm, Matthew Rosato, Michael S. Tsirkin,
	Nicolin Chen, Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu,
	Keqian Zhu

Bagas Sanjaya <bagasdotme@gmail.com> writes:

> On Mon, Nov 07, 2022 at 08:48:57PM -0400, Jason Gunthorpe wrote:
>> From: Kevin Tian <kevin.tian@intel.com>
>> 
>> Add iommufd into the documentation tree, and supply initial documentation.
>> Much of this is linked from code comments by kdoc.
>> 
>
> The patch also exposes htmldocs warnings as Stephen Rothwell has
> reported on linux-next [1] due to the copyright comments mistaken for
> kernel-doc comments, so I have applied the fixup:
>
> ---- >8 ----
>
> diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
> index 536a34d099968d..76b3761a89423e 100644
> --- a/drivers/iommu/iommufd/device.c
> +++ b/drivers/iommu/iommufd/device.c
> @@ -1,5 +1,6 @@
>  // SPDX-License-Identifier: GPL-2.0-only
> -/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
> +/*
> + * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES

Um ... this makes no sense at all.  If kernel-doc thought that was a
kernel-doc comment, the problem is there, not here.

<looks>

So the report you're referring to is

  https://lore.kernel.org/linux-next/20221110182938.40ce2651@canb.auug.org.au/

?  If so, this change will not fix the problem.  That error:

> drivers/iommu/iommufd/device.c:1: warning: no structured comments found
> drivers/iommu/iommufd/main.c:1: warning: no structured comments found

is caused by using .. kernel-doc:: directives to extract documentation
from files where none exists - thus "no structured comments found".

The *real* problem, methinks, is that the directives are added in patch 4
of the series, but the documentation doesn't show up until later.  So
the real fix would be to simply move this patch down.  Or just not worry
about it, since it all works out in the end and nobody will be bisecting
a docs build.

Bagas, you are *again* misadvising people.  Please stop doing that!

Thanks,

jon

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 04/17] iommufd: Document overview of iommufd
  2022-11-10 14:49     ` Jonathan Corbet
@ 2022-11-10 14:54       ` Jason Gunthorpe
  2022-11-10 15:10         ` Jonathan Corbet
  2022-11-11  1:46       ` Bagas Sanjaya
  1 sibling, 1 reply; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-10 14:54 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Bagas Sanjaya, bpf, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon, Alex Williamson,
	Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Thu, Nov 10, 2022 at 07:49:14AM -0700, Jonathan Corbet wrote:

> The *real* problem, methinks, is that the directives are added in patch 4
> of the series, but the documentation doesn't show up until later.  So
> the real fix would be to simply move this patch down.  Or just not worry
> about it, since it all works out in the end and nobody will be bisecting
> a docs build.

That is half the problem, the other is this:

https://lore.kernel.org/r/0-v1-c80e152ce63b+12-kdoc_export_ns_jgg@nvidia.com

Since even after the whole series the EXPORT_NS functions don't parse
properly. I'm going to put this patch before the doc patch and ignore
the bisection problem.

I'd like someone to say they are happy with the perl :)

Jason

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 04/17] iommufd: Document overview of iommufd
  2022-11-10 14:54       ` Jason Gunthorpe
@ 2022-11-10 15:10         ` Jonathan Corbet
  2022-11-10 15:23           ` Jason Gunthorpe
  0 siblings, 1 reply; 99+ messages in thread
From: Jonathan Corbet @ 2022-11-10 15:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bagas Sanjaya, bpf, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon, Alex Williamson,
	Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

Jason Gunthorpe <jgg@nvidia.com> writes:

> On Thu, Nov 10, 2022 at 07:49:14AM -0700, Jonathan Corbet wrote:
>
>> The *real* problem, methinks, is that the directives are added in patch 4
>> of the series, but the documentation doesn't show up until later.  So
>> the real fix would be to simply move this patch down.  Or just not worry
>> about it, since it all works out in the end and nobody will be bisecting
>> a docs build.
>
> That is half the problem, the other is this:
>
> https://lore.kernel.org/r/0-v1-c80e152ce63b+12-kdoc_export_ns_jgg@nvidia.com
>
> Since even after the whole series the EXPORT_NS functions don't parse
> properly. I'm going to put this patch before the doc patch and ignore
> the bisection problem.
>
> I'd like someone to say they are happy with the perl :)

I'm not happy with *any* perl! :)

I've been sitting on that patch because I was under the impression
another version was coming - was that wrong?

Thanks,

jon

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 04/17] iommufd: Document overview of iommufd
  2022-11-10 15:10         ` Jonathan Corbet
@ 2022-11-10 15:23           ` Jason Gunthorpe
  2022-11-10 15:28             ` Jonathan Corbet
  0 siblings, 1 reply; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-10 15:23 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Bagas Sanjaya, bpf, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon, Alex Williamson,
	Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Thu, Nov 10, 2022 at 08:10:19AM -0700, Jonathan Corbet wrote:
> Jason Gunthorpe <jgg@nvidia.com> writes:
> 
> > On Thu, Nov 10, 2022 at 07:49:14AM -0700, Jonathan Corbet wrote:
> >
> >> The *real* problem, methinks, is that the directives are added in patch 4
> >> of the series, but the documentation doesn't show up until later.  So
> >> the real fix would be to simply move this patch down.  Or just not worry
> >> about it, since it all works out in the end and nobody will be bisecting
> >> a docs build.
> >
> > That is half the problem, the other is this:
> >
> > https://lore.kernel.org/r/0-v1-c80e152ce63b+12-kdoc_export_ns_jgg@nvidia.com
> >
> > Since even after the whole series the EXPORT_NS functions don't parse
> > properly. I'm going to put this patch before the doc patch and ignore
> > the bisection problem.
> >
> > I'd like someone to say they are happy with the perl :)
> 
> I'm not happy with *any* perl! :)
> 
> I've been sitting on that patch because I was under the impression
> another version was coming - was that wrong?

I can resend it with the single regex if that is the preference - it
is not quite as exacting as the first version. I have to test it is
all.

Jason

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 04/17] iommufd: Document overview of iommufd
  2022-11-10 15:23           ` Jason Gunthorpe
@ 2022-11-10 15:28             ` Jonathan Corbet
  2022-11-10 15:29               ` Jason Gunthorpe
  0 siblings, 1 reply; 99+ messages in thread
From: Jonathan Corbet @ 2022-11-10 15:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bagas Sanjaya, bpf, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon, Alex Williamson,
	Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

Jason Gunthorpe <jgg@nvidia.com> writes:

> On Thu, Nov 10, 2022 at 08:10:19AM -0700, Jonathan Corbet wrote:
>> Jason Gunthorpe <jgg@nvidia.com> writes:
>> 
>> > On Thu, Nov 10, 2022 at 07:49:14AM -0700, Jonathan Corbet wrote:
>> >
>> >> The *real* problem, methinks, is that the directives are added in patch 4
>> >> of the series, but the documentation doesn't show up until later.  So
>> >> the real fix would be to simply move this patch down.  Or just not worry
>> >> about it, since it all works out in the end and nobody will be bisecting
>> >> a docs build.
>> >
>> > That is half the problem, the other is this:
>> >
>> > https://lore.kernel.org/r/0-v1-c80e152ce63b+12-kdoc_export_ns_jgg@nvidia.com
>> >
>> > Since even after the whole series the EXPORT_NS functions don't parse
>> > properly. I'm going to put this patch before the doc patch and ignore
>> > the bisection problem.
>> >
>> > I'd like someone to say they are happy with the perl :)
>> 
>> I'm not happy with *any* perl! :)
>> 
>> I've been sitting on that patch because I was under the impression
>> another version was coming - was that wrong?
>
> I can resend it with the single regex if that is the preference - it
> is not quite as exacting as the first version. I have to test it is
> all.

Single is nicer but it's not worth a great deal of angst; nothing we do
is going to turn kernel-doc into a thing of beauty :)

Thanks,

jon

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 04/17] iommufd: Document overview of iommufd
  2022-11-10 15:28             ` Jonathan Corbet
@ 2022-11-10 15:29               ` Jason Gunthorpe
  2022-11-10 15:52                 ` Jonathan Corbet
  0 siblings, 1 reply; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-10 15:29 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Bagas Sanjaya, bpf, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon, Alex Williamson,
	Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Thu, Nov 10, 2022 at 08:28:44AM -0700, Jonathan Corbet wrote:
> Jason Gunthorpe <jgg@nvidia.com> writes:
> 
> > On Thu, Nov 10, 2022 at 08:10:19AM -0700, Jonathan Corbet wrote:
> >> Jason Gunthorpe <jgg@nvidia.com> writes:
> >> 
> >> > On Thu, Nov 10, 2022 at 07:49:14AM -0700, Jonathan Corbet wrote:
> >> >
> >> >> The *real* problem, methinks, is that the directives are added in patch 4
> >> >> of the series, but the documentation doesn't show up until later.  So
> >> >> the real fix would be to simply move this patch down.  Or just not worry
> >> >> about it, since it all works out in the end and nobody will be bisecting
> >> >> a docs build.
> >> >
> >> > That is half the problem, the other is this:
> >> >
> >> > https://lore.kernel.org/r/0-v1-c80e152ce63b+12-kdoc_export_ns_jgg@nvidia.com
> >> >
> >> > Since even after the whole series the EXPORT_NS functions don't parse
> >> > properly. I'm going to put this patch before the doc patch and ignore
> >> > the bisection problem.
> >> >
> >> > I'd like someone to say they are happy with the perl :)
> >> 
> >> I'm not happy with *any* perl! :)
> >> 
> >> I've been sitting on that patch because I was under the impression
> >> another version was coming - was that wrong?
> >
> > I can resend it with the single regex if that is the preference - it
> > is not quite as exacting as the first version. I have to test it is
> > all.
> 
> Single is nicer but it's not worth a great deal of angst; nothing we do
> is going to turn kernel-doc into a thing of beauty :)

I will leave it be then because it is a bit tricky to tell if the new
regex breaks anything, and the first three attempts to create it
didn't work at all...

Thanks,
Jason

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 04/17] iommufd: Document overview of iommufd
  2022-11-10 15:29               ` Jason Gunthorpe
@ 2022-11-10 15:52                 ` Jonathan Corbet
  2022-11-10 16:54                   ` Jason Gunthorpe
  0 siblings, 1 reply; 99+ messages in thread
From: Jonathan Corbet @ 2022-11-10 15:52 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bagas Sanjaya, bpf, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon, Alex Williamson,
	Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

Jason Gunthorpe <jgg@nvidia.com> writes:

>> Single is nicer but it's not worth a great deal of angst; nothing we do
>> is going to turn kernel-doc into a thing of beauty :)
>
> I will leave it be then because it is a bit tricky to tell if the new
> regex breaks anything, and the first three attempts to create it
> didn't work at all...

That's fine.  If you want to keep it as part of your series feel free to
add:

Acked-by: Jonathan Corbet <corbet@lwn.net>

Otherwise I can carry it through the docs tree.

Thanks,

jon

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 04/17] iommufd: Document overview of iommufd
  2022-11-10 15:52                 ` Jonathan Corbet
@ 2022-11-10 16:54                   ` Jason Gunthorpe
  0 siblings, 0 replies; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-10 16:54 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Bagas Sanjaya, bpf, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon, Alex Williamson,
	Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Thu, Nov 10, 2022 at 08:52:16AM -0700, Jonathan Corbet wrote:
> Jason Gunthorpe <jgg@nvidia.com> writes:
> 
> >> Single is nicer but it's not worth a great deal of angst; nothing we do
> >> is going to turn kernel-doc into a thing of beauty :)
> >
> > I will leave it be then because it is a bit tricky to tell if the new
> > regex breaks anything, and the first three attempts to create it
> > didn't work at all...
> 
> That's fine.  If you want to keep it as part of your series feel free to
> add:
> 
> Acked-by: Jonathan Corbet <corbet@lwn.net>

Thanks, I'll keep it together since nothing else needs this right now

Jason

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 04/17] iommufd: Document overview of iommufd
  2022-11-10 14:49     ` Jonathan Corbet
  2022-11-10 14:54       ` Jason Gunthorpe
@ 2022-11-11  1:46       ` Bagas Sanjaya
  1 sibling, 0 replies; 99+ messages in thread
From: Bagas Sanjaya @ 2022-11-11  1:46 UTC (permalink / raw)
  To: Jonathan Corbet, Jason Gunthorpe
  Cc: bpf, David Woodhouse, iommu, Joerg Roedel, Kevin Tian, linux-doc,
	linux-kselftest, llvm, Nathan Chancellor, Nick Desaulniers,
	Miguel Ojeda, Robin Murphy, Shuah Khan, Suravee Suthikulpanit,
	Tom Rix, Will Deacon, Alex Williamson, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Auger, Eric Farman, Jason Wang, Jean-Philippe Brucker,
	Joao Martins, kvm, Matthew Rosato, Michael S. Tsirkin,
	Nicolin Chen, Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu,
	Keqian Zhu

On 11/10/22 21:49, Jonathan Corbet wrote:
> So the report you're referring to is
> 
>   https://lore.kernel.org/linux-next/20221110182938.40ce2651@canb.auug.org.au/
> 

Ah, I forget to refer to that link!

> ?  If so, this change will not fix the problem.  That error:
> 
>> drivers/iommu/iommufd/device.c:1: warning: no structured comments found
>> drivers/iommu/iommufd/main.c:1: warning: no structured comments found
> 
> is caused by using .. kernel-doc:: directives to extract documentation
> from files where none exists - thus "no structured comments found".
> 

-ENOENT files :)

> The *real* problem, methinks, is that the directives are added in patch 4
> of the series, but the documentation doesn't show up until later.  So
> the real fix would be to simply move this patch down.  Or just not worry
> about it, since it all works out in the end and nobody will be bisecting
> a docs build.
> 
> Bagas, you are *again* misadvising people.  Please stop doing that!
> 

OK, thanks.

-- 
An old man doll... just what I always wanted! - Clara


^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [PATCH v4 02/17] iommu: Add device-centric DMA ownership interfaces
  2022-11-08  0:48 ` [PATCH v4 02/17] iommu: Add device-centric DMA ownership interfaces Jason Gunthorpe
@ 2022-11-11  5:37   ` Tian, Kevin
  2022-11-14 16:44     ` Jason Gunthorpe
  2022-11-14 13:33   ` Eric Auger
  1 sibling, 1 reply; 99+ messages in thread
From: Tian, Kevin @ 2022-11-11  5:37 UTC (permalink / raw)
  To: Jason Gunthorpe, bpf, Jonathan Corbet, David Woodhouse, iommu,
	Joerg Roedel, linux-doc, linux-kselftest, llvm,
	Nathan Chancellor, Nick Desaulniers, Miguel Ojeda, Robin Murphy,
	Shuah Khan, Suravee Suthikulpanit, Tom Rix, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, November 8, 2022 8:49 AM
> +static int __iommu_take_dma_ownership(struct iommu_group *group, void
> *owner)
> +{
> +	int ret;
> +
> +	if (WARN_ON(!owner))
> +		return -EINVAL;

move to iommu_device_claim_dma_owner(). just like how it's checked
in the group version.

> +
> +	if ((group->domain && group->domain != group->default_domain) ||
> +	    !xa_empty(&group->pasid_array))
> +		return -EBUSY;

the check of pasid_array is a new addition in this version. it's probably
worthy a comment here.

otherwise,

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [PATCH v4 4/17] iommufd: Document overview of iommufd
  2022-11-08 17:10   ` [PATCH v4 4/17] " Jason Gunthorpe
@ 2022-11-11  5:59     ` Tian, Kevin
  2022-11-14 15:14       ` Jason Gunthorpe
  0 siblings, 1 reply; 99+ messages in thread
From: Tian, Kevin @ 2022-11-11  5:59 UTC (permalink / raw)
  To: Jason Gunthorpe, bpf, Jonathan Corbet, David Woodhouse, iommu,
	Joerg Roedel, linux-doc, linux-kselftest, llvm,
	Nathan Chancellor, Nick Desaulniers, Miguel Ojeda, Robin Murphy,
	Shuah Khan, Suravee Suthikulpanit, Tom Rix, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, November 9, 2022 1:10 AM
> 
> +/**
> + * iommufd_device_unbind - Undo iommufd_device_bind()
> + * @idev: Device returned by iommufd_device_bind()
> + *
> + * Release the device from iommufd control. The DMA ownership will
> return back
> + * to unowned with blocked DMA. This invalidates the iommufd_device

unowned but not blocked DMA. iommu_device_release_dma_owner()
will decide what will be the state then, e.g. attached back to the default
domain in most cases.

> +/**
> + * iommufd_device_detach - Disconnect a device to an iommu_domain
> + * @idev: device to detach
> + *
> + * Undoes iommufd_device_attach(). This disconnects the idev from the

'Undoes' -> "Undo'

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [PATCH v4 05/17] iommufd: File descriptor, context, kconfig and makefiles
  2022-11-08  0:48 ` [PATCH v4 05/17] iommufd: File descriptor, context, kconfig and makefiles Jason Gunthorpe
@ 2022-11-11  6:07   ` Tian, Kevin
  0 siblings, 0 replies; 99+ messages in thread
From: Tian, Kevin @ 2022-11-11  6:07 UTC (permalink / raw)
  To: Jason Gunthorpe, bpf, Jonathan Corbet, David Woodhouse, iommu,
	Joerg Roedel, linux-doc, linux-kselftest, llvm,
	Nathan Chancellor, Nick Desaulniers, Miguel Ojeda, Robin Murphy,
	Shuah Khan, Suravee Suthikulpanit, Tom Rix, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, November 8, 2022 8:49 AM
> 
> This is the basic infrastructure of a new miscdevice to hold the iommufd
> IOCTL API.
> 
> It provides:
>  - A miscdevice to create file descriptors to run the IOCTL interface over
> 
>  - A table based ioctl dispatch and centralized extendable pre-validation
>    step
> 
>  - An xarray mapping userspace ID's to kernel objects. The design has
>    multiple inter-related objects held within in a single IOMMUFD fd
> 
>  - A simple usage count to build a graph of object relations and protect
>    against hostile userspace racing ioctls
> 
> The only IOCTL provided in this patch is the generic 'destroy any object
> by handle' operation.
> 
> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [PATCH v4 07/17] iommufd: PFN handling for iopt_pages
  2022-11-08  0:49 ` [PATCH v4 07/17] iommufd: PFN handling for iopt_pages Jason Gunthorpe
@ 2022-11-11  9:56   ` Tian, Kevin
  2022-11-14 17:20     ` Jason Gunthorpe
  2022-11-11 11:09   ` Tian, Kevin
  1 sibling, 1 reply; 99+ messages in thread
From: Tian, Kevin @ 2022-11-11  9:56 UTC (permalink / raw)
  To: Jason Gunthorpe, bpf, Jonathan Corbet, David Woodhouse, iommu,
	Joerg Roedel, linux-doc, linux-kselftest, llvm,
	Nathan Chancellor, Nick Desaulniers, Miguel Ojeda, Robin Murphy,
	Shuah Khan, Suravee Suthikulpanit, Tom Rix, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, November 8, 2022 8:49 AM
> 
> +
> +struct interval_tree_double_span_iter {
> +	struct rb_root_cached *itrees[2];
> +	struct interval_tree_span_iter spans[2];
> +	union {
> +		unsigned long start_hole;
> +		unsigned long start_used;
> +	};
> +	union {
> +		unsigned long last_hole;
> +		unsigned long last_used;
> +	};
> +	/* 0 = hole, 1 = used span[0], 2 = used span[1], -1 done iteration */
> +	int is_used;
> +};

lack of a comment how this expects to be used as done for
struct interval_tree_span_iter. e.g. there is no value
representing used by both spans which implies this is used
to find valid range in either side. Those should be spelled out.

> +/*
> + * The IOVA to PFN map. The mapper automatically copies the PFNs into
> multiple

what is the mapper?

> + * Be cautious of overflow, an IOVA can go all the way up to U64_MAX, so
> + * last_iova + 1 can overflow. An iopt_pages index will always be much less
> than
> + * ULONG_MAX< so last_index + 1 cannot overflow.

remove trailing '<' after ULONG_MAX

> +
> +	ret = iommu_unmap(domain, iova, size);
> +	/*
> +	 * It is a logic error in this code or a driver bug if the IOMMU unmaps
> +	 * something other than exactly as requested. This implies that the
> +	 * iommu driver may not fail unmap for reasons beyond bad
> agruments.
> +	 * Particularly, the iommu driver may not do a memory allocation on
> the
> +	 * unmap path.
> +	 */

didn't understand the last sentence.

> +static void batch_skip_carry(struct pfn_batch *batch, unsigned int skip_pfns)

add a comment similar to batch_clear_carry()

> +{
> +	if (!batch->total_pfns)
> +		return;
> +	skip_pfns = min(batch->total_pfns, skip_pfns);
> +	batch->pfns[0] += skip_pfns;
> +	batch->npfns[0] -= skip_pfns;

what about skip_pfns exceeds batch->npfns[0]? looks this works only
if batch->total_pfns = batch->npfns[0]...

> +/* true if the pfn could be added, false otherwise */
> +static bool batch_add_pfn(struct pfn_batch *batch, unsigned long pfn)
> +{
> +	/* FIXME: U16 is too small */

performance or functional impact?

what would be the fix? and why cannot it be done now?

more comment is welcomed.

> +static void batch_unpin(struct pfn_batch *batch, struct iopt_pages *pages,
> +			unsigned int offset, size_t npages)
> +{
> +	unsigned int cur = 0;
> +
> +	while (offset) {
> +		if (batch->npfns[cur] > offset)
> +			break;
> +		offset -= batch->npfns[cur];
> +		cur++;
> +	}

'offset' usually means byte-addressed. 'index' is a better fit in
this context.

<to be continued>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [PATCH v4 07/17] iommufd: PFN handling for iopt_pages
  2022-11-08  0:49 ` [PATCH v4 07/17] iommufd: PFN handling for iopt_pages Jason Gunthorpe
  2022-11-11  9:56   ` Tian, Kevin
@ 2022-11-11 11:09   ` Tian, Kevin
  2022-11-14 17:24     ` Jason Gunthorpe
  1 sibling, 1 reply; 99+ messages in thread
From: Tian, Kevin @ 2022-11-11 11:09 UTC (permalink / raw)
  To: Jason Gunthorpe, bpf, Jonathan Corbet, David Woodhouse, iommu,
	Joerg Roedel, linux-doc, linux-kselftest, llvm,
	Nathan Chancellor, Nick Desaulniers, Miguel Ojeda, Robin Murphy,
	Shuah Khan, Suravee Suthikulpanit, Tom Rix, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

<...2nd part>

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, November 8, 2022 8:49 AM
> +
> +/* pfn_reader_user is just the pin_user_pages() path */
> +struct pfn_reader_user {
> +	struct page **upages;
> +	size_t upages_len;
> +	unsigned long upages_start;
> +	unsigned long upages_end;
> +	unsigned int gup_flags;
> +	int locked;

document what locked==-1/0/1 means

> +/* This is the "modern"  and faster accounting method used by io_uring */

double spaces before 'and'

> +/* Process a single span in the access_itree */

in all storages.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [PATCH v4 00/17] IOMMUFD Generic interface
  2022-11-08  0:48 [PATCH v4 00/17] IOMMUFD Generic interface Jason Gunthorpe
                   ` (17 preceding siblings ...)
  2022-11-08  1:09 ` S390 testing for IOMMUFD Jason Gunthorpe
@ 2022-11-11 15:51 ` Shameerali Kolothum Thodi
  2022-11-12 12:44   ` Yi Liu
  18 siblings, 1 reply; 99+ messages in thread
From: Shameerali Kolothum Thodi @ 2022-11-11 15:51 UTC (permalink / raw)
  To: Jason Gunthorpe, bpf, Jonathan Corbet, David Woodhouse, iommu,
	Joerg Roedel, Kevin Tian, linux-doc, linux-kselftest, llvm,
	Nathan Chancellor, Nick Desaulniers, Miguel Ojeda, Robin Murphy,
	Shuah Khan, Suravee Suthikulpanit, Tom Rix, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle, Yi Liu,
	zhukeqian



> -----Original Message-----
> From: Jason Gunthorpe [mailto:jgg@nvidia.com]
> Sent: 08 November 2022 00:49
> To: bpf@vger.kernel.org; Jonathan Corbet <corbet@lwn.net>; David
> Woodhouse <dwmw2@infradead.org>; iommu@lists.linux.dev; Joerg Roedel
> <joro@8bytes.org>; Kevin Tian <kevin.tian@intel.com>;
> linux-doc@vger.kernel.org; linux-kselftest@vger.kernel.org;
> llvm@lists.linux.dev; Nathan Chancellor <nathan@kernel.org>; Nick
> Desaulniers <ndesaulniers@google.com>; Miguel Ojeda <ojeda@kernel.org>;
> Robin Murphy <robin.murphy@arm.com>; Shuah Khan <shuah@kernel.org>;
> Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>; Tom Rix
> <trix@redhat.com>; Will Deacon <will@kernel.org>
> Cc: Alex Williamson <alex.williamson@redhat.com>; Lu Baolu
> <baolu.lu@linux.intel.com>; Chaitanya Kulkarni <chaitanyak@nvidia.com>;
> Cornelia Huck <cohuck@redhat.com>; Daniel Jordan
> <daniel.m.jordan@oracle.com>; David Gibson
> <david@gibson.dropbear.id.au>; Eric Auger <eric.auger@redhat.com>; Eric
> Farman <farman@linux.ibm.com>; Jason Wang <jasowang@redhat.com>;
> Jean-Philippe Brucker <jean-philippe@linaro.org>; Joao Martins
> <joao.m.martins@oracle.com>; kvm@vger.kernel.org; Matthew Rosato
> <mjrosato@linux.ibm.com>; Michael S. Tsirkin <mst@redhat.com>; Nicolin
> Chen <nicolinc@nvidia.com>; Niklas Schnelle <schnelle@linux.ibm.com>;
> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; Yi
> Liu <yi.l.liu@intel.com>; zhukeqian <zhukeqian1@huawei.com>
> Subject: [PATCH v4 00/17] IOMMUFD Generic interface
[...]
> 
> - Userspace page tables aka 'nested translation' for ARM and Intel iommu
>   drivers:
>   https://github.com/nicolinc/iommufd/commits/iommufd_nesting

Hi Eric/Yi/Nicolin,

Could you please provide a latest Kernel/Qemu branch for the ARM nesting support?
The above link points to Yi's git, but not sure which one is latest/stable to 
have a play.

Thanks,
Shameer


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 00/17] IOMMUFD Generic interface
  2022-11-11 15:51 ` [PATCH v4 00/17] IOMMUFD Generic interface Shameerali Kolothum Thodi
@ 2022-11-12 12:44   ` Yi Liu
  2023-01-10 11:35     ` Shameerali Kolothum Thodi
  0 siblings, 1 reply; 99+ messages in thread
From: Yi Liu @ 2022-11-12 12:44 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi, Jason Gunthorpe, bpf, Jonathan Corbet,
	David Woodhouse, iommu, Joerg Roedel, Kevin Tian, linux-doc,
	linux-kselftest, llvm, Nathan Chancellor, Nick Desaulniers,
	Miguel Ojeda, Robin Murphy, Shuah Khan, Suravee Suthikulpanit,
	Tom Rix, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle, zhukeqian

Hi Shameer,

On 2022/11/11 23:51, Shameerali Kolothum Thodi wrote:
> 
> 
>> -----Original Message-----
>> From: Jason Gunthorpe [mailto:jgg@nvidia.com]
>> Sent: 08 November 2022 00:49
>> To: bpf@vger.kernel.org; Jonathan Corbet <corbet@lwn.net>; David
>> Woodhouse <dwmw2@infradead.org>; iommu@lists.linux.dev; Joerg Roedel
>> <joro@8bytes.org>; Kevin Tian <kevin.tian@intel.com>;
>> linux-doc@vger.kernel.org; linux-kselftest@vger.kernel.org;
>> llvm@lists.linux.dev; Nathan Chancellor <nathan@kernel.org>; Nick
>> Desaulniers <ndesaulniers@google.com>; Miguel Ojeda <ojeda@kernel.org>;
>> Robin Murphy <robin.murphy@arm.com>; Shuah Khan <shuah@kernel.org>;
>> Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>; Tom Rix
>> <trix@redhat.com>; Will Deacon <will@kernel.org>
>> Cc: Alex Williamson <alex.williamson@redhat.com>; Lu Baolu
>> <baolu.lu@linux.intel.com>; Chaitanya Kulkarni <chaitanyak@nvidia.com>;
>> Cornelia Huck <cohuck@redhat.com>; Daniel Jordan
>> <daniel.m.jordan@oracle.com>; David Gibson
>> <david@gibson.dropbear.id.au>; Eric Auger <eric.auger@redhat.com>; Eric
>> Farman <farman@linux.ibm.com>; Jason Wang <jasowang@redhat.com>;
>> Jean-Philippe Brucker <jean-philippe@linaro.org>; Joao Martins
>> <joao.m.martins@oracle.com>; kvm@vger.kernel.org; Matthew Rosato
>> <mjrosato@linux.ibm.com>; Michael S. Tsirkin <mst@redhat.com>; Nicolin
>> Chen <nicolinc@nvidia.com>; Niklas Schnelle <schnelle@linux.ibm.com>;
>> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; Yi
>> Liu <yi.l.liu@intel.com>; zhukeqian <zhukeqian1@huawei.com>
>> Subject: [PATCH v4 00/17] IOMMUFD Generic interface
> [...]
>>
>> - Userspace page tables aka 'nested translation' for ARM and Intel iommu
>>    drivers:
>>    https://github.com/nicolinc/iommufd/commits/iommufd_nesting
> 
> Hi Eric/Yi/Nicolin,
> 
> Could you please provide a latest Kernel/Qemu branch for the ARM nesting support?
> The above link points to Yi's git, but not sure which one is latest/stable to
> have a play.

Nicolin and I are working on the new version for nesting support. Below 
kernl branch is our latest progress so far. As the naming, it's still
wip. We also need to workout a Qemu version, so still need some time
before sharing with you.

https://github.com/yiliu1765/iommufd/tree/wip/iommufd-v6.1-rc3-nesting

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [PATCH v4 08/17] iommufd: Algorithms for PFN storage
  2022-11-08  0:49 ` [PATCH v4 08/17] iommufd: Algorithms for PFN storage Jason Gunthorpe
@ 2022-11-14  5:50   ` Tian, Kevin
  2022-11-14 18:02     ` Jason Gunthorpe
  2022-11-14 19:19   ` [PATCH v4 8/17] " Jason Gunthorpe
  1 sibling, 1 reply; 99+ messages in thread
From: Tian, Kevin @ 2022-11-14  5:50 UTC (permalink / raw)
  To: Jason Gunthorpe, bpf, Jonathan Corbet, David Woodhouse, iommu,
	Joerg Roedel, linux-doc, linux-kselftest, llvm,
	Nathan Chancellor, Nick Desaulniers, Miguel Ojeda, Robin Murphy,
	Shuah Khan, Suravee Suthikulpanit, Tom Rix, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, November 8, 2022 8:49 AM
> 
> @@ -171,7 +183,7 @@ static struct iopt_area
> *iopt_pages_find_domain_area(struct iopt_pages *pages,
>   */
>  struct pfn_batch {
>  	unsigned long *pfns;
> -	u16 *npfns;
> +	u32 *npfns;

why not making it u32 and removing later FIXME directly in patch7?

>  static bool batch_add_pfn(struct pfn_batch *batch, unsigned long pfn)
>  {
> -	/* FIXME: U16 is too small */
> +	const unsigned int MAX_NPFNS = type_max(typeof(*batch->npfns));

use lowercase i.e. max_npfns.

> +static void __iopt_area_unfill_domain(struct iopt_area *area,
> +				      struct iopt_pages *pages,
> +				      struct iommu_domain *domain,
> +				      unsigned long last_index)
> +{
> +	struct interval_tree_double_span_iter span;
> +	unsigned long start_index = iopt_area_index(area);
> +	unsigned long unmapped_end_index = start_index;
> +	u64 backup[BATCH_BACKUP_SIZE];
> +	struct pfn_batch batch;
> +
> +	lockdep_assert_held(&pages->mutex);
> +
> +	batch_init_backup(&batch, last_index + 1, backup, sizeof(backup));
> +	interval_tree_for_each_double_span(&span, &pages-
> >domains_itree,
> +					   &pages->access_itree, start_index,
> +					   last_index) {
> +		if (span.is_used) {
> +			batch_skip_carry(&batch,
> +					 span.last_used - span.start_used + 1);
> +			continue;
> +		}
> +		iopt_area_unpin_domain(&batch, area, pages, domain,
> +				       span.start_hole, span.last_hole,
> +				       &unmapped_end_index, last_index);
> +	}
> +	if (unmapped_end_index != last_index + 1)
> +		iopt_area_unmap_domain_range(area, domain,
> unmapped_end_index,
> +					     last_index);

a comment marking that it's for the last trailing used span of which
the pages are not contiguous to previous span.

btw it is not easy to understand how this func plus unpin_domain()
actually work. more comments are welcomed to help readability.

> +/*
> + * This can do everything and is fully coherent with what a iommu_domain
> would
> + * see.
> + */
> +static int iopt_pages_rw_slow(struct iopt_pages *pages,

Can you elaborate what guarantees coherency in this function and how it
becomes different in other rw variations?

 +/**
> + * iopt_pages_remove_access() - Release an in-kernel access for PFNs
> + * @area: The source of PFNs
> + * @start_index: First page index
> + * @last_index: Inclusive last page index
> + *
> + * Undo iopt_pages_add_access() and unpin the pages if necessary. The
> caller
> + * must stop using the PFNs before calling this.
> + */
> +void iopt_pages_remove_access(struct iopt_area *area, unsigned long
> start_index,
> +			      unsigned long last_index)

this is called iopt_pages_xxx() but the first parameter is iopt_area.

also it's not balanced with iopt_pages_add_access() which requires the
caller to hold pages->mutex and populate area->num_accesses.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [PATCH v4 09/17] iommufd: Data structure to provide IOVA to PFN mapping
  2022-11-08  0:49 ` [PATCH v4 09/17] iommufd: Data structure to provide IOVA to PFN mapping Jason Gunthorpe
@ 2022-11-14  7:28   ` Tian, Kevin
  2022-11-14 18:43     ` Jason Gunthorpe
  0 siblings, 1 reply; 99+ messages in thread
From: Tian, Kevin @ 2022-11-14  7:28 UTC (permalink / raw)
  To: Jason Gunthorpe, bpf, Jonathan Corbet, David Woodhouse, iommu,
	Joerg Roedel, linux-doc, linux-kselftest, llvm,
	Nathan Chancellor, Nick Desaulniers, Miguel Ojeda, Robin Murphy,
	Shuah Khan, Suravee Suthikulpanit, Tom Rix, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, November 8, 2022 8:49 AM
> 
> +
> +/*
> + * Automatically find a block of IOVA that is not being used and not reserved.
> + * Does not return a 0 IOVA even if it is valid.

what is the problem with 0? should this be documented in uAPI?

> +	interval_tree_for_each_span(&allowed_span, &iopt->allowed_itree,
> +				    PAGE_SIZE, ULONG_MAX - PAGE_SIZE) {
> +		if (RB_EMPTY_ROOT(&iopt->allowed_itree.rb_root)) {
> +			allowed_span.start_used = PAGE_SIZE;
> +			allowed_span.last_used = ULONG_MAX - PAGE_SIZE;
> +			allowed_span.is_hole = false;
> +		}

statically initialize it when iopt is created?

> +
> +		if (!__alloc_iova_check_used(&allowed_span, length,
> +					     iova_alignment, page_offset))
> +			continue;
> +
> +		interval_tree_for_each_span(&area_span, &iopt->area_itree,
> +					    allowed_span.start_used,
> +					    allowed_span.last_used) {
> +			if (!__alloc_iova_check_hole(&area_span, length,
> +						     iova_alignment,
> +						     page_offset))
> +				continue;
> +
> +			interval_tree_for_each_span(&reserved_span,
> +						    &iopt->reserved_itree,
> +						    area_span.start_used,
> +						    area_span.last_used) {
> +				if (!__alloc_iova_check_hole(
> +					    &reserved_span, length,
> +					    iova_alignment, page_offset))
> +					continue;

this could be simplified by double span.

> +static int iopt_check_iova(struct io_pagetable *iopt, unsigned long iova,
> +			   unsigned long length)
> +{
> +	unsigned long last;
> +
> +	lockdep_assert_held(&iopt->iova_rwsem);
> +
> +	if ((iova & (iopt->iova_alignment - 1)))
> +		return -EINVAL;
> +
> +	if (check_add_overflow(iova, length - 1, &last))
> +		return -EOVERFLOW;
> +
> +	/* No reserved IOVA intersects the range */
> +	if (iopt_reserved_iter_first(iopt, iova, last))
> +		return -ENOENT;

vfio type1 returns -EINVAL

> +
> +	/* Check that there is not already a mapping in the range */
> +	if (iopt_area_iter_first(iopt, iova, last))
> +		return -EADDRINUSE;

vfio type1 returns -EEXIST

> +static int iopt_unmap_iova_range(struct io_pagetable *iopt, unsigned long
> start,
> +				 unsigned long end, unsigned long

s/end/last/

> +int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
> +		    unsigned long length, unsigned long *unmapped)
> +{
> +	unsigned long iova_end;

s/iova_end/iova_last/

> +static int iopt_calculate_iova_alignment(struct io_pagetable *iopt)
> +{
> +	unsigned long new_iova_alignment;
> +	struct iommufd_access *access;
> +	struct iommu_domain *domain;
> +	unsigned long index;
> +
> +	lockdep_assert_held_write(&iopt->iova_rwsem);
> +	lockdep_assert_held(&iopt->domains_rwsem);
> +
> +	if (iopt->disable_large_pages)
> +		new_iova_alignment = PAGE_SIZE;
> +	else
> +		new_iova_alignment = 1;

I didn't understand why we start searching alignment from a
smaller value when large pages is enabled. what is the
connection here?

> +	interval_tree_remove(&area->node, &iopt->area_itree);
> +	rc = iopt_insert_area(iopt, lhs, area->pages, start_iova,
> +			      iopt_area_start_byte(area, start_iova),
> +			      (new_start - 1) - start_iova + 1,
> +			      area->iommu_prot);
> +	if (WARN_ON(rc))
> +		goto err_insert;
> +
> +	rc = iopt_insert_area(iopt, rhs, area->pages, new_start,
> +			      iopt_area_start_byte(area, new_start),
> +			      last_iova - new_start + 1, area->iommu_prot);
> +	if (WARN_ON(rc))
> +		goto err_remove_lhs;
> +
> +	lhs->storage_domain = area->storage_domain;
> +	lhs->num_accesses = area->num_accesses;
> +	lhs->pages = area->pages;
> +	rhs->storage_domain = area->storage_domain;
> +	rhs->num_accesses = area->num_accesses;

if an access only spans one side, is it correct to have both split sides
keep the access number?

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [PATCH v4 10/17] iommufd: IOCTLs for the io_pagetable
  2022-11-08  0:49 ` [PATCH v4 10/17] iommufd: IOCTLs for the io_pagetable Jason Gunthorpe
  2022-11-08 13:27   ` Bagas Sanjaya
@ 2022-11-14  7:46   ` Tian, Kevin
  1 sibling, 0 replies; 99+ messages in thread
From: Tian, Kevin @ 2022-11-14  7:46 UTC (permalink / raw)
  To: Jason Gunthorpe, bpf, Jonathan Corbet, David Woodhouse, iommu,
	Joerg Roedel, linux-doc, linux-kselftest, llvm,
	Nathan Chancellor, Nick Desaulniers, Miguel Ojeda, Robin Murphy,
	Shuah Khan, Suravee Suthikulpanit, Tom Rix, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, November 8, 2022 8:49 AM
> 
> Connect the IOAS to its IOCTL interface. This exposes most of the
> functionality in the io_pagetable to userspace.
> 
> This is intended to be the core of the generic interface that IOMMUFD will
> provide. Every IOMMU driver should be able to implement an
> iommu_domain
> that is compatible with this generic mechanism.
> 
> It is also designed to be easy to use for simple non virtual machine
> monitor users, like DPDK:
>  - Universal simple support for all IOMMUs (no PPC special path)
>  - An IOVA allocator that considers the aperture and the allowed/reserved
>    ranges
>  - io_pagetable allows any number of iommu_domains to be connected to
> the
>    IOAS
>  - Automatic allocation and re-use of iommu_domains
> 
> Along with room in the design to add non-generic features to cater to
> specific HW functionality.
> 
> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>

with kernel-doc warning fixed:

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [PATCH v4 12/17] iommufd: Add kAPI toward external drivers for physical devices
  2022-11-08  0:49 ` [PATCH v4 12/17] iommufd: Add kAPI toward external drivers for physical devices Jason Gunthorpe
  2022-11-08 14:34   ` Yi Liu
@ 2022-11-14  7:59   ` Tian, Kevin
  1 sibling, 0 replies; 99+ messages in thread
From: Tian, Kevin @ 2022-11-14  7:59 UTC (permalink / raw)
  To: Jason Gunthorpe, bpf, Jonathan Corbet, David Woodhouse, iommu,
	Joerg Roedel, linux-doc, linux-kselftest, llvm,
	Nathan Chancellor, Nick Desaulniers, Miguel Ojeda, Robin Murphy,
	Shuah Khan, Suravee Suthikulpanit, Tom Rix, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, November 8, 2022 8:49 AM
> 
> Add the four functions external drivers need to connect physical DMA to
> the IOMMUFD:
> 
> iommufd_device_bind() / iommufd_device_unbind()
>   Register the device with iommufd and establish security isolation.
> 
> iommufd_device_attach() / iommufd_device_detach()
>   Connect a bound device to a page table
> 
> Binding a device creates a device object ID in the uAPI, however the
> generic API provides no IOCTLs to manipulate them.
> 
> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [PATCH v4 13/17] iommufd: Add kAPI toward external drivers for kernel access
  2022-11-08  0:49 ` [PATCH v4 13/17] iommufd: Add kAPI toward external drivers for kernel access Jason Gunthorpe
@ 2022-11-14  8:25   ` Tian, Kevin
  2022-11-14 19:05     ` Jason Gunthorpe
  0 siblings, 1 reply; 99+ messages in thread
From: Tian, Kevin @ 2022-11-14  8:25 UTC (permalink / raw)
  To: Jason Gunthorpe, bpf, Jonathan Corbet, David Woodhouse, iommu,
	Joerg Roedel, linux-doc, linux-kselftest, llvm,
	Nathan Chancellor, Nick Desaulniers, Miguel Ojeda, Robin Murphy,
	Shuah Khan, Suravee Suthikulpanit, Tom Rix, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, November 8, 2022 8:49 AM
> 
> Callers must provide a callback that immediately unpins any IOVA being
> used within a range. This happens if userspace unmaps the IOVA under the
> pin.

with 'must' then let's add a check in iommufd_access_create() on the
unmap callback.

> +struct iommufd_access *
> +iommufd_access_create(struct iommufd_ctx *ictx, u32 ioas_id,
> +		      const struct iommufd_access_ops *ops, void *data)
> +{
> +	struct iommufd_access *access;
> +	struct iommufd_object *obj;
> +	int rc;
> +
> +	/*
> +	 * There is no uAPI for the access object, but to keep things
> symmetric
> +	 * use the object infrastructure anyhow.
> +	 */
> +	access = iommufd_object_alloc(ictx, access, IOMMUFD_OBJ_ACCESS);
> +	if (IS_ERR(access))
> +		return access;
> +
> +	access->data = data;
> +	access->ops = ops;
> +	access->ictx = ictx;

I didn't see other places referencing access->ictx before this func
succeeds. But just to be robust it's always cleaner to put this line
together with iommufd_ctx_get().

The rest looks good to me:

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 02/17] iommu: Add device-centric DMA ownership interfaces
  2022-11-08  0:48 ` [PATCH v4 02/17] iommu: Add device-centric DMA ownership interfaces Jason Gunthorpe
  2022-11-11  5:37   ` Tian, Kevin
@ 2022-11-14 13:33   ` Eric Auger
  2022-11-14 16:58     ` Jason Gunthorpe
  1 sibling, 1 reply; 99+ messages in thread
From: Eric Auger @ 2022-11-14 13:33 UTC (permalink / raw)
  To: Jason Gunthorpe, bpf, Jonathan Corbet, David Woodhouse, iommu,
	Joerg Roedel, Kevin Tian, linux-doc, linux-kselftest, llvm,
	Nathan Chancellor, Nick Desaulniers, Miguel Ojeda, Robin Murphy,
	Shuah Khan, Suravee Suthikulpanit, Tom Rix, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

Hi,
On 11/8/22 01:48, Jason Gunthorpe wrote:
> From: Lu Baolu <baolu.lu@linux.intel.com>
>
> These complement the group interfaces used by VFIO and are for use by
> iommufd. The main difference is that multiple devices in the same group
> can all share the ownership by passing the same ownership pointer.
>
> Move the common code into shared functions.
>
> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/iommu/iommu.c | 124 +++++++++++++++++++++++++++++++++---------
>  include/linux/iommu.h |  12 ++++
>  2 files changed, 110 insertions(+), 26 deletions(-)
>
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 6ca377f4fbf9e9..4cb14e44e40f83 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -3108,41 +3108,52 @@ static int __iommu_group_alloc_blocking_domain(struct iommu_group *group)
>  	return 0;
>  }
>  
> +static int __iommu_take_dma_ownership(struct iommu_group *group, void *owner)
> +{
> +	int ret;
> +
> +	if (WARN_ON(!owner))
> +		return -EINVAL;
> +
> +	if ((group->domain && group->domain != group->default_domain) ||
> +	    !xa_empty(&group->pasid_array))
> +		return -EBUSY;
> +
> +	ret = __iommu_group_alloc_blocking_domain(group);
> +	if (ret)
> +		return ret;
> +	ret = __iommu_group_set_domain(group, group->blocking_domain);
> +	if (ret)
> +		return ret;
> +
> +	group->owner = owner;
> +	group->owner_cnt++;
> +	return 0;
> +}
> +
>  /**
>   * iommu_group_claim_dma_owner() - Set DMA ownership of a group
>   * @group: The group.
>   * @owner: Caller specified pointer. Used for exclusive ownership.
>   *
> - * This is to support backward compatibility for vfio which manages
> - * the dma ownership in iommu_group level. New invocations on this
> - * interface should be prohibited.
> + * This is to support backward compatibility for vfio which manages the dma
> + * ownership in iommu_group level. New invocations on this interface should be
> + * prohibited. Only a single owner may exist for a group.
>   */
>  int iommu_group_claim_dma_owner(struct iommu_group *group, void *owner)
>  {
>  	int ret = 0;
>  
> +	if (WARN_ON(!owner))
> +		return -EINVAL;
> +
>  	mutex_lock(&group->mutex);
>  	if (group->owner_cnt) {
>  		ret = -EPERM;
>  		goto unlock_out;
> -	} else {
> -		if ((group->domain && group->domain != group->default_domain) ||
> -		    !xa_empty(&group->pasid_array)) {
> -			ret = -EBUSY;
> -			goto unlock_out;
> -		}
> -
> -		ret = __iommu_group_alloc_blocking_domain(group);
> -		if (ret)
> -			goto unlock_out;
> -
> -		ret = __iommu_group_set_domain(group, group->blocking_domain);
> -		if (ret)
> -			goto unlock_out;
> -		group->owner = owner;
>  	}
>  
> -	group->owner_cnt++;
> +	ret = __iommu_take_dma_ownership(group, owner);
>  unlock_out:
>  	mutex_unlock(&group->mutex);
>  
> @@ -3151,30 +3162,91 @@ int iommu_group_claim_dma_owner(struct iommu_group *group, void *owner)
>  EXPORT_SYMBOL_GPL(iommu_group_claim_dma_owner);
>  
>  /**
> - * iommu_group_release_dma_owner() - Release DMA ownership of a group
> - * @group: The group.
> + * iommu_device_claim_dma_owner() - Set DMA ownership of a device
> + * @dev: The device.
> + * @owner: Caller specified pointer. Used for exclusive ownership.
>   *
> - * Release the DMA ownership claimed by iommu_group_claim_dma_owner().
> + * Claim the DMA ownership of a device. Multiple devices in the same group may
> + * concurrently claim ownership if they present the same owner value. Returns 0
> + * on success and error code on failure
>   */
> -void iommu_group_release_dma_owner(struct iommu_group *group)
> +int iommu_device_claim_dma_owner(struct device *dev, void *owner)
>  {
> -	int ret;
> +	struct iommu_group *group = iommu_group_get(dev);
> +	int ret = 0;
> +
> +	if (!group)
> +		return -ENODEV;
>  
>  	mutex_lock(&group->mutex);
> +	if (group->owner_cnt) {
> +		if (group->owner != owner) {
> +			ret = -EPERM;
> +			goto unlock_out;
> +		}
> +		group->owner_cnt++;
> +		goto unlock_out;
> +	}
> +
> +	ret = __iommu_take_dma_ownership(group, owner);
> +unlock_out:
> +	mutex_unlock(&group->mutex);
> +	iommu_group_put(group);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_device_claim_dma_owner);
> +
> +static void __iommu_release_dma_ownership(struct iommu_group *group)
> +{
> +	int ret;
> +
>  	if (WARN_ON(!group->owner_cnt || !group->owner ||
>  		    !xa_empty(&group->pasid_array)))
> -		goto unlock_out;
> +		return;
>  
>  	group->owner_cnt = 0;
>  	group->owner = NULL;
>  	ret = __iommu_group_set_domain(group, group->default_domain);
>  	WARN(ret, "iommu driver failed to attach the default domain");
> +}
>  
> -unlock_out:
> +/**
> + * iommu_group_release_dma_owner() - Release DMA ownership of a group
> + * @group: The group.
> + *
> + * Release the DMA ownership claimed by iommu_group_claim_dma_owner().
> + */
> +void iommu_group_release_dma_owner(struct iommu_group *group)
> +{
> +	mutex_lock(&group->mutex);
> +	__iommu_release_dma_ownership(group);
>  	mutex_unlock(&group->mutex);
>  }
>  EXPORT_SYMBOL_GPL(iommu_group_release_dma_owner);
>  
> +/**
> + * iommu_device_release_dma_owner() - Release DMA ownership of a device
> + * @group: The device.
@dev: the device
> + *
> + * Release the DMA ownership claimed by iommu_device_claim_dma_owner().
> + */
> +void iommu_device_release_dma_owner(struct device *dev)
> +{
> +	struct iommu_group *group = iommu_group_get(dev);
> +
> +	mutex_lock(&group->mutex);
> +	if (group->owner_cnt > 1) {
> +		group->owner_cnt--;
> +		goto unlock_out;
> +	}
> +	__iommu_release_dma_ownership(group);
> +unlock_out:
> +	mutex_unlock(&group->mutex);

if (group->owner_cnt > 1)

	group->owner_cnt--;
else
	__iommu_release_dma_ownership(group);

mutex_unlock(&group->mutex);

iommu_group_put(group);

> +	iommu_group_put(group);
> +}
> +EXPORT_SYMBOL_GPL(iommu_device_release_dma_owner);
> +
>  /**
>   * iommu_group_dma_owner_claimed() - Query group dma ownership status
>   * @group: The group.
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index a09fd32d8cc273..1690c334e51631 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -707,6 +707,9 @@ int iommu_group_claim_dma_owner(struct iommu_group *group, void *owner);
>  void iommu_group_release_dma_owner(struct iommu_group *group);
>  bool iommu_group_dma_owner_claimed(struct iommu_group *group);
>  
> +int iommu_device_claim_dma_owner(struct device *dev, void *owner);
> +void iommu_device_release_dma_owner(struct device *dev);
> +
>  struct iommu_domain *iommu_sva_domain_alloc(struct device *dev,
>  					    struct mm_struct *mm);
>  int iommu_attach_device_pasid(struct iommu_domain *domain,
> @@ -1064,6 +1067,15 @@ static inline bool iommu_group_dma_owner_claimed(struct iommu_group *group)
>  	return false;
>  }
>  
> +static inline void iommu_device_release_dma_owner(struct device *dev)
> +{
> +}
> +
> +static inline int iommu_device_claim_dma_owner(struct device *dev, void *owner)
> +{
> +	return -ENODEV;
> +}
> +
>  static inline struct iommu_domain *
>  iommu_sva_domain_alloc(struct device *dev, struct mm_struct *mm)
>  {
Besides
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 4/17] iommufd: Document overview of iommufd
  2022-11-11  5:59     ` Tian, Kevin
@ 2022-11-14 15:14       ` Jason Gunthorpe
  0 siblings, 0 replies; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-14 15:14 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon, Alex Williamson,
	Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

On Fri, Nov 11, 2022 at 05:59:02AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, November 9, 2022 1:10 AM
> > 
> > +/**
> > + * iommufd_device_unbind - Undo iommufd_device_bind()
> > + * @idev: Device returned by iommufd_device_bind()
> > + *
> > + * Release the device from iommufd control. The DMA ownership will
> > return back
> > + * to unowned with blocked DMA. This invalidates the iommufd_device
> 
> unowned but not blocked DMA. iommu_device_release_dma_owner()
> will decide what will be the state then, e.g. attached back to the default
> domain in most cases.

Woops

 * Release the device from iommufd control. The DMA ownership will return back
 * to unowned with DMA controlled by the DMA API. This invalidates the
 * iommufd_device pointer, other APIs that consume it must not be called
 * concurrently.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 02/17] iommu: Add device-centric DMA ownership interfaces
  2022-11-11  5:37   ` Tian, Kevin
@ 2022-11-14 16:44     ` Jason Gunthorpe
  0 siblings, 0 replies; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-14 16:44 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon, Alex Williamson,
	Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

On Fri, Nov 11, 2022 at 05:37:36AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, November 8, 2022 8:49 AM
> > +static int __iommu_take_dma_ownership(struct iommu_group *group, void
> > *owner)
> > +{
> > +	int ret;
> > +
> > +	if (WARN_ON(!owner))
> > +		return -EINVAL;
> 
> move to iommu_device_claim_dma_owner(). just like how it's checked
> in the group version.

Ok, like this:

@@ -3112,9 +3112,6 @@ static int __iommu_take_dma_ownership(struct iommu_group *group, void *owner)
 {
 	int ret;
 
-	if (WARN_ON(!owner))
-		return -EINVAL;
-
 	if ((group->domain && group->domain != group->default_domain) ||
 	    !xa_empty(&group->pasid_array))
 		return -EBUSY;
@@ -3177,6 +3174,8 @@ int iommu_device_claim_dma_owner(struct device *dev, void *owner)
 
 	if (!group)
 		return -ENODEV;
+	if (WARN_ON(!owner))
+		return -EINVAL;
 
 	mutex_lock(&group->mutex);
 	if (group->owner_cnt) {

> > +	if ((group->domain && group->domain != group->default_domain) ||
> > +	    !xa_empty(&group->pasid_array))
> > +		return -EBUSY;
> 
> the check of pasid_array is a new addition in this version. it's probably
> worthy a comment here.

It is just the merge resolution with the SVA series, the entire if is
being copied from someplace else

Thanks,
Jason

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 02/17] iommu: Add device-centric DMA ownership interfaces
  2022-11-14 13:33   ` Eric Auger
@ 2022-11-14 16:58     ` Jason Gunthorpe
  0 siblings, 0 replies; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-14 16:58 UTC (permalink / raw)
  To: Eric Auger
  Cc: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon, Alex Williamson,
	Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Farman, Jason Wang, Jean-Philippe Brucker,
	Joao Martins, kvm, Matthew Rosato, Michael S. Tsirkin,
	Nicolin Chen, Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu,
	Keqian Zhu

On Mon, Nov 14, 2022 at 02:33:41PM +0100, Eric Auger wrote:
> > +/**
> > + * iommu_device_release_dma_owner() - Release DMA ownership of a device
> > + * @group: The device.
> @dev: the device
> > + *
> > + * Release the DMA ownership claimed by iommu_device_claim_dma_owner().
> > + */
> > +void iommu_device_release_dma_owner(struct device *dev)
> > +{
> > +	struct iommu_group *group = iommu_group_get(dev);
> > +
> > +	mutex_lock(&group->mutex);
> > +	if (group->owner_cnt > 1) {
> > +		group->owner_cnt--;
> > +		goto unlock_out;
> > +	}
> > +	__iommu_release_dma_ownership(group);
> > +unlock_out:
> > +	mutex_unlock(&group->mutex);
> 
> if (group->owner_cnt > 1)
> 
> 	group->owner_cnt--;
> else
> 	__iommu_release_dma_ownership(group);
> 
> mutex_unlock(&group->mutex);
> 
> iommu_group_put(group);

Sure, thanks

Jason

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 07/17] iommufd: PFN handling for iopt_pages
  2022-11-11  9:56   ` Tian, Kevin
@ 2022-11-14 17:20     ` Jason Gunthorpe
  0 siblings, 0 replies; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-14 17:20 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon, Alex Williamson,
	Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

On Fri, Nov 11, 2022 at 09:56:58AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, November 8, 2022 8:49 AM
> > 
> > +
> > +struct interval_tree_double_span_iter {
> > +	struct rb_root_cached *itrees[2];
> > +	struct interval_tree_span_iter spans[2];
> > +	union {
> > +		unsigned long start_hole;
> > +		unsigned long start_used;
> > +	};
> > +	union {
> > +		unsigned long last_hole;
> > +		unsigned long last_used;
> > +	};
> > +	/* 0 = hole, 1 = used span[0], 2 = used span[1], -1 done iteration */
> > +	int is_used;
> > +};
> 
> lack of a comment how this expects to be used as done for
> struct interval_tree_span_iter. e.g. there is no value
> representing used by both spans which implies this is used
> to find valid range in either side. Those should be spelled out.

/*
 * This is a variation of the general interval_tree_span_iter that computes the
 * spans over the union of two different interval trees. Used ranges are broken
 * up and reported based on the tree that provides the interval. The first span
 * always takes priority. Like interval_tree_span_iter it is greedy and the same
 * value of is_used will not repeat on two iteration cycles.
 */

> 
> > +/*
> > + * The IOVA to PFN map. The mapper automatically copies the PFNs into
> > multiple
> 
> what is the mapper?

Let's just say "The map automatically"

> > +	ret = iommu_unmap(domain, iova, size);
> > +	/*
> > +	 * It is a logic error in this code or a driver bug if the IOMMU unmaps
> > +	 * something other than exactly as requested. This implies that the
> > +	 * iommu driver may not fail unmap for reasons beyond bad
> > agruments.
> > +	 * Particularly, the iommu driver may not do a memory allocation on
> > the
> > +	 * unmap path.
> > +	 */
> 
> didn't understand the last sentence.

Unmap path means its domain_ops->unmap

> > +static void batch_skip_carry(struct pfn_batch *batch, unsigned int skip_pfns)
> 
> add a comment similar to batch_clear_carry()

It seems OK as is, the comment describing what carry is just a few
lines above

> > +{
> > +	if (!batch->total_pfns)
> > +		return;
> > +	skip_pfns = min(batch->total_pfns, skip_pfns);
> > +	batch->pfns[0] += skip_pfns;
> > +	batch->npfns[0] -= skip_pfns;
> 
> what about skip_pfns exceeds batch->npfns[0]? looks this works only
> if batch->total_pfns = batch->npfns[0]...

Right, at this point the batch has only 1 pfn and total_pfns ==
batch->nfpfs[0]

Let's add an assertion:

@@ -239,6 +239,8 @@ static void batch_skip_carry(struct pfn_batch *batch, unsigned int skip_pfns)
 {
        if (!batch->total_pfns)
                return;
+       if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
+               WARN_ON(batch->total_pfns != batch->npfns[0]);
        skip_pfns = min(batch->total_pfns, skip_pfns);

> 
> > +/* true if the pfn could be added, false otherwise */
> > +static bool batch_add_pfn(struct pfn_batch *batch, unsigned long pfn)
> > +{
> > +	/* FIXME: U16 is too small */
> 
> performance or functional impact?
> 
> what would be the fix? and why cannot it be done now?
> 
> more comment is welcomed.

Er, as you noticed this was fixed and the rebase to fix it was
botched. It will be u32
 
> > +static void batch_unpin(struct pfn_batch *batch, struct iopt_pages *pages,
> > +			unsigned int offset, size_t npages)
> > +{
> > +	unsigned int cur = 0;
> > +
> > +	while (offset) {
> > +		if (batch->npfns[cur] > offset)
> > +			break;
> > +		offset -= batch->npfns[cur];
> > +		cur++;
> > +	}
> 
> 'offset' usually means byte-addressed. 'index' is a better fit in
> this context.

It is the offset into the logical page array held in the batch.

'first_page_off' would be clearer

Thanks,
Jason

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 07/17] iommufd: PFN handling for iopt_pages
  2022-11-11 11:09   ` Tian, Kevin
@ 2022-11-14 17:24     ` Jason Gunthorpe
  2022-11-15  2:59       ` Tian, Kevin
  0 siblings, 1 reply; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-14 17:24 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon, Alex Williamson,
	Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

On Fri, Nov 11, 2022 at 11:09:08AM +0000, Tian, Kevin wrote:
> <...2nd part>
> 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, November 8, 2022 8:49 AM
> > +
> > +/* pfn_reader_user is just the pin_user_pages() path */
> > +struct pfn_reader_user {
> > +	struct page **upages;
> > +	size_t upages_len;
> > +	unsigned long upages_start;
> > +	unsigned long upages_end;
> > +	unsigned int gup_flags;
> > +	int locked;
> 
> document what locked==-1/0/1 means

+       /*
+        * 1 means mmget() and mmap_read_lock(), 0 means only mmget(), -1 is
+        * neither
+        */


> > +/* Process a single span in the access_itree */
> 
> in all storages.

+/* Process a single span to load it from the proper storage */

Thanks,
Jason

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 08/17] iommufd: Algorithms for PFN storage
  2022-11-14  5:50   ` Tian, Kevin
@ 2022-11-14 18:02     ` Jason Gunthorpe
  2022-11-15  3:06       ` Tian, Kevin
  0 siblings, 1 reply; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-14 18:02 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon, Alex Williamson,
	Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

On Mon, Nov 14, 2022 at 05:50:50AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, November 8, 2022 8:49 AM
> > 
> > @@ -171,7 +183,7 @@ static struct iopt_area
> > *iopt_pages_find_domain_area(struct iopt_pages *pages,
> >   */
> >  struct pfn_batch {
> >  	unsigned long *pfns;
> > -	u16 *npfns;
> > +	u32 *npfns;
> 
> why not making it u32 and removing later FIXME directly in patch7?

Rebase error, I fixed it

> >  static bool batch_add_pfn(struct pfn_batch *batch, unsigned long pfn)
> >  {
> > -	/* FIXME: U16 is too small */
> > +	const unsigned int MAX_NPFNS = type_max(typeof(*batch->npfns));
> 
> use lowercase i.e. max_npfns.

It is customary to use caps for constants, eg enum values and things

> > +static void __iopt_area_unfill_domain(struct iopt_area *area,
> > +				      struct iopt_pages *pages,
> > +				      struct iommu_domain *domain,
> > +				      unsigned long last_index)
> > +{
> > +	struct interval_tree_double_span_iter span;
> > +	unsigned long start_index = iopt_area_index(area);
> > +	unsigned long unmapped_end_index = start_index;
> > +	u64 backup[BATCH_BACKUP_SIZE];
> > +	struct pfn_batch batch;
> > +
> > +	lockdep_assert_held(&pages->mutex);
> > +
> > +	batch_init_backup(&batch, last_index + 1, backup, sizeof(backup));
> > +	interval_tree_for_each_double_span(&span, &pages-
> > >domains_itree,
> > +					   &pages->access_itree, start_index,
> > +					   last_index) {
> > +		if (span.is_used) {
> > +			batch_skip_carry(&batch,
> > +					 span.last_used - span.start_used + 1);
> > +			continue;
> > +		}
> > +		iopt_area_unpin_domain(&batch, area, pages, domain,
> > +				       span.start_hole, span.last_hole,
> > +				       &unmapped_end_index, last_index);
> > +	}
> > +	if (unmapped_end_index != last_index + 1)
> > +		iopt_area_unmap_domain_range(area, domain,
> > unmapped_end_index,
> > +					     last_index);
> 
> a comment marking that it's for the last trailing used span of which
> the pages are not contiguous to previous span.

	/*
	 * If the range ends in a access then we do the residual unmap without
	 * any unpins.
	 */

> btw it is not easy to understand how this func plus unpin_domain()
> actually work. more comments are welcomed to help readability.

	/*
	 * For security we must not unpin something that is still DMA mapped,
	 * so this must unmap any IOVA before we go ahead and unpin the pages.
	 * This creates a complexity where we need to skip over unpinning pages
	 * held in the xarray, but continue to unmap from the domain.
	 *
	 * The domain unmap cannot stop in the middle of a contiguous range of
	 * PFNs. To solve this problem the unpinning step will read ahead to the
	 * end of any contiguous span, unmap that whole span, and then only
	 * unpin the leading part that does not have any accesses. The residual
	 * PFNs that were unmapped but not unpinned are called a "carry" in the
	 * batch as they are moved to the front of the PFN list and continue on
	 * to the next iteration(s).
	 */

> > +/*
> > + * This can do everything and is fully coherent with what a iommu_domain
> > would
> > + * see.
> > + */
> > +static int iopt_pages_rw_slow(struct iopt_pages *pages,
> 
> Can you elaborate what guarantees coherency in this function and how it
> becomes different in other rw variations?

/*
 * This uses the pfn_reader instead of taking a shortcut by using the mm. It can
 * do every scenario and is fully consistent with what an iommu_domain would
 * see.
 */

> > + * iopt_pages_remove_access() - Release an in-kernel access for PFNs
> > + * @area: The source of PFNs
> > + * @start_index: First page index
> > + * @last_index: Inclusive last page index
> > + *
> > + * Undo iopt_pages_add_access() and unpin the pages if necessary. The
> > caller
> > + * must stop using the PFNs before calling this.
> > + */
> > +void iopt_pages_remove_access(struct iopt_area *area, unsigned long
> > start_index,
> > +			      unsigned long last_index)
> 
> this is called iopt_pages_xxx() but the first parameter is iopt_area.
> 
> also it's not balanced with iopt_pages_add_access() which requires the
> caller to hold pages->mutex and populate area->num_accesses.

OK, see below

Thanks,
Jason

--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -565,7 +565,7 @@ void iommufd_access_unpin_pages(struct iommufd_access *access,
 
 	down_read(&iopt->iova_rwsem);
 	iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova)
-		iopt_pages_remove_access(
+		iopt_area_remove_access(
 			area, iopt_area_iova_to_index(area, iter.cur_iova),
 			iopt_area_iova_to_index(
 				area,
@@ -650,15 +650,10 @@ int iommufd_access_pin_pages(struct iommufd_access *access, unsigned long iova,
 			goto err_remove;
 		}
 
-		mutex_lock(&area->pages->mutex);
-		rc = iopt_pages_add_access(area->pages, index, last_index,
-					   out_pages, flags);
-		if (rc) {
-			mutex_unlock(&area->pages->mutex);
+		rc = iopt_pages_add_access(area, index, last_index, out_pages,
+					   flags);
+		if (rc)
 			goto err_remove;
-		}
-		area->num_accesses++;
-		mutex_unlock(&area->pages->mutex);
 		out_pages += last_index - index + 1;
 	}
 	if (!iopt_area_contig_done(&iter)) {
@@ -673,7 +668,7 @@ int iommufd_access_pin_pages(struct iommufd_access *access, unsigned long iova,
 	if (iova < iter.cur_iova) {
 		last_iova = iter.cur_iova - 1;
 		iopt_for_each_contig_area(&iter, area, iopt, iova, last_iova)
-			iopt_pages_remove_access(
+			iopt_area_remove_access(
 				area,
 				iopt_area_iova_to_index(area, iter.cur_iova),
 				iopt_area_iova_to_index(
diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h
index 3b85fa344f6be3..68bc3957534dd7 100644
--- a/drivers/iommu/iommufd/io_pagetable.h
+++ b/drivers/iommu/iommufd/io_pagetable.h
@@ -221,10 +221,10 @@ int iopt_pages_fill_xarray(struct iopt_pages *pages, unsigned long start,
 void iopt_pages_unfill_xarray(struct iopt_pages *pages, unsigned long start,
 			      unsigned long last);
 
-int iopt_pages_add_access(struct iopt_pages *pages, unsigned long start,
-			unsigned long last, struct page **out_pages,
-			unsigned int flags);
-void iopt_pages_remove_access(struct iopt_area *area, unsigned long start,
+int iopt_area_add_access(struct iopt_area *area, unsigned long start,
+			 unsigned long last, struct page **out_pages,
+			 unsigned int flags);
+void iopt_area_remove_access(struct iopt_area *area, unsigned long start,
 			    unsigned long last);
 int iopt_pages_rw_access(struct iopt_pages *pages, unsigned long start_byte,
 			 void *data, unsigned long length, unsigned int flags);
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index 01d2447eac4ede..e5f267d9e2b491 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -1807,8 +1826,8 @@ iopt_pages_get_exact_access(struct iopt_pages *pages, unsigned long index,
 }
 
 /**
- * iopt_pages_add_access() - Record an in-knerel access for PFNs
- * @pages: The source of PFNs
+ * iopt_area_add_access() - Record an in-knerel access for PFNs
+ * @area: The source of PFNs
  * @start_index: First page index
  * @last_index: Inclusive last page index
  * @out_pages: Output list of struct page's representing the PFNs
@@ -1819,40 +1838,49 @@ iopt_pages_get_exact_access(struct iopt_pages *pages, unsigned long index,
  *
  * This should be undone through a matching call to iopt_pages_remove_access()
  */
-int iopt_pages_add_access(struct iopt_pages *pages, unsigned long start_index,
+int iopt_area_add_access(struct iopt_area *area, unsigned long start_index,
 			  unsigned long last_index, struct page **out_pages,
 			  unsigned int flags)
 {
+	struct iopt_pages *pages = area->pages;
 	struct iopt_pages_access *access;
 	int rc;
 
 	if ((flags & IOMMUFD_ACCESS_RW_WRITE) && !pages->writable)
 		return -EPERM;
 
+	mutex_lock(&pages->mutex);
 	access = iopt_pages_get_exact_access(pages, start_index, last_index);
 	if (access) {
 		refcount_inc(&access->refcount);
 		iopt_pages_fill_from_xarray(pages, start_index, last_index,
 					    out_pages);
+		mutex_unlock(&pages->mutex);
 		return 0;
 	}
 
 	access = kzalloc(sizeof(*access), GFP_KERNEL_ACCOUNT);
-	if (!access)
-		return -ENOMEM;
+	if (!access) {
+		rc = -ENOMEM;
+		goto err_unlock;
+	}
 
 	rc = iopt_pages_fill_xarray(pages, start_index, last_index, out_pages);
 	if (rc)
-		goto out_free;
+		goto err_free;
 
 	access->node.start = start_index;
 	access->node.last = last_index;
 	refcount_set(&access->refcount, 1);
+	area->num_accesses++;
 	interval_tree_insert(&access->node, &pages->access_itree);
+	mutex_unlock(&pages->mutex);
 	return 0;
 
-out_free:
+err_free:
 	kfree(access);
+err_unlock:
+	mutex_unlock(&pages->mutex);
 	return rc;
 }
 
@@ -1865,11 +1893,11 @@ int iopt_pages_add_access(struct iopt_pages *pages, unsigned long start_index,
  * Undo iopt_pages_add_access() and unpin the pages if necessary. The caller
  * must stop using the PFNs before calling this.
  */
-void iopt_pages_remove_access(struct iopt_area *area, unsigned long start_index,
-			      unsigned long last_index)
+void iopt_area_remove_access(struct iopt_area *area, unsigned long start_index,
+			     unsigned long last_index)
 {
-	struct iopt_pages_access *access;
 	struct iopt_pages *pages = area->pages;
+	struct iopt_pages_access *access;
 
 	mutex_lock(&pages->mutex);
 	access = iopt_pages_get_exact_access(pages, start_index, last_index);

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 09/17] iommufd: Data structure to provide IOVA to PFN mapping
  2022-11-14  7:28   ` Tian, Kevin
@ 2022-11-14 18:43     ` Jason Gunthorpe
  2022-11-15  3:13       ` Tian, Kevin
  0 siblings, 1 reply; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-14 18:43 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon, Alex Williamson,
	Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

On Mon, Nov 14, 2022 at 07:28:47AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, November 8, 2022 8:49 AM
> > 
> > +
> > +/*
> > + * Automatically find a block of IOVA that is not being used and not reserved.
> > + * Does not return a 0 IOVA even if it is valid.
> 
> what is the problem with 0? should this be documented in uAPI?

0 is commonly used as an errant value for uninitialized things. We
don't automatically map it into a process mm because it can cause
security problems if we don't trap a bogus 0/NULL pointer reference.

The same logic applies here too, the allocator should not return 0 to
reserve it as an unmapped IOVA page to catch bugs.

I don't think it needs to be documented

> > +	interval_tree_for_each_span(&allowed_span, &iopt->allowed_itree,
> > +				    PAGE_SIZE, ULONG_MAX - PAGE_SIZE) {
> > +		if (RB_EMPTY_ROOT(&iopt->allowed_itree.rb_root)) {
> > +			allowed_span.start_used = PAGE_SIZE;
> > +			allowed_span.last_used = ULONG_MAX - PAGE_SIZE;
> > +			allowed_span.is_hole = false;
> > +		}
> 
> statically initialize it when iopt is created?

allowed_span is a stack variable?

> > +		if (!__alloc_iova_check_used(&allowed_span, length,
> > +					     iova_alignment, page_offset))
> > +			continue;
> > +
> > +		interval_tree_for_each_span(&area_span, &iopt->area_itree,
> > +					    allowed_span.start_used,
> > +					    allowed_span.last_used) {
> > +			if (!__alloc_iova_check_hole(&area_span, length,
> > +						     iova_alignment,
> > +						     page_offset))
> > +				continue;
> > +
> > +			interval_tree_for_each_span(&reserved_span,
> > +						    &iopt->reserved_itree,
> > +						    area_span.start_used,
> > +						    area_span.last_used) {
> > +				if (!__alloc_iova_check_hole(
> > +					    &reserved_span, length,
> > +					    iova_alignment, page_offset))
> > +					continue;
> 
> this could be simplified by double span.

It is subtly not compatible, the double span looks for used areas.
This is looking for a used area in the allowed_itree, a hole in the
area_itree, and a hole in the reserved_itree.

I don't think IOVA allocation should be a fast path so it is not worth
alot of effort to micro-optimize this.

> > +static int iopt_check_iova(struct io_pagetable *iopt, unsigned long iova,
> > +			   unsigned long length)
> > +{
> > +	unsigned long last;
> > +
> > +	lockdep_assert_held(&iopt->iova_rwsem);
> > +
> > +	if ((iova & (iopt->iova_alignment - 1)))
> > +		return -EINVAL;
> > +
> > +	if (check_add_overflow(iova, length - 1, &last))
> > +		return -EOVERFLOW;
> > +
> > +	/* No reserved IOVA intersects the range */
> > +	if (iopt_reserved_iter_first(iopt, iova, last))
> > +		return -ENOENT;
> 
> vfio type1 returns -EINVAL
> 
> > +
> > +	/* Check that there is not already a mapping in the range */
> > +	if (iopt_area_iter_first(iopt, iova, last))
> > +		return -EADDRINUSE;
> 
> vfio type1 returns -EEXIST

Hum I guess we can change them here, it is a bit annoying for the test suite
though.

> > +static int iopt_unmap_iova_range(struct io_pagetable *iopt, unsigned long
> > start,
> > +				 unsigned long end, unsigned long
> 
> s/end/last/
> 
> > +int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
> > +		    unsigned long length, unsigned long *unmapped)
> > +{
> > +	unsigned long iova_end;
> 
> s/iova_end/iova_last/

yep
 
> > +static int iopt_calculate_iova_alignment(struct io_pagetable *iopt)
> > +{
> > +	unsigned long new_iova_alignment;
> > +	struct iommufd_access *access;
> > +	struct iommu_domain *domain;
> > +	unsigned long index;
> > +
> > +	lockdep_assert_held_write(&iopt->iova_rwsem);
> > +	lockdep_assert_held(&iopt->domains_rwsem);
> > +
> > +	if (iopt->disable_large_pages)
> > +		new_iova_alignment = PAGE_SIZE;
> > +	else
> > +		new_iova_alignment = 1;
> 
> I didn't understand why we start searching alignment from a
> smaller value when large pages is enabled. what is the
> connection here?

'disable_large_pages' is a tiny bit misnamed, what it really does is
ensure that every iommu_map call is exactly PAGE_SIZE, not more (large
pages) and not less (what this is protecting against).

So if a domain has less than PAGE_SIZE we upgrade to
PAGE_SIZE. Otherwise we allow using the lowest possible alignment.

This allows userspace to always work in PAGE_SIZE units without fear
of problems, eg with sub-page-size units becoming weird or something.

> > +	interval_tree_remove(&area->node, &iopt->area_itree);
> > +	rc = iopt_insert_area(iopt, lhs, area->pages, start_iova,
> > +			      iopt_area_start_byte(area, start_iova),
> > +			      (new_start - 1) - start_iova + 1,
> > +			      area->iommu_prot);
> > +	if (WARN_ON(rc))
> > +		goto err_insert;
> > +
> > +	rc = iopt_insert_area(iopt, rhs, area->pages, new_start,
> > +			      iopt_area_start_byte(area, new_start),
> > +			      last_iova - new_start + 1, area->iommu_prot);
> > +	if (WARN_ON(rc))
> > +		goto err_remove_lhs;
> > +
> > +	lhs->storage_domain = area->storage_domain;
> > +	lhs->num_accesses = area->num_accesses;
> > +	lhs->pages = area->pages;
> > +	rhs->storage_domain = area->storage_domain;
> > +	rhs->num_accesses = area->num_accesses;
> 
> if an access only spans one side, is it correct to have both split sides
> keep the access number?

Er, this is acatually completely broken, woops. A removal of an access
will trigger a WARN_ON since the access_itree element is very likely
no longer correct.

Ah.. So the only use case here is unmapping and you can't unmap
something that has an access established, except in some pathalogical
case where the access does not intersect with what is being mapped.

There is no way to tell which iopt_pages_access are connected to which
areas, so without spending some memory this can't be fixed up. I think
it is not a real issue as mdev plus this ancient VFIO interface is
probably not something that exists in the real world..

+       /*
+        * Splitting is not permitted if an access exists, we don't track enough
+        * information to split existing accesses.
+        */
+       if (area->num_accesses) {
+               rc = -EINVAL;
+               goto err_unlock;
+       }
+
@@ -1041,10 +1050,8 @@ static int iopt_area_split(struct iopt_area *area, unsigned long iova)
                goto err_remove_lhs;
 
        lhs->storage_domain = area->storage_domain;
-       lhs->num_accesses = area->num_accesses;
        lhs->pages = area->pages;
        rhs->storage_domain = area->storage_domain;
-       rhs->num_accesses = area->num_accesses;
        rhs->pages = area->pages;
        kref_get(&rhs->pages->kref);
        kfree(area);

Thanks,
Jason

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 13/17] iommufd: Add kAPI toward external drivers for kernel access
  2022-11-14  8:25   ` Tian, Kevin
@ 2022-11-14 19:05     ` Jason Gunthorpe
  0 siblings, 0 replies; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-14 19:05 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon, Alex Williamson,
	Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

On Mon, Nov 14, 2022 at 08:25:35AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, November 8, 2022 8:49 AM
> > 
> > Callers must provide a callback that immediately unpins any IOVA being
> > used within a range. This happens if userspace unmaps the IOVA under the
> > pin.
> 
> with 'must' then let's add a check in iommufd_access_create() on the
> unmap callback.

What I did is put it on the iommufd_access_pin_pages() call, since it
is fine to use rw without providing an callback

> > +struct iommufd_access *
> > +iommufd_access_create(struct iommufd_ctx *ictx, u32 ioas_id,
> > +		      const struct iommufd_access_ops *ops, void *data)
> > +{
> > +	struct iommufd_access *access;
> > +	struct iommufd_object *obj;
> > +	int rc;
> > +
> > +	/*
> > +	 * There is no uAPI for the access object, but to keep things
> > symmetric
> > +	 * use the object infrastructure anyhow.
> > +	 */
> > +	access = iommufd_object_alloc(ictx, access, IOMMUFD_OBJ_ACCESS);
> > +	if (IS_ERR(access))
> > +		return access;
> > +
> > +	access->data = data;
> > +	access->ops = ops;
> > +	access->ictx = ictx;
> 
> I didn't see other places referencing access->ictx before this func
> succeeds. But just to be robust it's always cleaner to put this line
> together with iommufd_ctx_get().

OK

Thanks,
Jason

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 8/17] iommufd: Algorithms for PFN storage
  2022-11-08  0:49 ` [PATCH v4 08/17] iommufd: Algorithms for PFN storage Jason Gunthorpe
  2022-11-14  5:50   ` Tian, Kevin
@ 2022-11-14 19:19   ` Jason Gunthorpe
  1 sibling, 0 replies; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-14 19:19 UTC (permalink / raw)
  To: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Mon, Nov 07, 2022 at 08:49:01PM -0400, Jason Gunthorpe wrote:

> +/*
> + * Each interval represents an active iopt_access_pages(), it acts as an
> + * interval lock that keeps the PFNs pinned and stored in the xarray.
> + */
> +struct iopt_pages_access {
> +	struct interval_tree_node node;
> +	refcount_t refcount;

I noticed this is never used unlocked, everything holds the
pages->mutex for the pages that holds the node in its itree

So this is just an unsigned int, we don't need that atomic.

Jason

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 04/17] iommufd: Document overview of iommufd
  2022-11-08  0:48 ` [PATCH v4 04/17] iommufd: Document overview of iommufd Jason Gunthorpe
                     ` (2 preceding siblings ...)
  2022-11-10  9:30   ` [PATCH v4 04/17] " Bagas Sanjaya
@ 2022-11-14 20:50   ` Eric Auger
  2022-11-15  0:52     ` Jason Gunthorpe
  3 siblings, 1 reply; 99+ messages in thread
From: Eric Auger @ 2022-11-14 20:50 UTC (permalink / raw)
  To: Jason Gunthorpe, bpf, Jonathan Corbet, David Woodhouse, iommu,
	Joerg Roedel, Kevin Tian, linux-doc, linux-kselftest, llvm,
	Nathan Chancellor, Nick Desaulniers, Miguel Ojeda, Robin Murphy,
	Shuah Khan, Suravee Suthikulpanit, Tom Rix, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

Hi,

On 11/8/22 01:48, Jason Gunthorpe wrote:
> From: Kevin Tian <kevin.tian@intel.com>
>
> Add iommufd into the documentation tree, and supply initial documentation.
> Much of this is linked from code comments by kdoc.
>
> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  Documentation/userspace-api/index.rst   |   1 +
>  Documentation/userspace-api/iommufd.rst | 222 ++++++++++++++++++++++++
>  2 files changed, 223 insertions(+)
>  create mode 100644 Documentation/userspace-api/iommufd.rst
>
> diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
> index c78da9ce0ec44e..f16337bdb8520f 100644
> --- a/Documentation/userspace-api/index.rst
> +++ b/Documentation/userspace-api/index.rst
> @@ -25,6 +25,7 @@ place where this information is gathered.
>     ebpf/index
>     ioctl/index
>     iommu
> +   iommufd
>     media/index
>     netlink/index
>     sysfs-platform_profile
> diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst
> new file mode 100644
> index 00000000000000..64a135f3055adc
> --- /dev/null
> +++ b/Documentation/userspace-api/iommufd.rst
> @@ -0,0 +1,222 @@
> +.. SPDX-License-Identifier: GPL-2.0+
> +
> +=======
> +IOMMUFD
> +=======
> +
> +:Author: Jason Gunthorpe
> +:Author: Kevin Tian
> +
> +Overview
> +========
> +
> +IOMMUFD is the user API to control the IOMMU subsystem as it relates to managing
> +IO page tables from userspace using file descriptors. It intends to be general
> +and consumable by any driver that wants to expose DMA to userspace. These
> +drivers are eventually expected to deprecate any internal IOMMU logic if exists
they may already/historically implement (eg. vfio_iommu_type1.c)?
> +(e.g. vfio_iommu_type1.c).
> +
> +At minimum iommufd provides universal support of managing I/O address spaces and
> +I/O page tables for all IOMMUs, with room in the design to add non-generic
> +features to cater to specific hardware functionality.
> +
> +In this context the capital letter (IOMMUFD) refers to the subsystem while the
> +small letter (iommufd) refers to the file descriptors created via /dev/iommu for
> +use by userspace.
> +
> +Key Concepts
> +============
> +
> +User Visible Objects
> +--------------------
> +
> +Following IOMMUFD objects are exposed to userspace:
> +
> +- IOMMUFD_OBJ_IOAS, representing an I/O address space (IOAS), allowing map/unmap
> +  of user space memory into ranges of I/O Virtual Address (IOVA).
> +
> +  The IOAS is a functional replacement for the VFIO container, and like the VFIO
> +  container it copies an IOVA map to a list of iommu_domains held within it.
> +
> +- IOMMUFD_OBJ_DEVICE, representing a device that is bound to iommufd by an
> +  external driver.
> +
> +- IOMMUFD_OBJ_HW_PAGETABLE, representing an actual hardware I/O page table
> +  (i.e. a single struct iommu_domain) managed by the iommu driver.
> +
> +  The IOAS has a list of HW_PAGETABLES that share the same IOVA mapping and
> +  it will synchronize its mapping with each member HW_PAGETABLE.
> +
> +All user-visible objects are destroyed via the IOMMU_DESTROY uAPI.
> +
> +The diagram below shows relationship between user-visible objects and kernel
> +datastructures (external to iommufd), with numbers referred to operations
> +creating the objects and links::
> +
> +  _________________________________________________________
> + |                         iommufd                         |
> + |       [1]                                               |
> + |  _________________                                      |
> + | |                 |                                     |
> + | |                 |                                     |
> + | |                 |                                     |
> + | |                 |                                     |
> + | |                 |                                     |
> + | |                 |                                     |
> + | |                 |        [3]                 [2]      |
> + | |                 |    ____________         __________  |
> + | |      IOAS       |<--|            |<------|          | |
> + | |                 |   |HW_PAGETABLE|       |  DEVICE  | |
> + | |                 |   |____________|       |__________| |
> + | |                 |         |                   |       |
> + | |                 |         |                   |       |
> + | |                 |         |                   |       |
> + | |                 |         |                   |       |
> + | |                 |         |                   |       |
> + | |_________________|         |                   |       |
> + |         |                   |                   |       |
> + |_________|___________________|___________________|_______|
> +           |                   |                   |
> +           |              _____v______      _______v_____
> +           | PFN storage |            |    |             |
> +           |------------>|iommu_domain|    |struct device|
> +                         |____________|    |_____________|
> +
> +1. IOMMUFD_OBJ_IOAS is created via the IOMMU_IOAS_ALLOC uAPI. An iommufd can
> +   hold multiple IOAS objects. IOAS is the most generic object and does not
> +   expose interfaces that are specific to single IOMMU drivers. All operations
> +   on the IOAS must operate equally on each of the iommu_domains inside of it.
> +
> +2. IOMMUFD_OBJ_DEVICE is created when an external driver calls the IOMMUFD kAPI
> +   to bind a device to an iommufd. The driver is expected to implement proper a
s/proper/properly?
> +   set of ioctls to allow userspace to initiate the binding operation.
> +   Successful completion of this operation establishes the desired DMA ownership
> +   over the device. The driver must also set the driver_managed_dma flag and
> +   must not touch the device until this operation succeeds.
> +
> +3. IOMMUFD_OBJ_HW_PAGETABLE is created when an external driver calls the IOMMUFD
> +   kAPI to attach a bound device to an IOAS. Similarly the external driver uAPI
> +   allows userspace to initiate the attaching operation. If a compatible
> +   pagetable already exists then it is reused for the attachment. Otherwise a
> +   new pagetable object and iommu_domain is created. Successful completion of
> +   this operation sets up the linkages among IOAS, device and iommu_domain. Once
> +   this completes the device could do DMA.
> +
> +   Every iommu_domain inside the IOAS is also represented to userspace as a
> +   HW_PAGETABLE object.
> +
> +   .. note::
> +
> +      Future IOMMUFD updates will provide an API to create and manipulate the
> +      HW_PAGETABLE directly.
> +
> +A device can only bind to an iommufd due to DMA ownership claim and attach to at
> +most one IOAS object (no support of PASID yet).
> +
> +Currently only PCI device is allowed to use IOMMUFD.
is it still true? device_bind() now takes a struct device *

In [PATCH v4 12/17] iommufd: Add kAPI toward external drivers for
physical devices "PCI" is used at several places
but shouldn't it be removed now?

> +
> +Kernel Datastructure
> +--------------------
> +
> +User visible objects are backed by following datastructures:
> +
> +- iommufd_ioas for IOMMUFD_OBJ_IOAS.
> +- iommufd_device for IOMMUFD_OBJ_DEVICE.
> +- iommufd_hw_pagetable for IOMMUFD_OBJ_HW_PAGETABLE.
> +
> +Several terminologies when looking at these datastructures:
> +
> +- Automatic domain - refers to an iommu domain created automatically when
> +  attaching a device to an IOAS object. This is compatible to the semantics of
> +  VFIO type1.
> +
> +- Manual domain - refers to an iommu domain designated by the user as the
> +  target pagetable to be attached to by a device. Though currently there are
> +  no uAPIs to directly create such domain, the datastructure and algorithms
> +  are ready for handling that use case.
> +
> +- In-kernel user - refers to something like a VFIO mdev that is using the
> +  IOMMUFD access interface to access the IOAS. This starts by creating an
> +  iommufd_access object that is similar to the domain binding a physical device
> +  would do. The access object will then allow converting IOVA ranges into struct
> +  page * lists, or doing direct read/write to an IOVA.
> +
> +iommufd_ioas serves as the metadata datastructure to manage how IOVA ranges are
> +mapped to memory pages, composed of:
> +
> +- struct io_pagetable holding the IOVA map
> +- struct iopt_areas representing populated portions of IOVA
> +- struct iopt_pages representing the storage of PFNs
> +- struct iommu_domain representing the IO page table in the IOMMU
> +- struct iopt_pages_access representing in-kernel users of PFNs
> +- struct xarray pinned_pfns holding a list of pages pinned by in-kernel users
> +
> +Each iopt_pages represents a logical linear array of full PFNs. The PFNs are
> +ultimately derived from userspave VAs via an mm_struct. Once they have been
> +pinned the PFN is stored in IOPTEs of an iommu_domain or inside the pinned_pages
s/is/are
> +xarray if they have been pinned through an iommufd_access.
> +
> +PFN have to be copied between all combinations of storage locations, depending
> +on what domains are present and what kinds of in-kernel "software access" users
> +exists. The mechanism ensures that a page is pinned only once.
> +
> +An io_pagetable is composed of iopt_areas pointing at iopt_pages, along with a
> +list of iommu_domains that mirror the IOVA to PFN map.
> +
> +Multiple io_pagetable-s, through their iopt_area-s, can share a single
> +iopt_pages which avoids multi-pinning and double accounting of page
> +consumption.
> +
> +iommufd_ioas is sharable between subsystems, e.g. VFIO and VDPA, as long as
> +devices managed by different subsystems are bound to a same iommufd.
> +
> +IOMMUFD User API
> +================
> +
> +.. kernel-doc:: include/uapi/linux/iommufd.h
> +
> +IOMMUFD Kernel API
> +==================
> +
> +The IOMMUFD kAPI is device-centric with group-related tricks managed behind the
> +scene. This allows the external drivers calling such kAPI to implement a simple
> +device-centric uAPI for connecting its device to an iommufd, instead of
> +explicitly imposing the group semantics in its uAPI as VFIO does.
> +
> +.. kernel-doc:: drivers/iommu/iommufd/device.c
> +   :export:
> +
> +VFIO and IOMMUFD
> +----------------
> +
> +Connecting a VFIO device to iommufd can be done in two ways.
> +
> +First is a VFIO compatible way by directly implementing the /dev/vfio/vfio
> +container IOCTLs by mapping them into io_pagetable operations. Doing so allows
> +the use of iommufd in legacy VFIO applications by symlinking /dev/vfio/vfio to
> +/dev/iommufd or extending VFIO to SET_CONTAINER using an iommufd instead of a
> +container fd.
> +
> +The second approach directly extends VFIO to support a new set of device-centric
> +user API based on aforementioned IOMMUFD kernel API. It requires userspace
> +change but better matches the IOMMUFD API semantics and easier to support new
> +iommufd features when comparing it to the first approach.
> +
> +Currently both approaches are still work-in-progress.
> +
> +There are still a few gaps to be resolved to catch up with VFIO type1, as
> +documented in iommufd_vfio_check_extension().
> +
> +Future TODOs
> +============
> +
> +Currently IOMMUFD supports only kernel-managed I/O page table, similar to VFIO
> +type1. New features on the radar include:
> +
> + - Binding iommu_domain's to PASID/SSID
> + - Userspace page tables, for ARM, x86 and S390
> + - Kernel bypass'd invalidation of user page tables
> + - Re-use of the KVM page table in the IOMMU
> + - Dirty page tracking in the IOMMU
> + - Runtime Increase/Decrease of IOPTE size
> + - PRI support with faults resolved in userspace
Thanks

Eric


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 04/17] iommufd: Document overview of iommufd
  2022-11-14 20:50   ` Eric Auger
@ 2022-11-15  0:52     ` Jason Gunthorpe
  0 siblings, 0 replies; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-15  0:52 UTC (permalink / raw)
  To: Eric Auger
  Cc: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon, Alex Williamson,
	Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Farman, Jason Wang, Jean-Philippe Brucker,
	Joao Martins, kvm, Matthew Rosato, Michael S. Tsirkin,
	Nicolin Chen, Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu,
	Keqian Zhu

On Mon, Nov 14, 2022 at 09:50:56PM +0100, Eric Auger wrote:

> > +IOMMUFD is the user API to control the IOMMU subsystem as it relates to managing
> > +IO page tables from userspace using file descriptors. It intends to be general
> > +and consumable by any driver that wants to expose DMA to userspace. These
> > +drivers are eventually expected to deprecate any internal IOMMU logic if exists
> they may already/historically implement (eg. vfio_iommu_type1.c)?

Done

> > +2. IOMMUFD_OBJ_DEVICE is created when an external driver calls the IOMMUFD kAPI
> > +   to bind a device to an iommufd. The driver is expected to implement proper a
> s/proper/properly?
> > +   set of ioctls to allow userspace to initiate the binding operation.
> > +   Successful completion of this operation establishes the desired DMA ownership
> > +   over the device. The driver must also set the driver_managed_dma flag and
> > +   must not touch the device until this operation succeeds.

I don't know what this was suppose to say, lets delete the word proper

> > +3. IOMMUFD_OBJ_HW_PAGETABLE is created when an external driver calls the IOMMUFD
> > +   kAPI to attach a bound device to an IOAS. Similarly the external driver uAPI
> > +   allows userspace to initiate the attaching operation. If a compatible
> > +   pagetable already exists then it is reused for the attachment. Otherwise a
> > +   new pagetable object and iommu_domain is created. Successful completion of
> > +   this operation sets up the linkages among IOAS, device and iommu_domain. Once
> > +   this completes the device could do DMA.
> > +
> > +   Every iommu_domain inside the IOAS is also represented to userspace as a
> > +   HW_PAGETABLE object.
> > +
> > +   .. note::
> > +
> > +      Future IOMMUFD updates will provide an API to create and manipulate the
> > +      HW_PAGETABLE directly.
> > +
> > +A device can only bind to an iommufd due to DMA ownership claim and attach to at
> > +most one IOAS object (no support of PASID yet).
> > +
> > +Currently only PCI device is allowed to use IOMMUFD.
> is it still true? device_bind() now takes a struct device *
> 
> In [PATCH v4 12/17] iommufd: Add kAPI toward external drivers for
> physical devices "PCI" is used at several places
> but shouldn't it be removed now?

Right, gone

> > +- struct io_pagetable holding the IOVA map
> > +- struct iopt_areas representing populated portions of IOVA
> > +- struct iopt_pages representing the storage of PFNs
> > +- struct iommu_domain representing the IO page table in the IOMMU
> > +- struct iopt_pages_access representing in-kernel users of PFNs
> > +- struct xarray pinned_pfns holding a list of pages pinned by in-kernel users
> > +
> > +Each iopt_pages represents a logical linear array of full PFNs. The PFNs are
> > +ultimately derived from userspave VAs via an mm_struct. Once they have been
> > +pinned the PFN is stored in IOPTEs of an iommu_domain or inside the pinned_pages
> s/is/are

Ah it should be "Once they have been pinned the PFNs are stored in
IOPTEs" as the whole thing is refering to plural PFNs

Thanks,
Jason

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [PATCH v4 07/17] iommufd: PFN handling for iopt_pages
  2022-11-14 17:24     ` Jason Gunthorpe
@ 2022-11-15  2:59       ` Tian, Kevin
  0 siblings, 0 replies; 99+ messages in thread
From: Tian, Kevin @ 2022-11-15  2:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon, Alex Williamson,
	Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, November 15, 2022 1:25 AM
> 
> On Fri, Nov 11, 2022 at 11:09:08AM +0000, Tian, Kevin wrote:
> > <...2nd part>
> >
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Tuesday, November 8, 2022 8:49 AM
> > > +
> > > +/* pfn_reader_user is just the pin_user_pages() path */
> > > +struct pfn_reader_user {
> > > +	struct page **upages;
> > > +	size_t upages_len;
> > > +	unsigned long upages_start;
> > > +	unsigned long upages_end;
> > > +	unsigned int gup_flags;
> > > +	int locked;
> >
> > document what locked==-1/0/1 means
> 
> +       /*
> +        * 1 means mmget() and mmap_read_lock(), 0 means only mmget(), -1
> is
> +        * neither
> +        */
> 
> 
> > > +/* Process a single span in the access_itree */
> >
> > in all storages.
> 
> +/* Process a single span to load it from the proper storage */
> 

with your replies:

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [PATCH v4 08/17] iommufd: Algorithms for PFN storage
  2022-11-14 18:02     ` Jason Gunthorpe
@ 2022-11-15  3:06       ` Tian, Kevin
  2022-11-15 14:49         ` Jason Gunthorpe
  0 siblings, 1 reply; 99+ messages in thread
From: Tian, Kevin @ 2022-11-15  3:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon, Alex Williamson,
	Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, November 15, 2022 2:03 AM
> -		mutex_lock(&area->pages->mutex);
> -		rc = iopt_pages_add_access(area->pages, index, last_index,
> -					   out_pages, flags);
> -		if (rc) {
> -			mutex_unlock(&area->pages->mutex);
> +		rc = iopt_pages_add_access(area, index, last_index,
> out_pages,
> +					   flags);
> +		if (rc)

iopt_area_add_access(), which I suppose you have already fixed when
compiling this change. but just in case... 😊

> @@ -1819,40 +1838,49 @@ iopt_pages_get_exact_access(struct iopt_pages
> *pages, unsigned long index,
>   *
>   * This should be undone through a matching call to
> iopt_pages_remove_access()

iopt_area_remove_access()

> @@ -1865,11 +1893,11 @@ int iopt_pages_add_access(struct iopt_pages
> *pages, unsigned long start_index,
>   * Undo iopt_pages_add_access() and unpin the pages if necessary. The

iopt_area_add_access()

with above,

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [PATCH v4 09/17] iommufd: Data structure to provide IOVA to PFN mapping
  2022-11-14 18:43     ` Jason Gunthorpe
@ 2022-11-15  3:13       ` Tian, Kevin
  2022-11-15 15:05         ` Jason Gunthorpe
  0 siblings, 1 reply; 99+ messages in thread
From: Tian, Kevin @ 2022-11-15  3:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon, Alex Williamson,
	Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, November 15, 2022 2:44 AM
> 
> On Mon, Nov 14, 2022 at 07:28:47AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Tuesday, November 8, 2022 8:49 AM
> > >
> > > +
> > > +/*
> > > + * Automatically find a block of IOVA that is not being used and not
> reserved.
> > > + * Does not return a 0 IOVA even if it is valid.
> >
> > what is the problem with 0? should this be documented in uAPI?
> 
> 0 is commonly used as an errant value for uninitialized things. We
> don't automatically map it into a process mm because it can cause
> security problems if we don't trap a bogus 0/NULL pointer reference.
> 
> The same logic applies here too, the allocator should not return 0 to
> reserve it as an unmapped IOVA page to catch bugs.

CPU doesn't reference IOVA. Where do such bugs exist?

> 
> I don't think it needs to be documented

this again causes a subtle difference between automatic allocation
and fixed iova. If we really think address 0 is something related
to bug, then why is it allowed with fixed iova?

> 
> > > +		if (!__alloc_iova_check_used(&allowed_span, length,
> > > +					     iova_alignment, page_offset))
> > > +			continue;
> > > +
> > > +		interval_tree_for_each_span(&area_span, &iopt->area_itree,
> > > +					    allowed_span.start_used,
> > > +					    allowed_span.last_used) {
> > > +			if (!__alloc_iova_check_hole(&area_span, length,
> > > +						     iova_alignment,
> > > +						     page_offset))
> > > +				continue;
> > > +
> > > +			interval_tree_for_each_span(&reserved_span,
> > > +						    &iopt->reserved_itree,
> > > +						    area_span.start_used,
> > > +						    area_span.last_used) {
> > > +				if (!__alloc_iova_check_hole(
> > > +					    &reserved_span, length,
> > > +					    iova_alignment, page_offset))
> > > +					continue;
> >
> > this could be simplified by double span.
> 
> It is subtly not compatible, the double span looks for used areas.
> This is looking for a used area in the allowed_itree, a hole in the
> area_itree, and a hole in the reserved_itree.

the inner two loops can be replaced by double span, since both
are skipping used areas.

> 
> I don't think IOVA allocation should be a fast path so it is not worth
> alot of effort to micro-optimize this.

but I'm not insisting on changing them now. It's trivial.

> > > +	if (iopt->disable_large_pages)
> > > +		new_iova_alignment = PAGE_SIZE;
> > > +	else
> > > +		new_iova_alignment = 1;
> >
> > I didn't understand why we start searching alignment from a
> > smaller value when large pages is enabled. what is the
> > connection here?
> 
> 'disable_large_pages' is a tiny bit misnamed, what it really does is
> ensure that every iommu_map call is exactly PAGE_SIZE, not more (large
> pages) and not less (what this is protecting against).
> 
> So if a domain has less than PAGE_SIZE we upgrade to
> PAGE_SIZE. Otherwise we allow using the lowest possible alignment.
> 
> This allows userspace to always work in PAGE_SIZE units without fear
> of problems, eg with sub-page-size units becoming weird or something.

above are good stuff in a comment.

> 
> > > +	interval_tree_remove(&area->node, &iopt->area_itree);
> > > +	rc = iopt_insert_area(iopt, lhs, area->pages, start_iova,
> > > +			      iopt_area_start_byte(area, start_iova),
> > > +			      (new_start - 1) - start_iova + 1,
> > > +			      area->iommu_prot);
> > > +	if (WARN_ON(rc))
> > > +		goto err_insert;
> > > +
> > > +	rc = iopt_insert_area(iopt, rhs, area->pages, new_start,
> > > +			      iopt_area_start_byte(area, new_start),
> > > +			      last_iova - new_start + 1, area->iommu_prot);
> > > +	if (WARN_ON(rc))
> > > +		goto err_remove_lhs;
> > > +
> > > +	lhs->storage_domain = area->storage_domain;
> > > +	lhs->num_accesses = area->num_accesses;
> > > +	lhs->pages = area->pages;
> > > +	rhs->storage_domain = area->storage_domain;
> > > +	rhs->num_accesses = area->num_accesses;
> >
> > if an access only spans one side, is it correct to have both split sides
> > keep the access number?
> 
> Er, this is acatually completely broken, woops. A removal of an access
> will trigger a WARN_ON since the access_itree element is very likely
> no longer correct.
> 
> Ah.. So the only use case here is unmapping and you can't unmap
> something that has an access established, except in some pathalogical
> case where the access does not intersect with what is being mapped.
> 
> There is no way to tell which iopt_pages_access are connected to which
> areas, so without spending some memory this can't be fixed up. I think
> it is not a real issue as mdev plus this ancient VFIO interface is
> probably not something that exists in the real world..
> 
> +       /*
> +        * Splitting is not permitted if an access exists, we don't track enough
> +        * information to split existing accesses.
> +        */
> +       if (area->num_accesses) {
> +               rc = -EINVAL;
> +               goto err_unlock;
> +       }
> +
> @@ -1041,10 +1050,8 @@ static int iopt_area_split(struct iopt_area *area,
> unsigned long iova)
>                 goto err_remove_lhs;
> 
>         lhs->storage_domain = area->storage_domain;
> -       lhs->num_accesses = area->num_accesses;
>         lhs->pages = area->pages;
>         rhs->storage_domain = area->storage_domain;
> -       rhs->num_accesses = area->num_accesses;
>         rhs->pages = area->pages;
>         kref_get(&rhs->pages->kref);
>         kfree(area);
> 

this change makes sense to me

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 03/17] interval-tree: Add a utility to iterate over spans in an interval tree
  2022-11-08  0:48 ` [PATCH v4 03/17] interval-tree: Add a utility to iterate over spans in an interval tree Jason Gunthorpe
@ 2022-11-15 14:14   ` Eric Auger
  2022-11-15 16:44     ` Jason Gunthorpe
  0 siblings, 1 reply; 99+ messages in thread
From: Eric Auger @ 2022-11-15 14:14 UTC (permalink / raw)
  To: Jason Gunthorpe, bpf, Jonathan Corbet, David Woodhouse, iommu,
	Joerg Roedel, Kevin Tian, linux-doc, linux-kselftest, llvm,
	Nathan Chancellor, Nick Desaulniers, Miguel Ojeda, Robin Murphy,
	Shuah Khan, Suravee Suthikulpanit, Tom Rix, Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

Hi Jason,

On 11/8/22 01:48, Jason Gunthorpe wrote:
> The span iterator travels over the indexes of the interval_tree, not the
> nodes, and classifies spans of indexes as either 'used' or 'hole'.
>
> 'used' spans are fully covered by nodes in the tree and 'hole' spans have
> no node intersecting the span.
>
> This is done greedily such that spans are maximally sized and every
> iteration step switches between used/hole.
>
> As an example a trivial allocator can be written as:
>
> 	for (interval_tree_span_iter_first(&span, itree, 0, ULONG_MAX);
> 	     !interval_tree_span_iter_done(&span);
> 	     interval_tree_span_iter_next(&span))
> 		if (span.is_hole &&
> 		    span.last_hole - span.start_hole >= allocation_size - 1)
> 			return span.start_hole;
>
> With all the tricky boundary conditions handled by the library code.
>
> The following iommufd patches have several algorithms for its overlapping
> node interval trees that are significantly simplified with this kind of
> iteration primitive. As it seems generally useful, put it into lib/.
>
> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  .clang-format                 |   1 +
>  include/linux/interval_tree.h |  58 +++++++++++++++
>  lib/Kconfig                   |   4 ++
>  lib/interval_tree.c           | 132 ++++++++++++++++++++++++++++++++++
>  4 files changed, 195 insertions(+)
>
> diff --git a/.clang-format b/.clang-format
> index 1247d54f9e49fa..96d07786dcfb46 100644
> --- a/.clang-format
> +++ b/.clang-format
> @@ -440,6 +440,7 @@ ForEachMacros:
>    - 'inet_lhash2_for_each_icsk'
>    - 'inet_lhash2_for_each_icsk_continue'
>    - 'inet_lhash2_for_each_icsk_rcu'
> +  - 'interval_tree_for_each_span'
>    - 'intlist__for_each_entry'
>    - 'intlist__for_each_entry_safe'
>    - 'kcore_copy__for_each_phdr'
> diff --git a/include/linux/interval_tree.h b/include/linux/interval_tree.h
> index 288c26f50732d7..2b8026a3990645 100644
> --- a/include/linux/interval_tree.h
> +++ b/include/linux/interval_tree.h
> @@ -27,4 +27,62 @@ extern struct interval_tree_node *
>  interval_tree_iter_next(struct interval_tree_node *node,
>  			unsigned long start, unsigned long last);
>  
> +/**
> + * struct interval_tree_span_iter - Find used and unused spans.
> + * @start_hole: Start of an interval for a hole when is_hole == 1
> + * @last_hole: Inclusive end of an interval for a hole when is_hole == 1
> + * @start_used: Start of a used interval when is_hole == 0
> + * @last_used: Inclusive end of a used interval when is_hole == 0
> + * @is_hole: 0 == used, 1 == is_hole, -1 == done iteration
> + *
> + * This iterator travels over spans in an interval tree. It does not return
> + * nodes but classifies each span as either a hole, where no nodes intersect, or
> + * a used, which is fully covered by nodes. Each iteration step toggles between
> + * hole and used until the entire range is covered. The returned spans always
> + * fully cover the requested range.
> + *
> + * The iterator is greedy, it always returns the largest hole or used possible,
> + * consolidating all consecutive nodes.
> + *
> + * Use interval_tree_span_iter_done() to detect end of iteration.
> + */
> +struct interval_tree_span_iter {
> +	/* private: not for use by the caller */
> +	struct interval_tree_node *nodes[2];
> +	unsigned long first_index;
> +	unsigned long last_index;
> +
> +	/* public: */
> +	union {
> +		unsigned long start_hole;
> +		unsigned long start_used;
> +	};
> +	union {
> +		unsigned long last_hole;
> +		unsigned long last_used;
> +	};
> +	int is_hole;
> +};
> +
> +void interval_tree_span_iter_first(struct interval_tree_span_iter *state,
> +				   struct rb_root_cached *itree,
> +				   unsigned long first_index,
> +				   unsigned long last_index);
> +void interval_tree_span_iter_advance(struct interval_tree_span_iter *iter,
> +				     struct rb_root_cached *itree,
> +				     unsigned long new_index);
> +void interval_tree_span_iter_next(struct interval_tree_span_iter *state);
> +
> +static inline bool
> +interval_tree_span_iter_done(struct interval_tree_span_iter *state)
> +{
> +	return state->is_hole == -1;
> +}
> +
> +#define interval_tree_for_each_span(span, itree, first_index, last_index)      \
> +	for (interval_tree_span_iter_first(span, itree,                        \
> +					   first_index, last_index);           \
> +	     !interval_tree_span_iter_done(span);                              \
> +	     interval_tree_span_iter_next(span))
> +
>  #endif	/* _LINUX_INTERVAL_TREE_H */
> diff --git a/lib/Kconfig b/lib/Kconfig
> index 9bbf8a4b2108e6..c6c323fd251721 100644
> --- a/lib/Kconfig
> +++ b/lib/Kconfig
> @@ -479,6 +479,10 @@ config INTERVAL_TREE
>  
>  	  for more information.
>  
> +config INTERVAL_TREE_SPAN_ITER
> +	bool
> +	depends on INTERVAL_TREE
> +
>  config XARRAY_MULTI
>  	bool
>  	help
> diff --git a/lib/interval_tree.c b/lib/interval_tree.c
> index 593ce56ece5050..d2882db8fa2a07 100644
> --- a/lib/interval_tree.c
> +++ b/lib/interval_tree.c
> @@ -15,3 +15,135 @@ EXPORT_SYMBOL_GPL(interval_tree_insert);
>  EXPORT_SYMBOL_GPL(interval_tree_remove);
>  EXPORT_SYMBOL_GPL(interval_tree_iter_first);
>  EXPORT_SYMBOL_GPL(interval_tree_iter_next);
> +
> +#ifdef CONFIG_INTERVAL_TREE_SPAN_ITER
Maybe add in a kernel doc that a prerequisite is state.nodes[1] must be
populated
> +static void
> +interval_tree_span_iter_next_gap(struct interval_tree_span_iter *state)
> +{
> +	struct interval_tree_node *cur = state->nodes[1];
> +
> +	/*
> +	 * Roll nodes[1] into nodes[0] by advancing nodes[1] to the end of a
> +	 * contiguous span of nodes. This makes nodes[0]->last the end of that
> +	 * contiguous span of valid indexes that started at the original
I would suggest s/contiguous span/contiguous used span and remove "of
valid indexes"
> +	 * nodes[1]->start. nodes[1] is now the next node and a hole is between
nodes[1] is now the first node starting the next used span. A hole span
is between nodes[0]->last and nodes[1]->start
> +	 * nodes[0] and [1].
> +	 */
> +	state->nodes[0] = cur;
> +	do {
> +		if (cur->last > state->nodes[0]->last)
> +			state->nodes[0] = cur;
> +		cur = interval_tree_iter_next(cur, state->first_index,
> +					      state->last_index);
> +	} while (cur && (state->nodes[0]->last >= cur->start ||
> +			 state->nodes[0]->last + 1 == cur->start));
> +	state->nodes[1] = cur;
> +}
> +
> +void interval_tree_span_iter_first(struct interval_tree_span_iter *iter,
> +				   struct rb_root_cached *itree,
> +				   unsigned long first_index,
> +				   unsigned long last_index)
> +{
> +	iter->first_index = first_index;
> +	iter->last_index = last_index;
> +	iter->nodes[0] = NULL;
> +	iter->nodes[1] =
> +		interval_tree_iter_first(itree, first_index, last_index);
> +	if (!iter->nodes[1]) {
> +		/* No nodes intersect the span, whole span is hole */
> +		iter->start_hole = first_index;
> +		iter->last_hole = last_index;
> +		iter->is_hole = 1;
> +		return;
> +	}
> +	if (iter->nodes[1]->start > first_index) {
> +		/* Leading hole on first iteration */
> +		iter->start_hole = first_index;
> +		iter->last_hole = iter->nodes[1]->start - 1;
> +		iter->is_hole = 1;
> +		interval_tree_span_iter_next_gap(iter);
> +		return;
> +	}
> +
> +	/* Starting inside a used */
> +	iter->start_used = first_index;
> +	iter->is_hole = 0;
> +	interval_tree_span_iter_next_gap(iter);
> +	iter->last_used = iter->nodes[0]->last;
> +	if (iter->last_used >= last_index) {
> +		iter->last_used = last_index;
> +		iter->nodes[0] = NULL;
> +		iter->nodes[1] = NULL;
> +	}
> +}
> +EXPORT_SYMBOL_GPL(interval_tree_span_iter_first);
> +
> +void interval_tree_span_iter_next(struct interval_tree_span_iter *iter)
> +{
> +	if (!iter->nodes[0] && !iter->nodes[1]) {
> +		iter->is_hole = -1;
> +		return;
> +	}
> +
> +	if (iter->is_hole) {
> +		iter->start_used = iter->last_hole + 1;
> +		iter->last_used = iter->nodes[0]->last;
> +		if (iter->last_used >= iter->last_index) {
> +			iter->last_used = iter->last_index;
> +			iter->nodes[0] = NULL;
> +			iter->nodes[1] = NULL;
> +		}
> +		iter->is_hole = 0;
> +		return;
> +	}
> +
> +	if (!iter->nodes[1]) {
> +		/* Trailing hole */
> +		iter->start_hole = iter->nodes[0]->last + 1;
> +		iter->last_hole = iter->last_index;
> +		iter->nodes[0] = NULL;
> +		iter->is_hole = 1;
> +		return;
> +	}
> +
> +	/* must have both nodes[0] and [1], interior hole */
> +	iter->start_hole = iter->nodes[0]->last + 1;
> +	iter->last_hole = iter->nodes[1]->start - 1;
> +	iter->is_hole = 1;
> +	interval_tree_span_iter_next_gap(iter);
> +}
> +EXPORT_SYMBOL_GPL(interval_tree_span_iter_next);
> +
> +/*
> + * Advance the iterator index to a specific position. The returned used/hole is
> + * updated to start at new_index. This is faster than calling
> + * interval_tree_span_iter_first() as it can avoid full searches in several
> + * cases where the iterator is already set.
> + */
> +void interval_tree_span_iter_advance(struct interval_tree_span_iter *iter,
> +				     struct rb_root_cached *itree,
> +				     unsigned long new_index)
> +{
> +	if (iter->is_hole == -1)
> +		return;
> +
> +	iter->first_index = new_index;
check new_index > iter->first_index?
> +	if (new_index > iter->last_index) {
> +		iter->is_hole = -1;
> +		return;
> +	}
> +
> +	/* Rely on the union aliasing hole/used */
> +	if (iter->start_hole <= new_index && new_index <= iter->last_hole) {
> +		iter->start_hole = new_index;
> +		return;
> +	}
> +	if (new_index == iter->last_hole + 1)
> +		interval_tree_span_iter_next(iter);
> +	else
> +		interval_tree_span_iter_first(iter, itree, new_index,
> +					      iter->last_index);
> +}
> +EXPORT_SYMBOL_GPL(interval_tree_span_iter_advance);
> +#endif

Besides, looks good to me
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 08/17] iommufd: Algorithms for PFN storage
  2022-11-15  3:06       ` Tian, Kevin
@ 2022-11-15 14:49         ` Jason Gunthorpe
  0 siblings, 0 replies; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-15 14:49 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon, Alex Williamson,
	Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

On Tue, Nov 15, 2022 at 03:06:57AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, November 15, 2022 2:03 AM
> > -		mutex_lock(&area->pages->mutex);
> > -		rc = iopt_pages_add_access(area->pages, index, last_index,
> > -					   out_pages, flags);
> > -		if (rc) {
> > -			mutex_unlock(&area->pages->mutex);
> > +		rc = iopt_pages_add_access(area, index, last_index,
> > out_pages,
> > +					   flags);
> > +		if (rc)
> 
> iopt_area_add_access(), which I suppose you have already fixed when
> compiling this change. but just in case... 😊
> 
> > @@ -1819,40 +1838,49 @@ iopt_pages_get_exact_access(struct iopt_pages
> > *pages, unsigned long index,
> >   *
> >   * This should be undone through a matching call to
> > iopt_pages_remove_access()
> 
> iopt_area_remove_access()
> 
> > @@ -1865,11 +1893,11 @@ int iopt_pages_add_access(struct iopt_pages
> > *pages, unsigned long start_index,
> >   * Undo iopt_pages_add_access() and unpin the pages if necessary. The
> 
> iopt_area_add_access()
> 
> with above,

Yep, I got those

Thanks,
Jason

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 09/17] iommufd: Data structure to provide IOVA to PFN mapping
  2022-11-15  3:13       ` Tian, Kevin
@ 2022-11-15 15:05         ` Jason Gunthorpe
  2022-11-16  0:09           ` Tian, Kevin
  0 siblings, 1 reply; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-15 15:05 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon, Alex Williamson,
	Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

On Tue, Nov 15, 2022 at 03:13:56AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, November 15, 2022 2:44 AM
> > 
> > On Mon, Nov 14, 2022 at 07:28:47AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Tuesday, November 8, 2022 8:49 AM
> > > >
> > > > +
> > > > +/*
> > > > + * Automatically find a block of IOVA that is not being used and not
> > reserved.
> > > > + * Does not return a 0 IOVA even if it is valid.
> > >
> > > what is the problem with 0? should this be documented in uAPI?
> > 
> > 0 is commonly used as an errant value for uninitialized things. We
> > don't automatically map it into a process mm because it can cause
> > security problems if we don't trap a bogus 0/NULL pointer reference.
> > 
> > The same logic applies here too, the allocator should not return 0 to
> > reserve it as an unmapped IOVA page to catch bugs.
> 
> CPU doesn't reference IOVA. Where do such bugs exist?

SW is always buggy and SW programs the DMA address, so it could leave
a 0 behind or something during the programming.

> > I don't think it needs to be documented
> 
> this again causes a subtle difference between automatic allocation
> and fixed iova. If we really think address 0 is something related
> to bug, then why is it allowed with fixed iova?

Because fixed can do anything up to the limits of the HW. It is like
mmp, where MAP_FIXED can allocate 0s as well, but automatic allocation
will not.

> > > > +		if (!__alloc_iova_check_used(&allowed_span, length,
> > > > +					     iova_alignment, page_offset))
> > > > +			continue;
> > > > +
> > > > +		interval_tree_for_each_span(&area_span, &iopt->area_itree,
> > > > +					    allowed_span.start_used,
> > > > +					    allowed_span.last_used) {
> > > > +			if (!__alloc_iova_check_hole(&area_span, length,
> > > > +						     iova_alignment,
> > > > +						     page_offset))
> > > > +				continue;
> > > > +
> > > > +			interval_tree_for_each_span(&reserved_span,
> > > > +						    &iopt->reserved_itree,
> > > > +						    area_span.start_used,
> > > > +						    area_span.last_used) {
> > > > +				if (!__alloc_iova_check_hole(
> > > > +					    &reserved_span, length,
> > > > +					    iova_alignment, page_offset))
> > > > +					continue;
> > >
> > > this could be simplified by double span.
> > 
> > It is subtly not compatible, the double span looks for used areas.
> > This is looking for a used area in the allowed_itree, a hole in the
> > area_itree, and a hole in the reserved_itree.
> 
> the inner two loops can be replaced by double span, since both
> are skipping used areas.

The 2nd loop is looking for a used on allowed and the 3rd loop is
looking for a hole in reserved. To fix it we'd have to invert allowed
to work like reserved - which complicates the uAPI code.

> > > > +	if (iopt->disable_large_pages)
> > > > +		new_iova_alignment = PAGE_SIZE;
> > > > +	else
> > > > +		new_iova_alignment = 1;
> > >
> > > I didn't understand why we start searching alignment from a
> > > smaller value when large pages is enabled. what is the
> > > connection here?
> > 
> > 'disable_large_pages' is a tiny bit misnamed, what it really does is
> > ensure that every iommu_map call is exactly PAGE_SIZE, not more (large
> > pages) and not less (what this is protecting against).
> > 
> > So if a domain has less than PAGE_SIZE we upgrade to
> > PAGE_SIZE. Otherwise we allow using the lowest possible alignment.
> > 
> > This allows userspace to always work in PAGE_SIZE units without fear
> > of problems, eg with sub-page-size units becoming weird or something.
> 
> above are good stuff in a comment.

This is the comment:

/*
 * This is part of the VFIO compatibility support for VFIO_TYPE1_IOMMU. That
 * mode permits splitting a mapped area up, and then one of the splits is
 * unmapped. Doing this normally would cause us to violate our invariant of
 * pairing map/unmap. Thus, to support old VFIO compatibility disable support
 * for batching consecutive PFNs. All PFNs mapped into the iommu are done in
 * PAGE_SIZE units, not larger or smaller.
 */
static int batch_iommu_map_small(struct iommu_domain *domain,
				 unsigned long iova, phys_addr_t paddr,
				 size_t size, int prot)

Thanks,
Jason

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 03/17] interval-tree: Add a utility to iterate over spans in an interval tree
  2022-11-15 14:14   ` Eric Auger
@ 2022-11-15 16:44     ` Jason Gunthorpe
  0 siblings, 0 replies; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-15 16:44 UTC (permalink / raw)
  To: Eric Auger
  Cc: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	Kevin Tian, linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon, Alex Williamson,
	Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Farman, Jason Wang, Jean-Philippe Brucker,
	Joao Martins, kvm, Matthew Rosato, Michael S. Tsirkin,
	Nicolin Chen, Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu,
	Keqian Zhu

On Tue, Nov 15, 2022 at 03:14:00PM +0100, Eric Auger wrote:
> > diff --git a/lib/interval_tree.c b/lib/interval_tree.c
> > index 593ce56ece5050..d2882db8fa2a07 100644
> > --- a/lib/interval_tree.c
> > +++ b/lib/interval_tree.c
> > @@ -15,3 +15,135 @@ EXPORT_SYMBOL_GPL(interval_tree_insert);
> >  EXPORT_SYMBOL_GPL(interval_tree_remove);
> >  EXPORT_SYMBOL_GPL(interval_tree_iter_first);
> >  EXPORT_SYMBOL_GPL(interval_tree_iter_next);
> > +
> > +#ifdef CONFIG_INTERVAL_TREE_SPAN_ITER

> Maybe add in a kernel doc that a prerequisite is state.nodes[1] must be
> populated

Sure, lets just move the below comment up a bit:

/*
 * Roll nodes[1] into nodes[0] by advancing nodes[1] to the end of a contiguous
 * span of nodes. This makes nodes[0]->last the end of that contiguous used span
 * indexes that started at the original nodes[1]->start. nodes[1] is now the
 * first node starting the next used span. A hole span is between nodes[0]->last
 * and nodes[1]->start. nodes[1] must be !NULL.
 */

> > +/*
> > + * Advance the iterator index to a specific position. The returned used/hole is
> > + * updated to start at new_index. This is faster than calling
> > + * interval_tree_span_iter_first() as it can avoid full searches in several
> > + * cases where the iterator is already set.
> > + */
> > +void interval_tree_span_iter_advance(struct interval_tree_span_iter *iter,
> > +				     struct rb_root_cached *itree,
> > +				     unsigned long new_index)
> > +{
> > +	if (iter->is_hole == -1)
> > +		return;
> > +
> > +	iter->first_index = new_index;
> check new_index > iter->first_index?

It is odd but it actually works out OK if that is violated. I  guess a
WARN_ON would be appropriate but I've avoided adding assertions to
this code..

> > +	if (new_index > iter->last_index) {
> > +		iter->is_hole = -1;
> > +		return;
> > +	}
> > +
> > +	/* Rely on the union aliasing hole/used */
> > +	if (iter->start_hole <= new_index && new_index <= iter->last_hole) {
> > +		iter->start_hole = new_index;
> > +		return;
> > +	}
> > +	if (new_index == iter->last_hole + 1)
> > +		interval_tree_span_iter_next(iter);
> > +	else
> > +		interval_tree_span_iter_first(iter, itree, new_index,
> > +					      iter->last_index);

This call will reset iter->first_index to new_index and even if it is
outside the original bounds everything will work. Of course if the
caller does some 'backwards' advance then they are going to probably
be very sad and likely hit an infinite loop, but that applies to all
kinds of backwards advances, not just going before the original
bounds.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [PATCH v4 09/17] iommufd: Data structure to provide IOVA to PFN mapping
  2022-11-15 15:05         ` Jason Gunthorpe
@ 2022-11-16  0:09           ` Tian, Kevin
  2022-11-16  0:32             ` Jason Gunthorpe
  0 siblings, 1 reply; 99+ messages in thread
From: Tian, Kevin @ 2022-11-16  0:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon, Alex Williamson,
	Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, November 15, 2022 11:05 PM
> 
> On Tue, Nov 15, 2022 at 03:13:56AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Tuesday, November 15, 2022 2:44 AM
> > >
> > > On Mon, Nov 14, 2022 at 07:28:47AM +0000, Tian, Kevin wrote:
> > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > Sent: Tuesday, November 8, 2022 8:49 AM
> > > > >
> > > > > +
> > > > > +/*
> > > > > + * Automatically find a block of IOVA that is not being used and not
> > > reserved.
> > > > > + * Does not return a 0 IOVA even if it is valid.
> > > >
> > > > what is the problem with 0? should this be documented in uAPI?
> > >
> > > 0 is commonly used as an errant value for uninitialized things. We
> > > don't automatically map it into a process mm because it can cause
> > > security problems if we don't trap a bogus 0/NULL pointer reference.
> > >
> > > The same logic applies here too, the allocator should not return 0 to
> > > reserve it as an unmapped IOVA page to catch bugs.
> >
> > CPU doesn't reference IOVA. Where do such bugs exist?
> 
> SW is always buggy and SW programs the DMA address, so it could leave
> a 0 behind or something during the programming.

address 0 is never a bug in DMA to IOVA. if it is, it will be out of the
aperture or in the reserved IOVA list. 

DMA API is also a auto-iova scheme from driver p.o.v while it doesn't
impose any restriction on address 0.

> > > > > +		if (!__alloc_iova_check_used(&allowed_span, length,
> > > > > +					     iova_alignment,
> page_offset))
> > > > > +			continue;
> > > > > +
> > > > > +		interval_tree_for_each_span(&area_span, &iopt-
> >area_itree,
> > > > > +					    allowed_span.start_used,
> > > > > +					    allowed_span.last_used) {
> > > > > +			if (!__alloc_iova_check_hole(&area_span,
> length,
> > > > > +						     iova_alignment,
> > > > > +						     page_offset))
> > > > > +				continue;
> > > > > +
> > > > > +
> 	interval_tree_for_each_span(&reserved_span,
> > > > > +						    &iopt-
> >reserved_itree,
> > > > > +
> area_span.start_used,
> > > > > +
> area_span.last_used) {
> > > > > +				if (!__alloc_iova_check_hole(
> > > > > +					    &reserved_span, length,
> > > > > +					    iova_alignment,
> page_offset))
> > > > > +					continue;
> > > >
> > > > this could be simplified by double span.
> > >
> > > It is subtly not compatible, the double span looks for used areas.
> > > This is looking for a used area in the allowed_itree, a hole in the
> > > area_itree, and a hole in the reserved_itree.
> >
> > the inner two loops can be replaced by double span, since both
> > are skipping used areas.
> 
> The 2nd loop is looking for a used on allowed and the 3rd loop is
> looking for a hole in reserved. To fix it we'd have to invert allowed
> to work like reserved - which complicates the uAPI code.

The 1st loop finds an allowed range which can hold requested length

The 2nd loop finds an *unused* hole in the allowed range

The 3rd loop further looks for a hole in reserved.

last two both try to find a hole.

> 
> > > > > +	if (iopt->disable_large_pages)
> > > > > +		new_iova_alignment = PAGE_SIZE;
> > > > > +	else
> > > > > +		new_iova_alignment = 1;
> > > >
> > > > I didn't understand why we start searching alignment from a
> > > > smaller value when large pages is enabled. what is the
> > > > connection here?
> > >
> > > 'disable_large_pages' is a tiny bit misnamed, what it really does is
> > > ensure that every iommu_map call is exactly PAGE_SIZE, not more (large
> > > pages) and not less (what this is protecting against).
> > >
> > > So if a domain has less than PAGE_SIZE we upgrade to
> > > PAGE_SIZE. Otherwise we allow using the lowest possible alignment.
> > >
> > > This allows userspace to always work in PAGE_SIZE units without fear
> > > of problems, eg with sub-page-size units becoming weird or something.
> >
> > above are good stuff in a comment.
> 
> This is the comment:
> 
> /*
>  * This is part of the VFIO compatibility support for VFIO_TYPE1_IOMMU.
> That
>  * mode permits splitting a mapped area up, and then one of the splits is
>  * unmapped. Doing this normally would cause us to violate our invariant of
>  * pairing map/unmap. Thus, to support old VFIO compatibility disable
> support
>  * for batching consecutive PFNs. All PFNs mapped into the iommu are done
> in
>  * PAGE_SIZE units, not larger or smaller.
>  */
> static int batch_iommu_map_small(struct iommu_domain *domain,
> 				 unsigned long iova, phys_addr_t paddr,
> 				 size_t size, int prot)
> 

I meant a comment in iopt_calculate_iova_alignment().

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 09/17] iommufd: Data structure to provide IOVA to PFN mapping
  2022-11-16  0:09           ` Tian, Kevin
@ 2022-11-16  0:32             ` Jason Gunthorpe
  2022-11-16  2:30               ` Tian, Kevin
  0 siblings, 1 reply; 99+ messages in thread
From: Jason Gunthorpe @ 2022-11-16  0:32 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon, Alex Williamson,
	Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

On Wed, Nov 16, 2022 at 12:09:52AM +0000, Tian, Kevin wrote:

> > > > 0 is commonly used as an errant value for uninitialized things. We
> > > > don't automatically map it into a process mm because it can cause
> > > > security problems if we don't trap a bogus 0/NULL pointer reference.
> > > >
> > > > The same logic applies here too, the allocator should not return 0 to
> > > > reserve it as an unmapped IOVA page to catch bugs.
> > >
> > > CPU doesn't reference IOVA. Where do such bugs exist?
> > 
> > SW is always buggy and SW programs the DMA address, so it could leave
> > a 0 behind or something during the programming.
> 
> address 0 is never a bug in DMA to IOVA. if it is, it will be out of the
> aperture or in the reserved IOVA list. 

It is a SW bug in the sense that 0 is commonly an uninitialized value or
uninitialized memory.

> DMA API is also a auto-iova scheme from driver p.o.v while it doesn't
> impose any restriction on address 0.

It probably shouldn't do that. It also allocates -1ULL which causes
real bugs too. :(

 
> > > > > > +		if (!__alloc_iova_check_used(&allowed_span, length,
> > > > > > +					     iova_alignment,
> > page_offset))
> > > > > > +			continue;
> > > > > > +
> > > > > > +		interval_tree_for_each_span(&area_span, &iopt-
> > >area_itree,
> > > > > > +					    allowed_span.start_used,
> > > > > > +					    allowed_span.last_used) {
> > > > > > +			if (!__alloc_iova_check_hole(&area_span,
> > length,
> > > > > > +						     iova_alignment,
> > > > > > +						     page_offset))
> > > > > > +				continue;
> > > > > > +
> > > > > > +
> > 	interval_tree_for_each_span(&reserved_span,
> > > > > > +						    &iopt-
> > >reserved_itree,
> > > > > > +
> > area_span.start_used,
> > > > > > +
> > area_span.last_used) {
> > > > > > +				if (!__alloc_iova_check_hole(
> > > > > > +					    &reserved_span, length,
> > > > > > +					    iova_alignment,
> > page_offset))
> > > > > > +					continue;
> > > > >
> > > > > this could be simplified by double span.
> > > >
> > > > It is subtly not compatible, the double span looks for used areas.
> > > > This is looking for a used area in the allowed_itree, a hole in the
> > > > area_itree, and a hole in the reserved_itree.
> > >
> > > the inner two loops can be replaced by double span, since both
> > > are skipping used areas.
> > 
> > The 2nd loop is looking for a used on allowed and the 3rd loop is
> > looking for a hole in reserved. To fix it we'd have to invert allowed
> > to work like reserved - which complicates the uAPI code.
> 
> The 1st loop finds an allowed range which can hold requested length
> 
> The 2nd loop finds an *unused* hole in the allowed range
> 
> The 3rd loop further looks for a hole in reserved.
> 
> last two both try to find a hole.

Ooh, OK, I read that in the wrong order, you know I looked at this
many times to see if it could use the double span..

Ugh that is a pain, the double_span.h isn't setup for two .c files to
use it.

Anyhow, so like this:

	interval_tree_for_each_span(&allowed_span, &iopt->allowed_itree,
				    PAGE_SIZE, ULONG_MAX - PAGE_SIZE) {
		if (RB_EMPTY_ROOT(&iopt->allowed_itree.rb_root)) {
			allowed_span.start_used = PAGE_SIZE;
			allowed_span.last_used = ULONG_MAX - PAGE_SIZE;
			allowed_span.is_hole = false;
		}

		if (!__alloc_iova_check_used(&allowed_span, length,
					     iova_alignment, page_offset))
			continue;

		interval_tree_for_each_double_span(
			&used_span, &iopt->reserved_itree, &iopt->area_itree,
			allowed_span.start_used, allowed_span.last_used) {
			if (!__alloc_iova_check_hole(&used_span, length,
						     iova_alignment,
						     page_offset))
				continue;

			*iova = used_span.start_hole;
			return 0;
		}
	}


> > This is the comment:
> > 
> > /*
> >  * This is part of the VFIO compatibility support for VFIO_TYPE1_IOMMU.
> > That
> >  * mode permits splitting a mapped area up, and then one of the splits is
> >  * unmapped. Doing this normally would cause us to violate our invariant of
> >  * pairing map/unmap. Thus, to support old VFIO compatibility disable
> > support
> >  * for batching consecutive PFNs. All PFNs mapped into the iommu are done
> > in
> >  * PAGE_SIZE units, not larger or smaller.
> >  */
> > static int batch_iommu_map_small(struct iommu_domain *domain,
> > 				 unsigned long iova, phys_addr_t paddr,
> > 				 size_t size, int prot)
> > 
> 
> I meant a comment in iopt_calculate_iova_alignment().

How about "see batch_iommu_map_small()" ?

Jason

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [PATCH v4 09/17] iommufd: Data structure to provide IOVA to PFN mapping
  2022-11-16  0:32             ` Jason Gunthorpe
@ 2022-11-16  2:30               ` Tian, Kevin
  0 siblings, 0 replies; 99+ messages in thread
From: Tian, Kevin @ 2022-11-16  2:30 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: bpf, Jonathan Corbet, David Woodhouse, iommu, Joerg Roedel,
	linux-doc, linux-kselftest, llvm, Nathan Chancellor,
	Nick Desaulniers, Miguel Ojeda, Robin Murphy, Shuah Khan,
	Suravee Suthikulpanit, Tom Rix, Will Deacon, Alex Williamson,
	Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, November 16, 2022 8:33 AM
> > > This is the comment:
> > >
> > > /*
> > >  * This is part of the VFIO compatibility support for VFIO_TYPE1_IOMMU.
> > > That
> > >  * mode permits splitting a mapped area up, and then one of the splits is
> > >  * unmapped. Doing this normally would cause us to violate our invariant
> of
> > >  * pairing map/unmap. Thus, to support old VFIO compatibility disable
> > > support
> > >  * for batching consecutive PFNs. All PFNs mapped into the iommu are
> done
> > > in
> > >  * PAGE_SIZE units, not larger or smaller.
> > >  */
> > > static int batch_iommu_map_small(struct iommu_domain *domain,
> > > 				 unsigned long iova, phys_addr_t paddr,
> > > 				 size_t size, int prot)
> > >
> >
> > I meant a comment in iopt_calculate_iova_alignment().
> 
> How about "see batch_iommu_map_small()" ?
> 

yes, that works.

btw is there a better name to reflect strict PAGE_SIZE only?

disable_large_pages literally implies no >PAGE_SIZE alignment. This is the
confusion easily raised why it also refers to no subpage alignment.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [PATCH v4 00/17] IOMMUFD Generic interface
  2022-11-12 12:44   ` Yi Liu
@ 2023-01-10 11:35     ` Shameerali Kolothum Thodi
  2023-01-10 13:49       ` Jason Gunthorpe
  0 siblings, 1 reply; 99+ messages in thread
From: Shameerali Kolothum Thodi @ 2023-01-10 11:35 UTC (permalink / raw)
  To: Yi Liu, Jason Gunthorpe, bpf, Jonathan Corbet, David Woodhouse,
	iommu, Joerg Roedel, Kevin Tian, linux-doc, linux-kselftest,
	llvm, Nathan Chancellor, Nick Desaulniers, Miguel Ojeda,
	Robin Murphy, Shuah Khan, Suravee Suthikulpanit, Tom Rix,
	Will Deacon
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle, zhukeqian



> -----Original Message-----
> From: Yi Liu [mailto:yi.l.liu@intel.com]
> Sent: 12 November 2022 12:45
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> Jason Gunthorpe <jgg@nvidia.com>; bpf@vger.kernel.org; Jonathan Corbet
> <corbet@lwn.net>; David Woodhouse <dwmw2@infradead.org>;
> iommu@lists.linux.dev; Joerg Roedel <joro@8bytes.org>; Kevin Tian
> <kevin.tian@intel.com>; linux-doc@vger.kernel.org;
> linux-kselftest@vger.kernel.org; llvm@lists.linux.dev; Nathan Chancellor
> <nathan@kernel.org>; Nick Desaulniers <ndesaulniers@google.com>; Miguel
> Ojeda <ojeda@kernel.org>; Robin Murphy <robin.murphy@arm.com>; Shuah
> Khan <shuah@kernel.org>; Suravee Suthikulpanit
> <suravee.suthikulpanit@amd.com>; Tom Rix <trix@redhat.com>; Will
> Deacon <will@kernel.org>
> Cc: Alex Williamson <alex.williamson@redhat.com>; Lu Baolu
> <baolu.lu@linux.intel.com>; Chaitanya Kulkarni <chaitanyak@nvidia.com>;
> Cornelia Huck <cohuck@redhat.com>; Daniel Jordan
> <daniel.m.jordan@oracle.com>; David Gibson
> <david@gibson.dropbear.id.au>; Eric Auger <eric.auger@redhat.com>; Eric
> Farman <farman@linux.ibm.com>; Jason Wang <jasowang@redhat.com>;
> Jean-Philippe Brucker <jean-philippe@linaro.org>; Joao Martins
> <joao.m.martins@oracle.com>; kvm@vger.kernel.org; Matthew Rosato
> <mjrosato@linux.ibm.com>; Michael S. Tsirkin <mst@redhat.com>; Nicolin
> Chen <nicolinc@nvidia.com>; Niklas Schnelle <schnelle@linux.ibm.com>;
> zhukeqian <zhukeqian1@huawei.com>
> Subject: Re: [PATCH v4 00/17] IOMMUFD Generic interface
> 
> Hi Shameer,
> 
> On 2022/11/11 23:51, Shameerali Kolothum Thodi wrote:
> >
> >
> >> -----Original Message-----
> >> From: Jason Gunthorpe [mailto:jgg@nvidia.com]
> >> Sent: 08 November 2022 00:49
> >> To: bpf@vger.kernel.org; Jonathan Corbet <corbet@lwn.net>; David
> >> Woodhouse <dwmw2@infradead.org>; iommu@lists.linux.dev; Joerg
> Roedel
> >> <joro@8bytes.org>; Kevin Tian <kevin.tian@intel.com>;
> >> linux-doc@vger.kernel.org; linux-kselftest@vger.kernel.org;
> >> llvm@lists.linux.dev; Nathan Chancellor <nathan@kernel.org>; Nick
> >> Desaulniers <ndesaulniers@google.com>; Miguel Ojeda
> <ojeda@kernel.org>;
> >> Robin Murphy <robin.murphy@arm.com>; Shuah Khan
> <shuah@kernel.org>;
> >> Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>; Tom Rix
> >> <trix@redhat.com>; Will Deacon <will@kernel.org>
> >> Cc: Alex Williamson <alex.williamson@redhat.com>; Lu Baolu
> >> <baolu.lu@linux.intel.com>; Chaitanya Kulkarni
> <chaitanyak@nvidia.com>;
> >> Cornelia Huck <cohuck@redhat.com>; Daniel Jordan
> >> <daniel.m.jordan@oracle.com>; David Gibson
> >> <david@gibson.dropbear.id.au>; Eric Auger <eric.auger@redhat.com>;
> Eric
> >> Farman <farman@linux.ibm.com>; Jason Wang <jasowang@redhat.com>;
> >> Jean-Philippe Brucker <jean-philippe@linaro.org>; Joao Martins
> >> <joao.m.martins@oracle.com>; kvm@vger.kernel.org; Matthew Rosato
> >> <mjrosato@linux.ibm.com>; Michael S. Tsirkin <mst@redhat.com>;
> Nicolin
> >> Chen <nicolinc@nvidia.com>; Niklas Schnelle <schnelle@linux.ibm.com>;
> >> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> Yi
> >> Liu <yi.l.liu@intel.com>; zhukeqian <zhukeqian1@huawei.com>
> >> Subject: [PATCH v4 00/17] IOMMUFD Generic interface
> > [...]
> >>
> >> - Userspace page tables aka 'nested translation' for ARM and Intel iommu
> >>    drivers:
> >>    https://github.com/nicolinc/iommufd/commits/iommufd_nesting
> >
> > Hi Eric/Yi/Nicolin,
> >
> > Could you please provide a latest Kernel/Qemu branch for the ARM nesting
> support?
> > The above link points to Yi's git, but not sure which one is latest/stable to
> > have a play.
> 
> Nicolin and I are working on the new version for nesting support. Below
> kernl branch is our latest progress so far. As the naming, it's still
> wip. We also need to workout a Qemu version, so still need some time
> before sharing with you.
> 
> https://github.com/yiliu1765/iommufd/tree/wip/iommufd-v6.1-rc3-nesting

Hi Yi,

Thanks for that. I attempted ARM vSVA support based on your above branch
and related Qemu branch. With few hacks and additional patches the prototype
code works well on HiSilicon ARM platform. 

Please find the corresponding branches ere,
https://github.com/hisilicon/kernel-dev/tree/iommufd-v6.1-rc3-nesting-arm-vSVA
https://github.com/hisilicon/qemu/tree/qemu-iommufd-6.1-rc3-arm-vSVA

Please let me know if there are any recent branches for ARM support.

Thanks,
Shameer
 
> 
> --
> Regards,
> Yi Liu


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 00/17] IOMMUFD Generic interface
  2023-01-10 11:35     ` Shameerali Kolothum Thodi
@ 2023-01-10 13:49       ` Jason Gunthorpe
  2023-01-10 15:16         ` Joao Martins
  0 siblings, 1 reply; 99+ messages in thread
From: Jason Gunthorpe @ 2023-01-10 13:49 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: Yi Liu, bpf, Jonathan Corbet, David Woodhouse, iommu,
	Joerg Roedel, Kevin Tian, linux-doc, linux-kselftest, llvm,
	Nathan Chancellor, Nick Desaulniers, Miguel Ojeda, Robin Murphy,
	Shuah Khan, Suravee Suthikulpanit, Tom Rix, Will Deacon,
	Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle, zhukeqian

On Tue, Jan 10, 2023 at 11:35:59AM +0000, Shameerali Kolothum Thodi wrote:

> Thanks for that. I attempted ARM vSVA support based on your above branch
> and related Qemu branch. With few hacks and additional patches the prototype
> code works well on HiSilicon ARM platform. 

Nice!
 
> Please find the corresponding branches ere,
> https://github.com/hisilicon/kernel-dev/tree/iommufd-v6.1-rc3-nesting-arm-vSVA

We need to have a big think about how the PASID/PRI caps should be
working in VFIO..

The whole PRI thing needs to be its own series and need a careful look

Can you also look at the dirty tracking stuff? I'd really like to see
that done for the huawei vfio live migration driver

Thanks,
Jason

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 00/17] IOMMUFD Generic interface
  2023-01-10 13:49       ` Jason Gunthorpe
@ 2023-01-10 15:16         ` Joao Martins
  2023-01-10 15:18           ` Jason Gunthorpe
  2023-01-10 15:30           ` Shameerali Kolothum Thodi
  0 siblings, 2 replies; 99+ messages in thread
From: Joao Martins @ 2023-01-10 15:16 UTC (permalink / raw)
  To: Jason Gunthorpe, Shameerali Kolothum Thodi
  Cc: Yi Liu, bpf, Jonathan Corbet, David Woodhouse, iommu,
	Joerg Roedel, Kevin Tian, linux-doc, linux-kselftest, llvm,
	Nathan Chancellor, Nick Desaulniers, Miguel Ojeda, Robin Murphy,
	Shuah Khan, Suravee Suthikulpanit, Tom Rix, Will Deacon,
	Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, kvm, Matthew Rosato, Michael S. Tsirkin,
	Nicolin Chen, Niklas Schnelle, zhukeqian

On 10/01/2023 13:49, Jason Gunthorpe wrote:
> Can you also look at the dirty tracking stuff? I'd really like to see
> that done for the huawei vfio live migration driver
> 

He did look and provides comments based on his testing of the initial series
(IIUC from what we spoke at LPC). v2 should be simpler, though I am still
working it out the merging of unit tests into iommufd.

My plan once I post the v2 was to handover the SMMUv3 part back to Shameerali
given the fact he has hardware that has the feature and can iterate more
meaningfully than me.

	Joao

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v4 00/17] IOMMUFD Generic interface
  2023-01-10 15:16         ` Joao Martins
@ 2023-01-10 15:18           ` Jason Gunthorpe
  2023-01-10 15:30           ` Shameerali Kolothum Thodi
  1 sibling, 0 replies; 99+ messages in thread
From: Jason Gunthorpe @ 2023-01-10 15:18 UTC (permalink / raw)
  To: Joao Martins
  Cc: Shameerali Kolothum Thodi, Yi Liu, bpf, Jonathan Corbet,
	David Woodhouse, iommu, Joerg Roedel, Kevin Tian, linux-doc,
	linux-kselftest, llvm, Nathan Chancellor, Nick Desaulniers,
	Miguel Ojeda, Robin Murphy, Shuah Khan, Suravee Suthikulpanit,
	Tom Rix, Will Deacon, Alex Williamson, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Auger, Eric Farman, Jason Wang, Jean-Philippe Brucker, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, zhukeqian

On Tue, Jan 10, 2023 at 03:16:23PM +0000, Joao Martins wrote:
> On 10/01/2023 13:49, Jason Gunthorpe wrote:
> > Can you also look at the dirty tracking stuff? I'd really like to see
> > that done for the huawei vfio live migration driver
> > 
> 
> He did look and provides comments based on his testing of the initial series
> (IIUC from what we spoke at LPC). v2 should be simpler, though I am still
> working it out the merging of unit tests into iommufd.
> 
> My plan once I post the v2 was to handover the SMMUv3 part back to Shameerali
> given the fact he has hardware that has the feature and can iterate more
> meaningfully than me.

Sounds good

Thanks,
Jason

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [PATCH v4 00/17] IOMMUFD Generic interface
  2023-01-10 15:16         ` Joao Martins
  2023-01-10 15:18           ` Jason Gunthorpe
@ 2023-01-10 15:30           ` Shameerali Kolothum Thodi
  1 sibling, 0 replies; 99+ messages in thread
From: Shameerali Kolothum Thodi @ 2023-01-10 15:30 UTC (permalink / raw)
  To: Joao Martins, Jason Gunthorpe
  Cc: Yi Liu, bpf, Jonathan Corbet, David Woodhouse, iommu,
	Joerg Roedel, Kevin Tian, linux-doc, linux-kselftest, llvm,
	Nathan Chancellor, Nick Desaulniers, Miguel Ojeda, Robin Murphy,
	Shuah Khan, Suravee Suthikulpanit, Tom Rix, Will Deacon,
	Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, Jason Wang,
	Jean-Philippe Brucker, kvm, Matthew Rosato, Michael S. Tsirkin,
	Nicolin Chen, Niklas Schnelle, zhukeqian



> -----Original Message-----
> From: Joao Martins [mailto:joao.m.martins@oracle.com]
> Sent: 10 January 2023 15:16
> To: Jason Gunthorpe <jgg@nvidia.com>; Shameerali Kolothum Thodi
> <shameerali.kolothum.thodi@huawei.com>
> Cc: Yi Liu <yi.l.liu@intel.com>; bpf@vger.kernel.org; Jonathan Corbet
> <corbet@lwn.net>; David Woodhouse <dwmw2@infradead.org>;
> iommu@lists.linux.dev; Joerg Roedel <joro@8bytes.org>; Kevin Tian
> <kevin.tian@intel.com>; linux-doc@vger.kernel.org;
> linux-kselftest@vger.kernel.org; llvm@lists.linux.dev; Nathan Chancellor
> <nathan@kernel.org>; Nick Desaulniers <ndesaulniers@google.com>; Miguel
> Ojeda <ojeda@kernel.org>; Robin Murphy <robin.murphy@arm.com>; Shuah
> Khan <shuah@kernel.org>; Suravee Suthikulpanit
> <suravee.suthikulpanit@amd.com>; Tom Rix <trix@redhat.com>; Will
> Deacon <will@kernel.org>; Alex Williamson <alex.williamson@redhat.com>;
> Lu Baolu <baolu.lu@linux.intel.com>; Chaitanya Kulkarni
> <chaitanyak@nvidia.com>; Cornelia Huck <cohuck@redhat.com>; Daniel
> Jordan <daniel.m.jordan@oracle.com>; David Gibson
> <david@gibson.dropbear.id.au>; Eric Auger <eric.auger@redhat.com>; Eric
> Farman <farman@linux.ibm.com>; Jason Wang <jasowang@redhat.com>;
> Jean-Philippe Brucker <jean-philippe@linaro.org>; kvm@vger.kernel.org;
> Matthew Rosato <mjrosato@linux.ibm.com>; Michael S. Tsirkin
> <mst@redhat.com>; Nicolin Chen <nicolinc@nvidia.com>; Niklas Schnelle
> <schnelle@linux.ibm.com>; zhukeqian <zhukeqian1@huawei.com>
> Subject: Re: [PATCH v4 00/17] IOMMUFD Generic interface
> 
> On 10/01/2023 13:49, Jason Gunthorpe wrote:
> > Can you also look at the dirty tracking stuff? I'd really like to see
> > that done for the huawei vfio live migration driver
> >
> 
> He did look and provides comments based on his testing of the initial series
> (IIUC from what we spoke at LPC). v2 should be simpler, though I am still
> working it out the merging of unit tests into iommufd.
> 
> My plan once I post the v2 was to handover the SMMUv3 part back to
> Shameerali
> given the fact he has hardware that has the feature and can iterate more
> meaningfully than me.

No problem. Happy to help.

Thanks,
Shameer

^ permalink raw reply	[flat|nested] 99+ messages in thread

end of thread, other threads:[~2023-01-10 15:30 UTC | newest]

Thread overview: 99+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-08  0:48 [PATCH v4 00/17] IOMMUFD Generic interface Jason Gunthorpe
2022-11-08  0:48 ` [PATCH v4 01/17] iommu: Add IOMMU_CAP_ENFORCE_CACHE_COHERENCY Jason Gunthorpe
2022-11-08  0:48 ` [PATCH v4 02/17] iommu: Add device-centric DMA ownership interfaces Jason Gunthorpe
2022-11-11  5:37   ` Tian, Kevin
2022-11-14 16:44     ` Jason Gunthorpe
2022-11-14 13:33   ` Eric Auger
2022-11-14 16:58     ` Jason Gunthorpe
2022-11-08  0:48 ` [PATCH v4 03/17] interval-tree: Add a utility to iterate over spans in an interval tree Jason Gunthorpe
2022-11-15 14:14   ` Eric Auger
2022-11-15 16:44     ` Jason Gunthorpe
2022-11-08  0:48 ` [PATCH v4 04/17] iommufd: Document overview of iommufd Jason Gunthorpe
2022-11-08  3:45   ` Bagas Sanjaya
2022-11-08 17:10   ` [PATCH v4 4/17] " Jason Gunthorpe
2022-11-11  5:59     ` Tian, Kevin
2022-11-14 15:14       ` Jason Gunthorpe
2022-11-10  9:30   ` [PATCH v4 04/17] " Bagas Sanjaya
2022-11-10 14:49     ` Jonathan Corbet
2022-11-10 14:54       ` Jason Gunthorpe
2022-11-10 15:10         ` Jonathan Corbet
2022-11-10 15:23           ` Jason Gunthorpe
2022-11-10 15:28             ` Jonathan Corbet
2022-11-10 15:29               ` Jason Gunthorpe
2022-11-10 15:52                 ` Jonathan Corbet
2022-11-10 16:54                   ` Jason Gunthorpe
2022-11-11  1:46       ` Bagas Sanjaya
2022-11-14 20:50   ` Eric Auger
2022-11-15  0:52     ` Jason Gunthorpe
2022-11-08  0:48 ` [PATCH v4 05/17] iommufd: File descriptor, context, kconfig and makefiles Jason Gunthorpe
2022-11-11  6:07   ` Tian, Kevin
2022-11-08  0:48 ` [PATCH v4 06/17] kernel/user: Allow user::locked_vm to be usable for iommufd Jason Gunthorpe
2022-11-08  0:49 ` [PATCH v4 07/17] iommufd: PFN handling for iopt_pages Jason Gunthorpe
2022-11-11  9:56   ` Tian, Kevin
2022-11-14 17:20     ` Jason Gunthorpe
2022-11-11 11:09   ` Tian, Kevin
2022-11-14 17:24     ` Jason Gunthorpe
2022-11-15  2:59       ` Tian, Kevin
2022-11-08  0:49 ` [PATCH v4 08/17] iommufd: Algorithms for PFN storage Jason Gunthorpe
2022-11-14  5:50   ` Tian, Kevin
2022-11-14 18:02     ` Jason Gunthorpe
2022-11-15  3:06       ` Tian, Kevin
2022-11-15 14:49         ` Jason Gunthorpe
2022-11-14 19:19   ` [PATCH v4 8/17] " Jason Gunthorpe
2022-11-08  0:49 ` [PATCH v4 09/17] iommufd: Data structure to provide IOVA to PFN mapping Jason Gunthorpe
2022-11-14  7:28   ` Tian, Kevin
2022-11-14 18:43     ` Jason Gunthorpe
2022-11-15  3:13       ` Tian, Kevin
2022-11-15 15:05         ` Jason Gunthorpe
2022-11-16  0:09           ` Tian, Kevin
2022-11-16  0:32             ` Jason Gunthorpe
2022-11-16  2:30               ` Tian, Kevin
2022-11-08  0:49 ` [PATCH v4 10/17] iommufd: IOCTLs for the io_pagetable Jason Gunthorpe
2022-11-08 13:27   ` Bagas Sanjaya
2022-11-08 17:01     ` Jason Gunthorpe
2022-11-14  7:46   ` Tian, Kevin
2022-11-08  0:49 ` [PATCH v4 11/17] iommufd: Add a HW pagetable object Jason Gunthorpe
2022-11-08  0:49 ` [PATCH v4 12/17] iommufd: Add kAPI toward external drivers for physical devices Jason Gunthorpe
2022-11-08 14:34   ` Yi Liu
2022-11-08 17:57     ` Jason Gunthorpe
2022-11-14  7:59   ` Tian, Kevin
2022-11-08  0:49 ` [PATCH v4 13/17] iommufd: Add kAPI toward external drivers for kernel access Jason Gunthorpe
2022-11-14  8:25   ` Tian, Kevin
2022-11-14 19:05     ` Jason Gunthorpe
2022-11-08  0:49 ` [PATCH v4 14/17] iommufd: vfio container FD ioctl compatibility Jason Gunthorpe
2022-11-08  0:49 ` [PATCH v4 15/17] iommufd: Add a selftest Jason Gunthorpe
2022-11-08  1:01   ` Jason Gunthorpe
2022-11-08  5:48   ` Nicolin Chen
2022-11-08 13:27     ` Jason Gunthorpe
2022-11-09 23:51   ` Matthew Rosato
2022-11-08  0:49 ` [PATCH v4 16/17] iommufd: Add some fault injection points Jason Gunthorpe
2022-11-08  7:25   ` Nicolin Chen
2022-11-08 12:37     ` Jason Gunthorpe
2022-11-08  0:49 ` [PATCH v4 17/17] iommufd: Add additional invariant assertions Jason Gunthorpe
2022-11-08  1:09 ` S390 testing for IOMMUFD Jason Gunthorpe
2022-11-08 10:12   ` Christian Borntraeger
2022-11-08 14:04     ` Anthony Krowiak
2022-11-09 14:49       ` Anthony Krowiak
2022-11-09 16:12         ` Jason Gunthorpe
2022-11-09 19:13           ` Anthony Krowiak
2022-11-09 20:43             ` Jason Gunthorpe
2022-11-09 19:09         ` Anthony Krowiak
2022-11-08 13:50   ` Matthew Rosato
2022-11-08 13:54     ` Jason Gunthorpe
2022-11-08 14:19       ` Eric Farman
2022-11-08 14:37         ` Jason Gunthorpe
2022-11-08 15:29           ` Eric Farman
2022-11-08 19:18             ` Matthew Rosato
2022-11-08 20:04               ` Jason Gunthorpe
2022-11-08 20:17                 ` Eric Farman
2022-11-08 19:34             ` Jason Gunthorpe
2022-11-08 20:07               ` Eric Farman
2022-11-08 20:10                 ` Jason Gunthorpe
2022-11-09  0:41               ` [PATCH 1/2] vfio/ccw: Convert to use vfio_dma_rw() kernel test robot
2022-11-11 15:51 ` [PATCH v4 00/17] IOMMUFD Generic interface Shameerali Kolothum Thodi
2022-11-12 12:44   ` Yi Liu
2023-01-10 11:35     ` Shameerali Kolothum Thodi
2023-01-10 13:49       ` Jason Gunthorpe
2023-01-10 15:16         ` Joao Martins
2023-01-10 15:18           ` Jason Gunthorpe
2023-01-10 15:30           ` Shameerali Kolothum Thodi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.