All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC v2 00/13] IOMMUFD Generic interface
@ 2022-09-02 19:59 ` Jason Gunthorpe
  0 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-02 19:59 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, iommu,
	Jason Wang, Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

iommufd is the user API to control the IOMMU subsystem as it relates to
managing IO page tables that point at user space memory.

It takes over from drivers/vfio/vfio_iommu_type1.c (aka the VFIO
container) which is the VFIO specific interface for a similar idea.

We see a broad need for extended features, some being highly IOMMU device
specific:
 - Binding iommu_domain's to PASID/SSID
 - Userspace page tables, for ARM, x86 and S390
 - Kernel bypass'd invalidation of user page tables
 - Re-use of the KVM page table in the IOMMU
 - Dirty page tracking in the IOMMU
 - Runtime Increase/Decrease of IOPTE size
 - PRI support with faults resolved in userspace

As well as a need to access these features beyond just VFIO, from VDPA for
instance. Other classes of accelerator HW are touching on these areas now
too.

The pre-v1 series proposed re-using the VFIO type 1 data structure,
however it was suggested that if we are doing this big update then we
should also come with an improved data structure that solves the
limitations that VFIO type1 has. Notably this addresses:

 - Multiple IOAS/'containers' and multiple domains inside a single FD

 - Single-pin operation no matter how many domains and containers use
   a page

 - A fine grained locking scheme supporting user managed concurrency for
   multi-threaded map/unmap

 - A pre-registration mechanism to optimize vIOMMU use cases by
   pre-pinning pages

 - Extended ioctl API that can manage these new objects and exposes
   domains directly to user space

 - domains are sharable between subsystems, eg VFIO and VDPA

The bulk of this code is a new data structure design to track how the
IOVAs are mapped to PFNs.

iommufd intends to be general and consumable by any driver that wants to
DMA to userspace. From a driver perspective it can largely be dropped in
in-place of iommu_attach_device() and provides a uniform full feature set
to all consumers.

As this is a larger project this series is the first step. This series
provides the iommfd "generic interface" which is designed to be suitable
for applications like DPDK and VMM flows that are not optimized to
specific HW scenarios. It is close to being a drop in replacement for the
existing VFIO type 1.

Several follow-on series are being prepared:

- Patches integrating with qemu in native mode:
  https://github.com/yiliu1765/qemu/commits/qemu-iommufd-6.0-rc2

- A completed integration with VFIO now exists that covers "emulated" mdev
  use cases now, and can pass testing with qemu/etc in compatability mode:
  https://github.com/jgunthorpe/linux/commits/vfio_iommufd

- A draft providing system iommu dirty tracking on top of iommufd,
  including iommu driver implementations:
  https://github.com/jpemartins/linux/commits/x86-iommufd

  This pairs with patches for providing a similar API to support VFIO-device
  tracking to give a complete vfio solution:
  https://lore.kernel.org/kvm/20220901093853.60194-1-yishaih@nvidia.com/

- Userspace page tables aka 'nested translation' for ARM and Intel iommu
  drivers:
  https://github.com/nicolinc/iommufd/commits/iommufd_nesting

- "device centric" vfio series to expose the vfio_device FD directly as a
  normal cdev, and provide an extended API allowing dynamically changing
  the IOAS binding:
  https://github.com/yiliu1765/iommufd/commits/iommufd-v6.0-rc2-nesting-0901

- Drafts for PASID and PRI interfaces are included above as well

Overall enough work is done now to show the merit of the new API design
and at least draft solutions to many of the main problems.

Several people have contributed directly to this work: Eric Auger, Joao
Martins, Kevin Tian, Lu Baolu, Nicolin Chen, Yi L Liu. Many more have
participated in the discussions that lead here, and provided ideas. Thanks
to all!

The v1 iommufd series has been used to guide a large amount of preparatory
work that has now been merged. The general theme is to organize things in
a way that makes injecting iommufd natural:

 - VFIO live migration support with mlx5 and hisi_acc drivers.
   These series need a dirty tracking solution to be really usable.
   https://lore.kernel.org/kvm/20220224142024.147653-1-yishaih@nvidia.com/
   https://lore.kernel.org/kvm/20220308184902.2242-1-shameerali.kolothum.thodi@huawei.com/

 - Significantly rework the VFIO gvt mdev and remove struct
   mdev_parent_ops
   https://lore.kernel.org/lkml/20220411141403.86980-1-hch@lst.de/

 - Rework how PCIe no-snoop blocking works
   https://lore.kernel.org/kvm/0-v3-2cf356649677+a32-intel_no_snoop_jgg@nvidia.com/

 - Consolidate dma ownership into the iommu core code
   https://lore.kernel.org/linux-iommu/20220418005000.897664-1-baolu.lu@linux.intel.com/

 - Make all vfio driver interfaces use struct vfio_device consistently
   https://lore.kernel.org/kvm/0-v4-8045e76bf00b+13d-vfio_mdev_no_group_jgg@nvidia.com/

 - Remove the vfio_group from the kvm/vfio interface
   https://lore.kernel.org/kvm/0-v3-f7729924a7ea+25e33-vfio_kvm_no_group_jgg@nvidia.com/

 - Simplify locking in vfio
   https://lore.kernel.org/kvm/0-v2-d035a1842d81+1bf-vfio_group_locking_jgg@nvidia.com/

 - Remove the vfio notifiter scheme that faces drivers
   https://lore.kernel.org/kvm/0-v4-681e038e30fd+78-vfio_unmap_notif_jgg@nvidia.com/

 - Improve the driver facing API for vfio pin/unpin pages to make the
   presence of struct page clear
   https://lore.kernel.org/kvm/20220723020256.30081-1-nicolinc@nvidia.com/

 - Clean up in the Intel IOMMU driver
   https://lore.kernel.org/linux-iommu/20220301020159.633356-1-baolu.lu@linux.intel.com/
   https://lore.kernel.org/linux-iommu/20220510023407.2759143-1-baolu.lu@linux.intel.com/
   https://lore.kernel.org/linux-iommu/20220514014322.2927339-1-baolu.lu@linux.intel.com/
   https://lore.kernel.org/linux-iommu/20220706025524.2904370-1-baolu.lu@linux.intel.com/
   https://lore.kernel.org/linux-iommu/20220702015610.2849494-1-baolu.lu@linux.intel.com/

 - Rework s390 vfio drivers
   https://lore.kernel.org/kvm/20220707135737.720765-1-farman@linux.ibm.com/

 - Normalize vfio ioctl handling
   https://lore.kernel.org/kvm/0-v2-0f9e632d54fb+d6-vfio_ioctl_split_jgg@nvidia.com/

This is about 168 patches applied since March, thank you to everyone
involved in all this work!

Currently there are a number of supporting series still in progress:
 - Simplify and consolidate iommu_domain/device compatability checking
   https://lore.kernel.org/linux-iommu/20220815181437.28127-1-nicolinc@nvidia.com/

 - Align iommu SVA support with the domain-centric model
   https://lore.kernel.org/linux-iommu/20220826121141.50743-1-baolu.lu@linux.intel.com/

 - VFIO API for dirty tracking (aka dma logging) managed inside a PCI
   device, with mlx5 implementation
   https://lore.kernel.org/kvm/20220901093853.60194-1-yishaih@nvidia.com

 - Introduce a struct device sysfs presence for struct vfio_device
   https://lore.kernel.org/kvm/20220901143747.32858-1-kevin.tian@intel.com/

 - Complete restructuring the vfio mdev model
   https://lore.kernel.org/kvm/20220822062208.152745-1-hch@lst.de/

 - DMABUF exporter support for VFIO to allow PCI P2P with VFIO
   https://lore.kernel.org/r/0-v2-472615b3877e+28f7-vfio_dma_buf_jgg@nvidia.com

 - Isolate VFIO container code in preperation for iommufd to provide an
   alternative implementation of it all
   https://lore.kernel.org/kvm/0-v1-a805b607f1fb+17b-vfio_container_split_jgg@nvidia.com

 - Start to provide iommu_domain ops for power
   https://lore.kernel.org/all/20220714081822.3717693-1-aik@ozlabs.ru/

Right now there is no more preperatory work sketched out, so this is the
last of it.

This series remains RFC as there are still several important FIXME's to
deal with first, but things are on track for non-RFC in the near future.

This is on github: https://github.com/jgunthorpe/linux/commits/iommufd

v2:
 - Rebase to v6.0-rc3
 - Improve comments
 - Change to an iterative destruction approach to avoid cycles
 - Near rewrite of the vfio facing implementation, supported by a complete
   implementation on the vfio side
 - New IOMMU_IOAS_ALLOW_IOVAS API as discussed. Allows userspace to
   assert that ranges of IOVA must always be mappable. To be used by a VMM
   that has promised a guest a certain availability of IOVA. May help
   guide PPC's multi-window implementation.
 - Rework how unmap_iova works, user can unmap the whole ioas now
 - The no-snoop / wbinvd support is implemented
 - Bug fixes
 - Test suite improvements
 - Lots of smaller changes (the interdiff is 3k lines)
v1: https://lore.kernel.org/r/0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com

# S390 in-kernel page table walker
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Matthew Rosato <mjrosato@linux.ibm.com>
# AMD Dirty page tracking
Cc: Joao Martins <joao.m.martins@oracle.com>
# ARM SMMU Dirty page tracking
Cc: Keqian Zhu <zhukeqian1@huawei.com>
Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
# ARM SMMU nesting
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
# Map/unmap performance
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
# VDPA
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
# Power
Cc: David Gibson <david@gibson.dropbear.id.au>
# vfio
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: kvm@vger.kernel.org
# iommu
Cc: iommu@lists.linux.dev
# Collaborators
Cc: "Chaitanya Kulkarni" <chaitanyak@nvidia.com>
Cc: Nicolin Chen <nicolinc@nvidia.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Yi Liu <yi.l.liu@intel.com>
# s390
Cc: Eric Farman <farman@linux.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Jason Gunthorpe (12):
  interval-tree: Add a utility to iterate over spans in an interval tree
  iommufd: File descriptor, context, kconfig and makefiles
  kernel/user: Allow user::locked_vm to be usable for iommufd
  iommufd: PFN handling for iopt_pages
  iommufd: Algorithms for PFN storage
  iommufd: Data structure to provide IOVA to PFN mapping
  iommufd: IOCTLs for the io_pagetable
  iommufd: Add a HW pagetable object
  iommufd: Add kAPI toward external drivers for physical devices
  iommufd: Add kAPI toward external drivers for kernel access
  iommufd: vfio container FD ioctl compatibility
  iommufd: Add a selftest

Kevin Tian (1):
  iommufd: Overview documentation

 .clang-format                                 |    1 +
 Documentation/userspace-api/index.rst         |    1 +
 .../userspace-api/ioctl/ioctl-number.rst      |    1 +
 Documentation/userspace-api/iommufd.rst       |  224 +++
 MAINTAINERS                                   |   10 +
 drivers/iommu/Kconfig                         |    1 +
 drivers/iommu/Makefile                        |    2 +-
 drivers/iommu/iommufd/Kconfig                 |   22 +
 drivers/iommu/iommufd/Makefile                |   13 +
 drivers/iommu/iommufd/device.c                |  580 +++++++
 drivers/iommu/iommufd/hw_pagetable.c          |   68 +
 drivers/iommu/iommufd/io_pagetable.c          |  984 ++++++++++++
 drivers/iommu/iommufd/io_pagetable.h          |  186 +++
 drivers/iommu/iommufd/ioas.c                  |  338 ++++
 drivers/iommu/iommufd/iommufd_private.h       |  266 ++++
 drivers/iommu/iommufd/iommufd_test.h          |   74 +
 drivers/iommu/iommufd/main.c                  |  392 +++++
 drivers/iommu/iommufd/pages.c                 | 1301 +++++++++++++++
 drivers/iommu/iommufd/selftest.c              |  626 ++++++++
 drivers/iommu/iommufd/vfio_compat.c           |  423 +++++
 include/linux/interval_tree.h                 |   47 +
 include/linux/iommufd.h                       |  101 ++
 include/linux/sched/user.h                    |    2 +-
 include/uapi/linux/iommufd.h                  |  279 ++++
 kernel/user.c                                 |    1 +
 lib/interval_tree.c                           |   98 ++
 tools/testing/selftests/Makefile              |    1 +
 tools/testing/selftests/iommu/.gitignore      |    2 +
 tools/testing/selftests/iommu/Makefile        |   11 +
 tools/testing/selftests/iommu/config          |    2 +
 tools/testing/selftests/iommu/iommufd.c       | 1396 +++++++++++++++++
 31 files changed, 7451 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/userspace-api/iommufd.rst
 create mode 100644 drivers/iommu/iommufd/Kconfig
 create mode 100644 drivers/iommu/iommufd/Makefile
 create mode 100644 drivers/iommu/iommufd/device.c
 create mode 100644 drivers/iommu/iommufd/hw_pagetable.c
 create mode 100644 drivers/iommu/iommufd/io_pagetable.c
 create mode 100644 drivers/iommu/iommufd/io_pagetable.h
 create mode 100644 drivers/iommu/iommufd/ioas.c
 create mode 100644 drivers/iommu/iommufd/iommufd_private.h
 create mode 100644 drivers/iommu/iommufd/iommufd_test.h
 create mode 100644 drivers/iommu/iommufd/main.c
 create mode 100644 drivers/iommu/iommufd/pages.c
 create mode 100644 drivers/iommu/iommufd/selftest.c
 create mode 100644 drivers/iommu/iommufd/vfio_compat.c
 create mode 100644 include/linux/iommufd.h
 create mode 100644 include/uapi/linux/iommufd.h
 create mode 100644 tools/testing/selftests/iommu/.gitignore
 create mode 100644 tools/testing/selftests/iommu/Makefile
 create mode 100644 tools/testing/selftests/iommu/config
 create mode 100644 tools/testing/selftests/iommu/iommufd.c


base-commit: b90cb1053190353cc30f0fef0ef1f378ccc063c5
-- 
2.37.3


^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH RFC v2 00/13] IOMMUFD Generic interface
@ 2022-09-02 19:59 ` Jason Gunthorpe
  0 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-02 19:59 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, iommu,
	Jason Wang, Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

iommufd is the user API to control the IOMMU subsystem as it relates to
managing IO page tables that point at user space memory.

It takes over from drivers/vfio/vfio_iommu_type1.c (aka the VFIO
container) which is the VFIO specific interface for a similar idea.

We see a broad need for extended features, some being highly IOMMU device
specific:
 - Binding iommu_domain's to PASID/SSID
 - Userspace page tables, for ARM, x86 and S390
 - Kernel bypass'd invalidation of user page tables
 - Re-use of the KVM page table in the IOMMU
 - Dirty page tracking in the IOMMU
 - Runtime Increase/Decrease of IOPTE size
 - PRI support with faults resolved in userspace

As well as a need to access these features beyond just VFIO, from VDPA for
instance. Other classes of accelerator HW are touching on these areas now
too.

The pre-v1 series proposed re-using the VFIO type 1 data structure,
however it was suggested that if we are doing this big update then we
should also come with an improved data structure that solves the
limitations that VFIO type1 has. Notably this addresses:

 - Multiple IOAS/'containers' and multiple domains inside a single FD

 - Single-pin operation no matter how many domains and containers use
   a page

 - A fine grained locking scheme supporting user managed concurrency for
   multi-threaded map/unmap

 - A pre-registration mechanism to optimize vIOMMU use cases by
   pre-pinning pages

 - Extended ioctl API that can manage these new objects and exposes
   domains directly to user space

 - domains are sharable between subsystems, eg VFIO and VDPA

The bulk of this code is a new data structure design to track how the
IOVAs are mapped to PFNs.

iommufd intends to be general and consumable by any driver that wants to
DMA to userspace. From a driver perspective it can largely be dropped in
in-place of iommu_attach_device() and provides a uniform full feature set
to all consumers.

As this is a larger project this series is the first step. This series
provides the iommfd "generic interface" which is designed to be suitable
for applications like DPDK and VMM flows that are not optimized to
specific HW scenarios. It is close to being a drop in replacement for the
existing VFIO type 1.

Several follow-on series are being prepared:

- Patches integrating with qemu in native mode:
  https://github.com/yiliu1765/qemu/commits/qemu-iommufd-6.0-rc2

- A completed integration with VFIO now exists that covers "emulated" mdev
  use cases now, and can pass testing with qemu/etc in compatability mode:
  https://github.com/jgunthorpe/linux/commits/vfio_iommufd

- A draft providing system iommu dirty tracking on top of iommufd,
  including iommu driver implementations:
  https://github.com/jpemartins/linux/commits/x86-iommufd

  This pairs with patches for providing a similar API to support VFIO-device
  tracking to give a complete vfio solution:
  https://lore.kernel.org/kvm/20220901093853.60194-1-yishaih@nvidia.com/

- Userspace page tables aka 'nested translation' for ARM and Intel iommu
  drivers:
  https://github.com/nicolinc/iommufd/commits/iommufd_nesting

- "device centric" vfio series to expose the vfio_device FD directly as a
  normal cdev, and provide an extended API allowing dynamically changing
  the IOAS binding:
  https://github.com/yiliu1765/iommufd/commits/iommufd-v6.0-rc2-nesting-0901

- Drafts for PASID and PRI interfaces are included above as well

Overall enough work is done now to show the merit of the new API design
and at least draft solutions to many of the main problems.

Several people have contributed directly to this work: Eric Auger, Joao
Martins, Kevin Tian, Lu Baolu, Nicolin Chen, Yi L Liu. Many more have
participated in the discussions that lead here, and provided ideas. Thanks
to all!

The v1 iommufd series has been used to guide a large amount of preparatory
work that has now been merged. The general theme is to organize things in
a way that makes injecting iommufd natural:

 - VFIO live migration support with mlx5 and hisi_acc drivers.
   These series need a dirty tracking solution to be really usable.
   https://lore.kernel.org/kvm/20220224142024.147653-1-yishaih@nvidia.com/
   https://lore.kernel.org/kvm/20220308184902.2242-1-shameerali.kolothum.thodi@huawei.com/

 - Significantly rework the VFIO gvt mdev and remove struct
   mdev_parent_ops
   https://lore.kernel.org/lkml/20220411141403.86980-1-hch@lst.de/

 - Rework how PCIe no-snoop blocking works
   https://lore.kernel.org/kvm/0-v3-2cf356649677+a32-intel_no_snoop_jgg@nvidia.com/

 - Consolidate dma ownership into the iommu core code
   https://lore.kernel.org/linux-iommu/20220418005000.897664-1-baolu.lu@linux.intel.com/

 - Make all vfio driver interfaces use struct vfio_device consistently
   https://lore.kernel.org/kvm/0-v4-8045e76bf00b+13d-vfio_mdev_no_group_jgg@nvidia.com/

 - Remove the vfio_group from the kvm/vfio interface
   https://lore.kernel.org/kvm/0-v3-f7729924a7ea+25e33-vfio_kvm_no_group_jgg@nvidia.com/

 - Simplify locking in vfio
   https://lore.kernel.org/kvm/0-v2-d035a1842d81+1bf-vfio_group_locking_jgg@nvidia.com/

 - Remove the vfio notifiter scheme that faces drivers
   https://lore.kernel.org/kvm/0-v4-681e038e30fd+78-vfio_unmap_notif_jgg@nvidia.com/

 - Improve the driver facing API for vfio pin/unpin pages to make the
   presence of struct page clear
   https://lore.kernel.org/kvm/20220723020256.30081-1-nicolinc@nvidia.com/

 - Clean up in the Intel IOMMU driver
   https://lore.kernel.org/linux-iommu/20220301020159.633356-1-baolu.lu@linux.intel.com/
   https://lore.kernel.org/linux-iommu/20220510023407.2759143-1-baolu.lu@linux.intel.com/
   https://lore.kernel.org/linux-iommu/20220514014322.2927339-1-baolu.lu@linux.intel.com/
   https://lore.kernel.org/linux-iommu/20220706025524.2904370-1-baolu.lu@linux.intel.com/
   https://lore.kernel.org/linux-iommu/20220702015610.2849494-1-baolu.lu@linux.intel.com/

 - Rework s390 vfio drivers
   https://lore.kernel.org/kvm/20220707135737.720765-1-farman@linux.ibm.com/

 - Normalize vfio ioctl handling
   https://lore.kernel.org/kvm/0-v2-0f9e632d54fb+d6-vfio_ioctl_split_jgg@nvidia.com/

This is about 168 patches applied since March, thank you to everyone
involved in all this work!

Currently there are a number of supporting series still in progress:
 - Simplify and consolidate iommu_domain/device compatability checking
   https://lore.kernel.org/linux-iommu/20220815181437.28127-1-nicolinc@nvidia.com/

 - Align iommu SVA support with the domain-centric model
   https://lore.kernel.org/linux-iommu/20220826121141.50743-1-baolu.lu@linux.intel.com/

 - VFIO API for dirty tracking (aka dma logging) managed inside a PCI
   device, with mlx5 implementation
   https://lore.kernel.org/kvm/20220901093853.60194-1-yishaih@nvidia.com

 - Introduce a struct device sysfs presence for struct vfio_device
   https://lore.kernel.org/kvm/20220901143747.32858-1-kevin.tian@intel.com/

 - Complete restructuring the vfio mdev model
   https://lore.kernel.org/kvm/20220822062208.152745-1-hch@lst.de/

 - DMABUF exporter support for VFIO to allow PCI P2P with VFIO
   https://lore.kernel.org/r/0-v2-472615b3877e+28f7-vfio_dma_buf_jgg@nvidia.com

 - Isolate VFIO container code in preperation for iommufd to provide an
   alternative implementation of it all
   https://lore.kernel.org/kvm/0-v1-a805b607f1fb+17b-vfio_container_split_jgg@nvidia.com

 - Start to provide iommu_domain ops for power
   https://lore.kernel.org/all/20220714081822.3717693-1-aik@ozlabs.ru/

Right now there is no more preperatory work sketched out, so this is the
last of it.

This series remains RFC as there are still several important FIXME's to
deal with first, but things are on track for non-RFC in the near future.

This is on github: https://github.com/jgunthorpe/linux/commits/iommufd

v2:
 - Rebase to v6.0-rc3
 - Improve comments
 - Change to an iterative destruction approach to avoid cycles
 - Near rewrite of the vfio facing implementation, supported by a complete
   implementation on the vfio side
 - New IOMMU_IOAS_ALLOW_IOVAS API as discussed. Allows userspace to
   assert that ranges of IOVA must always be mappable. To be used by a VMM
   that has promised a guest a certain availability of IOVA. May help
   guide PPC's multi-window implementation.
 - Rework how unmap_iova works, user can unmap the whole ioas now
 - The no-snoop / wbinvd support is implemented
 - Bug fixes
 - Test suite improvements
 - Lots of smaller changes (the interdiff is 3k lines)
v1: https://lore.kernel.org/r/0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com

# S390 in-kernel page table walker
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Matthew Rosato <mjrosato@linux.ibm.com>
# AMD Dirty page tracking
Cc: Joao Martins <joao.m.martins@oracle.com>
# ARM SMMU Dirty page tracking
Cc: Keqian Zhu <zhukeqian1@huawei.com>
Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
# ARM SMMU nesting
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
# Map/unmap performance
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
# VDPA
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
# Power
Cc: David Gibson <david@gibson.dropbear.id.au>
# vfio
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: kvm@vger.kernel.org
# iommu
Cc: iommu@lists.linux.dev
# Collaborators
Cc: "Chaitanya Kulkarni" <chaitanyak@nvidia.com>
Cc: Nicolin Chen <nicolinc@nvidia.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Yi Liu <yi.l.liu@intel.com>
# s390
Cc: Eric Farman <farman@linux.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Jason Gunthorpe (12):
  interval-tree: Add a utility to iterate over spans in an interval tree
  iommufd: File descriptor, context, kconfig and makefiles
  kernel/user: Allow user::locked_vm to be usable for iommufd
  iommufd: PFN handling for iopt_pages
  iommufd: Algorithms for PFN storage
  iommufd: Data structure to provide IOVA to PFN mapping
  iommufd: IOCTLs for the io_pagetable
  iommufd: Add a HW pagetable object
  iommufd: Add kAPI toward external drivers for physical devices
  iommufd: Add kAPI toward external drivers for kernel access
  iommufd: vfio container FD ioctl compatibility
  iommufd: Add a selftest

Kevin Tian (1):
  iommufd: Overview documentation

 .clang-format                                 |    1 +
 Documentation/userspace-api/index.rst         |    1 +
 .../userspace-api/ioctl/ioctl-number.rst      |    1 +
 Documentation/userspace-api/iommufd.rst       |  224 +++
 MAINTAINERS                                   |   10 +
 drivers/iommu/Kconfig                         |    1 +
 drivers/iommu/Makefile                        |    2 +-
 drivers/iommu/iommufd/Kconfig                 |   22 +
 drivers/iommu/iommufd/Makefile                |   13 +
 drivers/iommu/iommufd/device.c                |  580 +++++++
 drivers/iommu/iommufd/hw_pagetable.c          |   68 +
 drivers/iommu/iommufd/io_pagetable.c          |  984 ++++++++++++
 drivers/iommu/iommufd/io_pagetable.h          |  186 +++
 drivers/iommu/iommufd/ioas.c                  |  338 ++++
 drivers/iommu/iommufd/iommufd_private.h       |  266 ++++
 drivers/iommu/iommufd/iommufd_test.h          |   74 +
 drivers/iommu/iommufd/main.c                  |  392 +++++
 drivers/iommu/iommufd/pages.c                 | 1301 +++++++++++++++
 drivers/iommu/iommufd/selftest.c              |  626 ++++++++
 drivers/iommu/iommufd/vfio_compat.c           |  423 +++++
 include/linux/interval_tree.h                 |   47 +
 include/linux/iommufd.h                       |  101 ++
 include/linux/sched/user.h                    |    2 +-
 include/uapi/linux/iommufd.h                  |  279 ++++
 kernel/user.c                                 |    1 +
 lib/interval_tree.c                           |   98 ++
 tools/testing/selftests/Makefile              |    1 +
 tools/testing/selftests/iommu/.gitignore      |    2 +
 tools/testing/selftests/iommu/Makefile        |   11 +
 tools/testing/selftests/iommu/config          |    2 +
 tools/testing/selftests/iommu/iommufd.c       | 1396 +++++++++++++++++
 31 files changed, 7451 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/userspace-api/iommufd.rst
 create mode 100644 drivers/iommu/iommufd/Kconfig
 create mode 100644 drivers/iommu/iommufd/Makefile
 create mode 100644 drivers/iommu/iommufd/device.c
 create mode 100644 drivers/iommu/iommufd/hw_pagetable.c
 create mode 100644 drivers/iommu/iommufd/io_pagetable.c
 create mode 100644 drivers/iommu/iommufd/io_pagetable.h
 create mode 100644 drivers/iommu/iommufd/ioas.c
 create mode 100644 drivers/iommu/iommufd/iommufd_private.h
 create mode 100644 drivers/iommu/iommufd/iommufd_test.h
 create mode 100644 drivers/iommu/iommufd/main.c
 create mode 100644 drivers/iommu/iommufd/pages.c
 create mode 100644 drivers/iommu/iommufd/selftest.c
 create mode 100644 drivers/iommu/iommufd/vfio_compat.c
 create mode 100644 include/linux/iommufd.h
 create mode 100644 include/uapi/linux/iommufd.h
 create mode 100644 tools/testing/selftests/iommu/.gitignore
 create mode 100644 tools/testing/selftests/iommu/Makefile
 create mode 100644 tools/testing/selftests/iommu/config
 create mode 100644 tools/testing/selftests/iommu/iommufd.c


base-commit: b90cb1053190353cc30f0fef0ef1f378ccc063c5
-- 
2.37.3


^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH RFC v2 01/13] interval-tree: Add a utility to iterate over spans in an interval tree
  2022-09-02 19:59 ` Jason Gunthorpe
@ 2022-09-02 19:59   ` Jason Gunthorpe
  -1 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-02 19:59 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, iommu,
	Jason Wang, Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

The span iterator travels over the indexes of the interval_tree, not the
nodes, and classifies spans of indexes as either 'used' or 'hole'.

'used' spans are fully covered by nodes in the tree and 'hole' spans have
no node intersecting the span.

This is done greedily such that spans are maximally sized and every
iteration step switches between used/hole.

As an example a trivial allocator can be written as:

	for (interval_tree_span_iter_first(&span, itree, 0, ULONG_MAX);
	     !interval_tree_span_iter_done(&span);
	     interval_tree_span_iter_next(&span))
		if (span.is_hole &&
		    span.last_hole - span.start_hole >= allocation_size - 1)
			return span.start_hole;

With all the tricky boundary conditions handled by the library code.

The following iommufd patches have several algorithms for two of its
overlapping node interval trees that are significantly simplified with
this kind of iteration primitive. As it seems generally useful, put it
into lib/.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 .clang-format                 |  1 +
 include/linux/interval_tree.h | 47 +++++++++++++++++
 lib/interval_tree.c           | 98 +++++++++++++++++++++++++++++++++++
 3 files changed, 146 insertions(+)

diff --git a/.clang-format b/.clang-format
index 1247d54f9e49fa..96d07786dcfb46 100644
--- a/.clang-format
+++ b/.clang-format
@@ -440,6 +440,7 @@ ForEachMacros:
   - 'inet_lhash2_for_each_icsk'
   - 'inet_lhash2_for_each_icsk_continue'
   - 'inet_lhash2_for_each_icsk_rcu'
+  - 'interval_tree_for_each_span'
   - 'intlist__for_each_entry'
   - 'intlist__for_each_entry_safe'
   - 'kcore_copy__for_each_phdr'
diff --git a/include/linux/interval_tree.h b/include/linux/interval_tree.h
index 288c26f50732d7..d52915d451177b 100644
--- a/include/linux/interval_tree.h
+++ b/include/linux/interval_tree.h
@@ -27,4 +27,51 @@ extern struct interval_tree_node *
 interval_tree_iter_next(struct interval_tree_node *node,
 			unsigned long start, unsigned long last);
 
+/*
+ * This iterator travels over spans in an interval tree. It does not return
+ * nodes but classifies each span as either a hole, where no nodes intersect, or
+ * a used, which is fully covered by nodes. Each iteration step toggles between
+ * hole and used until the entire range is covered. The returned spans always
+ * fully cover the requested range.
+ *
+ * The iterator is greedy, it always returns the largest hole or used possible,
+ * consolidating all consecutive nodes.
+ *
+ * Only is_hole, start_hole/used and last_hole/used are part of the external
+ * interface.
+ */
+struct interval_tree_span_iter {
+	struct interval_tree_node *nodes[2];
+	unsigned long first_index;
+	unsigned long last_index;
+	union {
+		unsigned long start_hole;
+		unsigned long start_used;
+	};
+	union {
+		unsigned long last_hole;
+		unsigned long last_used;
+	};
+	/* 0 == used, 1 == is_hole, -1 == done iteration */
+	int is_hole;
+};
+
+void interval_tree_span_iter_first(struct interval_tree_span_iter *state,
+				   struct rb_root_cached *itree,
+				   unsigned long first_index,
+				   unsigned long last_index);
+void interval_tree_span_iter_next(struct interval_tree_span_iter *state);
+
+static inline bool
+interval_tree_span_iter_done(struct interval_tree_span_iter *state)
+{
+	return state->is_hole == -1;
+}
+
+#define interval_tree_for_each_span(span, itree, first_index, last_index)      \
+	for (interval_tree_span_iter_first(span, itree,                        \
+					   first_index, last_index);           \
+	     !interval_tree_span_iter_done(span);                              \
+	     interval_tree_span_iter_next(span))
+
 #endif	/* _LINUX_INTERVAL_TREE_H */
diff --git a/lib/interval_tree.c b/lib/interval_tree.c
index 593ce56ece5050..5dff0da020923f 100644
--- a/lib/interval_tree.c
+++ b/lib/interval_tree.c
@@ -15,3 +15,101 @@ EXPORT_SYMBOL_GPL(interval_tree_insert);
 EXPORT_SYMBOL_GPL(interval_tree_remove);
 EXPORT_SYMBOL_GPL(interval_tree_iter_first);
 EXPORT_SYMBOL_GPL(interval_tree_iter_next);
+
+static void
+interval_tree_span_iter_next_gap(struct interval_tree_span_iter *state)
+{
+	struct interval_tree_node *cur = state->nodes[1];
+
+	/*
+	 * Roll nodes[1] into nodes[0] by advancing nodes[1] to the end of a
+	 * contiguous span of nodes. This makes nodes[0]->last the end of that
+	 * contiguous span of valid indexes that started at the original
+	 * nodes[1]->start. nodes[1] is now the next node and a hole is between
+	 * nodes[0] and [1].
+	 */
+	state->nodes[0] = cur;
+	do {
+		if (cur->last > state->nodes[0]->last)
+			state->nodes[0] = cur;
+		cur = interval_tree_iter_next(cur, state->first_index,
+					      state->last_index);
+	} while (cur && (state->nodes[0]->last >= cur->start ||
+			 state->nodes[0]->last + 1 == cur->start));
+	state->nodes[1] = cur;
+}
+
+void interval_tree_span_iter_first(struct interval_tree_span_iter *iter,
+				   struct rb_root_cached *itree,
+				   unsigned long first_index,
+				   unsigned long last_index)
+{
+	iter->first_index = first_index;
+	iter->last_index = last_index;
+	iter->nodes[0] = NULL;
+	iter->nodes[1] =
+		interval_tree_iter_first(itree, first_index, last_index);
+	if (!iter->nodes[1]) {
+		/* No nodes intersect the span, whole span is hole */
+		iter->start_hole = first_index;
+		iter->last_hole = last_index;
+		iter->is_hole = 1;
+		return;
+	}
+	if (iter->nodes[1]->start > first_index) {
+		/* Leading hole on first iteration */
+		iter->start_hole = first_index;
+		iter->last_hole = iter->nodes[1]->start - 1;
+		iter->is_hole = 1;
+		interval_tree_span_iter_next_gap(iter);
+		return;
+	}
+
+	/* Starting inside a used */
+	iter->start_used = first_index;
+	iter->is_hole = 0;
+	interval_tree_span_iter_next_gap(iter);
+	iter->last_used = iter->nodes[0]->last;
+	if (iter->last_used >= last_index) {
+		iter->last_used = last_index;
+		iter->nodes[0] = NULL;
+		iter->nodes[1] = NULL;
+	}
+}
+EXPORT_SYMBOL_GPL(interval_tree_span_iter_first);
+
+void interval_tree_span_iter_next(struct interval_tree_span_iter *iter)
+{
+	if (!iter->nodes[0] && !iter->nodes[1]) {
+		iter->is_hole = -1;
+		return;
+	}
+
+	if (iter->is_hole) {
+		iter->start_used = iter->last_hole + 1;
+		iter->last_used = iter->nodes[0]->last;
+		if (iter->last_used >= iter->last_index) {
+			iter->last_used = iter->last_index;
+			iter->nodes[0] = NULL;
+			iter->nodes[1] = NULL;
+		}
+		iter->is_hole = 0;
+		return;
+	}
+
+	if (!iter->nodes[1]) {
+		/* Trailing hole */
+		iter->start_hole = iter->nodes[0]->last + 1;
+		iter->last_hole = iter->last_index;
+		iter->nodes[0] = NULL;
+		iter->is_hole = 1;
+		return;
+	}
+
+	/* must have both nodes[0] and [1], interior hole */
+	iter->start_hole = iter->nodes[0]->last + 1;
+	iter->last_hole = iter->nodes[1]->start - 1;
+	iter->is_hole = 1;
+	interval_tree_span_iter_next_gap(iter);
+}
+EXPORT_SYMBOL_GPL(interval_tree_span_iter_next);
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH RFC v2 01/13] interval-tree: Add a utility to iterate over spans in an interval tree
@ 2022-09-02 19:59   ` Jason Gunthorpe
  0 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-02 19:59 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, iommu,
	Jason Wang, Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

The span iterator travels over the indexes of the interval_tree, not the
nodes, and classifies spans of indexes as either 'used' or 'hole'.

'used' spans are fully covered by nodes in the tree and 'hole' spans have
no node intersecting the span.

This is done greedily such that spans are maximally sized and every
iteration step switches between used/hole.

As an example a trivial allocator can be written as:

	for (interval_tree_span_iter_first(&span, itree, 0, ULONG_MAX);
	     !interval_tree_span_iter_done(&span);
	     interval_tree_span_iter_next(&span))
		if (span.is_hole &&
		    span.last_hole - span.start_hole >= allocation_size - 1)
			return span.start_hole;

With all the tricky boundary conditions handled by the library code.

The following iommufd patches have several algorithms for two of its
overlapping node interval trees that are significantly simplified with
this kind of iteration primitive. As it seems generally useful, put it
into lib/.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 .clang-format                 |  1 +
 include/linux/interval_tree.h | 47 +++++++++++++++++
 lib/interval_tree.c           | 98 +++++++++++++++++++++++++++++++++++
 3 files changed, 146 insertions(+)

diff --git a/.clang-format b/.clang-format
index 1247d54f9e49fa..96d07786dcfb46 100644
--- a/.clang-format
+++ b/.clang-format
@@ -440,6 +440,7 @@ ForEachMacros:
   - 'inet_lhash2_for_each_icsk'
   - 'inet_lhash2_for_each_icsk_continue'
   - 'inet_lhash2_for_each_icsk_rcu'
+  - 'interval_tree_for_each_span'
   - 'intlist__for_each_entry'
   - 'intlist__for_each_entry_safe'
   - 'kcore_copy__for_each_phdr'
diff --git a/include/linux/interval_tree.h b/include/linux/interval_tree.h
index 288c26f50732d7..d52915d451177b 100644
--- a/include/linux/interval_tree.h
+++ b/include/linux/interval_tree.h
@@ -27,4 +27,51 @@ extern struct interval_tree_node *
 interval_tree_iter_next(struct interval_tree_node *node,
 			unsigned long start, unsigned long last);
 
+/*
+ * This iterator travels over spans in an interval tree. It does not return
+ * nodes but classifies each span as either a hole, where no nodes intersect, or
+ * a used, which is fully covered by nodes. Each iteration step toggles between
+ * hole and used until the entire range is covered. The returned spans always
+ * fully cover the requested range.
+ *
+ * The iterator is greedy, it always returns the largest hole or used possible,
+ * consolidating all consecutive nodes.
+ *
+ * Only is_hole, start_hole/used and last_hole/used are part of the external
+ * interface.
+ */
+struct interval_tree_span_iter {
+	struct interval_tree_node *nodes[2];
+	unsigned long first_index;
+	unsigned long last_index;
+	union {
+		unsigned long start_hole;
+		unsigned long start_used;
+	};
+	union {
+		unsigned long last_hole;
+		unsigned long last_used;
+	};
+	/* 0 == used, 1 == is_hole, -1 == done iteration */
+	int is_hole;
+};
+
+void interval_tree_span_iter_first(struct interval_tree_span_iter *state,
+				   struct rb_root_cached *itree,
+				   unsigned long first_index,
+				   unsigned long last_index);
+void interval_tree_span_iter_next(struct interval_tree_span_iter *state);
+
+static inline bool
+interval_tree_span_iter_done(struct interval_tree_span_iter *state)
+{
+	return state->is_hole == -1;
+}
+
+#define interval_tree_for_each_span(span, itree, first_index, last_index)      \
+	for (interval_tree_span_iter_first(span, itree,                        \
+					   first_index, last_index);           \
+	     !interval_tree_span_iter_done(span);                              \
+	     interval_tree_span_iter_next(span))
+
 #endif	/* _LINUX_INTERVAL_TREE_H */
diff --git a/lib/interval_tree.c b/lib/interval_tree.c
index 593ce56ece5050..5dff0da020923f 100644
--- a/lib/interval_tree.c
+++ b/lib/interval_tree.c
@@ -15,3 +15,101 @@ EXPORT_SYMBOL_GPL(interval_tree_insert);
 EXPORT_SYMBOL_GPL(interval_tree_remove);
 EXPORT_SYMBOL_GPL(interval_tree_iter_first);
 EXPORT_SYMBOL_GPL(interval_tree_iter_next);
+
+static void
+interval_tree_span_iter_next_gap(struct interval_tree_span_iter *state)
+{
+	struct interval_tree_node *cur = state->nodes[1];
+
+	/*
+	 * Roll nodes[1] into nodes[0] by advancing nodes[1] to the end of a
+	 * contiguous span of nodes. This makes nodes[0]->last the end of that
+	 * contiguous span of valid indexes that started at the original
+	 * nodes[1]->start. nodes[1] is now the next node and a hole is between
+	 * nodes[0] and [1].
+	 */
+	state->nodes[0] = cur;
+	do {
+		if (cur->last > state->nodes[0]->last)
+			state->nodes[0] = cur;
+		cur = interval_tree_iter_next(cur, state->first_index,
+					      state->last_index);
+	} while (cur && (state->nodes[0]->last >= cur->start ||
+			 state->nodes[0]->last + 1 == cur->start));
+	state->nodes[1] = cur;
+}
+
+void interval_tree_span_iter_first(struct interval_tree_span_iter *iter,
+				   struct rb_root_cached *itree,
+				   unsigned long first_index,
+				   unsigned long last_index)
+{
+	iter->first_index = first_index;
+	iter->last_index = last_index;
+	iter->nodes[0] = NULL;
+	iter->nodes[1] =
+		interval_tree_iter_first(itree, first_index, last_index);
+	if (!iter->nodes[1]) {
+		/* No nodes intersect the span, whole span is hole */
+		iter->start_hole = first_index;
+		iter->last_hole = last_index;
+		iter->is_hole = 1;
+		return;
+	}
+	if (iter->nodes[1]->start > first_index) {
+		/* Leading hole on first iteration */
+		iter->start_hole = first_index;
+		iter->last_hole = iter->nodes[1]->start - 1;
+		iter->is_hole = 1;
+		interval_tree_span_iter_next_gap(iter);
+		return;
+	}
+
+	/* Starting inside a used */
+	iter->start_used = first_index;
+	iter->is_hole = 0;
+	interval_tree_span_iter_next_gap(iter);
+	iter->last_used = iter->nodes[0]->last;
+	if (iter->last_used >= last_index) {
+		iter->last_used = last_index;
+		iter->nodes[0] = NULL;
+		iter->nodes[1] = NULL;
+	}
+}
+EXPORT_SYMBOL_GPL(interval_tree_span_iter_first);
+
+void interval_tree_span_iter_next(struct interval_tree_span_iter *iter)
+{
+	if (!iter->nodes[0] && !iter->nodes[1]) {
+		iter->is_hole = -1;
+		return;
+	}
+
+	if (iter->is_hole) {
+		iter->start_used = iter->last_hole + 1;
+		iter->last_used = iter->nodes[0]->last;
+		if (iter->last_used >= iter->last_index) {
+			iter->last_used = iter->last_index;
+			iter->nodes[0] = NULL;
+			iter->nodes[1] = NULL;
+		}
+		iter->is_hole = 0;
+		return;
+	}
+
+	if (!iter->nodes[1]) {
+		/* Trailing hole */
+		iter->start_hole = iter->nodes[0]->last + 1;
+		iter->last_hole = iter->last_index;
+		iter->nodes[0] = NULL;
+		iter->is_hole = 1;
+		return;
+	}
+
+	/* must have both nodes[0] and [1], interior hole */
+	iter->start_hole = iter->nodes[0]->last + 1;
+	iter->last_hole = iter->nodes[1]->start - 1;
+	iter->is_hole = 1;
+	interval_tree_span_iter_next_gap(iter);
+}
+EXPORT_SYMBOL_GPL(interval_tree_span_iter_next);
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH RFC v2 02/13] iommufd: Overview documentation
  2022-09-02 19:59 ` Jason Gunthorpe
@ 2022-09-02 19:59   ` Jason Gunthorpe
  -1 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-02 19:59 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, iommu,
	Jason Wang, Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

From: Kevin Tian <kevin.tian@intel.com>

Add iommufd to the documentation tree.

Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 Documentation/userspace-api/index.rst   |   1 +
 Documentation/userspace-api/iommufd.rst | 224 ++++++++++++++++++++++++
 2 files changed, 225 insertions(+)
 create mode 100644 Documentation/userspace-api/iommufd.rst

diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
index a61eac0c73f825..3815f013e4aebd 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -25,6 +25,7 @@ place where this information is gathered.
    ebpf/index
    ioctl/index
    iommu
+   iommufd
    media/index
    sysfs-platform_profile
    vduse
diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst
new file mode 100644
index 00000000000000..38035b3822fd23
--- /dev/null
+++ b/Documentation/userspace-api/iommufd.rst
@@ -0,0 +1,224 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+=======
+IOMMUFD
+=======
+
+:Author: Jason Gunthorpe
+:Author: Kevin Tian
+
+Overview
+========
+
+IOMMUFD is the user API to control the IOMMU subsystem as it relates to managing
+IO page tables that point at user space memory. It intends to be general and
+consumable by any driver that wants to DMA to userspace. Those drivers are
+expected to deprecate any proprietary IOMMU logic, if existing (e.g.
+vfio_iommu_type1.c).
+
+At minimum iommufd provides a universal support of managing I/O address spaces
+and I/O page tables for all IOMMUs, with room in the design to add non-generic
+features to cater to specific hardware functionality.
+
+In this context the capital letter (IOMMUFD) refers to the subsystem while the
+small letter (iommufd) refers to the file descriptors created via /dev/iommu to
+run the user API over.
+
+Key Concepts
+============
+
+User Visible Objects
+--------------------
+
+Following IOMMUFD objects are exposed to userspace:
+
+- IOMMUFD_OBJ_IOAS, representing an I/O address space (IOAS) allowing map/unmap
+  of user space memory into ranges of I/O Virtual Address (IOVA).
+
+  The IOAS is a functional replacement for the VFIO container, and like the VFIO
+  container copies its IOVA map to a list of iommu_domains held within it.
+
+- IOMMUFD_OBJ_DEVICE, representing a device that is bound to iommufd by an
+  external driver.
+
+- IOMMUFD_OBJ_HW_PAGETABLE, wrapping an actual hardware I/O page table (i.e. a
+  single struct iommu_domain) managed by the iommu driver.
+
+  The IOAS has a list of HW_PAGETABLES that share the same IOVA mapping and the
+  IOAS will synchronize its mapping with each member HW_PAGETABLE.
+
+All user-visible objects are destroyed via the IOMMU_DESTROY uAPI.
+
+Linkage between user-visible objects and external kernel datastructures are
+reflected by dotted line arrows below, with numbers referring to certain
+operations creating the objects and links::
+
+  _________________________________________________________
+ |                         iommufd                         |
+ |       [1]                                               |
+ |  _________________                                      |
+ | |                 |                                     |
+ | |                 |                                     |
+ | |                 |                                     |
+ | |                 |                                     |
+ | |                 |                                     |
+ | |                 |                                     |
+ | |                 |        [3]                 [2]      |
+ | |                 |    ____________         __________  |
+ | |      IOAS       |<--|            |<------|          | |
+ | |                 |   |HW_PAGETABLE|       |  DEVICE  | |
+ | |                 |   |____________|       |__________| |
+ | |                 |         |                   |       |
+ | |                 |         |                   |       |
+ | |                 |         |                   |       |
+ | |                 |         |                   |       |
+ | |                 |         |                   |       |
+ | |_________________|         |                   |       |
+ |         |                   |                   |       |
+ |_________|___________________|___________________|_______|
+           |                   |                   |
+           |              _____v______      _______v_____
+           | PFN storage |            |    |             |
+           |------------>|iommu_domain|    |struct device|
+                         |____________|    |_____________|
+
+1. IOMMUFD_OBJ_IOAS is created via the IOMMU_IOAS_ALLOC uAPI. One iommufd can
+   hold multiple IOAS objects. IOAS is the most generic object and does not
+   expose interfaces that are specific to single IOMMU drivers. All operations
+   on the IOAS must operate equally on each of the iommu_domains that are inside
+   it.
+
+2. IOMMUFD_OBJ_DEVICE is created when an external driver calls the IOMMUFD kAPI
+   to bind a device to an iommufd. The external driver is expected to implement
+   proper uAPI for userspace to initiate the binding operation. Successful
+   completion of this operation establishes the desired DMA ownership over the
+   device. The external driver must set driver_managed_dma flag and must not
+   touch the device until this operation succeeds.
+
+3. IOMMUFD_OBJ_HW_PAGETABLE is created when an external driver calls the IOMMUFD
+   kAPI to attach a bound device to an IOAS. Similarly the external driver uAPI
+   allows userspace to initiate the attaching operation. If a compatible
+   pagetable already exists then it is reused for the attachment. Otherwise a
+   new pagetable object (and a new iommu_domain) is created. Successful
+   completion of this operation sets up the linkages among an IOAS, a device and
+   an iommu_domain. Once this completes the device could do DMA.
+
+   Every iommu_domain inside the IOAS is also represented to userspace as a
+   HW_PAGETABLE object.
+
+   NOTE: Future additions to IOMMUFD will provide an API to create and
+   manipulate the HW_PAGETABLE directly.
+
+One device can only bind to one iommufd (due to DMA ownership claim) and attach
+to at most one IOAS object (no support of PASID yet).
+
+Currently only PCI device is allowed.
+
+Kernel Datastructure
+--------------------
+
+User visible objects are backed by following datastructures:
+
+- iommufd_ioas for IOMMUFD_OBJ_IOAS.
+- iommufd_device for IOMMUFD_OBJ_DEVICE.
+- iommufd_hw_pagetable for IOMMUFD_OBJ_HW_PAGETABLE.
+
+Several terminologies when looking at these datastructures:
+
+- Automatic domain, referring to an iommu domain created automatically when
+  attaching a device to an IOAS object. This is compatible to the semantics of
+  VFIO type1.
+
+- Manual domain, referring to an iommu domain designated by the user as the
+  target pagetable to be attached to by a device. Though currently no user API
+  for userspace to directly create such domain, the datastructure and algorithms
+  are ready for that usage.
+
+- In-kernel user, referring to something like a VFIO mdev that is accessing the
+  IOAS and using a 'struct page \*' for CPU based access. Such users require an
+  isolation granularity smaller than what an iommu domain can afford. They must
+  manually enforce the IOAS constraints on DMA buffers before those buffers can
+  be accessed by mdev. Though no kernel API for an external driver to bind a
+  mdev, the datastructure and algorithms are ready for such usage.
+
+iommufd_ioas serves as the metadata datastructure to manage how IOVA ranges are
+mapped to memory pages, composed of:
+
+- struct io_pagetable holding the IOVA map
+- struct iopt_areas representing populated portions of IOVA
+- struct iopt_pages representing the storage of PFNs
+- struct iommu_domain representing the IO page table in the IOMMU
+- struct iopt_pages_user representing in-kernel users of PFNs
+- struct xarray pinned_pfns holding a list of pages pinned by
+   in-kernel Users
+
+The iopt_pages is the center of the storage and motion of PFNs. Each iopt_pages
+represents a logical linear array of full PFNs. PFNs are stored in a tiered
+scheme:
+
+ 1) iopt_pages::pinned_pfns xarray
+ 2) An iommu_domain
+ 3) The origin of the PFNs, i.e. the userspace pointer
+
+PFN have to be copied between all combinations of tiers, depending on the
+configuration (i.e. attached domains and in-kernel users).
+
+An io_pagetable is composed of iopt_areas pointing at iopt_pages, along with a
+list of iommu_domains that mirror the IOVA to PFN map.
+
+Multiple io_pagetable's, through their iopt_area's, can share a single
+iopt_pages which avoids multi-pinning and double accounting of page consumption.
+
+iommufd_ioas is sharable between subsystems, e.g. VFIO and VDPA, as long as
+devices managed by different subsystems are bound to a same iommufd.
+
+IOMMUFD User API
+================
+
+.. kernel-doc:: include/uapi/linux/iommufd.h
+
+IOMMUFD Kernel API
+==================
+
+The IOMMUFD kAPI is device-centric with group-related tricks managed behind the
+scene. This allows the external driver calling such kAPI to implement a simple
+device-centric uAPI for connecting its device to an iommufd, instead of
+explicitly imposing the group semantics in its uAPI (as VFIO does).
+
+.. kernel-doc:: drivers/iommu/iommufd/device.c
+   :export:
+
+VFIO and IOMMUFD
+----------------
+
+Connecting VFIO device to iommufd can be done in two approaches.
+
+First is a VFIO compatible way by directly implementing the /dev/vfio/vfio
+container IOCTLs by mapping them into io_pagetable operations. Doing so allows
+the use of iommufd in legacy VFIO applications by symlinking /dev/vfio/vfio to
+/dev/iommufd or extending VFIO to SET_CONTAINER using an iommufd instead of a
+container fd.
+
+The second approach directly extends VFIO to support a new set of device-centric
+user API based on aforementioned IOMMUFD kernel API. It requires userspace
+change but better matches the IOMMUFD API semantics and easier to support new
+iommufd features when comparing it to the first approach.
+
+Currently both approaches are still work-in-progress.
+
+There are still a few gaps to be resolved to catch up with VFIO type1, as
+documented in iommufd_vfio_check_extension().
+
+Future TODOs
+============
+
+Currently IOMMUFD supports only kernel-managed I/O page table, similar to VFIO
+type1. New features on the radar include:
+
+ - Binding iommu_domain's to PASID/SSID
+ - Userspace page tables, for ARM, x86 and S390
+ - Kernel bypass'd invalidation of user page tables
+ - Re-use of the KVM page table in the IOMMU
+ - Dirty page tracking in the IOMMU
+ - Runtime Increase/Decrease of IOPTE size
+ - PRI support with faults resolved in userspace
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH RFC v2 02/13] iommufd: Overview documentation
@ 2022-09-02 19:59   ` Jason Gunthorpe
  0 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-02 19:59 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, iommu,
	Jason Wang, Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

From: Kevin Tian <kevin.tian@intel.com>

Add iommufd to the documentation tree.

Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 Documentation/userspace-api/index.rst   |   1 +
 Documentation/userspace-api/iommufd.rst | 224 ++++++++++++++++++++++++
 2 files changed, 225 insertions(+)
 create mode 100644 Documentation/userspace-api/iommufd.rst

diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
index a61eac0c73f825..3815f013e4aebd 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -25,6 +25,7 @@ place where this information is gathered.
    ebpf/index
    ioctl/index
    iommu
+   iommufd
    media/index
    sysfs-platform_profile
    vduse
diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst
new file mode 100644
index 00000000000000..38035b3822fd23
--- /dev/null
+++ b/Documentation/userspace-api/iommufd.rst
@@ -0,0 +1,224 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+=======
+IOMMUFD
+=======
+
+:Author: Jason Gunthorpe
+:Author: Kevin Tian
+
+Overview
+========
+
+IOMMUFD is the user API to control the IOMMU subsystem as it relates to managing
+IO page tables that point at user space memory. It intends to be general and
+consumable by any driver that wants to DMA to userspace. Those drivers are
+expected to deprecate any proprietary IOMMU logic, if existing (e.g.
+vfio_iommu_type1.c).
+
+At minimum iommufd provides a universal support of managing I/O address spaces
+and I/O page tables for all IOMMUs, with room in the design to add non-generic
+features to cater to specific hardware functionality.
+
+In this context the capital letter (IOMMUFD) refers to the subsystem while the
+small letter (iommufd) refers to the file descriptors created via /dev/iommu to
+run the user API over.
+
+Key Concepts
+============
+
+User Visible Objects
+--------------------
+
+Following IOMMUFD objects are exposed to userspace:
+
+- IOMMUFD_OBJ_IOAS, representing an I/O address space (IOAS) allowing map/unmap
+  of user space memory into ranges of I/O Virtual Address (IOVA).
+
+  The IOAS is a functional replacement for the VFIO container, and like the VFIO
+  container copies its IOVA map to a list of iommu_domains held within it.
+
+- IOMMUFD_OBJ_DEVICE, representing a device that is bound to iommufd by an
+  external driver.
+
+- IOMMUFD_OBJ_HW_PAGETABLE, wrapping an actual hardware I/O page table (i.e. a
+  single struct iommu_domain) managed by the iommu driver.
+
+  The IOAS has a list of HW_PAGETABLES that share the same IOVA mapping and the
+  IOAS will synchronize its mapping with each member HW_PAGETABLE.
+
+All user-visible objects are destroyed via the IOMMU_DESTROY uAPI.
+
+Linkage between user-visible objects and external kernel datastructures are
+reflected by dotted line arrows below, with numbers referring to certain
+operations creating the objects and links::
+
+  _________________________________________________________
+ |                         iommufd                         |
+ |       [1]                                               |
+ |  _________________                                      |
+ | |                 |                                     |
+ | |                 |                                     |
+ | |                 |                                     |
+ | |                 |                                     |
+ | |                 |                                     |
+ | |                 |                                     |
+ | |                 |        [3]                 [2]      |
+ | |                 |    ____________         __________  |
+ | |      IOAS       |<--|            |<------|          | |
+ | |                 |   |HW_PAGETABLE|       |  DEVICE  | |
+ | |                 |   |____________|       |__________| |
+ | |                 |         |                   |       |
+ | |                 |         |                   |       |
+ | |                 |         |                   |       |
+ | |                 |         |                   |       |
+ | |                 |         |                   |       |
+ | |_________________|         |                   |       |
+ |         |                   |                   |       |
+ |_________|___________________|___________________|_______|
+           |                   |                   |
+           |              _____v______      _______v_____
+           | PFN storage |            |    |             |
+           |------------>|iommu_domain|    |struct device|
+                         |____________|    |_____________|
+
+1. IOMMUFD_OBJ_IOAS is created via the IOMMU_IOAS_ALLOC uAPI. One iommufd can
+   hold multiple IOAS objects. IOAS is the most generic object and does not
+   expose interfaces that are specific to single IOMMU drivers. All operations
+   on the IOAS must operate equally on each of the iommu_domains that are inside
+   it.
+
+2. IOMMUFD_OBJ_DEVICE is created when an external driver calls the IOMMUFD kAPI
+   to bind a device to an iommufd. The external driver is expected to implement
+   proper uAPI for userspace to initiate the binding operation. Successful
+   completion of this operation establishes the desired DMA ownership over the
+   device. The external driver must set driver_managed_dma flag and must not
+   touch the device until this operation succeeds.
+
+3. IOMMUFD_OBJ_HW_PAGETABLE is created when an external driver calls the IOMMUFD
+   kAPI to attach a bound device to an IOAS. Similarly the external driver uAPI
+   allows userspace to initiate the attaching operation. If a compatible
+   pagetable already exists then it is reused for the attachment. Otherwise a
+   new pagetable object (and a new iommu_domain) is created. Successful
+   completion of this operation sets up the linkages among an IOAS, a device and
+   an iommu_domain. Once this completes the device could do DMA.
+
+   Every iommu_domain inside the IOAS is also represented to userspace as a
+   HW_PAGETABLE object.
+
+   NOTE: Future additions to IOMMUFD will provide an API to create and
+   manipulate the HW_PAGETABLE directly.
+
+One device can only bind to one iommufd (due to DMA ownership claim) and attach
+to at most one IOAS object (no support of PASID yet).
+
+Currently only PCI device is allowed.
+
+Kernel Datastructure
+--------------------
+
+User visible objects are backed by following datastructures:
+
+- iommufd_ioas for IOMMUFD_OBJ_IOAS.
+- iommufd_device for IOMMUFD_OBJ_DEVICE.
+- iommufd_hw_pagetable for IOMMUFD_OBJ_HW_PAGETABLE.
+
+Several terminologies when looking at these datastructures:
+
+- Automatic domain, referring to an iommu domain created automatically when
+  attaching a device to an IOAS object. This is compatible to the semantics of
+  VFIO type1.
+
+- Manual domain, referring to an iommu domain designated by the user as the
+  target pagetable to be attached to by a device. Though currently no user API
+  for userspace to directly create such domain, the datastructure and algorithms
+  are ready for that usage.
+
+- In-kernel user, referring to something like a VFIO mdev that is accessing the
+  IOAS and using a 'struct page \*' for CPU based access. Such users require an
+  isolation granularity smaller than what an iommu domain can afford. They must
+  manually enforce the IOAS constraints on DMA buffers before those buffers can
+  be accessed by mdev. Though no kernel API for an external driver to bind a
+  mdev, the datastructure and algorithms are ready for such usage.
+
+iommufd_ioas serves as the metadata datastructure to manage how IOVA ranges are
+mapped to memory pages, composed of:
+
+- struct io_pagetable holding the IOVA map
+- struct iopt_areas representing populated portions of IOVA
+- struct iopt_pages representing the storage of PFNs
+- struct iommu_domain representing the IO page table in the IOMMU
+- struct iopt_pages_user representing in-kernel users of PFNs
+- struct xarray pinned_pfns holding a list of pages pinned by
+   in-kernel Users
+
+The iopt_pages is the center of the storage and motion of PFNs. Each iopt_pages
+represents a logical linear array of full PFNs. PFNs are stored in a tiered
+scheme:
+
+ 1) iopt_pages::pinned_pfns xarray
+ 2) An iommu_domain
+ 3) The origin of the PFNs, i.e. the userspace pointer
+
+PFN have to be copied between all combinations of tiers, depending on the
+configuration (i.e. attached domains and in-kernel users).
+
+An io_pagetable is composed of iopt_areas pointing at iopt_pages, along with a
+list of iommu_domains that mirror the IOVA to PFN map.
+
+Multiple io_pagetable's, through their iopt_area's, can share a single
+iopt_pages which avoids multi-pinning and double accounting of page consumption.
+
+iommufd_ioas is sharable between subsystems, e.g. VFIO and VDPA, as long as
+devices managed by different subsystems are bound to a same iommufd.
+
+IOMMUFD User API
+================
+
+.. kernel-doc:: include/uapi/linux/iommufd.h
+
+IOMMUFD Kernel API
+==================
+
+The IOMMUFD kAPI is device-centric with group-related tricks managed behind the
+scene. This allows the external driver calling such kAPI to implement a simple
+device-centric uAPI for connecting its device to an iommufd, instead of
+explicitly imposing the group semantics in its uAPI (as VFIO does).
+
+.. kernel-doc:: drivers/iommu/iommufd/device.c
+   :export:
+
+VFIO and IOMMUFD
+----------------
+
+Connecting VFIO device to iommufd can be done in two approaches.
+
+First is a VFIO compatible way by directly implementing the /dev/vfio/vfio
+container IOCTLs by mapping them into io_pagetable operations. Doing so allows
+the use of iommufd in legacy VFIO applications by symlinking /dev/vfio/vfio to
+/dev/iommufd or extending VFIO to SET_CONTAINER using an iommufd instead of a
+container fd.
+
+The second approach directly extends VFIO to support a new set of device-centric
+user API based on aforementioned IOMMUFD kernel API. It requires userspace
+change but better matches the IOMMUFD API semantics and easier to support new
+iommufd features when comparing it to the first approach.
+
+Currently both approaches are still work-in-progress.
+
+There are still a few gaps to be resolved to catch up with VFIO type1, as
+documented in iommufd_vfio_check_extension().
+
+Future TODOs
+============
+
+Currently IOMMUFD supports only kernel-managed I/O page table, similar to VFIO
+type1. New features on the radar include:
+
+ - Binding iommu_domain's to PASID/SSID
+ - Userspace page tables, for ARM, x86 and S390
+ - Kernel bypass'd invalidation of user page tables
+ - Re-use of the KVM page table in the IOMMU
+ - Dirty page tracking in the IOMMU
+ - Runtime Increase/Decrease of IOPTE size
+ - PRI support with faults resolved in userspace
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH RFC v2 03/13] iommufd: File descriptor, context, kconfig and makefiles
  2022-09-02 19:59 ` Jason Gunthorpe
@ 2022-09-02 19:59   ` Jason Gunthorpe
  -1 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-02 19:59 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, iommu,
	Jason Wang, Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

This is the basic infrastructure of a new miscdevice to hold the iommufd
IOCTL API.

It provides:
 - A miscdevice to create file descriptors to run the IOCTL interface over

 - A table based ioctl dispatch and centralized extendable pre-validation
   step

 - An xarray mapping user ID's to kernel objects. The design has multiple
   inter-related objects held within in a single IOMMUFD fd

 - A simple usage count to build a graph of object relations and protect
   against hostile userspace racing ioctls

The only IOCTL provided in this patch is the generic 'destroy any object
by handle' operation.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 .../userspace-api/ioctl/ioctl-number.rst      |   1 +
 MAINTAINERS                                   |  10 +
 drivers/iommu/Kconfig                         |   1 +
 drivers/iommu/Makefile                        |   2 +-
 drivers/iommu/iommufd/Kconfig                 |  13 +
 drivers/iommu/iommufd/Makefile                |   5 +
 drivers/iommu/iommufd/iommufd_private.h       | 110 ++++++
 drivers/iommu/iommufd/main.c                  | 345 ++++++++++++++++++
 include/linux/iommufd.h                       |  31 ++
 include/uapi/linux/iommufd.h                  |  55 +++
 10 files changed, 572 insertions(+), 1 deletion(-)
 create mode 100644 drivers/iommu/iommufd/Kconfig
 create mode 100644 drivers/iommu/iommufd/Makefile
 create mode 100644 drivers/iommu/iommufd/iommufd_private.h
 create mode 100644 drivers/iommu/iommufd/main.c
 create mode 100644 include/linux/iommufd.h
 create mode 100644 include/uapi/linux/iommufd.h

diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
index 3b985b19f39d12..4387e787411ebe 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -105,6 +105,7 @@ Code  Seq#    Include File                                           Comments
 '8'   all                                                            SNP8023 advanced NIC card
                                                                      <mailto:mcr@solidum.com>
 ';'   64-7F  linux/vfio.h
+';'   80-FF  linux/iommufd.h
 '='   00-3f  uapi/linux/ptp_clock.h                                  <mailto:richardcochran@gmail.com>
 '@'   00-0F  linux/radeonfb.h                                        conflict!
 '@'   00-0F  drivers/video/aty/aty128fb.c                            conflict!
diff --git a/MAINTAINERS b/MAINTAINERS
index 589517372408ca..abd041f5e00f4c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10609,6 +10609,16 @@ L:	linux-mips@vger.kernel.org
 S:	Maintained
 F:	drivers/net/ethernet/sgi/ioc3-eth.c
 
+IOMMU FD
+M:	Jason Gunthorpe <jgg@nvidia.com>
+M:	Kevin Tian <kevin.tian@intel.com>
+L:	iommu@lists.linux-foundation.org
+S:	Maintained
+F:	Documentation/userspace-api/iommufd.rst
+F:	drivers/iommu/iommufd/
+F:	include/uapi/linux/iommufd.h
+F:	include/linux/iommufd.h
+
 IOMAP FILESYSTEM LIBRARY
 M:	Christoph Hellwig <hch@infradead.org>
 M:	Darrick J. Wong <djwong@kernel.org>
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 5c5cb5bee8b626..9ff3d2830f9559 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -177,6 +177,7 @@ config MSM_IOMMU
 
 source "drivers/iommu/amd/Kconfig"
 source "drivers/iommu/intel/Kconfig"
+source "drivers/iommu/iommufd/Kconfig"
 
 config IRQ_REMAP
 	bool "Support for Interrupt Remapping"
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 44475a9b3eeaf9..6d2bc288324704 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -1,5 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0
-obj-y += amd/ intel/ arm/
+obj-y += amd/ intel/ arm/ iommufd/
 obj-$(CONFIG_IOMMU_API) += iommu.o
 obj-$(CONFIG_IOMMU_API) += iommu-traces.o
 obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
diff --git a/drivers/iommu/iommufd/Kconfig b/drivers/iommu/iommufd/Kconfig
new file mode 100644
index 00000000000000..fddd453bb0e764
--- /dev/null
+++ b/drivers/iommu/iommufd/Kconfig
@@ -0,0 +1,13 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config IOMMUFD
+	tristate "IOMMU Userspace API"
+	select INTERVAL_TREE
+	select IOMMU_API
+	default n
+	help
+	  Provides /dev/iommu the user API to control the IOMMU subsystem as
+	  it relates to managing IO page tables that point at user space memory.
+
+	  This would commonly be used in combination with VFIO.
+
+	  If you don't know what to do here, say N.
diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
new file mode 100644
index 00000000000000..a07a8cffe937c6
--- /dev/null
+++ b/drivers/iommu/iommufd/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0-only
+iommufd-y := \
+	main.o
+
+obj-$(CONFIG_IOMMUFD) += iommufd.o
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
new file mode 100644
index 00000000000000..a65208d6442be7
--- /dev/null
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -0,0 +1,110 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ */
+#ifndef __IOMMUFD_PRIVATE_H
+#define __IOMMUFD_PRIVATE_H
+
+#include <linux/rwsem.h>
+#include <linux/xarray.h>
+#include <linux/refcount.h>
+#include <linux/uaccess.h>
+
+struct iommufd_ctx {
+	struct file *file;
+	struct xarray objects;
+};
+
+struct iommufd_ctx *iommufd_fget(int fd);
+
+struct iommufd_ucmd {
+	struct iommufd_ctx *ictx;
+	void __user *ubuffer;
+	u32 user_size;
+	void *cmd;
+};
+
+/* Copy the response in ucmd->cmd back to userspace. */
+static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd,
+				       size_t cmd_len)
+{
+	if (copy_to_user(ucmd->ubuffer, ucmd->cmd,
+			 min_t(size_t, ucmd->user_size, cmd_len)))
+		return -EFAULT;
+	return 0;
+}
+
+enum iommufd_object_type {
+	IOMMUFD_OBJ_NONE,
+	IOMMUFD_OBJ_ANY = IOMMUFD_OBJ_NONE,
+};
+
+/* Base struct for all objects with a userspace ID handle. */
+struct iommufd_object {
+	struct rw_semaphore destroy_rwsem;
+	refcount_t users;
+	enum iommufd_object_type type;
+	unsigned int id;
+};
+
+static inline bool iommufd_lock_obj(struct iommufd_object *obj)
+{
+	if (!down_read_trylock(&obj->destroy_rwsem))
+		return false;
+	if (!refcount_inc_not_zero(&obj->users)) {
+		up_read(&obj->destroy_rwsem);
+		return false;
+	}
+	return true;
+}
+
+struct iommufd_object *iommufd_get_object(struct iommufd_ctx *ictx, u32 id,
+					  enum iommufd_object_type type);
+static inline void iommufd_put_object(struct iommufd_object *obj)
+{
+	refcount_dec(&obj->users);
+	up_read(&obj->destroy_rwsem);
+}
+
+/**
+ * iommufd_put_object_keep_user() - Release part of the refcount on obj
+ * @obj - Object to release
+ *
+ * Objects have two protections to ensure that userspace has a consistent
+ * experience with destruction. Normally objects are locked so that destroy will
+ * block while there are concurrent users, and wait for the object to be
+ * unlocked.
+ *
+ * However, destroy can also be blocked by holding users reference counts on the
+ * objects, in that case destroy will immediately return EBUSY and will not wait
+ * for reference counts to go to zero.
+ *
+ * This function releases the destroy lock and destroy will return EBUSY.
+ *
+ * It should be used in places where the users will be held beyond a single
+ * system call.
+ */
+static inline void iommufd_put_object_keep_user(struct iommufd_object *obj)
+{
+	up_read(&obj->destroy_rwsem);
+}
+void iommufd_object_abort(struct iommufd_ctx *ictx, struct iommufd_object *obj);
+void iommufd_object_abort_and_destroy(struct iommufd_ctx *ictx,
+				      struct iommufd_object *obj);
+void iommufd_object_finalize(struct iommufd_ctx *ictx,
+			     struct iommufd_object *obj);
+bool iommufd_object_destroy_user(struct iommufd_ctx *ictx,
+				 struct iommufd_object *obj);
+struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
+					     size_t size,
+					     enum iommufd_object_type type);
+
+#define iommufd_object_alloc(ictx, ptr, type)                                  \
+	container_of(_iommufd_object_alloc(                                    \
+			     ictx,                                             \
+			     sizeof(*(ptr)) + BUILD_BUG_ON_ZERO(               \
+						      offsetof(typeof(*(ptr)), \
+							       obj) != 0),     \
+			     type),                                            \
+		     typeof(*(ptr)), obj)
+
+#endif
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
new file mode 100644
index 00000000000000..a5b1e2302ba59d
--- /dev/null
+++ b/drivers/iommu/iommufd/main.c
@@ -0,0 +1,345 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (C) 2021 Intel Corporation
+ * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ *
+ * iommfd provides control over the IOMMU HW objects created by IOMMU kernel
+ * drivers. IOMMU HW objects revolve around IO page tables that map incoming DMA
+ * addresses (IOVA) to CPU addresses.
+ *
+ * The API is divided into a general portion that is intended to work with any
+ * kernel IOMMU driver, and a device specific portion that  is intended to be
+ * used with a userspace HW driver paired with the specific kernel driver. This
+ * mechanism allows all the unique functionalities in individual IOMMUs to be
+ * exposed to userspace control.
+ */
+#define pr_fmt(fmt) "iommufd: " fmt
+
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/miscdevice.h>
+#include <linux/mutex.h>
+#include <linux/bug.h>
+#include <uapi/linux/iommufd.h>
+#include <linux/iommufd.h>
+
+#include "iommufd_private.h"
+
+struct iommufd_object_ops {
+	void (*destroy)(struct iommufd_object *obj);
+};
+static struct iommufd_object_ops iommufd_object_ops[];
+
+struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
+					     size_t size,
+					     enum iommufd_object_type type)
+{
+	struct iommufd_object *obj;
+	int rc;
+
+	obj = kzalloc(size, GFP_KERNEL_ACCOUNT);
+	if (!obj)
+		return ERR_PTR(-ENOMEM);
+	obj->type = type;
+	init_rwsem(&obj->destroy_rwsem);
+	refcount_set(&obj->users, 1);
+
+	/*
+	 * Reserve an ID in the xarray but do not publish the pointer yet since
+	 * the caller hasn't initialized it yet. Once the pointer is published
+	 * in the xarray and visible to other threads we can't reliably destroy
+	 * it anymore, so the caller must complete all errorable operations
+	 * before calling iommufd_object_finalize().
+	 */
+	rc = xa_alloc(&ictx->objects, &obj->id, XA_ZERO_ENTRY,
+		      xa_limit_32b, GFP_KERNEL_ACCOUNT);
+	if (rc)
+		goto out_free;
+	return obj;
+out_free:
+	kfree(obj);
+	return ERR_PTR(rc);
+}
+
+/*
+ * Allow concurrent access to the object. This should only be done once the
+ * system call that created the object is guaranteed to succeed.
+ */
+void iommufd_object_finalize(struct iommufd_ctx *ictx,
+			     struct iommufd_object *obj)
+{
+	void *old;
+
+	old = xa_store(&ictx->objects, obj->id, obj, GFP_KERNEL);
+	/* obj->id was returned from xa_alloc() so the xa_store() cannot fail */
+	WARN_ON(old);
+}
+
+/* Undo _iommufd_object_alloc() if iommufd_object_finalize() was not called */
+void iommufd_object_abort(struct iommufd_ctx *ictx, struct iommufd_object *obj)
+{
+	void *old;
+
+	old = xa_erase(&ictx->objects, obj->id);
+	WARN_ON(old);
+	kfree(obj);
+}
+
+/*
+ * Abort an object that has been fully initialized and needs destroy, but has
+ * not been finalized.
+ */
+void iommufd_object_abort_and_destroy(struct iommufd_ctx *ictx,
+				      struct iommufd_object *obj)
+{
+	iommufd_object_ops[obj->type].destroy(obj);
+	iommufd_object_abort(ictx, obj);
+}
+
+struct iommufd_object *iommufd_get_object(struct iommufd_ctx *ictx, u32 id,
+					  enum iommufd_object_type type)
+{
+	struct iommufd_object *obj;
+
+	xa_lock(&ictx->objects);
+	obj = xa_load(&ictx->objects, id);
+	if (!obj || (type != IOMMUFD_OBJ_ANY && obj->type != type) ||
+	    !iommufd_lock_obj(obj))
+		obj = ERR_PTR(-ENOENT);
+	xa_unlock(&ictx->objects);
+	return obj;
+}
+
+/*
+ * The caller holds a users refcount and wants to destroy the object. Returns
+ * true if the object was destroyed. In all cases the caller no longer has a
+ * reference on obj.
+ */
+bool iommufd_object_destroy_user(struct iommufd_ctx *ictx,
+				 struct iommufd_object *obj)
+{
+	/*
+	 * The purpose of the destroy_rwsem is to ensure deterministic
+	 * destruction of objects used by external drivers and destroyed by this
+	 * function. Any temporary increment of the refcount must hold the read
+	 * side of this, such as during ioctl execution.
+	 */
+	down_write(&obj->destroy_rwsem);
+	xa_lock(&ictx->objects);
+	refcount_dec(&obj->users);
+	if (!refcount_dec_if_one(&obj->users)) {
+		xa_unlock(&ictx->objects);
+		up_write(&obj->destroy_rwsem);
+		return false;
+	}
+	__xa_erase(&ictx->objects, obj->id);
+	xa_unlock(&ictx->objects);
+	up_write(&obj->destroy_rwsem);
+
+	iommufd_object_ops[obj->type].destroy(obj);
+	kfree(obj);
+	return true;
+}
+
+static int iommufd_destroy(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_destroy *cmd = ucmd->cmd;
+	struct iommufd_object *obj;
+
+	obj = iommufd_get_object(ucmd->ictx, cmd->id, IOMMUFD_OBJ_ANY);
+	if (IS_ERR(obj))
+		return PTR_ERR(obj);
+	iommufd_put_object_keep_user(obj);
+	if (!iommufd_object_destroy_user(ucmd->ictx, obj))
+		return -EBUSY;
+	return 0;
+}
+
+static int iommufd_fops_open(struct inode *inode, struct file *filp)
+{
+	struct iommufd_ctx *ictx;
+
+	ictx = kzalloc(sizeof(*ictx), GFP_KERNEL_ACCOUNT);
+	if (!ictx)
+		return -ENOMEM;
+
+	xa_init_flags(&ictx->objects, XA_FLAGS_ALLOC1 | XA_FLAGS_ACCOUNT);
+	ictx->file = filp;
+	filp->private_data = ictx;
+	return 0;
+}
+
+static int iommufd_fops_release(struct inode *inode, struct file *filp)
+{
+	struct iommufd_ctx *ictx = filp->private_data;
+	struct iommufd_object *obj;
+
+	/* Destroy the graph from depth first */
+	while (!xa_empty(&ictx->objects)) {
+		unsigned int destroyed = 0;
+		unsigned long index;
+
+		xa_for_each (&ictx->objects, index, obj) {
+			/*
+			 * Since we are in release elevated users must come from
+			 * other objects holding the users. We will eventually
+			 * destroy the object that holds this one and the next
+			 * pass will progress it.
+			 */
+			if (!refcount_dec_if_one(&obj->users))
+				continue;
+			destroyed++;
+			xa_erase(&ictx->objects, index);
+			iommufd_object_ops[obj->type].destroy(obj);
+			kfree(obj);
+		}
+		/* Bug related to users refcount */
+		if (WARN_ON(!destroyed))
+			break;
+	}
+	kfree(ictx);
+	return 0;
+}
+
+union ucmd_buffer {
+	struct iommu_destroy destroy;
+};
+
+struct iommufd_ioctl_op {
+	unsigned int size;
+	unsigned int min_size;
+	unsigned int ioctl_num;
+	int (*execute)(struct iommufd_ucmd *ucmd);
+};
+
+#define IOCTL_OP(_ioctl, _fn, _struct, _last)                                  \
+	[_IOC_NR(_ioctl) - IOMMUFD_CMD_BASE] = {                               \
+		.size = sizeof(_struct) +                                      \
+			BUILD_BUG_ON_ZERO(sizeof(union ucmd_buffer) <          \
+					  sizeof(_struct)),                    \
+		.min_size = offsetofend(_struct, _last),                       \
+		.ioctl_num = _ioctl,                                           \
+		.execute = _fn,                                                \
+	}
+static struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
+	IOCTL_OP(IOMMU_DESTROY, iommufd_destroy, struct iommu_destroy, id),
+};
+
+static long iommufd_fops_ioctl(struct file *filp, unsigned int cmd,
+			       unsigned long arg)
+{
+	struct iommufd_ucmd ucmd = {};
+	struct iommufd_ioctl_op *op;
+	union ucmd_buffer buf;
+	unsigned int nr;
+	int ret;
+
+	ucmd.ictx = filp->private_data;
+	ucmd.ubuffer = (void __user *)arg;
+	ret = get_user(ucmd.user_size, (u32 __user *)ucmd.ubuffer);
+	if (ret)
+		return ret;
+
+	nr = _IOC_NR(cmd);
+	if (nr < IOMMUFD_CMD_BASE ||
+	    (nr - IOMMUFD_CMD_BASE) >= ARRAY_SIZE(iommufd_ioctl_ops))
+		return -ENOIOCTLCMD;
+	op = &iommufd_ioctl_ops[nr - IOMMUFD_CMD_BASE];
+	if (op->ioctl_num != cmd)
+		return -ENOIOCTLCMD;
+	if (ucmd.user_size < op->min_size)
+		return -EOPNOTSUPP;
+
+	ucmd.cmd = &buf;
+	ret = copy_struct_from_user(ucmd.cmd, op->size, ucmd.ubuffer,
+				    ucmd.user_size);
+	if (ret)
+		return ret;
+	ret = op->execute(&ucmd);
+	return ret;
+}
+
+static const struct file_operations iommufd_fops = {
+	.owner = THIS_MODULE,
+	.open = iommufd_fops_open,
+	.release = iommufd_fops_release,
+	.unlocked_ioctl = iommufd_fops_ioctl,
+};
+
+/**
+ * iommufd_ctx_get - Get a context reference
+ * @ictx - Context to get
+ *
+ * The caller must already hold a valid reference to ictx.
+ */
+void iommufd_ctx_get(struct iommufd_ctx *ictx)
+{
+	get_file(ictx->file);
+}
+EXPORT_SYMBOL_GPL(iommufd_ctx_get);
+
+/**
+ * iommufd_ctx_from_file - Acquires a reference to the iommufd context
+ * @file: File to obtain the reference from
+ *
+ * Returns a pointer to the iommufd_ctx, otherwise ERR_PTR. The struct file
+ * remains owned by the caller and the caller must still do fput. On success
+ * the caller is responsible to call iommufd_ctx_put().
+ */
+struct iommufd_ctx *iommufd_ctx_from_file(struct file *file)
+{
+	struct iommufd_ctx *ictx;
+
+	if (file->f_op != &iommufd_fops)
+		return ERR_PTR(-EBADFD);
+	ictx = file->private_data;
+	iommufd_ctx_get(ictx);
+	return ictx;
+}
+EXPORT_SYMBOL_GPL(iommufd_ctx_from_file);
+
+/**
+ * iommufd_ctx_put - Put back a reference
+ * @ictx - Context to put back
+ */
+void iommufd_ctx_put(struct iommufd_ctx *ictx)
+{
+	fput(ictx->file);
+}
+EXPORT_SYMBOL_GPL(iommufd_ctx_put);
+
+static struct iommufd_object_ops iommufd_object_ops[] = {
+};
+
+static struct miscdevice iommu_misc_dev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "iommu",
+	.fops = &iommufd_fops,
+	.nodename = "iommu",
+	.mode = 0660,
+};
+
+static int __init iommufd_init(void)
+{
+	int ret;
+
+	ret = misc_register(&iommu_misc_dev);
+	if (ret) {
+		pr_err("Failed to register misc device\n");
+		return ret;
+	}
+
+	return 0;
+}
+
+static void __exit iommufd_exit(void)
+{
+	misc_deregister(&iommu_misc_dev);
+}
+
+module_init(iommufd_init);
+module_exit(iommufd_exit);
+
+MODULE_DESCRIPTION("I/O Address Space Management for passthrough devices");
+MODULE_LICENSE("GPL v2");
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
new file mode 100644
index 00000000000000..c8bbed542e923c
--- /dev/null
+++ b/include/linux/iommufd.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2021 Intel Corporation
+ * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ */
+#ifndef __LINUX_IOMMUFD_H
+#define __LINUX_IOMMUFD_H
+
+#include <linux/types.h>
+#include <linux/errno.h>
+#include <linux/err.h>
+
+struct iommufd_ctx;
+struct file;
+
+void iommufd_ctx_get(struct iommufd_ctx *ictx);
+
+#if IS_ENABLED(CONFIG_IOMMUFD)
+struct iommufd_ctx *iommufd_ctx_from_file(struct file *file);
+void iommufd_ctx_put(struct iommufd_ctx *ictx);
+#else /* !CONFIG_IOMMUFD */
+static inline struct iommufd_ctx *iommufd_ctx_from_file(struct file *file)
+{
+       return ERR_PTR(-EOPNOTSUPP);
+}
+
+static inline void iommufd_ctx_put(struct iommufd_ctx *ictx)
+{
+}
+#endif /* CONFIG_IOMMUFD */
+#endif
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
new file mode 100644
index 00000000000000..2f7f76ec6db4cb
--- /dev/null
+++ b/include/uapi/linux/iommufd.h
@@ -0,0 +1,55 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
+ */
+#ifndef _UAPI_IOMMUFD_H
+#define _UAPI_IOMMUFD_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+#define IOMMUFD_TYPE (';')
+
+/**
+ * DOC: General ioctl format
+ *
+ * The ioctl mechanims follows a general format to allow for extensibility. Each
+ * ioctl is passed in a structure pointer as the argument providing the size of
+ * the structure in the first u32. The kernel checks that any structure space
+ * beyond what it understands is 0. This allows userspace to use the backward
+ * compatible portion while consistently using the newer, larger, structures.
+ *
+ * ioctls use a standard meaning for common errnos:
+ *
+ *  - ENOTTY: The IOCTL number itself is not supported at all
+ *  - E2BIG: The IOCTL number is supported, but the provided structure has
+ *    non-zero in a part the kernel does not understand.
+ *  - EOPNOTSUPP: The IOCTL number is supported, and the structure is
+ *    understood, however a known field has a value the kernel does not
+ *    understand or support.
+ *  - EINVAL: Everything about the IOCTL was understood, but a field is not
+ *    correct.
+ *  - ENOENT: An ID or IOVA provided does not exist.
+ *  - ENOMEM: Out of memory.
+ *  - EOVERFLOW: Mathematics oveflowed.
+ *
+ * As well as additional errnos. within specific ioctls.
+ */
+enum {
+	IOMMUFD_CMD_BASE = 0x80,
+	IOMMUFD_CMD_DESTROY = IOMMUFD_CMD_BASE,
+};
+
+/**
+ * struct iommu_destroy - ioctl(IOMMU_DESTROY)
+ * @size: sizeof(struct iommu_destroy)
+ * @id: iommufd object ID to destroy. Can by any destroyable object type.
+ *
+ * Destroy any object held within iommufd.
+ */
+struct iommu_destroy {
+	__u32 size;
+	__u32 id;
+};
+#define IOMMU_DESTROY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_DESTROY)
+
+#endif
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH RFC v2 03/13] iommufd: File descriptor, context, kconfig and makefiles
@ 2022-09-02 19:59   ` Jason Gunthorpe
  0 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-02 19:59 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, iommu,
	Jason Wang, Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

This is the basic infrastructure of a new miscdevice to hold the iommufd
IOCTL API.

It provides:
 - A miscdevice to create file descriptors to run the IOCTL interface over

 - A table based ioctl dispatch and centralized extendable pre-validation
   step

 - An xarray mapping user ID's to kernel objects. The design has multiple
   inter-related objects held within in a single IOMMUFD fd

 - A simple usage count to build a graph of object relations and protect
   against hostile userspace racing ioctls

The only IOCTL provided in this patch is the generic 'destroy any object
by handle' operation.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 .../userspace-api/ioctl/ioctl-number.rst      |   1 +
 MAINTAINERS                                   |  10 +
 drivers/iommu/Kconfig                         |   1 +
 drivers/iommu/Makefile                        |   2 +-
 drivers/iommu/iommufd/Kconfig                 |  13 +
 drivers/iommu/iommufd/Makefile                |   5 +
 drivers/iommu/iommufd/iommufd_private.h       | 110 ++++++
 drivers/iommu/iommufd/main.c                  | 345 ++++++++++++++++++
 include/linux/iommufd.h                       |  31 ++
 include/uapi/linux/iommufd.h                  |  55 +++
 10 files changed, 572 insertions(+), 1 deletion(-)
 create mode 100644 drivers/iommu/iommufd/Kconfig
 create mode 100644 drivers/iommu/iommufd/Makefile
 create mode 100644 drivers/iommu/iommufd/iommufd_private.h
 create mode 100644 drivers/iommu/iommufd/main.c
 create mode 100644 include/linux/iommufd.h
 create mode 100644 include/uapi/linux/iommufd.h

diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
index 3b985b19f39d12..4387e787411ebe 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -105,6 +105,7 @@ Code  Seq#    Include File                                           Comments
 '8'   all                                                            SNP8023 advanced NIC card
                                                                      <mailto:mcr@solidum.com>
 ';'   64-7F  linux/vfio.h
+';'   80-FF  linux/iommufd.h
 '='   00-3f  uapi/linux/ptp_clock.h                                  <mailto:richardcochran@gmail.com>
 '@'   00-0F  linux/radeonfb.h                                        conflict!
 '@'   00-0F  drivers/video/aty/aty128fb.c                            conflict!
diff --git a/MAINTAINERS b/MAINTAINERS
index 589517372408ca..abd041f5e00f4c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10609,6 +10609,16 @@ L:	linux-mips@vger.kernel.org
 S:	Maintained
 F:	drivers/net/ethernet/sgi/ioc3-eth.c
 
+IOMMU FD
+M:	Jason Gunthorpe <jgg@nvidia.com>
+M:	Kevin Tian <kevin.tian@intel.com>
+L:	iommu@lists.linux-foundation.org
+S:	Maintained
+F:	Documentation/userspace-api/iommufd.rst
+F:	drivers/iommu/iommufd/
+F:	include/uapi/linux/iommufd.h
+F:	include/linux/iommufd.h
+
 IOMAP FILESYSTEM LIBRARY
 M:	Christoph Hellwig <hch@infradead.org>
 M:	Darrick J. Wong <djwong@kernel.org>
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 5c5cb5bee8b626..9ff3d2830f9559 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -177,6 +177,7 @@ config MSM_IOMMU
 
 source "drivers/iommu/amd/Kconfig"
 source "drivers/iommu/intel/Kconfig"
+source "drivers/iommu/iommufd/Kconfig"
 
 config IRQ_REMAP
 	bool "Support for Interrupt Remapping"
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 44475a9b3eeaf9..6d2bc288324704 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -1,5 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0
-obj-y += amd/ intel/ arm/
+obj-y += amd/ intel/ arm/ iommufd/
 obj-$(CONFIG_IOMMU_API) += iommu.o
 obj-$(CONFIG_IOMMU_API) += iommu-traces.o
 obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
diff --git a/drivers/iommu/iommufd/Kconfig b/drivers/iommu/iommufd/Kconfig
new file mode 100644
index 00000000000000..fddd453bb0e764
--- /dev/null
+++ b/drivers/iommu/iommufd/Kconfig
@@ -0,0 +1,13 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config IOMMUFD
+	tristate "IOMMU Userspace API"
+	select INTERVAL_TREE
+	select IOMMU_API
+	default n
+	help
+	  Provides /dev/iommu the user API to control the IOMMU subsystem as
+	  it relates to managing IO page tables that point at user space memory.
+
+	  This would commonly be used in combination with VFIO.
+
+	  If you don't know what to do here, say N.
diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
new file mode 100644
index 00000000000000..a07a8cffe937c6
--- /dev/null
+++ b/drivers/iommu/iommufd/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0-only
+iommufd-y := \
+	main.o
+
+obj-$(CONFIG_IOMMUFD) += iommufd.o
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
new file mode 100644
index 00000000000000..a65208d6442be7
--- /dev/null
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -0,0 +1,110 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ */
+#ifndef __IOMMUFD_PRIVATE_H
+#define __IOMMUFD_PRIVATE_H
+
+#include <linux/rwsem.h>
+#include <linux/xarray.h>
+#include <linux/refcount.h>
+#include <linux/uaccess.h>
+
+struct iommufd_ctx {
+	struct file *file;
+	struct xarray objects;
+};
+
+struct iommufd_ctx *iommufd_fget(int fd);
+
+struct iommufd_ucmd {
+	struct iommufd_ctx *ictx;
+	void __user *ubuffer;
+	u32 user_size;
+	void *cmd;
+};
+
+/* Copy the response in ucmd->cmd back to userspace. */
+static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd,
+				       size_t cmd_len)
+{
+	if (copy_to_user(ucmd->ubuffer, ucmd->cmd,
+			 min_t(size_t, ucmd->user_size, cmd_len)))
+		return -EFAULT;
+	return 0;
+}
+
+enum iommufd_object_type {
+	IOMMUFD_OBJ_NONE,
+	IOMMUFD_OBJ_ANY = IOMMUFD_OBJ_NONE,
+};
+
+/* Base struct for all objects with a userspace ID handle. */
+struct iommufd_object {
+	struct rw_semaphore destroy_rwsem;
+	refcount_t users;
+	enum iommufd_object_type type;
+	unsigned int id;
+};
+
+static inline bool iommufd_lock_obj(struct iommufd_object *obj)
+{
+	if (!down_read_trylock(&obj->destroy_rwsem))
+		return false;
+	if (!refcount_inc_not_zero(&obj->users)) {
+		up_read(&obj->destroy_rwsem);
+		return false;
+	}
+	return true;
+}
+
+struct iommufd_object *iommufd_get_object(struct iommufd_ctx *ictx, u32 id,
+					  enum iommufd_object_type type);
+static inline void iommufd_put_object(struct iommufd_object *obj)
+{
+	refcount_dec(&obj->users);
+	up_read(&obj->destroy_rwsem);
+}
+
+/**
+ * iommufd_put_object_keep_user() - Release part of the refcount on obj
+ * @obj - Object to release
+ *
+ * Objects have two protections to ensure that userspace has a consistent
+ * experience with destruction. Normally objects are locked so that destroy will
+ * block while there are concurrent users, and wait for the object to be
+ * unlocked.
+ *
+ * However, destroy can also be blocked by holding users reference counts on the
+ * objects, in that case destroy will immediately return EBUSY and will not wait
+ * for reference counts to go to zero.
+ *
+ * This function releases the destroy lock and destroy will return EBUSY.
+ *
+ * It should be used in places where the users will be held beyond a single
+ * system call.
+ */
+static inline void iommufd_put_object_keep_user(struct iommufd_object *obj)
+{
+	up_read(&obj->destroy_rwsem);
+}
+void iommufd_object_abort(struct iommufd_ctx *ictx, struct iommufd_object *obj);
+void iommufd_object_abort_and_destroy(struct iommufd_ctx *ictx,
+				      struct iommufd_object *obj);
+void iommufd_object_finalize(struct iommufd_ctx *ictx,
+			     struct iommufd_object *obj);
+bool iommufd_object_destroy_user(struct iommufd_ctx *ictx,
+				 struct iommufd_object *obj);
+struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
+					     size_t size,
+					     enum iommufd_object_type type);
+
+#define iommufd_object_alloc(ictx, ptr, type)                                  \
+	container_of(_iommufd_object_alloc(                                    \
+			     ictx,                                             \
+			     sizeof(*(ptr)) + BUILD_BUG_ON_ZERO(               \
+						      offsetof(typeof(*(ptr)), \
+							       obj) != 0),     \
+			     type),                                            \
+		     typeof(*(ptr)), obj)
+
+#endif
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
new file mode 100644
index 00000000000000..a5b1e2302ba59d
--- /dev/null
+++ b/drivers/iommu/iommufd/main.c
@@ -0,0 +1,345 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (C) 2021 Intel Corporation
+ * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ *
+ * iommfd provides control over the IOMMU HW objects created by IOMMU kernel
+ * drivers. IOMMU HW objects revolve around IO page tables that map incoming DMA
+ * addresses (IOVA) to CPU addresses.
+ *
+ * The API is divided into a general portion that is intended to work with any
+ * kernel IOMMU driver, and a device specific portion that  is intended to be
+ * used with a userspace HW driver paired with the specific kernel driver. This
+ * mechanism allows all the unique functionalities in individual IOMMUs to be
+ * exposed to userspace control.
+ */
+#define pr_fmt(fmt) "iommufd: " fmt
+
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/miscdevice.h>
+#include <linux/mutex.h>
+#include <linux/bug.h>
+#include <uapi/linux/iommufd.h>
+#include <linux/iommufd.h>
+
+#include "iommufd_private.h"
+
+struct iommufd_object_ops {
+	void (*destroy)(struct iommufd_object *obj);
+};
+static struct iommufd_object_ops iommufd_object_ops[];
+
+struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
+					     size_t size,
+					     enum iommufd_object_type type)
+{
+	struct iommufd_object *obj;
+	int rc;
+
+	obj = kzalloc(size, GFP_KERNEL_ACCOUNT);
+	if (!obj)
+		return ERR_PTR(-ENOMEM);
+	obj->type = type;
+	init_rwsem(&obj->destroy_rwsem);
+	refcount_set(&obj->users, 1);
+
+	/*
+	 * Reserve an ID in the xarray but do not publish the pointer yet since
+	 * the caller hasn't initialized it yet. Once the pointer is published
+	 * in the xarray and visible to other threads we can't reliably destroy
+	 * it anymore, so the caller must complete all errorable operations
+	 * before calling iommufd_object_finalize().
+	 */
+	rc = xa_alloc(&ictx->objects, &obj->id, XA_ZERO_ENTRY,
+		      xa_limit_32b, GFP_KERNEL_ACCOUNT);
+	if (rc)
+		goto out_free;
+	return obj;
+out_free:
+	kfree(obj);
+	return ERR_PTR(rc);
+}
+
+/*
+ * Allow concurrent access to the object. This should only be done once the
+ * system call that created the object is guaranteed to succeed.
+ */
+void iommufd_object_finalize(struct iommufd_ctx *ictx,
+			     struct iommufd_object *obj)
+{
+	void *old;
+
+	old = xa_store(&ictx->objects, obj->id, obj, GFP_KERNEL);
+	/* obj->id was returned from xa_alloc() so the xa_store() cannot fail */
+	WARN_ON(old);
+}
+
+/* Undo _iommufd_object_alloc() if iommufd_object_finalize() was not called */
+void iommufd_object_abort(struct iommufd_ctx *ictx, struct iommufd_object *obj)
+{
+	void *old;
+
+	old = xa_erase(&ictx->objects, obj->id);
+	WARN_ON(old);
+	kfree(obj);
+}
+
+/*
+ * Abort an object that has been fully initialized and needs destroy, but has
+ * not been finalized.
+ */
+void iommufd_object_abort_and_destroy(struct iommufd_ctx *ictx,
+				      struct iommufd_object *obj)
+{
+	iommufd_object_ops[obj->type].destroy(obj);
+	iommufd_object_abort(ictx, obj);
+}
+
+struct iommufd_object *iommufd_get_object(struct iommufd_ctx *ictx, u32 id,
+					  enum iommufd_object_type type)
+{
+	struct iommufd_object *obj;
+
+	xa_lock(&ictx->objects);
+	obj = xa_load(&ictx->objects, id);
+	if (!obj || (type != IOMMUFD_OBJ_ANY && obj->type != type) ||
+	    !iommufd_lock_obj(obj))
+		obj = ERR_PTR(-ENOENT);
+	xa_unlock(&ictx->objects);
+	return obj;
+}
+
+/*
+ * The caller holds a users refcount and wants to destroy the object. Returns
+ * true if the object was destroyed. In all cases the caller no longer has a
+ * reference on obj.
+ */
+bool iommufd_object_destroy_user(struct iommufd_ctx *ictx,
+				 struct iommufd_object *obj)
+{
+	/*
+	 * The purpose of the destroy_rwsem is to ensure deterministic
+	 * destruction of objects used by external drivers and destroyed by this
+	 * function. Any temporary increment of the refcount must hold the read
+	 * side of this, such as during ioctl execution.
+	 */
+	down_write(&obj->destroy_rwsem);
+	xa_lock(&ictx->objects);
+	refcount_dec(&obj->users);
+	if (!refcount_dec_if_one(&obj->users)) {
+		xa_unlock(&ictx->objects);
+		up_write(&obj->destroy_rwsem);
+		return false;
+	}
+	__xa_erase(&ictx->objects, obj->id);
+	xa_unlock(&ictx->objects);
+	up_write(&obj->destroy_rwsem);
+
+	iommufd_object_ops[obj->type].destroy(obj);
+	kfree(obj);
+	return true;
+}
+
+static int iommufd_destroy(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_destroy *cmd = ucmd->cmd;
+	struct iommufd_object *obj;
+
+	obj = iommufd_get_object(ucmd->ictx, cmd->id, IOMMUFD_OBJ_ANY);
+	if (IS_ERR(obj))
+		return PTR_ERR(obj);
+	iommufd_put_object_keep_user(obj);
+	if (!iommufd_object_destroy_user(ucmd->ictx, obj))
+		return -EBUSY;
+	return 0;
+}
+
+static int iommufd_fops_open(struct inode *inode, struct file *filp)
+{
+	struct iommufd_ctx *ictx;
+
+	ictx = kzalloc(sizeof(*ictx), GFP_KERNEL_ACCOUNT);
+	if (!ictx)
+		return -ENOMEM;
+
+	xa_init_flags(&ictx->objects, XA_FLAGS_ALLOC1 | XA_FLAGS_ACCOUNT);
+	ictx->file = filp;
+	filp->private_data = ictx;
+	return 0;
+}
+
+static int iommufd_fops_release(struct inode *inode, struct file *filp)
+{
+	struct iommufd_ctx *ictx = filp->private_data;
+	struct iommufd_object *obj;
+
+	/* Destroy the graph from depth first */
+	while (!xa_empty(&ictx->objects)) {
+		unsigned int destroyed = 0;
+		unsigned long index;
+
+		xa_for_each (&ictx->objects, index, obj) {
+			/*
+			 * Since we are in release elevated users must come from
+			 * other objects holding the users. We will eventually
+			 * destroy the object that holds this one and the next
+			 * pass will progress it.
+			 */
+			if (!refcount_dec_if_one(&obj->users))
+				continue;
+			destroyed++;
+			xa_erase(&ictx->objects, index);
+			iommufd_object_ops[obj->type].destroy(obj);
+			kfree(obj);
+		}
+		/* Bug related to users refcount */
+		if (WARN_ON(!destroyed))
+			break;
+	}
+	kfree(ictx);
+	return 0;
+}
+
+union ucmd_buffer {
+	struct iommu_destroy destroy;
+};
+
+struct iommufd_ioctl_op {
+	unsigned int size;
+	unsigned int min_size;
+	unsigned int ioctl_num;
+	int (*execute)(struct iommufd_ucmd *ucmd);
+};
+
+#define IOCTL_OP(_ioctl, _fn, _struct, _last)                                  \
+	[_IOC_NR(_ioctl) - IOMMUFD_CMD_BASE] = {                               \
+		.size = sizeof(_struct) +                                      \
+			BUILD_BUG_ON_ZERO(sizeof(union ucmd_buffer) <          \
+					  sizeof(_struct)),                    \
+		.min_size = offsetofend(_struct, _last),                       \
+		.ioctl_num = _ioctl,                                           \
+		.execute = _fn,                                                \
+	}
+static struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
+	IOCTL_OP(IOMMU_DESTROY, iommufd_destroy, struct iommu_destroy, id),
+};
+
+static long iommufd_fops_ioctl(struct file *filp, unsigned int cmd,
+			       unsigned long arg)
+{
+	struct iommufd_ucmd ucmd = {};
+	struct iommufd_ioctl_op *op;
+	union ucmd_buffer buf;
+	unsigned int nr;
+	int ret;
+
+	ucmd.ictx = filp->private_data;
+	ucmd.ubuffer = (void __user *)arg;
+	ret = get_user(ucmd.user_size, (u32 __user *)ucmd.ubuffer);
+	if (ret)
+		return ret;
+
+	nr = _IOC_NR(cmd);
+	if (nr < IOMMUFD_CMD_BASE ||
+	    (nr - IOMMUFD_CMD_BASE) >= ARRAY_SIZE(iommufd_ioctl_ops))
+		return -ENOIOCTLCMD;
+	op = &iommufd_ioctl_ops[nr - IOMMUFD_CMD_BASE];
+	if (op->ioctl_num != cmd)
+		return -ENOIOCTLCMD;
+	if (ucmd.user_size < op->min_size)
+		return -EOPNOTSUPP;
+
+	ucmd.cmd = &buf;
+	ret = copy_struct_from_user(ucmd.cmd, op->size, ucmd.ubuffer,
+				    ucmd.user_size);
+	if (ret)
+		return ret;
+	ret = op->execute(&ucmd);
+	return ret;
+}
+
+static const struct file_operations iommufd_fops = {
+	.owner = THIS_MODULE,
+	.open = iommufd_fops_open,
+	.release = iommufd_fops_release,
+	.unlocked_ioctl = iommufd_fops_ioctl,
+};
+
+/**
+ * iommufd_ctx_get - Get a context reference
+ * @ictx - Context to get
+ *
+ * The caller must already hold a valid reference to ictx.
+ */
+void iommufd_ctx_get(struct iommufd_ctx *ictx)
+{
+	get_file(ictx->file);
+}
+EXPORT_SYMBOL_GPL(iommufd_ctx_get);
+
+/**
+ * iommufd_ctx_from_file - Acquires a reference to the iommufd context
+ * @file: File to obtain the reference from
+ *
+ * Returns a pointer to the iommufd_ctx, otherwise ERR_PTR. The struct file
+ * remains owned by the caller and the caller must still do fput. On success
+ * the caller is responsible to call iommufd_ctx_put().
+ */
+struct iommufd_ctx *iommufd_ctx_from_file(struct file *file)
+{
+	struct iommufd_ctx *ictx;
+
+	if (file->f_op != &iommufd_fops)
+		return ERR_PTR(-EBADFD);
+	ictx = file->private_data;
+	iommufd_ctx_get(ictx);
+	return ictx;
+}
+EXPORT_SYMBOL_GPL(iommufd_ctx_from_file);
+
+/**
+ * iommufd_ctx_put - Put back a reference
+ * @ictx - Context to put back
+ */
+void iommufd_ctx_put(struct iommufd_ctx *ictx)
+{
+	fput(ictx->file);
+}
+EXPORT_SYMBOL_GPL(iommufd_ctx_put);
+
+static struct iommufd_object_ops iommufd_object_ops[] = {
+};
+
+static struct miscdevice iommu_misc_dev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "iommu",
+	.fops = &iommufd_fops,
+	.nodename = "iommu",
+	.mode = 0660,
+};
+
+static int __init iommufd_init(void)
+{
+	int ret;
+
+	ret = misc_register(&iommu_misc_dev);
+	if (ret) {
+		pr_err("Failed to register misc device\n");
+		return ret;
+	}
+
+	return 0;
+}
+
+static void __exit iommufd_exit(void)
+{
+	misc_deregister(&iommu_misc_dev);
+}
+
+module_init(iommufd_init);
+module_exit(iommufd_exit);
+
+MODULE_DESCRIPTION("I/O Address Space Management for passthrough devices");
+MODULE_LICENSE("GPL v2");
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
new file mode 100644
index 00000000000000..c8bbed542e923c
--- /dev/null
+++ b/include/linux/iommufd.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2021 Intel Corporation
+ * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ */
+#ifndef __LINUX_IOMMUFD_H
+#define __LINUX_IOMMUFD_H
+
+#include <linux/types.h>
+#include <linux/errno.h>
+#include <linux/err.h>
+
+struct iommufd_ctx;
+struct file;
+
+void iommufd_ctx_get(struct iommufd_ctx *ictx);
+
+#if IS_ENABLED(CONFIG_IOMMUFD)
+struct iommufd_ctx *iommufd_ctx_from_file(struct file *file);
+void iommufd_ctx_put(struct iommufd_ctx *ictx);
+#else /* !CONFIG_IOMMUFD */
+static inline struct iommufd_ctx *iommufd_ctx_from_file(struct file *file)
+{
+       return ERR_PTR(-EOPNOTSUPP);
+}
+
+static inline void iommufd_ctx_put(struct iommufd_ctx *ictx)
+{
+}
+#endif /* CONFIG_IOMMUFD */
+#endif
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
new file mode 100644
index 00000000000000..2f7f76ec6db4cb
--- /dev/null
+++ b/include/uapi/linux/iommufd.h
@@ -0,0 +1,55 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
+ */
+#ifndef _UAPI_IOMMUFD_H
+#define _UAPI_IOMMUFD_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+#define IOMMUFD_TYPE (';')
+
+/**
+ * DOC: General ioctl format
+ *
+ * The ioctl mechanims follows a general format to allow for extensibility. Each
+ * ioctl is passed in a structure pointer as the argument providing the size of
+ * the structure in the first u32. The kernel checks that any structure space
+ * beyond what it understands is 0. This allows userspace to use the backward
+ * compatible portion while consistently using the newer, larger, structures.
+ *
+ * ioctls use a standard meaning for common errnos:
+ *
+ *  - ENOTTY: The IOCTL number itself is not supported at all
+ *  - E2BIG: The IOCTL number is supported, but the provided structure has
+ *    non-zero in a part the kernel does not understand.
+ *  - EOPNOTSUPP: The IOCTL number is supported, and the structure is
+ *    understood, however a known field has a value the kernel does not
+ *    understand or support.
+ *  - EINVAL: Everything about the IOCTL was understood, but a field is not
+ *    correct.
+ *  - ENOENT: An ID or IOVA provided does not exist.
+ *  - ENOMEM: Out of memory.
+ *  - EOVERFLOW: Mathematics oveflowed.
+ *
+ * As well as additional errnos. within specific ioctls.
+ */
+enum {
+	IOMMUFD_CMD_BASE = 0x80,
+	IOMMUFD_CMD_DESTROY = IOMMUFD_CMD_BASE,
+};
+
+/**
+ * struct iommu_destroy - ioctl(IOMMU_DESTROY)
+ * @size: sizeof(struct iommu_destroy)
+ * @id: iommufd object ID to destroy. Can by any destroyable object type.
+ *
+ * Destroy any object held within iommufd.
+ */
+struct iommu_destroy {
+	__u32 size;
+	__u32 id;
+};
+#define IOMMU_DESTROY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_DESTROY)
+
+#endif
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH RFC v2 04/13] kernel/user: Allow user::locked_vm to be usable for iommufd
  2022-09-02 19:59 ` Jason Gunthorpe
@ 2022-09-02 19:59   ` Jason Gunthorpe
  -1 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-02 19:59 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, iommu,
	Jason Wang, Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

Following the pattern of io_uring, perf, skb, and bpf iommfd will use
user->locked_vm for accounting pinned pages. Ensure the value is included
in the struct and export free_uid() as iommufd is modular.

user->locked_vm is the correct accounting to use for ulimit because it is
per-user, and the ulimit is not supposed to be per-process. Other
places (vfio, vdpa and infiniband) have used mm->pinned_vm and/or
mm->locked_vm for accounting pinned pages, but this is only per-process
and inconsistent with the majority of the kernel.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 include/linux/sched/user.h | 2 +-
 kernel/user.c              | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index f054d0360a7533..4cc52698e214e2 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -25,7 +25,7 @@ struct user_struct {
 
 #if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL) || \
 	defined(CONFIG_NET) || defined(CONFIG_IO_URING) || \
-	defined(CONFIG_VFIO_PCI_ZDEV_KVM)
+	defined(CONFIG_VFIO_PCI_ZDEV_KVM) || IS_ENABLED(CONFIG_IOMMUFD)
 	atomic_long_t locked_vm;
 #endif
 #ifdef CONFIG_WATCH_QUEUE
diff --git a/kernel/user.c b/kernel/user.c
index e2cf8c22b539a7..d667debeafd609 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -185,6 +185,7 @@ void free_uid(struct user_struct *up)
 	if (refcount_dec_and_lock_irqsave(&up->__count, &uidhash_lock, &flags))
 		free_user(up, flags);
 }
+EXPORT_SYMBOL_GPL(free_uid);
 
 struct user_struct *alloc_uid(kuid_t uid)
 {
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH RFC v2 04/13] kernel/user: Allow user::locked_vm to be usable for iommufd
@ 2022-09-02 19:59   ` Jason Gunthorpe
  0 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-02 19:59 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, iommu,
	Jason Wang, Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

Following the pattern of io_uring, perf, skb, and bpf iommfd will use
user->locked_vm for accounting pinned pages. Ensure the value is included
in the struct and export free_uid() as iommufd is modular.

user->locked_vm is the correct accounting to use for ulimit because it is
per-user, and the ulimit is not supposed to be per-process. Other
places (vfio, vdpa and infiniband) have used mm->pinned_vm and/or
mm->locked_vm for accounting pinned pages, but this is only per-process
and inconsistent with the majority of the kernel.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 include/linux/sched/user.h | 2 +-
 kernel/user.c              | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index f054d0360a7533..4cc52698e214e2 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -25,7 +25,7 @@ struct user_struct {
 
 #if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL) || \
 	defined(CONFIG_NET) || defined(CONFIG_IO_URING) || \
-	defined(CONFIG_VFIO_PCI_ZDEV_KVM)
+	defined(CONFIG_VFIO_PCI_ZDEV_KVM) || IS_ENABLED(CONFIG_IOMMUFD)
 	atomic_long_t locked_vm;
 #endif
 #ifdef CONFIG_WATCH_QUEUE
diff --git a/kernel/user.c b/kernel/user.c
index e2cf8c22b539a7..d667debeafd609 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -185,6 +185,7 @@ void free_uid(struct user_struct *up)
 	if (refcount_dec_and_lock_irqsave(&up->__count, &uidhash_lock, &flags))
 		free_user(up, flags);
 }
+EXPORT_SYMBOL_GPL(free_uid);
 
 struct user_struct *alloc_uid(kuid_t uid)
 {
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH RFC v2 05/13] iommufd: PFN handling for iopt_pages
  2022-09-02 19:59 ` Jason Gunthorpe
@ 2022-09-02 19:59   ` Jason Gunthorpe
  -1 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-02 19:59 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, iommu,
	Jason Wang, Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

The top of the data structure provides an IO Address Space (IOAS) that is
similar to a VFIO container. The IOAS allows map/unmap of memory into
ranges of IOVA called iopt_areas. Multiple domains and in-kernel
users (like VFIO mdevs) can be attached to the IOAS to access the PFNs
that those IOVA areas cover.

The IO Address Space (IOAS) datastructure is composed of:
 - struct io_pagetable holding the IOVA map
 - struct iopt_areas representing populated portions of IOVA
 - struct iopt_pages representing the storage of PFNs
 - struct iommu_domain representing each IO page table in the system IOMMU
 - struct iopt_pages_user representing in-kernel users of PFNs (ie VFIO
   mdevs)
 - struct xarray pinned_pfns holding a list of pages pinned by in-kernel
   users

This patch introduces the lowest part of the datastructure - the movement
of PFNs in a tiered storage scheme:
 1) iopt_pages::pinned_pfns xarray
 2) Multiple iommu_domains
 3) The origin of the PFNs, i.e. the userspace pointer

PFN have to be copied between all combinations of tiers, depending on the
configuration.

The interface is an iterator called a 'pfn_reader' which determines which
tier each PFN is stored and loads it into a list of PFNs held in a struct
pfn_batch.

Each step of the iterator will fill up the pfn_batch, then the caller can
use the pfn_batch to send the PFNs to the required destination. Repeating
this loop will read all the PFNs in an IOVA range.

The pfn_reader and pfn_batch also keep track of the pinned page accounting.

While PFNs are always stored and accessed as full PAGE_SIZE units the
iommu_domain tier can store with a sub-page offset/length to support
IOMMUs with a smaller IOPTE size than PAGE_SIZE.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommufd/Makefile          |   3 +-
 drivers/iommu/iommufd/io_pagetable.h    | 103 ++++
 drivers/iommu/iommufd/iommufd_private.h |  23 +
 drivers/iommu/iommufd/pages.c           | 718 ++++++++++++++++++++++++
 4 files changed, 846 insertions(+), 1 deletion(-)
 create mode 100644 drivers/iommu/iommufd/io_pagetable.h
 create mode 100644 drivers/iommu/iommufd/pages.c

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index a07a8cffe937c6..05a0e91e30afad 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 iommufd-y := \
-	main.o
+	main.o \
+	pages.o
 
 obj-$(CONFIG_IOMMUFD) += iommufd.o
diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h
new file mode 100644
index 00000000000000..24a0f1a9de6197
--- /dev/null
+++ b/drivers/iommu/iommufd/io_pagetable.h
@@ -0,0 +1,103 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
+ *
+ */
+#ifndef __IO_PAGETABLE_H
+#define __IO_PAGETABLE_H
+
+#include <linux/interval_tree.h>
+#include <linux/mutex.h>
+#include <linux/kref.h>
+#include <linux/xarray.h>
+
+#include "iommufd_private.h"
+
+struct iommu_domain;
+
+/*
+ * Each io_pagetable is composed of intervals of areas which cover regions of
+ * the iova that are backed by something. iova not covered by areas is not
+ * populated in the page table. Each area is fully populated with pages.
+ *
+ * iovas are in byte units, but must be iopt->iova_alignment aligned.
+ *
+ * pages can be NULL, this means some other thread is still working on setting
+ * up or tearing down the area. When observed under the write side of the
+ * domain_rwsem a NULL pages must mean the area is still being setup and no
+ * domains are filled.
+ *
+ * storage_domain points at an arbitrary iommu_domain that is holding the PFNs
+ * for this area. It is locked by the pages->mutex. This simplifies the locking
+ * as the pages code can rely on the storage_domain without having to get the
+ * iopt->domains_rwsem.
+ *
+ * The io_pagetable::iova_rwsem protects node
+ * The iopt_pages::mutex protects pages_node
+ * iopt and immu_prot are immutable
+ * The pages::mutex protects num_users
+ */
+struct iopt_area {
+	struct interval_tree_node node;
+	struct interval_tree_node pages_node;
+	struct io_pagetable *iopt;
+	struct iopt_pages *pages;
+	struct iommu_domain *storage_domain;
+	/* How many bytes into the first page the area starts */
+	unsigned int page_offset;
+	/* IOMMU_READ, IOMMU_WRITE, etc */
+	int iommu_prot;
+	unsigned int num_users;
+};
+
+static inline unsigned long iopt_area_index(struct iopt_area *area)
+{
+	return area->pages_node.start;
+}
+
+static inline unsigned long iopt_area_last_index(struct iopt_area *area)
+{
+	return area->pages_node.last;
+}
+
+static inline unsigned long iopt_area_iova(struct iopt_area *area)
+{
+	return area->node.start;
+}
+
+static inline unsigned long iopt_area_last_iova(struct iopt_area *area)
+{
+	return area->node.last;
+}
+
+/*
+ * This holds a pinned page list for multiple areas of IO address space. The
+ * pages always originate from a linear chunk of userspace VA. Multiple
+ * io_pagetable's, through their iopt_area's, can share a single iopt_pages
+ * which avoids multi-pinning and double accounting of page consumption.
+ *
+ * indexes in this structure are measured in PAGE_SIZE units, are 0 based from
+ * the start of the uptr and extend to npages. pages are pinned dynamically
+ * according to the intervals in the users_itree and domains_itree, npinned
+ * records the current number of pages pinned.
+ */
+struct iopt_pages {
+	struct kref kref;
+	struct mutex mutex;
+	size_t npages;
+	size_t npinned;
+	size_t last_npinned;
+	struct task_struct *source_task;
+	struct mm_struct *source_mm;
+	struct user_struct *source_user;
+	void __user *uptr;
+	bool writable:1;
+	bool has_cap_ipc_lock:1;
+
+	struct xarray pinned_pfns;
+	/* Of iopt_pages_user::node */
+	struct rb_root_cached users_itree;
+	/* Of iopt_area::pages_node */
+	struct rb_root_cached domains_itree;
+};
+
+#endif
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index a65208d6442be7..47a824897bc222 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -9,6 +9,29 @@
 #include <linux/refcount.h>
 #include <linux/uaccess.h>
 
+/*
+ * The IOVA to PFN map. The mapper automatically copies the PFNs into multiple
+ * domains and permits sharing of PFNs between io_pagetable instances. This
+ * supports both a design where IOAS's are 1:1 with a domain (eg because the
+ * domain is HW customized), or where the IOAS is 1:N with multiple generic
+ * domains.  The io_pagetable holds an interval tree of iopt_areas which point
+ * to shared iopt_pages which hold the pfns mapped to the page table.
+ *
+ * The locking order is domains_rwsem -> iova_rwsem -> pages::mutex
+ */
+struct io_pagetable {
+	struct rw_semaphore domains_rwsem;
+	struct xarray domains;
+	unsigned int next_domain_id;
+
+	struct rw_semaphore iova_rwsem;
+	struct rb_root_cached area_itree;
+	/* IOVA that cannot become reserved, struct iopt_allowed */
+	struct rb_root_cached allowed_itree;
+	/* IOVA that cannot be allocated, struct iopt_reserved */
+	struct rb_root_cached reserved_itree;
+};
+
 struct iommufd_ctx {
 	struct file *file;
 	struct xarray objects;
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
new file mode 100644
index 00000000000000..a5c369c94b2f11
--- /dev/null
+++ b/drivers/iommu/iommufd/pages.c
@@ -0,0 +1,718 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
+ *
+ * The iopt_pages is the center of the storage and motion of PFNs. Each
+ * iopt_pages represents a logical linear array of full PFNs. The array is 0
+ * based and has npages in it. Accessors use 'index' to refer to the entry in
+ * this logical array, regardless of its storage location.
+ *
+ * PFNs are stored in a tiered scheme:
+ *  1) iopt_pages::pinned_pfns xarray
+ *  2) An iommu_domain
+ *  3) The origin of the PFNs, i.e. the userspace pointer
+ *
+ * PFN have to be copied between all combinations of tiers, depending on the
+ * configuration.
+ *
+ * When a PFN is taken out of the userspace pointer it is pinned exactly once.
+ * The storage locations of the PFN's index are tracked in the two interval
+ * trees. If no interval includes the index then it is not pinned.
+ *
+ * If users_itree includes the PFN's index then an in-kernel user has requested
+ * the page. The PFN is stored in the xarray so other requestors can continue to
+ * find it.
+ *
+ * If the domains_itree includes the PFN's index then an iommu_domain is storing
+ * the PFN and it can be read back using iommu_iova_to_phys(). To avoid
+ * duplicating storage the xarray is not used if only iommu_domains are using
+ * the PFN's index.
+ *
+ * As a general principle this is designed so that destroy never fails. This
+ * means removing an iommu_domain or releasing a in-kernel user will not fail
+ * due to insufficient memory. In practice this means some cases have to hold
+ * PFNs in the xarray even though they are also being stored in an iommu_domain.
+ *
+ * While the iopt_pages can use an iommu_domain as storage, it does not have an
+ * IOVA itself. Instead the iopt_area represents a range of IOVA and uses the
+ * iopt_pages as the PFN provider. Multiple iopt_areas can share the iopt_pages
+ * and reference their own slice of the PFN array, with sub page granularity.
+ *
+ * In this file the term 'last' indicates an inclusive and closed interval, eg
+ * [0,0] refers to a single PFN. 'end' means an open range, eg [0,0) refers to
+ * no PFNs.
+ */
+#include <linux/overflow.h>
+#include <linux/slab.h>
+#include <linux/iommu.h>
+#include <linux/sched/mm.h>
+
+#include "io_pagetable.h"
+
+#define TEMP_MEMORY_LIMIT 65536
+#define BATCH_BACKUP_SIZE 32
+
+/*
+ * More memory makes pin_user_pages() and the batching more efficient, but as
+ * this is only a performance optimization don't try too hard to get it. A 64k
+ * allocation can hold about 26M of 4k pages and 13G of 2M pages in an
+ * pfn_batch. Various destroy paths cannot fail and provide a small amount of
+ * stack memory as a backup contingency. If backup_len is given this cannot
+ * fail.
+ */
+static void *temp_kmalloc(size_t *size, void *backup, size_t backup_len)
+{
+	void *res;
+
+	if (*size < backup_len)
+		return backup;
+	*size = min_t(size_t, *size, TEMP_MEMORY_LIMIT);
+	res = kmalloc(*size, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
+	if (res)
+		return res;
+	*size = PAGE_SIZE;
+	if (backup_len) {
+		res = kmalloc(*size, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
+		if (res)
+			return res;
+		*size = backup_len;
+		return backup;
+	}
+	return kmalloc(*size, GFP_KERNEL);
+}
+
+static void iopt_pages_add_npinned(struct iopt_pages *pages, size_t npages)
+{
+	int rc;
+
+	rc = check_add_overflow(pages->npinned, npages, &pages->npinned);
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
+		WARN_ON(rc || pages->npinned > pages->npages);
+}
+
+static void iopt_pages_sub_npinned(struct iopt_pages *pages, size_t npages)
+{
+	int rc;
+
+	rc = check_sub_overflow(pages->npinned, npages, &pages->npinned);
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
+		WARN_ON(rc || pages->npinned > pages->npages);
+}
+
+/*
+ * index is the number of PAGE_SIZE units from the start of the area's
+ * iopt_pages. If the iova is sub page-size then the area has an iova that
+ * covers a portion of the first and last pages in the range.
+ */
+static unsigned long iopt_area_index_to_iova(struct iopt_area *area,
+					     unsigned long index)
+{
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
+		WARN_ON(index < iopt_area_index(area) ||
+			index > iopt_area_last_index(area));
+	index -= iopt_area_index(area);
+	if (index == 0)
+		return iopt_area_iova(area);
+	return iopt_area_iova(area) - area->page_offset + index * PAGE_SIZE;
+}
+
+static unsigned long iopt_area_index_to_iova_last(struct iopt_area *area,
+						  unsigned long index)
+{
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
+		WARN_ON(index < iopt_area_index(area) ||
+			index > iopt_area_last_index(area));
+	if (index == iopt_area_last_index(area))
+		return iopt_area_last_iova(area);
+	return iopt_area_iova(area) - area->page_offset +
+	       (index - iopt_area_index(area) + 1) * PAGE_SIZE - 1;
+}
+
+static void iommu_unmap_nofail(struct iommu_domain *domain, unsigned long iova,
+			       size_t size)
+{
+	size_t ret;
+
+	ret = iommu_unmap(domain, iova, size);
+	/*
+	 * It is a logic error in this code or a driver bug if the IOMMU unmaps
+	 * something other than exactly as requested.
+	 */
+	WARN_ON(ret != size);
+}
+
+static struct iopt_area *iopt_pages_find_domain_area(struct iopt_pages *pages,
+						     unsigned long index)
+{
+	struct interval_tree_node *node;
+
+	node = interval_tree_iter_first(&pages->domains_itree, index, index);
+	if (!node)
+		return NULL;
+	return container_of(node, struct iopt_area, pages_node);
+}
+
+/*
+ * A simple datastructure to hold a vector of PFNs, optimized for contiguous
+ * PFNs. This is used as a temporary holding memory for shuttling pfns from one
+ * place to another. Generally everything is made more efficient if operations
+ * work on the largest possible grouping of pfns. eg fewer lock/unlock cycles,
+ * better cache locality, etc
+ */
+struct pfn_batch {
+	unsigned long *pfns;
+	u16 *npfns;
+	unsigned int array_size;
+	unsigned int end;
+	unsigned int total_pfns;
+};
+
+static void batch_clear(struct pfn_batch *batch)
+{
+	batch->total_pfns = 0;
+	batch->end = 0;
+	batch->pfns[0] = 0;
+	batch->npfns[0] = 0;
+}
+
+static int __batch_init(struct pfn_batch *batch, size_t max_pages, void *backup,
+			size_t backup_len)
+{
+	const size_t elmsz = sizeof(*batch->pfns) + sizeof(*batch->npfns);
+	size_t size = max_pages * elmsz;
+
+	batch->pfns = temp_kmalloc(&size, backup, backup_len);
+	if (!batch->pfns)
+		return -ENOMEM;
+	batch->array_size = size / elmsz;
+	batch->npfns = (u16 *)(batch->pfns + batch->array_size);
+	batch_clear(batch);
+	return 0;
+}
+
+static int batch_init(struct pfn_batch *batch, size_t max_pages)
+{
+	return __batch_init(batch, max_pages, NULL, 0);
+}
+
+static void batch_init_backup(struct pfn_batch *batch, size_t max_pages,
+			      void *backup, size_t backup_len)
+{
+	__batch_init(batch, max_pages, backup, backup_len);
+}
+
+static void batch_destroy(struct pfn_batch *batch, void *backup)
+{
+	if (batch->pfns != backup)
+		kfree(batch->pfns);
+}
+
+/* true if the pfn could be added, false otherwise */
+static bool batch_add_pfn(struct pfn_batch *batch, unsigned long pfn)
+{
+	/* FIXME: U16 is too small */
+	if (batch->end &&
+	    pfn == batch->pfns[batch->end - 1] + batch->npfns[batch->end - 1] &&
+	    batch->npfns[batch->end - 1] != U16_MAX) {
+		batch->npfns[batch->end - 1]++;
+		batch->total_pfns++;
+		return true;
+	}
+	if (batch->end == batch->array_size)
+		return false;
+	batch->total_pfns++;
+	batch->pfns[batch->end] = pfn;
+	batch->npfns[batch->end] = 1;
+	batch->end++;
+	return true;
+}
+
+/*
+ * Fill the batch with pfns from the domain. When the batch is full, or it
+ * reaches last_index, the function will return. The caller should use
+ * batch->total_pfns to determine the starting point for the next iteration.
+ */
+static void batch_from_domain(struct pfn_batch *batch,
+			      struct iommu_domain *domain,
+			      struct iopt_area *area, unsigned long index,
+			      unsigned long last_index)
+{
+	unsigned int page_offset = 0;
+	unsigned long iova;
+	phys_addr_t phys;
+
+	batch_clear(batch);
+	iova = iopt_area_index_to_iova(area, index);
+	if (index == iopt_area_index(area))
+		page_offset = area->page_offset;
+	while (index <= last_index) {
+		/*
+		 * This is pretty slow, it would be nice to get the page size
+		 * back from the driver, or have the driver directly fill the
+		 * batch.
+		 */
+		phys = iommu_iova_to_phys(domain, iova) - page_offset;
+		if (!batch_add_pfn(batch, PHYS_PFN(phys)))
+			return;
+		iova += PAGE_SIZE - page_offset;
+		page_offset = 0;
+		index++;
+	}
+}
+
+static int batch_to_domain(struct pfn_batch *batch, struct iommu_domain *domain,
+			   struct iopt_area *area, unsigned long start_index)
+{
+	unsigned long last_iova = iopt_area_last_iova(area);
+	unsigned int page_offset = 0;
+	unsigned long start_iova;
+	unsigned long next_iova;
+	unsigned int cur = 0;
+	unsigned long iova;
+	int rc;
+
+	/* The first index might be a partial page */
+	if (start_index == iopt_area_index(area))
+		page_offset = area->page_offset;
+	next_iova = iova = start_iova =
+		iopt_area_index_to_iova(area, start_index);
+	while (cur < batch->end) {
+		next_iova = min(last_iova + 1,
+				next_iova + batch->npfns[cur] * PAGE_SIZE -
+					page_offset);
+		rc = iommu_map(domain, iova,
+			       PFN_PHYS(batch->pfns[cur]) + page_offset,
+			       next_iova - iova, area->iommu_prot);
+		if (rc)
+			goto out_unmap;
+		iova = next_iova;
+		page_offset = 0;
+		cur++;
+	}
+	return 0;
+out_unmap:
+	if (start_iova != iova)
+		iommu_unmap_nofail(domain, start_iova, iova - start_iova);
+	return rc;
+}
+
+static void batch_from_xarray(struct pfn_batch *batch, struct xarray *xa,
+			      unsigned long start_index,
+			      unsigned long last_index)
+{
+	XA_STATE(xas, xa, start_index);
+	void *entry;
+
+	rcu_read_lock();
+	while (true) {
+		entry = xas_next(&xas);
+		if (xas_retry(&xas, entry))
+			continue;
+		WARN_ON(!xa_is_value(entry));
+		if (!batch_add_pfn(batch, xa_to_value(entry)) ||
+		    start_index == last_index)
+			break;
+		start_index++;
+	}
+	rcu_read_unlock();
+}
+
+static void clear_xarray(struct xarray *xa, unsigned long index,
+			 unsigned long last)
+{
+	XA_STATE(xas, xa, index);
+	void *entry;
+
+	xas_lock(&xas);
+	xas_for_each (&xas, entry, last)
+		xas_store(&xas, NULL);
+	xas_unlock(&xas);
+}
+
+static int batch_to_xarray(struct pfn_batch *batch, struct xarray *xa,
+			   unsigned long start_index)
+{
+	XA_STATE(xas, xa, start_index);
+	unsigned int npage = 0;
+	unsigned int cur = 0;
+
+	do {
+		xas_lock(&xas);
+		while (cur < batch->end) {
+			void *old;
+
+			old = xas_store(&xas,
+					xa_mk_value(batch->pfns[cur] + npage));
+			if (xas_error(&xas))
+				break;
+			WARN_ON(old);
+			npage++;
+			if (npage == batch->npfns[cur]) {
+				npage = 0;
+				cur++;
+			}
+			xas_next(&xas);
+		}
+		xas_unlock(&xas);
+	} while (xas_nomem(&xas, GFP_KERNEL));
+
+	if (xas_error(&xas)) {
+		if (xas.xa_index != start_index)
+			clear_xarray(xa, start_index, xas.xa_index - 1);
+		return xas_error(&xas);
+	}
+	return 0;
+}
+
+static void batch_to_pages(struct pfn_batch *batch, struct page **pages)
+{
+	unsigned int npage = 0;
+	unsigned int cur = 0;
+
+	while (cur < batch->end) {
+		*pages++ = pfn_to_page(batch->pfns[cur] + npage);
+		npage++;
+		if (npage == batch->npfns[cur]) {
+			npage = 0;
+			cur++;
+		}
+	}
+}
+
+static void batch_from_pages(struct pfn_batch *batch, struct page **pages,
+			     size_t npages)
+{
+	struct page **end = pages + npages;
+
+	for (; pages != end; pages++)
+		if (!batch_add_pfn(batch, page_to_pfn(*pages)))
+			break;
+}
+
+static void batch_unpin(struct pfn_batch *batch, struct iopt_pages *pages,
+			unsigned int offset, size_t npages)
+{
+	unsigned int cur = 0;
+
+	while (offset) {
+		if (batch->npfns[cur] > offset)
+			break;
+		offset -= batch->npfns[cur];
+		cur++;
+	}
+
+	while (npages) {
+		size_t to_unpin =
+			min_t(size_t, npages, batch->npfns[cur] - offset);
+
+		unpin_user_page_range_dirty_lock(
+			pfn_to_page(batch->pfns[cur] + offset), to_unpin,
+			pages->writable);
+		iopt_pages_sub_npinned(pages, to_unpin);
+		cur++;
+		offset = 0;
+		npages -= to_unpin;
+	}
+}
+
+/*
+ * PFNs are stored in three places, in order of preference:
+ * - The iopt_pages xarray. This is only populated if there is a
+ *   iopt_pages_user
+ * - The iommu_domain under an area
+ * - The original PFN source, ie pages->source_mm
+ *
+ * This iterator reads the pfns optimizing to load according to the
+ * above order.
+ */
+struct pfn_reader {
+	struct iopt_pages *pages;
+	struct interval_tree_span_iter span;
+	struct pfn_batch batch;
+	unsigned long batch_start_index;
+	unsigned long batch_end_index;
+	unsigned long last_index;
+
+	struct page **upages;
+	size_t upages_len;
+	unsigned long upages_start;
+	unsigned long upages_end;
+
+	unsigned int gup_flags;
+};
+
+static void update_unpinned(struct iopt_pages *pages)
+{
+	unsigned long npages = pages->last_npinned - pages->npinned;
+
+	lockdep_assert_held(&pages->mutex);
+
+	if (pages->has_cap_ipc_lock) {
+		pages->last_npinned = pages->npinned;
+		return;
+	}
+
+	if (WARN_ON(pages->npinned > pages->last_npinned) ||
+	    WARN_ON(atomic_long_read(&pages->source_user->locked_vm) < npages))
+		return;
+	atomic_long_sub(npages, &pages->source_user->locked_vm);
+	atomic64_sub(npages, &pages->source_mm->pinned_vm);
+	pages->last_npinned = pages->npinned;
+}
+
+/*
+ * Changes in the number of pages pinned is done after the pages have been read
+ * and processed. If the user lacked the limit then the error unwind will unpin
+ * everything that was just pinned.
+ */
+static int update_pinned(struct iopt_pages *pages)
+{
+	unsigned long lock_limit;
+	unsigned long cur_pages;
+	unsigned long new_pages;
+	unsigned long npages;
+
+	lockdep_assert_held(&pages->mutex);
+
+	if (pages->has_cap_ipc_lock) {
+		pages->last_npinned = pages->npinned;
+		return 0;
+	}
+
+	if (pages->npinned == pages->last_npinned)
+		return 0;
+
+	if (pages->npinned < pages->last_npinned) {
+		update_unpinned(pages);
+		return 0;
+	}
+
+	lock_limit =
+		task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	npages = pages->npinned - pages->last_npinned;
+	do {
+		cur_pages = atomic_long_read(&pages->source_user->locked_vm);
+		new_pages = cur_pages + npages;
+		if (new_pages > lock_limit)
+			return -ENOMEM;
+	} while (atomic_long_cmpxchg(&pages->source_user->locked_vm, cur_pages,
+				     new_pages) != cur_pages);
+	atomic64_add(npages, &pages->source_mm->pinned_vm);
+	pages->last_npinned = pages->npinned;
+	return 0;
+}
+
+static int pfn_reader_pin_pages(struct pfn_reader *pfns)
+{
+	struct iopt_pages *pages = pfns->pages;
+	unsigned long npages;
+	long rc;
+
+	if (!pfns->upages) {
+		/* All undone in iopt_pfn_reader_destroy */
+		pfns->upages_len =
+			(pfns->last_index - pfns->batch_end_index + 1) *
+			sizeof(*pfns->upages);
+		pfns->upages = temp_kmalloc(&pfns->upages_len, NULL, 0);
+		if (!pfns->upages)
+			return -ENOMEM;
+
+		if (!mmget_not_zero(pages->source_mm)) {
+			kfree(pfns->upages);
+			pfns->upages = NULL;
+			return -EINVAL;
+		}
+		mmap_read_lock(pages->source_mm);
+	}
+
+	npages = min_t(unsigned long,
+		       pfns->span.last_hole - pfns->batch_end_index + 1,
+		       pfns->upages_len / sizeof(*pfns->upages));
+
+	/* FIXME use pin_user_pages_fast() if current == source_mm */
+	rc = pin_user_pages_remote(
+		pages->source_mm,
+		(uintptr_t)(pages->uptr + pfns->batch_end_index * PAGE_SIZE),
+		npages, pfns->gup_flags, pfns->upages, NULL, NULL);
+	if (rc < 0)
+		return rc;
+	if (WARN_ON(!rc))
+		return -EFAULT;
+	iopt_pages_add_npinned(pages, rc);
+	pfns->upages_start = pfns->batch_end_index;
+	pfns->upages_end = pfns->batch_end_index + rc;
+	return 0;
+}
+
+/*
+ * The batch can contain a mixture of pages that are still in use and pages that
+ * need to be unpinned. Unpin only pages that are not held anywhere else.
+ */
+static void iopt_pages_unpin(struct iopt_pages *pages, struct pfn_batch *batch,
+			     unsigned long index, unsigned long last)
+{
+	struct interval_tree_span_iter user_span;
+	struct interval_tree_span_iter area_span;
+
+	lockdep_assert_held(&pages->mutex);
+
+	interval_tree_for_each_span (&user_span, &pages->users_itree, 0, last) {
+		if (!user_span.is_hole)
+			continue;
+
+		interval_tree_for_each_span (&area_span, &pages->domains_itree,
+					     user_span.start_hole,
+					     user_span.last_hole) {
+			if (!area_span.is_hole)
+				continue;
+
+			batch_unpin(batch, pages, area_span.start_hole - index,
+				    area_span.last_hole - area_span.start_hole +
+					    1);
+		}
+	}
+}
+
+/* Process a single span in the users_itree */
+static int pfn_reader_fill_span(struct pfn_reader *pfns)
+{
+	struct interval_tree_span_iter *span = &pfns->span;
+	struct iopt_area *area;
+	int rc;
+
+	if (!span->is_hole) {
+		batch_from_xarray(&pfns->batch, &pfns->pages->pinned_pfns,
+				  pfns->batch_end_index, span->last_used);
+		return 0;
+	}
+
+	/* FIXME: This should consider the entire hole remaining */
+	area = iopt_pages_find_domain_area(pfns->pages, pfns->batch_end_index);
+	if (area) {
+		unsigned int last_index;
+
+		last_index = min(iopt_area_last_index(area), span->last_hole);
+		/* The storage_domain cannot change without the pages mutex */
+		batch_from_domain(&pfns->batch, area->storage_domain, area,
+				  pfns->batch_end_index, last_index);
+		return 0;
+	}
+
+	if (pfns->batch_end_index >= pfns->upages_end) {
+		rc = pfn_reader_pin_pages(pfns);
+		if (rc)
+			return rc;
+	}
+
+	batch_from_pages(&pfns->batch,
+			 pfns->upages +
+				 (pfns->batch_end_index - pfns->upages_start),
+			 pfns->upages_end - pfns->batch_end_index);
+	return 0;
+}
+
+static bool pfn_reader_done(struct pfn_reader *pfns)
+{
+	return pfns->batch_start_index == pfns->last_index + 1;
+}
+
+static int pfn_reader_next(struct pfn_reader *pfns)
+{
+	int rc;
+
+	batch_clear(&pfns->batch);
+	pfns->batch_start_index = pfns->batch_end_index;
+	while (pfns->batch_end_index != pfns->last_index + 1) {
+		rc = pfn_reader_fill_span(pfns);
+		if (rc)
+			return rc;
+		pfns->batch_end_index =
+			pfns->batch_start_index + pfns->batch.total_pfns;
+		if (pfns->batch_end_index != pfns->span.last_used + 1)
+			return 0;
+		interval_tree_span_iter_next(&pfns->span);
+	}
+	return 0;
+}
+
+/*
+ * Adjust the pfn_reader to start at an externally determined hole span in the
+ * users_itree.
+ */
+static int pfn_reader_seek_hole(struct pfn_reader *pfns,
+				struct interval_tree_span_iter *span)
+{
+	pfns->batch_start_index = span->start_hole;
+	pfns->batch_end_index = span->start_hole;
+	pfns->last_index = span->last_hole;
+	pfns->span = *span;
+	return pfn_reader_next(pfns);
+}
+
+static int pfn_reader_init(struct pfn_reader *pfns, struct iopt_pages *pages,
+			   unsigned long index, unsigned long last)
+{
+	int rc;
+
+	lockdep_assert_held(&pages->mutex);
+
+	rc = batch_init(&pfns->batch, last - index + 1);
+	if (rc)
+		return rc;
+	pfns->pages = pages;
+	pfns->batch_start_index = index;
+	pfns->batch_end_index = index;
+	pfns->last_index = last;
+	pfns->upages = NULL;
+	pfns->upages_start = 0;
+	pfns->upages_end = 0;
+	interval_tree_span_iter_first(&pfns->span, &pages->users_itree, index,
+				      last);
+
+	if (pages->writable) {
+		pfns->gup_flags = FOLL_LONGTERM | FOLL_WRITE;
+	} else {
+		/* Still need to break COWs on read */
+		pfns->gup_flags = FOLL_LONGTERM | FOLL_FORCE | FOLL_WRITE;
+	}
+	return 0;
+}
+
+static void pfn_reader_destroy(struct pfn_reader *pfns)
+{
+	if (pfns->upages) {
+		size_t npages = pfns->upages_end - pfns->batch_end_index;
+
+		mmap_read_unlock(pfns->pages->source_mm);
+		mmput(pfns->pages->source_mm);
+
+		/* Any pages not transferred to the batch are just unpinned */
+		unpin_user_pages(pfns->upages + (pfns->batch_end_index -
+						 pfns->upages_start),
+				 npages);
+		kfree(pfns->upages);
+		pfns->upages = NULL;
+	}
+
+	if (pfns->batch_start_index != pfns->batch_end_index)
+		iopt_pages_unpin(pfns->pages, &pfns->batch,
+				 pfns->batch_start_index,
+				 pfns->batch_end_index - 1);
+	batch_destroy(&pfns->batch, NULL);
+	WARN_ON(pfns->pages->last_npinned != pfns->pages->npinned);
+}
+
+static int pfn_reader_first(struct pfn_reader *pfns, struct iopt_pages *pages,
+			    unsigned long index, unsigned long last)
+{
+	int rc;
+
+	rc = pfn_reader_init(pfns, pages, index, last);
+	if (rc)
+		return rc;
+	rc = pfn_reader_next(pfns);
+	if (rc) {
+		pfn_reader_destroy(pfns);
+		return rc;
+	}
+	return 0;
+}
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH RFC v2 05/13] iommufd: PFN handling for iopt_pages
@ 2022-09-02 19:59   ` Jason Gunthorpe
  0 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-02 19:59 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, iommu,
	Jason Wang, Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

The top of the data structure provides an IO Address Space (IOAS) that is
similar to a VFIO container. The IOAS allows map/unmap of memory into
ranges of IOVA called iopt_areas. Multiple domains and in-kernel
users (like VFIO mdevs) can be attached to the IOAS to access the PFNs
that those IOVA areas cover.

The IO Address Space (IOAS) datastructure is composed of:
 - struct io_pagetable holding the IOVA map
 - struct iopt_areas representing populated portions of IOVA
 - struct iopt_pages representing the storage of PFNs
 - struct iommu_domain representing each IO page table in the system IOMMU
 - struct iopt_pages_user representing in-kernel users of PFNs (ie VFIO
   mdevs)
 - struct xarray pinned_pfns holding a list of pages pinned by in-kernel
   users

This patch introduces the lowest part of the datastructure - the movement
of PFNs in a tiered storage scheme:
 1) iopt_pages::pinned_pfns xarray
 2) Multiple iommu_domains
 3) The origin of the PFNs, i.e. the userspace pointer

PFN have to be copied between all combinations of tiers, depending on the
configuration.

The interface is an iterator called a 'pfn_reader' which determines which
tier each PFN is stored and loads it into a list of PFNs held in a struct
pfn_batch.

Each step of the iterator will fill up the pfn_batch, then the caller can
use the pfn_batch to send the PFNs to the required destination. Repeating
this loop will read all the PFNs in an IOVA range.

The pfn_reader and pfn_batch also keep track of the pinned page accounting.

While PFNs are always stored and accessed as full PAGE_SIZE units the
iommu_domain tier can store with a sub-page offset/length to support
IOMMUs with a smaller IOPTE size than PAGE_SIZE.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommufd/Makefile          |   3 +-
 drivers/iommu/iommufd/io_pagetable.h    | 103 ++++
 drivers/iommu/iommufd/iommufd_private.h |  23 +
 drivers/iommu/iommufd/pages.c           | 718 ++++++++++++++++++++++++
 4 files changed, 846 insertions(+), 1 deletion(-)
 create mode 100644 drivers/iommu/iommufd/io_pagetable.h
 create mode 100644 drivers/iommu/iommufd/pages.c

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index a07a8cffe937c6..05a0e91e30afad 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 iommufd-y := \
-	main.o
+	main.o \
+	pages.o
 
 obj-$(CONFIG_IOMMUFD) += iommufd.o
diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h
new file mode 100644
index 00000000000000..24a0f1a9de6197
--- /dev/null
+++ b/drivers/iommu/iommufd/io_pagetable.h
@@ -0,0 +1,103 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
+ *
+ */
+#ifndef __IO_PAGETABLE_H
+#define __IO_PAGETABLE_H
+
+#include <linux/interval_tree.h>
+#include <linux/mutex.h>
+#include <linux/kref.h>
+#include <linux/xarray.h>
+
+#include "iommufd_private.h"
+
+struct iommu_domain;
+
+/*
+ * Each io_pagetable is composed of intervals of areas which cover regions of
+ * the iova that are backed by something. iova not covered by areas is not
+ * populated in the page table. Each area is fully populated with pages.
+ *
+ * iovas are in byte units, but must be iopt->iova_alignment aligned.
+ *
+ * pages can be NULL, this means some other thread is still working on setting
+ * up or tearing down the area. When observed under the write side of the
+ * domain_rwsem a NULL pages must mean the area is still being setup and no
+ * domains are filled.
+ *
+ * storage_domain points at an arbitrary iommu_domain that is holding the PFNs
+ * for this area. It is locked by the pages->mutex. This simplifies the locking
+ * as the pages code can rely on the storage_domain without having to get the
+ * iopt->domains_rwsem.
+ *
+ * The io_pagetable::iova_rwsem protects node
+ * The iopt_pages::mutex protects pages_node
+ * iopt and immu_prot are immutable
+ * The pages::mutex protects num_users
+ */
+struct iopt_area {
+	struct interval_tree_node node;
+	struct interval_tree_node pages_node;
+	struct io_pagetable *iopt;
+	struct iopt_pages *pages;
+	struct iommu_domain *storage_domain;
+	/* How many bytes into the first page the area starts */
+	unsigned int page_offset;
+	/* IOMMU_READ, IOMMU_WRITE, etc */
+	int iommu_prot;
+	unsigned int num_users;
+};
+
+static inline unsigned long iopt_area_index(struct iopt_area *area)
+{
+	return area->pages_node.start;
+}
+
+static inline unsigned long iopt_area_last_index(struct iopt_area *area)
+{
+	return area->pages_node.last;
+}
+
+static inline unsigned long iopt_area_iova(struct iopt_area *area)
+{
+	return area->node.start;
+}
+
+static inline unsigned long iopt_area_last_iova(struct iopt_area *area)
+{
+	return area->node.last;
+}
+
+/*
+ * This holds a pinned page list for multiple areas of IO address space. The
+ * pages always originate from a linear chunk of userspace VA. Multiple
+ * io_pagetable's, through their iopt_area's, can share a single iopt_pages
+ * which avoids multi-pinning and double accounting of page consumption.
+ *
+ * indexes in this structure are measured in PAGE_SIZE units, are 0 based from
+ * the start of the uptr and extend to npages. pages are pinned dynamically
+ * according to the intervals in the users_itree and domains_itree, npinned
+ * records the current number of pages pinned.
+ */
+struct iopt_pages {
+	struct kref kref;
+	struct mutex mutex;
+	size_t npages;
+	size_t npinned;
+	size_t last_npinned;
+	struct task_struct *source_task;
+	struct mm_struct *source_mm;
+	struct user_struct *source_user;
+	void __user *uptr;
+	bool writable:1;
+	bool has_cap_ipc_lock:1;
+
+	struct xarray pinned_pfns;
+	/* Of iopt_pages_user::node */
+	struct rb_root_cached users_itree;
+	/* Of iopt_area::pages_node */
+	struct rb_root_cached domains_itree;
+};
+
+#endif
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index a65208d6442be7..47a824897bc222 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -9,6 +9,29 @@
 #include <linux/refcount.h>
 #include <linux/uaccess.h>
 
+/*
+ * The IOVA to PFN map. The mapper automatically copies the PFNs into multiple
+ * domains and permits sharing of PFNs between io_pagetable instances. This
+ * supports both a design where IOAS's are 1:1 with a domain (eg because the
+ * domain is HW customized), or where the IOAS is 1:N with multiple generic
+ * domains.  The io_pagetable holds an interval tree of iopt_areas which point
+ * to shared iopt_pages which hold the pfns mapped to the page table.
+ *
+ * The locking order is domains_rwsem -> iova_rwsem -> pages::mutex
+ */
+struct io_pagetable {
+	struct rw_semaphore domains_rwsem;
+	struct xarray domains;
+	unsigned int next_domain_id;
+
+	struct rw_semaphore iova_rwsem;
+	struct rb_root_cached area_itree;
+	/* IOVA that cannot become reserved, struct iopt_allowed */
+	struct rb_root_cached allowed_itree;
+	/* IOVA that cannot be allocated, struct iopt_reserved */
+	struct rb_root_cached reserved_itree;
+};
+
 struct iommufd_ctx {
 	struct file *file;
 	struct xarray objects;
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
new file mode 100644
index 00000000000000..a5c369c94b2f11
--- /dev/null
+++ b/drivers/iommu/iommufd/pages.c
@@ -0,0 +1,718 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
+ *
+ * The iopt_pages is the center of the storage and motion of PFNs. Each
+ * iopt_pages represents a logical linear array of full PFNs. The array is 0
+ * based and has npages in it. Accessors use 'index' to refer to the entry in
+ * this logical array, regardless of its storage location.
+ *
+ * PFNs are stored in a tiered scheme:
+ *  1) iopt_pages::pinned_pfns xarray
+ *  2) An iommu_domain
+ *  3) The origin of the PFNs, i.e. the userspace pointer
+ *
+ * PFN have to be copied between all combinations of tiers, depending on the
+ * configuration.
+ *
+ * When a PFN is taken out of the userspace pointer it is pinned exactly once.
+ * The storage locations of the PFN's index are tracked in the two interval
+ * trees. If no interval includes the index then it is not pinned.
+ *
+ * If users_itree includes the PFN's index then an in-kernel user has requested
+ * the page. The PFN is stored in the xarray so other requestors can continue to
+ * find it.
+ *
+ * If the domains_itree includes the PFN's index then an iommu_domain is storing
+ * the PFN and it can be read back using iommu_iova_to_phys(). To avoid
+ * duplicating storage the xarray is not used if only iommu_domains are using
+ * the PFN's index.
+ *
+ * As a general principle this is designed so that destroy never fails. This
+ * means removing an iommu_domain or releasing a in-kernel user will not fail
+ * due to insufficient memory. In practice this means some cases have to hold
+ * PFNs in the xarray even though they are also being stored in an iommu_domain.
+ *
+ * While the iopt_pages can use an iommu_domain as storage, it does not have an
+ * IOVA itself. Instead the iopt_area represents a range of IOVA and uses the
+ * iopt_pages as the PFN provider. Multiple iopt_areas can share the iopt_pages
+ * and reference their own slice of the PFN array, with sub page granularity.
+ *
+ * In this file the term 'last' indicates an inclusive and closed interval, eg
+ * [0,0] refers to a single PFN. 'end' means an open range, eg [0,0) refers to
+ * no PFNs.
+ */
+#include <linux/overflow.h>
+#include <linux/slab.h>
+#include <linux/iommu.h>
+#include <linux/sched/mm.h>
+
+#include "io_pagetable.h"
+
+#define TEMP_MEMORY_LIMIT 65536
+#define BATCH_BACKUP_SIZE 32
+
+/*
+ * More memory makes pin_user_pages() and the batching more efficient, but as
+ * this is only a performance optimization don't try too hard to get it. A 64k
+ * allocation can hold about 26M of 4k pages and 13G of 2M pages in an
+ * pfn_batch. Various destroy paths cannot fail and provide a small amount of
+ * stack memory as a backup contingency. If backup_len is given this cannot
+ * fail.
+ */
+static void *temp_kmalloc(size_t *size, void *backup, size_t backup_len)
+{
+	void *res;
+
+	if (*size < backup_len)
+		return backup;
+	*size = min_t(size_t, *size, TEMP_MEMORY_LIMIT);
+	res = kmalloc(*size, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
+	if (res)
+		return res;
+	*size = PAGE_SIZE;
+	if (backup_len) {
+		res = kmalloc(*size, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
+		if (res)
+			return res;
+		*size = backup_len;
+		return backup;
+	}
+	return kmalloc(*size, GFP_KERNEL);
+}
+
+static void iopt_pages_add_npinned(struct iopt_pages *pages, size_t npages)
+{
+	int rc;
+
+	rc = check_add_overflow(pages->npinned, npages, &pages->npinned);
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
+		WARN_ON(rc || pages->npinned > pages->npages);
+}
+
+static void iopt_pages_sub_npinned(struct iopt_pages *pages, size_t npages)
+{
+	int rc;
+
+	rc = check_sub_overflow(pages->npinned, npages, &pages->npinned);
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
+		WARN_ON(rc || pages->npinned > pages->npages);
+}
+
+/*
+ * index is the number of PAGE_SIZE units from the start of the area's
+ * iopt_pages. If the iova is sub page-size then the area has an iova that
+ * covers a portion of the first and last pages in the range.
+ */
+static unsigned long iopt_area_index_to_iova(struct iopt_area *area,
+					     unsigned long index)
+{
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
+		WARN_ON(index < iopt_area_index(area) ||
+			index > iopt_area_last_index(area));
+	index -= iopt_area_index(area);
+	if (index == 0)
+		return iopt_area_iova(area);
+	return iopt_area_iova(area) - area->page_offset + index * PAGE_SIZE;
+}
+
+static unsigned long iopt_area_index_to_iova_last(struct iopt_area *area,
+						  unsigned long index)
+{
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
+		WARN_ON(index < iopt_area_index(area) ||
+			index > iopt_area_last_index(area));
+	if (index == iopt_area_last_index(area))
+		return iopt_area_last_iova(area);
+	return iopt_area_iova(area) - area->page_offset +
+	       (index - iopt_area_index(area) + 1) * PAGE_SIZE - 1;
+}
+
+static void iommu_unmap_nofail(struct iommu_domain *domain, unsigned long iova,
+			       size_t size)
+{
+	size_t ret;
+
+	ret = iommu_unmap(domain, iova, size);
+	/*
+	 * It is a logic error in this code or a driver bug if the IOMMU unmaps
+	 * something other than exactly as requested.
+	 */
+	WARN_ON(ret != size);
+}
+
+static struct iopt_area *iopt_pages_find_domain_area(struct iopt_pages *pages,
+						     unsigned long index)
+{
+	struct interval_tree_node *node;
+
+	node = interval_tree_iter_first(&pages->domains_itree, index, index);
+	if (!node)
+		return NULL;
+	return container_of(node, struct iopt_area, pages_node);
+}
+
+/*
+ * A simple datastructure to hold a vector of PFNs, optimized for contiguous
+ * PFNs. This is used as a temporary holding memory for shuttling pfns from one
+ * place to another. Generally everything is made more efficient if operations
+ * work on the largest possible grouping of pfns. eg fewer lock/unlock cycles,
+ * better cache locality, etc
+ */
+struct pfn_batch {
+	unsigned long *pfns;
+	u16 *npfns;
+	unsigned int array_size;
+	unsigned int end;
+	unsigned int total_pfns;
+};
+
+static void batch_clear(struct pfn_batch *batch)
+{
+	batch->total_pfns = 0;
+	batch->end = 0;
+	batch->pfns[0] = 0;
+	batch->npfns[0] = 0;
+}
+
+static int __batch_init(struct pfn_batch *batch, size_t max_pages, void *backup,
+			size_t backup_len)
+{
+	const size_t elmsz = sizeof(*batch->pfns) + sizeof(*batch->npfns);
+	size_t size = max_pages * elmsz;
+
+	batch->pfns = temp_kmalloc(&size, backup, backup_len);
+	if (!batch->pfns)
+		return -ENOMEM;
+	batch->array_size = size / elmsz;
+	batch->npfns = (u16 *)(batch->pfns + batch->array_size);
+	batch_clear(batch);
+	return 0;
+}
+
+static int batch_init(struct pfn_batch *batch, size_t max_pages)
+{
+	return __batch_init(batch, max_pages, NULL, 0);
+}
+
+static void batch_init_backup(struct pfn_batch *batch, size_t max_pages,
+			      void *backup, size_t backup_len)
+{
+	__batch_init(batch, max_pages, backup, backup_len);
+}
+
+static void batch_destroy(struct pfn_batch *batch, void *backup)
+{
+	if (batch->pfns != backup)
+		kfree(batch->pfns);
+}
+
+/* true if the pfn could be added, false otherwise */
+static bool batch_add_pfn(struct pfn_batch *batch, unsigned long pfn)
+{
+	/* FIXME: U16 is too small */
+	if (batch->end &&
+	    pfn == batch->pfns[batch->end - 1] + batch->npfns[batch->end - 1] &&
+	    batch->npfns[batch->end - 1] != U16_MAX) {
+		batch->npfns[batch->end - 1]++;
+		batch->total_pfns++;
+		return true;
+	}
+	if (batch->end == batch->array_size)
+		return false;
+	batch->total_pfns++;
+	batch->pfns[batch->end] = pfn;
+	batch->npfns[batch->end] = 1;
+	batch->end++;
+	return true;
+}
+
+/*
+ * Fill the batch with pfns from the domain. When the batch is full, or it
+ * reaches last_index, the function will return. The caller should use
+ * batch->total_pfns to determine the starting point for the next iteration.
+ */
+static void batch_from_domain(struct pfn_batch *batch,
+			      struct iommu_domain *domain,
+			      struct iopt_area *area, unsigned long index,
+			      unsigned long last_index)
+{
+	unsigned int page_offset = 0;
+	unsigned long iova;
+	phys_addr_t phys;
+
+	batch_clear(batch);
+	iova = iopt_area_index_to_iova(area, index);
+	if (index == iopt_area_index(area))
+		page_offset = area->page_offset;
+	while (index <= last_index) {
+		/*
+		 * This is pretty slow, it would be nice to get the page size
+		 * back from the driver, or have the driver directly fill the
+		 * batch.
+		 */
+		phys = iommu_iova_to_phys(domain, iova) - page_offset;
+		if (!batch_add_pfn(batch, PHYS_PFN(phys)))
+			return;
+		iova += PAGE_SIZE - page_offset;
+		page_offset = 0;
+		index++;
+	}
+}
+
+static int batch_to_domain(struct pfn_batch *batch, struct iommu_domain *domain,
+			   struct iopt_area *area, unsigned long start_index)
+{
+	unsigned long last_iova = iopt_area_last_iova(area);
+	unsigned int page_offset = 0;
+	unsigned long start_iova;
+	unsigned long next_iova;
+	unsigned int cur = 0;
+	unsigned long iova;
+	int rc;
+
+	/* The first index might be a partial page */
+	if (start_index == iopt_area_index(area))
+		page_offset = area->page_offset;
+	next_iova = iova = start_iova =
+		iopt_area_index_to_iova(area, start_index);
+	while (cur < batch->end) {
+		next_iova = min(last_iova + 1,
+				next_iova + batch->npfns[cur] * PAGE_SIZE -
+					page_offset);
+		rc = iommu_map(domain, iova,
+			       PFN_PHYS(batch->pfns[cur]) + page_offset,
+			       next_iova - iova, area->iommu_prot);
+		if (rc)
+			goto out_unmap;
+		iova = next_iova;
+		page_offset = 0;
+		cur++;
+	}
+	return 0;
+out_unmap:
+	if (start_iova != iova)
+		iommu_unmap_nofail(domain, start_iova, iova - start_iova);
+	return rc;
+}
+
+static void batch_from_xarray(struct pfn_batch *batch, struct xarray *xa,
+			      unsigned long start_index,
+			      unsigned long last_index)
+{
+	XA_STATE(xas, xa, start_index);
+	void *entry;
+
+	rcu_read_lock();
+	while (true) {
+		entry = xas_next(&xas);
+		if (xas_retry(&xas, entry))
+			continue;
+		WARN_ON(!xa_is_value(entry));
+		if (!batch_add_pfn(batch, xa_to_value(entry)) ||
+		    start_index == last_index)
+			break;
+		start_index++;
+	}
+	rcu_read_unlock();
+}
+
+static void clear_xarray(struct xarray *xa, unsigned long index,
+			 unsigned long last)
+{
+	XA_STATE(xas, xa, index);
+	void *entry;
+
+	xas_lock(&xas);
+	xas_for_each (&xas, entry, last)
+		xas_store(&xas, NULL);
+	xas_unlock(&xas);
+}
+
+static int batch_to_xarray(struct pfn_batch *batch, struct xarray *xa,
+			   unsigned long start_index)
+{
+	XA_STATE(xas, xa, start_index);
+	unsigned int npage = 0;
+	unsigned int cur = 0;
+
+	do {
+		xas_lock(&xas);
+		while (cur < batch->end) {
+			void *old;
+
+			old = xas_store(&xas,
+					xa_mk_value(batch->pfns[cur] + npage));
+			if (xas_error(&xas))
+				break;
+			WARN_ON(old);
+			npage++;
+			if (npage == batch->npfns[cur]) {
+				npage = 0;
+				cur++;
+			}
+			xas_next(&xas);
+		}
+		xas_unlock(&xas);
+	} while (xas_nomem(&xas, GFP_KERNEL));
+
+	if (xas_error(&xas)) {
+		if (xas.xa_index != start_index)
+			clear_xarray(xa, start_index, xas.xa_index - 1);
+		return xas_error(&xas);
+	}
+	return 0;
+}
+
+static void batch_to_pages(struct pfn_batch *batch, struct page **pages)
+{
+	unsigned int npage = 0;
+	unsigned int cur = 0;
+
+	while (cur < batch->end) {
+		*pages++ = pfn_to_page(batch->pfns[cur] + npage);
+		npage++;
+		if (npage == batch->npfns[cur]) {
+			npage = 0;
+			cur++;
+		}
+	}
+}
+
+static void batch_from_pages(struct pfn_batch *batch, struct page **pages,
+			     size_t npages)
+{
+	struct page **end = pages + npages;
+
+	for (; pages != end; pages++)
+		if (!batch_add_pfn(batch, page_to_pfn(*pages)))
+			break;
+}
+
+static void batch_unpin(struct pfn_batch *batch, struct iopt_pages *pages,
+			unsigned int offset, size_t npages)
+{
+	unsigned int cur = 0;
+
+	while (offset) {
+		if (batch->npfns[cur] > offset)
+			break;
+		offset -= batch->npfns[cur];
+		cur++;
+	}
+
+	while (npages) {
+		size_t to_unpin =
+			min_t(size_t, npages, batch->npfns[cur] - offset);
+
+		unpin_user_page_range_dirty_lock(
+			pfn_to_page(batch->pfns[cur] + offset), to_unpin,
+			pages->writable);
+		iopt_pages_sub_npinned(pages, to_unpin);
+		cur++;
+		offset = 0;
+		npages -= to_unpin;
+	}
+}
+
+/*
+ * PFNs are stored in three places, in order of preference:
+ * - The iopt_pages xarray. This is only populated if there is a
+ *   iopt_pages_user
+ * - The iommu_domain under an area
+ * - The original PFN source, ie pages->source_mm
+ *
+ * This iterator reads the pfns optimizing to load according to the
+ * above order.
+ */
+struct pfn_reader {
+	struct iopt_pages *pages;
+	struct interval_tree_span_iter span;
+	struct pfn_batch batch;
+	unsigned long batch_start_index;
+	unsigned long batch_end_index;
+	unsigned long last_index;
+
+	struct page **upages;
+	size_t upages_len;
+	unsigned long upages_start;
+	unsigned long upages_end;
+
+	unsigned int gup_flags;
+};
+
+static void update_unpinned(struct iopt_pages *pages)
+{
+	unsigned long npages = pages->last_npinned - pages->npinned;
+
+	lockdep_assert_held(&pages->mutex);
+
+	if (pages->has_cap_ipc_lock) {
+		pages->last_npinned = pages->npinned;
+		return;
+	}
+
+	if (WARN_ON(pages->npinned > pages->last_npinned) ||
+	    WARN_ON(atomic_long_read(&pages->source_user->locked_vm) < npages))
+		return;
+	atomic_long_sub(npages, &pages->source_user->locked_vm);
+	atomic64_sub(npages, &pages->source_mm->pinned_vm);
+	pages->last_npinned = pages->npinned;
+}
+
+/*
+ * Changes in the number of pages pinned is done after the pages have been read
+ * and processed. If the user lacked the limit then the error unwind will unpin
+ * everything that was just pinned.
+ */
+static int update_pinned(struct iopt_pages *pages)
+{
+	unsigned long lock_limit;
+	unsigned long cur_pages;
+	unsigned long new_pages;
+	unsigned long npages;
+
+	lockdep_assert_held(&pages->mutex);
+
+	if (pages->has_cap_ipc_lock) {
+		pages->last_npinned = pages->npinned;
+		return 0;
+	}
+
+	if (pages->npinned == pages->last_npinned)
+		return 0;
+
+	if (pages->npinned < pages->last_npinned) {
+		update_unpinned(pages);
+		return 0;
+	}
+
+	lock_limit =
+		task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	npages = pages->npinned - pages->last_npinned;
+	do {
+		cur_pages = atomic_long_read(&pages->source_user->locked_vm);
+		new_pages = cur_pages + npages;
+		if (new_pages > lock_limit)
+			return -ENOMEM;
+	} while (atomic_long_cmpxchg(&pages->source_user->locked_vm, cur_pages,
+				     new_pages) != cur_pages);
+	atomic64_add(npages, &pages->source_mm->pinned_vm);
+	pages->last_npinned = pages->npinned;
+	return 0;
+}
+
+static int pfn_reader_pin_pages(struct pfn_reader *pfns)
+{
+	struct iopt_pages *pages = pfns->pages;
+	unsigned long npages;
+	long rc;
+
+	if (!pfns->upages) {
+		/* All undone in iopt_pfn_reader_destroy */
+		pfns->upages_len =
+			(pfns->last_index - pfns->batch_end_index + 1) *
+			sizeof(*pfns->upages);
+		pfns->upages = temp_kmalloc(&pfns->upages_len, NULL, 0);
+		if (!pfns->upages)
+			return -ENOMEM;
+
+		if (!mmget_not_zero(pages->source_mm)) {
+			kfree(pfns->upages);
+			pfns->upages = NULL;
+			return -EINVAL;
+		}
+		mmap_read_lock(pages->source_mm);
+	}
+
+	npages = min_t(unsigned long,
+		       pfns->span.last_hole - pfns->batch_end_index + 1,
+		       pfns->upages_len / sizeof(*pfns->upages));
+
+	/* FIXME use pin_user_pages_fast() if current == source_mm */
+	rc = pin_user_pages_remote(
+		pages->source_mm,
+		(uintptr_t)(pages->uptr + pfns->batch_end_index * PAGE_SIZE),
+		npages, pfns->gup_flags, pfns->upages, NULL, NULL);
+	if (rc < 0)
+		return rc;
+	if (WARN_ON(!rc))
+		return -EFAULT;
+	iopt_pages_add_npinned(pages, rc);
+	pfns->upages_start = pfns->batch_end_index;
+	pfns->upages_end = pfns->batch_end_index + rc;
+	return 0;
+}
+
+/*
+ * The batch can contain a mixture of pages that are still in use and pages that
+ * need to be unpinned. Unpin only pages that are not held anywhere else.
+ */
+static void iopt_pages_unpin(struct iopt_pages *pages, struct pfn_batch *batch,
+			     unsigned long index, unsigned long last)
+{
+	struct interval_tree_span_iter user_span;
+	struct interval_tree_span_iter area_span;
+
+	lockdep_assert_held(&pages->mutex);
+
+	interval_tree_for_each_span (&user_span, &pages->users_itree, 0, last) {
+		if (!user_span.is_hole)
+			continue;
+
+		interval_tree_for_each_span (&area_span, &pages->domains_itree,
+					     user_span.start_hole,
+					     user_span.last_hole) {
+			if (!area_span.is_hole)
+				continue;
+
+			batch_unpin(batch, pages, area_span.start_hole - index,
+				    area_span.last_hole - area_span.start_hole +
+					    1);
+		}
+	}
+}
+
+/* Process a single span in the users_itree */
+static int pfn_reader_fill_span(struct pfn_reader *pfns)
+{
+	struct interval_tree_span_iter *span = &pfns->span;
+	struct iopt_area *area;
+	int rc;
+
+	if (!span->is_hole) {
+		batch_from_xarray(&pfns->batch, &pfns->pages->pinned_pfns,
+				  pfns->batch_end_index, span->last_used);
+		return 0;
+	}
+
+	/* FIXME: This should consider the entire hole remaining */
+	area = iopt_pages_find_domain_area(pfns->pages, pfns->batch_end_index);
+	if (area) {
+		unsigned int last_index;
+
+		last_index = min(iopt_area_last_index(area), span->last_hole);
+		/* The storage_domain cannot change without the pages mutex */
+		batch_from_domain(&pfns->batch, area->storage_domain, area,
+				  pfns->batch_end_index, last_index);
+		return 0;
+	}
+
+	if (pfns->batch_end_index >= pfns->upages_end) {
+		rc = pfn_reader_pin_pages(pfns);
+		if (rc)
+			return rc;
+	}
+
+	batch_from_pages(&pfns->batch,
+			 pfns->upages +
+				 (pfns->batch_end_index - pfns->upages_start),
+			 pfns->upages_end - pfns->batch_end_index);
+	return 0;
+}
+
+static bool pfn_reader_done(struct pfn_reader *pfns)
+{
+	return pfns->batch_start_index == pfns->last_index + 1;
+}
+
+static int pfn_reader_next(struct pfn_reader *pfns)
+{
+	int rc;
+
+	batch_clear(&pfns->batch);
+	pfns->batch_start_index = pfns->batch_end_index;
+	while (pfns->batch_end_index != pfns->last_index + 1) {
+		rc = pfn_reader_fill_span(pfns);
+		if (rc)
+			return rc;
+		pfns->batch_end_index =
+			pfns->batch_start_index + pfns->batch.total_pfns;
+		if (pfns->batch_end_index != pfns->span.last_used + 1)
+			return 0;
+		interval_tree_span_iter_next(&pfns->span);
+	}
+	return 0;
+}
+
+/*
+ * Adjust the pfn_reader to start at an externally determined hole span in the
+ * users_itree.
+ */
+static int pfn_reader_seek_hole(struct pfn_reader *pfns,
+				struct interval_tree_span_iter *span)
+{
+	pfns->batch_start_index = span->start_hole;
+	pfns->batch_end_index = span->start_hole;
+	pfns->last_index = span->last_hole;
+	pfns->span = *span;
+	return pfn_reader_next(pfns);
+}
+
+static int pfn_reader_init(struct pfn_reader *pfns, struct iopt_pages *pages,
+			   unsigned long index, unsigned long last)
+{
+	int rc;
+
+	lockdep_assert_held(&pages->mutex);
+
+	rc = batch_init(&pfns->batch, last - index + 1);
+	if (rc)
+		return rc;
+	pfns->pages = pages;
+	pfns->batch_start_index = index;
+	pfns->batch_end_index = index;
+	pfns->last_index = last;
+	pfns->upages = NULL;
+	pfns->upages_start = 0;
+	pfns->upages_end = 0;
+	interval_tree_span_iter_first(&pfns->span, &pages->users_itree, index,
+				      last);
+
+	if (pages->writable) {
+		pfns->gup_flags = FOLL_LONGTERM | FOLL_WRITE;
+	} else {
+		/* Still need to break COWs on read */
+		pfns->gup_flags = FOLL_LONGTERM | FOLL_FORCE | FOLL_WRITE;
+	}
+	return 0;
+}
+
+static void pfn_reader_destroy(struct pfn_reader *pfns)
+{
+	if (pfns->upages) {
+		size_t npages = pfns->upages_end - pfns->batch_end_index;
+
+		mmap_read_unlock(pfns->pages->source_mm);
+		mmput(pfns->pages->source_mm);
+
+		/* Any pages not transferred to the batch are just unpinned */
+		unpin_user_pages(pfns->upages + (pfns->batch_end_index -
+						 pfns->upages_start),
+				 npages);
+		kfree(pfns->upages);
+		pfns->upages = NULL;
+	}
+
+	if (pfns->batch_start_index != pfns->batch_end_index)
+		iopt_pages_unpin(pfns->pages, &pfns->batch,
+				 pfns->batch_start_index,
+				 pfns->batch_end_index - 1);
+	batch_destroy(&pfns->batch, NULL);
+	WARN_ON(pfns->pages->last_npinned != pfns->pages->npinned);
+}
+
+static int pfn_reader_first(struct pfn_reader *pfns, struct iopt_pages *pages,
+			    unsigned long index, unsigned long last)
+{
+	int rc;
+
+	rc = pfn_reader_init(pfns, pages, index, last);
+	if (rc)
+		return rc;
+	rc = pfn_reader_next(pfns);
+	if (rc) {
+		pfn_reader_destroy(pfns);
+		return rc;
+	}
+	return 0;
+}
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH RFC v2 06/13] iommufd: Algorithms for PFN storage
  2022-09-02 19:59 ` Jason Gunthorpe
@ 2022-09-02 19:59   ` Jason Gunthorpe
  -1 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-02 19:59 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, iommu,
	Jason Wang, Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

The iopt_pages which represents a logical linear list of PFNs held in
different storage tiers. Each area points to a slice of exactly one
iopt_pages, and each iopt_pages can have multiple areas and users.

The three storage tiers are managed to meet these objectives:

 - If no iommu_domain or user exists then minimal memory should be
   consumed by iomufd
 - If a page has been pinned then an iopt_pages will not pin it again
 - If an in-kernel user exists then the xarray must provide the backing
   storage to avoid allocations on domain removals
 - Otherwise any iommu_domain will be used for storage

In a common configuration with only an iommu_domain the iopt_pages does
not allocate significant memory itself.

The external interface for pages has several logical operations:

  iopt_area_fill_domain() will load the PFNs from storage into a single
  domain. This is used when attaching a new domain to an existing IOAS.

  iopt_area_fill_domains() will load the PFNs from storage into multiple
  domains. This is used when creating a new IOVA map in an existing IOAS

  iopt_pages_add_user() creates an iopt_pages_user that tracks an in-kernel
  user of PFNs. This is some external driver that might be accessing the
  IOVA using the CPU, or programming PFNs with the DMA API. ie a VFIO
  mdev.

  iopt_pages_fill_xarray() will load PFNs into the xarray and return a
  'struct page *' array. It is used by iopt_pages_user's to extract PFNs
  for in-kernel use. iopt_pages_fill_from_xarray() is a fast path when it
  is known the xarray is already filled.

As an iopt_pages can be referred to in slices by many areas and users it
uses interval trees to keep track of which storage tiers currently hold
the PFNs. On a page-by-page basis any request for a PFN will be satisfied
from one of the storage tiers and the PFN copied to target domain/array.

Unfill actions are similar, on a page by page basis domains are unmapped,
xarray entries freed or struct pages fully put back.

Significant complexity is required to fully optimize all of these data
motions. The implementation calculates the largest consecutive range of
same-storage indexes and operates in blocks. The accumulation of PFNs
always generates the largest contiguous PFN range possible to optimize and
this gathering can cross storage tier boundaries. For cases like 'fill
domains' care is taken to avoid duplicated work and PFNs are read once and
pushed into all domains.

The map/unmap interaction with the iommu_domain always works in contiguous
PFN blocks. The implementation does not require or benefit from any
split/merge optimization in the iommu_domain driver.

This design suggests several possible improvements in the IOMMU API that
would greatly help performance, particularly a way for the driver to map
and read the pfns lists instead of working with one driver call per page
to read, and one driver call per contiguous range to store.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommufd/io_pagetable.h |  71 ++++
 drivers/iommu/iommufd/pages.c        | 579 +++++++++++++++++++++++++++
 2 files changed, 650 insertions(+)

diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h
index 24a0f1a9de6197..fe3be8dd38240e 100644
--- a/drivers/iommu/iommufd/io_pagetable.h
+++ b/drivers/iommu/iommufd/io_pagetable.h
@@ -49,6 +49,14 @@ struct iopt_area {
 	unsigned int num_users;
 };
 
+int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages);
+void iopt_area_unfill_domains(struct iopt_area *area, struct iopt_pages *pages);
+
+int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain);
+void iopt_area_unfill_domain(struct iopt_area *area, struct iopt_pages *pages,
+			     struct iommu_domain *domain);
+void iopt_unmap_domain(struct io_pagetable *iopt, struct iommu_domain *domain);
+
 static inline unsigned long iopt_area_index(struct iopt_area *area)
 {
 	return area->pages_node.start;
@@ -69,6 +77,39 @@ static inline unsigned long iopt_area_last_iova(struct iopt_area *area)
 	return area->node.last;
 }
 
+static inline size_t iopt_area_length(struct iopt_area *area)
+{
+	return (area->node.last - area->node.start) + 1;
+}
+
+#define __make_iopt_iter(name)                                                 \
+	static inline struct iopt_##name *iopt_##name##_iter_first(            \
+		struct io_pagetable *iopt, unsigned long start,                \
+		unsigned long last)                                            \
+	{                                                                      \
+		struct interval_tree_node *node;                               \
+                                                                               \
+		lockdep_assert_held(&iopt->iova_rwsem);                        \
+		node = interval_tree_iter_first(&iopt->name##_itree, start,    \
+						last);                         \
+		if (!node)                                                     \
+			return NULL;                                           \
+		return container_of(node, struct iopt_##name, node);           \
+	}                                                                      \
+	static inline struct iopt_##name *iopt_##name##_iter_next(             \
+		struct iopt_##name *last_node, unsigned long start,            \
+		unsigned long last)                                            \
+	{                                                                      \
+		struct interval_tree_node *node;                               \
+                                                                               \
+		node = interval_tree_iter_next(&last_node->node, start, last); \
+		if (!node)                                                     \
+			return NULL;                                           \
+		return container_of(node, struct iopt_##name, node);           \
+	}
+
+__make_iopt_iter(area)
+
 /*
  * This holds a pinned page list for multiple areas of IO address space. The
  * pages always originate from a linear chunk of userspace VA. Multiple
@@ -100,4 +141,34 @@ struct iopt_pages {
 	struct rb_root_cached domains_itree;
 };
 
+struct iopt_pages *iopt_alloc_pages(void __user *uptr, unsigned long length,
+				    bool writable);
+void iopt_release_pages(struct kref *kref);
+static inline void iopt_put_pages(struct iopt_pages *pages)
+{
+	kref_put(&pages->kref, iopt_release_pages);
+}
+
+void iopt_pages_fill_from_xarray(struct iopt_pages *pages, unsigned long start,
+				 unsigned long last, struct page **out_pages);
+int iopt_pages_fill_xarray(struct iopt_pages *pages, unsigned long start,
+			   unsigned long last, struct page **out_pages);
+void iopt_pages_unfill_xarray(struct iopt_pages *pages, unsigned long start,
+			      unsigned long last);
+
+int iopt_pages_add_user(struct iopt_pages *pages, unsigned long start,
+			unsigned long last, struct page **out_pages,
+			bool write);
+void iopt_pages_remove_user(struct iopt_area *area, unsigned long start,
+			    unsigned long last);
+
+/*
+ * Each interval represents an active iopt_access_pages(), it acts as an
+ * interval lock that keeps the PFNs pinned and stored in the xarray.
+ */
+struct iopt_pages_user {
+	struct interval_tree_node node;
+	refcount_t refcount;
+};
+
 #endif
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index a5c369c94b2f11..91db42dd6aaeaa 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -140,6 +140,18 @@ static void iommu_unmap_nofail(struct iommu_domain *domain, unsigned long iova,
 	WARN_ON(ret != size);
 }
 
+static void iopt_area_unmap_domain_range(struct iopt_area *area,
+					 struct iommu_domain *domain,
+					 unsigned long start_index,
+					 unsigned long last_index)
+{
+	unsigned long start_iova = iopt_area_index_to_iova(area, start_index);
+
+	iommu_unmap_nofail(domain, start_iova,
+			   iopt_area_index_to_iova_last(area, last_index) -
+				   start_iova + 1);
+}
+
 static struct iopt_area *iopt_pages_find_domain_area(struct iopt_pages *pages,
 						     unsigned long index)
 {
@@ -716,3 +728,570 @@ static int pfn_reader_first(struct pfn_reader *pfns, struct iopt_pages *pages,
 	}
 	return 0;
 }
+
+struct iopt_pages *iopt_alloc_pages(void __user *uptr, unsigned long length,
+				    bool writable)
+{
+	struct iopt_pages *pages;
+
+	/*
+	 * The iommu API uses size_t as the length, and protect the DIV_ROUND_UP
+	 * below from overflow
+	 */
+	if (length > SIZE_MAX - PAGE_SIZE || length == 0)
+		return ERR_PTR(-EINVAL);
+
+	pages = kzalloc(sizeof(*pages), GFP_KERNEL_ACCOUNT);
+	if (!pages)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&pages->kref);
+	xa_init_flags(&pages->pinned_pfns, XA_FLAGS_ACCOUNT);
+	mutex_init(&pages->mutex);
+	pages->source_mm = current->mm;
+	mmgrab(pages->source_mm);
+	pages->uptr = (void __user *)ALIGN_DOWN((uintptr_t)uptr, PAGE_SIZE);
+	pages->npages = DIV_ROUND_UP(length + (uptr - pages->uptr), PAGE_SIZE);
+	pages->users_itree = RB_ROOT_CACHED;
+	pages->domains_itree = RB_ROOT_CACHED;
+	pages->writable = writable;
+	pages->has_cap_ipc_lock = capable(CAP_IPC_LOCK);
+	pages->source_task = current->group_leader;
+	get_task_struct(current->group_leader);
+	pages->source_user = get_uid(current_user());
+	return pages;
+}
+
+void iopt_release_pages(struct kref *kref)
+{
+	struct iopt_pages *pages = container_of(kref, struct iopt_pages, kref);
+
+	WARN_ON(!RB_EMPTY_ROOT(&pages->users_itree.rb_root));
+	WARN_ON(!RB_EMPTY_ROOT(&pages->domains_itree.rb_root));
+	WARN_ON(pages->npinned);
+	WARN_ON(!xa_empty(&pages->pinned_pfns));
+	mmdrop(pages->source_mm);
+	mutex_destroy(&pages->mutex);
+	put_task_struct(pages->source_task);
+	free_uid(pages->source_user);
+	kfree(pages);
+}
+
+/* Quickly guess if the interval tree might fully cover the given interval */
+static bool interval_tree_fully_covers(struct rb_root_cached *root,
+				       unsigned long index, unsigned long last)
+{
+	struct interval_tree_node *node;
+
+	node = interval_tree_iter_first(root, index, last);
+	if (!node)
+		return false;
+	return node->start <= index && node->last >= last;
+}
+
+static bool interval_tree_fully_covers_area(struct rb_root_cached *root,
+					    struct iopt_area *area)
+{
+	return interval_tree_fully_covers(root, iopt_area_index(area),
+					  iopt_area_last_index(area));
+}
+
+static void __iopt_area_unfill_domain(struct iopt_area *area,
+				      struct iopt_pages *pages,
+				      struct iommu_domain *domain,
+				      unsigned long last_index)
+{
+	unsigned long unmapped_index = iopt_area_index(area);
+	unsigned long cur_index = unmapped_index;
+	u64 backup[BATCH_BACKUP_SIZE];
+	struct pfn_batch batch;
+
+	lockdep_assert_held(&pages->mutex);
+
+	/* Fast path if there is obviously something else using every pfn */
+	if (interval_tree_fully_covers_area(&pages->domains_itree, area) ||
+	    interval_tree_fully_covers_area(&pages->users_itree, area)) {
+		iopt_area_unmap_domain_range(area, domain,
+					     iopt_area_index(area), last_index);
+		return;
+	}
+
+	/*
+	 * unmaps must always 'cut' at a place where the pfns are not contiguous
+	 * to pair with the maps that always install contiguous pages. This
+	 * algorithm is efficient in the expected case of few pinners.
+	 */
+	batch_init_backup(&batch, last_index + 1, backup, sizeof(backup));
+	while (cur_index != last_index + 1) {
+		unsigned long batch_index = cur_index;
+
+		batch_from_domain(&batch, domain, area, cur_index, last_index);
+		cur_index += batch.total_pfns;
+		iopt_area_unmap_domain_range(area, domain, unmapped_index,
+					     cur_index - 1);
+		unmapped_index = cur_index;
+		iopt_pages_unpin(pages, &batch, batch_index, cur_index - 1);
+		batch_clear(&batch);
+	}
+	batch_destroy(&batch, backup);
+	update_unpinned(pages);
+}
+
+static void iopt_area_unfill_partial_domain(struct iopt_area *area,
+					    struct iopt_pages *pages,
+					    struct iommu_domain *domain,
+					    unsigned long end_index)
+{
+	if (end_index != iopt_area_index(area))
+		__iopt_area_unfill_domain(area, pages, domain, end_index - 1);
+}
+
+/**
+ * iopt_unmap_domain() - Unmap without unpinning PFNs in a domain
+ * @iopt: The iopt the domain is part of
+ * @domain: The domain to unmap
+ *
+ * The caller must know that unpinning is not required, usually because there
+ * are other domains in the iopt.
+ */
+void iopt_unmap_domain(struct io_pagetable *iopt, struct iommu_domain *domain)
+{
+	struct interval_tree_span_iter span;
+
+	interval_tree_for_each_span (&span, &iopt->area_itree, 0, ULONG_MAX)
+		if (!span.is_hole)
+			iommu_unmap_nofail(domain, span.start_used,
+					   span.last_used - span.start_used +
+						   1);
+}
+
+/**
+ * iopt_area_unfill_domain() - Unmap and unpin PFNs in a domain
+ * @area: IOVA area to use
+ * @pages: page supplier for the area (area->pages is NULL)
+ * @domain: Domain to unmap from
+ *
+ * The domain should be removed from the domains_itree before calling. The
+ * domain will always be unmapped, but the PFNs may not be unpinned if there are
+ * still users.
+ */
+void iopt_area_unfill_domain(struct iopt_area *area, struct iopt_pages *pages,
+			     struct iommu_domain *domain)
+{
+	__iopt_area_unfill_domain(area, pages, domain,
+				  iopt_area_last_index(area));
+}
+
+/**
+ * iopt_area_fill_domain() - Map PFNs from the area into a domain
+ * @area: IOVA area to use
+ * @domain: Domain to load PFNs into
+ *
+ * Read the pfns from the area's underlying iopt_pages and map them into the
+ * given domain. Called when attaching a new domain to an io_pagetable.
+ */
+int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain)
+{
+	struct pfn_reader pfns;
+	int rc;
+
+	lockdep_assert_held(&area->pages->mutex);
+
+	rc = pfn_reader_first(&pfns, area->pages, iopt_area_index(area),
+			      iopt_area_last_index(area));
+	if (rc)
+		return rc;
+
+	while (!pfn_reader_done(&pfns)) {
+		rc = batch_to_domain(&pfns.batch, domain, area,
+				     pfns.batch_start_index);
+		if (rc)
+			goto out_unmap;
+
+		rc = pfn_reader_next(&pfns);
+		if (rc)
+			goto out_unmap;
+	}
+
+	rc = update_pinned(area->pages);
+	if (rc)
+		goto out_unmap;
+	goto out_destroy;
+
+out_unmap:
+	iopt_area_unfill_partial_domain(area, area->pages, domain,
+					pfns.batch_start_index);
+out_destroy:
+	pfn_reader_destroy(&pfns);
+	return rc;
+}
+
+/**
+ * iopt_area_fill_domains() - Install PFNs into the area's domains
+ * @area: The area to act on
+ * @pages: The pages associated with the area (area->pages is NULL)
+ *
+ * Called during area creation. The area is freshly created and not inserted in
+ * the domains_itree yet. PFNs are read and loaded into every domain held in the
+ * area's io_pagetable and the area is installed in the domains_itree.
+ *
+ * On failure all domains are left unchanged.
+ */
+int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages)
+{
+	struct pfn_reader pfns;
+	struct iommu_domain *domain;
+	unsigned long unmap_index;
+	unsigned long index;
+	int rc;
+
+	lockdep_assert_held(&area->iopt->domains_rwsem);
+
+	if (xa_empty(&area->iopt->domains))
+		return 0;
+
+	mutex_lock(&pages->mutex);
+	rc = pfn_reader_first(&pfns, pages, iopt_area_index(area),
+			      iopt_area_last_index(area));
+	if (rc)
+		goto out_unlock;
+
+	while (!pfn_reader_done(&pfns)) {
+		xa_for_each (&area->iopt->domains, index, domain) {
+			rc = batch_to_domain(&pfns.batch, domain, area,
+					     pfns.batch_start_index);
+			if (rc)
+				goto out_unmap;
+		}
+
+		rc = pfn_reader_next(&pfns);
+		if (rc)
+			goto out_unmap;
+	}
+	rc = update_pinned(pages);
+	if (rc)
+		goto out_unmap;
+
+	area->storage_domain = xa_load(&area->iopt->domains, 0);
+	interval_tree_insert(&area->pages_node, &pages->domains_itree);
+	goto out_destroy;
+
+out_unmap:
+	xa_for_each (&area->iopt->domains, unmap_index, domain) {
+		unsigned long end_index = pfns.batch_start_index;
+
+		if (unmap_index <= index)
+			end_index = pfns.batch_end_index;
+
+		/*
+		 * The area is not yet part of the domains_itree so we have to
+		 * manage the unpinning specially. The last domain does the
+		 * unpin, every other domain is just unmapped.
+		 */
+		if (unmap_index != area->iopt->next_domain_id - 1) {
+			if (end_index != iopt_area_index(area))
+				iopt_area_unmap_domain_range(
+					area, domain, iopt_area_index(area),
+					end_index - 1);
+		} else {
+			iopt_area_unfill_partial_domain(area, pages, domain,
+							end_index);
+		}
+	}
+out_destroy:
+	pfn_reader_destroy(&pfns);
+out_unlock:
+	mutex_unlock(&pages->mutex);
+	return rc;
+}
+
+/**
+ * iopt_area_unfill_domains() - unmap PFNs from the area's domains
+ * @area: The area to act on
+ * @pages: The pages associated with the area (area->pages is NULL)
+ *
+ * Called during area destruction. This unmaps the iova's covered by all the
+ * area's domains and releases the PFNs.
+ */
+void iopt_area_unfill_domains(struct iopt_area *area, struct iopt_pages *pages)
+{
+	struct io_pagetable *iopt = area->iopt;
+	struct iommu_domain *domain;
+	unsigned long index;
+
+	lockdep_assert_held(&iopt->domains_rwsem);
+
+	mutex_lock(&pages->mutex);
+	if (!area->storage_domain)
+		goto out_unlock;
+
+	xa_for_each (&iopt->domains, index, domain)
+		if (domain != area->storage_domain)
+			iopt_area_unmap_domain_range(
+				area, domain, iopt_area_index(area),
+				iopt_area_last_index(area));
+
+	interval_tree_remove(&area->pages_node, &pages->domains_itree);
+	iopt_area_unfill_domain(area, pages, area->storage_domain);
+	area->storage_domain = NULL;
+out_unlock:
+	mutex_unlock(&pages->mutex);
+}
+
+/*
+ * Erase entries in the pinned_pfns xarray that are not covered by any users.
+ * This does not unpin the pages, the caller is responsible to deal with the pin
+ * reference. The main purpose of this action is to save memory in the xarray.
+ */
+static void iopt_pages_clean_xarray(struct iopt_pages *pages,
+				    unsigned long index, unsigned long last)
+{
+	struct interval_tree_span_iter span;
+
+	lockdep_assert_held(&pages->mutex);
+
+	interval_tree_for_each_span (&span, &pages->users_itree, index, last)
+		if (span.is_hole)
+			clear_xarray(&pages->pinned_pfns, span.start_hole,
+				     span.last_hole);
+}
+
+/**
+ * iopt_pages_unfill_xarray() - Update the xarry after removing a user
+ * @pages: The pages to act on
+ * @start: Starting PFN index
+ * @last: Last PFN index
+ *
+ * Called when an iopt_pages_user is removed, removes pages from the itree.
+ * The user should already be removed from the users_itree.
+ */
+void iopt_pages_unfill_xarray(struct iopt_pages *pages, unsigned long start,
+			      unsigned long last)
+{
+	struct interval_tree_span_iter span;
+	struct pfn_batch batch;
+	u64 backup[BATCH_BACKUP_SIZE];
+
+	lockdep_assert_held(&pages->mutex);
+
+	if (interval_tree_fully_covers(&pages->domains_itree, start, last))
+		return iopt_pages_clean_xarray(pages, start, last);
+
+	batch_init_backup(&batch, last + 1, backup, sizeof(backup));
+	interval_tree_for_each_span (&span, &pages->users_itree, start, last) {
+		unsigned long cur_index;
+
+		if (!span.is_hole)
+			continue;
+		cur_index = span.start_hole;
+		while (cur_index != span.last_hole + 1) {
+			batch_from_xarray(&batch, &pages->pinned_pfns,
+					  cur_index, span.last_hole);
+			iopt_pages_unpin(pages, &batch, cur_index,
+					 cur_index + batch.total_pfns - 1);
+			cur_index += batch.total_pfns;
+			batch_clear(&batch);
+		}
+		clear_xarray(&pages->pinned_pfns, span.start_hole,
+			     span.last_hole);
+	}
+	batch_destroy(&batch, backup);
+	update_unpinned(pages);
+}
+
+/**
+ * iopt_pages_fill_from_xarray() - Fast path for reading PFNs
+ * @pages: The pages to act on
+ * @start: The first page index in the range
+ * @last: The last page index in the range
+ * @out_pages: The output array to return the pages
+ *
+ * This can be called if the caller is holding a refcount on an iopt_pages_user
+ * that is known to have already been filled. It quickly reads the pages
+ * directly from the xarray.
+ *
+ * This is part of the SW iommu interface to read pages for in-kernel use.
+ */
+void iopt_pages_fill_from_xarray(struct iopt_pages *pages, unsigned long start,
+				 unsigned long last, struct page **out_pages)
+{
+	XA_STATE(xas, &pages->pinned_pfns, start);
+	void *entry;
+
+	rcu_read_lock();
+	while (true) {
+		entry = xas_next(&xas);
+		if (xas_retry(&xas, entry))
+			continue;
+		WARN_ON(!xa_is_value(entry));
+		*(out_pages++) = pfn_to_page(xa_to_value(entry));
+		if (start == last)
+			break;
+		start++;
+	}
+	rcu_read_unlock();
+}
+
+/**
+ * iopt_pages_fill_xarray() - Read PFNs
+ * @pages: The pages to act on
+ * @start: The first page index in the range
+ * @last: The last page index in the range
+ * @out_pages: The output array to return the pages, may be NULL
+ *
+ * This populates the xarray and returns the pages in out_pages. As the slow
+ * path this is able to copy pages from other storage tiers into the xarray.
+ *
+ * On failure the xarray is left unchanged.
+ *
+ * This is part of the SW iommu interface to read pages for in-kernel use.
+ */
+int iopt_pages_fill_xarray(struct iopt_pages *pages, unsigned long start,
+			   unsigned long last, struct page **out_pages)
+{
+	struct interval_tree_span_iter span;
+	unsigned long xa_end = start;
+	struct pfn_reader pfns;
+	int rc;
+
+	rc = pfn_reader_init(&pfns, pages, start, last);
+	if (rc)
+		return rc;
+
+	interval_tree_for_each_span (&span, &pages->users_itree, start, last) {
+		if (!span.is_hole) {
+			if (out_pages)
+				iopt_pages_fill_from_xarray(
+					pages + (span.start_used - start),
+					span.start_used, span.last_used,
+					out_pages);
+			continue;
+		}
+
+		rc = pfn_reader_seek_hole(&pfns, &span);
+		if (rc)
+			goto out_clean_xa;
+
+		while (!pfn_reader_done(&pfns)) {
+			rc = batch_to_xarray(&pfns.batch, &pages->pinned_pfns,
+					     pfns.batch_start_index);
+			if (rc)
+				goto out_clean_xa;
+			batch_to_pages(&pfns.batch, out_pages);
+			xa_end += pfns.batch.total_pfns;
+			out_pages += pfns.batch.total_pfns;
+			rc = pfn_reader_next(&pfns);
+			if (rc)
+				goto out_clean_xa;
+		}
+	}
+
+	rc = update_pinned(pages);
+	if (rc)
+		goto out_clean_xa;
+	pfn_reader_destroy(&pfns);
+	return 0;
+
+out_clean_xa:
+	if (start != xa_end)
+		iopt_pages_unfill_xarray(pages, start, xa_end - 1);
+	pfn_reader_destroy(&pfns);
+	return rc;
+}
+
+static struct iopt_pages_user *
+iopt_pages_get_exact_user(struct iopt_pages *pages, unsigned long index,
+			  unsigned long last)
+{
+	struct interval_tree_node *node;
+
+	lockdep_assert_held(&pages->mutex);
+
+	/* There can be overlapping ranges in this interval tree */
+	for (node = interval_tree_iter_first(&pages->users_itree, index, last);
+	     node; node = interval_tree_iter_next(node, index, last))
+		if (node->start == index && node->last == last)
+			return container_of(node, struct iopt_pages_user, node);
+	return NULL;
+}
+
+/**
+ * iopt_pages_add_user() - Record an in-knerel user for PFNs
+ * @pages: The source of PFNs
+ * @start: First page index
+ * @last: Inclusive last page index
+ * @out_pages: Output list of struct page's representing the PFNs
+ * @write: True if the user will write to the pages
+ *
+ * Record that an in-kernel user will be accessing the pages, ensure they are
+ * pinned, and return the PFNs as a simple list of 'struct page *'.
+ *
+ * This should be undone through a matching call to iopt_pages_remove_user()
+ */
+int iopt_pages_add_user(struct iopt_pages *pages, unsigned long start,
+			unsigned long last, struct page **out_pages, bool write)
+{
+	struct iopt_pages_user *user;
+	int rc;
+
+	if (pages->writable != write)
+		return -EPERM;
+
+	user = iopt_pages_get_exact_user(pages, start, last);
+	if (user) {
+		refcount_inc(&user->refcount);
+		iopt_pages_fill_from_xarray(pages, start, last, out_pages);
+		return 0;
+	}
+
+	user = kzalloc(sizeof(*user), GFP_KERNEL_ACCOUNT);
+	if (!user)
+		return -ENOMEM;
+
+	rc = iopt_pages_fill_xarray(pages, start, last, out_pages);
+	if (rc)
+		goto out_free;
+
+	user->node.start = start;
+	user->node.last = last;
+	refcount_set(&user->refcount, 1);
+	interval_tree_insert(&user->node, &pages->users_itree);
+	return 0;
+
+out_free:
+	kfree(user);
+	return rc;
+}
+
+/**
+ * iopt_pages_remove_user() - Release an in-kernel user for PFNs
+ * @area: The source of PFNs
+ * @start: First page index
+ * @last: Inclusive last page index
+ *
+ * Undo iopt_pages_add_user() and unpin the pages if necessary. The caller must
+ * stop using the PFNs before calling this.
+ */
+void iopt_pages_remove_user(struct iopt_area *area, unsigned long start,
+			    unsigned long last)
+{
+	struct iopt_pages_user *user;
+	struct iopt_pages *pages = area->pages;
+
+	mutex_lock(&pages->mutex);
+	user = iopt_pages_get_exact_user(pages, start, last);
+	if (WARN_ON(!user))
+		goto out_unlock;
+
+	WARN_ON(area->num_users == 0);
+	area->num_users--;
+
+	if (!refcount_dec_and_test(&user->refcount))
+		goto out_unlock;
+
+	interval_tree_remove(&user->node, &pages->users_itree);
+	iopt_pages_unfill_xarray(pages, start, last);
+	kfree(user);
+out_unlock:
+	mutex_unlock(&pages->mutex);
+}
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH RFC v2 06/13] iommufd: Algorithms for PFN storage
@ 2022-09-02 19:59   ` Jason Gunthorpe
  0 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-02 19:59 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, iommu,
	Jason Wang, Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

The iopt_pages which represents a logical linear list of PFNs held in
different storage tiers. Each area points to a slice of exactly one
iopt_pages, and each iopt_pages can have multiple areas and users.

The three storage tiers are managed to meet these objectives:

 - If no iommu_domain or user exists then minimal memory should be
   consumed by iomufd
 - If a page has been pinned then an iopt_pages will not pin it again
 - If an in-kernel user exists then the xarray must provide the backing
   storage to avoid allocations on domain removals
 - Otherwise any iommu_domain will be used for storage

In a common configuration with only an iommu_domain the iopt_pages does
not allocate significant memory itself.

The external interface for pages has several logical operations:

  iopt_area_fill_domain() will load the PFNs from storage into a single
  domain. This is used when attaching a new domain to an existing IOAS.

  iopt_area_fill_domains() will load the PFNs from storage into multiple
  domains. This is used when creating a new IOVA map in an existing IOAS

  iopt_pages_add_user() creates an iopt_pages_user that tracks an in-kernel
  user of PFNs. This is some external driver that might be accessing the
  IOVA using the CPU, or programming PFNs with the DMA API. ie a VFIO
  mdev.

  iopt_pages_fill_xarray() will load PFNs into the xarray and return a
  'struct page *' array. It is used by iopt_pages_user's to extract PFNs
  for in-kernel use. iopt_pages_fill_from_xarray() is a fast path when it
  is known the xarray is already filled.

As an iopt_pages can be referred to in slices by many areas and users it
uses interval trees to keep track of which storage tiers currently hold
the PFNs. On a page-by-page basis any request for a PFN will be satisfied
from one of the storage tiers and the PFN copied to target domain/array.

Unfill actions are similar, on a page by page basis domains are unmapped,
xarray entries freed or struct pages fully put back.

Significant complexity is required to fully optimize all of these data
motions. The implementation calculates the largest consecutive range of
same-storage indexes and operates in blocks. The accumulation of PFNs
always generates the largest contiguous PFN range possible to optimize and
this gathering can cross storage tier boundaries. For cases like 'fill
domains' care is taken to avoid duplicated work and PFNs are read once and
pushed into all domains.

The map/unmap interaction with the iommu_domain always works in contiguous
PFN blocks. The implementation does not require or benefit from any
split/merge optimization in the iommu_domain driver.

This design suggests several possible improvements in the IOMMU API that
would greatly help performance, particularly a way for the driver to map
and read the pfns lists instead of working with one driver call per page
to read, and one driver call per contiguous range to store.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommufd/io_pagetable.h |  71 ++++
 drivers/iommu/iommufd/pages.c        | 579 +++++++++++++++++++++++++++
 2 files changed, 650 insertions(+)

diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h
index 24a0f1a9de6197..fe3be8dd38240e 100644
--- a/drivers/iommu/iommufd/io_pagetable.h
+++ b/drivers/iommu/iommufd/io_pagetable.h
@@ -49,6 +49,14 @@ struct iopt_area {
 	unsigned int num_users;
 };
 
+int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages);
+void iopt_area_unfill_domains(struct iopt_area *area, struct iopt_pages *pages);
+
+int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain);
+void iopt_area_unfill_domain(struct iopt_area *area, struct iopt_pages *pages,
+			     struct iommu_domain *domain);
+void iopt_unmap_domain(struct io_pagetable *iopt, struct iommu_domain *domain);
+
 static inline unsigned long iopt_area_index(struct iopt_area *area)
 {
 	return area->pages_node.start;
@@ -69,6 +77,39 @@ static inline unsigned long iopt_area_last_iova(struct iopt_area *area)
 	return area->node.last;
 }
 
+static inline size_t iopt_area_length(struct iopt_area *area)
+{
+	return (area->node.last - area->node.start) + 1;
+}
+
+#define __make_iopt_iter(name)                                                 \
+	static inline struct iopt_##name *iopt_##name##_iter_first(            \
+		struct io_pagetable *iopt, unsigned long start,                \
+		unsigned long last)                                            \
+	{                                                                      \
+		struct interval_tree_node *node;                               \
+                                                                               \
+		lockdep_assert_held(&iopt->iova_rwsem);                        \
+		node = interval_tree_iter_first(&iopt->name##_itree, start,    \
+						last);                         \
+		if (!node)                                                     \
+			return NULL;                                           \
+		return container_of(node, struct iopt_##name, node);           \
+	}                                                                      \
+	static inline struct iopt_##name *iopt_##name##_iter_next(             \
+		struct iopt_##name *last_node, unsigned long start,            \
+		unsigned long last)                                            \
+	{                                                                      \
+		struct interval_tree_node *node;                               \
+                                                                               \
+		node = interval_tree_iter_next(&last_node->node, start, last); \
+		if (!node)                                                     \
+			return NULL;                                           \
+		return container_of(node, struct iopt_##name, node);           \
+	}
+
+__make_iopt_iter(area)
+
 /*
  * This holds a pinned page list for multiple areas of IO address space. The
  * pages always originate from a linear chunk of userspace VA. Multiple
@@ -100,4 +141,34 @@ struct iopt_pages {
 	struct rb_root_cached domains_itree;
 };
 
+struct iopt_pages *iopt_alloc_pages(void __user *uptr, unsigned long length,
+				    bool writable);
+void iopt_release_pages(struct kref *kref);
+static inline void iopt_put_pages(struct iopt_pages *pages)
+{
+	kref_put(&pages->kref, iopt_release_pages);
+}
+
+void iopt_pages_fill_from_xarray(struct iopt_pages *pages, unsigned long start,
+				 unsigned long last, struct page **out_pages);
+int iopt_pages_fill_xarray(struct iopt_pages *pages, unsigned long start,
+			   unsigned long last, struct page **out_pages);
+void iopt_pages_unfill_xarray(struct iopt_pages *pages, unsigned long start,
+			      unsigned long last);
+
+int iopt_pages_add_user(struct iopt_pages *pages, unsigned long start,
+			unsigned long last, struct page **out_pages,
+			bool write);
+void iopt_pages_remove_user(struct iopt_area *area, unsigned long start,
+			    unsigned long last);
+
+/*
+ * Each interval represents an active iopt_access_pages(), it acts as an
+ * interval lock that keeps the PFNs pinned and stored in the xarray.
+ */
+struct iopt_pages_user {
+	struct interval_tree_node node;
+	refcount_t refcount;
+};
+
 #endif
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index a5c369c94b2f11..91db42dd6aaeaa 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -140,6 +140,18 @@ static void iommu_unmap_nofail(struct iommu_domain *domain, unsigned long iova,
 	WARN_ON(ret != size);
 }
 
+static void iopt_area_unmap_domain_range(struct iopt_area *area,
+					 struct iommu_domain *domain,
+					 unsigned long start_index,
+					 unsigned long last_index)
+{
+	unsigned long start_iova = iopt_area_index_to_iova(area, start_index);
+
+	iommu_unmap_nofail(domain, start_iova,
+			   iopt_area_index_to_iova_last(area, last_index) -
+				   start_iova + 1);
+}
+
 static struct iopt_area *iopt_pages_find_domain_area(struct iopt_pages *pages,
 						     unsigned long index)
 {
@@ -716,3 +728,570 @@ static int pfn_reader_first(struct pfn_reader *pfns, struct iopt_pages *pages,
 	}
 	return 0;
 }
+
+struct iopt_pages *iopt_alloc_pages(void __user *uptr, unsigned long length,
+				    bool writable)
+{
+	struct iopt_pages *pages;
+
+	/*
+	 * The iommu API uses size_t as the length, and protect the DIV_ROUND_UP
+	 * below from overflow
+	 */
+	if (length > SIZE_MAX - PAGE_SIZE || length == 0)
+		return ERR_PTR(-EINVAL);
+
+	pages = kzalloc(sizeof(*pages), GFP_KERNEL_ACCOUNT);
+	if (!pages)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&pages->kref);
+	xa_init_flags(&pages->pinned_pfns, XA_FLAGS_ACCOUNT);
+	mutex_init(&pages->mutex);
+	pages->source_mm = current->mm;
+	mmgrab(pages->source_mm);
+	pages->uptr = (void __user *)ALIGN_DOWN((uintptr_t)uptr, PAGE_SIZE);
+	pages->npages = DIV_ROUND_UP(length + (uptr - pages->uptr), PAGE_SIZE);
+	pages->users_itree = RB_ROOT_CACHED;
+	pages->domains_itree = RB_ROOT_CACHED;
+	pages->writable = writable;
+	pages->has_cap_ipc_lock = capable(CAP_IPC_LOCK);
+	pages->source_task = current->group_leader;
+	get_task_struct(current->group_leader);
+	pages->source_user = get_uid(current_user());
+	return pages;
+}
+
+void iopt_release_pages(struct kref *kref)
+{
+	struct iopt_pages *pages = container_of(kref, struct iopt_pages, kref);
+
+	WARN_ON(!RB_EMPTY_ROOT(&pages->users_itree.rb_root));
+	WARN_ON(!RB_EMPTY_ROOT(&pages->domains_itree.rb_root));
+	WARN_ON(pages->npinned);
+	WARN_ON(!xa_empty(&pages->pinned_pfns));
+	mmdrop(pages->source_mm);
+	mutex_destroy(&pages->mutex);
+	put_task_struct(pages->source_task);
+	free_uid(pages->source_user);
+	kfree(pages);
+}
+
+/* Quickly guess if the interval tree might fully cover the given interval */
+static bool interval_tree_fully_covers(struct rb_root_cached *root,
+				       unsigned long index, unsigned long last)
+{
+	struct interval_tree_node *node;
+
+	node = interval_tree_iter_first(root, index, last);
+	if (!node)
+		return false;
+	return node->start <= index && node->last >= last;
+}
+
+static bool interval_tree_fully_covers_area(struct rb_root_cached *root,
+					    struct iopt_area *area)
+{
+	return interval_tree_fully_covers(root, iopt_area_index(area),
+					  iopt_area_last_index(area));
+}
+
+static void __iopt_area_unfill_domain(struct iopt_area *area,
+				      struct iopt_pages *pages,
+				      struct iommu_domain *domain,
+				      unsigned long last_index)
+{
+	unsigned long unmapped_index = iopt_area_index(area);
+	unsigned long cur_index = unmapped_index;
+	u64 backup[BATCH_BACKUP_SIZE];
+	struct pfn_batch batch;
+
+	lockdep_assert_held(&pages->mutex);
+
+	/* Fast path if there is obviously something else using every pfn */
+	if (interval_tree_fully_covers_area(&pages->domains_itree, area) ||
+	    interval_tree_fully_covers_area(&pages->users_itree, area)) {
+		iopt_area_unmap_domain_range(area, domain,
+					     iopt_area_index(area), last_index);
+		return;
+	}
+
+	/*
+	 * unmaps must always 'cut' at a place where the pfns are not contiguous
+	 * to pair with the maps that always install contiguous pages. This
+	 * algorithm is efficient in the expected case of few pinners.
+	 */
+	batch_init_backup(&batch, last_index + 1, backup, sizeof(backup));
+	while (cur_index != last_index + 1) {
+		unsigned long batch_index = cur_index;
+
+		batch_from_domain(&batch, domain, area, cur_index, last_index);
+		cur_index += batch.total_pfns;
+		iopt_area_unmap_domain_range(area, domain, unmapped_index,
+					     cur_index - 1);
+		unmapped_index = cur_index;
+		iopt_pages_unpin(pages, &batch, batch_index, cur_index - 1);
+		batch_clear(&batch);
+	}
+	batch_destroy(&batch, backup);
+	update_unpinned(pages);
+}
+
+static void iopt_area_unfill_partial_domain(struct iopt_area *area,
+					    struct iopt_pages *pages,
+					    struct iommu_domain *domain,
+					    unsigned long end_index)
+{
+	if (end_index != iopt_area_index(area))
+		__iopt_area_unfill_domain(area, pages, domain, end_index - 1);
+}
+
+/**
+ * iopt_unmap_domain() - Unmap without unpinning PFNs in a domain
+ * @iopt: The iopt the domain is part of
+ * @domain: The domain to unmap
+ *
+ * The caller must know that unpinning is not required, usually because there
+ * are other domains in the iopt.
+ */
+void iopt_unmap_domain(struct io_pagetable *iopt, struct iommu_domain *domain)
+{
+	struct interval_tree_span_iter span;
+
+	interval_tree_for_each_span (&span, &iopt->area_itree, 0, ULONG_MAX)
+		if (!span.is_hole)
+			iommu_unmap_nofail(domain, span.start_used,
+					   span.last_used - span.start_used +
+						   1);
+}
+
+/**
+ * iopt_area_unfill_domain() - Unmap and unpin PFNs in a domain
+ * @area: IOVA area to use
+ * @pages: page supplier for the area (area->pages is NULL)
+ * @domain: Domain to unmap from
+ *
+ * The domain should be removed from the domains_itree before calling. The
+ * domain will always be unmapped, but the PFNs may not be unpinned if there are
+ * still users.
+ */
+void iopt_area_unfill_domain(struct iopt_area *area, struct iopt_pages *pages,
+			     struct iommu_domain *domain)
+{
+	__iopt_area_unfill_domain(area, pages, domain,
+				  iopt_area_last_index(area));
+}
+
+/**
+ * iopt_area_fill_domain() - Map PFNs from the area into a domain
+ * @area: IOVA area to use
+ * @domain: Domain to load PFNs into
+ *
+ * Read the pfns from the area's underlying iopt_pages and map them into the
+ * given domain. Called when attaching a new domain to an io_pagetable.
+ */
+int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain)
+{
+	struct pfn_reader pfns;
+	int rc;
+
+	lockdep_assert_held(&area->pages->mutex);
+
+	rc = pfn_reader_first(&pfns, area->pages, iopt_area_index(area),
+			      iopt_area_last_index(area));
+	if (rc)
+		return rc;
+
+	while (!pfn_reader_done(&pfns)) {
+		rc = batch_to_domain(&pfns.batch, domain, area,
+				     pfns.batch_start_index);
+		if (rc)
+			goto out_unmap;
+
+		rc = pfn_reader_next(&pfns);
+		if (rc)
+			goto out_unmap;
+	}
+
+	rc = update_pinned(area->pages);
+	if (rc)
+		goto out_unmap;
+	goto out_destroy;
+
+out_unmap:
+	iopt_area_unfill_partial_domain(area, area->pages, domain,
+					pfns.batch_start_index);
+out_destroy:
+	pfn_reader_destroy(&pfns);
+	return rc;
+}
+
+/**
+ * iopt_area_fill_domains() - Install PFNs into the area's domains
+ * @area: The area to act on
+ * @pages: The pages associated with the area (area->pages is NULL)
+ *
+ * Called during area creation. The area is freshly created and not inserted in
+ * the domains_itree yet. PFNs are read and loaded into every domain held in the
+ * area's io_pagetable and the area is installed in the domains_itree.
+ *
+ * On failure all domains are left unchanged.
+ */
+int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages)
+{
+	struct pfn_reader pfns;
+	struct iommu_domain *domain;
+	unsigned long unmap_index;
+	unsigned long index;
+	int rc;
+
+	lockdep_assert_held(&area->iopt->domains_rwsem);
+
+	if (xa_empty(&area->iopt->domains))
+		return 0;
+
+	mutex_lock(&pages->mutex);
+	rc = pfn_reader_first(&pfns, pages, iopt_area_index(area),
+			      iopt_area_last_index(area));
+	if (rc)
+		goto out_unlock;
+
+	while (!pfn_reader_done(&pfns)) {
+		xa_for_each (&area->iopt->domains, index, domain) {
+			rc = batch_to_domain(&pfns.batch, domain, area,
+					     pfns.batch_start_index);
+			if (rc)
+				goto out_unmap;
+		}
+
+		rc = pfn_reader_next(&pfns);
+		if (rc)
+			goto out_unmap;
+	}
+	rc = update_pinned(pages);
+	if (rc)
+		goto out_unmap;
+
+	area->storage_domain = xa_load(&area->iopt->domains, 0);
+	interval_tree_insert(&area->pages_node, &pages->domains_itree);
+	goto out_destroy;
+
+out_unmap:
+	xa_for_each (&area->iopt->domains, unmap_index, domain) {
+		unsigned long end_index = pfns.batch_start_index;
+
+		if (unmap_index <= index)
+			end_index = pfns.batch_end_index;
+
+		/*
+		 * The area is not yet part of the domains_itree so we have to
+		 * manage the unpinning specially. The last domain does the
+		 * unpin, every other domain is just unmapped.
+		 */
+		if (unmap_index != area->iopt->next_domain_id - 1) {
+			if (end_index != iopt_area_index(area))
+				iopt_area_unmap_domain_range(
+					area, domain, iopt_area_index(area),
+					end_index - 1);
+		} else {
+			iopt_area_unfill_partial_domain(area, pages, domain,
+							end_index);
+		}
+	}
+out_destroy:
+	pfn_reader_destroy(&pfns);
+out_unlock:
+	mutex_unlock(&pages->mutex);
+	return rc;
+}
+
+/**
+ * iopt_area_unfill_domains() - unmap PFNs from the area's domains
+ * @area: The area to act on
+ * @pages: The pages associated with the area (area->pages is NULL)
+ *
+ * Called during area destruction. This unmaps the iova's covered by all the
+ * area's domains and releases the PFNs.
+ */
+void iopt_area_unfill_domains(struct iopt_area *area, struct iopt_pages *pages)
+{
+	struct io_pagetable *iopt = area->iopt;
+	struct iommu_domain *domain;
+	unsigned long index;
+
+	lockdep_assert_held(&iopt->domains_rwsem);
+
+	mutex_lock(&pages->mutex);
+	if (!area->storage_domain)
+		goto out_unlock;
+
+	xa_for_each (&iopt->domains, index, domain)
+		if (domain != area->storage_domain)
+			iopt_area_unmap_domain_range(
+				area, domain, iopt_area_index(area),
+				iopt_area_last_index(area));
+
+	interval_tree_remove(&area->pages_node, &pages->domains_itree);
+	iopt_area_unfill_domain(area, pages, area->storage_domain);
+	area->storage_domain = NULL;
+out_unlock:
+	mutex_unlock(&pages->mutex);
+}
+
+/*
+ * Erase entries in the pinned_pfns xarray that are not covered by any users.
+ * This does not unpin the pages, the caller is responsible to deal with the pin
+ * reference. The main purpose of this action is to save memory in the xarray.
+ */
+static void iopt_pages_clean_xarray(struct iopt_pages *pages,
+				    unsigned long index, unsigned long last)
+{
+	struct interval_tree_span_iter span;
+
+	lockdep_assert_held(&pages->mutex);
+
+	interval_tree_for_each_span (&span, &pages->users_itree, index, last)
+		if (span.is_hole)
+			clear_xarray(&pages->pinned_pfns, span.start_hole,
+				     span.last_hole);
+}
+
+/**
+ * iopt_pages_unfill_xarray() - Update the xarry after removing a user
+ * @pages: The pages to act on
+ * @start: Starting PFN index
+ * @last: Last PFN index
+ *
+ * Called when an iopt_pages_user is removed, removes pages from the itree.
+ * The user should already be removed from the users_itree.
+ */
+void iopt_pages_unfill_xarray(struct iopt_pages *pages, unsigned long start,
+			      unsigned long last)
+{
+	struct interval_tree_span_iter span;
+	struct pfn_batch batch;
+	u64 backup[BATCH_BACKUP_SIZE];
+
+	lockdep_assert_held(&pages->mutex);
+
+	if (interval_tree_fully_covers(&pages->domains_itree, start, last))
+		return iopt_pages_clean_xarray(pages, start, last);
+
+	batch_init_backup(&batch, last + 1, backup, sizeof(backup));
+	interval_tree_for_each_span (&span, &pages->users_itree, start, last) {
+		unsigned long cur_index;
+
+		if (!span.is_hole)
+			continue;
+		cur_index = span.start_hole;
+		while (cur_index != span.last_hole + 1) {
+			batch_from_xarray(&batch, &pages->pinned_pfns,
+					  cur_index, span.last_hole);
+			iopt_pages_unpin(pages, &batch, cur_index,
+					 cur_index + batch.total_pfns - 1);
+			cur_index += batch.total_pfns;
+			batch_clear(&batch);
+		}
+		clear_xarray(&pages->pinned_pfns, span.start_hole,
+			     span.last_hole);
+	}
+	batch_destroy(&batch, backup);
+	update_unpinned(pages);
+}
+
+/**
+ * iopt_pages_fill_from_xarray() - Fast path for reading PFNs
+ * @pages: The pages to act on
+ * @start: The first page index in the range
+ * @last: The last page index in the range
+ * @out_pages: The output array to return the pages
+ *
+ * This can be called if the caller is holding a refcount on an iopt_pages_user
+ * that is known to have already been filled. It quickly reads the pages
+ * directly from the xarray.
+ *
+ * This is part of the SW iommu interface to read pages for in-kernel use.
+ */
+void iopt_pages_fill_from_xarray(struct iopt_pages *pages, unsigned long start,
+				 unsigned long last, struct page **out_pages)
+{
+	XA_STATE(xas, &pages->pinned_pfns, start);
+	void *entry;
+
+	rcu_read_lock();
+	while (true) {
+		entry = xas_next(&xas);
+		if (xas_retry(&xas, entry))
+			continue;
+		WARN_ON(!xa_is_value(entry));
+		*(out_pages++) = pfn_to_page(xa_to_value(entry));
+		if (start == last)
+			break;
+		start++;
+	}
+	rcu_read_unlock();
+}
+
+/**
+ * iopt_pages_fill_xarray() - Read PFNs
+ * @pages: The pages to act on
+ * @start: The first page index in the range
+ * @last: The last page index in the range
+ * @out_pages: The output array to return the pages, may be NULL
+ *
+ * This populates the xarray and returns the pages in out_pages. As the slow
+ * path this is able to copy pages from other storage tiers into the xarray.
+ *
+ * On failure the xarray is left unchanged.
+ *
+ * This is part of the SW iommu interface to read pages for in-kernel use.
+ */
+int iopt_pages_fill_xarray(struct iopt_pages *pages, unsigned long start,
+			   unsigned long last, struct page **out_pages)
+{
+	struct interval_tree_span_iter span;
+	unsigned long xa_end = start;
+	struct pfn_reader pfns;
+	int rc;
+
+	rc = pfn_reader_init(&pfns, pages, start, last);
+	if (rc)
+		return rc;
+
+	interval_tree_for_each_span (&span, &pages->users_itree, start, last) {
+		if (!span.is_hole) {
+			if (out_pages)
+				iopt_pages_fill_from_xarray(
+					pages + (span.start_used - start),
+					span.start_used, span.last_used,
+					out_pages);
+			continue;
+		}
+
+		rc = pfn_reader_seek_hole(&pfns, &span);
+		if (rc)
+			goto out_clean_xa;
+
+		while (!pfn_reader_done(&pfns)) {
+			rc = batch_to_xarray(&pfns.batch, &pages->pinned_pfns,
+					     pfns.batch_start_index);
+			if (rc)
+				goto out_clean_xa;
+			batch_to_pages(&pfns.batch, out_pages);
+			xa_end += pfns.batch.total_pfns;
+			out_pages += pfns.batch.total_pfns;
+			rc = pfn_reader_next(&pfns);
+			if (rc)
+				goto out_clean_xa;
+		}
+	}
+
+	rc = update_pinned(pages);
+	if (rc)
+		goto out_clean_xa;
+	pfn_reader_destroy(&pfns);
+	return 0;
+
+out_clean_xa:
+	if (start != xa_end)
+		iopt_pages_unfill_xarray(pages, start, xa_end - 1);
+	pfn_reader_destroy(&pfns);
+	return rc;
+}
+
+static struct iopt_pages_user *
+iopt_pages_get_exact_user(struct iopt_pages *pages, unsigned long index,
+			  unsigned long last)
+{
+	struct interval_tree_node *node;
+
+	lockdep_assert_held(&pages->mutex);
+
+	/* There can be overlapping ranges in this interval tree */
+	for (node = interval_tree_iter_first(&pages->users_itree, index, last);
+	     node; node = interval_tree_iter_next(node, index, last))
+		if (node->start == index && node->last == last)
+			return container_of(node, struct iopt_pages_user, node);
+	return NULL;
+}
+
+/**
+ * iopt_pages_add_user() - Record an in-knerel user for PFNs
+ * @pages: The source of PFNs
+ * @start: First page index
+ * @last: Inclusive last page index
+ * @out_pages: Output list of struct page's representing the PFNs
+ * @write: True if the user will write to the pages
+ *
+ * Record that an in-kernel user will be accessing the pages, ensure they are
+ * pinned, and return the PFNs as a simple list of 'struct page *'.
+ *
+ * This should be undone through a matching call to iopt_pages_remove_user()
+ */
+int iopt_pages_add_user(struct iopt_pages *pages, unsigned long start,
+			unsigned long last, struct page **out_pages, bool write)
+{
+	struct iopt_pages_user *user;
+	int rc;
+
+	if (pages->writable != write)
+		return -EPERM;
+
+	user = iopt_pages_get_exact_user(pages, start, last);
+	if (user) {
+		refcount_inc(&user->refcount);
+		iopt_pages_fill_from_xarray(pages, start, last, out_pages);
+		return 0;
+	}
+
+	user = kzalloc(sizeof(*user), GFP_KERNEL_ACCOUNT);
+	if (!user)
+		return -ENOMEM;
+
+	rc = iopt_pages_fill_xarray(pages, start, last, out_pages);
+	if (rc)
+		goto out_free;
+
+	user->node.start = start;
+	user->node.last = last;
+	refcount_set(&user->refcount, 1);
+	interval_tree_insert(&user->node, &pages->users_itree);
+	return 0;
+
+out_free:
+	kfree(user);
+	return rc;
+}
+
+/**
+ * iopt_pages_remove_user() - Release an in-kernel user for PFNs
+ * @area: The source of PFNs
+ * @start: First page index
+ * @last: Inclusive last page index
+ *
+ * Undo iopt_pages_add_user() and unpin the pages if necessary. The caller must
+ * stop using the PFNs before calling this.
+ */
+void iopt_pages_remove_user(struct iopt_area *area, unsigned long start,
+			    unsigned long last)
+{
+	struct iopt_pages_user *user;
+	struct iopt_pages *pages = area->pages;
+
+	mutex_lock(&pages->mutex);
+	user = iopt_pages_get_exact_user(pages, start, last);
+	if (WARN_ON(!user))
+		goto out_unlock;
+
+	WARN_ON(area->num_users == 0);
+	area->num_users--;
+
+	if (!refcount_dec_and_test(&user->refcount))
+		goto out_unlock;
+
+	interval_tree_remove(&user->node, &pages->users_itree);
+	iopt_pages_unfill_xarray(pages, start, last);
+	kfree(user);
+out_unlock:
+	mutex_unlock(&pages->mutex);
+}
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH RFC v2 07/13] iommufd: Data structure to provide IOVA to PFN mapping
  2022-09-02 19:59 ` Jason Gunthorpe
@ 2022-09-02 19:59   ` Jason Gunthorpe
  -1 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-02 19:59 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, iommu,
	Jason Wang, Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

This is the remainder of the IOAS data structure. Provide an object called
an io_pagetable that is composed of iopt_areas pointing at iopt_pages,
along with a list of iommu_domains that mirror the IOVA to PFN map.

At the top this is a simple interval tree of iopt_areas indicating the map
of IOVA to iopt_pages. An xarray keeps track of a list of domains. Based
on the attached domains there is a minimum alignment for areas (which may
be smaller than PAGE_SIZE), an interval tree of reserved IOVA that can't
be mapped and an IOVA of allowed IOVA that can always be mappable.

The concept of a 'user' refers to something like a VFIO mdev that is
accessing the IOVA and using a 'struct page *' for CPU based access.

Externally an API is provided that matches the requirements of the IOCTL
interface for map/unmap and domain attachment.

The API provides a 'copy' primitive to establish a new IOVA map in a
different IOAS from an existing mapping.

This is designed to support a pre-registration flow where userspace would
setup an dummy IOAS with no domains, map in memory and then establish a
user to pin all PFNs into the xarray.

Copy can then be used to create new IOVA mappings in a different IOAS,
with iommu_domains attached. Upon copy the PFNs will be read out of the
xarray and mapped into the iommu_domains, avoiding any pin_user_pages()
overheads.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommufd/Makefile          |   1 +
 drivers/iommu/iommufd/io_pagetable.c    | 981 ++++++++++++++++++++++++
 drivers/iommu/iommufd/io_pagetable.h    |  12 +
 drivers/iommu/iommufd/iommufd_private.h |  33 +
 include/linux/iommufd.h                 |   8 +
 5 files changed, 1035 insertions(+)
 create mode 100644 drivers/iommu/iommufd/io_pagetable.c

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index 05a0e91e30afad..b66a8c47ff55ec 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 iommufd-y := \
+	io_pagetable.o \
 	main.o \
 	pages.o
 
diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
new file mode 100644
index 00000000000000..7434bc8b393bbd
--- /dev/null
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -0,0 +1,981 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
+ *
+ * The io_pagetable is the top of datastructure that maps IOVA's to PFNs. The
+ * PFNs can be placed into an iommu_domain, or returned to the caller as a page
+ * list for access by an in-kernel user.
+ *
+ * The datastructure uses the iopt_pages to optimize the storage of the PFNs
+ * between the domains and xarray.
+ */
+#include <linux/iommufd.h>
+#include <linux/lockdep.h>
+#include <linux/iommu.h>
+#include <linux/sched/mm.h>
+#include <linux/err.h>
+#include <linux/slab.h>
+#include <linux/errno.h>
+
+#include "io_pagetable.h"
+
+static unsigned long iopt_area_iova_to_index(struct iopt_area *area,
+					     unsigned long iova)
+{
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
+		WARN_ON(iova < iopt_area_iova(area) ||
+			iova > iopt_area_last_iova(area));
+	return (iova - (iopt_area_iova(area) - area->page_offset)) / PAGE_SIZE;
+}
+
+static struct iopt_area *iopt_find_exact_area(struct io_pagetable *iopt,
+					      unsigned long iova,
+					      unsigned long last_iova)
+{
+	struct iopt_area *area;
+
+	area = iopt_area_iter_first(iopt, iova, last_iova);
+	if (!area || !area->pages || iopt_area_iova(area) != iova ||
+	    iopt_area_last_iova(area) != last_iova)
+		return NULL;
+	return area;
+}
+
+static bool __alloc_iova_check_hole(struct interval_tree_span_iter *span,
+				    unsigned long length,
+				    unsigned long iova_alignment,
+				    unsigned long page_offset)
+{
+	if (!span->is_hole || span->last_hole - span->start_hole < length - 1)
+		return false;
+
+	span->start_hole = ALIGN(span->start_hole, iova_alignment) |
+			   page_offset;
+	if (span->start_hole > span->last_hole ||
+	    span->last_hole - span->start_hole < length - 1)
+		return false;
+	return true;
+}
+
+static bool __alloc_iova_check_used(struct interval_tree_span_iter *span,
+				    unsigned long length,
+				    unsigned long iova_alignment,
+				    unsigned long page_offset)
+{
+	if (span->is_hole || span->last_used - span->start_used < length - 1)
+		return false;
+
+	span->start_used = ALIGN(span->start_used, iova_alignment) |
+			   page_offset;
+	if (span->start_used > span->last_used ||
+	    span->last_used - span->start_used < length - 1)
+		return false;
+	return true;
+}
+
+/*
+ * Automatically find a block of IOVA that is not being used and not reserved.
+ * Does not return a 0 IOVA even if it is valid.
+ */
+static int iopt_alloc_iova(struct io_pagetable *iopt, unsigned long *iova,
+			   unsigned long uptr, unsigned long length)
+{
+	struct interval_tree_span_iter reserved_span;
+	unsigned long page_offset = uptr % PAGE_SIZE;
+	struct interval_tree_span_iter allowed_span;
+	struct interval_tree_span_iter area_span;
+	unsigned long iova_alignment;
+
+	lockdep_assert_held(&iopt->iova_rwsem);
+
+	/* Protect roundup_pow-of_two() from overflow */
+	if (length == 0 || length >= ULONG_MAX / 2)
+		return -EOVERFLOW;
+
+	/*
+	 * Keep alignment present in the uptr when building the IOVA, this
+	 * increases the chance we can map a THP.
+	 */
+	if (!uptr)
+		iova_alignment = roundup_pow_of_two(length);
+	else
+		iova_alignment =
+			min_t(unsigned long, roundup_pow_of_two(length),
+			      1UL << __ffs64(uptr));
+
+	if (iova_alignment < iopt->iova_alignment)
+		return -EINVAL;
+
+	interval_tree_for_each_span(&allowed_span, &iopt->allowed_itree,
+				    PAGE_SIZE, ULONG_MAX - PAGE_SIZE) {
+		if (RB_EMPTY_ROOT(&iopt->allowed_itree.rb_root)) {
+			allowed_span.start_used = PAGE_SIZE;
+			allowed_span.last_used = ULONG_MAX - PAGE_SIZE;
+			allowed_span.is_hole = false;
+		}
+
+		if (!__alloc_iova_check_used(&allowed_span, length,
+					     iova_alignment, page_offset))
+			continue;
+
+		interval_tree_for_each_span(&area_span, &iopt->area_itree,
+					    allowed_span.start_used,
+					    allowed_span.last_used) {
+			if (!__alloc_iova_check_hole(&area_span, length,
+						     iova_alignment,
+						     page_offset))
+				continue;
+
+			interval_tree_for_each_span(&reserved_span,
+						    &iopt->reserved_itree,
+						    area_span.start_used,
+						    area_span.last_used) {
+				if (!__alloc_iova_check_hole(
+					    &reserved_span, length,
+					    iova_alignment, page_offset))
+					continue;
+
+				*iova = reserved_span.start_hole;
+				return 0;
+			}
+		}
+	}
+	return -ENOSPC;
+}
+
+/*
+ * The area takes a slice of the pages from start_bytes to start_byte + length
+ */
+static struct iopt_area *
+iopt_alloc_area(struct io_pagetable *iopt, struct iopt_pages *pages,
+		unsigned long iova, unsigned long start_byte,
+		unsigned long length, int iommu_prot, unsigned int flags)
+{
+	struct iopt_area *area;
+	int rc;
+
+	area = kzalloc(sizeof(*area), GFP_KERNEL_ACCOUNT);
+	if (!area)
+		return ERR_PTR(-ENOMEM);
+
+	area->iopt = iopt;
+	area->iommu_prot = iommu_prot;
+	area->page_offset = start_byte % PAGE_SIZE;
+	area->pages_node.start = start_byte / PAGE_SIZE;
+	if (check_add_overflow(start_byte, length - 1,
+			       &area->pages_node.last)) {
+		rc = -EOVERFLOW;
+		goto out_free;
+	}
+	area->pages_node.last = area->pages_node.last / PAGE_SIZE;
+	if (WARN_ON(area->pages_node.last >= pages->npages)) {
+		rc = -EOVERFLOW;
+		goto out_free;
+	}
+
+	down_write(&iopt->iova_rwsem);
+	if (flags & IOPT_ALLOC_IOVA) {
+		rc = iopt_alloc_iova(iopt, &iova,
+				     (uintptr_t)pages->uptr + start_byte,
+				     length);
+		if (rc)
+			goto out_unlock;
+	}
+
+	if (check_add_overflow(iova, length - 1, &area->node.last)) {
+		rc = -EOVERFLOW;
+		goto out_unlock;
+	}
+
+	if (!(flags & IOPT_ALLOC_IOVA)) {
+		if ((iova & (iopt->iova_alignment - 1)) ||
+		    (length & (iopt->iova_alignment - 1)) || !length) {
+			rc = -EINVAL;
+			goto out_unlock;
+		}
+
+		/* No reserved IOVA intersects the range */
+		if (iopt_reserved_iter_first(iopt, iova, area->node.last)) {
+			rc = -ENOENT;
+			goto out_unlock;
+		}
+
+		/* Check that there is not already a mapping in the range */
+		if (iopt_area_iter_first(iopt, iova, area->node.last)) {
+			rc = -EADDRINUSE;
+			goto out_unlock;
+		}
+	}
+
+	/*
+	 * The area is inserted with a NULL pages indicating it is not fully
+	 * initialized yet.
+	 */
+	area->node.start = iova;
+	interval_tree_insert(&area->node, &area->iopt->area_itree);
+	up_write(&iopt->iova_rwsem);
+	return area;
+
+out_unlock:
+	up_write(&iopt->iova_rwsem);
+out_free:
+	kfree(area);
+	return ERR_PTR(rc);
+}
+
+static void iopt_abort_area(struct iopt_area *area)
+{
+	down_write(&area->iopt->iova_rwsem);
+	interval_tree_remove(&area->node, &area->iopt->area_itree);
+	up_write(&area->iopt->iova_rwsem);
+	kfree(area);
+}
+
+static int iopt_finalize_area(struct iopt_area *area, struct iopt_pages *pages)
+{
+	int rc;
+
+	down_read(&area->iopt->domains_rwsem);
+	rc = iopt_area_fill_domains(area, pages);
+	if (!rc) {
+		/*
+		 * area->pages must be set inside the domains_rwsem to ensure
+		 * any newly added domains will get filled. Moves the reference
+		 * in from the caller
+		 */
+		down_write(&area->iopt->iova_rwsem);
+		area->pages = pages;
+		up_write(&area->iopt->iova_rwsem);
+	}
+	up_read(&area->iopt->domains_rwsem);
+	return rc;
+}
+
+int iopt_map_pages(struct io_pagetable *iopt, struct iopt_pages *pages,
+		   unsigned long *dst_iova, unsigned long start_bytes,
+		   unsigned long length, int iommu_prot, unsigned int flags)
+{
+	struct iopt_area *area;
+	int rc;
+
+	if ((iommu_prot & IOMMU_WRITE) && !pages->writable)
+		return -EPERM;
+
+	area = iopt_alloc_area(iopt, pages, *dst_iova, start_bytes, length,
+			       iommu_prot, flags);
+	if (IS_ERR(area))
+		return PTR_ERR(area);
+	*dst_iova = iopt_area_iova(area);
+
+	rc = iopt_finalize_area(area, pages);
+	if (rc) {
+		iopt_abort_area(area);
+		return rc;
+	}
+	return 0;
+}
+
+/**
+ * iopt_map_user_pages() - Map a user VA to an iova in the io page table
+ * @iopt: io_pagetable to act on
+ * @iova: If IOPT_ALLOC_IOVA is set this is unused on input and contains
+ *        the chosen iova on output. Otherwise is the iova to map to on input
+ * @uptr: User VA to map
+ * @length: Number of bytes to map
+ * @iommu_prot: Combination of IOMMU_READ/WRITE/etc bits for the mapping
+ * @flags: IOPT_ALLOC_IOVA or zero
+ *
+ * iova, uptr, and length must be aligned to iova_alignment. For domain backed
+ * page tables this will pin the pages and load them into the domain at iova.
+ * For non-domain page tables this will only setup a lazy reference and the
+ * caller must use iopt_access_pages() to touch them.
+ *
+ * iopt_unmap_iova() must be called to undo this before the io_pagetable can be
+ * destroyed.
+ */
+int iopt_map_user_pages(struct io_pagetable *iopt, unsigned long *iova,
+			void __user *uptr, unsigned long length, int iommu_prot,
+			unsigned int flags)
+{
+	struct iopt_pages *pages;
+	int rc;
+
+	pages = iopt_alloc_pages(uptr, length, iommu_prot & IOMMU_WRITE);
+	if (IS_ERR(pages))
+		return PTR_ERR(pages);
+
+	rc = iopt_map_pages(iopt, pages, iova, uptr - pages->uptr, length,
+			    iommu_prot, flags);
+	if (rc) {
+		iopt_put_pages(pages);
+		return rc;
+	}
+	return 0;
+}
+
+struct iopt_pages *iopt_get_pages(struct io_pagetable *iopt, unsigned long iova,
+				  unsigned long *start_byte,
+				  unsigned long length)
+{
+	unsigned long iova_end;
+	struct iopt_pages *pages;
+	struct iopt_area *area;
+
+	if (check_add_overflow(iova, length - 1, &iova_end))
+		return ERR_PTR(-EOVERFLOW);
+
+	down_read(&iopt->iova_rwsem);
+	area = iopt_find_exact_area(iopt, iova, iova_end);
+	if (!area) {
+		up_read(&iopt->iova_rwsem);
+		return ERR_PTR(-ENOENT);
+	}
+	pages = area->pages;
+	*start_byte = area->page_offset + iopt_area_index(area) * PAGE_SIZE;
+	kref_get(&pages->kref);
+	up_read(&iopt->iova_rwsem);
+
+	return pages;
+}
+
+static int iopt_unmap_iova_range(struct io_pagetable *iopt, unsigned long start,
+				 unsigned long end, unsigned long *unmapped)
+{
+	struct iopt_area *area;
+	unsigned long unmapped_bytes = 0;
+	int rc = -ENOENT;
+
+	/*
+	 * The domains_rwsem must be held in read mode any time any area->pages
+	 * is NULL. This prevents domain attach/detatch from running
+	 * concurrently with cleaning up the area.
+	 */
+	down_read(&iopt->domains_rwsem);
+	down_write(&iopt->iova_rwsem);
+	while ((area = iopt_area_iter_first(iopt, start, end))) {
+		unsigned long area_last = iopt_area_last_iova(area);
+		unsigned long area_first = iopt_area_iova(area);
+		struct iopt_pages *pages;
+
+		/* Userspace should not race map/unmap's of the same area */
+		if (!area->pages) {
+			rc = -EBUSY;
+			goto out_unlock_iova;
+		}
+
+		if (area_first < start || area_last > end) {
+			rc = -ENOENT;
+			goto out_unlock_iova;
+		}
+
+		/*
+		 * num_users writers must hold the iova_rwsem too, so we can
+		 * safely read it under the write side of the iovam_rwsem
+		 * without the pages->mutex.
+		 */
+		if (area->num_users) {
+			start = area_first;
+			area->prevent_users = true;
+			up_write(&iopt->iova_rwsem);
+			up_read(&iopt->domains_rwsem);
+			/* Later patch calls back to drivers to unmap */
+			return -EBUSY;
+		}
+
+		pages = area->pages;
+		area->pages = NULL;
+		up_write(&iopt->iova_rwsem);
+
+		iopt_area_unfill_domains(area, pages);
+		iopt_abort_area(area);
+		iopt_put_pages(pages);
+
+		unmapped_bytes += area_last - area_first + 1;
+
+		down_write(&iopt->iova_rwsem);
+	}
+	if (unmapped_bytes)
+		rc = 0;
+
+out_unlock_iova:
+	up_write(&iopt->iova_rwsem);
+	up_read(&iopt->domains_rwsem);
+	if (unmapped)
+		*unmapped = unmapped_bytes;
+	return rc;
+}
+
+/**
+ * iopt_unmap_iova() - Remove a range of iova
+ * @iopt: io_pagetable to act on
+ * @iova: Starting iova to unmap
+ * @length: Number of bytes to unmap
+ * @unmapped: Return number of bytes unmapped
+ *
+ * The requested range must be a superset of existing ranges.
+ * Splitting/truncating IOVA mappings is not allowed.
+ */
+int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
+		    unsigned long length, unsigned long *unmapped)
+{
+	unsigned long iova_end;
+
+	if (!length)
+		return -EINVAL;
+
+	if (check_add_overflow(iova, length - 1, &iova_end))
+		return -EOVERFLOW;
+
+	return iopt_unmap_iova_range(iopt, iova, iova_end, unmapped);
+}
+
+int iopt_unmap_all(struct io_pagetable *iopt, unsigned long *unmapped)
+{
+	return iopt_unmap_iova_range(iopt, 0, ULONG_MAX, unmapped);
+}
+
+/**
+ * iopt_access_pages() - Return a list of pages under the iova
+ * @iopt: io_pagetable to act on
+ * @iova: Starting IOVA
+ * @length: Number of bytes to access
+ * @out_pages: Output page list
+ * @write: True if access is for writing
+ *
+ * Reads @length bytes starting at iova and returns the struct page * pointers.
+ * These can be kmap'd by the caller for CPU access.
+ *
+ * The caller must perform iopt_unaccess_pages() when done to balance this.
+ *
+ * iova can be unaligned from PAGE_SIZE. The first returned byte starts at
+ * page_to_phys(out_pages[0]) + (iova % PAGE_SIZE). The caller promises not to
+ * touch memory outside the requested iova slice.
+ */
+int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
+		      unsigned long length, struct page **out_pages, bool write)
+{
+	unsigned long cur_iova = iova;
+	unsigned long last_iova;
+	struct iopt_area *area;
+	int rc;
+
+	if (!length)
+		return -EINVAL;
+	if (check_add_overflow(iova, length - 1, &last_iova))
+		return -EOVERFLOW;
+
+	down_read(&iopt->iova_rwsem);
+	for (area = iopt_area_iter_first(iopt, iova, last_iova); area;
+	     area = iopt_area_iter_next(area, iova, last_iova)) {
+		unsigned long last = min(last_iova, iopt_area_last_iova(area));
+		unsigned long last_index;
+		unsigned long index;
+
+		/* Need contiguous areas in the access */
+		if (iopt_area_iova(area) > cur_iova || !area->pages ||
+		    area->prevent_users) {
+			rc = -EINVAL;
+			goto out_remove;
+		}
+
+		index = iopt_area_iova_to_index(area, cur_iova);
+		last_index = iopt_area_iova_to_index(area, last);
+
+		/*
+		 * The API can only return aligned pages, so the starting point
+		 * must be at a page boundary.
+		 */
+		if ((cur_iova - (iopt_area_iova(area) - area->page_offset)) %
+		    PAGE_SIZE) {
+			rc = -EINVAL;
+			goto out_remove;
+		}
+
+		/*
+		 * and an interior ending point must be at a page boundary
+		 */
+		if (last != last_iova &&
+		    (iopt_area_last_iova(area) - cur_iova + 1) % PAGE_SIZE) {
+			rc = -EINVAL;
+			goto out_remove;
+		}
+
+		mutex_lock(&area->pages->mutex);
+		rc = iopt_pages_add_user(area->pages, index, last_index,
+					 out_pages, write);
+		if (rc) {
+			mutex_unlock(&area->pages->mutex);
+			goto out_remove;
+		}
+		area->num_users++;
+		mutex_unlock(&area->pages->mutex);
+		if (last == last_iova)
+			break;
+		cur_iova = last + 1;
+		out_pages += last_index - index;
+	}
+	if (cur_iova != last_iova)
+		goto out_remove;
+
+	up_read(&iopt->iova_rwsem);
+	return 0;
+
+out_remove:
+	if (cur_iova != iova)
+		iopt_unaccess_pages(iopt, iova, cur_iova - iova);
+	up_read(&iopt->iova_rwsem);
+	return rc;
+}
+EXPORT_SYMBOL_GPL(iopt_access_pages);
+
+/**
+ * iopt_unaccess_pages() - Undo iopt_access_pages
+ * @iopt: io_pagetable to act on
+ * @iova: Starting IOVA
+ * @length:- Number of bytes to access
+ *
+ * Return the struct page's. The caller must stop accessing them before calling
+ * this. The iova/length must exactly match the one provided to access_pages.
+ */
+void iopt_unaccess_pages(struct io_pagetable *iopt, unsigned long iova,
+			 unsigned long length)
+{
+	unsigned long cur_iova = iova;
+	unsigned long last_iova;
+	struct iopt_area *area;
+
+	if (WARN_ON(!length) ||
+	    WARN_ON(check_add_overflow(iova, length - 1, &last_iova)))
+		return;
+
+	down_read(&iopt->iova_rwsem);
+	for (area = iopt_area_iter_first(iopt, iova, last_iova); area;
+	     area = iopt_area_iter_next(area, iova, last_iova)) {
+		unsigned long last = min(last_iova, iopt_area_last_iova(area));
+
+		iopt_pages_remove_user(area,
+				       iopt_area_iova_to_index(area, cur_iova),
+				       iopt_area_iova_to_index(area, last));
+		if (last == last_iova)
+			break;
+		cur_iova = last + 1;
+	}
+	up_read(&iopt->iova_rwsem);
+}
+EXPORT_SYMBOL_GPL(iopt_unaccess_pages);
+
+/* The caller must always free all the nodes in the allowed_iova rb_root. */
+int iopt_set_allow_iova(struct io_pagetable *iopt,
+			struct rb_root_cached *allowed_iova)
+{
+	struct iopt_allowed *allowed;
+
+	down_write(&iopt->iova_rwsem);
+	swap(*allowed_iova, iopt->allowed_itree);
+
+	for (allowed = iopt_allowed_iter_first(iopt, 0, ULONG_MAX); allowed;
+	     allowed = iopt_allowed_iter_next(allowed, 0, ULONG_MAX)) {
+		if (iopt_reserved_iter_first(iopt, allowed->node.start,
+					     allowed->node.last)) {
+			swap(*allowed_iova, iopt->allowed_itree);
+			up_write(&iopt->iova_rwsem);
+			return -EADDRINUSE;
+		}
+	}
+	up_write(&iopt->iova_rwsem);
+	return 0;
+}
+
+int iopt_reserve_iova(struct io_pagetable *iopt, unsigned long start,
+		      unsigned long last, void *owner)
+{
+	struct iopt_reserved *reserved;
+
+	lockdep_assert_held_write(&iopt->iova_rwsem);
+
+	if (iopt_area_iter_first(iopt, start, last) ||
+	    iopt_allowed_iter_first(iopt, start, last))
+		return -EADDRINUSE;
+
+	reserved = kzalloc(sizeof(*reserved), GFP_KERNEL_ACCOUNT);
+	if (!reserved)
+		return -ENOMEM;
+	reserved->node.start = start;
+	reserved->node.last = last;
+	reserved->owner = owner;
+	interval_tree_insert(&reserved->node, &iopt->reserved_itree);
+	return 0;
+}
+
+static void __iopt_remove_reserved_iova(struct io_pagetable *iopt, void *owner)
+{
+	struct iopt_reserved *reserved, *next;
+
+	lockdep_assert_held_write(&iopt->iova_rwsem);
+
+	for (reserved = iopt_reserved_iter_first(iopt, 0, ULONG_MAX);
+	     reserved; reserved = next) {
+		next = iopt_reserved_iter_next(reserved, 0, ULONG_MAX);
+
+		if (reserved->owner == owner) {
+			interval_tree_remove(&reserved->node,
+					     &iopt->reserved_itree);
+			kfree(reserved);
+		}
+	}
+}
+
+void iopt_remove_reserved_iova(struct io_pagetable *iopt, void *owner)
+{
+	down_write(&iopt->iova_rwsem);
+	__iopt_remove_reserved_iova(iopt, owner);
+	up_write(&iopt->iova_rwsem);
+}
+
+int iopt_init_table(struct io_pagetable *iopt)
+{
+	init_rwsem(&iopt->iova_rwsem);
+	init_rwsem(&iopt->domains_rwsem);
+	iopt->area_itree = RB_ROOT_CACHED;
+	iopt->allowed_itree = RB_ROOT_CACHED;
+	iopt->reserved_itree = RB_ROOT_CACHED;
+	xa_init_flags(&iopt->domains, XA_FLAGS_ACCOUNT);
+
+	/*
+	 * iopt's start as SW tables that can use the entire size_t IOVA space
+	 * due to the use of size_t in the APIs. They have no alignment
+	 * restriction.
+	 */
+	iopt->iova_alignment = 1;
+
+	return 0;
+}
+
+void iopt_destroy_table(struct io_pagetable *iopt)
+{
+	struct interval_tree_node *node;
+
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
+		iopt_remove_reserved_iova(iopt, NULL);
+
+	while ((node = interval_tree_iter_first(&iopt->allowed_itree, 0,
+						ULONG_MAX))) {
+		interval_tree_remove(node, &iopt->allowed_itree);
+		kfree(container_of(node, struct iopt_allowed, node));
+	}
+
+	WARN_ON(!RB_EMPTY_ROOT(&iopt->reserved_itree.rb_root));
+	WARN_ON(!xa_empty(&iopt->domains));
+	WARN_ON(!RB_EMPTY_ROOT(&iopt->area_itree.rb_root));
+}
+
+/**
+ * iopt_unfill_domain() - Unfill a domain with PFNs
+ * @iopt: io_pagetable to act on
+ * @domain: domain to unfill
+ *
+ * This is used when removing a domain from the iopt. Every area in the iopt
+ * will be unmapped from the domain. The domain must already be removed from the
+ * domains xarray.
+ */
+static void iopt_unfill_domain(struct io_pagetable *iopt,
+			       struct iommu_domain *domain)
+{
+	struct iopt_area *area;
+
+	lockdep_assert_held(&iopt->iova_rwsem);
+	lockdep_assert_held_write(&iopt->domains_rwsem);
+
+	/*
+	 * Some other domain is holding all the pfns still, rapidly unmap this
+	 * domain.
+	 */
+	if (iopt->next_domain_id != 0) {
+		/* Pick an arbitrary remaining domain to act as storage */
+		struct iommu_domain *storage_domain =
+			xa_load(&iopt->domains, 0);
+
+		for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
+		     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
+			struct iopt_pages *pages = area->pages;
+
+			if (!pages)
+				continue;
+
+			mutex_lock(&pages->mutex);
+			if (area->storage_domain != domain) {
+				mutex_unlock(&pages->mutex);
+				continue;
+			}
+			area->storage_domain = storage_domain;
+			mutex_unlock(&pages->mutex);
+		}
+
+
+		iopt_unmap_domain(iopt, domain);
+		return;
+	}
+
+	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
+	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
+		struct iopt_pages *pages = area->pages;
+
+		if (!pages)
+			continue;
+
+		mutex_lock(&pages->mutex);
+		interval_tree_remove(&area->pages_node,
+				     &pages->domains_itree);
+		WARN_ON(area->storage_domain != domain);
+		area->storage_domain = NULL;
+		iopt_area_unfill_domain(area, pages, domain);
+		mutex_unlock(&pages->mutex);
+	}
+}
+
+/**
+ * iopt_fill_domain() - Fill a domain with PFNs
+ * @iopt: io_pagetable to act on
+ * @domain: domain to fill
+ *
+ * Fill the domain with PFNs from every area in the iopt. On failure the domain
+ * is left unchanged.
+ */
+static int iopt_fill_domain(struct io_pagetable *iopt,
+			    struct iommu_domain *domain)
+{
+	struct iopt_area *end_area;
+	struct iopt_area *area;
+	int rc;
+
+	lockdep_assert_held(&iopt->iova_rwsem);
+	lockdep_assert_held_write(&iopt->domains_rwsem);
+
+	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
+	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
+		struct iopt_pages *pages = area->pages;
+
+		if (!pages)
+			continue;
+
+		mutex_lock(&pages->mutex);
+		rc = iopt_area_fill_domain(area, domain);
+		if (rc) {
+			mutex_unlock(&pages->mutex);
+			goto out_unfill;
+		}
+		if (!area->storage_domain) {
+			WARN_ON(iopt->next_domain_id != 0);
+			area->storage_domain = domain;
+			interval_tree_insert(&area->pages_node,
+					     &pages->domains_itree);
+		}
+		mutex_unlock(&pages->mutex);
+	}
+	return 0;
+
+out_unfill:
+	end_area = area;
+	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
+	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
+		struct iopt_pages *pages = area->pages;
+
+		if (area == end_area)
+			break;
+		if (!pages)
+			continue;
+		mutex_lock(&pages->mutex);
+		if (iopt->next_domain_id == 0) {
+			interval_tree_remove(&area->pages_node,
+					     &pages->domains_itree);
+			area->storage_domain = NULL;
+		}
+		iopt_area_unfill_domain(area, pages, domain);
+		mutex_unlock(&pages->mutex);
+	}
+	return rc;
+}
+
+/* All existing area's conform to an increased page size */
+static int iopt_check_iova_alignment(struct io_pagetable *iopt,
+				     unsigned long new_iova_alignment)
+{
+	struct iopt_area *area;
+
+	lockdep_assert_held(&iopt->iova_rwsem);
+
+	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
+	     area = iopt_area_iter_next(area, 0, ULONG_MAX))
+		if ((iopt_area_iova(area) % new_iova_alignment) ||
+		    (iopt_area_length(area) % new_iova_alignment))
+			return -EADDRINUSE;
+	return 0;
+}
+
+int iopt_table_add_domain(struct io_pagetable *iopt,
+			  struct iommu_domain *domain)
+{
+	const struct iommu_domain_geometry *geometry = &domain->geometry;
+	struct iommu_domain *iter_domain;
+	unsigned int new_iova_alignment;
+	unsigned long index;
+	int rc;
+
+	down_write(&iopt->domains_rwsem);
+	down_write(&iopt->iova_rwsem);
+
+	xa_for_each (&iopt->domains, index, iter_domain) {
+		if (WARN_ON(iter_domain == domain)) {
+			rc = -EEXIST;
+			goto out_unlock;
+		}
+	}
+
+	/*
+	 * The io page size drives the iova_alignment. Internally the iopt_pages
+	 * works in PAGE_SIZE units and we adjust when mapping sub-PAGE_SIZE
+	 * objects into the iommu_domain.
+	 *
+	 * A iommu_domain must always be able to accept PAGE_SIZE to be
+	 * compatible as we can't guarantee higher contiguity.
+	 */
+	new_iova_alignment =
+		max_t(unsigned long, 1UL << __ffs(domain->pgsize_bitmap),
+		      iopt->iova_alignment);
+	if (new_iova_alignment > PAGE_SIZE) {
+		rc = -EINVAL;
+		goto out_unlock;
+	}
+	if (new_iova_alignment != iopt->iova_alignment) {
+		rc = iopt_check_iova_alignment(iopt, new_iova_alignment);
+		if (rc)
+			goto out_unlock;
+	}
+
+	/* No area exists that is outside the allowed domain aperture */
+	if (geometry->aperture_start != 0) {
+		rc = iopt_reserve_iova(iopt, 0, geometry->aperture_start - 1,
+				       domain);
+		if (rc)
+			goto out_reserved;
+	}
+	if (geometry->aperture_end != ULONG_MAX) {
+		rc = iopt_reserve_iova(iopt, geometry->aperture_end + 1,
+				       ULONG_MAX, domain);
+		if (rc)
+			goto out_reserved;
+	}
+
+	rc = xa_reserve(&iopt->domains, iopt->next_domain_id, GFP_KERNEL);
+	if (rc)
+		goto out_reserved;
+
+	rc = iopt_fill_domain(iopt, domain);
+	if (rc)
+		goto out_release;
+
+	iopt->iova_alignment = new_iova_alignment;
+	xa_store(&iopt->domains, iopt->next_domain_id, domain, GFP_KERNEL);
+	iopt->next_domain_id++;
+	up_write(&iopt->iova_rwsem);
+	up_write(&iopt->domains_rwsem);
+	return 0;
+out_release:
+	xa_release(&iopt->domains, iopt->next_domain_id);
+out_reserved:
+	__iopt_remove_reserved_iova(iopt, domain);
+out_unlock:
+	up_write(&iopt->iova_rwsem);
+	up_write(&iopt->domains_rwsem);
+	return rc;
+}
+
+void iopt_table_remove_domain(struct io_pagetable *iopt,
+			      struct iommu_domain *domain)
+{
+	struct iommu_domain *iter_domain = NULL;
+	unsigned long new_iova_alignment;
+	unsigned long index;
+
+	down_write(&iopt->domains_rwsem);
+	down_write(&iopt->iova_rwsem);
+
+	xa_for_each (&iopt->domains, index, iter_domain)
+		if (iter_domain == domain)
+			break;
+	if (WARN_ON(iter_domain != domain) || index >= iopt->next_domain_id)
+		goto out_unlock;
+
+	/*
+	 * Compress the xarray to keep it linear by swapping the entry to erase
+	 * with the tail entry and shrinking the tail.
+	 */
+	iopt->next_domain_id--;
+	iter_domain = xa_erase(&iopt->domains, iopt->next_domain_id);
+	if (index != iopt->next_domain_id)
+		xa_store(&iopt->domains, index, iter_domain, GFP_KERNEL);
+
+	iopt_unfill_domain(iopt, domain);
+	__iopt_remove_reserved_iova(iopt, domain);
+
+	/* Recalculate the iova alignment without the domain */
+	new_iova_alignment = 1;
+	xa_for_each (&iopt->domains, index, iter_domain)
+		new_iova_alignment = max_t(unsigned long,
+					   1UL << __ffs(domain->pgsize_bitmap),
+					   new_iova_alignment);
+	if (!WARN_ON(new_iova_alignment > iopt->iova_alignment))
+		iopt->iova_alignment = new_iova_alignment;
+
+out_unlock:
+	up_write(&iopt->iova_rwsem);
+	up_write(&iopt->domains_rwsem);
+}
+
+/* Narrow the valid_iova_itree to include reserved ranges from a group. */
+int iopt_table_enforce_group_resv_regions(struct io_pagetable *iopt,
+					  struct iommu_group *group,
+					  phys_addr_t *sw_msi_start)
+{
+	struct iommu_resv_region *resv;
+	struct iommu_resv_region *tmp;
+	LIST_HEAD(group_resv_regions);
+	int rc;
+
+	down_write(&iopt->iova_rwsem);
+	rc = iommu_get_group_resv_regions(group, &group_resv_regions);
+	if (rc)
+		goto out_unlock;
+
+	list_for_each_entry (resv, &group_resv_regions, list) {
+		if (resv->type == IOMMU_RESV_DIRECT_RELAXABLE)
+			continue;
+
+		/*
+		 * The presence of any 'real' MSI regions should take precedence
+		 * over the software-managed one if the IOMMU driver happens to
+		 * advertise both types.
+		 */
+		if (sw_msi_start && resv->type == IOMMU_RESV_MSI) {
+			*sw_msi_start = 0;
+			sw_msi_start = NULL;
+		}
+		if (sw_msi_start && resv->type == IOMMU_RESV_SW_MSI)
+			*sw_msi_start = resv->start;
+
+		rc = iopt_reserve_iova(iopt, resv->start,
+				       resv->length - 1 + resv->start, group);
+		if (rc)
+			goto out_reserved;
+	}
+	rc = 0;
+	goto out_free_resv;
+
+out_reserved:
+	__iopt_remove_reserved_iova(iopt, group);
+out_free_resv:
+	list_for_each_entry_safe (resv, tmp, &group_resv_regions, list)
+		kfree(resv);
+out_unlock:
+	up_write(&iopt->iova_rwsem);
+	return rc;
+}
diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h
index fe3be8dd38240e..7fe5a700239012 100644
--- a/drivers/iommu/iommufd/io_pagetable.h
+++ b/drivers/iommu/iommufd/io_pagetable.h
@@ -46,9 +46,19 @@ struct iopt_area {
 	unsigned int page_offset;
 	/* IOMMU_READ, IOMMU_WRITE, etc */
 	int iommu_prot;
+	bool prevent_users : 1;
 	unsigned int num_users;
 };
 
+struct iopt_allowed {
+	struct interval_tree_node node;
+};
+
+struct iopt_reserved {
+	struct interval_tree_node node;
+	void *owner;
+};
+
 int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages);
 void iopt_area_unfill_domains(struct iopt_area *area, struct iopt_pages *pages);
 
@@ -109,6 +119,8 @@ static inline size_t iopt_area_length(struct iopt_area *area)
 	}
 
 __make_iopt_iter(area)
+__make_iopt_iter(allowed)
+__make_iopt_iter(reserved)
 
 /*
  * This holds a pinned page list for multiple areas of IO address space. The
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 47a824897bc222..560ab06fbc3366 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -9,6 +9,9 @@
 #include <linux/refcount.h>
 #include <linux/uaccess.h>
 
+struct iommu_domain;
+struct iommu_group;
+
 /*
  * The IOVA to PFN map. The mapper automatically copies the PFNs into multiple
  * domains and permits sharing of PFNs between io_pagetable instances. This
@@ -30,8 +33,38 @@ struct io_pagetable {
 	struct rb_root_cached allowed_itree;
 	/* IOVA that cannot be allocated, struct iopt_reserved */
 	struct rb_root_cached reserved_itree;
+	unsigned long iova_alignment;
 };
 
+int iopt_init_table(struct io_pagetable *iopt);
+void iopt_destroy_table(struct io_pagetable *iopt);
+struct iopt_pages *iopt_get_pages(struct io_pagetable *iopt, unsigned long iova,
+				  unsigned long *start_byte,
+				  unsigned long length);
+enum { IOPT_ALLOC_IOVA = 1 << 0 };
+int iopt_map_user_pages(struct io_pagetable *iopt, unsigned long *iova,
+			void __user *uptr, unsigned long length, int iommu_prot,
+			unsigned int flags);
+int iopt_map_pages(struct io_pagetable *iopt, struct iopt_pages *pages,
+		   unsigned long *dst_iova, unsigned long start_byte,
+		   unsigned long length, int iommu_prot, unsigned int flags);
+int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
+		    unsigned long length, unsigned long *unmapped);
+int iopt_unmap_all(struct io_pagetable *iopt, unsigned long *unmapped);
+
+int iopt_table_add_domain(struct io_pagetable *iopt,
+			  struct iommu_domain *domain);
+void iopt_table_remove_domain(struct io_pagetable *iopt,
+			      struct iommu_domain *domain);
+int iopt_table_enforce_group_resv_regions(struct io_pagetable *iopt,
+					  struct iommu_group *group,
+					  phys_addr_t *sw_msi_start);
+int iopt_set_allow_iova(struct io_pagetable *iopt,
+			struct rb_root_cached *allowed_iova);
+int iopt_reserve_iova(struct io_pagetable *iopt, unsigned long start,
+		      unsigned long last, void *owner);
+void iopt_remove_reserved_iova(struct io_pagetable *iopt, void *owner);
+
 struct iommufd_ctx {
 	struct file *file;
 	struct xarray objects;
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index c8bbed542e923c..9c6ec4d66b4a92 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -10,9 +10,17 @@
 #include <linux/errno.h>
 #include <linux/err.h>
 
+struct page;
 struct iommufd_ctx;
+struct io_pagetable;
 struct file;
 
+int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
+		      unsigned long length, struct page **out_pages,
+		      bool write);
+void iopt_unaccess_pages(struct io_pagetable *iopt, unsigned long iova,
+			 unsigned long length);
+
 void iommufd_ctx_get(struct iommufd_ctx *ictx);
 
 #if IS_ENABLED(CONFIG_IOMMUFD)
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH RFC v2 07/13] iommufd: Data structure to provide IOVA to PFN mapping
@ 2022-09-02 19:59   ` Jason Gunthorpe
  0 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-02 19:59 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, iommu,
	Jason Wang, Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

This is the remainder of the IOAS data structure. Provide an object called
an io_pagetable that is composed of iopt_areas pointing at iopt_pages,
along with a list of iommu_domains that mirror the IOVA to PFN map.

At the top this is a simple interval tree of iopt_areas indicating the map
of IOVA to iopt_pages. An xarray keeps track of a list of domains. Based
on the attached domains there is a minimum alignment for areas (which may
be smaller than PAGE_SIZE), an interval tree of reserved IOVA that can't
be mapped and an IOVA of allowed IOVA that can always be mappable.

The concept of a 'user' refers to something like a VFIO mdev that is
accessing the IOVA and using a 'struct page *' for CPU based access.

Externally an API is provided that matches the requirements of the IOCTL
interface for map/unmap and domain attachment.

The API provides a 'copy' primitive to establish a new IOVA map in a
different IOAS from an existing mapping.

This is designed to support a pre-registration flow where userspace would
setup an dummy IOAS with no domains, map in memory and then establish a
user to pin all PFNs into the xarray.

Copy can then be used to create new IOVA mappings in a different IOAS,
with iommu_domains attached. Upon copy the PFNs will be read out of the
xarray and mapped into the iommu_domains, avoiding any pin_user_pages()
overheads.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommufd/Makefile          |   1 +
 drivers/iommu/iommufd/io_pagetable.c    | 981 ++++++++++++++++++++++++
 drivers/iommu/iommufd/io_pagetable.h    |  12 +
 drivers/iommu/iommufd/iommufd_private.h |  33 +
 include/linux/iommufd.h                 |   8 +
 5 files changed, 1035 insertions(+)
 create mode 100644 drivers/iommu/iommufd/io_pagetable.c

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index 05a0e91e30afad..b66a8c47ff55ec 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 iommufd-y := \
+	io_pagetable.o \
 	main.o \
 	pages.o
 
diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
new file mode 100644
index 00000000000000..7434bc8b393bbd
--- /dev/null
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -0,0 +1,981 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
+ *
+ * The io_pagetable is the top of datastructure that maps IOVA's to PFNs. The
+ * PFNs can be placed into an iommu_domain, or returned to the caller as a page
+ * list for access by an in-kernel user.
+ *
+ * The datastructure uses the iopt_pages to optimize the storage of the PFNs
+ * between the domains and xarray.
+ */
+#include <linux/iommufd.h>
+#include <linux/lockdep.h>
+#include <linux/iommu.h>
+#include <linux/sched/mm.h>
+#include <linux/err.h>
+#include <linux/slab.h>
+#include <linux/errno.h>
+
+#include "io_pagetable.h"
+
+static unsigned long iopt_area_iova_to_index(struct iopt_area *area,
+					     unsigned long iova)
+{
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
+		WARN_ON(iova < iopt_area_iova(area) ||
+			iova > iopt_area_last_iova(area));
+	return (iova - (iopt_area_iova(area) - area->page_offset)) / PAGE_SIZE;
+}
+
+static struct iopt_area *iopt_find_exact_area(struct io_pagetable *iopt,
+					      unsigned long iova,
+					      unsigned long last_iova)
+{
+	struct iopt_area *area;
+
+	area = iopt_area_iter_first(iopt, iova, last_iova);
+	if (!area || !area->pages || iopt_area_iova(area) != iova ||
+	    iopt_area_last_iova(area) != last_iova)
+		return NULL;
+	return area;
+}
+
+static bool __alloc_iova_check_hole(struct interval_tree_span_iter *span,
+				    unsigned long length,
+				    unsigned long iova_alignment,
+				    unsigned long page_offset)
+{
+	if (!span->is_hole || span->last_hole - span->start_hole < length - 1)
+		return false;
+
+	span->start_hole = ALIGN(span->start_hole, iova_alignment) |
+			   page_offset;
+	if (span->start_hole > span->last_hole ||
+	    span->last_hole - span->start_hole < length - 1)
+		return false;
+	return true;
+}
+
+static bool __alloc_iova_check_used(struct interval_tree_span_iter *span,
+				    unsigned long length,
+				    unsigned long iova_alignment,
+				    unsigned long page_offset)
+{
+	if (span->is_hole || span->last_used - span->start_used < length - 1)
+		return false;
+
+	span->start_used = ALIGN(span->start_used, iova_alignment) |
+			   page_offset;
+	if (span->start_used > span->last_used ||
+	    span->last_used - span->start_used < length - 1)
+		return false;
+	return true;
+}
+
+/*
+ * Automatically find a block of IOVA that is not being used and not reserved.
+ * Does not return a 0 IOVA even if it is valid.
+ */
+static int iopt_alloc_iova(struct io_pagetable *iopt, unsigned long *iova,
+			   unsigned long uptr, unsigned long length)
+{
+	struct interval_tree_span_iter reserved_span;
+	unsigned long page_offset = uptr % PAGE_SIZE;
+	struct interval_tree_span_iter allowed_span;
+	struct interval_tree_span_iter area_span;
+	unsigned long iova_alignment;
+
+	lockdep_assert_held(&iopt->iova_rwsem);
+
+	/* Protect roundup_pow-of_two() from overflow */
+	if (length == 0 || length >= ULONG_MAX / 2)
+		return -EOVERFLOW;
+
+	/*
+	 * Keep alignment present in the uptr when building the IOVA, this
+	 * increases the chance we can map a THP.
+	 */
+	if (!uptr)
+		iova_alignment = roundup_pow_of_two(length);
+	else
+		iova_alignment =
+			min_t(unsigned long, roundup_pow_of_two(length),
+			      1UL << __ffs64(uptr));
+
+	if (iova_alignment < iopt->iova_alignment)
+		return -EINVAL;
+
+	interval_tree_for_each_span(&allowed_span, &iopt->allowed_itree,
+				    PAGE_SIZE, ULONG_MAX - PAGE_SIZE) {
+		if (RB_EMPTY_ROOT(&iopt->allowed_itree.rb_root)) {
+			allowed_span.start_used = PAGE_SIZE;
+			allowed_span.last_used = ULONG_MAX - PAGE_SIZE;
+			allowed_span.is_hole = false;
+		}
+
+		if (!__alloc_iova_check_used(&allowed_span, length,
+					     iova_alignment, page_offset))
+			continue;
+
+		interval_tree_for_each_span(&area_span, &iopt->area_itree,
+					    allowed_span.start_used,
+					    allowed_span.last_used) {
+			if (!__alloc_iova_check_hole(&area_span, length,
+						     iova_alignment,
+						     page_offset))
+				continue;
+
+			interval_tree_for_each_span(&reserved_span,
+						    &iopt->reserved_itree,
+						    area_span.start_used,
+						    area_span.last_used) {
+				if (!__alloc_iova_check_hole(
+					    &reserved_span, length,
+					    iova_alignment, page_offset))
+					continue;
+
+				*iova = reserved_span.start_hole;
+				return 0;
+			}
+		}
+	}
+	return -ENOSPC;
+}
+
+/*
+ * The area takes a slice of the pages from start_bytes to start_byte + length
+ */
+static struct iopt_area *
+iopt_alloc_area(struct io_pagetable *iopt, struct iopt_pages *pages,
+		unsigned long iova, unsigned long start_byte,
+		unsigned long length, int iommu_prot, unsigned int flags)
+{
+	struct iopt_area *area;
+	int rc;
+
+	area = kzalloc(sizeof(*area), GFP_KERNEL_ACCOUNT);
+	if (!area)
+		return ERR_PTR(-ENOMEM);
+
+	area->iopt = iopt;
+	area->iommu_prot = iommu_prot;
+	area->page_offset = start_byte % PAGE_SIZE;
+	area->pages_node.start = start_byte / PAGE_SIZE;
+	if (check_add_overflow(start_byte, length - 1,
+			       &area->pages_node.last)) {
+		rc = -EOVERFLOW;
+		goto out_free;
+	}
+	area->pages_node.last = area->pages_node.last / PAGE_SIZE;
+	if (WARN_ON(area->pages_node.last >= pages->npages)) {
+		rc = -EOVERFLOW;
+		goto out_free;
+	}
+
+	down_write(&iopt->iova_rwsem);
+	if (flags & IOPT_ALLOC_IOVA) {
+		rc = iopt_alloc_iova(iopt, &iova,
+				     (uintptr_t)pages->uptr + start_byte,
+				     length);
+		if (rc)
+			goto out_unlock;
+	}
+
+	if (check_add_overflow(iova, length - 1, &area->node.last)) {
+		rc = -EOVERFLOW;
+		goto out_unlock;
+	}
+
+	if (!(flags & IOPT_ALLOC_IOVA)) {
+		if ((iova & (iopt->iova_alignment - 1)) ||
+		    (length & (iopt->iova_alignment - 1)) || !length) {
+			rc = -EINVAL;
+			goto out_unlock;
+		}
+
+		/* No reserved IOVA intersects the range */
+		if (iopt_reserved_iter_first(iopt, iova, area->node.last)) {
+			rc = -ENOENT;
+			goto out_unlock;
+		}
+
+		/* Check that there is not already a mapping in the range */
+		if (iopt_area_iter_first(iopt, iova, area->node.last)) {
+			rc = -EADDRINUSE;
+			goto out_unlock;
+		}
+	}
+
+	/*
+	 * The area is inserted with a NULL pages indicating it is not fully
+	 * initialized yet.
+	 */
+	area->node.start = iova;
+	interval_tree_insert(&area->node, &area->iopt->area_itree);
+	up_write(&iopt->iova_rwsem);
+	return area;
+
+out_unlock:
+	up_write(&iopt->iova_rwsem);
+out_free:
+	kfree(area);
+	return ERR_PTR(rc);
+}
+
+static void iopt_abort_area(struct iopt_area *area)
+{
+	down_write(&area->iopt->iova_rwsem);
+	interval_tree_remove(&area->node, &area->iopt->area_itree);
+	up_write(&area->iopt->iova_rwsem);
+	kfree(area);
+}
+
+static int iopt_finalize_area(struct iopt_area *area, struct iopt_pages *pages)
+{
+	int rc;
+
+	down_read(&area->iopt->domains_rwsem);
+	rc = iopt_area_fill_domains(area, pages);
+	if (!rc) {
+		/*
+		 * area->pages must be set inside the domains_rwsem to ensure
+		 * any newly added domains will get filled. Moves the reference
+		 * in from the caller
+		 */
+		down_write(&area->iopt->iova_rwsem);
+		area->pages = pages;
+		up_write(&area->iopt->iova_rwsem);
+	}
+	up_read(&area->iopt->domains_rwsem);
+	return rc;
+}
+
+int iopt_map_pages(struct io_pagetable *iopt, struct iopt_pages *pages,
+		   unsigned long *dst_iova, unsigned long start_bytes,
+		   unsigned long length, int iommu_prot, unsigned int flags)
+{
+	struct iopt_area *area;
+	int rc;
+
+	if ((iommu_prot & IOMMU_WRITE) && !pages->writable)
+		return -EPERM;
+
+	area = iopt_alloc_area(iopt, pages, *dst_iova, start_bytes, length,
+			       iommu_prot, flags);
+	if (IS_ERR(area))
+		return PTR_ERR(area);
+	*dst_iova = iopt_area_iova(area);
+
+	rc = iopt_finalize_area(area, pages);
+	if (rc) {
+		iopt_abort_area(area);
+		return rc;
+	}
+	return 0;
+}
+
+/**
+ * iopt_map_user_pages() - Map a user VA to an iova in the io page table
+ * @iopt: io_pagetable to act on
+ * @iova: If IOPT_ALLOC_IOVA is set this is unused on input and contains
+ *        the chosen iova on output. Otherwise is the iova to map to on input
+ * @uptr: User VA to map
+ * @length: Number of bytes to map
+ * @iommu_prot: Combination of IOMMU_READ/WRITE/etc bits for the mapping
+ * @flags: IOPT_ALLOC_IOVA or zero
+ *
+ * iova, uptr, and length must be aligned to iova_alignment. For domain backed
+ * page tables this will pin the pages and load them into the domain at iova.
+ * For non-domain page tables this will only setup a lazy reference and the
+ * caller must use iopt_access_pages() to touch them.
+ *
+ * iopt_unmap_iova() must be called to undo this before the io_pagetable can be
+ * destroyed.
+ */
+int iopt_map_user_pages(struct io_pagetable *iopt, unsigned long *iova,
+			void __user *uptr, unsigned long length, int iommu_prot,
+			unsigned int flags)
+{
+	struct iopt_pages *pages;
+	int rc;
+
+	pages = iopt_alloc_pages(uptr, length, iommu_prot & IOMMU_WRITE);
+	if (IS_ERR(pages))
+		return PTR_ERR(pages);
+
+	rc = iopt_map_pages(iopt, pages, iova, uptr - pages->uptr, length,
+			    iommu_prot, flags);
+	if (rc) {
+		iopt_put_pages(pages);
+		return rc;
+	}
+	return 0;
+}
+
+struct iopt_pages *iopt_get_pages(struct io_pagetable *iopt, unsigned long iova,
+				  unsigned long *start_byte,
+				  unsigned long length)
+{
+	unsigned long iova_end;
+	struct iopt_pages *pages;
+	struct iopt_area *area;
+
+	if (check_add_overflow(iova, length - 1, &iova_end))
+		return ERR_PTR(-EOVERFLOW);
+
+	down_read(&iopt->iova_rwsem);
+	area = iopt_find_exact_area(iopt, iova, iova_end);
+	if (!area) {
+		up_read(&iopt->iova_rwsem);
+		return ERR_PTR(-ENOENT);
+	}
+	pages = area->pages;
+	*start_byte = area->page_offset + iopt_area_index(area) * PAGE_SIZE;
+	kref_get(&pages->kref);
+	up_read(&iopt->iova_rwsem);
+
+	return pages;
+}
+
+static int iopt_unmap_iova_range(struct io_pagetable *iopt, unsigned long start,
+				 unsigned long end, unsigned long *unmapped)
+{
+	struct iopt_area *area;
+	unsigned long unmapped_bytes = 0;
+	int rc = -ENOENT;
+
+	/*
+	 * The domains_rwsem must be held in read mode any time any area->pages
+	 * is NULL. This prevents domain attach/detatch from running
+	 * concurrently with cleaning up the area.
+	 */
+	down_read(&iopt->domains_rwsem);
+	down_write(&iopt->iova_rwsem);
+	while ((area = iopt_area_iter_first(iopt, start, end))) {
+		unsigned long area_last = iopt_area_last_iova(area);
+		unsigned long area_first = iopt_area_iova(area);
+		struct iopt_pages *pages;
+
+		/* Userspace should not race map/unmap's of the same area */
+		if (!area->pages) {
+			rc = -EBUSY;
+			goto out_unlock_iova;
+		}
+
+		if (area_first < start || area_last > end) {
+			rc = -ENOENT;
+			goto out_unlock_iova;
+		}
+
+		/*
+		 * num_users writers must hold the iova_rwsem too, so we can
+		 * safely read it under the write side of the iovam_rwsem
+		 * without the pages->mutex.
+		 */
+		if (area->num_users) {
+			start = area_first;
+			area->prevent_users = true;
+			up_write(&iopt->iova_rwsem);
+			up_read(&iopt->domains_rwsem);
+			/* Later patch calls back to drivers to unmap */
+			return -EBUSY;
+		}
+
+		pages = area->pages;
+		area->pages = NULL;
+		up_write(&iopt->iova_rwsem);
+
+		iopt_area_unfill_domains(area, pages);
+		iopt_abort_area(area);
+		iopt_put_pages(pages);
+
+		unmapped_bytes += area_last - area_first + 1;
+
+		down_write(&iopt->iova_rwsem);
+	}
+	if (unmapped_bytes)
+		rc = 0;
+
+out_unlock_iova:
+	up_write(&iopt->iova_rwsem);
+	up_read(&iopt->domains_rwsem);
+	if (unmapped)
+		*unmapped = unmapped_bytes;
+	return rc;
+}
+
+/**
+ * iopt_unmap_iova() - Remove a range of iova
+ * @iopt: io_pagetable to act on
+ * @iova: Starting iova to unmap
+ * @length: Number of bytes to unmap
+ * @unmapped: Return number of bytes unmapped
+ *
+ * The requested range must be a superset of existing ranges.
+ * Splitting/truncating IOVA mappings is not allowed.
+ */
+int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
+		    unsigned long length, unsigned long *unmapped)
+{
+	unsigned long iova_end;
+
+	if (!length)
+		return -EINVAL;
+
+	if (check_add_overflow(iova, length - 1, &iova_end))
+		return -EOVERFLOW;
+
+	return iopt_unmap_iova_range(iopt, iova, iova_end, unmapped);
+}
+
+int iopt_unmap_all(struct io_pagetable *iopt, unsigned long *unmapped)
+{
+	return iopt_unmap_iova_range(iopt, 0, ULONG_MAX, unmapped);
+}
+
+/**
+ * iopt_access_pages() - Return a list of pages under the iova
+ * @iopt: io_pagetable to act on
+ * @iova: Starting IOVA
+ * @length: Number of bytes to access
+ * @out_pages: Output page list
+ * @write: True if access is for writing
+ *
+ * Reads @length bytes starting at iova and returns the struct page * pointers.
+ * These can be kmap'd by the caller for CPU access.
+ *
+ * The caller must perform iopt_unaccess_pages() when done to balance this.
+ *
+ * iova can be unaligned from PAGE_SIZE. The first returned byte starts at
+ * page_to_phys(out_pages[0]) + (iova % PAGE_SIZE). The caller promises not to
+ * touch memory outside the requested iova slice.
+ */
+int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
+		      unsigned long length, struct page **out_pages, bool write)
+{
+	unsigned long cur_iova = iova;
+	unsigned long last_iova;
+	struct iopt_area *area;
+	int rc;
+
+	if (!length)
+		return -EINVAL;
+	if (check_add_overflow(iova, length - 1, &last_iova))
+		return -EOVERFLOW;
+
+	down_read(&iopt->iova_rwsem);
+	for (area = iopt_area_iter_first(iopt, iova, last_iova); area;
+	     area = iopt_area_iter_next(area, iova, last_iova)) {
+		unsigned long last = min(last_iova, iopt_area_last_iova(area));
+		unsigned long last_index;
+		unsigned long index;
+
+		/* Need contiguous areas in the access */
+		if (iopt_area_iova(area) > cur_iova || !area->pages ||
+		    area->prevent_users) {
+			rc = -EINVAL;
+			goto out_remove;
+		}
+
+		index = iopt_area_iova_to_index(area, cur_iova);
+		last_index = iopt_area_iova_to_index(area, last);
+
+		/*
+		 * The API can only return aligned pages, so the starting point
+		 * must be at a page boundary.
+		 */
+		if ((cur_iova - (iopt_area_iova(area) - area->page_offset)) %
+		    PAGE_SIZE) {
+			rc = -EINVAL;
+			goto out_remove;
+		}
+
+		/*
+		 * and an interior ending point must be at a page boundary
+		 */
+		if (last != last_iova &&
+		    (iopt_area_last_iova(area) - cur_iova + 1) % PAGE_SIZE) {
+			rc = -EINVAL;
+			goto out_remove;
+		}
+
+		mutex_lock(&area->pages->mutex);
+		rc = iopt_pages_add_user(area->pages, index, last_index,
+					 out_pages, write);
+		if (rc) {
+			mutex_unlock(&area->pages->mutex);
+			goto out_remove;
+		}
+		area->num_users++;
+		mutex_unlock(&area->pages->mutex);
+		if (last == last_iova)
+			break;
+		cur_iova = last + 1;
+		out_pages += last_index - index;
+	}
+	if (cur_iova != last_iova)
+		goto out_remove;
+
+	up_read(&iopt->iova_rwsem);
+	return 0;
+
+out_remove:
+	if (cur_iova != iova)
+		iopt_unaccess_pages(iopt, iova, cur_iova - iova);
+	up_read(&iopt->iova_rwsem);
+	return rc;
+}
+EXPORT_SYMBOL_GPL(iopt_access_pages);
+
+/**
+ * iopt_unaccess_pages() - Undo iopt_access_pages
+ * @iopt: io_pagetable to act on
+ * @iova: Starting IOVA
+ * @length:- Number of bytes to access
+ *
+ * Return the struct page's. The caller must stop accessing them before calling
+ * this. The iova/length must exactly match the one provided to access_pages.
+ */
+void iopt_unaccess_pages(struct io_pagetable *iopt, unsigned long iova,
+			 unsigned long length)
+{
+	unsigned long cur_iova = iova;
+	unsigned long last_iova;
+	struct iopt_area *area;
+
+	if (WARN_ON(!length) ||
+	    WARN_ON(check_add_overflow(iova, length - 1, &last_iova)))
+		return;
+
+	down_read(&iopt->iova_rwsem);
+	for (area = iopt_area_iter_first(iopt, iova, last_iova); area;
+	     area = iopt_area_iter_next(area, iova, last_iova)) {
+		unsigned long last = min(last_iova, iopt_area_last_iova(area));
+
+		iopt_pages_remove_user(area,
+				       iopt_area_iova_to_index(area, cur_iova),
+				       iopt_area_iova_to_index(area, last));
+		if (last == last_iova)
+			break;
+		cur_iova = last + 1;
+	}
+	up_read(&iopt->iova_rwsem);
+}
+EXPORT_SYMBOL_GPL(iopt_unaccess_pages);
+
+/* The caller must always free all the nodes in the allowed_iova rb_root. */
+int iopt_set_allow_iova(struct io_pagetable *iopt,
+			struct rb_root_cached *allowed_iova)
+{
+	struct iopt_allowed *allowed;
+
+	down_write(&iopt->iova_rwsem);
+	swap(*allowed_iova, iopt->allowed_itree);
+
+	for (allowed = iopt_allowed_iter_first(iopt, 0, ULONG_MAX); allowed;
+	     allowed = iopt_allowed_iter_next(allowed, 0, ULONG_MAX)) {
+		if (iopt_reserved_iter_first(iopt, allowed->node.start,
+					     allowed->node.last)) {
+			swap(*allowed_iova, iopt->allowed_itree);
+			up_write(&iopt->iova_rwsem);
+			return -EADDRINUSE;
+		}
+	}
+	up_write(&iopt->iova_rwsem);
+	return 0;
+}
+
+int iopt_reserve_iova(struct io_pagetable *iopt, unsigned long start,
+		      unsigned long last, void *owner)
+{
+	struct iopt_reserved *reserved;
+
+	lockdep_assert_held_write(&iopt->iova_rwsem);
+
+	if (iopt_area_iter_first(iopt, start, last) ||
+	    iopt_allowed_iter_first(iopt, start, last))
+		return -EADDRINUSE;
+
+	reserved = kzalloc(sizeof(*reserved), GFP_KERNEL_ACCOUNT);
+	if (!reserved)
+		return -ENOMEM;
+	reserved->node.start = start;
+	reserved->node.last = last;
+	reserved->owner = owner;
+	interval_tree_insert(&reserved->node, &iopt->reserved_itree);
+	return 0;
+}
+
+static void __iopt_remove_reserved_iova(struct io_pagetable *iopt, void *owner)
+{
+	struct iopt_reserved *reserved, *next;
+
+	lockdep_assert_held_write(&iopt->iova_rwsem);
+
+	for (reserved = iopt_reserved_iter_first(iopt, 0, ULONG_MAX);
+	     reserved; reserved = next) {
+		next = iopt_reserved_iter_next(reserved, 0, ULONG_MAX);
+
+		if (reserved->owner == owner) {
+			interval_tree_remove(&reserved->node,
+					     &iopt->reserved_itree);
+			kfree(reserved);
+		}
+	}
+}
+
+void iopt_remove_reserved_iova(struct io_pagetable *iopt, void *owner)
+{
+	down_write(&iopt->iova_rwsem);
+	__iopt_remove_reserved_iova(iopt, owner);
+	up_write(&iopt->iova_rwsem);
+}
+
+int iopt_init_table(struct io_pagetable *iopt)
+{
+	init_rwsem(&iopt->iova_rwsem);
+	init_rwsem(&iopt->domains_rwsem);
+	iopt->area_itree = RB_ROOT_CACHED;
+	iopt->allowed_itree = RB_ROOT_CACHED;
+	iopt->reserved_itree = RB_ROOT_CACHED;
+	xa_init_flags(&iopt->domains, XA_FLAGS_ACCOUNT);
+
+	/*
+	 * iopt's start as SW tables that can use the entire size_t IOVA space
+	 * due to the use of size_t in the APIs. They have no alignment
+	 * restriction.
+	 */
+	iopt->iova_alignment = 1;
+
+	return 0;
+}
+
+void iopt_destroy_table(struct io_pagetable *iopt)
+{
+	struct interval_tree_node *node;
+
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
+		iopt_remove_reserved_iova(iopt, NULL);
+
+	while ((node = interval_tree_iter_first(&iopt->allowed_itree, 0,
+						ULONG_MAX))) {
+		interval_tree_remove(node, &iopt->allowed_itree);
+		kfree(container_of(node, struct iopt_allowed, node));
+	}
+
+	WARN_ON(!RB_EMPTY_ROOT(&iopt->reserved_itree.rb_root));
+	WARN_ON(!xa_empty(&iopt->domains));
+	WARN_ON(!RB_EMPTY_ROOT(&iopt->area_itree.rb_root));
+}
+
+/**
+ * iopt_unfill_domain() - Unfill a domain with PFNs
+ * @iopt: io_pagetable to act on
+ * @domain: domain to unfill
+ *
+ * This is used when removing a domain from the iopt. Every area in the iopt
+ * will be unmapped from the domain. The domain must already be removed from the
+ * domains xarray.
+ */
+static void iopt_unfill_domain(struct io_pagetable *iopt,
+			       struct iommu_domain *domain)
+{
+	struct iopt_area *area;
+
+	lockdep_assert_held(&iopt->iova_rwsem);
+	lockdep_assert_held_write(&iopt->domains_rwsem);
+
+	/*
+	 * Some other domain is holding all the pfns still, rapidly unmap this
+	 * domain.
+	 */
+	if (iopt->next_domain_id != 0) {
+		/* Pick an arbitrary remaining domain to act as storage */
+		struct iommu_domain *storage_domain =
+			xa_load(&iopt->domains, 0);
+
+		for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
+		     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
+			struct iopt_pages *pages = area->pages;
+
+			if (!pages)
+				continue;
+
+			mutex_lock(&pages->mutex);
+			if (area->storage_domain != domain) {
+				mutex_unlock(&pages->mutex);
+				continue;
+			}
+			area->storage_domain = storage_domain;
+			mutex_unlock(&pages->mutex);
+		}
+
+
+		iopt_unmap_domain(iopt, domain);
+		return;
+	}
+
+	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
+	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
+		struct iopt_pages *pages = area->pages;
+
+		if (!pages)
+			continue;
+
+		mutex_lock(&pages->mutex);
+		interval_tree_remove(&area->pages_node,
+				     &pages->domains_itree);
+		WARN_ON(area->storage_domain != domain);
+		area->storage_domain = NULL;
+		iopt_area_unfill_domain(area, pages, domain);
+		mutex_unlock(&pages->mutex);
+	}
+}
+
+/**
+ * iopt_fill_domain() - Fill a domain with PFNs
+ * @iopt: io_pagetable to act on
+ * @domain: domain to fill
+ *
+ * Fill the domain with PFNs from every area in the iopt. On failure the domain
+ * is left unchanged.
+ */
+static int iopt_fill_domain(struct io_pagetable *iopt,
+			    struct iommu_domain *domain)
+{
+	struct iopt_area *end_area;
+	struct iopt_area *area;
+	int rc;
+
+	lockdep_assert_held(&iopt->iova_rwsem);
+	lockdep_assert_held_write(&iopt->domains_rwsem);
+
+	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
+	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
+		struct iopt_pages *pages = area->pages;
+
+		if (!pages)
+			continue;
+
+		mutex_lock(&pages->mutex);
+		rc = iopt_area_fill_domain(area, domain);
+		if (rc) {
+			mutex_unlock(&pages->mutex);
+			goto out_unfill;
+		}
+		if (!area->storage_domain) {
+			WARN_ON(iopt->next_domain_id != 0);
+			area->storage_domain = domain;
+			interval_tree_insert(&area->pages_node,
+					     &pages->domains_itree);
+		}
+		mutex_unlock(&pages->mutex);
+	}
+	return 0;
+
+out_unfill:
+	end_area = area;
+	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
+	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
+		struct iopt_pages *pages = area->pages;
+
+		if (area == end_area)
+			break;
+		if (!pages)
+			continue;
+		mutex_lock(&pages->mutex);
+		if (iopt->next_domain_id == 0) {
+			interval_tree_remove(&area->pages_node,
+					     &pages->domains_itree);
+			area->storage_domain = NULL;
+		}
+		iopt_area_unfill_domain(area, pages, domain);
+		mutex_unlock(&pages->mutex);
+	}
+	return rc;
+}
+
+/* All existing area's conform to an increased page size */
+static int iopt_check_iova_alignment(struct io_pagetable *iopt,
+				     unsigned long new_iova_alignment)
+{
+	struct iopt_area *area;
+
+	lockdep_assert_held(&iopt->iova_rwsem);
+
+	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
+	     area = iopt_area_iter_next(area, 0, ULONG_MAX))
+		if ((iopt_area_iova(area) % new_iova_alignment) ||
+		    (iopt_area_length(area) % new_iova_alignment))
+			return -EADDRINUSE;
+	return 0;
+}
+
+int iopt_table_add_domain(struct io_pagetable *iopt,
+			  struct iommu_domain *domain)
+{
+	const struct iommu_domain_geometry *geometry = &domain->geometry;
+	struct iommu_domain *iter_domain;
+	unsigned int new_iova_alignment;
+	unsigned long index;
+	int rc;
+
+	down_write(&iopt->domains_rwsem);
+	down_write(&iopt->iova_rwsem);
+
+	xa_for_each (&iopt->domains, index, iter_domain) {
+		if (WARN_ON(iter_domain == domain)) {
+			rc = -EEXIST;
+			goto out_unlock;
+		}
+	}
+
+	/*
+	 * The io page size drives the iova_alignment. Internally the iopt_pages
+	 * works in PAGE_SIZE units and we adjust when mapping sub-PAGE_SIZE
+	 * objects into the iommu_domain.
+	 *
+	 * A iommu_domain must always be able to accept PAGE_SIZE to be
+	 * compatible as we can't guarantee higher contiguity.
+	 */
+	new_iova_alignment =
+		max_t(unsigned long, 1UL << __ffs(domain->pgsize_bitmap),
+		      iopt->iova_alignment);
+	if (new_iova_alignment > PAGE_SIZE) {
+		rc = -EINVAL;
+		goto out_unlock;
+	}
+	if (new_iova_alignment != iopt->iova_alignment) {
+		rc = iopt_check_iova_alignment(iopt, new_iova_alignment);
+		if (rc)
+			goto out_unlock;
+	}
+
+	/* No area exists that is outside the allowed domain aperture */
+	if (geometry->aperture_start != 0) {
+		rc = iopt_reserve_iova(iopt, 0, geometry->aperture_start - 1,
+				       domain);
+		if (rc)
+			goto out_reserved;
+	}
+	if (geometry->aperture_end != ULONG_MAX) {
+		rc = iopt_reserve_iova(iopt, geometry->aperture_end + 1,
+				       ULONG_MAX, domain);
+		if (rc)
+			goto out_reserved;
+	}
+
+	rc = xa_reserve(&iopt->domains, iopt->next_domain_id, GFP_KERNEL);
+	if (rc)
+		goto out_reserved;
+
+	rc = iopt_fill_domain(iopt, domain);
+	if (rc)
+		goto out_release;
+
+	iopt->iova_alignment = new_iova_alignment;
+	xa_store(&iopt->domains, iopt->next_domain_id, domain, GFP_KERNEL);
+	iopt->next_domain_id++;
+	up_write(&iopt->iova_rwsem);
+	up_write(&iopt->domains_rwsem);
+	return 0;
+out_release:
+	xa_release(&iopt->domains, iopt->next_domain_id);
+out_reserved:
+	__iopt_remove_reserved_iova(iopt, domain);
+out_unlock:
+	up_write(&iopt->iova_rwsem);
+	up_write(&iopt->domains_rwsem);
+	return rc;
+}
+
+void iopt_table_remove_domain(struct io_pagetable *iopt,
+			      struct iommu_domain *domain)
+{
+	struct iommu_domain *iter_domain = NULL;
+	unsigned long new_iova_alignment;
+	unsigned long index;
+
+	down_write(&iopt->domains_rwsem);
+	down_write(&iopt->iova_rwsem);
+
+	xa_for_each (&iopt->domains, index, iter_domain)
+		if (iter_domain == domain)
+			break;
+	if (WARN_ON(iter_domain != domain) || index >= iopt->next_domain_id)
+		goto out_unlock;
+
+	/*
+	 * Compress the xarray to keep it linear by swapping the entry to erase
+	 * with the tail entry and shrinking the tail.
+	 */
+	iopt->next_domain_id--;
+	iter_domain = xa_erase(&iopt->domains, iopt->next_domain_id);
+	if (index != iopt->next_domain_id)
+		xa_store(&iopt->domains, index, iter_domain, GFP_KERNEL);
+
+	iopt_unfill_domain(iopt, domain);
+	__iopt_remove_reserved_iova(iopt, domain);
+
+	/* Recalculate the iova alignment without the domain */
+	new_iova_alignment = 1;
+	xa_for_each (&iopt->domains, index, iter_domain)
+		new_iova_alignment = max_t(unsigned long,
+					   1UL << __ffs(domain->pgsize_bitmap),
+					   new_iova_alignment);
+	if (!WARN_ON(new_iova_alignment > iopt->iova_alignment))
+		iopt->iova_alignment = new_iova_alignment;
+
+out_unlock:
+	up_write(&iopt->iova_rwsem);
+	up_write(&iopt->domains_rwsem);
+}
+
+/* Narrow the valid_iova_itree to include reserved ranges from a group. */
+int iopt_table_enforce_group_resv_regions(struct io_pagetable *iopt,
+					  struct iommu_group *group,
+					  phys_addr_t *sw_msi_start)
+{
+	struct iommu_resv_region *resv;
+	struct iommu_resv_region *tmp;
+	LIST_HEAD(group_resv_regions);
+	int rc;
+
+	down_write(&iopt->iova_rwsem);
+	rc = iommu_get_group_resv_regions(group, &group_resv_regions);
+	if (rc)
+		goto out_unlock;
+
+	list_for_each_entry (resv, &group_resv_regions, list) {
+		if (resv->type == IOMMU_RESV_DIRECT_RELAXABLE)
+			continue;
+
+		/*
+		 * The presence of any 'real' MSI regions should take precedence
+		 * over the software-managed one if the IOMMU driver happens to
+		 * advertise both types.
+		 */
+		if (sw_msi_start && resv->type == IOMMU_RESV_MSI) {
+			*sw_msi_start = 0;
+			sw_msi_start = NULL;
+		}
+		if (sw_msi_start && resv->type == IOMMU_RESV_SW_MSI)
+			*sw_msi_start = resv->start;
+
+		rc = iopt_reserve_iova(iopt, resv->start,
+				       resv->length - 1 + resv->start, group);
+		if (rc)
+			goto out_reserved;
+	}
+	rc = 0;
+	goto out_free_resv;
+
+out_reserved:
+	__iopt_remove_reserved_iova(iopt, group);
+out_free_resv:
+	list_for_each_entry_safe (resv, tmp, &group_resv_regions, list)
+		kfree(resv);
+out_unlock:
+	up_write(&iopt->iova_rwsem);
+	return rc;
+}
diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h
index fe3be8dd38240e..7fe5a700239012 100644
--- a/drivers/iommu/iommufd/io_pagetable.h
+++ b/drivers/iommu/iommufd/io_pagetable.h
@@ -46,9 +46,19 @@ struct iopt_area {
 	unsigned int page_offset;
 	/* IOMMU_READ, IOMMU_WRITE, etc */
 	int iommu_prot;
+	bool prevent_users : 1;
 	unsigned int num_users;
 };
 
+struct iopt_allowed {
+	struct interval_tree_node node;
+};
+
+struct iopt_reserved {
+	struct interval_tree_node node;
+	void *owner;
+};
+
 int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages);
 void iopt_area_unfill_domains(struct iopt_area *area, struct iopt_pages *pages);
 
@@ -109,6 +119,8 @@ static inline size_t iopt_area_length(struct iopt_area *area)
 	}
 
 __make_iopt_iter(area)
+__make_iopt_iter(allowed)
+__make_iopt_iter(reserved)
 
 /*
  * This holds a pinned page list for multiple areas of IO address space. The
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 47a824897bc222..560ab06fbc3366 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -9,6 +9,9 @@
 #include <linux/refcount.h>
 #include <linux/uaccess.h>
 
+struct iommu_domain;
+struct iommu_group;
+
 /*
  * The IOVA to PFN map. The mapper automatically copies the PFNs into multiple
  * domains and permits sharing of PFNs between io_pagetable instances. This
@@ -30,8 +33,38 @@ struct io_pagetable {
 	struct rb_root_cached allowed_itree;
 	/* IOVA that cannot be allocated, struct iopt_reserved */
 	struct rb_root_cached reserved_itree;
+	unsigned long iova_alignment;
 };
 
+int iopt_init_table(struct io_pagetable *iopt);
+void iopt_destroy_table(struct io_pagetable *iopt);
+struct iopt_pages *iopt_get_pages(struct io_pagetable *iopt, unsigned long iova,
+				  unsigned long *start_byte,
+				  unsigned long length);
+enum { IOPT_ALLOC_IOVA = 1 << 0 };
+int iopt_map_user_pages(struct io_pagetable *iopt, unsigned long *iova,
+			void __user *uptr, unsigned long length, int iommu_prot,
+			unsigned int flags);
+int iopt_map_pages(struct io_pagetable *iopt, struct iopt_pages *pages,
+		   unsigned long *dst_iova, unsigned long start_byte,
+		   unsigned long length, int iommu_prot, unsigned int flags);
+int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
+		    unsigned long length, unsigned long *unmapped);
+int iopt_unmap_all(struct io_pagetable *iopt, unsigned long *unmapped);
+
+int iopt_table_add_domain(struct io_pagetable *iopt,
+			  struct iommu_domain *domain);
+void iopt_table_remove_domain(struct io_pagetable *iopt,
+			      struct iommu_domain *domain);
+int iopt_table_enforce_group_resv_regions(struct io_pagetable *iopt,
+					  struct iommu_group *group,
+					  phys_addr_t *sw_msi_start);
+int iopt_set_allow_iova(struct io_pagetable *iopt,
+			struct rb_root_cached *allowed_iova);
+int iopt_reserve_iova(struct io_pagetable *iopt, unsigned long start,
+		      unsigned long last, void *owner);
+void iopt_remove_reserved_iova(struct io_pagetable *iopt, void *owner);
+
 struct iommufd_ctx {
 	struct file *file;
 	struct xarray objects;
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index c8bbed542e923c..9c6ec4d66b4a92 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -10,9 +10,17 @@
 #include <linux/errno.h>
 #include <linux/err.h>
 
+struct page;
 struct iommufd_ctx;
+struct io_pagetable;
 struct file;
 
+int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
+		      unsigned long length, struct page **out_pages,
+		      bool write);
+void iopt_unaccess_pages(struct io_pagetable *iopt, unsigned long iova,
+			 unsigned long length);
+
 void iommufd_ctx_get(struct iommufd_ctx *ictx);
 
 #if IS_ENABLED(CONFIG_IOMMUFD)
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH RFC v2 08/13] iommufd: IOCTLs for the io_pagetable
  2022-09-02 19:59 ` Jason Gunthorpe
@ 2022-09-02 19:59   ` Jason Gunthorpe
  -1 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-02 19:59 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, iommu,
	Jason Wang, Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

Connect the IOAS to its IOCTL interface. This exposes most of the
functionality in the io_pagetable to userspace.

This is intended to be the core of the generic interface that IOMMUFD will
provide. Every IOMMU driver should be able to implement an iommu_domain
that is compatible with this generic mechanism.

It is also designed to be easy to use for simple non virtual machine
monitor users, like DPDK:
 - Universal simple support for all IOMMUs (no PPC special path)
 - An IOVA allocator that considers the aperture and the allowed/reserved
   ranges
 - io_pagetable allows any number of iommu_domains to be connected to the
   IOAS

Along with room in the design to add non-generic features to cater to
specific HW functionality.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommufd/Makefile          |   1 +
 drivers/iommu/iommufd/ioas.c            | 316 ++++++++++++++++++++++++
 drivers/iommu/iommufd/iommufd_private.h |  28 +++
 drivers/iommu/iommufd/main.c            |  20 ++
 include/uapi/linux/iommufd.h            | 188 ++++++++++++++
 5 files changed, 553 insertions(+)
 create mode 100644 drivers/iommu/iommufd/ioas.c

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index b66a8c47ff55ec..2b4f36f1b72f9d 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 iommufd-y := \
 	io_pagetable.o \
+	ioas.o \
 	main.o \
 	pages.o
 
diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
new file mode 100644
index 00000000000000..f9f545158a4891
--- /dev/null
+++ b/drivers/iommu/iommufd/ioas.c
@@ -0,0 +1,316 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ */
+#include <linux/interval_tree.h>
+#include <linux/iommufd.h>
+#include <linux/iommu.h>
+#include <uapi/linux/iommufd.h>
+
+#include "io_pagetable.h"
+
+void iommufd_ioas_destroy(struct iommufd_object *obj)
+{
+	struct iommufd_ioas *ioas = container_of(obj, struct iommufd_ioas, obj);
+	int rc;
+
+	rc = iopt_unmap_all(&ioas->iopt, NULL);
+	WARN_ON(rc && rc != -ENOENT);
+	iopt_destroy_table(&ioas->iopt);
+}
+
+struct iommufd_ioas *iommufd_ioas_alloc(struct iommufd_ctx *ictx)
+{
+	struct iommufd_ioas *ioas;
+	int rc;
+
+	ioas = iommufd_object_alloc(ictx, ioas, IOMMUFD_OBJ_IOAS);
+	if (IS_ERR(ioas))
+		return ioas;
+
+	rc = iopt_init_table(&ioas->iopt);
+	if (rc)
+		goto out_abort;
+	return ioas;
+
+out_abort:
+	iommufd_object_abort(ictx, &ioas->obj);
+	return ERR_PTR(rc);
+}
+
+int iommufd_ioas_alloc_ioctl(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_ioas_alloc *cmd = ucmd->cmd;
+	struct iommufd_ioas *ioas;
+	int rc;
+
+	if (cmd->flags)
+		return -EOPNOTSUPP;
+
+	ioas = iommufd_ioas_alloc(ucmd->ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	cmd->out_ioas_id = ioas->obj.id;
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+	if (rc)
+		goto out_table;
+	iommufd_object_finalize(ucmd->ictx, &ioas->obj);
+	return 0;
+
+out_table:
+	iommufd_ioas_destroy(&ioas->obj);
+	return rc;
+}
+
+int iommufd_ioas_iova_ranges(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_ioas_iova_ranges __user *uptr = ucmd->ubuffer;
+	struct iommu_ioas_iova_ranges *cmd = ucmd->cmd;
+	struct iommufd_ioas *ioas;
+	struct interval_tree_span_iter span;
+	u32 max_iovas;
+	int rc;
+
+	if (cmd->__reserved)
+		return -EOPNOTSUPP;
+
+	max_iovas = cmd->size - sizeof(*cmd);
+	if (max_iovas % sizeof(cmd->out_valid_iovas[0]))
+		return -EINVAL;
+	max_iovas /= sizeof(cmd->out_valid_iovas[0]);
+
+	ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	down_read(&ioas->iopt.iova_rwsem);
+	cmd->out_num_iovas = 0;
+	interval_tree_for_each_span (&span, &ioas->iopt.reserved_itree,
+				     0, ULONG_MAX) {
+		if (!span.is_hole)
+			continue;
+		if (cmd->out_num_iovas < max_iovas) {
+			rc = put_user((u64)span.start_hole,
+				      &uptr->out_valid_iovas[cmd->out_num_iovas]
+					       .start);
+			if (rc)
+				goto out_put;
+			rc = put_user(
+				(u64)span.last_hole,
+				&uptr->out_valid_iovas[cmd->out_num_iovas].last);
+			if (rc)
+				goto out_put;
+		}
+		cmd->out_num_iovas++;
+	}
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+	if (rc)
+		goto out_put;
+	if (cmd->out_num_iovas > max_iovas)
+		rc = -EMSGSIZE;
+out_put:
+	up_read(&ioas->iopt.iova_rwsem);
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+static int iommufd_ioas_load_iovas(struct rb_root_cached *itree,
+				   struct iommu_iova_range __user *ranges,
+				   u32 num)
+{
+	u32 i;
+
+	for (i = 0; i != num; i++) {
+		struct iommu_iova_range range;
+		struct iopt_allowed *allowed;
+
+		if (copy_from_user(&range, ranges + i, sizeof(range)))
+			return -EFAULT;
+
+		if (range.start >= range.last)
+			return -EINVAL;
+
+		if (interval_tree_iter_first(itree, range.start, range.last))
+			return -EINVAL;
+
+		allowed = kzalloc(sizeof(*allowed), GFP_KERNEL_ACCOUNT);
+		if (!allowed)
+			return -ENOMEM;
+		allowed->node.start = range.start;
+		allowed->node.last = range.last;
+
+		interval_tree_insert(&allowed->node, itree);
+	}
+	return 0;
+}
+
+int iommufd_ioas_allow_iovas(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_ioas_allow_iovas *cmd = ucmd->cmd;
+	struct rb_root_cached allowed_iova = RB_ROOT_CACHED;
+	struct interval_tree_node *node;
+	struct iommufd_ioas *ioas;
+	struct io_pagetable *iopt;
+	int rc = 0;
+
+	ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+	iopt = &ioas->iopt;
+
+	rc = iommufd_ioas_load_iovas(&allowed_iova,
+				      u64_to_user_ptr(cmd->allowed_iovas),
+				      cmd->num_iovas);
+	if (rc)
+		goto out_free;
+
+	rc = iopt_set_allow_iova(iopt, &allowed_iova);
+out_free:
+	while ((node = interval_tree_iter_first(&allowed_iova, 0, ULONG_MAX))) {
+		interval_tree_remove(node, &allowed_iova);
+		kfree(container_of(node, struct iopt_allowed, node));
+	}
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+static int conv_iommu_prot(u32 map_flags)
+{
+	int iommu_prot;
+
+	/*
+	 * We provide no manual cache coherency ioctls to userspace and most
+	 * architectures make the CPU ops for cache flushing privileged.
+	 * Therefore we require the underlying IOMMU to support CPU coherent
+	 * operation. Support for IOMMU_CACHE is enforced by the
+	 * dev_is_dma_coherent() test during bind.
+	 */
+	iommu_prot = IOMMU_CACHE;
+	if (map_flags & IOMMU_IOAS_MAP_WRITEABLE)
+		iommu_prot |= IOMMU_WRITE;
+	if (map_flags & IOMMU_IOAS_MAP_READABLE)
+		iommu_prot |= IOMMU_READ;
+	return iommu_prot;
+}
+
+int iommufd_ioas_map(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_ioas_map *cmd = ucmd->cmd;
+	struct iommufd_ioas *ioas;
+	unsigned int flags = 0;
+	unsigned long iova;
+	int rc;
+
+	if ((cmd->flags &
+	     ~(IOMMU_IOAS_MAP_FIXED_IOVA | IOMMU_IOAS_MAP_WRITEABLE |
+	       IOMMU_IOAS_MAP_READABLE)) ||
+	    cmd->__reserved)
+		return -EOPNOTSUPP;
+	if (cmd->iova >= ULONG_MAX || cmd->length >= ULONG_MAX)
+		return -EOVERFLOW;
+
+	ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	if (!(cmd->flags & IOMMU_IOAS_MAP_FIXED_IOVA))
+		flags = IOPT_ALLOC_IOVA;
+	iova = cmd->iova;
+	rc = iopt_map_user_pages(&ioas->iopt, &iova,
+				 u64_to_user_ptr(cmd->user_va), cmd->length,
+				 conv_iommu_prot(cmd->flags), flags);
+	if (rc)
+		goto out_put;
+
+	cmd->iova = iova;
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+out_put:
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+int iommufd_ioas_copy(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_ioas_copy *cmd = ucmd->cmd;
+	struct iommufd_ioas *src_ioas;
+	struct iommufd_ioas *dst_ioas;
+	struct iopt_pages *pages;
+	unsigned int flags = 0;
+	unsigned long iova;
+	unsigned long start_byte;
+	int rc;
+
+	if ((cmd->flags &
+	     ~(IOMMU_IOAS_MAP_FIXED_IOVA | IOMMU_IOAS_MAP_WRITEABLE |
+	       IOMMU_IOAS_MAP_READABLE)))
+		return -EOPNOTSUPP;
+	if (cmd->length >= ULONG_MAX)
+		return -EOVERFLOW;
+
+	src_ioas = iommufd_get_ioas(ucmd, cmd->src_ioas_id);
+	if (IS_ERR(src_ioas))
+		return PTR_ERR(src_ioas);
+	/* FIXME: copy is not limited to an exact match anymore */
+	pages = iopt_get_pages(&src_ioas->iopt, cmd->src_iova, &start_byte,
+			       cmd->length);
+	iommufd_put_object(&src_ioas->obj);
+	if (IS_ERR(pages))
+		return PTR_ERR(pages);
+
+	dst_ioas = iommufd_get_ioas(ucmd, cmd->dst_ioas_id);
+	if (IS_ERR(dst_ioas)) {
+		iopt_put_pages(pages);
+		return PTR_ERR(dst_ioas);
+	}
+
+	if (!(cmd->flags & IOMMU_IOAS_MAP_FIXED_IOVA))
+		flags = IOPT_ALLOC_IOVA;
+	iova = cmd->dst_iova;
+	rc = iopt_map_pages(&dst_ioas->iopt, pages, &iova, start_byte,
+			    cmd->length, conv_iommu_prot(cmd->flags), flags);
+	if (rc) {
+		iopt_put_pages(pages);
+		goto out_put_dst;
+	}
+
+	cmd->dst_iova = iova;
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+out_put_dst:
+	iommufd_put_object(&dst_ioas->obj);
+	return rc;
+}
+
+int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_ioas_unmap *cmd = ucmd->cmd;
+	struct iommufd_ioas *ioas;
+	unsigned long unmapped = 0;
+	int rc;
+
+	ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	if (cmd->iova == 0 && cmd->length == U64_MAX) {
+		rc = iopt_unmap_all(&ioas->iopt, &unmapped);
+		if (rc)
+			goto out_put;
+	} else {
+		if (cmd->iova >= ULONG_MAX || cmd->length >= ULONG_MAX) {
+			rc = -EOVERFLOW;
+			goto out_put;
+		}
+		rc = iopt_unmap_iova(&ioas->iopt, cmd->iova, cmd->length,
+				     &unmapped);
+		if (rc)
+			goto out_put;
+	}
+
+	cmd->length = unmapped;
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+
+out_put:
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 560ab06fbc3366..0ef6b9bf4916eb 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -92,6 +92,7 @@ static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd,
 enum iommufd_object_type {
 	IOMMUFD_OBJ_NONE,
 	IOMMUFD_OBJ_ANY = IOMMUFD_OBJ_NONE,
+	IOMMUFD_OBJ_IOAS,
 };
 
 /* Base struct for all objects with a userspace ID handle. */
@@ -163,4 +164,31 @@ struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
 			     type),                                            \
 		     typeof(*(ptr)), obj)
 
+/*
+ * The IO Address Space (IOAS) pagetable is a virtual page table backed by the
+ * io_pagetable object. It is a user controlled mapping of IOVA -> PFNs. The
+ * mapping is copied into all of the associated domains and made available to
+ * in-kernel users.
+ */
+struct iommufd_ioas {
+	struct iommufd_object obj;
+	struct io_pagetable iopt;
+};
+
+static inline struct iommufd_ioas *iommufd_get_ioas(struct iommufd_ucmd *ucmd,
+						    u32 id)
+{
+	return container_of(iommufd_get_object(ucmd->ictx, id,
+					       IOMMUFD_OBJ_IOAS),
+			    struct iommufd_ioas, obj);
+}
+
+struct iommufd_ioas *iommufd_ioas_alloc(struct iommufd_ctx *ictx);
+int iommufd_ioas_alloc_ioctl(struct iommufd_ucmd *ucmd);
+void iommufd_ioas_destroy(struct iommufd_object *obj);
+int iommufd_ioas_iova_ranges(struct iommufd_ucmd *ucmd);
+int iommufd_ioas_allow_iovas(struct iommufd_ucmd *ucmd);
+int iommufd_ioas_map(struct iommufd_ucmd *ucmd);
+int iommufd_ioas_copy(struct iommufd_ucmd *ucmd);
+int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd);
 #endif
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index a5b1e2302ba59d..55b42eeb141b20 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -204,6 +204,11 @@ static int iommufd_fops_release(struct inode *inode, struct file *filp)
 
 union ucmd_buffer {
 	struct iommu_destroy destroy;
+	struct iommu_ioas_alloc alloc;
+	struct iommu_ioas_allow_iovas allow_iovas;
+	struct iommu_ioas_iova_ranges iova_ranges;
+	struct iommu_ioas_map map;
+	struct iommu_ioas_unmap unmap;
 };
 
 struct iommufd_ioctl_op {
@@ -224,6 +229,18 @@ struct iommufd_ioctl_op {
 	}
 static struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
 	IOCTL_OP(IOMMU_DESTROY, iommufd_destroy, struct iommu_destroy, id),
+	IOCTL_OP(IOMMU_IOAS_ALLOC, iommufd_ioas_alloc_ioctl,
+		 struct iommu_ioas_alloc, out_ioas_id),
+	IOCTL_OP(IOMMU_IOAS_ALLOW_IOVAS, iommufd_ioas_allow_iovas,
+		 struct iommu_ioas_allow_iovas, allowed_iovas),
+	IOCTL_OP(IOMMU_IOAS_COPY, iommufd_ioas_copy, struct iommu_ioas_copy,
+		 src_iova),
+	IOCTL_OP(IOMMU_IOAS_IOVA_RANGES, iommufd_ioas_iova_ranges,
+		 struct iommu_ioas_iova_ranges, __reserved),
+	IOCTL_OP(IOMMU_IOAS_MAP, iommufd_ioas_map, struct iommu_ioas_map,
+		 __reserved),
+	IOCTL_OP(IOMMU_IOAS_UNMAP, iommufd_ioas_unmap, struct iommu_ioas_unmap,
+		 length),
 };
 
 static long iommufd_fops_ioctl(struct file *filp, unsigned int cmd,
@@ -310,6 +327,9 @@ void iommufd_ctx_put(struct iommufd_ctx *ictx)
 EXPORT_SYMBOL_GPL(iommufd_ctx_put);
 
 static struct iommufd_object_ops iommufd_object_ops[] = {
+	[IOMMUFD_OBJ_IOAS] = {
+		.destroy = iommufd_ioas_destroy,
+	},
 };
 
 static struct miscdevice iommu_misc_dev = {
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 2f7f76ec6db4cb..b7b0ac4016bb70 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -37,6 +37,12 @@
 enum {
 	IOMMUFD_CMD_BASE = 0x80,
 	IOMMUFD_CMD_DESTROY = IOMMUFD_CMD_BASE,
+	IOMMUFD_CMD_IOAS_ALLOC,
+	IOMMUFD_CMD_IOAS_ALLOW_IOVAS,
+	IOMMUFD_CMD_IOAS_COPY,
+	IOMMUFD_CMD_IOAS_IOVA_RANGES,
+	IOMMUFD_CMD_IOAS_MAP,
+	IOMMUFD_CMD_IOAS_UNMAP,
 };
 
 /**
@@ -52,4 +58,186 @@ struct iommu_destroy {
 };
 #define IOMMU_DESTROY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_DESTROY)
 
+/**
+ * struct iommu_ioas_alloc - ioctl(IOMMU_IOAS_ALLOC)
+ * @size: sizeof(struct iommu_ioas_alloc)
+ * @flags: Must be 0
+ * @out_ioas_id: Output IOAS ID for the allocated object
+ *
+ * Allocate an IO Address Space (IOAS) which holds an IO Virtual Address (IOVA)
+ * to memory mapping.
+ */
+struct iommu_ioas_alloc {
+	__u32 size;
+	__u32 flags;
+	__u32 out_ioas_id;
+};
+#define IOMMU_IOAS_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_ALLOC)
+
+/**
+ * struct iommu_iova_range
+ * @start: First IOVA
+ * @last: Inclusive last IOVA
+ *
+ * An interval in IOVA space.
+ */
+struct iommu_iova_range {
+	__aligned_u64 start;
+	__aligned_u64 last;
+};
+
+/**
+ * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
+ * @size: sizeof(struct iommu_ioas_iova_ranges)
+ * @ioas_id: IOAS ID to read ranges from
+ * @out_num_iovas: Output total number of ranges in the IOAS
+ * @__reserved: Must be 0
+ * @out_valid_iovas: Array of valid IOVA ranges. The array length is the smaller
+ *                   of out_num_iovas or the length implied by size.
+ * @out_valid_iovas.start: First IOVA in the allowed range
+ * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
+ *
+ * Query an IOAS for ranges of allowed IOVAs. Mapping IOVA outside these ranges
+ * is not allowed. out_num_iovas will be set to the total number of iovas and
+ * the out_valid_iovas[] will be filled in as space permits. size should include
+ * the allocated flex array.
+ *
+ * The allowed ranges are dependent on the HW path the DMA operation takes, and
+ * can change during the lifetime of the IOAS. A fresh empty IOAS will have a
+ * full range, and each attached device will narrow the ranges based on that
+ * devices HW restrictions. Detatching a device can widen the ranges. Userspace
+ * should query ranges after every attach/detatch to know what IOVAs are valid
+ * for mapping.
+ */
+struct iommu_ioas_iova_ranges {
+	__u32 size;
+	__u32 ioas_id;
+	__u32 out_num_iovas;
+	__u32 __reserved;
+	struct iommu_valid_iovas {
+		__aligned_u64 start;
+		__aligned_u64 last;
+	} out_valid_iovas[];
+};
+#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_IOVA_RANGES)
+
+/**
+ * struct iommu_ioas_allow_iovas - ioctl(IOMMU_IOAS_ALLOW_IOVAS)
+ * @size: sizeof(struct iommu_ioas_allow_iovas)
+ * @ioas_id: IOAS ID to allow IOVAs from
+ * @allowed_iovas: Pointer to array of struct iommu_iova_range
+ *
+ * Ensure a range of IOVAs are always available for allocation. If this call
+ * succeeds then IOMMU_IOAS_IOVA_RANGES will never return a list of IOVA ranges
+ * that are narrower than the ranges provided here. This call will fail if
+ * IOMMU_IOAS_IOVA_RANGES is currently narrower than the given ranges.
+ *
+ * When an IOAS is first created the IOVA_RANGES will be maximally sized, and as
+ * devices are attached the IOVA will narrow based on the device restrictions.
+ * When an allowed range is specified any narrowing will be refused, ie device
+ * attachment can fail if the device requires limiting within the allowed range.
+ *
+ * Automatic IOVA allocation is also impacted by this call, it MAP will allocate
+ * within the allowed IOVAs if they are present.
+ *
+ * This call replaces the entire allowed list with the given list.
+ */
+struct iommu_ioas_allow_iovas {
+	__u32 size;
+	__u32 ioas_id;
+	__u32 num_iovas;
+	__u32 __reserved;
+	__aligned_u64 allowed_iovas;
+};
+#define IOMMU_IOAS_ALLOW_IOVAS _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_ALLOW_IOVAS)
+
+/**
+ * enum iommufd_ioas_map_flags - Flags for map and copy
+ * @IOMMU_IOAS_MAP_FIXED_IOVA: If clear the kernel will compute an appropriate
+ *                             IOVA to place the mapping at
+ * @IOMMU_IOAS_MAP_WRITEABLE: DMA is allowed to write to this mapping
+ * @IOMMU_IOAS_MAP_READABLE: DMA is allowed to read from this mapping
+ */
+enum iommufd_ioas_map_flags {
+	IOMMU_IOAS_MAP_FIXED_IOVA = 1 << 0,
+	IOMMU_IOAS_MAP_WRITEABLE = 1 << 1,
+	IOMMU_IOAS_MAP_READABLE = 1 << 2,
+};
+
+/**
+ * struct iommu_ioas_map - ioctl(IOMMU_IOAS_MAP)
+ * @size: sizeof(struct iommu_ioas_map)
+ * @flags: Combination of enum iommufd_ioas_map_flags
+ * @ioas_id: IOAS ID to change the mapping of
+ * @__reserved: Must be 0
+ * @user_va: Userspace pointer to start mapping from
+ * @length: Number of bytes to map
+ * @iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is set
+ *        then this must be provided as input.
+ *
+ * Set an IOVA mapping from a user pointer. If FIXED_IOVA is specified then the
+ * mapping will be established at iova, otherwise a suitable location will be
+ * automatically selected and returned in iova.
+ */
+struct iommu_ioas_map {
+	__u32 size;
+	__u32 flags;
+	__u32 ioas_id;
+	__u32 __reserved;
+	__aligned_u64 user_va;
+	__aligned_u64 length;
+	__aligned_u64 iova;
+};
+#define IOMMU_IOAS_MAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_MAP)
+
+/**
+ * struct iommu_ioas_copy - ioctl(IOMMU_IOAS_COPY)
+ * @size: sizeof(struct iommu_ioas_copy)
+ * @flags: Combination of enum iommufd_ioas_map_flags
+ * @dst_ioas_id: IOAS ID to change the mapping of
+ * @src_ioas_id: IOAS ID to copy from
+ * @length: Number of bytes to copy and map
+ * @dst_iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is
+ *            set then this must be provided as input.
+ * @src_iova: IOVA to start the copy
+ *
+ * Copy an already existing mapping from src_ioas_id and establish it in
+ * dst_ioas_id. The src iova/length must exactly match a range used with
+ * IOMMU_IOAS_MAP.
+ *
+ * This may be used to efficiently clone a subset of an IOAS to another, or as a
+ * kind of 'cache' to speed up mapping. Copy has an effciency advantage over
+ * establishing equivilant new mappings, as internal resources are shared, and
+ * the kernel will pin the user memory only once.
+ */
+struct iommu_ioas_copy {
+	__u32 size;
+	__u32 flags;
+	__u32 dst_ioas_id;
+	__u32 src_ioas_id;
+	__aligned_u64 length;
+	__aligned_u64 dst_iova;
+	__aligned_u64 src_iova;
+};
+#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY)
+
+/**
+ * struct iommu_ioas_unmap - ioctl(IOMMU_IOAS_UNMAP)
+ * @size: sizeof(struct iommu_ioas_copy)
+ * @ioas_id: IOAS ID to change the mapping of
+ * @iova: IOVA to start the unmapping at
+ * @length: Number of bytes to unmap, and return back the bytes unmapped
+ *
+ * Unmap an IOVA range. The iova/length must be a superset of a previously
+ * mapped range used with IOMMU_IOAS_PAGETABLE_MAP or COPY. Splitting or
+ * truncating ranges is not allowed. The values 0 to U64_MAX will unmap
+ * everything.
+ */
+struct iommu_ioas_unmap {
+	__u32 size;
+	__u32 ioas_id;
+	__aligned_u64 iova;
+	__aligned_u64 length;
+};
+#define IOMMU_IOAS_UNMAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_UNMAP)
 #endif
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH RFC v2 08/13] iommufd: IOCTLs for the io_pagetable
@ 2022-09-02 19:59   ` Jason Gunthorpe
  0 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-02 19:59 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, iommu,
	Jason Wang, Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

Connect the IOAS to its IOCTL interface. This exposes most of the
functionality in the io_pagetable to userspace.

This is intended to be the core of the generic interface that IOMMUFD will
provide. Every IOMMU driver should be able to implement an iommu_domain
that is compatible with this generic mechanism.

It is also designed to be easy to use for simple non virtual machine
monitor users, like DPDK:
 - Universal simple support for all IOMMUs (no PPC special path)
 - An IOVA allocator that considers the aperture and the allowed/reserved
   ranges
 - io_pagetable allows any number of iommu_domains to be connected to the
   IOAS

Along with room in the design to add non-generic features to cater to
specific HW functionality.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommufd/Makefile          |   1 +
 drivers/iommu/iommufd/ioas.c            | 316 ++++++++++++++++++++++++
 drivers/iommu/iommufd/iommufd_private.h |  28 +++
 drivers/iommu/iommufd/main.c            |  20 ++
 include/uapi/linux/iommufd.h            | 188 ++++++++++++++
 5 files changed, 553 insertions(+)
 create mode 100644 drivers/iommu/iommufd/ioas.c

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index b66a8c47ff55ec..2b4f36f1b72f9d 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 iommufd-y := \
 	io_pagetable.o \
+	ioas.o \
 	main.o \
 	pages.o
 
diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
new file mode 100644
index 00000000000000..f9f545158a4891
--- /dev/null
+++ b/drivers/iommu/iommufd/ioas.c
@@ -0,0 +1,316 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ */
+#include <linux/interval_tree.h>
+#include <linux/iommufd.h>
+#include <linux/iommu.h>
+#include <uapi/linux/iommufd.h>
+
+#include "io_pagetable.h"
+
+void iommufd_ioas_destroy(struct iommufd_object *obj)
+{
+	struct iommufd_ioas *ioas = container_of(obj, struct iommufd_ioas, obj);
+	int rc;
+
+	rc = iopt_unmap_all(&ioas->iopt, NULL);
+	WARN_ON(rc && rc != -ENOENT);
+	iopt_destroy_table(&ioas->iopt);
+}
+
+struct iommufd_ioas *iommufd_ioas_alloc(struct iommufd_ctx *ictx)
+{
+	struct iommufd_ioas *ioas;
+	int rc;
+
+	ioas = iommufd_object_alloc(ictx, ioas, IOMMUFD_OBJ_IOAS);
+	if (IS_ERR(ioas))
+		return ioas;
+
+	rc = iopt_init_table(&ioas->iopt);
+	if (rc)
+		goto out_abort;
+	return ioas;
+
+out_abort:
+	iommufd_object_abort(ictx, &ioas->obj);
+	return ERR_PTR(rc);
+}
+
+int iommufd_ioas_alloc_ioctl(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_ioas_alloc *cmd = ucmd->cmd;
+	struct iommufd_ioas *ioas;
+	int rc;
+
+	if (cmd->flags)
+		return -EOPNOTSUPP;
+
+	ioas = iommufd_ioas_alloc(ucmd->ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	cmd->out_ioas_id = ioas->obj.id;
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+	if (rc)
+		goto out_table;
+	iommufd_object_finalize(ucmd->ictx, &ioas->obj);
+	return 0;
+
+out_table:
+	iommufd_ioas_destroy(&ioas->obj);
+	return rc;
+}
+
+int iommufd_ioas_iova_ranges(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_ioas_iova_ranges __user *uptr = ucmd->ubuffer;
+	struct iommu_ioas_iova_ranges *cmd = ucmd->cmd;
+	struct iommufd_ioas *ioas;
+	struct interval_tree_span_iter span;
+	u32 max_iovas;
+	int rc;
+
+	if (cmd->__reserved)
+		return -EOPNOTSUPP;
+
+	max_iovas = cmd->size - sizeof(*cmd);
+	if (max_iovas % sizeof(cmd->out_valid_iovas[0]))
+		return -EINVAL;
+	max_iovas /= sizeof(cmd->out_valid_iovas[0]);
+
+	ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	down_read(&ioas->iopt.iova_rwsem);
+	cmd->out_num_iovas = 0;
+	interval_tree_for_each_span (&span, &ioas->iopt.reserved_itree,
+				     0, ULONG_MAX) {
+		if (!span.is_hole)
+			continue;
+		if (cmd->out_num_iovas < max_iovas) {
+			rc = put_user((u64)span.start_hole,
+				      &uptr->out_valid_iovas[cmd->out_num_iovas]
+					       .start);
+			if (rc)
+				goto out_put;
+			rc = put_user(
+				(u64)span.last_hole,
+				&uptr->out_valid_iovas[cmd->out_num_iovas].last);
+			if (rc)
+				goto out_put;
+		}
+		cmd->out_num_iovas++;
+	}
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+	if (rc)
+		goto out_put;
+	if (cmd->out_num_iovas > max_iovas)
+		rc = -EMSGSIZE;
+out_put:
+	up_read(&ioas->iopt.iova_rwsem);
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+static int iommufd_ioas_load_iovas(struct rb_root_cached *itree,
+				   struct iommu_iova_range __user *ranges,
+				   u32 num)
+{
+	u32 i;
+
+	for (i = 0; i != num; i++) {
+		struct iommu_iova_range range;
+		struct iopt_allowed *allowed;
+
+		if (copy_from_user(&range, ranges + i, sizeof(range)))
+			return -EFAULT;
+
+		if (range.start >= range.last)
+			return -EINVAL;
+
+		if (interval_tree_iter_first(itree, range.start, range.last))
+			return -EINVAL;
+
+		allowed = kzalloc(sizeof(*allowed), GFP_KERNEL_ACCOUNT);
+		if (!allowed)
+			return -ENOMEM;
+		allowed->node.start = range.start;
+		allowed->node.last = range.last;
+
+		interval_tree_insert(&allowed->node, itree);
+	}
+	return 0;
+}
+
+int iommufd_ioas_allow_iovas(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_ioas_allow_iovas *cmd = ucmd->cmd;
+	struct rb_root_cached allowed_iova = RB_ROOT_CACHED;
+	struct interval_tree_node *node;
+	struct iommufd_ioas *ioas;
+	struct io_pagetable *iopt;
+	int rc = 0;
+
+	ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+	iopt = &ioas->iopt;
+
+	rc = iommufd_ioas_load_iovas(&allowed_iova,
+				      u64_to_user_ptr(cmd->allowed_iovas),
+				      cmd->num_iovas);
+	if (rc)
+		goto out_free;
+
+	rc = iopt_set_allow_iova(iopt, &allowed_iova);
+out_free:
+	while ((node = interval_tree_iter_first(&allowed_iova, 0, ULONG_MAX))) {
+		interval_tree_remove(node, &allowed_iova);
+		kfree(container_of(node, struct iopt_allowed, node));
+	}
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+static int conv_iommu_prot(u32 map_flags)
+{
+	int iommu_prot;
+
+	/*
+	 * We provide no manual cache coherency ioctls to userspace and most
+	 * architectures make the CPU ops for cache flushing privileged.
+	 * Therefore we require the underlying IOMMU to support CPU coherent
+	 * operation. Support for IOMMU_CACHE is enforced by the
+	 * dev_is_dma_coherent() test during bind.
+	 */
+	iommu_prot = IOMMU_CACHE;
+	if (map_flags & IOMMU_IOAS_MAP_WRITEABLE)
+		iommu_prot |= IOMMU_WRITE;
+	if (map_flags & IOMMU_IOAS_MAP_READABLE)
+		iommu_prot |= IOMMU_READ;
+	return iommu_prot;
+}
+
+int iommufd_ioas_map(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_ioas_map *cmd = ucmd->cmd;
+	struct iommufd_ioas *ioas;
+	unsigned int flags = 0;
+	unsigned long iova;
+	int rc;
+
+	if ((cmd->flags &
+	     ~(IOMMU_IOAS_MAP_FIXED_IOVA | IOMMU_IOAS_MAP_WRITEABLE |
+	       IOMMU_IOAS_MAP_READABLE)) ||
+	    cmd->__reserved)
+		return -EOPNOTSUPP;
+	if (cmd->iova >= ULONG_MAX || cmd->length >= ULONG_MAX)
+		return -EOVERFLOW;
+
+	ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	if (!(cmd->flags & IOMMU_IOAS_MAP_FIXED_IOVA))
+		flags = IOPT_ALLOC_IOVA;
+	iova = cmd->iova;
+	rc = iopt_map_user_pages(&ioas->iopt, &iova,
+				 u64_to_user_ptr(cmd->user_va), cmd->length,
+				 conv_iommu_prot(cmd->flags), flags);
+	if (rc)
+		goto out_put;
+
+	cmd->iova = iova;
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+out_put:
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+int iommufd_ioas_copy(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_ioas_copy *cmd = ucmd->cmd;
+	struct iommufd_ioas *src_ioas;
+	struct iommufd_ioas *dst_ioas;
+	struct iopt_pages *pages;
+	unsigned int flags = 0;
+	unsigned long iova;
+	unsigned long start_byte;
+	int rc;
+
+	if ((cmd->flags &
+	     ~(IOMMU_IOAS_MAP_FIXED_IOVA | IOMMU_IOAS_MAP_WRITEABLE |
+	       IOMMU_IOAS_MAP_READABLE)))
+		return -EOPNOTSUPP;
+	if (cmd->length >= ULONG_MAX)
+		return -EOVERFLOW;
+
+	src_ioas = iommufd_get_ioas(ucmd, cmd->src_ioas_id);
+	if (IS_ERR(src_ioas))
+		return PTR_ERR(src_ioas);
+	/* FIXME: copy is not limited to an exact match anymore */
+	pages = iopt_get_pages(&src_ioas->iopt, cmd->src_iova, &start_byte,
+			       cmd->length);
+	iommufd_put_object(&src_ioas->obj);
+	if (IS_ERR(pages))
+		return PTR_ERR(pages);
+
+	dst_ioas = iommufd_get_ioas(ucmd, cmd->dst_ioas_id);
+	if (IS_ERR(dst_ioas)) {
+		iopt_put_pages(pages);
+		return PTR_ERR(dst_ioas);
+	}
+
+	if (!(cmd->flags & IOMMU_IOAS_MAP_FIXED_IOVA))
+		flags = IOPT_ALLOC_IOVA;
+	iova = cmd->dst_iova;
+	rc = iopt_map_pages(&dst_ioas->iopt, pages, &iova, start_byte,
+			    cmd->length, conv_iommu_prot(cmd->flags), flags);
+	if (rc) {
+		iopt_put_pages(pages);
+		goto out_put_dst;
+	}
+
+	cmd->dst_iova = iova;
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+out_put_dst:
+	iommufd_put_object(&dst_ioas->obj);
+	return rc;
+}
+
+int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_ioas_unmap *cmd = ucmd->cmd;
+	struct iommufd_ioas *ioas;
+	unsigned long unmapped = 0;
+	int rc;
+
+	ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	if (cmd->iova == 0 && cmd->length == U64_MAX) {
+		rc = iopt_unmap_all(&ioas->iopt, &unmapped);
+		if (rc)
+			goto out_put;
+	} else {
+		if (cmd->iova >= ULONG_MAX || cmd->length >= ULONG_MAX) {
+			rc = -EOVERFLOW;
+			goto out_put;
+		}
+		rc = iopt_unmap_iova(&ioas->iopt, cmd->iova, cmd->length,
+				     &unmapped);
+		if (rc)
+			goto out_put;
+	}
+
+	cmd->length = unmapped;
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+
+out_put:
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 560ab06fbc3366..0ef6b9bf4916eb 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -92,6 +92,7 @@ static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd,
 enum iommufd_object_type {
 	IOMMUFD_OBJ_NONE,
 	IOMMUFD_OBJ_ANY = IOMMUFD_OBJ_NONE,
+	IOMMUFD_OBJ_IOAS,
 };
 
 /* Base struct for all objects with a userspace ID handle. */
@@ -163,4 +164,31 @@ struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
 			     type),                                            \
 		     typeof(*(ptr)), obj)
 
+/*
+ * The IO Address Space (IOAS) pagetable is a virtual page table backed by the
+ * io_pagetable object. It is a user controlled mapping of IOVA -> PFNs. The
+ * mapping is copied into all of the associated domains and made available to
+ * in-kernel users.
+ */
+struct iommufd_ioas {
+	struct iommufd_object obj;
+	struct io_pagetable iopt;
+};
+
+static inline struct iommufd_ioas *iommufd_get_ioas(struct iommufd_ucmd *ucmd,
+						    u32 id)
+{
+	return container_of(iommufd_get_object(ucmd->ictx, id,
+					       IOMMUFD_OBJ_IOAS),
+			    struct iommufd_ioas, obj);
+}
+
+struct iommufd_ioas *iommufd_ioas_alloc(struct iommufd_ctx *ictx);
+int iommufd_ioas_alloc_ioctl(struct iommufd_ucmd *ucmd);
+void iommufd_ioas_destroy(struct iommufd_object *obj);
+int iommufd_ioas_iova_ranges(struct iommufd_ucmd *ucmd);
+int iommufd_ioas_allow_iovas(struct iommufd_ucmd *ucmd);
+int iommufd_ioas_map(struct iommufd_ucmd *ucmd);
+int iommufd_ioas_copy(struct iommufd_ucmd *ucmd);
+int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd);
 #endif
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index a5b1e2302ba59d..55b42eeb141b20 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -204,6 +204,11 @@ static int iommufd_fops_release(struct inode *inode, struct file *filp)
 
 union ucmd_buffer {
 	struct iommu_destroy destroy;
+	struct iommu_ioas_alloc alloc;
+	struct iommu_ioas_allow_iovas allow_iovas;
+	struct iommu_ioas_iova_ranges iova_ranges;
+	struct iommu_ioas_map map;
+	struct iommu_ioas_unmap unmap;
 };
 
 struct iommufd_ioctl_op {
@@ -224,6 +229,18 @@ struct iommufd_ioctl_op {
 	}
 static struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
 	IOCTL_OP(IOMMU_DESTROY, iommufd_destroy, struct iommu_destroy, id),
+	IOCTL_OP(IOMMU_IOAS_ALLOC, iommufd_ioas_alloc_ioctl,
+		 struct iommu_ioas_alloc, out_ioas_id),
+	IOCTL_OP(IOMMU_IOAS_ALLOW_IOVAS, iommufd_ioas_allow_iovas,
+		 struct iommu_ioas_allow_iovas, allowed_iovas),
+	IOCTL_OP(IOMMU_IOAS_COPY, iommufd_ioas_copy, struct iommu_ioas_copy,
+		 src_iova),
+	IOCTL_OP(IOMMU_IOAS_IOVA_RANGES, iommufd_ioas_iova_ranges,
+		 struct iommu_ioas_iova_ranges, __reserved),
+	IOCTL_OP(IOMMU_IOAS_MAP, iommufd_ioas_map, struct iommu_ioas_map,
+		 __reserved),
+	IOCTL_OP(IOMMU_IOAS_UNMAP, iommufd_ioas_unmap, struct iommu_ioas_unmap,
+		 length),
 };
 
 static long iommufd_fops_ioctl(struct file *filp, unsigned int cmd,
@@ -310,6 +327,9 @@ void iommufd_ctx_put(struct iommufd_ctx *ictx)
 EXPORT_SYMBOL_GPL(iommufd_ctx_put);
 
 static struct iommufd_object_ops iommufd_object_ops[] = {
+	[IOMMUFD_OBJ_IOAS] = {
+		.destroy = iommufd_ioas_destroy,
+	},
 };
 
 static struct miscdevice iommu_misc_dev = {
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 2f7f76ec6db4cb..b7b0ac4016bb70 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -37,6 +37,12 @@
 enum {
 	IOMMUFD_CMD_BASE = 0x80,
 	IOMMUFD_CMD_DESTROY = IOMMUFD_CMD_BASE,
+	IOMMUFD_CMD_IOAS_ALLOC,
+	IOMMUFD_CMD_IOAS_ALLOW_IOVAS,
+	IOMMUFD_CMD_IOAS_COPY,
+	IOMMUFD_CMD_IOAS_IOVA_RANGES,
+	IOMMUFD_CMD_IOAS_MAP,
+	IOMMUFD_CMD_IOAS_UNMAP,
 };
 
 /**
@@ -52,4 +58,186 @@ struct iommu_destroy {
 };
 #define IOMMU_DESTROY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_DESTROY)
 
+/**
+ * struct iommu_ioas_alloc - ioctl(IOMMU_IOAS_ALLOC)
+ * @size: sizeof(struct iommu_ioas_alloc)
+ * @flags: Must be 0
+ * @out_ioas_id: Output IOAS ID for the allocated object
+ *
+ * Allocate an IO Address Space (IOAS) which holds an IO Virtual Address (IOVA)
+ * to memory mapping.
+ */
+struct iommu_ioas_alloc {
+	__u32 size;
+	__u32 flags;
+	__u32 out_ioas_id;
+};
+#define IOMMU_IOAS_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_ALLOC)
+
+/**
+ * struct iommu_iova_range
+ * @start: First IOVA
+ * @last: Inclusive last IOVA
+ *
+ * An interval in IOVA space.
+ */
+struct iommu_iova_range {
+	__aligned_u64 start;
+	__aligned_u64 last;
+};
+
+/**
+ * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
+ * @size: sizeof(struct iommu_ioas_iova_ranges)
+ * @ioas_id: IOAS ID to read ranges from
+ * @out_num_iovas: Output total number of ranges in the IOAS
+ * @__reserved: Must be 0
+ * @out_valid_iovas: Array of valid IOVA ranges. The array length is the smaller
+ *                   of out_num_iovas or the length implied by size.
+ * @out_valid_iovas.start: First IOVA in the allowed range
+ * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
+ *
+ * Query an IOAS for ranges of allowed IOVAs. Mapping IOVA outside these ranges
+ * is not allowed. out_num_iovas will be set to the total number of iovas and
+ * the out_valid_iovas[] will be filled in as space permits. size should include
+ * the allocated flex array.
+ *
+ * The allowed ranges are dependent on the HW path the DMA operation takes, and
+ * can change during the lifetime of the IOAS. A fresh empty IOAS will have a
+ * full range, and each attached device will narrow the ranges based on that
+ * devices HW restrictions. Detatching a device can widen the ranges. Userspace
+ * should query ranges after every attach/detatch to know what IOVAs are valid
+ * for mapping.
+ */
+struct iommu_ioas_iova_ranges {
+	__u32 size;
+	__u32 ioas_id;
+	__u32 out_num_iovas;
+	__u32 __reserved;
+	struct iommu_valid_iovas {
+		__aligned_u64 start;
+		__aligned_u64 last;
+	} out_valid_iovas[];
+};
+#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_IOVA_RANGES)
+
+/**
+ * struct iommu_ioas_allow_iovas - ioctl(IOMMU_IOAS_ALLOW_IOVAS)
+ * @size: sizeof(struct iommu_ioas_allow_iovas)
+ * @ioas_id: IOAS ID to allow IOVAs from
+ * @allowed_iovas: Pointer to array of struct iommu_iova_range
+ *
+ * Ensure a range of IOVAs are always available for allocation. If this call
+ * succeeds then IOMMU_IOAS_IOVA_RANGES will never return a list of IOVA ranges
+ * that are narrower than the ranges provided here. This call will fail if
+ * IOMMU_IOAS_IOVA_RANGES is currently narrower than the given ranges.
+ *
+ * When an IOAS is first created the IOVA_RANGES will be maximally sized, and as
+ * devices are attached the IOVA will narrow based on the device restrictions.
+ * When an allowed range is specified any narrowing will be refused, ie device
+ * attachment can fail if the device requires limiting within the allowed range.
+ *
+ * Automatic IOVA allocation is also impacted by this call, it MAP will allocate
+ * within the allowed IOVAs if they are present.
+ *
+ * This call replaces the entire allowed list with the given list.
+ */
+struct iommu_ioas_allow_iovas {
+	__u32 size;
+	__u32 ioas_id;
+	__u32 num_iovas;
+	__u32 __reserved;
+	__aligned_u64 allowed_iovas;
+};
+#define IOMMU_IOAS_ALLOW_IOVAS _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_ALLOW_IOVAS)
+
+/**
+ * enum iommufd_ioas_map_flags - Flags for map and copy
+ * @IOMMU_IOAS_MAP_FIXED_IOVA: If clear the kernel will compute an appropriate
+ *                             IOVA to place the mapping at
+ * @IOMMU_IOAS_MAP_WRITEABLE: DMA is allowed to write to this mapping
+ * @IOMMU_IOAS_MAP_READABLE: DMA is allowed to read from this mapping
+ */
+enum iommufd_ioas_map_flags {
+	IOMMU_IOAS_MAP_FIXED_IOVA = 1 << 0,
+	IOMMU_IOAS_MAP_WRITEABLE = 1 << 1,
+	IOMMU_IOAS_MAP_READABLE = 1 << 2,
+};
+
+/**
+ * struct iommu_ioas_map - ioctl(IOMMU_IOAS_MAP)
+ * @size: sizeof(struct iommu_ioas_map)
+ * @flags: Combination of enum iommufd_ioas_map_flags
+ * @ioas_id: IOAS ID to change the mapping of
+ * @__reserved: Must be 0
+ * @user_va: Userspace pointer to start mapping from
+ * @length: Number of bytes to map
+ * @iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is set
+ *        then this must be provided as input.
+ *
+ * Set an IOVA mapping from a user pointer. If FIXED_IOVA is specified then the
+ * mapping will be established at iova, otherwise a suitable location will be
+ * automatically selected and returned in iova.
+ */
+struct iommu_ioas_map {
+	__u32 size;
+	__u32 flags;
+	__u32 ioas_id;
+	__u32 __reserved;
+	__aligned_u64 user_va;
+	__aligned_u64 length;
+	__aligned_u64 iova;
+};
+#define IOMMU_IOAS_MAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_MAP)
+
+/**
+ * struct iommu_ioas_copy - ioctl(IOMMU_IOAS_COPY)
+ * @size: sizeof(struct iommu_ioas_copy)
+ * @flags: Combination of enum iommufd_ioas_map_flags
+ * @dst_ioas_id: IOAS ID to change the mapping of
+ * @src_ioas_id: IOAS ID to copy from
+ * @length: Number of bytes to copy and map
+ * @dst_iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is
+ *            set then this must be provided as input.
+ * @src_iova: IOVA to start the copy
+ *
+ * Copy an already existing mapping from src_ioas_id and establish it in
+ * dst_ioas_id. The src iova/length must exactly match a range used with
+ * IOMMU_IOAS_MAP.
+ *
+ * This may be used to efficiently clone a subset of an IOAS to another, or as a
+ * kind of 'cache' to speed up mapping. Copy has an effciency advantage over
+ * establishing equivilant new mappings, as internal resources are shared, and
+ * the kernel will pin the user memory only once.
+ */
+struct iommu_ioas_copy {
+	__u32 size;
+	__u32 flags;
+	__u32 dst_ioas_id;
+	__u32 src_ioas_id;
+	__aligned_u64 length;
+	__aligned_u64 dst_iova;
+	__aligned_u64 src_iova;
+};
+#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY)
+
+/**
+ * struct iommu_ioas_unmap - ioctl(IOMMU_IOAS_UNMAP)
+ * @size: sizeof(struct iommu_ioas_copy)
+ * @ioas_id: IOAS ID to change the mapping of
+ * @iova: IOVA to start the unmapping at
+ * @length: Number of bytes to unmap, and return back the bytes unmapped
+ *
+ * Unmap an IOVA range. The iova/length must be a superset of a previously
+ * mapped range used with IOMMU_IOAS_PAGETABLE_MAP or COPY. Splitting or
+ * truncating ranges is not allowed. The values 0 to U64_MAX will unmap
+ * everything.
+ */
+struct iommu_ioas_unmap {
+	__u32 size;
+	__u32 ioas_id;
+	__aligned_u64 iova;
+	__aligned_u64 length;
+};
+#define IOMMU_IOAS_UNMAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_UNMAP)
 #endif
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH RFC v2 09/13] iommufd: Add a HW pagetable object
  2022-09-02 19:59 ` Jason Gunthorpe
@ 2022-09-02 19:59   ` Jason Gunthorpe
  -1 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-02 19:59 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, iommu,
	Jason Wang, Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

The hw_pagetable object exposes the internal struct iommu_domain's to
userspace. An iommu_domain is required when any DMA device attaches to an
IOAS to control the io page table through the iommu driver.

For compatibility with VFIO the hw_pagetable is automatically created when
a DMA device is attached to the IOAS. If a compatible iommu_domain already
exists then the hw_pagetable associated with it is used for the
attachment.

In the initial series there is no iommufd uAPI for the hw_pagetable
object. The next patch provides driver facing APIs for IO page table
attachment that allows drivers to accept either an IOAS or a hw_pagetable
ID and for the driver to return the hw_pagetable ID that was auto-selected
from an IOAS. The expectation is the driver will provide uAPI through its
own FD for attaching its device to iommufd. This allows userspace to learn
the mapping of devices to iommu_domains and to override the automatic
attachment.

The future HW specific interface will allow userspace to create
hw_pagetable objects using iommu_domains with IOMMU driver specific
parameters. This infrastructure will allow linking those domains to IOAS's
and devices.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommufd/Makefile          |  1 +
 drivers/iommu/iommufd/hw_pagetable.c    | 68 +++++++++++++++++++++++++
 drivers/iommu/iommufd/ioas.c            | 20 ++++++++
 drivers/iommu/iommufd/iommufd_private.h | 36 +++++++++++++
 drivers/iommu/iommufd/main.c            |  3 ++
 5 files changed, 128 insertions(+)
 create mode 100644 drivers/iommu/iommufd/hw_pagetable.c

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index 2b4f36f1b72f9d..e13e971aa28c60 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 iommufd-y := \
+	hw_pagetable.o \
 	io_pagetable.o \
 	ioas.o \
 	main.o \
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
new file mode 100644
index 00000000000000..c7e05ec7a11380
--- /dev/null
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -0,0 +1,68 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ */
+#include <linux/iommu.h>
+
+#include "iommufd_private.h"
+
+void iommufd_hw_pagetable_destroy(struct iommufd_object *obj)
+{
+	struct iommufd_hw_pagetable *hwpt =
+		container_of(obj, struct iommufd_hw_pagetable, obj);
+
+	WARN_ON(!list_empty(&hwpt->devices));
+
+	iommu_domain_free(hwpt->domain);
+	refcount_dec(&hwpt->ioas->obj.users);
+	mutex_destroy(&hwpt->devices_lock);
+}
+
+/**
+ * iommufd_hw_pagetable_alloc() - Get an iommu_domain for a device
+ * @ictx: iommufd context
+ * @ioas: IOAS to associate the domain with
+ * @dev: Device to get an iommu_domain for
+ *
+ * Allocate a new iommu_domain and return it as a hw_pagetable.
+ */
+struct iommufd_hw_pagetable *
+iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
+			   struct device *dev)
+{
+	struct iommufd_hw_pagetable *hwpt;
+	int rc;
+
+	hwpt = iommufd_object_alloc(ictx, hwpt, IOMMUFD_OBJ_HW_PAGETABLE);
+	if (IS_ERR(hwpt))
+		return hwpt;
+
+	hwpt->domain = iommu_domain_alloc(dev->bus);
+	if (!hwpt->domain) {
+		rc = -ENOMEM;
+		goto out_abort;
+	}
+
+	/*
+	 * If the IOMMU can block non-coherent operations (ie PCIe TLPs with
+	 * no-snoop set) then always turn it on. We currently don't have a uAPI
+	 * to allow userspace to restore coherency if it wants to use no-snoop
+	 * TLPs.
+	 */
+	if (hwpt->domain->ops->enforce_cache_coherency)
+		hwpt->enforce_cache_coherency =
+			hwpt->domain->ops->enforce_cache_coherency(
+				hwpt->domain);
+
+	INIT_LIST_HEAD(&hwpt->devices);
+	INIT_LIST_HEAD(&hwpt->hwpt_item);
+	mutex_init(&hwpt->devices_lock);
+	/* Pairs with iommufd_hw_pagetable_destroy() */
+	refcount_inc(&ioas->obj.users);
+	hwpt->ioas = ioas;
+	return hwpt;
+
+out_abort:
+	iommufd_object_abort(ictx, &hwpt->obj);
+	return ERR_PTR(rc);
+}
diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
index f9f545158a4891..42b9a04188a116 100644
--- a/drivers/iommu/iommufd/ioas.c
+++ b/drivers/iommu/iommufd/ioas.c
@@ -17,6 +17,7 @@ void iommufd_ioas_destroy(struct iommufd_object *obj)
 	rc = iopt_unmap_all(&ioas->iopt, NULL);
 	WARN_ON(rc && rc != -ENOENT);
 	iopt_destroy_table(&ioas->iopt);
+	mutex_destroy(&ioas->mutex);
 }
 
 struct iommufd_ioas *iommufd_ioas_alloc(struct iommufd_ctx *ictx)
@@ -31,6 +32,9 @@ struct iommufd_ioas *iommufd_ioas_alloc(struct iommufd_ctx *ictx)
 	rc = iopt_init_table(&ioas->iopt);
 	if (rc)
 		goto out_abort;
+
+	INIT_LIST_HEAD(&ioas->hwpt_list);
+	mutex_init(&ioas->mutex);
 	return ioas;
 
 out_abort:
@@ -314,3 +318,19 @@ int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd)
 	iommufd_put_object(&ioas->obj);
 	return rc;
 }
+
+bool iommufd_ioas_enforced_coherent(struct iommufd_ioas *ioas)
+{
+	struct iommufd_hw_pagetable *hwpt;
+	bool ret = true;
+
+	mutex_lock(&ioas->mutex);
+	list_for_each_entry(hwpt, &ioas->hwpt_list, hwpt_item) {
+		if (!hwpt->enforce_cache_coherency) {
+			ret = false;
+			break;
+		}
+	}
+	mutex_unlock(&ioas->mutex);
+	return ret;
+}
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 0ef6b9bf4916eb..4f628800bc2b71 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -92,6 +92,7 @@ static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd,
 enum iommufd_object_type {
 	IOMMUFD_OBJ_NONE,
 	IOMMUFD_OBJ_ANY = IOMMUFD_OBJ_NONE,
+	IOMMUFD_OBJ_HW_PAGETABLE,
 	IOMMUFD_OBJ_IOAS,
 };
 
@@ -169,10 +170,20 @@ struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
  * io_pagetable object. It is a user controlled mapping of IOVA -> PFNs. The
  * mapping is copied into all of the associated domains and made available to
  * in-kernel users.
+ *
+ * Every iommu_domain that is created is wrapped in a iommufd_hw_pagetable
+ * object. When we go to attach a device to an IOAS we need to get an
+ * iommu_domain and wrapping iommufd_hw_pagetable for it.
+ *
+ * An iommu_domain & iommfd_hw_pagetable will be automatically selected
+ * for a device based on the hwpt_list. If no suitable iommu_domain
+ * is found a new iommu_domain will be created.
  */
 struct iommufd_ioas {
 	struct iommufd_object obj;
 	struct io_pagetable iopt;
+	struct mutex mutex;
+	struct list_head hwpt_list;
 };
 
 static inline struct iommufd_ioas *iommufd_get_ioas(struct iommufd_ucmd *ucmd,
@@ -182,6 +193,7 @@ static inline struct iommufd_ioas *iommufd_get_ioas(struct iommufd_ucmd *ucmd,
 					       IOMMUFD_OBJ_IOAS),
 			    struct iommufd_ioas, obj);
 }
+bool iommufd_ioas_enforced_coherent(struct iommufd_ioas *ioas);
 
 struct iommufd_ioas *iommufd_ioas_alloc(struct iommufd_ctx *ictx);
 int iommufd_ioas_alloc_ioctl(struct iommufd_ucmd *ucmd);
@@ -191,4 +203,28 @@ int iommufd_ioas_allow_iovas(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_map(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_copy(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd);
+
+/*
+ * A HW pagetable is called an iommu_domain inside the kernel. This user object
+ * allows directly creating and inspecting the domains. Domains that have kernel
+ * owned page tables will be associated with an iommufd_ioas that provides the
+ * IOVA to PFN map.
+ */
+struct iommufd_hw_pagetable {
+	struct iommufd_object obj;
+	struct iommufd_ioas *ioas;
+	struct iommu_domain *domain;
+	bool auto_domain : 1;
+	bool enforce_cache_coherency : 1;
+	/* Head at iommufd_ioas::hwpt_list */
+	struct list_head hwpt_item;
+	struct mutex devices_lock;
+	struct list_head devices;
+};
+
+struct iommufd_hw_pagetable *
+iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
+			   struct device *dev);
+void iommufd_hw_pagetable_destroy(struct iommufd_object *obj);
+
 #endif
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 55b42eeb141b20..2a9b581cacffb6 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -330,6 +330,9 @@ static struct iommufd_object_ops iommufd_object_ops[] = {
 	[IOMMUFD_OBJ_IOAS] = {
 		.destroy = iommufd_ioas_destroy,
 	},
+	[IOMMUFD_OBJ_HW_PAGETABLE] = {
+		.destroy = iommufd_hw_pagetable_destroy,
+	},
 };
 
 static struct miscdevice iommu_misc_dev = {
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH RFC v2 09/13] iommufd: Add a HW pagetable object
@ 2022-09-02 19:59   ` Jason Gunthorpe
  0 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-02 19:59 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, iommu,
	Jason Wang, Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

The hw_pagetable object exposes the internal struct iommu_domain's to
userspace. An iommu_domain is required when any DMA device attaches to an
IOAS to control the io page table through the iommu driver.

For compatibility with VFIO the hw_pagetable is automatically created when
a DMA device is attached to the IOAS. If a compatible iommu_domain already
exists then the hw_pagetable associated with it is used for the
attachment.

In the initial series there is no iommufd uAPI for the hw_pagetable
object. The next patch provides driver facing APIs for IO page table
attachment that allows drivers to accept either an IOAS or a hw_pagetable
ID and for the driver to return the hw_pagetable ID that was auto-selected
from an IOAS. The expectation is the driver will provide uAPI through its
own FD for attaching its device to iommufd. This allows userspace to learn
the mapping of devices to iommu_domains and to override the automatic
attachment.

The future HW specific interface will allow userspace to create
hw_pagetable objects using iommu_domains with IOMMU driver specific
parameters. This infrastructure will allow linking those domains to IOAS's
and devices.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommufd/Makefile          |  1 +
 drivers/iommu/iommufd/hw_pagetable.c    | 68 +++++++++++++++++++++++++
 drivers/iommu/iommufd/ioas.c            | 20 ++++++++
 drivers/iommu/iommufd/iommufd_private.h | 36 +++++++++++++
 drivers/iommu/iommufd/main.c            |  3 ++
 5 files changed, 128 insertions(+)
 create mode 100644 drivers/iommu/iommufd/hw_pagetable.c

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index 2b4f36f1b72f9d..e13e971aa28c60 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 iommufd-y := \
+	hw_pagetable.o \
 	io_pagetable.o \
 	ioas.o \
 	main.o \
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
new file mode 100644
index 00000000000000..c7e05ec7a11380
--- /dev/null
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -0,0 +1,68 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ */
+#include <linux/iommu.h>
+
+#include "iommufd_private.h"
+
+void iommufd_hw_pagetable_destroy(struct iommufd_object *obj)
+{
+	struct iommufd_hw_pagetable *hwpt =
+		container_of(obj, struct iommufd_hw_pagetable, obj);
+
+	WARN_ON(!list_empty(&hwpt->devices));
+
+	iommu_domain_free(hwpt->domain);
+	refcount_dec(&hwpt->ioas->obj.users);
+	mutex_destroy(&hwpt->devices_lock);
+}
+
+/**
+ * iommufd_hw_pagetable_alloc() - Get an iommu_domain for a device
+ * @ictx: iommufd context
+ * @ioas: IOAS to associate the domain with
+ * @dev: Device to get an iommu_domain for
+ *
+ * Allocate a new iommu_domain and return it as a hw_pagetable.
+ */
+struct iommufd_hw_pagetable *
+iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
+			   struct device *dev)
+{
+	struct iommufd_hw_pagetable *hwpt;
+	int rc;
+
+	hwpt = iommufd_object_alloc(ictx, hwpt, IOMMUFD_OBJ_HW_PAGETABLE);
+	if (IS_ERR(hwpt))
+		return hwpt;
+
+	hwpt->domain = iommu_domain_alloc(dev->bus);
+	if (!hwpt->domain) {
+		rc = -ENOMEM;
+		goto out_abort;
+	}
+
+	/*
+	 * If the IOMMU can block non-coherent operations (ie PCIe TLPs with
+	 * no-snoop set) then always turn it on. We currently don't have a uAPI
+	 * to allow userspace to restore coherency if it wants to use no-snoop
+	 * TLPs.
+	 */
+	if (hwpt->domain->ops->enforce_cache_coherency)
+		hwpt->enforce_cache_coherency =
+			hwpt->domain->ops->enforce_cache_coherency(
+				hwpt->domain);
+
+	INIT_LIST_HEAD(&hwpt->devices);
+	INIT_LIST_HEAD(&hwpt->hwpt_item);
+	mutex_init(&hwpt->devices_lock);
+	/* Pairs with iommufd_hw_pagetable_destroy() */
+	refcount_inc(&ioas->obj.users);
+	hwpt->ioas = ioas;
+	return hwpt;
+
+out_abort:
+	iommufd_object_abort(ictx, &hwpt->obj);
+	return ERR_PTR(rc);
+}
diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
index f9f545158a4891..42b9a04188a116 100644
--- a/drivers/iommu/iommufd/ioas.c
+++ b/drivers/iommu/iommufd/ioas.c
@@ -17,6 +17,7 @@ void iommufd_ioas_destroy(struct iommufd_object *obj)
 	rc = iopt_unmap_all(&ioas->iopt, NULL);
 	WARN_ON(rc && rc != -ENOENT);
 	iopt_destroy_table(&ioas->iopt);
+	mutex_destroy(&ioas->mutex);
 }
 
 struct iommufd_ioas *iommufd_ioas_alloc(struct iommufd_ctx *ictx)
@@ -31,6 +32,9 @@ struct iommufd_ioas *iommufd_ioas_alloc(struct iommufd_ctx *ictx)
 	rc = iopt_init_table(&ioas->iopt);
 	if (rc)
 		goto out_abort;
+
+	INIT_LIST_HEAD(&ioas->hwpt_list);
+	mutex_init(&ioas->mutex);
 	return ioas;
 
 out_abort:
@@ -314,3 +318,19 @@ int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd)
 	iommufd_put_object(&ioas->obj);
 	return rc;
 }
+
+bool iommufd_ioas_enforced_coherent(struct iommufd_ioas *ioas)
+{
+	struct iommufd_hw_pagetable *hwpt;
+	bool ret = true;
+
+	mutex_lock(&ioas->mutex);
+	list_for_each_entry(hwpt, &ioas->hwpt_list, hwpt_item) {
+		if (!hwpt->enforce_cache_coherency) {
+			ret = false;
+			break;
+		}
+	}
+	mutex_unlock(&ioas->mutex);
+	return ret;
+}
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 0ef6b9bf4916eb..4f628800bc2b71 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -92,6 +92,7 @@ static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd,
 enum iommufd_object_type {
 	IOMMUFD_OBJ_NONE,
 	IOMMUFD_OBJ_ANY = IOMMUFD_OBJ_NONE,
+	IOMMUFD_OBJ_HW_PAGETABLE,
 	IOMMUFD_OBJ_IOAS,
 };
 
@@ -169,10 +170,20 @@ struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
  * io_pagetable object. It is a user controlled mapping of IOVA -> PFNs. The
  * mapping is copied into all of the associated domains and made available to
  * in-kernel users.
+ *
+ * Every iommu_domain that is created is wrapped in a iommufd_hw_pagetable
+ * object. When we go to attach a device to an IOAS we need to get an
+ * iommu_domain and wrapping iommufd_hw_pagetable for it.
+ *
+ * An iommu_domain & iommfd_hw_pagetable will be automatically selected
+ * for a device based on the hwpt_list. If no suitable iommu_domain
+ * is found a new iommu_domain will be created.
  */
 struct iommufd_ioas {
 	struct iommufd_object obj;
 	struct io_pagetable iopt;
+	struct mutex mutex;
+	struct list_head hwpt_list;
 };
 
 static inline struct iommufd_ioas *iommufd_get_ioas(struct iommufd_ucmd *ucmd,
@@ -182,6 +193,7 @@ static inline struct iommufd_ioas *iommufd_get_ioas(struct iommufd_ucmd *ucmd,
 					       IOMMUFD_OBJ_IOAS),
 			    struct iommufd_ioas, obj);
 }
+bool iommufd_ioas_enforced_coherent(struct iommufd_ioas *ioas);
 
 struct iommufd_ioas *iommufd_ioas_alloc(struct iommufd_ctx *ictx);
 int iommufd_ioas_alloc_ioctl(struct iommufd_ucmd *ucmd);
@@ -191,4 +203,28 @@ int iommufd_ioas_allow_iovas(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_map(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_copy(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd);
+
+/*
+ * A HW pagetable is called an iommu_domain inside the kernel. This user object
+ * allows directly creating and inspecting the domains. Domains that have kernel
+ * owned page tables will be associated with an iommufd_ioas that provides the
+ * IOVA to PFN map.
+ */
+struct iommufd_hw_pagetable {
+	struct iommufd_object obj;
+	struct iommufd_ioas *ioas;
+	struct iommu_domain *domain;
+	bool auto_domain : 1;
+	bool enforce_cache_coherency : 1;
+	/* Head at iommufd_ioas::hwpt_list */
+	struct list_head hwpt_item;
+	struct mutex devices_lock;
+	struct list_head devices;
+};
+
+struct iommufd_hw_pagetable *
+iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
+			   struct device *dev);
+void iommufd_hw_pagetable_destroy(struct iommufd_object *obj);
+
 #endif
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 55b42eeb141b20..2a9b581cacffb6 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -330,6 +330,9 @@ static struct iommufd_object_ops iommufd_object_ops[] = {
 	[IOMMUFD_OBJ_IOAS] = {
 		.destroy = iommufd_ioas_destroy,
 	},
+	[IOMMUFD_OBJ_HW_PAGETABLE] = {
+		.destroy = iommufd_hw_pagetable_destroy,
+	},
 };
 
 static struct miscdevice iommu_misc_dev = {
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH RFC v2 10/13] iommufd: Add kAPI toward external drivers for physical devices
  2022-09-02 19:59 ` Jason Gunthorpe
@ 2022-09-02 19:59   ` Jason Gunthorpe
  -1 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-02 19:59 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, iommu,
	Jason Wang, Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

Add the four functions external drivers need to connect physical DMA to
the IOMMUFD:

iommufd_device_bind() / iommufd_device_unbind()
  Register the device with iommufd and establish security isolation.

iommufd_device_attach() / iommufd_device_detach()
  Connect a bound device to a page table

Binding a device creates a device object ID in the uAPI, however the
generic API provides no IOCTLs to manipulate them.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommufd/Makefile          |   1 +
 drivers/iommu/iommufd/device.c          | 396 ++++++++++++++++++++++++
 drivers/iommu/iommufd/iommufd_private.h |   4 +
 drivers/iommu/iommufd/main.c            |   3 +
 include/linux/iommufd.h                 |  14 +
 5 files changed, 418 insertions(+)
 create mode 100644 drivers/iommu/iommufd/device.c

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index e13e971aa28c60..ca28a135b9675f 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 iommufd-y := \
+	device.o \
 	hw_pagetable.o \
 	io_pagetable.o \
 	ioas.o \
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
new file mode 100644
index 00000000000000..23b101db846f40
--- /dev/null
+++ b/drivers/iommu/iommufd/device.c
@@ -0,0 +1,396 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ */
+#include <linux/iommufd.h>
+#include <linux/slab.h>
+#include <linux/iommu.h>
+#include <linux/file.h>
+#include <linux/pci.h>
+#include <linux/irqdomain.h>
+#include <linux/dma-iommu.h>
+#include <linux/dma-map-ops.h>
+
+#include "iommufd_private.h"
+
+/*
+ * A iommufd_device object represents the binding relationship between a
+ * consuming driver and the iommufd. These objects are created/destroyed by
+ * external drivers, not by userspace.
+ */
+struct iommufd_device {
+	struct iommufd_object obj;
+	struct iommufd_ctx *ictx;
+	struct iommufd_hw_pagetable *hwpt;
+	/* Head at iommufd_hw_pagetable::devices */
+	struct list_head devices_item;
+	/* always the physical device */
+	struct device *dev;
+	struct iommu_group *group;
+};
+
+void iommufd_device_destroy(struct iommufd_object *obj)
+{
+	struct iommufd_device *idev =
+		container_of(obj, struct iommufd_device, obj);
+
+	iommu_group_release_dma_owner(idev->group);
+	iommu_group_put(idev->group);
+	iommufd_ctx_put(idev->ictx);
+}
+
+/**
+ * iommufd_device_bind - Bind a physical device to an iommu fd
+ * @ictx: iommufd file descriptor
+ * @dev: Pointer to a physical PCI device struct
+ * @id: Output ID number to return to userspace for this device
+ *
+ * A successful bind establishes an ownership over the device and returns
+ * struct iommufd_device pointer, otherwise returns error pointer.
+ *
+ * A driver using this API must set driver_managed_dma and must not touch
+ * the device until this routine succeeds and establishes ownership.
+ *
+ * Binding a PCI device places the entire RID under iommufd control.
+ *
+ * The caller must undo this with iommufd_unbind_device()
+ */
+struct iommufd_device *iommufd_device_bind(struct iommufd_ctx *ictx,
+					   struct device *dev, u32 *id)
+{
+	struct iommufd_device *idev;
+	struct iommu_group *group;
+	int rc;
+
+       /*
+        * iommufd always sets IOMMU_CACHE because we offer no way for userspace
+        * to restore cache coherency.
+        */
+	if (!iommu_capable(dev->bus, IOMMU_CAP_CACHE_COHERENCY))
+		return ERR_PTR(-EINVAL);
+
+	group = iommu_group_get(dev);
+	if (!group)
+		return ERR_PTR(-ENODEV);
+
+	/*
+	 * FIXME: Use a device-centric iommu api, this won't work with
+	 * multi-device groups
+	 */
+	rc = iommu_group_claim_dma_owner(group, ictx);
+	if (rc)
+		goto out_group_put;
+
+	idev = iommufd_object_alloc(ictx, idev, IOMMUFD_OBJ_DEVICE);
+	if (IS_ERR(idev)) {
+		rc = PTR_ERR(idev);
+		goto out_release_owner;
+	}
+	idev->ictx = ictx;
+	iommufd_ctx_get(ictx);
+	idev->dev = dev;
+	/* The calling driver is a user until iommufd_device_unbind() */
+	refcount_inc(&idev->obj.users);
+	/* group refcount moves into iommufd_device */
+	idev->group = group;
+
+	/*
+	 * If the caller fails after this success it must call
+	 * iommufd_unbind_device() which is safe since we hold this refcount.
+	 * This also means the device is a leaf in the graph and no other object
+	 * can take a reference on it.
+	 */
+	iommufd_object_finalize(ictx, &idev->obj);
+	*id = idev->obj.id;
+	return idev;
+
+out_release_owner:
+	iommu_group_release_dma_owner(group);
+out_group_put:
+	iommu_group_put(group);
+	return ERR_PTR(rc);
+}
+EXPORT_SYMBOL_GPL(iommufd_device_bind);
+
+void iommufd_device_unbind(struct iommufd_device *idev)
+{
+	bool was_destroyed;
+
+	was_destroyed = iommufd_object_destroy_user(idev->ictx, &idev->obj);
+	WARN_ON(!was_destroyed);
+}
+EXPORT_SYMBOL_GPL(iommufd_device_unbind);
+
+/**
+ * iommufd_device_enforced_coherent - True if no-snoop TLPs are blocked
+ * @idev: device to query
+ *
+ * This can only be called if the device is attached, and the caller must ensure
+ * that the this is not raced with iommufd_device_attach() /
+ * iommufd_device_detach().
+ */
+bool iommufd_device_enforced_coherent(struct iommufd_device *idev)
+{
+	return iommufd_ioas_enforced_coherent(idev->hwpt->ioas);
+}
+EXPORT_SYMBOL_GPL(iommufd_device_enforced_coherent);
+
+static int iommufd_device_setup_msi(struct iommufd_device *idev,
+				    struct iommufd_hw_pagetable *hwpt,
+				    phys_addr_t sw_msi_start,
+				    unsigned int flags)
+{
+	int rc;
+
+	/*
+	 * IOMMU_CAP_INTR_REMAP means that the platform is isolating MSI,
+	 * nothing further to do.
+	 */
+	if (iommu_capable(idev->dev->bus, IOMMU_CAP_INTR_REMAP))
+		return 0;
+
+	/*
+	 * On ARM systems that set the global IRQ_DOMAIN_FLAG_MSI_REMAP every
+	 * allocated iommu_domain will block interrupts by default and this
+	 * special flow is needed to turn them back on.
+	 */
+	if (irq_domain_check_msi_remap()) {
+		if (WARN_ON(!sw_msi_start))
+			return -EPERM;
+		/*
+		 * iommu_get_msi_cookie() can only be called once per domain,
+		 * it returns -EBUSY on later calls.
+		 */
+		if (hwpt->msi_cookie)
+			return 0;
+		rc = iommu_get_msi_cookie(hwpt->domain, sw_msi_start);
+		if (rc && rc != -ENODEV)
+			return rc;
+		hwpt->msi_cookie = true;
+		return 0;
+	}
+
+	/*
+	 * Otherwise the platform has a MSI window that is not isolated. For
+	 * historical compat with VFIO allow a module parameter to ignore the
+	 * insecurity.
+	 */
+	if (!(flags & IOMMUFD_ATTACH_FLAGS_ALLOW_UNSAFE_INTERRUPT))
+		return -EPERM;
+	return 0;
+}
+
+static bool iommufd_hw_pagetable_has_group(struct iommufd_hw_pagetable *hwpt,
+					   struct iommu_group *group)
+{
+	struct iommufd_device *cur_dev;
+
+	list_for_each_entry (cur_dev, &hwpt->devices, devices_item)
+		if (cur_dev->group == group)
+			return true;
+	return false;
+}
+
+static int iommufd_device_do_attach(struct iommufd_device *idev,
+				    struct iommufd_hw_pagetable *hwpt,
+				    unsigned int flags)
+{
+	int rc;
+
+	mutex_lock(&hwpt->devices_lock);
+	/*
+	 * FIXME: Use a device-centric iommu api. For now check if the
+	 * hw_pagetable already has a device of the same group joined to tell if
+	 * we are the first and need to attach the group.
+	 */
+	if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
+		phys_addr_t sw_msi_start = 0;
+
+		rc = iommu_attach_group(hwpt->domain, idev->group);
+		if (rc)
+			goto out_unlock;
+
+		/*
+		 * hwpt is now the exclusive owner of the group so this is the
+		 * first time enforce is called for this group.
+		 */
+		rc = iopt_table_enforce_group_resv_regions(
+			&hwpt->ioas->iopt, idev->group, &sw_msi_start);
+		if (rc)
+			goto out_detach;
+		rc = iommufd_device_setup_msi(idev, hwpt, sw_msi_start, flags);
+		if (rc)
+			goto out_iova;
+
+		if (list_empty(&hwpt->devices)) {
+			rc = iopt_table_add_domain(&hwpt->ioas->iopt,
+						   hwpt->domain);
+			if (rc)
+				goto out_iova;
+		}
+	}
+
+	idev->hwpt = hwpt;
+	refcount_inc(&hwpt->obj.users);
+	list_add(&idev->devices_item, &hwpt->devices);
+	mutex_unlock(&hwpt->devices_lock);
+	return 0;
+
+out_iova:
+	iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->group);
+out_detach:
+	iommu_detach_group(hwpt->domain, idev->group);
+out_unlock:
+	mutex_unlock(&hwpt->devices_lock);
+	return rc;
+}
+
+/*
+ * When automatically managing the domains we search for a compatible domain in
+ * the iopt and if one is found use it, otherwise create a new domain.
+ * Automatic domain selection will never pick a manually created domain.
+ */
+static int iommufd_device_auto_get_domain(struct iommufd_device *idev,
+					  struct iommufd_ioas *ioas,
+					  unsigned int flags)
+{
+	struct iommufd_hw_pagetable *hwpt;
+	int rc;
+
+	/*
+	 * There is no differentiation when domains are allocated, so any domain
+	 * that is willing to attach to the device is interchangeable with any
+	 * other.
+	 */
+	mutex_lock(&ioas->mutex);
+	list_for_each_entry (hwpt, &ioas->hwpt_list, hwpt_item) {
+		if (!hwpt->auto_domain ||
+		    !refcount_inc_not_zero(&hwpt->obj.users))
+			continue;
+
+		/*
+		 * FIXME: if the group is already attached to a domain make sure
+		 * this returns EMEDIUMTYPE
+		 */
+		rc = iommufd_device_do_attach(idev, hwpt, flags);
+		refcount_dec(&hwpt->obj.users);
+		if (rc) {
+			if (rc == -EMEDIUMTYPE)
+				continue;
+			goto out_unlock;
+		}
+		goto out_unlock;
+	}
+
+	hwpt = iommufd_hw_pagetable_alloc(idev->ictx, ioas, idev->dev);
+	if (IS_ERR(hwpt)) {
+		rc = PTR_ERR(hwpt);
+		goto out_unlock;
+	}
+	hwpt->auto_domain = true;
+
+	rc = iommufd_device_do_attach(idev, hwpt, flags);
+	if (rc)
+		goto out_abort;
+	list_add_tail(&hwpt->hwpt_item, &ioas->hwpt_list);
+
+	mutex_unlock(&ioas->mutex);
+	iommufd_object_finalize(idev->ictx, &hwpt->obj);
+	return 0;
+
+out_abort:
+	iommufd_object_abort_and_destroy(idev->ictx, &hwpt->obj);
+out_unlock:
+	mutex_unlock(&ioas->mutex);
+	return rc;
+}
+
+/**
+ * iommufd_device_attach - Connect a device to an iommu_domain
+ * @idev: device to attach
+ * @pt_id: Input a IOMMUFD_OBJ_IOAS, or IOMMUFD_OBJ_HW_PAGETABLE
+ *         Output the IOMMUFD_OBJ_HW_PAGETABLE ID
+ * @flags: Optional flags
+ *
+ * This connects the device to an iommu_domain, either automatically or manually
+ * selected. Once this completes the device could do DMA.
+ *
+ * The caller should return the resulting pt_id back to userspace.
+ * This function is undone by calling iommufd_device_detach().
+ */
+int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id,
+			  unsigned int flags)
+{
+	struct iommufd_object *pt_obj;
+	int rc;
+
+	pt_obj = iommufd_get_object(idev->ictx, *pt_id, IOMMUFD_OBJ_ANY);
+	if (IS_ERR(pt_obj))
+		return PTR_ERR(pt_obj);
+
+	switch (pt_obj->type) {
+	case IOMMUFD_OBJ_HW_PAGETABLE: {
+		struct iommufd_hw_pagetable *hwpt =
+			container_of(pt_obj, struct iommufd_hw_pagetable, obj);
+
+		rc = iommufd_device_do_attach(idev, hwpt, flags);
+		if (rc)
+			goto out_put_pt_obj;
+
+		mutex_lock(&hwpt->ioas->mutex);
+		list_add_tail(&hwpt->hwpt_item, &hwpt->ioas->hwpt_list);
+		mutex_unlock(&hwpt->ioas->mutex);
+		break;
+	}
+	case IOMMUFD_OBJ_IOAS: {
+		struct iommufd_ioas *ioas =
+			container_of(pt_obj, struct iommufd_ioas, obj);
+
+		rc = iommufd_device_auto_get_domain(idev, ioas, flags);
+		if (rc)
+			goto out_put_pt_obj;
+		break;
+	}
+	default:
+		rc = -EINVAL;
+		goto out_put_pt_obj;
+	}
+
+	refcount_inc(&idev->obj.users);
+	*pt_id = idev->hwpt->obj.id;
+	rc = 0;
+
+out_put_pt_obj:
+	iommufd_put_object(pt_obj);
+	return rc;
+}
+EXPORT_SYMBOL_GPL(iommufd_device_attach);
+
+void iommufd_device_detach(struct iommufd_device *idev)
+{
+	struct iommufd_hw_pagetable *hwpt = idev->hwpt;
+
+	mutex_lock(&hwpt->ioas->mutex);
+	mutex_lock(&hwpt->devices_lock);
+	list_del(&idev->devices_item);
+	if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
+		if (list_empty(&hwpt->devices)) {
+			iopt_table_remove_domain(&hwpt->ioas->iopt,
+						 hwpt->domain);
+			list_del(&hwpt->hwpt_item);
+		}
+		iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->group);
+		iommu_detach_group(hwpt->domain, idev->group);
+	}
+	mutex_unlock(&hwpt->devices_lock);
+	mutex_unlock(&hwpt->ioas->mutex);
+
+	if (hwpt->auto_domain)
+		iommufd_object_destroy_user(idev->ictx, &hwpt->obj);
+	else
+		refcount_dec(&hwpt->obj.users);
+
+	idev->hwpt = NULL;
+
+	refcount_dec(&idev->obj.users);
+}
+EXPORT_SYMBOL_GPL(iommufd_device_detach);
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 4f628800bc2b71..0ede92b0aa32b4 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -92,6 +92,7 @@ static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd,
 enum iommufd_object_type {
 	IOMMUFD_OBJ_NONE,
 	IOMMUFD_OBJ_ANY = IOMMUFD_OBJ_NONE,
+	IOMMUFD_OBJ_DEVICE,
 	IOMMUFD_OBJ_HW_PAGETABLE,
 	IOMMUFD_OBJ_IOAS,
 };
@@ -216,6 +217,7 @@ struct iommufd_hw_pagetable {
 	struct iommu_domain *domain;
 	bool auto_domain : 1;
 	bool enforce_cache_coherency : 1;
+	bool msi_cookie : 1;
 	/* Head at iommufd_ioas::hwpt_list */
 	struct list_head hwpt_item;
 	struct mutex devices_lock;
@@ -227,4 +229,6 @@ iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
 			   struct device *dev);
 void iommufd_hw_pagetable_destroy(struct iommufd_object *obj);
 
+void iommufd_device_destroy(struct iommufd_object *obj);
+
 #endif
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 2a9b581cacffb6..b09dbfc8009dc2 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -327,6 +327,9 @@ void iommufd_ctx_put(struct iommufd_ctx *ictx)
 EXPORT_SYMBOL_GPL(iommufd_ctx_put);
 
 static struct iommufd_object_ops iommufd_object_ops[] = {
+	[IOMMUFD_OBJ_DEVICE] = {
+		.destroy = iommufd_device_destroy,
+	},
 	[IOMMUFD_OBJ_IOAS] = {
 		.destroy = iommufd_ioas_destroy,
 	},
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index 9c6ec4d66b4a92..477c3ea098f637 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -9,12 +9,26 @@
 #include <linux/types.h>
 #include <linux/errno.h>
 #include <linux/err.h>
+#include <linux/device.h>
 
 struct page;
+struct iommufd_device;
 struct iommufd_ctx;
 struct io_pagetable;
 struct file;
 
+struct iommufd_device *iommufd_device_bind(struct iommufd_ctx *ictx,
+					   struct device *dev, u32 *id);
+void iommufd_device_unbind(struct iommufd_device *idev);
+bool iommufd_device_enforced_coherent(struct iommufd_device *idev);
+
+enum {
+	IOMMUFD_ATTACH_FLAGS_ALLOW_UNSAFE_INTERRUPT = 1 << 0,
+};
+int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id,
+			  unsigned int flags);
+void iommufd_device_detach(struct iommufd_device *idev);
+
 int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
 		      unsigned long length, struct page **out_pages,
 		      bool write);
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH RFC v2 10/13] iommufd: Add kAPI toward external drivers for physical devices
@ 2022-09-02 19:59   ` Jason Gunthorpe
  0 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-02 19:59 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, iommu,
	Jason Wang, Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

Add the four functions external drivers need to connect physical DMA to
the IOMMUFD:

iommufd_device_bind() / iommufd_device_unbind()
  Register the device with iommufd and establish security isolation.

iommufd_device_attach() / iommufd_device_detach()
  Connect a bound device to a page table

Binding a device creates a device object ID in the uAPI, however the
generic API provides no IOCTLs to manipulate them.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommufd/Makefile          |   1 +
 drivers/iommu/iommufd/device.c          | 396 ++++++++++++++++++++++++
 drivers/iommu/iommufd/iommufd_private.h |   4 +
 drivers/iommu/iommufd/main.c            |   3 +
 include/linux/iommufd.h                 |  14 +
 5 files changed, 418 insertions(+)
 create mode 100644 drivers/iommu/iommufd/device.c

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index e13e971aa28c60..ca28a135b9675f 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 iommufd-y := \
+	device.o \
 	hw_pagetable.o \
 	io_pagetable.o \
 	ioas.o \
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
new file mode 100644
index 00000000000000..23b101db846f40
--- /dev/null
+++ b/drivers/iommu/iommufd/device.c
@@ -0,0 +1,396 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ */
+#include <linux/iommufd.h>
+#include <linux/slab.h>
+#include <linux/iommu.h>
+#include <linux/file.h>
+#include <linux/pci.h>
+#include <linux/irqdomain.h>
+#include <linux/dma-iommu.h>
+#include <linux/dma-map-ops.h>
+
+#include "iommufd_private.h"
+
+/*
+ * A iommufd_device object represents the binding relationship between a
+ * consuming driver and the iommufd. These objects are created/destroyed by
+ * external drivers, not by userspace.
+ */
+struct iommufd_device {
+	struct iommufd_object obj;
+	struct iommufd_ctx *ictx;
+	struct iommufd_hw_pagetable *hwpt;
+	/* Head at iommufd_hw_pagetable::devices */
+	struct list_head devices_item;
+	/* always the physical device */
+	struct device *dev;
+	struct iommu_group *group;
+};
+
+void iommufd_device_destroy(struct iommufd_object *obj)
+{
+	struct iommufd_device *idev =
+		container_of(obj, struct iommufd_device, obj);
+
+	iommu_group_release_dma_owner(idev->group);
+	iommu_group_put(idev->group);
+	iommufd_ctx_put(idev->ictx);
+}
+
+/**
+ * iommufd_device_bind - Bind a physical device to an iommu fd
+ * @ictx: iommufd file descriptor
+ * @dev: Pointer to a physical PCI device struct
+ * @id: Output ID number to return to userspace for this device
+ *
+ * A successful bind establishes an ownership over the device and returns
+ * struct iommufd_device pointer, otherwise returns error pointer.
+ *
+ * A driver using this API must set driver_managed_dma and must not touch
+ * the device until this routine succeeds and establishes ownership.
+ *
+ * Binding a PCI device places the entire RID under iommufd control.
+ *
+ * The caller must undo this with iommufd_unbind_device()
+ */
+struct iommufd_device *iommufd_device_bind(struct iommufd_ctx *ictx,
+					   struct device *dev, u32 *id)
+{
+	struct iommufd_device *idev;
+	struct iommu_group *group;
+	int rc;
+
+       /*
+        * iommufd always sets IOMMU_CACHE because we offer no way for userspace
+        * to restore cache coherency.
+        */
+	if (!iommu_capable(dev->bus, IOMMU_CAP_CACHE_COHERENCY))
+		return ERR_PTR(-EINVAL);
+
+	group = iommu_group_get(dev);
+	if (!group)
+		return ERR_PTR(-ENODEV);
+
+	/*
+	 * FIXME: Use a device-centric iommu api, this won't work with
+	 * multi-device groups
+	 */
+	rc = iommu_group_claim_dma_owner(group, ictx);
+	if (rc)
+		goto out_group_put;
+
+	idev = iommufd_object_alloc(ictx, idev, IOMMUFD_OBJ_DEVICE);
+	if (IS_ERR(idev)) {
+		rc = PTR_ERR(idev);
+		goto out_release_owner;
+	}
+	idev->ictx = ictx;
+	iommufd_ctx_get(ictx);
+	idev->dev = dev;
+	/* The calling driver is a user until iommufd_device_unbind() */
+	refcount_inc(&idev->obj.users);
+	/* group refcount moves into iommufd_device */
+	idev->group = group;
+
+	/*
+	 * If the caller fails after this success it must call
+	 * iommufd_unbind_device() which is safe since we hold this refcount.
+	 * This also means the device is a leaf in the graph and no other object
+	 * can take a reference on it.
+	 */
+	iommufd_object_finalize(ictx, &idev->obj);
+	*id = idev->obj.id;
+	return idev;
+
+out_release_owner:
+	iommu_group_release_dma_owner(group);
+out_group_put:
+	iommu_group_put(group);
+	return ERR_PTR(rc);
+}
+EXPORT_SYMBOL_GPL(iommufd_device_bind);
+
+void iommufd_device_unbind(struct iommufd_device *idev)
+{
+	bool was_destroyed;
+
+	was_destroyed = iommufd_object_destroy_user(idev->ictx, &idev->obj);
+	WARN_ON(!was_destroyed);
+}
+EXPORT_SYMBOL_GPL(iommufd_device_unbind);
+
+/**
+ * iommufd_device_enforced_coherent - True if no-snoop TLPs are blocked
+ * @idev: device to query
+ *
+ * This can only be called if the device is attached, and the caller must ensure
+ * that the this is not raced with iommufd_device_attach() /
+ * iommufd_device_detach().
+ */
+bool iommufd_device_enforced_coherent(struct iommufd_device *idev)
+{
+	return iommufd_ioas_enforced_coherent(idev->hwpt->ioas);
+}
+EXPORT_SYMBOL_GPL(iommufd_device_enforced_coherent);
+
+static int iommufd_device_setup_msi(struct iommufd_device *idev,
+				    struct iommufd_hw_pagetable *hwpt,
+				    phys_addr_t sw_msi_start,
+				    unsigned int flags)
+{
+	int rc;
+
+	/*
+	 * IOMMU_CAP_INTR_REMAP means that the platform is isolating MSI,
+	 * nothing further to do.
+	 */
+	if (iommu_capable(idev->dev->bus, IOMMU_CAP_INTR_REMAP))
+		return 0;
+
+	/*
+	 * On ARM systems that set the global IRQ_DOMAIN_FLAG_MSI_REMAP every
+	 * allocated iommu_domain will block interrupts by default and this
+	 * special flow is needed to turn them back on.
+	 */
+	if (irq_domain_check_msi_remap()) {
+		if (WARN_ON(!sw_msi_start))
+			return -EPERM;
+		/*
+		 * iommu_get_msi_cookie() can only be called once per domain,
+		 * it returns -EBUSY on later calls.
+		 */
+		if (hwpt->msi_cookie)
+			return 0;
+		rc = iommu_get_msi_cookie(hwpt->domain, sw_msi_start);
+		if (rc && rc != -ENODEV)
+			return rc;
+		hwpt->msi_cookie = true;
+		return 0;
+	}
+
+	/*
+	 * Otherwise the platform has a MSI window that is not isolated. For
+	 * historical compat with VFIO allow a module parameter to ignore the
+	 * insecurity.
+	 */
+	if (!(flags & IOMMUFD_ATTACH_FLAGS_ALLOW_UNSAFE_INTERRUPT))
+		return -EPERM;
+	return 0;
+}
+
+static bool iommufd_hw_pagetable_has_group(struct iommufd_hw_pagetable *hwpt,
+					   struct iommu_group *group)
+{
+	struct iommufd_device *cur_dev;
+
+	list_for_each_entry (cur_dev, &hwpt->devices, devices_item)
+		if (cur_dev->group == group)
+			return true;
+	return false;
+}
+
+static int iommufd_device_do_attach(struct iommufd_device *idev,
+				    struct iommufd_hw_pagetable *hwpt,
+				    unsigned int flags)
+{
+	int rc;
+
+	mutex_lock(&hwpt->devices_lock);
+	/*
+	 * FIXME: Use a device-centric iommu api. For now check if the
+	 * hw_pagetable already has a device of the same group joined to tell if
+	 * we are the first and need to attach the group.
+	 */
+	if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
+		phys_addr_t sw_msi_start = 0;
+
+		rc = iommu_attach_group(hwpt->domain, idev->group);
+		if (rc)
+			goto out_unlock;
+
+		/*
+		 * hwpt is now the exclusive owner of the group so this is the
+		 * first time enforce is called for this group.
+		 */
+		rc = iopt_table_enforce_group_resv_regions(
+			&hwpt->ioas->iopt, idev->group, &sw_msi_start);
+		if (rc)
+			goto out_detach;
+		rc = iommufd_device_setup_msi(idev, hwpt, sw_msi_start, flags);
+		if (rc)
+			goto out_iova;
+
+		if (list_empty(&hwpt->devices)) {
+			rc = iopt_table_add_domain(&hwpt->ioas->iopt,
+						   hwpt->domain);
+			if (rc)
+				goto out_iova;
+		}
+	}
+
+	idev->hwpt = hwpt;
+	refcount_inc(&hwpt->obj.users);
+	list_add(&idev->devices_item, &hwpt->devices);
+	mutex_unlock(&hwpt->devices_lock);
+	return 0;
+
+out_iova:
+	iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->group);
+out_detach:
+	iommu_detach_group(hwpt->domain, idev->group);
+out_unlock:
+	mutex_unlock(&hwpt->devices_lock);
+	return rc;
+}
+
+/*
+ * When automatically managing the domains we search for a compatible domain in
+ * the iopt and if one is found use it, otherwise create a new domain.
+ * Automatic domain selection will never pick a manually created domain.
+ */
+static int iommufd_device_auto_get_domain(struct iommufd_device *idev,
+					  struct iommufd_ioas *ioas,
+					  unsigned int flags)
+{
+	struct iommufd_hw_pagetable *hwpt;
+	int rc;
+
+	/*
+	 * There is no differentiation when domains are allocated, so any domain
+	 * that is willing to attach to the device is interchangeable with any
+	 * other.
+	 */
+	mutex_lock(&ioas->mutex);
+	list_for_each_entry (hwpt, &ioas->hwpt_list, hwpt_item) {
+		if (!hwpt->auto_domain ||
+		    !refcount_inc_not_zero(&hwpt->obj.users))
+			continue;
+
+		/*
+		 * FIXME: if the group is already attached to a domain make sure
+		 * this returns EMEDIUMTYPE
+		 */
+		rc = iommufd_device_do_attach(idev, hwpt, flags);
+		refcount_dec(&hwpt->obj.users);
+		if (rc) {
+			if (rc == -EMEDIUMTYPE)
+				continue;
+			goto out_unlock;
+		}
+		goto out_unlock;
+	}
+
+	hwpt = iommufd_hw_pagetable_alloc(idev->ictx, ioas, idev->dev);
+	if (IS_ERR(hwpt)) {
+		rc = PTR_ERR(hwpt);
+		goto out_unlock;
+	}
+	hwpt->auto_domain = true;
+
+	rc = iommufd_device_do_attach(idev, hwpt, flags);
+	if (rc)
+		goto out_abort;
+	list_add_tail(&hwpt->hwpt_item, &ioas->hwpt_list);
+
+	mutex_unlock(&ioas->mutex);
+	iommufd_object_finalize(idev->ictx, &hwpt->obj);
+	return 0;
+
+out_abort:
+	iommufd_object_abort_and_destroy(idev->ictx, &hwpt->obj);
+out_unlock:
+	mutex_unlock(&ioas->mutex);
+	return rc;
+}
+
+/**
+ * iommufd_device_attach - Connect a device to an iommu_domain
+ * @idev: device to attach
+ * @pt_id: Input a IOMMUFD_OBJ_IOAS, or IOMMUFD_OBJ_HW_PAGETABLE
+ *         Output the IOMMUFD_OBJ_HW_PAGETABLE ID
+ * @flags: Optional flags
+ *
+ * This connects the device to an iommu_domain, either automatically or manually
+ * selected. Once this completes the device could do DMA.
+ *
+ * The caller should return the resulting pt_id back to userspace.
+ * This function is undone by calling iommufd_device_detach().
+ */
+int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id,
+			  unsigned int flags)
+{
+	struct iommufd_object *pt_obj;
+	int rc;
+
+	pt_obj = iommufd_get_object(idev->ictx, *pt_id, IOMMUFD_OBJ_ANY);
+	if (IS_ERR(pt_obj))
+		return PTR_ERR(pt_obj);
+
+	switch (pt_obj->type) {
+	case IOMMUFD_OBJ_HW_PAGETABLE: {
+		struct iommufd_hw_pagetable *hwpt =
+			container_of(pt_obj, struct iommufd_hw_pagetable, obj);
+
+		rc = iommufd_device_do_attach(idev, hwpt, flags);
+		if (rc)
+			goto out_put_pt_obj;
+
+		mutex_lock(&hwpt->ioas->mutex);
+		list_add_tail(&hwpt->hwpt_item, &hwpt->ioas->hwpt_list);
+		mutex_unlock(&hwpt->ioas->mutex);
+		break;
+	}
+	case IOMMUFD_OBJ_IOAS: {
+		struct iommufd_ioas *ioas =
+			container_of(pt_obj, struct iommufd_ioas, obj);
+
+		rc = iommufd_device_auto_get_domain(idev, ioas, flags);
+		if (rc)
+			goto out_put_pt_obj;
+		break;
+	}
+	default:
+		rc = -EINVAL;
+		goto out_put_pt_obj;
+	}
+
+	refcount_inc(&idev->obj.users);
+	*pt_id = idev->hwpt->obj.id;
+	rc = 0;
+
+out_put_pt_obj:
+	iommufd_put_object(pt_obj);
+	return rc;
+}
+EXPORT_SYMBOL_GPL(iommufd_device_attach);
+
+void iommufd_device_detach(struct iommufd_device *idev)
+{
+	struct iommufd_hw_pagetable *hwpt = idev->hwpt;
+
+	mutex_lock(&hwpt->ioas->mutex);
+	mutex_lock(&hwpt->devices_lock);
+	list_del(&idev->devices_item);
+	if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
+		if (list_empty(&hwpt->devices)) {
+			iopt_table_remove_domain(&hwpt->ioas->iopt,
+						 hwpt->domain);
+			list_del(&hwpt->hwpt_item);
+		}
+		iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->group);
+		iommu_detach_group(hwpt->domain, idev->group);
+	}
+	mutex_unlock(&hwpt->devices_lock);
+	mutex_unlock(&hwpt->ioas->mutex);
+
+	if (hwpt->auto_domain)
+		iommufd_object_destroy_user(idev->ictx, &hwpt->obj);
+	else
+		refcount_dec(&hwpt->obj.users);
+
+	idev->hwpt = NULL;
+
+	refcount_dec(&idev->obj.users);
+}
+EXPORT_SYMBOL_GPL(iommufd_device_detach);
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 4f628800bc2b71..0ede92b0aa32b4 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -92,6 +92,7 @@ static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd,
 enum iommufd_object_type {
 	IOMMUFD_OBJ_NONE,
 	IOMMUFD_OBJ_ANY = IOMMUFD_OBJ_NONE,
+	IOMMUFD_OBJ_DEVICE,
 	IOMMUFD_OBJ_HW_PAGETABLE,
 	IOMMUFD_OBJ_IOAS,
 };
@@ -216,6 +217,7 @@ struct iommufd_hw_pagetable {
 	struct iommu_domain *domain;
 	bool auto_domain : 1;
 	bool enforce_cache_coherency : 1;
+	bool msi_cookie : 1;
 	/* Head at iommufd_ioas::hwpt_list */
 	struct list_head hwpt_item;
 	struct mutex devices_lock;
@@ -227,4 +229,6 @@ iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
 			   struct device *dev);
 void iommufd_hw_pagetable_destroy(struct iommufd_object *obj);
 
+void iommufd_device_destroy(struct iommufd_object *obj);
+
 #endif
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 2a9b581cacffb6..b09dbfc8009dc2 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -327,6 +327,9 @@ void iommufd_ctx_put(struct iommufd_ctx *ictx)
 EXPORT_SYMBOL_GPL(iommufd_ctx_put);
 
 static struct iommufd_object_ops iommufd_object_ops[] = {
+	[IOMMUFD_OBJ_DEVICE] = {
+		.destroy = iommufd_device_destroy,
+	},
 	[IOMMUFD_OBJ_IOAS] = {
 		.destroy = iommufd_ioas_destroy,
 	},
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index 9c6ec4d66b4a92..477c3ea098f637 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -9,12 +9,26 @@
 #include <linux/types.h>
 #include <linux/errno.h>
 #include <linux/err.h>
+#include <linux/device.h>
 
 struct page;
+struct iommufd_device;
 struct iommufd_ctx;
 struct io_pagetable;
 struct file;
 
+struct iommufd_device *iommufd_device_bind(struct iommufd_ctx *ictx,
+					   struct device *dev, u32 *id);
+void iommufd_device_unbind(struct iommufd_device *idev);
+bool iommufd_device_enforced_coherent(struct iommufd_device *idev);
+
+enum {
+	IOMMUFD_ATTACH_FLAGS_ALLOW_UNSAFE_INTERRUPT = 1 << 0,
+};
+int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id,
+			  unsigned int flags);
+void iommufd_device_detach(struct iommufd_device *idev);
+
 int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
 		      unsigned long length, struct page **out_pages,
 		      bool write);
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH RFC v2 11/13] iommufd: Add kAPI toward external drivers for kernel access
  2022-09-02 19:59 ` Jason Gunthorpe
@ 2022-09-02 19:59   ` Jason Gunthorpe
  -1 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-02 19:59 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, iommu,
	Jason Wang, Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

Kernel access is the mode that VFIO "mdevs" use. In this case there is no
struct device and no IOMMU connection. iommufd acts as a record keeper for
accesses and returns the actual struct pages back to the caller to use
however they need. eg with kmap or the DMA API.

Each caller must create a struct iommufd_access with
iommufd_access_create(), similar to how iommufd_device_bind() works. Using
this struct the caller can access blocks of IOVA using
iommufd_access_pin_pages() or iommufd_access_rw().

Callers must provide a callback that immediately unpins any IOVA being
used within a range. This happens if userspace unmaps the IOVA under the
pin.

The implementation forwards the access requests directly the identical
iopt infrastructure that manages the iopt_pages_user.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommufd/device.c          | 123 ++++++++++++++++++++++++
 drivers/iommu/iommufd/io_pagetable.c    |   7 +-
 drivers/iommu/iommufd/ioas.c            |   2 +
 drivers/iommu/iommufd/iommufd_private.h |   5 +
 drivers/iommu/iommufd/main.c            |   3 +
 include/linux/iommufd.h                 |  40 ++++++++
 6 files changed, 178 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 23b101db846f40..d34bdbcb84a40d 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -394,3 +394,126 @@ void iommufd_device_detach(struct iommufd_device *idev)
 	refcount_dec(&idev->obj.users);
 }
 EXPORT_SYMBOL_GPL(iommufd_device_detach);
+
+struct iommufd_access_priv {
+	struct iommufd_object obj;
+	struct iommufd_access pub;
+	struct iommufd_ctx *ictx;
+	struct iommufd_ioas *ioas;
+	const struct iommufd_access_ops *ops;
+	void *data;
+	u32 ioas_access_list_id;
+};
+
+void iommufd_access_destroy_object(struct iommufd_object *obj)
+{
+	struct iommufd_access_priv *access =
+		container_of(obj, struct iommufd_access_priv, obj);
+
+	WARN_ON(xa_erase(&access->ioas->access_list,
+			 access->ioas_access_list_id) != access);
+	iommufd_ctx_put(access->ictx);
+	refcount_dec(&access->ioas->obj.users);
+}
+
+struct iommufd_access *
+iommufd_access_create(struct iommufd_ctx *ictx, u32 ioas_id,
+		      const struct iommufd_access_ops *ops, void *data)
+{
+	struct iommufd_access_priv *access;
+	struct iommufd_object *obj;
+	int rc;
+
+	/*
+	 * FIXME: should this be an object? It is much like a device but I can't
+	 * forsee a use for it right now. On the other hand it costs almost
+	 * nothing to do, so may as well..
+	 */
+	access = iommufd_object_alloc(ictx, access, IOMMUFD_OBJ_ACCESS);
+	if (IS_ERR(access))
+		return &access->pub;
+
+	obj = iommufd_get_object(ictx, ioas_id, IOMMUFD_OBJ_IOAS);
+	if (IS_ERR(obj)) {
+		rc = PTR_ERR(obj);
+		goto out_abort;
+	}
+	access->ioas = container_of(obj, struct iommufd_ioas, obj);
+	iommufd_put_object_keep_user(obj);
+
+	rc = xa_alloc(&access->ioas->access_list, &access->ioas_access_list_id,
+		      access, xa_limit_16b, GFP_KERNEL_ACCOUNT);
+	if (rc)
+		goto out_put_ioas;
+
+	/* The calling driver is a user until iommufd_access_destroy() */
+	refcount_inc(&access->obj.users);
+	access->ictx = ictx;
+	access->data = data;
+	access->pub.iopt = &access->ioas->iopt;
+	iommufd_ctx_get(ictx);
+	iommufd_object_finalize(ictx, &access->obj);
+	return &access->pub;
+out_put_ioas:
+	refcount_dec(&access->ioas->obj.users);
+out_abort:
+	iommufd_object_abort(ictx, &access->obj);
+	return ERR_PTR(rc);
+}
+EXPORT_SYMBOL_GPL(iommufd_access_create);
+
+void iommufd_access_destroy(struct iommufd_access *access_pub)
+{
+	struct iommufd_access_priv *access =
+		container_of(access_pub, struct iommufd_access_priv, pub);
+	bool was_destroyed;
+
+	was_destroyed = iommufd_object_destroy_user(access->ictx, &access->obj);
+	WARN_ON(!was_destroyed);
+}
+EXPORT_SYMBOL_GPL(iommufd_access_destroy);
+
+/**
+ * iommufd_access_notify_unmap - Notify users of an iopt to stop using it
+ * @iopt - iopt to work on
+ * @iova - Starting iova in the iopt
+ * @length - Number of bytes
+ *
+ * After this function returns there should be no users attached to the pages
+ * linked to this iopt that intersect with iova,length. Anyone that has attached
+ * a user through iopt_access_pages() needs to detatch it through
+ * iommufd_access_unpin_pages() before this function returns.
+ *
+ * The unmap callback may not call or wait for a iommufd_access_destroy() to
+ * complete. Once iommufd_access_destroy() returns no ops are running and no
+ * future ops will be called.
+ */
+void iommufd_access_notify_unmap(struct io_pagetable *iopt, unsigned long iova,
+				 unsigned long length)
+{
+	struct iommufd_ioas *ioas =
+		container_of(iopt, struct iommufd_ioas, iopt);
+	struct iommufd_access_priv *access;
+	unsigned long index;
+
+	xa_lock(&ioas->access_list);
+	xa_for_each(&ioas->access_list, index, access) {
+		if (!iommufd_lock_obj(&access->obj))
+			continue;
+		xa_unlock(&ioas->access_list);
+
+		access->ops->unmap(access->data, iova, length);
+
+		iommufd_put_object(&access->obj);
+		xa_lock(&ioas->access_list);
+	}
+	xa_unlock(&ioas->access_list);
+}
+
+int iommufd_access_rw(struct iommufd_access *access, unsigned long iova,
+		      void *data, size_t len, bool write)
+{
+	/* FIXME implement me */
+	return -EINVAL;
+}
+EXPORT_SYMBOL_GPL(iommufd_access_rw);
diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
index 7434bc8b393bbd..dfc7362b78c6fb 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -349,6 +349,7 @@ static int iopt_unmap_iova_range(struct io_pagetable *iopt, unsigned long start,
 	 * is NULL. This prevents domain attach/detatch from running
 	 * concurrently with cleaning up the area.
 	 */
+again:
 	down_read(&iopt->domains_rwsem);
 	down_write(&iopt->iova_rwsem);
 	while ((area = iopt_area_iter_first(iopt, start, end))) {
@@ -377,8 +378,10 @@ static int iopt_unmap_iova_range(struct io_pagetable *iopt, unsigned long start,
 			area->prevent_users = true;
 			up_write(&iopt->iova_rwsem);
 			up_read(&iopt->domains_rwsem);
-			/* Later patch calls back to drivers to unmap */
-			return -EBUSY;
+			iommufd_access_notify_unmap(iopt, area_first,
+						    iopt_area_length(area));
+			WARN_ON(READ_ONCE(area->num_users));
+			goto again;
 		}
 
 		pages = area->pages;
diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
index 42b9a04188a116..7222af13551828 100644
--- a/drivers/iommu/iommufd/ioas.c
+++ b/drivers/iommu/iommufd/ioas.c
@@ -17,6 +17,7 @@ void iommufd_ioas_destroy(struct iommufd_object *obj)
 	rc = iopt_unmap_all(&ioas->iopt, NULL);
 	WARN_ON(rc && rc != -ENOENT);
 	iopt_destroy_table(&ioas->iopt);
+	WARN_ON(!xa_empty(&ioas->access_list));
 	mutex_destroy(&ioas->mutex);
 }
 
@@ -35,6 +36,7 @@ struct iommufd_ioas *iommufd_ioas_alloc(struct iommufd_ctx *ictx)
 
 	INIT_LIST_HEAD(&ioas->hwpt_list);
 	mutex_init(&ioas->mutex);
+	xa_init_flags(&ioas->access_list, XA_FLAGS_ALLOC);
 	return ioas;
 
 out_abort:
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 0ede92b0aa32b4..540b36c0befa5e 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -52,6 +52,8 @@ int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
 		    unsigned long length, unsigned long *unmapped);
 int iopt_unmap_all(struct io_pagetable *iopt, unsigned long *unmapped);
 
+void iommufd_access_notify_unmap(struct io_pagetable *iopt, unsigned long iova,
+				 unsigned long length);
 int iopt_table_add_domain(struct io_pagetable *iopt,
 			  struct iommu_domain *domain);
 void iopt_table_remove_domain(struct io_pagetable *iopt,
@@ -95,6 +97,7 @@ enum iommufd_object_type {
 	IOMMUFD_OBJ_DEVICE,
 	IOMMUFD_OBJ_HW_PAGETABLE,
 	IOMMUFD_OBJ_IOAS,
+	IOMMUFD_OBJ_ACCESS,
 };
 
 /* Base struct for all objects with a userspace ID handle. */
@@ -185,6 +188,7 @@ struct iommufd_ioas {
 	struct io_pagetable iopt;
 	struct mutex mutex;
 	struct list_head hwpt_list;
+	struct xarray access_list;
 };
 
 static inline struct iommufd_ioas *iommufd_get_ioas(struct iommufd_ucmd *ucmd,
@@ -231,4 +235,5 @@ void iommufd_hw_pagetable_destroy(struct iommufd_object *obj);
 
 void iommufd_device_destroy(struct iommufd_object *obj);
 
+void iommufd_access_destroy_object(struct iommufd_object *obj);
 #endif
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index b09dbfc8009dc2..ed64b84b3b9337 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -327,6 +327,9 @@ void iommufd_ctx_put(struct iommufd_ctx *ictx)
 EXPORT_SYMBOL_GPL(iommufd_ctx_put);
 
 static struct iommufd_object_ops iommufd_object_ops[] = {
+	[IOMMUFD_OBJ_ACCESS] = {
+		.destroy = iommufd_access_destroy_object,
+	},
 	[IOMMUFD_OBJ_DEVICE] = {
 		.destroy = iommufd_device_destroy,
 	},
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index 477c3ea098f637..c072e400f3e645 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -13,10 +13,15 @@
 
 struct page;
 struct iommufd_device;
+struct iommufd_access;
 struct iommufd_ctx;
 struct io_pagetable;
 struct file;
 
+struct iommufd_access {
+	struct io_pagetable *iopt;
+};
+
 struct iommufd_device *iommufd_device_bind(struct iommufd_ctx *ictx,
 					   struct device *dev, u32 *id);
 void iommufd_device_unbind(struct iommufd_device *idev);
@@ -29,17 +34,46 @@ int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id,
 			  unsigned int flags);
 void iommufd_device_detach(struct iommufd_device *idev);
 
+struct iommufd_access_ops {
+	void (*unmap)(void *data, unsigned long iova, unsigned long length);
+};
+
+struct iommufd_access *
+iommufd_access_create(struct iommufd_ctx *ictx, u32 ioas_id,
+		      const struct iommufd_access_ops *ops, void *data);
+void iommufd_access_destroy(struct iommufd_access *access);
 int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
 		      unsigned long length, struct page **out_pages,
 		      bool write);
 void iopt_unaccess_pages(struct io_pagetable *iopt, unsigned long iova,
 			 unsigned long length);
 
+static inline int iommufd_access_pin_pages(struct iommufd_access *access,
+					   unsigned long iova,
+					   unsigned long length,
+					   struct page **out_pages, bool write)
+{
+	if (!IS_ENABLED(CONFIG_IOMMUFD))
+		return -EOPNOTSUPP;
+	return iopt_access_pages(access->iopt, iova, length, out_pages, write);
+}
+
+static inline void iommufd_access_unpin_pages(struct iommufd_access *access,
+					      unsigned long iova,
+					      unsigned long length)
+{
+	if (IS_ENABLED(CONFIG_IOMMUFD))
+		iopt_unaccess_pages(access->iopt, iova, length);
+}
+
 void iommufd_ctx_get(struct iommufd_ctx *ictx);
 
 #if IS_ENABLED(CONFIG_IOMMUFD)
 struct iommufd_ctx *iommufd_ctx_from_file(struct file *file);
 void iommufd_ctx_put(struct iommufd_ctx *ictx);
+
+int iommufd_access_rw(struct iommufd_access *access, unsigned long iova,
+		      void *data, size_t len, bool write);
 #else /* !CONFIG_IOMMUFD */
 static inline struct iommufd_ctx *iommufd_ctx_from_file(struct file *file)
 {
@@ -49,5 +83,11 @@ static inline struct iommufd_ctx *iommufd_ctx_from_file(struct file *file)
 static inline void iommufd_ctx_put(struct iommufd_ctx *ictx)
 {
 }
+
+static inline int iommufd_access_rw(struct iommufd_access *access, unsigned long iova,
+		      void *data, size_t len, bool write)
+{
+	return -EOPNOTSUPP;
+}
 #endif /* CONFIG_IOMMUFD */
 #endif
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH RFC v2 11/13] iommufd: Add kAPI toward external drivers for kernel access
@ 2022-09-02 19:59   ` Jason Gunthorpe
  0 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-02 19:59 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, iommu,
	Jason Wang, Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

Kernel access is the mode that VFIO "mdevs" use. In this case there is no
struct device and no IOMMU connection. iommufd acts as a record keeper for
accesses and returns the actual struct pages back to the caller to use
however they need. eg with kmap or the DMA API.

Each caller must create a struct iommufd_access with
iommufd_access_create(), similar to how iommufd_device_bind() works. Using
this struct the caller can access blocks of IOVA using
iommufd_access_pin_pages() or iommufd_access_rw().

Callers must provide a callback that immediately unpins any IOVA being
used within a range. This happens if userspace unmaps the IOVA under the
pin.

The implementation forwards the access requests directly the identical
iopt infrastructure that manages the iopt_pages_user.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommufd/device.c          | 123 ++++++++++++++++++++++++
 drivers/iommu/iommufd/io_pagetable.c    |   7 +-
 drivers/iommu/iommufd/ioas.c            |   2 +
 drivers/iommu/iommufd/iommufd_private.h |   5 +
 drivers/iommu/iommufd/main.c            |   3 +
 include/linux/iommufd.h                 |  40 ++++++++
 6 files changed, 178 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 23b101db846f40..d34bdbcb84a40d 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -394,3 +394,126 @@ void iommufd_device_detach(struct iommufd_device *idev)
 	refcount_dec(&idev->obj.users);
 }
 EXPORT_SYMBOL_GPL(iommufd_device_detach);
+
+struct iommufd_access_priv {
+	struct iommufd_object obj;
+	struct iommufd_access pub;
+	struct iommufd_ctx *ictx;
+	struct iommufd_ioas *ioas;
+	const struct iommufd_access_ops *ops;
+	void *data;
+	u32 ioas_access_list_id;
+};
+
+void iommufd_access_destroy_object(struct iommufd_object *obj)
+{
+	struct iommufd_access_priv *access =
+		container_of(obj, struct iommufd_access_priv, obj);
+
+	WARN_ON(xa_erase(&access->ioas->access_list,
+			 access->ioas_access_list_id) != access);
+	iommufd_ctx_put(access->ictx);
+	refcount_dec(&access->ioas->obj.users);
+}
+
+struct iommufd_access *
+iommufd_access_create(struct iommufd_ctx *ictx, u32 ioas_id,
+		      const struct iommufd_access_ops *ops, void *data)
+{
+	struct iommufd_access_priv *access;
+	struct iommufd_object *obj;
+	int rc;
+
+	/*
+	 * FIXME: should this be an object? It is much like a device but I can't
+	 * forsee a use for it right now. On the other hand it costs almost
+	 * nothing to do, so may as well..
+	 */
+	access = iommufd_object_alloc(ictx, access, IOMMUFD_OBJ_ACCESS);
+	if (IS_ERR(access))
+		return &access->pub;
+
+	obj = iommufd_get_object(ictx, ioas_id, IOMMUFD_OBJ_IOAS);
+	if (IS_ERR(obj)) {
+		rc = PTR_ERR(obj);
+		goto out_abort;
+	}
+	access->ioas = container_of(obj, struct iommufd_ioas, obj);
+	iommufd_put_object_keep_user(obj);
+
+	rc = xa_alloc(&access->ioas->access_list, &access->ioas_access_list_id,
+		      access, xa_limit_16b, GFP_KERNEL_ACCOUNT);
+	if (rc)
+		goto out_put_ioas;
+
+	/* The calling driver is a user until iommufd_access_destroy() */
+	refcount_inc(&access->obj.users);
+	access->ictx = ictx;
+	access->data = data;
+	access->pub.iopt = &access->ioas->iopt;
+	iommufd_ctx_get(ictx);
+	iommufd_object_finalize(ictx, &access->obj);
+	return &access->pub;
+out_put_ioas:
+	refcount_dec(&access->ioas->obj.users);
+out_abort:
+	iommufd_object_abort(ictx, &access->obj);
+	return ERR_PTR(rc);
+}
+EXPORT_SYMBOL_GPL(iommufd_access_create);
+
+void iommufd_access_destroy(struct iommufd_access *access_pub)
+{
+	struct iommufd_access_priv *access =
+		container_of(access_pub, struct iommufd_access_priv, pub);
+	bool was_destroyed;
+
+	was_destroyed = iommufd_object_destroy_user(access->ictx, &access->obj);
+	WARN_ON(!was_destroyed);
+}
+EXPORT_SYMBOL_GPL(iommufd_access_destroy);
+
+/**
+ * iommufd_access_notify_unmap - Notify users of an iopt to stop using it
+ * @iopt - iopt to work on
+ * @iova - Starting iova in the iopt
+ * @length - Number of bytes
+ *
+ * After this function returns there should be no users attached to the pages
+ * linked to this iopt that intersect with iova,length. Anyone that has attached
+ * a user through iopt_access_pages() needs to detatch it through
+ * iommufd_access_unpin_pages() before this function returns.
+ *
+ * The unmap callback may not call or wait for a iommufd_access_destroy() to
+ * complete. Once iommufd_access_destroy() returns no ops are running and no
+ * future ops will be called.
+ */
+void iommufd_access_notify_unmap(struct io_pagetable *iopt, unsigned long iova,
+				 unsigned long length)
+{
+	struct iommufd_ioas *ioas =
+		container_of(iopt, struct iommufd_ioas, iopt);
+	struct iommufd_access_priv *access;
+	unsigned long index;
+
+	xa_lock(&ioas->access_list);
+	xa_for_each(&ioas->access_list, index, access) {
+		if (!iommufd_lock_obj(&access->obj))
+			continue;
+		xa_unlock(&ioas->access_list);
+
+		access->ops->unmap(access->data, iova, length);
+
+		iommufd_put_object(&access->obj);
+		xa_lock(&ioas->access_list);
+	}
+	xa_unlock(&ioas->access_list);
+}
+
+int iommufd_access_rw(struct iommufd_access *access, unsigned long iova,
+		      void *data, size_t len, bool write)
+{
+	/* FIXME implement me */
+	return -EINVAL;
+}
+EXPORT_SYMBOL_GPL(iommufd_access_rw);
diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
index 7434bc8b393bbd..dfc7362b78c6fb 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -349,6 +349,7 @@ static int iopt_unmap_iova_range(struct io_pagetable *iopt, unsigned long start,
 	 * is NULL. This prevents domain attach/detatch from running
 	 * concurrently with cleaning up the area.
 	 */
+again:
 	down_read(&iopt->domains_rwsem);
 	down_write(&iopt->iova_rwsem);
 	while ((area = iopt_area_iter_first(iopt, start, end))) {
@@ -377,8 +378,10 @@ static int iopt_unmap_iova_range(struct io_pagetable *iopt, unsigned long start,
 			area->prevent_users = true;
 			up_write(&iopt->iova_rwsem);
 			up_read(&iopt->domains_rwsem);
-			/* Later patch calls back to drivers to unmap */
-			return -EBUSY;
+			iommufd_access_notify_unmap(iopt, area_first,
+						    iopt_area_length(area));
+			WARN_ON(READ_ONCE(area->num_users));
+			goto again;
 		}
 
 		pages = area->pages;
diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
index 42b9a04188a116..7222af13551828 100644
--- a/drivers/iommu/iommufd/ioas.c
+++ b/drivers/iommu/iommufd/ioas.c
@@ -17,6 +17,7 @@ void iommufd_ioas_destroy(struct iommufd_object *obj)
 	rc = iopt_unmap_all(&ioas->iopt, NULL);
 	WARN_ON(rc && rc != -ENOENT);
 	iopt_destroy_table(&ioas->iopt);
+	WARN_ON(!xa_empty(&ioas->access_list));
 	mutex_destroy(&ioas->mutex);
 }
 
@@ -35,6 +36,7 @@ struct iommufd_ioas *iommufd_ioas_alloc(struct iommufd_ctx *ictx)
 
 	INIT_LIST_HEAD(&ioas->hwpt_list);
 	mutex_init(&ioas->mutex);
+	xa_init_flags(&ioas->access_list, XA_FLAGS_ALLOC);
 	return ioas;
 
 out_abort:
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 0ede92b0aa32b4..540b36c0befa5e 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -52,6 +52,8 @@ int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
 		    unsigned long length, unsigned long *unmapped);
 int iopt_unmap_all(struct io_pagetable *iopt, unsigned long *unmapped);
 
+void iommufd_access_notify_unmap(struct io_pagetable *iopt, unsigned long iova,
+				 unsigned long length);
 int iopt_table_add_domain(struct io_pagetable *iopt,
 			  struct iommu_domain *domain);
 void iopt_table_remove_domain(struct io_pagetable *iopt,
@@ -95,6 +97,7 @@ enum iommufd_object_type {
 	IOMMUFD_OBJ_DEVICE,
 	IOMMUFD_OBJ_HW_PAGETABLE,
 	IOMMUFD_OBJ_IOAS,
+	IOMMUFD_OBJ_ACCESS,
 };
 
 /* Base struct for all objects with a userspace ID handle. */
@@ -185,6 +188,7 @@ struct iommufd_ioas {
 	struct io_pagetable iopt;
 	struct mutex mutex;
 	struct list_head hwpt_list;
+	struct xarray access_list;
 };
 
 static inline struct iommufd_ioas *iommufd_get_ioas(struct iommufd_ucmd *ucmd,
@@ -231,4 +235,5 @@ void iommufd_hw_pagetable_destroy(struct iommufd_object *obj);
 
 void iommufd_device_destroy(struct iommufd_object *obj);
 
+void iommufd_access_destroy_object(struct iommufd_object *obj);
 #endif
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index b09dbfc8009dc2..ed64b84b3b9337 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -327,6 +327,9 @@ void iommufd_ctx_put(struct iommufd_ctx *ictx)
 EXPORT_SYMBOL_GPL(iommufd_ctx_put);
 
 static struct iommufd_object_ops iommufd_object_ops[] = {
+	[IOMMUFD_OBJ_ACCESS] = {
+		.destroy = iommufd_access_destroy_object,
+	},
 	[IOMMUFD_OBJ_DEVICE] = {
 		.destroy = iommufd_device_destroy,
 	},
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index 477c3ea098f637..c072e400f3e645 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -13,10 +13,15 @@
 
 struct page;
 struct iommufd_device;
+struct iommufd_access;
 struct iommufd_ctx;
 struct io_pagetable;
 struct file;
 
+struct iommufd_access {
+	struct io_pagetable *iopt;
+};
+
 struct iommufd_device *iommufd_device_bind(struct iommufd_ctx *ictx,
 					   struct device *dev, u32 *id);
 void iommufd_device_unbind(struct iommufd_device *idev);
@@ -29,17 +34,46 @@ int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id,
 			  unsigned int flags);
 void iommufd_device_detach(struct iommufd_device *idev);
 
+struct iommufd_access_ops {
+	void (*unmap)(void *data, unsigned long iova, unsigned long length);
+};
+
+struct iommufd_access *
+iommufd_access_create(struct iommufd_ctx *ictx, u32 ioas_id,
+		      const struct iommufd_access_ops *ops, void *data);
+void iommufd_access_destroy(struct iommufd_access *access);
 int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
 		      unsigned long length, struct page **out_pages,
 		      bool write);
 void iopt_unaccess_pages(struct io_pagetable *iopt, unsigned long iova,
 			 unsigned long length);
 
+static inline int iommufd_access_pin_pages(struct iommufd_access *access,
+					   unsigned long iova,
+					   unsigned long length,
+					   struct page **out_pages, bool write)
+{
+	if (!IS_ENABLED(CONFIG_IOMMUFD))
+		return -EOPNOTSUPP;
+	return iopt_access_pages(access->iopt, iova, length, out_pages, write);
+}
+
+static inline void iommufd_access_unpin_pages(struct iommufd_access *access,
+					      unsigned long iova,
+					      unsigned long length)
+{
+	if (IS_ENABLED(CONFIG_IOMMUFD))
+		iopt_unaccess_pages(access->iopt, iova, length);
+}
+
 void iommufd_ctx_get(struct iommufd_ctx *ictx);
 
 #if IS_ENABLED(CONFIG_IOMMUFD)
 struct iommufd_ctx *iommufd_ctx_from_file(struct file *file);
 void iommufd_ctx_put(struct iommufd_ctx *ictx);
+
+int iommufd_access_rw(struct iommufd_access *access, unsigned long iova,
+		      void *data, size_t len, bool write);
 #else /* !CONFIG_IOMMUFD */
 static inline struct iommufd_ctx *iommufd_ctx_from_file(struct file *file)
 {
@@ -49,5 +83,11 @@ static inline struct iommufd_ctx *iommufd_ctx_from_file(struct file *file)
 static inline void iommufd_ctx_put(struct iommufd_ctx *ictx)
 {
 }
+
+static inline int iommufd_access_rw(struct iommufd_access *access, unsigned long iova,
+		      void *data, size_t len, bool write)
+{
+	return -EOPNOTSUPP;
+}
 #endif /* CONFIG_IOMMUFD */
 #endif
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH RFC v2 12/13] iommufd: vfio container FD ioctl compatibility
  2022-09-02 19:59 ` Jason Gunthorpe
@ 2022-09-02 19:59   ` Jason Gunthorpe
  -1 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-02 19:59 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, iommu,
	Jason Wang, Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

iommufd can directly implement the /dev/vfio/vfio container IOCTLs by
mapping them into io_pagetable operations.

A userspace application can test against iommufd and confirm compatability
then simply make a small change to open /dev/iommu instead of
/dev/vfio/vfio.

For testing purposes /dev/vfio/vfio can be symlinked to /dev/iommu and
then all applications will use the compatability path with no code
changes. It is unclear if this could ever be a production configuration.

This series just provides the iommufd side of compatability. Actually
linking this to VFIO_SET_CONTAINER is a followup series, with a link in
the cover letter.

Internally the compatibility API uses a normal IOAS object that, like
vfio, is automatically allocated when the first device is
attached.

Userspace can also query or set this IOAS object directly using the
IOMMU_VFIO_IOAS ioctl. This allows mixing and matching new iommufd only
features while still using the VFIO style map/unmap ioctls.

While this is enough to operate qemu, it is still a bit of a WIP with a
few gaps:

 - Only the TYPE1v2 mode is supported where unmap cannot punch holes or
   split areas. The old mode can be implemented with a new operation to
   split an iopt_area into two without disturbing the iopt_pages or the
   domains, then unmapping a whole area as normal.

 - Resource limits rely on memory cgroups to bound what userspace can do
   instead of the module parameter dma_entry_limit.

 - Pinned page accounting uses the same system as io_uring, not the
   mm_struct based system vfio uses.

 - VFIO P2P is not implemented. The DMABUF patches for vfio are a start at
   a solution where iommufd would import a special DMABUF. This is to avoid
   further propogating the follow_pfn() security problem.

 - Indefinite suspend of SW access (VFIO_DMA_MAP_FLAG_VADDR) is not
   implemented.

 - A full audit for pedantic compatibility details (eg errnos, etc) has
   not yet been done

 - powerpc SPAPR is left out, as it is not connected to the iommu_domain
   framework. My hope is that SPAPR will be moved into the iommu_domain
   framework as a special HW specific type and would expect power to
   support the generic interface through a normal iommu_domain.

The following are not going to be implemented and we expect to remove them
from VFIO type1:

 - SW access 'dirty tracking'. As discussed in the cover letter this will
   be done in VFIO.

 - VFIO_TYPE1_NESTING_IOMMU
    https://lore.kernel.org/all/0-v1-0093c9b0e345+19-vfio_no_nesting_jgg@nvidia.com/

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommufd/Makefile          |   3 +-
 drivers/iommu/iommufd/iommufd_private.h |   6 +
 drivers/iommu/iommufd/main.c            |  16 +-
 drivers/iommu/iommufd/vfio_compat.c     | 423 ++++++++++++++++++++++++
 include/linux/iommufd.h                 |   8 +
 include/uapi/linux/iommufd.h            |  36 ++
 6 files changed, 486 insertions(+), 6 deletions(-)
 create mode 100644 drivers/iommu/iommufd/vfio_compat.c

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index ca28a135b9675f..2fdff04000b326 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -5,6 +5,7 @@ iommufd-y := \
 	io_pagetable.o \
 	ioas.o \
 	main.o \
-	pages.o
+	pages.o \
+	vfio_compat.o
 
 obj-$(CONFIG_IOMMUFD) += iommufd.o
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 540b36c0befa5e..d87227cc08a47d 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -70,6 +70,8 @@ void iopt_remove_reserved_iova(struct io_pagetable *iopt, void *owner);
 struct iommufd_ctx {
 	struct file *file;
 	struct xarray objects;
+
+	struct iommufd_ioas *vfio_ioas;
 };
 
 struct iommufd_ctx *iommufd_fget(int fd);
@@ -81,6 +83,9 @@ struct iommufd_ucmd {
 	void *cmd;
 };
 
+int iommufd_vfio_ioctl(struct iommufd_ctx *ictx, unsigned int cmd,
+		       unsigned long arg);
+
 /* Copy the response in ucmd->cmd back to userspace. */
 static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd,
 				       size_t cmd_len)
@@ -208,6 +213,7 @@ int iommufd_ioas_allow_iovas(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_map(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_copy(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd);
+int iommufd_vfio_ioas(struct iommufd_ucmd *ucmd);
 
 /*
  * A HW pagetable is called an iommu_domain inside the kernel. This user object
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index ed64b84b3b9337..549d6a4c8f5575 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -134,6 +134,8 @@ bool iommufd_object_destroy_user(struct iommufd_ctx *ictx,
 		return false;
 	}
 	__xa_erase(&ictx->objects, obj->id);
+	if (ictx->vfio_ioas && &ictx->vfio_ioas->obj == obj)
+		ictx->vfio_ioas = NULL;
 	xa_unlock(&ictx->objects);
 	up_write(&obj->destroy_rwsem);
 
@@ -241,27 +243,31 @@ static struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
 		 __reserved),
 	IOCTL_OP(IOMMU_IOAS_UNMAP, iommufd_ioas_unmap, struct iommu_ioas_unmap,
 		 length),
+	IOCTL_OP(IOMMU_VFIO_IOAS, iommufd_vfio_ioas, struct iommu_vfio_ioas,
+		 __reserved),
 };
 
 static long iommufd_fops_ioctl(struct file *filp, unsigned int cmd,
 			       unsigned long arg)
 {
+	struct iommufd_ctx *ictx = filp->private_data;
 	struct iommufd_ucmd ucmd = {};
 	struct iommufd_ioctl_op *op;
 	union ucmd_buffer buf;
 	unsigned int nr;
 	int ret;
 
-	ucmd.ictx = filp->private_data;
+	nr = _IOC_NR(cmd);
+	if (nr < IOMMUFD_CMD_BASE ||
+	    (nr - IOMMUFD_CMD_BASE) >= ARRAY_SIZE(iommufd_ioctl_ops))
+		return iommufd_vfio_ioctl(ictx, cmd, arg);
+
+	ucmd.ictx = ictx;
 	ucmd.ubuffer = (void __user *)arg;
 	ret = get_user(ucmd.user_size, (u32 __user *)ucmd.ubuffer);
 	if (ret)
 		return ret;
 
-	nr = _IOC_NR(cmd);
-	if (nr < IOMMUFD_CMD_BASE ||
-	    (nr - IOMMUFD_CMD_BASE) >= ARRAY_SIZE(iommufd_ioctl_ops))
-		return -ENOIOCTLCMD;
 	op = &iommufd_ioctl_ops[nr - IOMMUFD_CMD_BASE];
 	if (op->ioctl_num != cmd)
 		return -ENOIOCTLCMD;
diff --git a/drivers/iommu/iommufd/vfio_compat.c b/drivers/iommu/iommufd/vfio_compat.c
new file mode 100644
index 00000000000000..57ef97aa309985
--- /dev/null
+++ b/drivers/iommu/iommufd/vfio_compat.c
@@ -0,0 +1,423 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ */
+#include <linux/file.h>
+#include <linux/interval_tree.h>
+#include <linux/iommu.h>
+#include <linux/iommufd.h>
+#include <linux/slab.h>
+#include <linux/vfio.h>
+#include <uapi/linux/vfio.h>
+#include <uapi/linux/iommufd.h>
+
+#include "iommufd_private.h"
+
+static struct iommufd_ioas *get_compat_ioas(struct iommufd_ctx *ictx)
+{
+	struct iommufd_ioas *ioas = ERR_PTR(-ENODEV);
+
+	xa_lock(&ictx->objects);
+	if (!ictx->vfio_ioas || !iommufd_lock_obj(&ictx->vfio_ioas->obj))
+		goto out_unlock;
+	ioas = ictx->vfio_ioas;
+out_unlock:
+	xa_unlock(&ictx->objects);
+	return ioas;
+}
+
+/**
+ * iommufd_vfio_compat_ioas_id - Return the IOAS ID that vfio should use
+ * @ictx - Context to operate on
+ *
+ * The compatability IOAS is the IOAS that the vfio compatability ioctls operate
+ * on since they do not have an IOAS ID input in their ABI. Only attaching a
+ * group should cause a default creation of the internal ioas, this returns the
+ * existing ioas if it has already been assigned somehow.
+ */
+int iommufd_vfio_compat_ioas_id(struct iommufd_ctx *ictx, u32 *out_ioas_id)
+{
+	struct iommufd_ioas *ioas = NULL;
+	struct iommufd_ioas *out_ioas;
+
+	ioas = iommufd_ioas_alloc(ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	xa_lock(&ictx->objects);
+	if (ictx->vfio_ioas && iommufd_lock_obj(&ictx->vfio_ioas->obj))
+		out_ioas = ictx->vfio_ioas;
+	else {
+		out_ioas = ioas;
+		ictx->vfio_ioas = ioas;
+	}
+	xa_unlock(&ictx->objects);
+
+	*out_ioas_id = out_ioas->obj.id;
+	if (out_ioas != ioas) {
+		iommufd_put_object(&out_ioas->obj);
+		iommufd_object_abort(ictx, &ioas->obj);
+		return 0;
+	}
+	iommufd_object_finalize(ictx, &ioas->obj);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(iommufd_vfio_compat_ioas_id);
+
+int iommufd_vfio_ioas(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_vfio_ioas *cmd = ucmd->cmd;
+	struct iommufd_ioas *ioas;
+
+	if (cmd->__reserved)
+		return -EOPNOTSUPP;
+	switch (cmd->op) {
+	case IOMMU_VFIO_IOAS_GET:
+		ioas = get_compat_ioas(ucmd->ictx);
+		if (IS_ERR(ioas))
+			return PTR_ERR(ioas);
+		cmd->ioas_id = ioas->obj.id;
+		iommufd_put_object(&ioas->obj);
+		return iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+
+	case IOMMU_VFIO_IOAS_SET:
+		ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
+		if (IS_ERR(ioas))
+			return PTR_ERR(ioas);
+		xa_lock(&ucmd->ictx->objects);
+		ucmd->ictx->vfio_ioas = ioas;
+		xa_unlock(&ucmd->ictx->objects);
+		iommufd_put_object(&ioas->obj);
+		return 0;
+
+	case IOMMU_VFIO_IOAS_CLEAR:
+		xa_lock(&ucmd->ictx->objects);
+		ucmd->ictx->vfio_ioas = NULL;
+		xa_unlock(&ucmd->ictx->objects);
+		return 0;
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
+static int iommufd_vfio_map_dma(struct iommufd_ctx *ictx, unsigned int cmd,
+				void __user *arg)
+{
+	u32 supported_flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
+	size_t minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
+	struct vfio_iommu_type1_dma_map map;
+	int iommu_prot = IOMMU_CACHE;
+	struct iommufd_ioas *ioas;
+	unsigned long iova;
+	int rc;
+
+	if (copy_from_user(&map, arg, minsz))
+		return -EFAULT;
+
+	if (map.argsz < minsz || map.flags & ~supported_flags)
+		return -EINVAL;
+
+	if (map.flags & VFIO_DMA_MAP_FLAG_READ)
+		iommu_prot |= IOMMU_READ;
+	if (map.flags & VFIO_DMA_MAP_FLAG_WRITE)
+		iommu_prot |= IOMMU_WRITE;
+
+	ioas = get_compat_ioas(ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	iova = map.iova;
+	rc = iopt_map_user_pages(&ioas->iopt, &iova,
+				 u64_to_user_ptr(map.vaddr), map.size,
+				 iommu_prot, 0);
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+static int iommufd_vfio_unmap_dma(struct iommufd_ctx *ictx, unsigned int cmd,
+				  void __user *arg)
+{
+	size_t minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size);
+
+	/* VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP is obsoleted by the new
+	 * dirty tracking direction:
+	 *  https://lore.kernel.org/kvm/20220731125503.142683-1-yishaih@nvidia.com/
+	 *  https://lore.kernel.org/kvm/20220428210933.3583-1-joao.m.martins@oracle.com/
+	 */
+	u32 supported_flags = VFIO_DMA_UNMAP_FLAG_ALL;
+	struct vfio_iommu_type1_dma_unmap unmap;
+	struct iommufd_ioas *ioas;
+	unsigned long unmapped;
+	int rc;
+
+	if (copy_from_user(&unmap, arg, minsz))
+		return -EFAULT;
+
+	if (unmap.argsz < minsz || unmap.flags & ~supported_flags)
+		return -EINVAL;
+
+	ioas = get_compat_ioas(ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	if (unmap.flags & VFIO_DMA_UNMAP_FLAG_ALL)
+		rc = iopt_unmap_all(&ioas->iopt, &unmapped);
+	else
+		rc = iopt_unmap_iova(&ioas->iopt, unmap.iova,
+				     unmap.size, &unmapped);
+	iommufd_put_object(&ioas->obj);
+	unmap.size = unmapped;
+
+	return rc;
+}
+
+static int iommufd_vfio_cc_iommu(struct iommufd_ctx *ictx)
+{
+	struct iommufd_ioas *ioas;
+	bool rc;
+
+	ioas = get_compat_ioas(ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+	rc = iommufd_ioas_enforced_coherent(ioas);
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+static int iommufd_vfio_check_extension(struct iommufd_ctx *ictx,
+					unsigned long type)
+{
+	switch (type) {
+	case VFIO_TYPE1v2_IOMMU:
+	case VFIO_UNMAP_ALL:
+		return 1;
+
+	/*
+	 * This is obsolete, and to be removed from VFIO. It was an incomplete
+	 * idea that got merged.
+	 * https://lore.kernel.org/kvm/0-v1-0093c9b0e345+19-vfio_no_nesting_jgg@nvidia.com/
+	 */
+	case VFIO_TYPE1_NESTING_IOMMU:
+		return 0;
+
+	case VFIO_DMA_CC_IOMMU:
+		return iommufd_vfio_cc_iommu(ictx);
+
+	/*
+	 * FIXME: The type1 iommu allows splitting of maps, which can fail. This is doable but
+	 * is a bunch of extra code that is only for supporting this case.
+	 */
+	case VFIO_TYPE1_IOMMU:
+	/*
+	 * FIXME: VFIO_DMA_MAP_FLAG_VADDR
+	 * https://lore.kernel.org/kvm/1611939252-7240-1-git-send-email-steven.sistare@oracle.com/
+	 * Wow, what a wild feature. This should have been implemented by
+	 * allowing a iopt_pages to be associated with a memfd. It can then
+	 * source mapping requests directly from a memfd without going through a
+	 * mm_struct and thus doesn't care that the original qemu exec'd itself.
+	 * The idea that userspace can flip a flag and cause kernel users to
+	 * block indefinately is unacceptable.
+	 *
+	 * For VFIO compat we should implement this in a slightly different way,
+	 * Creating a access_user that spans the whole area will immediately
+	 * stop new faults as they will be handled from the xarray. We can then
+	 * reparent the iopt_pages to the new mm_struct and undo the
+	 * access_user. No blockage of kernel users required, does require
+	 * filling the xarray with pages though.
+	 */
+	case VFIO_UPDATE_VADDR:
+	default:
+		return 0;
+	}
+}
+
+static int iommufd_vfio_set_iommu(struct iommufd_ctx *ictx, unsigned long type)
+{
+	struct iommufd_ioas *ioas = NULL;
+
+	if (type != VFIO_TYPE1v2_IOMMU)
+		return -EINVAL;
+
+	/* VFIO fails the set_iommu if there is no group */
+	ioas = get_compat_ioas(ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+	iommufd_put_object(&ioas->obj);
+	return 0;
+}
+
+static u64 iommufd_get_pagesizes(struct iommufd_ioas *ioas)
+{
+	/* FIXME: See vfio_update_pgsize_bitmap(), for compat this should return
+	 * the high bits too, and we need to decide if we should report that
+	 * iommufd supports less than PAGE_SIZE alignment or stick to strict
+	 * compatibility. qemu only cares about the first set bit.
+	 */
+	return ioas->iopt.iova_alignment;
+}
+
+static int iommufd_fill_cap_iova(struct iommufd_ioas *ioas,
+				 struct vfio_info_cap_header __user *cur,
+				 size_t avail)
+{
+	struct vfio_iommu_type1_info_cap_iova_range __user *ucap_iovas =
+		container_of(cur,
+			     struct vfio_iommu_type1_info_cap_iova_range __user,
+			     header);
+	struct vfio_iommu_type1_info_cap_iova_range cap_iovas = {
+		.header = {
+			.id = VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE,
+			.version = 1,
+		},
+	};
+	struct interval_tree_span_iter span;
+
+	interval_tree_for_each_span (&span, &ioas->iopt.reserved_itree,
+				     0, ULONG_MAX) {
+		struct vfio_iova_range range;
+
+		if (!span.is_hole)
+			continue;
+		range.start = span.start_hole;
+		range.end = span.last_hole;
+		if (avail >= struct_size(&cap_iovas, iova_ranges,
+					 cap_iovas.nr_iovas + 1) &&
+		    copy_to_user(&ucap_iovas->iova_ranges[cap_iovas.nr_iovas],
+				 &range, sizeof(range)))
+			return -EFAULT;
+		cap_iovas.nr_iovas++;
+	}
+	if (avail >= struct_size(&cap_iovas, iova_ranges, cap_iovas.nr_iovas) &&
+	    copy_to_user(ucap_iovas, &cap_iovas, sizeof(cap_iovas)))
+		return -EFAULT;
+	return struct_size(&cap_iovas, iova_ranges, cap_iovas.nr_iovas);
+}
+
+static int iommufd_fill_cap_dma_avail(struct iommufd_ioas *ioas,
+				      struct vfio_info_cap_header __user *cur,
+				      size_t avail)
+{
+	struct vfio_iommu_type1_info_dma_avail cap_dma = {
+		.header = {
+			.id = VFIO_IOMMU_TYPE1_INFO_DMA_AVAIL,
+			.version = 1,
+		},
+		/* iommufd has no limit, return the same value as VFIO. */
+		.avail = U16_MAX,
+	};
+
+	if (avail >= sizeof(cap_dma) &&
+	    copy_to_user(cur, &cap_dma, sizeof(cap_dma)))
+		return -EFAULT;
+	return sizeof(cap_dma);
+}
+
+static int iommufd_vfio_iommu_get_info(struct iommufd_ctx *ictx,
+				       void __user *arg)
+{
+	typedef int (*fill_cap_fn)(struct iommufd_ioas *ioas,
+				   struct vfio_info_cap_header __user *cur,
+				   size_t avail);
+	static const fill_cap_fn fill_fns[] = {
+		iommufd_fill_cap_iova,
+		iommufd_fill_cap_dma_avail,
+	};
+	size_t minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes);
+	struct vfio_info_cap_header __user *last_cap = NULL;
+	struct vfio_iommu_type1_info info;
+	struct iommufd_ioas *ioas;
+	size_t total_cap_size;
+	int rc;
+	int i;
+
+	if (copy_from_user(&info, arg, minsz))
+		return -EFAULT;
+
+	if (info.argsz < minsz)
+		return -EINVAL;
+	minsz = min_t(size_t, info.argsz, sizeof(info));
+
+	ioas = get_compat_ioas(ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	down_read(&ioas->iopt.iova_rwsem);
+	info.flags = VFIO_IOMMU_INFO_PGSIZES;
+	info.iova_pgsizes = iommufd_get_pagesizes(ioas);
+	info.cap_offset = 0;
+
+	total_cap_size = sizeof(info);
+	for (i = 0; i != ARRAY_SIZE(fill_fns); i++) {
+		int cap_size;
+
+		if (info.argsz > total_cap_size)
+			cap_size = fill_fns[i](ioas, arg + total_cap_size,
+					       info.argsz - total_cap_size);
+		else
+			cap_size = fill_fns[i](ioas, NULL, 0);
+		if (cap_size < 0) {
+			rc = cap_size;
+			goto out_put;
+		}
+		if (last_cap && info.argsz >= total_cap_size &&
+		    put_user(total_cap_size, &last_cap->next)) {
+			rc = -EFAULT;
+			goto out_put;
+		}
+		last_cap = arg + total_cap_size;
+		total_cap_size += cap_size;
+	}
+
+	/*
+	 * If the user did not provide enough space then only some caps are
+	 * returned and the argsz will be updated to the correct amount to get
+	 * all caps.
+	 */
+	if (info.argsz >= total_cap_size)
+		info.cap_offset = sizeof(info);
+	info.argsz = total_cap_size;
+	info.flags |= VFIO_IOMMU_INFO_CAPS;
+	if (copy_to_user(arg, &info, minsz))
+		rc = -EFAULT;
+	rc = 0;
+
+out_put:
+	up_read(&ioas->iopt.iova_rwsem);
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+/* FIXME TODO:
+PowerPC SPAPR only:
+#define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
+#define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
+#define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
+#define VFIO_IOMMU_SPAPR_REGISTER_MEMORY	_IO(VFIO_TYPE, VFIO_BASE + 17)
+#define VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY	_IO(VFIO_TYPE, VFIO_BASE + 18)
+#define VFIO_IOMMU_SPAPR_TCE_CREATE	_IO(VFIO_TYPE, VFIO_BASE + 19)
+#define VFIO_IOMMU_SPAPR_TCE_REMOVE	_IO(VFIO_TYPE, VFIO_BASE + 20)
+*/
+
+int iommufd_vfio_ioctl(struct iommufd_ctx *ictx, unsigned int cmd,
+		       unsigned long arg)
+{
+	void __user *uarg = (void __user *)arg;
+
+	switch (cmd) {
+	case VFIO_GET_API_VERSION:
+		return VFIO_API_VERSION;
+	case VFIO_SET_IOMMU:
+		return iommufd_vfio_set_iommu(ictx, arg);
+	case VFIO_CHECK_EXTENSION:
+		return iommufd_vfio_check_extension(ictx, arg);
+	case VFIO_IOMMU_GET_INFO:
+		return iommufd_vfio_iommu_get_info(ictx, uarg);
+	case VFIO_IOMMU_MAP_DMA:
+		return iommufd_vfio_map_dma(ictx, cmd, uarg);
+	case VFIO_IOMMU_UNMAP_DMA:
+		return iommufd_vfio_unmap_dma(ictx, cmd, uarg);
+	case VFIO_IOMMU_DIRTY_PAGES:
+	default:
+		return -ENOIOCTLCMD;
+	}
+	return -ENOIOCTLCMD;
+}
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index c072e400f3e645..050024ff68142d 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -72,6 +72,8 @@ void iommufd_ctx_get(struct iommufd_ctx *ictx);
 struct iommufd_ctx *iommufd_ctx_from_file(struct file *file);
 void iommufd_ctx_put(struct iommufd_ctx *ictx);
 
+int iommufd_vfio_compat_ioas_id(struct iommufd_ctx *ictx, u32 *out_ioas_id);
+
 int iommufd_access_rw(struct iommufd_access *access, unsigned long iova,
 		      void *data, size_t len, bool write);
 #else /* !CONFIG_IOMMUFD */
@@ -84,6 +86,12 @@ static inline void iommufd_ctx_put(struct iommufd_ctx *ictx)
 {
 }
 
+static inline int iommufd_vfio_compat_ioas_id(struct iommufd_ctx *ictx,
+					      u32 *out_ioas_id)
+{
+	return -EOPNOTSUPP;
+}
+
 static inline int iommufd_access_rw(struct iommufd_access *access, unsigned long iova,
 		      void *data, size_t len, bool write)
 {
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index b7b0ac4016bb70..48c290505844d8 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -43,6 +43,7 @@ enum {
 	IOMMUFD_CMD_IOAS_IOVA_RANGES,
 	IOMMUFD_CMD_IOAS_MAP,
 	IOMMUFD_CMD_IOAS_UNMAP,
+	IOMMUFD_CMD_VFIO_IOAS,
 };
 
 /**
@@ -240,4 +241,39 @@ struct iommu_ioas_unmap {
 	__aligned_u64 length;
 };
 #define IOMMU_IOAS_UNMAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_UNMAP)
+
+/**
+ * enum iommufd_vfio_ioas_op
+ * @IOMMU_VFIO_IOAS_GET: Get the current compatibility IOAS
+ * @IOMMU_VFIO_IOAS_SET: Change the current compatibility IOAS
+ * @IOMMU_VFIO_IOAS_CLEAR: Disable VFIO compatibility
+ */
+enum iommufd_vfio_ioas_op {
+	IOMMU_VFIO_IOAS_GET = 0,
+	IOMMU_VFIO_IOAS_SET = 1,
+	IOMMU_VFIO_IOAS_CLEAR = 2,
+};
+
+/**
+ * struct iommu_vfio_ioas - ioctl(IOMMU_VFIO_IOAS)
+ * @size: sizeof(struct iommu_ioas_copy)
+ * @ioas_id: For IOMMU_VFIO_IOAS_SET the input IOAS ID to set
+ *           For IOMMU_VFIO_IOAS_GET will output the IOAS ID
+ * @op: One of enum iommufd_vfio_ioas_op
+ * @__reserved: Must be 0
+ *
+ * The VFIO compatibility support uses a single ioas because VFIO APIs do not
+ * support the ID field. Set or Get the IOAS that VFIO compatibility will use.
+ * When VFIO_GROUP_SET_CONTAINER is used on an iommufd it will get the
+ * compatibility ioas, either by taking what is already set, or auto creating
+ * one. From then on VFIO will continue to use that ioas and is not effected by
+ * this ioctl. SET or CLEAR does not destroy any auto-created IOAS.
+ */
+struct iommu_vfio_ioas {
+	__u32 size;
+	__u32 ioas_id;
+	__u16 op;
+	__u16 __reserved;
+};
+#define IOMMU_VFIO_IOAS _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VFIO_IOAS)
 #endif
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH RFC v2 12/13] iommufd: vfio container FD ioctl compatibility
@ 2022-09-02 19:59   ` Jason Gunthorpe
  0 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-02 19:59 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, iommu,
	Jason Wang, Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

iommufd can directly implement the /dev/vfio/vfio container IOCTLs by
mapping them into io_pagetable operations.

A userspace application can test against iommufd and confirm compatability
then simply make a small change to open /dev/iommu instead of
/dev/vfio/vfio.

For testing purposes /dev/vfio/vfio can be symlinked to /dev/iommu and
then all applications will use the compatability path with no code
changes. It is unclear if this could ever be a production configuration.

This series just provides the iommufd side of compatability. Actually
linking this to VFIO_SET_CONTAINER is a followup series, with a link in
the cover letter.

Internally the compatibility API uses a normal IOAS object that, like
vfio, is automatically allocated when the first device is
attached.

Userspace can also query or set this IOAS object directly using the
IOMMU_VFIO_IOAS ioctl. This allows mixing and matching new iommufd only
features while still using the VFIO style map/unmap ioctls.

While this is enough to operate qemu, it is still a bit of a WIP with a
few gaps:

 - Only the TYPE1v2 mode is supported where unmap cannot punch holes or
   split areas. The old mode can be implemented with a new operation to
   split an iopt_area into two without disturbing the iopt_pages or the
   domains, then unmapping a whole area as normal.

 - Resource limits rely on memory cgroups to bound what userspace can do
   instead of the module parameter dma_entry_limit.

 - Pinned page accounting uses the same system as io_uring, not the
   mm_struct based system vfio uses.

 - VFIO P2P is not implemented. The DMABUF patches for vfio are a start at
   a solution where iommufd would import a special DMABUF. This is to avoid
   further propogating the follow_pfn() security problem.

 - Indefinite suspend of SW access (VFIO_DMA_MAP_FLAG_VADDR) is not
   implemented.

 - A full audit for pedantic compatibility details (eg errnos, etc) has
   not yet been done

 - powerpc SPAPR is left out, as it is not connected to the iommu_domain
   framework. My hope is that SPAPR will be moved into the iommu_domain
   framework as a special HW specific type and would expect power to
   support the generic interface through a normal iommu_domain.

The following are not going to be implemented and we expect to remove them
from VFIO type1:

 - SW access 'dirty tracking'. As discussed in the cover letter this will
   be done in VFIO.

 - VFIO_TYPE1_NESTING_IOMMU
    https://lore.kernel.org/all/0-v1-0093c9b0e345+19-vfio_no_nesting_jgg@nvidia.com/

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommufd/Makefile          |   3 +-
 drivers/iommu/iommufd/iommufd_private.h |   6 +
 drivers/iommu/iommufd/main.c            |  16 +-
 drivers/iommu/iommufd/vfio_compat.c     | 423 ++++++++++++++++++++++++
 include/linux/iommufd.h                 |   8 +
 include/uapi/linux/iommufd.h            |  36 ++
 6 files changed, 486 insertions(+), 6 deletions(-)
 create mode 100644 drivers/iommu/iommufd/vfio_compat.c

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index ca28a135b9675f..2fdff04000b326 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -5,6 +5,7 @@ iommufd-y := \
 	io_pagetable.o \
 	ioas.o \
 	main.o \
-	pages.o
+	pages.o \
+	vfio_compat.o
 
 obj-$(CONFIG_IOMMUFD) += iommufd.o
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 540b36c0befa5e..d87227cc08a47d 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -70,6 +70,8 @@ void iopt_remove_reserved_iova(struct io_pagetable *iopt, void *owner);
 struct iommufd_ctx {
 	struct file *file;
 	struct xarray objects;
+
+	struct iommufd_ioas *vfio_ioas;
 };
 
 struct iommufd_ctx *iommufd_fget(int fd);
@@ -81,6 +83,9 @@ struct iommufd_ucmd {
 	void *cmd;
 };
 
+int iommufd_vfio_ioctl(struct iommufd_ctx *ictx, unsigned int cmd,
+		       unsigned long arg);
+
 /* Copy the response in ucmd->cmd back to userspace. */
 static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd,
 				       size_t cmd_len)
@@ -208,6 +213,7 @@ int iommufd_ioas_allow_iovas(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_map(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_copy(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd);
+int iommufd_vfio_ioas(struct iommufd_ucmd *ucmd);
 
 /*
  * A HW pagetable is called an iommu_domain inside the kernel. This user object
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index ed64b84b3b9337..549d6a4c8f5575 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -134,6 +134,8 @@ bool iommufd_object_destroy_user(struct iommufd_ctx *ictx,
 		return false;
 	}
 	__xa_erase(&ictx->objects, obj->id);
+	if (ictx->vfio_ioas && &ictx->vfio_ioas->obj == obj)
+		ictx->vfio_ioas = NULL;
 	xa_unlock(&ictx->objects);
 	up_write(&obj->destroy_rwsem);
 
@@ -241,27 +243,31 @@ static struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
 		 __reserved),
 	IOCTL_OP(IOMMU_IOAS_UNMAP, iommufd_ioas_unmap, struct iommu_ioas_unmap,
 		 length),
+	IOCTL_OP(IOMMU_VFIO_IOAS, iommufd_vfio_ioas, struct iommu_vfio_ioas,
+		 __reserved),
 };
 
 static long iommufd_fops_ioctl(struct file *filp, unsigned int cmd,
 			       unsigned long arg)
 {
+	struct iommufd_ctx *ictx = filp->private_data;
 	struct iommufd_ucmd ucmd = {};
 	struct iommufd_ioctl_op *op;
 	union ucmd_buffer buf;
 	unsigned int nr;
 	int ret;
 
-	ucmd.ictx = filp->private_data;
+	nr = _IOC_NR(cmd);
+	if (nr < IOMMUFD_CMD_BASE ||
+	    (nr - IOMMUFD_CMD_BASE) >= ARRAY_SIZE(iommufd_ioctl_ops))
+		return iommufd_vfio_ioctl(ictx, cmd, arg);
+
+	ucmd.ictx = ictx;
 	ucmd.ubuffer = (void __user *)arg;
 	ret = get_user(ucmd.user_size, (u32 __user *)ucmd.ubuffer);
 	if (ret)
 		return ret;
 
-	nr = _IOC_NR(cmd);
-	if (nr < IOMMUFD_CMD_BASE ||
-	    (nr - IOMMUFD_CMD_BASE) >= ARRAY_SIZE(iommufd_ioctl_ops))
-		return -ENOIOCTLCMD;
 	op = &iommufd_ioctl_ops[nr - IOMMUFD_CMD_BASE];
 	if (op->ioctl_num != cmd)
 		return -ENOIOCTLCMD;
diff --git a/drivers/iommu/iommufd/vfio_compat.c b/drivers/iommu/iommufd/vfio_compat.c
new file mode 100644
index 00000000000000..57ef97aa309985
--- /dev/null
+++ b/drivers/iommu/iommufd/vfio_compat.c
@@ -0,0 +1,423 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ */
+#include <linux/file.h>
+#include <linux/interval_tree.h>
+#include <linux/iommu.h>
+#include <linux/iommufd.h>
+#include <linux/slab.h>
+#include <linux/vfio.h>
+#include <uapi/linux/vfio.h>
+#include <uapi/linux/iommufd.h>
+
+#include "iommufd_private.h"
+
+static struct iommufd_ioas *get_compat_ioas(struct iommufd_ctx *ictx)
+{
+	struct iommufd_ioas *ioas = ERR_PTR(-ENODEV);
+
+	xa_lock(&ictx->objects);
+	if (!ictx->vfio_ioas || !iommufd_lock_obj(&ictx->vfio_ioas->obj))
+		goto out_unlock;
+	ioas = ictx->vfio_ioas;
+out_unlock:
+	xa_unlock(&ictx->objects);
+	return ioas;
+}
+
+/**
+ * iommufd_vfio_compat_ioas_id - Return the IOAS ID that vfio should use
+ * @ictx - Context to operate on
+ *
+ * The compatability IOAS is the IOAS that the vfio compatability ioctls operate
+ * on since they do not have an IOAS ID input in their ABI. Only attaching a
+ * group should cause a default creation of the internal ioas, this returns the
+ * existing ioas if it has already been assigned somehow.
+ */
+int iommufd_vfio_compat_ioas_id(struct iommufd_ctx *ictx, u32 *out_ioas_id)
+{
+	struct iommufd_ioas *ioas = NULL;
+	struct iommufd_ioas *out_ioas;
+
+	ioas = iommufd_ioas_alloc(ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	xa_lock(&ictx->objects);
+	if (ictx->vfio_ioas && iommufd_lock_obj(&ictx->vfio_ioas->obj))
+		out_ioas = ictx->vfio_ioas;
+	else {
+		out_ioas = ioas;
+		ictx->vfio_ioas = ioas;
+	}
+	xa_unlock(&ictx->objects);
+
+	*out_ioas_id = out_ioas->obj.id;
+	if (out_ioas != ioas) {
+		iommufd_put_object(&out_ioas->obj);
+		iommufd_object_abort(ictx, &ioas->obj);
+		return 0;
+	}
+	iommufd_object_finalize(ictx, &ioas->obj);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(iommufd_vfio_compat_ioas_id);
+
+int iommufd_vfio_ioas(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_vfio_ioas *cmd = ucmd->cmd;
+	struct iommufd_ioas *ioas;
+
+	if (cmd->__reserved)
+		return -EOPNOTSUPP;
+	switch (cmd->op) {
+	case IOMMU_VFIO_IOAS_GET:
+		ioas = get_compat_ioas(ucmd->ictx);
+		if (IS_ERR(ioas))
+			return PTR_ERR(ioas);
+		cmd->ioas_id = ioas->obj.id;
+		iommufd_put_object(&ioas->obj);
+		return iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+
+	case IOMMU_VFIO_IOAS_SET:
+		ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
+		if (IS_ERR(ioas))
+			return PTR_ERR(ioas);
+		xa_lock(&ucmd->ictx->objects);
+		ucmd->ictx->vfio_ioas = ioas;
+		xa_unlock(&ucmd->ictx->objects);
+		iommufd_put_object(&ioas->obj);
+		return 0;
+
+	case IOMMU_VFIO_IOAS_CLEAR:
+		xa_lock(&ucmd->ictx->objects);
+		ucmd->ictx->vfio_ioas = NULL;
+		xa_unlock(&ucmd->ictx->objects);
+		return 0;
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
+static int iommufd_vfio_map_dma(struct iommufd_ctx *ictx, unsigned int cmd,
+				void __user *arg)
+{
+	u32 supported_flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
+	size_t minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
+	struct vfio_iommu_type1_dma_map map;
+	int iommu_prot = IOMMU_CACHE;
+	struct iommufd_ioas *ioas;
+	unsigned long iova;
+	int rc;
+
+	if (copy_from_user(&map, arg, minsz))
+		return -EFAULT;
+
+	if (map.argsz < minsz || map.flags & ~supported_flags)
+		return -EINVAL;
+
+	if (map.flags & VFIO_DMA_MAP_FLAG_READ)
+		iommu_prot |= IOMMU_READ;
+	if (map.flags & VFIO_DMA_MAP_FLAG_WRITE)
+		iommu_prot |= IOMMU_WRITE;
+
+	ioas = get_compat_ioas(ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	iova = map.iova;
+	rc = iopt_map_user_pages(&ioas->iopt, &iova,
+				 u64_to_user_ptr(map.vaddr), map.size,
+				 iommu_prot, 0);
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+static int iommufd_vfio_unmap_dma(struct iommufd_ctx *ictx, unsigned int cmd,
+				  void __user *arg)
+{
+	size_t minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size);
+
+	/* VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP is obsoleted by the new
+	 * dirty tracking direction:
+	 *  https://lore.kernel.org/kvm/20220731125503.142683-1-yishaih@nvidia.com/
+	 *  https://lore.kernel.org/kvm/20220428210933.3583-1-joao.m.martins@oracle.com/
+	 */
+	u32 supported_flags = VFIO_DMA_UNMAP_FLAG_ALL;
+	struct vfio_iommu_type1_dma_unmap unmap;
+	struct iommufd_ioas *ioas;
+	unsigned long unmapped;
+	int rc;
+
+	if (copy_from_user(&unmap, arg, minsz))
+		return -EFAULT;
+
+	if (unmap.argsz < minsz || unmap.flags & ~supported_flags)
+		return -EINVAL;
+
+	ioas = get_compat_ioas(ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	if (unmap.flags & VFIO_DMA_UNMAP_FLAG_ALL)
+		rc = iopt_unmap_all(&ioas->iopt, &unmapped);
+	else
+		rc = iopt_unmap_iova(&ioas->iopt, unmap.iova,
+				     unmap.size, &unmapped);
+	iommufd_put_object(&ioas->obj);
+	unmap.size = unmapped;
+
+	return rc;
+}
+
+static int iommufd_vfio_cc_iommu(struct iommufd_ctx *ictx)
+{
+	struct iommufd_ioas *ioas;
+	bool rc;
+
+	ioas = get_compat_ioas(ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+	rc = iommufd_ioas_enforced_coherent(ioas);
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+static int iommufd_vfio_check_extension(struct iommufd_ctx *ictx,
+					unsigned long type)
+{
+	switch (type) {
+	case VFIO_TYPE1v2_IOMMU:
+	case VFIO_UNMAP_ALL:
+		return 1;
+
+	/*
+	 * This is obsolete, and to be removed from VFIO. It was an incomplete
+	 * idea that got merged.
+	 * https://lore.kernel.org/kvm/0-v1-0093c9b0e345+19-vfio_no_nesting_jgg@nvidia.com/
+	 */
+	case VFIO_TYPE1_NESTING_IOMMU:
+		return 0;
+
+	case VFIO_DMA_CC_IOMMU:
+		return iommufd_vfio_cc_iommu(ictx);
+
+	/*
+	 * FIXME: The type1 iommu allows splitting of maps, which can fail. This is doable but
+	 * is a bunch of extra code that is only for supporting this case.
+	 */
+	case VFIO_TYPE1_IOMMU:
+	/*
+	 * FIXME: VFIO_DMA_MAP_FLAG_VADDR
+	 * https://lore.kernel.org/kvm/1611939252-7240-1-git-send-email-steven.sistare@oracle.com/
+	 * Wow, what a wild feature. This should have been implemented by
+	 * allowing a iopt_pages to be associated with a memfd. It can then
+	 * source mapping requests directly from a memfd without going through a
+	 * mm_struct and thus doesn't care that the original qemu exec'd itself.
+	 * The idea that userspace can flip a flag and cause kernel users to
+	 * block indefinately is unacceptable.
+	 *
+	 * For VFIO compat we should implement this in a slightly different way,
+	 * Creating a access_user that spans the whole area will immediately
+	 * stop new faults as they will be handled from the xarray. We can then
+	 * reparent the iopt_pages to the new mm_struct and undo the
+	 * access_user. No blockage of kernel users required, does require
+	 * filling the xarray with pages though.
+	 */
+	case VFIO_UPDATE_VADDR:
+	default:
+		return 0;
+	}
+}
+
+static int iommufd_vfio_set_iommu(struct iommufd_ctx *ictx, unsigned long type)
+{
+	struct iommufd_ioas *ioas = NULL;
+
+	if (type != VFIO_TYPE1v2_IOMMU)
+		return -EINVAL;
+
+	/* VFIO fails the set_iommu if there is no group */
+	ioas = get_compat_ioas(ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+	iommufd_put_object(&ioas->obj);
+	return 0;
+}
+
+static u64 iommufd_get_pagesizes(struct iommufd_ioas *ioas)
+{
+	/* FIXME: See vfio_update_pgsize_bitmap(), for compat this should return
+	 * the high bits too, and we need to decide if we should report that
+	 * iommufd supports less than PAGE_SIZE alignment or stick to strict
+	 * compatibility. qemu only cares about the first set bit.
+	 */
+	return ioas->iopt.iova_alignment;
+}
+
+static int iommufd_fill_cap_iova(struct iommufd_ioas *ioas,
+				 struct vfio_info_cap_header __user *cur,
+				 size_t avail)
+{
+	struct vfio_iommu_type1_info_cap_iova_range __user *ucap_iovas =
+		container_of(cur,
+			     struct vfio_iommu_type1_info_cap_iova_range __user,
+			     header);
+	struct vfio_iommu_type1_info_cap_iova_range cap_iovas = {
+		.header = {
+			.id = VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE,
+			.version = 1,
+		},
+	};
+	struct interval_tree_span_iter span;
+
+	interval_tree_for_each_span (&span, &ioas->iopt.reserved_itree,
+				     0, ULONG_MAX) {
+		struct vfio_iova_range range;
+
+		if (!span.is_hole)
+			continue;
+		range.start = span.start_hole;
+		range.end = span.last_hole;
+		if (avail >= struct_size(&cap_iovas, iova_ranges,
+					 cap_iovas.nr_iovas + 1) &&
+		    copy_to_user(&ucap_iovas->iova_ranges[cap_iovas.nr_iovas],
+				 &range, sizeof(range)))
+			return -EFAULT;
+		cap_iovas.nr_iovas++;
+	}
+	if (avail >= struct_size(&cap_iovas, iova_ranges, cap_iovas.nr_iovas) &&
+	    copy_to_user(ucap_iovas, &cap_iovas, sizeof(cap_iovas)))
+		return -EFAULT;
+	return struct_size(&cap_iovas, iova_ranges, cap_iovas.nr_iovas);
+}
+
+static int iommufd_fill_cap_dma_avail(struct iommufd_ioas *ioas,
+				      struct vfio_info_cap_header __user *cur,
+				      size_t avail)
+{
+	struct vfio_iommu_type1_info_dma_avail cap_dma = {
+		.header = {
+			.id = VFIO_IOMMU_TYPE1_INFO_DMA_AVAIL,
+			.version = 1,
+		},
+		/* iommufd has no limit, return the same value as VFIO. */
+		.avail = U16_MAX,
+	};
+
+	if (avail >= sizeof(cap_dma) &&
+	    copy_to_user(cur, &cap_dma, sizeof(cap_dma)))
+		return -EFAULT;
+	return sizeof(cap_dma);
+}
+
+static int iommufd_vfio_iommu_get_info(struct iommufd_ctx *ictx,
+				       void __user *arg)
+{
+	typedef int (*fill_cap_fn)(struct iommufd_ioas *ioas,
+				   struct vfio_info_cap_header __user *cur,
+				   size_t avail);
+	static const fill_cap_fn fill_fns[] = {
+		iommufd_fill_cap_iova,
+		iommufd_fill_cap_dma_avail,
+	};
+	size_t minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes);
+	struct vfio_info_cap_header __user *last_cap = NULL;
+	struct vfio_iommu_type1_info info;
+	struct iommufd_ioas *ioas;
+	size_t total_cap_size;
+	int rc;
+	int i;
+
+	if (copy_from_user(&info, arg, minsz))
+		return -EFAULT;
+
+	if (info.argsz < minsz)
+		return -EINVAL;
+	minsz = min_t(size_t, info.argsz, sizeof(info));
+
+	ioas = get_compat_ioas(ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	down_read(&ioas->iopt.iova_rwsem);
+	info.flags = VFIO_IOMMU_INFO_PGSIZES;
+	info.iova_pgsizes = iommufd_get_pagesizes(ioas);
+	info.cap_offset = 0;
+
+	total_cap_size = sizeof(info);
+	for (i = 0; i != ARRAY_SIZE(fill_fns); i++) {
+		int cap_size;
+
+		if (info.argsz > total_cap_size)
+			cap_size = fill_fns[i](ioas, arg + total_cap_size,
+					       info.argsz - total_cap_size);
+		else
+			cap_size = fill_fns[i](ioas, NULL, 0);
+		if (cap_size < 0) {
+			rc = cap_size;
+			goto out_put;
+		}
+		if (last_cap && info.argsz >= total_cap_size &&
+		    put_user(total_cap_size, &last_cap->next)) {
+			rc = -EFAULT;
+			goto out_put;
+		}
+		last_cap = arg + total_cap_size;
+		total_cap_size += cap_size;
+	}
+
+	/*
+	 * If the user did not provide enough space then only some caps are
+	 * returned and the argsz will be updated to the correct amount to get
+	 * all caps.
+	 */
+	if (info.argsz >= total_cap_size)
+		info.cap_offset = sizeof(info);
+	info.argsz = total_cap_size;
+	info.flags |= VFIO_IOMMU_INFO_CAPS;
+	if (copy_to_user(arg, &info, minsz))
+		rc = -EFAULT;
+	rc = 0;
+
+out_put:
+	up_read(&ioas->iopt.iova_rwsem);
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+/* FIXME TODO:
+PowerPC SPAPR only:
+#define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
+#define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
+#define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
+#define VFIO_IOMMU_SPAPR_REGISTER_MEMORY	_IO(VFIO_TYPE, VFIO_BASE + 17)
+#define VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY	_IO(VFIO_TYPE, VFIO_BASE + 18)
+#define VFIO_IOMMU_SPAPR_TCE_CREATE	_IO(VFIO_TYPE, VFIO_BASE + 19)
+#define VFIO_IOMMU_SPAPR_TCE_REMOVE	_IO(VFIO_TYPE, VFIO_BASE + 20)
+*/
+
+int iommufd_vfio_ioctl(struct iommufd_ctx *ictx, unsigned int cmd,
+		       unsigned long arg)
+{
+	void __user *uarg = (void __user *)arg;
+
+	switch (cmd) {
+	case VFIO_GET_API_VERSION:
+		return VFIO_API_VERSION;
+	case VFIO_SET_IOMMU:
+		return iommufd_vfio_set_iommu(ictx, arg);
+	case VFIO_CHECK_EXTENSION:
+		return iommufd_vfio_check_extension(ictx, arg);
+	case VFIO_IOMMU_GET_INFO:
+		return iommufd_vfio_iommu_get_info(ictx, uarg);
+	case VFIO_IOMMU_MAP_DMA:
+		return iommufd_vfio_map_dma(ictx, cmd, uarg);
+	case VFIO_IOMMU_UNMAP_DMA:
+		return iommufd_vfio_unmap_dma(ictx, cmd, uarg);
+	case VFIO_IOMMU_DIRTY_PAGES:
+	default:
+		return -ENOIOCTLCMD;
+	}
+	return -ENOIOCTLCMD;
+}
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index c072e400f3e645..050024ff68142d 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -72,6 +72,8 @@ void iommufd_ctx_get(struct iommufd_ctx *ictx);
 struct iommufd_ctx *iommufd_ctx_from_file(struct file *file);
 void iommufd_ctx_put(struct iommufd_ctx *ictx);
 
+int iommufd_vfio_compat_ioas_id(struct iommufd_ctx *ictx, u32 *out_ioas_id);
+
 int iommufd_access_rw(struct iommufd_access *access, unsigned long iova,
 		      void *data, size_t len, bool write);
 #else /* !CONFIG_IOMMUFD */
@@ -84,6 +86,12 @@ static inline void iommufd_ctx_put(struct iommufd_ctx *ictx)
 {
 }
 
+static inline int iommufd_vfio_compat_ioas_id(struct iommufd_ctx *ictx,
+					      u32 *out_ioas_id)
+{
+	return -EOPNOTSUPP;
+}
+
 static inline int iommufd_access_rw(struct iommufd_access *access, unsigned long iova,
 		      void *data, size_t len, bool write)
 {
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index b7b0ac4016bb70..48c290505844d8 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -43,6 +43,7 @@ enum {
 	IOMMUFD_CMD_IOAS_IOVA_RANGES,
 	IOMMUFD_CMD_IOAS_MAP,
 	IOMMUFD_CMD_IOAS_UNMAP,
+	IOMMUFD_CMD_VFIO_IOAS,
 };
 
 /**
@@ -240,4 +241,39 @@ struct iommu_ioas_unmap {
 	__aligned_u64 length;
 };
 #define IOMMU_IOAS_UNMAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_UNMAP)
+
+/**
+ * enum iommufd_vfio_ioas_op
+ * @IOMMU_VFIO_IOAS_GET: Get the current compatibility IOAS
+ * @IOMMU_VFIO_IOAS_SET: Change the current compatibility IOAS
+ * @IOMMU_VFIO_IOAS_CLEAR: Disable VFIO compatibility
+ */
+enum iommufd_vfio_ioas_op {
+	IOMMU_VFIO_IOAS_GET = 0,
+	IOMMU_VFIO_IOAS_SET = 1,
+	IOMMU_VFIO_IOAS_CLEAR = 2,
+};
+
+/**
+ * struct iommu_vfio_ioas - ioctl(IOMMU_VFIO_IOAS)
+ * @size: sizeof(struct iommu_ioas_copy)
+ * @ioas_id: For IOMMU_VFIO_IOAS_SET the input IOAS ID to set
+ *           For IOMMU_VFIO_IOAS_GET will output the IOAS ID
+ * @op: One of enum iommufd_vfio_ioas_op
+ * @__reserved: Must be 0
+ *
+ * The VFIO compatibility support uses a single ioas because VFIO APIs do not
+ * support the ID field. Set or Get the IOAS that VFIO compatibility will use.
+ * When VFIO_GROUP_SET_CONTAINER is used on an iommufd it will get the
+ * compatibility ioas, either by taking what is already set, or auto creating
+ * one. From then on VFIO will continue to use that ioas and is not effected by
+ * this ioctl. SET or CLEAR does not destroy any auto-created IOAS.
+ */
+struct iommu_vfio_ioas {
+	__u32 size;
+	__u32 ioas_id;
+	__u16 op;
+	__u16 __reserved;
+};
+#define IOMMU_VFIO_IOAS _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VFIO_IOAS)
 #endif
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH RFC v2 13/13] iommufd: Add a selftest
  2022-09-02 19:59 ` Jason Gunthorpe
@ 2022-09-02 19:59   ` Jason Gunthorpe
  -1 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-02 19:59 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, iommu,
	Jason Wang, Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

Cover the essential functionality of the iommufd with a directed
test. This aims to achieve reasonable functional coverage using the
in-kernel self test framework.

It provides a mock for the iommu_domain that allows it to run without any
HW and the mocking provides a way to directly validate that the PFNs
loaded into the iommu_domain are correct.

The mock also simulates the rare case of PAGE_SIZE > iommu page size as
the mock will typically operate at a 2K iommu page size. This allows
exercising all of the calculations to support this mismatch.

This allows achieving high coverage of the corner cases in the iopt_pages.

However, it is an unusually invasive config option to enable all of
this. The config option should never be enabled in a production kernel.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/Kconfig            |    9 +
 drivers/iommu/iommufd/Makefile           |    2 +
 drivers/iommu/iommufd/device.c           |   61 +
 drivers/iommu/iommufd/iommufd_private.h  |   21 +
 drivers/iommu/iommufd/iommufd_test.h     |   74 ++
 drivers/iommu/iommufd/main.c             |   12 +
 drivers/iommu/iommufd/pages.c            |    4 +
 drivers/iommu/iommufd/selftest.c         |  626 ++++++++++
 tools/testing/selftests/Makefile         |    1 +
 tools/testing/selftests/iommu/.gitignore |    2 +
 tools/testing/selftests/iommu/Makefile   |   11 +
 tools/testing/selftests/iommu/config     |    2 +
 tools/testing/selftests/iommu/iommufd.c  | 1396 ++++++++++++++++++++++
 13 files changed, 2221 insertions(+)
 create mode 100644 drivers/iommu/iommufd/iommufd_test.h
 create mode 100644 drivers/iommu/iommufd/selftest.c
 create mode 100644 tools/testing/selftests/iommu/.gitignore
 create mode 100644 tools/testing/selftests/iommu/Makefile
 create mode 100644 tools/testing/selftests/iommu/config
 create mode 100644 tools/testing/selftests/iommu/iommufd.c

diff --git a/drivers/iommu/iommufd/Kconfig b/drivers/iommu/iommufd/Kconfig
index fddd453bb0e764..9b41fde7c839c5 100644
--- a/drivers/iommu/iommufd/Kconfig
+++ b/drivers/iommu/iommufd/Kconfig
@@ -11,3 +11,12 @@ config IOMMUFD
 	  This would commonly be used in combination with VFIO.
 
 	  If you don't know what to do here, say N.
+
+config IOMMUFD_TEST
+	bool "IOMMU Userspace API Test support"
+	depends on IOMMUFD
+	depends on RUNTIME_TESTING_MENU
+	default n
+	help
+	  This is dangerous, do not enable unless running
+	  tools/testing/selftests/iommu
diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index 2fdff04000b326..8aeba81800c512 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -8,4 +8,6 @@ iommufd-y := \
 	pages.o \
 	vfio_compat.o
 
+iommufd-$(CONFIG_IOMMUFD_TEST) += selftest.o
+
 obj-$(CONFIG_IOMMUFD) += iommufd.o
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index d34bdbcb84a40d..7e6ddf82f34cb9 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -517,3 +517,64 @@ int iommufd_access_rw(struct iommufd_access *access, unsigned long iova,
 	return -EINVAL;
 }
 EXPORT_SYMBOL_GPL(iommufd_access_rw);
+
+#ifdef CONFIG_IOMMUFD_TEST
+/*
+ * Creating a real iommufd_device is too hard, bypass creating a iommufd_device
+ * and go directly to attaching a domain.
+ */
+struct iommufd_hw_pagetable *
+iommufd_device_selftest_attach(struct iommufd_ctx *ictx,
+			       struct iommufd_ioas *ioas,
+			       struct device *mock_dev)
+{
+	struct iommufd_hw_pagetable *hwpt;
+	int rc;
+
+	hwpt = iommufd_hw_pagetable_alloc(ictx, ioas, mock_dev);
+	if (IS_ERR(hwpt))
+		return hwpt;
+
+	rc = iopt_table_add_domain(&hwpt->ioas->iopt, hwpt->domain);
+	if (rc)
+		goto out_hwpt;
+
+	refcount_inc(&hwpt->obj.users);
+	iommufd_object_finalize(ictx, &hwpt->obj);
+	return hwpt;
+
+out_hwpt:
+	iommufd_object_abort_and_destroy(ictx, &hwpt->obj);
+	return ERR_PTR(rc);
+}
+
+void iommufd_device_selftest_detach(struct iommufd_ctx *ictx,
+				    struct iommufd_hw_pagetable *hwpt)
+{
+	iopt_table_remove_domain(&hwpt->ioas->iopt, hwpt->domain);
+	refcount_dec(&hwpt->obj.users);
+}
+
+unsigned int iommufd_access_selfest_id(struct iommufd_access *access_pub)
+{
+	struct iommufd_access_priv *access =
+		container_of(access_pub, struct iommufd_access_priv, pub);
+
+	return access->obj.id;
+}
+
+void *iommufd_access_selftest_get(struct iommufd_ctx *ictx,
+				  unsigned int access_id,
+				  struct iommufd_object **out_obj)
+{
+	struct iommufd_object *access_obj;
+
+	access_obj =
+		iommufd_get_object(ictx, access_id, IOMMUFD_OBJ_ACCESS);
+	if (IS_ERR(access_obj))
+		return ERR_CAST(access_obj);
+	*out_obj = access_obj;
+	return container_of(access_obj, struct iommufd_access_priv, obj)->data;
+}
+
+#endif
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index d87227cc08a47d..0b414b6a00f061 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -103,6 +103,9 @@ enum iommufd_object_type {
 	IOMMUFD_OBJ_HW_PAGETABLE,
 	IOMMUFD_OBJ_IOAS,
 	IOMMUFD_OBJ_ACCESS,
+#ifdef CONFIG_IOMMUFD_TEST
+	IOMMUFD_OBJ_SELFTEST,
+#endif
 };
 
 /* Base struct for all objects with a userspace ID handle. */
@@ -242,4 +245,22 @@ void iommufd_hw_pagetable_destroy(struct iommufd_object *obj);
 void iommufd_device_destroy(struct iommufd_object *obj);
 
 void iommufd_access_destroy_object(struct iommufd_object *obj);
+
+#ifdef CONFIG_IOMMUFD_TEST
+struct iommufd_access;
+struct iommufd_hw_pagetable *
+iommufd_device_selftest_attach(struct iommufd_ctx *ictx,
+			       struct iommufd_ioas *ioas,
+			       struct device *mock_dev);
+void iommufd_device_selftest_detach(struct iommufd_ctx *ictx,
+				    struct iommufd_hw_pagetable *hwpt);
+unsigned int iommufd_access_selfest_id(struct iommufd_access *access_pub);
+void *iommufd_access_selftest_get(struct iommufd_ctx *ictx,
+				  unsigned int access_id,
+				  struct iommufd_object **out_obj);
+int iommufd_test(struct iommufd_ucmd *ucmd);
+void iommufd_selftest_destroy(struct iommufd_object *obj);
+extern size_t iommufd_test_memory_limit;
+#endif
+
 #endif
diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h
new file mode 100644
index 00000000000000..485f44394dbe9b
--- /dev/null
+++ b/drivers/iommu/iommufd/iommufd_test.h
@@ -0,0 +1,74 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
+ */
+#ifndef _UAPI_IOMMUFD_TEST_H
+#define _UAPI_IOMMUFD_TEST_H
+
+#include <linux/types.h>
+#include <linux/iommufd.h>
+
+enum {
+	IOMMU_TEST_OP_ADD_RESERVED,
+	IOMMU_TEST_OP_MOCK_DOMAIN,
+	IOMMU_TEST_OP_MD_CHECK_MAP,
+	IOMMU_TEST_OP_MD_CHECK_REFS,
+	IOMMU_TEST_OP_CREATE_ACCESS,
+	IOMMU_TEST_OP_DESTROY_ACCESS,
+	IOMMU_TEST_OP_DESTROY_ACCESS_ITEM,
+	IOMMU_TEST_OP_ACCESS_PAGES,
+	IOMMU_TEST_OP_SET_TEMP_MEMORY_LIMIT,
+};
+
+enum {
+	MOCK_APERTURE_START = 1UL << 24,
+	MOCK_APERTURE_LAST = (1UL << 31) - 1,
+};
+
+enum {
+	MOCK_FLAGS_ACCESS_WRITE = 1 << 0,
+};
+
+struct iommu_test_cmd {
+	__u32 size;
+	__u32 op;
+	__u32 id;
+	union {
+		struct {
+			__u32 device_id;
+		} mock_domain;
+		struct {
+			__aligned_u64 start;
+			__aligned_u64 length;
+		} add_reserved;
+		struct {
+			__aligned_u64 iova;
+			__aligned_u64 length;
+			__aligned_u64 uptr;
+		} check_map;
+		struct {
+			__aligned_u64 length;
+			__aligned_u64 uptr;
+			__u32 refs;
+		} check_refs;
+		struct {
+			__u32 out_access_id;
+		} create_access;
+		struct {
+			__u32 flags;
+			__u32 out_access_item_id;
+			__aligned_u64 iova;
+			__aligned_u64 length;
+			__aligned_u64 uptr;
+		} access_pages;
+		struct {
+			__u32 access_item_id;
+		} destroy_access_item;
+		struct {
+			__u32 limit;
+		} memory_limit;
+	};
+	__u32 last;
+};
+#define IOMMU_TEST_CMD _IO(IOMMUFD_TYPE, IOMMUFD_CMD_BASE + 32)
+
+#endif
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 549d6a4c8f5575..1097e5f07f8eb9 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -25,6 +25,7 @@
 #include <linux/iommufd.h>
 
 #include "iommufd_private.h"
+#include "iommufd_test.h"
 
 struct iommufd_object_ops {
 	void (*destroy)(struct iommufd_object *obj);
@@ -211,6 +212,9 @@ union ucmd_buffer {
 	struct iommu_ioas_iova_ranges iova_ranges;
 	struct iommu_ioas_map map;
 	struct iommu_ioas_unmap unmap;
+#ifdef CONFIG_IOMMUFD_TEST
+	struct iommu_test_cmd test;
+#endif
 };
 
 struct iommufd_ioctl_op {
@@ -245,6 +249,9 @@ static struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
 		 length),
 	IOCTL_OP(IOMMU_VFIO_IOAS, iommufd_vfio_ioas, struct iommu_vfio_ioas,
 		 __reserved),
+#ifdef CONFIG_IOMMUFD_TEST
+	IOCTL_OP(IOMMU_TEST_CMD, iommufd_test, struct iommu_test_cmd, last),
+#endif
 };
 
 static long iommufd_fops_ioctl(struct file *filp, unsigned int cmd,
@@ -345,6 +352,11 @@ static struct iommufd_object_ops iommufd_object_ops[] = {
 	[IOMMUFD_OBJ_HW_PAGETABLE] = {
 		.destroy = iommufd_hw_pagetable_destroy,
 	},
+#ifdef CONFIG_IOMMUFD_TEST
+	[IOMMUFD_OBJ_SELFTEST] = {
+		.destroy = iommufd_selftest_destroy,
+	},
+#endif
 };
 
 static struct miscdevice iommu_misc_dev = {
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index 91db42dd6aaeaa..59a55f0a35b2af 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -48,7 +48,11 @@
 
 #include "io_pagetable.h"
 
+#ifndef CONFIG_IOMMUFD_TEST
 #define TEMP_MEMORY_LIMIT 65536
+#else
+#define TEMP_MEMORY_LIMIT iommufd_test_memory_limit
+#endif
 #define BATCH_BACKUP_SIZE 32
 
 /*
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
new file mode 100644
index 00000000000000..e9c178048a1284
--- /dev/null
+++ b/drivers/iommu/iommufd/selftest.c
@@ -0,0 +1,626 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
+ *
+ * Kernel side components to support tools/testing/selftests/iommu
+ */
+#include <linux/slab.h>
+#include <linux/iommu.h>
+#include <linux/xarray.h>
+
+#include "iommufd_private.h"
+#include "iommufd_test.h"
+
+size_t iommufd_test_memory_limit = 65536;
+
+enum {
+	MOCK_IO_PAGE_SIZE = PAGE_SIZE / 2,
+
+	/*
+	 * Like a real page table alignment requires the low bits of the address
+	 * to be zero. xarray also requires the high bit to be zero, so we store
+	 * the pfns shifted. The upper bits are used for metadata.
+	 */
+	MOCK_PFN_MASK = ULONG_MAX / MOCK_IO_PAGE_SIZE,
+
+	_MOCK_PFN_START = MOCK_PFN_MASK + 1,
+	MOCK_PFN_START_IOVA = _MOCK_PFN_START,
+	MOCK_PFN_LAST_IOVA = _MOCK_PFN_START,
+};
+
+struct mock_iommu_domain {
+	struct iommu_domain domain;
+	struct xarray pfns;
+};
+
+enum selftest_obj_type {
+	TYPE_IDEV,
+};
+
+struct selftest_obj {
+	struct iommufd_object obj;
+	enum selftest_obj_type type;
+
+	union {
+		struct {
+			struct iommufd_hw_pagetable *hwpt;
+			struct iommufd_ctx *ictx;
+			struct device mock_dev;
+		} idev;
+	};
+};
+
+static struct iommu_domain *mock_domain_alloc(unsigned int iommu_domain_type)
+{
+	struct mock_iommu_domain *mock;
+
+	if (WARN_ON(iommu_domain_type != IOMMU_DOMAIN_UNMANAGED))
+		return NULL;
+
+	mock = kzalloc(sizeof(*mock), GFP_KERNEL);
+	if (!mock)
+		return NULL;
+	mock->domain.geometry.aperture_start = MOCK_APERTURE_START;
+	mock->domain.geometry.aperture_end = MOCK_APERTURE_LAST;
+	mock->domain.pgsize_bitmap = MOCK_IO_PAGE_SIZE;
+	xa_init(&mock->pfns);
+	return &mock->domain;
+}
+
+static void mock_domain_free(struct iommu_domain *domain)
+{
+	struct mock_iommu_domain *mock =
+		container_of(domain, struct mock_iommu_domain, domain);
+
+	WARN_ON(!xa_empty(&mock->pfns));
+	kfree(mock);
+}
+
+static int mock_domain_map_pages(struct iommu_domain *domain,
+				 unsigned long iova, phys_addr_t paddr,
+				 size_t pgsize, size_t pgcount, int prot,
+				 gfp_t gfp, size_t *mapped)
+{
+	struct mock_iommu_domain *mock =
+		container_of(domain, struct mock_iommu_domain, domain);
+	unsigned long flags = MOCK_PFN_START_IOVA;
+
+	WARN_ON(iova % MOCK_IO_PAGE_SIZE);
+	WARN_ON(pgsize % MOCK_IO_PAGE_SIZE);
+	for (; pgcount; pgcount--) {
+		size_t cur;
+
+		for (cur = 0; cur != pgsize; cur += MOCK_IO_PAGE_SIZE) {
+			void *old;
+
+			if (pgcount == 1 && cur + MOCK_IO_PAGE_SIZE == pgsize)
+				flags = MOCK_PFN_LAST_IOVA;
+			old = xa_store(&mock->pfns, iova / MOCK_IO_PAGE_SIZE,
+				       xa_mk_value((paddr / MOCK_IO_PAGE_SIZE) | flags),
+				       GFP_KERNEL);
+			if (xa_is_err(old))
+				return xa_err(old);
+			WARN_ON(old);
+			iova += MOCK_IO_PAGE_SIZE;
+			paddr += MOCK_IO_PAGE_SIZE;
+			*mapped += MOCK_IO_PAGE_SIZE;
+			flags = 0;
+		}
+	}
+	return 0;
+}
+
+static size_t mock_domain_unmap_pages(struct iommu_domain *domain,
+				      unsigned long iova, size_t pgsize,
+				      size_t pgcount,
+				      struct iommu_iotlb_gather *iotlb_gather)
+{
+	struct mock_iommu_domain *mock =
+		container_of(domain, struct mock_iommu_domain, domain);
+	bool first = true;
+	size_t ret = 0;
+	void *ent;
+
+	WARN_ON(iova % MOCK_IO_PAGE_SIZE);
+	WARN_ON(pgsize % MOCK_IO_PAGE_SIZE);
+
+	for (; pgcount; pgcount--) {
+		size_t cur;
+
+		for (cur = 0; cur != pgsize; cur += MOCK_IO_PAGE_SIZE) {
+			ent = xa_erase(&mock->pfns, iova / MOCK_IO_PAGE_SIZE);
+			WARN_ON(!ent);
+			/*
+			 * iommufd generates unmaps that must be a strict
+			 * superset of the map's performend So every starting
+			 * IOVA should have been an iova passed to map, and the
+			 *
+			 * First IOVA must be present and have been a first IOVA
+			 * passed to map_pages
+			 */
+			if (first) {
+				WARN_ON(!(xa_to_value(ent) &
+					  MOCK_PFN_START_IOVA));
+				first = false;
+			}
+			if (pgcount == 1 && cur + MOCK_IO_PAGE_SIZE == pgsize)
+				WARN_ON(!(xa_to_value(ent) &
+					  MOCK_PFN_LAST_IOVA));
+
+			iova += MOCK_IO_PAGE_SIZE;
+			ret += MOCK_IO_PAGE_SIZE;
+		}
+	}
+	return ret;
+}
+
+static phys_addr_t mock_domain_iova_to_phys(struct iommu_domain *domain,
+					    dma_addr_t iova)
+{
+	struct mock_iommu_domain *mock =
+		container_of(domain, struct mock_iommu_domain, domain);
+	void *ent;
+
+	WARN_ON(iova % MOCK_IO_PAGE_SIZE);
+	ent = xa_load(&mock->pfns, iova / MOCK_IO_PAGE_SIZE);
+	WARN_ON(!ent);
+	return (xa_to_value(ent) & MOCK_PFN_MASK) * MOCK_IO_PAGE_SIZE;
+}
+
+static const struct iommu_ops mock_ops = {
+	.owner = THIS_MODULE,
+	.pgsize_bitmap = MOCK_IO_PAGE_SIZE,
+	.domain_alloc = mock_domain_alloc,
+	.default_domain_ops =
+		&(struct iommu_domain_ops){
+			.free = mock_domain_free,
+			.map_pages = mock_domain_map_pages,
+			.unmap_pages = mock_domain_unmap_pages,
+			.iova_to_phys = mock_domain_iova_to_phys,
+		},
+};
+
+static inline struct iommufd_hw_pagetable *
+get_md_pagetable(struct iommufd_ucmd *ucmd, u32 mockpt_id,
+		 struct mock_iommu_domain **mock)
+{
+	struct iommufd_hw_pagetable *hwpt;
+	struct iommufd_object *obj;
+
+	obj = iommufd_get_object(ucmd->ictx, mockpt_id,
+				 IOMMUFD_OBJ_HW_PAGETABLE);
+	if (IS_ERR(obj))
+		return ERR_CAST(obj);
+	hwpt = container_of(obj, struct iommufd_hw_pagetable, obj);
+	if (hwpt->domain->ops != mock_ops.default_domain_ops) {
+		return ERR_PTR(-EINVAL);
+		iommufd_put_object(&hwpt->obj);
+	}
+	*mock = container_of(hwpt->domain, struct mock_iommu_domain, domain);
+	return hwpt;
+}
+
+/* Create an hw_pagetable with the mock domain so we can test the domain ops */
+static int iommufd_test_mock_domain(struct iommufd_ucmd *ucmd,
+				    struct iommu_test_cmd *cmd)
+{
+	static struct bus_type mock_bus = { .iommu_ops = &mock_ops };
+	struct iommufd_hw_pagetable *hwpt;
+	struct selftest_obj *sobj;
+	struct iommufd_ioas *ioas;
+	int rc;
+
+	ioas = iommufd_get_ioas(ucmd, cmd->id);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	sobj = iommufd_object_alloc(ucmd->ictx, sobj, IOMMUFD_OBJ_SELFTEST);
+	if (IS_ERR(sobj)) {
+		rc = PTR_ERR(sobj);
+		goto out_ioas;
+	}
+	sobj->idev.ictx = ucmd->ictx;
+	sobj->type = TYPE_IDEV;
+	sobj->idev.mock_dev.bus = &mock_bus;
+
+	hwpt = iommufd_device_selftest_attach(ucmd->ictx, ioas,
+					      &sobj->idev.mock_dev);
+	if (IS_ERR(hwpt)) {
+		rc = PTR_ERR(hwpt);
+		goto out_sobj;
+	}
+	sobj->idev.hwpt = hwpt;
+
+	cmd->id = hwpt->obj.id;
+	cmd->mock_domain.device_id = sobj->obj.id;
+	iommufd_object_finalize(ucmd->ictx, &sobj->obj);
+	iommufd_put_object(&ioas->obj);
+	return iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+
+out_sobj:
+	iommufd_object_abort(ucmd->ictx, &sobj->obj);
+out_ioas:
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+/* Add an additional reserved IOVA to the IOAS */
+static int iommufd_test_add_reserved(struct iommufd_ucmd *ucmd,
+				     unsigned int mockpt_id,
+				     unsigned long start, size_t length)
+{
+	struct iommufd_ioas *ioas;
+	int rc;
+
+	ioas = iommufd_get_ioas(ucmd, mockpt_id);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+	down_write(&ioas->iopt.iova_rwsem);
+	rc = iopt_reserve_iova(&ioas->iopt, start, start + length - 1, NULL);
+	up_write(&ioas->iopt.iova_rwsem);
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+/* Check that every pfn under each iova matches the pfn under a user VA */
+static int iommufd_test_md_check_pa(struct iommufd_ucmd *ucmd,
+				    unsigned int mockpt_id, unsigned long iova,
+				    size_t length, void __user *uptr)
+{
+	struct iommufd_hw_pagetable *hwpt;
+	struct mock_iommu_domain *mock;
+	int rc;
+
+	if (iova % MOCK_IO_PAGE_SIZE || length % MOCK_IO_PAGE_SIZE ||
+	    (uintptr_t)uptr % MOCK_IO_PAGE_SIZE)
+		return -EINVAL;
+
+	hwpt = get_md_pagetable(ucmd, mockpt_id, &mock);
+	if (IS_ERR(hwpt))
+		return PTR_ERR(hwpt);
+
+	for (; length; length -= MOCK_IO_PAGE_SIZE) {
+		struct page *pages[1];
+		unsigned long pfn;
+		long npages;
+		void *ent;
+
+		npages = get_user_pages_fast((uintptr_t)uptr & PAGE_MASK, 1, 0,
+					     pages);
+		if (npages < 0) {
+			rc = npages;
+			goto out_put;
+		}
+		if (WARN_ON(npages != 1)) {
+			rc = -EFAULT;
+			goto out_put;
+		}
+		pfn = page_to_pfn(pages[0]);
+		put_page(pages[0]);
+
+		ent = xa_load(&mock->pfns, iova / MOCK_IO_PAGE_SIZE);
+		if (!ent ||
+		    (xa_to_value(ent) & MOCK_PFN_MASK) * MOCK_IO_PAGE_SIZE !=
+			    pfn * PAGE_SIZE + ((uintptr_t)uptr % PAGE_SIZE)) {
+			rc = -EINVAL;
+			goto out_put;
+		}
+		iova += MOCK_IO_PAGE_SIZE;
+		uptr += MOCK_IO_PAGE_SIZE;
+	}
+	rc = 0;
+
+out_put:
+	iommufd_put_object(&hwpt->obj);
+	return rc;
+}
+
+/* Check that the page ref count matches, to look for missing pin/unpins */
+static int iommufd_test_md_check_refs(struct iommufd_ucmd *ucmd,
+				      void __user *uptr, size_t length,
+				      unsigned int refs)
+{
+	if (length % PAGE_SIZE || (uintptr_t)uptr % PAGE_SIZE)
+		return -EINVAL;
+
+	for (; length; length -= PAGE_SIZE) {
+		struct page *pages[1];
+		long npages;
+
+		npages = get_user_pages_fast((uintptr_t)uptr, 1, 0, pages);
+		if (npages < 0)
+			return npages;
+		if (WARN_ON(npages != 1))
+			return -EFAULT;
+		if (!PageCompound(pages[0])) {
+			unsigned int count;
+
+			count = page_ref_count(pages[0]);
+			if (count / GUP_PIN_COUNTING_BIAS != refs) {
+				put_page(pages[0]);
+				return -EIO;
+			}
+		}
+		put_page(pages[0]);
+		uptr += PAGE_SIZE;
+	}
+	return 0;
+}
+
+struct selftest_access {
+	struct iommufd_access *access;
+	spinlock_t lock;
+	struct list_head items;
+	unsigned int next_id;
+	bool destroying;
+};
+
+struct selftest_access_item {
+	struct list_head items_elm;
+	unsigned long iova;
+	unsigned long iova_end;
+	size_t length;
+	unsigned int id;
+};
+
+static void iommufd_test_access_unmap(void *data, unsigned long iova,
+				      unsigned long length)
+{
+	struct selftest_access *staccess = data;
+	struct selftest_access_item *item;
+	unsigned long iova_end = iova + length - 1;
+
+	spin_lock(&staccess->lock);
+	list_for_each_entry(item, &staccess->items, items_elm) {
+		if (iova <= item->iova_end && iova_end >= item->iova) {
+			list_del(&item->items_elm);
+			spin_unlock(&staccess->lock);
+			iommufd_access_unpin_pages(staccess->access, item->iova,
+						   item->length);
+			kfree(item);
+			return;
+		}
+	}
+	spin_unlock(&staccess->lock);
+}
+
+static struct iommufd_access_ops selftest_access_ops = {
+	.unmap = iommufd_test_access_unmap,
+};
+
+static int iommufd_test_create_access(struct iommufd_ucmd *ucmd,
+				      unsigned int ioas_id)
+{
+	struct iommu_test_cmd *cmd = ucmd->cmd;
+	struct selftest_access *staccess;
+	int rc;
+
+	staccess = kzalloc(sizeof(*staccess), GFP_KERNEL_ACCOUNT);
+	if (!staccess)
+		return -ENOMEM;
+	INIT_LIST_HEAD(&staccess->items);
+	spin_lock_init(&staccess->lock);
+
+	staccess->access = iommufd_access_create(
+		ucmd->ictx, ioas_id, &selftest_access_ops, staccess);
+	if (IS_ERR(staccess->access)) {
+		rc = PTR_ERR(staccess->access);
+		goto out_free;
+	}
+	cmd->create_access.out_access_id =
+		iommufd_access_selfest_id(staccess->access);
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+	if (rc)
+		goto out_destroy;
+
+	return 0;
+
+out_destroy:
+	iommufd_access_destroy(staccess->access);
+out_free:
+	kfree(staccess);
+	return rc;
+}
+
+static int iommufd_test_destroy_access(struct iommufd_ucmd *ucmd,
+				       unsigned int access_id)
+{
+	struct selftest_access *staccess;
+	struct iommufd_object *access_obj;
+
+	staccess =
+		iommufd_access_selftest_get(ucmd->ictx, access_id, &access_obj);
+	if (IS_ERR(staccess))
+		return PTR_ERR(staccess);
+	iommufd_put_object(access_obj);
+
+	spin_lock(&staccess->lock);
+	if (!list_empty(&staccess->items) || staccess->destroying) {
+		spin_unlock(&staccess->lock);
+		return -EBUSY;
+	}
+	staccess->destroying = true;
+	spin_unlock(&staccess->lock);
+
+	/* FIXME: this holds a reference on the object even after the fd is closed */
+	iommufd_access_destroy(staccess->access);
+	kfree(staccess);
+	return 0;
+}
+
+/* Check that the pages in a page array match the pages in the user VA */
+static int iommufd_test_check_pages(void __user *uptr, struct page **pages,
+				    size_t npages)
+{
+	for (; npages; npages--) {
+		struct page *tmp_pages[1];
+		long rc;
+
+		rc = get_user_pages_fast((uintptr_t)uptr, 1, 0, tmp_pages);
+		if (rc < 0)
+			return rc;
+		if (WARN_ON(rc != 1))
+			return -EFAULT;
+		put_page(tmp_pages[0]);
+		if (tmp_pages[0] != *pages)
+			return -EBADE;
+		pages++;
+		uptr += PAGE_SIZE;
+	}
+	return 0;
+}
+
+static int iommufd_test_access_pages(struct iommufd_ucmd *ucmd,
+				     unsigned int access_id, unsigned long iova,
+				     size_t length, void __user *uptr,
+				     u32 flags)
+{
+	struct iommu_test_cmd *cmd = ucmd->cmd;
+	struct iommufd_object *access_obj;
+	struct selftest_access_item *item;
+	struct selftest_access *staccess;
+	struct page **pages;
+	size_t npages;
+	int rc;
+
+	if (flags & ~MOCK_FLAGS_ACCESS_WRITE)
+		return -EOPNOTSUPP;
+
+	staccess =
+		iommufd_access_selftest_get(ucmd->ictx, access_id, &access_obj);
+	if (IS_ERR(staccess))
+		return PTR_ERR(staccess);
+
+	npages = (ALIGN(iova + length, PAGE_SIZE) -
+		  ALIGN_DOWN(iova, PAGE_SIZE)) /
+		 PAGE_SIZE;
+	pages = kvcalloc(npages, sizeof(*pages), GFP_KERNEL_ACCOUNT);
+	if (!pages) {
+		rc = -ENOMEM;
+		goto out_put;
+	}
+
+	rc = iommufd_access_pin_pages(staccess->access, iova, length, pages,
+				      flags & MOCK_FLAGS_ACCESS_WRITE);
+	if (rc)
+		goto out_free_pages;
+
+	rc = iommufd_test_check_pages(
+		uptr - (iova - ALIGN_DOWN(iova, PAGE_SIZE)), pages, npages);
+	if (rc)
+		goto out_unaccess;
+
+	item = kzalloc(sizeof(*item), GFP_KERNEL_ACCOUNT);
+	if (!item) {
+		rc = -ENOMEM;
+		goto out_unaccess;
+	}
+
+	item->iova = iova;
+	item->length = length;
+	spin_lock(&staccess->lock);
+	item->id = staccess->next_id++;
+	list_add_tail(&item->items_elm, &staccess->items);
+	spin_unlock(&staccess->lock);
+
+	cmd->access_pages.out_access_item_id = item->id;
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+	if (rc)
+		goto out_free_item;
+	goto out_free_pages;
+
+out_free_item:
+	spin_lock(&staccess->lock);
+	list_del(&item->items_elm);
+	spin_unlock(&staccess->lock);
+	kfree(item);
+out_unaccess:
+	iommufd_access_unpin_pages(staccess->access, iova, length);
+out_free_pages:
+	kvfree(pages);
+out_put:
+	iommufd_put_object(access_obj);
+	return rc;
+}
+
+static int iommufd_test_access_item_destroy(struct iommufd_ucmd *ucmd,
+					    unsigned int access_id,
+					    unsigned int item_id)
+{
+	struct iommufd_object *access_obj;
+	struct selftest_access_item *item;
+	struct selftest_access *staccess;
+
+	staccess =
+		iommufd_access_selftest_get(ucmd->ictx, access_id, &access_obj);
+	if (IS_ERR(staccess))
+		return PTR_ERR(staccess);
+
+	spin_lock(&staccess->lock);
+	list_for_each_entry(item, &staccess->items, items_elm) {
+		if (item->id == item_id) {
+			list_del(&item->items_elm);
+			spin_unlock(&staccess->lock);
+			iommufd_access_unpin_pages(staccess->access, item->iova,
+						   item->length);
+			kfree(item);
+			iommufd_put_object(access_obj);
+			return 0;
+		}
+	}
+	spin_unlock(&staccess->lock);
+	iommufd_put_object(access_obj);
+	return -ENOENT;
+}
+
+void iommufd_selftest_destroy(struct iommufd_object *obj)
+{
+	struct selftest_obj *sobj = container_of(obj, struct selftest_obj, obj);
+
+	switch (sobj->type) {
+	case TYPE_IDEV:
+		iommufd_device_selftest_detach(sobj->idev.ictx,
+					       sobj->idev.hwpt);
+		break;
+	}
+}
+
+int iommufd_test(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_test_cmd *cmd = ucmd->cmd;
+
+	switch (cmd->op) {
+	case IOMMU_TEST_OP_ADD_RESERVED:
+		return iommufd_test_add_reserved(ucmd, cmd->id,
+						 cmd->add_reserved.start,
+						 cmd->add_reserved.length);
+	case IOMMU_TEST_OP_MOCK_DOMAIN:
+		return iommufd_test_mock_domain(ucmd, cmd);
+	case IOMMU_TEST_OP_MD_CHECK_MAP:
+		return iommufd_test_md_check_pa(
+			ucmd, cmd->id, cmd->check_map.iova,
+			cmd->check_map.length,
+			u64_to_user_ptr(cmd->check_map.uptr));
+	case IOMMU_TEST_OP_MD_CHECK_REFS:
+		return iommufd_test_md_check_refs(
+			ucmd, u64_to_user_ptr(cmd->check_refs.uptr),
+			cmd->check_refs.length, cmd->check_refs.refs);
+	case IOMMU_TEST_OP_CREATE_ACCESS:
+		return iommufd_test_create_access(ucmd, cmd->id);
+	case IOMMU_TEST_OP_DESTROY_ACCESS:
+		return iommufd_test_destroy_access(ucmd, cmd->id);
+	case IOMMU_TEST_OP_ACCESS_PAGES:
+		return iommufd_test_access_pages(
+			ucmd, cmd->id, cmd->access_pages.iova,
+			cmd->access_pages.length,
+			u64_to_user_ptr(cmd->access_pages.uptr),
+			cmd->access_pages.flags);
+	case IOMMU_TEST_OP_DESTROY_ACCESS_ITEM:
+		return iommufd_test_access_item_destroy(
+			ucmd, cmd->id, cmd->destroy_access_item.access_item_id);
+	case IOMMU_TEST_OP_SET_TEMP_MEMORY_LIMIT:
+		iommufd_test_memory_limit = cmd->memory_limit.limit;
+		return 0;
+	default:
+		return -EOPNOTSUPP;
+	}
+}
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index c2064a35688b08..58a8520542410b 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -25,6 +25,7 @@ TARGETS += ftrace
 TARGETS += futex
 TARGETS += gpio
 TARGETS += intel_pstate
+TARGETS += iommu
 TARGETS += ipc
 TARGETS += ir
 TARGETS += kcmp
diff --git a/tools/testing/selftests/iommu/.gitignore b/tools/testing/selftests/iommu/.gitignore
new file mode 100644
index 00000000000000..c6bd07e7ff59b3
--- /dev/null
+++ b/tools/testing/selftests/iommu/.gitignore
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+/iommufd
diff --git a/tools/testing/selftests/iommu/Makefile b/tools/testing/selftests/iommu/Makefile
new file mode 100644
index 00000000000000..7bc38b3beaeb20
--- /dev/null
+++ b/tools/testing/selftests/iommu/Makefile
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0-only
+CFLAGS += -Wall -O2 -Wno-unused-function
+CFLAGS += -I../../../../include/uapi/
+CFLAGS += -I../../../../include/
+
+CFLAGS += -D_GNU_SOURCE
+
+TEST_GEN_PROGS :=
+TEST_GEN_PROGS += iommufd
+
+include ../lib.mk
diff --git a/tools/testing/selftests/iommu/config b/tools/testing/selftests/iommu/config
new file mode 100644
index 00000000000000..6c4f901d6fed3c
--- /dev/null
+++ b/tools/testing/selftests/iommu/config
@@ -0,0 +1,2 @@
+CONFIG_IOMMUFD
+CONFIG_IOMMUFD_TEST
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
new file mode 100644
index 00000000000000..9aea459ba183ec
--- /dev/null
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -0,0 +1,1396 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES */
+#include <stdlib.h>
+#include <unistd.h>
+#include <sys/mman.h>
+#include <sys/fcntl.h>
+#include <sys/ioctl.h>
+#include <assert.h>
+#include <stddef.h>
+
+#include "../kselftest_harness.h"
+
+#define __EXPORTED_HEADERS__
+#include <linux/iommufd.h>
+#include <linux/vfio.h>
+#include "../../../../drivers/iommu/iommufd/iommufd_test.h"
+
+static void *buffer;
+
+static unsigned long PAGE_SIZE;
+static unsigned long HUGEPAGE_SIZE;
+static unsigned long BUFFER_SIZE;
+
+#define MOCK_PAGE_SIZE (PAGE_SIZE / 2)
+
+static unsigned long get_huge_page_size(void)
+{
+	char buf[80];
+	int ret;
+	int fd;
+
+	fd = open("/sys/kernel/mm/transparent_hugepage/hpage_pmd_size",
+		  O_RDONLY);
+	if (fd < 0)
+		return 2 * 1024 * 1024;
+
+	ret = read(fd, buf, sizeof(buf));
+	close(fd);
+	if (ret <= 0 || ret == sizeof(buf))
+		return 2 * 1024 * 1024;
+	buf[ret] = 0;
+	return strtoul(buf, NULL, 10);
+}
+
+static __attribute__((constructor)) void setup_sizes(void)
+{
+	int rc;
+
+	PAGE_SIZE = sysconf(_SC_PAGE_SIZE);
+	HUGEPAGE_SIZE = get_huge_page_size();
+
+	BUFFER_SIZE = PAGE_SIZE * 16;
+	rc = posix_memalign(&buffer, HUGEPAGE_SIZE, BUFFER_SIZE);
+	assert(rc || buffer || (uintptr_t)buffer % HUGEPAGE_SIZE == 0);
+}
+
+/* Hack to make assertions more readable */
+#define _IOMMU_TEST_CMD(x) IOMMU_TEST_CMD
+
+/*
+ * Have the kernel check the refcount on pages. I don't know why a freshly
+ * mmap'd anon non-compound page starts out with a ref of 3
+ */
+#define check_refs(_ptr, _length, _refs)                                       \
+	({                                                                     \
+		struct iommu_test_cmd test_cmd = {                             \
+			.size = sizeof(test_cmd),                              \
+			.op = IOMMU_TEST_OP_MD_CHECK_REFS,                     \
+			.check_refs = { .length = _length,                     \
+					.uptr = (uintptr_t)(_ptr),             \
+					.refs = _refs },                       \
+		};                                                             \
+		ASSERT_EQ(0,                                                   \
+			  ioctl(self->fd,                                      \
+				_IOMMU_TEST_CMD(IOMMU_TEST_OP_MD_CHECK_REFS),  \
+				&test_cmd));                                   \
+	})
+
+static int _test_cmd_create_access(int fd, unsigned int ioas_id,
+				   __u32 *access_id)
+{
+	struct iommu_test_cmd cmd = {
+		.size = sizeof(cmd),
+		.op = IOMMU_TEST_OP_CREATE_ACCESS,
+		.id = ioas_id,
+	};
+	int ret;
+
+	ret = ioctl(fd, IOMMU_TEST_CMD, &cmd);
+	if (ret)
+		return ret;
+	*access_id = cmd.create_access.out_access_id;
+	return 0;
+}
+#define test_cmd_create_access(ioas_id, access_id) \
+	ASSERT_EQ(0, _test_cmd_create_access(self->fd, ioas_id, access_id))
+
+static int _test_cmd_destroy_access(int fd, unsigned int access_id)
+{
+	struct iommu_test_cmd cmd = {
+		.size = sizeof(cmd),
+		.op = IOMMU_TEST_OP_DESTROY_ACCESS,
+		.id = access_id,
+	};
+	return ioctl(fd, IOMMU_TEST_CMD, &cmd);
+}
+#define test_cmd_destroy_access(access_id) \
+	ASSERT_EQ(0, _test_cmd_destroy_access(self->fd, access_id))
+
+static int _test_cmd_destroy_access_item(int fd, unsigned int access_id,
+					 unsigned int access_item_id)
+{
+	struct iommu_test_cmd cmd = {
+		.size = sizeof(cmd),
+		.op = IOMMU_TEST_OP_DESTROY_ACCESS_ITEM,
+		.id = access_id,
+		.destroy_access_item = { .access_item_id = access_item_id },
+	};
+	return ioctl(fd, IOMMU_TEST_CMD, &cmd);
+}
+#define test_cmd_destroy_access_item(access_id, access_item_id)         \
+	ASSERT_EQ(0, _test_cmd_destroy_access_item(self->fd, access_id, \
+						   access_item_id))
+
+static int _test_ioctl_destroy(int fd, unsigned int id)
+{
+	struct iommu_destroy cmd = {
+		.size = sizeof(cmd),
+		.id = id,
+	};
+	return ioctl(fd, IOMMU_DESTROY, &cmd);
+}
+#define test_ioctl_destroy(id) \
+	ASSERT_EQ(0, _test_ioctl_destroy(self->fd, id))
+
+static int _test_ioctl_ioas_alloc(int fd, __u32 *id)
+{
+	struct iommu_ioas_alloc cmd = {
+		.size = sizeof(cmd),
+	};
+	int ret;
+
+	ret = ioctl(fd, IOMMU_IOAS_ALLOC, &cmd);
+	if (ret)
+		return ret;
+	*id = cmd.out_ioas_id;
+	return 0;
+}
+#define test_ioctl_ioas_alloc(id)                                   \
+	({                                                          \
+		ASSERT_EQ(0, _test_ioctl_ioas_alloc(self->fd, id)); \
+		ASSERT_NE(0, *(id));                                \
+	})
+
+static void teardown_iommufd(int fd, struct __test_metadata *_metadata)
+{
+	struct iommu_test_cmd test_cmd = {
+		.size = sizeof(test_cmd),
+		.op = IOMMU_TEST_OP_MD_CHECK_REFS,
+		.check_refs = { .length = BUFFER_SIZE,
+				.uptr = (uintptr_t)buffer },
+	};
+
+	EXPECT_EQ(0, close(fd));
+
+	fd = open("/dev/iommu", O_RDWR);
+	EXPECT_NE(-1, fd);
+	EXPECT_EQ(0, ioctl(fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_MD_CHECK_REFS),
+			   &test_cmd));
+	EXPECT_EQ(0, close(fd));
+}
+
+#define EXPECT_ERRNO(expected_errno, cmd)                                      \
+	({                                                                     \
+		ASSERT_EQ(-1, cmd);                                            \
+		EXPECT_EQ(expected_errno, errno);                              \
+	})
+
+FIXTURE(iommufd) {
+	int fd;
+};
+
+FIXTURE_SETUP(iommufd) {
+	self->fd = open("/dev/iommu", O_RDWR);
+	ASSERT_NE(-1, self->fd);
+}
+
+FIXTURE_TEARDOWN(iommufd) {
+	teardown_iommufd(self->fd, _metadata);
+}
+
+TEST_F(iommufd, simple_close)
+{
+}
+
+TEST_F(iommufd, cmd_fail)
+{
+	struct iommu_destroy cmd = { .size = sizeof(cmd), .id = 0 };
+
+	/* object id is invalid */
+	EXPECT_ERRNO(ENOENT, ioctl(self->fd, IOMMU_DESTROY, &cmd));
+	/* Bad pointer */
+	EXPECT_ERRNO(EFAULT, ioctl(self->fd, IOMMU_DESTROY, NULL));
+	/* Unknown ioctl */
+	EXPECT_ERRNO(ENOTTY,
+		     ioctl(self->fd, _IO(IOMMUFD_TYPE, IOMMUFD_CMD_BASE - 1),
+			   &cmd));
+}
+
+TEST_F(iommufd, cmd_ex_fail)
+{
+	struct {
+		struct iommu_destroy cmd;
+		__u64 future;
+	} cmd = { .cmd = { .size = sizeof(cmd), .id = 0 } };
+
+	/* object id is invalid and command is longer */
+	EXPECT_ERRNO(ENOENT, ioctl(self->fd, IOMMU_DESTROY, &cmd));
+	/* future area is non-zero */
+	cmd.future = 1;
+	EXPECT_ERRNO(E2BIG, ioctl(self->fd, IOMMU_DESTROY, &cmd));
+	/* Original command "works" */
+	cmd.cmd.size = sizeof(cmd.cmd);
+	EXPECT_ERRNO(ENOENT, ioctl(self->fd, IOMMU_DESTROY, &cmd));
+	/* Short command fails */
+	cmd.cmd.size = sizeof(cmd.cmd) - 1;
+	EXPECT_ERRNO(EOPNOTSUPP, ioctl(self->fd, IOMMU_DESTROY, &cmd));
+}
+
+FIXTURE(iommufd_ioas) {
+	int fd;
+	uint32_t ioas_id;
+	uint32_t domain_id;
+	uint64_t base_iova;
+};
+
+FIXTURE_VARIANT(iommufd_ioas) {
+	unsigned int mock_domains;
+	unsigned int memory_limit;
+};
+
+FIXTURE_SETUP(iommufd_ioas) {
+	struct iommu_test_cmd memlimit_cmd = {
+		.size = sizeof(memlimit_cmd),
+		.op = IOMMU_TEST_OP_SET_TEMP_MEMORY_LIMIT,
+		.memory_limit = {.limit = variant->memory_limit},
+	};
+	unsigned int i;
+
+	if (!variant->memory_limit)
+		memlimit_cmd.memory_limit.limit = 65536;
+
+	self->fd = open("/dev/iommu", O_RDWR);
+	ASSERT_NE(-1, self->fd);
+	test_ioctl_ioas_alloc(&self->ioas_id);
+
+	ASSERT_EQ(0, ioctl(self->fd,
+			   _IOMMU_TEST_CMD(IOMMU_TEST_OP_SET_TEMP_MEMORY_LIMIT),
+			   &memlimit_cmd));
+
+	for (i = 0; i != variant->mock_domains; i++) {
+		struct iommu_test_cmd test_cmd = {
+			.size = sizeof(test_cmd),
+			.op = IOMMU_TEST_OP_MOCK_DOMAIN,
+			.id = self->ioas_id,
+		};
+
+		ASSERT_EQ(0, ioctl(self->fd,
+				   _IOMMU_TEST_CMD(IOMMU_TEST_OP_MOCK_DOMAIN),
+				   &test_cmd));
+		EXPECT_NE(0, test_cmd.id);
+		self->domain_id = test_cmd.id;
+		self->base_iova = MOCK_APERTURE_START;
+	}
+}
+
+FIXTURE_TEARDOWN(iommufd_ioas) {
+	struct iommu_test_cmd memlimit_cmd = {
+		.size = sizeof(memlimit_cmd),
+		.op = IOMMU_TEST_OP_SET_TEMP_MEMORY_LIMIT,
+		.memory_limit = {.limit = 65536},
+	};
+
+	EXPECT_EQ(0, ioctl(self->fd,
+			   _IOMMU_TEST_CMD(IOMMU_TEST_OP_SET_TEMP_MEMORY_LIMIT),
+			   &memlimit_cmd));
+	teardown_iommufd(self->fd, _metadata);
+}
+
+FIXTURE_VARIANT_ADD(iommufd_ioas, no_domain) {
+};
+
+FIXTURE_VARIANT_ADD(iommufd_ioas, mock_domain) {
+	.mock_domains = 1,
+};
+
+FIXTURE_VARIANT_ADD(iommufd_ioas, two_mock_domain) {
+	.mock_domains = 2,
+};
+
+FIXTURE_VARIANT_ADD(iommufd_ioas, mock_domain_limit) {
+	.mock_domains = 1,
+	.memory_limit = 16,
+};
+
+TEST_F(iommufd_ioas, ioas_auto_destroy)
+{
+}
+
+TEST_F(iommufd_ioas, ioas_destroy)
+{
+	struct iommu_destroy destroy_cmd = {
+		.size = sizeof(destroy_cmd),
+		.id = self->ioas_id,
+	};
+
+	if (self->domain_id) {
+		/* IOAS cannot be freed while a domain is on it */
+		EXPECT_ERRNO(EBUSY,
+			     ioctl(self->fd, IOMMU_DESTROY, &destroy_cmd));
+	} else {
+		/* Can allocate and manually free an IOAS table */
+		test_ioctl_destroy(self->ioas_id);
+	}
+}
+
+TEST_F(iommufd_ioas, ioas_area_destroy)
+{
+	struct iommu_destroy destroy_cmd = {
+		.size = sizeof(destroy_cmd),
+		.id = self->ioas_id,
+	};
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.ioas_id = self->ioas_id,
+		.user_va = (uintptr_t)buffer,
+		.length = PAGE_SIZE,
+		.iova = self->base_iova,
+	};
+
+	/* Adding an area does not change ability to destroy */
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	if (self->domain_id)
+		EXPECT_ERRNO(EBUSY,
+			     ioctl(self->fd, IOMMU_DESTROY, &destroy_cmd));
+	else
+		test_ioctl_destroy(self->ioas_id);
+}
+
+TEST_F(iommufd_ioas, ioas_area_auto_destroy)
+{
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.ioas_id = self->ioas_id,
+		.user_va = (uintptr_t)buffer,
+		.length = PAGE_SIZE,
+	};
+	int i;
+
+	/* Can allocate and automatically free an IOAS table with many areas */
+	for (i = 0; i != 10; i++) {
+		map_cmd.iova = self->base_iova + i * PAGE_SIZE;
+		ASSERT_EQ(0,
+			  ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	}
+}
+
+TEST_F(iommufd_ioas, area)
+{
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.ioas_id = self->ioas_id,
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.length = PAGE_SIZE,
+		.user_va = (uintptr_t)buffer,
+	};
+	struct iommu_ioas_unmap unmap_cmd = {
+		.size = sizeof(unmap_cmd),
+		.ioas_id = self->ioas_id,
+		.length = PAGE_SIZE,
+	};
+	int i;
+
+	/* Unmap fails if nothing is mapped */
+	for (i = 0; i != 10; i++) {
+		unmap_cmd.iova = i * PAGE_SIZE;
+		EXPECT_ERRNO(ENOENT, ioctl(self->fd, IOMMU_IOAS_UNMAP,
+					   &unmap_cmd));
+	}
+
+	/* Unmap works */
+	for (i = 0; i != 10; i++) {
+		map_cmd.iova = self->base_iova + i * PAGE_SIZE;
+		ASSERT_EQ(0,
+			  ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	}
+	for (i = 0; i != 10; i++) {
+		unmap_cmd.iova = self->base_iova + i * PAGE_SIZE;
+		ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_UNMAP,
+				   &unmap_cmd));
+	}
+
+	/* Split fails */
+	map_cmd.length = PAGE_SIZE * 2;
+	map_cmd.iova = self->base_iova + 16 * PAGE_SIZE;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	unmap_cmd.iova = self->base_iova + 16 * PAGE_SIZE;
+	EXPECT_ERRNO(ENOENT,
+		     ioctl(self->fd, IOMMU_IOAS_UNMAP, &unmap_cmd));
+	unmap_cmd.iova = self->base_iova + 17 * PAGE_SIZE;
+	EXPECT_ERRNO(ENOENT,
+		     ioctl(self->fd, IOMMU_IOAS_UNMAP, &unmap_cmd));
+
+	/* Over map fails */
+	map_cmd.length = PAGE_SIZE * 2;
+	map_cmd.iova = self->base_iova + 16 * PAGE_SIZE;
+	EXPECT_ERRNO(EADDRINUSE,
+		     ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	map_cmd.length = PAGE_SIZE;
+	map_cmd.iova = self->base_iova + 16 * PAGE_SIZE;
+	EXPECT_ERRNO(EADDRINUSE,
+		     ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	map_cmd.length = PAGE_SIZE;
+	map_cmd.iova = self->base_iova + 17 * PAGE_SIZE;
+	EXPECT_ERRNO(EADDRINUSE,
+		     ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	map_cmd.length = PAGE_SIZE * 2;
+	map_cmd.iova = self->base_iova + 15 * PAGE_SIZE;
+	EXPECT_ERRNO(EADDRINUSE,
+		     ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	map_cmd.length = PAGE_SIZE * 3;
+	map_cmd.iova = self->base_iova + 15 * PAGE_SIZE;
+	EXPECT_ERRNO(EADDRINUSE,
+		     ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+
+	/* unmap all works */
+	unmap_cmd.iova = 0;
+	unmap_cmd.length = UINT64_MAX;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_UNMAP, &unmap_cmd));
+}
+
+TEST_F(iommufd_ioas, unmap_fully_contained_areas)
+{
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.ioas_id = self->ioas_id,
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.length = PAGE_SIZE,
+		.user_va = (uintptr_t)buffer,
+	};
+	struct iommu_ioas_unmap unmap_cmd = {
+		.size = sizeof(unmap_cmd),
+		.ioas_id = self->ioas_id,
+		.length = PAGE_SIZE,
+	};
+	int i;
+
+	/* Give no_domain some space to rewind base_iova */
+	self->base_iova += 4 * PAGE_SIZE;
+
+	for (i = 0; i != 4; i++) {
+		map_cmd.iova = self->base_iova + i * 16 * PAGE_SIZE;
+		map_cmd.length = 8 * PAGE_SIZE;
+		ASSERT_EQ(0,
+			  ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	}
+
+	/* Unmap not fully contained area doesn't work */
+	unmap_cmd.iova = self->base_iova - 4 * PAGE_SIZE;
+	unmap_cmd.length = 8 * PAGE_SIZE;
+	EXPECT_ERRNO(ENOENT,
+		     ioctl(self->fd, IOMMU_IOAS_UNMAP, &unmap_cmd));
+
+	unmap_cmd.iova = self->base_iova + 3 * 16 * PAGE_SIZE + 8 * PAGE_SIZE - 4 * PAGE_SIZE;
+	unmap_cmd.length = 8 * PAGE_SIZE;
+	EXPECT_ERRNO(ENOENT,
+		     ioctl(self->fd, IOMMU_IOAS_UNMAP, &unmap_cmd));
+
+	/* Unmap fully contained areas works */
+	unmap_cmd.iova = self->base_iova - 4 * PAGE_SIZE;
+	unmap_cmd.length = 3 * 16 * PAGE_SIZE + 8 * PAGE_SIZE + 4 * PAGE_SIZE;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_UNMAP, &unmap_cmd));
+	ASSERT_EQ(32 * PAGE_SIZE, unmap_cmd.length);
+}
+
+TEST_F(iommufd_ioas, area_auto_iova)
+{
+	struct iommu_test_cmd test_cmd = {
+		.size = sizeof(test_cmd),
+		.op = IOMMU_TEST_OP_ADD_RESERVED,
+		.id = self->ioas_id,
+		.add_reserved = { .start = PAGE_SIZE * 4,
+				  .length = PAGE_SIZE * 100 },
+	};
+	struct iommu_iova_range ranges[1] = {};
+	struct iommu_ioas_allow_iovas allow_cmd = {
+		.size = sizeof(allow_cmd),
+		.ioas_id = self->ioas_id,
+		.num_iovas = 1,
+		.allowed_iovas = (uintptr_t)ranges,
+	};
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.ioas_id = self->ioas_id,
+		.length = PAGE_SIZE,
+		.user_va = (uintptr_t)buffer,
+	};
+	struct iommu_ioas_unmap unmap_cmd = {
+		.size = sizeof(unmap_cmd),
+		.ioas_id = self->ioas_id,
+		.length = PAGE_SIZE,
+	};
+	uint64_t iovas[10];
+	int i;
+
+	/* Simple 4k pages */
+	for (i = 0; i != 10; i++) {
+		ASSERT_EQ(0,
+			  ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+		iovas[i] = map_cmd.iova;
+	}
+	for (i = 0; i != 10; i++) {
+		unmap_cmd.iova = iovas[i];
+		ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_UNMAP,
+				   &unmap_cmd));
+	}
+
+	/* Kernel automatically aligns IOVAs properly */
+	if (self->domain_id)
+		map_cmd.user_va = (uintptr_t)buffer;
+	else
+		map_cmd.user_va = 1UL << 31;
+	for (i = 0; i != 10; i++) {
+		map_cmd.length = PAGE_SIZE * (i + 1);
+		ASSERT_EQ(0,
+			  ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+		iovas[i] = map_cmd.iova;
+		EXPECT_EQ(0, map_cmd.iova % (1UL << (ffs(map_cmd.length)-1)));
+	}
+	for (i = 0; i != 10; i++) {
+		unmap_cmd.length = PAGE_SIZE * (i + 1);
+		unmap_cmd.iova = iovas[i];
+		ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_UNMAP,
+				   &unmap_cmd));
+	}
+
+	/* Avoids a reserved region */
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ADD_RESERVED),
+			&test_cmd));
+	for (i = 0; i != 10; i++) {
+		map_cmd.length = PAGE_SIZE * (i + 1);
+		ASSERT_EQ(0,
+			  ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+		iovas[i] = map_cmd.iova;
+		EXPECT_EQ(0, map_cmd.iova % (1UL << (ffs(map_cmd.length)-1)));
+		EXPECT_EQ(false,
+			  map_cmd.iova > test_cmd.add_reserved.start &&
+				  map_cmd.iova <
+					  test_cmd.add_reserved.start +
+						  test_cmd.add_reserved.length);
+	}
+	for (i = 0; i != 10; i++) {
+		unmap_cmd.length = PAGE_SIZE * (i + 1);
+		unmap_cmd.iova = iovas[i];
+		ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_UNMAP,
+				   &unmap_cmd));
+	}
+
+	/* Allowed region intersects with a reserved region */
+	ranges[0].start = PAGE_SIZE;
+	ranges[0].last = PAGE_SIZE * 600;
+	EXPECT_ERRNO(EADDRINUSE,
+		     ioctl(self->fd, IOMMU_IOAS_ALLOW_IOVAS, &allow_cmd));
+
+	/* Allocate from an allowed region */
+	if (self->domain_id) {
+		ranges[0].start =  MOCK_APERTURE_START + PAGE_SIZE;
+		ranges[0].last = MOCK_APERTURE_START + PAGE_SIZE * 600 - 1;
+	} else {
+		ranges[0].start = PAGE_SIZE * 200;
+		ranges[0].last = PAGE_SIZE * 600 - 1;
+	}
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_ALLOW_IOVAS, &allow_cmd));
+	for (i = 0; i != 10; i++) {
+		map_cmd.length = PAGE_SIZE * (i + 1);
+		ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+		iovas[i] = map_cmd.iova;
+		EXPECT_EQ(0, map_cmd.iova % (1UL << (ffs(map_cmd.length)-1)));
+		EXPECT_EQ(true, map_cmd.iova >= ranges[0].start);
+		EXPECT_EQ(true, map_cmd.iova <= ranges[0].last);
+		EXPECT_EQ(true,
+			  map_cmd.iova + map_cmd.length > ranges[0].start);
+		EXPECT_EQ(true,
+			  map_cmd.iova + map_cmd.length <= ranges[0].last + 1);
+	}
+	for (i = 0; i != 10; i++) {
+		unmap_cmd.length = PAGE_SIZE * (i + 1);
+		unmap_cmd.iova = iovas[i];
+		ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_UNMAP, &unmap_cmd));
+	}
+}
+
+TEST_F(iommufd_ioas, area_allowed)
+{
+	struct iommu_test_cmd test_cmd = {
+		.size = sizeof(test_cmd),
+		.op = IOMMU_TEST_OP_ADD_RESERVED,
+		.id = self->ioas_id,
+		.add_reserved = { .start = PAGE_SIZE * 4,
+				  .length = PAGE_SIZE * 100 },
+	};
+	struct iommu_iova_range ranges[1] = {};
+	struct iommu_ioas_allow_iovas allow_cmd = {
+		.size = sizeof(allow_cmd),
+		.ioas_id = self->ioas_id,
+		.num_iovas = 1,
+		.allowed_iovas = (uintptr_t)ranges,
+	};
+
+	/* Reserved intersects an allowed */
+	allow_cmd.num_iovas = 1;
+	ranges[0].start = self->base_iova;
+	ranges[0].last = ranges[0].start + PAGE_SIZE * 600;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_ALLOW_IOVAS, &allow_cmd));
+	test_cmd.add_reserved.start = ranges[0].start + PAGE_SIZE;
+	test_cmd.add_reserved.length = PAGE_SIZE;
+	EXPECT_ERRNO(EADDRINUSE,
+		     ioctl(self->fd,
+			   _IOMMU_TEST_CMD(IOMMU_TEST_OP_ADD_RESERVED),
+			   &test_cmd));
+	allow_cmd.num_iovas = 0;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_ALLOW_IOVAS, &allow_cmd));
+
+	/* Allowed intersects a reserved */
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ADD_RESERVED),
+			&test_cmd));
+	allow_cmd.num_iovas = 1;
+	ranges[0].start = self->base_iova;
+	ranges[0].last = ranges[0].start + PAGE_SIZE * 600;
+	EXPECT_ERRNO(EADDRINUSE,
+		     ioctl(self->fd, IOMMU_IOAS_ALLOW_IOVAS, &allow_cmd));
+}
+
+TEST_F(iommufd_ioas, copy_area)
+{
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.ioas_id = self->ioas_id,
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.length = PAGE_SIZE,
+		.user_va = (uintptr_t)buffer,
+	};
+	struct iommu_ioas_copy copy_cmd = {
+		.size = sizeof(copy_cmd),
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.dst_ioas_id = self->ioas_id,
+		.src_ioas_id = self->ioas_id,
+		.length = PAGE_SIZE,
+	};
+
+	map_cmd.iova = self->base_iova;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+
+	/* Copy inside a single IOAS */
+	copy_cmd.src_iova = self->base_iova;
+	copy_cmd.dst_iova = self->base_iova + PAGE_SIZE;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_COPY, &copy_cmd));
+
+	/* Copy between IOAS's */
+	copy_cmd.src_iova = self->base_iova;
+	copy_cmd.dst_iova = 0;
+	test_ioctl_ioas_alloc(&copy_cmd.dst_ioas_id);
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_COPY, &copy_cmd));
+}
+
+TEST_F(iommufd_ioas, iova_ranges)
+{
+	struct iommu_test_cmd test_cmd = {
+		.size = sizeof(test_cmd),
+		.op = IOMMU_TEST_OP_ADD_RESERVED,
+		.id = self->ioas_id,
+		.add_reserved = { .start = PAGE_SIZE, .length = PAGE_SIZE },
+	};
+	struct iommu_ioas_iova_ranges *cmd = (void *)buffer;
+
+	*cmd = (struct iommu_ioas_iova_ranges){
+		.size = BUFFER_SIZE,
+		.ioas_id = self->ioas_id,
+	};
+
+	/* Range can be read */
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_IOVA_RANGES, cmd));
+	EXPECT_EQ(1, cmd->out_num_iovas);
+	if (!self->domain_id) {
+		EXPECT_EQ(0, cmd->out_valid_iovas[0].start);
+		EXPECT_EQ(SIZE_MAX, cmd->out_valid_iovas[0].last);
+	} else {
+		EXPECT_EQ(MOCK_APERTURE_START, cmd->out_valid_iovas[0].start);
+		EXPECT_EQ(MOCK_APERTURE_LAST, cmd->out_valid_iovas[0].last);
+	}
+	memset(cmd->out_valid_iovas, 0,
+	       sizeof(cmd->out_valid_iovas[0]) * cmd->out_num_iovas);
+
+	/* Buffer too small */
+	cmd->size = sizeof(*cmd);
+	EXPECT_ERRNO(EMSGSIZE,
+		     ioctl(self->fd, IOMMU_IOAS_IOVA_RANGES, cmd));
+	EXPECT_EQ(1, cmd->out_num_iovas);
+	EXPECT_EQ(0, cmd->out_valid_iovas[0].start);
+	EXPECT_EQ(0, cmd->out_valid_iovas[0].last);
+
+	/* 2 ranges */
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ADD_RESERVED),
+			&test_cmd));
+	cmd->size = BUFFER_SIZE;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_IOVA_RANGES, cmd));
+	if (!self->domain_id) {
+		EXPECT_EQ(2, cmd->out_num_iovas);
+		EXPECT_EQ(0, cmd->out_valid_iovas[0].start);
+		EXPECT_EQ(PAGE_SIZE - 1, cmd->out_valid_iovas[0].last);
+		EXPECT_EQ(PAGE_SIZE * 2, cmd->out_valid_iovas[1].start);
+		EXPECT_EQ(SIZE_MAX, cmd->out_valid_iovas[1].last);
+	} else {
+		EXPECT_EQ(1, cmd->out_num_iovas);
+		EXPECT_EQ(MOCK_APERTURE_START, cmd->out_valid_iovas[0].start);
+		EXPECT_EQ(MOCK_APERTURE_LAST, cmd->out_valid_iovas[0].last);
+	}
+	memset(cmd->out_valid_iovas, 0,
+	       sizeof(cmd->out_valid_iovas[0]) * cmd->out_num_iovas);
+
+	/* Buffer too small */
+	cmd->size = sizeof(*cmd) + sizeof(cmd->out_valid_iovas[0]);
+	if (!self->domain_id) {
+		EXPECT_ERRNO(EMSGSIZE,
+			     ioctl(self->fd, IOMMU_IOAS_IOVA_RANGES, cmd));
+		EXPECT_EQ(2, cmd->out_num_iovas);
+		EXPECT_EQ(0, cmd->out_valid_iovas[0].start);
+		EXPECT_EQ(PAGE_SIZE - 1, cmd->out_valid_iovas[0].last);
+	} else {
+		ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_IOVA_RANGES, cmd));
+		EXPECT_EQ(1, cmd->out_num_iovas);
+		EXPECT_EQ(MOCK_APERTURE_START, cmd->out_valid_iovas[0].start);
+		EXPECT_EQ(MOCK_APERTURE_LAST, cmd->out_valid_iovas[0].last);
+	}
+	EXPECT_EQ(0, cmd->out_valid_iovas[1].start);
+	EXPECT_EQ(0, cmd->out_valid_iovas[1].last);
+}
+
+TEST_F(iommufd_ioas, access)
+{
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.ioas_id = self->ioas_id,
+		.user_va = (uintptr_t)buffer,
+		.length = BUFFER_SIZE,
+		.iova = MOCK_APERTURE_START,
+	};
+	struct iommu_test_cmd access_cmd = {
+		.size = sizeof(access_cmd),
+		.op = IOMMU_TEST_OP_ACCESS_PAGES,
+		.access_pages = { .iova = MOCK_APERTURE_START,
+				  .length = BUFFER_SIZE,
+				  .uptr = (uintptr_t)buffer },
+	};
+	struct iommu_test_cmd mock_cmd = {
+		.size = sizeof(mock_cmd),
+		.op = IOMMU_TEST_OP_MOCK_DOMAIN,
+		.id = self->ioas_id,
+	};
+	struct iommu_test_cmd check_map_cmd = {
+		.size = sizeof(check_map_cmd),
+		.op = IOMMU_TEST_OP_MD_CHECK_MAP,
+		.check_map = { .iova = MOCK_APERTURE_START,
+			       .length = BUFFER_SIZE,
+			       .uptr = (uintptr_t)buffer },
+	};
+	uint32_t access_item_id;
+
+	test_cmd_create_access(self->ioas_id, &access_cmd.id);
+
+	/* Single map/unmap */
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ACCESS_PAGES),
+			&access_cmd));
+	test_cmd_destroy_access_item(
+		access_cmd.id, access_cmd.access_pages.out_access_item_id);
+
+	/* Double user */
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ACCESS_PAGES),
+			&access_cmd));
+	access_item_id = access_cmd.access_pages.out_access_item_id;
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ACCESS_PAGES),
+			&access_cmd));
+	test_cmd_destroy_access_item(
+		access_cmd.id, access_cmd.access_pages.out_access_item_id);
+	test_cmd_destroy_access_item(access_cmd.id, access_item_id);
+
+	/* Add/remove a domain with a user */
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ACCESS_PAGES),
+			&access_cmd));
+	ASSERT_EQ(0, ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_MOCK_DOMAIN),
+			   &mock_cmd));
+	check_map_cmd.id = mock_cmd.id;
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_MD_CHECK_MAP),
+			&check_map_cmd));
+
+	test_ioctl_destroy(mock_cmd.mock_domain.device_id);
+	test_ioctl_destroy(mock_cmd.id);
+	test_cmd_destroy_access_item(
+		access_cmd.id, access_cmd.access_pages.out_access_item_id);
+	test_cmd_destroy_access(access_cmd.id);
+}
+
+FIXTURE(iommufd_mock_domain) {
+	int fd;
+	uint32_t ioas_id;
+	uint32_t domain_id;
+	uint32_t domain_ids[2];
+	int mmap_flags;
+	size_t mmap_buf_size;
+};
+
+FIXTURE_VARIANT(iommufd_mock_domain) {
+	unsigned int mock_domains;
+	bool hugepages;
+};
+
+FIXTURE_SETUP(iommufd_mock_domain)
+{
+	struct iommu_test_cmd test_cmd = {
+		.size = sizeof(test_cmd),
+		.op = IOMMU_TEST_OP_MOCK_DOMAIN,
+	};
+	unsigned int i;
+
+	self->fd = open("/dev/iommu", O_RDWR);
+	ASSERT_NE(-1, self->fd);
+	test_ioctl_ioas_alloc(&self->ioas_id);
+
+	ASSERT_GE(ARRAY_SIZE(self->domain_ids), variant->mock_domains);
+
+	for (i = 0; i != variant->mock_domains; i++) {
+		test_cmd.id = self->ioas_id;
+		ASSERT_EQ(0, ioctl(self->fd,
+				   _IOMMU_TEST_CMD(IOMMU_TEST_OP_MOCK_DOMAIN),
+				   &test_cmd));
+		EXPECT_NE(0, test_cmd.id);
+		self->domain_ids[i] = test_cmd.id;
+	}
+	self->domain_id = self->domain_ids[0];
+
+	self->mmap_flags = MAP_SHARED | MAP_ANONYMOUS;
+	self->mmap_buf_size = PAGE_SIZE * 8;
+	if (variant->hugepages) {
+		/*
+		 * MAP_POPULATE will cause the kernel to fail mmap if THPs are
+		 * not available.
+		 */
+		self->mmap_flags |= MAP_HUGETLB | MAP_POPULATE;
+		self->mmap_buf_size = HUGEPAGE_SIZE * 2;
+	}
+}
+
+FIXTURE_TEARDOWN(iommufd_mock_domain) {
+	teardown_iommufd(self->fd, _metadata);
+}
+
+FIXTURE_VARIANT_ADD(iommufd_mock_domain, one_domain) {
+	.mock_domains = 1,
+	.hugepages = false,
+};
+
+FIXTURE_VARIANT_ADD(iommufd_mock_domain, two_domains) {
+	.mock_domains = 2,
+	.hugepages = false,
+};
+
+FIXTURE_VARIANT_ADD(iommufd_mock_domain, one_domain_hugepage) {
+	.mock_domains = 1,
+	.hugepages = true,
+};
+
+FIXTURE_VARIANT_ADD(iommufd_mock_domain, two_domains_hugepage) {
+	.mock_domains = 2,
+	.hugepages = true,
+};
+
+/* Have the kernel check that the user pages made it to the iommu_domain */
+#define check_mock_iova(_ptr, _iova, _length)                                  \
+	({                                                                     \
+		struct iommu_test_cmd check_map_cmd = {                        \
+			.size = sizeof(check_map_cmd),                         \
+			.op = IOMMU_TEST_OP_MD_CHECK_MAP,                      \
+			.id = self->domain_id,                                 \
+			.check_map = { .iova = _iova,                          \
+				       .length = _length,                      \
+				       .uptr = (uintptr_t)(_ptr) },            \
+		};                                                             \
+		ASSERT_EQ(0,                                                   \
+			  ioctl(self->fd,                                      \
+				_IOMMU_TEST_CMD(IOMMU_TEST_OP_MD_CHECK_MAP),   \
+				&check_map_cmd));                              \
+		if (self->domain_ids[1]) {                                     \
+			check_map_cmd.id = self->domain_ids[1];                \
+			ASSERT_EQ(0,                                           \
+				  ioctl(self->fd,                              \
+					_IOMMU_TEST_CMD(                       \
+						IOMMU_TEST_OP_MD_CHECK_MAP),   \
+					&check_map_cmd));                      \
+		}                                                              \
+	})
+
+TEST_F(iommufd_mock_domain, basic)
+{
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.ioas_id = self->ioas_id,
+	};
+	struct iommu_ioas_unmap unmap_cmd = {
+		.size = sizeof(unmap_cmd),
+		.ioas_id = self->ioas_id,
+	};
+	size_t buf_size = self->mmap_buf_size;
+	uint8_t *buf;
+
+	/* Simple one page map */
+	map_cmd.user_va = (uintptr_t)buffer;
+	map_cmd.length = PAGE_SIZE;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	check_mock_iova(buffer, map_cmd.iova, map_cmd.length);
+
+	buf = mmap(0, buf_size, PROT_READ | PROT_WRITE, self->mmap_flags, -1,
+		   0);
+	ASSERT_NE(MAP_FAILED, buf);
+
+	/* EFAULT half way through mapping */
+	ASSERT_EQ(0, munmap(buf + buf_size / 2, buf_size / 2));
+	map_cmd.user_va = (uintptr_t)buf;
+	map_cmd.length = buf_size;
+	EXPECT_ERRNO(EFAULT,
+		     ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+
+	/* EFAULT on first page */
+	ASSERT_EQ(0, munmap(buf, buf_size / 2));
+	EXPECT_ERRNO(EFAULT,
+		     ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+}
+
+TEST_F(iommufd_mock_domain, all_aligns)
+{
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.ioas_id = self->ioas_id,
+	};
+	struct iommu_ioas_unmap unmap_cmd = {
+		.size = sizeof(unmap_cmd),
+		.ioas_id = self->ioas_id,
+	};
+	size_t test_step =
+		variant->hugepages ? (self->mmap_buf_size / 16) : MOCK_PAGE_SIZE;
+	size_t buf_size = self->mmap_buf_size;
+	unsigned int start;
+	unsigned int end;
+	uint8_t *buf;
+
+	buf = mmap(0, buf_size, PROT_READ | PROT_WRITE, self->mmap_flags, -1, 0);
+	ASSERT_NE(MAP_FAILED, buf);
+	check_refs(buf, buf_size, 0);
+
+	/*
+	 * Map every combination of page size and alignment within a big region,
+	 * less for hugepage case as it takes so long to finish.
+	 */
+	for (start = 0; start < buf_size; start += test_step) {
+		map_cmd.user_va = (uintptr_t)buf + start;
+		if (variant->hugepages)
+			end = buf_size;
+		else
+			end = start + MOCK_PAGE_SIZE;
+		for (; end < buf_size; end += MOCK_PAGE_SIZE) {
+			map_cmd.length = end - start;
+			ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_MAP,
+					   &map_cmd));
+			check_mock_iova(buf + start, map_cmd.iova,
+					map_cmd.length);
+			check_refs(buf + start / PAGE_SIZE * PAGE_SIZE,
+				   end / PAGE_SIZE * PAGE_SIZE -
+					   start / PAGE_SIZE * PAGE_SIZE,
+				   1);
+
+			unmap_cmd.iova = map_cmd.iova;
+			unmap_cmd.length = end - start;
+			ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_UNMAP,
+					   &unmap_cmd));
+		}
+	}
+	check_refs(buf, buf_size, 0);
+	ASSERT_EQ(0, munmap(buf, buf_size));
+}
+
+TEST_F(iommufd_mock_domain, all_aligns_copy)
+{
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.ioas_id = self->ioas_id,
+	};
+	struct iommu_ioas_unmap unmap_cmd = {
+		.size = sizeof(unmap_cmd),
+		.ioas_id = self->ioas_id,
+	};
+	struct iommu_test_cmd add_mock_pt = {
+		.size = sizeof(add_mock_pt),
+		.op = IOMMU_TEST_OP_MOCK_DOMAIN,
+	};
+	size_t test_step =
+		variant->hugepages ? self->mmap_buf_size / 16 : MOCK_PAGE_SIZE;
+	size_t buf_size = self->mmap_buf_size;
+	unsigned int start;
+	unsigned int end;
+	uint8_t *buf;
+
+	buf = mmap(0, buf_size, PROT_READ | PROT_WRITE, self->mmap_flags, -1, 0);
+	ASSERT_NE(MAP_FAILED, buf);
+	check_refs(buf, buf_size, 0);
+
+	/*
+	 * Map every combination of page size and alignment within a big region,
+	 * less for hugepage case as it takes so long to finish.
+	 */
+	for (start = 0; start < buf_size; start += test_step) {
+		map_cmd.user_va = (uintptr_t)buf + start;
+		if (variant->hugepages)
+			end = buf_size;
+		else
+			end = start + MOCK_PAGE_SIZE;
+		for (; end < buf_size; end += MOCK_PAGE_SIZE) {
+			unsigned int old_id;
+
+			map_cmd.length = end - start;
+			ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_MAP,
+					   &map_cmd));
+
+			/* Add and destroy a domain while the area exists */
+			add_mock_pt.id = self->ioas_id;
+			ASSERT_EQ(0, ioctl(self->fd,
+					   _IOMMU_TEST_CMD(
+						   IOMMU_TEST_OP_MOCK_DOMAIN),
+					   &add_mock_pt));
+			old_id = self->domain_ids[1];
+			self->domain_ids[1] = add_mock_pt.id;
+
+			check_mock_iova(buf + start, map_cmd.iova,
+					map_cmd.length);
+			check_refs(buf + start / PAGE_SIZE * PAGE_SIZE,
+				   end / PAGE_SIZE * PAGE_SIZE -
+					   start / PAGE_SIZE * PAGE_SIZE,
+				   1);
+
+			test_ioctl_destroy(add_mock_pt.mock_domain.device_id);
+			test_ioctl_destroy(add_mock_pt.id)
+			self->domain_ids[1] = old_id;
+
+			unmap_cmd.iova = map_cmd.iova;
+			unmap_cmd.length = end - start;
+			ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_UNMAP,
+					   &unmap_cmd));
+		}
+	}
+	check_refs(buf, buf_size, 0);
+	ASSERT_EQ(0, munmap(buf, buf_size));
+}
+
+TEST_F(iommufd_mock_domain, user_copy)
+{
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.ioas_id = self->ioas_id,
+		.user_va = (uintptr_t)buffer,
+		.length = BUFFER_SIZE,
+		.iova = MOCK_APERTURE_START,
+	};
+	struct iommu_test_cmd access_cmd = {
+		.size = sizeof(access_cmd),
+		.op = IOMMU_TEST_OP_ACCESS_PAGES,
+		.access_pages = { .iova = MOCK_APERTURE_START,
+				  .length = BUFFER_SIZE,
+				  .uptr = (uintptr_t)buffer },
+	};
+	struct iommu_ioas_copy copy_cmd = {
+		.size = sizeof(copy_cmd),
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.dst_ioas_id = self->ioas_id,
+		.src_iova = MOCK_APERTURE_START,
+		.dst_iova = MOCK_APERTURE_START,
+		.length = BUFFER_SIZE,
+	};
+
+	/* Pin the pages in an IOAS with no domains then copy to an IOAS with domains */
+	test_ioctl_ioas_alloc(&map_cmd.ioas_id);
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+
+	test_cmd_create_access(map_cmd.ioas_id, &access_cmd.id);
+
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ACCESS_PAGES),
+			&access_cmd));
+	copy_cmd.src_ioas_id = map_cmd.ioas_id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_COPY, &copy_cmd));
+	check_mock_iova(buffer, map_cmd.iova, BUFFER_SIZE);
+
+	test_cmd_destroy_access_item(
+		access_cmd.id, access_cmd.access_pages.out_access_item_id);
+	test_cmd_destroy_access(access_cmd.id)
+	test_ioctl_destroy(map_cmd.ioas_id);
+}
+
+FIXTURE(vfio_compat_nodev) {
+	int fd;
+};
+
+FIXTURE_SETUP(vfio_compat_nodev) {
+	self->fd = open("/dev/iommu", O_RDWR);
+	ASSERT_NE(-1, self->fd);
+}
+
+FIXTURE_TEARDOWN(vfio_compat_nodev) {
+	teardown_iommufd(self->fd, _metadata);
+}
+
+TEST_F(vfio_compat_nodev, simple_ioctls)
+{
+	ASSERT_EQ(VFIO_API_VERSION, ioctl(self->fd, VFIO_GET_API_VERSION));
+	ASSERT_EQ(1, ioctl(self->fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1v2_IOMMU));
+}
+
+TEST_F(vfio_compat_nodev, unmap_cmd)
+{
+	struct vfio_iommu_type1_dma_unmap unmap_cmd = {
+		.iova = MOCK_APERTURE_START,
+		.size = PAGE_SIZE,
+	};
+
+	unmap_cmd.argsz = 1;
+	EXPECT_ERRNO(EINVAL, ioctl(self->fd, VFIO_IOMMU_UNMAP_DMA, &unmap_cmd));
+
+	unmap_cmd.argsz = sizeof(unmap_cmd);
+	unmap_cmd.flags = 1 << 31;
+	EXPECT_ERRNO(EINVAL, ioctl(self->fd, VFIO_IOMMU_UNMAP_DMA, &unmap_cmd));
+
+	unmap_cmd.flags = 0;
+	EXPECT_ERRNO(ENODEV, ioctl(self->fd, VFIO_IOMMU_UNMAP_DMA, &unmap_cmd));
+}
+
+TEST_F(vfio_compat_nodev, map_cmd)
+{
+	struct vfio_iommu_type1_dma_map map_cmd = {
+		.iova = MOCK_APERTURE_START,
+		.size = PAGE_SIZE,
+		.vaddr = (__u64)buffer,
+	};
+
+	map_cmd.argsz = 1;
+	EXPECT_ERRNO(EINVAL, ioctl(self->fd, VFIO_IOMMU_MAP_DMA, &map_cmd));
+
+	map_cmd.argsz = sizeof(map_cmd);
+	map_cmd.flags = 1 << 31;
+	EXPECT_ERRNO(EINVAL, ioctl(self->fd, VFIO_IOMMU_MAP_DMA, &map_cmd));
+
+	/* Requires a domain to be attached */
+	map_cmd.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
+	EXPECT_ERRNO(ENODEV, ioctl(self->fd, VFIO_IOMMU_MAP_DMA, &map_cmd));
+}
+
+TEST_F(vfio_compat_nodev, info_cmd)
+{
+	struct vfio_iommu_type1_info info_cmd = {};
+
+	/* Invalid argsz */
+	info_cmd.argsz = 1;
+	EXPECT_ERRNO(EINVAL, ioctl(self->fd, VFIO_IOMMU_GET_INFO, &info_cmd));
+
+	info_cmd.argsz = sizeof(info_cmd);
+	EXPECT_ERRNO(ENODEV, ioctl(self->fd, VFIO_IOMMU_GET_INFO, &info_cmd));
+}
+
+TEST_F(vfio_compat_nodev, set_iommu_cmd)
+{
+	/* Requires a domain to be attached */
+	EXPECT_ERRNO(ENODEV, ioctl(self->fd, VFIO_SET_IOMMU, VFIO_TYPE1v2_IOMMU));
+}
+
+TEST_F(vfio_compat_nodev, vfio_ioas)
+{
+	struct iommu_vfio_ioas vfio_ioas_cmd = {
+		.size = sizeof(vfio_ioas_cmd),
+		.op = IOMMU_VFIO_IOAS_GET,
+	};
+	__u32 ioas_id;
+
+	/* ENODEV if there is no compat ioas */
+	EXPECT_ERRNO(ENODEV, ioctl(self->fd, IOMMU_VFIO_IOAS, &vfio_ioas_cmd));
+
+	/* Invalid id for set */
+	vfio_ioas_cmd.op = IOMMU_VFIO_IOAS_SET;
+	EXPECT_ERRNO(ENOENT, ioctl(self->fd, IOMMU_VFIO_IOAS, &vfio_ioas_cmd));
+
+	/* Valid id for set*/
+	test_ioctl_ioas_alloc(&ioas_id);
+	vfio_ioas_cmd.ioas_id = ioas_id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_VFIO_IOAS, &vfio_ioas_cmd));
+
+	/* Same id comes back from get */
+	vfio_ioas_cmd.op = IOMMU_VFIO_IOAS_GET;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_VFIO_IOAS, &vfio_ioas_cmd));
+	ASSERT_EQ(ioas_id, vfio_ioas_cmd.ioas_id);
+
+	/* Clear works */
+	vfio_ioas_cmd.op = IOMMU_VFIO_IOAS_CLEAR;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_VFIO_IOAS, &vfio_ioas_cmd));
+	vfio_ioas_cmd.op = IOMMU_VFIO_IOAS_GET;
+	EXPECT_ERRNO(ENODEV, ioctl(self->fd, IOMMU_VFIO_IOAS, &vfio_ioas_cmd));
+}
+
+FIXTURE(vfio_compat_mock_domain) {
+	int fd;
+	uint32_t ioas_id;
+};
+
+FIXTURE_SETUP(vfio_compat_mock_domain) {
+	struct iommu_test_cmd test_cmd = {
+		.size = sizeof(test_cmd),
+		.op = IOMMU_TEST_OP_MOCK_DOMAIN,
+	};
+	struct iommu_vfio_ioas vfio_ioas_cmd = {
+		.size = sizeof(vfio_ioas_cmd),
+		.op = IOMMU_VFIO_IOAS_SET,
+	};
+
+	self->fd = open("/dev/iommu", O_RDWR);
+	ASSERT_NE(-1, self->fd);
+
+	/* Create what VFIO would consider a group */
+	test_ioctl_ioas_alloc(&self->ioas_id);
+	test_cmd.id = self->ioas_id;
+	ASSERT_EQ(0, ioctl(self->fd,
+			   _IOMMU_TEST_CMD(IOMMU_TEST_OP_MOCK_DOMAIN),
+			   &test_cmd));
+	EXPECT_NE(0, test_cmd.id);
+
+	/* Attach it to the vfio compat */
+	vfio_ioas_cmd.ioas_id = self->ioas_id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_VFIO_IOAS, &vfio_ioas_cmd));
+	ASSERT_EQ(0, ioctl(self->fd, VFIO_SET_IOMMU, VFIO_TYPE1v2_IOMMU));
+}
+
+FIXTURE_TEARDOWN(vfio_compat_mock_domain) {
+	teardown_iommufd(self->fd, _metadata);
+}
+
+TEST_F(vfio_compat_mock_domain, simple_close)
+{
+}
+
+/*
+ * Execute an ioctl command stored in buffer and check that the result does not
+ * overflow memory.
+ */
+static bool is_filled(const void *buf, uint8_t c, size_t len)
+{
+	const uint8_t *cbuf = buf;
+
+	for (; len; cbuf++, len--)
+		if (*cbuf != c)
+			return false;
+	return true;
+}
+
+#define ioctl_check_buf(fd, cmd)                                               \
+	({                                                                     \
+		size_t _cmd_len = *(__u32 *)buffer;                            \
+									       \
+		memset(buffer + _cmd_len, 0xAA, BUFFER_SIZE - _cmd_len);       \
+		ASSERT_EQ(0, ioctl(fd, cmd, buffer));                          \
+		ASSERT_EQ(true, is_filled(buffer + _cmd_len, 0xAA,             \
+					  BUFFER_SIZE - _cmd_len));            \
+	})
+
+static void check_vfio_info_cap_chain(struct __test_metadata *_metadata,
+				      struct vfio_iommu_type1_info *info_cmd)
+{
+	const struct vfio_info_cap_header *cap;
+
+	ASSERT_GE(info_cmd->argsz, info_cmd->cap_offset + sizeof(*cap));
+	cap = buffer + info_cmd->cap_offset;
+	while (true) {
+		size_t cap_size;
+
+		if (cap->next)
+			cap_size = (buffer + cap->next) - (void *)cap;
+		else
+			cap_size = (buffer + info_cmd->argsz) - (void *)cap;
+
+		switch (cap->id) {
+		case VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE: {
+			struct vfio_iommu_type1_info_cap_iova_range *data =
+				(void *)cap;
+
+			ASSERT_EQ(1, data->header.version);
+			ASSERT_EQ(1, data->nr_iovas);
+			EXPECT_EQ(MOCK_APERTURE_START,
+				  data->iova_ranges[0].start);
+			EXPECT_EQ(MOCK_APERTURE_LAST,
+				  data->iova_ranges[0].end);
+			break;
+		}
+		case VFIO_IOMMU_TYPE1_INFO_DMA_AVAIL: {
+			struct vfio_iommu_type1_info_dma_avail *data =
+				(void *)cap;
+
+			ASSERT_EQ(1, data->header.version);
+			ASSERT_EQ(sizeof(*data), cap_size);
+			break;
+		}
+		default:
+			ASSERT_EQ(false, true);
+			break;
+		}
+		if (!cap->next)
+			break;
+
+		ASSERT_GE(info_cmd->argsz, cap->next + sizeof(*cap));
+		ASSERT_GE(buffer + cap->next, (void *)cap);
+		cap = buffer + cap->next;
+	}
+}
+
+TEST_F(vfio_compat_mock_domain, get_info)
+{
+	struct vfio_iommu_type1_info *info_cmd = buffer;
+	unsigned int i;
+	size_t caplen;
+
+	/* Pre-cap ABI */
+	*info_cmd = (struct vfio_iommu_type1_info){
+		.argsz = offsetof(struct vfio_iommu_type1_info, cap_offset),
+	};
+	ioctl_check_buf(self->fd, VFIO_IOMMU_GET_INFO);
+	ASSERT_NE(0, info_cmd->iova_pgsizes);
+	ASSERT_EQ(VFIO_IOMMU_INFO_PGSIZES | VFIO_IOMMU_INFO_CAPS,
+		  info_cmd->flags);
+
+	/* Read the cap chain size */
+	*info_cmd = (struct vfio_iommu_type1_info){
+		.argsz = sizeof(*info_cmd),
+	};
+	ioctl_check_buf(self->fd, VFIO_IOMMU_GET_INFO);
+	ASSERT_NE(0, info_cmd->iova_pgsizes);
+	ASSERT_EQ(VFIO_IOMMU_INFO_PGSIZES | VFIO_IOMMU_INFO_CAPS,
+		  info_cmd->flags);
+	ASSERT_EQ(0, info_cmd->cap_offset);
+	ASSERT_LT(sizeof(*info_cmd), info_cmd->argsz);
+
+	/* Read the caps, kernel should never create a corrupted caps */
+	caplen = info_cmd->argsz;
+	for (i = sizeof(*info_cmd); i < caplen; i++) {
+		*info_cmd = (struct vfio_iommu_type1_info){
+			.argsz = i,
+		};
+		ioctl_check_buf(self->fd, VFIO_IOMMU_GET_INFO);
+		ASSERT_EQ(VFIO_IOMMU_INFO_PGSIZES | VFIO_IOMMU_INFO_CAPS,
+			  info_cmd->flags);
+		if (!info_cmd->cap_offset)
+			continue;
+		check_vfio_info_cap_chain(_metadata, info_cmd);
+	}
+}
+
+/* FIXME use fault injection to test memory failure paths */
+/* FIXME test VFIO_IOMMU_MAP_DMA */
+/* FIXME test VFIO_IOMMU_UNMAP_DMA */
+/* FIXME test 2k iova alignment */
+/* FIXME cover boundary cases for iopt_access_pages()  */
+
+TEST_HARNESS_MAIN
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH RFC v2 13/13] iommufd: Add a selftest
@ 2022-09-02 19:59   ` Jason Gunthorpe
  0 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-02 19:59 UTC (permalink / raw)
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, iommu,
	Jason Wang, Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

Cover the essential functionality of the iommufd with a directed
test. This aims to achieve reasonable functional coverage using the
in-kernel self test framework.

It provides a mock for the iommu_domain that allows it to run without any
HW and the mocking provides a way to directly validate that the PFNs
loaded into the iommu_domain are correct.

The mock also simulates the rare case of PAGE_SIZE > iommu page size as
the mock will typically operate at a 2K iommu page size. This allows
exercising all of the calculations to support this mismatch.

This allows achieving high coverage of the corner cases in the iopt_pages.

However, it is an unusually invasive config option to enable all of
this. The config option should never be enabled in a production kernel.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/Kconfig            |    9 +
 drivers/iommu/iommufd/Makefile           |    2 +
 drivers/iommu/iommufd/device.c           |   61 +
 drivers/iommu/iommufd/iommufd_private.h  |   21 +
 drivers/iommu/iommufd/iommufd_test.h     |   74 ++
 drivers/iommu/iommufd/main.c             |   12 +
 drivers/iommu/iommufd/pages.c            |    4 +
 drivers/iommu/iommufd/selftest.c         |  626 ++++++++++
 tools/testing/selftests/Makefile         |    1 +
 tools/testing/selftests/iommu/.gitignore |    2 +
 tools/testing/selftests/iommu/Makefile   |   11 +
 tools/testing/selftests/iommu/config     |    2 +
 tools/testing/selftests/iommu/iommufd.c  | 1396 ++++++++++++++++++++++
 13 files changed, 2221 insertions(+)
 create mode 100644 drivers/iommu/iommufd/iommufd_test.h
 create mode 100644 drivers/iommu/iommufd/selftest.c
 create mode 100644 tools/testing/selftests/iommu/.gitignore
 create mode 100644 tools/testing/selftests/iommu/Makefile
 create mode 100644 tools/testing/selftests/iommu/config
 create mode 100644 tools/testing/selftests/iommu/iommufd.c

diff --git a/drivers/iommu/iommufd/Kconfig b/drivers/iommu/iommufd/Kconfig
index fddd453bb0e764..9b41fde7c839c5 100644
--- a/drivers/iommu/iommufd/Kconfig
+++ b/drivers/iommu/iommufd/Kconfig
@@ -11,3 +11,12 @@ config IOMMUFD
 	  This would commonly be used in combination with VFIO.
 
 	  If you don't know what to do here, say N.
+
+config IOMMUFD_TEST
+	bool "IOMMU Userspace API Test support"
+	depends on IOMMUFD
+	depends on RUNTIME_TESTING_MENU
+	default n
+	help
+	  This is dangerous, do not enable unless running
+	  tools/testing/selftests/iommu
diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index 2fdff04000b326..8aeba81800c512 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -8,4 +8,6 @@ iommufd-y := \
 	pages.o \
 	vfio_compat.o
 
+iommufd-$(CONFIG_IOMMUFD_TEST) += selftest.o
+
 obj-$(CONFIG_IOMMUFD) += iommufd.o
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index d34bdbcb84a40d..7e6ddf82f34cb9 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -517,3 +517,64 @@ int iommufd_access_rw(struct iommufd_access *access, unsigned long iova,
 	return -EINVAL;
 }
 EXPORT_SYMBOL_GPL(iommufd_access_rw);
+
+#ifdef CONFIG_IOMMUFD_TEST
+/*
+ * Creating a real iommufd_device is too hard, bypass creating a iommufd_device
+ * and go directly to attaching a domain.
+ */
+struct iommufd_hw_pagetable *
+iommufd_device_selftest_attach(struct iommufd_ctx *ictx,
+			       struct iommufd_ioas *ioas,
+			       struct device *mock_dev)
+{
+	struct iommufd_hw_pagetable *hwpt;
+	int rc;
+
+	hwpt = iommufd_hw_pagetable_alloc(ictx, ioas, mock_dev);
+	if (IS_ERR(hwpt))
+		return hwpt;
+
+	rc = iopt_table_add_domain(&hwpt->ioas->iopt, hwpt->domain);
+	if (rc)
+		goto out_hwpt;
+
+	refcount_inc(&hwpt->obj.users);
+	iommufd_object_finalize(ictx, &hwpt->obj);
+	return hwpt;
+
+out_hwpt:
+	iommufd_object_abort_and_destroy(ictx, &hwpt->obj);
+	return ERR_PTR(rc);
+}
+
+void iommufd_device_selftest_detach(struct iommufd_ctx *ictx,
+				    struct iommufd_hw_pagetable *hwpt)
+{
+	iopt_table_remove_domain(&hwpt->ioas->iopt, hwpt->domain);
+	refcount_dec(&hwpt->obj.users);
+}
+
+unsigned int iommufd_access_selfest_id(struct iommufd_access *access_pub)
+{
+	struct iommufd_access_priv *access =
+		container_of(access_pub, struct iommufd_access_priv, pub);
+
+	return access->obj.id;
+}
+
+void *iommufd_access_selftest_get(struct iommufd_ctx *ictx,
+				  unsigned int access_id,
+				  struct iommufd_object **out_obj)
+{
+	struct iommufd_object *access_obj;
+
+	access_obj =
+		iommufd_get_object(ictx, access_id, IOMMUFD_OBJ_ACCESS);
+	if (IS_ERR(access_obj))
+		return ERR_CAST(access_obj);
+	*out_obj = access_obj;
+	return container_of(access_obj, struct iommufd_access_priv, obj)->data;
+}
+
+#endif
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index d87227cc08a47d..0b414b6a00f061 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -103,6 +103,9 @@ enum iommufd_object_type {
 	IOMMUFD_OBJ_HW_PAGETABLE,
 	IOMMUFD_OBJ_IOAS,
 	IOMMUFD_OBJ_ACCESS,
+#ifdef CONFIG_IOMMUFD_TEST
+	IOMMUFD_OBJ_SELFTEST,
+#endif
 };
 
 /* Base struct for all objects with a userspace ID handle. */
@@ -242,4 +245,22 @@ void iommufd_hw_pagetable_destroy(struct iommufd_object *obj);
 void iommufd_device_destroy(struct iommufd_object *obj);
 
 void iommufd_access_destroy_object(struct iommufd_object *obj);
+
+#ifdef CONFIG_IOMMUFD_TEST
+struct iommufd_access;
+struct iommufd_hw_pagetable *
+iommufd_device_selftest_attach(struct iommufd_ctx *ictx,
+			       struct iommufd_ioas *ioas,
+			       struct device *mock_dev);
+void iommufd_device_selftest_detach(struct iommufd_ctx *ictx,
+				    struct iommufd_hw_pagetable *hwpt);
+unsigned int iommufd_access_selfest_id(struct iommufd_access *access_pub);
+void *iommufd_access_selftest_get(struct iommufd_ctx *ictx,
+				  unsigned int access_id,
+				  struct iommufd_object **out_obj);
+int iommufd_test(struct iommufd_ucmd *ucmd);
+void iommufd_selftest_destroy(struct iommufd_object *obj);
+extern size_t iommufd_test_memory_limit;
+#endif
+
 #endif
diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h
new file mode 100644
index 00000000000000..485f44394dbe9b
--- /dev/null
+++ b/drivers/iommu/iommufd/iommufd_test.h
@@ -0,0 +1,74 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
+ */
+#ifndef _UAPI_IOMMUFD_TEST_H
+#define _UAPI_IOMMUFD_TEST_H
+
+#include <linux/types.h>
+#include <linux/iommufd.h>
+
+enum {
+	IOMMU_TEST_OP_ADD_RESERVED,
+	IOMMU_TEST_OP_MOCK_DOMAIN,
+	IOMMU_TEST_OP_MD_CHECK_MAP,
+	IOMMU_TEST_OP_MD_CHECK_REFS,
+	IOMMU_TEST_OP_CREATE_ACCESS,
+	IOMMU_TEST_OP_DESTROY_ACCESS,
+	IOMMU_TEST_OP_DESTROY_ACCESS_ITEM,
+	IOMMU_TEST_OP_ACCESS_PAGES,
+	IOMMU_TEST_OP_SET_TEMP_MEMORY_LIMIT,
+};
+
+enum {
+	MOCK_APERTURE_START = 1UL << 24,
+	MOCK_APERTURE_LAST = (1UL << 31) - 1,
+};
+
+enum {
+	MOCK_FLAGS_ACCESS_WRITE = 1 << 0,
+};
+
+struct iommu_test_cmd {
+	__u32 size;
+	__u32 op;
+	__u32 id;
+	union {
+		struct {
+			__u32 device_id;
+		} mock_domain;
+		struct {
+			__aligned_u64 start;
+			__aligned_u64 length;
+		} add_reserved;
+		struct {
+			__aligned_u64 iova;
+			__aligned_u64 length;
+			__aligned_u64 uptr;
+		} check_map;
+		struct {
+			__aligned_u64 length;
+			__aligned_u64 uptr;
+			__u32 refs;
+		} check_refs;
+		struct {
+			__u32 out_access_id;
+		} create_access;
+		struct {
+			__u32 flags;
+			__u32 out_access_item_id;
+			__aligned_u64 iova;
+			__aligned_u64 length;
+			__aligned_u64 uptr;
+		} access_pages;
+		struct {
+			__u32 access_item_id;
+		} destroy_access_item;
+		struct {
+			__u32 limit;
+		} memory_limit;
+	};
+	__u32 last;
+};
+#define IOMMU_TEST_CMD _IO(IOMMUFD_TYPE, IOMMUFD_CMD_BASE + 32)
+
+#endif
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 549d6a4c8f5575..1097e5f07f8eb9 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -25,6 +25,7 @@
 #include <linux/iommufd.h>
 
 #include "iommufd_private.h"
+#include "iommufd_test.h"
 
 struct iommufd_object_ops {
 	void (*destroy)(struct iommufd_object *obj);
@@ -211,6 +212,9 @@ union ucmd_buffer {
 	struct iommu_ioas_iova_ranges iova_ranges;
 	struct iommu_ioas_map map;
 	struct iommu_ioas_unmap unmap;
+#ifdef CONFIG_IOMMUFD_TEST
+	struct iommu_test_cmd test;
+#endif
 };
 
 struct iommufd_ioctl_op {
@@ -245,6 +249,9 @@ static struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
 		 length),
 	IOCTL_OP(IOMMU_VFIO_IOAS, iommufd_vfio_ioas, struct iommu_vfio_ioas,
 		 __reserved),
+#ifdef CONFIG_IOMMUFD_TEST
+	IOCTL_OP(IOMMU_TEST_CMD, iommufd_test, struct iommu_test_cmd, last),
+#endif
 };
 
 static long iommufd_fops_ioctl(struct file *filp, unsigned int cmd,
@@ -345,6 +352,11 @@ static struct iommufd_object_ops iommufd_object_ops[] = {
 	[IOMMUFD_OBJ_HW_PAGETABLE] = {
 		.destroy = iommufd_hw_pagetable_destroy,
 	},
+#ifdef CONFIG_IOMMUFD_TEST
+	[IOMMUFD_OBJ_SELFTEST] = {
+		.destroy = iommufd_selftest_destroy,
+	},
+#endif
 };
 
 static struct miscdevice iommu_misc_dev = {
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index 91db42dd6aaeaa..59a55f0a35b2af 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -48,7 +48,11 @@
 
 #include "io_pagetable.h"
 
+#ifndef CONFIG_IOMMUFD_TEST
 #define TEMP_MEMORY_LIMIT 65536
+#else
+#define TEMP_MEMORY_LIMIT iommufd_test_memory_limit
+#endif
 #define BATCH_BACKUP_SIZE 32
 
 /*
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
new file mode 100644
index 00000000000000..e9c178048a1284
--- /dev/null
+++ b/drivers/iommu/iommufd/selftest.c
@@ -0,0 +1,626 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
+ *
+ * Kernel side components to support tools/testing/selftests/iommu
+ */
+#include <linux/slab.h>
+#include <linux/iommu.h>
+#include <linux/xarray.h>
+
+#include "iommufd_private.h"
+#include "iommufd_test.h"
+
+size_t iommufd_test_memory_limit = 65536;
+
+enum {
+	MOCK_IO_PAGE_SIZE = PAGE_SIZE / 2,
+
+	/*
+	 * Like a real page table alignment requires the low bits of the address
+	 * to be zero. xarray also requires the high bit to be zero, so we store
+	 * the pfns shifted. The upper bits are used for metadata.
+	 */
+	MOCK_PFN_MASK = ULONG_MAX / MOCK_IO_PAGE_SIZE,
+
+	_MOCK_PFN_START = MOCK_PFN_MASK + 1,
+	MOCK_PFN_START_IOVA = _MOCK_PFN_START,
+	MOCK_PFN_LAST_IOVA = _MOCK_PFN_START,
+};
+
+struct mock_iommu_domain {
+	struct iommu_domain domain;
+	struct xarray pfns;
+};
+
+enum selftest_obj_type {
+	TYPE_IDEV,
+};
+
+struct selftest_obj {
+	struct iommufd_object obj;
+	enum selftest_obj_type type;
+
+	union {
+		struct {
+			struct iommufd_hw_pagetable *hwpt;
+			struct iommufd_ctx *ictx;
+			struct device mock_dev;
+		} idev;
+	};
+};
+
+static struct iommu_domain *mock_domain_alloc(unsigned int iommu_domain_type)
+{
+	struct mock_iommu_domain *mock;
+
+	if (WARN_ON(iommu_domain_type != IOMMU_DOMAIN_UNMANAGED))
+		return NULL;
+
+	mock = kzalloc(sizeof(*mock), GFP_KERNEL);
+	if (!mock)
+		return NULL;
+	mock->domain.geometry.aperture_start = MOCK_APERTURE_START;
+	mock->domain.geometry.aperture_end = MOCK_APERTURE_LAST;
+	mock->domain.pgsize_bitmap = MOCK_IO_PAGE_SIZE;
+	xa_init(&mock->pfns);
+	return &mock->domain;
+}
+
+static void mock_domain_free(struct iommu_domain *domain)
+{
+	struct mock_iommu_domain *mock =
+		container_of(domain, struct mock_iommu_domain, domain);
+
+	WARN_ON(!xa_empty(&mock->pfns));
+	kfree(mock);
+}
+
+static int mock_domain_map_pages(struct iommu_domain *domain,
+				 unsigned long iova, phys_addr_t paddr,
+				 size_t pgsize, size_t pgcount, int prot,
+				 gfp_t gfp, size_t *mapped)
+{
+	struct mock_iommu_domain *mock =
+		container_of(domain, struct mock_iommu_domain, domain);
+	unsigned long flags = MOCK_PFN_START_IOVA;
+
+	WARN_ON(iova % MOCK_IO_PAGE_SIZE);
+	WARN_ON(pgsize % MOCK_IO_PAGE_SIZE);
+	for (; pgcount; pgcount--) {
+		size_t cur;
+
+		for (cur = 0; cur != pgsize; cur += MOCK_IO_PAGE_SIZE) {
+			void *old;
+
+			if (pgcount == 1 && cur + MOCK_IO_PAGE_SIZE == pgsize)
+				flags = MOCK_PFN_LAST_IOVA;
+			old = xa_store(&mock->pfns, iova / MOCK_IO_PAGE_SIZE,
+				       xa_mk_value((paddr / MOCK_IO_PAGE_SIZE) | flags),
+				       GFP_KERNEL);
+			if (xa_is_err(old))
+				return xa_err(old);
+			WARN_ON(old);
+			iova += MOCK_IO_PAGE_SIZE;
+			paddr += MOCK_IO_PAGE_SIZE;
+			*mapped += MOCK_IO_PAGE_SIZE;
+			flags = 0;
+		}
+	}
+	return 0;
+}
+
+static size_t mock_domain_unmap_pages(struct iommu_domain *domain,
+				      unsigned long iova, size_t pgsize,
+				      size_t pgcount,
+				      struct iommu_iotlb_gather *iotlb_gather)
+{
+	struct mock_iommu_domain *mock =
+		container_of(domain, struct mock_iommu_domain, domain);
+	bool first = true;
+	size_t ret = 0;
+	void *ent;
+
+	WARN_ON(iova % MOCK_IO_PAGE_SIZE);
+	WARN_ON(pgsize % MOCK_IO_PAGE_SIZE);
+
+	for (; pgcount; pgcount--) {
+		size_t cur;
+
+		for (cur = 0; cur != pgsize; cur += MOCK_IO_PAGE_SIZE) {
+			ent = xa_erase(&mock->pfns, iova / MOCK_IO_PAGE_SIZE);
+			WARN_ON(!ent);
+			/*
+			 * iommufd generates unmaps that must be a strict
+			 * superset of the map's performend So every starting
+			 * IOVA should have been an iova passed to map, and the
+			 *
+			 * First IOVA must be present and have been a first IOVA
+			 * passed to map_pages
+			 */
+			if (first) {
+				WARN_ON(!(xa_to_value(ent) &
+					  MOCK_PFN_START_IOVA));
+				first = false;
+			}
+			if (pgcount == 1 && cur + MOCK_IO_PAGE_SIZE == pgsize)
+				WARN_ON(!(xa_to_value(ent) &
+					  MOCK_PFN_LAST_IOVA));
+
+			iova += MOCK_IO_PAGE_SIZE;
+			ret += MOCK_IO_PAGE_SIZE;
+		}
+	}
+	return ret;
+}
+
+static phys_addr_t mock_domain_iova_to_phys(struct iommu_domain *domain,
+					    dma_addr_t iova)
+{
+	struct mock_iommu_domain *mock =
+		container_of(domain, struct mock_iommu_domain, domain);
+	void *ent;
+
+	WARN_ON(iova % MOCK_IO_PAGE_SIZE);
+	ent = xa_load(&mock->pfns, iova / MOCK_IO_PAGE_SIZE);
+	WARN_ON(!ent);
+	return (xa_to_value(ent) & MOCK_PFN_MASK) * MOCK_IO_PAGE_SIZE;
+}
+
+static const struct iommu_ops mock_ops = {
+	.owner = THIS_MODULE,
+	.pgsize_bitmap = MOCK_IO_PAGE_SIZE,
+	.domain_alloc = mock_domain_alloc,
+	.default_domain_ops =
+		&(struct iommu_domain_ops){
+			.free = mock_domain_free,
+			.map_pages = mock_domain_map_pages,
+			.unmap_pages = mock_domain_unmap_pages,
+			.iova_to_phys = mock_domain_iova_to_phys,
+		},
+};
+
+static inline struct iommufd_hw_pagetable *
+get_md_pagetable(struct iommufd_ucmd *ucmd, u32 mockpt_id,
+		 struct mock_iommu_domain **mock)
+{
+	struct iommufd_hw_pagetable *hwpt;
+	struct iommufd_object *obj;
+
+	obj = iommufd_get_object(ucmd->ictx, mockpt_id,
+				 IOMMUFD_OBJ_HW_PAGETABLE);
+	if (IS_ERR(obj))
+		return ERR_CAST(obj);
+	hwpt = container_of(obj, struct iommufd_hw_pagetable, obj);
+	if (hwpt->domain->ops != mock_ops.default_domain_ops) {
+		return ERR_PTR(-EINVAL);
+		iommufd_put_object(&hwpt->obj);
+	}
+	*mock = container_of(hwpt->domain, struct mock_iommu_domain, domain);
+	return hwpt;
+}
+
+/* Create an hw_pagetable with the mock domain so we can test the domain ops */
+static int iommufd_test_mock_domain(struct iommufd_ucmd *ucmd,
+				    struct iommu_test_cmd *cmd)
+{
+	static struct bus_type mock_bus = { .iommu_ops = &mock_ops };
+	struct iommufd_hw_pagetable *hwpt;
+	struct selftest_obj *sobj;
+	struct iommufd_ioas *ioas;
+	int rc;
+
+	ioas = iommufd_get_ioas(ucmd, cmd->id);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	sobj = iommufd_object_alloc(ucmd->ictx, sobj, IOMMUFD_OBJ_SELFTEST);
+	if (IS_ERR(sobj)) {
+		rc = PTR_ERR(sobj);
+		goto out_ioas;
+	}
+	sobj->idev.ictx = ucmd->ictx;
+	sobj->type = TYPE_IDEV;
+	sobj->idev.mock_dev.bus = &mock_bus;
+
+	hwpt = iommufd_device_selftest_attach(ucmd->ictx, ioas,
+					      &sobj->idev.mock_dev);
+	if (IS_ERR(hwpt)) {
+		rc = PTR_ERR(hwpt);
+		goto out_sobj;
+	}
+	sobj->idev.hwpt = hwpt;
+
+	cmd->id = hwpt->obj.id;
+	cmd->mock_domain.device_id = sobj->obj.id;
+	iommufd_object_finalize(ucmd->ictx, &sobj->obj);
+	iommufd_put_object(&ioas->obj);
+	return iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+
+out_sobj:
+	iommufd_object_abort(ucmd->ictx, &sobj->obj);
+out_ioas:
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+/* Add an additional reserved IOVA to the IOAS */
+static int iommufd_test_add_reserved(struct iommufd_ucmd *ucmd,
+				     unsigned int mockpt_id,
+				     unsigned long start, size_t length)
+{
+	struct iommufd_ioas *ioas;
+	int rc;
+
+	ioas = iommufd_get_ioas(ucmd, mockpt_id);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+	down_write(&ioas->iopt.iova_rwsem);
+	rc = iopt_reserve_iova(&ioas->iopt, start, start + length - 1, NULL);
+	up_write(&ioas->iopt.iova_rwsem);
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
+
+/* Check that every pfn under each iova matches the pfn under a user VA */
+static int iommufd_test_md_check_pa(struct iommufd_ucmd *ucmd,
+				    unsigned int mockpt_id, unsigned long iova,
+				    size_t length, void __user *uptr)
+{
+	struct iommufd_hw_pagetable *hwpt;
+	struct mock_iommu_domain *mock;
+	int rc;
+
+	if (iova % MOCK_IO_PAGE_SIZE || length % MOCK_IO_PAGE_SIZE ||
+	    (uintptr_t)uptr % MOCK_IO_PAGE_SIZE)
+		return -EINVAL;
+
+	hwpt = get_md_pagetable(ucmd, mockpt_id, &mock);
+	if (IS_ERR(hwpt))
+		return PTR_ERR(hwpt);
+
+	for (; length; length -= MOCK_IO_PAGE_SIZE) {
+		struct page *pages[1];
+		unsigned long pfn;
+		long npages;
+		void *ent;
+
+		npages = get_user_pages_fast((uintptr_t)uptr & PAGE_MASK, 1, 0,
+					     pages);
+		if (npages < 0) {
+			rc = npages;
+			goto out_put;
+		}
+		if (WARN_ON(npages != 1)) {
+			rc = -EFAULT;
+			goto out_put;
+		}
+		pfn = page_to_pfn(pages[0]);
+		put_page(pages[0]);
+
+		ent = xa_load(&mock->pfns, iova / MOCK_IO_PAGE_SIZE);
+		if (!ent ||
+		    (xa_to_value(ent) & MOCK_PFN_MASK) * MOCK_IO_PAGE_SIZE !=
+			    pfn * PAGE_SIZE + ((uintptr_t)uptr % PAGE_SIZE)) {
+			rc = -EINVAL;
+			goto out_put;
+		}
+		iova += MOCK_IO_PAGE_SIZE;
+		uptr += MOCK_IO_PAGE_SIZE;
+	}
+	rc = 0;
+
+out_put:
+	iommufd_put_object(&hwpt->obj);
+	return rc;
+}
+
+/* Check that the page ref count matches, to look for missing pin/unpins */
+static int iommufd_test_md_check_refs(struct iommufd_ucmd *ucmd,
+				      void __user *uptr, size_t length,
+				      unsigned int refs)
+{
+	if (length % PAGE_SIZE || (uintptr_t)uptr % PAGE_SIZE)
+		return -EINVAL;
+
+	for (; length; length -= PAGE_SIZE) {
+		struct page *pages[1];
+		long npages;
+
+		npages = get_user_pages_fast((uintptr_t)uptr, 1, 0, pages);
+		if (npages < 0)
+			return npages;
+		if (WARN_ON(npages != 1))
+			return -EFAULT;
+		if (!PageCompound(pages[0])) {
+			unsigned int count;
+
+			count = page_ref_count(pages[0]);
+			if (count / GUP_PIN_COUNTING_BIAS != refs) {
+				put_page(pages[0]);
+				return -EIO;
+			}
+		}
+		put_page(pages[0]);
+		uptr += PAGE_SIZE;
+	}
+	return 0;
+}
+
+struct selftest_access {
+	struct iommufd_access *access;
+	spinlock_t lock;
+	struct list_head items;
+	unsigned int next_id;
+	bool destroying;
+};
+
+struct selftest_access_item {
+	struct list_head items_elm;
+	unsigned long iova;
+	unsigned long iova_end;
+	size_t length;
+	unsigned int id;
+};
+
+static void iommufd_test_access_unmap(void *data, unsigned long iova,
+				      unsigned long length)
+{
+	struct selftest_access *staccess = data;
+	struct selftest_access_item *item;
+	unsigned long iova_end = iova + length - 1;
+
+	spin_lock(&staccess->lock);
+	list_for_each_entry(item, &staccess->items, items_elm) {
+		if (iova <= item->iova_end && iova_end >= item->iova) {
+			list_del(&item->items_elm);
+			spin_unlock(&staccess->lock);
+			iommufd_access_unpin_pages(staccess->access, item->iova,
+						   item->length);
+			kfree(item);
+			return;
+		}
+	}
+	spin_unlock(&staccess->lock);
+}
+
+static struct iommufd_access_ops selftest_access_ops = {
+	.unmap = iommufd_test_access_unmap,
+};
+
+static int iommufd_test_create_access(struct iommufd_ucmd *ucmd,
+				      unsigned int ioas_id)
+{
+	struct iommu_test_cmd *cmd = ucmd->cmd;
+	struct selftest_access *staccess;
+	int rc;
+
+	staccess = kzalloc(sizeof(*staccess), GFP_KERNEL_ACCOUNT);
+	if (!staccess)
+		return -ENOMEM;
+	INIT_LIST_HEAD(&staccess->items);
+	spin_lock_init(&staccess->lock);
+
+	staccess->access = iommufd_access_create(
+		ucmd->ictx, ioas_id, &selftest_access_ops, staccess);
+	if (IS_ERR(staccess->access)) {
+		rc = PTR_ERR(staccess->access);
+		goto out_free;
+	}
+	cmd->create_access.out_access_id =
+		iommufd_access_selfest_id(staccess->access);
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+	if (rc)
+		goto out_destroy;
+
+	return 0;
+
+out_destroy:
+	iommufd_access_destroy(staccess->access);
+out_free:
+	kfree(staccess);
+	return rc;
+}
+
+static int iommufd_test_destroy_access(struct iommufd_ucmd *ucmd,
+				       unsigned int access_id)
+{
+	struct selftest_access *staccess;
+	struct iommufd_object *access_obj;
+
+	staccess =
+		iommufd_access_selftest_get(ucmd->ictx, access_id, &access_obj);
+	if (IS_ERR(staccess))
+		return PTR_ERR(staccess);
+	iommufd_put_object(access_obj);
+
+	spin_lock(&staccess->lock);
+	if (!list_empty(&staccess->items) || staccess->destroying) {
+		spin_unlock(&staccess->lock);
+		return -EBUSY;
+	}
+	staccess->destroying = true;
+	spin_unlock(&staccess->lock);
+
+	/* FIXME: this holds a reference on the object even after the fd is closed */
+	iommufd_access_destroy(staccess->access);
+	kfree(staccess);
+	return 0;
+}
+
+/* Check that the pages in a page array match the pages in the user VA */
+static int iommufd_test_check_pages(void __user *uptr, struct page **pages,
+				    size_t npages)
+{
+	for (; npages; npages--) {
+		struct page *tmp_pages[1];
+		long rc;
+
+		rc = get_user_pages_fast((uintptr_t)uptr, 1, 0, tmp_pages);
+		if (rc < 0)
+			return rc;
+		if (WARN_ON(rc != 1))
+			return -EFAULT;
+		put_page(tmp_pages[0]);
+		if (tmp_pages[0] != *pages)
+			return -EBADE;
+		pages++;
+		uptr += PAGE_SIZE;
+	}
+	return 0;
+}
+
+static int iommufd_test_access_pages(struct iommufd_ucmd *ucmd,
+				     unsigned int access_id, unsigned long iova,
+				     size_t length, void __user *uptr,
+				     u32 flags)
+{
+	struct iommu_test_cmd *cmd = ucmd->cmd;
+	struct iommufd_object *access_obj;
+	struct selftest_access_item *item;
+	struct selftest_access *staccess;
+	struct page **pages;
+	size_t npages;
+	int rc;
+
+	if (flags & ~MOCK_FLAGS_ACCESS_WRITE)
+		return -EOPNOTSUPP;
+
+	staccess =
+		iommufd_access_selftest_get(ucmd->ictx, access_id, &access_obj);
+	if (IS_ERR(staccess))
+		return PTR_ERR(staccess);
+
+	npages = (ALIGN(iova + length, PAGE_SIZE) -
+		  ALIGN_DOWN(iova, PAGE_SIZE)) /
+		 PAGE_SIZE;
+	pages = kvcalloc(npages, sizeof(*pages), GFP_KERNEL_ACCOUNT);
+	if (!pages) {
+		rc = -ENOMEM;
+		goto out_put;
+	}
+
+	rc = iommufd_access_pin_pages(staccess->access, iova, length, pages,
+				      flags & MOCK_FLAGS_ACCESS_WRITE);
+	if (rc)
+		goto out_free_pages;
+
+	rc = iommufd_test_check_pages(
+		uptr - (iova - ALIGN_DOWN(iova, PAGE_SIZE)), pages, npages);
+	if (rc)
+		goto out_unaccess;
+
+	item = kzalloc(sizeof(*item), GFP_KERNEL_ACCOUNT);
+	if (!item) {
+		rc = -ENOMEM;
+		goto out_unaccess;
+	}
+
+	item->iova = iova;
+	item->length = length;
+	spin_lock(&staccess->lock);
+	item->id = staccess->next_id++;
+	list_add_tail(&item->items_elm, &staccess->items);
+	spin_unlock(&staccess->lock);
+
+	cmd->access_pages.out_access_item_id = item->id;
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+	if (rc)
+		goto out_free_item;
+	goto out_free_pages;
+
+out_free_item:
+	spin_lock(&staccess->lock);
+	list_del(&item->items_elm);
+	spin_unlock(&staccess->lock);
+	kfree(item);
+out_unaccess:
+	iommufd_access_unpin_pages(staccess->access, iova, length);
+out_free_pages:
+	kvfree(pages);
+out_put:
+	iommufd_put_object(access_obj);
+	return rc;
+}
+
+static int iommufd_test_access_item_destroy(struct iommufd_ucmd *ucmd,
+					    unsigned int access_id,
+					    unsigned int item_id)
+{
+	struct iommufd_object *access_obj;
+	struct selftest_access_item *item;
+	struct selftest_access *staccess;
+
+	staccess =
+		iommufd_access_selftest_get(ucmd->ictx, access_id, &access_obj);
+	if (IS_ERR(staccess))
+		return PTR_ERR(staccess);
+
+	spin_lock(&staccess->lock);
+	list_for_each_entry(item, &staccess->items, items_elm) {
+		if (item->id == item_id) {
+			list_del(&item->items_elm);
+			spin_unlock(&staccess->lock);
+			iommufd_access_unpin_pages(staccess->access, item->iova,
+						   item->length);
+			kfree(item);
+			iommufd_put_object(access_obj);
+			return 0;
+		}
+	}
+	spin_unlock(&staccess->lock);
+	iommufd_put_object(access_obj);
+	return -ENOENT;
+}
+
+void iommufd_selftest_destroy(struct iommufd_object *obj)
+{
+	struct selftest_obj *sobj = container_of(obj, struct selftest_obj, obj);
+
+	switch (sobj->type) {
+	case TYPE_IDEV:
+		iommufd_device_selftest_detach(sobj->idev.ictx,
+					       sobj->idev.hwpt);
+		break;
+	}
+}
+
+int iommufd_test(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_test_cmd *cmd = ucmd->cmd;
+
+	switch (cmd->op) {
+	case IOMMU_TEST_OP_ADD_RESERVED:
+		return iommufd_test_add_reserved(ucmd, cmd->id,
+						 cmd->add_reserved.start,
+						 cmd->add_reserved.length);
+	case IOMMU_TEST_OP_MOCK_DOMAIN:
+		return iommufd_test_mock_domain(ucmd, cmd);
+	case IOMMU_TEST_OP_MD_CHECK_MAP:
+		return iommufd_test_md_check_pa(
+			ucmd, cmd->id, cmd->check_map.iova,
+			cmd->check_map.length,
+			u64_to_user_ptr(cmd->check_map.uptr));
+	case IOMMU_TEST_OP_MD_CHECK_REFS:
+		return iommufd_test_md_check_refs(
+			ucmd, u64_to_user_ptr(cmd->check_refs.uptr),
+			cmd->check_refs.length, cmd->check_refs.refs);
+	case IOMMU_TEST_OP_CREATE_ACCESS:
+		return iommufd_test_create_access(ucmd, cmd->id);
+	case IOMMU_TEST_OP_DESTROY_ACCESS:
+		return iommufd_test_destroy_access(ucmd, cmd->id);
+	case IOMMU_TEST_OP_ACCESS_PAGES:
+		return iommufd_test_access_pages(
+			ucmd, cmd->id, cmd->access_pages.iova,
+			cmd->access_pages.length,
+			u64_to_user_ptr(cmd->access_pages.uptr),
+			cmd->access_pages.flags);
+	case IOMMU_TEST_OP_DESTROY_ACCESS_ITEM:
+		return iommufd_test_access_item_destroy(
+			ucmd, cmd->id, cmd->destroy_access_item.access_item_id);
+	case IOMMU_TEST_OP_SET_TEMP_MEMORY_LIMIT:
+		iommufd_test_memory_limit = cmd->memory_limit.limit;
+		return 0;
+	default:
+		return -EOPNOTSUPP;
+	}
+}
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index c2064a35688b08..58a8520542410b 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -25,6 +25,7 @@ TARGETS += ftrace
 TARGETS += futex
 TARGETS += gpio
 TARGETS += intel_pstate
+TARGETS += iommu
 TARGETS += ipc
 TARGETS += ir
 TARGETS += kcmp
diff --git a/tools/testing/selftests/iommu/.gitignore b/tools/testing/selftests/iommu/.gitignore
new file mode 100644
index 00000000000000..c6bd07e7ff59b3
--- /dev/null
+++ b/tools/testing/selftests/iommu/.gitignore
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+/iommufd
diff --git a/tools/testing/selftests/iommu/Makefile b/tools/testing/selftests/iommu/Makefile
new file mode 100644
index 00000000000000..7bc38b3beaeb20
--- /dev/null
+++ b/tools/testing/selftests/iommu/Makefile
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0-only
+CFLAGS += -Wall -O2 -Wno-unused-function
+CFLAGS += -I../../../../include/uapi/
+CFLAGS += -I../../../../include/
+
+CFLAGS += -D_GNU_SOURCE
+
+TEST_GEN_PROGS :=
+TEST_GEN_PROGS += iommufd
+
+include ../lib.mk
diff --git a/tools/testing/selftests/iommu/config b/tools/testing/selftests/iommu/config
new file mode 100644
index 00000000000000..6c4f901d6fed3c
--- /dev/null
+++ b/tools/testing/selftests/iommu/config
@@ -0,0 +1,2 @@
+CONFIG_IOMMUFD
+CONFIG_IOMMUFD_TEST
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
new file mode 100644
index 00000000000000..9aea459ba183ec
--- /dev/null
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -0,0 +1,1396 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES */
+#include <stdlib.h>
+#include <unistd.h>
+#include <sys/mman.h>
+#include <sys/fcntl.h>
+#include <sys/ioctl.h>
+#include <assert.h>
+#include <stddef.h>
+
+#include "../kselftest_harness.h"
+
+#define __EXPORTED_HEADERS__
+#include <linux/iommufd.h>
+#include <linux/vfio.h>
+#include "../../../../drivers/iommu/iommufd/iommufd_test.h"
+
+static void *buffer;
+
+static unsigned long PAGE_SIZE;
+static unsigned long HUGEPAGE_SIZE;
+static unsigned long BUFFER_SIZE;
+
+#define MOCK_PAGE_SIZE (PAGE_SIZE / 2)
+
+static unsigned long get_huge_page_size(void)
+{
+	char buf[80];
+	int ret;
+	int fd;
+
+	fd = open("/sys/kernel/mm/transparent_hugepage/hpage_pmd_size",
+		  O_RDONLY);
+	if (fd < 0)
+		return 2 * 1024 * 1024;
+
+	ret = read(fd, buf, sizeof(buf));
+	close(fd);
+	if (ret <= 0 || ret == sizeof(buf))
+		return 2 * 1024 * 1024;
+	buf[ret] = 0;
+	return strtoul(buf, NULL, 10);
+}
+
+static __attribute__((constructor)) void setup_sizes(void)
+{
+	int rc;
+
+	PAGE_SIZE = sysconf(_SC_PAGE_SIZE);
+	HUGEPAGE_SIZE = get_huge_page_size();
+
+	BUFFER_SIZE = PAGE_SIZE * 16;
+	rc = posix_memalign(&buffer, HUGEPAGE_SIZE, BUFFER_SIZE);
+	assert(rc || buffer || (uintptr_t)buffer % HUGEPAGE_SIZE == 0);
+}
+
+/* Hack to make assertions more readable */
+#define _IOMMU_TEST_CMD(x) IOMMU_TEST_CMD
+
+/*
+ * Have the kernel check the refcount on pages. I don't know why a freshly
+ * mmap'd anon non-compound page starts out with a ref of 3
+ */
+#define check_refs(_ptr, _length, _refs)                                       \
+	({                                                                     \
+		struct iommu_test_cmd test_cmd = {                             \
+			.size = sizeof(test_cmd),                              \
+			.op = IOMMU_TEST_OP_MD_CHECK_REFS,                     \
+			.check_refs = { .length = _length,                     \
+					.uptr = (uintptr_t)(_ptr),             \
+					.refs = _refs },                       \
+		};                                                             \
+		ASSERT_EQ(0,                                                   \
+			  ioctl(self->fd,                                      \
+				_IOMMU_TEST_CMD(IOMMU_TEST_OP_MD_CHECK_REFS),  \
+				&test_cmd));                                   \
+	})
+
+static int _test_cmd_create_access(int fd, unsigned int ioas_id,
+				   __u32 *access_id)
+{
+	struct iommu_test_cmd cmd = {
+		.size = sizeof(cmd),
+		.op = IOMMU_TEST_OP_CREATE_ACCESS,
+		.id = ioas_id,
+	};
+	int ret;
+
+	ret = ioctl(fd, IOMMU_TEST_CMD, &cmd);
+	if (ret)
+		return ret;
+	*access_id = cmd.create_access.out_access_id;
+	return 0;
+}
+#define test_cmd_create_access(ioas_id, access_id) \
+	ASSERT_EQ(0, _test_cmd_create_access(self->fd, ioas_id, access_id))
+
+static int _test_cmd_destroy_access(int fd, unsigned int access_id)
+{
+	struct iommu_test_cmd cmd = {
+		.size = sizeof(cmd),
+		.op = IOMMU_TEST_OP_DESTROY_ACCESS,
+		.id = access_id,
+	};
+	return ioctl(fd, IOMMU_TEST_CMD, &cmd);
+}
+#define test_cmd_destroy_access(access_id) \
+	ASSERT_EQ(0, _test_cmd_destroy_access(self->fd, access_id))
+
+static int _test_cmd_destroy_access_item(int fd, unsigned int access_id,
+					 unsigned int access_item_id)
+{
+	struct iommu_test_cmd cmd = {
+		.size = sizeof(cmd),
+		.op = IOMMU_TEST_OP_DESTROY_ACCESS_ITEM,
+		.id = access_id,
+		.destroy_access_item = { .access_item_id = access_item_id },
+	};
+	return ioctl(fd, IOMMU_TEST_CMD, &cmd);
+}
+#define test_cmd_destroy_access_item(access_id, access_item_id)         \
+	ASSERT_EQ(0, _test_cmd_destroy_access_item(self->fd, access_id, \
+						   access_item_id))
+
+static int _test_ioctl_destroy(int fd, unsigned int id)
+{
+	struct iommu_destroy cmd = {
+		.size = sizeof(cmd),
+		.id = id,
+	};
+	return ioctl(fd, IOMMU_DESTROY, &cmd);
+}
+#define test_ioctl_destroy(id) \
+	ASSERT_EQ(0, _test_ioctl_destroy(self->fd, id))
+
+static int _test_ioctl_ioas_alloc(int fd, __u32 *id)
+{
+	struct iommu_ioas_alloc cmd = {
+		.size = sizeof(cmd),
+	};
+	int ret;
+
+	ret = ioctl(fd, IOMMU_IOAS_ALLOC, &cmd);
+	if (ret)
+		return ret;
+	*id = cmd.out_ioas_id;
+	return 0;
+}
+#define test_ioctl_ioas_alloc(id)                                   \
+	({                                                          \
+		ASSERT_EQ(0, _test_ioctl_ioas_alloc(self->fd, id)); \
+		ASSERT_NE(0, *(id));                                \
+	})
+
+static void teardown_iommufd(int fd, struct __test_metadata *_metadata)
+{
+	struct iommu_test_cmd test_cmd = {
+		.size = sizeof(test_cmd),
+		.op = IOMMU_TEST_OP_MD_CHECK_REFS,
+		.check_refs = { .length = BUFFER_SIZE,
+				.uptr = (uintptr_t)buffer },
+	};
+
+	EXPECT_EQ(0, close(fd));
+
+	fd = open("/dev/iommu", O_RDWR);
+	EXPECT_NE(-1, fd);
+	EXPECT_EQ(0, ioctl(fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_MD_CHECK_REFS),
+			   &test_cmd));
+	EXPECT_EQ(0, close(fd));
+}
+
+#define EXPECT_ERRNO(expected_errno, cmd)                                      \
+	({                                                                     \
+		ASSERT_EQ(-1, cmd);                                            \
+		EXPECT_EQ(expected_errno, errno);                              \
+	})
+
+FIXTURE(iommufd) {
+	int fd;
+};
+
+FIXTURE_SETUP(iommufd) {
+	self->fd = open("/dev/iommu", O_RDWR);
+	ASSERT_NE(-1, self->fd);
+}
+
+FIXTURE_TEARDOWN(iommufd) {
+	teardown_iommufd(self->fd, _metadata);
+}
+
+TEST_F(iommufd, simple_close)
+{
+}
+
+TEST_F(iommufd, cmd_fail)
+{
+	struct iommu_destroy cmd = { .size = sizeof(cmd), .id = 0 };
+
+	/* object id is invalid */
+	EXPECT_ERRNO(ENOENT, ioctl(self->fd, IOMMU_DESTROY, &cmd));
+	/* Bad pointer */
+	EXPECT_ERRNO(EFAULT, ioctl(self->fd, IOMMU_DESTROY, NULL));
+	/* Unknown ioctl */
+	EXPECT_ERRNO(ENOTTY,
+		     ioctl(self->fd, _IO(IOMMUFD_TYPE, IOMMUFD_CMD_BASE - 1),
+			   &cmd));
+}
+
+TEST_F(iommufd, cmd_ex_fail)
+{
+	struct {
+		struct iommu_destroy cmd;
+		__u64 future;
+	} cmd = { .cmd = { .size = sizeof(cmd), .id = 0 } };
+
+	/* object id is invalid and command is longer */
+	EXPECT_ERRNO(ENOENT, ioctl(self->fd, IOMMU_DESTROY, &cmd));
+	/* future area is non-zero */
+	cmd.future = 1;
+	EXPECT_ERRNO(E2BIG, ioctl(self->fd, IOMMU_DESTROY, &cmd));
+	/* Original command "works" */
+	cmd.cmd.size = sizeof(cmd.cmd);
+	EXPECT_ERRNO(ENOENT, ioctl(self->fd, IOMMU_DESTROY, &cmd));
+	/* Short command fails */
+	cmd.cmd.size = sizeof(cmd.cmd) - 1;
+	EXPECT_ERRNO(EOPNOTSUPP, ioctl(self->fd, IOMMU_DESTROY, &cmd));
+}
+
+FIXTURE(iommufd_ioas) {
+	int fd;
+	uint32_t ioas_id;
+	uint32_t domain_id;
+	uint64_t base_iova;
+};
+
+FIXTURE_VARIANT(iommufd_ioas) {
+	unsigned int mock_domains;
+	unsigned int memory_limit;
+};
+
+FIXTURE_SETUP(iommufd_ioas) {
+	struct iommu_test_cmd memlimit_cmd = {
+		.size = sizeof(memlimit_cmd),
+		.op = IOMMU_TEST_OP_SET_TEMP_MEMORY_LIMIT,
+		.memory_limit = {.limit = variant->memory_limit},
+	};
+	unsigned int i;
+
+	if (!variant->memory_limit)
+		memlimit_cmd.memory_limit.limit = 65536;
+
+	self->fd = open("/dev/iommu", O_RDWR);
+	ASSERT_NE(-1, self->fd);
+	test_ioctl_ioas_alloc(&self->ioas_id);
+
+	ASSERT_EQ(0, ioctl(self->fd,
+			   _IOMMU_TEST_CMD(IOMMU_TEST_OP_SET_TEMP_MEMORY_LIMIT),
+			   &memlimit_cmd));
+
+	for (i = 0; i != variant->mock_domains; i++) {
+		struct iommu_test_cmd test_cmd = {
+			.size = sizeof(test_cmd),
+			.op = IOMMU_TEST_OP_MOCK_DOMAIN,
+			.id = self->ioas_id,
+		};
+
+		ASSERT_EQ(0, ioctl(self->fd,
+				   _IOMMU_TEST_CMD(IOMMU_TEST_OP_MOCK_DOMAIN),
+				   &test_cmd));
+		EXPECT_NE(0, test_cmd.id);
+		self->domain_id = test_cmd.id;
+		self->base_iova = MOCK_APERTURE_START;
+	}
+}
+
+FIXTURE_TEARDOWN(iommufd_ioas) {
+	struct iommu_test_cmd memlimit_cmd = {
+		.size = sizeof(memlimit_cmd),
+		.op = IOMMU_TEST_OP_SET_TEMP_MEMORY_LIMIT,
+		.memory_limit = {.limit = 65536},
+	};
+
+	EXPECT_EQ(0, ioctl(self->fd,
+			   _IOMMU_TEST_CMD(IOMMU_TEST_OP_SET_TEMP_MEMORY_LIMIT),
+			   &memlimit_cmd));
+	teardown_iommufd(self->fd, _metadata);
+}
+
+FIXTURE_VARIANT_ADD(iommufd_ioas, no_domain) {
+};
+
+FIXTURE_VARIANT_ADD(iommufd_ioas, mock_domain) {
+	.mock_domains = 1,
+};
+
+FIXTURE_VARIANT_ADD(iommufd_ioas, two_mock_domain) {
+	.mock_domains = 2,
+};
+
+FIXTURE_VARIANT_ADD(iommufd_ioas, mock_domain_limit) {
+	.mock_domains = 1,
+	.memory_limit = 16,
+};
+
+TEST_F(iommufd_ioas, ioas_auto_destroy)
+{
+}
+
+TEST_F(iommufd_ioas, ioas_destroy)
+{
+	struct iommu_destroy destroy_cmd = {
+		.size = sizeof(destroy_cmd),
+		.id = self->ioas_id,
+	};
+
+	if (self->domain_id) {
+		/* IOAS cannot be freed while a domain is on it */
+		EXPECT_ERRNO(EBUSY,
+			     ioctl(self->fd, IOMMU_DESTROY, &destroy_cmd));
+	} else {
+		/* Can allocate and manually free an IOAS table */
+		test_ioctl_destroy(self->ioas_id);
+	}
+}
+
+TEST_F(iommufd_ioas, ioas_area_destroy)
+{
+	struct iommu_destroy destroy_cmd = {
+		.size = sizeof(destroy_cmd),
+		.id = self->ioas_id,
+	};
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.ioas_id = self->ioas_id,
+		.user_va = (uintptr_t)buffer,
+		.length = PAGE_SIZE,
+		.iova = self->base_iova,
+	};
+
+	/* Adding an area does not change ability to destroy */
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	if (self->domain_id)
+		EXPECT_ERRNO(EBUSY,
+			     ioctl(self->fd, IOMMU_DESTROY, &destroy_cmd));
+	else
+		test_ioctl_destroy(self->ioas_id);
+}
+
+TEST_F(iommufd_ioas, ioas_area_auto_destroy)
+{
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.ioas_id = self->ioas_id,
+		.user_va = (uintptr_t)buffer,
+		.length = PAGE_SIZE,
+	};
+	int i;
+
+	/* Can allocate and automatically free an IOAS table with many areas */
+	for (i = 0; i != 10; i++) {
+		map_cmd.iova = self->base_iova + i * PAGE_SIZE;
+		ASSERT_EQ(0,
+			  ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	}
+}
+
+TEST_F(iommufd_ioas, area)
+{
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.ioas_id = self->ioas_id,
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.length = PAGE_SIZE,
+		.user_va = (uintptr_t)buffer,
+	};
+	struct iommu_ioas_unmap unmap_cmd = {
+		.size = sizeof(unmap_cmd),
+		.ioas_id = self->ioas_id,
+		.length = PAGE_SIZE,
+	};
+	int i;
+
+	/* Unmap fails if nothing is mapped */
+	for (i = 0; i != 10; i++) {
+		unmap_cmd.iova = i * PAGE_SIZE;
+		EXPECT_ERRNO(ENOENT, ioctl(self->fd, IOMMU_IOAS_UNMAP,
+					   &unmap_cmd));
+	}
+
+	/* Unmap works */
+	for (i = 0; i != 10; i++) {
+		map_cmd.iova = self->base_iova + i * PAGE_SIZE;
+		ASSERT_EQ(0,
+			  ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	}
+	for (i = 0; i != 10; i++) {
+		unmap_cmd.iova = self->base_iova + i * PAGE_SIZE;
+		ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_UNMAP,
+				   &unmap_cmd));
+	}
+
+	/* Split fails */
+	map_cmd.length = PAGE_SIZE * 2;
+	map_cmd.iova = self->base_iova + 16 * PAGE_SIZE;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	unmap_cmd.iova = self->base_iova + 16 * PAGE_SIZE;
+	EXPECT_ERRNO(ENOENT,
+		     ioctl(self->fd, IOMMU_IOAS_UNMAP, &unmap_cmd));
+	unmap_cmd.iova = self->base_iova + 17 * PAGE_SIZE;
+	EXPECT_ERRNO(ENOENT,
+		     ioctl(self->fd, IOMMU_IOAS_UNMAP, &unmap_cmd));
+
+	/* Over map fails */
+	map_cmd.length = PAGE_SIZE * 2;
+	map_cmd.iova = self->base_iova + 16 * PAGE_SIZE;
+	EXPECT_ERRNO(EADDRINUSE,
+		     ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	map_cmd.length = PAGE_SIZE;
+	map_cmd.iova = self->base_iova + 16 * PAGE_SIZE;
+	EXPECT_ERRNO(EADDRINUSE,
+		     ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	map_cmd.length = PAGE_SIZE;
+	map_cmd.iova = self->base_iova + 17 * PAGE_SIZE;
+	EXPECT_ERRNO(EADDRINUSE,
+		     ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	map_cmd.length = PAGE_SIZE * 2;
+	map_cmd.iova = self->base_iova + 15 * PAGE_SIZE;
+	EXPECT_ERRNO(EADDRINUSE,
+		     ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	map_cmd.length = PAGE_SIZE * 3;
+	map_cmd.iova = self->base_iova + 15 * PAGE_SIZE;
+	EXPECT_ERRNO(EADDRINUSE,
+		     ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+
+	/* unmap all works */
+	unmap_cmd.iova = 0;
+	unmap_cmd.length = UINT64_MAX;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_UNMAP, &unmap_cmd));
+}
+
+TEST_F(iommufd_ioas, unmap_fully_contained_areas)
+{
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.ioas_id = self->ioas_id,
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.length = PAGE_SIZE,
+		.user_va = (uintptr_t)buffer,
+	};
+	struct iommu_ioas_unmap unmap_cmd = {
+		.size = sizeof(unmap_cmd),
+		.ioas_id = self->ioas_id,
+		.length = PAGE_SIZE,
+	};
+	int i;
+
+	/* Give no_domain some space to rewind base_iova */
+	self->base_iova += 4 * PAGE_SIZE;
+
+	for (i = 0; i != 4; i++) {
+		map_cmd.iova = self->base_iova + i * 16 * PAGE_SIZE;
+		map_cmd.length = 8 * PAGE_SIZE;
+		ASSERT_EQ(0,
+			  ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	}
+
+	/* Unmap not fully contained area doesn't work */
+	unmap_cmd.iova = self->base_iova - 4 * PAGE_SIZE;
+	unmap_cmd.length = 8 * PAGE_SIZE;
+	EXPECT_ERRNO(ENOENT,
+		     ioctl(self->fd, IOMMU_IOAS_UNMAP, &unmap_cmd));
+
+	unmap_cmd.iova = self->base_iova + 3 * 16 * PAGE_SIZE + 8 * PAGE_SIZE - 4 * PAGE_SIZE;
+	unmap_cmd.length = 8 * PAGE_SIZE;
+	EXPECT_ERRNO(ENOENT,
+		     ioctl(self->fd, IOMMU_IOAS_UNMAP, &unmap_cmd));
+
+	/* Unmap fully contained areas works */
+	unmap_cmd.iova = self->base_iova - 4 * PAGE_SIZE;
+	unmap_cmd.length = 3 * 16 * PAGE_SIZE + 8 * PAGE_SIZE + 4 * PAGE_SIZE;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_UNMAP, &unmap_cmd));
+	ASSERT_EQ(32 * PAGE_SIZE, unmap_cmd.length);
+}
+
+TEST_F(iommufd_ioas, area_auto_iova)
+{
+	struct iommu_test_cmd test_cmd = {
+		.size = sizeof(test_cmd),
+		.op = IOMMU_TEST_OP_ADD_RESERVED,
+		.id = self->ioas_id,
+		.add_reserved = { .start = PAGE_SIZE * 4,
+				  .length = PAGE_SIZE * 100 },
+	};
+	struct iommu_iova_range ranges[1] = {};
+	struct iommu_ioas_allow_iovas allow_cmd = {
+		.size = sizeof(allow_cmd),
+		.ioas_id = self->ioas_id,
+		.num_iovas = 1,
+		.allowed_iovas = (uintptr_t)ranges,
+	};
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.ioas_id = self->ioas_id,
+		.length = PAGE_SIZE,
+		.user_va = (uintptr_t)buffer,
+	};
+	struct iommu_ioas_unmap unmap_cmd = {
+		.size = sizeof(unmap_cmd),
+		.ioas_id = self->ioas_id,
+		.length = PAGE_SIZE,
+	};
+	uint64_t iovas[10];
+	int i;
+
+	/* Simple 4k pages */
+	for (i = 0; i != 10; i++) {
+		ASSERT_EQ(0,
+			  ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+		iovas[i] = map_cmd.iova;
+	}
+	for (i = 0; i != 10; i++) {
+		unmap_cmd.iova = iovas[i];
+		ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_UNMAP,
+				   &unmap_cmd));
+	}
+
+	/* Kernel automatically aligns IOVAs properly */
+	if (self->domain_id)
+		map_cmd.user_va = (uintptr_t)buffer;
+	else
+		map_cmd.user_va = 1UL << 31;
+	for (i = 0; i != 10; i++) {
+		map_cmd.length = PAGE_SIZE * (i + 1);
+		ASSERT_EQ(0,
+			  ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+		iovas[i] = map_cmd.iova;
+		EXPECT_EQ(0, map_cmd.iova % (1UL << (ffs(map_cmd.length)-1)));
+	}
+	for (i = 0; i != 10; i++) {
+		unmap_cmd.length = PAGE_SIZE * (i + 1);
+		unmap_cmd.iova = iovas[i];
+		ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_UNMAP,
+				   &unmap_cmd));
+	}
+
+	/* Avoids a reserved region */
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ADD_RESERVED),
+			&test_cmd));
+	for (i = 0; i != 10; i++) {
+		map_cmd.length = PAGE_SIZE * (i + 1);
+		ASSERT_EQ(0,
+			  ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+		iovas[i] = map_cmd.iova;
+		EXPECT_EQ(0, map_cmd.iova % (1UL << (ffs(map_cmd.length)-1)));
+		EXPECT_EQ(false,
+			  map_cmd.iova > test_cmd.add_reserved.start &&
+				  map_cmd.iova <
+					  test_cmd.add_reserved.start +
+						  test_cmd.add_reserved.length);
+	}
+	for (i = 0; i != 10; i++) {
+		unmap_cmd.length = PAGE_SIZE * (i + 1);
+		unmap_cmd.iova = iovas[i];
+		ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_UNMAP,
+				   &unmap_cmd));
+	}
+
+	/* Allowed region intersects with a reserved region */
+	ranges[0].start = PAGE_SIZE;
+	ranges[0].last = PAGE_SIZE * 600;
+	EXPECT_ERRNO(EADDRINUSE,
+		     ioctl(self->fd, IOMMU_IOAS_ALLOW_IOVAS, &allow_cmd));
+
+	/* Allocate from an allowed region */
+	if (self->domain_id) {
+		ranges[0].start =  MOCK_APERTURE_START + PAGE_SIZE;
+		ranges[0].last = MOCK_APERTURE_START + PAGE_SIZE * 600 - 1;
+	} else {
+		ranges[0].start = PAGE_SIZE * 200;
+		ranges[0].last = PAGE_SIZE * 600 - 1;
+	}
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_ALLOW_IOVAS, &allow_cmd));
+	for (i = 0; i != 10; i++) {
+		map_cmd.length = PAGE_SIZE * (i + 1);
+		ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+		iovas[i] = map_cmd.iova;
+		EXPECT_EQ(0, map_cmd.iova % (1UL << (ffs(map_cmd.length)-1)));
+		EXPECT_EQ(true, map_cmd.iova >= ranges[0].start);
+		EXPECT_EQ(true, map_cmd.iova <= ranges[0].last);
+		EXPECT_EQ(true,
+			  map_cmd.iova + map_cmd.length > ranges[0].start);
+		EXPECT_EQ(true,
+			  map_cmd.iova + map_cmd.length <= ranges[0].last + 1);
+	}
+	for (i = 0; i != 10; i++) {
+		unmap_cmd.length = PAGE_SIZE * (i + 1);
+		unmap_cmd.iova = iovas[i];
+		ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_UNMAP, &unmap_cmd));
+	}
+}
+
+TEST_F(iommufd_ioas, area_allowed)
+{
+	struct iommu_test_cmd test_cmd = {
+		.size = sizeof(test_cmd),
+		.op = IOMMU_TEST_OP_ADD_RESERVED,
+		.id = self->ioas_id,
+		.add_reserved = { .start = PAGE_SIZE * 4,
+				  .length = PAGE_SIZE * 100 },
+	};
+	struct iommu_iova_range ranges[1] = {};
+	struct iommu_ioas_allow_iovas allow_cmd = {
+		.size = sizeof(allow_cmd),
+		.ioas_id = self->ioas_id,
+		.num_iovas = 1,
+		.allowed_iovas = (uintptr_t)ranges,
+	};
+
+	/* Reserved intersects an allowed */
+	allow_cmd.num_iovas = 1;
+	ranges[0].start = self->base_iova;
+	ranges[0].last = ranges[0].start + PAGE_SIZE * 600;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_ALLOW_IOVAS, &allow_cmd));
+	test_cmd.add_reserved.start = ranges[0].start + PAGE_SIZE;
+	test_cmd.add_reserved.length = PAGE_SIZE;
+	EXPECT_ERRNO(EADDRINUSE,
+		     ioctl(self->fd,
+			   _IOMMU_TEST_CMD(IOMMU_TEST_OP_ADD_RESERVED),
+			   &test_cmd));
+	allow_cmd.num_iovas = 0;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_ALLOW_IOVAS, &allow_cmd));
+
+	/* Allowed intersects a reserved */
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ADD_RESERVED),
+			&test_cmd));
+	allow_cmd.num_iovas = 1;
+	ranges[0].start = self->base_iova;
+	ranges[0].last = ranges[0].start + PAGE_SIZE * 600;
+	EXPECT_ERRNO(EADDRINUSE,
+		     ioctl(self->fd, IOMMU_IOAS_ALLOW_IOVAS, &allow_cmd));
+}
+
+TEST_F(iommufd_ioas, copy_area)
+{
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.ioas_id = self->ioas_id,
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.length = PAGE_SIZE,
+		.user_va = (uintptr_t)buffer,
+	};
+	struct iommu_ioas_copy copy_cmd = {
+		.size = sizeof(copy_cmd),
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.dst_ioas_id = self->ioas_id,
+		.src_ioas_id = self->ioas_id,
+		.length = PAGE_SIZE,
+	};
+
+	map_cmd.iova = self->base_iova;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+
+	/* Copy inside a single IOAS */
+	copy_cmd.src_iova = self->base_iova;
+	copy_cmd.dst_iova = self->base_iova + PAGE_SIZE;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_COPY, &copy_cmd));
+
+	/* Copy between IOAS's */
+	copy_cmd.src_iova = self->base_iova;
+	copy_cmd.dst_iova = 0;
+	test_ioctl_ioas_alloc(&copy_cmd.dst_ioas_id);
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_COPY, &copy_cmd));
+}
+
+TEST_F(iommufd_ioas, iova_ranges)
+{
+	struct iommu_test_cmd test_cmd = {
+		.size = sizeof(test_cmd),
+		.op = IOMMU_TEST_OP_ADD_RESERVED,
+		.id = self->ioas_id,
+		.add_reserved = { .start = PAGE_SIZE, .length = PAGE_SIZE },
+	};
+	struct iommu_ioas_iova_ranges *cmd = (void *)buffer;
+
+	*cmd = (struct iommu_ioas_iova_ranges){
+		.size = BUFFER_SIZE,
+		.ioas_id = self->ioas_id,
+	};
+
+	/* Range can be read */
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_IOVA_RANGES, cmd));
+	EXPECT_EQ(1, cmd->out_num_iovas);
+	if (!self->domain_id) {
+		EXPECT_EQ(0, cmd->out_valid_iovas[0].start);
+		EXPECT_EQ(SIZE_MAX, cmd->out_valid_iovas[0].last);
+	} else {
+		EXPECT_EQ(MOCK_APERTURE_START, cmd->out_valid_iovas[0].start);
+		EXPECT_EQ(MOCK_APERTURE_LAST, cmd->out_valid_iovas[0].last);
+	}
+	memset(cmd->out_valid_iovas, 0,
+	       sizeof(cmd->out_valid_iovas[0]) * cmd->out_num_iovas);
+
+	/* Buffer too small */
+	cmd->size = sizeof(*cmd);
+	EXPECT_ERRNO(EMSGSIZE,
+		     ioctl(self->fd, IOMMU_IOAS_IOVA_RANGES, cmd));
+	EXPECT_EQ(1, cmd->out_num_iovas);
+	EXPECT_EQ(0, cmd->out_valid_iovas[0].start);
+	EXPECT_EQ(0, cmd->out_valid_iovas[0].last);
+
+	/* 2 ranges */
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ADD_RESERVED),
+			&test_cmd));
+	cmd->size = BUFFER_SIZE;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_IOVA_RANGES, cmd));
+	if (!self->domain_id) {
+		EXPECT_EQ(2, cmd->out_num_iovas);
+		EXPECT_EQ(0, cmd->out_valid_iovas[0].start);
+		EXPECT_EQ(PAGE_SIZE - 1, cmd->out_valid_iovas[0].last);
+		EXPECT_EQ(PAGE_SIZE * 2, cmd->out_valid_iovas[1].start);
+		EXPECT_EQ(SIZE_MAX, cmd->out_valid_iovas[1].last);
+	} else {
+		EXPECT_EQ(1, cmd->out_num_iovas);
+		EXPECT_EQ(MOCK_APERTURE_START, cmd->out_valid_iovas[0].start);
+		EXPECT_EQ(MOCK_APERTURE_LAST, cmd->out_valid_iovas[0].last);
+	}
+	memset(cmd->out_valid_iovas, 0,
+	       sizeof(cmd->out_valid_iovas[0]) * cmd->out_num_iovas);
+
+	/* Buffer too small */
+	cmd->size = sizeof(*cmd) + sizeof(cmd->out_valid_iovas[0]);
+	if (!self->domain_id) {
+		EXPECT_ERRNO(EMSGSIZE,
+			     ioctl(self->fd, IOMMU_IOAS_IOVA_RANGES, cmd));
+		EXPECT_EQ(2, cmd->out_num_iovas);
+		EXPECT_EQ(0, cmd->out_valid_iovas[0].start);
+		EXPECT_EQ(PAGE_SIZE - 1, cmd->out_valid_iovas[0].last);
+	} else {
+		ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_IOVA_RANGES, cmd));
+		EXPECT_EQ(1, cmd->out_num_iovas);
+		EXPECT_EQ(MOCK_APERTURE_START, cmd->out_valid_iovas[0].start);
+		EXPECT_EQ(MOCK_APERTURE_LAST, cmd->out_valid_iovas[0].last);
+	}
+	EXPECT_EQ(0, cmd->out_valid_iovas[1].start);
+	EXPECT_EQ(0, cmd->out_valid_iovas[1].last);
+}
+
+TEST_F(iommufd_ioas, access)
+{
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.ioas_id = self->ioas_id,
+		.user_va = (uintptr_t)buffer,
+		.length = BUFFER_SIZE,
+		.iova = MOCK_APERTURE_START,
+	};
+	struct iommu_test_cmd access_cmd = {
+		.size = sizeof(access_cmd),
+		.op = IOMMU_TEST_OP_ACCESS_PAGES,
+		.access_pages = { .iova = MOCK_APERTURE_START,
+				  .length = BUFFER_SIZE,
+				  .uptr = (uintptr_t)buffer },
+	};
+	struct iommu_test_cmd mock_cmd = {
+		.size = sizeof(mock_cmd),
+		.op = IOMMU_TEST_OP_MOCK_DOMAIN,
+		.id = self->ioas_id,
+	};
+	struct iommu_test_cmd check_map_cmd = {
+		.size = sizeof(check_map_cmd),
+		.op = IOMMU_TEST_OP_MD_CHECK_MAP,
+		.check_map = { .iova = MOCK_APERTURE_START,
+			       .length = BUFFER_SIZE,
+			       .uptr = (uintptr_t)buffer },
+	};
+	uint32_t access_item_id;
+
+	test_cmd_create_access(self->ioas_id, &access_cmd.id);
+
+	/* Single map/unmap */
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ACCESS_PAGES),
+			&access_cmd));
+	test_cmd_destroy_access_item(
+		access_cmd.id, access_cmd.access_pages.out_access_item_id);
+
+	/* Double user */
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ACCESS_PAGES),
+			&access_cmd));
+	access_item_id = access_cmd.access_pages.out_access_item_id;
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ACCESS_PAGES),
+			&access_cmd));
+	test_cmd_destroy_access_item(
+		access_cmd.id, access_cmd.access_pages.out_access_item_id);
+	test_cmd_destroy_access_item(access_cmd.id, access_item_id);
+
+	/* Add/remove a domain with a user */
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ACCESS_PAGES),
+			&access_cmd));
+	ASSERT_EQ(0, ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_MOCK_DOMAIN),
+			   &mock_cmd));
+	check_map_cmd.id = mock_cmd.id;
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_MD_CHECK_MAP),
+			&check_map_cmd));
+
+	test_ioctl_destroy(mock_cmd.mock_domain.device_id);
+	test_ioctl_destroy(mock_cmd.id);
+	test_cmd_destroy_access_item(
+		access_cmd.id, access_cmd.access_pages.out_access_item_id);
+	test_cmd_destroy_access(access_cmd.id);
+}
+
+FIXTURE(iommufd_mock_domain) {
+	int fd;
+	uint32_t ioas_id;
+	uint32_t domain_id;
+	uint32_t domain_ids[2];
+	int mmap_flags;
+	size_t mmap_buf_size;
+};
+
+FIXTURE_VARIANT(iommufd_mock_domain) {
+	unsigned int mock_domains;
+	bool hugepages;
+};
+
+FIXTURE_SETUP(iommufd_mock_domain)
+{
+	struct iommu_test_cmd test_cmd = {
+		.size = sizeof(test_cmd),
+		.op = IOMMU_TEST_OP_MOCK_DOMAIN,
+	};
+	unsigned int i;
+
+	self->fd = open("/dev/iommu", O_RDWR);
+	ASSERT_NE(-1, self->fd);
+	test_ioctl_ioas_alloc(&self->ioas_id);
+
+	ASSERT_GE(ARRAY_SIZE(self->domain_ids), variant->mock_domains);
+
+	for (i = 0; i != variant->mock_domains; i++) {
+		test_cmd.id = self->ioas_id;
+		ASSERT_EQ(0, ioctl(self->fd,
+				   _IOMMU_TEST_CMD(IOMMU_TEST_OP_MOCK_DOMAIN),
+				   &test_cmd));
+		EXPECT_NE(0, test_cmd.id);
+		self->domain_ids[i] = test_cmd.id;
+	}
+	self->domain_id = self->domain_ids[0];
+
+	self->mmap_flags = MAP_SHARED | MAP_ANONYMOUS;
+	self->mmap_buf_size = PAGE_SIZE * 8;
+	if (variant->hugepages) {
+		/*
+		 * MAP_POPULATE will cause the kernel to fail mmap if THPs are
+		 * not available.
+		 */
+		self->mmap_flags |= MAP_HUGETLB | MAP_POPULATE;
+		self->mmap_buf_size = HUGEPAGE_SIZE * 2;
+	}
+}
+
+FIXTURE_TEARDOWN(iommufd_mock_domain) {
+	teardown_iommufd(self->fd, _metadata);
+}
+
+FIXTURE_VARIANT_ADD(iommufd_mock_domain, one_domain) {
+	.mock_domains = 1,
+	.hugepages = false,
+};
+
+FIXTURE_VARIANT_ADD(iommufd_mock_domain, two_domains) {
+	.mock_domains = 2,
+	.hugepages = false,
+};
+
+FIXTURE_VARIANT_ADD(iommufd_mock_domain, one_domain_hugepage) {
+	.mock_domains = 1,
+	.hugepages = true,
+};
+
+FIXTURE_VARIANT_ADD(iommufd_mock_domain, two_domains_hugepage) {
+	.mock_domains = 2,
+	.hugepages = true,
+};
+
+/* Have the kernel check that the user pages made it to the iommu_domain */
+#define check_mock_iova(_ptr, _iova, _length)                                  \
+	({                                                                     \
+		struct iommu_test_cmd check_map_cmd = {                        \
+			.size = sizeof(check_map_cmd),                         \
+			.op = IOMMU_TEST_OP_MD_CHECK_MAP,                      \
+			.id = self->domain_id,                                 \
+			.check_map = { .iova = _iova,                          \
+				       .length = _length,                      \
+				       .uptr = (uintptr_t)(_ptr) },            \
+		};                                                             \
+		ASSERT_EQ(0,                                                   \
+			  ioctl(self->fd,                                      \
+				_IOMMU_TEST_CMD(IOMMU_TEST_OP_MD_CHECK_MAP),   \
+				&check_map_cmd));                              \
+		if (self->domain_ids[1]) {                                     \
+			check_map_cmd.id = self->domain_ids[1];                \
+			ASSERT_EQ(0,                                           \
+				  ioctl(self->fd,                              \
+					_IOMMU_TEST_CMD(                       \
+						IOMMU_TEST_OP_MD_CHECK_MAP),   \
+					&check_map_cmd));                      \
+		}                                                              \
+	})
+
+TEST_F(iommufd_mock_domain, basic)
+{
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.ioas_id = self->ioas_id,
+	};
+	struct iommu_ioas_unmap unmap_cmd = {
+		.size = sizeof(unmap_cmd),
+		.ioas_id = self->ioas_id,
+	};
+	size_t buf_size = self->mmap_buf_size;
+	uint8_t *buf;
+
+	/* Simple one page map */
+	map_cmd.user_va = (uintptr_t)buffer;
+	map_cmd.length = PAGE_SIZE;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+	check_mock_iova(buffer, map_cmd.iova, map_cmd.length);
+
+	buf = mmap(0, buf_size, PROT_READ | PROT_WRITE, self->mmap_flags, -1,
+		   0);
+	ASSERT_NE(MAP_FAILED, buf);
+
+	/* EFAULT half way through mapping */
+	ASSERT_EQ(0, munmap(buf + buf_size / 2, buf_size / 2));
+	map_cmd.user_va = (uintptr_t)buf;
+	map_cmd.length = buf_size;
+	EXPECT_ERRNO(EFAULT,
+		     ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+
+	/* EFAULT on first page */
+	ASSERT_EQ(0, munmap(buf, buf_size / 2));
+	EXPECT_ERRNO(EFAULT,
+		     ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+}
+
+TEST_F(iommufd_mock_domain, all_aligns)
+{
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.ioas_id = self->ioas_id,
+	};
+	struct iommu_ioas_unmap unmap_cmd = {
+		.size = sizeof(unmap_cmd),
+		.ioas_id = self->ioas_id,
+	};
+	size_t test_step =
+		variant->hugepages ? (self->mmap_buf_size / 16) : MOCK_PAGE_SIZE;
+	size_t buf_size = self->mmap_buf_size;
+	unsigned int start;
+	unsigned int end;
+	uint8_t *buf;
+
+	buf = mmap(0, buf_size, PROT_READ | PROT_WRITE, self->mmap_flags, -1, 0);
+	ASSERT_NE(MAP_FAILED, buf);
+	check_refs(buf, buf_size, 0);
+
+	/*
+	 * Map every combination of page size and alignment within a big region,
+	 * less for hugepage case as it takes so long to finish.
+	 */
+	for (start = 0; start < buf_size; start += test_step) {
+		map_cmd.user_va = (uintptr_t)buf + start;
+		if (variant->hugepages)
+			end = buf_size;
+		else
+			end = start + MOCK_PAGE_SIZE;
+		for (; end < buf_size; end += MOCK_PAGE_SIZE) {
+			map_cmd.length = end - start;
+			ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_MAP,
+					   &map_cmd));
+			check_mock_iova(buf + start, map_cmd.iova,
+					map_cmd.length);
+			check_refs(buf + start / PAGE_SIZE * PAGE_SIZE,
+				   end / PAGE_SIZE * PAGE_SIZE -
+					   start / PAGE_SIZE * PAGE_SIZE,
+				   1);
+
+			unmap_cmd.iova = map_cmd.iova;
+			unmap_cmd.length = end - start;
+			ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_UNMAP,
+					   &unmap_cmd));
+		}
+	}
+	check_refs(buf, buf_size, 0);
+	ASSERT_EQ(0, munmap(buf, buf_size));
+}
+
+TEST_F(iommufd_mock_domain, all_aligns_copy)
+{
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.ioas_id = self->ioas_id,
+	};
+	struct iommu_ioas_unmap unmap_cmd = {
+		.size = sizeof(unmap_cmd),
+		.ioas_id = self->ioas_id,
+	};
+	struct iommu_test_cmd add_mock_pt = {
+		.size = sizeof(add_mock_pt),
+		.op = IOMMU_TEST_OP_MOCK_DOMAIN,
+	};
+	size_t test_step =
+		variant->hugepages ? self->mmap_buf_size / 16 : MOCK_PAGE_SIZE;
+	size_t buf_size = self->mmap_buf_size;
+	unsigned int start;
+	unsigned int end;
+	uint8_t *buf;
+
+	buf = mmap(0, buf_size, PROT_READ | PROT_WRITE, self->mmap_flags, -1, 0);
+	ASSERT_NE(MAP_FAILED, buf);
+	check_refs(buf, buf_size, 0);
+
+	/*
+	 * Map every combination of page size and alignment within a big region,
+	 * less for hugepage case as it takes so long to finish.
+	 */
+	for (start = 0; start < buf_size; start += test_step) {
+		map_cmd.user_va = (uintptr_t)buf + start;
+		if (variant->hugepages)
+			end = buf_size;
+		else
+			end = start + MOCK_PAGE_SIZE;
+		for (; end < buf_size; end += MOCK_PAGE_SIZE) {
+			unsigned int old_id;
+
+			map_cmd.length = end - start;
+			ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_MAP,
+					   &map_cmd));
+
+			/* Add and destroy a domain while the area exists */
+			add_mock_pt.id = self->ioas_id;
+			ASSERT_EQ(0, ioctl(self->fd,
+					   _IOMMU_TEST_CMD(
+						   IOMMU_TEST_OP_MOCK_DOMAIN),
+					   &add_mock_pt));
+			old_id = self->domain_ids[1];
+			self->domain_ids[1] = add_mock_pt.id;
+
+			check_mock_iova(buf + start, map_cmd.iova,
+					map_cmd.length);
+			check_refs(buf + start / PAGE_SIZE * PAGE_SIZE,
+				   end / PAGE_SIZE * PAGE_SIZE -
+					   start / PAGE_SIZE * PAGE_SIZE,
+				   1);
+
+			test_ioctl_destroy(add_mock_pt.mock_domain.device_id);
+			test_ioctl_destroy(add_mock_pt.id)
+			self->domain_ids[1] = old_id;
+
+			unmap_cmd.iova = map_cmd.iova;
+			unmap_cmd.length = end - start;
+			ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_UNMAP,
+					   &unmap_cmd));
+		}
+	}
+	check_refs(buf, buf_size, 0);
+	ASSERT_EQ(0, munmap(buf, buf_size));
+}
+
+TEST_F(iommufd_mock_domain, user_copy)
+{
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.ioas_id = self->ioas_id,
+		.user_va = (uintptr_t)buffer,
+		.length = BUFFER_SIZE,
+		.iova = MOCK_APERTURE_START,
+	};
+	struct iommu_test_cmd access_cmd = {
+		.size = sizeof(access_cmd),
+		.op = IOMMU_TEST_OP_ACCESS_PAGES,
+		.access_pages = { .iova = MOCK_APERTURE_START,
+				  .length = BUFFER_SIZE,
+				  .uptr = (uintptr_t)buffer },
+	};
+	struct iommu_ioas_copy copy_cmd = {
+		.size = sizeof(copy_cmd),
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.dst_ioas_id = self->ioas_id,
+		.src_iova = MOCK_APERTURE_START,
+		.dst_iova = MOCK_APERTURE_START,
+		.length = BUFFER_SIZE,
+	};
+
+	/* Pin the pages in an IOAS with no domains then copy to an IOAS with domains */
+	test_ioctl_ioas_alloc(&map_cmd.ioas_id);
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+
+	test_cmd_create_access(map_cmd.ioas_id, &access_cmd.id);
+
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_ACCESS_PAGES),
+			&access_cmd));
+	copy_cmd.src_ioas_id = map_cmd.ioas_id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_COPY, &copy_cmd));
+	check_mock_iova(buffer, map_cmd.iova, BUFFER_SIZE);
+
+	test_cmd_destroy_access_item(
+		access_cmd.id, access_cmd.access_pages.out_access_item_id);
+	test_cmd_destroy_access(access_cmd.id)
+	test_ioctl_destroy(map_cmd.ioas_id);
+}
+
+FIXTURE(vfio_compat_nodev) {
+	int fd;
+};
+
+FIXTURE_SETUP(vfio_compat_nodev) {
+	self->fd = open("/dev/iommu", O_RDWR);
+	ASSERT_NE(-1, self->fd);
+}
+
+FIXTURE_TEARDOWN(vfio_compat_nodev) {
+	teardown_iommufd(self->fd, _metadata);
+}
+
+TEST_F(vfio_compat_nodev, simple_ioctls)
+{
+	ASSERT_EQ(VFIO_API_VERSION, ioctl(self->fd, VFIO_GET_API_VERSION));
+	ASSERT_EQ(1, ioctl(self->fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1v2_IOMMU));
+}
+
+TEST_F(vfio_compat_nodev, unmap_cmd)
+{
+	struct vfio_iommu_type1_dma_unmap unmap_cmd = {
+		.iova = MOCK_APERTURE_START,
+		.size = PAGE_SIZE,
+	};
+
+	unmap_cmd.argsz = 1;
+	EXPECT_ERRNO(EINVAL, ioctl(self->fd, VFIO_IOMMU_UNMAP_DMA, &unmap_cmd));
+
+	unmap_cmd.argsz = sizeof(unmap_cmd);
+	unmap_cmd.flags = 1 << 31;
+	EXPECT_ERRNO(EINVAL, ioctl(self->fd, VFIO_IOMMU_UNMAP_DMA, &unmap_cmd));
+
+	unmap_cmd.flags = 0;
+	EXPECT_ERRNO(ENODEV, ioctl(self->fd, VFIO_IOMMU_UNMAP_DMA, &unmap_cmd));
+}
+
+TEST_F(vfio_compat_nodev, map_cmd)
+{
+	struct vfio_iommu_type1_dma_map map_cmd = {
+		.iova = MOCK_APERTURE_START,
+		.size = PAGE_SIZE,
+		.vaddr = (__u64)buffer,
+	};
+
+	map_cmd.argsz = 1;
+	EXPECT_ERRNO(EINVAL, ioctl(self->fd, VFIO_IOMMU_MAP_DMA, &map_cmd));
+
+	map_cmd.argsz = sizeof(map_cmd);
+	map_cmd.flags = 1 << 31;
+	EXPECT_ERRNO(EINVAL, ioctl(self->fd, VFIO_IOMMU_MAP_DMA, &map_cmd));
+
+	/* Requires a domain to be attached */
+	map_cmd.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
+	EXPECT_ERRNO(ENODEV, ioctl(self->fd, VFIO_IOMMU_MAP_DMA, &map_cmd));
+}
+
+TEST_F(vfio_compat_nodev, info_cmd)
+{
+	struct vfio_iommu_type1_info info_cmd = {};
+
+	/* Invalid argsz */
+	info_cmd.argsz = 1;
+	EXPECT_ERRNO(EINVAL, ioctl(self->fd, VFIO_IOMMU_GET_INFO, &info_cmd));
+
+	info_cmd.argsz = sizeof(info_cmd);
+	EXPECT_ERRNO(ENODEV, ioctl(self->fd, VFIO_IOMMU_GET_INFO, &info_cmd));
+}
+
+TEST_F(vfio_compat_nodev, set_iommu_cmd)
+{
+	/* Requires a domain to be attached */
+	EXPECT_ERRNO(ENODEV, ioctl(self->fd, VFIO_SET_IOMMU, VFIO_TYPE1v2_IOMMU));
+}
+
+TEST_F(vfio_compat_nodev, vfio_ioas)
+{
+	struct iommu_vfio_ioas vfio_ioas_cmd = {
+		.size = sizeof(vfio_ioas_cmd),
+		.op = IOMMU_VFIO_IOAS_GET,
+	};
+	__u32 ioas_id;
+
+	/* ENODEV if there is no compat ioas */
+	EXPECT_ERRNO(ENODEV, ioctl(self->fd, IOMMU_VFIO_IOAS, &vfio_ioas_cmd));
+
+	/* Invalid id for set */
+	vfio_ioas_cmd.op = IOMMU_VFIO_IOAS_SET;
+	EXPECT_ERRNO(ENOENT, ioctl(self->fd, IOMMU_VFIO_IOAS, &vfio_ioas_cmd));
+
+	/* Valid id for set*/
+	test_ioctl_ioas_alloc(&ioas_id);
+	vfio_ioas_cmd.ioas_id = ioas_id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_VFIO_IOAS, &vfio_ioas_cmd));
+
+	/* Same id comes back from get */
+	vfio_ioas_cmd.op = IOMMU_VFIO_IOAS_GET;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_VFIO_IOAS, &vfio_ioas_cmd));
+	ASSERT_EQ(ioas_id, vfio_ioas_cmd.ioas_id);
+
+	/* Clear works */
+	vfio_ioas_cmd.op = IOMMU_VFIO_IOAS_CLEAR;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_VFIO_IOAS, &vfio_ioas_cmd));
+	vfio_ioas_cmd.op = IOMMU_VFIO_IOAS_GET;
+	EXPECT_ERRNO(ENODEV, ioctl(self->fd, IOMMU_VFIO_IOAS, &vfio_ioas_cmd));
+}
+
+FIXTURE(vfio_compat_mock_domain) {
+	int fd;
+	uint32_t ioas_id;
+};
+
+FIXTURE_SETUP(vfio_compat_mock_domain) {
+	struct iommu_test_cmd test_cmd = {
+		.size = sizeof(test_cmd),
+		.op = IOMMU_TEST_OP_MOCK_DOMAIN,
+	};
+	struct iommu_vfio_ioas vfio_ioas_cmd = {
+		.size = sizeof(vfio_ioas_cmd),
+		.op = IOMMU_VFIO_IOAS_SET,
+	};
+
+	self->fd = open("/dev/iommu", O_RDWR);
+	ASSERT_NE(-1, self->fd);
+
+	/* Create what VFIO would consider a group */
+	test_ioctl_ioas_alloc(&self->ioas_id);
+	test_cmd.id = self->ioas_id;
+	ASSERT_EQ(0, ioctl(self->fd,
+			   _IOMMU_TEST_CMD(IOMMU_TEST_OP_MOCK_DOMAIN),
+			   &test_cmd));
+	EXPECT_NE(0, test_cmd.id);
+
+	/* Attach it to the vfio compat */
+	vfio_ioas_cmd.ioas_id = self->ioas_id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_VFIO_IOAS, &vfio_ioas_cmd));
+	ASSERT_EQ(0, ioctl(self->fd, VFIO_SET_IOMMU, VFIO_TYPE1v2_IOMMU));
+}
+
+FIXTURE_TEARDOWN(vfio_compat_mock_domain) {
+	teardown_iommufd(self->fd, _metadata);
+}
+
+TEST_F(vfio_compat_mock_domain, simple_close)
+{
+}
+
+/*
+ * Execute an ioctl command stored in buffer and check that the result does not
+ * overflow memory.
+ */
+static bool is_filled(const void *buf, uint8_t c, size_t len)
+{
+	const uint8_t *cbuf = buf;
+
+	for (; len; cbuf++, len--)
+		if (*cbuf != c)
+			return false;
+	return true;
+}
+
+#define ioctl_check_buf(fd, cmd)                                               \
+	({                                                                     \
+		size_t _cmd_len = *(__u32 *)buffer;                            \
+									       \
+		memset(buffer + _cmd_len, 0xAA, BUFFER_SIZE - _cmd_len);       \
+		ASSERT_EQ(0, ioctl(fd, cmd, buffer));                          \
+		ASSERT_EQ(true, is_filled(buffer + _cmd_len, 0xAA,             \
+					  BUFFER_SIZE - _cmd_len));            \
+	})
+
+static void check_vfio_info_cap_chain(struct __test_metadata *_metadata,
+				      struct vfio_iommu_type1_info *info_cmd)
+{
+	const struct vfio_info_cap_header *cap;
+
+	ASSERT_GE(info_cmd->argsz, info_cmd->cap_offset + sizeof(*cap));
+	cap = buffer + info_cmd->cap_offset;
+	while (true) {
+		size_t cap_size;
+
+		if (cap->next)
+			cap_size = (buffer + cap->next) - (void *)cap;
+		else
+			cap_size = (buffer + info_cmd->argsz) - (void *)cap;
+
+		switch (cap->id) {
+		case VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE: {
+			struct vfio_iommu_type1_info_cap_iova_range *data =
+				(void *)cap;
+
+			ASSERT_EQ(1, data->header.version);
+			ASSERT_EQ(1, data->nr_iovas);
+			EXPECT_EQ(MOCK_APERTURE_START,
+				  data->iova_ranges[0].start);
+			EXPECT_EQ(MOCK_APERTURE_LAST,
+				  data->iova_ranges[0].end);
+			break;
+		}
+		case VFIO_IOMMU_TYPE1_INFO_DMA_AVAIL: {
+			struct vfio_iommu_type1_info_dma_avail *data =
+				(void *)cap;
+
+			ASSERT_EQ(1, data->header.version);
+			ASSERT_EQ(sizeof(*data), cap_size);
+			break;
+		}
+		default:
+			ASSERT_EQ(false, true);
+			break;
+		}
+		if (!cap->next)
+			break;
+
+		ASSERT_GE(info_cmd->argsz, cap->next + sizeof(*cap));
+		ASSERT_GE(buffer + cap->next, (void *)cap);
+		cap = buffer + cap->next;
+	}
+}
+
+TEST_F(vfio_compat_mock_domain, get_info)
+{
+	struct vfio_iommu_type1_info *info_cmd = buffer;
+	unsigned int i;
+	size_t caplen;
+
+	/* Pre-cap ABI */
+	*info_cmd = (struct vfio_iommu_type1_info){
+		.argsz = offsetof(struct vfio_iommu_type1_info, cap_offset),
+	};
+	ioctl_check_buf(self->fd, VFIO_IOMMU_GET_INFO);
+	ASSERT_NE(0, info_cmd->iova_pgsizes);
+	ASSERT_EQ(VFIO_IOMMU_INFO_PGSIZES | VFIO_IOMMU_INFO_CAPS,
+		  info_cmd->flags);
+
+	/* Read the cap chain size */
+	*info_cmd = (struct vfio_iommu_type1_info){
+		.argsz = sizeof(*info_cmd),
+	};
+	ioctl_check_buf(self->fd, VFIO_IOMMU_GET_INFO);
+	ASSERT_NE(0, info_cmd->iova_pgsizes);
+	ASSERT_EQ(VFIO_IOMMU_INFO_PGSIZES | VFIO_IOMMU_INFO_CAPS,
+		  info_cmd->flags);
+	ASSERT_EQ(0, info_cmd->cap_offset);
+	ASSERT_LT(sizeof(*info_cmd), info_cmd->argsz);
+
+	/* Read the caps, kernel should never create a corrupted caps */
+	caplen = info_cmd->argsz;
+	for (i = sizeof(*info_cmd); i < caplen; i++) {
+		*info_cmd = (struct vfio_iommu_type1_info){
+			.argsz = i,
+		};
+		ioctl_check_buf(self->fd, VFIO_IOMMU_GET_INFO);
+		ASSERT_EQ(VFIO_IOMMU_INFO_PGSIZES | VFIO_IOMMU_INFO_CAPS,
+			  info_cmd->flags);
+		if (!info_cmd->cap_offset)
+			continue;
+		check_vfio_info_cap_chain(_metadata, info_cmd);
+	}
+}
+
+/* FIXME use fault injection to test memory failure paths */
+/* FIXME test VFIO_IOMMU_MAP_DMA */
+/* FIXME test VFIO_IOMMU_UNMAP_DMA */
+/* FIXME test 2k iova alignment */
+/* FIXME cover boundary cases for iopt_access_pages()  */
+
+TEST_HARNESS_MAIN
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 03/13] iommufd: File descriptor, context, kconfig and makefiles
  2022-09-02 19:59   ` Jason Gunthorpe
  (?)
@ 2022-09-04  8:19   ` Baolu Lu
  2022-09-09 18:46     ` Jason Gunthorpe
  -1 siblings, 1 reply; 78+ messages in thread
From: Baolu Lu @ 2022-09-04  8:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: baolu.lu, Alex Williamson, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, iommu,
	Jason Wang, Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On 2022/9/3 3:59, Jason Gunthorpe wrote:
> This is the basic infrastructure of a new miscdevice to hold the iommufd
> IOCTL API.
> 
> It provides:
>   - A miscdevice to create file descriptors to run the IOCTL interface over
> 
>   - A table based ioctl dispatch and centralized extendable pre-validation
>     step.
> 
>   - An xarray mapping user ID's to kernel objects. The design has multiple
>     inter-related objects held within in a single IOMMUFD fd
> 
>   - A simple usage count to build a graph of object relations and protect
>     against hostile userspace racing ioctls
> 
> The only IOCTL provided in this patch is the generic 'destroy any object
> by handle' operation.
> 
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>   .../userspace-api/ioctl/ioctl-number.rst      |   1 +
>   MAINTAINERS                                   |  10 +
>   drivers/iommu/Kconfig                         |   1 +
>   drivers/iommu/Makefile                        |   2 +-
>   drivers/iommu/iommufd/Kconfig                 |  13 +
>   drivers/iommu/iommufd/Makefile                |   5 +
>   drivers/iommu/iommufd/iommufd_private.h       | 110 ++++++
>   drivers/iommu/iommufd/main.c                  | 345 ++++++++++++++++++
>   include/linux/iommufd.h                       |  31 ++
>   include/uapi/linux/iommufd.h                  |  55 +++
>   10 files changed, 572 insertions(+), 1 deletion(-)
>   create mode 100644 drivers/iommu/iommufd/Kconfig
>   create mode 100644 drivers/iommu/iommufd/Makefile
>   create mode 100644 drivers/iommu/iommufd/iommufd_private.h
>   create mode 100644 drivers/iommu/iommufd/main.c
>   create mode 100644 include/linux/iommufd.h
>   create mode 100644 include/uapi/linux/iommufd.h
> 
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index 3b985b19f39d12..4387e787411ebe 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -105,6 +105,7 @@ Code  Seq#    Include File                                           Comments
>   '8'   all                                                            SNP8023 advanced NIC card
>                                                                        <mailto:mcr@solidum.com>
>   ';'   64-7F  linux/vfio.h
> +';'   80-FF  linux/iommufd.h
>   '='   00-3f  uapi/linux/ptp_clock.h                                  <mailto:richardcochran@gmail.com>
>   '@'   00-0F  linux/radeonfb.h                                        conflict!
>   '@'   00-0F  drivers/video/aty/aty128fb.c                            conflict!
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 589517372408ca..abd041f5e00f4c 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -10609,6 +10609,16 @@ L:	linux-mips@vger.kernel.org
>   S:	Maintained
>   F:	drivers/net/ethernet/sgi/ioc3-eth.c
>   
> +IOMMU FD
> +M:	Jason Gunthorpe <jgg@nvidia.com>
> +M:	Kevin Tian <kevin.tian@intel.com>
> +L:	iommu@lists.linux-foundation.org

This mailing list has already been replaced with iommu@lists.linux.dev.

> +S:	Maintained
> +F:	Documentation/userspace-api/iommufd.rst
> +F:	drivers/iommu/iommufd/
> +F:	include/uapi/linux/iommufd.h
> +F:	include/linux/iommufd.h
> +
>   IOMAP FILESYSTEM LIBRARY
>   M:	Christoph Hellwig <hch@infradead.org>
>   M:	Darrick J. Wong <djwong@kernel.org>
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index 5c5cb5bee8b626..9ff3d2830f9559 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -177,6 +177,7 @@ config MSM_IOMMU
>   
>   source "drivers/iommu/amd/Kconfig"
>   source "drivers/iommu/intel/Kconfig"
> +source "drivers/iommu/iommufd/Kconfig"
>   
>   config IRQ_REMAP
>   	bool "Support for Interrupt Remapping"
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index 44475a9b3eeaf9..6d2bc288324704 100644
> --- a/drivers/iommu/Makefile
> +++ b/drivers/iommu/Makefile
> @@ -1,5 +1,5 @@
>   # SPDX-License-Identifier: GPL-2.0
> -obj-y += amd/ intel/ arm/
> +obj-y += amd/ intel/ arm/ iommufd/
>   obj-$(CONFIG_IOMMU_API) += iommu.o
>   obj-$(CONFIG_IOMMU_API) += iommu-traces.o
>   obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
> diff --git a/drivers/iommu/iommufd/Kconfig b/drivers/iommu/iommufd/Kconfig
> new file mode 100644
> index 00000000000000..fddd453bb0e764
> --- /dev/null
> +++ b/drivers/iommu/iommufd/Kconfig
> @@ -0,0 +1,13 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +config IOMMUFD
> +	tristate "IOMMU Userspace API"
> +	select INTERVAL_TREE
> +	select IOMMU_API
> +	default n
> +	help
> +	  Provides /dev/iommu the user API to control the IOMMU subsystem as
> +	  it relates to managing IO page tables that point at user space memory.
> +
> +	  This would commonly be used in combination with VFIO.
> +
> +	  If you don't know what to do here, say N.
> diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
> new file mode 100644
> index 00000000000000..a07a8cffe937c6
> --- /dev/null
> +++ b/drivers/iommu/iommufd/Makefile
> @@ -0,0 +1,5 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +iommufd-y := \
> +	main.o
> +
> +obj-$(CONFIG_IOMMUFD) += iommufd.o
> diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
> new file mode 100644
> index 00000000000000..a65208d6442be7
> --- /dev/null
> +++ b/drivers/iommu/iommufd/iommufd_private.h
> @@ -0,0 +1,110 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
> + */
> +#ifndef __IOMMUFD_PRIVATE_H
> +#define __IOMMUFD_PRIVATE_H
> +
> +#include <linux/rwsem.h>
> +#include <linux/xarray.h>
> +#include <linux/refcount.h>
> +#include <linux/uaccess.h>
> +
> +struct iommufd_ctx {
> +	struct file *file;
> +	struct xarray objects;
> +};
> +
> +struct iommufd_ctx *iommufd_fget(int fd);
> +
> +struct iommufd_ucmd {
> +	struct iommufd_ctx *ictx;
> +	void __user *ubuffer;
> +	u32 user_size;
> +	void *cmd;
> +};
> +
> +/* Copy the response in ucmd->cmd back to userspace. */
> +static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd,
> +				       size_t cmd_len)
> +{
> +	if (copy_to_user(ucmd->ubuffer, ucmd->cmd,
> +			 min_t(size_t, ucmd->user_size, cmd_len)))
> +		return -EFAULT;
> +	return 0;
> +}
> +
> +enum iommufd_object_type {
> +	IOMMUFD_OBJ_NONE,
> +	IOMMUFD_OBJ_ANY = IOMMUFD_OBJ_NONE,
> +};
> +
> +/* Base struct for all objects with a userspace ID handle. */
> +struct iommufd_object {
> +	struct rw_semaphore destroy_rwsem;
> +	refcount_t users;
> +	enum iommufd_object_type type;
> +	unsigned int id;
> +};
> +
> +static inline bool iommufd_lock_obj(struct iommufd_object *obj)
> +{
> +	if (!down_read_trylock(&obj->destroy_rwsem))
> +		return false;
> +	if (!refcount_inc_not_zero(&obj->users)) {
> +		up_read(&obj->destroy_rwsem);
> +		return false;
> +	}
> +	return true;
> +}
> +
> +struct iommufd_object *iommufd_get_object(struct iommufd_ctx *ictx, u32 id,
> +					  enum iommufd_object_type type);
> +static inline void iommufd_put_object(struct iommufd_object *obj)
> +{
> +	refcount_dec(&obj->users);
> +	up_read(&obj->destroy_rwsem);
> +}
> +
> +/**
> + * iommufd_put_object_keep_user() - Release part of the refcount on obj
> + * @obj - Object to release
> + *
> + * Objects have two protections to ensure that userspace has a consistent
> + * experience with destruction. Normally objects are locked so that destroy will
> + * block while there are concurrent users, and wait for the object to be
> + * unlocked.
> + *
> + * However, destroy can also be blocked by holding users reference counts on the
> + * objects, in that case destroy will immediately return EBUSY and will not wait
> + * for reference counts to go to zero.
> + *
> + * This function releases the destroy lock and destroy will return EBUSY.

This reads odd. Does it release or acquire a destroy lock.

> + *
> + * It should be used in places where the users will be held beyond a single
> + * system call.
> + */
> +static inline void iommufd_put_object_keep_user(struct iommufd_object *obj)
> +{
> +	up_read(&obj->destroy_rwsem);
> +}
> +void iommufd_object_abort(struct iommufd_ctx *ictx, struct iommufd_object *obj);
> +void iommufd_object_abort_and_destroy(struct iommufd_ctx *ictx,
> +				      struct iommufd_object *obj);
> +void iommufd_object_finalize(struct iommufd_ctx *ictx,
> +			     struct iommufd_object *obj);
> +bool iommufd_object_destroy_user(struct iommufd_ctx *ictx,
> +				 struct iommufd_object *obj);
> +struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
> +					     size_t size,
> +					     enum iommufd_object_type type);
> +
> +#define iommufd_object_alloc(ictx, ptr, type)                                  \
> +	container_of(_iommufd_object_alloc(                                    \
> +			     ictx,                                             \
> +			     sizeof(*(ptr)) + BUILD_BUG_ON_ZERO(               \
> +						      offsetof(typeof(*(ptr)), \
> +							       obj) != 0),     \
> +			     type),                                            \
> +		     typeof(*(ptr)), obj)
> +
> +#endif
> diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
> new file mode 100644
> index 00000000000000..a5b1e2302ba59d
> --- /dev/null
> +++ b/drivers/iommu/iommufd/main.c
> @@ -0,0 +1,345 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (C) 2021 Intel Corporation
> + * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
> + *
> + * iommfd provides control over the IOMMU HW objects created by IOMMU kernel
       ^^^^^^ iommufd

> + * drivers. IOMMU HW objects revolve around IO page tables that map incoming DMA
> + * addresses (IOVA) to CPU addresses.
> + *
> + * The API is divided into a general portion that is intended to work with any
> + * kernel IOMMU driver, and a device specific portion that  is intended to be
> + * used with a userspace HW driver paired with the specific kernel driver. This
> + * mechanism allows all the unique functionalities in individual IOMMUs to be
> + * exposed to userspace control.
> + */
> +#define pr_fmt(fmt) "iommufd: " fmt
> +
> +#include <linux/file.h>
> +#include <linux/fs.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/miscdevice.h>
> +#include <linux/mutex.h>
> +#include <linux/bug.h>
> +#include <uapi/linux/iommufd.h>
> +#include <linux/iommufd.h>
> +
> +#include "iommufd_private.h"
> +
> +struct iommufd_object_ops {
> +	void (*destroy)(struct iommufd_object *obj);
> +};
> +static struct iommufd_object_ops iommufd_object_ops[];
> +
> +struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
> +					     size_t size,
> +					     enum iommufd_object_type type)
> +{
> +	struct iommufd_object *obj;
> +	int rc;
> +
> +	obj = kzalloc(size, GFP_KERNEL_ACCOUNT);
> +	if (!obj)
> +		return ERR_PTR(-ENOMEM);
> +	obj->type = type;
> +	init_rwsem(&obj->destroy_rwsem);
> +	refcount_set(&obj->users, 1);
> +
> +	/*
> +	 * Reserve an ID in the xarray but do not publish the pointer yet since
> +	 * the caller hasn't initialized it yet. Once the pointer is published
> +	 * in the xarray and visible to other threads we can't reliably destroy
> +	 * it anymore, so the caller must complete all errorable operations
> +	 * before calling iommufd_object_finalize().
> +	 */
> +	rc = xa_alloc(&ictx->objects, &obj->id, XA_ZERO_ENTRY,
> +		      xa_limit_32b, GFP_KERNEL_ACCOUNT);
> +	if (rc)
> +		goto out_free;
> +	return obj;
> +out_free:
> +	kfree(obj);
> +	return ERR_PTR(rc);
> +}
> +
> +/*
> + * Allow concurrent access to the object. This should only be done once the
> + * system call that created the object is guaranteed to succeed.
> + */
> +void iommufd_object_finalize(struct iommufd_ctx *ictx,
> +			     struct iommufd_object *obj)
> +{
> +	void *old;
> +
> +	old = xa_store(&ictx->objects, obj->id, obj, GFP_KERNEL);
> +	/* obj->id was returned from xa_alloc() so the xa_store() cannot fail */
> +	WARN_ON(old);
> +}
> +
> +/* Undo _iommufd_object_alloc() if iommufd_object_finalize() was not called */
> +void iommufd_object_abort(struct iommufd_ctx *ictx, struct iommufd_object *obj)
> +{
> +	void *old;
> +
> +	old = xa_erase(&ictx->objects, obj->id);
> +	WARN_ON(old);
> +	kfree(obj);
> +}
> +
> +/*
> + * Abort an object that has been fully initialized and needs destroy, but has
> + * not been finalized.
> + */
> +void iommufd_object_abort_and_destroy(struct iommufd_ctx *ictx,
> +				      struct iommufd_object *obj)
> +{
> +	iommufd_object_ops[obj->type].destroy(obj);
> +	iommufd_object_abort(ictx, obj);
> +}
> +
> +struct iommufd_object *iommufd_get_object(struct iommufd_ctx *ictx, u32 id,
> +					  enum iommufd_object_type type)
> +{
> +	struct iommufd_object *obj;
> +
> +	xa_lock(&ictx->objects);
> +	obj = xa_load(&ictx->objects, id);
> +	if (!obj || (type != IOMMUFD_OBJ_ANY && obj->type != type) ||
> +	    !iommufd_lock_obj(obj))
> +		obj = ERR_PTR(-ENOENT);
> +	xa_unlock(&ictx->objects);
> +	return obj;
> +}
> +
> +/*
> + * The caller holds a users refcount and wants to destroy the object. Returns
> + * true if the object was destroyed. In all cases the caller no longer has a
> + * reference on obj.
> + */
> +bool iommufd_object_destroy_user(struct iommufd_ctx *ictx,
> +				 struct iommufd_object *obj)
> +{
> +	/*
> +	 * The purpose of the destroy_rwsem is to ensure deterministic
> +	 * destruction of objects used by external drivers and destroyed by this
> +	 * function. Any temporary increment of the refcount must hold the read
> +	 * side of this, such as during ioctl execution.
> +	 */
> +	down_write(&obj->destroy_rwsem);
> +	xa_lock(&ictx->objects);
> +	refcount_dec(&obj->users);
> +	if (!refcount_dec_if_one(&obj->users)) {
> +		xa_unlock(&ictx->objects);
> +		up_write(&obj->destroy_rwsem);
> +		return false;
> +	}
> +	__xa_erase(&ictx->objects, obj->id);
> +	xa_unlock(&ictx->objects);
> +	up_write(&obj->destroy_rwsem);
> +
> +	iommufd_object_ops[obj->type].destroy(obj);
> +	kfree(obj);
> +	return true;
> +}
> +
> +static int iommufd_destroy(struct iommufd_ucmd *ucmd)
> +{
> +	struct iommu_destroy *cmd = ucmd->cmd;
> +	struct iommufd_object *obj;
> +
> +	obj = iommufd_get_object(ucmd->ictx, cmd->id, IOMMUFD_OBJ_ANY);
> +	if (IS_ERR(obj))
> +		return PTR_ERR(obj);
> +	iommufd_put_object_keep_user(obj);
> +	if (!iommufd_object_destroy_user(ucmd->ictx, obj))
> +		return -EBUSY;
> +	return 0;
> +}
> +
> +static int iommufd_fops_open(struct inode *inode, struct file *filp)
> +{
> +	struct iommufd_ctx *ictx;
> +
> +	ictx = kzalloc(sizeof(*ictx), GFP_KERNEL_ACCOUNT);
> +	if (!ictx)
> +		return -ENOMEM;
> +
> +	xa_init_flags(&ictx->objects, XA_FLAGS_ALLOC1 | XA_FLAGS_ACCOUNT);
> +	ictx->file = filp;
> +	filp->private_data = ictx;
> +	return 0;
> +}
> +
> +static int iommufd_fops_release(struct inode *inode, struct file *filp)
> +{
> +	struct iommufd_ctx *ictx = filp->private_data;
> +	struct iommufd_object *obj;
> +
> +	/* Destroy the graph from depth first */
> +	while (!xa_empty(&ictx->objects)) {
> +		unsigned int destroyed = 0;
> +		unsigned long index;
> +
> +		xa_for_each (&ictx->objects, index, obj) {
> +			/*
> +			 * Since we are in release elevated users must come from
> +			 * other objects holding the users. We will eventually
> +			 * destroy the object that holds this one and the next
> +			 * pass will progress it.
> +			 */
> +			if (!refcount_dec_if_one(&obj->users))
> +				continue;
> +			destroyed++;
> +			xa_erase(&ictx->objects, index);
> +			iommufd_object_ops[obj->type].destroy(obj);
> +			kfree(obj);
> +		}
> +		/* Bug related to users refcount */
> +		if (WARN_ON(!destroyed))
> +			break;
> +	}
> +	kfree(ictx);
> +	return 0;
> +}
> +
> +union ucmd_buffer {
> +	struct iommu_destroy destroy;
> +};
> +
> +struct iommufd_ioctl_op {
> +	unsigned int size;
> +	unsigned int min_size;
> +	unsigned int ioctl_num;
> +	int (*execute)(struct iommufd_ucmd *ucmd);
> +};
> +
> +#define IOCTL_OP(_ioctl, _fn, _struct, _last)                                  \
> +	[_IOC_NR(_ioctl) - IOMMUFD_CMD_BASE] = {                               \
> +		.size = sizeof(_struct) +                                      \
> +			BUILD_BUG_ON_ZERO(sizeof(union ucmd_buffer) <          \
> +					  sizeof(_struct)),                    \
> +		.min_size = offsetofend(_struct, _last),                       \
> +		.ioctl_num = _ioctl,                                           \
> +		.execute = _fn,                                                \
> +	}
> +static struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
> +	IOCTL_OP(IOMMU_DESTROY, iommufd_destroy, struct iommu_destroy, id),
> +};
> +
> +static long iommufd_fops_ioctl(struct file *filp, unsigned int cmd,
> +			       unsigned long arg)
> +{
> +	struct iommufd_ucmd ucmd = {};
> +	struct iommufd_ioctl_op *op;
> +	union ucmd_buffer buf;
> +	unsigned int nr;
> +	int ret;
> +
> +	ucmd.ictx = filp->private_data;
> +	ucmd.ubuffer = (void __user *)arg;
> +	ret = get_user(ucmd.user_size, (u32 __user *)ucmd.ubuffer);
> +	if (ret)
> +		return ret;
> +
> +	nr = _IOC_NR(cmd);
> +	if (nr < IOMMUFD_CMD_BASE ||
> +	    (nr - IOMMUFD_CMD_BASE) >= ARRAY_SIZE(iommufd_ioctl_ops))
> +		return -ENOIOCTLCMD;
> +	op = &iommufd_ioctl_ops[nr - IOMMUFD_CMD_BASE];
> +	if (op->ioctl_num != cmd)
> +		return -ENOIOCTLCMD;
> +	if (ucmd.user_size < op->min_size)
> +		return -EOPNOTSUPP;
> +
> +	ucmd.cmd = &buf;
> +	ret = copy_struct_from_user(ucmd.cmd, op->size, ucmd.ubuffer,
> +				    ucmd.user_size);
> +	if (ret)
> +		return ret;
> +	ret = op->execute(&ucmd);
> +	return ret;
> +}
> +
> +static const struct file_operations iommufd_fops = {
> +	.owner = THIS_MODULE,
> +	.open = iommufd_fops_open,
> +	.release = iommufd_fops_release,
> +	.unlocked_ioctl = iommufd_fops_ioctl,
> +};
> +
> +/**
> + * iommufd_ctx_get - Get a context reference
> + * @ictx - Context to get
> + *
> + * The caller must already hold a valid reference to ictx.
> + */
> +void iommufd_ctx_get(struct iommufd_ctx *ictx)
> +{
> +	get_file(ictx->file);
> +}
> +EXPORT_SYMBOL_GPL(iommufd_ctx_get);
> +
> +/**
> + * iommufd_ctx_from_file - Acquires a reference to the iommufd context
> + * @file: File to obtain the reference from
> + *
> + * Returns a pointer to the iommufd_ctx, otherwise ERR_PTR. The struct file
> + * remains owned by the caller and the caller must still do fput. On success
> + * the caller is responsible to call iommufd_ctx_put().
> + */
> +struct iommufd_ctx *iommufd_ctx_from_file(struct file *file)
> +{
> +	struct iommufd_ctx *ictx;
> +
> +	if (file->f_op != &iommufd_fops)
> +		return ERR_PTR(-EBADFD);
> +	ictx = file->private_data;
> +	iommufd_ctx_get(ictx);
> +	return ictx;
> +}
> +EXPORT_SYMBOL_GPL(iommufd_ctx_from_file);
> +
> +/**
> + * iommufd_ctx_put - Put back a reference
> + * @ictx - Context to put back
> + */
> +void iommufd_ctx_put(struct iommufd_ctx *ictx)
> +{
> +	fput(ictx->file);
> +}
> +EXPORT_SYMBOL_GPL(iommufd_ctx_put);
> +
> +static struct iommufd_object_ops iommufd_object_ops[] = {
> +};
> +
> +static struct miscdevice iommu_misc_dev = {
> +	.minor = MISC_DYNAMIC_MINOR,
> +	.name = "iommu",
> +	.fops = &iommufd_fops,
> +	.nodename = "iommu",
> +	.mode = 0660,
> +};
> +
> +static int __init iommufd_init(void)
> +{
> +	int ret;
> +
> +	ret = misc_register(&iommu_misc_dev);
> +	if (ret) {
> +		pr_err("Failed to register misc device\n");
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +static void __exit iommufd_exit(void)
> +{
> +	misc_deregister(&iommu_misc_dev);
> +}
> +
> +module_init(iommufd_init);
> +module_exit(iommufd_exit);
> +
> +MODULE_DESCRIPTION("I/O Address Space Management for passthrough devices");
> +MODULE_LICENSE("GPL v2");
> diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
> new file mode 100644
> index 00000000000000..c8bbed542e923c
> --- /dev/null
> +++ b/include/linux/iommufd.h
> @@ -0,0 +1,31 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (C) 2021 Intel Corporation
> + * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
> + */
> +#ifndef __LINUX_IOMMUFD_H
> +#define __LINUX_IOMMUFD_H
> +
> +#include <linux/types.h>
> +#include <linux/errno.h>
> +#include <linux/err.h>
> +
> +struct iommufd_ctx;
> +struct file;
> +
> +void iommufd_ctx_get(struct iommufd_ctx *ictx);
> +
> +#if IS_ENABLED(CONFIG_IOMMUFD)
> +struct iommufd_ctx *iommufd_ctx_from_file(struct file *file);
> +void iommufd_ctx_put(struct iommufd_ctx *ictx);
> +#else /* !CONFIG_IOMMUFD */
> +static inline struct iommufd_ctx *iommufd_ctx_from_file(struct file *file)
> +{
> +       return ERR_PTR(-EOPNOTSUPP);
> +}
> +
> +static inline void iommufd_ctx_put(struct iommufd_ctx *ictx)
> +{
> +}
> +#endif /* CONFIG_IOMMUFD */
> +#endif
> diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
> new file mode 100644
> index 00000000000000..2f7f76ec6db4cb
> --- /dev/null
> +++ b/include/uapi/linux/iommufd.h
> @@ -0,0 +1,55 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
> + */
> +#ifndef _UAPI_IOMMUFD_H
> +#define _UAPI_IOMMUFD_H
> +
> +#include <linux/types.h>
> +#include <linux/ioctl.h>
> +
> +#define IOMMUFD_TYPE (';')
> +
> +/**
> + * DOC: General ioctl format
> + *
> + * The ioctl mechanims follows a general format to allow for extensibility. Each
> + * ioctl is passed in a structure pointer as the argument providing the size of
> + * the structure in the first u32. The kernel checks that any structure space
> + * beyond what it understands is 0. This allows userspace to use the backward
> + * compatible portion while consistently using the newer, larger, structures.
> + *
> + * ioctls use a standard meaning for common errnos:
> + *
> + *  - ENOTTY: The IOCTL number itself is not supported at all
> + *  - E2BIG: The IOCTL number is supported, but the provided structure has
> + *    non-zero in a part the kernel does not understand.
> + *  - EOPNOTSUPP: The IOCTL number is supported, and the structure is
> + *    understood, however a known field has a value the kernel does not
> + *    understand or support.
> + *  - EINVAL: Everything about the IOCTL was understood, but a field is not
> + *    correct.
> + *  - ENOENT: An ID or IOVA provided does not exist.
> + *  - ENOMEM: Out of memory.
> + *  - EOVERFLOW: Mathematics oveflowed.
> + *
> + * As well as additional errnos. within specific ioctls.
> + */
> +enum {
> +	IOMMUFD_CMD_BASE = 0x80,
> +	IOMMUFD_CMD_DESTROY = IOMMUFD_CMD_BASE,
> +};
> +
> +/**
> + * struct iommu_destroy - ioctl(IOMMU_DESTROY)
> + * @size: sizeof(struct iommu_destroy)
> + * @id: iommufd object ID to destroy. Can by any destroyable object type.

                                              ^^be ?

> + *
> + * Destroy any object held within iommufd.
> + */
> +struct iommu_destroy {
> +	__u32 size;
> +	__u32 id;
> +};
> +#define IOMMU_DESTROY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_DESTROY)
> +
> +#endif

Best regards,
baolu

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 02/13] iommufd: Overview documentation
  2022-09-02 19:59   ` Jason Gunthorpe
  (?)
@ 2022-09-07  1:39   ` David Gibson
  2022-09-09 18:52     ` Jason Gunthorpe
  -1 siblings, 1 reply; 78+ messages in thread
From: David Gibson @ 2022-09-07  1:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, Eric Auger, Eric Farman, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

[-- Attachment #1: Type: text/plain, Size: 12282 bytes --]

On Fri, Sep 02, 2022 at 04:59:18PM -0300, Jason Gunthorpe wrote:
> From: Kevin Tian <kevin.tian@intel.com>
> 
> Add iommufd to the documentation tree.
> 
> Signed-off-by: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  Documentation/userspace-api/index.rst   |   1 +
>  Documentation/userspace-api/iommufd.rst | 224 ++++++++++++++++++++++++
>  2 files changed, 225 insertions(+)
>  create mode 100644 Documentation/userspace-api/iommufd.rst
> 
> diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
> index a61eac0c73f825..3815f013e4aebd 100644
> --- a/Documentation/userspace-api/index.rst
> +++ b/Documentation/userspace-api/index.rst
> @@ -25,6 +25,7 @@ place where this information is gathered.
>     ebpf/index
>     ioctl/index
>     iommu
> +   iommufd
>     media/index
>     sysfs-platform_profile
>     vduse
> diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst
> new file mode 100644
> index 00000000000000..38035b3822fd23
> --- /dev/null
> +++ b/Documentation/userspace-api/iommufd.rst
> @@ -0,0 +1,224 @@
> +.. SPDX-License-Identifier: GPL-2.0+
> +
> +=======
> +IOMMUFD
> +=======
> +
> +:Author: Jason Gunthorpe
> +:Author: Kevin Tian
> +
> +Overview
> +========
> +
> +IOMMUFD is the user API to control the IOMMU subsystem as it relates to managing
> +IO page tables that point at user space memory. It intends to be general and
> +consumable by any driver that wants to DMA to userspace. Those drivers are

s/Those/These/

> +expected to deprecate any proprietary IOMMU logic, if existing (e.g.

I don't thing "propietary" is an accurate description.  Maybe
"existing" or "bespoke?

> +vfio_iommu_type1.c).
> +
> +At minimum iommufd provides a universal support of managing I/O address spaces
> +and I/O page tables for all IOMMUs, with room in the design to add non-generic
> +features to cater to specific hardware functionality.
> +
> +In this context the capital letter (IOMMUFD) refers to the subsystem while the
> +small letter (iommufd) refers to the file descriptors created via /dev/iommu to
> +run the user API over.
> +
> +Key Concepts
> +============
> +
> +User Visible Objects
> +--------------------
> +
> +Following IOMMUFD objects are exposed to userspace:
> +
> +- IOMMUFD_OBJ_IOAS, representing an I/O address space (IOAS) allowing map/unmap
> +  of user space memory into ranges of I/O Virtual Address (IOVA).
> +
> +  The IOAS is a functional replacement for the VFIO container, and like the VFIO
> +  container copies its IOVA map to a list of iommu_domains held within it.
> +
> +- IOMMUFD_OBJ_DEVICE, representing a device that is bound to iommufd by an
> +  external driver.
> +
> +- IOMMUFD_OBJ_HW_PAGETABLE, wrapping an actual hardware I/O page table (i.e. a

s/wrapping/representing/ for consistency.

> +  single struct iommu_domain) managed by the iommu driver.
> +
> +  The IOAS has a list of HW_PAGETABLES that share the same IOVA mapping and the
> +  IOAS will synchronize its mapping with each member HW_PAGETABLE.
> +
> +All user-visible objects are destroyed via the IOMMU_DESTROY uAPI.
> +
> +Linkage between user-visible objects and external kernel datastructures are
> +reflected by dotted line arrows below, with numbers referring to certain

I'm a little bit confused by the reference to "dotted line arrows": I
only see one arrow style in the diagram.

> +operations creating the objects and links::
> +
> +  _________________________________________________________
> + |                         iommufd                         |
> + |       [1]                                               |
> + |  _________________                                      |
> + | |                 |                                     |
> + | |                 |                                     |
> + | |                 |                                     |
> + | |                 |                                     |
> + | |                 |                                     |
> + | |                 |                                     |
> + | |                 |        [3]                 [2]      |
> + | |                 |    ____________         __________  |
> + | |      IOAS       |<--|            |<------|          | |
> + | |                 |   |HW_PAGETABLE|       |  DEVICE  | |
> + | |                 |   |____________|       |__________| |
> + | |                 |         |                   |       |
> + | |                 |         |                   |       |
> + | |                 |         |                   |       |
> + | |                 |         |                   |       |
> + | |                 |         |                   |       |
> + | |_________________|         |                   |       |
> + |         |                   |                   |       |
> + |_________|___________________|___________________|_______|
> +           |                   |                   |
> +           |              _____v______      _______v_____
> +           | PFN storage |            |    |             |
> +           |------------>|iommu_domain|    |struct device|
> +                         |____________|    |_____________|
> +
> +1. IOMMUFD_OBJ_IOAS is created via the IOMMU_IOAS_ALLOC uAPI. One iommufd can
> +   hold multiple IOAS objects. IOAS is the most generic object and does not
> +   expose interfaces that are specific to single IOMMU drivers. All operations
> +   on the IOAS must operate equally on each of the iommu_domains that are inside
> +   it.
> +
> +2. IOMMUFD_OBJ_DEVICE is created when an external driver calls the IOMMUFD kAPI
> +   to bind a device to an iommufd. The external driver is expected to implement
> +   proper uAPI for userspace to initiate the binding operation. Successful
> +   completion of this operation establishes the desired DMA ownership over the
> +   device. The external driver must set driver_managed_dma flag and must not
> +   touch the device until this operation succeeds.
> +
> +3. IOMMUFD_OBJ_HW_PAGETABLE is created when an external driver calls the IOMMUFD
> +   kAPI to attach a bound device to an IOAS. Similarly the external driver uAPI
> +   allows userspace to initiate the attaching operation. If a compatible
> +   pagetable already exists then it is reused for the attachment. Otherwise a
> +   new pagetable object (and a new iommu_domain) is created. Successful
> +   completion of this operation sets up the linkages among an IOAS, a device and
> +   an iommu_domain. Once this completes the device could do DMA.
> +
> +   Every iommu_domain inside the IOAS is also represented to userspace as a
> +   HW_PAGETABLE object.
> +
> +   NOTE: Future additions to IOMMUFD will provide an API to create and
> +   manipulate the HW_PAGETABLE directly.
> +
> +One device can only bind to one iommufd (due to DMA ownership claim) and attach
> +to at most one IOAS object (no support of PASID yet).
> +
> +Currently only PCI device is allowed.
> +
> +Kernel Datastructure
> +--------------------
> +
> +User visible objects are backed by following datastructures:
> +
> +- iommufd_ioas for IOMMUFD_OBJ_IOAS.
> +- iommufd_device for IOMMUFD_OBJ_DEVICE.
> +- iommufd_hw_pagetable for IOMMUFD_OBJ_HW_PAGETABLE.
> +
> +Several terminologies when looking at these datastructures:
> +
> +- Automatic domain, referring to an iommu domain created automatically when
> +  attaching a device to an IOAS object. This is compatible to the semantics of
> +  VFIO type1.
> +
> +- Manual domain, referring to an iommu domain designated by the user as the
> +  target pagetable to be attached to by a device. Though currently no user API
> +  for userspace to directly create such domain, the datastructure and algorithms
> +  are ready for that usage.
> +
> +- In-kernel user, referring to something like a VFIO mdev that is accessing the
> +  IOAS and using a 'struct page \*' for CPU based access. Such users require an
> +  isolation granularity smaller than what an iommu domain can afford. They must
> +  manually enforce the IOAS constraints on DMA buffers before those buffers can
> +  be accessed by mdev. Though no kernel API for an external driver to bind a
> +  mdev, the datastructure and algorithms are ready for such usage.
> +
> +iommufd_ioas serves as the metadata datastructure to manage how IOVA ranges are
> +mapped to memory pages, composed of:
> +
> +- struct io_pagetable holding the IOVA map
> +- struct iopt_areas representing populated portions of IOVA
> +- struct iopt_pages representing the storage of PFNs
> +- struct iommu_domain representing the IO page table in the IOMMU
> +- struct iopt_pages_user representing in-kernel users of PFNs
> +- struct xarray pinned_pfns holding a list of pages pinned by
> +   in-kernel Users
> +
> +The iopt_pages is the center of the storage and motion of PFNs. Each iopt_pages
> +represents a logical linear array of full PFNs. PFNs are stored in a tiered
> +scheme:
> +
> + 1) iopt_pages::pinned_pfns xarray
> + 2) An iommu_domain
> + 3) The origin of the PFNs, i.e. the userspace pointer

I can't follow what this "tiered scheme" is describing.

> +PFN have to be copied between all combinations of tiers, depending on the
> +configuration (i.e. attached domains and in-kernel users).
> +
> +An io_pagetable is composed of iopt_areas pointing at iopt_pages, along with a
> +list of iommu_domains that mirror the IOVA to PFN map.
> +
> +Multiple io_pagetable's, through their iopt_area's, can share a single
> +iopt_pages which avoids multi-pinning and double accounting of page consumption.
> +
> +iommufd_ioas is sharable between subsystems, e.g. VFIO and VDPA, as long as
> +devices managed by different subsystems are bound to a same iommufd.
> +
> +IOMMUFD User API
> +================
> +
> +.. kernel-doc:: include/uapi/linux/iommufd.h
> +
> +IOMMUFD Kernel API
> +==================
> +
> +The IOMMUFD kAPI is device-centric with group-related tricks managed behind the
> +scene. This allows the external driver calling such kAPI to implement a simple
> +device-centric uAPI for connecting its device to an iommufd, instead of
> +explicitly imposing the group semantics in its uAPI (as VFIO does).
> +
> +.. kernel-doc:: drivers/iommu/iommufd/device.c
> +   :export:
> +
> +VFIO and IOMMUFD
> +----------------
> +
> +Connecting VFIO device to iommufd can be done in two approaches.

s/approaches/ways/

> +
> +First is a VFIO compatible way by directly implementing the /dev/vfio/vfio
> +container IOCTLs by mapping them into io_pagetable operations. Doing so allows
> +the use of iommufd in legacy VFIO applications by symlinking /dev/vfio/vfio to
> +/dev/iommufd or extending VFIO to SET_CONTAINER using an iommufd instead of a
> +container fd.
> +
> +The second approach directly extends VFIO to support a new set of device-centric
> +user API based on aforementioned IOMMUFD kernel API. It requires userspace
> +change but better matches the IOMMUFD API semantics and easier to support new
> +iommufd features when comparing it to the first approach.
> +
> +Currently both approaches are still work-in-progress.
> +
> +There are still a few gaps to be resolved to catch up with VFIO type1, as
> +documented in iommufd_vfio_check_extension().
> +
> +Future TODOs
> +============
> +
> +Currently IOMMUFD supports only kernel-managed I/O page table, similar to VFIO
> +type1. New features on the radar include:
> +
> + - Binding iommu_domain's to PASID/SSID
> + - Userspace page tables, for ARM, x86 and S390
> + - Kernel bypass'd invalidation of user page tables
> + - Re-use of the KVM page table in the IOMMU
> + - Dirty page tracking in the IOMMU
> + - Runtime Increase/Decrease of IOPTE size
> + - PRI support with faults resolved in userspace

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 03/13] iommufd: File descriptor, context, kconfig and makefiles
  2022-09-04  8:19   ` Baolu Lu
@ 2022-09-09 18:46     ` Jason Gunthorpe
  0 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-09 18:46 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Alex Williamson, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, iommu,
	Jason Wang, Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Sun, Sep 04, 2022 at 04:19:04PM +0800, Baolu Lu wrote:
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 589517372408ca..abd041f5e00f4c 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -10609,6 +10609,16 @@ L:	linux-mips@vger.kernel.org
> >   S:	Maintained
> >   F:	drivers/net/ethernet/sgi/ioc3-eth.c
> > +IOMMU FD
> > +M:	Jason Gunthorpe <jgg@nvidia.com>
> > +M:	Kevin Tian <kevin.tian@intel.com>
> > +L:	iommu@lists.linux-foundation.org
> 
> This mailing list has already been replaced with iommu@lists.linux.dev.

It is also not sorted.. I fixed both

> > +/**
> > + * iommufd_put_object_keep_user() - Release part of the refcount on obj
> > + * @obj - Object to release
> > + *
> > + * Objects have two protections to ensure that userspace has a consistent
> > + * experience with destruction. Normally objects are locked so that destroy will
> > + * block while there are concurrent users, and wait for the object to be
> > + * unlocked.
> > + *
> > + * However, destroy can also be blocked by holding users reference counts on the
> > + * objects, in that case destroy will immediately return EBUSY and will not wait
> > + * for reference counts to go to zero.
> > + *
> > + * This function releases the destroy lock and destroy will return EBUSY.
> 
> This reads odd. Does it release or acquire a destroy lock.

I changed this line to
  This function switches from blocking userspace to returning EBUSY.

And the rest too, thanks

Jason

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 02/13] iommufd: Overview documentation
  2022-09-07  1:39   ` David Gibson
@ 2022-09-09 18:52     ` Jason Gunthorpe
  2022-09-12 10:40       ` David Gibson
  0 siblings, 1 reply; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-09 18:52 UTC (permalink / raw)
  To: David Gibson
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, Eric Auger, Eric Farman, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Wed, Sep 07, 2022 at 11:39:51AM +1000, David Gibson wrote:

> > +expected to deprecate any proprietary IOMMU logic, if existing (e.g.
> 
> I don't thing "propietary" is an accurate description.  Maybe
> "existing" or "bespoke?

How about "internal"

 These drivers are eventually expected to deprecate any internal IOMMU
 logic, if existing (e.g. vfio_iommu_type1.c).

> > +All user-visible objects are destroyed via the IOMMU_DESTROY uAPI.
> > +
> > +Linkage between user-visible objects and external kernel datastructures are
> > +reflected by dotted line arrows below, with numbers referring to certain
> 
> I'm a little bit confused by the reference to "dotted line arrows": I
> only see one arrow style in the diagram.

I think this means all the "dashed lines with arrows"

How about "by the directed lines below"

> > +The iopt_pages is the center of the storage and motion of PFNs. Each iopt_pages
> > +represents a logical linear array of full PFNs. PFNs are stored in a tiered
> > +scheme:
> > +
> > + 1) iopt_pages::pinned_pfns xarray
> > + 2) An iommu_domain
> > + 3) The origin of the PFNs, i.e. the userspace pointer
> 
> I can't follow what this "tiered scheme" is describing.

Hum, I'm not sure how to address this.

Is this better?

 1) PFNs that have been "software accessed" stored in theiopt_pages::pinned_pfns
    xarray
 2) PFNs stored inside the IOPTEs accessed through an iommu_domain
 3) The origin of the PFNs, i.e. the userspace VA in a mm_struct

Thanks,
Jason

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 02/13] iommufd: Overview documentation
  2022-09-09 18:52     ` Jason Gunthorpe
@ 2022-09-12 10:40       ` David Gibson
  2022-09-27 17:33         ` Jason Gunthorpe
  0 siblings, 1 reply; 78+ messages in thread
From: David Gibson @ 2022-09-12 10:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, Eric Auger, Eric Farman, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

[-- Attachment #1: Type: text/plain, Size: 2185 bytes --]

On Fri, Sep 09, 2022 at 03:52:27PM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 07, 2022 at 11:39:51AM +1000, David Gibson wrote:
> 
> > > +expected to deprecate any proprietary IOMMU logic, if existing (e.g.
> > 
> > I don't thing "propietary" is an accurate description.  Maybe
> > "existing" or "bespoke?
> 
> How about "internal"

>  These drivers are eventually expected to deprecate any internal IOMMU
>  logic, if existing (e.g. vfio_iommu_type1.c).

That works.

> > > +All user-visible objects are destroyed via the IOMMU_DESTROY uAPI.
> > > +
> > > +Linkage between user-visible objects and external kernel datastructures are
> > > +reflected by dotted line arrows below, with numbers referring to certain
> > 
> > I'm a little bit confused by the reference to "dotted line arrows": I
> > only see one arrow style in the diagram.
> 
> I think this means all the "dashed lines with arrows"
> 
> How about "by the directed lines below"

Or simply "reflected by arrows below".

> > > +The iopt_pages is the center of the storage and motion of PFNs. Each iopt_pages
> > > +represents a logical linear array of full PFNs. PFNs are stored in a tiered
> > > +scheme:
> > > +
> > > + 1) iopt_pages::pinned_pfns xarray
> > > + 2) An iommu_domain
> > > + 3) The origin of the PFNs, i.e. the userspace pointer
> > 
> > I can't follow what this "tiered scheme" is describing.
> 
> Hum, I'm not sure how to address this.
> 
> Is this better?
> 
>  1) PFNs that have been "software accessed" stored in theiopt_pages::pinned_pfns
>     xarray
>  2) PFNs stored inside the IOPTEs accessed through an iommu_domain
>  3) The origin of the PFNs, i.e. the userspace VA in a mm_struct

Hmm.. only slightly.  What about:

   Each opt_pages represents a logical linear array of full PFNs.  The
   PFNs are ultimately derived from userspave VAs via an mm_struct.
   They are cached in .. <describe the pined_pfns and iommu_domain
   data structures>

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* RE: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-09-02 19:59 ` Jason Gunthorpe
                   ` (13 preceding siblings ...)
  (?)
@ 2022-09-13  1:55 ` Tian, Kevin
  2022-09-13  7:28   ` Eric Auger
  -1 siblings, 1 reply; 78+ messages in thread
From: Tian, Kevin @ 2022-09-13  1:55 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson, Rodel, Jorg
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, iommu,
	Jason Wang, Jean-Philippe Brucker, Martins, Joao, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu

We didn't close the open of how to get this merged in LPC due to the
audio issue. Then let's use mails.

Overall there are three options on the table:

1) Require vfio-compat to be 100% compatible with vfio-type1

   Probably not a good choice given the amount of work to fix the remaining
   gaps. And this will block support of new IOMMU features for a longer time.

2) Leave vfio-compat as what it is in this series

   Treat it as a vehicle to validate the iommufd logic instead of immediately
   replacing vfio-type1. Functionally most vfio applications can work w/o
   change if putting aside the difference on locked mm accounting, p2p, etc.

   Then work on new features and 100% vfio-type1 compat. in parallel.

3) Focus on iommufd native uAPI first

   Require vfio_device cdev and adoption in Qemu. Only for new vfio app.

   Then work on new features and vfio-compat in parallel.

I'm fine with either 2) or 3). Per a quick chat with Alex he prefers to 3).

Jason, how about your opinion?

Thanks
Kevin

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Saturday, September 3, 2022 3:59 AM
> 
> iommufd is the user API to control the IOMMU subsystem as it relates to
> managing IO page tables that point at user space memory.
> 
> It takes over from drivers/vfio/vfio_iommu_type1.c (aka the VFIO
> container) which is the VFIO specific interface for a similar idea.
> 
> We see a broad need for extended features, some being highly IOMMU
> device
> specific:
>  - Binding iommu_domain's to PASID/SSID
>  - Userspace page tables, for ARM, x86 and S390
>  - Kernel bypass'd invalidation of user page tables
>  - Re-use of the KVM page table in the IOMMU
>  - Dirty page tracking in the IOMMU
>  - Runtime Increase/Decrease of IOPTE size
>  - PRI support with faults resolved in userspace
> 
> As well as a need to access these features beyond just VFIO, from VDPA for
> instance. Other classes of accelerator HW are touching on these areas now
> too.
> 
> The pre-v1 series proposed re-using the VFIO type 1 data structure,
> however it was suggested that if we are doing this big update then we
> should also come with an improved data structure that solves the
> limitations that VFIO type1 has. Notably this addresses:
> 
>  - Multiple IOAS/'containers' and multiple domains inside a single FD
> 
>  - Single-pin operation no matter how many domains and containers use
>    a page
> 
>  - A fine grained locking scheme supporting user managed concurrency for
>    multi-threaded map/unmap
> 
>  - A pre-registration mechanism to optimize vIOMMU use cases by
>    pre-pinning pages
> 
>  - Extended ioctl API that can manage these new objects and exposes
>    domains directly to user space
> 
>  - domains are sharable between subsystems, eg VFIO and VDPA
> 
> The bulk of this code is a new data structure design to track how the
> IOVAs are mapped to PFNs.
> 
> iommufd intends to be general and consumable by any driver that wants to
> DMA to userspace. From a driver perspective it can largely be dropped in
> in-place of iommu_attach_device() and provides a uniform full feature set
> to all consumers.
> 
> As this is a larger project this series is the first step. This series
> provides the iommfd "generic interface" which is designed to be suitable
> for applications like DPDK and VMM flows that are not optimized to
> specific HW scenarios. It is close to being a drop in replacement for the
> existing VFIO type 1.
> 
> Several follow-on series are being prepared:
> 
> - Patches integrating with qemu in native mode:
>   https://github.com/yiliu1765/qemu/commits/qemu-iommufd-6.0-rc2
> 
> - A completed integration with VFIO now exists that covers "emulated" mdev
>   use cases now, and can pass testing with qemu/etc in compatability mode:
>   https://github.com/jgunthorpe/linux/commits/vfio_iommufd
> 
> - A draft providing system iommu dirty tracking on top of iommufd,
>   including iommu driver implementations:
>   https://github.com/jpemartins/linux/commits/x86-iommufd
> 
>   This pairs with patches for providing a similar API to support VFIO-device
>   tracking to give a complete vfio solution:
>   https://lore.kernel.org/kvm/20220901093853.60194-1-yishaih@nvidia.com/
> 
> - Userspace page tables aka 'nested translation' for ARM and Intel iommu
>   drivers:
>   https://github.com/nicolinc/iommufd/commits/iommufd_nesting
> 
> - "device centric" vfio series to expose the vfio_device FD directly as a
>   normal cdev, and provide an extended API allowing dynamically changing
>   the IOAS binding:
>   https://github.com/yiliu1765/iommufd/commits/iommufd-v6.0-rc2-
> nesting-0901
> 
> - Drafts for PASID and PRI interfaces are included above as well
> 
> Overall enough work is done now to show the merit of the new API design
> and at least draft solutions to many of the main problems.
> 
> Several people have contributed directly to this work: Eric Auger, Joao
> Martins, Kevin Tian, Lu Baolu, Nicolin Chen, Yi L Liu. Many more have
> participated in the discussions that lead here, and provided ideas. Thanks
> to all!
> 
> The v1 iommufd series has been used to guide a large amount of preparatory
> work that has now been merged. The general theme is to organize things in
> a way that makes injecting iommufd natural:
> 
>  - VFIO live migration support with mlx5 and hisi_acc drivers.
>    These series need a dirty tracking solution to be really usable.
>    https://lore.kernel.org/kvm/20220224142024.147653-1-
> yishaih@nvidia.com/
>    https://lore.kernel.org/kvm/20220308184902.2242-1-
> shameerali.kolothum.thodi@huawei.com/
> 
>  - Significantly rework the VFIO gvt mdev and remove struct
>    mdev_parent_ops
>    https://lore.kernel.org/lkml/20220411141403.86980-1-hch@lst.de/
> 
>  - Rework how PCIe no-snoop blocking works
>    https://lore.kernel.org/kvm/0-v3-2cf356649677+a32-
> intel_no_snoop_jgg@nvidia.com/
> 
>  - Consolidate dma ownership into the iommu core code
>    https://lore.kernel.org/linux-iommu/20220418005000.897664-1-
> baolu.lu@linux.intel.com/
> 
>  - Make all vfio driver interfaces use struct vfio_device consistently
>    https://lore.kernel.org/kvm/0-v4-8045e76bf00b+13d-
> vfio_mdev_no_group_jgg@nvidia.com/
> 
>  - Remove the vfio_group from the kvm/vfio interface
>    https://lore.kernel.org/kvm/0-v3-f7729924a7ea+25e33-
> vfio_kvm_no_group_jgg@nvidia.com/
> 
>  - Simplify locking in vfio
>    https://lore.kernel.org/kvm/0-v2-d035a1842d81+1bf-
> vfio_group_locking_jgg@nvidia.com/
> 
>  - Remove the vfio notifiter scheme that faces drivers
>    https://lore.kernel.org/kvm/0-v4-681e038e30fd+78-
> vfio_unmap_notif_jgg@nvidia.com/
> 
>  - Improve the driver facing API for vfio pin/unpin pages to make the
>    presence of struct page clear
>    https://lore.kernel.org/kvm/20220723020256.30081-1-
> nicolinc@nvidia.com/
> 
>  - Clean up in the Intel IOMMU driver
>    https://lore.kernel.org/linux-iommu/20220301020159.633356-1-
> baolu.lu@linux.intel.com/
>    https://lore.kernel.org/linux-iommu/20220510023407.2759143-1-
> baolu.lu@linux.intel.com/
>    https://lore.kernel.org/linux-iommu/20220514014322.2927339-1-
> baolu.lu@linux.intel.com/
>    https://lore.kernel.org/linux-iommu/20220706025524.2904370-1-
> baolu.lu@linux.intel.com/
>    https://lore.kernel.org/linux-iommu/20220702015610.2849494-1-
> baolu.lu@linux.intel.com/
> 
>  - Rework s390 vfio drivers
>    https://lore.kernel.org/kvm/20220707135737.720765-1-
> farman@linux.ibm.com/
> 
>  - Normalize vfio ioctl handling
>    https://lore.kernel.org/kvm/0-v2-0f9e632d54fb+d6-
> vfio_ioctl_split_jgg@nvidia.com/
> 
> This is about 168 patches applied since March, thank you to everyone
> involved in all this work!
> 
> Currently there are a number of supporting series still in progress:
>  - Simplify and consolidate iommu_domain/device compatability checking
>    https://lore.kernel.org/linux-iommu/20220815181437.28127-1-
> nicolinc@nvidia.com/
> 
>  - Align iommu SVA support with the domain-centric model
>    https://lore.kernel.org/linux-iommu/20220826121141.50743-1-
> baolu.lu@linux.intel.com/
> 
>  - VFIO API for dirty tracking (aka dma logging) managed inside a PCI
>    device, with mlx5 implementation
>    https://lore.kernel.org/kvm/20220901093853.60194-1-yishaih@nvidia.com
> 
>  - Introduce a struct device sysfs presence for struct vfio_device
>    https://lore.kernel.org/kvm/20220901143747.32858-1-
> kevin.tian@intel.com/
> 
>  - Complete restructuring the vfio mdev model
>    https://lore.kernel.org/kvm/20220822062208.152745-1-hch@lst.de/
> 
>  - DMABUF exporter support for VFIO to allow PCI P2P with VFIO
>    https://lore.kernel.org/r/0-v2-472615b3877e+28f7-
> vfio_dma_buf_jgg@nvidia.com
> 
>  - Isolate VFIO container code in preperation for iommufd to provide an
>    alternative implementation of it all
>    https://lore.kernel.org/kvm/0-v1-a805b607f1fb+17b-
> vfio_container_split_jgg@nvidia.com
> 
>  - Start to provide iommu_domain ops for power
>    https://lore.kernel.org/all/20220714081822.3717693-1-aik@ozlabs.ru/
> 
> Right now there is no more preperatory work sketched out, so this is the
> last of it.
> 
> This series remains RFC as there are still several important FIXME's to
> deal with first, but things are on track for non-RFC in the near future.
> 
> This is on github: https://github.com/jgunthorpe/linux/commits/iommufd
> 
> v2:
>  - Rebase to v6.0-rc3
>  - Improve comments
>  - Change to an iterative destruction approach to avoid cycles
>  - Near rewrite of the vfio facing implementation, supported by a complete
>    implementation on the vfio side
>  - New IOMMU_IOAS_ALLOW_IOVAS API as discussed. Allows userspace to
>    assert that ranges of IOVA must always be mappable. To be used by a
> VMM
>    that has promised a guest a certain availability of IOVA. May help
>    guide PPC's multi-window implementation.
>  - Rework how unmap_iova works, user can unmap the whole ioas now
>  - The no-snoop / wbinvd support is implemented
>  - Bug fixes
>  - Test suite improvements
>  - Lots of smaller changes (the interdiff is 3k lines)
> v1: https://lore.kernel.org/r/0-v1-e79cd8d168e8+6-
> iommufd_jgg@nvidia.com
> 
> # S390 in-kernel page table walker
> Cc: Niklas Schnelle <schnelle@linux.ibm.com>
> Cc: Matthew Rosato <mjrosato@linux.ibm.com>
> # AMD Dirty page tracking
> Cc: Joao Martins <joao.m.martins@oracle.com>
> # ARM SMMU Dirty page tracking
> Cc: Keqian Zhu <zhukeqian1@huawei.com>
> Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> # ARM SMMU nesting
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> # Map/unmap performance
> Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
> # VDPA
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Jason Wang <jasowang@redhat.com>
> # Power
> Cc: David Gibson <david@gibson.dropbear.id.au>
> # vfio
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Cornelia Huck <cohuck@redhat.com>
> Cc: kvm@vger.kernel.org
> # iommu
> Cc: iommu@lists.linux.dev
> # Collaborators
> Cc: "Chaitanya Kulkarni" <chaitanyak@nvidia.com>
> Cc: Nicolin Chen <nicolinc@nvidia.com>
> Cc: Lu Baolu <baolu.lu@linux.intel.com>
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Yi Liu <yi.l.liu@intel.com>
> # s390
> Cc: Eric Farman <farman@linux.ibm.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> 
> Jason Gunthorpe (12):
>   interval-tree: Add a utility to iterate over spans in an interval tree
>   iommufd: File descriptor, context, kconfig and makefiles
>   kernel/user: Allow user::locked_vm to be usable for iommufd
>   iommufd: PFN handling for iopt_pages
>   iommufd: Algorithms for PFN storage
>   iommufd: Data structure to provide IOVA to PFN mapping
>   iommufd: IOCTLs for the io_pagetable
>   iommufd: Add a HW pagetable object
>   iommufd: Add kAPI toward external drivers for physical devices
>   iommufd: Add kAPI toward external drivers for kernel access
>   iommufd: vfio container FD ioctl compatibility
>   iommufd: Add a selftest
> 
> Kevin Tian (1):
>   iommufd: Overview documentation
> 
>  .clang-format                                 |    1 +
>  Documentation/userspace-api/index.rst         |    1 +
>  .../userspace-api/ioctl/ioctl-number.rst      |    1 +
>  Documentation/userspace-api/iommufd.rst       |  224 +++
>  MAINTAINERS                                   |   10 +
>  drivers/iommu/Kconfig                         |    1 +
>  drivers/iommu/Makefile                        |    2 +-
>  drivers/iommu/iommufd/Kconfig                 |   22 +
>  drivers/iommu/iommufd/Makefile                |   13 +
>  drivers/iommu/iommufd/device.c                |  580 +++++++
>  drivers/iommu/iommufd/hw_pagetable.c          |   68 +
>  drivers/iommu/iommufd/io_pagetable.c          |  984 ++++++++++++
>  drivers/iommu/iommufd/io_pagetable.h          |  186 +++
>  drivers/iommu/iommufd/ioas.c                  |  338 ++++
>  drivers/iommu/iommufd/iommufd_private.h       |  266 ++++
>  drivers/iommu/iommufd/iommufd_test.h          |   74 +
>  drivers/iommu/iommufd/main.c                  |  392 +++++
>  drivers/iommu/iommufd/pages.c                 | 1301 +++++++++++++++
>  drivers/iommu/iommufd/selftest.c              |  626 ++++++++
>  drivers/iommu/iommufd/vfio_compat.c           |  423 +++++
>  include/linux/interval_tree.h                 |   47 +
>  include/linux/iommufd.h                       |  101 ++
>  include/linux/sched/user.h                    |    2 +-
>  include/uapi/linux/iommufd.h                  |  279 ++++
>  kernel/user.c                                 |    1 +
>  lib/interval_tree.c                           |   98 ++
>  tools/testing/selftests/Makefile              |    1 +
>  tools/testing/selftests/iommu/.gitignore      |    2 +
>  tools/testing/selftests/iommu/Makefile        |   11 +
>  tools/testing/selftests/iommu/config          |    2 +
>  tools/testing/selftests/iommu/iommufd.c       | 1396 +++++++++++++++++
>  31 files changed, 7451 insertions(+), 2 deletions(-)
>  create mode 100644 Documentation/userspace-api/iommufd.rst
>  create mode 100644 drivers/iommu/iommufd/Kconfig
>  create mode 100644 drivers/iommu/iommufd/Makefile
>  create mode 100644 drivers/iommu/iommufd/device.c
>  create mode 100644 drivers/iommu/iommufd/hw_pagetable.c
>  create mode 100644 drivers/iommu/iommufd/io_pagetable.c
>  create mode 100644 drivers/iommu/iommufd/io_pagetable.h
>  create mode 100644 drivers/iommu/iommufd/ioas.c
>  create mode 100644 drivers/iommu/iommufd/iommufd_private.h
>  create mode 100644 drivers/iommu/iommufd/iommufd_test.h
>  create mode 100644 drivers/iommu/iommufd/main.c
>  create mode 100644 drivers/iommu/iommufd/pages.c
>  create mode 100644 drivers/iommu/iommufd/selftest.c
>  create mode 100644 drivers/iommu/iommufd/vfio_compat.c
>  create mode 100644 include/linux/iommufd.h
>  create mode 100644 include/uapi/linux/iommufd.h
>  create mode 100644 tools/testing/selftests/iommu/.gitignore
>  create mode 100644 tools/testing/selftests/iommu/Makefile
>  create mode 100644 tools/testing/selftests/iommu/config
>  create mode 100644 tools/testing/selftests/iommu/iommufd.c
> 
> 
> base-commit: b90cb1053190353cc30f0fef0ef1f378ccc063c5
> --
> 2.37.3


^ permalink raw reply	[flat|nested] 78+ messages in thread

* RE: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-09-02 19:59 ` Jason Gunthorpe
                   ` (14 preceding siblings ...)
  (?)
@ 2022-09-13  2:05 ` Tian, Kevin
  2022-09-20 20:07   ` Jason Gunthorpe
  -1 siblings, 1 reply; 78+ messages in thread
From: Tian, Kevin @ 2022-09-13  2:05 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson, Rodel, Jorg
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, David Gibson, Eric Auger, Eric Farman, iommu,
	Jason Wang, Jean-Philippe Brucker, Martins, Joao, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu

A side open is about the maintenance model of iommufd.

This series proposes to put its files under drivers/iommu/, while the
logic is relatively self-contained compared to other files in that directory.

Joerg, do you plan to do same level of review on this series as you did
for other iommu patches or prefer to a lighter model with trust on the
existing reviewers in this area (mostly VFIO folks, moving forward also
include vdpa, uacces, etc.)?

Thanks
Kevin

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Saturday, September 3, 2022 3:59 AM
> 
> iommufd is the user API to control the IOMMU subsystem as it relates to
> managing IO page tables that point at user space memory.
> 
> It takes over from drivers/vfio/vfio_iommu_type1.c (aka the VFIO
> container) which is the VFIO specific interface for a similar idea.
> 
> We see a broad need for extended features, some being highly IOMMU
> device
> specific:
>  - Binding iommu_domain's to PASID/SSID
>  - Userspace page tables, for ARM, x86 and S390
>  - Kernel bypass'd invalidation of user page tables
>  - Re-use of the KVM page table in the IOMMU
>  - Dirty page tracking in the IOMMU
>  - Runtime Increase/Decrease of IOPTE size
>  - PRI support with faults resolved in userspace
> 
> As well as a need to access these features beyond just VFIO, from VDPA for
> instance. Other classes of accelerator HW are touching on these areas now
> too.
> 
> The pre-v1 series proposed re-using the VFIO type 1 data structure,
> however it was suggested that if we are doing this big update then we
> should also come with an improved data structure that solves the
> limitations that VFIO type1 has. Notably this addresses:
> 
>  - Multiple IOAS/'containers' and multiple domains inside a single FD
> 
>  - Single-pin operation no matter how many domains and containers use
>    a page
> 
>  - A fine grained locking scheme supporting user managed concurrency for
>    multi-threaded map/unmap
> 
>  - A pre-registration mechanism to optimize vIOMMU use cases by
>    pre-pinning pages
> 
>  - Extended ioctl API that can manage these new objects and exposes
>    domains directly to user space
> 
>  - domains are sharable between subsystems, eg VFIO and VDPA
> 
> The bulk of this code is a new data structure design to track how the
> IOVAs are mapped to PFNs.
> 
> iommufd intends to be general and consumable by any driver that wants to
> DMA to userspace. From a driver perspective it can largely be dropped in
> in-place of iommu_attach_device() and provides a uniform full feature set
> to all consumers.
> 
> As this is a larger project this series is the first step. This series
> provides the iommfd "generic interface" which is designed to be suitable
> for applications like DPDK and VMM flows that are not optimized to
> specific HW scenarios. It is close to being a drop in replacement for the
> existing VFIO type 1.
> 
> Several follow-on series are being prepared:
> 
> - Patches integrating with qemu in native mode:
>   https://github.com/yiliu1765/qemu/commits/qemu-iommufd-6.0-rc2
> 
> - A completed integration with VFIO now exists that covers "emulated" mdev
>   use cases now, and can pass testing with qemu/etc in compatability mode:
>   https://github.com/jgunthorpe/linux/commits/vfio_iommufd
> 
> - A draft providing system iommu dirty tracking on top of iommufd,
>   including iommu driver implementations:
>   https://github.com/jpemartins/linux/commits/x86-iommufd
> 
>   This pairs with patches for providing a similar API to support VFIO-device
>   tracking to give a complete vfio solution:
>   https://lore.kernel.org/kvm/20220901093853.60194-1-yishaih@nvidia.com/
> 
> - Userspace page tables aka 'nested translation' for ARM and Intel iommu
>   drivers:
>   https://github.com/nicolinc/iommufd/commits/iommufd_nesting
> 
> - "device centric" vfio series to expose the vfio_device FD directly as a
>   normal cdev, and provide an extended API allowing dynamically changing
>   the IOAS binding:
>   https://github.com/yiliu1765/iommufd/commits/iommufd-v6.0-rc2-
> nesting-0901
> 
> - Drafts for PASID and PRI interfaces are included above as well
> 
> Overall enough work is done now to show the merit of the new API design
> and at least draft solutions to many of the main problems.
> 
> Several people have contributed directly to this work: Eric Auger, Joao
> Martins, Kevin Tian, Lu Baolu, Nicolin Chen, Yi L Liu. Many more have
> participated in the discussions that lead here, and provided ideas. Thanks
> to all!
> 
> The v1 iommufd series has been used to guide a large amount of preparatory
> work that has now been merged. The general theme is to organize things in
> a way that makes injecting iommufd natural:
> 
>  - VFIO live migration support with mlx5 and hisi_acc drivers.
>    These series need a dirty tracking solution to be really usable.
>    https://lore.kernel.org/kvm/20220224142024.147653-1-
> yishaih@nvidia.com/
>    https://lore.kernel.org/kvm/20220308184902.2242-1-
> shameerali.kolothum.thodi@huawei.com/
> 
>  - Significantly rework the VFIO gvt mdev and remove struct
>    mdev_parent_ops
>    https://lore.kernel.org/lkml/20220411141403.86980-1-hch@lst.de/
> 
>  - Rework how PCIe no-snoop blocking works
>    https://lore.kernel.org/kvm/0-v3-2cf356649677+a32-
> intel_no_snoop_jgg@nvidia.com/
> 
>  - Consolidate dma ownership into the iommu core code
>    https://lore.kernel.org/linux-iommu/20220418005000.897664-1-
> baolu.lu@linux.intel.com/
> 
>  - Make all vfio driver interfaces use struct vfio_device consistently
>    https://lore.kernel.org/kvm/0-v4-8045e76bf00b+13d-
> vfio_mdev_no_group_jgg@nvidia.com/
> 
>  - Remove the vfio_group from the kvm/vfio interface
>    https://lore.kernel.org/kvm/0-v3-f7729924a7ea+25e33-
> vfio_kvm_no_group_jgg@nvidia.com/
> 
>  - Simplify locking in vfio
>    https://lore.kernel.org/kvm/0-v2-d035a1842d81+1bf-
> vfio_group_locking_jgg@nvidia.com/
> 
>  - Remove the vfio notifiter scheme that faces drivers
>    https://lore.kernel.org/kvm/0-v4-681e038e30fd+78-
> vfio_unmap_notif_jgg@nvidia.com/
> 
>  - Improve the driver facing API for vfio pin/unpin pages to make the
>    presence of struct page clear
>    https://lore.kernel.org/kvm/20220723020256.30081-1-
> nicolinc@nvidia.com/
> 
>  - Clean up in the Intel IOMMU driver
>    https://lore.kernel.org/linux-iommu/20220301020159.633356-1-
> baolu.lu@linux.intel.com/
>    https://lore.kernel.org/linux-iommu/20220510023407.2759143-1-
> baolu.lu@linux.intel.com/
>    https://lore.kernel.org/linux-iommu/20220514014322.2927339-1-
> baolu.lu@linux.intel.com/
>    https://lore.kernel.org/linux-iommu/20220706025524.2904370-1-
> baolu.lu@linux.intel.com/
>    https://lore.kernel.org/linux-iommu/20220702015610.2849494-1-
> baolu.lu@linux.intel.com/
> 
>  - Rework s390 vfio drivers
>    https://lore.kernel.org/kvm/20220707135737.720765-1-
> farman@linux.ibm.com/
> 
>  - Normalize vfio ioctl handling
>    https://lore.kernel.org/kvm/0-v2-0f9e632d54fb+d6-
> vfio_ioctl_split_jgg@nvidia.com/
> 
> This is about 168 patches applied since March, thank you to everyone
> involved in all this work!
> 
> Currently there are a number of supporting series still in progress:
>  - Simplify and consolidate iommu_domain/device compatability checking
>    https://lore.kernel.org/linux-iommu/20220815181437.28127-1-
> nicolinc@nvidia.com/
> 
>  - Align iommu SVA support with the domain-centric model
>    https://lore.kernel.org/linux-iommu/20220826121141.50743-1-
> baolu.lu@linux.intel.com/
> 
>  - VFIO API for dirty tracking (aka dma logging) managed inside a PCI
>    device, with mlx5 implementation
>    https://lore.kernel.org/kvm/20220901093853.60194-1-yishaih@nvidia.com
> 
>  - Introduce a struct device sysfs presence for struct vfio_device
>    https://lore.kernel.org/kvm/20220901143747.32858-1-
> kevin.tian@intel.com/
> 
>  - Complete restructuring the vfio mdev model
>    https://lore.kernel.org/kvm/20220822062208.152745-1-hch@lst.de/
> 
>  - DMABUF exporter support for VFIO to allow PCI P2P with VFIO
>    https://lore.kernel.org/r/0-v2-472615b3877e+28f7-
> vfio_dma_buf_jgg@nvidia.com
> 
>  - Isolate VFIO container code in preperation for iommufd to provide an
>    alternative implementation of it all
>    https://lore.kernel.org/kvm/0-v1-a805b607f1fb+17b-
> vfio_container_split_jgg@nvidia.com
> 
>  - Start to provide iommu_domain ops for power
>    https://lore.kernel.org/all/20220714081822.3717693-1-aik@ozlabs.ru/
> 
> Right now there is no more preperatory work sketched out, so this is the
> last of it.
> 
> This series remains RFC as there are still several important FIXME's to
> deal with first, but things are on track for non-RFC in the near future.
> 
> This is on github: https://github.com/jgunthorpe/linux/commits/iommufd
> 
> v2:
>  - Rebase to v6.0-rc3
>  - Improve comments
>  - Change to an iterative destruction approach to avoid cycles
>  - Near rewrite of the vfio facing implementation, supported by a complete
>    implementation on the vfio side
>  - New IOMMU_IOAS_ALLOW_IOVAS API as discussed. Allows userspace to
>    assert that ranges of IOVA must always be mappable. To be used by a
> VMM
>    that has promised a guest a certain availability of IOVA. May help
>    guide PPC's multi-window implementation.
>  - Rework how unmap_iova works, user can unmap the whole ioas now
>  - The no-snoop / wbinvd support is implemented
>  - Bug fixes
>  - Test suite improvements
>  - Lots of smaller changes (the interdiff is 3k lines)
> v1: https://lore.kernel.org/r/0-v1-e79cd8d168e8+6-
> iommufd_jgg@nvidia.com
> 
> # S390 in-kernel page table walker
> Cc: Niklas Schnelle <schnelle@linux.ibm.com>
> Cc: Matthew Rosato <mjrosato@linux.ibm.com>
> # AMD Dirty page tracking
> Cc: Joao Martins <joao.m.martins@oracle.com>
> # ARM SMMU Dirty page tracking
> Cc: Keqian Zhu <zhukeqian1@huawei.com>
> Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> # ARM SMMU nesting
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> # Map/unmap performance
> Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
> # VDPA
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Jason Wang <jasowang@redhat.com>
> # Power
> Cc: David Gibson <david@gibson.dropbear.id.au>
> # vfio
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Cornelia Huck <cohuck@redhat.com>
> Cc: kvm@vger.kernel.org
> # iommu
> Cc: iommu@lists.linux.dev
> # Collaborators
> Cc: "Chaitanya Kulkarni" <chaitanyak@nvidia.com>
> Cc: Nicolin Chen <nicolinc@nvidia.com>
> Cc: Lu Baolu <baolu.lu@linux.intel.com>
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Yi Liu <yi.l.liu@intel.com>
> # s390
> Cc: Eric Farman <farman@linux.ibm.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> 
> Jason Gunthorpe (12):
>   interval-tree: Add a utility to iterate over spans in an interval tree
>   iommufd: File descriptor, context, kconfig and makefiles
>   kernel/user: Allow user::locked_vm to be usable for iommufd
>   iommufd: PFN handling for iopt_pages
>   iommufd: Algorithms for PFN storage
>   iommufd: Data structure to provide IOVA to PFN mapping
>   iommufd: IOCTLs for the io_pagetable
>   iommufd: Add a HW pagetable object
>   iommufd: Add kAPI toward external drivers for physical devices
>   iommufd: Add kAPI toward external drivers for kernel access
>   iommufd: vfio container FD ioctl compatibility
>   iommufd: Add a selftest
> 
> Kevin Tian (1):
>   iommufd: Overview documentation
> 
>  .clang-format                                 |    1 +
>  Documentation/userspace-api/index.rst         |    1 +
>  .../userspace-api/ioctl/ioctl-number.rst      |    1 +
>  Documentation/userspace-api/iommufd.rst       |  224 +++
>  MAINTAINERS                                   |   10 +
>  drivers/iommu/Kconfig                         |    1 +
>  drivers/iommu/Makefile                        |    2 +-
>  drivers/iommu/iommufd/Kconfig                 |   22 +
>  drivers/iommu/iommufd/Makefile                |   13 +
>  drivers/iommu/iommufd/device.c                |  580 +++++++
>  drivers/iommu/iommufd/hw_pagetable.c          |   68 +
>  drivers/iommu/iommufd/io_pagetable.c          |  984 ++++++++++++
>  drivers/iommu/iommufd/io_pagetable.h          |  186 +++
>  drivers/iommu/iommufd/ioas.c                  |  338 ++++
>  drivers/iommu/iommufd/iommufd_private.h       |  266 ++++
>  drivers/iommu/iommufd/iommufd_test.h          |   74 +
>  drivers/iommu/iommufd/main.c                  |  392 +++++
>  drivers/iommu/iommufd/pages.c                 | 1301 +++++++++++++++
>  drivers/iommu/iommufd/selftest.c              |  626 ++++++++
>  drivers/iommu/iommufd/vfio_compat.c           |  423 +++++
>  include/linux/interval_tree.h                 |   47 +
>  include/linux/iommufd.h                       |  101 ++
>  include/linux/sched/user.h                    |    2 +-
>  include/uapi/linux/iommufd.h                  |  279 ++++
>  kernel/user.c                                 |    1 +
>  lib/interval_tree.c                           |   98 ++
>  tools/testing/selftests/Makefile              |    1 +
>  tools/testing/selftests/iommu/.gitignore      |    2 +
>  tools/testing/selftests/iommu/Makefile        |   11 +
>  tools/testing/selftests/iommu/config          |    2 +
>  tools/testing/selftests/iommu/iommufd.c       | 1396 +++++++++++++++++
>  31 files changed, 7451 insertions(+), 2 deletions(-)
>  create mode 100644 Documentation/userspace-api/iommufd.rst
>  create mode 100644 drivers/iommu/iommufd/Kconfig
>  create mode 100644 drivers/iommu/iommufd/Makefile
>  create mode 100644 drivers/iommu/iommufd/device.c
>  create mode 100644 drivers/iommu/iommufd/hw_pagetable.c
>  create mode 100644 drivers/iommu/iommufd/io_pagetable.c
>  create mode 100644 drivers/iommu/iommufd/io_pagetable.h
>  create mode 100644 drivers/iommu/iommufd/ioas.c
>  create mode 100644 drivers/iommu/iommufd/iommufd_private.h
>  create mode 100644 drivers/iommu/iommufd/iommufd_test.h
>  create mode 100644 drivers/iommu/iommufd/main.c
>  create mode 100644 drivers/iommu/iommufd/pages.c
>  create mode 100644 drivers/iommu/iommufd/selftest.c
>  create mode 100644 drivers/iommu/iommufd/vfio_compat.c
>  create mode 100644 include/linux/iommufd.h
>  create mode 100644 include/uapi/linux/iommufd.h
>  create mode 100644 tools/testing/selftests/iommu/.gitignore
>  create mode 100644 tools/testing/selftests/iommu/Makefile
>  create mode 100644 tools/testing/selftests/iommu/config
>  create mode 100644 tools/testing/selftests/iommu/iommufd.c
> 
> 
> base-commit: b90cb1053190353cc30f0fef0ef1f378ccc063c5
> --
> 2.37.3


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-09-13  1:55 ` [PATCH RFC v2 00/13] IOMMUFD Generic interface Tian, Kevin
@ 2022-09-13  7:28   ` Eric Auger
  2022-09-20 19:56     ` Jason Gunthorpe
  0 siblings, 1 reply; 78+ messages in thread
From: Eric Auger @ 2022-09-13  7:28 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe, Alex Williamson, Rodel, Jorg
  Cc: Lu Baolu, Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan,
	David Gibson, Eric Farman, iommu, Jason Wang,
	Jean-Philippe Brucker, Martins, Joao, kvm, Matthew Rosato,
	Michael S. Tsirkin, Nicolin Chen, Niklas Schnelle,
	Shameerali Kolothum Thodi, Liu, Yi L, Keqian Zhu

Hi,

On 9/13/22 03:55, Tian, Kevin wrote:
> We didn't close the open of how to get this merged in LPC due to the
> audio issue. Then let's use mails.
>
> Overall there are three options on the table:
>
> 1) Require vfio-compat to be 100% compatible with vfio-type1
>
>    Probably not a good choice given the amount of work to fix the remaining
>    gaps. And this will block support of new IOMMU features for a longer time.
>
> 2) Leave vfio-compat as what it is in this series
>
>    Treat it as a vehicle to validate the iommufd logic instead of immediately
>    replacing vfio-type1. Functionally most vfio applications can work w/o
>    change if putting aside the difference on locked mm accounting, p2p, etc.
>
>    Then work on new features and 100% vfio-type1 compat. in parallel.
>
> 3) Focus on iommufd native uAPI first
>
>    Require vfio_device cdev and adoption in Qemu. Only for new vfio app.
>
>    Then work on new features and vfio-compat in parallel.
>
> I'm fine with either 2) or 3). Per a quick chat with Alex he prefers to 3).

I am also inclined to pursue 3) as this was the initial Jason's guidance
and pre-requisite to integrate new features. In the past we concluded
vfio-compat would mostly be used for testing purpose. Our QEMU
integration fully is based on device based API.

Thanks

Eric
>
> Jason, how about your opinion?
>
> Thanks
> Kevin
>
>> From: Jason Gunthorpe <jgg@nvidia.com>
>> Sent: Saturday, September 3, 2022 3:59 AM
>>
>> iommufd is the user API to control the IOMMU subsystem as it relates to
>> managing IO page tables that point at user space memory.
>>
>> It takes over from drivers/vfio/vfio_iommu_type1.c (aka the VFIO
>> container) which is the VFIO specific interface for a similar idea.
>>
>> We see a broad need for extended features, some being highly IOMMU
>> device
>> specific:
>>  - Binding iommu_domain's to PASID/SSID
>>  - Userspace page tables, for ARM, x86 and S390
>>  - Kernel bypass'd invalidation of user page tables
>>  - Re-use of the KVM page table in the IOMMU
>>  - Dirty page tracking in the IOMMU
>>  - Runtime Increase/Decrease of IOPTE size
>>  - PRI support with faults resolved in userspace
>>
>> As well as a need to access these features beyond just VFIO, from VDPA for
>> instance. Other classes of accelerator HW are touching on these areas now
>> too.
>>
>> The pre-v1 series proposed re-using the VFIO type 1 data structure,
>> however it was suggested that if we are doing this big update then we
>> should also come with an improved data structure that solves the
>> limitations that VFIO type1 has. Notably this addresses:
>>
>>  - Multiple IOAS/'containers' and multiple domains inside a single FD
>>
>>  - Single-pin operation no matter how many domains and containers use
>>    a page
>>
>>  - A fine grained locking scheme supporting user managed concurrency for
>>    multi-threaded map/unmap
>>
>>  - A pre-registration mechanism to optimize vIOMMU use cases by
>>    pre-pinning pages
>>
>>  - Extended ioctl API that can manage these new objects and exposes
>>    domains directly to user space
>>
>>  - domains are sharable between subsystems, eg VFIO and VDPA
>>
>> The bulk of this code is a new data structure design to track how the
>> IOVAs are mapped to PFNs.
>>
>> iommufd intends to be general and consumable by any driver that wants to
>> DMA to userspace. From a driver perspective it can largely be dropped in
>> in-place of iommu_attach_device() and provides a uniform full feature set
>> to all consumers.
>>
>> As this is a larger project this series is the first step. This series
>> provides the iommfd "generic interface" which is designed to be suitable
>> for applications like DPDK and VMM flows that are not optimized to
>> specific HW scenarios. It is close to being a drop in replacement for the
>> existing VFIO type 1.
>>
>> Several follow-on series are being prepared:
>>
>> - Patches integrating with qemu in native mode:
>>   https://github.com/yiliu1765/qemu/commits/qemu-iommufd-6.0-rc2
>>
>> - A completed integration with VFIO now exists that covers "emulated" mdev
>>   use cases now, and can pass testing with qemu/etc in compatability mode:
>>   https://github.com/jgunthorpe/linux/commits/vfio_iommufd
>>
>> - A draft providing system iommu dirty tracking on top of iommufd,
>>   including iommu driver implementations:
>>   https://github.com/jpemartins/linux/commits/x86-iommufd
>>
>>   This pairs with patches for providing a similar API to support VFIO-device
>>   tracking to give a complete vfio solution:
>>   https://lore.kernel.org/kvm/20220901093853.60194-1-yishaih@nvidia.com/
>>
>> - Userspace page tables aka 'nested translation' for ARM and Intel iommu
>>   drivers:
>>   https://github.com/nicolinc/iommufd/commits/iommufd_nesting
>>
>> - "device centric" vfio series to expose the vfio_device FD directly as a
>>   normal cdev, and provide an extended API allowing dynamically changing
>>   the IOAS binding:
>>   https://github.com/yiliu1765/iommufd/commits/iommufd-v6.0-rc2-
>> nesting-0901
>>
>> - Drafts for PASID and PRI interfaces are included above as well
>>
>> Overall enough work is done now to show the merit of the new API design
>> and at least draft solutions to many of the main problems.
>>
>> Several people have contributed directly to this work: Eric Auger, Joao
>> Martins, Kevin Tian, Lu Baolu, Nicolin Chen, Yi L Liu. Many more have
>> participated in the discussions that lead here, and provided ideas. Thanks
>> to all!
>>
>> The v1 iommufd series has been used to guide a large amount of preparatory
>> work that has now been merged. The general theme is to organize things in
>> a way that makes injecting iommufd natural:
>>
>>  - VFIO live migration support with mlx5 and hisi_acc drivers.
>>    These series need a dirty tracking solution to be really usable.
>>    https://lore.kernel.org/kvm/20220224142024.147653-1-
>> yishaih@nvidia.com/
>>    https://lore.kernel.org/kvm/20220308184902.2242-1-
>> shameerali.kolothum.thodi@huawei.com/
>>
>>  - Significantly rework the VFIO gvt mdev and remove struct
>>    mdev_parent_ops
>>    https://lore.kernel.org/lkml/20220411141403.86980-1-hch@lst.de/
>>
>>  - Rework how PCIe no-snoop blocking works
>>    https://lore.kernel.org/kvm/0-v3-2cf356649677+a32-
>> intel_no_snoop_jgg@nvidia.com/
>>
>>  - Consolidate dma ownership into the iommu core code
>>    https://lore.kernel.org/linux-iommu/20220418005000.897664-1-
>> baolu.lu@linux.intel.com/
>>
>>  - Make all vfio driver interfaces use struct vfio_device consistently
>>    https://lore.kernel.org/kvm/0-v4-8045e76bf00b+13d-
>> vfio_mdev_no_group_jgg@nvidia.com/
>>
>>  - Remove the vfio_group from the kvm/vfio interface
>>    https://lore.kernel.org/kvm/0-v3-f7729924a7ea+25e33-
>> vfio_kvm_no_group_jgg@nvidia.com/
>>
>>  - Simplify locking in vfio
>>    https://lore.kernel.org/kvm/0-v2-d035a1842d81+1bf-
>> vfio_group_locking_jgg@nvidia.com/
>>
>>  - Remove the vfio notifiter scheme that faces drivers
>>    https://lore.kernel.org/kvm/0-v4-681e038e30fd+78-
>> vfio_unmap_notif_jgg@nvidia.com/
>>
>>  - Improve the driver facing API for vfio pin/unpin pages to make the
>>    presence of struct page clear
>>    https://lore.kernel.org/kvm/20220723020256.30081-1-
>> nicolinc@nvidia.com/
>>
>>  - Clean up in the Intel IOMMU driver
>>    https://lore.kernel.org/linux-iommu/20220301020159.633356-1-
>> baolu.lu@linux.intel.com/
>>    https://lore.kernel.org/linux-iommu/20220510023407.2759143-1-
>> baolu.lu@linux.intel.com/
>>    https://lore.kernel.org/linux-iommu/20220514014322.2927339-1-
>> baolu.lu@linux.intel.com/
>>    https://lore.kernel.org/linux-iommu/20220706025524.2904370-1-
>> baolu.lu@linux.intel.com/
>>    https://lore.kernel.org/linux-iommu/20220702015610.2849494-1-
>> baolu.lu@linux.intel.com/
>>
>>  - Rework s390 vfio drivers
>>    https://lore.kernel.org/kvm/20220707135737.720765-1-
>> farman@linux.ibm.com/
>>
>>  - Normalize vfio ioctl handling
>>    https://lore.kernel.org/kvm/0-v2-0f9e632d54fb+d6-
>> vfio_ioctl_split_jgg@nvidia.com/
>>
>> This is about 168 patches applied since March, thank you to everyone
>> involved in all this work!
>>
>> Currently there are a number of supporting series still in progress:
>>  - Simplify and consolidate iommu_domain/device compatability checking
>>    https://lore.kernel.org/linux-iommu/20220815181437.28127-1-
>> nicolinc@nvidia.com/
>>
>>  - Align iommu SVA support with the domain-centric model
>>    https://lore.kernel.org/linux-iommu/20220826121141.50743-1-
>> baolu.lu@linux.intel.com/
>>
>>  - VFIO API for dirty tracking (aka dma logging) managed inside a PCI
>>    device, with mlx5 implementation
>>    https://lore.kernel.org/kvm/20220901093853.60194-1-yishaih@nvidia.com
>>
>>  - Introduce a struct device sysfs presence for struct vfio_device
>>    https://lore.kernel.org/kvm/20220901143747.32858-1-
>> kevin.tian@intel.com/
>>
>>  - Complete restructuring the vfio mdev model
>>    https://lore.kernel.org/kvm/20220822062208.152745-1-hch@lst.de/
>>
>>  - DMABUF exporter support for VFIO to allow PCI P2P with VFIO
>>    https://lore.kernel.org/r/0-v2-472615b3877e+28f7-
>> vfio_dma_buf_jgg@nvidia.com
>>
>>  - Isolate VFIO container code in preperation for iommufd to provide an
>>    alternative implementation of it all
>>    https://lore.kernel.org/kvm/0-v1-a805b607f1fb+17b-
>> vfio_container_split_jgg@nvidia.com
>>
>>  - Start to provide iommu_domain ops for power
>>    https://lore.kernel.org/all/20220714081822.3717693-1-aik@ozlabs.ru/
>>
>> Right now there is no more preperatory work sketched out, so this is the
>> last of it.
>>
>> This series remains RFC as there are still several important FIXME's to
>> deal with first, but things are on track for non-RFC in the near future.
>>
>> This is on github: https://github.com/jgunthorpe/linux/commits/iommufd
>>
>> v2:
>>  - Rebase to v6.0-rc3
>>  - Improve comments
>>  - Change to an iterative destruction approach to avoid cycles
>>  - Near rewrite of the vfio facing implementation, supported by a complete
>>    implementation on the vfio side
>>  - New IOMMU_IOAS_ALLOW_IOVAS API as discussed. Allows userspace to
>>    assert that ranges of IOVA must always be mappable. To be used by a
>> VMM
>>    that has promised a guest a certain availability of IOVA. May help
>>    guide PPC's multi-window implementation.
>>  - Rework how unmap_iova works, user can unmap the whole ioas now
>>  - The no-snoop / wbinvd support is implemented
>>  - Bug fixes
>>  - Test suite improvements
>>  - Lots of smaller changes (the interdiff is 3k lines)
>> v1: https://lore.kernel.org/r/0-v1-e79cd8d168e8+6-
>> iommufd_jgg@nvidia.com
>>
>> # S390 in-kernel page table walker
>> Cc: Niklas Schnelle <schnelle@linux.ibm.com>
>> Cc: Matthew Rosato <mjrosato@linux.ibm.com>
>> # AMD Dirty page tracking
>> Cc: Joao Martins <joao.m.martins@oracle.com>
>> # ARM SMMU Dirty page tracking
>> Cc: Keqian Zhu <zhukeqian1@huawei.com>
>> Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
>> # ARM SMMU nesting
>> Cc: Eric Auger <eric.auger@redhat.com>
>> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
>> # Map/unmap performance
>> Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
>> # VDPA
>> Cc: "Michael S. Tsirkin" <mst@redhat.com>
>> Cc: Jason Wang <jasowang@redhat.com>
>> # Power
>> Cc: David Gibson <david@gibson.dropbear.id.au>
>> # vfio
>> Cc: Alex Williamson <alex.williamson@redhat.com>
>> Cc: Cornelia Huck <cohuck@redhat.com>
>> Cc: kvm@vger.kernel.org
>> # iommu
>> Cc: iommu@lists.linux.dev
>> # Collaborators
>> Cc: "Chaitanya Kulkarni" <chaitanyak@nvidia.com>
>> Cc: Nicolin Chen <nicolinc@nvidia.com>
>> Cc: Lu Baolu <baolu.lu@linux.intel.com>
>> Cc: Kevin Tian <kevin.tian@intel.com>
>> Cc: Yi Liu <yi.l.liu@intel.com>
>> # s390
>> Cc: Eric Farman <farman@linux.ibm.com>
>> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>>
>> Jason Gunthorpe (12):
>>   interval-tree: Add a utility to iterate over spans in an interval tree
>>   iommufd: File descriptor, context, kconfig and makefiles
>>   kernel/user: Allow user::locked_vm to be usable for iommufd
>>   iommufd: PFN handling for iopt_pages
>>   iommufd: Algorithms for PFN storage
>>   iommufd: Data structure to provide IOVA to PFN mapping
>>   iommufd: IOCTLs for the io_pagetable
>>   iommufd: Add a HW pagetable object
>>   iommufd: Add kAPI toward external drivers for physical devices
>>   iommufd: Add kAPI toward external drivers for kernel access
>>   iommufd: vfio container FD ioctl compatibility
>>   iommufd: Add a selftest
>>
>> Kevin Tian (1):
>>   iommufd: Overview documentation
>>
>>  .clang-format                                 |    1 +
>>  Documentation/userspace-api/index.rst         |    1 +
>>  .../userspace-api/ioctl/ioctl-number.rst      |    1 +
>>  Documentation/userspace-api/iommufd.rst       |  224 +++
>>  MAINTAINERS                                   |   10 +
>>  drivers/iommu/Kconfig                         |    1 +
>>  drivers/iommu/Makefile                        |    2 +-
>>  drivers/iommu/iommufd/Kconfig                 |   22 +
>>  drivers/iommu/iommufd/Makefile                |   13 +
>>  drivers/iommu/iommufd/device.c                |  580 +++++++
>>  drivers/iommu/iommufd/hw_pagetable.c          |   68 +
>>  drivers/iommu/iommufd/io_pagetable.c          |  984 ++++++++++++
>>  drivers/iommu/iommufd/io_pagetable.h          |  186 +++
>>  drivers/iommu/iommufd/ioas.c                  |  338 ++++
>>  drivers/iommu/iommufd/iommufd_private.h       |  266 ++++
>>  drivers/iommu/iommufd/iommufd_test.h          |   74 +
>>  drivers/iommu/iommufd/main.c                  |  392 +++++
>>  drivers/iommu/iommufd/pages.c                 | 1301 +++++++++++++++
>>  drivers/iommu/iommufd/selftest.c              |  626 ++++++++
>>  drivers/iommu/iommufd/vfio_compat.c           |  423 +++++
>>  include/linux/interval_tree.h                 |   47 +
>>  include/linux/iommufd.h                       |  101 ++
>>  include/linux/sched/user.h                    |    2 +-
>>  include/uapi/linux/iommufd.h                  |  279 ++++
>>  kernel/user.c                                 |    1 +
>>  lib/interval_tree.c                           |   98 ++
>>  tools/testing/selftests/Makefile              |    1 +
>>  tools/testing/selftests/iommu/.gitignore      |    2 +
>>  tools/testing/selftests/iommu/Makefile        |   11 +
>>  tools/testing/selftests/iommu/config          |    2 +
>>  tools/testing/selftests/iommu/iommufd.c       | 1396 +++++++++++++++++
>>  31 files changed, 7451 insertions(+), 2 deletions(-)
>>  create mode 100644 Documentation/userspace-api/iommufd.rst
>>  create mode 100644 drivers/iommu/iommufd/Kconfig
>>  create mode 100644 drivers/iommu/iommufd/Makefile
>>  create mode 100644 drivers/iommu/iommufd/device.c
>>  create mode 100644 drivers/iommu/iommufd/hw_pagetable.c
>>  create mode 100644 drivers/iommu/iommufd/io_pagetable.c
>>  create mode 100644 drivers/iommu/iommufd/io_pagetable.h
>>  create mode 100644 drivers/iommu/iommufd/ioas.c
>>  create mode 100644 drivers/iommu/iommufd/iommufd_private.h
>>  create mode 100644 drivers/iommu/iommufd/iommufd_test.h
>>  create mode 100644 drivers/iommu/iommufd/main.c
>>  create mode 100644 drivers/iommu/iommufd/pages.c
>>  create mode 100644 drivers/iommu/iommufd/selftest.c
>>  create mode 100644 drivers/iommu/iommufd/vfio_compat.c
>>  create mode 100644 include/linux/iommufd.h
>>  create mode 100644 include/uapi/linux/iommufd.h
>>  create mode 100644 tools/testing/selftests/iommu/.gitignore
>>  create mode 100644 tools/testing/selftests/iommu/Makefile
>>  create mode 100644 tools/testing/selftests/iommu/config
>>  create mode 100644 tools/testing/selftests/iommu/iommufd.c
>>
>>
>> base-commit: b90cb1053190353cc30f0fef0ef1f378ccc063c5
>> --
>> 2.37.3


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-09-13  7:28   ` Eric Auger
@ 2022-09-20 19:56     ` Jason Gunthorpe
  2022-09-21  3:48       ` Tian, Kevin
  2022-09-21 18:06       ` Alex Williamson
  0 siblings, 2 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-20 19:56 UTC (permalink / raw)
  To: Eric Auger
  Cc: Tian, Kevin, Alex Williamson, Rodel, Jorg, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Farman, iommu, Jason Wang, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu

On Tue, Sep 13, 2022 at 09:28:18AM +0200, Eric Auger wrote:
> Hi,
> 
> On 9/13/22 03:55, Tian, Kevin wrote:
> > We didn't close the open of how to get this merged in LPC due to the
> > audio issue. Then let's use mails.
> >
> > Overall there are three options on the table:
> >
> > 1) Require vfio-compat to be 100% compatible with vfio-type1
> >
> >    Probably not a good choice given the amount of work to fix the remaining
> >    gaps. And this will block support of new IOMMU features for a longer time.
> >
> > 2) Leave vfio-compat as what it is in this series
> >
> >    Treat it as a vehicle to validate the iommufd logic instead of immediately
> >    replacing vfio-type1. Functionally most vfio applications can work w/o
> >    change if putting aside the difference on locked mm accounting, p2p, etc.
> >
> >    Then work on new features and 100% vfio-type1 compat. in parallel.
> >
> > 3) Focus on iommufd native uAPI first
> >
> >    Require vfio_device cdev and adoption in Qemu. Only for new vfio app.
> >
> >    Then work on new features and vfio-compat in parallel.
> >
> > I'm fine with either 2) or 3). Per a quick chat with Alex he prefers to 3).
> 
> I am also inclined to pursue 3) as this was the initial Jason's guidance
> and pre-requisite to integrate new features. In the past we concluded
> vfio-compat would mostly be used for testing purpose. Our QEMU
> integration fully is based on device based API.

There are some poor chicken and egg problems here.

I had some assumptions:
 a - the vfio cdev model is going to be iommufd only
 b - any uAPI we add as we go along should be generally useful going
     forward
 c - we should try to minimize the 'minimally viable iommufd' series

The compat as it stands now (eg #2) is threading this needle. Since it
can exist without cdev it means (c) is made smaller, to two series.

Since we add something useful to some use cases, eg DPDK is deployable
that way, (b) is OK.

If we focus on a strict path with 3, and avoid adding non-useful code,
then we have to have two more (unwritten!) series beyond where we are
now - vfio group compartmentalization, and cdev integration, and the
initial (c) will increase.

3 also has us merging something that currently has no usable
userspace, which I also do dislike alot.

I still think the compat gaps are small. I've realized that
VFIO_DMA_UNMAP_FLAG_VADDR has no implementation in qemu, and since it
can deadlock the kernel I propose we purge it completely.

P2P is ongoing.

That really just leaves the accounting, and I'm still not convinced at
this must be a critical thing. Linus's latest remarks reported in lwn
at the maintainer summit on tracepoints/BPF as ABI seem to support
this. Let's see an actual deployed production configuration that would
be impacted, and we won't find that unless we move forward.

So, I still like 2 because it yields the smallest next step before we
can bring all the parallel work onto the list, and it makes testing
and converting non-qemu stuff easier even going forward.

Jason

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-09-13  2:05 ` Tian, Kevin
@ 2022-09-20 20:07   ` Jason Gunthorpe
  2022-09-21  3:40     ` Tian, Kevin
  2022-09-26 13:48     ` Rodel, Jorg
  0 siblings, 2 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-20 20:07 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Rodel, Jorg, Lu Baolu, Chaitanya Kulkarni,
	Cornelia Huck, Daniel Jordan, David Gibson, Eric Auger,
	Eric Farman, iommu, Jason Wang, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu

On Tue, Sep 13, 2022 at 02:05:13AM +0000, Tian, Kevin wrote:
> A side open is about the maintenance model of iommufd.
> 
> This series proposes to put its files under drivers/iommu/, while the
> logic is relatively self-contained compared to other files in that directory.
> 
> Joerg, do you plan to do same level of review on this series as you did
> for other iommu patches or prefer to a lighter model with trust on the
> existing reviewers in this area (mostly VFIO folks, moving forward also
> include vdpa, uacces, etc.)?

From my view, I don't get the sense the Joerg is interested in
maintaining this, so I was expecting to have to PR this to Linus on
its own (with the VFIO bits) and a new group would carry it through
the initial phases.

However, I'm completely dead set against repeating past mistakes of
merging half-finished code through one tree expecting some other tree
will finish the work.

This means new features like, say, dirty tracking, will need to come
in one unit with: the iommufd uAPI, any new iommu_domain ops/api, at
least one driver implementation and a functional selftest.

Which means we will need to put in some work to avoid/manage
conflicts inside the iommu drivers.

Regards,
Jason

^ permalink raw reply	[flat|nested] 78+ messages in thread

* RE: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-09-20 20:07   ` Jason Gunthorpe
@ 2022-09-21  3:40     ` Tian, Kevin
  2022-09-21 16:19       ` Jason Gunthorpe
  2022-09-26 13:48     ` Rodel, Jorg
  1 sibling, 1 reply; 78+ messages in thread
From: Tian, Kevin @ 2022-09-21  3:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Rodel, Jorg, Lu Baolu, Chaitanya Kulkarni,
	Cornelia Huck, Daniel Jordan, David Gibson, Eric Auger,
	Eric Farman, iommu, Jason Wang, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 21, 2022 4:08 AM
> 
> On Tue, Sep 13, 2022 at 02:05:13AM +0000, Tian, Kevin wrote:
> > A side open is about the maintenance model of iommufd.
> >
> > This series proposes to put its files under drivers/iommu/, while the
> > logic is relatively self-contained compared to other files in that directory.
> >
> > Joerg, do you plan to do same level of review on this series as you did
> > for other iommu patches or prefer to a lighter model with trust on the
> > existing reviewers in this area (mostly VFIO folks, moving forward also
> > include vdpa, uacces, etc.)?
> 
> From my view, I don't get the sense the Joerg is interested in
> maintaining this, so I was expecting to have to PR this to Linus on
> its own (with the VFIO bits) and a new group would carry it through
> the initial phases.

I'm fine with this model if it also matches Joerg's thought.

Then we need add a "X: drivers/iommu/iommufd" line under IOMMU
SUBSYSTEM in the MAINTAINERS file.

> 
> However, I'm completely dead set against repeating past mistakes of
> merging half-finished code through one tree expecting some other tree
> will finish the work.
> 
> This means new features like, say, dirty tracking, will need to come
> in one unit with: the iommufd uAPI, any new iommu_domain ops/api, at
> least one driver implementation and a functional selftest.
> 
> Which means we will need to put in some work to avoid/manage
> conflicts inside the iommu drivers.
> 

Completely agree.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* RE: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-09-20 19:56     ` Jason Gunthorpe
@ 2022-09-21  3:48       ` Tian, Kevin
  2022-09-21 18:06       ` Alex Williamson
  1 sibling, 0 replies; 78+ messages in thread
From: Tian, Kevin @ 2022-09-21  3:48 UTC (permalink / raw)
  To: Jason Gunthorpe, Eric Auger
  Cc: Alex Williamson, Rodel, Jorg, Lu Baolu, Chaitanya Kulkarni,
	Cornelia Huck, Daniel Jordan, David Gibson, Eric Farman, iommu,
	Jason Wang, Jean-Philippe Brucker, Martins, Joao, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 21, 2022 3:57 AM
> 
> On Tue, Sep 13, 2022 at 09:28:18AM +0200, Eric Auger wrote:
> > Hi,
> >
> > On 9/13/22 03:55, Tian, Kevin wrote:
> > > We didn't close the open of how to get this merged in LPC due to the
> > > audio issue. Then let's use mails.
> > >
> > > Overall there are three options on the table:
> > >
> > > 1) Require vfio-compat to be 100% compatible with vfio-type1
> > >
> > >    Probably not a good choice given the amount of work to fix the
> remaining
> > >    gaps. And this will block support of new IOMMU features for a longer
> time.
> > >
> > > 2) Leave vfio-compat as what it is in this series
> > >
> > >    Treat it as a vehicle to validate the iommufd logic instead of
> immediately
> > >    replacing vfio-type1. Functionally most vfio applications can work w/o
> > >    change if putting aside the difference on locked mm accounting, p2p,
> etc.
> > >
> > >    Then work on new features and 100% vfio-type1 compat. in parallel.
> > >
> > > 3) Focus on iommufd native uAPI first
> > >
> > >    Require vfio_device cdev and adoption in Qemu. Only for new vfio app.
> > >
> > >    Then work on new features and vfio-compat in parallel.
> > >
> > > I'm fine with either 2) or 3). Per a quick chat with Alex he prefers to 3).
> >
> > I am also inclined to pursue 3) as this was the initial Jason's guidance
> > and pre-requisite to integrate new features. In the past we concluded
> > vfio-compat would mostly be used for testing purpose. Our QEMU
> > integration fully is based on device based API.
> 
> There are some poor chicken and egg problems here.
> 
> I had some assumptions:
>  a - the vfio cdev model is going to be iommufd only
>  b - any uAPI we add as we go along should be generally useful going
>      forward
>  c - we should try to minimize the 'minimally viable iommufd' series
> 
> The compat as it stands now (eg #2) is threading this needle. Since it
> can exist without cdev it means (c) is made smaller, to two series.
> 
> Since we add something useful to some use cases, eg DPDK is deployable
> that way, (b) is OK.
> 
> If we focus on a strict path with 3, and avoid adding non-useful code,
> then we have to have two more (unwritten!) series beyond where we are
> now - vfio group compartmentalization, and cdev integration, and the
> initial (c) will increase.

We are working on splitting vfio group now. cdev integration was there
but needs update based on the former part.

Once ready we'll send out in case people want to see the actual
material impact for #3.

> 
> 3 also has us merging something that currently has no usable
> userspace, which I also do dislike alot.
> 
> I still think the compat gaps are small. I've realized that
> VFIO_DMA_UNMAP_FLAG_VADDR has no implementation in qemu, and
> since it
> can deadlock the kernel I propose we purge it completely.
> 
> P2P is ongoing.
> 
> That really just leaves the accounting, and I'm still not convinced at
> this must be a critical thing. Linus's latest remarks reported in lwn
> at the maintainer summit on tracepoints/BPF as ABI seem to support
> this. Let's see an actual deployed production configuration that would
> be impacted, and we won't find that unless we move forward.
> 
> So, I still like 2 because it yields the smallest next step before we
> can bring all the parallel work onto the list, and it makes testing
> and converting non-qemu stuff easier even going forward.
> 
> Jason

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-09-21  3:40     ` Tian, Kevin
@ 2022-09-21 16:19       ` Jason Gunthorpe
  0 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-21 16:19 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Rodel, Jorg, Lu Baolu, Chaitanya Kulkarni,
	Cornelia Huck, Daniel Jordan, David Gibson, Eric Auger,
	Eric Farman, iommu, Jason Wang, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu

On Wed, Sep 21, 2022 at 03:40:44AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, September 21, 2022 4:08 AM
> > 
> > On Tue, Sep 13, 2022 at 02:05:13AM +0000, Tian, Kevin wrote:
> > > A side open is about the maintenance model of iommufd.
> > >
> > > This series proposes to put its files under drivers/iommu/, while the
> > > logic is relatively self-contained compared to other files in that directory.
> > >
> > > Joerg, do you plan to do same level of review on this series as you did
> > > for other iommu patches or prefer to a lighter model with trust on the
> > > existing reviewers in this area (mostly VFIO folks, moving forward also
> > > include vdpa, uacces, etc.)?
> > 
> > From my view, I don't get the sense the Joerg is interested in
> > maintaining this, so I was expecting to have to PR this to Linus on
> > its own (with the VFIO bits) and a new group would carry it through
> > the initial phases.
> 
> I'm fine with this model if it also matches Joerg's thought.
> 
> Then we need add a "X: drivers/iommu/iommufd" line under IOMMU
> SUBSYSTEM in the MAINTAINERS file.

The maintainers file is fine to have a new stanza:

+IOMMU FD
+M:     Jason Gunthorpe <jgg@nvidia.com>
+M:     Kevin Tian <kevin.tian@intel.com>
+L:     iommu@lists.linux.dev
+S:     Maintained
+F:     Documentation/userspace-api/iommufd.rst
+F:     drivers/iommu/iommufd/
+F:     include/uapi/linux/iommufd.h
+F:     include/linux/iommufd.h

It says who to send patches to for review..

Jason

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-09-20 19:56     ` Jason Gunthorpe
  2022-09-21  3:48       ` Tian, Kevin
@ 2022-09-21 18:06       ` Alex Williamson
  2022-09-21 18:44         ` Jason Gunthorpe
                           ` (2 more replies)
  1 sibling, 3 replies; 78+ messages in thread
From: Alex Williamson @ 2022-09-21 18:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Eric Auger, Tian, Kevin, Rodel, Jorg, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Farman, iommu, Jason Wang, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu, Steve Sistare, libvir-list, Daniel P. Berrangé,
	Laine Stump

[Cc+ Steve, libvirt, Daniel, Laine]

On Tue, 20 Sep 2022 16:56:42 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Sep 13, 2022 at 09:28:18AM +0200, Eric Auger wrote:
> > Hi,
> > 
> > On 9/13/22 03:55, Tian, Kevin wrote:  
> > > We didn't close the open of how to get this merged in LPC due to the
> > > audio issue. Then let's use mails.
> > >
> > > Overall there are three options on the table:
> > >
> > > 1) Require vfio-compat to be 100% compatible with vfio-type1
> > >
> > >    Probably not a good choice given the amount of work to fix the remaining
> > >    gaps. And this will block support of new IOMMU features for a longer time.
> > >
> > > 2) Leave vfio-compat as what it is in this series
> > >
> > >    Treat it as a vehicle to validate the iommufd logic instead of immediately
> > >    replacing vfio-type1. Functionally most vfio applications can work w/o
> > >    change if putting aside the difference on locked mm accounting, p2p, etc.
> > >
> > >    Then work on new features and 100% vfio-type1 compat. in parallel.
> > >
> > > 3) Focus on iommufd native uAPI first
> > >
> > >    Require vfio_device cdev and adoption in Qemu. Only for new vfio app.
> > >
> > >    Then work on new features and vfio-compat in parallel.
> > >
> > > I'm fine with either 2) or 3). Per a quick chat with Alex he prefers to 3).  
> > 
> > I am also inclined to pursue 3) as this was the initial Jason's guidance
> > and pre-requisite to integrate new features. In the past we concluded
> > vfio-compat would mostly be used for testing purpose. Our QEMU
> > integration fully is based on device based API.  
> 
> There are some poor chicken and egg problems here.
> 
> I had some assumptions:
>  a - the vfio cdev model is going to be iommufd only
>  b - any uAPI we add as we go along should be generally useful going
>      forward
>  c - we should try to minimize the 'minimally viable iommufd' series
> 
> The compat as it stands now (eg #2) is threading this needle. Since it
> can exist without cdev it means (c) is made smaller, to two series.
> 
> Since we add something useful to some use cases, eg DPDK is deployable
> that way, (b) is OK.
> 
> If we focus on a strict path with 3, and avoid adding non-useful code,
> then we have to have two more (unwritten!) series beyond where we are
> now - vfio group compartmentalization, and cdev integration, and the
> initial (c) will increase.
> 
> 3 also has us merging something that currently has no usable
> userspace, which I also do dislike alot.
> 
> I still think the compat gaps are small. I've realized that
> VFIO_DMA_UNMAP_FLAG_VADDR has no implementation in qemu, and since it
> can deadlock the kernel I propose we purge it completely.

Steve won't be happy to hear that, QEMU support exists but isn't yet
merged.
 
> P2P is ongoing.
> 
> That really just leaves the accounting, and I'm still not convinced at
> this must be a critical thing. Linus's latest remarks reported in lwn
> at the maintainer summit on tracepoints/BPF as ABI seem to support
> this. Let's see an actual deployed production configuration that would
> be impacted, and we won't find that unless we move forward.

I'll try to summarize the proposed change so that we can get better
advice from libvirt folks, or potentially anyone else managing locked
memory limits for device assignment VMs.

Background: when a DMA range, ex. guest RAM, is mapped to a vfio device,
we use the system IOMMU to provide GPA to HPA translation for assigned
devices. Unlike CPU page tables, we don't generally have a means to
demand fault these translations, therefore the memory target of the
translation is pinned to prevent that it cannot be swapped or
relocated, ie. to guarantee the translation is always valid.

The issue is where we account these pinned pages, where accounting is
necessary such that a user cannot lock an arbitrary number of pages
into RAM to generate a DoS attack.  Duplicate accounting should be
resolved by iommufd, but is outside the scope of this discussion.

Currently, vfio tests against the mm_struct.locked_vm relative to
rlimit(RLIMIT_MEMLOCK), which reads task->signal->rlim[limit].rlim_cur,
where task is the current process.  This is the same limit set via the
setrlimit syscall used by prlimit(1) and reported via 'ulimit -l'.

Note that in both cases above, we're dealing with a task, or process
limit and both prlimit and ulimit man pages describe them as such.

iommufd supposes instead, and references existing kernel
implementations, that despite the descriptions above these limits are
actually meant to be user limits and therefore instead charges pinned
pages against user_struct.locked_vm and also marks them in
mm_struct.pinned_vm.

The proposed algorithm is to read the _task_ locked memory limit, then
attempt to charge the _user_ locked_vm, such that user_struct.locked_vm
cannot exceed the task locked memory limit.

This obviously has implications.  AFAICT, any management tool that
doesn't instantiate assigned device VMs under separate users are
essentially untenable.  For example, if we launch VM1 under userA and
set a locked memory limit of 4GB via prlimit to account for an assigned
device, that works fine, until we launch VM2 from userA as well.  In
that case we can't simply set a 4GB limit on the VM2 task because
there's already 4GB charged against user_struct.locked_vm for VM1.  So
we'd need to set the VM2 task limit to 8GB to be able to launch VM2.
But not only that, we'd need to go back and also set VM1's task limit
to 8GB or else it will fail if a DMA mapped memory region is transient
and needs to be re-mapped.

Effectively any task under the same user and requiring pinned memory
needs to have a locked memory limit set, and updated, to account for
all tasks using pinned memory by that user.

How does this affect known current use cases of locked memory
management for assigned device VMs?

Does qemu://system by default sandbox into per VM uids or do they all
use the qemu user by default.  I imagine qemu://session mode is pretty
screwed by this, but I also don't know who/where locked limits are
lifted for such VMs.  Boxes, who I think now supports assigned device
VMs, could also be affected. 
 
> So, I still like 2 because it yields the smallest next step before we
> can bring all the parallel work onto the list, and it makes testing
> and converting non-qemu stuff easier even going forward.

If a vfio compatible interface isn't transparently compatible, then I
have a hard time understanding its value.  Please correct my above
description and implications, but I suspect these are not just
theoretical ABI compat issues.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-09-21 18:06       ` Alex Williamson
@ 2022-09-21 18:44         ` Jason Gunthorpe
  2022-09-21 19:30           ` Steven Sistare
                             ` (2 more replies)
  2022-09-21 22:36         ` Laine Stump
  2022-09-22 11:06         ` Daniel P. Berrangé
  2 siblings, 3 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-21 18:44 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Eric Auger, Tian, Kevin, Rodel, Jorg, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Farman, iommu, Jason Wang, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu, Steve Sistare, libvir-list, Daniel P. Berrangé,
	Laine Stump

On Wed, Sep 21, 2022 at 12:06:49PM -0600, Alex Williamson wrote:

> > I still think the compat gaps are small. I've realized that
> > VFIO_DMA_UNMAP_FLAG_VADDR has no implementation in qemu, and since it
> > can deadlock the kernel I propose we purge it completely.
> 
> Steve won't be happy to hear that, QEMU support exists but isn't yet
> merged.

If Steve wants to keep it then someone needs to fix the deadlock in
the vfio implementation before any userspace starts to appear. 

I can fix the deadlock in iommufd in a terrible expensive way, but
would rather we design a better interface if nobody is using it yet. I
advocate for passing the memfd to the kernel and use that as the page
provider, not a mm_struct.

> The issue is where we account these pinned pages, where accounting is
> necessary such that a user cannot lock an arbitrary number of pages
> into RAM to generate a DoS attack.  

It is worth pointing out that preventing a DOS attack doesn't actually
work because a *task* limit is trivially bypassed by just spawning
more tasks. So, as a security feature, this is already very
questionable.

What we've done here is make the security feature work to actually
prevent DOS attacks, which then gives you this problem:

> This obviously has implications.  AFAICT, any management tool that
> doesn't instantiate assigned device VMs under separate users are
> essentially untenable.

Because now that the security feature works properly it detects the
DOS created by spawning multiple tasks :(

Somehow I was under the impression there was not user sharing in the
common cases, but I guess I don't know that for sure.

> > So, I still like 2 because it yields the smallest next step before we
> > can bring all the parallel work onto the list, and it makes testing
> > and converting non-qemu stuff easier even going forward.
> 
> If a vfio compatible interface isn't transparently compatible, then I
> have a hard time understanding its value.  Please correct my above
> description and implications, but I suspect these are not just
> theoretical ABI compat issues.  Thanks,

Because it is just fine for everything that doesn't use the ulimit
feature, which is still a lot of use cases!

Remember, at this point we are not replacing /dev/vfio/vfio, this is
just providing the general compat in a form that has to be opted
into. I think if you open the /dev/iommu device node then you should
get secured accounting.

If /dev/vfio/vfio is provided by iommufd it may well have to trigger a
different ulimit tracking - if that is the only sticking point it
seems minor and should be addressed in some later series that adds
/dev/vfio/vfio support to iommufd..

Jason

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-09-21 18:44         ` Jason Gunthorpe
@ 2022-09-21 19:30           ` Steven Sistare
  2022-09-21 23:09             ` Jason Gunthorpe
  2022-09-21 23:20           ` Jason Gunthorpe
  2022-09-22 11:20           ` Daniel P. Berrangé
  2 siblings, 1 reply; 78+ messages in thread
From: Steven Sistare @ 2022-09-21 19:30 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Eric Auger, Tian, Kevin, Rodel, Jorg, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Farman, iommu, Jason Wang, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu, libvir-list, Daniel P. Berrangé,
	Laine Stump

On 9/21/2022 2:44 PM, Jason Gunthorpe wrote:
> On Wed, Sep 21, 2022 at 12:06:49PM -0600, Alex Williamson wrote:
> 
>>> I still think the compat gaps are small. I've realized that
>>> VFIO_DMA_UNMAP_FLAG_VADDR has no implementation in qemu, and since it
>>> can deadlock the kernel I propose we purge it completely.
>>
>> Steve won't be happy to hear that, QEMU support exists but isn't yet
>> merged.

"unhappy" barely scratches the surface of my feelings!

Live update is a great feature that solves a real problem, and lots of 
people have spent lots of time providing thorough feedback on the qemu
patches I have submitted.  We *will* cross the finish line.  In the
mean time, I maintain a patched qemu for use in my company, and I have
heard from others who do the same.

> If Steve wants to keep it then someone needs to fix the deadlock in
> the vfio implementation before any userspace starts to appear. 

The only VFIO_DMA_UNMAP_FLAG_VADDR issue I am aware of is broken pinned accounting
across exec, which can result in mm->locked_vm becoming negative. I have several 
fixes, but none result in limits being reached at exactly the same time as before --
the same general issue being discussed for iommufd.  I am still thinking about it.

I am not aware of a deadlock problem.  Please elaborate or point me to an
email thread.

> I can fix the deadlock in iommufd in a terrible expensive way, but
> would rather we design a better interface if nobody is using it yet. I
> advocate for passing the memfd to the kernel and use that as the page
> provider, not a mm_struct.

memfd support alone is not sufficient.  Live update also supports guest ram
backed by named shared memory.

- Steve

>> The issue is where we account these pinned pages, where accounting is
>> necessary such that a user cannot lock an arbitrary number of pages
>> into RAM to generate a DoS attack.  
> 
> It is worth pointing out that preventing a DOS attack doesn't actually
> work because a *task* limit is trivially bypassed by just spawning
> more tasks. So, as a security feature, this is already very
> questionable.
> 
> What we've done here is make the security feature work to actually
> prevent DOS attacks, which then gives you this problem:
> 
>> This obviously has implications.  AFAICT, any management tool that
>> doesn't instantiate assigned device VMs under separate users are
>> essentially untenable.
> 
> Because now that the security feature works properly it detects the
> DOS created by spawning multiple tasks :(
> 
> Somehow I was under the impression there was not user sharing in the
> common cases, but I guess I don't know that for sure.
> 
>>> So, I still like 2 because it yields the smallest next step before we
>>> can bring all the parallel work onto the list, and it makes testing
>>> and converting non-qemu stuff easier even going forward.
>>
>> If a vfio compatible interface isn't transparently compatible, then I
>> have a hard time understanding its value.  Please correct my above
>> description and implications, but I suspect these are not just
>> theoretical ABI compat issues.  Thanks,
> 
> Because it is just fine for everything that doesn't use the ulimit
> feature, which is still a lot of use cases!
> 
> Remember, at this point we are not replacing /dev/vfio/vfio, this is
> just providing the general compat in a form that has to be opted
> into. I think if you open the /dev/iommu device node then you should
> get secured accounting.
> 
> If /dev/vfio/vfio is provided by iommufd it may well have to trigger a
> different ulimit tracking - if that is the only sticking point it
> seems minor and should be addressed in some later series that adds
> /dev/vfio/vfio support to iommufd..
> 
> Jason

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-09-21 18:06       ` Alex Williamson
  2022-09-21 18:44         ` Jason Gunthorpe
@ 2022-09-21 22:36         ` Laine Stump
  2022-09-22 11:06         ` Daniel P. Berrangé
  2 siblings, 0 replies; 78+ messages in thread
From: Laine Stump @ 2022-09-21 22:36 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: Eric Auger, Tian, Kevin, Rodel, Jorg, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Farman, iommu, Jason Wang, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu, Steve Sistare, libvir-list, Daniel P. Berrangé

On 9/21/22 2:06 PM, Alex Williamson wrote:
> [Cc+ Steve, libvirt, Daniel, Laine]
> 
> On Tue, 20 Sep 2022 16:56:42 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
>> On Tue, Sep 13, 2022 at 09:28:18AM +0200, Eric Auger wrote:
>>> Hi,
>>>
>>> On 9/13/22 03:55, Tian, Kevin wrote:
>>>> We didn't close the open of how to get this merged in LPC due to the
>>>> audio issue. Then let's use mails.
>>>>
>>>> Overall there are three options on the table:
>>>>
>>>> 1) Require vfio-compat to be 100% compatible with vfio-type1
>>>>
>>>>     Probably not a good choice given the amount of work to fix the remaining
>>>>     gaps. And this will block support of new IOMMU features for a longer time.
>>>>
>>>> 2) Leave vfio-compat as what it is in this series
>>>>
>>>>     Treat it as a vehicle to validate the iommufd logic instead of immediately
>>>>     replacing vfio-type1. Functionally most vfio applications can work w/o
>>>>     change if putting aside the difference on locked mm accounting, p2p, etc.
>>>>
>>>>     Then work on new features and 100% vfio-type1 compat. in parallel.
>>>>
>>>> 3) Focus on iommufd native uAPI first
>>>>
>>>>     Require vfio_device cdev and adoption in Qemu. Only for new vfio app.
>>>>
>>>>     Then work on new features and vfio-compat in parallel.
>>>>
>>>> I'm fine with either 2) or 3). Per a quick chat with Alex he prefers to 3).
>>>
>>> I am also inclined to pursue 3) as this was the initial Jason's guidance
>>> and pre-requisite to integrate new features. In the past we concluded
>>> vfio-compat would mostly be used for testing purpose. Our QEMU
>>> integration fully is based on device based API.
>>
>> There are some poor chicken and egg problems here.
>>
>> I had some assumptions:
>>   a - the vfio cdev model is going to be iommufd only
>>   b - any uAPI we add as we go along should be generally useful going
>>       forward
>>   c - we should try to minimize the 'minimally viable iommufd' series
>>
>> The compat as it stands now (eg #2) is threading this needle. Since it
>> can exist without cdev it means (c) is made smaller, to two series.
>>
>> Since we add something useful to some use cases, eg DPDK is deployable
>> that way, (b) is OK.
>>
>> If we focus on a strict path with 3, and avoid adding non-useful code,
>> then we have to have two more (unwritten!) series beyond where we are
>> now - vfio group compartmentalization, and cdev integration, and the
>> initial (c) will increase.
>>
>> 3 also has us merging something that currently has no usable
>> userspace, which I also do dislike alot.
>>
>> I still think the compat gaps are small. I've realized that
>> VFIO_DMA_UNMAP_FLAG_VADDR has no implementation in qemu, and since it
>> can deadlock the kernel I propose we purge it completely.
> 
> Steve won't be happy to hear that, QEMU support exists but isn't yet
> merged.
>   
>> P2P is ongoing.
>>
>> That really just leaves the accounting, and I'm still not convinced at
>> this must be a critical thing. Linus's latest remarks reported in lwn
>> at the maintainer summit on tracepoints/BPF as ABI seem to support
>> this. Let's see an actual deployed production configuration that would
>> be impacted, and we won't find that unless we move forward.
> 
> I'll try to summarize the proposed change so that we can get better
> advice from libvirt folks, or potentially anyone else managing locked
> memory limits for device assignment VMs.
> 
> Background: when a DMA range, ex. guest RAM, is mapped to a vfio device,
> we use the system IOMMU to provide GPA to HPA translation for assigned
> devices. Unlike CPU page tables, we don't generally have a means to
> demand fault these translations, therefore the memory target of the
> translation is pinned to prevent that it cannot be swapped or
> relocated, ie. to guarantee the translation is always valid.
> 
> The issue is where we account these pinned pages, where accounting is
> necessary such that a user cannot lock an arbitrary number of pages
> into RAM to generate a DoS attack.  Duplicate accounting should be
> resolved by iommufd, but is outside the scope of this discussion.
> 
> Currently, vfio tests against the mm_struct.locked_vm relative to
> rlimit(RLIMIT_MEMLOCK), which reads task->signal->rlim[limit].rlim_cur,
> where task is the current process.  This is the same limit set via the
> setrlimit syscall used by prlimit(1) and reported via 'ulimit -l'.
> 
> Note that in both cases above, we're dealing with a task, or process
> limit and both prlimit and ulimit man pages describe them as such.
> 
> iommufd supposes instead, and references existing kernel
> implementations, that despite the descriptions above these limits are
> actually meant to be user limits and therefore instead charges pinned
> pages against user_struct.locked_vm and also marks them in
> mm_struct.pinned_vm.
> 
> The proposed algorithm is to read the _task_ locked memory limit, then
> attempt to charge the _user_ locked_vm, such that user_struct.locked_vm
> cannot exceed the task locked memory limit.
> 
> This obviously has implications.  AFAICT, any management tool that
> doesn't instantiate assigned device VMs under separate users are
> essentially untenable.  For example, if we launch VM1 under userA and
> set a locked memory limit of 4GB via prlimit to account for an assigned
> device, that works fine, until we launch VM2 from userA as well.  In
> that case we can't simply set a 4GB limit on the VM2 task because
> there's already 4GB charged against user_struct.locked_vm for VM1.  So
> we'd need to set the VM2 task limit to 8GB to be able to launch VM2.
> But not only that, we'd need to go back and also set VM1's task limit
> to 8GB or else it will fail if a DMA mapped memory region is transient
> and needs to be re-mapped.
> 
> Effectively any task under the same user and requiring pinned memory
> needs to have a locked memory limit set, and updated, to account for
> all tasks using pinned memory by that user.
> 
> How does this affect known current use cases of locked memory
> management for assigned device VMs?
> 
> Does qemu://system by default sandbox into per VM uids or do they all
> use the qemu user by default.

Unless it is told otherwise in the XML for the VMs, each qemu process 
uses the same uid (which is usually "qemu", but can be changed in 
systemwide config).

>  I imagine qemu://session mode is pretty
> screwed by this, but I also don't know who/where locked limits are
> lifted for such VMs.  Boxes, who I think now supports assigned device
> VMs, could also be affected.

because qemu:///session runs an unprivileged libvirt (i.e. unable to 
raise the limits), boxes sets the limits elsewhere  beforehand (not sure 
where, as I'm not familiar with boxes source).

>   
>> So, I still like 2 because it yields the smallest next step before we
>> can bring all the parallel work onto the list, and it makes testing
>> and converting non-qemu stuff easier even going forward.
> 
> If a vfio compatible interface isn't transparently compatible, then I
> have a hard time understanding its value.  Please correct my above
> description and implications, but I suspect these are not just
> theoretical ABI compat issues.  Thanks,
> 
> Alex
> 


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-09-21 19:30           ` Steven Sistare
@ 2022-09-21 23:09             ` Jason Gunthorpe
  2022-10-06 16:01               ` Jason Gunthorpe
  0 siblings, 1 reply; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-21 23:09 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Alex Williamson, Eric Auger, Tian, Kevin, Rodel, Jorg, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Farman, iommu, Jason Wang, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu, libvir-list, Daniel P. Berrangé,
	Laine Stump

On Wed, Sep 21, 2022 at 03:30:55PM -0400, Steven Sistare wrote:

> > If Steve wants to keep it then someone needs to fix the deadlock in
> > the vfio implementation before any userspace starts to appear. 
> 
> The only VFIO_DMA_UNMAP_FLAG_VADDR issue I am aware of is broken pinned accounting
> across exec, which can result in mm->locked_vm becoming negative. I have several 
> fixes, but none result in limits being reached at exactly the same time as before --
> the same general issue being discussed for iommufd.  I am still thinking about it.

Oh, yeah, I noticed this was all busted up too.

> I am not aware of a deadlock problem.  Please elaborate or point me to an
> email thread.

VFIO_DMA_UNMAP_FLAG_VADDR open codes a lock in the kernel where
userspace can tigger the lock to be taken and then returns to
userspace with the lock held.

Any scenario where a kernel thread hits that open-coded lock and then
userspace does-the-wrong-thing will deadlock the kernel.

For instance consider a mdev driver. We assert
VFIO_DMA_UNMAP_FLAG_VADDR, the mdev driver does a DMA in a workqueue
and becomes blocked on the now locked lock. Userspace then tries to
close the device FD.

FD closure will trigger device close and the VFIO core code
requirement is that mdev driver device teardown must halt all
concurrent threads touching vfio_device. Thus the mdev will try to
fence its workqeue and then deadlock - unable to flush/cancel a work
that is currently blocked on a lock held by userspace that will never
be unlocked.

This is just the first scenario that comes to mind. The approach to
give userspace control of a lock that kernel threads can become
blocked on is so completely sketchy it is a complete no-go in my
opinion. If I had seen it when it was posted I would have hard NAK'd
it.

My "full" solution in mind for iommufd is to pin all the memory upon
VFIO_DMA_UNMAP_FLAG_VADDR, so we can continue satisfy DMA requests
while the mm_struct is not available. But IMHO this is basically
useless for any actual user of mdevs.

The other option is to just exclude mdevs and fail the
VFIO_DMA_UNMAP_FLAG_VADDR if any are present, then prevent them from
becoming present while it is asserted. In this way we don't need to do
anything beyond a simple check as the iommu_domain is already fully
populated and pinned.

> > I can fix the deadlock in iommufd in a terrible expensive way, but
> > would rather we design a better interface if nobody is using it yet. I
> > advocate for passing the memfd to the kernel and use that as the page
> > provider, not a mm_struct.
> 
> memfd support alone is not sufficient.  Live update also supports guest ram
> backed by named shared memory.

By "memfd" I broadly mean whatever FD based storage you want to use:
shmem, hugetlbfs, whatever. Just not a mm_struct.

The point is we satisfy the page requests through the fd based object,
not through a vma in the mm_struct.

Jason

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-09-21 18:44         ` Jason Gunthorpe
  2022-09-21 19:30           ` Steven Sistare
@ 2022-09-21 23:20           ` Jason Gunthorpe
  2022-09-22 11:20           ` Daniel P. Berrangé
  2 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-21 23:20 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Eric Auger, Tian, Kevin, Rodel, Jorg, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Farman, iommu, Jason Wang, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu, Steve Sistare, libvir-list, Daniel P. Berrangé,
	Laine Stump

On Wed, Sep 21, 2022 at 03:44:24PM -0300, Jason Gunthorpe wrote:

> If /dev/vfio/vfio is provided by iommufd it may well have to trigger a
> different ulimit tracking - if that is the only sticking point it
> seems minor and should be addressed in some later series that adds
> /dev/vfio/vfio support to iommufd..

And I have come up with a nice idea for this that feels OK

- Add a 'pin accounting compat' flag to struct iommufd_ctx (eg per FD)
  The flag is set to 1 if /dev/vfio/vfio was the cdev that opened the
  ctx
  An IOCTL issued by cap sysadmin can set the flag

- If the flag is set we do not do pin accounting in the user.
  Instead we account for pins in the FD. The single FD cannot pass the
  rlimit.

This nicely emulates the desired behavior from virtualization without
creating all the problems with exec/fork/etc that per-task tracking
has.

Even in iommufd native mode a priviledged virtualization layer can use
the ioctl to enter the old mode and pass the fd to qemu under a shared
user. This should ease migration I guess.

It can still be oversubscribed but it is now limited to the number of
iommufd_ctx's *with devices* that the userspace can create. Since each
device can be attached to only 1 iommufd this is a stronger limit than
the task limit. 1 device given to the qemu will mean a perfect
enforcement. (ignoring that a hostile qemu can still blow past the
rlimit using concurrent rdma or io_uring)

It is a small incremental step - does this suitably address the concern?

Jason

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-09-21 18:06       ` Alex Williamson
  2022-09-21 18:44         ` Jason Gunthorpe
  2022-09-21 22:36         ` Laine Stump
@ 2022-09-22 11:06         ` Daniel P. Berrangé
  2022-09-22 14:13           ` Jason Gunthorpe
  2 siblings, 1 reply; 78+ messages in thread
From: Daniel P. Berrangé @ 2022-09-22 11:06 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, Eric Auger, Tian, Kevin, Rodel, Jorg, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Farman, iommu, Jason Wang, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu, Steve Sistare, libvir-list, Laine Stump

On Wed, Sep 21, 2022 at 12:06:49PM -0600, Alex Williamson wrote:
> [Cc+ Steve, libvirt, Daniel, Laine]
> 
> On Tue, 20 Sep 2022 16:56:42 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> > That really just leaves the accounting, and I'm still not convinced at
> > this must be a critical thing. Linus's latest remarks reported in lwn
> > at the maintainer summit on tracepoints/BPF as ABI seem to support
> > this. Let's see an actual deployed production configuration that would
> > be impacted, and we won't find that unless we move forward.
> 
> I'll try to summarize the proposed change so that we can get better
> advice from libvirt folks, or potentially anyone else managing locked
> memory limits for device assignment VMs.
> 
> Background: when a DMA range, ex. guest RAM, is mapped to a vfio device,
> we use the system IOMMU to provide GPA to HPA translation for assigned
> devices. Unlike CPU page tables, we don't generally have a means to
> demand fault these translations, therefore the memory target of the
> translation is pinned to prevent that it cannot be swapped or
> relocated, ie. to guarantee the translation is always valid.
> 
> The issue is where we account these pinned pages, where accounting is
> necessary such that a user cannot lock an arbitrary number of pages
> into RAM to generate a DoS attack.  Duplicate accounting should be
> resolved by iommufd, but is outside the scope of this discussion.
> 
> Currently, vfio tests against the mm_struct.locked_vm relative to
> rlimit(RLIMIT_MEMLOCK), which reads task->signal->rlim[limit].rlim_cur,
> where task is the current process.  This is the same limit set via the
> setrlimit syscall used by prlimit(1) and reported via 'ulimit -l'.
> 
> Note that in both cases above, we're dealing with a task, or process
> limit and both prlimit and ulimit man pages describe them as such.
> 
> iommufd supposes instead, and references existing kernel
> implementations, that despite the descriptions above these limits are
> actually meant to be user limits and therefore instead charges pinned
> pages against user_struct.locked_vm and also marks them in
> mm_struct.pinned_vm.
> 
> The proposed algorithm is to read the _task_ locked memory limit, then
> attempt to charge the _user_ locked_vm, such that user_struct.locked_vm
> cannot exceed the task locked memory limit.
> 
> This obviously has implications.  AFAICT, any management tool that
> doesn't instantiate assigned device VMs under separate users are
> essentially untenable.  For example, if we launch VM1 under userA and
> set a locked memory limit of 4GB via prlimit to account for an assigned
> device, that works fine, until we launch VM2 from userA as well.  In
> that case we can't simply set a 4GB limit on the VM2 task because
> there's already 4GB charged against user_struct.locked_vm for VM1.  So
> we'd need to set the VM2 task limit to 8GB to be able to launch VM2.
> But not only that, we'd need to go back and also set VM1's task limit
> to 8GB or else it will fail if a DMA mapped memory region is transient
> and needs to be re-mapped.
> 
> Effectively any task under the same user and requiring pinned memory
> needs to have a locked memory limit set, and updated, to account for
> all tasks using pinned memory by that user.

That is pretty unpleasant. Having to update all existing VMs, when
starting a new VM (or hotpluigging a VFIO device to an existing
VM) is something we would never want todo.

Charging this against the user weakens the DoS protection that
we have today from the POV of individual VMs.

Our primary risk is that a single QEMU is compromised and attempts
to impact the host in some way. We want to limit the damage that
an individual QEMU can cause.

Consider 4 VMs each locked with 4 GB. Any single compromised VM
is constrained to only use 4 GB of locked memory.

With the per-user accounting, now any single compromised VM can
use the cummulative 16 GB of locked memory, even though we only
want that VM to be able to use 4 GB.

For a cummulative memory limit, we would expect cgroups to be
the enforcement mechanism. eg consider a machine has 64 GB of
RAM, and we want to reserve 12 GB for host OS ensure, and all
other RAM is for VM usage. A mgmt app wanting such protecton
would set a limit on /machines.slice, at the 52 GB mark, to
reserve the 12 GB for non-VM usage.

Also, the mgmt app accounts for how many VMs it has started on
a host and will not try to launch more VMs than there is RAM
available to support them. Accounting at the user level instead
of the task level, is effectively trying to protect against the
bad mgmt app trying to overcommit the host. That is not really
important, as the mgmt app is so privileged it is already
assumed to be a trusted host component akin to a root acocunt.

So per-user locked mem accounting looks like a regression in
our VM isolation abilities compared to the per-task accounting.

> How does this affect known current use cases of locked memory
> management for assigned device VMs?

It will affect every single application using libvirt today, with
the possible exception of KubeVirt. KubeVirt puts each VM in a
separate container, and if they have userns enabled, those will
each get distinct UIDs for accounting purposes.

I expect every other usage of libvrit to be affected, unless the
mgmt app has gone out of its way to configure a dedicated UID for
each QEMU - I'm not aware of any which do this.

> Does qemu://system by default sandbox into per VM uids or do they all
> use the qemu user by default.  I imagine qemu://session mode is pretty
> screwed by this, but I also don't know who/where locked limits are
> lifted for such VMs.  Boxes, who I think now supports assigned device
> VMs, could also be affected.

In a out of the box config qemu:///system will always run all VMs
under the user:group pairing  qemu:qemu.

Mgmt apps can requested a dedicated UID per VM in the XML config,
but I'm not aware of any doing this. Someone might, but we should
assume the majority will not.

IOW, it affects essentially all libvirt usage of VFIO.

> > So, I still like 2 because it yields the smallest next step before we
> > can bring all the parallel work onto the list, and it makes testing
> > and converting non-qemu stuff easier even going forward.
> 
> If a vfio compatible interface isn't transparently compatible, then I
> have a hard time understanding its value.  Please correct my above
> description and implications, but I suspect these are not just
> theoretical ABI compat issues.  Thanks,



With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-09-21 18:44         ` Jason Gunthorpe
  2022-09-21 19:30           ` Steven Sistare
  2022-09-21 23:20           ` Jason Gunthorpe
@ 2022-09-22 11:20           ` Daniel P. Berrangé
  2022-09-22 14:08             ` Jason Gunthorpe
  2 siblings, 1 reply; 78+ messages in thread
From: Daniel P. Berrangé @ 2022-09-22 11:20 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Eric Auger, Tian, Kevin, Rodel, Jorg, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Farman, iommu, Jason Wang, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu, Steve Sistare, libvir-list, Laine Stump

On Wed, Sep 21, 2022 at 03:44:24PM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 21, 2022 at 12:06:49PM -0600, Alex Williamson wrote:
> > The issue is where we account these pinned pages, where accounting is
> > necessary such that a user cannot lock an arbitrary number of pages
> > into RAM to generate a DoS attack.  
> 
> It is worth pointing out that preventing a DOS attack doesn't actually
> work because a *task* limit is trivially bypassed by just spawning
> more tasks. So, as a security feature, this is already very
> questionable.

The malicious party on host VM hosts is generally the QEMU process.
QEMU is normally prevented from spawning more tasks, both by SELinux
controls and be the seccomp sandbox blocking clone() (except for
thread creation).  We need to constrain what any individual QEMU can
do to the host, and the per-task mem locking limits can do that.

The mgmt app is what spawns more tasks (QEMU instances) and they
are generally a trusted party on the host, or they are already
constrained in other ways such as cgroups or namespaces. The
mgmt apps would be expected to not intentionally overcommit the
host with VMs needing too much cummulative locked RAM.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-09-22 11:20           ` Daniel P. Berrangé
@ 2022-09-22 14:08             ` Jason Gunthorpe
  2022-09-22 14:49               ` Daniel P. Berrangé
  0 siblings, 1 reply; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-22 14:08 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Alex Williamson, Eric Auger, Tian, Kevin, Rodel, Jorg, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Farman, iommu, Jason Wang, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu, Steve Sistare, libvir-list, Laine Stump

On Thu, Sep 22, 2022 at 12:20:50PM +0100, Daniel P. Berrangé wrote:
> On Wed, Sep 21, 2022 at 03:44:24PM -0300, Jason Gunthorpe wrote:
> > On Wed, Sep 21, 2022 at 12:06:49PM -0600, Alex Williamson wrote:
> > > The issue is where we account these pinned pages, where accounting is
> > > necessary such that a user cannot lock an arbitrary number of pages
> > > into RAM to generate a DoS attack.  
> > 
> > It is worth pointing out that preventing a DOS attack doesn't actually
> > work because a *task* limit is trivially bypassed by just spawning
> > more tasks. So, as a security feature, this is already very
> > questionable.
> 
> The malicious party on host VM hosts is generally the QEMU process.
> QEMU is normally prevented from spawning more tasks, both by SELinux
> controls and be the seccomp sandbox blocking clone() (except for
> thread creation).  We need to constrain what any individual QEMU can
> do to the host, and the per-task mem locking limits can do that.

Even with syscall limits simple things like execve (enabled eg for
qemu self-upgrade) can corrupt the kernel task-based accounting to the
point that the limits don't work.

Also, you are skipping the fact that since every subsystem does this
differently and wrong a qemu can still go at least 3x over the
allocation using just normal allowed functionalities.

Again, as a security feature this fundamentally does not work. We
cannot account for a FD owned resource inside the task based
mm_struct. There are always going to be exploitable holes.

What you really want is a cgroup based limit that is consistently
applied in the kernel.

Regardless, since this seems pretty well entrenched I continue to
suggest my simpler alternative of making it fd based instead of user
based. At least that doesn't have the unsolvable bugs related to task
accounting.

Jason

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-09-22 11:06         ` Daniel P. Berrangé
@ 2022-09-22 14:13           ` Jason Gunthorpe
  2022-09-22 14:46             ` Daniel P. Berrangé
  0 siblings, 1 reply; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-22 14:13 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Alex Williamson, Eric Auger, Tian, Kevin, Rodel, Jorg, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Farman, iommu, Jason Wang, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu, Steve Sistare, libvir-list, Laine Stump

On Thu, Sep 22, 2022 at 12:06:33PM +0100, Daniel P. Berrangé wrote:

> So per-user locked mem accounting looks like a regression in
> our VM isolation abilities compared to the per-task accounting.

For this kind of API the management app needs to put each VM in its
own user, which I'm a bit surprised it doesn't already do as a further
protection against cross-process concerns.

The question here is how to we provide enough compatability for this
existing methodology while still closing the security holes and
inconsistencies that exist in the kernel implementation.

Jason

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-09-22 14:13           ` Jason Gunthorpe
@ 2022-09-22 14:46             ` Daniel P. Berrangé
  0 siblings, 0 replies; 78+ messages in thread
From: Daniel P. Berrangé @ 2022-09-22 14:46 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Eric Auger, Tian, Kevin, Rodel, Jorg, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Farman, iommu, Jason Wang, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu, Steve Sistare, libvir-list, Laine Stump

On Thu, Sep 22, 2022 at 11:13:42AM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 22, 2022 at 12:06:33PM +0100, Daniel P. Berrangé wrote:
> 
> > So per-user locked mem accounting looks like a regression in
> > our VM isolation abilities compared to the per-task accounting.
> 
> For this kind of API the management app needs to put each VM in its
> own user, which I'm a bit surprised it doesn't already do as a further
> protection against cross-process concerns.

Putting VMs in dedicated users is not practical to automatically do
on a general purpose OS install, because there's no arbitrator of
what UID ranges can be safely used without conflicting with other
usage on the OS. 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-09-22 14:08             ` Jason Gunthorpe
@ 2022-09-22 14:49               ` Daniel P. Berrangé
  2022-09-22 14:51                 ` Jason Gunthorpe
  0 siblings, 1 reply; 78+ messages in thread
From: Daniel P. Berrangé @ 2022-09-22 14:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Eric Auger, Tian, Kevin, Rodel, Jorg, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Farman, iommu, Jason Wang, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu, Steve Sistare, libvir-list, Laine Stump

On Thu, Sep 22, 2022 at 11:08:23AM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 22, 2022 at 12:20:50PM +0100, Daniel P. Berrangé wrote:
> > On Wed, Sep 21, 2022 at 03:44:24PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Sep 21, 2022 at 12:06:49PM -0600, Alex Williamson wrote:
> > > > The issue is where we account these pinned pages, where accounting is
> > > > necessary such that a user cannot lock an arbitrary number of pages
> > > > into RAM to generate a DoS attack.  
> > > 
> > > It is worth pointing out that preventing a DOS attack doesn't actually
> > > work because a *task* limit is trivially bypassed by just spawning
> > > more tasks. So, as a security feature, this is already very
> > > questionable.
> > 
> > The malicious party on host VM hosts is generally the QEMU process.
> > QEMU is normally prevented from spawning more tasks, both by SELinux
> > controls and be the seccomp sandbox blocking clone() (except for
> > thread creation).  We need to constrain what any individual QEMU can
> > do to the host, and the per-task mem locking limits can do that.
> 
> Even with syscall limits simple things like execve (enabled eg for
> qemu self-upgrade) can corrupt the kernel task-based accounting to the
> point that the limits don't work.

Note, execve is currently blocked by default too by the default
seccomp sandbox used with libvirt, as well as by the SELinux
policy again.  self-upgrade isn't a feature that exists (yet).

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-09-22 14:49               ` Daniel P. Berrangé
@ 2022-09-22 14:51                 ` Jason Gunthorpe
  2022-09-22 15:00                   ` Daniel P. Berrangé
  0 siblings, 1 reply; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-22 14:51 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Alex Williamson, Eric Auger, Tian, Kevin, Rodel, Jorg, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Farman, iommu, Jason Wang, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu, Steve Sistare, libvir-list, Laine Stump

On Thu, Sep 22, 2022 at 03:49:02PM +0100, Daniel P. Berrangé wrote:
> On Thu, Sep 22, 2022 at 11:08:23AM -0300, Jason Gunthorpe wrote:
> > On Thu, Sep 22, 2022 at 12:20:50PM +0100, Daniel P. Berrangé wrote:
> > > On Wed, Sep 21, 2022 at 03:44:24PM -0300, Jason Gunthorpe wrote:
> > > > On Wed, Sep 21, 2022 at 12:06:49PM -0600, Alex Williamson wrote:
> > > > > The issue is where we account these pinned pages, where accounting is
> > > > > necessary such that a user cannot lock an arbitrary number of pages
> > > > > into RAM to generate a DoS attack.  
> > > > 
> > > > It is worth pointing out that preventing a DOS attack doesn't actually
> > > > work because a *task* limit is trivially bypassed by just spawning
> > > > more tasks. So, as a security feature, this is already very
> > > > questionable.
> > > 
> > > The malicious party on host VM hosts is generally the QEMU process.
> > > QEMU is normally prevented from spawning more tasks, both by SELinux
> > > controls and be the seccomp sandbox blocking clone() (except for
> > > thread creation).  We need to constrain what any individual QEMU can
> > > do to the host, and the per-task mem locking limits can do that.
> > 
> > Even with syscall limits simple things like execve (enabled eg for
> > qemu self-upgrade) can corrupt the kernel task-based accounting to the
> > point that the limits don't work.
> 
> Note, execve is currently blocked by default too by the default
> seccomp sandbox used with libvirt, as well as by the SELinux
> policy again.  self-upgrade isn't a feature that exists (yet).

That userspace has disabled half the kernel isn't an excuse for the
kernel to be insecure by design :( This needs to be fixed to enable
features we know are coming so..

What would libvirt land like to see given task based tracking cannot
be fixed in the kernel?

Jason

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-09-22 14:51                 ` Jason Gunthorpe
@ 2022-09-22 15:00                   ` Daniel P. Berrangé
  2022-09-22 15:31                     ` Jason Gunthorpe
  0 siblings, 1 reply; 78+ messages in thread
From: Daniel P. Berrangé @ 2022-09-22 15:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Eric Auger, Tian, Kevin, Rodel, Jorg, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Farman, iommu, Jason Wang, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu, Steve Sistare, libvir-list, Laine Stump

On Thu, Sep 22, 2022 at 11:51:54AM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 22, 2022 at 03:49:02PM +0100, Daniel P. Berrangé wrote:
> > On Thu, Sep 22, 2022 at 11:08:23AM -0300, Jason Gunthorpe wrote:
> > > On Thu, Sep 22, 2022 at 12:20:50PM +0100, Daniel P. Berrangé wrote:
> > > > On Wed, Sep 21, 2022 at 03:44:24PM -0300, Jason Gunthorpe wrote:
> > > > > On Wed, Sep 21, 2022 at 12:06:49PM -0600, Alex Williamson wrote:
> > > > > > The issue is where we account these pinned pages, where accounting is
> > > > > > necessary such that a user cannot lock an arbitrary number of pages
> > > > > > into RAM to generate a DoS attack.  
> > > > > 
> > > > > It is worth pointing out that preventing a DOS attack doesn't actually
> > > > > work because a *task* limit is trivially bypassed by just spawning
> > > > > more tasks. So, as a security feature, this is already very
> > > > > questionable.
> > > > 
> > > > The malicious party on host VM hosts is generally the QEMU process.
> > > > QEMU is normally prevented from spawning more tasks, both by SELinux
> > > > controls and be the seccomp sandbox blocking clone() (except for
> > > > thread creation).  We need to constrain what any individual QEMU can
> > > > do to the host, and the per-task mem locking limits can do that.
> > > 
> > > Even with syscall limits simple things like execve (enabled eg for
> > > qemu self-upgrade) can corrupt the kernel task-based accounting to the
> > > point that the limits don't work.
> > 
> > Note, execve is currently blocked by default too by the default
> > seccomp sandbox used with libvirt, as well as by the SELinux
> > policy again.  self-upgrade isn't a feature that exists (yet).
> 
> That userspace has disabled half the kernel isn't an excuse for the
> kernel to be insecure by design :( This needs to be fixed to enable
> features we know are coming so..
> 
> What would libvirt land like to see given task based tracking cannot
> be fixed in the kernel?

There needs to be a mechanism to control individual VMs, whether by
task or by cgroup. User based limits are not suited to what we need
to achieve.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-09-22 15:00                   ` Daniel P. Berrangé
@ 2022-09-22 15:31                     ` Jason Gunthorpe
  2022-09-23  8:54                       ` Daniel P. Berrangé
  0 siblings, 1 reply; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-22 15:31 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Alex Williamson, Eric Auger, Tian, Kevin, Rodel, Jorg, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Farman, iommu, Jason Wang, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu, Steve Sistare, libvir-list, Laine Stump

On Thu, Sep 22, 2022 at 04:00:00PM +0100, Daniel P. Berrangé wrote:
> On Thu, Sep 22, 2022 at 11:51:54AM -0300, Jason Gunthorpe wrote:
> > On Thu, Sep 22, 2022 at 03:49:02PM +0100, Daniel P. Berrangé wrote:
> > > On Thu, Sep 22, 2022 at 11:08:23AM -0300, Jason Gunthorpe wrote:
> > > > On Thu, Sep 22, 2022 at 12:20:50PM +0100, Daniel P. Berrangé wrote:
> > > > > On Wed, Sep 21, 2022 at 03:44:24PM -0300, Jason Gunthorpe wrote:
> > > > > > On Wed, Sep 21, 2022 at 12:06:49PM -0600, Alex Williamson wrote:
> > > > > > > The issue is where we account these pinned pages, where accounting is
> > > > > > > necessary such that a user cannot lock an arbitrary number of pages
> > > > > > > into RAM to generate a DoS attack.  
> > > > > > 
> > > > > > It is worth pointing out that preventing a DOS attack doesn't actually
> > > > > > work because a *task* limit is trivially bypassed by just spawning
> > > > > > more tasks. So, as a security feature, this is already very
> > > > > > questionable.
> > > > > 
> > > > > The malicious party on host VM hosts is generally the QEMU process.
> > > > > QEMU is normally prevented from spawning more tasks, both by SELinux
> > > > > controls and be the seccomp sandbox blocking clone() (except for
> > > > > thread creation).  We need to constrain what any individual QEMU can
> > > > > do to the host, and the per-task mem locking limits can do that.
> > > > 
> > > > Even with syscall limits simple things like execve (enabled eg for
> > > > qemu self-upgrade) can corrupt the kernel task-based accounting to the
> > > > point that the limits don't work.
> > > 
> > > Note, execve is currently blocked by default too by the default
> > > seccomp sandbox used with libvirt, as well as by the SELinux
> > > policy again.  self-upgrade isn't a feature that exists (yet).
> > 
> > That userspace has disabled half the kernel isn't an excuse for the
> > kernel to be insecure by design :( This needs to be fixed to enable
> > features we know are coming so..
> > 
> > What would libvirt land like to see given task based tracking cannot
> > be fixed in the kernel?
> 
> There needs to be a mechanism to control individual VMs, whether by
> task or by cgroup. User based limits are not suited to what we need
> to achieve.

The kernel has already standardized on user based limits here for
other subsystems - libvirt and qemu cannot ignore that it exists. It
is only a matter of time before qemu starts using these other
subsystem features (eg io_uring) and has problems.

So, IMHO, the future must be that libvirt/etc sets an unlimited
rlimit, because the user approach is not going away in the kernel and
it sounds like libvirt cannot accommodate it at all.

This means we need to provide a new mechanism for future libvirt to
use. Are you happy with cgroups?

Once those points are decided, we need to figure out how best to
continue to support historical libvirt and still meet the kernel
security needs going forward. This is where I'm thinking about storing
the tracking in the FD instead of the user.

IMHO task based is something that cannot be made to work properly.

Jason

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-09-22 15:31                     ` Jason Gunthorpe
@ 2022-09-23  8:54                       ` Daniel P. Berrangé
  2022-09-23 13:29                         ` Jason Gunthorpe
  0 siblings, 1 reply; 78+ messages in thread
From: Daniel P. Berrangé @ 2022-09-23  8:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Eric Auger, Tian, Kevin, Rodel, Jorg, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Farman, iommu, Jason Wang, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu, Steve Sistare, libvir-list, Laine Stump

On Thu, Sep 22, 2022 at 12:31:20PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 22, 2022 at 04:00:00PM +0100, Daniel P. Berrangé wrote:
> > On Thu, Sep 22, 2022 at 11:51:54AM -0300, Jason Gunthorpe wrote:
> > > On Thu, Sep 22, 2022 at 03:49:02PM +0100, Daniel P. Berrangé wrote:
> > > > On Thu, Sep 22, 2022 at 11:08:23AM -0300, Jason Gunthorpe wrote:
> > > > > On Thu, Sep 22, 2022 at 12:20:50PM +0100, Daniel P. Berrangé wrote:
> > > > > > On Wed, Sep 21, 2022 at 03:44:24PM -0300, Jason Gunthorpe wrote:
> > > > > > > On Wed, Sep 21, 2022 at 12:06:49PM -0600, Alex Williamson wrote:
> > > > > > > > The issue is where we account these pinned pages, where accounting is
> > > > > > > > necessary such that a user cannot lock an arbitrary number of pages
> > > > > > > > into RAM to generate a DoS attack.  
> > > > > > > 
> > > > > > > It is worth pointing out that preventing a DOS attack doesn't actually
> > > > > > > work because a *task* limit is trivially bypassed by just spawning
> > > > > > > more tasks. So, as a security feature, this is already very
> > > > > > > questionable.
> > > > > > 
> > > > > > The malicious party on host VM hosts is generally the QEMU process.
> > > > > > QEMU is normally prevented from spawning more tasks, both by SELinux
> > > > > > controls and be the seccomp sandbox blocking clone() (except for
> > > > > > thread creation).  We need to constrain what any individual QEMU can
> > > > > > do to the host, and the per-task mem locking limits can do that.
> > > > > 
> > > > > Even with syscall limits simple things like execve (enabled eg for
> > > > > qemu self-upgrade) can corrupt the kernel task-based accounting to the
> > > > > point that the limits don't work.
> > > > 
> > > > Note, execve is currently blocked by default too by the default
> > > > seccomp sandbox used with libvirt, as well as by the SELinux
> > > > policy again.  self-upgrade isn't a feature that exists (yet).
> > > 
> > > That userspace has disabled half the kernel isn't an excuse for the
> > > kernel to be insecure by design :( This needs to be fixed to enable
> > > features we know are coming so..
> > > 
> > > What would libvirt land like to see given task based tracking cannot
> > > be fixed in the kernel?
> > 
> > There needs to be a mechanism to control individual VMs, whether by
> > task or by cgroup. User based limits are not suited to what we need
> > to achieve.
> 
> The kernel has already standardized on user based limits here for
> other subsystems - libvirt and qemu cannot ignore that it exists. It
> is only a matter of time before qemu starts using these other
> subsystem features (eg io_uring) and has problems.
> 
> So, IMHO, the future must be that libvirt/etc sets an unlimited
> rlimit, because the user approach is not going away in the kernel and
> it sounds like libvirt cannot accommodate it at all.
> 
> This means we need to provide a new mechanism for future libvirt to
> use. Are you happy with cgroups?

Yes, we use cgroups extensively already.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-09-23  8:54                       ` Daniel P. Berrangé
@ 2022-09-23 13:29                         ` Jason Gunthorpe
  2022-09-23 13:35                           ` Daniel P. Berrangé
  2022-09-23 14:03                           ` Alex Williamson
  0 siblings, 2 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-23 13:29 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Alex Williamson, Eric Auger, Tian, Kevin, Rodel, Jorg, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Farman, iommu, Jason Wang, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu, Steve Sistare, libvir-list, Laine Stump

On Fri, Sep 23, 2022 at 09:54:48AM +0100, Daniel P. Berrangé wrote:

> Yes, we use cgroups extensively already.

Ok, I will try to see about this

Can you also tell me if the selinux/seccomp will prevent qemu from
opening more than one /dev/vfio/vfio ? I suppose the answer is no?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-09-23 13:29                         ` Jason Gunthorpe
@ 2022-09-23 13:35                           ` Daniel P. Berrangé
  2022-09-23 13:46                             ` Jason Gunthorpe
  2022-09-23 14:03                           ` Alex Williamson
  1 sibling, 1 reply; 78+ messages in thread
From: Daniel P. Berrangé @ 2022-09-23 13:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Eric Auger, Tian, Kevin, Rodel, Jorg, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Farman, iommu, Jason Wang, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu, Steve Sistare, libvir-list, Laine Stump

On Fri, Sep 23, 2022 at 10:29:41AM -0300, Jason Gunthorpe wrote:
> On Fri, Sep 23, 2022 at 09:54:48AM +0100, Daniel P. Berrangé wrote:
> 
> > Yes, we use cgroups extensively already.
> 
> Ok, I will try to see about this
> 
> Can you also tell me if the selinux/seccomp will prevent qemu from
> opening more than one /dev/vfio/vfio ? I suppose the answer is no?

I don't believe there's any restriction on the nubmer of open attempts,
its just a case of allowed or denied globally for the VM.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-09-23 13:35                           ` Daniel P. Berrangé
@ 2022-09-23 13:46                             ` Jason Gunthorpe
  2022-09-23 14:00                               ` Daniel P. Berrangé
  0 siblings, 1 reply; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-23 13:46 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Alex Williamson, Eric Auger, Tian, Kevin, Rodel, Jorg, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Farman, iommu, Jason Wang, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu, Steve Sistare, libvir-list, Laine Stump

On Fri, Sep 23, 2022 at 02:35:20PM +0100, Daniel P. Berrangé wrote:
> On Fri, Sep 23, 2022 at 10:29:41AM -0300, Jason Gunthorpe wrote:
> > On Fri, Sep 23, 2022 at 09:54:48AM +0100, Daniel P. Berrangé wrote:
> > 
> > > Yes, we use cgroups extensively already.
> > 
> > Ok, I will try to see about this
> > 
> > Can you also tell me if the selinux/seccomp will prevent qemu from
> > opening more than one /dev/vfio/vfio ? I suppose the answer is no?
> 
> I don't believe there's any restriction on the nubmer of open attempts,
> its just a case of allowed or denied globally for the VM.

Ok

For iommufd we plan to have qemu accept a single already opened FD of
/dev/iommu and so the selinux/etc would block all access to the
chardev.

Can you tell me if the thing invoking qmeu that will open /dev/iommu
will have CAP_SYS_RESOURCE ? I assume yes if it is already touching
ulimits..

Jason

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-09-23 13:46                             ` Jason Gunthorpe
@ 2022-09-23 14:00                               ` Daniel P. Berrangé
  2022-09-23 15:40                                 ` Laine Stump
  0 siblings, 1 reply; 78+ messages in thread
From: Daniel P. Berrangé @ 2022-09-23 14:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Eric Auger, Tian, Kevin, Rodel, Jorg, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Farman, iommu, Jason Wang, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu, Steve Sistare, libvir-list, Laine Stump

On Fri, Sep 23, 2022 at 10:46:21AM -0300, Jason Gunthorpe wrote:
> On Fri, Sep 23, 2022 at 02:35:20PM +0100, Daniel P. Berrangé wrote:
> > On Fri, Sep 23, 2022 at 10:29:41AM -0300, Jason Gunthorpe wrote:
> > > On Fri, Sep 23, 2022 at 09:54:48AM +0100, Daniel P. Berrangé wrote:
> > > 
> > > > Yes, we use cgroups extensively already.
> > > 
> > > Ok, I will try to see about this
> > > 
> > > Can you also tell me if the selinux/seccomp will prevent qemu from
> > > opening more than one /dev/vfio/vfio ? I suppose the answer is no?
> > 
> > I don't believe there's any restriction on the nubmer of open attempts,
> > its just a case of allowed or denied globally for the VM.
> 
> Ok
> 
> For iommufd we plan to have qemu accept a single already opened FD of
> /dev/iommu and so the selinux/etc would block all access to the
> chardev.

A selinux policy update would be needed to allow read()/write() for the
inherited FD, whle keeping open() blocked

> Can you tell me if the thing invoking qmeu that will open /dev/iommu
> will have CAP_SYS_RESOURCE ? I assume yes if it is already touching
> ulimits..

The privileged libvirtd runs with privs equiv to root, so all
capabilities are present.

The unprivileged libvirtd runs with same privs as your user account,
so no capabilities. I vaguely recall there was some way to enable
use of PCI passthrough for unpriv libvirtd, but needed a bunch of
admin setup steps ahead of time.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-09-23 13:29                         ` Jason Gunthorpe
  2022-09-23 13:35                           ` Daniel P. Berrangé
@ 2022-09-23 14:03                           ` Alex Williamson
  2022-09-26  6:34                             ` David Gibson
  1 sibling, 1 reply; 78+ messages in thread
From: Alex Williamson @ 2022-09-23 14:03 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Daniel P. Berrangé,
	Eric Auger, Tian, Kevin, Rodel, Jorg, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Farman, iommu, Jason Wang, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu, Steve Sistare, libvir-list, Laine Stump

On Fri, 23 Sep 2022 10:29:41 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Fri, Sep 23, 2022 at 09:54:48AM +0100, Daniel P. Berrangé wrote:
> 
> > Yes, we use cgroups extensively already.  
> 
> Ok, I will try to see about this
> 
> Can you also tell me if the selinux/seccomp will prevent qemu from
> opening more than one /dev/vfio/vfio ? I suppose the answer is no?

QEMU manages the container:group association with legacy vfio, so it
can't be restricted from creating multiple containers.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-09-23 14:00                               ` Daniel P. Berrangé
@ 2022-09-23 15:40                                 ` Laine Stump
  2022-10-21 19:56                                   ` Jason Gunthorpe
  0 siblings, 1 reply; 78+ messages in thread
From: Laine Stump @ 2022-09-23 15:40 UTC (permalink / raw)
  To: Daniel P. Berrangé, Jason Gunthorpe
  Cc: Alex Williamson, Eric Auger, Tian, Kevin, Rodel, Jorg, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Farman, iommu, Jason Wang, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu, Steve Sistare, libvir-list

On 9/23/22 10:00 AM, Daniel P. Berrangé wrote:
> On Fri, Sep 23, 2022 at 10:46:21AM -0300, Jason Gunthorpe wrote:
>> On Fri, Sep 23, 2022 at 02:35:20PM +0100, Daniel P. Berrangé wrote:
>>> On Fri, Sep 23, 2022 at 10:29:41AM -0300, Jason Gunthorpe wrote:
>>>> On Fri, Sep 23, 2022 at 09:54:48AM +0100, Daniel P. Berrangé wrote:
>>>>
>>>>> Yes, we use cgroups extensively already.
>>>>
>>>> Ok, I will try to see about this
>>>>
>>>> Can you also tell me if the selinux/seccomp will prevent qemu from
>>>> opening more than one /dev/vfio/vfio ? I suppose the answer is no?
>>>
>>> I don't believe there's any restriction on the nubmer of open attempts,
>>> its just a case of allowed or denied globally for the VM.
>>
>> Ok
>>
>> For iommufd we plan to have qemu accept a single already opened FD of
>> /dev/iommu and so the selinux/etc would block all access to the
>> chardev.
> 
> A selinux policy update would be needed to allow read()/write() for the
> inherited FD, whle keeping open() blocked
> 
>> Can you tell me if the thing invoking qmeu that will open /dev/iommu
>> will have CAP_SYS_RESOURCE ? I assume yes if it is already touching
>> ulimits..
> 
> The privileged libvirtd runs with privs equiv to root, so all
> capabilities are present.
> 
> The unprivileged libvirtd runs with same privs as your user account,
> so no capabilities. I vaguely recall there was some way to enable
> use of PCI passthrough for unpriv libvirtd, but needed a bunch of
> admin setup steps ahead of time.

It's been a few years, but my recollection is that before starting a 
libvirtd that will run a guest with a vfio device, a privileged process 
needs to

1) increase the locked memory limit for the user that will be running 
qemu (eg. by adding a file with the increased limit to 
/etc/security/limits.d)

2) bind the device to the vfio-pci driver, and

3) chown /dev/vfio/$iommu_group to the user running qemu.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-09-23 14:03                           ` Alex Williamson
@ 2022-09-26  6:34                             ` David Gibson
  0 siblings, 0 replies; 78+ messages in thread
From: David Gibson @ 2022-09-26  6:34 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, Daniel P. Berrangé,
	Eric Auger, Tian, Kevin, Rodel, Jorg, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, Eric Farman,
	iommu, Jason Wang, Jean-Philippe Brucker, Martins, Joao, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu, Steve Sistare, libvir-list, Laine Stump

[-- Attachment #1: Type: text/plain, Size: 1066 bytes --]

On Fri, Sep 23, 2022 at 08:03:07AM -0600, Alex Williamson wrote:
> On Fri, 23 Sep 2022 10:29:41 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Fri, Sep 23, 2022 at 09:54:48AM +0100, Daniel P. Berrangé wrote:
> > 
> > > Yes, we use cgroups extensively already.  
> > 
> > Ok, I will try to see about this
> > 
> > Can you also tell me if the selinux/seccomp will prevent qemu from
> > opening more than one /dev/vfio/vfio ? I suppose the answer is no?
> 
> QEMU manages the container:group association with legacy vfio, so it
> can't be restricted from creating multiple containers.  Thanks,

.. and it absolutely will create multiple containers (i.e. open
/dev/vfio/vfio multiple times) if there are multiple guest-side vIOMMU
domains.

It can, however, open each /dev/vfio/NN group file only once each,
since they are exclusive access.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-09-20 20:07   ` Jason Gunthorpe
  2022-09-21  3:40     ` Tian, Kevin
@ 2022-09-26 13:48     ` Rodel, Jorg
  1 sibling, 0 replies; 78+ messages in thread
From: Rodel, Jorg @ 2022-09-26 13:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Alex Williamson, Lu Baolu, Chaitanya Kulkarni,
	Cornelia Huck, Daniel Jordan, David Gibson, Eric Auger,
	Eric Farman, iommu, Jason Wang, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu

On Tue, Sep 20, 2022 at 05:07:56PM -0300, Jason Gunthorpe wrote:
> From my view, I don't get the sense the Joerg is interested in
> maintaining this, so I was expecting to have to PR this to Linus on
> its own (with the VFIO bits) and a new group would carry it through
> the initial phases.

Well, I am interested in maintaining the parts related to the IOMMU-API
and making sure future updates don't break anything. I am happy to trust
you all with the other details, as you all better understand the
use-cases and interactions with other sub-systems.

So I am fine with you sending the PR to get iommufd upstream together with
the VFIO changes (with my acks for the iommu-parts), but further updates
should still go through my tree to avoid any conflicts with other IOMMU
changes.

Regards,

-- 
Jörg Rödel
jroedel@suse.de

SUSE Software Solutions Germany GmbH
Frankenstraße 146
90461 Nürnberg
Germany

(HRB 36809, AG Nürnberg)
Geschäftsführer: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 02/13] iommufd: Overview documentation
  2022-09-12 10:40       ` David Gibson
@ 2022-09-27 17:33         ` Jason Gunthorpe
  2022-09-29  3:47           ` David Gibson
  0 siblings, 1 reply; 78+ messages in thread
From: Jason Gunthorpe @ 2022-09-27 17:33 UTC (permalink / raw)
  To: David Gibson
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, Eric Auger, Eric Farman, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

On Mon, Sep 12, 2022 at 08:40:20PM +1000, David Gibson wrote:

> > > > +The iopt_pages is the center of the storage and motion of PFNs. Each iopt_pages
> > > > +represents a logical linear array of full PFNs. PFNs are stored in a tiered
> > > > +scheme:
> > > > +
> > > > + 1) iopt_pages::pinned_pfns xarray
> > > > + 2) An iommu_domain
> > > > + 3) The origin of the PFNs, i.e. the userspace pointer
> > > 
> > > I can't follow what this "tiered scheme" is describing.
> > 
> > Hum, I'm not sure how to address this.
> > 
> > Is this better?
> > 
> >  1) PFNs that have been "software accessed" stored in theiopt_pages::pinned_pfns
> >     xarray
> >  2) PFNs stored inside the IOPTEs accessed through an iommu_domain
> >  3) The origin of the PFNs, i.e. the userspace VA in a mm_struct
> 
> Hmm.. only slightly.  What about:
> 
>    Each opt_pages represents a logical linear array of full PFNs.  The
>    PFNs are ultimately derived from userspave VAs via an mm_struct.
>    They are cached in .. <describe the pined_pfns and iommu_domain
>    data structures>

Ok, I have this now:

Each iopt_pages represents a logical linear array of full PFNs.  The PFNs are
ultimately derived from userspave VAs via an mm_struct. Once they have been
pinned the PFN is stored in an iommu_domain's IOPTEs or inside the pinned_pages
xarray if they are being "software accessed".

PFN have to be copied between all combinations of storage locations, depending
on what domains are present and what kinds of in-kernel "software access" users
exists. The mechanism ensures that a page is pinned only once.

Thanks
Jason 



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 02/13] iommufd: Overview documentation
  2022-09-27 17:33         ` Jason Gunthorpe
@ 2022-09-29  3:47           ` David Gibson
  0 siblings, 0 replies; 78+ messages in thread
From: David Gibson @ 2022-09-29  3:47 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Lu Baolu, Chaitanya Kulkarni, Cornelia Huck,
	Daniel Jordan, Eric Auger, Eric Farman, iommu, Jason Wang,
	Jean-Philippe Brucker, Joao Martins, Kevin Tian, kvm,
	Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Yi Liu, Keqian Zhu

[-- Attachment #1: Type: text/plain, Size: 1983 bytes --]

On Tue, Sep 27, 2022 at 02:33:31PM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 12, 2022 at 08:40:20PM +1000, David Gibson wrote:
> 
> > > > > +The iopt_pages is the center of the storage and motion of PFNs. Each iopt_pages
> > > > > +represents a logical linear array of full PFNs. PFNs are stored in a tiered
> > > > > +scheme:
> > > > > +
> > > > > + 1) iopt_pages::pinned_pfns xarray
> > > > > + 2) An iommu_domain
> > > > > + 3) The origin of the PFNs, i.e. the userspace pointer
> > > > 
> > > > I can't follow what this "tiered scheme" is describing.
> > > 
> > > Hum, I'm not sure how to address this.
> > > 
> > > Is this better?
> > > 
> > >  1) PFNs that have been "software accessed" stored in theiopt_pages::pinned_pfns
> > >     xarray
> > >  2) PFNs stored inside the IOPTEs accessed through an iommu_domain
> > >  3) The origin of the PFNs, i.e. the userspace VA in a mm_struct
> > 
> > Hmm.. only slightly.  What about:
> > 
> >    Each opt_pages represents a logical linear array of full PFNs.  The
> >    PFNs are ultimately derived from userspave VAs via an mm_struct.
> >    They are cached in .. <describe the pined_pfns and iommu_domain
> >    data structures>
> 
> Ok, I have this now:
> 
> Each iopt_pages represents a logical linear array of full PFNs.  The PFNs are
> ultimately derived from userspave VAs via an mm_struct. Once they have been
> pinned the PFN is stored in an iommu_domain's IOPTEs or inside the pinned_pages
> xarray if they are being "software accessed".
> 
> PFN have to be copied between all combinations of storage locations, depending
> on what domains are present and what kinds of in-kernel "software access" users
> exists. The mechanism ensures that a page is pinned only once.

LGTM, thanks.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-09-21 23:09             ` Jason Gunthorpe
@ 2022-10-06 16:01               ` Jason Gunthorpe
  2022-10-06 22:57                 ` Steven Sistare
  2022-10-10 20:54                 ` Steven Sistare
  0 siblings, 2 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2022-10-06 16:01 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Alex Williamson, Eric Auger, Tian, Kevin, Rodel, Jorg, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Farman, iommu, Jason Wang, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu, libvir-list, Daniel P. Berrangé,
	Laine Stump

On Wed, Sep 21, 2022 at 08:09:54PM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 21, 2022 at 03:30:55PM -0400, Steven Sistare wrote:
> 
> > > If Steve wants to keep it then someone needs to fix the deadlock in
> > > the vfio implementation before any userspace starts to appear. 
> > 
> > The only VFIO_DMA_UNMAP_FLAG_VADDR issue I am aware of is broken pinned accounting
> > across exec, which can result in mm->locked_vm becoming negative. I have several 
> > fixes, but none result in limits being reached at exactly the same time as before --
> > the same general issue being discussed for iommufd.  I am still thinking about it.
> 
> Oh, yeah, I noticed this was all busted up too.
> 
> > I am not aware of a deadlock problem.  Please elaborate or point me to an
> > email thread.
> 
> VFIO_DMA_UNMAP_FLAG_VADDR open codes a lock in the kernel where
> userspace can tigger the lock to be taken and then returns to
> userspace with the lock held.
> 
> Any scenario where a kernel thread hits that open-coded lock and then
> userspace does-the-wrong-thing will deadlock the kernel.
> 
> For instance consider a mdev driver. We assert
> VFIO_DMA_UNMAP_FLAG_VADDR, the mdev driver does a DMA in a workqueue
> and becomes blocked on the now locked lock. Userspace then tries to
> close the device FD.
> 
> FD closure will trigger device close and the VFIO core code
> requirement is that mdev driver device teardown must halt all
> concurrent threads touching vfio_device. Thus the mdev will try to
> fence its workqeue and then deadlock - unable to flush/cancel a work
> that is currently blocked on a lock held by userspace that will never
> be unlocked.
> 
> This is just the first scenario that comes to mind. The approach to
> give userspace control of a lock that kernel threads can become
> blocked on is so completely sketchy it is a complete no-go in my
> opinion. If I had seen it when it was posted I would have hard NAK'd
> it.
> 
> My "full" solution in mind for iommufd is to pin all the memory upon
> VFIO_DMA_UNMAP_FLAG_VADDR, so we can continue satisfy DMA requests
> while the mm_struct is not available. But IMHO this is basically
> useless for any actual user of mdevs.
> 
> The other option is to just exclude mdevs and fail the
> VFIO_DMA_UNMAP_FLAG_VADDR if any are present, then prevent them from
> becoming present while it is asserted. In this way we don't need to do
> anything beyond a simple check as the iommu_domain is already fully
> populated and pinned.

Do we have a solution to this?

If not I would like to make a patch removing VFIO_DMA_UNMAP_FLAG_VADDR

Aside from the approach to use the FD, another idea is to just use
fork.

qemu would do something like

 .. stop all container ioctl activity ..
 fork()
    ioctl(CHANGE_MM) // switch all maps to this mm
    .. signal parent.. 
    .. wait parent..
    exit(0)
 .. wait child ..
 exec()
 ioctl(CHANGE_MM) // switch all maps to this mm
 ..signal child..
 waitpid(childpid)

This way the kernel is never left without a page provider for the
maps, the dummy mm_struct belonging to the fork will serve that role
for the gap.

And the above is only required if we have mdevs, so we could imagine
userspace optimizing it away for, eg vfio-pci only cases.

It is not as efficient as using a FD backing, but this is super easy
to implement in the kernel.

Jason

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-10-06 16:01               ` Jason Gunthorpe
@ 2022-10-06 22:57                 ` Steven Sistare
  2022-10-10 20:54                 ` Steven Sistare
  1 sibling, 0 replies; 78+ messages in thread
From: Steven Sistare @ 2022-10-06 22:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Eric Auger, Tian, Kevin, Rodel, Jorg, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Farman, iommu, Jason Wang, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu, libvir-list, Daniel P. Berrangé,
	Laine Stump

On 10/6/2022 12:01 PM, Jason Gunthorpe wrote:
> On Wed, Sep 21, 2022 at 08:09:54PM -0300, Jason Gunthorpe wrote:
>> On Wed, Sep 21, 2022 at 03:30:55PM -0400, Steven Sistare wrote:
>>
>>>> If Steve wants to keep it then someone needs to fix the deadlock in
>>>> the vfio implementation before any userspace starts to appear. 
>>>
>>> The only VFIO_DMA_UNMAP_FLAG_VADDR issue I am aware of is broken pinned accounting
>>> across exec, which can result in mm->locked_vm becoming negative. I have several 
>>> fixes, but none result in limits being reached at exactly the same time as before --
>>> the same general issue being discussed for iommufd.  I am still thinking about it.
>>
>> Oh, yeah, I noticed this was all busted up too.
>>
>>> I am not aware of a deadlock problem.  Please elaborate or point me to an
>>> email thread.
>>
>> VFIO_DMA_UNMAP_FLAG_VADDR open codes a lock in the kernel where
>> userspace can tigger the lock to be taken and then returns to
>> userspace with the lock held.
>>
>> Any scenario where a kernel thread hits that open-coded lock and then
>> userspace does-the-wrong-thing will deadlock the kernel.
>>
>> For instance consider a mdev driver. We assert
>> VFIO_DMA_UNMAP_FLAG_VADDR, the mdev driver does a DMA in a workqueue
>> and becomes blocked on the now locked lock. Userspace then tries to
>> close the device FD.
>>
>> FD closure will trigger device close and the VFIO core code
>> requirement is that mdev driver device teardown must halt all
>> concurrent threads touching vfio_device. Thus the mdev will try to
>> fence its workqeue and then deadlock - unable to flush/cancel a work
>> that is currently blocked on a lock held by userspace that will never
>> be unlocked.
>>
>> This is just the first scenario that comes to mind. The approach to
>> give userspace control of a lock that kernel threads can become
>> blocked on is so completely sketchy it is a complete no-go in my
>> opinion. If I had seen it when it was posted I would have hard NAK'd
>> it.
>>
>> My "full" solution in mind for iommufd is to pin all the memory upon
>> VFIO_DMA_UNMAP_FLAG_VADDR, so we can continue satisfy DMA requests
>> while the mm_struct is not available. But IMHO this is basically
>> useless for any actual user of mdevs.
>>
>> The other option is to just exclude mdevs and fail the
>> VFIO_DMA_UNMAP_FLAG_VADDR if any are present, then prevent them from
>> becoming present while it is asserted. In this way we don't need to do
>> anything beyond a simple check as the iommu_domain is already fully
>> populated and pinned.
> 
> Do we have a solution to this?

Not yet, but I have not had time until now.  Let me try some things tomorrow 
and get back to you.  Thanks for thinking about it.

- Steve

> If not I would like to make a patch removing VFIO_DMA_UNMAP_FLAG_VADDR
> 
> Aside from the approach to use the FD, another idea is to just use
> fork.
> 
> qemu would do something like
> 
>  .. stop all container ioctl activity ..
>  fork()
>     ioctl(CHANGE_MM) // switch all maps to this mm
>     .. signal parent.. 
>     .. wait parent..
>     exit(0)
>  .. wait child ..
>  exec()
>  ioctl(CHANGE_MM) // switch all maps to this mm
>  ..signal child..
>  waitpid(childpid)
> 
> This way the kernel is never left without a page provider for the
> maps, the dummy mm_struct belonging to the fork will serve that role
> for the gap.
> 
> And the above is only required if we have mdevs, so we could imagine
> userspace optimizing it away for, eg vfio-pci only cases.
> 
> It is not as efficient as using a FD backing, but this is super easy
> to implement in the kernel.
> 
> Jason

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-10-06 16:01               ` Jason Gunthorpe
  2022-10-06 22:57                 ` Steven Sistare
@ 2022-10-10 20:54                 ` Steven Sistare
  2022-10-11 12:30                   ` Jason Gunthorpe
  1 sibling, 1 reply; 78+ messages in thread
From: Steven Sistare @ 2022-10-10 20:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Eric Auger, Tian, Kevin, Rodel, Jorg, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Farman, iommu, Jason Wang, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu, libvir-list, Daniel P. Berrangé,
	Laine Stump

On 10/6/2022 12:01 PM, Jason Gunthorpe wrote:
> On Wed, Sep 21, 2022 at 08:09:54PM -0300, Jason Gunthorpe wrote:
>> On Wed, Sep 21, 2022 at 03:30:55PM -0400, Steven Sistare wrote:
>>
>>>> If Steve wants to keep it then someone needs to fix the deadlock in
>>>> the vfio implementation before any userspace starts to appear. 
>>>
>>> The only VFIO_DMA_UNMAP_FLAG_VADDR issue I am aware of is broken pinned accounting
>>> across exec, which can result in mm->locked_vm becoming negative. I have several 
>>> fixes, but none result in limits being reached at exactly the same time as before --
>>> the same general issue being discussed for iommufd.  I am still thinking about it.
>>
>> Oh, yeah, I noticed this was all busted up too.
>>
>>> I am not aware of a deadlock problem.  Please elaborate or point me to an
>>> email thread.
>>
>> VFIO_DMA_UNMAP_FLAG_VADDR open codes a lock in the kernel where
>> userspace can tigger the lock to be taken and then returns to
>> userspace with the lock held.
>>
>> Any scenario where a kernel thread hits that open-coded lock and then
>> userspace does-the-wrong-thing will deadlock the kernel.
>>
>> For instance consider a mdev driver. We assert
>> VFIO_DMA_UNMAP_FLAG_VADDR, the mdev driver does a DMA in a workqueue
>> and becomes blocked on the now locked lock. Userspace then tries to
>> close the device FD.
>>
>> FD closure will trigger device close and the VFIO core code
>> requirement is that mdev driver device teardown must halt all
>> concurrent threads touching vfio_device. Thus the mdev will try to
>> fence its workqeue and then deadlock - unable to flush/cancel a work
>> that is currently blocked on a lock held by userspace that will never
>> be unlocked.
>>
>> This is just the first scenario that comes to mind. The approach to
>> give userspace control of a lock that kernel threads can become
>> blocked on is so completely sketchy it is a complete no-go in my
>> opinion. If I had seen it when it was posted I would have hard NAK'd
>> it.
>>
>> My "full" solution in mind for iommufd is to pin all the memory upon
>> VFIO_DMA_UNMAP_FLAG_VADDR, so we can continue satisfy DMA requests
>> while the mm_struct is not available. But IMHO this is basically
>> useless for any actual user of mdevs.
>>
>> The other option is to just exclude mdevs and fail the
>> VFIO_DMA_UNMAP_FLAG_VADDR if any are present, then prevent them from
>> becoming present while it is asserted. In this way we don't need to do
>> anything beyond a simple check as the iommu_domain is already fully
>> populated and pinned.
> 
> Do we have a solution to this?
> 
> If not I would like to make a patch removing VFIO_DMA_UNMAP_FLAG_VADDR
> 
> Aside from the approach to use the FD, another idea is to just use
> fork.
> 
> qemu would do something like
> 
>  .. stop all container ioctl activity ..
>  fork()
>     ioctl(CHANGE_MM) // switch all maps to this mm
>     .. signal parent.. 
>     .. wait parent..
>     exit(0)
>  .. wait child ..
>  exec()
>  ioctl(CHANGE_MM) // switch all maps to this mm
>  ..signal child..
>  waitpid(childpid)
> 
> This way the kernel is never left without a page provider for the
> maps, the dummy mm_struct belonging to the fork will serve that role
> for the gap.
> 
> And the above is only required if we have mdevs, so we could imagine
> userspace optimizing it away for, eg vfio-pci only cases.
> 
> It is not as efficient as using a FD backing, but this is super easy
> to implement in the kernel.

I propose to avoid deadlock for mediated devices as follows.  Currently, an
mdev calling vfio_pin_pages blocks in vfio_wait while VFIO_DMA_UNMAP_FLAG_VADDR
is asserted.

  * In vfio_wait, I will maintain a list of waiters, each list element
    consisting of (task, mdev, close_flag=false).

  * When the vfio device descriptor is closed, vfio_device_fops_release
    will notify the vfio_iommu driver, which will find the mdev on the waiters
    list, set elem->close_flag=true, and call wake_up_process for the task.

  * The task will wake in vfio_wait, see close_flag=true, and return EFAULT
    to the mdev caller.

This requires a little new plumbing.  I will work out the details, but if you
see a problem with the overall approach, please let me know.

- Steve

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-10-10 20:54                 ` Steven Sistare
@ 2022-10-11 12:30                   ` Jason Gunthorpe
  2022-10-11 20:30                     ` Steven Sistare
  0 siblings, 1 reply; 78+ messages in thread
From: Jason Gunthorpe @ 2022-10-11 12:30 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Alex Williamson, Eric Auger, Tian, Kevin, Rodel, Jorg, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Farman, iommu, Jason Wang, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu, libvir-list, Daniel P. Berrangé,
	Laine Stump

On Mon, Oct 10, 2022 at 04:54:50PM -0400, Steven Sistare wrote:
> > Do we have a solution to this?
> > 
> > If not I would like to make a patch removing VFIO_DMA_UNMAP_FLAG_VADDR
> > 
> > Aside from the approach to use the FD, another idea is to just use
> > fork.
> > 
> > qemu would do something like
> > 
> >  .. stop all container ioctl activity ..
> >  fork()
> >     ioctl(CHANGE_MM) // switch all maps to this mm
> >     .. signal parent.. 
> >     .. wait parent..
> >     exit(0)
> >  .. wait child ..
> >  exec()
> >  ioctl(CHANGE_MM) // switch all maps to this mm
> >  ..signal child..
> >  waitpid(childpid)
> > 
> > This way the kernel is never left without a page provider for the
> > maps, the dummy mm_struct belonging to the fork will serve that role
> > for the gap.
> > 
> > And the above is only required if we have mdevs, so we could imagine
> > userspace optimizing it away for, eg vfio-pci only cases.
> > 
> > It is not as efficient as using a FD backing, but this is super easy
> > to implement in the kernel.
> 
> I propose to avoid deadlock for mediated devices as follows.  Currently, an
> mdev calling vfio_pin_pages blocks in vfio_wait while VFIO_DMA_UNMAP_FLAG_VADDR
> is asserted.
> 
>   * In vfio_wait, I will maintain a list of waiters, each list element
>     consisting of (task, mdev, close_flag=false).
> 
>   * When the vfio device descriptor is closed, vfio_device_fops_release
>     will notify the vfio_iommu driver, which will find the mdev on the waiters
>     list, set elem->close_flag=true, and call wake_up_process for the task.

This alone is not sufficient, the mdev driver can continue to
establish new mappings until it's close_device function
returns. Killing only existing mappings is racy.

I think you are focusing on the one issue I pointed at, as I said, I'm
sure there are more ways than just close to abuse this functionality
to deadlock the kernel.

I continue to prefer we remove it completely and do something more
robust. I suggested two options.

Jason

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-10-11 12:30                   ` Jason Gunthorpe
@ 2022-10-11 20:30                     ` Steven Sistare
  2022-10-12 12:32                       ` Jason Gunthorpe
  0 siblings, 1 reply; 78+ messages in thread
From: Steven Sistare @ 2022-10-11 20:30 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Eric Auger, Tian, Kevin, Rodel, Jorg, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Farman, iommu, Jason Wang, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu, libvir-list, Daniel P. Berrangé,
	Laine Stump

On 10/11/2022 8:30 AM, Jason Gunthorpe wrote:
> On Mon, Oct 10, 2022 at 04:54:50PM -0400, Steven Sistare wrote:
>>> Do we have a solution to this?
>>>
>>> If not I would like to make a patch removing VFIO_DMA_UNMAP_FLAG_VADDR
>>>
>>> Aside from the approach to use the FD, another idea is to just use
>>> fork.
>>>
>>> qemu would do something like
>>>
>>>  .. stop all container ioctl activity ..
>>>  fork()
>>>     ioctl(CHANGE_MM) // switch all maps to this mm
>>>     .. signal parent.. 
>>>     .. wait parent..
>>>     exit(0)
>>>  .. wait child ..
>>>  exec()
>>>  ioctl(CHANGE_MM) // switch all maps to this mm
>>>  ..signal child..
>>>  waitpid(childpid)
>>>
>>> This way the kernel is never left without a page provider for the
>>> maps, the dummy mm_struct belonging to the fork will serve that role
>>> for the gap.
>>>
>>> And the above is only required if we have mdevs, so we could imagine
>>> userspace optimizing it away for, eg vfio-pci only cases.
>>>
>>> It is not as efficient as using a FD backing, but this is super easy
>>> to implement in the kernel.
>>
>> I propose to avoid deadlock for mediated devices as follows.  Currently, an
>> mdev calling vfio_pin_pages blocks in vfio_wait while VFIO_DMA_UNMAP_FLAG_VADDR
>> is asserted.
>>
>>   * In vfio_wait, I will maintain a list of waiters, each list element
>>     consisting of (task, mdev, close_flag=false).
>>
>>   * When the vfio device descriptor is closed, vfio_device_fops_release
>>     will notify the vfio_iommu driver, which will find the mdev on the waiters
>>     list, set elem->close_flag=true, and call wake_up_process for the task.
> 
> This alone is not sufficient, the mdev driver can continue to
> establish new mappings until it's close_device function
> returns. Killing only existing mappings is racy.
> 
> I think you are focusing on the one issue I pointed at, as I said, I'm
> sure there are more ways than just close to abuse this functionality
> to deadlock the kernel.
> 
> I continue to prefer we remove it completely and do something more
> robust. I suggested two options.

It's not racy.  New pin requests also land in vfio_wait if any vaddr's have
been invalidated in any vfio_dma in the iommu.  See
  vfio_iommu_type1_pin_pages()
    if (iommu->vaddr_invalid_count)
      vfio_find_dma_valid()
        vfio_wait()

However, I will investigate saving a reference to the file object in the vfio_dma
(for mappings backed by a file) and using that to translate IOVA's.  I think that
will be easier to use than fork/CHANGE_MM/exec, and may even be easier to use
than VFIO_DMA_UNMAP_FLAG_VADDR.  To be continued.

- Steve

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-10-11 20:30                     ` Steven Sistare
@ 2022-10-12 12:32                       ` Jason Gunthorpe
  2022-10-12 13:50                         ` Steven Sistare
  0 siblings, 1 reply; 78+ messages in thread
From: Jason Gunthorpe @ 2022-10-12 12:32 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Alex Williamson, Eric Auger, Tian, Kevin, Rodel, Jorg, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Farman, iommu, Jason Wang, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu, libvir-list, Daniel P. Berrangé,
	Laine Stump

On Tue, Oct 11, 2022 at 04:30:58PM -0400, Steven Sistare wrote:
> On 10/11/2022 8:30 AM, Jason Gunthorpe wrote:
> > On Mon, Oct 10, 2022 at 04:54:50PM -0400, Steven Sistare wrote:
> >>> Do we have a solution to this?
> >>>
> >>> If not I would like to make a patch removing VFIO_DMA_UNMAP_FLAG_VADDR
> >>>
> >>> Aside from the approach to use the FD, another idea is to just use
> >>> fork.
> >>>
> >>> qemu would do something like
> >>>
> >>>  .. stop all container ioctl activity ..
> >>>  fork()
> >>>     ioctl(CHANGE_MM) // switch all maps to this mm
> >>>     .. signal parent.. 
> >>>     .. wait parent..
> >>>     exit(0)
> >>>  .. wait child ..
> >>>  exec()
> >>>  ioctl(CHANGE_MM) // switch all maps to this mm
> >>>  ..signal child..
> >>>  waitpid(childpid)
> >>>
> >>> This way the kernel is never left without a page provider for the
> >>> maps, the dummy mm_struct belonging to the fork will serve that role
> >>> for the gap.
> >>>
> >>> And the above is only required if we have mdevs, so we could imagine
> >>> userspace optimizing it away for, eg vfio-pci only cases.
> >>>
> >>> It is not as efficient as using a FD backing, but this is super easy
> >>> to implement in the kernel.
> >>
> >> I propose to avoid deadlock for mediated devices as follows.  Currently, an
> >> mdev calling vfio_pin_pages blocks in vfio_wait while VFIO_DMA_UNMAP_FLAG_VADDR
> >> is asserted.
> >>
> >>   * In vfio_wait, I will maintain a list of waiters, each list element
> >>     consisting of (task, mdev, close_flag=false).
> >>
> >>   * When the vfio device descriptor is closed, vfio_device_fops_release
> >>     will notify the vfio_iommu driver, which will find the mdev on the waiters
> >>     list, set elem->close_flag=true, and call wake_up_process for the task.
> > 
> > This alone is not sufficient, the mdev driver can continue to
> > establish new mappings until it's close_device function
> > returns. Killing only existing mappings is racy.
> > 
> > I think you are focusing on the one issue I pointed at, as I said, I'm
> > sure there are more ways than just close to abuse this functionality
> > to deadlock the kernel.
> > 
> > I continue to prefer we remove it completely and do something more
> > robust. I suggested two options.
> 
> It's not racy.  New pin requests also land in vfio_wait if any vaddr's have
> been invalidated in any vfio_dma in the iommu.  See
>   vfio_iommu_type1_pin_pages()
>     if (iommu->vaddr_invalid_count)
>       vfio_find_dma_valid()
>         vfio_wait()

I mean you can't do a one shot wakeup of only existing waiters, and
you can't corrupt the container to wake up waiters for other devices,
so I don't see how this can be made to work safely...

It also doesn't solve any flow that doesn't trigger file close, like a
process thread being stuck on the wait in the kernel. eg because a
trapped mmio triggered an access or something.

So it doesn't seem like a workable direction to me.

> However, I will investigate saving a reference to the file object in
> the vfio_dma (for mappings backed by a file) and using that to
> translate IOVA's.

It is certainly the best flow, but it may be difficult. Eg the memfd
work for KVM to do something similar is quite involved.

> I think that will be easier to use than fork/CHANGE_MM/exec, and may
> even be easier to use than VFIO_DMA_UNMAP_FLAG_VADDR.  To be
> continued.

Yes, certainly easier to use, I suggested CHANGE_MM because the kernel
implementation is very easy, I could send you something to test w/
iommufd in a few hours effort probably.

Anyhow, I think this conversation has convinced me there is no way to
fix VFIO_DMA_UNMAP_FLAG_VADDR. I'll send a patch reverting it due to
it being a security bug, basically.

Jason

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-10-12 12:32                       ` Jason Gunthorpe
@ 2022-10-12 13:50                         ` Steven Sistare
  2022-10-12 14:40                           ` Jason Gunthorpe
  0 siblings, 1 reply; 78+ messages in thread
From: Steven Sistare @ 2022-10-12 13:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Eric Auger, Tian, Kevin, Rodel, Jorg, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Farman, iommu, Jason Wang, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu, libvir-list, Daniel P. Berrangé,
	Laine Stump

On 10/12/2022 8:32 AM, Jason Gunthorpe wrote:
> On Tue, Oct 11, 2022 at 04:30:58PM -0400, Steven Sistare wrote:
>> On 10/11/2022 8:30 AM, Jason Gunthorpe wrote:
>>> On Mon, Oct 10, 2022 at 04:54:50PM -0400, Steven Sistare wrote:
>>>>> Do we have a solution to this?
>>>>>
>>>>> If not I would like to make a patch removing VFIO_DMA_UNMAP_FLAG_VADDR
>>>>>
>>>>> Aside from the approach to use the FD, another idea is to just use
>>>>> fork.
>>>>>
>>>>> qemu would do something like
>>>>>
>>>>>  .. stop all container ioctl activity ..
>>>>>  fork()
>>>>>     ioctl(CHANGE_MM) // switch all maps to this mm
>>>>>     .. signal parent.. 
>>>>>     .. wait parent..
>>>>>     exit(0)
>>>>>  .. wait child ..
>>>>>  exec()
>>>>>  ioctl(CHANGE_MM) // switch all maps to this mm
>>>>>  ..signal child..
>>>>>  waitpid(childpid)
>>>>>
>>>>> This way the kernel is never left without a page provider for the
>>>>> maps, the dummy mm_struct belonging to the fork will serve that role
>>>>> for the gap.
>>>>>
>>>>> And the above is only required if we have mdevs, so we could imagine
>>>>> userspace optimizing it away for, eg vfio-pci only cases.
>>>>>
>>>>> It is not as efficient as using a FD backing, but this is super easy
>>>>> to implement in the kernel.
>>>>
>>>> I propose to avoid deadlock for mediated devices as follows.  Currently, an
>>>> mdev calling vfio_pin_pages blocks in vfio_wait while VFIO_DMA_UNMAP_FLAG_VADDR
>>>> is asserted.
>>>>
>>>>   * In vfio_wait, I will maintain a list of waiters, each list element
>>>>     consisting of (task, mdev, close_flag=false).
>>>>
>>>>   * When the vfio device descriptor is closed, vfio_device_fops_release
>>>>     will notify the vfio_iommu driver, which will find the mdev on the waiters
>>>>     list, set elem->close_flag=true, and call wake_up_process for the task.
>>>
>>> This alone is not sufficient, the mdev driver can continue to
>>> establish new mappings until it's close_device function
>>> returns. Killing only existing mappings is racy.
>>>
>>> I think you are focusing on the one issue I pointed at, as I said, I'm
>>> sure there are more ways than just close to abuse this functionality
>>> to deadlock the kernel.
>>>
>>> I continue to prefer we remove it completely and do something more
>>> robust. I suggested two options.
>>
>> It's not racy.  New pin requests also land in vfio_wait if any vaddr's have
>> been invalidated in any vfio_dma in the iommu.  See
>>   vfio_iommu_type1_pin_pages()
>>     if (iommu->vaddr_invalid_count)
>>       vfio_find_dma_valid()
>>         vfio_wait()
> 
> I mean you can't do a one shot wakeup of only existing waiters, and
> you can't corrupt the container to wake up waiters for other devices,
> so I don't see how this can be made to work safely...
> 
> It also doesn't solve any flow that doesn't trigger file close, like a
> process thread being stuck on the wait in the kernel. eg because a
> trapped mmio triggered an access or something.
> 
> So it doesn't seem like a workable direction to me.
> 
>> However, I will investigate saving a reference to the file object in
>> the vfio_dma (for mappings backed by a file) and using that to
>> translate IOVA's.
> 
> It is certainly the best flow, but it may be difficult. Eg the memfd
> work for KVM to do something similar is quite involved.
> 
>> I think that will be easier to use than fork/CHANGE_MM/exec, and may
>> even be easier to use than VFIO_DMA_UNMAP_FLAG_VADDR.  To be
>> continued.
> 
> Yes, certainly easier to use, I suggested CHANGE_MM because the kernel
> implementation is very easy, I could send you something to test w/
> iommufd in a few hours effort probably.
> 
> Anyhow, I think this conversation has convinced me there is no way to
> fix VFIO_DMA_UNMAP_FLAG_VADDR. I'll send a patch reverting it due to
> it being a security bug, basically.

Please do not.  Please give me the courtesy of time to develop a replacement 
before we delete it. Surely you can make progress on other opens areas of iommufd
without needing to delete this immediately.

- Steve

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-10-12 13:50                         ` Steven Sistare
@ 2022-10-12 14:40                           ` Jason Gunthorpe
  2022-10-12 14:55                             ` Steven Sistare
  0 siblings, 1 reply; 78+ messages in thread
From: Jason Gunthorpe @ 2022-10-12 14:40 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Alex Williamson, Eric Auger, Tian, Kevin, Rodel, Jorg, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Farman, iommu, Jason Wang, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu, libvir-list, Daniel P. Berrangé,
	Laine Stump

On Wed, Oct 12, 2022 at 09:50:53AM -0400, Steven Sistare wrote:

> > Anyhow, I think this conversation has convinced me there is no way to
> > fix VFIO_DMA_UNMAP_FLAG_VADDR. I'll send a patch reverting it due to
> > it being a security bug, basically.
> 
> Please do not.  Please give me the courtesy of time to develop a replacement 
> before we delete it. Surely you can make progress on other opens areas of iommufd
> without needing to delete this immediately.

I'm not worried about iommufd, I'm worried about shipping kernels with
a significant security problem backed into them.

As we cannot salvage this interface it should quickly deleted so that
it doesn't cause any incidents.

It will not effect your ability to create a replacement.

Jason

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-10-12 14:40                           ` Jason Gunthorpe
@ 2022-10-12 14:55                             ` Steven Sistare
  2022-10-12 14:59                               ` Jason Gunthorpe
  0 siblings, 1 reply; 78+ messages in thread
From: Steven Sistare @ 2022-10-12 14:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Eric Auger, Tian, Kevin, Rodel, Jorg, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Farman, iommu, Jason Wang, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu, libvir-list, Daniel P. Berrangé,
	Laine Stump

On 10/12/2022 10:40 AM, Jason Gunthorpe wrote:
> On Wed, Oct 12, 2022 at 09:50:53AM -0400, Steven Sistare wrote:
> 
>>> Anyhow, I think this conversation has convinced me there is no way to
>>> fix VFIO_DMA_UNMAP_FLAG_VADDR. I'll send a patch reverting it due to
>>> it being a security bug, basically.
>>
>> Please do not.  Please give me the courtesy of time to develop a replacement 
>> before we delete it. Surely you can make progress on other opens areas of iommufd
>> without needing to delete this immediately.
> 
> I'm not worried about iommufd, I'm worried about shipping kernels with
> a significant security problem backed into them.
> 
> As we cannot salvage this interface it should quickly deleted so that
> it doesn't cause any incidents.
> 
> It will not effect your ability to create a replacement.

I am not convinced we cannot salvage the interface, and indeed I might want to reuse
parts of it, and you are over-stating the risk of a feature that is already in 
millions of kernels and has been for years. Deleting it all before having a
replacement hurts the people like myself who are continuing to develop and test
live update in qemu on the latest kernels.

- Steve

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-10-12 14:55                             ` Steven Sistare
@ 2022-10-12 14:59                               ` Jason Gunthorpe
  0 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2022-10-12 14:59 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Alex Williamson, Eric Auger, Tian, Kevin, Rodel, Jorg, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Farman, iommu, Jason Wang, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu, libvir-list, Daniel P. Berrangé,
	Laine Stump

On Wed, Oct 12, 2022 at 10:55:57AM -0400, Steven Sistare wrote:
> On 10/12/2022 10:40 AM, Jason Gunthorpe wrote:
> > On Wed, Oct 12, 2022 at 09:50:53AM -0400, Steven Sistare wrote:
> > 
> >>> Anyhow, I think this conversation has convinced me there is no way to
> >>> fix VFIO_DMA_UNMAP_FLAG_VADDR. I'll send a patch reverting it due to
> >>> it being a security bug, basically.
> >>
> >> Please do not.  Please give me the courtesy of time to develop a replacement 
> >> before we delete it. Surely you can make progress on other opens areas of iommufd
> >> without needing to delete this immediately.
> > 
> > I'm not worried about iommufd, I'm worried about shipping kernels with
> > a significant security problem backed into them.
> > 
> > As we cannot salvage this interface it should quickly deleted so that
> > it doesn't cause any incidents.
> > 
> > It will not effect your ability to create a replacement.
> 
> I am not convinced we cannot salvage the interface, and indeed I might want to reuse
> parts of it, and you are over-stating the risk of a feature that is already in 
> millions of kernels and has been for years. Deleting it all before having a
> replacement hurts the people like myself who are continuing to develop and test
> live update in qemu on the latest kernels.

I think this is a mistake, as I'm convinced it cannot be salvaged, but
we could instead guard it by a config symbol and CONFIG_EXPERIMENTAL.

Jason

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface
  2022-09-23 15:40                                 ` Laine Stump
@ 2022-10-21 19:56                                   ` Jason Gunthorpe
  0 siblings, 0 replies; 78+ messages in thread
From: Jason Gunthorpe @ 2022-10-21 19:56 UTC (permalink / raw)
  To: Laine Stump
  Cc: Daniel P. Berrangé,
	Alex Williamson, Eric Auger, Tian, Kevin, Rodel, Jorg, Lu Baolu,
	Chaitanya Kulkarni, Cornelia Huck, Daniel Jordan, David Gibson,
	Eric Farman, iommu, Jason Wang, Jean-Philippe Brucker, Martins,
	Joao, kvm, Matthew Rosato, Michael S. Tsirkin, Nicolin Chen,
	Niklas Schnelle, Shameerali Kolothum Thodi, Liu, Yi L,
	Keqian Zhu, Steve Sistare, libvir-list, Alistair Popple

On Fri, Sep 23, 2022 at 11:40:51AM -0400, Laine Stump wrote:
> It's been a few years, but my recollection is that before starting a
> libvirtd that will run a guest with a vfio device, a privileged process
> needs to
> 
> 1) increase the locked memory limit for the user that will be running qemu
> (eg. by adding a file with the increased limit to /etc/security/limits.d)
> 
> 2) bind the device to the vfio-pci driver, and
> 
> 3) chown /dev/vfio/$iommu_group to the user running qemu.

Here is what is going on to resolve this:

1) iommufd internally supports two ways to account ulimits, the vfio
   way and the io_uring way. Each FD operates in its own mode.
 
   When /dev/iommu is opened the FD defaults to the io_uring way, when
   /dev/vfio/vfio is opened it uses the VFIO way. This means
   /dev/vfio/vfio is not a symlink, there is a new kconfig
   now to make iommufd directly provide a miscdev.

2) There is an ioctl IOMMU_OPTION_RLIMIT_MODE which allows a
   privileged user to query/set which mode the FD will run in.

   The idea is that libvirt will open iommufd, the first action will
   be to set vfio compat mode, and then it will fd pass the fd to
   qemu and qemu will operate in the correct sandbox.

3) We are working on a cgroup for FOLL_LONGTERM, it is a big job but
   this should prove a comprehensive resolution to this problem across
   the kernel and improve the qemu sandbox security.

   Still TBD, but most likely when the cgroup supports this libvirt
   would set the rlimit to unlimited, then set new mlock and
   FOLL_LONGTERM cgroup limits to create the sandbox.

Jason

^ permalink raw reply	[flat|nested] 78+ messages in thread

end of thread, other threads:[~2022-10-21 19:56 UTC | newest]

Thread overview: 78+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-02 19:59 [PATCH RFC v2 00/13] IOMMUFD Generic interface Jason Gunthorpe
2022-09-02 19:59 ` Jason Gunthorpe
2022-09-02 19:59 ` [PATCH RFC v2 01/13] interval-tree: Add a utility to iterate over spans in an interval tree Jason Gunthorpe
2022-09-02 19:59   ` Jason Gunthorpe
2022-09-02 19:59 ` [PATCH RFC v2 02/13] iommufd: Overview documentation Jason Gunthorpe
2022-09-02 19:59   ` Jason Gunthorpe
2022-09-07  1:39   ` David Gibson
2022-09-09 18:52     ` Jason Gunthorpe
2022-09-12 10:40       ` David Gibson
2022-09-27 17:33         ` Jason Gunthorpe
2022-09-29  3:47           ` David Gibson
2022-09-02 19:59 ` [PATCH RFC v2 03/13] iommufd: File descriptor, context, kconfig and makefiles Jason Gunthorpe
2022-09-02 19:59   ` Jason Gunthorpe
2022-09-04  8:19   ` Baolu Lu
2022-09-09 18:46     ` Jason Gunthorpe
2022-09-02 19:59 ` [PATCH RFC v2 04/13] kernel/user: Allow user::locked_vm to be usable for iommufd Jason Gunthorpe
2022-09-02 19:59   ` Jason Gunthorpe
2022-09-02 19:59 ` [PATCH RFC v2 05/13] iommufd: PFN handling for iopt_pages Jason Gunthorpe
2022-09-02 19:59   ` Jason Gunthorpe
2022-09-02 19:59 ` [PATCH RFC v2 06/13] iommufd: Algorithms for PFN storage Jason Gunthorpe
2022-09-02 19:59   ` Jason Gunthorpe
2022-09-02 19:59 ` [PATCH RFC v2 07/13] iommufd: Data structure to provide IOVA to PFN mapping Jason Gunthorpe
2022-09-02 19:59   ` Jason Gunthorpe
2022-09-02 19:59 ` [PATCH RFC v2 08/13] iommufd: IOCTLs for the io_pagetable Jason Gunthorpe
2022-09-02 19:59   ` Jason Gunthorpe
2022-09-02 19:59 ` [PATCH RFC v2 09/13] iommufd: Add a HW pagetable object Jason Gunthorpe
2022-09-02 19:59   ` Jason Gunthorpe
2022-09-02 19:59 ` [PATCH RFC v2 10/13] iommufd: Add kAPI toward external drivers for physical devices Jason Gunthorpe
2022-09-02 19:59   ` Jason Gunthorpe
2022-09-02 19:59 ` [PATCH RFC v2 11/13] iommufd: Add kAPI toward external drivers for kernel access Jason Gunthorpe
2022-09-02 19:59   ` Jason Gunthorpe
2022-09-02 19:59 ` [PATCH RFC v2 12/13] iommufd: vfio container FD ioctl compatibility Jason Gunthorpe
2022-09-02 19:59   ` Jason Gunthorpe
2022-09-02 19:59 ` [PATCH RFC v2 13/13] iommufd: Add a selftest Jason Gunthorpe
2022-09-02 19:59   ` Jason Gunthorpe
2022-09-13  1:55 ` [PATCH RFC v2 00/13] IOMMUFD Generic interface Tian, Kevin
2022-09-13  7:28   ` Eric Auger
2022-09-20 19:56     ` Jason Gunthorpe
2022-09-21  3:48       ` Tian, Kevin
2022-09-21 18:06       ` Alex Williamson
2022-09-21 18:44         ` Jason Gunthorpe
2022-09-21 19:30           ` Steven Sistare
2022-09-21 23:09             ` Jason Gunthorpe
2022-10-06 16:01               ` Jason Gunthorpe
2022-10-06 22:57                 ` Steven Sistare
2022-10-10 20:54                 ` Steven Sistare
2022-10-11 12:30                   ` Jason Gunthorpe
2022-10-11 20:30                     ` Steven Sistare
2022-10-12 12:32                       ` Jason Gunthorpe
2022-10-12 13:50                         ` Steven Sistare
2022-10-12 14:40                           ` Jason Gunthorpe
2022-10-12 14:55                             ` Steven Sistare
2022-10-12 14:59                               ` Jason Gunthorpe
2022-09-21 23:20           ` Jason Gunthorpe
2022-09-22 11:20           ` Daniel P. Berrangé
2022-09-22 14:08             ` Jason Gunthorpe
2022-09-22 14:49               ` Daniel P. Berrangé
2022-09-22 14:51                 ` Jason Gunthorpe
2022-09-22 15:00                   ` Daniel P. Berrangé
2022-09-22 15:31                     ` Jason Gunthorpe
2022-09-23  8:54                       ` Daniel P. Berrangé
2022-09-23 13:29                         ` Jason Gunthorpe
2022-09-23 13:35                           ` Daniel P. Berrangé
2022-09-23 13:46                             ` Jason Gunthorpe
2022-09-23 14:00                               ` Daniel P. Berrangé
2022-09-23 15:40                                 ` Laine Stump
2022-10-21 19:56                                   ` Jason Gunthorpe
2022-09-23 14:03                           ` Alex Williamson
2022-09-26  6:34                             ` David Gibson
2022-09-21 22:36         ` Laine Stump
2022-09-22 11:06         ` Daniel P. Berrangé
2022-09-22 14:13           ` Jason Gunthorpe
2022-09-22 14:46             ` Daniel P. Berrangé
2022-09-13  2:05 ` Tian, Kevin
2022-09-20 20:07   ` Jason Gunthorpe
2022-09-21  3:40     ` Tian, Kevin
2022-09-21 16:19       ` Jason Gunthorpe
2022-09-26 13:48     ` Rodel, Jorg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.