All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 00/18] IOMMUFD Dirty Tracking
@ 2023-10-18 20:26 Joao Martins
  2023-10-18 20:26 ` [PATCH v4 01/18] vfio/iova_bitmap: Export more API symbols Joao Martins
                   ` (17 more replies)
  0 siblings, 18 replies; 84+ messages in thread
From: Joao Martins @ 2023-10-18 20:26 UTC (permalink / raw)
  To: iommu
  Cc: Jason Gunthorpe, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu,
	Yi Liu, Yi Y Sun, Nicolin Chen, Joerg Roedel,
	Suravee Suthikulpanit, Will Deacon, Robin Murphy, Zhenzhong Duan,
	Alex Williamson, kvm, Joao Martins

Presented herewith is a series that extends IOMMUFD to have IOMMU
hardware support for dirty bit in the IOPTEs.

Today, AMD Milan (or more recent) supports it while ARM SMMUv3.2
alongside VT-D rev3.x also do support.  One intended use-case (but not
restricted!) is to support Live Migration with SR-IOV, specially useful
for live migrateable PCI devices that can't supply its own dirty
tracking hardware blocks amongst others.

At a quick glance, IOMMUFD lets the userspace create the IOAS with a
set of a IOVA ranges mapped to some physical memory composing an IO
pagetable. This is then created via HWPT_ALLOC or attached to a
particular device/hwpt, consequently creating the IOMMU domain and share
a common IO page table representing the endporint DMA-addressable guest
address space. In IOMMUFD Dirty tracking (since v2 of the series) it
will require via the HWPT_ALLOC model only, as opposed to simpler
autodomains model.

The result is an hw_pagetable which represents the
iommu_domain which will be directly manipulated. The IOMMUFD UAPI,
and the iommu/iommufd kAPI are then extended to provide:

1) Enforcement that only devices with dirty tracking support are attached
to an IOMMU domain, to cover the case where this isn't all homogenous in
the platform. While initially this is more aimed at possible heterogenous nature
of ARM while x86 gets future proofed, should any such ocasion occur.

The device dirty tracking enforcement on attach_dev is made whether the
dirty_ops are set or not. Given that attach always checks for dirty
ops and IOMMU_CAP_DIRTY, while writing this it almost wanted this to
move to upper layer but semantically iommu driver should do the
checking.

2) Toggling of Dirty Tracking on the iommu_domain. We model as the most
common case of changing hardware translation control structures dynamically
(x86) while making it easier to have an always-enabled mode. In the
RFCv1, the ARM specific case is suggested to be always enabled instead of
having to enable the per-PTE DBM control bit (what I previously called
"range tracking"). Here, setting/clearing tracking means just clearing the
dirty bits at start. The 'real' tracking of whether dirty
tracking is enabled is stored in the IOMMU driver, hence no new
fields are added to iommufd pagetable structures, except for the
iommu_domain dirty ops part via adding a dirty_ops field to
iommu_domain. We use that too for IOMMUFD to know if dirty tracking
is supported and toggleable without having iommu drivers replicate said
checks.

3) Add a capability probing for dirty tracking, leveraging the
per-device iommu_capable() and adding a IOMMU_CAP_DIRTY. It extends
the GET_HW_INFO ioctl which takes a device ID to return some generic
capabilities *in addition*. Possible values enumarated by `enum
iommufd_hw_capabilities`.

4) Read the I/O PTEs and marshal its dirtyiness into a bitmap. The bitmap
indexes on a page_size basis the IOVAs that got written by the device.
While performing the marshalling also drivers need to clear the dirty bits
from IOPTE and allow the kAPI caller to batch the much needed IOTLB flush.
There's no copy of bitmaps to userspace backed memory, all is zerocopy
based to not add more cost to the iommu driver IOPT walker. This shares
functionality with VFIO device dirty tracking via the IOVA bitmap APIs. So
far this is a test-and-clear kind of interface given that the IOPT walk is
going to be expensive. In addition this also adds the ability to read dirty
bit info without clearing the PTE info. This is meant to cover the
unmap-and-read-dirty use-case, and avoid the second IOTLB flush.

The only dependency is:
* Have domain_alloc_user() API with flags [2] already queued (iommufd/for-next).

The series is organized as follows:

* Patches 1-4: Takes care of the iommu domain operations to be added.
The idea is to abstract iommu drivers from any idea of how bitmaps are
stored or propagated back to the caller, as well as allowing
control/batching over IOTLB flush. So there's a data structure and an
helper that only tells the upper layer that an IOVA range got dirty.
This logic is shared with VFIO and it's meant to walking the bitmap
user memory, and kmap-ing plus setting bits as needed. IOMMU driver
just has an idea of a 'dirty bitmap state' and recording an IOVA as
dirty.

* Patches 5-9, 13-18: Adds the UAPIs for IOMMUFD, and selftests. The
selftests cover some corner cases on boundaries handling of the bitmap
and various bitmap sizes that exercise. I haven't included huge IOVA
ranges to avoid risking the selftests failing to execute due to OOM
issues of mmaping big buffers.

* Patches 10-11: AMD IOMMU implementation, particularly on those having
HDSup support. Tested with a Qemu amd-iommu with HDSUp emulated[0]. And
tested with live migration with VFs (but with IOMMU dirty tracking).

* Patches 12: Intel IOMMU rev3.x+ implementation. Tested with a Qemu
based intel-iommu vIOMMU with SSADS emulation support[0].

On AMD/Intel I have tested this with emulation and then live migration in
AMD hardware; 

The qemu iommu emulation bits are to increase coverage of this code and
hopefully make this more broadly available for fellow contributors/devs,
old version[1]; it uses Yi's 2 commits to have hw_info() supported (still
needs a bit of cleanup) on top of a recent Zhenzhong series of IOMMUFD
QEMU bringup work: see here[0]. It includes IOMMUFD dirty tracking for
Live migration and with live migration tested. I won't be exactly
following up a v2 of QEMU patches until IOMMUFD tracking lands.

Feedback or any comments are very much appreciated.

Thanks!
        Joao

[0] https://github.com/jpemartins/qemu/commits/iommufd-v3
[1] https://lore.kernel.org/qemu-devel/20220428211351.3897-1-joao.m.martins@oracle.com/
[2] https://lore.kernel.org/linux-iommu/20230919092523.39286-1-yi.l.liu@intel.com/
[3] https://github.com/jpemartins/linux/commits/iommufd-v3
[4] https://lore.kernel.org/linux-iommu/20230518204650.14541-1-joao.m.martins@oracle.com/
[5] https://lore.kernel.org/kvm/20220428210933.3583-1-joao.m.martins@oracle.com/
[6] https://github.com/jpemartins/linux/commits/smmu-iommufd-v3
[7] https://lore.kernel.org/linux-iommu/20230923012511.10379-1-joao.m.martins@oracle.com/

Changes since v3[7]:
* Consolidate previous patch 9 and 10 into a single patch,
while removing iommufd_dirty_bitmap structure to instead use
the UAPI defined structure iommu_hwpt_get_dirty_iova and
pass around internally in iommufd.
* Iterate over areas from within the IOVA bitmap iterator
* Drop check for specific flags in hw_pagetable
* Drop assignment that calculates iopt_alignment, to instead
use iopt_alignment directly
* Move IOVA bitmap into iommufd and introduce IOMMUFD_DRIVER
bool kconfig which designates the usage of dirty tracking related
bitmaps (i.e. VFIO and IOMMU drivers right now).
* Update IOVA bitmap header files to account for IOMMUFD_DRIVER
being disabled
* Sort new IOMMUFD ioctls accordingly
* Move IOVA bitmap symbols to IOMMUFD namespace and update its
users by importing new namespace (new patch 3)
* Remove AMD IOMMU kernel log feature printing from series
* Remove struct amd_iommu from do_iommu_domain_alloc() function
and derive it from within do_iommu_domain_alloc().
* Consolidate pte_test_dirty() and pte_test_and_clear_dirty()
by passing flags and deciding whether to test or test_and_clear.
* Add a comment on top of the -EINVAL attach_device() failure when
dirty tracking is enforcement but IOMMU does not support dirty tracking
* Add Reviewed-by given by Suravee
* Select IOMMUFD_DRIVER if IOMMUFD is enabled on supported IOMMU drivers
* Remove spurious rcu_read_{,un}lock() from Intel/AMD iommus
* Fix unwinding in dirty tracking set failure case in intel-iommu
* Avoid unnecesary locking when checking dmar_domain::dirty_tracking
* Rename intel_iommu_slads_supported() to a slads_supported() macro
following intel-iommu style
* Change the XOR check into a == in set_dirty_tracking iommu op
* Consolidate PTE test or test-and-clear into single helper
* Improve intel_pasid_setup_dirty_tracking() to: use rate limited printk;
avoid unnecessary update if state is already the desired one;
do a clflush on non-coherent devices; remove the qi_flush_piotlb(); and
fail on unsupported pgtts.
* Remove the first_level support and always fail domain_alloc_user on those
cases with it being no use case ATM. Doing so lets us remove some code
for such handling in set_dirty_tracking
* Error out the pasid device attach when dirty tracking is enforced
as we don't support that either.
* Reorganize the series to have selftests at the end, and core/driver
enablement first.

Changes since RFCv2[4]:
* Testing has always occured on the new code, but now it has seen
Live Migration coverage with extra QEMU work on AMD hardware.
* General commit message improvements
* Remove spurious headers in selftests
* Exported some symbols to actually allow things to build when IOMMUFD
is built as a module. (Alex Williamson)
* Switch the enforcing to be done on IOMMU domain allocation via
domain_alloc_user (Jason, Robin, Lu Baolu)
* Removed RCU series from Lu Baolu (Jason)
* Switch set_dirty/read_dirty/clear_dirty to down_read() (Jason)
* Make sure it check for area::pages (Jason)
* Move clearing dirties before set dirty a helper (Jason)
* Avoid breaking IOMMUFD selftests UAPI (Jason)
* General improvements to testing
* Add coverage to new out_capabilities support in HW_INFO.
* Address Shameer/Robin comments in smmu-v3 (code is on a branch[6])
  - Properly check for FEAT_HD together with COHERENCY
  - Remove the pgsize_bitmap check
  - Limit the quirk set to s1 pgtbl_cfg.
  - Fix commit message on dubious sentence on DBM usecase

Changes since RFCv1[5]:
Too many changes but the major items were:
* Majorirty of the changes from Jason/Kevin/Baolu/Suravee:
- Improve structure and rework most commit messages
- Drop all of the VFIO-compat series
- Drop the unmap-get-dirty API
- Tie this to HWPT only, no more autodomains handling;
- Rework the UAPI widely by:
  - Having a IOMMU_DEVICE_GET_CAPS which allows to fetching capabilities
    of devices, specifically test dirty tracking support for an individual
    device
  - Add a enforce-dirty flag to the IOMMU domain via HWPT_ALLOC
  - SET_DIRTY now clears dirty tracking before asking iommu driver to do so;
  - New GET_DIRTY_IOVA flag that does not clear dirty bits
  - Add coverage for all added APIs
  - Expand GET_DIRTY_IOVA tests to cover IOVA bitmap corner cases tests
  that I had in separate; I only excluded the Terabyte IOVA range
  usecases (which test bitmaps 2M+) because those will most likely fail
  to be run as selftests (not sure yet how I can include those). I am
  not exactly sure how I can cover those, unless I do 'fake IOVA maps'
  *somehow* which do not necessarily require real buffers.
- Handle most comments in intel-iommu. Only remaining one for v3 is the
  PTE walker which will be done better.
- Handle all comments in amd-iommu, most of which regarding locking.
  Only one remaining is v3 same as Intel;
- Reflect the UAPI changes into iommu driver implementations, including
persisting dirty tracking enabling in new attach_dev calls, as well as
enforcing attach_dev enforces the requested domain flags;
* Comments from Yi Sun in making sure that dirty tracking isn't
restricted into SS only, so relax the check for FL support because it's
always enabled. (Yi Sun)
* Most of code that was in v1 for dirty bitmaps got rewritten and
repurpose to also cover VFIO case; so reuse this infra here too for both.
(Jason)
* Take Robin's suggestion of always enabling dirty tracking and set_dirty
just clearing bits on 'activation', and make that a generic property to
ensure we always get accurate results between starting and stopping
tracking. (Robin Murphy)
* Address all comments from SMMUv3 into how we enable/test the DBM, or the
bits in the context descriptor with io-pgtable::quirks, etc
(Robin, Shameerali)

Joao Martins (18):
  vfio/iova_bitmap: Export more API symbols
  vfio: Move iova_bitmap into iommufd
  iommufd/iova_bitmap: Move symbols to IOMMUFD namespace
  iommu: Add iommu_domain ops for dirty tracking
  iommufd: Add a flag to enforce dirty tracking on attach
  iommufd: Add IOMMU_HWPT_SET_DIRTY
  iommufd: Add IOMMU_HWPT_GET_DIRTY_IOVA
  iommufd: Add capabilities to IOMMU_GET_HW_INFO
  iommufd: Add a flag to skip clearing of IOPTE dirty
  iommu/amd: Add domain_alloc_user based domain allocation
  iommu/amd: Access/Dirty bit support in IOPTEs
  iommu/intel: Access/Dirty bit support for SL domains
  iommufd/selftest: Expand mock_domain with dev_flags
  iommufd/selftest: Test IOMMU_HWPT_ALLOC_ENFORCE_DIRTY
  iommufd/selftest: Test IOMMU_HWPT_SET_DIRTY
  iommufd/selftest: Test IOMMU_HWPT_GET_DIRTY_IOVA
  iommufd/selftest: Test out_capabilities in IOMMU_GET_HW_INFO
  iommufd/selftest: Test IOMMU_GET_DIRTY_IOVA_NO_CLEAR flag

 drivers/iommu/amd/Kconfig                     |   1 +
 drivers/iommu/amd/amd_iommu_types.h           |  12 +
 drivers/iommu/amd/io_pgtable.c                |  69 ++++++
 drivers/iommu/amd/iommu.c                     | 143 +++++++++++-
 drivers/iommu/intel/Kconfig                   |   1 +
 drivers/iommu/intel/iommu.c                   | 104 ++++++++-
 drivers/iommu/intel/iommu.h                   |  17 ++
 drivers/iommu/intel/pasid.c                   | 109 +++++++++
 drivers/iommu/intel/pasid.h                   |   4 +
 drivers/iommu/iommufd/Kconfig                 |   4 +
 drivers/iommu/iommufd/Makefile                |   1 +
 drivers/iommu/iommufd/device.c                |   4 +
 drivers/iommu/iommufd/hw_pagetable.c          |  80 ++++++-
 drivers/iommu/iommufd/io_pagetable.c          | 151 ++++++++++++
 drivers/iommu/iommufd/iommufd_private.h       |  22 ++
 drivers/iommu/iommufd/iommufd_test.h          |  21 ++
 drivers/{vfio => iommu/iommufd}/iova_bitmap.c |   5 +-
 drivers/iommu/iommufd/main.c                  |   7 +
 drivers/iommu/iommufd/selftest.c              | 171 +++++++++++++-
 drivers/vfio/Makefile                         |   3 +-
 drivers/vfio/pci/mlx5/Kconfig                 |   1 +
 drivers/vfio/pci/mlx5/main.c                  |   1 +
 drivers/vfio/pci/pds/Kconfig                  |   1 +
 drivers/vfio/pci/pds/pci_drv.c                |   1 +
 drivers/vfio/vfio_main.c                      |   1 +
 include/linux/io-pgtable.h                    |   4 +
 include/linux/iommu.h                         |  56 +++++
 include/linux/iova_bitmap.h                   |  26 +++
 include/uapi/linux/iommufd.h                  |  79 +++++++
 tools/testing/selftests/iommu/iommufd.c       | 216 ++++++++++++++++++
 .../selftests/iommu/iommufd_fail_nth.c        |   2 +-
 tools/testing/selftests/iommu/iommufd_utils.h | 184 ++++++++++++++-
 32 files changed, 1482 insertions(+), 19 deletions(-)
 rename drivers/{vfio => iommu/iommufd}/iova_bitmap.c (98%)

-- 
2.17.2


^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH v4 01/18] vfio/iova_bitmap: Export more API symbols
  2023-10-18 20:26 [PATCH v4 00/18] IOMMUFD Dirty Tracking Joao Martins
@ 2023-10-18 20:26 ` Joao Martins
  2023-10-18 22:14   ` Jason Gunthorpe
                     ` (2 more replies)
  2023-10-18 20:26 ` [PATCH v4 02/18] vfio: Move iova_bitmap into iommufd Joao Martins
                   ` (16 subsequent siblings)
  17 siblings, 3 replies; 84+ messages in thread
From: Joao Martins @ 2023-10-18 20:26 UTC (permalink / raw)
  To: iommu
  Cc: Jason Gunthorpe, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu,
	Yi Liu, Yi Y Sun, Nicolin Chen, Joerg Roedel,
	Suravee Suthikulpanit, Will Deacon, Robin Murphy, Zhenzhong Duan,
	Alex Williamson, kvm, Joao Martins

In preparation to move iova_bitmap into iommufd, export the rest of API
symbols that will be used in what could be used by modules, namely:

	iova_bitmap_alloc
	iova_bitmap_free
	iova_bitmap_for_each

Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/vfio/iova_bitmap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/vfio/iova_bitmap.c b/drivers/vfio/iova_bitmap.c
index 0848f920efb7..f54b56388e00 100644
--- a/drivers/vfio/iova_bitmap.c
+++ b/drivers/vfio/iova_bitmap.c
@@ -268,6 +268,7 @@ struct iova_bitmap *iova_bitmap_alloc(unsigned long iova, size_t length,
 	iova_bitmap_free(bitmap);
 	return ERR_PTR(rc);
 }
+EXPORT_SYMBOL_GPL(iova_bitmap_alloc);
 
 /**
  * iova_bitmap_free() - Frees an IOVA bitmap object
@@ -289,6 +290,7 @@ void iova_bitmap_free(struct iova_bitmap *bitmap)
 
 	kfree(bitmap);
 }
+EXPORT_SYMBOL_GPL(iova_bitmap_free);
 
 /*
  * Returns the remaining bitmap indexes from mapped_total_index to process for
@@ -387,6 +389,7 @@ int iova_bitmap_for_each(struct iova_bitmap *bitmap, void *opaque,
 
 	return ret;
 }
+EXPORT_SYMBOL_GPL(iova_bitmap_for_each);
 
 /**
  * iova_bitmap_set() - Records an IOVA range in bitmap
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 02/18] vfio: Move iova_bitmap into iommufd
  2023-10-18 20:26 [PATCH v4 00/18] IOMMUFD Dirty Tracking Joao Martins
  2023-10-18 20:26 ` [PATCH v4 01/18] vfio/iova_bitmap: Export more API symbols Joao Martins
@ 2023-10-18 20:26 ` Joao Martins
  2023-10-18 22:14   ` Jason Gunthorpe
                     ` (2 more replies)
  2023-10-18 20:27 ` [PATCH v4 03/18] iommufd/iova_bitmap: Move symbols to IOMMUFD namespace Joao Martins
                   ` (15 subsequent siblings)
  17 siblings, 3 replies; 84+ messages in thread
From: Joao Martins @ 2023-10-18 20:26 UTC (permalink / raw)
  To: iommu
  Cc: Jason Gunthorpe, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu,
	Yi Liu, Yi Y Sun, Nicolin Chen, Joerg Roedel,
	Suravee Suthikulpanit, Will Deacon, Robin Murphy, Zhenzhong Duan,
	Alex Williamson, kvm, Joao Martins, Brett Creeley, Yishai Hadas

Both VFIO and IOMMUFD will need iova bitmap for storing dirties and walking
the user bitmaps, so move to the common dependency into IOMMUFD.  In doing
so, create the symbol IOMMUFD_DRIVER which designates the builtin code that
will be used by drivers when selected. Today this means MLX5_VFIO_PCI and
PDS_VFIO_PCI. IOMMU drivers will do the same (in future patches) when
supporting dirty tracking and select IOMMUFD_DRIVER accordingly.

Given that the symbol maybe be disabled, add header definitions in
iova_bitmap.h for when IOMMUFD_DRIVER=n

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/iommufd/Kconfig                 |  4 +++
 drivers/iommu/iommufd/Makefile                |  1 +
 drivers/{vfio => iommu/iommufd}/iova_bitmap.c |  0
 drivers/vfio/Makefile                         |  3 +--
 drivers/vfio/pci/mlx5/Kconfig                 |  1 +
 drivers/vfio/pci/pds/Kconfig                  |  1 +
 include/linux/iova_bitmap.h                   | 26 +++++++++++++++++++
 7 files changed, 34 insertions(+), 2 deletions(-)
 rename drivers/{vfio => iommu/iommufd}/iova_bitmap.c (100%)

diff --git a/drivers/iommu/iommufd/Kconfig b/drivers/iommu/iommufd/Kconfig
index 99d4b075df49..1fa543204e89 100644
--- a/drivers/iommu/iommufd/Kconfig
+++ b/drivers/iommu/iommufd/Kconfig
@@ -11,6 +11,10 @@ config IOMMUFD
 
 	  If you don't know what to do here, say N.
 
+config IOMMUFD_DRIVER
+	bool
+	default n
+
 if IOMMUFD
 config IOMMUFD_VFIO_CONTAINER
 	bool "IOMMUFD provides the VFIO container /dev/vfio/vfio"
diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index 8aeba81800c5..34b446146961 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -11,3 +11,4 @@ iommufd-y := \
 iommufd-$(CONFIG_IOMMUFD_TEST) += selftest.o
 
 obj-$(CONFIG_IOMMUFD) += iommufd.o
+obj-$(CONFIG_IOMMUFD_DRIVER) += iova_bitmap.o
diff --git a/drivers/vfio/iova_bitmap.c b/drivers/iommu/iommufd/iova_bitmap.c
similarity index 100%
rename from drivers/vfio/iova_bitmap.c
rename to drivers/iommu/iommufd/iova_bitmap.c
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index c82ea032d352..68c05705200f 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -1,8 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_VFIO) += vfio.o
 
-vfio-y += vfio_main.o \
-	  iova_bitmap.o
+vfio-y += vfio_main.o
 vfio-$(CONFIG_VFIO_DEVICE_CDEV) += device_cdev.o
 vfio-$(CONFIG_VFIO_GROUP) += group.o
 vfio-$(CONFIG_IOMMUFD) += iommufd.o
diff --git a/drivers/vfio/pci/mlx5/Kconfig b/drivers/vfio/pci/mlx5/Kconfig
index 7088edc4fb28..c3ced56b7787 100644
--- a/drivers/vfio/pci/mlx5/Kconfig
+++ b/drivers/vfio/pci/mlx5/Kconfig
@@ -3,6 +3,7 @@ config MLX5_VFIO_PCI
 	tristate "VFIO support for MLX5 PCI devices"
 	depends on MLX5_CORE
 	select VFIO_PCI_CORE
+	select IOMMUFD_DRIVER
 	help
 	  This provides migration support for MLX5 devices using the VFIO
 	  framework.
diff --git a/drivers/vfio/pci/pds/Kconfig b/drivers/vfio/pci/pds/Kconfig
index 407b3fd32733..fff368a8183b 100644
--- a/drivers/vfio/pci/pds/Kconfig
+++ b/drivers/vfio/pci/pds/Kconfig
@@ -5,6 +5,7 @@ config PDS_VFIO_PCI
 	tristate "VFIO support for PDS PCI devices"
 	depends on PDS_CORE
 	select VFIO_PCI_CORE
+	select IOMMUFD_DRIVER
 	help
 	  This provides generic PCI support for PDS devices using the VFIO
 	  framework.
diff --git a/include/linux/iova_bitmap.h b/include/linux/iova_bitmap.h
index c006cf0a25f3..1c338f5e5b7a 100644
--- a/include/linux/iova_bitmap.h
+++ b/include/linux/iova_bitmap.h
@@ -7,6 +7,7 @@
 #define _IOVA_BITMAP_H_
 
 #include <linux/types.h>
+#include <linux/errno.h>
 
 struct iova_bitmap;
 
@@ -14,6 +15,7 @@ typedef int (*iova_bitmap_fn_t)(struct iova_bitmap *bitmap,
 				unsigned long iova, size_t length,
 				void *opaque);
 
+#if IS_ENABLED(CONFIG_IOMMUFD_DRIVER)
 struct iova_bitmap *iova_bitmap_alloc(unsigned long iova, size_t length,
 				      unsigned long page_size,
 				      u64 __user *data);
@@ -22,5 +24,29 @@ int iova_bitmap_for_each(struct iova_bitmap *bitmap, void *opaque,
 			 iova_bitmap_fn_t fn);
 void iova_bitmap_set(struct iova_bitmap *bitmap,
 		     unsigned long iova, size_t length);
+#else
+static inline struct iova_bitmap *iova_bitmap_alloc(unsigned long iova,
+						    size_t length,
+						    unsigned long page_size,
+						    u64 __user *data)
+{
+	return NULL;
+}
+
+static inline void iova_bitmap_free(struct iova_bitmap *bitmap)
+{
+}
+
+static inline int iova_bitmap_for_each(struct iova_bitmap *bitmap, void *opaque,
+				       iova_bitmap_fn_t fn)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline void iova_bitmap_set(struct iova_bitmap *bitmap,
+				   unsigned long iova, size_t length)
+{
+}
+#endif
 
 #endif
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 03/18] iommufd/iova_bitmap: Move symbols to IOMMUFD namespace
  2023-10-18 20:26 [PATCH v4 00/18] IOMMUFD Dirty Tracking Joao Martins
  2023-10-18 20:26 ` [PATCH v4 01/18] vfio/iova_bitmap: Export more API symbols Joao Martins
  2023-10-18 20:26 ` [PATCH v4 02/18] vfio: Move iova_bitmap into iommufd Joao Martins
@ 2023-10-18 20:27 ` Joao Martins
  2023-10-18 22:16   ` Jason Gunthorpe
                     ` (3 more replies)
  2023-10-18 20:27 ` [PATCH v4 04/18] iommu: Add iommu_domain ops for dirty tracking Joao Martins
                   ` (14 subsequent siblings)
  17 siblings, 4 replies; 84+ messages in thread
From: Joao Martins @ 2023-10-18 20:27 UTC (permalink / raw)
  To: iommu
  Cc: Jason Gunthorpe, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu,
	Yi Liu, Yi Y Sun, Nicolin Chen, Joerg Roedel,
	Suravee Suthikulpanit, Will Deacon, Robin Murphy, Zhenzhong Duan,
	Alex Williamson, kvm, Joao Martins, Brett Creeley, Yishai Hadas

Have the IOVA bitmap exported symbols adhere to the IOMMUFD symbol
export convention i.e. using the IOMMUFD namespace. In doing so,
import the namespace in the current users. This means VFIO and the
vfio-pci drivers that use iova_bitmap_set().

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/iommufd/iova_bitmap.c | 8 ++++----
 drivers/vfio/pci/mlx5/main.c        | 1 +
 drivers/vfio/pci/pds/pci_drv.c      | 1 +
 drivers/vfio/vfio_main.c            | 1 +
 4 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/iommufd/iova_bitmap.c b/drivers/iommu/iommufd/iova_bitmap.c
index f54b56388e00..0a92c9eeaf7f 100644
--- a/drivers/iommu/iommufd/iova_bitmap.c
+++ b/drivers/iommu/iommufd/iova_bitmap.c
@@ -268,7 +268,7 @@ struct iova_bitmap *iova_bitmap_alloc(unsigned long iova, size_t length,
 	iova_bitmap_free(bitmap);
 	return ERR_PTR(rc);
 }
-EXPORT_SYMBOL_GPL(iova_bitmap_alloc);
+EXPORT_SYMBOL_NS_GPL(iova_bitmap_alloc, IOMMUFD);
 
 /**
  * iova_bitmap_free() - Frees an IOVA bitmap object
@@ -290,7 +290,7 @@ void iova_bitmap_free(struct iova_bitmap *bitmap)
 
 	kfree(bitmap);
 }
-EXPORT_SYMBOL_GPL(iova_bitmap_free);
+EXPORT_SYMBOL_NS_GPL(iova_bitmap_free, IOMMUFD);
 
 /*
  * Returns the remaining bitmap indexes from mapped_total_index to process for
@@ -389,7 +389,7 @@ int iova_bitmap_for_each(struct iova_bitmap *bitmap, void *opaque,
 
 	return ret;
 }
-EXPORT_SYMBOL_GPL(iova_bitmap_for_each);
+EXPORT_SYMBOL_NS_GPL(iova_bitmap_for_each, IOMMUFD);
 
 /**
  * iova_bitmap_set() - Records an IOVA range in bitmap
@@ -423,4 +423,4 @@ void iova_bitmap_set(struct iova_bitmap *bitmap,
 		cur_bit += nbits;
 	} while (cur_bit <= last_bit);
 }
-EXPORT_SYMBOL_GPL(iova_bitmap_set);
+EXPORT_SYMBOL_NS_GPL(iova_bitmap_set, IOMMUFD);
diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
index 42ec574a8622..5cf2b491d15a 100644
--- a/drivers/vfio/pci/mlx5/main.c
+++ b/drivers/vfio/pci/mlx5/main.c
@@ -1376,6 +1376,7 @@ static struct pci_driver mlx5vf_pci_driver = {
 
 module_pci_driver(mlx5vf_pci_driver);
 
+MODULE_IMPORT_NS(IOMMUFD);
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Max Gurtovoy <mgurtovoy@nvidia.com>");
 MODULE_AUTHOR("Yishai Hadas <yishaih@nvidia.com>");
diff --git a/drivers/vfio/pci/pds/pci_drv.c b/drivers/vfio/pci/pds/pci_drv.c
index ab4b5958e413..dd8c00c895a2 100644
--- a/drivers/vfio/pci/pds/pci_drv.c
+++ b/drivers/vfio/pci/pds/pci_drv.c
@@ -204,6 +204,7 @@ static struct pci_driver pds_vfio_pci_driver = {
 
 module_pci_driver(pds_vfio_pci_driver);
 
+MODULE_IMPORT_NS(IOMMUFD);
 MODULE_DESCRIPTION(PDS_VFIO_DRV_DESCRIPTION);
 MODULE_AUTHOR("Brett Creeley <brett.creeley@amd.com>");
 MODULE_LICENSE("GPL");
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 40732e8ed4c6..a96d97da367d 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -1693,6 +1693,7 @@ static void __exit vfio_cleanup(void)
 module_init(vfio_init);
 module_exit(vfio_cleanup);
 
+MODULE_IMPORT_NS(IOMMUFD);
 MODULE_VERSION(DRIVER_VERSION);
 MODULE_LICENSE("GPL v2");
 MODULE_AUTHOR(DRIVER_AUTHOR);
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 04/18] iommu: Add iommu_domain ops for dirty tracking
  2023-10-18 20:26 [PATCH v4 00/18] IOMMUFD Dirty Tracking Joao Martins
                   ` (2 preceding siblings ...)
  2023-10-18 20:27 ` [PATCH v4 03/18] iommufd/iova_bitmap: Move symbols to IOMMUFD namespace Joao Martins
@ 2023-10-18 20:27 ` Joao Martins
  2023-10-18 22:26   ` Jason Gunthorpe
                     ` (2 more replies)
  2023-10-18 20:27 ` [PATCH v4 05/18] iommufd: Add a flag to enforce dirty tracking on attach Joao Martins
                   ` (13 subsequent siblings)
  17 siblings, 3 replies; 84+ messages in thread
From: Joao Martins @ 2023-10-18 20:27 UTC (permalink / raw)
  To: iommu
  Cc: Jason Gunthorpe, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu,
	Yi Liu, Yi Y Sun, Nicolin Chen, Joerg Roedel,
	Suravee Suthikulpanit, Will Deacon, Robin Murphy, Zhenzhong Duan,
	Alex Williamson, kvm, Joao Martins

Add to iommu domain operations a set of callbacks to perform dirty
tracking, particulary to start and stop tracking and to read and clear the
dirty data.

Drivers are generally expected to dynamically change its translation
structures to toggle the tracking and flush some form of control state
structure that stands in the IOVA translation path. Though it's not
mandatory, as drivers can also enable dirty tracking at boot, and just
clear the dirty bits before setting dirty tracking. For each of the newly
added IOMMU core APIs:

iommu_cap::IOMMU_CAP_DIRTY: new device iommu_capable value when probing for
capabilities of the device.

.set_dirty_tracking(): an iommu driver is expected to change its
translation structures and enable dirty tracking for the devices in the
iommu_domain. For drivers making dirty tracking always-enabled, it should
just return 0.

.read_and_clear_dirty(): an iommu driver is expected to walk the pagetables
for the iova range passed in and use iommu_dirty_bitmap_record() to record
dirty info per IOVA. When detecting that a given IOVA is dirty it should
also clear its dirty state from the PTE, *unless* the flag
IOMMU_DIRTY_NO_CLEAR is passed in -- flushing is steered from the caller of
the domain_op via iotlb_gather. The iommu core APIs use the same data
structure in use for dirty tracking for VFIO device dirty (struct
iova_bitmap) abstracted by iommu_dirty_bitmap_record() helper function.

domain::dirty_ops: IOMMU domains will store the dirty ops depending on
whether the iommu device supports dirty tracking or not. iommu drivers can
then use this field to figure if the dirty tracking is supported+enforced
on attach. The enforcement is enable via domain_alloc_user() which is done
via IOMMUFD hwpt flag introduced later.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 include/linux/io-pgtable.h |  4 +++
 include/linux/iommu.h      | 56 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 60 insertions(+)

diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index 1b7a44b35616..25142a0e2fc2 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -166,6 +166,10 @@ struct io_pgtable_ops {
 			      struct iommu_iotlb_gather *gather);
 	phys_addr_t (*iova_to_phys)(struct io_pgtable_ops *ops,
 				    unsigned long iova);
+	int (*read_and_clear_dirty)(struct io_pgtable_ops *ops,
+				    unsigned long iova, size_t size,
+				    unsigned long flags,
+				    struct iommu_dirty_bitmap *dirty);
 };
 
 /**
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 3861d66b65c1..dada7875a98c 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -13,6 +13,7 @@
 #include <linux/errno.h>
 #include <linux/err.h>
 #include <linux/of.h>
+#include <linux/iova_bitmap.h>
 #include <uapi/linux/iommu.h>
 
 #define IOMMU_READ	(1 << 0)
@@ -37,6 +38,7 @@ struct bus_type;
 struct device;
 struct iommu_domain;
 struct iommu_domain_ops;
+struct iommu_dirty_ops;
 struct notifier_block;
 struct iommu_sva;
 struct iommu_fault_event;
@@ -95,6 +97,8 @@ struct iommu_domain_geometry {
 struct iommu_domain {
 	unsigned type;
 	const struct iommu_domain_ops *ops;
+	const struct iommu_dirty_ops *dirty_ops;
+
 	unsigned long pgsize_bitmap;	/* Bitmap of page sizes in use */
 	struct iommu_domain_geometry geometry;
 	struct iommu_dma_cookie *iova_cookie;
@@ -133,6 +137,7 @@ enum iommu_cap {
 	 * usefully support the non-strict DMA flush queue.
 	 */
 	IOMMU_CAP_DEFERRED_FLUSH,
+	IOMMU_CAP_DIRTY,		/* IOMMU supports dirty tracking */
 };
 
 /* These are the possible reserved region types */
@@ -227,6 +232,32 @@ struct iommu_iotlb_gather {
 	bool			queued;
 };
 
+/**
+ * struct iommu_dirty_bitmap - Dirty IOVA bitmap state
+ * @bitmap: IOVA bitmap
+ * @gather: Range information for a pending IOTLB flush
+ */
+struct iommu_dirty_bitmap {
+	struct iova_bitmap *bitmap;
+	struct iommu_iotlb_gather *gather;
+};
+
+/**
+ * struct iommu_dirty_ops - domain specific dirty tracking operations
+ * @set_dirty_tracking: Enable or Disable dirty tracking on the iommu domain
+ * @read_and_clear_dirty: Walk IOMMU page tables for dirtied PTEs marshalled
+ *                        into a bitmap, with a bit represented as a page.
+ *                        Reads the dirty PTE bits and clears it from IO
+ *                        pagetables.
+ */
+struct iommu_dirty_ops {
+	int (*set_dirty_tracking)(struct iommu_domain *domain, bool enabled);
+	int (*read_and_clear_dirty)(struct iommu_domain *domain,
+				    unsigned long iova, size_t size,
+				    unsigned long flags,
+				    struct iommu_dirty_bitmap *dirty);
+};
+
 /**
  * struct iommu_ops - iommu ops and capabilities
  * @capable: check capability
@@ -641,6 +672,28 @@ static inline bool iommu_iotlb_gather_queued(struct iommu_iotlb_gather *gather)
 	return gather && gather->queued;
 }
 
+static inline void iommu_dirty_bitmap_init(struct iommu_dirty_bitmap *dirty,
+					   struct iova_bitmap *bitmap,
+					   struct iommu_iotlb_gather *gather)
+{
+	if (gather)
+		iommu_iotlb_gather_init(gather);
+
+	dirty->bitmap = bitmap;
+	dirty->gather = gather;
+}
+
+static inline void
+iommu_dirty_bitmap_record(struct iommu_dirty_bitmap *dirty, unsigned long iova,
+			  unsigned long length)
+{
+	if (dirty->bitmap)
+		iova_bitmap_set(dirty->bitmap, iova, length);
+
+	if (dirty->gather)
+		iommu_iotlb_gather_add_range(dirty->gather, iova, length);
+}
+
 /* PCI device grouping function */
 extern struct iommu_group *pci_device_group(struct device *dev);
 /* Generic device grouping function */
@@ -671,6 +724,9 @@ struct iommu_fwspec {
 /* ATS is supported */
 #define IOMMU_FWSPEC_PCI_RC_ATS			(1 << 0)
 
+/* Read but do not clear any dirty bits */
+#define IOMMU_DIRTY_NO_CLEAR			(1 << 0)
+
 /**
  * struct iommu_sva - handle to a device-mm bond
  */
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 05/18] iommufd: Add a flag to enforce dirty tracking on attach
  2023-10-18 20:26 [PATCH v4 00/18] IOMMUFD Dirty Tracking Joao Martins
                   ` (3 preceding siblings ...)
  2023-10-18 20:27 ` [PATCH v4 04/18] iommu: Add iommu_domain ops for dirty tracking Joao Martins
@ 2023-10-18 20:27 ` Joao Martins
  2023-10-18 22:26   ` Jason Gunthorpe
  2023-10-18 22:38   ` Jason Gunthorpe
  2023-10-18 20:27 ` [PATCH v4 06/18] iommufd: Add IOMMU_HWPT_SET_DIRTY Joao Martins
                   ` (12 subsequent siblings)
  17 siblings, 2 replies; 84+ messages in thread
From: Joao Martins @ 2023-10-18 20:27 UTC (permalink / raw)
  To: iommu
  Cc: Jason Gunthorpe, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu,
	Yi Liu, Yi Y Sun, Nicolin Chen, Joerg Roedel,
	Suravee Suthikulpanit, Will Deacon, Robin Murphy, Zhenzhong Duan,
	Alex Williamson, kvm, Joao Martins

Throughout IOMMU domain lifetime that wants to use dirty tracking, some
guarantees are needed such that any device attached to the iommu_domain
supports dirty tracking.

The idea is to handle a case where IOMMU in the system are assymetric
feature-wise and thus the capability may not be supported for all devices.
The enforcement is done by adding a flag into HWPT_ALLOC namely:

	IOMMUFD_HWPT_ALLOC_ENFORCE_DIRTY

.. Passed in HWPT_ALLOC ioctl() flags. The enforcement is done by creating
a iommu_domain via domain_alloc_user() and validating the requested flags
with what the device IOMMU supports (and failing accordingly) advertised).
Advertising the new IOMMU domain feature flag requires that the individual
iommu driver capability is supported when a future device attachment
happens.

Link: https://lore.kernel.org/kvm/20220721142421.GB4609@nvidia.com/
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/iommufd/hw_pagetable.c | 4 +++-
 include/uapi/linux/iommufd.h         | 3 +++
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index 8b3d2875d642..2b9ff3850bb8 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -157,7 +157,9 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
 	struct iommufd_ioas *ioas;
 	int rc;
 
-	if ((cmd->flags & (~IOMMU_HWPT_ALLOC_NEST_PARENT)) || cmd->__reserved)
+	if ((cmd->flags &
+	    ~(IOMMU_HWPT_ALLOC_NEST_PARENT|IOMMU_HWPT_ALLOC_ENFORCE_DIRTY)) ||
+	    cmd->__reserved)
 		return -EOPNOTSUPP;
 
 	idev = iommufd_get_device(ucmd, cmd->dev_id);
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 4a7c5c8fdbb4..cd94a9d8ce66 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -352,9 +352,12 @@ struct iommu_vfio_ioas {
  * @IOMMU_HWPT_ALLOC_NEST_PARENT: If set, allocate a domain which can serve
  *                                as the parent domain in the nesting
  *                                configuration.
+ * @IOMMU_HWPT_ALLOC_ENFORCE_DIRTY: Dirty tracking support for device IOMMU is
+ *                                enforced on device attachment
  */
 enum iommufd_hwpt_alloc_flags {
 	IOMMU_HWPT_ALLOC_NEST_PARENT = 1 << 0,
+	IOMMU_HWPT_ALLOC_ENFORCE_DIRTY = 1 << 1,
 };
 
 /**
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 06/18] iommufd: Add IOMMU_HWPT_SET_DIRTY
  2023-10-18 20:26 [PATCH v4 00/18] IOMMUFD Dirty Tracking Joao Martins
                   ` (4 preceding siblings ...)
  2023-10-18 20:27 ` [PATCH v4 05/18] iommufd: Add a flag to enforce dirty tracking on attach Joao Martins
@ 2023-10-18 20:27 ` Joao Martins
  2023-10-18 22:28   ` Jason Gunthorpe
                     ` (3 more replies)
  2023-10-18 20:27 ` [PATCH v4 07/18] iommufd: Add IOMMU_HWPT_GET_DIRTY_IOVA Joao Martins
                   ` (11 subsequent siblings)
  17 siblings, 4 replies; 84+ messages in thread
From: Joao Martins @ 2023-10-18 20:27 UTC (permalink / raw)
  To: iommu
  Cc: Jason Gunthorpe, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu,
	Yi Liu, Yi Y Sun, Nicolin Chen, Joerg Roedel,
	Suravee Suthikulpanit, Will Deacon, Robin Murphy, Zhenzhong Duan,
	Alex Williamson, kvm, Joao Martins

Every IOMMU driver should be able to implement the needed iommu domain ops
to control dirty tracking.

Connect a hw_pagetable to the IOMMU core dirty tracking ops, specifically
the ability to enable/disable dirty tracking on an IOMMU domain
(hw_pagetable id). To that end add an io_pagetable kernel API to toggle
dirty tracking:

* iopt_set_dirty_tracking(iopt, [domain], state)

The intended caller of this is via the hw_pagetable object that is created.

Internally it will ensure the leftover dirty state is cleared /right
before/ dirty tracking starts. This is also useful for iommu drivers which
may decide that dirty tracking is always-enabled at boot without wanting to
toggle dynamically via corresponding iommu domain op.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/iommufd/hw_pagetable.c    | 24 +++++++++++
 drivers/iommu/iommufd/io_pagetable.c    | 55 +++++++++++++++++++++++++
 drivers/iommu/iommufd/iommufd_private.h | 12 ++++++
 drivers/iommu/iommufd/main.c            |  3 ++
 include/uapi/linux/iommufd.h            | 25 +++++++++++
 5 files changed, 119 insertions(+)

diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index 2b9ff3850bb8..85a0f696c744 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -196,3 +196,27 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
 	iommufd_put_object(&idev->obj);
 	return rc;
 }
+
+int iommufd_hwpt_set_dirty(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_hwpt_set_dirty *cmd = ucmd->cmd;
+	struct iommufd_hw_pagetable *hwpt;
+	struct iommufd_ioas *ioas;
+	int rc = -EOPNOTSUPP;
+	bool enable;
+
+	if (cmd->flags & ~IOMMU_DIRTY_TRACKING_ENABLE)
+		return rc;
+
+	hwpt = iommufd_get_hwpt(ucmd, cmd->hwpt_id);
+	if (IS_ERR(hwpt))
+		return PTR_ERR(hwpt);
+
+	ioas = hwpt->ioas;
+	enable = cmd->flags & IOMMU_DIRTY_TRACKING_ENABLE;
+
+	rc = iopt_set_dirty_tracking(&ioas->iopt, hwpt->domain, enable);
+
+	iommufd_put_object(&hwpt->obj);
+	return rc;
+}
diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
index 3a598182b761..535d73466e15 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -412,6 +412,61 @@ int iopt_map_user_pages(struct iommufd_ctx *ictx, struct io_pagetable *iopt,
 	return 0;
 }
 
+static int iopt_clear_dirty_data(struct io_pagetable *iopt,
+				 struct iommu_domain *domain)
+{
+	const struct iommu_dirty_ops *ops = domain->dirty_ops;
+	struct iommu_iotlb_gather gather;
+	struct iommu_dirty_bitmap dirty;
+	struct iopt_area *area;
+	int ret = 0;
+
+	lockdep_assert_held_read(&iopt->iova_rwsem);
+
+	iommu_dirty_bitmap_init(&dirty, NULL, &gather);
+
+	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
+	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
+		if (!area->pages)
+			continue;
+
+		ret = ops->read_and_clear_dirty(domain,
+						iopt_area_iova(area),
+						iopt_area_length(area), 0,
+						&dirty);
+		if (ret)
+			break;
+	}
+
+	iommu_iotlb_sync(domain, &gather);
+	return ret;
+}
+
+int iopt_set_dirty_tracking(struct io_pagetable *iopt,
+			    struct iommu_domain *domain, bool enable)
+{
+	const struct iommu_dirty_ops *ops = domain->dirty_ops;
+	int ret = 0;
+
+	if (!ops)
+		return -EOPNOTSUPP;
+
+	down_read(&iopt->iova_rwsem);
+
+	/* Clear dirty bits from PTEs to ensure a clean snapshot */
+	if (enable) {
+		ret = iopt_clear_dirty_data(iopt, domain);
+		if (ret)
+			goto out_unlock;
+	}
+
+	ret = ops->set_dirty_tracking(domain, enable);
+
+out_unlock:
+	up_read(&iopt->iova_rwsem);
+	return ret;
+}
+
 int iopt_get_pages(struct io_pagetable *iopt, unsigned long iova,
 		   unsigned long length, struct list_head *pages_list)
 {
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 3064997a0181..d42e01cc1105 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -8,6 +8,7 @@
 #include <linux/xarray.h>
 #include <linux/refcount.h>
 #include <linux/uaccess.h>
+#include <uapi/linux/iommufd.h>
 
 struct iommu_domain;
 struct iommu_group;
@@ -70,6 +71,9 @@ int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
 		    unsigned long length, unsigned long *unmapped);
 int iopt_unmap_all(struct io_pagetable *iopt, unsigned long *unmapped);
 
+int iopt_set_dirty_tracking(struct io_pagetable *iopt,
+			    struct iommu_domain *domain, bool enable);
+
 void iommufd_access_notify_unmap(struct io_pagetable *iopt, unsigned long iova,
 				 unsigned long length);
 int iopt_table_add_domain(struct io_pagetable *iopt,
@@ -240,6 +244,14 @@ struct iommufd_hw_pagetable {
 	struct list_head hwpt_item;
 };
 
+static inline struct iommufd_hw_pagetable *iommufd_get_hwpt(
+					struct iommufd_ucmd *ucmd, u32 id)
+{
+	return container_of(iommufd_get_object(ucmd->ictx, id,
+					       IOMMUFD_OBJ_HW_PAGETABLE),
+			    struct iommufd_hw_pagetable, obj);
+}
+int iommufd_hwpt_set_dirty(struct iommufd_ucmd *ucmd);
 struct iommufd_hw_pagetable *
 iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
 			   struct iommufd_device *idev, u32 flags,
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index e71523cbd0de..2e625b280d61 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -307,6 +307,7 @@ union ucmd_buffer {
 	struct iommu_destroy destroy;
 	struct iommu_hw_info info;
 	struct iommu_hwpt_alloc hwpt;
+	struct iommu_hwpt_set_dirty set_dirty;
 	struct iommu_ioas_alloc alloc;
 	struct iommu_ioas_allow_iovas allow_iovas;
 	struct iommu_ioas_copy ioas_copy;
@@ -342,6 +343,8 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
 		 __reserved),
 	IOCTL_OP(IOMMU_HWPT_ALLOC, iommufd_hwpt_alloc, struct iommu_hwpt_alloc,
 		 __reserved),
+	IOCTL_OP(IOMMU_HWPT_SET_DIRTY, iommufd_hwpt_set_dirty,
+		 struct iommu_hwpt_set_dirty, __reserved),
 	IOCTL_OP(IOMMU_IOAS_ALLOC, iommufd_ioas_alloc_ioctl,
 		 struct iommu_ioas_alloc, out_ioas_id),
 	IOCTL_OP(IOMMU_IOAS_ALLOW_IOVAS, iommufd_ioas_allow_iovas,
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index cd94a9d8ce66..9e1721e38819 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -47,6 +47,7 @@ enum {
 	IOMMUFD_CMD_VFIO_IOAS,
 	IOMMUFD_CMD_HWPT_ALLOC,
 	IOMMUFD_CMD_GET_HW_INFO,
+	IOMMUFD_CMD_HWPT_SET_DIRTY,
 };
 
 /**
@@ -454,4 +455,28 @@ struct iommu_hw_info {
 	__u32 __reserved;
 };
 #define IOMMU_GET_HW_INFO _IO(IOMMUFD_TYPE, IOMMUFD_CMD_GET_HW_INFO)
+
+/*
+ * enum iommufd_set_dirty_flags - Flags for steering dirty tracking
+ * @IOMMU_DIRTY_TRACKING_ENABLE: Enables dirty tracking
+ */
+enum iommufd_hwpt_set_dirty_flags {
+	IOMMU_DIRTY_TRACKING_ENABLE = 1,
+};
+
+/**
+ * struct iommu_hwpt_set_dirty - ioctl(IOMMU_HWPT_SET_DIRTY)
+ * @size: sizeof(struct iommu_hwpt_set_dirty)
+ * @flags: Flags to control dirty tracking status.
+ * @hwpt_id: HW pagetable ID that represents the IOMMU domain.
+ *
+ * Toggle dirty tracking on an HW pagetable.
+ */
+struct iommu_hwpt_set_dirty {
+	__u32 size;
+	__u32 flags;
+	__u32 hwpt_id;
+	__u32 __reserved;
+};
+#define IOMMU_HWPT_SET_DIRTY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_SET_DIRTY)
 #endif
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 07/18] iommufd: Add IOMMU_HWPT_GET_DIRTY_IOVA
  2023-10-18 20:26 [PATCH v4 00/18] IOMMUFD Dirty Tracking Joao Martins
                   ` (5 preceding siblings ...)
  2023-10-18 20:27 ` [PATCH v4 06/18] iommufd: Add IOMMU_HWPT_SET_DIRTY Joao Martins
@ 2023-10-18 20:27 ` Joao Martins
  2023-10-18 22:39   ` Jason Gunthorpe
                     ` (2 more replies)
  2023-10-18 20:27 ` [PATCH v4 08/18] iommufd: Add capabilities to IOMMU_GET_HW_INFO Joao Martins
                   ` (10 subsequent siblings)
  17 siblings, 3 replies; 84+ messages in thread
From: Joao Martins @ 2023-10-18 20:27 UTC (permalink / raw)
  To: iommu
  Cc: Jason Gunthorpe, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu,
	Yi Liu, Yi Y Sun, Nicolin Chen, Joerg Roedel,
	Suravee Suthikulpanit, Will Deacon, Robin Murphy, Zhenzhong Duan,
	Alex Williamson, kvm, Joao Martins

Connect a hw_pagetable to the IOMMU core dirty tracking
read_and_clear_dirty iommu domain op. It exposes all of the functionality
for the UAPI that read the dirtied IOVAs while clearing the Dirty bits from
the PTEs.

In doing so, add an IO pagetable API iopt_read_and_clear_dirty_data() that
performs the reading of dirty IOPTEs for a given IOVA range and then
copying back to userspace bitmap.

Underneath it uses the IOMMU domain kernel API which will read the dirty
bits, as well as atomically clearing the IOPTE dirty bit and flushing the
IOTLB at the end. The IOVA bitmaps usage takes care of the iteration of the
bitmaps user pages efficiently and without copies. Within the iterator
function we iterate over io-pagetable contigous areas that have been
mapped.

Contrary to past incantation of a similar interface in VFIO the IOVA range
to be scanned is tied in to the bitmap size, thus the application needs to
pass a appropriately sized bitmap address taking into account the iova
range being passed *and* page size ... as opposed to allowing bitmap-iova
!= iova.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/iommufd/hw_pagetable.c    | 51 ++++++++++++++
 drivers/iommu/iommufd/io_pagetable.c    | 91 +++++++++++++++++++++++++
 drivers/iommu/iommufd/iommufd_private.h | 10 +++
 drivers/iommu/iommufd/main.c            |  4 ++
 include/uapi/linux/iommufd.h            | 28 ++++++++
 5 files changed, 184 insertions(+)

diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index 85a0f696c744..c954f91c3b7b 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -220,3 +220,54 @@ int iommufd_hwpt_set_dirty(struct iommufd_ucmd *ucmd)
 	iommufd_put_object(&hwpt->obj);
 	return rc;
 }
+
+int iommufd_check_iova_range(struct iommufd_ioas *ioas,
+			     struct iommu_hwpt_get_dirty_iova *bitmap)
+{
+	unsigned long pgshift, npages;
+	size_t iommu_pgsize;
+	int rc = -EINVAL;
+
+	pgshift = __ffs(bitmap->page_size);
+	npages = bitmap->length >> pgshift;
+
+	if (!npages || (npages > ULONG_MAX))
+		return rc;
+
+	iommu_pgsize = ioas->iopt.iova_alignment;
+
+	if (bitmap->iova & (iommu_pgsize - 1))
+		return rc;
+
+	if (!bitmap->length || bitmap->length & (iommu_pgsize - 1))
+		return rc;
+
+	return 0;
+}
+
+int iommufd_hwpt_get_dirty_iova(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_hwpt_get_dirty_iova *cmd = ucmd->cmd;
+	struct iommufd_hw_pagetable *hwpt;
+	struct iommufd_ioas *ioas;
+	int rc = -EOPNOTSUPP;
+
+	if ((cmd->flags || cmd->__reserved))
+		return -EOPNOTSUPP;
+
+	hwpt = iommufd_get_hwpt(ucmd, cmd->hwpt_id);
+	if (IS_ERR(hwpt))
+		return PTR_ERR(hwpt);
+
+	ioas = hwpt->ioas;
+	rc = iommufd_check_iova_range(ioas, cmd);
+	if (rc)
+		goto out_put;
+
+	rc = iopt_read_and_clear_dirty_data(&ioas->iopt, hwpt->domain,
+					    cmd->flags, cmd);
+
+out_put:
+	iommufd_put_object(&hwpt->obj);
+	return rc;
+}
diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
index 535d73466e15..0c08b3df1b6f 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -15,6 +15,7 @@
 #include <linux/err.h>
 #include <linux/slab.h>
 #include <linux/errno.h>
+#include <uapi/linux/iommufd.h>
 
 #include "io_pagetable.h"
 #include "double_span.h"
@@ -412,6 +413,96 @@ int iopt_map_user_pages(struct iommufd_ctx *ictx, struct io_pagetable *iopt,
 	return 0;
 }
 
+struct iova_bitmap_fn_arg {
+	struct io_pagetable *iopt;
+	struct iommu_domain *domain;
+	struct iommu_dirty_bitmap *dirty;
+};
+
+static int __iommu_read_and_clear_dirty(struct iova_bitmap *bitmap,
+					unsigned long iova, size_t length,
+					void *opaque)
+{
+	struct iopt_area *area;
+	struct iopt_area_contig_iter iter;
+	struct iova_bitmap_fn_arg *arg = opaque;
+	struct iommu_domain *domain = arg->domain;
+	struct iommu_dirty_bitmap *dirty = arg->dirty;
+	const struct iommu_dirty_ops *ops = domain->dirty_ops;
+	unsigned long last_iova = iova + length - 1;
+	int ret = -EINVAL;
+
+	iopt_for_each_contig_area(&iter, area, arg->iopt, iova, last_iova) {
+		unsigned long last = min(last_iova, iopt_area_last_iova(area));
+
+		ret = ops->read_and_clear_dirty(domain, iter.cur_iova,
+						last - iter.cur_iova + 1,
+						0, dirty);
+		if (ret)
+			break;
+	}
+
+	if (!iopt_area_contig_done(&iter))
+		ret = -EINVAL;
+
+	return ret;
+}
+
+static int iommu_read_and_clear_dirty(struct iommu_domain *domain,
+				      struct io_pagetable *iopt,
+				      unsigned long flags,
+				      struct iommu_hwpt_get_dirty_iova *bitmap)
+{
+	const struct iommu_dirty_ops *ops = domain->dirty_ops;
+	struct iommu_iotlb_gather gather;
+	struct iommu_dirty_bitmap dirty;
+	struct iova_bitmap_fn_arg arg;
+	struct iova_bitmap *iter;
+	int ret = 0;
+
+	if (!ops || !ops->read_and_clear_dirty)
+		return -EOPNOTSUPP;
+
+	iter = iova_bitmap_alloc(bitmap->iova, bitmap->length,
+			     bitmap->page_size, bitmap->data);
+	if (IS_ERR(iter))
+		return -ENOMEM;
+
+	iommu_dirty_bitmap_init(&dirty, iter, &gather);
+
+	arg.iopt = iopt;
+	arg.domain = domain;
+	arg.dirty = &dirty;
+	iova_bitmap_for_each(iter, &arg, __iommu_read_and_clear_dirty);
+
+	iommu_iotlb_sync(domain, &gather);
+	iova_bitmap_free(iter);
+
+	return ret;
+}
+
+int iopt_read_and_clear_dirty_data(struct io_pagetable *iopt,
+				   struct iommu_domain *domain,
+				   unsigned long flags,
+				   struct iommu_hwpt_get_dirty_iova *bitmap)
+{
+	unsigned long last_iova, iova = bitmap->iova;
+	unsigned long length = bitmap->length;
+	int ret = -EINVAL;
+
+	if ((iova & (iopt->iova_alignment - 1)))
+		return -EINVAL;
+
+	if (check_add_overflow(iova, length - 1, &last_iova))
+		return -EOVERFLOW;
+
+	down_read(&iopt->iova_rwsem);
+	ret = iommu_read_and_clear_dirty(domain, iopt, flags, bitmap);
+	up_read(&iopt->iova_rwsem);
+
+	return ret;
+}
+
 static int iopt_clear_dirty_data(struct io_pagetable *iopt,
 				 struct iommu_domain *domain)
 {
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index d42e01cc1105..daceedcc91ec 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -8,6 +8,8 @@
 #include <linux/xarray.h>
 #include <linux/refcount.h>
 #include <linux/uaccess.h>
+#include <linux/iommu.h>
+#include <linux/iova_bitmap.h>
 #include <uapi/linux/iommufd.h>
 
 struct iommu_domain;
@@ -71,6 +73,10 @@ int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
 		    unsigned long length, unsigned long *unmapped);
 int iopt_unmap_all(struct io_pagetable *iopt, unsigned long *unmapped);
 
+int iopt_read_and_clear_dirty_data(struct io_pagetable *iopt,
+				   struct iommu_domain *domain,
+				   unsigned long flags,
+				   struct iommu_hwpt_get_dirty_iova *bitmap);
 int iopt_set_dirty_tracking(struct io_pagetable *iopt,
 			    struct iommu_domain *domain, bool enable);
 
@@ -226,6 +232,8 @@ int iommufd_option_rlimit_mode(struct iommu_option *cmd,
 			       struct iommufd_ctx *ictx);
 
 int iommufd_vfio_ioas(struct iommufd_ucmd *ucmd);
+int iommufd_check_iova_range(struct iommufd_ioas *ioas,
+			     struct iommu_hwpt_get_dirty_iova *bitmap);
 
 /*
  * A HW pagetable is called an iommu_domain inside the kernel. This user object
@@ -252,6 +260,8 @@ static inline struct iommufd_hw_pagetable *iommufd_get_hwpt(
 			    struct iommufd_hw_pagetable, obj);
 }
 int iommufd_hwpt_set_dirty(struct iommufd_ucmd *ucmd);
+int iommufd_hwpt_get_dirty_iova(struct iommufd_ucmd *ucmd);
+
 struct iommufd_hw_pagetable *
 iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
 			   struct iommufd_device *idev, u32 flags,
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 2e625b280d61..30f1656ac5da 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -307,6 +307,7 @@ union ucmd_buffer {
 	struct iommu_destroy destroy;
 	struct iommu_hw_info info;
 	struct iommu_hwpt_alloc hwpt;
+	struct iommu_hwpt_get_dirty_iova get_dirty_iova;
 	struct iommu_hwpt_set_dirty set_dirty;
 	struct iommu_ioas_alloc alloc;
 	struct iommu_ioas_allow_iovas allow_iovas;
@@ -343,6 +344,8 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
 		 __reserved),
 	IOCTL_OP(IOMMU_HWPT_ALLOC, iommufd_hwpt_alloc, struct iommu_hwpt_alloc,
 		 __reserved),
+	IOCTL_OP(IOMMU_HWPT_GET_DIRTY_IOVA, iommufd_hwpt_get_dirty_iova,
+		 struct iommu_hwpt_get_dirty_iova, data),
 	IOCTL_OP(IOMMU_HWPT_SET_DIRTY, iommufd_hwpt_set_dirty,
 		 struct iommu_hwpt_set_dirty, __reserved),
 	IOCTL_OP(IOMMU_IOAS_ALLOC, iommufd_ioas_alloc_ioctl,
@@ -555,5 +558,6 @@ MODULE_ALIAS_MISCDEV(VFIO_MINOR);
 MODULE_ALIAS("devname:vfio/vfio");
 #endif
 MODULE_IMPORT_NS(IOMMUFD_INTERNAL);
+MODULE_IMPORT_NS(IOMMUFD);
 MODULE_DESCRIPTION("I/O Address Space Management for passthrough devices");
 MODULE_LICENSE("GPL");
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 9e1721e38819..efeb12c1aaeb 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -48,6 +48,7 @@ enum {
 	IOMMUFD_CMD_HWPT_ALLOC,
 	IOMMUFD_CMD_GET_HW_INFO,
 	IOMMUFD_CMD_HWPT_SET_DIRTY,
+	IOMMUFD_CMD_HWPT_GET_DIRTY_IOVA,
 };
 
 /**
@@ -479,4 +480,31 @@ struct iommu_hwpt_set_dirty {
 	__u32 __reserved;
 };
 #define IOMMU_HWPT_SET_DIRTY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_SET_DIRTY)
+
+/**
+ * struct iommu_hwpt_get_dirty_iova - ioctl(IOMMU_HWPT_GET_DIRTY_IOVA)
+ * @size: sizeof(struct iommu_hwpt_get_dirty_iova)
+ * @hwpt_id: HW pagetable ID that represents the IOMMU domain.
+ * @flags: Flags to control dirty tracking status.
+ * @iova: base IOVA of the bitmap first bit
+ * @length: IOVA range size
+ * @page_size: page size granularity of each bit in the bitmap
+ * @data: bitmap where to set the dirty bits. The bitmap bits each
+ * represent a page_size which you deviate from an arbitrary iova.
+ * Checking a given IOVA is dirty:
+ *
+ *  data[(iova / page_size) / 64] & (1ULL << (iova % 64))
+ */
+struct iommu_hwpt_get_dirty_iova {
+	__u32 size;
+	__u32 hwpt_id;
+	__u32 flags;
+	__u32 __reserved;
+	__aligned_u64 iova;
+	__aligned_u64 length;
+	__aligned_u64 page_size;
+	__aligned_u64 *data;
+};
+#define IOMMU_HWPT_GET_DIRTY_IOVA _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_GET_DIRTY_IOVA)
+
 #endif
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 08/18] iommufd: Add capabilities to IOMMU_GET_HW_INFO
  2023-10-18 20:26 [PATCH v4 00/18] IOMMUFD Dirty Tracking Joao Martins
                   ` (6 preceding siblings ...)
  2023-10-18 20:27 ` [PATCH v4 07/18] iommufd: Add IOMMU_HWPT_GET_DIRTY_IOVA Joao Martins
@ 2023-10-18 20:27 ` Joao Martins
  2023-10-18 22:44   ` Jason Gunthorpe
  2023-10-20  6:46   ` Tian, Kevin
  2023-10-18 20:27 ` [PATCH v4 09/18] iommufd: Add a flag to skip clearing of IOPTE dirty Joao Martins
                   ` (9 subsequent siblings)
  17 siblings, 2 replies; 84+ messages in thread
From: Joao Martins @ 2023-10-18 20:27 UTC (permalink / raw)
  To: iommu
  Cc: Jason Gunthorpe, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu,
	Yi Liu, Yi Y Sun, Nicolin Chen, Joerg Roedel,
	Suravee Suthikulpanit, Will Deacon, Robin Murphy, Zhenzhong Duan,
	Alex Williamson, kvm, Joao Martins

Extend IOMMUFD_CMD_GET_HW_INFO op to query generic iommu capabilities for a
given device.

Capabilities are IOMMU agnostic and use device_iommu_capable() API passing
one of the IOMMU_CAP_*. Enumerate IOMMU_CAP_DIRTY for now in the
out_capabilities field returned back to userspace.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/iommufd/device.c |  4 ++++
 include/uapi/linux/iommufd.h   | 11 +++++++++++
 2 files changed, 15 insertions(+)

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index e88fa73a45e6..71ee22dc1a85 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -1185,6 +1185,10 @@ int iommufd_get_hw_info(struct iommufd_ucmd *ucmd)
 	 */
 	cmd->data_len = data_len;
 
+	cmd->out_capabilities = 0;
+	if (device_iommu_capable(idev->dev, IOMMU_CAP_DIRTY))
+		cmd->out_capabilities |= IOMMU_HW_CAP_DIRTY_TRACKING;
+
 	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
 out_free:
 	kfree(data);
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index efeb12c1aaeb..91de0043e73f 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -419,6 +419,14 @@ enum iommu_hw_info_type {
 	IOMMU_HW_INFO_TYPE_INTEL_VTD,
 };
 
+/**
+ * enum iommufd_hw_info_capabilities
+ * @IOMMU_CAP_DIRTY_TRACKING: IOMMU hardware support for dirty tracking
+ */
+enum iommufd_hw_capabilities {
+	IOMMU_HW_CAP_DIRTY_TRACKING = 1 << 0,
+};
+
 /**
  * struct iommu_hw_info - ioctl(IOMMU_GET_HW_INFO)
  * @size: sizeof(struct iommu_hw_info)
@@ -430,6 +438,8 @@ enum iommu_hw_info_type {
  *             the iommu type specific hardware information data
  * @out_data_type: Output the iommu hardware info type as defined in the enum
  *                 iommu_hw_info_type.
+ * @out_capabilities: Output the iommu capability info type as defined in the
+ *                    enum iommu_hw_capabilities.
  * @__reserved: Must be 0
  *
  * Query an iommu type specific hardware information data from an iommu behind
@@ -454,6 +464,7 @@ struct iommu_hw_info {
 	__aligned_u64 data_uptr;
 	__u32 out_data_type;
 	__u32 __reserved;
+	__aligned_u64 out_capabilities;
 };
 #define IOMMU_GET_HW_INFO _IO(IOMMUFD_TYPE, IOMMUFD_CMD_GET_HW_INFO)
 
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 09/18] iommufd: Add a flag to skip clearing of IOPTE dirty
  2023-10-18 20:26 [PATCH v4 00/18] IOMMUFD Dirty Tracking Joao Martins
                   ` (7 preceding siblings ...)
  2023-10-18 20:27 ` [PATCH v4 08/18] iommufd: Add capabilities to IOMMU_GET_HW_INFO Joao Martins
@ 2023-10-18 20:27 ` Joao Martins
  2023-10-18 22:54   ` Jason Gunthorpe
  2023-10-20  6:52   ` Tian, Kevin
  2023-10-18 20:27 ` [PATCH v4 10/18] iommu/amd: Add domain_alloc_user based domain allocation Joao Martins
                   ` (8 subsequent siblings)
  17 siblings, 2 replies; 84+ messages in thread
From: Joao Martins @ 2023-10-18 20:27 UTC (permalink / raw)
  To: iommu
  Cc: Jason Gunthorpe, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu,
	Yi Liu, Yi Y Sun, Nicolin Chen, Joerg Roedel,
	Suravee Suthikulpanit, Will Deacon, Robin Murphy, Zhenzhong Duan,
	Alex Williamson, kvm, Joao Martins

VFIO has an operation where it unmaps an IOVA while returning a bitmap with
the dirty data. In reality the operation doesn't quite query the IO
pagetables that the PTE was dirty or not. Instead it marks as dirty on
anything that was mapped, and doing so in one syscall.

In IOMMUFD the equivalent is done in two operations by querying with
GET_DIRTY_IOVA followed by UNMAP_IOVA. However, this would incur two TLB
flushes given that after clearing dirty bits IOMMU implementations require
invalidating their IOTLB, plus another invalidation needed for the UNMAP.
To allow dirty bits to be queried faster, add a flag
(IOMMU_GET_DIRTY_IOVA_NO_CLEAR) that requests to not clear the dirty bits
from the PTE (but just reading them), under the expectation that the next
operation is the unmap. An alternative is to unmap and just perpectually
mark as dirty as that's the same behaviour as today. So here equivalent
functionally can be provided with unmap alone, and if real dirty info is
required it will amortize the cost while querying.

There's still a race against DMA where in theory the unmap of the IOVA
(when the guest invalidates the IOTLB via emulated iommu) would race
against the VF performing DMA on the same IOVA. As discussed in [0], we are
accepting to resolve this race as throwing away the DMA and it doesn't
matter if it hit physical DRAM or not, the VM can't tell if we threw it
away because the DMA was blocked or because we failed to copy the DRAM.

[0] https://lore.kernel.org/linux-iommu/20220502185239.GR8364@nvidia.com/

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/iommufd/hw_pagetable.c |  3 ++-
 drivers/iommu/iommufd/io_pagetable.c |  9 +++++++--
 include/uapi/linux/iommufd.h         | 12 ++++++++++++
 3 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index c954f91c3b7b..23a5e52b4755 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -252,7 +252,8 @@ int iommufd_hwpt_get_dirty_iova(struct iommufd_ucmd *ucmd)
 	struct iommufd_ioas *ioas;
 	int rc = -EOPNOTSUPP;
 
-	if ((cmd->flags || cmd->__reserved))
+	if ((cmd->flags & ~(IOMMU_GET_DIRTY_IOVA_NO_CLEAR)) ||
+	    cmd->__reserved)
 		return -EOPNOTSUPP;
 
 	hwpt = iommufd_get_hwpt(ucmd, cmd->hwpt_id);
diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
index 0c08b3df1b6f..835d54876b45 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -414,6 +414,7 @@ int iopt_map_user_pages(struct iommufd_ctx *ictx, struct io_pagetable *iopt,
 }
 
 struct iova_bitmap_fn_arg {
+	unsigned long flags;
 	struct io_pagetable *iopt;
 	struct iommu_domain *domain;
 	struct iommu_dirty_bitmap *dirty;
@@ -430,6 +431,7 @@ static int __iommu_read_and_clear_dirty(struct iova_bitmap *bitmap,
 	struct iommu_dirty_bitmap *dirty = arg->dirty;
 	const struct iommu_dirty_ops *ops = domain->dirty_ops;
 	unsigned long last_iova = iova + length - 1;
+	unsigned long flags = arg->flags;
 	int ret = -EINVAL;
 
 	iopt_for_each_contig_area(&iter, area, arg->iopt, iova, last_iova) {
@@ -437,7 +439,7 @@ static int __iommu_read_and_clear_dirty(struct iova_bitmap *bitmap,
 
 		ret = ops->read_and_clear_dirty(domain, iter.cur_iova,
 						last - iter.cur_iova + 1,
-						0, dirty);
+						flags, dirty);
 		if (ret)
 			break;
 	}
@@ -470,12 +472,15 @@ static int iommu_read_and_clear_dirty(struct iommu_domain *domain,
 
 	iommu_dirty_bitmap_init(&dirty, iter, &gather);
 
+	arg.flags = flags;
 	arg.iopt = iopt;
 	arg.domain = domain;
 	arg.dirty = &dirty;
 	iova_bitmap_for_each(iter, &arg, __iommu_read_and_clear_dirty);
 
-	iommu_iotlb_sync(domain, &gather);
+	if (!(flags & IOMMU_DIRTY_NO_CLEAR))
+		iommu_iotlb_sync(domain, &gather);
+
 	iova_bitmap_free(iter);
 
 	return ret;
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 91de0043e73f..8b372b43ffc0 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -492,6 +492,18 @@ struct iommu_hwpt_set_dirty {
 };
 #define IOMMU_HWPT_SET_DIRTY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_SET_DIRTY)
 
+/**
+ * enum iommufd_get_dirty_iova_flags - Flags for getting dirty bits
+ * @IOMMU_GET_DIRTY_IOVA_NO_CLEAR: Just read the PTEs without clearing any dirty
+ *                                 bits metadata. This flag can be passed in the
+ *                                 expectation where the next operation is
+ *                                 an unmap of the same IOVA range.
+ *
+ */
+enum iommufd_hwpt_get_dirty_iova_flags {
+	IOMMU_GET_DIRTY_IOVA_NO_CLEAR = 1,
+};
+
 /**
  * struct iommu_hwpt_get_dirty_iova - ioctl(IOMMU_HWPT_GET_DIRTY_IOVA)
  * @size: sizeof(struct iommu_hwpt_get_dirty_iova)
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 10/18] iommu/amd: Add domain_alloc_user based domain allocation
  2023-10-18 20:26 [PATCH v4 00/18] IOMMUFD Dirty Tracking Joao Martins
                   ` (8 preceding siblings ...)
  2023-10-18 20:27 ` [PATCH v4 09/18] iommufd: Add a flag to skip clearing of IOPTE dirty Joao Martins
@ 2023-10-18 20:27 ` Joao Martins
  2023-10-18 22:58   ` Jason Gunthorpe
  2023-10-18 20:27 ` [PATCH v4 11/18] iommu/amd: Access/Dirty bit support in IOPTEs Joao Martins
                   ` (7 subsequent siblings)
  17 siblings, 1 reply; 84+ messages in thread
From: Joao Martins @ 2023-10-18 20:27 UTC (permalink / raw)
  To: iommu
  Cc: Jason Gunthorpe, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu,
	Yi Liu, Yi Y Sun, Nicolin Chen, Joerg Roedel,
	Suravee Suthikulpanit, Will Deacon, Robin Murphy, Zhenzhong Duan,
	Alex Williamson, kvm, Joao Martins

Add the domain_alloc_user op implementation. To that end, refactor
amd_iommu_domain_alloc() to receive a dev pointer and flags, while renaming
it too, such that it becomes a common function shared with
domain_alloc_user() implementation. The sole difference with
domain_alloc_user() is that we initialize also other fields that
iommu_domain_alloc() does. It lets it return the iommu domain correctly
initialized in one function.

This is in preparation to add dirty enforcement on AMD implementation of
domain_alloc_user.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
---
 drivers/iommu/amd/iommu.c | 47 ++++++++++++++++++++++++++++++++++++---
 1 file changed, 44 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 95bd7c25ba6f..292a09b2fbbf 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -37,6 +37,7 @@
 #include <asm/iommu.h>
 #include <asm/gart.h>
 #include <asm/dma.h>
+#include <uapi/linux/iommufd.h>
 
 #include "amd_iommu.h"
 #include "../dma-iommu.h"
@@ -2155,28 +2156,67 @@ static inline u64 dma_max_address(void)
 	return ((1ULL << PM_LEVEL_SHIFT(amd_iommu_gpt_level)) - 1);
 }
 
-static struct iommu_domain *amd_iommu_domain_alloc(unsigned type)
+static struct iommu_domain *do_iommu_domain_alloc(unsigned int type,
+						  struct device *dev,
+						  u32 flags)
 {
 	struct protection_domain *domain;
+	struct amd_iommu *iommu = NULL;
+
+	if (dev) {
+		iommu = rlookup_amd_iommu(dev);
+		if (!iommu)
+			return ERR_PTR(-ENODEV);
+	}
 
 	/*
 	 * Since DTE[Mode]=0 is prohibited on SNP-enabled system,
 	 * default to use IOMMU_DOMAIN_DMA[_FQ].
 	 */
 	if (amd_iommu_snp_en && (type == IOMMU_DOMAIN_IDENTITY))
-		return NULL;
+		return ERR_PTR(-EINVAL);
 
 	domain = protection_domain_alloc(type);
 	if (!domain)
-		return NULL;
+		return ERR_PTR(-ENOMEM);
 
 	domain->domain.geometry.aperture_start = 0;
 	domain->domain.geometry.aperture_end   = dma_max_address();
 	domain->domain.geometry.force_aperture = true;
 
+	if (iommu) {
+		domain->domain.type = type;
+		domain->domain.pgsize_bitmap =
+			iommu->iommu.ops->pgsize_bitmap;
+		domain->domain.ops =
+			iommu->iommu.ops->default_domain_ops;
+	}
+
 	return &domain->domain;
 }
 
+static struct iommu_domain *amd_iommu_domain_alloc(unsigned int type)
+{
+	struct iommu_domain *domain;
+
+	domain = do_iommu_domain_alloc(type, NULL, 0);
+	if (IS_ERR(domain))
+		return NULL;
+
+	return domain;
+}
+
+static struct iommu_domain *amd_iommu_domain_alloc_user(struct device *dev,
+							u32 flags)
+{
+	unsigned int type = IOMMU_DOMAIN_UNMANAGED;
+
+	if (flags & IOMMU_HWPT_ALLOC_NEST_PARENT)
+		return ERR_PTR(-EOPNOTSUPP);
+
+	return do_iommu_domain_alloc(type, dev, flags);
+}
+
 static void amd_iommu_domain_free(struct iommu_domain *dom)
 {
 	struct protection_domain *domain;
@@ -2464,6 +2504,7 @@ static bool amd_iommu_enforce_cache_coherency(struct iommu_domain *domain)
 const struct iommu_ops amd_iommu_ops = {
 	.capable = amd_iommu_capable,
 	.domain_alloc = amd_iommu_domain_alloc,
+	.domain_alloc_user = amd_iommu_domain_alloc_user,
 	.probe_device = amd_iommu_probe_device,
 	.release_device = amd_iommu_release_device,
 	.probe_finalize = amd_iommu_probe_finalize,
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 11/18] iommu/amd: Access/Dirty bit support in IOPTEs
  2023-10-18 20:26 [PATCH v4 00/18] IOMMUFD Dirty Tracking Joao Martins
                   ` (9 preceding siblings ...)
  2023-10-18 20:27 ` [PATCH v4 10/18] iommu/amd: Add domain_alloc_user based domain allocation Joao Martins
@ 2023-10-18 20:27 ` Joao Martins
  2023-10-18 23:11   ` Jason Gunthorpe
  2023-10-20 18:57   ` Joao Martins
  2023-10-18 20:27 ` [PATCH v4 12/18] iommu/intel: Access/Dirty bit support for SL domains Joao Martins
                   ` (6 subsequent siblings)
  17 siblings, 2 replies; 84+ messages in thread
From: Joao Martins @ 2023-10-18 20:27 UTC (permalink / raw)
  To: iommu
  Cc: Jason Gunthorpe, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu,
	Yi Liu, Yi Y Sun, Nicolin Chen, Joerg Roedel,
	Suravee Suthikulpanit, Will Deacon, Robin Murphy, Zhenzhong Duan,
	Alex Williamson, kvm, Joao Martins

IOMMU advertises Access/Dirty bits if the extended feature register reports
it. Relevant AMD IOMMU SDM ref[0] "1.3.8 Enhanced Support for Access and
Dirty Bits"

To enable it set the DTE flag in bits 7 and 8 to enable access, or
access+dirty. With that, the IOMMU starts marking the D and A flags on
every Memory Request or ATS translation request. It is on the VMM side to
steer whether to enable dirty tracking or not, rather than wrongly doing in
IOMMU. Relevant AMD IOMMU SDM ref [0], "Table 7. Device Table Entry (DTE)
Field Definitions" particularly the entry "HAD".

To actually toggle on and off it's relatively simple as it's setting 2 bits
on DTE and flush the device DTE cache.

To get what's dirtied use existing AMD io-pgtable support, by walking the
pagetables over each IOVA, with fetch_pte().  The IOTLB flushing is left to
the caller (much like unmap), and iommu_dirty_bitmap_record() is the one
adding page-ranges to invalidate. This allows caller to batch the flush
over a big span of IOVA space, without the iommu wondering about when to
flush.

Worthwhile sections from AMD IOMMU SDM:

"2.2.3.1 Host Access Support"
"2.2.3.2 Host Dirty Support"

For details on how IOMMU hardware updates the dirty bit see, and expects
from its consequent clearing by CPU:

"2.2.7.4 Updating Accessed and Dirty Bits in the Guest Address Tables"
"2.2.7.5 Clearing Accessed and Dirty Bits"

Quoting the SDM:

"The setting of accessed and dirty status bits in the page tables is
visible to both the CPU and the peripheral when sharing guest page tables.
The IOMMU interlocked operations to update A and D bits must be 64-bit
operations and naturally aligned on a 64-bit boundary"

.. and for the IOMMU update sequence to Dirty bit, essentially is states:

1. Decodes the read and write intent from the memory access.
2. If P=0 in the page descriptor, fail the access.
3. Compare the A & D bits in the descriptor with the read and write
intent in the request.
4. If the A or D bits need to be updated in the descriptor:
* Start atomic operation.
* Read the descriptor as a 64-bit access.
* If the descriptor no longer appears to require an update, release the
atomic lock with
no further action and continue to step 5.
* Calculate the new A & D bits.
* Write the descriptor as a 64-bit access.
* End atomic operation.
5. Continue to the next stage of translation or to the memory access.

Access/Dirty bits readout also need to consider the non-default page-sizes
(aka replicated PTEs as mentined by manual), as AMD supports all powers of
two (except 512G) page sizes.

Select IOMMUFD_DRIVER only if IOMMUFD is enabled considering that IOMMU
dirty tracking requires IOMMUFD.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
---
 drivers/iommu/amd/Kconfig           |  1 +
 drivers/iommu/amd/amd_iommu_types.h | 12 ++++
 drivers/iommu/amd/io_pgtable.c      | 69 +++++++++++++++++++++
 drivers/iommu/amd/iommu.c           | 96 +++++++++++++++++++++++++++++
 4 files changed, 178 insertions(+)

diff --git a/drivers/iommu/amd/Kconfig b/drivers/iommu/amd/Kconfig
index 9b5fc3356bf2..8bd4c3b183ec 100644
--- a/drivers/iommu/amd/Kconfig
+++ b/drivers/iommu/amd/Kconfig
@@ -10,6 +10,7 @@ config AMD_IOMMU
 	select IOMMU_API
 	select IOMMU_IOVA
 	select IOMMU_IO_PGTABLE
+	select IOMMUFD_DRIVER if IOMMUFD
 	depends on X86_64 && PCI && ACPI && HAVE_CMPXCHG_DOUBLE
 	help
 	  With this option you can enable support for AMD IOMMU hardware in
diff --git a/drivers/iommu/amd/amd_iommu_types.h b/drivers/iommu/amd/amd_iommu_types.h
index 7dc30c2b56b3..dec4e5c2b66b 100644
--- a/drivers/iommu/amd/amd_iommu_types.h
+++ b/drivers/iommu/amd/amd_iommu_types.h
@@ -97,7 +97,9 @@
 #define FEATURE_GATS_MASK	(3ULL)
 #define FEATURE_GAM_VAPIC	BIT_ULL(21)
 #define FEATURE_GIOSUP		BIT_ULL(48)
+#define FEATURE_HASUP		BIT_ULL(49)
 #define FEATURE_EPHSUP		BIT_ULL(50)
+#define FEATURE_HDSUP		BIT_ULL(52)
 #define FEATURE_SNP		BIT_ULL(63)
 
 #define FEATURE_PASID_SHIFT	32
@@ -212,6 +214,7 @@
 /* macros and definitions for device table entries */
 #define DEV_ENTRY_VALID         0x00
 #define DEV_ENTRY_TRANSLATION   0x01
+#define DEV_ENTRY_HAD           0x07
 #define DEV_ENTRY_PPR           0x34
 #define DEV_ENTRY_IR            0x3d
 #define DEV_ENTRY_IW            0x3e
@@ -370,10 +373,16 @@
 #define PTE_LEVEL_PAGE_SIZE(level)			\
 	(1ULL << (12 + (9 * (level))))
 
+/*
+ * The IOPTE dirty bit
+ */
+#define IOMMU_PTE_HD_BIT (6)
+
 /*
  * Bit value definition for I/O PTE fields
  */
 #define IOMMU_PTE_PR	BIT_ULL(0)
+#define IOMMU_PTE_HD	BIT_ULL(IOMMU_PTE_HD_BIT)
 #define IOMMU_PTE_U	BIT_ULL(59)
 #define IOMMU_PTE_FC	BIT_ULL(60)
 #define IOMMU_PTE_IR	BIT_ULL(61)
@@ -384,6 +393,7 @@
  */
 #define DTE_FLAG_V	BIT_ULL(0)
 #define DTE_FLAG_TV	BIT_ULL(1)
+#define DTE_FLAG_HAD	(3ULL << 7)
 #define DTE_FLAG_GIOV	BIT_ULL(54)
 #define DTE_FLAG_GV	BIT_ULL(55)
 #define DTE_GLX_SHIFT	(56)
@@ -413,6 +423,7 @@
 
 #define IOMMU_PAGE_MASK (((1ULL << 52) - 1) & ~0xfffULL)
 #define IOMMU_PTE_PRESENT(pte) ((pte) & IOMMU_PTE_PR)
+#define IOMMU_PTE_DIRTY(pte) ((pte) & IOMMU_PTE_HD)
 #define IOMMU_PTE_PAGE(pte) (iommu_phys_to_virt((pte) & IOMMU_PAGE_MASK))
 #define IOMMU_PTE_MODE(pte) (((pte) >> 9) & 0x07)
 
@@ -563,6 +574,7 @@ struct protection_domain {
 	int nid;		/* Node ID */
 	u64 *gcr3_tbl;		/* Guest CR3 table */
 	unsigned long flags;	/* flags to find out type of domain */
+	bool dirty_tracking;	/* dirty tracking is enabled in the domain */
 	unsigned dev_cnt;	/* devices assigned to this domain */
 	unsigned dev_iommu[MAX_IOMMUS]; /* per-IOMMU reference count */
 };
diff --git a/drivers/iommu/amd/io_pgtable.c b/drivers/iommu/amd/io_pgtable.c
index 2892aa1b4dc1..953f867b4943 100644
--- a/drivers/iommu/amd/io_pgtable.c
+++ b/drivers/iommu/amd/io_pgtable.c
@@ -486,6 +486,74 @@ static phys_addr_t iommu_v1_iova_to_phys(struct io_pgtable_ops *ops, unsigned lo
 	return (__pte & ~offset_mask) | (iova & offset_mask);
 }
 
+static bool pte_test_and_clear_dirty(u64 *ptep, unsigned long size,
+				     unsigned long flags)
+{
+	bool test_only = flags & IOMMU_DIRTY_NO_CLEAR;
+	bool dirty = false;
+	int i, count;
+
+	/*
+	 * 2.2.3.2 Host Dirty Support
+	 * When a non-default page size is used , software must OR the
+	 * Dirty bits in all of the replicated host PTEs used to map
+	 * the page. The IOMMU does not guarantee the Dirty bits are
+	 * set in all of the replicated PTEs. Any portion of the page
+	 * may have been written even if the Dirty bit is set in only
+	 * one of the replicated PTEs.
+	 */
+	count = PAGE_SIZE_PTE_COUNT(size);
+	for (i = 0; i < count && test_only; i++) {
+		if (test_bit(IOMMU_PTE_HD_BIT,
+			     (unsigned long *) &ptep[i])) {
+			dirty = true;
+			break;
+		}
+	}
+
+	for (i = 0; i < count && !test_only; i++) {
+		if (test_and_clear_bit(IOMMU_PTE_HD_BIT,
+				       (unsigned long *) &ptep[i])) {
+			dirty = true;
+		}
+	}
+
+	return dirty;
+}
+
+static int iommu_v1_read_and_clear_dirty(struct io_pgtable_ops *ops,
+					 unsigned long iova, size_t size,
+					 unsigned long flags,
+					 struct iommu_dirty_bitmap *dirty)
+{
+	struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
+	unsigned long end = iova + size - 1;
+
+	do {
+		unsigned long pgsize = 0;
+		u64 *ptep, pte;
+
+		ptep = fetch_pte(pgtable, iova, &pgsize);
+		if (ptep)
+			pte = READ_ONCE(*ptep);
+		if (!ptep || !IOMMU_PTE_PRESENT(pte)) {
+			pgsize = pgsize ?: PTE_LEVEL_PAGE_SIZE(0);
+			iova += pgsize;
+			continue;
+		}
+
+		/*
+		 * Mark the whole IOVA range as dirty even if only one of
+		 * the replicated PTEs were marked dirty.
+		 */
+		if (pte_test_and_clear_dirty(ptep, pgsize, flags))
+			iommu_dirty_bitmap_record(dirty, iova, pgsize);
+		iova += pgsize;
+	} while (iova < end);
+
+	return 0;
+}
+
 /*
  * ----------------------------------------------------
  */
@@ -527,6 +595,7 @@ static struct io_pgtable *v1_alloc_pgtable(struct io_pgtable_cfg *cfg, void *coo
 	pgtable->iop.ops.map_pages    = iommu_v1_map_pages;
 	pgtable->iop.ops.unmap_pages  = iommu_v1_unmap_pages;
 	pgtable->iop.ops.iova_to_phys = iommu_v1_iova_to_phys;
+	pgtable->iop.ops.read_and_clear_dirty = iommu_v1_read_and_clear_dirty;
 
 	return &pgtable->iop;
 }
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 292a09b2fbbf..e7e9982fdad6 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -66,6 +66,7 @@ LIST_HEAD(hpet_map);
 LIST_HEAD(acpihid_map);
 
 const struct iommu_ops amd_iommu_ops;
+const struct iommu_dirty_ops amd_dirty_ops;
 
 static ATOMIC_NOTIFIER_HEAD(ppr_notifier);
 int amd_iommu_max_glx_val = -1;
@@ -1611,6 +1612,9 @@ static void set_dte_entry(struct amd_iommu *iommu, u16 devid,
 			pte_root |= 1ULL << DEV_ENTRY_PPR;
 	}
 
+	if (domain->dirty_tracking)
+		pte_root |= DTE_FLAG_HAD;
+
 	if (domain->flags & PD_IOMMUV2_MASK) {
 		u64 gcr3 = iommu_virt_to_phys(domain->gcr3_tbl);
 		u64 glx  = domain->glx;
@@ -2156,10 +2160,16 @@ static inline u64 dma_max_address(void)
 	return ((1ULL << PM_LEVEL_SHIFT(amd_iommu_gpt_level)) - 1);
 }
 
+static bool amd_iommu_hd_support(struct amd_iommu *iommu)
+{
+	return iommu && (iommu->features & FEATURE_HDSUP);
+}
+
 static struct iommu_domain *do_iommu_domain_alloc(unsigned int type,
 						  struct device *dev,
 						  u32 flags)
 {
+	bool enforce_dirty = flags & IOMMU_HWPT_ALLOC_ENFORCE_DIRTY;
 	struct protection_domain *domain;
 	struct amd_iommu *iommu = NULL;
 
@@ -2176,6 +2186,9 @@ static struct iommu_domain *do_iommu_domain_alloc(unsigned int type,
 	if (amd_iommu_snp_en && (type == IOMMU_DOMAIN_IDENTITY))
 		return ERR_PTR(-EINVAL);
 
+	if (enforce_dirty && !amd_iommu_hd_support(iommu))
+		return ERR_PTR(-EOPNOTSUPP);
+
 	domain = protection_domain_alloc(type);
 	if (!domain)
 		return ERR_PTR(-ENOMEM);
@@ -2190,6 +2203,9 @@ static struct iommu_domain *do_iommu_domain_alloc(unsigned int type,
 			iommu->iommu.ops->pgsize_bitmap;
 		domain->domain.ops =
 			iommu->iommu.ops->default_domain_ops;
+
+		if (enforce_dirty)
+			domain->domain.dirty_ops = &amd_dirty_ops;
 	}
 
 	return &domain->domain;
@@ -2254,6 +2270,13 @@ static int amd_iommu_attach_device(struct iommu_domain *dom,
 
 	dev_data->defer_attach = false;
 
+	/*
+	 * Restrict to devices with compatible IOMMU hardware support
+	 * when enforcement of dirty tracking is enabled.
+	 */
+	if (dom->dirty_ops && !amd_iommu_hd_support(iommu))
+		return -EINVAL;
+
 	if (dev_data->domain)
 		detach_device(dev);
 
@@ -2372,6 +2395,11 @@ static bool amd_iommu_capable(struct device *dev, enum iommu_cap cap)
 		return true;
 	case IOMMU_CAP_DEFERRED_FLUSH:
 		return true;
+	case IOMMU_CAP_DIRTY: {
+		struct amd_iommu *iommu = rlookup_amd_iommu(dev);
+
+		return amd_iommu_hd_support(iommu);
+	}
 	default:
 		break;
 	}
@@ -2379,6 +2407,69 @@ static bool amd_iommu_capable(struct device *dev, enum iommu_cap cap)
 	return false;
 }
 
+static int amd_iommu_set_dirty_tracking(struct iommu_domain *domain,
+					bool enable)
+{
+	struct protection_domain *pdomain = to_pdomain(domain);
+	struct dev_table_entry *dev_table;
+	struct iommu_dev_data *dev_data;
+	struct amd_iommu *iommu;
+	unsigned long flags;
+	u64 pte_root;
+
+	spin_lock_irqsave(&pdomain->lock, flags);
+	if (!(pdomain->dirty_tracking ^ enable)) {
+		spin_unlock_irqrestore(&pdomain->lock, flags);
+		return 0;
+	}
+
+	list_for_each_entry(dev_data, &pdomain->dev_list, list) {
+		iommu = rlookup_amd_iommu(dev_data->dev);
+		if (!iommu)
+			continue;
+
+		dev_table = get_dev_table(iommu);
+		pte_root = dev_table[dev_data->devid].data[0];
+
+		pte_root = (enable ?
+			pte_root | DTE_FLAG_HAD : pte_root & ~DTE_FLAG_HAD);
+
+		/* Flush device DTE */
+		dev_table[dev_data->devid].data[0] = pte_root;
+		device_flush_dte(dev_data);
+	}
+
+	/* Flush IOTLB to mark IOPTE dirty on the next translation(s) */
+	amd_iommu_domain_flush_tlb_pde(pdomain);
+	amd_iommu_domain_flush_complete(pdomain);
+	pdomain->dirty_tracking = enable;
+	spin_unlock_irqrestore(&pdomain->lock, flags);
+
+	return 0;
+}
+
+static int amd_iommu_read_and_clear_dirty(struct iommu_domain *domain,
+					  unsigned long iova, size_t size,
+					  unsigned long flags,
+					  struct iommu_dirty_bitmap *dirty)
+{
+	struct protection_domain *pdomain = to_pdomain(domain);
+	struct io_pgtable_ops *ops = &pdomain->iop.iop.ops;
+	unsigned long lflags;
+
+	if (!ops || !ops->read_and_clear_dirty)
+		return -EOPNOTSUPP;
+
+	spin_lock_irqsave(&pdomain->lock, lflags);
+	if (!pdomain->dirty_tracking && dirty->bitmap) {
+		spin_unlock_irqrestore(&pdomain->lock, lflags);
+		return -EINVAL;
+	}
+	spin_unlock_irqrestore(&pdomain->lock, lflags);
+
+	return ops->read_and_clear_dirty(ops, iova, size, flags, dirty);
+}
+
 static void amd_iommu_get_resv_regions(struct device *dev,
 				       struct list_head *head)
 {
@@ -2501,6 +2592,11 @@ static bool amd_iommu_enforce_cache_coherency(struct iommu_domain *domain)
 	return true;
 }
 
+const struct iommu_dirty_ops amd_dirty_ops = {
+	.set_dirty_tracking = amd_iommu_set_dirty_tracking,
+	.read_and_clear_dirty = amd_iommu_read_and_clear_dirty,
+};
+
 const struct iommu_ops amd_iommu_ops = {
 	.capable = amd_iommu_capable,
 	.domain_alloc = amd_iommu_domain_alloc,
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 12/18] iommu/intel: Access/Dirty bit support for SL domains
  2023-10-18 20:26 [PATCH v4 00/18] IOMMUFD Dirty Tracking Joao Martins
                   ` (10 preceding siblings ...)
  2023-10-18 20:27 ` [PATCH v4 11/18] iommu/amd: Access/Dirty bit support in IOPTEs Joao Martins
@ 2023-10-18 20:27 ` Joao Martins
  2023-10-19  3:04   ` Baolu Lu
  2023-10-20  7:53   ` Tian, Kevin
  2023-10-18 20:27 ` [PATCH v4 13/18] iommufd/selftest: Expand mock_domain with dev_flags Joao Martins
                   ` (5 subsequent siblings)
  17 siblings, 2 replies; 84+ messages in thread
From: Joao Martins @ 2023-10-18 20:27 UTC (permalink / raw)
  To: iommu
  Cc: Jason Gunthorpe, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu,
	Yi Liu, Yi Y Sun, Nicolin Chen, Joerg Roedel,
	Suravee Suthikulpanit, Will Deacon, Robin Murphy, Zhenzhong Duan,
	Alex Williamson, kvm, Joao Martins

IOMMU advertises Access/Dirty bits for second-stage page table if the
extended capability DMAR register reports it (ECAP, mnemonic ECAP.SSADS).
The first stage table is compatible with CPU page table thus A/D bits are
implicitly supported. Relevant Intel IOMMU SDM ref for first stage table
"3.6.2 Accessed, Extended Accessed, and Dirty Flags" and second stage table
"3.7.2 Accessed and Dirty Flags".

First stage page table is enabled by default so it's allowed to set dirty
tracking and no control bits needed, it just returns 0. To use SSADS, set
bit 9 (SSADE) in the scalable-mode PASID table entry and flush the IOTLB
via pasid_flush_caches() following the manual. Relevant SDM refs:

"3.7.2 Accessed and Dirty Flags"
"6.5.3.3 Guidance to Software for Invalidations,
 Table 23. Guidance to Software for Invalidations"

PTE dirty bit is located in bit 9 and it's cached in the IOTLB so flush
IOTLB to make sure IOMMU attempts to set the dirty bit again. Note that
iommu_dirty_bitmap_record() will add the IOVA to iotlb_gather and thus the
caller of the iommu op will flush the IOTLB. Relevant manuals over the
hardware translation is chapter 6 with some special mention to:

"6.2.3.1 Scalable-Mode PASID-Table Entry Programming Considerations"
"6.2.4 IOTLB"

Select IOMMUFD_DRIVER only if IOMMUFD is enabled, given that IOMMU dirty
tracking requires IOMMUFD.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/intel/Kconfig |   1 +
 drivers/iommu/intel/iommu.c | 104 +++++++++++++++++++++++++++++++++-
 drivers/iommu/intel/iommu.h |  17 ++++++
 drivers/iommu/intel/pasid.c | 109 ++++++++++++++++++++++++++++++++++++
 drivers/iommu/intel/pasid.h |   4 ++
 5 files changed, 234 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/Kconfig b/drivers/iommu/intel/Kconfig
index 2e56bd79f589..f5348b80652b 100644
--- a/drivers/iommu/intel/Kconfig
+++ b/drivers/iommu/intel/Kconfig
@@ -15,6 +15,7 @@ config INTEL_IOMMU
 	select DMA_OPS
 	select IOMMU_API
 	select IOMMU_IOVA
+	select IOMMUFD_DRIVER if IOMMUFD
 	select NEED_DMA_MAP_STATE
 	select DMAR_TABLE
 	select SWIOTLB
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 017aed5813d8..405b459416d5 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -300,6 +300,7 @@ static int iommu_skip_te_disable;
 #define IDENTMAP_AZALIA		4
 
 const struct iommu_ops intel_iommu_ops;
+const struct iommu_dirty_ops intel_dirty_ops;
 
 static bool translation_pre_enabled(struct intel_iommu *iommu)
 {
@@ -4077,10 +4078,12 @@ static struct iommu_domain *intel_iommu_domain_alloc(unsigned type)
 static struct iommu_domain *
 intel_iommu_domain_alloc_user(struct device *dev, u32 flags)
 {
+	bool enforce_dirty = (flags & IOMMU_HWPT_ALLOC_ENFORCE_DIRTY);
 	struct iommu_domain *domain;
 	struct intel_iommu *iommu;
 
-	if (flags & (~IOMMU_HWPT_ALLOC_NEST_PARENT))
+	if (flags & (~(IOMMU_HWPT_ALLOC_NEST_PARENT|
+		       IOMMU_HWPT_ALLOC_ENFORCE_DIRTY)))
 		return ERR_PTR(-EOPNOTSUPP);
 
 	iommu = device_to_iommu(dev, NULL, NULL);
@@ -4090,6 +4093,9 @@ intel_iommu_domain_alloc_user(struct device *dev, u32 flags)
 	if ((flags & IOMMU_HWPT_ALLOC_NEST_PARENT) && !ecap_nest(iommu->ecap))
 		return ERR_PTR(-EOPNOTSUPP);
 
+	if (enforce_dirty && !slads_supported(iommu))
+		return ERR_PTR(-EOPNOTSUPP);
+
 	/*
 	 * domain_alloc_user op needs to fully initialize a domain
 	 * before return, so uses iommu_domain_alloc() here for
@@ -4098,6 +4104,15 @@ intel_iommu_domain_alloc_user(struct device *dev, u32 flags)
 	domain = iommu_domain_alloc(dev->bus);
 	if (!domain)
 		domain = ERR_PTR(-ENOMEM);
+
+	if (!IS_ERR(domain) && enforce_dirty) {
+		if (to_dmar_domain(domain)->use_first_level) {
+			iommu_domain_free(domain);
+			return ERR_PTR(-EOPNOTSUPP);
+		}
+		domain->dirty_ops = &intel_dirty_ops;
+	}
+
 	return domain;
 }
 
@@ -4121,6 +4136,9 @@ static int prepare_domain_attach_device(struct iommu_domain *domain,
 	if (dmar_domain->force_snooping && !ecap_sc_support(iommu->ecap))
 		return -EINVAL;
 
+	if (domain->dirty_ops && !slads_supported(iommu))
+		return -EINVAL;
+
 	/* check if this iommu agaw is sufficient for max mapped address */
 	addr_width = agaw_to_width(iommu->agaw);
 	if (addr_width > cap_mgaw(iommu->cap))
@@ -4375,6 +4393,8 @@ static bool intel_iommu_capable(struct device *dev, enum iommu_cap cap)
 		return dmar_platform_optin();
 	case IOMMU_CAP_ENFORCE_CACHE_COHERENCY:
 		return ecap_sc_support(info->iommu->ecap);
+	case IOMMU_CAP_DIRTY:
+		return slads_supported(info->iommu);
 	default:
 		return false;
 	}
@@ -4772,6 +4792,9 @@ static int intel_iommu_set_dev_pasid(struct iommu_domain *domain,
 	if (!pasid_supported(iommu) || dev_is_real_dma_subdevice(dev))
 		return -EOPNOTSUPP;
 
+	if (domain->dirty_ops)
+		return -EINVAL;
+
 	if (context_copied(iommu, info->bus, info->devfn))
 		return -EBUSY;
 
@@ -4830,6 +4853,85 @@ static void *intel_iommu_hw_info(struct device *dev, u32 *length, u32 *type)
 	return vtd;
 }
 
+static int intel_iommu_set_dirty_tracking(struct iommu_domain *domain,
+					  bool enable)
+{
+	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+	struct device_domain_info *info;
+	int ret = -EINVAL;
+
+	spin_lock(&dmar_domain->lock);
+	if (dmar_domain->dirty_tracking == enable)
+		goto out_unlock;
+
+	list_for_each_entry(info, &dmar_domain->devices, link) {
+		ret = intel_pasid_setup_dirty_tracking(info->iommu, info->domain,
+						     info->dev, IOMMU_NO_PASID,
+						     enable);
+		if (ret)
+			goto err_unwind;
+
+	}
+
+	if (!ret)
+		dmar_domain->dirty_tracking = enable;
+out_unlock:
+	spin_unlock(&dmar_domain->lock);
+
+	return 0;
+
+err_unwind:
+	list_for_each_entry(info, &dmar_domain->devices, link)
+		intel_pasid_setup_dirty_tracking(info->iommu, dmar_domain,
+					  info->dev, IOMMU_NO_PASID,
+					  dmar_domain->dirty_tracking);
+	spin_unlock(&dmar_domain->lock);
+	return ret;
+}
+
+static int intel_iommu_read_and_clear_dirty(struct iommu_domain *domain,
+					    unsigned long iova, size_t size,
+					    unsigned long flags,
+					    struct iommu_dirty_bitmap *dirty)
+{
+	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+	unsigned long end = iova + size - 1;
+	unsigned long pgsize;
+
+	/*
+	 * IOMMUFD core calls into a dirty tracking disabled domain without an
+	 * IOVA bitmap set in order to clean dirty bits in all PTEs that might
+	 * have occurred when we stopped dirty tracking. This ensures that we
+	 * never inherit dirtied bits from a previous cycle.
+	 */
+	if (!dmar_domain->dirty_tracking && dirty->bitmap)
+		return -EINVAL;
+
+	do {
+		struct dma_pte *pte;
+		int lvl = 0;
+
+		pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &lvl,
+				     GFP_ATOMIC);
+		pgsize = level_size(lvl) << VTD_PAGE_SHIFT;
+		if (!pte || !dma_pte_present(pte)) {
+			iova += pgsize;
+			continue;
+		}
+
+		if (dma_sl_pte_test_and_clear_dirty(pte, flags))
+			iommu_dirty_bitmap_record(dirty, iova, pgsize);
+		iova += pgsize;
+	} while (iova < end);
+
+	return 0;
+}
+
+const struct iommu_dirty_ops intel_dirty_ops = {
+	.set_dirty_tracking	= intel_iommu_set_dirty_tracking,
+	.read_and_clear_dirty   = intel_iommu_read_and_clear_dirty,
+};
+
 const struct iommu_ops intel_iommu_ops = {
 	.capable		= intel_iommu_capable,
 	.hw_info		= intel_iommu_hw_info,
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index c18fb699c87a..27bcfd3bacdd 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -48,6 +48,9 @@
 #define DMA_FL_PTE_DIRTY	BIT_ULL(6)
 #define DMA_FL_PTE_XD		BIT_ULL(63)
 
+#define DMA_SL_PTE_DIRTY_BIT	9
+#define DMA_SL_PTE_DIRTY	BIT_ULL(DMA_SL_PTE_DIRTY_BIT)
+
 #define ADDR_WIDTH_5LEVEL	(57)
 #define ADDR_WIDTH_4LEVEL	(48)
 
@@ -539,6 +542,9 @@ enum {
 #define sm_supported(iommu)	(intel_iommu_sm && ecap_smts((iommu)->ecap))
 #define pasid_supported(iommu)	(sm_supported(iommu) &&			\
 				 ecap_pasid((iommu)->ecap))
+#define slads_supported(iommu) (sm_supported(iommu) &&                 \
+				ecap_slads((iommu)->ecap))
+
 
 struct pasid_entry;
 struct pasid_state_entry;
@@ -592,6 +598,7 @@ struct dmar_domain {
 					 * otherwise, goes through the second
 					 * level.
 					 */
+	u8 dirty_tracking:1;		/* Dirty tracking is enabled */
 
 	spinlock_t lock;		/* Protect device tracking lists */
 	struct list_head devices;	/* all devices' list */
@@ -781,6 +788,16 @@ static inline bool dma_pte_present(struct dma_pte *pte)
 	return (pte->val & 3) != 0;
 }
 
+static inline bool dma_sl_pte_test_and_clear_dirty(struct dma_pte *pte,
+						   unsigned long flags)
+{
+	if (flags & IOMMU_DIRTY_NO_CLEAR)
+		return (pte->val & DMA_SL_PTE_DIRTY) != 0;
+
+	return test_and_clear_bit(DMA_SL_PTE_DIRTY_BIT,
+				  (unsigned long *)&pte->val);
+}
+
 static inline bool dma_pte_superpage(struct dma_pte *pte)
 {
 	return (pte->val & DMA_PTE_LARGE_PAGE);
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 8f92b92f3d2a..785384a59d55 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -277,6 +277,11 @@ static inline void pasid_set_bits(u64 *ptr, u64 mask, u64 bits)
 	WRITE_ONCE(*ptr, (old & ~mask) | bits);
 }
 
+static inline u64 pasid_get_bits(u64 *ptr)
+{
+	return READ_ONCE(*ptr);
+}
+
 /*
  * Setup the DID(Domain Identifier) field (Bit 64~79) of scalable mode
  * PASID entry.
@@ -335,6 +340,36 @@ static inline void pasid_set_fault_enable(struct pasid_entry *pe)
 	pasid_set_bits(&pe->val[0], 1 << 1, 0);
 }
 
+/*
+ * Enable second level A/D bits by setting the SLADE (Second Level
+ * Access Dirty Enable) field (Bit 9) of a scalable mode PASID
+ * entry.
+ */
+static inline void pasid_set_ssade(struct pasid_entry *pe)
+{
+	pasid_set_bits(&pe->val[0], 1 << 9, 1 << 9);
+}
+
+/*
+ * Enable second level A/D bits by setting the SLADE (Second Level
+ * Access Dirty Enable) field (Bit 9) of a scalable mode PASID
+ * entry.
+ */
+static inline void pasid_clear_ssade(struct pasid_entry *pe)
+{
+	pasid_set_bits(&pe->val[0], 1 << 9, 0);
+}
+
+/*
+ * Checks if second level A/D bits by setting the SLADE (Second Level
+ * Access Dirty Enable) field (Bit 9) of a scalable mode PASID
+ * entry is enabled.
+ */
+static inline bool pasid_get_ssade(struct pasid_entry *pe)
+{
+	return pasid_get_bits(&pe->val[0]) & (1 << 9);
+}
+
 /*
  * Setup the WPE(Write Protect Enable) field (Bit 132) of a
  * scalable mode PASID entry.
@@ -627,6 +662,8 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu,
 	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_SL_ONLY);
 	pasid_set_fault_enable(pte);
 	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
+	if (domain->dirty_tracking)
+		pasid_set_ssade(pte);
 
 	pasid_set_present(pte);
 	spin_unlock(&iommu->lock);
@@ -636,6 +673,78 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu,
 	return 0;
 }
 
+/*
+ * Set up dirty tracking on a second only translation type.
+ */
+int intel_pasid_setup_dirty_tracking(struct intel_iommu *iommu,
+				     struct dmar_domain *domain,
+				     struct device *dev, u32 pasid,
+				     bool enabled)
+{
+	struct pasid_entry *pte;
+	u16 did, pgtt;
+
+	spin_lock(&iommu->lock);
+
+	pte = intel_pasid_get_entry(dev, pasid);
+	if (!pte) {
+		spin_unlock(&iommu->lock);
+		dev_err_ratelimited(dev,
+				    "Failed to get pasid entry of PASID %d\n",
+				    pasid);
+		return -ENODEV;
+	}
+
+	did = domain_id_iommu(domain, iommu);
+	pgtt = pasid_pte_get_pgtt(pte);
+	if (pgtt != PASID_ENTRY_PGTT_SL_ONLY && pgtt != PASID_ENTRY_PGTT_NESTED) {
+		spin_unlock(&iommu->lock);
+		dev_err_ratelimited(dev,
+				    "Dirty tracking not supported on translation type %d\n",
+				    pgtt);
+		return -EOPNOTSUPP;
+	}
+
+	if (pasid_get_ssade(pte) == enabled) {
+		spin_unlock(&iommu->lock);
+		return 0;
+	}
+
+	if (enabled)
+		pasid_set_ssade(pte);
+	else
+		pasid_clear_ssade(pte);
+	spin_unlock(&iommu->lock);
+
+	if (!ecap_coherent(iommu->ecap))
+		clflush_cache_range(pte, sizeof(*pte));
+
+	/*
+	 * From VT-d spec table 25 "Guidance to Software for Invalidations":
+	 *
+	 * - PASID-selective-within-Domain PASID-cache invalidation
+	 *   If (PGTT=SS or Nested)
+	 *    - Domain-selective IOTLB invalidation
+	 *   Else
+	 *    - PASID-selective PASID-based IOTLB invalidation
+	 * - If (pasid is RID_PASID)
+	 *    - Global Device-TLB invalidation to affected functions
+	 *   Else
+	 *    - PASID-based Device-TLB invalidation (with S=1 and
+	 *      Addr[63:12]=0x7FFFFFFF_FFFFF) to affected functions
+	 */
+	pasid_cache_invalidation_with_pasid(iommu, did, pasid);
+
+	if (pgtt == PASID_ENTRY_PGTT_SL_ONLY || pgtt == PASID_ENTRY_PGTT_NESTED)
+		iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
+
+	/* Device IOTLB doesn't need to be flushed in caching mode. */
+	if (!cap_caching_mode(iommu->cap))
+		devtlb_invalidation_with_pasid(iommu, dev, pasid);
+
+	return 0;
+}
+
 /*
  * Set up the scalable mode pasid entry for passthrough translation type.
  */
diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
index 4e9e68c3c388..958050b093aa 100644
--- a/drivers/iommu/intel/pasid.h
+++ b/drivers/iommu/intel/pasid.h
@@ -106,6 +106,10 @@ int intel_pasid_setup_first_level(struct intel_iommu *iommu,
 int intel_pasid_setup_second_level(struct intel_iommu *iommu,
 				   struct dmar_domain *domain,
 				   struct device *dev, u32 pasid);
+int intel_pasid_setup_dirty_tracking(struct intel_iommu *iommu,
+				     struct dmar_domain *domain,
+				     struct device *dev, u32 pasid,
+				     bool enabled);
 int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
 				   struct dmar_domain *domain,
 				   struct device *dev, u32 pasid);
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 13/18] iommufd/selftest: Expand mock_domain with dev_flags
  2023-10-18 20:26 [PATCH v4 00/18] IOMMUFD Dirty Tracking Joao Martins
                   ` (11 preceding siblings ...)
  2023-10-18 20:27 ` [PATCH v4 12/18] iommu/intel: Access/Dirty bit support for SL domains Joao Martins
@ 2023-10-18 20:27 ` Joao Martins
  2023-10-20  7:57   ` Tian, Kevin
  2023-10-18 20:27 ` [PATCH v4 14/18] iommufd/selftest: Test IOMMU_HWPT_ALLOC_ENFORCE_DIRTY Joao Martins
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 84+ messages in thread
From: Joao Martins @ 2023-10-18 20:27 UTC (permalink / raw)
  To: iommu
  Cc: Jason Gunthorpe, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu,
	Yi Liu, Yi Y Sun, Nicolin Chen, Joerg Roedel,
	Suravee Suthikulpanit, Will Deacon, Robin Murphy, Zhenzhong Duan,
	Alex Williamson, kvm, Joao Martins

Expand mock_domain test to be able to manipulate the device capabilities.
This allows testing with mockdev without dirty tracking support advertised
and thus make sure enforce_dirty test does the expected.

To avoid breaking IOMMUFD_TEST UABI replicate the mock_domain struct and
thus add an input dev_flags at the end.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/iommufd/iommufd_test.h          | 12 ++++++++
 drivers/iommu/iommufd/selftest.c              | 11 +++++--
 tools/testing/selftests/iommu/iommufd_utils.h | 30 +++++++++++++++++++
 3 files changed, 51 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h
index 3f3644375bf1..9817edcd8968 100644
--- a/drivers/iommu/iommufd/iommufd_test.h
+++ b/drivers/iommu/iommufd/iommufd_test.h
@@ -19,6 +19,7 @@ enum {
 	IOMMU_TEST_OP_SET_TEMP_MEMORY_LIMIT,
 	IOMMU_TEST_OP_MOCK_DOMAIN_REPLACE,
 	IOMMU_TEST_OP_ACCESS_REPLACE_IOAS,
+	IOMMU_TEST_OP_MOCK_DOMAIN_FLAGS,
 };
 
 enum {
@@ -40,6 +41,10 @@ enum {
 	MOCK_FLAGS_ACCESS_CREATE_NEEDS_PIN_PAGES = 1 << 0,
 };
 
+enum {
+	MOCK_FLAGS_DEVICE_NO_DIRTY = 1 << 0,
+};
+
 struct iommu_test_cmd {
 	__u32 size;
 	__u32 op;
@@ -56,6 +61,13 @@ struct iommu_test_cmd {
 			/* out_idev_id is the standard iommufd_bind object */
 			__u32 out_idev_id;
 		} mock_domain;
+		struct {
+			__u32 out_stdev_id;
+			__u32 out_hwpt_id;
+			__u32 out_idev_id;
+			/* Expand mock_domain to set mock device flags */
+			__u32 dev_flags;
+		} mock_domain_flags;
 		struct {
 			__u32 pt_id;
 		} mock_domain_replace;
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index fe7e3c7d933a..bd3704b28bfb 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -96,6 +96,7 @@ enum selftest_obj_type {
 
 struct mock_dev {
 	struct device dev;
+	unsigned long flags;
 };
 
 struct selftest_obj {
@@ -381,7 +382,7 @@ static void mock_dev_release(struct device *dev)
 	kfree(mdev);
 }
 
-static struct mock_dev *mock_dev_create(void)
+static struct mock_dev *mock_dev_create(unsigned long dev_flags)
 {
 	struct mock_dev *mdev;
 	int rc;
@@ -391,6 +392,7 @@ static struct mock_dev *mock_dev_create(void)
 		return ERR_PTR(-ENOMEM);
 
 	device_initialize(&mdev->dev);
+	mdev->flags = dev_flags;
 	mdev->dev.release = mock_dev_release;
 	mdev->dev.bus = &iommufd_mock_bus_type.bus;
 
@@ -426,6 +428,7 @@ static int iommufd_test_mock_domain(struct iommufd_ucmd *ucmd,
 	struct iommufd_device *idev;
 	struct selftest_obj *sobj;
 	u32 pt_id = cmd->id;
+	u32 dev_flags = 0;
 	u32 idev_id;
 	int rc;
 
@@ -436,7 +439,10 @@ static int iommufd_test_mock_domain(struct iommufd_ucmd *ucmd,
 	sobj->idev.ictx = ucmd->ictx;
 	sobj->type = TYPE_IDEV;
 
-	sobj->idev.mock_dev = mock_dev_create();
+	if (cmd->op == IOMMU_TEST_OP_MOCK_DOMAIN_FLAGS)
+		dev_flags = cmd->mock_domain_flags.dev_flags;
+
+	sobj->idev.mock_dev = mock_dev_create(dev_flags);
 	if (IS_ERR(sobj->idev.mock_dev)) {
 		rc = PTR_ERR(sobj->idev.mock_dev);
 		goto out_sobj;
@@ -1019,6 +1025,7 @@ int iommufd_test(struct iommufd_ucmd *ucmd)
 						 cmd->add_reserved.start,
 						 cmd->add_reserved.length);
 	case IOMMU_TEST_OP_MOCK_DOMAIN:
+	case IOMMU_TEST_OP_MOCK_DOMAIN_FLAGS:
 		return iommufd_test_mock_domain(ucmd, cmd);
 	case IOMMU_TEST_OP_MOCK_DOMAIN_REPLACE:
 		return iommufd_test_mock_domain_replace(
diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h
index be4970a84977..8e84d2592f2d 100644
--- a/tools/testing/selftests/iommu/iommufd_utils.h
+++ b/tools/testing/selftests/iommu/iommufd_utils.h
@@ -74,6 +74,36 @@ static int _test_cmd_mock_domain(int fd, unsigned int ioas_id, __u32 *stdev_id,
 	EXPECT_ERRNO(_errno, _test_cmd_mock_domain(self->fd, ioas_id, \
 						   stdev_id, hwpt_id, NULL))
 
+static int _test_cmd_mock_domain_flags(int fd, unsigned int ioas_id,
+				       __u32 stdev_flags,
+				       __u32 *stdev_id, __u32 *hwpt_id,
+				       __u32 *idev_id)
+{
+	struct iommu_test_cmd cmd = {
+		.size = sizeof(cmd),
+		.op = IOMMU_TEST_OP_MOCK_DOMAIN_FLAGS,
+		.id = ioas_id,
+		.mock_domain_flags = { .dev_flags = stdev_flags },
+	};
+	int ret;
+
+	ret = ioctl(fd, IOMMU_TEST_CMD, &cmd);
+	if (ret)
+		return ret;
+	if (stdev_id)
+		*stdev_id = cmd.mock_domain_flags.out_stdev_id;
+	assert(cmd.id != 0);
+	if (hwpt_id)
+		*hwpt_id = cmd.mock_domain_flags.out_hwpt_id;
+	if (idev_id)
+		*idev_id = cmd.mock_domain_flags.out_idev_id;
+	return 0;
+}
+#define test_err_mock_domain_flags(_errno, ioas_id, flags, stdev_id, hwpt_id) \
+	EXPECT_ERRNO(_errno, _test_cmd_mock_domain_flags(self->fd, ioas_id, \
+							 flags, stdev_id, \
+							 hwpt_id, NULL))
+
 static int _test_cmd_mock_domain_replace(int fd, __u32 stdev_id, __u32 pt_id,
 					 __u32 *hwpt_id)
 {
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 14/18] iommufd/selftest: Test IOMMU_HWPT_ALLOC_ENFORCE_DIRTY
  2023-10-18 20:26 [PATCH v4 00/18] IOMMUFD Dirty Tracking Joao Martins
                   ` (12 preceding siblings ...)
  2023-10-18 20:27 ` [PATCH v4 13/18] iommufd/selftest: Expand mock_domain with dev_flags Joao Martins
@ 2023-10-18 20:27 ` Joao Martins
  2023-10-20  7:59   ` Tian, Kevin
  2023-10-18 20:27 ` [PATCH v4 15/18] iommufd/selftest: Test IOMMU_HWPT_SET_DIRTY Joao Martins
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 84+ messages in thread
From: Joao Martins @ 2023-10-18 20:27 UTC (permalink / raw)
  To: iommu
  Cc: Jason Gunthorpe, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu,
	Yi Liu, Yi Y Sun, Nicolin Chen, Joerg Roedel,
	Suravee Suthikulpanit, Will Deacon, Robin Murphy, Zhenzhong Duan,
	Alex Williamson, kvm, Joao Martins

In order to selftest the iommu domain dirty enforcing implement the
mock_domain necessary support and add a new dev_flags to test that the
hwpt_alloc/attach_device fails as expected.

Expand the existing mock_domain fixture with a enforce_dirty test that
exercises the hwpt_alloc and device attachment.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/iommufd/selftest.c              | 39 ++++++++++++++-
 tools/testing/selftests/iommu/iommufd.c       | 49 +++++++++++++++++++
 tools/testing/selftests/iommu/iommufd_utils.h |  3 ++
 3 files changed, 90 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index bd3704b28bfb..e5d421b54b1a 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -119,6 +119,12 @@ static void mock_domain_blocking_free(struct iommu_domain *domain)
 static int mock_domain_nop_attach(struct iommu_domain *domain,
 				  struct device *dev)
 {
+	struct mock_dev *mdev = container_of(dev, struct mock_dev, dev);
+
+	if (domain->dirty_ops &&
+	    (mdev->flags & MOCK_FLAGS_DEVICE_NO_DIRTY))
+		return -EINVAL;
+
 	return 0;
 }
 
@@ -147,6 +153,26 @@ static void *mock_domain_hw_info(struct device *dev, u32 *length, u32 *type)
 	return info;
 }
 
+static int mock_domain_set_dirty_tracking(struct iommu_domain *domain,
+					  bool enable)
+{
+	return 0;
+}
+
+static int mock_domain_read_and_clear_dirty(struct iommu_domain *domain,
+					    unsigned long iova, size_t size,
+					    unsigned long flags,
+					    struct iommu_dirty_bitmap *dirty)
+{
+	return 0;
+}
+
+const struct iommu_dirty_ops dirty_ops = {
+	.set_dirty_tracking = mock_domain_set_dirty_tracking,
+	.read_and_clear_dirty = mock_domain_read_and_clear_dirty,
+};
+
+
 static const struct iommu_ops mock_ops;
 
 static struct iommu_domain *mock_domain_alloc(unsigned int iommu_domain_type)
@@ -174,12 +200,20 @@ static struct iommu_domain *mock_domain_alloc(unsigned int iommu_domain_type)
 static struct iommu_domain *
 mock_domain_alloc_user(struct device *dev, u32 flags)
 {
+	struct mock_dev *mdev = container_of(dev, struct mock_dev, dev);
 	struct iommu_domain *domain;
 
-	if (flags & (~IOMMU_HWPT_ALLOC_NEST_PARENT))
+	if (flags & (~(IOMMU_HWPT_ALLOC_NEST_PARENT|
+		       IOMMU_HWPT_ALLOC_ENFORCE_DIRTY)))
+		return ERR_PTR(-EOPNOTSUPP);
+
+	if ((flags & IOMMU_HWPT_ALLOC_ENFORCE_DIRTY) &&
+	    (mdev->flags & MOCK_FLAGS_DEVICE_NO_DIRTY))
 		return ERR_PTR(-EOPNOTSUPP);
 
 	domain = mock_domain_alloc(IOMMU_DOMAIN_UNMANAGED);
+	if (domain && !(mdev->flags & MOCK_FLAGS_DEVICE_NO_DIRTY))
+		domain->dirty_ops = &dirty_ops;
 	if (!domain)
 		domain = ERR_PTR(-ENOMEM);
 	return domain;
@@ -387,6 +421,9 @@ static struct mock_dev *mock_dev_create(unsigned long dev_flags)
 	struct mock_dev *mdev;
 	int rc;
 
+	if (dev_flags & ~(MOCK_FLAGS_DEVICE_NO_DIRTY))
+		return ERR_PTR(-EINVAL);
+
 	mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
 	if (!mdev)
 		return ERR_PTR(-ENOMEM);
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
index 6323153d277b..a0ed712c810d 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -1433,6 +1433,55 @@ TEST_F(iommufd_mock_domain, alloc_hwpt)
 	}
 }
 
+FIXTURE(iommufd_dirty_tracking)
+{
+	int fd;
+	uint32_t ioas_id;
+	uint32_t hwpt_id;
+	uint32_t stdev_id;
+	uint32_t idev_id;
+};
+
+FIXTURE_SETUP(iommufd_dirty_tracking)
+{
+	self->fd = open("/dev/iommu", O_RDWR);
+	ASSERT_NE(-1, self->fd);
+
+	test_ioctl_ioas_alloc(&self->ioas_id);
+	test_cmd_mock_domain(self->ioas_id, &self->stdev_id,
+			     &self->hwpt_id, &self->idev_id);
+}
+
+FIXTURE_TEARDOWN(iommufd_dirty_tracking)
+{
+	teardown_iommufd(self->fd, _metadata);
+}
+
+TEST_F(iommufd_dirty_tracking, enforce_dirty)
+{
+	uint32_t ioas_id, stddev_id, idev_id;
+	uint32_t hwpt_id, _hwpt_id;
+	uint32_t dev_flags;
+
+	/* Regular case */
+	dev_flags = MOCK_FLAGS_DEVICE_NO_DIRTY;
+	test_cmd_hwpt_alloc(self->idev_id, self->ioas_id,
+			    IOMMU_HWPT_ALLOC_ENFORCE_DIRTY, &hwpt_id);
+	test_cmd_mock_domain(hwpt_id, &stddev_id, NULL, NULL);
+	test_err_mock_domain_flags(EINVAL, hwpt_id, dev_flags,
+				   &stddev_id, NULL);
+	test_ioctl_destroy(stddev_id);
+	test_ioctl_destroy(hwpt_id);
+
+	/* IOMMU device does not support dirty tracking */
+	test_ioctl_ioas_alloc(&ioas_id);
+	test_cmd_mock_domain_flags(ioas_id, dev_flags,
+				   &stddev_id, &_hwpt_id, &idev_id);
+	test_err_hwpt_alloc(EOPNOTSUPP, idev_id, ioas_id,
+			    IOMMU_HWPT_ALLOC_ENFORCE_DIRTY, &hwpt_id);
+	test_ioctl_destroy(stddev_id);
+}
+
 /* VFIO compatibility IOCTLs */
 
 TEST_F(iommufd, simple_ioctls)
diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h
index 8e84d2592f2d..930edfe693c7 100644
--- a/tools/testing/selftests/iommu/iommufd_utils.h
+++ b/tools/testing/selftests/iommu/iommufd_utils.h
@@ -99,6 +99,9 @@ static int _test_cmd_mock_domain_flags(int fd, unsigned int ioas_id,
 		*idev_id = cmd.mock_domain_flags.out_idev_id;
 	return 0;
 }
+#define test_cmd_mock_domain_flags(ioas_id, flags, stdev_id, hwpt_id, idev_id)       \
+	ASSERT_EQ(0, _test_cmd_mock_domain_flags(self->fd, ioas_id, flags, \
+						 stdev_id, hwpt_id, idev_id))
 #define test_err_mock_domain_flags(_errno, ioas_id, flags, stdev_id, hwpt_id) \
 	EXPECT_ERRNO(_errno, _test_cmd_mock_domain_flags(self->fd, ioas_id, \
 							 flags, stdev_id, \
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 15/18] iommufd/selftest: Test IOMMU_HWPT_SET_DIRTY
  2023-10-18 20:26 [PATCH v4 00/18] IOMMUFD Dirty Tracking Joao Martins
                   ` (13 preceding siblings ...)
  2023-10-18 20:27 ` [PATCH v4 14/18] iommufd/selftest: Test IOMMU_HWPT_ALLOC_ENFORCE_DIRTY Joao Martins
@ 2023-10-18 20:27 ` Joao Martins
  2023-10-20  8:00   ` Tian, Kevin
  2023-10-18 20:27 ` [PATCH v4 16/18] iommufd/selftest: Test IOMMU_HWPT_GET_DIRTY_IOVA Joao Martins
                   ` (2 subsequent siblings)
  17 siblings, 1 reply; 84+ messages in thread
From: Joao Martins @ 2023-10-18 20:27 UTC (permalink / raw)
  To: iommu
  Cc: Jason Gunthorpe, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu,
	Yi Liu, Yi Y Sun, Nicolin Chen, Joerg Roedel,
	Suravee Suthikulpanit, Will Deacon, Robin Murphy, Zhenzhong Duan,
	Alex Williamson, kvm, Joao Martins

Change mock_domain to supporting dirty tracking and add tests to exercise
the new SET_DIRTY API in the iommufd_dirty_tracking selftest fixture.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/iommufd/selftest.c              | 17 ++++++++++++++++
 tools/testing/selftests/iommu/iommufd.c       | 15 ++++++++++++++
 tools/testing/selftests/iommu/iommufd_utils.h | 20 +++++++++++++++++++
 3 files changed, 52 insertions(+)

diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index e5d421b54b1a..fcbb4e1d88d4 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -24,6 +24,7 @@ static struct platform_device *selftest_iommu_dev;
 size_t iommufd_test_memory_limit = 65536;
 
 enum {
+	MOCK_DIRTY_TRACK = 1,
 	MOCK_IO_PAGE_SIZE = PAGE_SIZE / 2,
 
 	/*
@@ -86,6 +87,7 @@ void iommufd_test_syz_conv_iova_id(struct iommufd_ucmd *ucmd,
 }
 
 struct mock_iommu_domain {
+	unsigned long flags;
 	struct iommu_domain domain;
 	struct xarray pfns;
 };
@@ -156,6 +158,21 @@ static void *mock_domain_hw_info(struct device *dev, u32 *length, u32 *type)
 static int mock_domain_set_dirty_tracking(struct iommu_domain *domain,
 					  bool enable)
 {
+	struct mock_iommu_domain *mock =
+		container_of(domain, struct mock_iommu_domain, domain);
+	unsigned long flags = mock->flags;
+
+	if (enable && !domain->dirty_ops)
+		return -EINVAL;
+
+	/* No change? */
+	if (!(enable ^ !!(flags & MOCK_DIRTY_TRACK)))
+		return 0;
+
+	flags = (enable ?
+		 flags | MOCK_DIRTY_TRACK : flags & ~MOCK_DIRTY_TRACK);
+
+	mock->flags = flags;
 	return 0;
 }
 
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
index a0ed712c810d..ab1536d6b4db 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -1482,6 +1482,21 @@ TEST_F(iommufd_dirty_tracking, enforce_dirty)
 	test_ioctl_destroy(stddev_id);
 }
 
+TEST_F(iommufd_dirty_tracking, set_dirty)
+{
+	uint32_t stddev_id;
+	uint32_t hwpt_id;
+
+	test_cmd_hwpt_alloc(self->idev_id, self->ioas_id,
+			    IOMMU_HWPT_ALLOC_ENFORCE_DIRTY, &hwpt_id);
+	test_cmd_mock_domain(hwpt_id, &stddev_id, NULL, NULL);
+	test_cmd_set_dirty(hwpt_id, true);
+	test_cmd_set_dirty(hwpt_id, false);
+
+	test_ioctl_destroy(stddev_id);
+	test_ioctl_destroy(hwpt_id);
+}
+
 /* VFIO compatibility IOCTLs */
 
 TEST_F(iommufd, simple_ioctls)
diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h
index 930edfe693c7..5214ae17b19a 100644
--- a/tools/testing/selftests/iommu/iommufd_utils.h
+++ b/tools/testing/selftests/iommu/iommufd_utils.h
@@ -177,9 +177,29 @@ static int _test_cmd_access_replace_ioas(int fd, __u32 access_id,
 		return ret;
 	return 0;
 }
+
+
 #define test_cmd_access_replace_ioas(access_id, ioas_id) \
 	ASSERT_EQ(0, _test_cmd_access_replace_ioas(self->fd, access_id, ioas_id))
 
+static int _test_cmd_set_dirty(int fd, __u32 hwpt_id, bool enabled)
+{
+	struct iommu_hwpt_set_dirty cmd = {
+		.size = sizeof(cmd),
+		.flags = enabled ? IOMMU_DIRTY_TRACKING_ENABLE : 0,
+		.hwpt_id = hwpt_id,
+	};
+	int ret;
+
+	ret = ioctl(fd, IOMMU_HWPT_SET_DIRTY, &cmd);
+	if (ret)
+		return -errno;
+	return 0;
+}
+
+#define test_cmd_set_dirty(hwpt_id, enabled) \
+	ASSERT_EQ(0, _test_cmd_set_dirty(self->fd, hwpt_id, enabled))
+
 static int _test_cmd_create_access(int fd, unsigned int ioas_id,
 				   __u32 *access_id, unsigned int flags)
 {
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 16/18] iommufd/selftest: Test IOMMU_HWPT_GET_DIRTY_IOVA
  2023-10-18 20:26 [PATCH v4 00/18] IOMMUFD Dirty Tracking Joao Martins
                   ` (14 preceding siblings ...)
  2023-10-18 20:27 ` [PATCH v4 15/18] iommufd/selftest: Test IOMMU_HWPT_SET_DIRTY Joao Martins
@ 2023-10-18 20:27 ` Joao Martins
  2023-10-18 20:27 ` [PATCH v4 17/18] iommufd/selftest: Test out_capabilities in IOMMU_GET_HW_INFO Joao Martins
  2023-10-18 20:27 ` [PATCH v4 18/18] iommufd/selftest: Test IOMMU_GET_DIRTY_IOVA_NO_CLEAR flag Joao Martins
  17 siblings, 0 replies; 84+ messages in thread
From: Joao Martins @ 2023-10-18 20:27 UTC (permalink / raw)
  To: iommu
  Cc: Jason Gunthorpe, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu,
	Yi Liu, Yi Y Sun, Nicolin Chen, Joerg Roedel,
	Suravee Suthikulpanit, Will Deacon, Robin Murphy, Zhenzhong Duan,
	Alex Williamson, kvm, Joao Martins

Add a new test ioctl for simulating the dirty IOVAs in the mock domain, and
implement the mock iommu domain ops that get the dirty tracking supported.

The selftest exercises the usual main workflow of:

1) Setting dirty tracking from the iommu domain
2) Read and clear dirty IOPTEs

Different fixtures will test different IOVA range sizes, that exercise
corner cases of the bitmaps.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/iommufd/iommufd_test.h          |   9 ++
 drivers/iommu/iommufd/selftest.c              |  88 +++++++++++++-
 tools/testing/selftests/iommu/iommufd.c       |  99 ++++++++++++++++
 tools/testing/selftests/iommu/iommufd_utils.h | 109 ++++++++++++++++++
 4 files changed, 302 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h
index 9817edcd8968..1f2e93d3d4e8 100644
--- a/drivers/iommu/iommufd/iommufd_test.h
+++ b/drivers/iommu/iommufd/iommufd_test.h
@@ -20,6 +20,7 @@ enum {
 	IOMMU_TEST_OP_MOCK_DOMAIN_REPLACE,
 	IOMMU_TEST_OP_ACCESS_REPLACE_IOAS,
 	IOMMU_TEST_OP_MOCK_DOMAIN_FLAGS,
+	IOMMU_TEST_OP_DIRTY,
 };
 
 enum {
@@ -107,6 +108,14 @@ struct iommu_test_cmd {
 		struct {
 			__u32 ioas_id;
 		} access_replace_ioas;
+		struct {
+			__u32 flags;
+			__aligned_u64 iova;
+			__aligned_u64 length;
+			__aligned_u64 page_size;
+			__aligned_u64 uptr;
+			__aligned_u64 out_nr_dirty;
+		} dirty;
 	};
 	__u32 last;
 };
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index fcbb4e1d88d4..4ba10dbc63d0 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -37,6 +37,7 @@ enum {
 	_MOCK_PFN_START = MOCK_PFN_MASK + 1,
 	MOCK_PFN_START_IOVA = _MOCK_PFN_START,
 	MOCK_PFN_LAST_IOVA = _MOCK_PFN_START,
+	MOCK_PFN_DIRTY_IOVA = _MOCK_PFN_START << 1,
 };
 
 /*
@@ -181,6 +182,31 @@ static int mock_domain_read_and_clear_dirty(struct iommu_domain *domain,
 					    unsigned long flags,
 					    struct iommu_dirty_bitmap *dirty)
 {
+	struct mock_iommu_domain *mock =
+		container_of(domain, struct mock_iommu_domain, domain);
+	unsigned long i, max = size / MOCK_IO_PAGE_SIZE;
+	void *ent, *old;
+
+	if (!(mock->flags & MOCK_DIRTY_TRACK) && dirty->bitmap)
+		return -EINVAL;
+
+	for (i = 0; i < max; i++) {
+		unsigned long cur = iova + i * MOCK_IO_PAGE_SIZE;
+
+		ent = xa_load(&mock->pfns, cur / MOCK_IO_PAGE_SIZE);
+		if (ent &&
+		    (xa_to_value(ent) & MOCK_PFN_DIRTY_IOVA)) {
+			unsigned long val;
+
+			/* Clear dirty */
+			val = xa_to_value(ent) & ~MOCK_PFN_DIRTY_IOVA;
+			old = xa_store(&mock->pfns, cur / MOCK_IO_PAGE_SIZE,
+				       xa_mk_value(val), GFP_KERNEL);
+			WARN_ON_ONCE(ent != old);
+			iommu_dirty_bitmap_record(dirty, cur, MOCK_IO_PAGE_SIZE);
+		}
+	}
+
 	return 0;
 }
 
@@ -313,7 +339,7 @@ static size_t mock_domain_unmap_pages(struct iommu_domain *domain,
 
 		for (cur = 0; cur != pgsize; cur += MOCK_IO_PAGE_SIZE) {
 			ent = xa_erase(&mock->pfns, iova / MOCK_IO_PAGE_SIZE);
-			WARN_ON(!ent);
+
 			/*
 			 * iommufd generates unmaps that must be a strict
 			 * superset of the map's performend So every starting
@@ -323,12 +349,12 @@ static size_t mock_domain_unmap_pages(struct iommu_domain *domain,
 			 * passed to map_pages
 			 */
 			if (first) {
-				WARN_ON(!(xa_to_value(ent) &
+				WARN_ON(ent && !(xa_to_value(ent) &
 					  MOCK_PFN_START_IOVA));
 				first = false;
 			}
 			if (pgcount == 1 && cur + MOCK_IO_PAGE_SIZE == pgsize)
-				WARN_ON(!(xa_to_value(ent) &
+				WARN_ON(ent && !(xa_to_value(ent) &
 					  MOCK_PFN_LAST_IOVA));
 
 			iova += MOCK_IO_PAGE_SIZE;
@@ -1056,6 +1082,56 @@ static_assert((unsigned int)MOCK_ACCESS_RW_WRITE == IOMMUFD_ACCESS_RW_WRITE);
 static_assert((unsigned int)MOCK_ACCESS_RW_SLOW_PATH ==
 	      __IOMMUFD_ACCESS_RW_SLOW_PATH);
 
+static int iommufd_test_dirty(struct iommufd_ucmd *ucmd,
+			      unsigned int mockpt_id, unsigned long iova,
+			      size_t length, unsigned long page_size,
+			      void __user *uptr, u32 flags)
+{
+	unsigned long i, max = length / page_size;
+	struct iommu_test_cmd *cmd = ucmd->cmd;
+	struct iommufd_hw_pagetable *hwpt;
+	struct mock_iommu_domain *mock;
+	int rc, count = 0;
+
+	if (iova % page_size || length % page_size ||
+	    (uintptr_t)uptr % page_size)
+		return -EINVAL;
+
+	hwpt = get_md_pagetable(ucmd, mockpt_id, &mock);
+	if (IS_ERR(hwpt))
+		return PTR_ERR(hwpt);
+
+	if (!(mock->flags & MOCK_DIRTY_TRACK)) {
+		rc = -EINVAL;
+		goto out_put;
+	}
+
+	for (i = 0; i < max; i++) {
+		unsigned long cur = iova + i * page_size;
+		void *ent, *old;
+
+		if (!test_bit(i, (unsigned long *) uptr))
+			continue;
+
+		ent = xa_load(&mock->pfns, cur / page_size);
+		if (ent) {
+			unsigned long val;
+
+			val = xa_to_value(ent) | MOCK_PFN_DIRTY_IOVA;
+			old = xa_store(&mock->pfns, cur / page_size,
+				       xa_mk_value(val), GFP_KERNEL);
+			WARN_ON_ONCE(ent != old);
+			count++;
+		}
+	}
+
+	cmd->dirty.out_nr_dirty = count;
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+out_put:
+	iommufd_put_object(&hwpt->obj);
+	return rc;
+}
+
 void iommufd_selftest_destroy(struct iommufd_object *obj)
 {
 	struct selftest_obj *sobj = container_of(obj, struct selftest_obj, obj);
@@ -1121,6 +1197,12 @@ int iommufd_test(struct iommufd_ucmd *ucmd)
 			return -EINVAL;
 		iommufd_test_memory_limit = cmd->memory_limit.limit;
 		return 0;
+	case IOMMU_TEST_OP_DIRTY:
+		return iommufd_test_dirty(
+			ucmd, cmd->id, cmd->dirty.iova,
+			cmd->dirty.length, cmd->dirty.page_size,
+			u64_to_user_ptr(cmd->dirty.uptr),
+			cmd->dirty.flags);
 	default:
 		return -EOPNOTSUPP;
 	}
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
index ab1536d6b4db..e12e0731e414 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -12,6 +12,7 @@
 static unsigned long HUGEPAGE_SIZE;
 
 #define MOCK_PAGE_SIZE (PAGE_SIZE / 2)
+#define BITS_PER_BYTE 8
 
 static unsigned long get_huge_page_size(void)
 {
@@ -1440,13 +1441,47 @@ FIXTURE(iommufd_dirty_tracking)
 	uint32_t hwpt_id;
 	uint32_t stdev_id;
 	uint32_t idev_id;
+	unsigned long page_size;
+	unsigned long bitmap_size;
+	void *bitmap;
+	void *buffer;
+};
+
+FIXTURE_VARIANT(iommufd_dirty_tracking)
+{
+	unsigned long buffer_size;
 };
 
 FIXTURE_SETUP(iommufd_dirty_tracking)
 {
+	void *vrc;
+	int rc;
+
 	self->fd = open("/dev/iommu", O_RDWR);
 	ASSERT_NE(-1, self->fd);
 
+	rc = posix_memalign(&self->buffer, HUGEPAGE_SIZE, variant->buffer_size);
+	if (rc || !self->buffer) {
+		SKIP(return, "Skipping buffer_size=%lu due to errno=%d",
+			     variant->buffer_size, rc);
+	}
+
+	assert((uintptr_t)self->buffer % HUGEPAGE_SIZE == 0);
+	vrc = mmap(self->buffer, variant->buffer_size, PROT_READ | PROT_WRITE,
+		   MAP_SHARED | MAP_ANONYMOUS | MAP_FIXED, -1, 0);
+	assert(vrc == self->buffer);
+
+	self->page_size = MOCK_PAGE_SIZE;
+	self->bitmap_size = variant->buffer_size /
+			     self->page_size / BITS_PER_BYTE;
+
+	/* Provision with an extra (MOCK_PAGE_SIZE) for the unaligned case */
+	rc = posix_memalign(&self->bitmap, PAGE_SIZE,
+			    self->bitmap_size + MOCK_PAGE_SIZE);
+	assert(!rc);
+	assert(self->bitmap);
+	assert((uintptr_t)self->bitmap % PAGE_SIZE == 0);
+
 	test_ioctl_ioas_alloc(&self->ioas_id);
 	test_cmd_mock_domain(self->ioas_id, &self->stdev_id,
 			     &self->hwpt_id, &self->idev_id);
@@ -1454,9 +1489,41 @@ FIXTURE_SETUP(iommufd_dirty_tracking)
 
 FIXTURE_TEARDOWN(iommufd_dirty_tracking)
 {
+	munmap(self->buffer, variant->buffer_size);
+	munmap(self->bitmap, self->bitmap_size);
 	teardown_iommufd(self->fd, _metadata);
 }
 
+FIXTURE_VARIANT_ADD(iommufd_dirty_tracking, domain_dirty128k)
+{
+	/* one u32 index bitmap */
+	.buffer_size = 128UL * 1024UL,
+};
+
+FIXTURE_VARIANT_ADD(iommufd_dirty_tracking, domain_dirty256k)
+{
+	/* one u64 index bitmap */
+	.buffer_size = 256UL * 1024UL,
+};
+
+FIXTURE_VARIANT_ADD(iommufd_dirty_tracking, domain_dirty640k)
+{
+	/* two u64 index and trailing end bitmap */
+	.buffer_size = 640UL * 1024UL,
+};
+
+FIXTURE_VARIANT_ADD(iommufd_dirty_tracking, domain_dirty128M)
+{
+	/* 4K bitmap (128M IOVA range) */
+	.buffer_size = 128UL * 1024UL * 1024UL,
+};
+
+FIXTURE_VARIANT_ADD(iommufd_dirty_tracking, domain_dirty256M)
+{
+	/* 8K bitmap (256M IOVA range) */
+	.buffer_size = 256UL * 1024UL * 1024UL,
+};
+
 TEST_F(iommufd_dirty_tracking, enforce_dirty)
 {
 	uint32_t ioas_id, stddev_id, idev_id;
@@ -1497,6 +1564,38 @@ TEST_F(iommufd_dirty_tracking, set_dirty)
 	test_ioctl_destroy(hwpt_id);
 }
 
+TEST_F(iommufd_dirty_tracking, get_dirty_iova)
+{
+	uint32_t stddev_id;
+	uint32_t hwpt_id;
+	uint32_t ioas_id;
+
+	test_ioctl_ioas_alloc(&ioas_id);
+	test_ioctl_ioas_map_fixed_id(ioas_id, self->buffer,
+				     variant->buffer_size,
+				     MOCK_APERTURE_START);
+
+	test_cmd_hwpt_alloc(self->idev_id, ioas_id,
+			    IOMMU_HWPT_ALLOC_ENFORCE_DIRTY, &hwpt_id);
+	test_cmd_mock_domain(hwpt_id, &stddev_id, NULL, NULL);
+
+	test_cmd_set_dirty(hwpt_id, true);
+
+	test_mock_dirty_bitmaps(hwpt_id, variant->buffer_size,
+				MOCK_APERTURE_START,
+				self->page_size, self->bitmap,
+				self->bitmap_size, _metadata);
+
+	/* PAGE_SIZE unaligned bitmap */
+	test_mock_dirty_bitmaps(hwpt_id, variant->buffer_size,
+				MOCK_APERTURE_START,
+				self->page_size, self->bitmap + MOCK_PAGE_SIZE,
+				self->bitmap_size, _metadata);
+
+	test_ioctl_destroy(stddev_id);
+	test_ioctl_destroy(hwpt_id);
+}
+
 /* VFIO compatibility IOCTLs */
 
 TEST_F(iommufd, simple_ioctls)
diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h
index 5214ae17b19a..1ff93f812a32 100644
--- a/tools/testing/selftests/iommu/iommufd_utils.h
+++ b/tools/testing/selftests/iommu/iommufd_utils.h
@@ -9,6 +9,8 @@
 #include <sys/ioctl.h>
 #include <stdint.h>
 #include <assert.h>
+#include <linux/bitmap.h>
+#include <linux/bitops.h>
 
 #include "../kselftest_harness.h"
 #include "../../../../drivers/iommu/iommufd/iommufd_test.h"
@@ -200,6 +202,102 @@ static int _test_cmd_set_dirty(int fd, __u32 hwpt_id, bool enabled)
 #define test_cmd_set_dirty(hwpt_id, enabled) \
 	ASSERT_EQ(0, _test_cmd_set_dirty(self->fd, hwpt_id, enabled))
 
+static int _test_cmd_get_dirty_iova(int fd, __u32 hwpt_id, size_t length,
+				    __u64 iova, size_t page_size, __u64 *bitmap)
+{
+	struct iommu_hwpt_get_dirty_iova cmd = {
+		.size = sizeof(cmd),
+		.hwpt_id = hwpt_id,
+		.iova = iova,
+		.length = length,
+		.page_size = page_size,
+		.data = bitmap,
+	};
+	int ret;
+
+	ret = ioctl(fd, IOMMU_HWPT_GET_DIRTY_IOVA, &cmd);
+	if (ret)
+		return ret;
+	return 0;
+}
+
+#define test_cmd_get_dirty_iova(fd, hwpt_id, length, iova, page_size, bitmap) \
+	ASSERT_EQ(0, _test_cmd_get_dirty_iova(fd, hwpt_id, length,            \
+					      iova, page_size, bitmap))
+
+static int _test_cmd_mock_domain_set_dirty(int fd, __u32 hwpt_id, size_t length,
+					   __u64 iova, size_t page_size,
+					   __u64 *bitmap, __u64 *dirty)
+{
+	struct iommu_test_cmd cmd = {
+		.size = sizeof(cmd),
+		.op = IOMMU_TEST_OP_DIRTY,
+		.id = hwpt_id,
+		.dirty = {
+			.iova = iova,
+			.length = length,
+			.page_size = page_size,
+			.uptr = (uintptr_t) bitmap,
+		}
+	};
+	int ret;
+
+	ret = ioctl(fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_DIRTY), &cmd);
+	if (ret)
+		return -ret;
+	if (dirty)
+		*dirty = cmd.dirty.out_nr_dirty;
+	return 0;
+}
+
+#define test_cmd_mock_domain_set_dirty(fd, hwpt_id, length, iova, page_size, bitmap, nr) \
+	ASSERT_EQ(0, _test_cmd_mock_domain_set_dirty(fd, hwpt_id,            \
+						     length, iova,           \
+						     page_size, bitmap,      \
+						     nr))
+
+static int _test_mock_dirty_bitmaps(int fd, __u32 hwpt_id, size_t length,
+				    __u64 iova, size_t page_size,
+				    __u64 *bitmap, __u64 bitmap_size,
+				    struct __test_metadata *_metadata)
+{
+	unsigned long i, count, nbits = bitmap_size * BITS_PER_BYTE;
+	unsigned long nr = nbits / 2;
+	__u64 out_dirty = 0;
+
+	/* Mark all even bits as dirty in the mock domain */
+	for (count = 0, i = 0; i < nbits; count += !(i%2), i++)
+		if (!(i % 2))
+			__set_bit(i, (unsigned long *) bitmap);
+	ASSERT_EQ(nr, count);
+
+	test_cmd_mock_domain_set_dirty(fd, hwpt_id, length, iova, page_size,
+				       bitmap, &out_dirty);
+	ASSERT_EQ(nr, out_dirty);
+
+	/* Expect all even bits as dirty in the user bitmap */
+	memset(bitmap, 0, bitmap_size);
+	test_cmd_get_dirty_iova(fd, hwpt_id, length, iova, page_size, bitmap);
+	for (count = 0, i = 0; i < nbits; count += !(i%2), i++)
+		ASSERT_EQ(!(i % 2), test_bit(i, (unsigned long *) bitmap));
+	ASSERT_EQ(count, out_dirty);
+
+	memset(bitmap, 0, bitmap_size);
+	test_cmd_get_dirty_iova(fd, hwpt_id, length, iova, page_size, bitmap);
+
+	/* It as read already -- expect all zeroes */
+	for (i = 0; i < nbits; i++)
+		ASSERT_EQ(0, test_bit(i, (unsigned long *) bitmap));
+
+	return 0;
+}
+#define test_mock_dirty_bitmaps(hwpt_id, length, iova, page_size, bitmap, \
+				bitmap_size, _metadata) \
+	ASSERT_EQ(0, _test_mock_dirty_bitmaps(self->fd, hwpt_id,      \
+					      length, iova,           \
+					      page_size, bitmap,      \
+					      bitmap_size, _metadata))
+
 static int _test_cmd_create_access(int fd, unsigned int ioas_id,
 				   __u32 *access_id, unsigned int flags)
 {
@@ -324,6 +422,17 @@ static int _test_ioctl_ioas_map(int fd, unsigned int ioas_id, void *buffer,
 					     IOMMU_IOAS_MAP_READABLE));       \
 	})
 
+#define test_ioctl_ioas_map_fixed_id(ioas_id, buffer, length, iova)           \
+	({                                                                    \
+		__u64 __iova = iova;                                          \
+		ASSERT_EQ(0, _test_ioctl_ioas_map(                            \
+				     self->fd, ioas_id, buffer, length,       \
+				     &__iova,                                 \
+				     IOMMU_IOAS_MAP_FIXED_IOVA |              \
+					     IOMMU_IOAS_MAP_WRITEABLE |       \
+					     IOMMU_IOAS_MAP_READABLE));       \
+	})
+
 #define test_err_ioctl_ioas_map_fixed(_errno, buffer, length, iova)           \
 	({                                                                    \
 		__u64 __iova = iova;                                          \
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 17/18] iommufd/selftest: Test out_capabilities in IOMMU_GET_HW_INFO
  2023-10-18 20:26 [PATCH v4 00/18] IOMMUFD Dirty Tracking Joao Martins
                   ` (15 preceding siblings ...)
  2023-10-18 20:27 ` [PATCH v4 16/18] iommufd/selftest: Test IOMMU_HWPT_GET_DIRTY_IOVA Joao Martins
@ 2023-10-18 20:27 ` Joao Martins
  2023-10-18 20:27 ` [PATCH v4 18/18] iommufd/selftest: Test IOMMU_GET_DIRTY_IOVA_NO_CLEAR flag Joao Martins
  17 siblings, 0 replies; 84+ messages in thread
From: Joao Martins @ 2023-10-18 20:27 UTC (permalink / raw)
  To: iommu
  Cc: Jason Gunthorpe, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu,
	Yi Liu, Yi Y Sun, Nicolin Chen, Joerg Roedel,
	Suravee Suthikulpanit, Will Deacon, Robin Murphy, Zhenzhong Duan,
	Alex Williamson, kvm, Joao Martins

Enumerate the capabilities from the mock device and test whether it
advertises as expected. Include it as part of the iommufd_dirty_tracking
fixture.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/iommufd/selftest.c                | 13 ++++++++++++-
 tools/testing/selftests/iommu/iommufd.c         | 17 +++++++++++++++++
 .../testing/selftests/iommu/iommufd_fail_nth.c  |  2 +-
 tools/testing/selftests/iommu/iommufd_utils.h   | 14 +++++++++++---
 4 files changed, 41 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index 4ba10dbc63d0..322254a2bbc1 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -379,7 +379,18 @@ static phys_addr_t mock_domain_iova_to_phys(struct iommu_domain *domain,
 
 static bool mock_domain_capable(struct device *dev, enum iommu_cap cap)
 {
-	return cap == IOMMU_CAP_CACHE_COHERENCY;
+	struct mock_dev *mdev = container_of(dev, struct mock_dev, dev);
+
+	switch (cap) {
+	case IOMMU_CAP_CACHE_COHERENCY:
+		return true;
+	case IOMMU_CAP_DIRTY:
+		return !(mdev->flags & MOCK_FLAGS_DEVICE_NO_DIRTY);
+	default:
+		break;
+	}
+
+	return false;
 }
 
 static void mock_domain_set_plaform_dma_ops(struct device *dev)
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
index e12e0731e414..2865dc271a26 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -1564,6 +1564,23 @@ TEST_F(iommufd_dirty_tracking, set_dirty)
 	test_ioctl_destroy(hwpt_id);
 }
 
+TEST_F(iommufd_dirty_tracking, device_dirty_capability)
+{
+	uint32_t caps = 0;
+	uint32_t stddev_id;
+	uint32_t hwpt_id;
+
+	test_cmd_hwpt_alloc(self->idev_id, self->ioas_id, 0, &hwpt_id);
+	test_cmd_mock_domain(hwpt_id, &stddev_id, NULL, NULL);
+	test_cmd_get_hw_capabilities(self->idev_id, caps,
+				     IOMMU_HW_CAP_DIRTY_TRACKING);
+	ASSERT_EQ(IOMMU_HW_CAP_DIRTY_TRACKING,
+		  caps & IOMMU_HW_CAP_DIRTY_TRACKING);
+
+	test_ioctl_destroy(stddev_id);
+	test_ioctl_destroy(hwpt_id);
+}
+
 TEST_F(iommufd_dirty_tracking, get_dirty_iova)
 {
 	uint32_t stddev_id;
diff --git a/tools/testing/selftests/iommu/iommufd_fail_nth.c b/tools/testing/selftests/iommu/iommufd_fail_nth.c
index 31386be42439..ff735bdd833e 100644
--- a/tools/testing/selftests/iommu/iommufd_fail_nth.c
+++ b/tools/testing/selftests/iommu/iommufd_fail_nth.c
@@ -612,7 +612,7 @@ TEST_FAIL_NTH(basic_fail_nth, device)
 				  &idev_id))
 		return -1;
 
-	if (_test_cmd_get_hw_info(self->fd, idev_id, &info, sizeof(info)))
+	if (_test_cmd_get_hw_info(self->fd, idev_id, &info, sizeof(info), NULL))
 		return -1;
 
 	if (_test_cmd_hwpt_alloc(self->fd, idev_id, ioas_id, 0, &hwpt_id))
diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h
index 1ff93f812a32..790b5c0769cc 100644
--- a/tools/testing/selftests/iommu/iommufd_utils.h
+++ b/tools/testing/selftests/iommu/iommufd_utils.h
@@ -522,7 +522,8 @@ static void teardown_iommufd(int fd, struct __test_metadata *_metadata)
 
 /* @data can be NULL */
 static int _test_cmd_get_hw_info(int fd, __u32 device_id,
-				 void *data, size_t data_len)
+				 void *data, size_t data_len,
+				 uint32_t *capabilities)
 {
 	struct iommu_test_hw_info *info = (struct iommu_test_hw_info *)data;
 	struct iommu_hw_info cmd = {
@@ -530,6 +531,7 @@ static int _test_cmd_get_hw_info(int fd, __u32 device_id,
 		.dev_id = device_id,
 		.data_len = data_len,
 		.data_uptr = (uint64_t)data,
+		.out_capabilities = 0,
 	};
 	int ret;
 
@@ -566,14 +568,20 @@ static int _test_cmd_get_hw_info(int fd, __u32 device_id,
 			assert(!info->flags);
 	}
 
+	if (capabilities)
+		*capabilities = cmd.out_capabilities;
+
 	return 0;
 }
 
 #define test_cmd_get_hw_info(device_id, data, data_len)         \
 	ASSERT_EQ(0, _test_cmd_get_hw_info(self->fd, device_id, \
-					   data, data_len))
+					   data, data_len, NULL))
 
 #define test_err_get_hw_info(_errno, device_id, data, data_len) \
 	EXPECT_ERRNO(_errno,                                    \
 		     _test_cmd_get_hw_info(self->fd, device_id, \
-					   data, data_len))
+					   data, data_len, NULL))
+
+#define test_cmd_get_hw_capabilities(device_id, caps, mask)     \
+	ASSERT_EQ(0, _test_cmd_get_hw_info(self->fd, device_id, NULL, 0, &caps))
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v4 18/18] iommufd/selftest: Test IOMMU_GET_DIRTY_IOVA_NO_CLEAR flag
  2023-10-18 20:26 [PATCH v4 00/18] IOMMUFD Dirty Tracking Joao Martins
                   ` (16 preceding siblings ...)
  2023-10-18 20:27 ` [PATCH v4 17/18] iommufd/selftest: Test out_capabilities in IOMMU_GET_HW_INFO Joao Martins
@ 2023-10-18 20:27 ` Joao Martins
  17 siblings, 0 replies; 84+ messages in thread
From: Joao Martins @ 2023-10-18 20:27 UTC (permalink / raw)
  To: iommu
  Cc: Jason Gunthorpe, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu,
	Yi Liu, Yi Y Sun, Nicolin Chen, Joerg Roedel,
	Suravee Suthikulpanit, Will Deacon, Robin Murphy, Zhenzhong Duan,
	Alex Williamson, kvm, Joao Martins

Change test_mock_dirty_bitmaps() to pass a flag where it specifies the flag
under test. The test does the same thing as the GET_DIRTY_IOVA regular
test. Except that it tests whether the dirtied bits are fetched all the
same a second time, as opposed to observing them cleared.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/iommufd/selftest.c              | 15 ++++---
 tools/testing/selftests/iommu/iommufd.c       | 40 ++++++++++++++++++-
 tools/testing/selftests/iommu/iommufd_utils.h | 26 +++++++-----
 3 files changed, 64 insertions(+), 17 deletions(-)

diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index 322254a2bbc1..7e2174efc927 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -196,13 +196,16 @@ static int mock_domain_read_and_clear_dirty(struct iommu_domain *domain,
 		ent = xa_load(&mock->pfns, cur / MOCK_IO_PAGE_SIZE);
 		if (ent &&
 		    (xa_to_value(ent) & MOCK_PFN_DIRTY_IOVA)) {
-			unsigned long val;
-
 			/* Clear dirty */
-			val = xa_to_value(ent) & ~MOCK_PFN_DIRTY_IOVA;
-			old = xa_store(&mock->pfns, cur / MOCK_IO_PAGE_SIZE,
-				       xa_mk_value(val), GFP_KERNEL);
-			WARN_ON_ONCE(ent != old);
+			if (!(flags & IOMMU_GET_DIRTY_IOVA_NO_CLEAR)) {
+				unsigned long val;
+
+				val = xa_to_value(ent) & ~MOCK_PFN_DIRTY_IOVA;
+				old = xa_store(&mock->pfns,
+					       cur / MOCK_IO_PAGE_SIZE,
+					       xa_mk_value(val), GFP_KERNEL);
+				WARN_ON_ONCE(ent != old);
+			}
 			iommu_dirty_bitmap_record(dirty, cur, MOCK_IO_PAGE_SIZE);
 		}
 	}
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
index 2865dc271a26..d94042d9b812 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -1601,13 +1601,49 @@ TEST_F(iommufd_dirty_tracking, get_dirty_iova)
 	test_mock_dirty_bitmaps(hwpt_id, variant->buffer_size,
 				MOCK_APERTURE_START,
 				self->page_size, self->bitmap,
-				self->bitmap_size, _metadata);
+				self->bitmap_size, 0, _metadata);
 
 	/* PAGE_SIZE unaligned bitmap */
 	test_mock_dirty_bitmaps(hwpt_id, variant->buffer_size,
 				MOCK_APERTURE_START,
 				self->page_size, self->bitmap + MOCK_PAGE_SIZE,
-				self->bitmap_size, _metadata);
+				self->bitmap_size, 0, _metadata);
+
+	test_ioctl_destroy(stddev_id);
+	test_ioctl_destroy(hwpt_id);
+}
+
+TEST_F(iommufd_dirty_tracking, get_dirty_iova_no_clear)
+{
+	uint32_t stddev_id;
+	uint32_t hwpt_id;
+	uint32_t ioas_id;
+
+	test_ioctl_ioas_alloc(&ioas_id);
+	test_ioctl_ioas_map_fixed_id(ioas_id, self->buffer,
+				     variant->buffer_size,
+				     MOCK_APERTURE_START);
+
+	test_cmd_hwpt_alloc(self->idev_id, ioas_id,
+			    IOMMU_HWPT_ALLOC_ENFORCE_DIRTY, &hwpt_id);
+	test_cmd_mock_domain(hwpt_id, &stddev_id, NULL, NULL);
+
+	test_cmd_set_dirty(hwpt_id, true);
+
+	test_mock_dirty_bitmaps(hwpt_id, variant->buffer_size,
+				MOCK_APERTURE_START,
+				self->page_size, self->bitmap,
+				self->bitmap_size,
+				IOMMU_GET_DIRTY_IOVA_NO_CLEAR,
+				_metadata);
+
+	/* Unaligned bitmap */
+	test_mock_dirty_bitmaps(hwpt_id, variant->buffer_size,
+				MOCK_APERTURE_START,
+				self->page_size, self->bitmap + MOCK_PAGE_SIZE,
+				self->bitmap_size,
+				IOMMU_GET_DIRTY_IOVA_NO_CLEAR,
+				_metadata);
 
 	test_ioctl_destroy(stddev_id);
 	test_ioctl_destroy(hwpt_id);
diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h
index 790b5c0769cc..e66d6b2e7367 100644
--- a/tools/testing/selftests/iommu/iommufd_utils.h
+++ b/tools/testing/selftests/iommu/iommufd_utils.h
@@ -203,11 +203,13 @@ static int _test_cmd_set_dirty(int fd, __u32 hwpt_id, bool enabled)
 	ASSERT_EQ(0, _test_cmd_set_dirty(self->fd, hwpt_id, enabled))
 
 static int _test_cmd_get_dirty_iova(int fd, __u32 hwpt_id, size_t length,
-				    __u64 iova, size_t page_size, __u64 *bitmap)
+				    __u64 iova, size_t page_size, __u64 *bitmap,
+				    __u32 flags)
 {
 	struct iommu_hwpt_get_dirty_iova cmd = {
 		.size = sizeof(cmd),
 		.hwpt_id = hwpt_id,
+		.flags = flags,
 		.iova = iova,
 		.length = length,
 		.page_size = page_size,
@@ -221,9 +223,10 @@ static int _test_cmd_get_dirty_iova(int fd, __u32 hwpt_id, size_t length,
 	return 0;
 }
 
-#define test_cmd_get_dirty_iova(fd, hwpt_id, length, iova, page_size, bitmap) \
+#define test_cmd_get_dirty_iova(fd, hwpt_id, length, iova, page_size, bitmap, \
+				flags) \
 	ASSERT_EQ(0, _test_cmd_get_dirty_iova(fd, hwpt_id, length,            \
-					      iova, page_size, bitmap))
+					      iova, page_size, bitmap, flags))
 
 static int _test_cmd_mock_domain_set_dirty(int fd, __u32 hwpt_id, size_t length,
 					   __u64 iova, size_t page_size,
@@ -259,6 +262,7 @@ static int _test_cmd_mock_domain_set_dirty(int fd, __u32 hwpt_id, size_t length,
 static int _test_mock_dirty_bitmaps(int fd, __u32 hwpt_id, size_t length,
 				    __u64 iova, size_t page_size,
 				    __u64 *bitmap, __u64 bitmap_size,
+				    __u32 flags,
 				    struct __test_metadata *_metadata)
 {
 	unsigned long i, count, nbits = bitmap_size * BITS_PER_BYTE;
@@ -277,26 +281,30 @@ static int _test_mock_dirty_bitmaps(int fd, __u32 hwpt_id, size_t length,
 
 	/* Expect all even bits as dirty in the user bitmap */
 	memset(bitmap, 0, bitmap_size);
-	test_cmd_get_dirty_iova(fd, hwpt_id, length, iova, page_size, bitmap);
+	test_cmd_get_dirty_iova(fd, hwpt_id, length, iova,
+				page_size, bitmap, flags);
 	for (count = 0, i = 0; i < nbits; count += !(i%2), i++)
 		ASSERT_EQ(!(i % 2), test_bit(i, (unsigned long *) bitmap));
 	ASSERT_EQ(count, out_dirty);
 
 	memset(bitmap, 0, bitmap_size);
-	test_cmd_get_dirty_iova(fd, hwpt_id, length, iova, page_size, bitmap);
+	test_cmd_get_dirty_iova(fd, hwpt_id, length, iova,
+				page_size, bitmap, flags);
 
 	/* It as read already -- expect all zeroes */
-	for (i = 0; i < nbits; i++)
-		ASSERT_EQ(0, test_bit(i, (unsigned long *) bitmap));
+	for (i = 0; i < nbits; i++) {
+		ASSERT_EQ(!(i % 2) && (flags & IOMMU_GET_DIRTY_IOVA_NO_CLEAR),
+			  test_bit(i, (unsigned long *) bitmap));
+	}
 
 	return 0;
 }
 #define test_mock_dirty_bitmaps(hwpt_id, length, iova, page_size, bitmap, \
-				bitmap_size, _metadata) \
+				bitmap_size, flags, _metadata) \
 	ASSERT_EQ(0, _test_mock_dirty_bitmaps(self->fd, hwpt_id,      \
 					      length, iova,           \
 					      page_size, bitmap,      \
-					      bitmap_size, _metadata))
+					      bitmap_size, flags, _metadata))
 
 static int _test_cmd_create_access(int fd, unsigned int ioas_id,
 				   __u32 *access_id, unsigned int flags)
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 01/18] vfio/iova_bitmap: Export more API symbols
  2023-10-18 20:26 ` [PATCH v4 01/18] vfio/iova_bitmap: Export more API symbols Joao Martins
@ 2023-10-18 22:14   ` Jason Gunthorpe
  2023-10-20  5:45   ` Tian, Kevin
  2023-10-20 16:44   ` Alex Williamson
  2 siblings, 0 replies; 84+ messages in thread
From: Jason Gunthorpe @ 2023-10-18 22:14 UTC (permalink / raw)
  To: Joao Martins
  Cc: iommu, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu, Yi Liu,
	Yi Y Sun, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Zhenzhong Duan, Alex Williamson, kvm

On Wed, Oct 18, 2023 at 09:26:58PM +0100, Joao Martins wrote:
> In preparation to move iova_bitmap into iommufd, export the rest of API
> symbols that will be used in what could be used by modules, namely:
> 
> 	iova_bitmap_alloc
> 	iova_bitmap_free
> 	iova_bitmap_for_each
> 
> Suggested-by: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  drivers/vfio/iova_bitmap.c | 3 +++
>  1 file changed, 3 insertions(+)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 02/18] vfio: Move iova_bitmap into iommufd
  2023-10-18 20:26 ` [PATCH v4 02/18] vfio: Move iova_bitmap into iommufd Joao Martins
@ 2023-10-18 22:14   ` Jason Gunthorpe
  2023-10-19 17:48     ` Brett Creeley
  2023-10-20  5:46   ` Tian, Kevin
  2023-10-20 16:44   ` Alex Williamson
  2 siblings, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2023-10-18 22:14 UTC (permalink / raw)
  To: Joao Martins
  Cc: iommu, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu, Yi Liu,
	Yi Y Sun, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Zhenzhong Duan, Alex Williamson, kvm,
	Brett Creeley, Yishai Hadas

On Wed, Oct 18, 2023 at 09:26:59PM +0100, Joao Martins wrote:
> Both VFIO and IOMMUFD will need iova bitmap for storing dirties and walking
> the user bitmaps, so move to the common dependency into IOMMUFD.  In doing
> so, create the symbol IOMMUFD_DRIVER which designates the builtin code that
> will be used by drivers when selected. Today this means MLX5_VFIO_PCI and
> PDS_VFIO_PCI. IOMMU drivers will do the same (in future patches) when
> supporting dirty tracking and select IOMMUFD_DRIVER accordingly.
> 
> Given that the symbol maybe be disabled, add header definitions in
> iova_bitmap.h for when IOMMUFD_DRIVER=n
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  drivers/iommu/iommufd/Kconfig                 |  4 +++
>  drivers/iommu/iommufd/Makefile                |  1 +
>  drivers/{vfio => iommu/iommufd}/iova_bitmap.c |  0
>  drivers/vfio/Makefile                         |  3 +--
>  drivers/vfio/pci/mlx5/Kconfig                 |  1 +
>  drivers/vfio/pci/pds/Kconfig                  |  1 +
>  include/linux/iova_bitmap.h                   | 26 +++++++++++++++++++
>  7 files changed, 34 insertions(+), 2 deletions(-)
>  rename drivers/{vfio => iommu/iommufd}/iova_bitmap.c (100%)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 03/18] iommufd/iova_bitmap: Move symbols to IOMMUFD namespace
  2023-10-18 20:27 ` [PATCH v4 03/18] iommufd/iova_bitmap: Move symbols to IOMMUFD namespace Joao Martins
@ 2023-10-18 22:16   ` Jason Gunthorpe
  2023-10-19 17:48   ` Brett Creeley
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 84+ messages in thread
From: Jason Gunthorpe @ 2023-10-18 22:16 UTC (permalink / raw)
  To: Joao Martins
  Cc: iommu, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu, Yi Liu,
	Yi Y Sun, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Zhenzhong Duan, Alex Williamson, kvm,
	Brett Creeley, Yishai Hadas

On Wed, Oct 18, 2023 at 09:27:00PM +0100, Joao Martins wrote:
> Have the IOVA bitmap exported symbols adhere to the IOMMUFD symbol
> export convention i.e. using the IOMMUFD namespace. In doing so,
> import the namespace in the current users. This means VFIO and the
> vfio-pci drivers that use iova_bitmap_set().
> 
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  drivers/iommu/iommufd/iova_bitmap.c | 8 ++++----
>  drivers/vfio/pci/mlx5/main.c        | 1 +
>  drivers/vfio/pci/pds/pci_drv.c      | 1 +
>  drivers/vfio/vfio_main.c            | 1 +
>  4 files changed, 7 insertions(+), 4 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 05/18] iommufd: Add a flag to enforce dirty tracking on attach
  2023-10-18 20:27 ` [PATCH v4 05/18] iommufd: Add a flag to enforce dirty tracking on attach Joao Martins
@ 2023-10-18 22:26   ` Jason Gunthorpe
  2023-10-18 22:38   ` Jason Gunthorpe
  1 sibling, 0 replies; 84+ messages in thread
From: Jason Gunthorpe @ 2023-10-18 22:26 UTC (permalink / raw)
  To: Joao Martins
  Cc: iommu, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu, Yi Liu,
	Yi Y Sun, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Zhenzhong Duan, Alex Williamson, kvm

On Wed, Oct 18, 2023 at 09:27:02PM +0100, Joao Martins wrote:
> Throughout IOMMU domain lifetime that wants to use dirty tracking, some
> guarantees are needed such that any device attached to the iommu_domain
> supports dirty tracking.
> 
> The idea is to handle a case where IOMMU in the system are assymetric
> feature-wise and thus the capability may not be supported for all devices.
> The enforcement is done by adding a flag into HWPT_ALLOC namely:
> 
> 	IOMMUFD_HWPT_ALLOC_ENFORCE_DIRTY
> 
> .. Passed in HWPT_ALLOC ioctl() flags. The enforcement is done by creating
> a iommu_domain via domain_alloc_user() and validating the requested flags
> with what the device IOMMU supports (and failing accordingly) advertised).
> Advertising the new IOMMU domain feature flag requires that the individual
> iommu driver capability is supported when a future device attachment
> happens.
> 
> Link: https://lore.kernel.org/kvm/20220721142421.GB4609@nvidia.com/
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  drivers/iommu/iommufd/hw_pagetable.c | 4 +++-
>  include/uapi/linux/iommufd.h         | 3 +++
>  2 files changed, 6 insertions(+), 1 deletion(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 04/18] iommu: Add iommu_domain ops for dirty tracking
  2023-10-18 20:27 ` [PATCH v4 04/18] iommu: Add iommu_domain ops for dirty tracking Joao Martins
@ 2023-10-18 22:26   ` Jason Gunthorpe
  2023-10-19  1:45   ` Baolu Lu
  2023-10-20  5:54   ` Tian, Kevin
  2 siblings, 0 replies; 84+ messages in thread
From: Jason Gunthorpe @ 2023-10-18 22:26 UTC (permalink / raw)
  To: Joao Martins
  Cc: iommu, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu, Yi Liu,
	Yi Y Sun, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Zhenzhong Duan, Alex Williamson, kvm

On Wed, Oct 18, 2023 at 09:27:01PM +0100, Joao Martins wrote:
> Add to iommu domain operations a set of callbacks to perform dirty
> tracking, particulary to start and stop tracking and to read and clear the
> dirty data.
> 
> Drivers are generally expected to dynamically change its translation
> structures to toggle the tracking and flush some form of control state
> structure that stands in the IOVA translation path. Though it's not
> mandatory, as drivers can also enable dirty tracking at boot, and just
> clear the dirty bits before setting dirty tracking. For each of the newly
> added IOMMU core APIs:
> 
> iommu_cap::IOMMU_CAP_DIRTY: new device iommu_capable value when probing for
> capabilities of the device.
> 
> .set_dirty_tracking(): an iommu driver is expected to change its
> translation structures and enable dirty tracking for the devices in the
> iommu_domain. For drivers making dirty tracking always-enabled, it should
> just return 0.
> 
> .read_and_clear_dirty(): an iommu driver is expected to walk the pagetables
> for the iova range passed in and use iommu_dirty_bitmap_record() to record
> dirty info per IOVA. When detecting that a given IOVA is dirty it should
> also clear its dirty state from the PTE, *unless* the flag
> IOMMU_DIRTY_NO_CLEAR is passed in -- flushing is steered from the caller of
> the domain_op via iotlb_gather. The iommu core APIs use the same data
> structure in use for dirty tracking for VFIO device dirty (struct
> iova_bitmap) abstracted by iommu_dirty_bitmap_record() helper function.
> 
> domain::dirty_ops: IOMMU domains will store the dirty ops depending on
> whether the iommu device supports dirty tracking or not. iommu drivers can
> then use this field to figure if the dirty tracking is supported+enforced
> on attach. The enforcement is enable via domain_alloc_user() which is done
> via IOMMUFD hwpt flag introduced later.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  include/linux/io-pgtable.h |  4 +++
>  include/linux/iommu.h      | 56 ++++++++++++++++++++++++++++++++++++++
>  2 files changed, 60 insertions(+)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 06/18] iommufd: Add IOMMU_HWPT_SET_DIRTY
  2023-10-18 20:27 ` [PATCH v4 06/18] iommufd: Add IOMMU_HWPT_SET_DIRTY Joao Martins
@ 2023-10-18 22:28   ` Jason Gunthorpe
  2023-10-20  6:09   ` Tian, Kevin
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 84+ messages in thread
From: Jason Gunthorpe @ 2023-10-18 22:28 UTC (permalink / raw)
  To: Joao Martins
  Cc: iommu, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu, Yi Liu,
	Yi Y Sun, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Zhenzhong Duan, Alex Williamson, kvm

On Wed, Oct 18, 2023 at 09:27:03PM +0100, Joao Martins wrote:
> Every IOMMU driver should be able to implement the needed iommu domain ops
> to control dirty tracking.
> 
> Connect a hw_pagetable to the IOMMU core dirty tracking ops, specifically
> the ability to enable/disable dirty tracking on an IOMMU domain
> (hw_pagetable id). To that end add an io_pagetable kernel API to toggle
> dirty tracking:
> 
> * iopt_set_dirty_tracking(iopt, [domain], state)
> 
> The intended caller of this is via the hw_pagetable object that is created.
> 
> Internally it will ensure the leftover dirty state is cleared /right
> before/ dirty tracking starts. This is also useful for iommu drivers which
> may decide that dirty tracking is always-enabled at boot without wanting to
> toggle dynamically via corresponding iommu domain op.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  drivers/iommu/iommufd/hw_pagetable.c    | 24 +++++++++++
>  drivers/iommu/iommufd/io_pagetable.c    | 55 +++++++++++++++++++++++++
>  drivers/iommu/iommufd/iommufd_private.h | 12 ++++++
>  drivers/iommu/iommufd/main.c            |  3 ++
>  include/uapi/linux/iommufd.h            | 25 +++++++++++
>  5 files changed, 119 insertions(+)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 05/18] iommufd: Add a flag to enforce dirty tracking on attach
  2023-10-18 20:27 ` [PATCH v4 05/18] iommufd: Add a flag to enforce dirty tracking on attach Joao Martins
  2023-10-18 22:26   ` Jason Gunthorpe
@ 2023-10-18 22:38   ` Jason Gunthorpe
  2023-10-18 23:38     ` Joao Martins
  1 sibling, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2023-10-18 22:38 UTC (permalink / raw)
  To: Joao Martins
  Cc: iommu, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu, Yi Liu,
	Yi Y Sun, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Zhenzhong Duan, Alex Williamson, kvm

On Wed, Oct 18, 2023 at 09:27:02PM +0100, Joao Martins wrote:
> Throughout IOMMU domain lifetime that wants to use dirty tracking, some
> guarantees are needed such that any device attached to the iommu_domain
> supports dirty tracking.
> 
> The idea is to handle a case where IOMMU in the system are assymetric
> feature-wise and thus the capability may not be supported for all devices.
> The enforcement is done by adding a flag into HWPT_ALLOC namely:
> 
> 	IOMMUFD_HWPT_ALLOC_ENFORCE_DIRTY

Actually, can we change this name?

IOMMUFD_HWPT_ALLOC_DIRTY_TRACKING

?

There isnt' really anything 'enforce' here, it just creates a domain
that will always allow the eventual IOMMU_DIRTY_TRACKING_ENABLE no
matter what it is attached to.

That is the same as the other domain flags like NEST_PARENT

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 07/18] iommufd: Add IOMMU_HWPT_GET_DIRTY_IOVA
  2023-10-18 20:27 ` [PATCH v4 07/18] iommufd: Add IOMMU_HWPT_GET_DIRTY_IOVA Joao Martins
@ 2023-10-18 22:39   ` Jason Gunthorpe
  2023-10-18 23:43     ` Joao Martins
  2023-10-19 10:01   ` Joao Martins
  2023-10-20  6:32   ` Tian, Kevin
  2 siblings, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2023-10-18 22:39 UTC (permalink / raw)
  To: Joao Martins
  Cc: iommu, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu, Yi Liu,
	Yi Y Sun, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Zhenzhong Duan, Alex Williamson, kvm

On Wed, Oct 18, 2023 at 09:27:04PM +0100, Joao Martins wrote:

> +int iommufd_check_iova_range(struct iommufd_ioas *ioas,
> +			     struct iommu_hwpt_get_dirty_iova *bitmap)
> +{
> +	unsigned long pgshift, npages;
> +	size_t iommu_pgsize;
> +	int rc = -EINVAL;
> +
> +	pgshift = __ffs(bitmap->page_size);
> +	npages = bitmap->length >> pgshift;

npages = bitmap->length / bitmap->page_size;

? (if page_size is a bitmask it is badly named)

> +static int __iommu_read_and_clear_dirty(struct iova_bitmap *bitmap,
> +					unsigned long iova, size_t length,
> +					void *opaque)
> +{
> +	struct iopt_area *area;
> +	struct iopt_area_contig_iter iter;
> +	struct iova_bitmap_fn_arg *arg = opaque;
> +	struct iommu_domain *domain = arg->domain;
> +	struct iommu_dirty_bitmap *dirty = arg->dirty;
> +	const struct iommu_dirty_ops *ops = domain->dirty_ops;
> +	unsigned long last_iova = iova + length - 1;
> +	int ret = -EINVAL;
> +
> +	iopt_for_each_contig_area(&iter, area, arg->iopt, iova, last_iova) {
> +		unsigned long last = min(last_iova, iopt_area_last_iova(area));
> +
> +		ret = ops->read_and_clear_dirty(domain, iter.cur_iova,
> +						last - iter.cur_iova + 1,
> +						0, dirty);

This seems like a lot of stuff going on with ret..

> +		if (ret)

return ret

> +			break;
> +	}
> +
> +	if (!iopt_area_contig_done(&iter))
> +		ret = -EINVAL;

return  -EINVAL

> +
> +	return ret;

return 0;

And remove the -EINVAL. iopt_area_contig_done() captures the case
where the iova range is not fully contained by areas, even the case
where there are no areas.

But otherwise

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 08/18] iommufd: Add capabilities to IOMMU_GET_HW_INFO
  2023-10-18 20:27 ` [PATCH v4 08/18] iommufd: Add capabilities to IOMMU_GET_HW_INFO Joao Martins
@ 2023-10-18 22:44   ` Jason Gunthorpe
  2023-10-19  9:55     ` Joao Martins
  2023-10-20  6:46   ` Tian, Kevin
  1 sibling, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2023-10-18 22:44 UTC (permalink / raw)
  To: Joao Martins
  Cc: iommu, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu, Yi Liu,
	Yi Y Sun, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Zhenzhong Duan, Alex Williamson, kvm

On Wed, Oct 18, 2023 at 09:27:05PM +0100, Joao Martins wrote:
> Extend IOMMUFD_CMD_GET_HW_INFO op to query generic iommu capabilities for a
> given device.
> 
> Capabilities are IOMMU agnostic and use device_iommu_capable() API passing
> one of the IOMMU_CAP_*. Enumerate IOMMU_CAP_DIRTY for now in the
> out_capabilities field returned back to userspace.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  drivers/iommu/iommufd/device.c |  4 ++++
>  include/uapi/linux/iommufd.h   | 11 +++++++++++
>  2 files changed, 15 insertions(+)
> 
> diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
> index e88fa73a45e6..71ee22dc1a85 100644
> --- a/drivers/iommu/iommufd/device.c
> +++ b/drivers/iommu/iommufd/device.c
> @@ -1185,6 +1185,10 @@ int iommufd_get_hw_info(struct iommufd_ucmd *ucmd)
>  	 */
>  	cmd->data_len = data_len;
>  
> +	cmd->out_capabilities = 0;
> +	if (device_iommu_capable(idev->dev, IOMMU_CAP_DIRTY))
> +		cmd->out_capabilities |= IOMMU_HW_CAP_DIRTY_TRACKING;
> +
>  	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
>  out_free:
>  	kfree(data);
> diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
> index efeb12c1aaeb..91de0043e73f 100644
> --- a/include/uapi/linux/iommufd.h
> +++ b/include/uapi/linux/iommufd.h
> @@ -419,6 +419,14 @@ enum iommu_hw_info_type {
>  	IOMMU_HW_INFO_TYPE_INTEL_VTD,
>  };
>  
> +/**
> + * enum iommufd_hw_info_capabilities
> + * @IOMMU_CAP_DIRTY_TRACKING: IOMMU hardware support for dirty tracking
> + */

Lets write more details here, which iommufd APIs does this flag mean
work.

But otherwise

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 09/18] iommufd: Add a flag to skip clearing of IOPTE dirty
  2023-10-18 20:27 ` [PATCH v4 09/18] iommufd: Add a flag to skip clearing of IOPTE dirty Joao Martins
@ 2023-10-18 22:54   ` Jason Gunthorpe
  2023-10-18 23:50     ` Joao Martins
  2023-10-20  6:52   ` Tian, Kevin
  1 sibling, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2023-10-18 22:54 UTC (permalink / raw)
  To: Joao Martins
  Cc: iommu, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu, Yi Liu,
	Yi Y Sun, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Zhenzhong Duan, Alex Williamson, kvm

On Wed, Oct 18, 2023 at 09:27:06PM +0100, Joao Martins wrote:
> VFIO has an operation where it unmaps an IOVA while returning a bitmap with
> the dirty data. In reality the operation doesn't quite query the IO
> pagetables that the PTE was dirty or not. Instead it marks as dirty on
> anything that was mapped, and doing so in one syscall.
> 
> In IOMMUFD the equivalent is done in two operations by querying with
> GET_DIRTY_IOVA followed by UNMAP_IOVA. However, this would incur two TLB
> flushes given that after clearing dirty bits IOMMU implementations require
> invalidating their IOTLB, plus another invalidation needed for the UNMAP.
> To allow dirty bits to be queried faster, add a flag
> (IOMMU_GET_DIRTY_IOVA_NO_CLEAR) that requests to not clear the dirty bits
> from the PTE (but just reading them), under the expectation that the next
> operation is the unmap. An alternative is to unmap and just perpectually
> mark as dirty as that's the same behaviour as today. So here equivalent
> functionally can be provided with unmap alone, and if real dirty info is
> required it will amortize the cost while querying.

It seems fine, but I wonder is it really worthwhile? Did you measure
this? I suppose it is during the critical outage window

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 10/18] iommu/amd: Add domain_alloc_user based domain allocation
  2023-10-18 20:27 ` [PATCH v4 10/18] iommu/amd: Add domain_alloc_user based domain allocation Joao Martins
@ 2023-10-18 22:58   ` Jason Gunthorpe
  2023-10-18 23:54     ` Joao Martins
  0 siblings, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2023-10-18 22:58 UTC (permalink / raw)
  To: Joao Martins
  Cc: iommu, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu, Yi Liu,
	Yi Y Sun, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Zhenzhong Duan, Alex Williamson, kvm

On Wed, Oct 18, 2023 at 09:27:07PM +0100, Joao Martins wrote:
> -static struct iommu_domain *amd_iommu_domain_alloc(unsigned type)
> +static struct iommu_domain *do_iommu_domain_alloc(unsigned int type,
> +						  struct device *dev,
> +						  u32 flags)
>  {
>  	struct protection_domain *domain;
> +	struct amd_iommu *iommu = NULL;
> +
> +	if (dev) {
> +		iommu = rlookup_amd_iommu(dev);
> +		if (!iommu)
> +			return ERR_PTR(-ENODEV);
> +	}
>  
>  	/*
>  	 * Since DTE[Mode]=0 is prohibited on SNP-enabled system,
>  	 * default to use IOMMU_DOMAIN_DMA[_FQ].
>  	 */
>  	if (amd_iommu_snp_en && (type == IOMMU_DOMAIN_IDENTITY))
> -		return NULL;
> +		return ERR_PTR(-EINVAL);
>  
>  	domain = protection_domain_alloc(type);
>  	if (!domain)
> -		return NULL;
> +		return ERR_PTR(-ENOMEM);
>  
>  	domain->domain.geometry.aperture_start = 0;
>  	domain->domain.geometry.aperture_end   = dma_max_address();
>  	domain->domain.geometry.force_aperture = true;
>  
> +	if (iommu) {
> +		domain->domain.type = type;
> +		domain->domain.pgsize_bitmap =
> +			iommu->iommu.ops->pgsize_bitmap;
> +		domain->domain.ops =
> +			iommu->iommu.ops->default_domain_ops;
> +	}
> +
>  	return &domain->domain;
>  }

In the end this is probably not enough refactoring, but this driver
needs so much work we should just wait till the already written series
get merged.

eg domain_alloc_paging should just invoke domain_alloc_user with some
null arguments if the driver is constructed this way

> +static struct iommu_domain *amd_iommu_domain_alloc_user(struct device *dev,
> +							u32 flags)
> +{
> +	unsigned int type = IOMMU_DOMAIN_UNMANAGED;
> +
> +	if (flags & IOMMU_HWPT_ALLOC_NEST_PARENT)
> +		return ERR_PTR(-EOPNOTSUPP);

This should be written as a list of flags the driver *supports* not
that it rejects

if (flags)
	return ERR_PTR(-EOPNOTSUPP);

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 11/18] iommu/amd: Access/Dirty bit support in IOPTEs
  2023-10-18 20:27 ` [PATCH v4 11/18] iommu/amd: Access/Dirty bit support in IOPTEs Joao Martins
@ 2023-10-18 23:11   ` Jason Gunthorpe
  2023-10-19  0:17     ` Joao Martins
  2023-10-20 18:57   ` Joao Martins
  1 sibling, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2023-10-18 23:11 UTC (permalink / raw)
  To: Joao Martins
  Cc: iommu, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu, Yi Liu,
	Yi Y Sun, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Zhenzhong Duan, Alex Williamson, kvm

On Wed, Oct 18, 2023 at 09:27:08PM +0100, Joao Martins wrote:
> +static int iommu_v1_read_and_clear_dirty(struct io_pgtable_ops *ops,
> +					 unsigned long iova, size_t size,
> +					 unsigned long flags,
> +					 struct iommu_dirty_bitmap *dirty)
> +{
> +	struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
> +	unsigned long end = iova + size - 1;
> +
> +	do {
> +		unsigned long pgsize = 0;
> +		u64 *ptep, pte;
> +
> +		ptep = fetch_pte(pgtable, iova, &pgsize);
> +		if (ptep)
> +			pte = READ_ONCE(*ptep);

It is fine for now, but this is so slow for something that is such a
fast path. We are optimizing away a TLB invalidation but leaving
this???

It is a radix tree, you walk trees by retaining your position at each
level as you go (eg in a function per-level call chain or something)
then ++ is cheap. Re-searching the entire tree every time is madness.

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 05/18] iommufd: Add a flag to enforce dirty tracking on attach
  2023-10-18 22:38   ` Jason Gunthorpe
@ 2023-10-18 23:38     ` Joao Martins
  2023-10-20  5:55       ` Tian, Kevin
  0 siblings, 1 reply; 84+ messages in thread
From: Joao Martins @ 2023-10-18 23:38 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu, Yi Liu,
	Yi Y Sun, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Zhenzhong Duan, Alex Williamson, kvm

On 18/10/2023 23:38, Jason Gunthorpe wrote:
> On Wed, Oct 18, 2023 at 09:27:02PM +0100, Joao Martins wrote:
>> Throughout IOMMU domain lifetime that wants to use dirty tracking, some
>> guarantees are needed such that any device attached to the iommu_domain
>> supports dirty tracking.
>>
>> The idea is to handle a case where IOMMU in the system are assymetric
>> feature-wise and thus the capability may not be supported for all devices.
>> The enforcement is done by adding a flag into HWPT_ALLOC namely:
>>
>> 	IOMMUFD_HWPT_ALLOC_ENFORCE_DIRTY
> 
> Actually, can we change this name?
> 
> IOMMUFD_HWPT_ALLOC_DIRTY_TRACKING
> 
> ?
>
Yeap.

> There isnt' really anything 'enforce' here, it just creates a domain
> that will always allow the eventual IOMMU_DIRTY_TRACKING_ENABLE no
> matter what it is attached to.
> 
The 'enforce' part comes from the fact that device being attached in
domain_alloc_user() will require dirty tracking before it can succeed.

Still your flag name looks better anyhow

> That is the same as the other domain flags like NEST_PARENT
> 
> Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 07/18] iommufd: Add IOMMU_HWPT_GET_DIRTY_IOVA
  2023-10-18 22:39   ` Jason Gunthorpe
@ 2023-10-18 23:43     ` Joao Martins
  2023-10-19 12:01       ` Jason Gunthorpe
  0 siblings, 1 reply; 84+ messages in thread
From: Joao Martins @ 2023-10-18 23:43 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu, Yi Liu,
	Yi Y Sun, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Zhenzhong Duan, Alex Williamson, kvm

On 18/10/2023 23:39, Jason Gunthorpe wrote:
> On Wed, Oct 18, 2023 at 09:27:04PM +0100, Joao Martins wrote:
> 
>> +int iommufd_check_iova_range(struct iommufd_ioas *ioas,
>> +			     struct iommu_hwpt_get_dirty_iova *bitmap)
>> +{
>> +	unsigned long pgshift, npages;
>> +	size_t iommu_pgsize;
>> +	int rc = -EINVAL;
>> +
>> +	pgshift = __ffs(bitmap->page_size);
>> +	npages = bitmap->length >> pgshift;
> 
> npages = bitmap->length / bitmap->page_size;
> 
> ? (if page_size is a bitmask it is badly named)
> 

It was a way to avoid the divide by zero, but
I can switch to the above, and check for bitmap->page_size
being non-zero. should be less obscure


>> +static int __iommu_read_and_clear_dirty(struct iova_bitmap *bitmap,
>> +					unsigned long iova, size_t length,
>> +					void *opaque)
>> +{
>> +	struct iopt_area *area;
>> +	struct iopt_area_contig_iter iter;
>> +	struct iova_bitmap_fn_arg *arg = opaque;
>> +	struct iommu_domain *domain = arg->domain;
>> +	struct iommu_dirty_bitmap *dirty = arg->dirty;
>> +	const struct iommu_dirty_ops *ops = domain->dirty_ops;
>> +	unsigned long last_iova = iova + length - 1;
>> +	int ret = -EINVAL;
>> +
>> +	iopt_for_each_contig_area(&iter, area, arg->iopt, iova, last_iova) {
>> +		unsigned long last = min(last_iova, iopt_area_last_iova(area));
>> +
>> +		ret = ops->read_and_clear_dirty(domain, iter.cur_iova,
>> +						last - iter.cur_iova + 1,
>> +						0, dirty);
> 
> This seems like a lot of stuff going on with ret..
> 

All to have a single return exit point, given no different cleanup is required.
Thought it was the general best way (when possible)

>> +		if (ret)
> 
> return ret
> 
>> +			break;
>> +	}
>> +
>> +	if (!iopt_area_contig_done(&iter))
>> +		ret = -EINVAL;
> 
> return  -EINVAL
> 
>> +
>> +	return ret;
> 
> return 0;
> 
> And remove the -EINVAL. 

OK

> iopt_area_contig_done() captures the case
> where the iova range is not fully contained by areas, even the case
> where there are no areas.
> 
> But otherwise
> 
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> 
> Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 09/18] iommufd: Add a flag to skip clearing of IOPTE dirty
  2023-10-18 22:54   ` Jason Gunthorpe
@ 2023-10-18 23:50     ` Joao Martins
  0 siblings, 0 replies; 84+ messages in thread
From: Joao Martins @ 2023-10-18 23:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu, Yi Liu,
	Yi Y Sun, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Zhenzhong Duan, Alex Williamson, kvm

On 18/10/2023 23:54, Jason Gunthorpe wrote:
> On Wed, Oct 18, 2023 at 09:27:06PM +0100, Joao Martins wrote:
>> VFIO has an operation where it unmaps an IOVA while returning a bitmap with
>> the dirty data. In reality the operation doesn't quite query the IO
>> pagetables that the PTE was dirty or not. Instead it marks as dirty on
>> anything that was mapped, and doing so in one syscall.
>>
>> In IOMMUFD the equivalent is done in two operations by querying with
>> GET_DIRTY_IOVA followed by UNMAP_IOVA. However, this would incur two TLB
>> flushes given that after clearing dirty bits IOMMU implementations require
>> invalidating their IOTLB, plus another invalidation needed for the UNMAP.
>> To allow dirty bits to be queried faster, add a flag
>> (IOMMU_GET_DIRTY_IOVA_NO_CLEAR) that requests to not clear the dirty bits
>> from the PTE (but just reading them), under the expectation that the next
>> operation is the unmap. An alternative is to unmap and just perpectually
>> mark as dirty as that's the same behaviour as today. So here equivalent
>> functionally can be provided with unmap alone, and if real dirty info is
>> required it will amortize the cost while querying.
> 
> It seems fine, but I wonder is it really worthwhile? 
> Did you measure
> this? I suppose it is during the critical outage window
>

Design-wise we avoid an extra IOTLB invalidation in emulated-vIOMMU with many
potentially mappings being done ... which is where this usually is more relevant
to be used. It bothers a little, if it falling under over-optimization when
unmap itself is already expensive. But I didn't explicitly measure the cost of
the IOTLB that we are saving.

> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> 
> Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 10/18] iommu/amd: Add domain_alloc_user based domain allocation
  2023-10-18 22:58   ` Jason Gunthorpe
@ 2023-10-18 23:54     ` Joao Martins
  0 siblings, 0 replies; 84+ messages in thread
From: Joao Martins @ 2023-10-18 23:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu, Yi Liu,
	Yi Y Sun, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Zhenzhong Duan, Alex Williamson, kvm

On 18/10/2023 23:58, Jason Gunthorpe wrote:
> On Wed, Oct 18, 2023 at 09:27:07PM +0100, Joao Martins wrote:
>> -static struct iommu_domain *amd_iommu_domain_alloc(unsigned type)
>> +static struct iommu_domain *do_iommu_domain_alloc(unsigned int type,
>> +						  struct device *dev,
>> +						  u32 flags)
>>  {
>>  	struct protection_domain *domain;
>> +	struct amd_iommu *iommu = NULL;
>> +
>> +	if (dev) {
>> +		iommu = rlookup_amd_iommu(dev);
>> +		if (!iommu)
>> +			return ERR_PTR(-ENODEV);
>> +	}
>>  
>>  	/*
>>  	 * Since DTE[Mode]=0 is prohibited on SNP-enabled system,
>>  	 * default to use IOMMU_DOMAIN_DMA[_FQ].
>>  	 */
>>  	if (amd_iommu_snp_en && (type == IOMMU_DOMAIN_IDENTITY))
>> -		return NULL;
>> +		return ERR_PTR(-EINVAL);
>>  
>>  	domain = protection_domain_alloc(type);
>>  	if (!domain)
>> -		return NULL;
>> +		return ERR_PTR(-ENOMEM);
>>  
>>  	domain->domain.geometry.aperture_start = 0;
>>  	domain->domain.geometry.aperture_end   = dma_max_address();
>>  	domain->domain.geometry.force_aperture = true;
>>  
>> +	if (iommu) {
>> +		domain->domain.type = type;
>> +		domain->domain.pgsize_bitmap =
>> +			iommu->iommu.ops->pgsize_bitmap;
>> +		domain->domain.ops =
>> +			iommu->iommu.ops->default_domain_ops;
>> +	}
>> +
>>  	return &domain->domain;
>>  }
> 
> In the end this is probably not enough refactoring, but this driver
> needs so much work we should just wait till the already written series
> get merged.
> 

That is quite a road ahead :(( -- and AMD IOMMU hardware is the only thing I am
best equipped to test this.

But I understand if we end up going this way

> eg domain_alloc_paging should just invoke domain_alloc_user with some
> null arguments if the driver is constructed this way
> 
>> +static struct iommu_domain *amd_iommu_domain_alloc_user(struct device *dev,
>> +							u32 flags)
>> +{
>> +	unsigned int type = IOMMU_DOMAIN_UNMANAGED;
>> +
>> +	if (flags & IOMMU_HWPT_ALLOC_NEST_PARENT)
>> +		return ERR_PTR(-EOPNOTSUPP);
> 
> This should be written as a list of flags the driver *supports* not
> that it rejects
> 
> if (flags)
> 	return ERR_PTR(-EOPNOTSUPP);

Will fix (this was a silly mistake, as I'm doing the right thing on Intel).

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 11/18] iommu/amd: Access/Dirty bit support in IOPTEs
  2023-10-18 23:11   ` Jason Gunthorpe
@ 2023-10-19  0:17     ` Joao Martins
  2023-10-19 11:58       ` Joao Martins
  0 siblings, 1 reply; 84+ messages in thread
From: Joao Martins @ 2023-10-19  0:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu, Yi Liu,
	Yi Y Sun, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Zhenzhong Duan, Alex Williamson, kvm

On 19/10/2023 00:11, Jason Gunthorpe wrote:
> On Wed, Oct 18, 2023 at 09:27:08PM +0100, Joao Martins wrote:
>> +static int iommu_v1_read_and_clear_dirty(struct io_pgtable_ops *ops,
>> +					 unsigned long iova, size_t size,
>> +					 unsigned long flags,
>> +					 struct iommu_dirty_bitmap *dirty)
>> +{
>> +	struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
>> +	unsigned long end = iova + size - 1;
>> +
>> +	do {
>> +		unsigned long pgsize = 0;
>> +		u64 *ptep, pte;
>> +
>> +		ptep = fetch_pte(pgtable, iova, &pgsize);
>> +		if (ptep)
>> +			pte = READ_ONCE(*ptep);
> 
> It is fine for now, but this is so slow for something that is such a
> fast path. We are optimizing away a TLB invalidation but leaving
> this???
> 

More obvious reason is that I'm still working towards the 'faster' page table
walker. Then map/unmap code needs to do similar lookups so thought of reusing
the same functions as map/unmap initially. And improve it afterwards or when
introducing the splitting.

> It is a radix tree, you walk trees by retaining your position at each
> level as you go (eg in a function per-level call chain or something)
> then ++ is cheap. Re-searching the entire tree every time is madness.

I'm aware -- I have an improved page-table walker for AMD[0] (not yet for Intel;
still in the works), but in my experiments with huge IOVA ranges, the time to
walk the page tables end up making not that much difference, compared to the
size it needs to walk. However is how none of this matters, once we increase up
a level (PMD), then walking huge IOVA ranges is super-cheap (and invisible with
PUDs). Which makes the dynamic-splitting/page-demotion important.

Furthermore, this is not quite yet easy for other people to test and see numbers
for themselves; so more and more I need to work on something like
iommufd_log_perf tool under tools/testing that is similar to the gup_perf to all
performance work obvious and 'standardized'

------->8--------
[0] [hasn't been rebased into this version I sent]

commit 431de7e855ee8c1622663f8d81600f62fed0ed4a
Author: Joao Martins <joao.m.martins@oracle.com>
Date:   Sat Oct 7 18:17:33 2023 -0400

    iommu/amd: Improve dirty read io-pgtable walker

    fetch_pte() based is a little ineficient for level-1 page-sizes.

    It walks all the levels to return a PTE, and disregarding the potential
    batching that could be done for the previous level. Implement a
    page-table walker based on the freeing functions which recursevily walks
    the next-level.

    For each level it iterates on the non-default page sizes as the
    different mappings return, provided each PTE level-7 may account
    the next power-of-2 per added PTE.

    Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

diff --git a/drivers/iommu/amd/io_pgtable.c b/drivers/iommu/amd/io_pgtable.c
index 29f5ab0ba14f..babb5fb5fd51 100644
--- a/drivers/iommu/amd/io_pgtable.c
+++ b/drivers/iommu/amd/io_pgtable.c
@@ -552,39 +552,63 @@ static bool pte_test_and_clear_dirty(u64 *ptep, unsigned
long size)
        return dirty;
 }

+static bool pte_is_large_or_base(u64 *ptep)
+{
+       return (PM_PTE_LEVEL(*ptep) == 0 || PM_PTE_LEVEL(*ptep) == 7);
+}
+
+static int walk_iova_range(u64 *pt, unsigned long iova, size_t size,
+                          int level, unsigned long flags,
+                          struct iommu_dirty_bitmap *dirty)
+{
+       unsigned long addr, isize, end = iova + size;
+       unsigned long page_size;
+       int i, next_level;
+       u64 *p, *ptep;
+
+       next_level = level - 1;
+       isize = page_size = PTE_LEVEL_PAGE_SIZE(next_level);
+
+       for (addr = iova; addr < end; addr += isize) {
+               i = PM_LEVEL_INDEX(next_level, addr);
+               ptep = &pt[i];
+
+               /* PTE present? */
+               if (!IOMMU_PTE_PRESENT(*ptep))
+                       continue;
+
+               if (level > 1 && !pte_is_large_or_base(ptep)) {
+                       p = IOMMU_PTE_PAGE(*ptep);
+                       isize = min(end - addr, page_size);
+                       walk_iova_range(p, addr, isize, next_level,
+                                       flags, dirty);
+               } else {
+                       isize = PM_PTE_LEVEL(*ptep) == 7 ?
+                                       PTE_PAGE_SIZE(*ptep) : page_size;
+
+                       /*
+                        * Mark the whole IOVA range as dirty even if only one
+                        * of the replicated PTEs were marked dirty.
+                        */
+                       if (((flags & IOMMU_DIRTY_NO_CLEAR) &&
+                                       pte_test_dirty(ptep, isize)) ||
+                           pte_test_and_clear_dirty(ptep, isize))
+                               iommu_dirty_bitmap_record(dirty, addr, isize);
+               }
+       }
+
+       return 0;
+}
+
 static int iommu_v1_read_and_clear_dirty(struct io_pgtable_ops *ops,
                                         unsigned long iova, size_t size,
                                         unsigned long flags,
                                         struct iommu_dirty_bitmap *dirty)
 {
        struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
-       unsigned long end = iova + size - 1;
-
-       do {
-               unsigned long pgsize = 0;
-               u64 *ptep, pte;
-
-               ptep = fetch_pte(pgtable, iova, &pgsize);
-               if (ptep)
-                       pte = READ_ONCE(*ptep);
-               if (!ptep || !IOMMU_PTE_PRESENT(pte)) {
-                       pgsize = pgsize ?: PTE_LEVEL_PAGE_SIZE(0);
-                       iova += pgsize;
-                       continue;
-               }
-
-               /*
-                * Mark the whole IOVA range as dirty even if only one of
-                * the replicated PTEs were marked dirty.
-                */
-               if (((flags & IOMMU_DIRTY_NO_CLEAR) &&
-                               pte_test_dirty(ptep, pgsize)) ||
-                   pte_test_and_clear_dirty(ptep, pgsize))
-                       iommu_dirty_bitmap_record(dirty, iova, pgsize);
-               iova += pgsize;
-       } while (iova < end);

-       return 0;
+       return walk_iova_range(pgtable->root, iova, size,
+                              pgtable->mode, flags, dirty);
 }

 /*

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 04/18] iommu: Add iommu_domain ops for dirty tracking
  2023-10-18 20:27 ` [PATCH v4 04/18] iommu: Add iommu_domain ops for dirty tracking Joao Martins
  2023-10-18 22:26   ` Jason Gunthorpe
@ 2023-10-19  1:45   ` Baolu Lu
  2023-10-20  5:54   ` Tian, Kevin
  2 siblings, 0 replies; 84+ messages in thread
From: Baolu Lu @ 2023-10-19  1:45 UTC (permalink / raw)
  To: Joao Martins, iommu
  Cc: baolu.lu, Jason Gunthorpe, Kevin Tian, Shameerali Kolothum Thodi,
	Yi Liu, Yi Y Sun, Nicolin Chen, Joerg Roedel,
	Suravee Suthikulpanit, Will Deacon, Robin Murphy, Zhenzhong Duan,
	Alex Williamson, kvm

On 10/19/23 4:27 AM, Joao Martins wrote:
> Add to iommu domain operations a set of callbacks to perform dirty
> tracking, particulary to start and stop tracking and to read and clear the
> dirty data.
> 
> Drivers are generally expected to dynamically change its translation
> structures to toggle the tracking and flush some form of control state
> structure that stands in the IOVA translation path. Though it's not
> mandatory, as drivers can also enable dirty tracking at boot, and just
> clear the dirty bits before setting dirty tracking. For each of the newly
> added IOMMU core APIs:
> 
> iommu_cap::IOMMU_CAP_DIRTY: new device iommu_capable value when probing for
> capabilities of the device.
> 
> .set_dirty_tracking(): an iommu driver is expected to change its
> translation structures and enable dirty tracking for the devices in the
> iommu_domain. For drivers making dirty tracking always-enabled, it should
> just return 0.
> 
> .read_and_clear_dirty(): an iommu driver is expected to walk the pagetables
> for the iova range passed in and use iommu_dirty_bitmap_record() to record
> dirty info per IOVA. When detecting that a given IOVA is dirty it should
> also clear its dirty state from the PTE, *unless* the flag
> IOMMU_DIRTY_NO_CLEAR is passed in -- flushing is steered from the caller of
> the domain_op via iotlb_gather. The iommu core APIs use the same data
> structure in use for dirty tracking for VFIO device dirty (struct
> iova_bitmap) abstracted by iommu_dirty_bitmap_record() helper function.
> 
> domain::dirty_ops: IOMMU domains will store the dirty ops depending on
> whether the iommu device supports dirty tracking or not. iommu drivers can
> then use this field to figure if the dirty tracking is supported+enforced
> on attach. The enforcement is enable via domain_alloc_user() which is done
> via IOMMUFD hwpt flag introduced later.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>   include/linux/io-pgtable.h |  4 +++
>   include/linux/iommu.h      | 56 ++++++++++++++++++++++++++++++++++++++
>   2 files changed, 60 insertions(+)

Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>

Best regards,
baolu

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 12/18] iommu/intel: Access/Dirty bit support for SL domains
  2023-10-18 20:27 ` [PATCH v4 12/18] iommu/intel: Access/Dirty bit support for SL domains Joao Martins
@ 2023-10-19  3:04   ` Baolu Lu
  2023-10-19  9:14     ` Joao Martins
  2023-10-20  7:53   ` Tian, Kevin
  1 sibling, 1 reply; 84+ messages in thread
From: Baolu Lu @ 2023-10-19  3:04 UTC (permalink / raw)
  To: Joao Martins, iommu
  Cc: baolu.lu, Jason Gunthorpe, Kevin Tian, Shameerali Kolothum Thodi,
	Yi Liu, Yi Y Sun, Nicolin Chen, Joerg Roedel,
	Suravee Suthikulpanit, Will Deacon, Robin Murphy, Zhenzhong Duan,
	Alex Williamson, kvm

On 10/19/23 4:27 AM, Joao Martins wrote:
> IOMMU advertises Access/Dirty bits for second-stage page table if the
> extended capability DMAR register reports it (ECAP, mnemonic ECAP.SSADS).
> The first stage table is compatible with CPU page table thus A/D bits are
> implicitly supported. Relevant Intel IOMMU SDM ref for first stage table
> "3.6.2 Accessed, Extended Accessed, and Dirty Flags" and second stage table
> "3.7.2 Accessed and Dirty Flags".
> 
> First stage page table is enabled by default so it's allowed to set dirty
> tracking and no control bits needed, it just returns 0. To use SSADS, set
> bit 9 (SSADE) in the scalable-mode PASID table entry and flush the IOTLB
> via pasid_flush_caches() following the manual. Relevant SDM refs:
> 
> "3.7.2 Accessed and Dirty Flags"
> "6.5.3.3 Guidance to Software for Invalidations,
>   Table 23. Guidance to Software for Invalidations"
> 
> PTE dirty bit is located in bit 9 and it's cached in the IOTLB so flush
> IOTLB to make sure IOMMU attempts to set the dirty bit again. Note that
> iommu_dirty_bitmap_record() will add the IOVA to iotlb_gather and thus the
> caller of the iommu op will flush the IOTLB. Relevant manuals over the
> hardware translation is chapter 6 with some special mention to:
> 
> "6.2.3.1 Scalable-Mode PASID-Table Entry Programming Considerations"
> "6.2.4 IOTLB"
> 
> Select IOMMUFD_DRIVER only if IOMMUFD is enabled, given that IOMMU dirty
> tracking requires IOMMUFD.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>   drivers/iommu/intel/Kconfig |   1 +
>   drivers/iommu/intel/iommu.c | 104 +++++++++++++++++++++++++++++++++-
>   drivers/iommu/intel/iommu.h |  17 ++++++
>   drivers/iommu/intel/pasid.c | 109 ++++++++++++++++++++++++++++++++++++
>   drivers/iommu/intel/pasid.h |   4 ++
>   5 files changed, 234 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/intel/Kconfig b/drivers/iommu/intel/Kconfig
> index 2e56bd79f589..f5348b80652b 100644
> --- a/drivers/iommu/intel/Kconfig
> +++ b/drivers/iommu/intel/Kconfig
> @@ -15,6 +15,7 @@ config INTEL_IOMMU
>   	select DMA_OPS
>   	select IOMMU_API
>   	select IOMMU_IOVA
> +	select IOMMUFD_DRIVER if IOMMUFD
>   	select NEED_DMA_MAP_STATE
>   	select DMAR_TABLE
>   	select SWIOTLB
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 017aed5813d8..405b459416d5 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -300,6 +300,7 @@ static int iommu_skip_te_disable;
>   #define IDENTMAP_AZALIA		4
>   
>   const struct iommu_ops intel_iommu_ops;
> +const struct iommu_dirty_ops intel_dirty_ops;
>   
>   static bool translation_pre_enabled(struct intel_iommu *iommu)
>   {
> @@ -4077,10 +4078,12 @@ static struct iommu_domain *intel_iommu_domain_alloc(unsigned type)
>   static struct iommu_domain *
>   intel_iommu_domain_alloc_user(struct device *dev, u32 flags)
>   {
> +	bool enforce_dirty = (flags & IOMMU_HWPT_ALLOC_ENFORCE_DIRTY);
>   	struct iommu_domain *domain;
>   	struct intel_iommu *iommu;
>   
> -	if (flags & (~IOMMU_HWPT_ALLOC_NEST_PARENT))
> +	if (flags & (~(IOMMU_HWPT_ALLOC_NEST_PARENT|
> +		       IOMMU_HWPT_ALLOC_ENFORCE_DIRTY)))
>   		return ERR_PTR(-EOPNOTSUPP);
>   
>   	iommu = device_to_iommu(dev, NULL, NULL);
> @@ -4090,6 +4093,9 @@ intel_iommu_domain_alloc_user(struct device *dev, u32 flags)
>   	if ((flags & IOMMU_HWPT_ALLOC_NEST_PARENT) && !ecap_nest(iommu->ecap))
>   		return ERR_PTR(-EOPNOTSUPP);
>   
> +	if (enforce_dirty && !slads_supported(iommu))
> +		return ERR_PTR(-EOPNOTSUPP);
> +
>   	/*
>   	 * domain_alloc_user op needs to fully initialize a domain
>   	 * before return, so uses iommu_domain_alloc() here for
> @@ -4098,6 +4104,15 @@ intel_iommu_domain_alloc_user(struct device *dev, u32 flags)
>   	domain = iommu_domain_alloc(dev->bus);
>   	if (!domain)
>   		domain = ERR_PTR(-ENOMEM);
> +
> +	if (!IS_ERR(domain) && enforce_dirty) {
> +		if (to_dmar_domain(domain)->use_first_level) {
> +			iommu_domain_free(domain);
> +			return ERR_PTR(-EOPNOTSUPP);
> +		}
> +		domain->dirty_ops = &intel_dirty_ops;
> +	}
> +
>   	return domain;
>   }
>   
> @@ -4121,6 +4136,9 @@ static int prepare_domain_attach_device(struct iommu_domain *domain,
>   	if (dmar_domain->force_snooping && !ecap_sc_support(iommu->ecap))
>   		return -EINVAL;
>   
> +	if (domain->dirty_ops && !slads_supported(iommu))
> +		return -EINVAL;
> +
>   	/* check if this iommu agaw is sufficient for max mapped address */
>   	addr_width = agaw_to_width(iommu->agaw);
>   	if (addr_width > cap_mgaw(iommu->cap))
> @@ -4375,6 +4393,8 @@ static bool intel_iommu_capable(struct device *dev, enum iommu_cap cap)
>   		return dmar_platform_optin();
>   	case IOMMU_CAP_ENFORCE_CACHE_COHERENCY:
>   		return ecap_sc_support(info->iommu->ecap);
> +	case IOMMU_CAP_DIRTY:
> +		return slads_supported(info->iommu);
>   	default:
>   		return false;
>   	}
> @@ -4772,6 +4792,9 @@ static int intel_iommu_set_dev_pasid(struct iommu_domain *domain,
>   	if (!pasid_supported(iommu) || dev_is_real_dma_subdevice(dev))
>   		return -EOPNOTSUPP;
>   
> +	if (domain->dirty_ops)
> +		return -EINVAL;
> +
>   	if (context_copied(iommu, info->bus, info->devfn))
>   		return -EBUSY;
>   
> @@ -4830,6 +4853,85 @@ static void *intel_iommu_hw_info(struct device *dev, u32 *length, u32 *type)
>   	return vtd;
>   }
>   
> +static int intel_iommu_set_dirty_tracking(struct iommu_domain *domain,
> +					  bool enable)
> +{
> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> +	struct device_domain_info *info;
> +	int ret = -EINVAL;
> +
> +	spin_lock(&dmar_domain->lock);
> +	if (dmar_domain->dirty_tracking == enable)
> +		goto out_unlock;
> +
> +	list_for_each_entry(info, &dmar_domain->devices, link) {
> +		ret = intel_pasid_setup_dirty_tracking(info->iommu, info->domain,
> +						     info->dev, IOMMU_NO_PASID,
> +						     enable);
> +		if (ret)
> +			goto err_unwind;
> +
> +	}
> +
> +	if (!ret)
> +		dmar_domain->dirty_tracking = enable;

We should also support setting dirty tracking even if the domain has not
been attached to any device?

To achieve this, we can remove ret initialization and remove the above
check. Make the default path a successful one.

	int ret;

	[...]

	dmar_domain->dirty_tracking = enable;

> +out_unlock:
> +	spin_unlock(&dmar_domain->lock);
> +
> +	return 0;
> +
> +err_unwind:
> +	list_for_each_entry(info, &dmar_domain->devices, link)
> +		intel_pasid_setup_dirty_tracking(info->iommu, dmar_domain,
> +					  info->dev, IOMMU_NO_PASID,
> +					  dmar_domain->dirty_tracking);
> +	spin_unlock(&dmar_domain->lock);
> +	return ret;
> +}
> +
> +static int intel_iommu_read_and_clear_dirty(struct iommu_domain *domain,
> +					    unsigned long iova, size_t size,
> +					    unsigned long flags,
> +					    struct iommu_dirty_bitmap *dirty)
> +{
> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> +	unsigned long end = iova + size - 1;
> +	unsigned long pgsize;
> +
> +	/*
> +	 * IOMMUFD core calls into a dirty tracking disabled domain without an
> +	 * IOVA bitmap set in order to clean dirty bits in all PTEs that might
> +	 * have occurred when we stopped dirty tracking. This ensures that we
> +	 * never inherit dirtied bits from a previous cycle.
> +	 */
> +	if (!dmar_domain->dirty_tracking && dirty->bitmap)
> +		return -EINVAL;
> +
> +	do {
> +		struct dma_pte *pte;
> +		int lvl = 0;
> +
> +		pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &lvl,
> +				     GFP_ATOMIC);
> +		pgsize = level_size(lvl) << VTD_PAGE_SHIFT;
> +		if (!pte || !dma_pte_present(pte)) {
> +			iova += pgsize;
> +			continue;
> +		}
> +
> +		if (dma_sl_pte_test_and_clear_dirty(pte, flags))
> +			iommu_dirty_bitmap_record(dirty, iova, pgsize);
> +		iova += pgsize;
> +	} while (iova < end);
> +
> +	return 0;
> +}
> +
> +const struct iommu_dirty_ops intel_dirty_ops = {
> +	.set_dirty_tracking	= intel_iommu_set_dirty_tracking,
> +	.read_and_clear_dirty   = intel_iommu_read_and_clear_dirty,
> +};
> +
>   const struct iommu_ops intel_iommu_ops = {
>   	.capable		= intel_iommu_capable,
>   	.hw_info		= intel_iommu_hw_info,
> diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
> index c18fb699c87a..27bcfd3bacdd 100644
> --- a/drivers/iommu/intel/iommu.h
> +++ b/drivers/iommu/intel/iommu.h
> @@ -48,6 +48,9 @@
>   #define DMA_FL_PTE_DIRTY	BIT_ULL(6)
>   #define DMA_FL_PTE_XD		BIT_ULL(63)
>   
> +#define DMA_SL_PTE_DIRTY_BIT	9
> +#define DMA_SL_PTE_DIRTY	BIT_ULL(DMA_SL_PTE_DIRTY_BIT)
> +
>   #define ADDR_WIDTH_5LEVEL	(57)
>   #define ADDR_WIDTH_4LEVEL	(48)
>   
> @@ -539,6 +542,9 @@ enum {
>   #define sm_supported(iommu)	(intel_iommu_sm && ecap_smts((iommu)->ecap))
>   #define pasid_supported(iommu)	(sm_supported(iommu) &&			\
>   				 ecap_pasid((iommu)->ecap))
> +#define slads_supported(iommu) (sm_supported(iommu) &&                 \
> +				ecap_slads((iommu)->ecap))
> +
>   
>   struct pasid_entry;
>   struct pasid_state_entry;
> @@ -592,6 +598,7 @@ struct dmar_domain {
>   					 * otherwise, goes through the second
>   					 * level.
>   					 */
> +	u8 dirty_tracking:1;		/* Dirty tracking is enabled */
>   
>   	spinlock_t lock;		/* Protect device tracking lists */
>   	struct list_head devices;	/* all devices' list */
> @@ -781,6 +788,16 @@ static inline bool dma_pte_present(struct dma_pte *pte)
>   	return (pte->val & 3) != 0;
>   }
>   
> +static inline bool dma_sl_pte_test_and_clear_dirty(struct dma_pte *pte,
> +						   unsigned long flags)
> +{
> +	if (flags & IOMMU_DIRTY_NO_CLEAR)
> +		return (pte->val & DMA_SL_PTE_DIRTY) != 0;
> +
> +	return test_and_clear_bit(DMA_SL_PTE_DIRTY_BIT,
> +				  (unsigned long *)&pte->val);
> +}
> +
>   static inline bool dma_pte_superpage(struct dma_pte *pte)
>   {
>   	return (pte->val & DMA_PTE_LARGE_PAGE);
> diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
> index 8f92b92f3d2a..785384a59d55 100644
> --- a/drivers/iommu/intel/pasid.c
> +++ b/drivers/iommu/intel/pasid.c
> @@ -277,6 +277,11 @@ static inline void pasid_set_bits(u64 *ptr, u64 mask, u64 bits)
>   	WRITE_ONCE(*ptr, (old & ~mask) | bits);
>   }
>   
> +static inline u64 pasid_get_bits(u64 *ptr)
> +{
> +	return READ_ONCE(*ptr);
> +}
> +
>   /*
>    * Setup the DID(Domain Identifier) field (Bit 64~79) of scalable mode
>    * PASID entry.
> @@ -335,6 +340,36 @@ static inline void pasid_set_fault_enable(struct pasid_entry *pe)
>   	pasid_set_bits(&pe->val[0], 1 << 1, 0);
>   }
>   
> +/*
> + * Enable second level A/D bits by setting the SLADE (Second Level
> + * Access Dirty Enable) field (Bit 9) of a scalable mode PASID
> + * entry.
> + */
> +static inline void pasid_set_ssade(struct pasid_entry *pe)
> +{
> +	pasid_set_bits(&pe->val[0], 1 << 9, 1 << 9);
> +}
> +
> +/*
> + * Enable second level A/D bits by setting the SLADE (Second Level

nit: Disable second level ....

> + * Access Dirty Enable) field (Bit 9) of a scalable mode PASID
> + * entry.
> + */
> +static inline void pasid_clear_ssade(struct pasid_entry *pe)
> +{
> +	pasid_set_bits(&pe->val[0], 1 << 9, 0);
> +}
> +
> +/*
> + * Checks if second level A/D bits by setting the SLADE (Second Level
> + * Access Dirty Enable) field (Bit 9) of a scalable mode PASID
> + * entry is enabled.
> + */
> +static inline bool pasid_get_ssade(struct pasid_entry *pe)
> +{
> +	return pasid_get_bits(&pe->val[0]) & (1 << 9);
> +}
> +
>   /*
>    * Setup the WPE(Write Protect Enable) field (Bit 132) of a
>    * scalable mode PASID entry.
> @@ -627,6 +662,8 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu,
>   	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_SL_ONLY);
>   	pasid_set_fault_enable(pte);
>   	pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
> +	if (domain->dirty_tracking)
> +		pasid_set_ssade(pte);
>   
>   	pasid_set_present(pte);
>   	spin_unlock(&iommu->lock);
> @@ -636,6 +673,78 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu,
>   	return 0;
>   }
>   
> +/*
> + * Set up dirty tracking on a second only translation type.

nit: ... on a second only or nested translation type.

> + */
> +int intel_pasid_setup_dirty_tracking(struct intel_iommu *iommu,
> +				     struct dmar_domain *domain,
> +				     struct device *dev, u32 pasid,
> +				     bool enabled)
> +{
> +	struct pasid_entry *pte;
> +	u16 did, pgtt;
> +
> +	spin_lock(&iommu->lock);
> +
> +	pte = intel_pasid_get_entry(dev, pasid);
> +	if (!pte) {
> +		spin_unlock(&iommu->lock);
> +		dev_err_ratelimited(dev,
> +				    "Failed to get pasid entry of PASID %d\n",
> +				    pasid);
> +		return -ENODEV;
> +	}
> +
> +	did = domain_id_iommu(domain, iommu);
> +	pgtt = pasid_pte_get_pgtt(pte);
> +	if (pgtt != PASID_ENTRY_PGTT_SL_ONLY && pgtt != PASID_ENTRY_PGTT_NESTED) {
> +		spin_unlock(&iommu->lock);
> +		dev_err_ratelimited(dev,
> +				    "Dirty tracking not supported on translation type %d\n",
> +				    pgtt);
> +		return -EOPNOTSUPP;
> +	}
> +
> +	if (pasid_get_ssade(pte) == enabled) {
> +		spin_unlock(&iommu->lock);
> +		return 0;
> +	}
> +
> +	if (enabled)
> +		pasid_set_ssade(pte);
> +	else
> +		pasid_clear_ssade(pte);
> +	spin_unlock(&iommu->lock);
> +
> +	if (!ecap_coherent(iommu->ecap))
> +		clflush_cache_range(pte, sizeof(*pte));
> +
> +	/*
> +	 * From VT-d spec table 25 "Guidance to Software for Invalidations":
> +	 *
> +	 * - PASID-selective-within-Domain PASID-cache invalidation
> +	 *   If (PGTT=SS or Nested)
> +	 *    - Domain-selective IOTLB invalidation
> +	 *   Else
> +	 *    - PASID-selective PASID-based IOTLB invalidation
> +	 * - If (pasid is RID_PASID)
> +	 *    - Global Device-TLB invalidation to affected functions
> +	 *   Else
> +	 *    - PASID-based Device-TLB invalidation (with S=1 and
> +	 *      Addr[63:12]=0x7FFFFFFF_FFFFF) to affected functions
> +	 */
> +	pasid_cache_invalidation_with_pasid(iommu, did, pasid);
> +
> +	if (pgtt == PASID_ENTRY_PGTT_SL_ONLY || pgtt == PASID_ENTRY_PGTT_NESTED)

Above check is unnecessary.

> +		iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
> +
> +	/* Device IOTLB doesn't need to be flushed in caching mode. */
> +	if (!cap_caching_mode(iommu->cap))
> +		devtlb_invalidation_with_pasid(iommu, dev, pasid);
> +
> +	return 0;
> +}
> +
>   /*
>    * Set up the scalable mode pasid entry for passthrough translation type.
>    */
> diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
> index 4e9e68c3c388..958050b093aa 100644
> --- a/drivers/iommu/intel/pasid.h
> +++ b/drivers/iommu/intel/pasid.h
> @@ -106,6 +106,10 @@ int intel_pasid_setup_first_level(struct intel_iommu *iommu,
>   int intel_pasid_setup_second_level(struct intel_iommu *iommu,
>   				   struct dmar_domain *domain,
>   				   struct device *dev, u32 pasid);
> +int intel_pasid_setup_dirty_tracking(struct intel_iommu *iommu,
> +				     struct dmar_domain *domain,
> +				     struct device *dev, u32 pasid,
> +				     bool enabled);
>   int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
>   				   struct dmar_domain *domain,
>   				   struct device *dev, u32 pasid);

Others look good to me. Thank you very much!

With above addressed,

Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>

Best regards,
baolu

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 12/18] iommu/intel: Access/Dirty bit support for SL domains
  2023-10-19  3:04   ` Baolu Lu
@ 2023-10-19  9:14     ` Joao Martins
  2023-10-19 10:33       ` Joao Martins
  2023-10-19 23:56       ` Jason Gunthorpe
  0 siblings, 2 replies; 84+ messages in thread
From: Joao Martins @ 2023-10-19  9:14 UTC (permalink / raw)
  To: Baolu Lu, iommu
  Cc: Jason Gunthorpe, Kevin Tian, Shameerali Kolothum Thodi, Yi Liu,
	Yi Y Sun, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Zhenzhong Duan, Alex Williamson, kvm

On 19/10/2023 04:04, Baolu Lu wrote:
> On 10/19/23 4:27 AM, Joao Martins wrote:
>> IOMMU advertises Access/Dirty bits for second-stage page table if the
>> extended capability DMAR register reports it (ECAP, mnemonic ECAP.SSADS).
>> The first stage table is compatible with CPU page table thus A/D bits are
>> implicitly supported. Relevant Intel IOMMU SDM ref for first stage table
>> "3.6.2 Accessed, Extended Accessed, and Dirty Flags" and second stage table
>> "3.7.2 Accessed and Dirty Flags".
>>
>> First stage page table is enabled by default so it's allowed to set dirty
>> tracking and no control bits needed, it just returns 0. To use SSADS, set
>> bit 9 (SSADE) in the scalable-mode PASID table entry and flush the IOTLB
>> via pasid_flush_caches() following the manual. Relevant SDM refs:
>>
>> "3.7.2 Accessed and Dirty Flags"
>> "6.5.3.3 Guidance to Software for Invalidations,
>>   Table 23. Guidance to Software for Invalidations"
>>
>> PTE dirty bit is located in bit 9 and it's cached in the IOTLB so flush
>> IOTLB to make sure IOMMU attempts to set the dirty bit again. Note that
>> iommu_dirty_bitmap_record() will add the IOVA to iotlb_gather and thus the
>> caller of the iommu op will flush the IOTLB. Relevant manuals over the
>> hardware translation is chapter 6 with some special mention to:
>>
>> "6.2.3.1 Scalable-Mode PASID-Table Entry Programming Considerations"
>> "6.2.4 IOTLB"
>>
>> Select IOMMUFD_DRIVER only if IOMMUFD is enabled, given that IOMMU dirty
>> tracking requires IOMMUFD.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>   drivers/iommu/intel/Kconfig |   1 +
>>   drivers/iommu/intel/iommu.c | 104 +++++++++++++++++++++++++++++++++-
>>   drivers/iommu/intel/iommu.h |  17 ++++++
>>   drivers/iommu/intel/pasid.c | 109 ++++++++++++++++++++++++++++++++++++
>>   drivers/iommu/intel/pasid.h |   4 ++
>>   5 files changed, 234 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/iommu/intel/Kconfig b/drivers/iommu/intel/Kconfig
>> index 2e56bd79f589..f5348b80652b 100644
>> --- a/drivers/iommu/intel/Kconfig
>> +++ b/drivers/iommu/intel/Kconfig
>> @@ -15,6 +15,7 @@ config INTEL_IOMMU
>>       select DMA_OPS
>>       select IOMMU_API
>>       select IOMMU_IOVA
>> +    select IOMMUFD_DRIVER if IOMMUFD
>>       select NEED_DMA_MAP_STATE
>>       select DMAR_TABLE
>>       select SWIOTLB
>> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
>> index 017aed5813d8..405b459416d5 100644
>> --- a/drivers/iommu/intel/iommu.c
>> +++ b/drivers/iommu/intel/iommu.c
>> @@ -300,6 +300,7 @@ static int iommu_skip_te_disable;
>>   #define IDENTMAP_AZALIA        4
>>     const struct iommu_ops intel_iommu_ops;
>> +const struct iommu_dirty_ops intel_dirty_ops;
>>     static bool translation_pre_enabled(struct intel_iommu *iommu)
>>   {
>> @@ -4077,10 +4078,12 @@ static struct iommu_domain
>> *intel_iommu_domain_alloc(unsigned type)
>>   static struct iommu_domain *
>>   intel_iommu_domain_alloc_user(struct device *dev, u32 flags)
>>   {
>> +    bool enforce_dirty = (flags & IOMMU_HWPT_ALLOC_ENFORCE_DIRTY);
>>       struct iommu_domain *domain;
>>       struct intel_iommu *iommu;
>>   -    if (flags & (~IOMMU_HWPT_ALLOC_NEST_PARENT))
>> +    if (flags & (~(IOMMU_HWPT_ALLOC_NEST_PARENT|
>> +               IOMMU_HWPT_ALLOC_ENFORCE_DIRTY)))
>>           return ERR_PTR(-EOPNOTSUPP);
>>         iommu = device_to_iommu(dev, NULL, NULL);
>> @@ -4090,6 +4093,9 @@ intel_iommu_domain_alloc_user(struct device *dev, u32
>> flags)
>>       if ((flags & IOMMU_HWPT_ALLOC_NEST_PARENT) && !ecap_nest(iommu->ecap))
>>           return ERR_PTR(-EOPNOTSUPP);
>>   +    if (enforce_dirty && !slads_supported(iommu))
>> +        return ERR_PTR(-EOPNOTSUPP);
>> +
>>       /*
>>        * domain_alloc_user op needs to fully initialize a domain
>>        * before return, so uses iommu_domain_alloc() here for
>> @@ -4098,6 +4104,15 @@ intel_iommu_domain_alloc_user(struct device *dev, u32
>> flags)
>>       domain = iommu_domain_alloc(dev->bus);
>>       if (!domain)
>>           domain = ERR_PTR(-ENOMEM);
>> +
>> +    if (!IS_ERR(domain) && enforce_dirty) {
>> +        if (to_dmar_domain(domain)->use_first_level) {
>> +            iommu_domain_free(domain);
>> +            return ERR_PTR(-EOPNOTSUPP);
>> +        }
>> +        domain->dirty_ops = &intel_dirty_ops;
>> +    }
>> +
>>       return domain;
>>   }
>>   @@ -4121,6 +4136,9 @@ static int prepare_domain_attach_device(struct
>> iommu_domain *domain,
>>       if (dmar_domain->force_snooping && !ecap_sc_support(iommu->ecap))
>>           return -EINVAL;
>>   +    if (domain->dirty_ops && !slads_supported(iommu))
>> +        return -EINVAL;
>> +
>>       /* check if this iommu agaw is sufficient for max mapped address */
>>       addr_width = agaw_to_width(iommu->agaw);
>>       if (addr_width > cap_mgaw(iommu->cap))
>> @@ -4375,6 +4393,8 @@ static bool intel_iommu_capable(struct device *dev, enum
>> iommu_cap cap)
>>           return dmar_platform_optin();
>>       case IOMMU_CAP_ENFORCE_CACHE_COHERENCY:
>>           return ecap_sc_support(info->iommu->ecap);
>> +    case IOMMU_CAP_DIRTY:
>> +        return slads_supported(info->iommu);
>>       default:
>>           return false;
>>       }
>> @@ -4772,6 +4792,9 @@ static int intel_iommu_set_dev_pasid(struct iommu_domain
>> *domain,
>>       if (!pasid_supported(iommu) || dev_is_real_dma_subdevice(dev))
>>           return -EOPNOTSUPP;
>>   +    if (domain->dirty_ops)
>> +        return -EINVAL;
>> +
>>       if (context_copied(iommu, info->bus, info->devfn))
>>           return -EBUSY;
>>   @@ -4830,6 +4853,85 @@ static void *intel_iommu_hw_info(struct device *dev,
>> u32 *length, u32 *type)
>>       return vtd;
>>   }
>>   +static int intel_iommu_set_dirty_tracking(struct iommu_domain *domain,
>> +                      bool enable)
>> +{
>> +    struct dmar_domain *dmar_domain = to_dmar_domain(domain);
>> +    struct device_domain_info *info;
>> +    int ret = -EINVAL;
>> +
>> +    spin_lock(&dmar_domain->lock);
>> +    if (dmar_domain->dirty_tracking == enable)
>> +        goto out_unlock;
>> +
>> +    list_for_each_entry(info, &dmar_domain->devices, link) {
>> +        ret = intel_pasid_setup_dirty_tracking(info->iommu, info->domain,
>> +                             info->dev, IOMMU_NO_PASID,
>> +                             enable);
>> +        if (ret)
>> +            goto err_unwind;
>> +
>> +    }
>> +
>> +    if (!ret)
>> +        dmar_domain->dirty_tracking = enable;
> 
> We should also support setting dirty tracking even if the domain has not
> been attached to any device?
> 
Considering this rides on hwpt-alloc which attaches a device on domain
allocation, then this shouldn't be possible in pratice. But I take this is to
improve 'future' resilience, as there's nothing bad coming from below change you
suggested.

> To achieve this, we can remove ret initialization and remove the above
> check. Make the default path a successful one.
> 
>     int ret;
> 
>     [...]
> 
>     dmar_domain->dirty_tracking = enable;
> 
OK, if above makes sense.

>> +out_unlock:
>> +    spin_unlock(&dmar_domain->lock);
>> +
>> +    return 0;
>> +
>> +err_unwind:
>> +    list_for_each_entry(info, &dmar_domain->devices, link)
>> +        intel_pasid_setup_dirty_tracking(info->iommu, dmar_domain,
>> +                      info->dev, IOMMU_NO_PASID,
>> +                      dmar_domain->dirty_tracking);
>> +    spin_unlock(&dmar_domain->lock);
>> +    return ret;
>> +}
>> +
>> +static int intel_iommu_read_and_clear_dirty(struct iommu_domain *domain,
>> +                        unsigned long iova, size_t size,
>> +                        unsigned long flags,
>> +                        struct iommu_dirty_bitmap *dirty)
>> +{
>> +    struct dmar_domain *dmar_domain = to_dmar_domain(domain);
>> +    unsigned long end = iova + size - 1;
>> +    unsigned long pgsize;
>> +
>> +    /*
>> +     * IOMMUFD core calls into a dirty tracking disabled domain without an
>> +     * IOVA bitmap set in order to clean dirty bits in all PTEs that might
>> +     * have occurred when we stopped dirty tracking. This ensures that we
>> +     * never inherit dirtied bits from a previous cycle.
>> +     */
>> +    if (!dmar_domain->dirty_tracking && dirty->bitmap)
>> +        return -EINVAL;
>> +
>> +    do {
>> +        struct dma_pte *pte;
>> +        int lvl = 0;
>> +
>> +        pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &lvl,
>> +                     GFP_ATOMIC);
>> +        pgsize = level_size(lvl) << VTD_PAGE_SHIFT;
>> +        if (!pte || !dma_pte_present(pte)) {
>> +            iova += pgsize;
>> +            continue;
>> +        }
>> +
>> +        if (dma_sl_pte_test_and_clear_dirty(pte, flags))
>> +            iommu_dirty_bitmap_record(dirty, iova, pgsize);
>> +        iova += pgsize;
>> +    } while (iova < end);
>> +
>> +    return 0;
>> +}
>> +
>> +const struct iommu_dirty_ops intel_dirty_ops = {
>> +    .set_dirty_tracking    = intel_iommu_set_dirty_tracking,
>> +    .read_and_clear_dirty   = intel_iommu_read_and_clear_dirty,
>> +};
>> +
>>   const struct iommu_ops intel_iommu_ops = {
>>       .capable        = intel_iommu_capable,
>>       .hw_info        = intel_iommu_hw_info,
>> diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
>> index c18fb699c87a..27bcfd3bacdd 100644
>> --- a/drivers/iommu/intel/iommu.h
>> +++ b/drivers/iommu/intel/iommu.h
>> @@ -48,6 +48,9 @@
>>   #define DMA_FL_PTE_DIRTY    BIT_ULL(6)
>>   #define DMA_FL_PTE_XD        BIT_ULL(63)
>>   +#define DMA_SL_PTE_DIRTY_BIT    9
>> +#define DMA_SL_PTE_DIRTY    BIT_ULL(DMA_SL_PTE_DIRTY_BIT)
>> +
>>   #define ADDR_WIDTH_5LEVEL    (57)
>>   #define ADDR_WIDTH_4LEVEL    (48)
>>   @@ -539,6 +542,9 @@ enum {
>>   #define sm_supported(iommu)    (intel_iommu_sm && ecap_smts((iommu)->ecap))
>>   #define pasid_supported(iommu)    (sm_supported(iommu) &&            \
>>                    ecap_pasid((iommu)->ecap))
>> +#define slads_supported(iommu) (sm_supported(iommu) &&                 \
>> +                ecap_slads((iommu)->ecap))
>> +
>>     struct pasid_entry;
>>   struct pasid_state_entry;
>> @@ -592,6 +598,7 @@ struct dmar_domain {
>>                        * otherwise, goes through the second
>>                        * level.
>>                        */
>> +    u8 dirty_tracking:1;        /* Dirty tracking is enabled */
>>         spinlock_t lock;        /* Protect device tracking lists */
>>       struct list_head devices;    /* all devices' list */
>> @@ -781,6 +788,16 @@ static inline bool dma_pte_present(struct dma_pte *pte)
>>       return (pte->val & 3) != 0;
>>   }
>>   +static inline bool dma_sl_pte_test_and_clear_dirty(struct dma_pte *pte,
>> +                           unsigned long flags)
>> +{
>> +    if (flags & IOMMU_DIRTY_NO_CLEAR)
>> +        return (pte->val & DMA_SL_PTE_DIRTY) != 0;
>> +
>> +    return test_and_clear_bit(DMA_SL_PTE_DIRTY_BIT,
>> +                  (unsigned long *)&pte->val);
>> +}
>> +
>>   static inline bool dma_pte_superpage(struct dma_pte *pte)
>>   {
>>       return (pte->val & DMA_PTE_LARGE_PAGE);
>> diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
>> index 8f92b92f3d2a..785384a59d55 100644
>> --- a/drivers/iommu/intel/pasid.c
>> +++ b/drivers/iommu/intel/pasid.c
>> @@ -277,6 +277,11 @@ static inline void pasid_set_bits(u64 *ptr, u64 mask, u64
>> bits)
>>       WRITE_ONCE(*ptr, (old & ~mask) | bits);
>>   }
>>   +static inline u64 pasid_get_bits(u64 *ptr)
>> +{
>> +    return READ_ONCE(*ptr);
>> +}
>> +
>>   /*
>>    * Setup the DID(Domain Identifier) field (Bit 64~79) of scalable mode
>>    * PASID entry.
>> @@ -335,6 +340,36 @@ static inline void pasid_set_fault_enable(struct
>> pasid_entry *pe)
>>       pasid_set_bits(&pe->val[0], 1 << 1, 0);
>>   }
>>   +/*
>> + * Enable second level A/D bits by setting the SLADE (Second Level
>> + * Access Dirty Enable) field (Bit 9) of a scalable mode PASID
>> + * entry.
>> + */
>> +static inline void pasid_set_ssade(struct pasid_entry *pe)
>> +{
>> +    pasid_set_bits(&pe->val[0], 1 << 9, 1 << 9);
>> +}
>> +
>> +/*
>> + * Enable second level A/D bits by setting the SLADE (Second Level
> 
> nit: Disable second level ....
> 
/me nods

>> + * Access Dirty Enable) field (Bit 9) of a scalable mode PASID
>> + * entry.
>> + */
>> +static inline void pasid_clear_ssade(struct pasid_entry *pe)
>> +{
>> +    pasid_set_bits(&pe->val[0], 1 << 9, 0);
>> +}
>> +
>> +/*
>> + * Checks if second level A/D bits by setting the SLADE (Second Level
>> + * Access Dirty Enable) field (Bit 9) of a scalable mode PASID
>> + * entry is enabled.
>> + */
>> +static inline bool pasid_get_ssade(struct pasid_entry *pe)
>> +{
>> +    return pasid_get_bits(&pe->val[0]) & (1 << 9);
>> +}
>> +
>>   /*
>>    * Setup the WPE(Write Protect Enable) field (Bit 132) of a
>>    * scalable mode PASID entry.
>> @@ -627,6 +662,8 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu,
>>       pasid_set_translation_type(pte, PASID_ENTRY_PGTT_SL_ONLY);
>>       pasid_set_fault_enable(pte);
>>       pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
>> +    if (domain->dirty_tracking)
>> +        pasid_set_ssade(pte);
>>         pasid_set_present(pte);
>>       spin_unlock(&iommu->lock);
>> @@ -636,6 +673,78 @@ int intel_pasid_setup_second_level(struct intel_iommu
>> *iommu,
>>       return 0;
>>   }
>>   +/*
>> + * Set up dirty tracking on a second only translation type.
> 
> nit: ... on a second only or nested translation type.
> 
OK -- had changed the check, but not the comment :/

>> + */
>> +int intel_pasid_setup_dirty_tracking(struct intel_iommu *iommu,
>> +                     struct dmar_domain *domain,
>> +                     struct device *dev, u32 pasid,
>> +                     bool enabled)
>> +{
>> +    struct pasid_entry *pte;
>> +    u16 did, pgtt;
>> +
>> +    spin_lock(&iommu->lock);
>> +
>> +    pte = intel_pasid_get_entry(dev, pasid);
>> +    if (!pte) {
>> +        spin_unlock(&iommu->lock);
>> +        dev_err_ratelimited(dev,
>> +                    "Failed to get pasid entry of PASID %d\n",
>> +                    pasid);
>> +        return -ENODEV;
>> +    }
>> +
>> +    did = domain_id_iommu(domain, iommu);
>> +    pgtt = pasid_pte_get_pgtt(pte);
>> +    if (pgtt != PASID_ENTRY_PGTT_SL_ONLY && pgtt != PASID_ENTRY_PGTT_NESTED) {
>> +        spin_unlock(&iommu->lock);
>> +        dev_err_ratelimited(dev,
>> +                    "Dirty tracking not supported on translation type %d\n",
>> +                    pgtt);
>> +        return -EOPNOTSUPP;
>> +    }
>> +
>> +    if (pasid_get_ssade(pte) == enabled) {
>> +        spin_unlock(&iommu->lock);
>> +        return 0;
>> +    }
>> +
>> +    if (enabled)
>> +        pasid_set_ssade(pte);
>> +    else
>> +        pasid_clear_ssade(pte);
>> +    spin_unlock(&iommu->lock);
>> +
>> +    if (!ecap_coherent(iommu->ecap))
>> +        clflush_cache_range(pte, sizeof(*pte));
>> +
>> +    /*
>> +     * From VT-d spec table 25 "Guidance to Software for Invalidations":
>> +     *
>> +     * - PASID-selective-within-Domain PASID-cache invalidation
>> +     *   If (PGTT=SS or Nested)
>> +     *    - Domain-selective IOTLB invalidation
>> +     *   Else
>> +     *    - PASID-selective PASID-based IOTLB invalidation
>> +     * - If (pasid is RID_PASID)
>> +     *    - Global Device-TLB invalidation to affected functions
>> +     *   Else
>> +     *    - PASID-based Device-TLB invalidation (with S=1 and
>> +     *      Addr[63:12]=0x7FFFFFFF_FFFFF) to affected functions
>> +     */
>> +    pasid_cache_invalidation_with_pasid(iommu, did, pasid);
>> +
>> +    if (pgtt == PASID_ENTRY_PGTT_SL_ONLY || pgtt == PASID_ENTRY_PGTT_NESTED)
> 
> Above check is unnecessary.
> 
Ah yes, we're validating beforehand above.

>> +        iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
>> +
>> +    /* Device IOTLB doesn't need to be flushed in caching mode. */
>> +    if (!cap_caching_mode(iommu->cap))
>> +        devtlb_invalidation_with_pasid(iommu, dev, pasid);
>> +
>> +    return 0;
>> +}
>> +
>>   /*
>>    * Set up the scalable mode pasid entry for passthrough translation type.
>>    */
>> diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
>> index 4e9e68c3c388..958050b093aa 100644
>> --- a/drivers/iommu/intel/pasid.h
>> +++ b/drivers/iommu/intel/pasid.h
>> @@ -106,6 +106,10 @@ int intel_pasid_setup_first_level(struct intel_iommu *iommu,
>>   int intel_pasid_setup_second_level(struct intel_iommu *iommu,
>>                      struct dmar_domain *domain,
>>                      struct device *dev, u32 pasid);
>> +int intel_pasid_setup_dirty_tracking(struct intel_iommu *iommu,
>> +                     struct dmar_domain *domain,
>> +                     struct device *dev, u32 pasid,
>> +                     bool enabled);
>>   int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
>>                      struct dmar_domain *domain,
>>                      struct device *dev, u32 pasid);
> 
> Others look good to me. Thank you very much!
> 
> With above addressed,
> 
> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
> 
Thanks

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 08/18] iommufd: Add capabilities to IOMMU_GET_HW_INFO
  2023-10-18 22:44   ` Jason Gunthorpe
@ 2023-10-19  9:55     ` Joao Martins
  2023-10-19 23:56       ` Jason Gunthorpe
  0 siblings, 1 reply; 84+ messages in thread
From: Joao Martins @ 2023-10-19  9:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu, Yi Liu,
	Yi Y Sun, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Zhenzhong Duan, Alex Williamson, kvm

On 18/10/2023 23:44, Jason Gunthorpe wrote:
> On Wed, Oct 18, 2023 at 09:27:05PM +0100, Joao Martins wrote:
>> Extend IOMMUFD_CMD_GET_HW_INFO op to query generic iommu capabilities for a
>> given device.
>>
>> Capabilities are IOMMU agnostic and use device_iommu_capable() API passing
>> one of the IOMMU_CAP_*. Enumerate IOMMU_CAP_DIRTY for now in the
>> out_capabilities field returned back to userspace.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  drivers/iommu/iommufd/device.c |  4 ++++
>>  include/uapi/linux/iommufd.h   | 11 +++++++++++
>>  2 files changed, 15 insertions(+)
>>
>> diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
>> index e88fa73a45e6..71ee22dc1a85 100644
>> --- a/drivers/iommu/iommufd/device.c
>> +++ b/drivers/iommu/iommufd/device.c
>> @@ -1185,6 +1185,10 @@ int iommufd_get_hw_info(struct iommufd_ucmd *ucmd)
>>  	 */
>>  	cmd->data_len = data_len;
>>  
>> +	cmd->out_capabilities = 0;
>> +	if (device_iommu_capable(idev->dev, IOMMU_CAP_DIRTY))
>> +		cmd->out_capabilities |= IOMMU_HW_CAP_DIRTY_TRACKING;
>> +
>>  	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
>>  out_free:
>>  	kfree(data);
>> diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
>> index efeb12c1aaeb..91de0043e73f 100644
>> --- a/include/uapi/linux/iommufd.h
>> +++ b/include/uapi/linux/iommufd.h
>> @@ -419,6 +419,14 @@ enum iommu_hw_info_type {
>>  	IOMMU_HW_INFO_TYPE_INTEL_VTD,
>>  };
>>   
>> +/**
>> + * enum iommufd_hw_info_capabilities
>> + * @IOMMU_CAP_DIRTY_TRACKING: IOMMU hardware support for dirty tracking
>> + */
> 
> Lets write more details here, which iommufd APIs does this flag mean
> work.
> 

I added this below. It is relatively brief as I expect people to read what each
of the API do. Unless I should be expanding in length here?

diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index ef8a1243eb57..43ed2f208503 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -422,6 +422,12 @@ enum iommu_hw_info_type {
 /**
  * enum iommufd_hw_info_capabilities
  * @IOMMU_CAP_DIRTY_TRACKING: IOMMU hardware support for dirty tracking
+ *                            If available, it means the following APIs
+ *                            are supported:
+ *
+ *                                   IOMMU_HWPT_GET_DIRTY_IOVA
+ *                                   IOMMU_HWPT_SET_DIRTY
+ *
  */
 enum iommufd_hw_capabilities {
        IOMMU_HW_CAP_DIRTY_TRACKING = 1 << 0,

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 07/18] iommufd: Add IOMMU_HWPT_GET_DIRTY_IOVA
  2023-10-18 20:27 ` [PATCH v4 07/18] iommufd: Add IOMMU_HWPT_GET_DIRTY_IOVA Joao Martins
  2023-10-18 22:39   ` Jason Gunthorpe
@ 2023-10-19 10:01   ` Joao Martins
  2023-10-20  6:32   ` Tian, Kevin
  2 siblings, 0 replies; 84+ messages in thread
From: Joao Martins @ 2023-10-19 10:01 UTC (permalink / raw)
  To: iommu
  Cc: Jason Gunthorpe, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu,
	Yi Liu, Yi Y Sun, Nicolin Chen, Joerg Roedel,
	Suravee Suthikulpanit, Will Deacon, Robin Murphy, Zhenzhong Duan,
	Alex Williamson, kvm

On 18/10/2023 21:27, Joao Martins wrote:
> diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
> index 9e1721e38819..efeb12c1aaeb 100644
> --- a/include/uapi/linux/iommufd.h
> +++ b/include/uapi/linux/iommufd.h
> @@ -48,6 +48,7 @@ enum {
>  	IOMMUFD_CMD_HWPT_ALLOC,
>  	IOMMUFD_CMD_GET_HW_INFO,
>  	IOMMUFD_CMD_HWPT_SET_DIRTY,
> +	IOMMUFD_CMD_HWPT_GET_DIRTY_IOVA,
>  };
>  
>  /**
> @@ -479,4 +480,31 @@ struct iommu_hwpt_set_dirty {
>  	__u32 __reserved;
>  };
>  #define IOMMU_HWPT_SET_DIRTY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_SET_DIRTY)
> +
> +/**
> + * struct iommu_hwpt_get_dirty_iova - ioctl(IOMMU_HWPT_GET_DIRTY_IOVA)
> + * @size: sizeof(struct iommu_hwpt_get_dirty_iova)
> + * @hwpt_id: HW pagetable ID that represents the IOMMU domain.
> + * @flags: Flags to control dirty tracking status.
> + * @iova: base IOVA of the bitmap first bit
> + * @length: IOVA range size
> + * @page_size: page size granularity of each bit in the bitmap
> + * @data: bitmap where to set the dirty bits. The bitmap bits each
> + * represent a page_size which you deviate from an arbitrary iova.
> + * Checking a given IOVA is dirty:
> + *
> + *  data[(iova / page_size) / 64] & (1ULL << (iova % 64))
> + */
> +struct iommu_hwpt_get_dirty_iova {
> +	__u32 size;
> +	__u32 hwpt_id;
> +	__u32 flags;
> +	__u32 __reserved;
> +	__aligned_u64 iova;
> +	__aligned_u64 length;
> +	__aligned_u64 page_size;
> +	__aligned_u64 *data;
> +};
> +#define IOMMU_HWPT_GET_DIRTY_IOVA _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_GET_DIRTY_IOVA)
> +
>  #endif

I added this extra chunk:

diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 6b26045f6577..3349347cb766 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -502,7 +502,7 @@ struct iommu_hwpt_set_dirty {
  * struct iommu_hwpt_get_dirty_iova - ioctl(IOMMU_HWPT_GET_DIRTY_IOVA)
  * @size: sizeof(struct iommu_hwpt_get_dirty_iova)
  * @hwpt_id: HW pagetable ID that represents the IOMMU domain.
- * @flags: Flags to control dirty tracking status.
+ * @flags: Flags to control the fetching of dirty IOVAs.
  * @iova: base IOVA of the bitmap first bit
  * @length: IOVA range size
  * @page_size: page size granularity of each bit in the bitmap
@@ -511,6 +511,10 @@ struct iommu_hwpt_set_dirty {
  * Checking a given IOVA is dirty:
  *
  *  data[(iova / page_size) / 64] & (1ULL << (iova % 64))
+ *
+ * Walk the IOMMU pagetables for a given IOVA range to return a bitmap
+ * with the dirty IOVAs. In doing so it will also by default clear any
+ * dirty bit metadata set in the IOPTE.
  */
 struct iommu_hwpt_get_dirty_iova {
        __u32 size;

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 12/18] iommu/intel: Access/Dirty bit support for SL domains
  2023-10-19  9:14     ` Joao Martins
@ 2023-10-19 10:33       ` Joao Martins
  2023-10-19 23:56       ` Jason Gunthorpe
  1 sibling, 0 replies; 84+ messages in thread
From: Joao Martins @ 2023-10-19 10:33 UTC (permalink / raw)
  To: Baolu Lu, iommu
  Cc: Jason Gunthorpe, Kevin Tian, Shameerali Kolothum Thodi, Yi Liu,
	Yi Y Sun, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Zhenzhong Duan, Alex Williamson, kvm

On 19/10/2023 10:14, Joao Martins wrote:
> On 19/10/2023 04:04, Baolu Lu wrote:
>> On 10/19/23 4:27 AM, Joao Martins wrote:
>>> +/*
>>> + * Enable second level A/D bits by setting the SLADE (Second Level
>>
>> nit: Disable second level ....
>>
> /me nods
> 
>>> + * Access Dirty Enable) field (Bit 9) of a scalable mode PASID
>>> + * entry.
>>> + */
>>> +static inline void pasid_clear_ssade(struct pasid_entry *pe)
>>> +{
>>> +    pasid_set_bits(&pe->val[0], 1 << 9, 0);
>>> +}
>>> +
>>> +/*
>>> + * Checks if second level A/D bits by setting the SLADE (Second Level
>>> + * Access Dirty Enable) field (Bit 9) of a scalable mode PASID
>>> + * entry is enabled.
>>> + */
>>> +static inline bool pasid_get_ssade(struct pasid_entry *pe)
>>> +{
>>> +    return pasid_get_bits(&pe->val[0]) & (1 << 9);
>>> +}
>>> +

Adjusted this part a little better:

diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 785384a59d55..deb775d84499 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -351,7 +351,7 @@ static inline void pasid_set_ssade(struct pasid_entry *pe)
 }

 /*
- * Enable second level A/D bits by setting the SLADE (Second Level
+ * Disable second level A/D bits by clearing the SLADE (Second Level
  * Access Dirty Enable) field (Bit 9) of a scalable mode PASID
  * entry.
  */
@@ -361,9 +361,9 @@ static inline void pasid_clear_ssade(struct pasid_entry *pe)
 }

 /*
- * Checks if second level A/D bits by setting the SLADE (Second Level
+ * Checks if second level A/D bits specifically the SLADE (Second Level
  * Access Dirty Enable) field (Bit 9) of a scalable mode PASID
- * entry is enabled.
+ * entry is set.
  */
 static inline bool pasid_get_ssade(struct pasid_entry *pe)
 {

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 11/18] iommu/amd: Access/Dirty bit support in IOPTEs
  2023-10-19  0:17     ` Joao Martins
@ 2023-10-19 11:58       ` Joao Martins
  2023-10-19 23:59         ` Jason Gunthorpe
  2023-10-20  2:21         ` Baolu Lu
  0 siblings, 2 replies; 84+ messages in thread
From: Joao Martins @ 2023-10-19 11:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu, Yi Liu,
	Yi Y Sun, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Zhenzhong Duan, Alex Williamson, kvm

On 19/10/2023 01:17, Joao Martins wrote:
> On 19/10/2023 00:11, Jason Gunthorpe wrote:
>> On Wed, Oct 18, 2023 at 09:27:08PM +0100, Joao Martins wrote:
>>> +static int iommu_v1_read_and_clear_dirty(struct io_pgtable_ops *ops,
>>> +					 unsigned long iova, size_t size,
>>> +					 unsigned long flags,
>>> +					 struct iommu_dirty_bitmap *dirty)
>>> +{
>>> +	struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
>>> +	unsigned long end = iova + size - 1;
>>> +
>>> +	do {
>>> +		unsigned long pgsize = 0;
>>> +		u64 *ptep, pte;
>>> +
>>> +		ptep = fetch_pte(pgtable, iova, &pgsize);
>>> +		if (ptep)
>>> +			pte = READ_ONCE(*ptep);
>>
>> It is fine for now, but this is so slow for something that is such a
>> fast path. We are optimizing away a TLB invalidation but leaving
>> this???
>>
> 
> More obvious reason is that I'm still working towards the 'faster' page table
> walker. Then map/unmap code needs to do similar lookups so thought of reusing
> the same functions as map/unmap initially. And improve it afterwards or when
> introducing the splitting.
> 
>> It is a radix tree, you walk trees by retaining your position at each
>> level as you go (eg in a function per-level call chain or something)
>> then ++ is cheap. Re-searching the entire tree every time is madness.
> 
> I'm aware -- I have an improved page-table walker for AMD[0] (not yet for Intel;
> still in the works), 

Sigh, I realized that Intel's pfn_to_dma_pte() (main lookup function for
map/unmap/iova_to_phys) does something a little off when it finds a non-present
PTE. It allocates a page table to it; which is not OK in this specific case (I
would argue it's neither for iova_to_phys but well maybe I misunderstand the
expectation of that API).

AMD has no such behaviour, though that driver per your earlier suggestion might
need to wait until -rc1 for some of the refactorings get merged. Hopefully we
don't need to wait for the last 3 series of AMD Driver refactoring (?) to be
done as that looks to be more SVA related; Unless there's something more
specific you are looking for prior to introducing AMD's domain_alloc_user().

Anyhow, let me fix this, and post an update. Perhaps it's best I target this for
-rc1 and have improved page-table walkers all at once [the iommufd_log_perf
thingie below unlikely to be part of this set right away]. I have been playing
with the AMD driver a lot more on baremetal, so I am getting confident on the
snippet below (even with big IOVA ranges). I'm also retrying to see in-house if
there's now a rev3.0 Intel machine that I can post results for -rc1 (last time
in v2 I didn't; but things could have changed).

> but in my experiments with huge IOVA ranges, the time to
> walk the page tables end up making not that much difference, compared to the
> size it needs to walk. However is how none of this matters, once we increase up
> a level (PMD), then walking huge IOVA ranges is super-cheap (and invisible with
> PUDs). Which makes the dynamic-splitting/page-demotion important.
> 
> Furthermore, this is not quite yet easy for other people to test and see numbers
> for themselves; so more and more I need to work on something like
> iommufd_log_perf tool under tools/testing that is similar to the gup_perf to make all
> performance work obvious and 'standardized'
> 
> ------->8--------
> [0] [hasn't been rebased into this version I sent]
> 
> commit 431de7e855ee8c1622663f8d81600f62fed0ed4a
> Author: Joao Martins <joao.m.martins@oracle.com>
> Date:   Sat Oct 7 18:17:33 2023 -0400
> 
>     iommu/amd: Improve dirty read io-pgtable walker
> 
>     fetch_pte() based is a little ineficient for level-1 page-sizes.
> 
>     It walks all the levels to return a PTE, and disregarding the potential
>     batching that could be done for the previous level. Implement a
>     page-table walker based on the freeing functions which recursevily walks
>     the next-level.
> 
>     For each level it iterates on the non-default page sizes as the
>     different mappings return, provided each PTE level-7 may account
>     the next power-of-2 per added PTE.
> 
>     Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> 
> diff --git a/drivers/iommu/amd/io_pgtable.c b/drivers/iommu/amd/io_pgtable.c
> index 29f5ab0ba14f..babb5fb5fd51 100644
> --- a/drivers/iommu/amd/io_pgtable.c
> +++ b/drivers/iommu/amd/io_pgtable.c
> @@ -552,39 +552,63 @@ static bool pte_test_and_clear_dirty(u64 *ptep, unsigned
> long size)
>         return dirty;
>  }
> 
> +static bool pte_is_large_or_base(u64 *ptep)
> +{
> +       return (PM_PTE_LEVEL(*ptep) == 0 || PM_PTE_LEVEL(*ptep) == 7);
> +}
> +
> +static int walk_iova_range(u64 *pt, unsigned long iova, size_t size,
> +                          int level, unsigned long flags,
> +                          struct iommu_dirty_bitmap *dirty)
> +{
> +       unsigned long addr, isize, end = iova + size;
> +       unsigned long page_size;
> +       int i, next_level;
> +       u64 *p, *ptep;
> +
> +       next_level = level - 1;
> +       isize = page_size = PTE_LEVEL_PAGE_SIZE(next_level);
> +
> +       for (addr = iova; addr < end; addr += isize) {
> +               i = PM_LEVEL_INDEX(next_level, addr);
> +               ptep = &pt[i];
> +
> +               /* PTE present? */
> +               if (!IOMMU_PTE_PRESENT(*ptep))
> +                       continue;
> +
> +               if (level > 1 && !pte_is_large_or_base(ptep)) {
> +                       p = IOMMU_PTE_PAGE(*ptep);
> +                       isize = min(end - addr, page_size);
> +                       walk_iova_range(p, addr, isize, next_level,
> +                                       flags, dirty);
> +               } else {
> +                       isize = PM_PTE_LEVEL(*ptep) == 7 ?
> +                                       PTE_PAGE_SIZE(*ptep) : page_size;
> +
> +                       /*
> +                        * Mark the whole IOVA range as dirty even if only one
> +                        * of the replicated PTEs were marked dirty.
> +                        */
> +                       if (((flags & IOMMU_DIRTY_NO_CLEAR) &&
> +                                       pte_test_dirty(ptep, isize)) ||
> +                           pte_test_and_clear_dirty(ptep, isize))
> +                               iommu_dirty_bitmap_record(dirty, addr, isize);
> +               }
> +       }
> +
> +       return 0;
> +}
> +
>  static int iommu_v1_read_and_clear_dirty(struct io_pgtable_ops *ops,
>                                          unsigned long iova, size_t size,
>                                          unsigned long flags,
>                                          struct iommu_dirty_bitmap *dirty)
>  {
>         struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
> -       unsigned long end = iova + size - 1;
> -
> -       do {
> -               unsigned long pgsize = 0;
> -               u64 *ptep, pte;
> -
> -               ptep = fetch_pte(pgtable, iova, &pgsize);
> -               if (ptep)
> -                       pte = READ_ONCE(*ptep);
> -               if (!ptep || !IOMMU_PTE_PRESENT(pte)) {
> -                       pgsize = pgsize ?: PTE_LEVEL_PAGE_SIZE(0);
> -                       iova += pgsize;
> -                       continue;
> -               }
> -
> -               /*
> -                * Mark the whole IOVA range as dirty even if only one of
> -                * the replicated PTEs were marked dirty.
> -                */
> -               if (((flags & IOMMU_DIRTY_NO_CLEAR) &&
> -                               pte_test_dirty(ptep, pgsize)) ||
> -                   pte_test_and_clear_dirty(ptep, pgsize))
> -                       iommu_dirty_bitmap_record(dirty, iova, pgsize);
> -               iova += pgsize;
> -       } while (iova < end);
> 
> -       return 0;
> +       return walk_iova_range(pgtable->root, iova, size,
> +                              pgtable->mode, flags, dirty);
>  }
> 
>  /*

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 07/18] iommufd: Add IOMMU_HWPT_GET_DIRTY_IOVA
  2023-10-18 23:43     ` Joao Martins
@ 2023-10-19 12:01       ` Jason Gunthorpe
  2023-10-19 12:04         ` Joao Martins
  0 siblings, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2023-10-19 12:01 UTC (permalink / raw)
  To: Joao Martins
  Cc: iommu, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu, Yi Liu,
	Yi Y Sun, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Zhenzhong Duan, Alex Williamson, kvm

On Thu, Oct 19, 2023 at 12:43:19AM +0100, Joao Martins wrote:
> On 18/10/2023 23:39, Jason Gunthorpe wrote:
> > On Wed, Oct 18, 2023 at 09:27:04PM +0100, Joao Martins wrote:
> > 
> >> +int iommufd_check_iova_range(struct iommufd_ioas *ioas,
> >> +			     struct iommu_hwpt_get_dirty_iova *bitmap)
> >> +{
> >> +	unsigned long pgshift, npages;
> >> +	size_t iommu_pgsize;
> >> +	int rc = -EINVAL;
> >> +
> >> +	pgshift = __ffs(bitmap->page_size);
> >> +	npages = bitmap->length >> pgshift;
> > 
> > npages = bitmap->length / bitmap->page_size;
> > 
> > ? (if page_size is a bitmask it is badly named)
> > 
> 
> It was a way to avoid the divide by zero, but
> I can switch to the above, and check for bitmap->page_size
> being non-zero. should be less obscure

Why would we ge so far along with a 0 page size? Reject that when
bitmap is created??

> >> +static int __iommu_read_and_clear_dirty(struct iova_bitmap *bitmap,
> >> +					unsigned long iova, size_t length,
> >> +					void *opaque)
> >> +{
> >> +	struct iopt_area *area;
> >> +	struct iopt_area_contig_iter iter;
> >> +	struct iova_bitmap_fn_arg *arg = opaque;
> >> +	struct iommu_domain *domain = arg->domain;
> >> +	struct iommu_dirty_bitmap *dirty = arg->dirty;
> >> +	const struct iommu_dirty_ops *ops = domain->dirty_ops;
> >> +	unsigned long last_iova = iova + length - 1;
> >> +	int ret = -EINVAL;
> >> +
> >> +	iopt_for_each_contig_area(&iter, area, arg->iopt, iova, last_iova) {
> >> +		unsigned long last = min(last_iova, iopt_area_last_iova(area));
> >> +
> >> +		ret = ops->read_and_clear_dirty(domain, iter.cur_iova,
> >> +						last - iter.cur_iova + 1,
> >> +						0, dirty);
> > 
> > This seems like a lot of stuff going on with ret..
> 
> All to have a single return exit point, given no different cleanup is required.
> Thought it was the general best way (when possible)

Mm, at least I'm not a believer of single exit point :)

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 07/18] iommufd: Add IOMMU_HWPT_GET_DIRTY_IOVA
  2023-10-19 12:01       ` Jason Gunthorpe
@ 2023-10-19 12:04         ` Joao Martins
  0 siblings, 0 replies; 84+ messages in thread
From: Joao Martins @ 2023-10-19 12:04 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu, Yi Liu,
	Yi Y Sun, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Zhenzhong Duan, Alex Williamson, kvm

On 19/10/2023 13:01, Jason Gunthorpe wrote:
> On Thu, Oct 19, 2023 at 12:43:19AM +0100, Joao Martins wrote:
>> On 18/10/2023 23:39, Jason Gunthorpe wrote:
>>> On Wed, Oct 18, 2023 at 09:27:04PM +0100, Joao Martins wrote:
>>>
>>>> +int iommufd_check_iova_range(struct iommufd_ioas *ioas,
>>>> +			     struct iommu_hwpt_get_dirty_iova *bitmap)
>>>> +{
>>>> +	unsigned long pgshift, npages;
>>>> +	size_t iommu_pgsize;
>>>> +	int rc = -EINVAL;
>>>> +
>>>> +	pgshift = __ffs(bitmap->page_size);
>>>> +	npages = bitmap->length >> pgshift;
>>>
>>> npages = bitmap->length / bitmap->page_size;
>>>
>>> ? (if page_size is a bitmask it is badly named)
>>>
>>
>> It was a way to avoid the divide by zero, but
>> I can switch to the above, and check for bitmap->page_size
>> being non-zero. should be less obscure
> 
> Why would we ge so far along with a 0 page size? Reject that when
> bitmap is created??
> 
Yeah, right now I have an extra

	if (!bitmap->page_size)
		return rc;

	npages = bitmap->length / bitmap->page_size;

It is non-sensical to consider the whole thing valid is page size is 0.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 02/18] vfio: Move iova_bitmap into iommufd
  2023-10-18 22:14   ` Jason Gunthorpe
@ 2023-10-19 17:48     ` Brett Creeley
  0 siblings, 0 replies; 84+ messages in thread
From: Brett Creeley @ 2023-10-19 17:48 UTC (permalink / raw)
  To: Jason Gunthorpe, Joao Martins
  Cc: iommu, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu, Yi Liu,
	Yi Y Sun, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Zhenzhong Duan, Alex Williamson, kvm,
	Brett Creeley, Yishai Hadas

On 10/18/2023 3:14 PM, Jason Gunthorpe wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
> 
> 
> On Wed, Oct 18, 2023 at 09:26:59PM +0100, Joao Martins wrote:
>> Both VFIO and IOMMUFD will need iova bitmap for storing dirties and walking
>> the user bitmaps, so move to the common dependency into IOMMUFD.  In doing
>> so, create the symbol IOMMUFD_DRIVER which designates the builtin code that
>> will be used by drivers when selected. Today this means MLX5_VFIO_PCI and
>> PDS_VFIO_PCI. IOMMU drivers will do the same (in future patches) when
>> supporting dirty tracking and select IOMMUFD_DRIVER accordingly.
>>
>> Given that the symbol maybe be disabled, add header definitions in
>> iova_bitmap.h for when IOMMUFD_DRIVER=n
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>   drivers/iommu/iommufd/Kconfig                 |  4 +++
>>   drivers/iommu/iommufd/Makefile                |  1 +
>>   drivers/{vfio => iommu/iommufd}/iova_bitmap.c |  0
>>   drivers/vfio/Makefile                         |  3 +--
>>   drivers/vfio/pci/mlx5/Kconfig                 |  1 +
>>   drivers/vfio/pci/pds/Kconfig                  |  1 +
>>   include/linux/iova_bitmap.h                   | 26 +++++++++++++++++++
>>   7 files changed, 34 insertions(+), 2 deletions(-)
>>   rename drivers/{vfio => iommu/iommufd}/iova_bitmap.c (100%)

Reviewed-by: Brett Creeley <brett.creeley@amd.com>

Thanks,

Brett
> 
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> 
> Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 03/18] iommufd/iova_bitmap: Move symbols to IOMMUFD namespace
  2023-10-18 20:27 ` [PATCH v4 03/18] iommufd/iova_bitmap: Move symbols to IOMMUFD namespace Joao Martins
  2023-10-18 22:16   ` Jason Gunthorpe
@ 2023-10-19 17:48   ` Brett Creeley
  2023-10-20  5:47   ` Tian, Kevin
  2023-10-20 16:44   ` Alex Williamson
  3 siblings, 0 replies; 84+ messages in thread
From: Brett Creeley @ 2023-10-19 17:48 UTC (permalink / raw)
  To: Joao Martins, iommu
  Cc: Jason Gunthorpe, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu,
	Yi Liu, Yi Y Sun, Nicolin Chen, Joerg Roedel,
	Suravee Suthikulpanit, Will Deacon, Robin Murphy, Zhenzhong Duan,
	Alex Williamson, kvm, Brett Creeley, Yishai Hadas

On 10/18/2023 1:27 PM, Joao Martins wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
> 
> 
> Have the IOVA bitmap exported symbols adhere to the IOMMUFD symbol
> export convention i.e. using the IOMMUFD namespace. In doing so,
> import the namespace in the current users. This means VFIO and the
> vfio-pci drivers that use iova_bitmap_set().
> 
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>   drivers/iommu/iommufd/iova_bitmap.c | 8 ++++----
>   drivers/vfio/pci/mlx5/main.c        | 1 +
>   drivers/vfio/pci/pds/pci_drv.c      | 1 +
>   drivers/vfio/vfio_main.c            | 1 +
>   4 files changed, 7 insertions(+), 4 deletions(-)

Reviewed-by: Brett Creeley <brett.creeley@amd.com>

Thanks,

Brett

> 
> diff --git a/drivers/iommu/iommufd/iova_bitmap.c b/drivers/iommu/iommufd/iova_bitmap.c
> index f54b56388e00..0a92c9eeaf7f 100644
> --- a/drivers/iommu/iommufd/iova_bitmap.c
> +++ b/drivers/iommu/iommufd/iova_bitmap.c
> @@ -268,7 +268,7 @@ struct iova_bitmap *iova_bitmap_alloc(unsigned long iova, size_t length,
>          iova_bitmap_free(bitmap);
>          return ERR_PTR(rc);
>   }
> -EXPORT_SYMBOL_GPL(iova_bitmap_alloc);
> +EXPORT_SYMBOL_NS_GPL(iova_bitmap_alloc, IOMMUFD);
> 
>   /**
>    * iova_bitmap_free() - Frees an IOVA bitmap object
> @@ -290,7 +290,7 @@ void iova_bitmap_free(struct iova_bitmap *bitmap)
> 
>          kfree(bitmap);
>   }
> -EXPORT_SYMBOL_GPL(iova_bitmap_free);
> +EXPORT_SYMBOL_NS_GPL(iova_bitmap_free, IOMMUFD);
> 
>   /*
>    * Returns the remaining bitmap indexes from mapped_total_index to process for
> @@ -389,7 +389,7 @@ int iova_bitmap_for_each(struct iova_bitmap *bitmap, void *opaque,
> 
>          return ret;
>   }
> -EXPORT_SYMBOL_GPL(iova_bitmap_for_each);
> +EXPORT_SYMBOL_NS_GPL(iova_bitmap_for_each, IOMMUFD);
> 
>   /**
>    * iova_bitmap_set() - Records an IOVA range in bitmap
> @@ -423,4 +423,4 @@ void iova_bitmap_set(struct iova_bitmap *bitmap,
>                  cur_bit += nbits;
>          } while (cur_bit <= last_bit);
>   }
> -EXPORT_SYMBOL_GPL(iova_bitmap_set);
> +EXPORT_SYMBOL_NS_GPL(iova_bitmap_set, IOMMUFD);
> diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
> index 42ec574a8622..5cf2b491d15a 100644
> --- a/drivers/vfio/pci/mlx5/main.c
> +++ b/drivers/vfio/pci/mlx5/main.c
> @@ -1376,6 +1376,7 @@ static struct pci_driver mlx5vf_pci_driver = {
> 
>   module_pci_driver(mlx5vf_pci_driver);
> 
> +MODULE_IMPORT_NS(IOMMUFD);
>   MODULE_LICENSE("GPL");
>   MODULE_AUTHOR("Max Gurtovoy <mgurtovoy@nvidia.com>");
>   MODULE_AUTHOR("Yishai Hadas <yishaih@nvidia.com>");
> diff --git a/drivers/vfio/pci/pds/pci_drv.c b/drivers/vfio/pci/pds/pci_drv.c
> index ab4b5958e413..dd8c00c895a2 100644
> --- a/drivers/vfio/pci/pds/pci_drv.c
> +++ b/drivers/vfio/pci/pds/pci_drv.c
> @@ -204,6 +204,7 @@ static struct pci_driver pds_vfio_pci_driver = {
> 
>   module_pci_driver(pds_vfio_pci_driver);
> 
> +MODULE_IMPORT_NS(IOMMUFD);
>   MODULE_DESCRIPTION(PDS_VFIO_DRV_DESCRIPTION);
>   MODULE_AUTHOR("Brett Creeley <brett.creeley@amd.com>");
>   MODULE_LICENSE("GPL");
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> index 40732e8ed4c6..a96d97da367d 100644
> --- a/drivers/vfio/vfio_main.c
> +++ b/drivers/vfio/vfio_main.c
> @@ -1693,6 +1693,7 @@ static void __exit vfio_cleanup(void)
>   module_init(vfio_init);
>   module_exit(vfio_cleanup);
> 
> +MODULE_IMPORT_NS(IOMMUFD);
>   MODULE_VERSION(DRIVER_VERSION);
>   MODULE_LICENSE("GPL v2");
>   MODULE_AUTHOR(DRIVER_AUTHOR);
> --
> 2.17.2
> 

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 12/18] iommu/intel: Access/Dirty bit support for SL domains
  2023-10-19  9:14     ` Joao Martins
  2023-10-19 10:33       ` Joao Martins
@ 2023-10-19 23:56       ` Jason Gunthorpe
  2023-10-20 10:12         ` Joao Martins
  1 sibling, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2023-10-19 23:56 UTC (permalink / raw)
  To: Joao Martins
  Cc: Baolu Lu, iommu, Kevin Tian, Shameerali Kolothum Thodi, Yi Liu,
	Yi Y Sun, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Zhenzhong Duan, Alex Williamson, kvm

On Thu, Oct 19, 2023 at 10:14:41AM +0100, Joao Martins wrote:

> > We should also support setting dirty tracking even if the domain has not
> > been attached to any device?
> > 
> Considering this rides on hwpt-alloc which attaches a device on domain
> allocation, then this shouldn't be possible in pratice. 

?? It doesn't.. iommufd_hwpt_alloc() pass immediate_attach=false. The
immediate attach is only for IOAS created auto domains which can't
support dirty tracking

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 08/18] iommufd: Add capabilities to IOMMU_GET_HW_INFO
  2023-10-19  9:55     ` Joao Martins
@ 2023-10-19 23:56       ` Jason Gunthorpe
  0 siblings, 0 replies; 84+ messages in thread
From: Jason Gunthorpe @ 2023-10-19 23:56 UTC (permalink / raw)
  To: Joao Martins
  Cc: iommu, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu, Yi Liu,
	Yi Y Sun, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Zhenzhong Duan, Alex Williamson, kvm

On Thu, Oct 19, 2023 at 10:55:57AM +0100, Joao Martins wrote:

> I added this below. It is relatively brief as I expect people to read what each
> of the API do. Unless I should be expanding in length here?
> 
> diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
> index ef8a1243eb57..43ed2f208503 100644
> --- a/include/uapi/linux/iommufd.h
> +++ b/include/uapi/linux/iommufd.h
> @@ -422,6 +422,12 @@ enum iommu_hw_info_type {
>  /**
>   * enum iommufd_hw_info_capabilities
>   * @IOMMU_CAP_DIRTY_TRACKING: IOMMU hardware support for dirty tracking
> + *                            If available, it means the following APIs
> + *                            are supported:
> + *
> + *                                   IOMMU_HWPT_GET_DIRTY_IOVA
> + *                                   IOMMU_HWPT_SET_DIRTY
> + *
>   */

That seems fine, thanks

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 11/18] iommu/amd: Access/Dirty bit support in IOPTEs
  2023-10-19 11:58       ` Joao Martins
@ 2023-10-19 23:59         ` Jason Gunthorpe
  2023-10-20 14:43           ` Joao Martins
  2023-10-20  2:21         ` Baolu Lu
  1 sibling, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2023-10-19 23:59 UTC (permalink / raw)
  To: Joao Martins
  Cc: iommu, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu, Yi Liu,
	Yi Y Sun, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Zhenzhong Duan, Alex Williamson, kvm

On Thu, Oct 19, 2023 at 12:58:29PM +0100, Joao Martins wrote:

> Sigh, I realized that Intel's pfn_to_dma_pte() (main lookup function for
> map/unmap/iova_to_phys) does something a little off when it finds a non-present
> PTE. It allocates a page table to it; which is not OK in this specific case (I
> would argue it's neither for iova_to_phys but well maybe I misunderstand the
> expectation of that API).

Oh :(
 
> AMD has no such behaviour, though that driver per your earlier suggestion might
> need to wait until -rc1 for some of the refactorings get merged. Hopefully we
> don't need to wait for the last 3 series of AMD Driver refactoring (?) to be
> done as that looks to be more SVA related; Unless there's something more
> specific you are looking for prior to introducing AMD's domain_alloc_user().

I don't think we need to wait, it just needs to go on the cleaning list.
 
> Anyhow, let me fix this, and post an update. Perhaps it's best I target this for
> -rc1 and have improved page-table walkers all at once [the iommufd_log_perf
> thingie below unlikely to be part of this set right away]. I have been playing
> with the AMD driver a lot more on baremetal, so I am getting confident on the
> snippet below (even with big IOVA ranges). I'm also retrying to see in-house if
> there's now a rev3.0 Intel machine that I can post results for -rc1 (last time
> in v2 I didn't; but things could have changed).

I'd rather you keep it simple and send the walkers as followups to the
driver maintainers directly.

> > for themselves; so more and more I need to work on something like
> > iommufd_log_perf tool under tools/testing that is similar to the gup_perf to make all
> > performance work obvious and 'standardized'

We have a mlx5 vfio driver in rdma-core and I have been thinking it
would be a nice basis for building an iommufd tester/benchmarker as it
has a wide set of "easilly" triggered functionality.

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 11/18] iommu/amd: Access/Dirty bit support in IOPTEs
  2023-10-19 11:58       ` Joao Martins
  2023-10-19 23:59         ` Jason Gunthorpe
@ 2023-10-20  2:21         ` Baolu Lu
  2023-10-20  7:01           ` Tian, Kevin
  2023-10-20  9:34           ` Joao Martins
  1 sibling, 2 replies; 84+ messages in thread
From: Baolu Lu @ 2023-10-20  2:21 UTC (permalink / raw)
  To: Joao Martins, Jason Gunthorpe
  Cc: baolu.lu, iommu, Kevin Tian, Shameerali Kolothum Thodi, Yi Liu,
	Yi Y Sun, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Zhenzhong Duan, Alex Williamson, kvm

On 10/19/23 7:58 PM, Joao Martins wrote:
> On 19/10/2023 01:17, Joao Martins wrote:
>> On 19/10/2023 00:11, Jason Gunthorpe wrote:
>>> On Wed, Oct 18, 2023 at 09:27:08PM +0100, Joao Martins wrote:
>>>> +static int iommu_v1_read_and_clear_dirty(struct io_pgtable_ops *ops,
>>>> +					 unsigned long iova, size_t size,
>>>> +					 unsigned long flags,
>>>> +					 struct iommu_dirty_bitmap *dirty)
>>>> +{
>>>> +	struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
>>>> +	unsigned long end = iova + size - 1;
>>>> +
>>>> +	do {
>>>> +		unsigned long pgsize = 0;
>>>> +		u64 *ptep, pte;
>>>> +
>>>> +		ptep = fetch_pte(pgtable, iova, &pgsize);
>>>> +		if (ptep)
>>>> +			pte = READ_ONCE(*ptep);
>>> It is fine for now, but this is so slow for something that is such a
>>> fast path. We are optimizing away a TLB invalidation but leaving
>>> this???
>>>
>> More obvious reason is that I'm still working towards the 'faster' page table
>> walker. Then map/unmap code needs to do similar lookups so thought of reusing
>> the same functions as map/unmap initially. And improve it afterwards or when
>> introducing the splitting.
>>
>>> It is a radix tree, you walk trees by retaining your position at each
>>> level as you go (eg in a function per-level call chain or something)
>>> then ++ is cheap. Re-searching the entire tree every time is madness.
>> I'm aware -- I have an improved page-table walker for AMD[0] (not yet for Intel;
>> still in the works),
> Sigh, I realized that Intel's pfn_to_dma_pte() (main lookup function for
> map/unmap/iova_to_phys) does something a little off when it finds a non-present
> PTE. It allocates a page table to it; which is not OK in this specific case (I
> would argue it's neither for iova_to_phys but well maybe I misunderstand the
> expectation of that API).

pfn_to_dma_pte() doesn't allocate page for a non-present PTE if the
target_level parameter is set to 0. See below line 932.

  913 static struct dma_pte *pfn_to_dma_pte(struct dmar_domain *domain,
  914                           unsigned long pfn, int *target_level,
  915                           gfp_t gfp)
  916 {

[...]

  927         while (1) {
  928                 void *tmp_page;
  929
  930                 offset = pfn_level_offset(pfn, level);
  931                 pte = &parent[offset];
  932                 if (!*target_level && (dma_pte_superpage(pte) || 
!dma_pte_present(pte)))
  933                         break;

So both iova_to_phys() and read_and_clear_dirty() are doing things
right:

	struct dma_pte *pte;
	int level = 0;

	pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT,
                              &level, GFP_KERNEL);
	if (pte && dma_pte_present(pte)) {
		/* The PTE is valid, check anything you want! */
		... ...
	}

Or, I am overlooking something else?

Best regards,
baolu

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: [PATCH v4 01/18] vfio/iova_bitmap: Export more API symbols
  2023-10-18 20:26 ` [PATCH v4 01/18] vfio/iova_bitmap: Export more API symbols Joao Martins
  2023-10-18 22:14   ` Jason Gunthorpe
@ 2023-10-20  5:45   ` Tian, Kevin
  2023-10-20 16:44   ` Alex Williamson
  2 siblings, 0 replies; 84+ messages in thread
From: Tian, Kevin @ 2023-10-20  5:45 UTC (permalink / raw)
  To: Martins, Joao, iommu
  Cc: Jason Gunthorpe, Shameerali Kolothum Thodi, Lu Baolu, Liu, Yi L,
	Sun, Yi Y, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Duan, Zhenzhong, Alex Williamson, kvm,
	Martins, Joao

> From: Joao Martins <joao.m.martins@oracle.com>
> Sent: Thursday, October 19, 2023 4:27 AM
> 
> In preparation to move iova_bitmap into iommufd, export the rest of API
> symbols that will be used in what could be used by modules, namely:
> 
> 	iova_bitmap_alloc
> 	iova_bitmap_free
> 	iova_bitmap_for_each
> 
> Suggested-by: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: [PATCH v4 02/18] vfio: Move iova_bitmap into iommufd
  2023-10-18 20:26 ` [PATCH v4 02/18] vfio: Move iova_bitmap into iommufd Joao Martins
  2023-10-18 22:14   ` Jason Gunthorpe
@ 2023-10-20  5:46   ` Tian, Kevin
  2023-10-20 16:44   ` Alex Williamson
  2 siblings, 0 replies; 84+ messages in thread
From: Tian, Kevin @ 2023-10-20  5:46 UTC (permalink / raw)
  To: Martins, Joao, iommu
  Cc: Jason Gunthorpe, Shameerali Kolothum Thodi, Lu Baolu, Liu, Yi L,
	Sun, Yi Y, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Duan, Zhenzhong, Alex Williamson, kvm,
	Martins, Joao, Brett Creeley, Yishai Hadas

> From: Joao Martins <joao.m.martins@oracle.com>
> Sent: Thursday, October 19, 2023 4:27 AM
> 
> Both VFIO and IOMMUFD will need iova bitmap for storing dirties and
> walking
> the user bitmaps, so move to the common dependency into IOMMUFD.  In
> doing
> so, create the symbol IOMMUFD_DRIVER which designates the builtin code
> that
> will be used by drivers when selected. Today this means MLX5_VFIO_PCI and
> PDS_VFIO_PCI. IOMMU drivers will do the same (in future patches) when
> supporting dirty tracking and select IOMMUFD_DRIVER accordingly.
> 
> Given that the symbol maybe be disabled, add header definitions in
> iova_bitmap.h for when IOMMUFD_DRIVER=n
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: [PATCH v4 03/18] iommufd/iova_bitmap: Move symbols to IOMMUFD namespace
  2023-10-18 20:27 ` [PATCH v4 03/18] iommufd/iova_bitmap: Move symbols to IOMMUFD namespace Joao Martins
  2023-10-18 22:16   ` Jason Gunthorpe
  2023-10-19 17:48   ` Brett Creeley
@ 2023-10-20  5:47   ` Tian, Kevin
  2023-10-20 16:44   ` Alex Williamson
  3 siblings, 0 replies; 84+ messages in thread
From: Tian, Kevin @ 2023-10-20  5:47 UTC (permalink / raw)
  To: Martins, Joao, iommu
  Cc: Jason Gunthorpe, Shameerali Kolothum Thodi, Lu Baolu, Liu, Yi L,
	Sun, Yi Y, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Duan, Zhenzhong, Alex Williamson, kvm,
	Martins, Joao, Brett Creeley, Yishai Hadas

> From: Joao Martins <joao.m.martins@oracle.com>
> Sent: Thursday, October 19, 2023 4:27 AM
> 
> Have the IOVA bitmap exported symbols adhere to the IOMMUFD symbol
> export convention i.e. using the IOMMUFD namespace. In doing so,
> import the namespace in the current users. This means VFIO and the
> vfio-pci drivers that use iova_bitmap_set().
> 
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: [PATCH v4 04/18] iommu: Add iommu_domain ops for dirty tracking
  2023-10-18 20:27 ` [PATCH v4 04/18] iommu: Add iommu_domain ops for dirty tracking Joao Martins
  2023-10-18 22:26   ` Jason Gunthorpe
  2023-10-19  1:45   ` Baolu Lu
@ 2023-10-20  5:54   ` Tian, Kevin
  2023-10-20 11:24     ` Joao Martins
  2 siblings, 1 reply; 84+ messages in thread
From: Tian, Kevin @ 2023-10-20  5:54 UTC (permalink / raw)
  To: Martins, Joao, iommu
  Cc: Jason Gunthorpe, Shameerali Kolothum Thodi, Lu Baolu, Liu, Yi L,
	Sun, Yi Y, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Duan, Zhenzhong, Alex Williamson, kvm,
	Martins, Joao

> From: Joao Martins <joao.m.martins@oracle.com>
> Sent: Thursday, October 19, 2023 4:27 AM
> 
> Add to iommu domain operations a set of callbacks to perform dirty
> tracking, particulary to start and stop tracking and to read and clear the
> dirty data.
> 
> Drivers are generally expected to dynamically change its translation
> structures to toggle the tracking and flush some form of control state
> structure that stands in the IOVA translation path. Though it's not
> mandatory, as drivers can also enable dirty tracking at boot, and just
> clear the dirty bits before setting dirty tracking. For each of the newly
> added IOMMU core APIs:
> 
> iommu_cap::IOMMU_CAP_DIRTY: new device iommu_capable value when
> probing for
> capabilities of the device.

IOMMU_CAP_DIRTY_TRACKING is more readable.

> @@ -671,6 +724,9 @@ struct iommu_fwspec {
>  /* ATS is supported */
>  #define IOMMU_FWSPEC_PCI_RC_ATS			(1 << 0)
> 
> +/* Read but do not clear any dirty bits */
> +#define IOMMU_DIRTY_NO_CLEAR			(1 << 0)
> +

better move to the place where iommu_dirty_ops is defined.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: [PATCH v4 05/18] iommufd: Add a flag to enforce dirty tracking on attach
  2023-10-18 23:38     ` Joao Martins
@ 2023-10-20  5:55       ` Tian, Kevin
  0 siblings, 0 replies; 84+ messages in thread
From: Tian, Kevin @ 2023-10-20  5:55 UTC (permalink / raw)
  To: Martins, Joao, Jason Gunthorpe
  Cc: iommu, Shameerali Kolothum Thodi, Lu Baolu, Liu, Yi L, Sun, Yi Y,
	Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Duan, Zhenzhong, Alex Williamson, kvm

> From: Joao Martins <joao.m.martins@oracle.com>
> Sent: Thursday, October 19, 2023 7:39 AM
> 
> On 18/10/2023 23:38, Jason Gunthorpe wrote:
> > On Wed, Oct 18, 2023 at 09:27:02PM +0100, Joao Martins wrote:
> >> Throughout IOMMU domain lifetime that wants to use dirty tracking,
> some
> >> guarantees are needed such that any device attached to the
> iommu_domain
> >> supports dirty tracking.
> >>
> >> The idea is to handle a case where IOMMU in the system are assymetric
> >> feature-wise and thus the capability may not be supported for all devices.
> >> The enforcement is done by adding a flag into HWPT_ALLOC namely:
> >>
> >> 	IOMMUFD_HWPT_ALLOC_ENFORCE_DIRTY
> >
> > Actually, can we change this name?
> >
> > IOMMUFD_HWPT_ALLOC_DIRTY_TRACKING
> >
> > ?
> >
> Yeap.
> 

Agree. Please also adjust the related texts in commit msg and code
comments.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: [PATCH v4 06/18] iommufd: Add IOMMU_HWPT_SET_DIRTY
  2023-10-18 20:27 ` [PATCH v4 06/18] iommufd: Add IOMMU_HWPT_SET_DIRTY Joao Martins
  2023-10-18 22:28   ` Jason Gunthorpe
@ 2023-10-20  6:09   ` Tian, Kevin
  2023-10-20 15:30     ` Joao Martins
  2023-10-20  7:56   ` Tian, Kevin
  2023-10-20 20:41   ` Joao Martins
  3 siblings, 1 reply; 84+ messages in thread
From: Tian, Kevin @ 2023-10-20  6:09 UTC (permalink / raw)
  To: Martins, Joao, iommu
  Cc: Jason Gunthorpe, Shameerali Kolothum Thodi, Lu Baolu, Liu, Yi L,
	Sun, Yi Y, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Duan, Zhenzhong, Alex Williamson, kvm,
	Martins, Joao

> From: Joao Martins <joao.m.martins@oracle.com>
> Sent: Thursday, October 19, 2023 4:27 AM
> 
> +
> +int iopt_set_dirty_tracking(struct io_pagetable *iopt,
> +			    struct iommu_domain *domain, bool enable)
> +{
> +	const struct iommu_dirty_ops *ops = domain->dirty_ops;
> +	int ret = 0;
> +
> +	if (!ops)
> +		return -EOPNOTSUPP;
> +
> +	down_read(&iopt->iova_rwsem);
> +
> +	/* Clear dirty bits from PTEs to ensure a clean snapshot */
> +	if (enable) {
> +		ret = iopt_clear_dirty_data(iopt, domain);
> +		if (ret)
> +			goto out_unlock;
> +	}

why not leaving this to the actual driver which wants to always
enable dirty instead of making it a general burden for all?

> +
> +/*
> + * enum iommufd_set_dirty_flags - Flags for steering dirty tracking
> + * @IOMMU_DIRTY_TRACKING_ENABLE: Enables dirty tracking

s/Enables/Enable/

> + */
> +enum iommufd_hwpt_set_dirty_flags {
> +	IOMMU_DIRTY_TRACKING_ENABLE = 1,

IOMMU_HWPT_DIRTY_TRACKING_ENABLE

> +
> +/**
> + * struct iommu_hwpt_set_dirty - ioctl(IOMMU_HWPT_SET_DIRTY)

IOMMU_HWPT_SET_DIRTY_TRACKING

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: [PATCH v4 07/18] iommufd: Add IOMMU_HWPT_GET_DIRTY_IOVA
  2023-10-18 20:27 ` [PATCH v4 07/18] iommufd: Add IOMMU_HWPT_GET_DIRTY_IOVA Joao Martins
  2023-10-18 22:39   ` Jason Gunthorpe
  2023-10-19 10:01   ` Joao Martins
@ 2023-10-20  6:32   ` Tian, Kevin
  2023-10-20 11:53     ` Joao Martins
  2 siblings, 1 reply; 84+ messages in thread
From: Tian, Kevin @ 2023-10-20  6:32 UTC (permalink / raw)
  To: Martins, Joao, iommu
  Cc: Jason Gunthorpe, Shameerali Kolothum Thodi, Lu Baolu, Liu, Yi L,
	Sun, Yi Y, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Duan, Zhenzhong, Alex Williamson, kvm,
	Martins, Joao

> From: Joao Martins <joao.m.martins@oracle.com>
> Sent: Thursday, October 19, 2023 4:27 AM
> 
> Underneath it uses the IOMMU domain kernel API which will read the dirty
> bits, as well as atomically clearing the IOPTE dirty bit and flushing the
> IOTLB at the end. The IOVA bitmaps usage takes care of the iteration of the

what does 'atomically' try to convey here?

> +/**
> + * struct iommu_hwpt_get_dirty_iova -
> ioctl(IOMMU_HWPT_GET_DIRTY_IOVA)

IOMMU_HWPT_GET_DIRTY_BITMAP? IOVA usually means one address
but here we talk about a bitmap of which one bit represents a page.

> + * @size: sizeof(struct iommu_hwpt_get_dirty_iova)
> + * @hwpt_id: HW pagetable ID that represents the IOMMU domain.
> + * @flags: Flags to control dirty tracking status.
> + * @iova: base IOVA of the bitmap first bit
> + * @length: IOVA range size
> + * @page_size: page size granularity of each bit in the bitmap
> + * @data: bitmap where to set the dirty bits. The bitmap bits each
> + * represent a page_size which you deviate from an arbitrary iova.
> + * Checking a given IOVA is dirty:
> + *
> + *  data[(iova / page_size) / 64] & (1ULL << (iova % 64))

(1ULL << ((iova / page_size) % 64)

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: [PATCH v4 08/18] iommufd: Add capabilities to IOMMU_GET_HW_INFO
  2023-10-18 20:27 ` [PATCH v4 08/18] iommufd: Add capabilities to IOMMU_GET_HW_INFO Joao Martins
  2023-10-18 22:44   ` Jason Gunthorpe
@ 2023-10-20  6:46   ` Tian, Kevin
  2023-10-20 11:52     ` Joao Martins
  1 sibling, 1 reply; 84+ messages in thread
From: Tian, Kevin @ 2023-10-20  6:46 UTC (permalink / raw)
  To: Martins, Joao, iommu
  Cc: Jason Gunthorpe, Shameerali Kolothum Thodi, Lu Baolu, Liu, Yi L,
	Sun, Yi Y, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Duan, Zhenzhong, Alex Williamson, kvm,
	Martins, Joao

> From: Joao Martins <joao.m.martins@oracle.com>
> Sent: Thursday, October 19, 2023 4:27 AM
> 
> +/**
> + * enum iommufd_hw_info_capabilities
> + * @IOMMU_CAP_DIRTY_TRACKING: IOMMU hardware support for dirty
> tracking

@IOMMU_HW_CAP_DIRTY_TRACKING: ...

>  /**
>   * struct iommu_hw_info - ioctl(IOMMU_GET_HW_INFO)
>   * @size: sizeof(struct iommu_hw_info)
> @@ -430,6 +438,8 @@ enum iommu_hw_info_type {
>   *             the iommu type specific hardware information data
>   * @out_data_type: Output the iommu hardware info type as defined in the
> enum
>   *                 iommu_hw_info_type.
> + * @out_capabilities: Output the iommu capability info type as defined in
> the
> + *                    enum iommu_hw_capabilities.

"Output the 'generic' iommu capability info ..."

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: [PATCH v4 09/18] iommufd: Add a flag to skip clearing of IOPTE dirty
  2023-10-18 20:27 ` [PATCH v4 09/18] iommufd: Add a flag to skip clearing of IOPTE dirty Joao Martins
  2023-10-18 22:54   ` Jason Gunthorpe
@ 2023-10-20  6:52   ` Tian, Kevin
  1 sibling, 0 replies; 84+ messages in thread
From: Tian, Kevin @ 2023-10-20  6:52 UTC (permalink / raw)
  To: Martins, Joao, iommu
  Cc: Jason Gunthorpe, Shameerali Kolothum Thodi, Lu Baolu, Liu, Yi L,
	Sun, Yi Y, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Duan, Zhenzhong, Alex Williamson, kvm,
	Martins, Joao

> From: Joao Martins <joao.m.martins@oracle.com>
> Sent: Thursday, October 19, 2023 4:27 AM
> 
> VFIO has an operation where it unmaps an IOVA while returning a bitmap
> with
> the dirty data. In reality the operation doesn't quite query the IO
> pagetables that the PTE was dirty or not. Instead it marks as dirty on
> anything that was mapped, and doing so in one syscall.
> 
> In IOMMUFD the equivalent is done in two operations by querying with
> GET_DIRTY_IOVA followed by UNMAP_IOVA. However, this would incur two
> TLB
> flushes given that after clearing dirty bits IOMMU implementations require
> invalidating their IOTLB, plus another invalidation needed for the UNMAP.
> To allow dirty bits to be queried faster, add a flag
> (IOMMU_GET_DIRTY_IOVA_NO_CLEAR) that requests to not clear the dirty
> bits
> from the PTE (but just reading them), under the expectation that the next
> operation is the unmap. An alternative is to unmap and just perpectually
> mark as dirty as that's the same behaviour as today. So here equivalent
> functionally can be provided with unmap alone, and if real dirty info is
> required it will amortize the cost while querying.
> 
> There's still a race against DMA where in theory the unmap of the IOVA
> (when the guest invalidates the IOTLB via emulated iommu) would race
> against the VF performing DMA on the same IOVA. As discussed in [0], we
> are
> accepting to resolve this race as throwing away the DMA and it doesn't
> matter if it hit physical DRAM or not, the VM can't tell if we threw it
> away because the DMA was blocked or because we failed to copy the DRAM.
> 
> [0] https://lore.kernel.org/linux-
> iommu/20220502185239.GR8364@nvidia.com/
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: [PATCH v4 11/18] iommu/amd: Access/Dirty bit support in IOPTEs
  2023-10-20  2:21         ` Baolu Lu
@ 2023-10-20  7:01           ` Tian, Kevin
  2023-10-20  9:34           ` Joao Martins
  1 sibling, 0 replies; 84+ messages in thread
From: Tian, Kevin @ 2023-10-20  7:01 UTC (permalink / raw)
  To: Baolu Lu, Martins, Joao, Jason Gunthorpe
  Cc: iommu, Shameerali Kolothum Thodi, Liu, Yi L, Sun, Yi Y,
	Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Duan, Zhenzhong, Alex Williamson, kvm

> From: Baolu Lu <baolu.lu@linux.intel.com>
> Sent: Friday, October 20, 2023 10:22 AM
> 
> On 10/19/23 7:58 PM, Joao Martins wrote:
> > On 19/10/2023 01:17, Joao Martins wrote:
> >> On 19/10/2023 00:11, Jason Gunthorpe wrote:
> >>> On Wed, Oct 18, 2023 at 09:27:08PM +0100, Joao Martins wrote:
> >>>> +static int iommu_v1_read_and_clear_dirty(struct io_pgtable_ops *ops,
> >>>> +					 unsigned long iova, size_t size,
> >>>> +					 unsigned long flags,
> >>>> +					 struct iommu_dirty_bitmap *dirty)
> >>>> +{
> >>>> +	struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
> >>>> +	unsigned long end = iova + size - 1;
> >>>> +
> >>>> +	do {
> >>>> +		unsigned long pgsize = 0;
> >>>> +		u64 *ptep, pte;
> >>>> +
> >>>> +		ptep = fetch_pte(pgtable, iova, &pgsize);
> >>>> +		if (ptep)
> >>>> +			pte = READ_ONCE(*ptep);
> >>> It is fine for now, but this is so slow for something that is such a
> >>> fast path. We are optimizing away a TLB invalidation but leaving
> >>> this???
> >>>
> >> More obvious reason is that I'm still working towards the 'faster' page
> table
> >> walker. Then map/unmap code needs to do similar lookups so thought of
> reusing
> >> the same functions as map/unmap initially. And improve it afterwards or
> when
> >> introducing the splitting.
> >>
> >>> It is a radix tree, you walk trees by retaining your position at each
> >>> level as you go (eg in a function per-level call chain or something)
> >>> then ++ is cheap. Re-searching the entire tree every time is madness.
> >> I'm aware -- I have an improved page-table walker for AMD[0] (not yet for
> Intel;
> >> still in the works),
> > Sigh, I realized that Intel's pfn_to_dma_pte() (main lookup function for
> > map/unmap/iova_to_phys) does something a little off when it finds a non-
> present
> > PTE. It allocates a page table to it; which is not OK in this specific case (I
> > would argue it's neither for iova_to_phys but well maybe I misunderstand
> the
> > expectation of that API).
> 
> pfn_to_dma_pte() doesn't allocate page for a non-present PTE if the
> target_level parameter is set to 0. See below line 932.
> 
>   913 static struct dma_pte *pfn_to_dma_pte(struct dmar_domain *domain,
>   914                           unsigned long pfn, int *target_level,
>   915                           gfp_t gfp)
>   916 {
> 
> [...]
> 
>   927         while (1) {
>   928                 void *tmp_page;
>   929
>   930                 offset = pfn_level_offset(pfn, level);
>   931                 pte = &parent[offset];
>   932                 if (!*target_level && (dma_pte_superpage(pte) ||
> !dma_pte_present(pte)))
>   933                         break;
> 
> So both iova_to_phys() and read_and_clear_dirty() are doing things
> right:
> 
> 	struct dma_pte *pte;
> 	int level = 0;
> 
> 	pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT,
>                               &level, GFP_KERNEL);
> 	if (pte && dma_pte_present(pte)) {
> 		/* The PTE is valid, check anything you want! */
> 		... ...
> 	}
> 
> Or, I am overlooking something else?
> 

This is correct.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: [PATCH v4 12/18] iommu/intel: Access/Dirty bit support for SL domains
  2023-10-18 20:27 ` [PATCH v4 12/18] iommu/intel: Access/Dirty bit support for SL domains Joao Martins
  2023-10-19  3:04   ` Baolu Lu
@ 2023-10-20  7:53   ` Tian, Kevin
  2023-10-20  9:15     ` Baolu Lu
  1 sibling, 1 reply; 84+ messages in thread
From: Tian, Kevin @ 2023-10-20  7:53 UTC (permalink / raw)
  To: Martins, Joao, iommu
  Cc: Jason Gunthorpe, Shameerali Kolothum Thodi, Lu Baolu, Liu, Yi L,
	Sun, Yi Y, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Duan, Zhenzhong, Alex Williamson, kvm,
	Martins, Joao

> From: Joao Martins <joao.m.martins@oracle.com>
> Sent: Thursday, October 19, 2023 4:27 AM
> 
> +
> +	if (!IS_ERR(domain) && enforce_dirty) {
> +		if (to_dmar_domain(domain)->use_first_level) {
> +			iommu_domain_free(domain);
> +			return ERR_PTR(-EOPNOTSUPP);
> +		}
> +		domain->dirty_ops = &intel_dirty_ops;
> +	}

I don't understand why we should check use_first_level here. It's
a dead condition as alloc_user() only uses 2nd level. Later when
nested is introduced then we explicitly disallow enforce_dirty
on a nested user domain.

otherwise with Baolu's comments addressed:

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: [PATCH v4 06/18] iommufd: Add IOMMU_HWPT_SET_DIRTY
  2023-10-18 20:27 ` [PATCH v4 06/18] iommufd: Add IOMMU_HWPT_SET_DIRTY Joao Martins
  2023-10-18 22:28   ` Jason Gunthorpe
  2023-10-20  6:09   ` Tian, Kevin
@ 2023-10-20  7:56   ` Tian, Kevin
  2023-10-20 20:41   ` Joao Martins
  3 siblings, 0 replies; 84+ messages in thread
From: Tian, Kevin @ 2023-10-20  7:56 UTC (permalink / raw)
  To: Martins, Joao, iommu
  Cc: Jason Gunthorpe, Shameerali Kolothum Thodi, Lu Baolu, Liu, Yi L,
	Sun, Yi Y, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Duan, Zhenzhong, Alex Williamson, kvm,
	Martins, Joao

> From: Tian, Kevin
> Sent: Friday, October 20, 2023 2:09 PM
> 
> > From: Joao Martins <joao.m.martins@oracle.com>
> > Sent: Thursday, October 19, 2023 4:27 AM
> >
> > +
> > +int iopt_set_dirty_tracking(struct io_pagetable *iopt,
> > +			    struct iommu_domain *domain, bool enable)
> > +{
> > +	const struct iommu_dirty_ops *ops = domain->dirty_ops;
> > +	int ret = 0;
> > +
> > +	if (!ops)
> > +		return -EOPNOTSUPP;
> > +
> > +	down_read(&iopt->iova_rwsem);
> > +
> > +	/* Clear dirty bits from PTEs to ensure a clean snapshot */
> > +	if (enable) {
> > +		ret = iopt_clear_dirty_data(iopt, domain);
> > +		if (ret)
> > +			goto out_unlock;
> > +	}
> 
> why not leaving this to the actual driver which wants to always
> enable dirty instead of making it a general burden for all?
> 

looks I was misled by the commit msg. This is really for the reality
that there is no guarantee from previous cycle where the user will
quiesce device, read/clean all the remaining dirty bits and then
disable dirty tracking. It's true for live migration but not if just
looking at this dirty tracking alone.

So,

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: [PATCH v4 13/18] iommufd/selftest: Expand mock_domain with dev_flags
  2023-10-18 20:27 ` [PATCH v4 13/18] iommufd/selftest: Expand mock_domain with dev_flags Joao Martins
@ 2023-10-20  7:57   ` Tian, Kevin
  0 siblings, 0 replies; 84+ messages in thread
From: Tian, Kevin @ 2023-10-20  7:57 UTC (permalink / raw)
  To: Martins, Joao, iommu
  Cc: Jason Gunthorpe, Shameerali Kolothum Thodi, Lu Baolu, Liu, Yi L,
	Sun, Yi Y, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Duan, Zhenzhong, Alex Williamson, kvm,
	Martins, Joao

> From: Joao Martins <joao.m.martins@oracle.com>
> Sent: Thursday, October 19, 2023 4:27 AM
> 
> Expand mock_domain test to be able to manipulate the device capabilities.
> This allows testing with mockdev without dirty tracking support advertised
> and thus make sure enforce_dirty test does the expected.
> 
> To avoid breaking IOMMUFD_TEST UABI replicate the mock_domain struct
> and
> thus add an input dev_flags at the end.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: [PATCH v4 14/18] iommufd/selftest: Test IOMMU_HWPT_ALLOC_ENFORCE_DIRTY
  2023-10-18 20:27 ` [PATCH v4 14/18] iommufd/selftest: Test IOMMU_HWPT_ALLOC_ENFORCE_DIRTY Joao Martins
@ 2023-10-20  7:59   ` Tian, Kevin
  0 siblings, 0 replies; 84+ messages in thread
From: Tian, Kevin @ 2023-10-20  7:59 UTC (permalink / raw)
  To: Martins, Joao, iommu
  Cc: Jason Gunthorpe, Shameerali Kolothum Thodi, Lu Baolu, Liu, Yi L,
	Sun, Yi Y, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Duan, Zhenzhong, Alex Williamson, kvm,
	Martins, Joao

> From: Joao Martins <joao.m.martins@oracle.com>
> Sent: Thursday, October 19, 2023 4:27 AM
> 
> In order to selftest the iommu domain dirty enforcing implement the
> mock_domain necessary support and add a new dev_flags to test that the
> hwpt_alloc/attach_device fails as expected.
> 
> Expand the existing mock_domain fixture with a enforce_dirty test that
> exercises the hwpt_alloc and device attachment.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: [PATCH v4 15/18] iommufd/selftest: Test IOMMU_HWPT_SET_DIRTY
  2023-10-18 20:27 ` [PATCH v4 15/18] iommufd/selftest: Test IOMMU_HWPT_SET_DIRTY Joao Martins
@ 2023-10-20  8:00   ` Tian, Kevin
  0 siblings, 0 replies; 84+ messages in thread
From: Tian, Kevin @ 2023-10-20  8:00 UTC (permalink / raw)
  To: Martins, Joao, iommu
  Cc: Jason Gunthorpe, Shameerali Kolothum Thodi, Lu Baolu, Liu, Yi L,
	Sun, Yi Y, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Duan, Zhenzhong, Alex Williamson, kvm,
	Martins, Joao

> From: Joao Martins <joao.m.martins@oracle.com>
> Sent: Thursday, October 19, 2023 4:27 AM
> 
> Change mock_domain to supporting dirty tracking and add tests to exercise
> the new SET_DIRTY API in the iommufd_dirty_tracking selftest fixture.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 12/18] iommu/intel: Access/Dirty bit support for SL domains
  2023-10-20  7:53   ` Tian, Kevin
@ 2023-10-20  9:15     ` Baolu Lu
  0 siblings, 0 replies; 84+ messages in thread
From: Baolu Lu @ 2023-10-20  9:15 UTC (permalink / raw)
  To: Tian, Kevin, Martins, Joao, iommu
  Cc: baolu.lu, Jason Gunthorpe, Shameerali Kolothum Thodi, Liu, Yi L,
	Sun, Yi Y, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Duan, Zhenzhong, Alex Williamson, kvm

On 2023/10/20 15:53, Tian, Kevin wrote:
>> From: Joao Martins<joao.m.martins@oracle.com>
>> Sent: Thursday, October 19, 2023 4:27 AM
>>
>> +
>> +	if (!IS_ERR(domain) && enforce_dirty) {
>> +		if (to_dmar_domain(domain)->use_first_level) {
>> +			iommu_domain_free(domain);
>> +			return ERR_PTR(-EOPNOTSUPP);
>> +		}
>> +		domain->dirty_ops = &intel_dirty_ops;
>> +	}
> I don't understand why we should check use_first_level here. It's
> a dead condition as alloc_user() only uses 2nd level. Later when
> nested is introduced then we explicitly disallow enforce_dirty
> on a nested user domain.

This is to restrict dirty tracking to the second level page table
*explicitly*. If we have a need to enable dirty tracking on the first
level in the future, we can remove this check.

Best regards,
baolu

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 11/18] iommu/amd: Access/Dirty bit support in IOPTEs
  2023-10-20  2:21         ` Baolu Lu
  2023-10-20  7:01           ` Tian, Kevin
@ 2023-10-20  9:34           ` Joao Martins
  2023-10-20 11:20             ` Joao Martins
  1 sibling, 1 reply; 84+ messages in thread
From: Joao Martins @ 2023-10-20  9:34 UTC (permalink / raw)
  To: Baolu Lu
  Cc: iommu, Kevin Tian, Shameerali Kolothum Thodi, Yi Liu, Yi Y Sun,
	Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Zhenzhong Duan, Alex Williamson, kvm,
	Jason Gunthorpe

On 20/10/2023 03:21, Baolu Lu wrote:
> On 10/19/23 7:58 PM, Joao Martins wrote:
>> On 19/10/2023 01:17, Joao Martins wrote:
>>> On 19/10/2023 00:11, Jason Gunthorpe wrote:
>>>> On Wed, Oct 18, 2023 at 09:27:08PM +0100, Joao Martins wrote:
>>>>> +static int iommu_v1_read_and_clear_dirty(struct io_pgtable_ops *ops,
>>>>> +                     unsigned long iova, size_t size,
>>>>> +                     unsigned long flags,
>>>>> +                     struct iommu_dirty_bitmap *dirty)
>>>>> +{
>>>>> +    struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
>>>>> +    unsigned long end = iova + size - 1;
>>>>> +
>>>>> +    do {
>>>>> +        unsigned long pgsize = 0;
>>>>> +        u64 *ptep, pte;
>>>>> +
>>>>> +        ptep = fetch_pte(pgtable, iova, &pgsize);
>>>>> +        if (ptep)
>>>>> +            pte = READ_ONCE(*ptep);
>>>> It is fine for now, but this is so slow for something that is such a
>>>> fast path. We are optimizing away a TLB invalidation but leaving
>>>> this???
>>>>
>>> More obvious reason is that I'm still working towards the 'faster' page table
>>> walker. Then map/unmap code needs to do similar lookups so thought of reusing
>>> the same functions as map/unmap initially. And improve it afterwards or when
>>> introducing the splitting.
>>>
>>>> It is a radix tree, you walk trees by retaining your position at each
>>>> level as you go (eg in a function per-level call chain or something)
>>>> then ++ is cheap. Re-searching the entire tree every time is madness.
>>> I'm aware -- I have an improved page-table walker for AMD[0] (not yet for Intel;
>>> still in the works),
>> Sigh, I realized that Intel's pfn_to_dma_pte() (main lookup function for
>> map/unmap/iova_to_phys) does something a little off when it finds a non-present
>> PTE. It allocates a page table to it; which is not OK in this specific case (I
>> would argue it's neither for iova_to_phys but well maybe I misunderstand the
>> expectation of that API).
> 
> pfn_to_dma_pte() doesn't allocate page for a non-present PTE if the
> target_level parameter is set to 0. See below line 932.
> 
>  913 static struct dma_pte *pfn_to_dma_pte(struct dmar_domain *domain,
>  914                           unsigned long pfn, int *target_level,
>  915                           gfp_t gfp)
>  916 {
> 
> [...]
> 
>  927         while (1) {
>  928                 void *tmp_page;
>  929
>  930                 offset = pfn_level_offset(pfn, level);
>  931                 pte = &parent[offset];
>  932                 if (!*target_level && (dma_pte_superpage(pte) ||
> !dma_pte_present(pte)))
>  933                         break;
> 
> So both iova_to_phys() and read_and_clear_dirty() are doing things
> right:
> 
>     struct dma_pte *pte;
>     int level = 0;
> 
>     pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT,
>                              &level, GFP_KERNEL);
>     if (pte && dma_pte_present(pte)) {
>         /* The PTE is valid, check anything you want! */
>         ... ...
>     }
> 
> Or, I am overlooking something else?

You're right, thanks for the keeping me straight -- I was already doing the
right thing. I've forgotten about it in the midst of the other code -- Probably
worth a comment in the caller to make it obvious.

	Joao

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 12/18] iommu/intel: Access/Dirty bit support for SL domains
  2023-10-19 23:56       ` Jason Gunthorpe
@ 2023-10-20 10:12         ` Joao Martins
  0 siblings, 0 replies; 84+ messages in thread
From: Joao Martins @ 2023-10-20 10:12 UTC (permalink / raw)
  To: Baolu Lu, Jason Gunthorpe
  Cc: iommu, Kevin Tian, Shameerali Kolothum Thodi, Yi Liu, Yi Y Sun,
	Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Zhenzhong Duan, Alex Williamson, kvm



On 20/10/2023 00:56, Jason Gunthorpe wrote:
> On Thu, Oct 19, 2023 at 10:14:41AM +0100, Joao Martins wrote:
> 
>>> We should also support setting dirty tracking even if the domain has not
>>> been attached to any device?
>>>
>> Considering this rides on hwpt-alloc which attaches a device on domain
>> allocation, then this shouldn't be possible in pratice. 
> 
> ?? It doesn't.. iommufd_hwpt_alloc() pass immediate_attach=false. The
> immediate attach is only for IOAS created auto domains which can't
> support dirty tracking
I'm mixing the domain being created with a device that it can validate against
vs actually ends up happening right after (in a separate operation) which is for
the device to be attached to the domain via ATTACH_IOMMU_PT. So technically one
can set dirty tracking on an empty domain.

	Joao

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 11/18] iommu/amd: Access/Dirty bit support in IOPTEs
  2023-10-20  9:34           ` Joao Martins
@ 2023-10-20 11:20             ` Joao Martins
  0 siblings, 0 replies; 84+ messages in thread
From: Joao Martins @ 2023-10-20 11:20 UTC (permalink / raw)
  To: Baolu Lu
  Cc: iommu, Kevin Tian, Shameerali Kolothum Thodi, Yi Liu, Yi Y Sun,
	Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Zhenzhong Duan, Alex Williamson, kvm,
	Jason Gunthorpe

On 20/10/2023 10:34, Joao Martins wrote:
> On 20/10/2023 03:21, Baolu Lu wrote:
>> On 10/19/23 7:58 PM, Joao Martins wrote:
>>> On 19/10/2023 01:17, Joao Martins wrote:
>>>> On 19/10/2023 00:11, Jason Gunthorpe wrote:
>>>>> On Wed, Oct 18, 2023 at 09:27:08PM +0100, Joao Martins wrote:
>>>>>> +static int iommu_v1_read_and_clear_dirty(struct io_pgtable_ops *ops,
>>>>>> +                     unsigned long iova, size_t size,
>>>>>> +                     unsigned long flags,
>>>>>> +                     struct iommu_dirty_bitmap *dirty)
>>>>>> +{
>>>>>> +    struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
>>>>>> +    unsigned long end = iova + size - 1;
>>>>>> +
>>>>>> +    do {
>>>>>> +        unsigned long pgsize = 0;
>>>>>> +        u64 *ptep, pte;
>>>>>> +
>>>>>> +        ptep = fetch_pte(pgtable, iova, &pgsize);
>>>>>> +        if (ptep)
>>>>>> +            pte = READ_ONCE(*ptep);
>>>>> It is fine for now, but this is so slow for something that is such a
>>>>> fast path. We are optimizing away a TLB invalidation but leaving
>>>>> this???
>>>>>
>>>> More obvious reason is that I'm still working towards the 'faster' page table
>>>> walker. Then map/unmap code needs to do similar lookups so thought of reusing
>>>> the same functions as map/unmap initially. And improve it afterwards or when
>>>> introducing the splitting.
>>>>
>>>>> It is a radix tree, you walk trees by retaining your position at each
>>>>> level as you go (eg in a function per-level call chain or something)
>>>>> then ++ is cheap. Re-searching the entire tree every time is madness.
>>>> I'm aware -- I have an improved page-table walker for AMD[0] (not yet for Intel;
>>>> still in the works),
>>> Sigh, I realized that Intel's pfn_to_dma_pte() (main lookup function for
>>> map/unmap/iova_to_phys) does something a little off when it finds a non-present
>>> PTE. It allocates a page table to it; which is not OK in this specific case (I
>>> would argue it's neither for iova_to_phys but well maybe I misunderstand the
>>> expectation of that API).
>>
>> pfn_to_dma_pte() doesn't allocate page for a non-present PTE if the
>> target_level parameter is set to 0. See below line 932.
>>
>>  913 static struct dma_pte *pfn_to_dma_pte(struct dmar_domain *domain,
>>  914                           unsigned long pfn, int *target_level,
>>  915                           gfp_t gfp)
>>  916 {
>>
>> [...]
>>
>>  927         while (1) {
>>  928                 void *tmp_page;
>>  929
>>  930                 offset = pfn_level_offset(pfn, level);
>>  931                 pte = &parent[offset];
>>  932                 if (!*target_level && (dma_pte_superpage(pte) ||
>> !dma_pte_present(pte)))
>>  933                         break;
>>
>> So both iova_to_phys() and read_and_clear_dirty() are doing things
>> right:
>>
>>     struct dma_pte *pte;
>>     int level = 0;
>>
>>     pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT,
>>                              &level, GFP_KERNEL);
>>     if (pte && dma_pte_present(pte)) {
>>         /* The PTE is valid, check anything you want! */
>>         ... ...
>>     }
>>
>> Or, I am overlooking something else?
> 
> You're right, thanks for the keeping me straight -- I was already doing the
> right thing. I've forgotten about it in the midst of the other code -- Probably
> worth a comment in the caller to make it obvious.

For what is worth, this is the improved page-table walker I have in staging (as
a separate patch) alongside AMD. It is quite similar, except AMD IOMMU has a
bigger featureset in the PTEs page size it can represent, but the crux of the
walking is the same, bearing different coding style in the IOMMU drivers.

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 97558b420e35..f6990962af2a 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -4889,14 +4889,52 @@ static int intel_iommu_set_dirty_tracking(struct
iommu_domain *domain,
        return ret;
 }

+static int walk_dirty_dma_pte_level(struct dmar_domain *domain, int level,
+                                   struct dma_pte *pte, unsigned long start_pfn,
+                                   unsigned long last_pfn, unsigned long flags,
+                                   struct iommu_dirty_bitmap *dirty)
+{
+       unsigned long pfn, page_size;
+
+       pfn = start_pfn;
+       pte = &pte[pfn_level_offset(pfn, level)];
+
+       do {
+               unsigned long level_pfn = pfn & level_mask(level);
+               unsigned long level_last;
+
+               if (!dma_pte_present(pte))
+                       goto next;
+
+               if (level > 1 && !dma_pte_superpage(pte)) {
+                       level_last = level_pfn + level_size(level) - 1;
+                       level_last = min(level_last, last_pfn);
+                       walk_dirty_dma_pte_level(domain, level - 1,
+                                                phys_to_virt(dma_pte_addr(pte)),
+                                                pfn, level_last,
+                                                flags, dirty);
+               } else {
+                       page_size = level_size(level) << VTD_PAGE_SHIFT;
+
+                       if (dma_sl_pte_test_and_clear_dirty(pte, flags))
+                               iommu_dirty_bitmap_record(dirty,
+                                                         pfn << VTD_PAGE_SHIFT,
+                                                         page_size);
+               }
+next:
+               pfn = level_pfn + level_size(level);
+       } while (!first_pte_in_page(++pte) && pfn <= last_pfn);
+
+       return 0;
+}
+
 static int intel_iommu_read_and_clear_dirty(struct iommu_domain *domain,
                                            unsigned long iova, size_t size,
                                            unsigned long flags,
                                            struct iommu_dirty_bitmap *dirty)
 {
        struct dmar_domain *dmar_domain = to_dmar_domain(domain);
-       unsigned long end = iova + size - 1;
-       unsigned long pgsize;
+       unsigned long start_pfn, last_pfn;

        /*
         * IOMMUFD core calls into a dirty tracking disabled domain without an
@@ -4907,24 +4945,14 @@ static int intel_iommu_read_and_clear_dirty(struct
iommu_domain *domain,
        if (!dmar_domain->dirty_tracking && dirty->bitmap)
                return -EINVAL;

-       do {
-               struct dma_pte *pte;
-               int lvl = 0;
-
-               pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &lvl,
-                                    GFP_ATOMIC);
-               pgsize = level_size(lvl) << VTD_PAGE_SHIFT;
-               if (!pte || !dma_pte_present(pte)) {
-                       iova += pgsize;
-                       continue;
-               }

-               if (dma_sl_pte_test_and_clear_dirty(pte, flags))
-                       iommu_dirty_bitmap_record(dirty, iova, pgsize);
-               iova += pgsize;
-       } while (iova < end);
+       start_pfn = iova >> VTD_PAGE_SHIFT;
+       last_pfn = (iova + size - 1) >> VTD_PAGE_SHIFT;

-       return 0;
+       return walk_dirty_dma_pte_level(dmar_domain,
+                                       agaw_to_level(dmar_domain->agaw),
+                                       dmar_domain->pgd, start_pfn, last_pfn,
+                                       flags, dirty);
 }

 const struct iommu_dirty_ops intel_dirty_ops = {

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 04/18] iommu: Add iommu_domain ops for dirty tracking
  2023-10-20  5:54   ` Tian, Kevin
@ 2023-10-20 11:24     ` Joao Martins
  0 siblings, 0 replies; 84+ messages in thread
From: Joao Martins @ 2023-10-20 11:24 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, Shameerali Kolothum Thodi, Lu Baolu, Liu, Yi L,
	Sun, Yi Y, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Duan, Zhenzhong, Alex Williamson,
	iommu, kvm



On 20/10/2023 06:54, Tian, Kevin wrote:
>> From: Joao Martins <joao.m.martins@oracle.com>
>> Sent: Thursday, October 19, 2023 4:27 AM
>>
>> Add to iommu domain operations a set of callbacks to perform dirty
>> tracking, particulary to start and stop tracking and to read and clear the
>> dirty data.
>>
>> Drivers are generally expected to dynamically change its translation
>> structures to toggle the tracking and flush some form of control state
>> structure that stands in the IOVA translation path. Though it's not
>> mandatory, as drivers can also enable dirty tracking at boot, and just
>> clear the dirty bits before setting dirty tracking. For each of the newly
>> added IOMMU core APIs:
>>
>> iommu_cap::IOMMU_CAP_DIRTY: new device iommu_capable value when
>> probing for
>> capabilities of the device.
> 
> IOMMU_CAP_DIRTY_TRACKING is more readable.
> 
OK

>> @@ -671,6 +724,9 @@ struct iommu_fwspec {
>>  /* ATS is supported */
>>  #define IOMMU_FWSPEC_PCI_RC_ATS			(1 << 0)
>>
>> +/* Read but do not clear any dirty bits */
>> +#define IOMMU_DIRTY_NO_CLEAR			(1 << 0)
>> +
> 
> better move to the place where iommu_dirty_ops is defined.
> 
OK

> Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 08/18] iommufd: Add capabilities to IOMMU_GET_HW_INFO
  2023-10-20  6:46   ` Tian, Kevin
@ 2023-10-20 11:52     ` Joao Martins
  0 siblings, 0 replies; 84+ messages in thread
From: Joao Martins @ 2023-10-20 11:52 UTC (permalink / raw)
  To: Tian, Kevin, iommu
  Cc: Jason Gunthorpe, Shameerali Kolothum Thodi, Lu Baolu, Liu, Yi L,
	Sun, Yi Y, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Duan, Zhenzhong, Alex Williamson, kvm



On 20/10/2023 07:46, Tian, Kevin wrote:
>> From: Joao Martins <joao.m.martins@oracle.com>
>> Sent: Thursday, October 19, 2023 4:27 AM
>>
>> +/**
>> + * enum iommufd_hw_info_capabilities
>> + * @IOMMU_CAP_DIRTY_TRACKING: IOMMU hardware support for dirty
>> tracking
> 
> @IOMMU_HW_CAP_DIRTY_TRACKING: ...
> 
OK
>>  /**
>>   * struct iommu_hw_info - ioctl(IOMMU_GET_HW_INFO)
>>   * @size: sizeof(struct iommu_hw_info)
>> @@ -430,6 +438,8 @@ enum iommu_hw_info_type {
>>   *             the iommu type specific hardware information data
>>   * @out_data_type: Output the iommu hardware info type as defined in the
>> enum
>>   *                 iommu_hw_info_type.
>> + * @out_capabilities: Output the iommu capability info type as defined in
>> the
>> + *                    enum iommu_hw_capabilities.
> 
> "Output the 'generic' iommu capability info ..."
>

Yeap, I 'generic' is important here

> Reviewed-by: Kevin Tian <kevin.tian@intel.com>

Thanks

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 07/18] iommufd: Add IOMMU_HWPT_GET_DIRTY_IOVA
  2023-10-20  6:32   ` Tian, Kevin
@ 2023-10-20 11:53     ` Joao Martins
  2023-10-20 13:40       ` Jason Gunthorpe
  0 siblings, 1 reply; 84+ messages in thread
From: Joao Martins @ 2023-10-20 11:53 UTC (permalink / raw)
  To: Tian, Kevin, iommu
  Cc: Jason Gunthorpe, Shameerali Kolothum Thodi, Lu Baolu, Liu, Yi L,
	Sun, Yi Y, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Duan, Zhenzhong, Alex Williamson, kvm

On 20/10/2023 07:32, Tian, Kevin wrote:
>> From: Joao Martins <joao.m.martins@oracle.com>
>> Sent: Thursday, October 19, 2023 4:27 AM
>>
>> Underneath it uses the IOMMU domain kernel API which will read the dirty
>> bits, as well as atomically clearing the IOPTE dirty bit and flushing the
>> IOTLB at the end. The IOVA bitmaps usage takes care of the iteration of the
> 
> what does 'atomically' try to convey here?
> 
Meaning that the test/update PTE events by IOMMU and CPU must not intersect, iow
in a mutually exclusive manner. e.g. IOMMU hw initiates an atomic transaction
and checks if the PTE dirty bit is set and then update if it's not. CPU uses
locked bit/cmpxchg manipulation instructions to ensure the testing and clearing
is done in that way. iommu hw never clears the bit, but the update needs to
ensure that IOMMU won't lose info and miss setting dirty bits after CPU finishes
its instruction or vice-versa. But sentence this refers to the IOMMU driver
implementation and the IOMMU hardware it is 'frontending'.

>> +/**
>> + * struct iommu_hwpt_get_dirty_iova -
>> ioctl(IOMMU_HWPT_GET_DIRTY_IOVA)
> 
> IOMMU_HWPT_GET_DIRTY_BITMAP? IOVA usually means one address
> but here we talk about a bitmap of which one bit represents a page.
> 
My reading of 'IOVA' was actually in the plural form -- Probably my bad english.

HWPT_GET_DIRTY_BITMAP is OK too (maybe better); I guess more explicit
on how it's structured the reporting/returning data.

>> + * @size: sizeof(struct iommu_hwpt_get_dirty_iova)
>> + * @hwpt_id: HW pagetable ID that represents the IOMMU domain.
>> + * @flags: Flags to control dirty tracking status.
>> + * @iova: base IOVA of the bitmap first bit
>> + * @length: IOVA range size
>> + * @page_size: page size granularity of each bit in the bitmap
>> + * @data: bitmap where to set the dirty bits. The bitmap bits each
>> + * represent a page_size which you deviate from an arbitrary iova.
>> + * Checking a given IOVA is dirty:
>> + *
>> + *  data[(iova / page_size) / 64] & (1ULL << (iova % 64))
> 
> (1ULL << ((iova / page_size) % 64)
> 
Ah!

> Reviewed-by: Kevin Tian <kevin.tian@intel.com>

Thanks

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 07/18] iommufd: Add IOMMU_HWPT_GET_DIRTY_IOVA
  2023-10-20 11:53     ` Joao Martins
@ 2023-10-20 13:40       ` Jason Gunthorpe
  0 siblings, 0 replies; 84+ messages in thread
From: Jason Gunthorpe @ 2023-10-20 13:40 UTC (permalink / raw)
  To: Joao Martins
  Cc: Tian, Kevin, iommu, Shameerali Kolothum Thodi, Lu Baolu, Liu,
	Yi L, Sun, Yi Y, Nicolin Chen, Joerg Roedel,
	Suravee Suthikulpanit, Will Deacon, Robin Murphy, Duan,
	Zhenzhong, Alex Williamson, kvm

On Fri, Oct 20, 2023 at 12:53:36PM +0100, Joao Martins wrote:

> >> + * struct iommu_hwpt_get_dirty_iova -
> >> ioctl(IOMMU_HWPT_GET_DIRTY_IOVA)
> > 
> > IOMMU_HWPT_GET_DIRTY_BITMAP? IOVA usually means one address
> > but here we talk about a bitmap of which one bit represents a page.
> > 
> My reading of 'IOVA' was actually in the plural form -- Probably my bad english.
> 
> HWPT_GET_DIRTY_BITMAP is OK too (maybe better); I guess more explicit
> on how it's structured the reporting/returning data.

I tend to agree

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 11/18] iommu/amd: Access/Dirty bit support in IOPTEs
  2023-10-19 23:59         ` Jason Gunthorpe
@ 2023-10-20 14:43           ` Joao Martins
  2023-10-20 21:22             ` Joao Martins
  2023-10-21 16:14             ` Jason Gunthorpe
  0 siblings, 2 replies; 84+ messages in thread
From: Joao Martins @ 2023-10-20 14:43 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu, Yi Liu,
	Yi Y Sun, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Zhenzhong Duan, Alex Williamson, kvm

On 20/10/2023 00:59, Jason Gunthorpe wrote:
> On Thu, Oct 19, 2023 at 12:58:29PM +0100, Joao Martins wrote:
>> AMD has no such behaviour, though that driver per your earlier suggestion might
>> need to wait until -rc1 for some of the refactorings get merged. Hopefully we
>> don't need to wait for the last 3 series of AMD Driver refactoring (?) to be
>> done as that looks to be more SVA related; Unless there's something more
>> specific you are looking for prior to introducing AMD's domain_alloc_user().
> 
> I don't think we need to wait, it just needs to go on the cleaning list.
>

I am not sure I followed. This suggests an post-merge cleanups, which goes in
different direction of your original comment? But maybe I am just not parsing it
right (sorry, just confused)

>>> for themselves; so more and more I need to work on something like
>>> iommufd_log_perf tool under tools/testing that is similar to the gup_perf to make all
>>> performance work obvious and 'standardized'
> 
> We have a mlx5 vfio driver in rdma-core and I have been thinking it
> would be a nice basis for building an iommufd tester/benchmarker as it
> has a wide set of "easilly" triggered functionality.

Oh woah, that's quite awesome; I'll take a closer look; I thought rdma-core
support for mlx5-vfio was to do direct usage of the firmware interface, but it
appears to be for regular RDMA apps as well. I do use some RDMA to exercise
iommu dirty tracking; but it's more like a rudimentary test inside the guest,
not something self-contained.

I was thinking in something more basic (for starters) device-agnostic to
exercise the iommu side part -- I gave gup_test example as that's exactly what I
have in mind. Though having in-device knowledge is the ultimate tool to exercise
this free of migration problems/overheads/algorithm. Initially I was thinking a
DPDK app where you program the device itself, but it is a bit too complex. But
if rdma-core can still use mlx5-VFIO while retaining the same semantics of a
'normal' RDMA app (e.g. which registers some MRs and lets you rdma read/write)
then this is definitely good foundation.

	Joao

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 06/18] iommufd: Add IOMMU_HWPT_SET_DIRTY
  2023-10-20  6:09   ` Tian, Kevin
@ 2023-10-20 15:30     ` Joao Martins
  0 siblings, 0 replies; 84+ messages in thread
From: Joao Martins @ 2023-10-20 15:30 UTC (permalink / raw)
  To: Tian, Kevin, iommu
  Cc: Jason Gunthorpe, Shameerali Kolothum Thodi, Lu Baolu, Liu, Yi L,
	Sun, Yi Y, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Duan, Zhenzhong, Alex Williamson, kvm

On 20/10/2023 07:09, Tian, Kevin wrote:
>> From: Joao Martins <joao.m.martins@oracle.com>
>> Sent: Thursday, October 19, 2023 4:27 AM
>> +
>> +/**
>> + * struct iommu_hwpt_set_dirty - ioctl(IOMMU_HWPT_SET_DIRTY)
> 
> IOMMU_HWPT_SET_DIRTY_TRACKING

OK; I am riding on the assumption that the naming change ask
is to broadly replace structs/iommufd-cmds/ioctls/commit-msgs from
set_dirty to set_dirty_tracking

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 01/18] vfio/iova_bitmap: Export more API symbols
  2023-10-18 20:26 ` [PATCH v4 01/18] vfio/iova_bitmap: Export more API symbols Joao Martins
  2023-10-18 22:14   ` Jason Gunthorpe
  2023-10-20  5:45   ` Tian, Kevin
@ 2023-10-20 16:44   ` Alex Williamson
  2 siblings, 0 replies; 84+ messages in thread
From: Alex Williamson @ 2023-10-20 16:44 UTC (permalink / raw)
  To: Joao Martins
  Cc: iommu, Jason Gunthorpe, Kevin Tian, Shameerali Kolothum Thodi,
	Lu Baolu, Yi Liu, Yi Y Sun, Nicolin Chen, Joerg Roedel,
	Suravee Suthikulpanit, Will Deacon, Robin Murphy, Zhenzhong Duan,
	kvm

On Wed, 18 Oct 2023 21:26:58 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> In preparation to move iova_bitmap into iommufd, export the rest of API
> symbols that will be used in what could be used by modules, namely:
> 
> 	iova_bitmap_alloc
> 	iova_bitmap_free
> 	iova_bitmap_for_each
> 
> Suggested-by: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  drivers/vfio/iova_bitmap.c | 3 +++
>  1 file changed, 3 insertions(+)


Reviewed-by: Alex Williamson <alex.williamson@redhat.com>


> 
> diff --git a/drivers/vfio/iova_bitmap.c b/drivers/vfio/iova_bitmap.c
> index 0848f920efb7..f54b56388e00 100644
> --- a/drivers/vfio/iova_bitmap.c
> +++ b/drivers/vfio/iova_bitmap.c
> @@ -268,6 +268,7 @@ struct iova_bitmap *iova_bitmap_alloc(unsigned long iova, size_t length,
>  	iova_bitmap_free(bitmap);
>  	return ERR_PTR(rc);
>  }
> +EXPORT_SYMBOL_GPL(iova_bitmap_alloc);
>  
>  /**
>   * iova_bitmap_free() - Frees an IOVA bitmap object
> @@ -289,6 +290,7 @@ void iova_bitmap_free(struct iova_bitmap *bitmap)
>  
>  	kfree(bitmap);
>  }
> +EXPORT_SYMBOL_GPL(iova_bitmap_free);
>  
>  /*
>   * Returns the remaining bitmap indexes from mapped_total_index to process for
> @@ -387,6 +389,7 @@ int iova_bitmap_for_each(struct iova_bitmap *bitmap, void *opaque,
>  
>  	return ret;
>  }
> +EXPORT_SYMBOL_GPL(iova_bitmap_for_each);
>  
>  /**
>   * iova_bitmap_set() - Records an IOVA range in bitmap


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 02/18] vfio: Move iova_bitmap into iommufd
  2023-10-18 20:26 ` [PATCH v4 02/18] vfio: Move iova_bitmap into iommufd Joao Martins
  2023-10-18 22:14   ` Jason Gunthorpe
  2023-10-20  5:46   ` Tian, Kevin
@ 2023-10-20 16:44   ` Alex Williamson
  2 siblings, 0 replies; 84+ messages in thread
From: Alex Williamson @ 2023-10-20 16:44 UTC (permalink / raw)
  To: Joao Martins
  Cc: iommu, Jason Gunthorpe, Kevin Tian, Shameerali Kolothum Thodi,
	Lu Baolu, Yi Liu, Yi Y Sun, Nicolin Chen, Joerg Roedel,
	Suravee Suthikulpanit, Will Deacon, Robin Murphy, Zhenzhong Duan,
	kvm, Brett Creeley, Yishai Hadas

On Wed, 18 Oct 2023 21:26:59 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> Both VFIO and IOMMUFD will need iova bitmap for storing dirties and walking
> the user bitmaps, so move to the common dependency into IOMMUFD.  In doing
> so, create the symbol IOMMUFD_DRIVER which designates the builtin code that
> will be used by drivers when selected. Today this means MLX5_VFIO_PCI and
> PDS_VFIO_PCI. IOMMU drivers will do the same (in future patches) when
> supporting dirty tracking and select IOMMUFD_DRIVER accordingly.
> 
> Given that the symbol maybe be disabled, add header definitions in
> iova_bitmap.h for when IOMMUFD_DRIVER=n
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  drivers/iommu/iommufd/Kconfig                 |  4 +++
>  drivers/iommu/iommufd/Makefile                |  1 +
>  drivers/{vfio => iommu/iommufd}/iova_bitmap.c |  0
>  drivers/vfio/Makefile                         |  3 +--
>  drivers/vfio/pci/mlx5/Kconfig                 |  1 +
>  drivers/vfio/pci/pds/Kconfig                  |  1 +
>  include/linux/iova_bitmap.h                   | 26 +++++++++++++++++++
>  7 files changed, 34 insertions(+), 2 deletions(-)
>  rename drivers/{vfio => iommu/iommufd}/iova_bitmap.c (100%)


Reviewed-by: Alex Williamson <alex.williamson@redhat.com>


> 
> diff --git a/drivers/iommu/iommufd/Kconfig b/drivers/iommu/iommufd/Kconfig
> index 99d4b075df49..1fa543204e89 100644
> --- a/drivers/iommu/iommufd/Kconfig
> +++ b/drivers/iommu/iommufd/Kconfig
> @@ -11,6 +11,10 @@ config IOMMUFD
>  
>  	  If you don't know what to do here, say N.
>  
> +config IOMMUFD_DRIVER
> +	bool
> +	default n
> +
>  if IOMMUFD
>  config IOMMUFD_VFIO_CONTAINER
>  	bool "IOMMUFD provides the VFIO container /dev/vfio/vfio"
> diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
> index 8aeba81800c5..34b446146961 100644
> --- a/drivers/iommu/iommufd/Makefile
> +++ b/drivers/iommu/iommufd/Makefile
> @@ -11,3 +11,4 @@ iommufd-y := \
>  iommufd-$(CONFIG_IOMMUFD_TEST) += selftest.o
>  
>  obj-$(CONFIG_IOMMUFD) += iommufd.o
> +obj-$(CONFIG_IOMMUFD_DRIVER) += iova_bitmap.o
> diff --git a/drivers/vfio/iova_bitmap.c b/drivers/iommu/iommufd/iova_bitmap.c
> similarity index 100%
> rename from drivers/vfio/iova_bitmap.c
> rename to drivers/iommu/iommufd/iova_bitmap.c
> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> index c82ea032d352..68c05705200f 100644
> --- a/drivers/vfio/Makefile
> +++ b/drivers/vfio/Makefile
> @@ -1,8 +1,7 @@
>  # SPDX-License-Identifier: GPL-2.0
>  obj-$(CONFIG_VFIO) += vfio.o
>  
> -vfio-y += vfio_main.o \
> -	  iova_bitmap.o
> +vfio-y += vfio_main.o
>  vfio-$(CONFIG_VFIO_DEVICE_CDEV) += device_cdev.o
>  vfio-$(CONFIG_VFIO_GROUP) += group.o
>  vfio-$(CONFIG_IOMMUFD) += iommufd.o
> diff --git a/drivers/vfio/pci/mlx5/Kconfig b/drivers/vfio/pci/mlx5/Kconfig
> index 7088edc4fb28..c3ced56b7787 100644
> --- a/drivers/vfio/pci/mlx5/Kconfig
> +++ b/drivers/vfio/pci/mlx5/Kconfig
> @@ -3,6 +3,7 @@ config MLX5_VFIO_PCI
>  	tristate "VFIO support for MLX5 PCI devices"
>  	depends on MLX5_CORE
>  	select VFIO_PCI_CORE
> +	select IOMMUFD_DRIVER
>  	help
>  	  This provides migration support for MLX5 devices using the VFIO
>  	  framework.
> diff --git a/drivers/vfio/pci/pds/Kconfig b/drivers/vfio/pci/pds/Kconfig
> index 407b3fd32733..fff368a8183b 100644
> --- a/drivers/vfio/pci/pds/Kconfig
> +++ b/drivers/vfio/pci/pds/Kconfig
> @@ -5,6 +5,7 @@ config PDS_VFIO_PCI
>  	tristate "VFIO support for PDS PCI devices"
>  	depends on PDS_CORE
>  	select VFIO_PCI_CORE
> +	select IOMMUFD_DRIVER
>  	help
>  	  This provides generic PCI support for PDS devices using the VFIO
>  	  framework.
> diff --git a/include/linux/iova_bitmap.h b/include/linux/iova_bitmap.h
> index c006cf0a25f3..1c338f5e5b7a 100644
> --- a/include/linux/iova_bitmap.h
> +++ b/include/linux/iova_bitmap.h
> @@ -7,6 +7,7 @@
>  #define _IOVA_BITMAP_H_
>  
>  #include <linux/types.h>
> +#include <linux/errno.h>
>  
>  struct iova_bitmap;
>  
> @@ -14,6 +15,7 @@ typedef int (*iova_bitmap_fn_t)(struct iova_bitmap *bitmap,
>  				unsigned long iova, size_t length,
>  				void *opaque);
>  
> +#if IS_ENABLED(CONFIG_IOMMUFD_DRIVER)
>  struct iova_bitmap *iova_bitmap_alloc(unsigned long iova, size_t length,
>  				      unsigned long page_size,
>  				      u64 __user *data);
> @@ -22,5 +24,29 @@ int iova_bitmap_for_each(struct iova_bitmap *bitmap, void *opaque,
>  			 iova_bitmap_fn_t fn);
>  void iova_bitmap_set(struct iova_bitmap *bitmap,
>  		     unsigned long iova, size_t length);
> +#else
> +static inline struct iova_bitmap *iova_bitmap_alloc(unsigned long iova,
> +						    size_t length,
> +						    unsigned long page_size,
> +						    u64 __user *data)
> +{
> +	return NULL;
> +}
> +
> +static inline void iova_bitmap_free(struct iova_bitmap *bitmap)
> +{
> +}
> +
> +static inline int iova_bitmap_for_each(struct iova_bitmap *bitmap, void *opaque,
> +				       iova_bitmap_fn_t fn)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static inline void iova_bitmap_set(struct iova_bitmap *bitmap,
> +				   unsigned long iova, size_t length)
> +{
> +}
> +#endif
>  
>  #endif


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 03/18] iommufd/iova_bitmap: Move symbols to IOMMUFD namespace
  2023-10-18 20:27 ` [PATCH v4 03/18] iommufd/iova_bitmap: Move symbols to IOMMUFD namespace Joao Martins
                     ` (2 preceding siblings ...)
  2023-10-20  5:47   ` Tian, Kevin
@ 2023-10-20 16:44   ` Alex Williamson
  3 siblings, 0 replies; 84+ messages in thread
From: Alex Williamson @ 2023-10-20 16:44 UTC (permalink / raw)
  To: Joao Martins
  Cc: iommu, Jason Gunthorpe, Kevin Tian, Shameerali Kolothum Thodi,
	Lu Baolu, Yi Liu, Yi Y Sun, Nicolin Chen, Joerg Roedel,
	Suravee Suthikulpanit, Will Deacon, Robin Murphy, Zhenzhong Duan,
	kvm, Brett Creeley, Yishai Hadas

On Wed, 18 Oct 2023 21:27:00 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> Have the IOVA bitmap exported symbols adhere to the IOMMUFD symbol
> export convention i.e. using the IOMMUFD namespace. In doing so,
> import the namespace in the current users. This means VFIO and the
> vfio-pci drivers that use iova_bitmap_set().
> 
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  drivers/iommu/iommufd/iova_bitmap.c | 8 ++++----
>  drivers/vfio/pci/mlx5/main.c        | 1 +
>  drivers/vfio/pci/pds/pci_drv.c      | 1 +
>  drivers/vfio/vfio_main.c            | 1 +
>  4 files changed, 7 insertions(+), 4 deletions(-)
> 

Reviewed-by: Alex Williamson <alex.williamson@redhat.com>


> diff --git a/drivers/iommu/iommufd/iova_bitmap.c b/drivers/iommu/iommufd/iova_bitmap.c
> index f54b56388e00..0a92c9eeaf7f 100644
> --- a/drivers/iommu/iommufd/iova_bitmap.c
> +++ b/drivers/iommu/iommufd/iova_bitmap.c
> @@ -268,7 +268,7 @@ struct iova_bitmap *iova_bitmap_alloc(unsigned long iova, size_t length,
>  	iova_bitmap_free(bitmap);
>  	return ERR_PTR(rc);
>  }
> -EXPORT_SYMBOL_GPL(iova_bitmap_alloc);
> +EXPORT_SYMBOL_NS_GPL(iova_bitmap_alloc, IOMMUFD);
>  
>  /**
>   * iova_bitmap_free() - Frees an IOVA bitmap object
> @@ -290,7 +290,7 @@ void iova_bitmap_free(struct iova_bitmap *bitmap)
>  
>  	kfree(bitmap);
>  }
> -EXPORT_SYMBOL_GPL(iova_bitmap_free);
> +EXPORT_SYMBOL_NS_GPL(iova_bitmap_free, IOMMUFD);
>  
>  /*
>   * Returns the remaining bitmap indexes from mapped_total_index to process for
> @@ -389,7 +389,7 @@ int iova_bitmap_for_each(struct iova_bitmap *bitmap, void *opaque,
>  
>  	return ret;
>  }
> -EXPORT_SYMBOL_GPL(iova_bitmap_for_each);
> +EXPORT_SYMBOL_NS_GPL(iova_bitmap_for_each, IOMMUFD);
>  
>  /**
>   * iova_bitmap_set() - Records an IOVA range in bitmap
> @@ -423,4 +423,4 @@ void iova_bitmap_set(struct iova_bitmap *bitmap,
>  		cur_bit += nbits;
>  	} while (cur_bit <= last_bit);
>  }
> -EXPORT_SYMBOL_GPL(iova_bitmap_set);
> +EXPORT_SYMBOL_NS_GPL(iova_bitmap_set, IOMMUFD);
> diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
> index 42ec574a8622..5cf2b491d15a 100644
> --- a/drivers/vfio/pci/mlx5/main.c
> +++ b/drivers/vfio/pci/mlx5/main.c
> @@ -1376,6 +1376,7 @@ static struct pci_driver mlx5vf_pci_driver = {
>  
>  module_pci_driver(mlx5vf_pci_driver);
>  
> +MODULE_IMPORT_NS(IOMMUFD);
>  MODULE_LICENSE("GPL");
>  MODULE_AUTHOR("Max Gurtovoy <mgurtovoy@nvidia.com>");
>  MODULE_AUTHOR("Yishai Hadas <yishaih@nvidia.com>");
> diff --git a/drivers/vfio/pci/pds/pci_drv.c b/drivers/vfio/pci/pds/pci_drv.c
> index ab4b5958e413..dd8c00c895a2 100644
> --- a/drivers/vfio/pci/pds/pci_drv.c
> +++ b/drivers/vfio/pci/pds/pci_drv.c
> @@ -204,6 +204,7 @@ static struct pci_driver pds_vfio_pci_driver = {
>  
>  module_pci_driver(pds_vfio_pci_driver);
>  
> +MODULE_IMPORT_NS(IOMMUFD);
>  MODULE_DESCRIPTION(PDS_VFIO_DRV_DESCRIPTION);
>  MODULE_AUTHOR("Brett Creeley <brett.creeley@amd.com>");
>  MODULE_LICENSE("GPL");
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> index 40732e8ed4c6..a96d97da367d 100644
> --- a/drivers/vfio/vfio_main.c
> +++ b/drivers/vfio/vfio_main.c
> @@ -1693,6 +1693,7 @@ static void __exit vfio_cleanup(void)
>  module_init(vfio_init);
>  module_exit(vfio_cleanup);
>  
> +MODULE_IMPORT_NS(IOMMUFD);
>  MODULE_VERSION(DRIVER_VERSION);
>  MODULE_LICENSE("GPL v2");
>  MODULE_AUTHOR(DRIVER_AUTHOR);


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 11/18] iommu/amd: Access/Dirty bit support in IOPTEs
  2023-10-18 20:27 ` [PATCH v4 11/18] iommu/amd: Access/Dirty bit support in IOPTEs Joao Martins
  2023-10-18 23:11   ` Jason Gunthorpe
@ 2023-10-20 18:57   ` Joao Martins
  1 sibling, 0 replies; 84+ messages in thread
From: Joao Martins @ 2023-10-20 18:57 UTC (permalink / raw)
  To: Suravee Suthikulpanit, iommu
  Cc: Jason Gunthorpe, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu,
	Yi Liu, Yi Y Sun, Nicolin Chen, Joerg Roedel, Will Deacon,
	Robin Murphy, Zhenzhong Duan, Alex Williamson, kvm

Suravee,

On 18/10/2023 21:27, Joao Martins wrote:
> @@ -2379,6 +2407,69 @@ static bool amd_iommu_capable(struct device *dev, enum iommu_cap cap)
>  	return false;
>  }
>  
> +static int amd_iommu_set_dirty_tracking(struct iommu_domain *domain,
> +					bool enable)
> +{
> +	struct protection_domain *pdomain = to_pdomain(domain);
> +	struct dev_table_entry *dev_table;
> +	struct iommu_dev_data *dev_data;
> +	struct amd_iommu *iommu;
> +	unsigned long flags;
> +	u64 pte_root;
> +
> +	spin_lock_irqsave(&pdomain->lock, flags);
> +	if (!(pdomain->dirty_tracking ^ enable)) {
> +		spin_unlock_irqrestore(&pdomain->lock, flags);
> +		return 0;
> +	}
> +
> +	list_for_each_entry(dev_data, &pdomain->dev_list, list) {
> +		iommu = rlookup_amd_iommu(dev_data->dev);
> +		if (!iommu)
> +			continue;
> +
> +		dev_table = get_dev_table(iommu);
> +		pte_root = dev_table[dev_data->devid].data[0];
> +
> +		pte_root = (enable ?
> +			pte_root | DTE_FLAG_HAD : pte_root & ~DTE_FLAG_HAD);
> +
> +		/* Flush device DTE */
> +		dev_table[dev_data->devid].data[0] = pte_root;
> +		device_flush_dte(dev_data);
> +	}
> +
> +	/* Flush IOTLB to mark IOPTE dirty on the next translation(s) */
> +	amd_iommu_domain_flush_tlb_pde(pdomain);
> +	amd_iommu_domain_flush_complete(pdomain);
> +	pdomain->dirty_tracking = enable;
> +	spin_unlock_irqrestore(&pdomain->lock, flags);
> +
> +	return 0;
> +}
> +

I'm adding this snippet below considering some earlier discussion on Intel
driver. This only skips the domain flush when the domain has no devices (or
rlookup didnt give an iommu). Technically this was code that mistakenly deleted
from rfcv1->rfcv2 after your first review, so still retaining your Reviewed-by;
let me know if that's wrong.

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 6b4768ff66e1..c5be76e019bf 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2413,6 +2413,7 @@ static int amd_iommu_set_dirty_tracking(struct
iommu_domain *domain,
        struct protection_domain *pdomain = to_pdomain(domain);
        struct dev_table_entry *dev_table;
        struct iommu_dev_data *dev_data;
+       bool domain_flush = false;
        struct amd_iommu *iommu;
        unsigned long flags;
        u64 pte_root;
@@ -2437,11 +2438,14 @@ static int amd_iommu_set_dirty_tracking(struct
iommu_domain *domain,
                /* Flush device DTE */
                dev_table[dev_data->devid].data[0] = pte_root;
                device_flush_dte(dev_data);
+               domain_flush = true;
        }

        /* Flush IOTLB to mark IOPTE dirty on the next translation(s) */
-       amd_iommu_domain_flush_tlb_pde(pdomain);
-       amd_iommu_domain_flush_complete(pdomain);
+       if (domain_flush) {
+               amd_iommu_domain_flush_tlb_pde(pdomain);
+               amd_iommu_domain_flush_complete(pdomain);
+       }
        pdomain->dirty_tracking = enable;
        spin_unlock_irqrestore(&pdomain->lock, flags);

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 06/18] iommufd: Add IOMMU_HWPT_SET_DIRTY
  2023-10-18 20:27 ` [PATCH v4 06/18] iommufd: Add IOMMU_HWPT_SET_DIRTY Joao Martins
                     ` (2 preceding siblings ...)
  2023-10-20  7:56   ` Tian, Kevin
@ 2023-10-20 20:41   ` Joao Martins
  3 siblings, 0 replies; 84+ messages in thread
From: Joao Martins @ 2023-10-20 20:41 UTC (permalink / raw)
  To: iommu, Kevin Tian, Jason Gunthorpe
  Cc: Shameerali Kolothum Thodi, Lu Baolu, Yi Liu, Yi Y Sun,
	Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Zhenzhong Duan, Alex Williamson, kvm

On 18/10/2023 21:27, Joao Martins wrote:
> +
> +/*
> + * enum iommufd_set_dirty_flags - Flags for steering dirty tracking
> + * @IOMMU_DIRTY_TRACKING_ENABLE: Enables dirty tracking
> + */
> +enum iommufd_hwpt_set_dirty_flags {
> +	IOMMU_DIRTY_TRACKING_ENABLE = 1,
> +};
> +
> +/**
> + * struct iommu_hwpt_set_dirty - ioctl(IOMMU_HWPT_SET_DIRTY)
> + * @size: sizeof(struct iommu_hwpt_set_dirty)
> + * @flags: Flags to control dirty tracking status.
> + * @hwpt_id: HW pagetable ID that represents the IOMMU domain.
> + *
> + * Toggle dirty tracking on an HW pagetable.
> + */
> +struct iommu_hwpt_set_dirty {
> +	__u32 size;
> +	__u32 flags;
> +	__u32 hwpt_id;
> +	__u32 __reserved;
> +};
> +#define IOMMU_HWPT_SET_DIRTY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_SET_DIRTY)
>  #endif

Notice a docs inconsistency with the docs compared to other ioctls which pass
flags. So applying the snippet below to this patch. Will be doing similar thing
to GET_DIRTY_BITMAP patch, except that the GET_DIRTY_BITMAP patch says "Must be
zero", and then change into a similar comment as below when I introduce the
NO_CLEAR flag.

 /**
  * struct iommu_hwpt_set_dirty_tracking - ioctl(IOMMU_HWPT_SET_DIRTY_TRACKING)
  * @size: sizeof(struct iommu_hwpt_set_dirty_tracking)
- * @flags: Flags to control dirty tracking status.
+ * @flags: Combination of enum iommufd_hwpt_set_dirty_tracking_flags
  * @hwpt_id: HW pagetable ID that represents the IOMMU domain.
  *
  * Toggle dirty tracking on an HW pagetable.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 11/18] iommu/amd: Access/Dirty bit support in IOPTEs
  2023-10-20 14:43           ` Joao Martins
@ 2023-10-20 21:22             ` Joao Martins
  2023-10-21 16:14             ` Jason Gunthorpe
  1 sibling, 0 replies; 84+ messages in thread
From: Joao Martins @ 2023-10-20 21:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu, Yi Liu,
	Yi Y Sun, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Zhenzhong Duan, Alex Williamson, kvm

On 20/10/2023 15:43, Joao Martins wrote:
> On 20/10/2023 00:59, Jason Gunthorpe wrote:
>> On Thu, Oct 19, 2023 at 12:58:29PM +0100, Joao Martins wrote:
>>> AMD has no such behaviour, though that driver per your earlier suggestion might
>>> need to wait until -rc1 for some of the refactorings get merged. Hopefully we
>>> don't need to wait for the last 3 series of AMD Driver refactoring (?) to be
>>> done as that looks to be more SVA related; Unless there's something more
>>> specific you are looking for prior to introducing AMD's domain_alloc_user().
>>
>> I don't think we need to wait, it just needs to go on the cleaning list.
>>
> 
> I am not sure I followed. This suggests an post-merge cleanups, which goes in
> different direction of your original comment? But maybe I am just not parsing it
> right (sorry, just confused)
> 
Oh, I think I now really understood what you originally meant. The wait was into
other stuff that needs work unrelated to this, not specifically for these
patches to wait e.g. stuff like domain_alloc_paging as your example, with
respect to domain_alloc_user being reused as well and different domain types.

I will follow-up with v5 shortly with both drivers.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 11/18] iommu/amd: Access/Dirty bit support in IOPTEs
  2023-10-20 14:43           ` Joao Martins
  2023-10-20 21:22             ` Joao Martins
@ 2023-10-21 16:14             ` Jason Gunthorpe
  2023-10-22  7:07               ` Yishai Hadas
  1 sibling, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2023-10-21 16:14 UTC (permalink / raw)
  To: Joao Martins
  Cc: iommu, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu, Yi Liu,
	Yi Y Sun, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Zhenzhong Duan, Alex Williamson, kvm

On Fri, Oct 20, 2023 at 03:43:57PM +0100, Joao Martins wrote:
> On 20/10/2023 00:59, Jason Gunthorpe wrote:
> > On Thu, Oct 19, 2023 at 12:58:29PM +0100, Joao Martins wrote:
> >> AMD has no such behaviour, though that driver per your earlier suggestion might
> >> need to wait until -rc1 for some of the refactorings get merged. Hopefully we
> >> don't need to wait for the last 3 series of AMD Driver refactoring (?) to be
> >> done as that looks to be more SVA related; Unless there's something more
> >> specific you are looking for prior to introducing AMD's domain_alloc_user().
> > 
> > I don't think we need to wait, it just needs to go on the cleaning list.
> >
> 
> I am not sure I followed. This suggests an post-merge cleanups, which goes in
> different direction of your original comment? But maybe I am just not parsing it
> right (sorry, just confused)

Yes post merge for the weirdo alloc flow

> >>> for themselves; so more and more I need to work on something like
> >>> iommufd_log_perf tool under tools/testing that is similar to the gup_perf to make all
> >>> performance work obvious and 'standardized'
> > 
> > We have a mlx5 vfio driver in rdma-core and I have been thinking it
> > would be a nice basis for building an iommufd tester/benchmarker as it
> > has a wide set of "easilly" triggered functionality.
>
> Oh woah, that's quite awesome; I'll take a closer look; I thought rdma-core
> support for mlx5-vfio was to do direct usage of the firmware interface, but it
> appears to be for regular RDMA apps as well. I do use some RDMA to exercise
> iommu dirty tracking; but it's more like a rudimentary test inside the guest,
> not something self-contained.

I can't remember anymore how much is supported, but supporting more is
not hard work. With a simple QP/CQ you can do all sorts of interesting
DMA.

Yishai would remember if QP/CQ got fully wired up

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v4 11/18] iommu/amd: Access/Dirty bit support in IOPTEs
  2023-10-21 16:14             ` Jason Gunthorpe
@ 2023-10-22  7:07               ` Yishai Hadas
  0 siblings, 0 replies; 84+ messages in thread
From: Yishai Hadas @ 2023-10-22  7:07 UTC (permalink / raw)
  To: Jason Gunthorpe, Joao Martins
  Cc: iommu, Kevin Tian, Shameerali Kolothum Thodi, Lu Baolu, Yi Liu,
	Yi Y Sun, Nicolin Chen, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Zhenzhong Duan, Alex Williamson, kvm

On 21/10/2023 19:14, Jason Gunthorpe wrote:
> On Fri, Oct 20, 2023 at 03:43:57PM +0100, Joao Martins wrote:
>> On 20/10/2023 00:59, Jason Gunthorpe wrote:
>>> On Thu, Oct 19, 2023 at 12:58:29PM +0100, Joao Martins wrote:
>>>> AMD has no such behaviour, though that driver per your earlier suggestion might
>>>> need to wait until -rc1 for some of the refactorings get merged. Hopefully we
>>>> don't need to wait for the last 3 series of AMD Driver refactoring (?) to be
>>>> done as that looks to be more SVA related; Unless there's something more
>>>> specific you are looking for prior to introducing AMD's domain_alloc_user().
>>> I don't think we need to wait, it just needs to go on the cleaning list.
>>>
>> I am not sure I followed. This suggests an post-merge cleanups, which goes in
>> different direction of your original comment? But maybe I am just not parsing it
>> right (sorry, just confused)
> Yes post merge for the weirdo alloc flow
>
>>>>> for themselves; so more and more I need to work on something like
>>>>> iommufd_log_perf tool under tools/testing that is similar to the gup_perf to make all
>>>>> performance work obvious and 'standardized'
>>> We have a mlx5 vfio driver in rdma-core and I have been thinking it
>>> would be a nice basis for building an iommufd tester/benchmarker as it
>>> has a wide set of "easilly" triggered functionality.
>> Oh woah, that's quite awesome; I'll take a closer look; I thought rdma-core
>> support for mlx5-vfio was to do direct usage of the firmware interface, but it
>> appears to be for regular RDMA apps as well. I do use some RDMA to exercise
>> iommu dirty tracking; but it's more like a rudimentary test inside the guest,
>> not something self-contained.
> I can't remember anymore how much is supported, but supporting more is
> not hard work. With a simple QP/CQ you can do all sorts of interesting
> DMA.
>
> Yishai would remember if QP/CQ got fully wired up

For now, QP/CQ are supported only over the DEVX API (i.e. 
mlx5dv_devx_obj_create()) of the mlx5-vfio driver in rdma-core.

In that case, data-path for RDMA applications should be done by the 
application itself based on the mlx5 specification.

Yishai

>
> Jason



^ permalink raw reply	[flat|nested] 84+ messages in thread

end of thread, other threads:[~2023-10-22  7:10 UTC | newest]

Thread overview: 84+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-18 20:26 [PATCH v4 00/18] IOMMUFD Dirty Tracking Joao Martins
2023-10-18 20:26 ` [PATCH v4 01/18] vfio/iova_bitmap: Export more API symbols Joao Martins
2023-10-18 22:14   ` Jason Gunthorpe
2023-10-20  5:45   ` Tian, Kevin
2023-10-20 16:44   ` Alex Williamson
2023-10-18 20:26 ` [PATCH v4 02/18] vfio: Move iova_bitmap into iommufd Joao Martins
2023-10-18 22:14   ` Jason Gunthorpe
2023-10-19 17:48     ` Brett Creeley
2023-10-20  5:46   ` Tian, Kevin
2023-10-20 16:44   ` Alex Williamson
2023-10-18 20:27 ` [PATCH v4 03/18] iommufd/iova_bitmap: Move symbols to IOMMUFD namespace Joao Martins
2023-10-18 22:16   ` Jason Gunthorpe
2023-10-19 17:48   ` Brett Creeley
2023-10-20  5:47   ` Tian, Kevin
2023-10-20 16:44   ` Alex Williamson
2023-10-18 20:27 ` [PATCH v4 04/18] iommu: Add iommu_domain ops for dirty tracking Joao Martins
2023-10-18 22:26   ` Jason Gunthorpe
2023-10-19  1:45   ` Baolu Lu
2023-10-20  5:54   ` Tian, Kevin
2023-10-20 11:24     ` Joao Martins
2023-10-18 20:27 ` [PATCH v4 05/18] iommufd: Add a flag to enforce dirty tracking on attach Joao Martins
2023-10-18 22:26   ` Jason Gunthorpe
2023-10-18 22:38   ` Jason Gunthorpe
2023-10-18 23:38     ` Joao Martins
2023-10-20  5:55       ` Tian, Kevin
2023-10-18 20:27 ` [PATCH v4 06/18] iommufd: Add IOMMU_HWPT_SET_DIRTY Joao Martins
2023-10-18 22:28   ` Jason Gunthorpe
2023-10-20  6:09   ` Tian, Kevin
2023-10-20 15:30     ` Joao Martins
2023-10-20  7:56   ` Tian, Kevin
2023-10-20 20:41   ` Joao Martins
2023-10-18 20:27 ` [PATCH v4 07/18] iommufd: Add IOMMU_HWPT_GET_DIRTY_IOVA Joao Martins
2023-10-18 22:39   ` Jason Gunthorpe
2023-10-18 23:43     ` Joao Martins
2023-10-19 12:01       ` Jason Gunthorpe
2023-10-19 12:04         ` Joao Martins
2023-10-19 10:01   ` Joao Martins
2023-10-20  6:32   ` Tian, Kevin
2023-10-20 11:53     ` Joao Martins
2023-10-20 13:40       ` Jason Gunthorpe
2023-10-18 20:27 ` [PATCH v4 08/18] iommufd: Add capabilities to IOMMU_GET_HW_INFO Joao Martins
2023-10-18 22:44   ` Jason Gunthorpe
2023-10-19  9:55     ` Joao Martins
2023-10-19 23:56       ` Jason Gunthorpe
2023-10-20  6:46   ` Tian, Kevin
2023-10-20 11:52     ` Joao Martins
2023-10-18 20:27 ` [PATCH v4 09/18] iommufd: Add a flag to skip clearing of IOPTE dirty Joao Martins
2023-10-18 22:54   ` Jason Gunthorpe
2023-10-18 23:50     ` Joao Martins
2023-10-20  6:52   ` Tian, Kevin
2023-10-18 20:27 ` [PATCH v4 10/18] iommu/amd: Add domain_alloc_user based domain allocation Joao Martins
2023-10-18 22:58   ` Jason Gunthorpe
2023-10-18 23:54     ` Joao Martins
2023-10-18 20:27 ` [PATCH v4 11/18] iommu/amd: Access/Dirty bit support in IOPTEs Joao Martins
2023-10-18 23:11   ` Jason Gunthorpe
2023-10-19  0:17     ` Joao Martins
2023-10-19 11:58       ` Joao Martins
2023-10-19 23:59         ` Jason Gunthorpe
2023-10-20 14:43           ` Joao Martins
2023-10-20 21:22             ` Joao Martins
2023-10-21 16:14             ` Jason Gunthorpe
2023-10-22  7:07               ` Yishai Hadas
2023-10-20  2:21         ` Baolu Lu
2023-10-20  7:01           ` Tian, Kevin
2023-10-20  9:34           ` Joao Martins
2023-10-20 11:20             ` Joao Martins
2023-10-20 18:57   ` Joao Martins
2023-10-18 20:27 ` [PATCH v4 12/18] iommu/intel: Access/Dirty bit support for SL domains Joao Martins
2023-10-19  3:04   ` Baolu Lu
2023-10-19  9:14     ` Joao Martins
2023-10-19 10:33       ` Joao Martins
2023-10-19 23:56       ` Jason Gunthorpe
2023-10-20 10:12         ` Joao Martins
2023-10-20  7:53   ` Tian, Kevin
2023-10-20  9:15     ` Baolu Lu
2023-10-18 20:27 ` [PATCH v4 13/18] iommufd/selftest: Expand mock_domain with dev_flags Joao Martins
2023-10-20  7:57   ` Tian, Kevin
2023-10-18 20:27 ` [PATCH v4 14/18] iommufd/selftest: Test IOMMU_HWPT_ALLOC_ENFORCE_DIRTY Joao Martins
2023-10-20  7:59   ` Tian, Kevin
2023-10-18 20:27 ` [PATCH v4 15/18] iommufd/selftest: Test IOMMU_HWPT_SET_DIRTY Joao Martins
2023-10-20  8:00   ` Tian, Kevin
2023-10-18 20:27 ` [PATCH v4 16/18] iommufd/selftest: Test IOMMU_HWPT_GET_DIRTY_IOVA Joao Martins
2023-10-18 20:27 ` [PATCH v4 17/18] iommufd/selftest: Test out_capabilities in IOMMU_GET_HW_INFO Joao Martins
2023-10-18 20:27 ` [PATCH v4 18/18] iommufd/selftest: Test IOMMU_GET_DIRTY_IOVA_NO_CLEAR flag Joao Martins

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.