[PATCH RFC 00/19] IOMMUFD Dirty Tracking

* [PATCH RFC 00/19] IOMMUFD Dirty Tracking
@ 2022-04-28 21:09 ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Joao Martins, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian,
	Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck, kvm

Presented herewith is a series that extends IOMMUFD to have IOMMU
hardware support for dirty bit in the IOPTEs.

Today, AMD Milan (which been out for a year now) supports it while ARM
SMMUv3.2+ alongside VT-D rev3.x are expected to eventually come along.
The intended use-case is to support Live Migration with SR-IOV, with IOMMUs
that support it. Yishai Hadas will be soon submiting an RFC that covers the
PCI device dirty tracker via vfio.

At a quick glance, IOMMUFD lets the userspace VMM create the IOAS with a
set of a IOVA ranges mapped to some physical memory composing an IO
pagetable. This is then attached to a particular device, consequently
creating the protection domain to share a common IO page table
representing the endporint DMA-addressable guest address space.
(Hopefully I am not twisting the terminology here) The resultant object
is an hw_pagetable object which represents the iommu_domain
object that will be directly manipulated. For more background on
IOMMUFD have a look at these two series[0][1] on the kernel and qemu
consumption respectivally. The IOMMUFD UAPI, kAPI and the iommu core
kAPI is then extended to provide:

 1) Enabling or disabling dirty tracking on the iommu_domain. Model
as the most common case of changing hardware protection domain control
bits, and ARM specific case of having to enable the per-PTE DBM control
bit. The 'real' tracking of whether dirty tracking is enabled or not is
stored in the vendor IOMMU, hence no new fields are added to iommufd
pagetable structures.

 2) Read the I/O PTEs and marshal its dirtyiness into a bitmap. The bitmap
thus describe the IOVAs that got written by the device. While performing
the marshalling also vendors need to clear the dirty bits from IOPTE and
allow the kAPI caller to batch the much needed IOTLB flush.
There's no copy of bitmaps to userspace backed memory, all is zerocopy
based. So far this is a test-and-clear kind of interface given that the
IOPT walk is going to be expensive. It occured to me to separate
the readout of dirty, and the clearing of dirty from IOPTEs.
I haven't opted for that one, given that it would mean two lenghty IOPTE
walks and felt counter-performant.

 3) Unmapping an IOVA range while returning its dirty bit prior to
unmap. This case is specific for non-nested vIOMMU case where an
erronous guest (or device) DMAing to an address being unmapped at the
same time.

[See at the end too, on general remarks, specifically the one regarding
 probing dirty tracking via a dedicated iommufd cap ioctl]

The series is organized as follows:

* Patches 1-3: Takes care of the iommu domain operations to be added and
extends iommufd io-pagetable to set/clear dirty tracking, as well as
reading the dirty bits from the vendor pagetables. The idea is to abstract
iommu vendors from any idea of how bitmaps are stored or propagated back to
the caller, as well as allowing control/batching over IOTLB flush. So
there's a data structure and an helper that only tells the upper layer that
an IOVA range got dirty. IOMMUFD carries the logic to pin pages, walking
the bitmap user memory, and kmap-ing them as needed. IOMMU vendor just has
an idea of a 'dirty bitmap state' and recording an IOVA as dirty by the
vendor IOMMU implementor.

* Patches 4-5: Adds the new unmap domain op that returns whether the IOVA
got dirtied. I separated this from the rest of the set, as I am still
questioning the need for this API and whether this race needs to be
fundamentally be handled. I guess the thinking is that live-migration
should be guest foolproof, but how much the race happens in pratice to
deem this as a necessary unmap variant. Perhaps maybe it might be enough
fetching dirty bits prior to the unmap? Feedback appreciated.

* Patches 6-8: Adds the UAPIs for IOMMUFD, vfio-compat and selftests.
We should discuss whether to include the vfio-compat or not. Given how
vfio-type1-iommu perpectually dirties any IOVA, and here I am replacing
with the IOMMU hw support. I haven't implemented the perpectual dirtying
given his lack of usefullness over an IOMMU-backed implementation (or so
I think). The selftests, test mainly the principal workflow, still needs
to get added more corner cases.

Note: Given that there's no capability for new APIs, or page sizes or
etc, the userspace app using IOMMUFD native API would gather -EOPNOTSUPP
when dirty tracking is not supported by the IOMMU hardware.

For completeness and most importantly to make sure the new IOMMU core ops
capture the hardware blocks, all the IOMMUs that will eventually get IOMMU A/D
support were implemented. So the next half of the series presents *proof of
concept* implementations for IOMMUs:

* Patches 9-11: AMD IOMMU implementation, particularly on those having
HDSup support. Tested with a Qemu amd-iommu with HDSUp emulated,
and also on a AMD Milan server IOMMU.

* Patches 12-17: Adapts the past series from Keqian Zhu[2] but reworked
to do the dynamic set/clear dirty tracking, and immplicitly clearing
dirty bits on the readout. Given the lack of hardware and difficulty
to get this in an emulated SMMUv3 (given the dependency on the PE HTTU
and BBML2, IIUC) then this is only compiled tested. Hopefully I am not
getting the attribution wrong.

* Patches 18-19: Intel IOMMU rev3.x implementation. Tested with a Qemu
based intel-iommu with SSADS/SLADS emulation support.

To help testing/prototypization, qemu iommu emulation bits were written
to increase coverage of this code and hopefully make this more broadly
available for fellow contributors/devs. A separate series is submitted right
after this covering the Qemu IOMMUFD extensions for dirty tracking, alongside
its x86 iommus emulation A/D bits. Meanwhile it's also on github
(https://github.com/jpemartins/qemu/commits/iommufd)

Remarks / Observations:

* There's no capabilities API in IOMMUFD, and in this RFC each vendor tracks
what has access in each of the newly added ops. Initially I was thinking to
have a HWPT_GET_DIRTY to probe how dirty tracking is supported (rather than
bailing out with EOPNOTSUP) as well as an get_dirty_tracking
iommu-core API. On the UAPI, perhaps it might be better to have a single API
for capabilities in general (similar to KVM)  and at the simplest is a subop
where the necessary info is conveyed on a per-subop basis?

* The UAPI/kAPI could be generalized over the next iteration to also cover
Access bit (or Intel's Extended Access bit that tracks non-CPU usage).
It wasn't done, as I was not aware of a use-case. I am wondering
if the access-bits could be used to do some form of zero page detection
(to just send the pages that got touched), although dirty-bits could be
used just the same way. Happy to adjust for RFCv2. The algorithms, IOPTE
walk and marshalling into bitmaps as well as the necessary IOTLB flush
batching are all the same. The focus is on dirty bit given that the
dirtyness IOVA feedback is used to select the pages that need to be transfered
to the destination while migration is happening.
Sidebar: Sadly, there's a lot less clever possible tricks that can be
done (compared to the CPU/KVM) without having the PCI device cooperate (like
userfaultfd, wrprotect, etc as those would turn into nepharious IOMMU
perm faults and devices with DMA target aborts).
If folks thing the UAPI/iommu-kAPI should be agnostic to any PTE A/D
bits, we can instead have the ioctls be named after
HWPT_SET_TRACKING() and add another argument which asks which bits to
enabling tracking (IOMMUFD_ACCESS/IOMMUFD_DIRTY/IOMMUFD_ACCESS_NONCPU).
Likewise for the read_and_clear() as all PTE bits follow the same logic
as dirty. Happy to readjust if folks think it is worthwhile.

* IOMMU Nesting /shouldn't/ matter in this work, as it is expected that we
only care about the first stage of IOMMU pagetables for hypervisors i.e.
tracking dirty GPAs (and not caring about dirty GIOVAs).

* Dirty bit tracking only, is not enough. Large IO pages tend to be the norm
when DMA mapping large ranges of IOVA space, when really the VMM wants the
smallest granularity possible to track(i.e. host base pages). A separate bit
of work will need to take care demoting IOPTE page sizes at guest-runtime to
increase/decrease the dirty tracking granularity, likely under the form of a
IOAS demote/promote page-size within a previously mapped IOVA range.

Feedback is very much appreciated!

[0] https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com/
[1] https://lore.kernel.org/kvm/20220414104710.28534-1-yi.l.liu@intel.com/
[2] https://lore.kernel.org/linux-arm-kernel/20210413085457.25400-1-zhukeqian1@huawei.com/

	Joao

TODOs:
* More selftests for large/small iopte sizes;
* Better vIOMMU+VFIO testing (AMD doesn't support it);
* Performance efficiency of GET_DIRTY_IOVA in various workloads;
* Testing with a live migrateable VF;

Jean-Philippe Brucker (1):
  iommu/arm-smmu-v3: Add feature detection for HTTU

Joao Martins (16):
  iommu: Add iommu_domain ops for dirty tracking
  iommufd: Dirty tracking for io_pagetable
  iommufd: Dirty tracking data support
  iommu: Add an unmap API that returns dirtied IOPTEs
  iommufd: Add a dirty bitmap to iopt_unmap_iova()
  iommufd: Dirty tracking IOCTLs for the hw_pagetable
  iommufd/vfio-compat: Dirty tracking IOCTLs compatibility
  iommufd: Add a test for dirty tracking ioctls
  iommu/amd: Access/Dirty bit support in IOPTEs
  iommu/amd: Add unmap_read_dirty() support
  iommu/amd: Print access/dirty bits if supported
  iommu/arm-smmu-v3: Add read_and_clear_dirty() support
  iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
  iommu/arm-smmu-v3: Add unmap_read_dirty() support
  iommu/intel: Access/Dirty bit support for SL domains
  iommu/intel: Add unmap_read_dirty() support

Kunkun Jiang (2):
  iommu/arm-smmu-v3: Add feature detection for BBML
  iommu/arm-smmu-v3: Enable HTTU for stage1 with io-pgtable mapping

 drivers/iommu/amd/amd_iommu.h               |   1 +
 drivers/iommu/amd/amd_iommu_types.h         |  11 +
 drivers/iommu/amd/init.c                    |  12 +-
 drivers/iommu/amd/io_pgtable.c              | 100 +++++++-
 drivers/iommu/amd/iommu.c                   |  99 ++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 135 +++++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  14 ++
 drivers/iommu/intel/iommu.c                 | 152 +++++++++++-
 drivers/iommu/intel/pasid.c                 |  76 ++++++
 drivers/iommu/intel/pasid.h                 |   7 +
 drivers/iommu/io-pgtable-arm.c              | 232 ++++++++++++++++--
 drivers/iommu/iommu.c                       |  71 +++++-
 drivers/iommu/iommufd/hw_pagetable.c        |  79 ++++++
 drivers/iommu/iommufd/io_pagetable.c        | 253 +++++++++++++++++++-
 drivers/iommu/iommufd/io_pagetable.h        |   3 +-
 drivers/iommu/iommufd/ioas.c                |  35 ++-
 drivers/iommu/iommufd/iommufd_private.h     |  59 ++++-
 drivers/iommu/iommufd/iommufd_test.h        |   9 +
 drivers/iommu/iommufd/main.c                |   9 +
 drivers/iommu/iommufd/pages.c               |  79 +++++-
 drivers/iommu/iommufd/selftest.c            | 137 ++++++++++-
 drivers/iommu/iommufd/vfio_compat.c         | 221 ++++++++++++++++-
 include/linux/intel-iommu.h                 |  30 +++
 include/linux/io-pgtable.h                  |  20 ++
 include/linux/iommu.h                       |  64 +++++
 include/uapi/linux/iommufd.h                |  78 ++++++
 tools/testing/selftests/iommu/Makefile      |   1 +
 tools/testing/selftests/iommu/iommufd.c     | 135 +++++++++++
 28 files changed, 2047 insertions(+), 75 deletions(-)

-- 
2.17.2

^ permalink raw reply	[flat|nested] 209+ messages in thread