All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 00/19] IOMMUFD Dirty Tracking
@ 2022-04-28 21:09 ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Joao Martins, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian,
	Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck, kvm

Presented herewith is a series that extends IOMMUFD to have IOMMU
hardware support for dirty bit in the IOPTEs.

Today, AMD Milan (which been out for a year now) supports it while ARM
SMMUv3.2+ alongside VT-D rev3.x are expected to eventually come along.
The intended use-case is to support Live Migration with SR-IOV, with IOMMUs
that support it. Yishai Hadas will be soon submiting an RFC that covers the
PCI device dirty tracker via vfio.

At a quick glance, IOMMUFD lets the userspace VMM create the IOAS with a
set of a IOVA ranges mapped to some physical memory composing an IO
pagetable. This is then attached to a particular device, consequently
creating the protection domain to share a common IO page table
representing the endporint DMA-addressable guest address space.
(Hopefully I am not twisting the terminology here) The resultant object
is an hw_pagetable object which represents the iommu_domain
object that will be directly manipulated. For more background on
IOMMUFD have a look at these two series[0][1] on the kernel and qemu
consumption respectivally. The IOMMUFD UAPI, kAPI and the iommu core
kAPI is then extended to provide:

 1) Enabling or disabling dirty tracking on the iommu_domain. Model
as the most common case of changing hardware protection domain control
bits, and ARM specific case of having to enable the per-PTE DBM control
bit. The 'real' tracking of whether dirty tracking is enabled or not is
stored in the vendor IOMMU, hence no new fields are added to iommufd
pagetable structures.

 2) Read the I/O PTEs and marshal its dirtyiness into a bitmap. The bitmap
thus describe the IOVAs that got written by the device. While performing
the marshalling also vendors need to clear the dirty bits from IOPTE and
allow the kAPI caller to batch the much needed IOTLB flush.
There's no copy of bitmaps to userspace backed memory, all is zerocopy
based. So far this is a test-and-clear kind of interface given that the
IOPT walk is going to be expensive. It occured to me to separate
the readout of dirty, and the clearing of dirty from IOPTEs.
I haven't opted for that one, given that it would mean two lenghty IOPTE
walks and felt counter-performant.

 3) Unmapping an IOVA range while returning its dirty bit prior to
unmap. This case is specific for non-nested vIOMMU case where an
erronous guest (or device) DMAing to an address being unmapped at the
same time.

[See at the end too, on general remarks, specifically the one regarding
 probing dirty tracking via a dedicated iommufd cap ioctl]

The series is organized as follows:

* Patches 1-3: Takes care of the iommu domain operations to be added and
extends iommufd io-pagetable to set/clear dirty tracking, as well as
reading the dirty bits from the vendor pagetables. The idea is to abstract
iommu vendors from any idea of how bitmaps are stored or propagated back to
the caller, as well as allowing control/batching over IOTLB flush. So
there's a data structure and an helper that only tells the upper layer that
an IOVA range got dirty. IOMMUFD carries the logic to pin pages, walking
the bitmap user memory, and kmap-ing them as needed. IOMMU vendor just has
an idea of a 'dirty bitmap state' and recording an IOVA as dirty by the
vendor IOMMU implementor.

* Patches 4-5: Adds the new unmap domain op that returns whether the IOVA
got dirtied. I separated this from the rest of the set, as I am still
questioning the need for this API and whether this race needs to be
fundamentally be handled. I guess the thinking is that live-migration
should be guest foolproof, but how much the race happens in pratice to
deem this as a necessary unmap variant. Perhaps maybe it might be enough
fetching dirty bits prior to the unmap? Feedback appreciated.

* Patches 6-8: Adds the UAPIs for IOMMUFD, vfio-compat and selftests.
We should discuss whether to include the vfio-compat or not. Given how
vfio-type1-iommu perpectually dirties any IOVA, and here I am replacing
with the IOMMU hw support. I haven't implemented the perpectual dirtying
given his lack of usefullness over an IOMMU-backed implementation (or so
I think). The selftests, test mainly the principal workflow, still needs
to get added more corner cases.

Note: Given that there's no capability for new APIs, or page sizes or
etc, the userspace app using IOMMUFD native API would gather -EOPNOTSUPP
when dirty tracking is not supported by the IOMMU hardware.

For completeness and most importantly to make sure the new IOMMU core ops
capture the hardware blocks, all the IOMMUs that will eventually get IOMMU A/D
support were implemented. So the next half of the series presents *proof of
concept* implementations for IOMMUs:

* Patches 9-11: AMD IOMMU implementation, particularly on those having
HDSup support. Tested with a Qemu amd-iommu with HDSUp emulated,
and also on a AMD Milan server IOMMU.

* Patches 12-17: Adapts the past series from Keqian Zhu[2] but reworked
to do the dynamic set/clear dirty tracking, and immplicitly clearing
dirty bits on the readout. Given the lack of hardware and difficulty
to get this in an emulated SMMUv3 (given the dependency on the PE HTTU
and BBML2, IIUC) then this is only compiled tested. Hopefully I am not
getting the attribution wrong.

* Patches 18-19: Intel IOMMU rev3.x implementation. Tested with a Qemu
based intel-iommu with SSADS/SLADS emulation support.

To help testing/prototypization, qemu iommu emulation bits were written
to increase coverage of this code and hopefully make this more broadly
available for fellow contributors/devs. A separate series is submitted right
after this covering the Qemu IOMMUFD extensions for dirty tracking, alongside
its x86 iommus emulation A/D bits. Meanwhile it's also on github
(https://github.com/jpemartins/qemu/commits/iommufd)

Remarks / Observations:

* There's no capabilities API in IOMMUFD, and in this RFC each vendor tracks
what has access in each of the newly added ops. Initially I was thinking to
have a HWPT_GET_DIRTY to probe how dirty tracking is supported (rather than
bailing out with EOPNOTSUP) as well as an get_dirty_tracking
iommu-core API. On the UAPI, perhaps it might be better to have a single API
for capabilities in general (similar to KVM)  and at the simplest is a subop
where the necessary info is conveyed on a per-subop basis?

* The UAPI/kAPI could be generalized over the next iteration to also cover
Access bit (or Intel's Extended Access bit that tracks non-CPU usage).
It wasn't done, as I was not aware of a use-case. I am wondering
if the access-bits could be used to do some form of zero page detection
(to just send the pages that got touched), although dirty-bits could be
used just the same way. Happy to adjust for RFCv2. The algorithms, IOPTE
walk and marshalling into bitmaps as well as the necessary IOTLB flush
batching are all the same. The focus is on dirty bit given that the
dirtyness IOVA feedback is used to select the pages that need to be transfered
to the destination while migration is happening.
Sidebar: Sadly, there's a lot less clever possible tricks that can be
done (compared to the CPU/KVM) without having the PCI device cooperate (like
userfaultfd, wrprotect, etc as those would turn into nepharious IOMMU
perm faults and devices with DMA target aborts).
If folks thing the UAPI/iommu-kAPI should be agnostic to any PTE A/D
bits, we can instead have the ioctls be named after
HWPT_SET_TRACKING() and add another argument which asks which bits to
enabling tracking (IOMMUFD_ACCESS/IOMMUFD_DIRTY/IOMMUFD_ACCESS_NONCPU).
Likewise for the read_and_clear() as all PTE bits follow the same logic
as dirty. Happy to readjust if folks think it is worthwhile.

* IOMMU Nesting /shouldn't/ matter in this work, as it is expected that we
only care about the first stage of IOMMU pagetables for hypervisors i.e.
tracking dirty GPAs (and not caring about dirty GIOVAs).

* Dirty bit tracking only, is not enough. Large IO pages tend to be the norm
when DMA mapping large ranges of IOVA space, when really the VMM wants the
smallest granularity possible to track(i.e. host base pages). A separate bit
of work will need to take care demoting IOPTE page sizes at guest-runtime to
increase/decrease the dirty tracking granularity, likely under the form of a
IOAS demote/promote page-size within a previously mapped IOVA range.

Feedback is very much appreciated!

[0] https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com/
[1] https://lore.kernel.org/kvm/20220414104710.28534-1-yi.l.liu@intel.com/
[2] https://lore.kernel.org/linux-arm-kernel/20210413085457.25400-1-zhukeqian1@huawei.com/

	Joao

TODOs:
* More selftests for large/small iopte sizes;
* Better vIOMMU+VFIO testing (AMD doesn't support it);
* Performance efficiency of GET_DIRTY_IOVA in various workloads;
* Testing with a live migrateable VF;

Jean-Philippe Brucker (1):
  iommu/arm-smmu-v3: Add feature detection for HTTU

Joao Martins (16):
  iommu: Add iommu_domain ops for dirty tracking
  iommufd: Dirty tracking for io_pagetable
  iommufd: Dirty tracking data support
  iommu: Add an unmap API that returns dirtied IOPTEs
  iommufd: Add a dirty bitmap to iopt_unmap_iova()
  iommufd: Dirty tracking IOCTLs for the hw_pagetable
  iommufd/vfio-compat: Dirty tracking IOCTLs compatibility
  iommufd: Add a test for dirty tracking ioctls
  iommu/amd: Access/Dirty bit support in IOPTEs
  iommu/amd: Add unmap_read_dirty() support
  iommu/amd: Print access/dirty bits if supported
  iommu/arm-smmu-v3: Add read_and_clear_dirty() support
  iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
  iommu/arm-smmu-v3: Add unmap_read_dirty() support
  iommu/intel: Access/Dirty bit support for SL domains
  iommu/intel: Add unmap_read_dirty() support

Kunkun Jiang (2):
  iommu/arm-smmu-v3: Add feature detection for BBML
  iommu/arm-smmu-v3: Enable HTTU for stage1 with io-pgtable mapping

 drivers/iommu/amd/amd_iommu.h               |   1 +
 drivers/iommu/amd/amd_iommu_types.h         |  11 +
 drivers/iommu/amd/init.c                    |  12 +-
 drivers/iommu/amd/io_pgtable.c              | 100 +++++++-
 drivers/iommu/amd/iommu.c                   |  99 ++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 135 +++++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  14 ++
 drivers/iommu/intel/iommu.c                 | 152 +++++++++++-
 drivers/iommu/intel/pasid.c                 |  76 ++++++
 drivers/iommu/intel/pasid.h                 |   7 +
 drivers/iommu/io-pgtable-arm.c              | 232 ++++++++++++++++--
 drivers/iommu/iommu.c                       |  71 +++++-
 drivers/iommu/iommufd/hw_pagetable.c        |  79 ++++++
 drivers/iommu/iommufd/io_pagetable.c        | 253 +++++++++++++++++++-
 drivers/iommu/iommufd/io_pagetable.h        |   3 +-
 drivers/iommu/iommufd/ioas.c                |  35 ++-
 drivers/iommu/iommufd/iommufd_private.h     |  59 ++++-
 drivers/iommu/iommufd/iommufd_test.h        |   9 +
 drivers/iommu/iommufd/main.c                |   9 +
 drivers/iommu/iommufd/pages.c               |  79 +++++-
 drivers/iommu/iommufd/selftest.c            | 137 ++++++++++-
 drivers/iommu/iommufd/vfio_compat.c         | 221 ++++++++++++++++-
 include/linux/intel-iommu.h                 |  30 +++
 include/linux/io-pgtable.h                  |  20 ++
 include/linux/iommu.h                       |  64 +++++
 include/uapi/linux/iommufd.h                |  78 ++++++
 tools/testing/selftests/iommu/Makefile      |   1 +
 tools/testing/selftests/iommu/iommufd.c     | 135 +++++++++++
 28 files changed, 2047 insertions(+), 75 deletions(-)

-- 
2.17.2


^ permalink raw reply	[flat|nested] 209+ messages in thread

* [PATCH RFC 00/19] IOMMUFD Dirty Tracking
@ 2022-04-28 21:09 ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, Jason Gunthorpe,
	kvm, Will Deacon, Cornelia Huck, Alex Williamson, Joao Martins,
	David Woodhouse, Robin Murphy

Presented herewith is a series that extends IOMMUFD to have IOMMU
hardware support for dirty bit in the IOPTEs.

Today, AMD Milan (which been out for a year now) supports it while ARM
SMMUv3.2+ alongside VT-D rev3.x are expected to eventually come along.
The intended use-case is to support Live Migration with SR-IOV, with IOMMUs
that support it. Yishai Hadas will be soon submiting an RFC that covers the
PCI device dirty tracker via vfio.

At a quick glance, IOMMUFD lets the userspace VMM create the IOAS with a
set of a IOVA ranges mapped to some physical memory composing an IO
pagetable. This is then attached to a particular device, consequently
creating the protection domain to share a common IO page table
representing the endporint DMA-addressable guest address space.
(Hopefully I am not twisting the terminology here) The resultant object
is an hw_pagetable object which represents the iommu_domain
object that will be directly manipulated. For more background on
IOMMUFD have a look at these two series[0][1] on the kernel and qemu
consumption respectivally. The IOMMUFD UAPI, kAPI and the iommu core
kAPI is then extended to provide:

 1) Enabling or disabling dirty tracking on the iommu_domain. Model
as the most common case of changing hardware protection domain control
bits, and ARM specific case of having to enable the per-PTE DBM control
bit. The 'real' tracking of whether dirty tracking is enabled or not is
stored in the vendor IOMMU, hence no new fields are added to iommufd
pagetable structures.

 2) Read the I/O PTEs and marshal its dirtyiness into a bitmap. The bitmap
thus describe the IOVAs that got written by the device. While performing
the marshalling also vendors need to clear the dirty bits from IOPTE and
allow the kAPI caller to batch the much needed IOTLB flush.
There's no copy of bitmaps to userspace backed memory, all is zerocopy
based. So far this is a test-and-clear kind of interface given that the
IOPT walk is going to be expensive. It occured to me to separate
the readout of dirty, and the clearing of dirty from IOPTEs.
I haven't opted for that one, given that it would mean two lenghty IOPTE
walks and felt counter-performant.

 3) Unmapping an IOVA range while returning its dirty bit prior to
unmap. This case is specific for non-nested vIOMMU case where an
erronous guest (or device) DMAing to an address being unmapped at the
same time.

[See at the end too, on general remarks, specifically the one regarding
 probing dirty tracking via a dedicated iommufd cap ioctl]

The series is organized as follows:

* Patches 1-3: Takes care of the iommu domain operations to be added and
extends iommufd io-pagetable to set/clear dirty tracking, as well as
reading the dirty bits from the vendor pagetables. The idea is to abstract
iommu vendors from any idea of how bitmaps are stored or propagated back to
the caller, as well as allowing control/batching over IOTLB flush. So
there's a data structure and an helper that only tells the upper layer that
an IOVA range got dirty. IOMMUFD carries the logic to pin pages, walking
the bitmap user memory, and kmap-ing them as needed. IOMMU vendor just has
an idea of a 'dirty bitmap state' and recording an IOVA as dirty by the
vendor IOMMU implementor.

* Patches 4-5: Adds the new unmap domain op that returns whether the IOVA
got dirtied. I separated this from the rest of the set, as I am still
questioning the need for this API and whether this race needs to be
fundamentally be handled. I guess the thinking is that live-migration
should be guest foolproof, but how much the race happens in pratice to
deem this as a necessary unmap variant. Perhaps maybe it might be enough
fetching dirty bits prior to the unmap? Feedback appreciated.

* Patches 6-8: Adds the UAPIs for IOMMUFD, vfio-compat and selftests.
We should discuss whether to include the vfio-compat or not. Given how
vfio-type1-iommu perpectually dirties any IOVA, and here I am replacing
with the IOMMU hw support. I haven't implemented the perpectual dirtying
given his lack of usefullness over an IOMMU-backed implementation (or so
I think). The selftests, test mainly the principal workflow, still needs
to get added more corner cases.

Note: Given that there's no capability for new APIs, or page sizes or
etc, the userspace app using IOMMUFD native API would gather -EOPNOTSUPP
when dirty tracking is not supported by the IOMMU hardware.

For completeness and most importantly to make sure the new IOMMU core ops
capture the hardware blocks, all the IOMMUs that will eventually get IOMMU A/D
support were implemented. So the next half of the series presents *proof of
concept* implementations for IOMMUs:

* Patches 9-11: AMD IOMMU implementation, particularly on those having
HDSup support. Tested with a Qemu amd-iommu with HDSUp emulated,
and also on a AMD Milan server IOMMU.

* Patches 12-17: Adapts the past series from Keqian Zhu[2] but reworked
to do the dynamic set/clear dirty tracking, and immplicitly clearing
dirty bits on the readout. Given the lack of hardware and difficulty
to get this in an emulated SMMUv3 (given the dependency on the PE HTTU
and BBML2, IIUC) then this is only compiled tested. Hopefully I am not
getting the attribution wrong.

* Patches 18-19: Intel IOMMU rev3.x implementation. Tested with a Qemu
based intel-iommu with SSADS/SLADS emulation support.

To help testing/prototypization, qemu iommu emulation bits were written
to increase coverage of this code and hopefully make this more broadly
available for fellow contributors/devs. A separate series is submitted right
after this covering the Qemu IOMMUFD extensions for dirty tracking, alongside
its x86 iommus emulation A/D bits. Meanwhile it's also on github
(https://github.com/jpemartins/qemu/commits/iommufd)

Remarks / Observations:

* There's no capabilities API in IOMMUFD, and in this RFC each vendor tracks
what has access in each of the newly added ops. Initially I was thinking to
have a HWPT_GET_DIRTY to probe how dirty tracking is supported (rather than
bailing out with EOPNOTSUP) as well as an get_dirty_tracking
iommu-core API. On the UAPI, perhaps it might be better to have a single API
for capabilities in general (similar to KVM)  and at the simplest is a subop
where the necessary info is conveyed on a per-subop basis?

* The UAPI/kAPI could be generalized over the next iteration to also cover
Access bit (or Intel's Extended Access bit that tracks non-CPU usage).
It wasn't done, as I was not aware of a use-case. I am wondering
if the access-bits could be used to do some form of zero page detection
(to just send the pages that got touched), although dirty-bits could be
used just the same way. Happy to adjust for RFCv2. The algorithms, IOPTE
walk and marshalling into bitmaps as well as the necessary IOTLB flush
batching are all the same. The focus is on dirty bit given that the
dirtyness IOVA feedback is used to select the pages that need to be transfered
to the destination while migration is happening.
Sidebar: Sadly, there's a lot less clever possible tricks that can be
done (compared to the CPU/KVM) without having the PCI device cooperate (like
userfaultfd, wrprotect, etc as those would turn into nepharious IOMMU
perm faults and devices with DMA target aborts).
If folks thing the UAPI/iommu-kAPI should be agnostic to any PTE A/D
bits, we can instead have the ioctls be named after
HWPT_SET_TRACKING() and add another argument which asks which bits to
enabling tracking (IOMMUFD_ACCESS/IOMMUFD_DIRTY/IOMMUFD_ACCESS_NONCPU).
Likewise for the read_and_clear() as all PTE bits follow the same logic
as dirty. Happy to readjust if folks think it is worthwhile.

* IOMMU Nesting /shouldn't/ matter in this work, as it is expected that we
only care about the first stage of IOMMU pagetables for hypervisors i.e.
tracking dirty GPAs (and not caring about dirty GIOVAs).

* Dirty bit tracking only, is not enough. Large IO pages tend to be the norm
when DMA mapping large ranges of IOVA space, when really the VMM wants the
smallest granularity possible to track(i.e. host base pages). A separate bit
of work will need to take care demoting IOPTE page sizes at guest-runtime to
increase/decrease the dirty tracking granularity, likely under the form of a
IOAS demote/promote page-size within a previously mapped IOVA range.

Feedback is very much appreciated!

[0] https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com/
[1] https://lore.kernel.org/kvm/20220414104710.28534-1-yi.l.liu@intel.com/
[2] https://lore.kernel.org/linux-arm-kernel/20210413085457.25400-1-zhukeqian1@huawei.com/

	Joao

TODOs:
* More selftests for large/small iopte sizes;
* Better vIOMMU+VFIO testing (AMD doesn't support it);
* Performance efficiency of GET_DIRTY_IOVA in various workloads;
* Testing with a live migrateable VF;

Jean-Philippe Brucker (1):
  iommu/arm-smmu-v3: Add feature detection for HTTU

Joao Martins (16):
  iommu: Add iommu_domain ops for dirty tracking
  iommufd: Dirty tracking for io_pagetable
  iommufd: Dirty tracking data support
  iommu: Add an unmap API that returns dirtied IOPTEs
  iommufd: Add a dirty bitmap to iopt_unmap_iova()
  iommufd: Dirty tracking IOCTLs for the hw_pagetable
  iommufd/vfio-compat: Dirty tracking IOCTLs compatibility
  iommufd: Add a test for dirty tracking ioctls
  iommu/amd: Access/Dirty bit support in IOPTEs
  iommu/amd: Add unmap_read_dirty() support
  iommu/amd: Print access/dirty bits if supported
  iommu/arm-smmu-v3: Add read_and_clear_dirty() support
  iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
  iommu/arm-smmu-v3: Add unmap_read_dirty() support
  iommu/intel: Access/Dirty bit support for SL domains
  iommu/intel: Add unmap_read_dirty() support

Kunkun Jiang (2):
  iommu/arm-smmu-v3: Add feature detection for BBML
  iommu/arm-smmu-v3: Enable HTTU for stage1 with io-pgtable mapping

 drivers/iommu/amd/amd_iommu.h               |   1 +
 drivers/iommu/amd/amd_iommu_types.h         |  11 +
 drivers/iommu/amd/init.c                    |  12 +-
 drivers/iommu/amd/io_pgtable.c              | 100 +++++++-
 drivers/iommu/amd/iommu.c                   |  99 ++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 135 +++++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  14 ++
 drivers/iommu/intel/iommu.c                 | 152 +++++++++++-
 drivers/iommu/intel/pasid.c                 |  76 ++++++
 drivers/iommu/intel/pasid.h                 |   7 +
 drivers/iommu/io-pgtable-arm.c              | 232 ++++++++++++++++--
 drivers/iommu/iommu.c                       |  71 +++++-
 drivers/iommu/iommufd/hw_pagetable.c        |  79 ++++++
 drivers/iommu/iommufd/io_pagetable.c        | 253 +++++++++++++++++++-
 drivers/iommu/iommufd/io_pagetable.h        |   3 +-
 drivers/iommu/iommufd/ioas.c                |  35 ++-
 drivers/iommu/iommufd/iommufd_private.h     |  59 ++++-
 drivers/iommu/iommufd/iommufd_test.h        |   9 +
 drivers/iommu/iommufd/main.c                |   9 +
 drivers/iommu/iommufd/pages.c               |  79 +++++-
 drivers/iommu/iommufd/selftest.c            | 137 ++++++++++-
 drivers/iommu/iommufd/vfio_compat.c         | 221 ++++++++++++++++-
 include/linux/intel-iommu.h                 |  30 +++
 include/linux/io-pgtable.h                  |  20 ++
 include/linux/iommu.h                       |  64 +++++
 include/uapi/linux/iommufd.h                |  78 ++++++
 tools/testing/selftests/iommu/Makefile      |   1 +
 tools/testing/selftests/iommu/iommufd.c     | 135 +++++++++++
 28 files changed, 2047 insertions(+), 75 deletions(-)

-- 
2.17.2

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* [PATCH RFC 01/19] iommu: Add iommu_domain ops for dirty tracking
  2022-04-28 21:09 ` Joao Martins
@ 2022-04-28 21:09   ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Joao Martins, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian,
	Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck, kvm

Add to iommu domain operations a set of callbacks to
perform dirty tracking, particulary to start and stop
tracking and finally to test and clear the dirty data.

Drivers are expected to dynamically change its hw protection
domain bits to toggle the tracking and flush some form of
control state structure that stands in the IOVA translation
path.

For reading and clearing dirty data, in all IOMMUs a transition
from any of the PTE access bits (Access, Dirty) implies flushing
the IOTLB to invalidate any stale data in the IOTLB as to whether
or not the IOMMU should update the said PTEs. The iommu core APIs
introduce a new structure for storing the dirties, albeit vendor
IOMMUs implementing .read_and_clear_dirty() just use
iommu_dirty_bitmap_record() to set the memory storing dirties.
The underlying tracking/iteration of user bitmap memory is instead
done by iommufd which takes care of initializing the dirty bitmap
*prior* to passing to the IOMMU domain op.

So far for currently/to-be-supported IOMMUs with dirty tracking
support this particularly because the tracking is part of
first stage tables and part of address translation. Below
it is mentioned how hardware deal with the hardware protection
domain control bits, to justify the added iommu core APIs.
vendor IOMMU implementation will also explain in more detail on
the dirty bit usage/clearing in the IOPTEs.

* x86 AMD:

The same thing for AMD particularly the Device Table
respectivally, followed by flushing the Device IOTLB. On AMD[1],
section "2.2.1 Updating Shared Tables", e.g.

> Each table can also have its contents cached by the IOMMU or
peripheral IOTLBs. Therefore, after
updating a table entry that can be cached, system software must
send the IOMMU an appropriate
invalidate command. Information in the peripheral IOTLBs must
also be invalidated.

There's no mention of particular bits that are cached or
not but fetching a dev entry is part of address translation
as also depicted, so invalidate the device table to make
sure the next translations fetch a DTE entry with the HD bits set.

* x86 Intel (rev3.0+):

Likewise[2] set the SSADE bit in the scalable-entry second stage table
to enable Access/Dirty bits in the second stage page table. See manual,
particularly on "6.2.3.1 Scalable-Mode PASID-Table Entry Programming
Considerations"

> When modifying root-entries, scalable-mode root-entries,
context-entries, or scalable-mode context
entries:
> Software must serially invalidate the context-cache,
PASID-cache (if applicable), and the IOTLB.  The serialization is
required since hardware may utilize information from the
context-caches (e.g., Domain-ID) to tag new entries inserted to
the PASID-cache and IOTLB for processing in-flight requests.
Section 6.5 describe the invalidation operations.

And also the whole chapter "" Table "Table 23.  Guidance to
Software for Invalidations" in "6.5.3.3 Guidance to Software for
Invalidations" explicitly mentions

> SSADE transition from 0 to 1 in a scalable-mode PASID-table
entry with PGTT value of Second-stage or Nested

* ARM SMMUV3.2:

SMMUv3.2 needs to toggle the dirty bit descriptor
over the CD (or S2CD) for toggling and flush/invalidate
the IOMMU dev IOTLB.

Reference[0]: SMMU spec, "5.4.1 CD notes",

> The following CD fields are permitted to be cached as part of a
translation or TLB entry, and alteration requires
invalidation of any TLB entry that might have cached these
fields, in addition to CD structure cache invalidation:

...
HA, HD
...

Although, The ARM SMMUv3 case is a tad different that its x86
counterparts. Rather than changing *only* the IOMMU domain device entry to
enable dirty tracking (and having a dedicated bit for dirtyness in IOPTE)
ARM instead uses a dirty-bit modifier which is separately enabled, and
changes the *existing* meaning of access bits (for ro/rw), to the point
that marking access bit read-only but with dirty-bit-modifier enabled
doesn't trigger an perm io page fault.

In pratice this means that changing iommu context isn't enough
and in fact mostly useless IIUC (and can be always enabled). Dirtying
is only really enabled when the DBM pte bit is enabled (with the
CD.HD bit as a prereq).

To capture this h/w construct an iommu core API is added which enables
dirty tracking on an IOVA range rather than a device/context entry.
iommufd picks one or the other, and IOMMUFD core will favour
device-context op followed by IOVA-range alternative.

[0] https://developer.arm.com/documentation/ihi0070/latest
[1] https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf
[2] https://cdrdv2.intel.com/v1/dl/getContent/671081

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/iommu.c      | 28 ++++++++++++++++++++
 include/linux/io-pgtable.h |  6 +++++
 include/linux/iommu.h      | 52 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 86 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 0c42ece25854..d18b9ddbcce4 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -15,6 +15,7 @@
 #include <linux/init.h>
 #include <linux/export.h>
 #include <linux/slab.h>
+#include <linux/highmem.h>
 #include <linux/errno.h>
 #include <linux/iommu.h>
 #include <linux/idr.h>
@@ -3167,3 +3168,30 @@ bool iommu_group_dma_owner_claimed(struct iommu_group *group)
 	return user;
 }
 EXPORT_SYMBOL_GPL(iommu_group_dma_owner_claimed);
+
+unsigned int iommu_dirty_bitmap_record(struct iommu_dirty_bitmap *dirty,
+				       unsigned long iova, unsigned long length)
+{
+	unsigned long nbits, offset, start_offset, idx, size, *kaddr;
+
+	nbits = max(1UL, length >> dirty->pgshift);
+	offset = (iova - dirty->iova) >> dirty->pgshift;
+	idx = offset / (PAGE_SIZE * BITS_PER_BYTE);
+	offset = offset % (PAGE_SIZE * BITS_PER_BYTE);
+	start_offset = dirty->start_offset;
+
+	while (nbits > 0) {
+		kaddr = kmap(dirty->pages[idx]) + start_offset;
+		size = min(PAGE_SIZE * BITS_PER_BYTE - offset, nbits);
+		bitmap_set(kaddr, offset, size);
+		kunmap(dirty->pages[idx]);
+		start_offset = offset = 0;
+		nbits -= size;
+		idx++;
+	}
+
+	if (dirty->gather)
+		iommu_iotlb_gather_add_range(dirty->gather, iova, length);
+
+	return nbits;
+}
diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index 86af6f0a00a2..82b39925c21f 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -165,6 +165,12 @@ struct io_pgtable_ops {
 			      struct iommu_iotlb_gather *gather);
 	phys_addr_t (*iova_to_phys)(struct io_pgtable_ops *ops,
 				    unsigned long iova);
+	int (*set_dirty_tracking)(struct io_pgtable_ops *ops,
+				  unsigned long iova, size_t size,
+				  bool enabled);
+	int (*read_and_clear_dirty)(struct io_pgtable_ops *ops,
+				    unsigned long iova, size_t size,
+				    struct iommu_dirty_bitmap *dirty);
 };
 
 /**
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 6ef2df258673..ca076365d77b 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -189,6 +189,25 @@ struct iommu_iotlb_gather {
 	bool			queued;
 };
 
+/**
+ * struct iommu_dirty_bitmap - Dirty IOVA bitmap state
+ *
+ * @iova: IOVA representing the start of the bitmap, the first bit of the bitmap
+ * @pgshift: Page granularity of the bitmap
+ * @gather: Range information for a pending IOTLB flush
+ * @start_offset: Offset of the first user page
+ * @pages: User pages representing the bitmap region
+ * @npages: Number of user pages pinned
+ */
+struct iommu_dirty_bitmap {
+	unsigned long iova;
+	unsigned long pgshift;
+	struct iommu_iotlb_gather *gather;
+	unsigned long start_offset;
+	unsigned long npages;
+	struct page **pages;
+};
+
 /**
  * struct iommu_ops - iommu ops and capabilities
  * @capable: check capability
@@ -275,6 +294,13 @@ struct iommu_ops {
  * @enable_nesting: Enable nesting
  * @set_pgtable_quirks: Set io page table quirks (IO_PGTABLE_QUIRK_*)
  * @free: Release the domain after use.
+ * @set_dirty_tracking: Enable or Disable dirty tracking on the iommu domain
+ * @set_dirty_tracking_range: Enable or Disable dirty tracking on a range of
+ *                            an iommu domain
+ * @read_and_clear_dirty: Walk IOMMU page tables for dirtied PTEs marshalled
+ *                        into a bitmap, with a bit represented as a page.
+ *                        Reads the dirty PTE bits and clears it from IO
+ *                        pagetables.
  */
 struct iommu_domain_ops {
 	int (*attach_dev)(struct iommu_domain *domain, struct device *dev);
@@ -305,6 +331,15 @@ struct iommu_domain_ops {
 				  unsigned long quirks);
 
 	void (*free)(struct iommu_domain *domain);
+
+	int (*set_dirty_tracking)(struct iommu_domain *domain, bool enabled);
+	int (*set_dirty_tracking_range)(struct iommu_domain *domain,
+					unsigned long iova, size_t size,
+					struct iommu_iotlb_gather *iotlb_gather,
+					bool enabled);
+	int (*read_and_clear_dirty)(struct iommu_domain *domain,
+				    unsigned long iova, size_t size,
+				    struct iommu_dirty_bitmap *dirty);
 };
 
 /**
@@ -494,6 +529,23 @@ void iommu_set_dma_strict(void);
 extern int report_iommu_fault(struct iommu_domain *domain, struct device *dev,
 			      unsigned long iova, int flags);
 
+unsigned int iommu_dirty_bitmap_record(struct iommu_dirty_bitmap *dirty,
+				       unsigned long iova, unsigned long length);
+
+static inline void iommu_dirty_bitmap_init(struct iommu_dirty_bitmap *dirty,
+					   unsigned long base,
+					   unsigned long pgshift,
+					   struct iommu_iotlb_gather *gather)
+{
+	memset(dirty, 0, sizeof(*dirty));
+	dirty->iova = base;
+	dirty->pgshift = pgshift;
+	dirty->gather = gather;
+
+	if (gather)
+		iommu_iotlb_gather_init(dirty->gather);
+}
+
 static inline void iommu_flush_iotlb_all(struct iommu_domain *domain)
 {
 	if (domain->ops->flush_iotlb_all)
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH RFC 01/19] iommu: Add iommu_domain ops for dirty tracking
@ 2022-04-28 21:09   ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, Jason Gunthorpe,
	kvm, Will Deacon, Cornelia Huck, Alex Williamson, Joao Martins,
	David Woodhouse, Robin Murphy

Add to iommu domain operations a set of callbacks to
perform dirty tracking, particulary to start and stop
tracking and finally to test and clear the dirty data.

Drivers are expected to dynamically change its hw protection
domain bits to toggle the tracking and flush some form of
control state structure that stands in the IOVA translation
path.

For reading and clearing dirty data, in all IOMMUs a transition
from any of the PTE access bits (Access, Dirty) implies flushing
the IOTLB to invalidate any stale data in the IOTLB as to whether
or not the IOMMU should update the said PTEs. The iommu core APIs
introduce a new structure for storing the dirties, albeit vendor
IOMMUs implementing .read_and_clear_dirty() just use
iommu_dirty_bitmap_record() to set the memory storing dirties.
The underlying tracking/iteration of user bitmap memory is instead
done by iommufd which takes care of initializing the dirty bitmap
*prior* to passing to the IOMMU domain op.

So far for currently/to-be-supported IOMMUs with dirty tracking
support this particularly because the tracking is part of
first stage tables and part of address translation. Below
it is mentioned how hardware deal with the hardware protection
domain control bits, to justify the added iommu core APIs.
vendor IOMMU implementation will also explain in more detail on
the dirty bit usage/clearing in the IOPTEs.

* x86 AMD:

The same thing for AMD particularly the Device Table
respectivally, followed by flushing the Device IOTLB. On AMD[1],
section "2.2.1 Updating Shared Tables", e.g.

> Each table can also have its contents cached by the IOMMU or
peripheral IOTLBs. Therefore, after
updating a table entry that can be cached, system software must
send the IOMMU an appropriate
invalidate command. Information in the peripheral IOTLBs must
also be invalidated.

There's no mention of particular bits that are cached or
not but fetching a dev entry is part of address translation
as also depicted, so invalidate the device table to make
sure the next translations fetch a DTE entry with the HD bits set.

* x86 Intel (rev3.0+):

Likewise[2] set the SSADE bit in the scalable-entry second stage table
to enable Access/Dirty bits in the second stage page table. See manual,
particularly on "6.2.3.1 Scalable-Mode PASID-Table Entry Programming
Considerations"

> When modifying root-entries, scalable-mode root-entries,
context-entries, or scalable-mode context
entries:
> Software must serially invalidate the context-cache,
PASID-cache (if applicable), and the IOTLB.  The serialization is
required since hardware may utilize information from the
context-caches (e.g., Domain-ID) to tag new entries inserted to
the PASID-cache and IOTLB for processing in-flight requests.
Section 6.5 describe the invalidation operations.

And also the whole chapter "" Table "Table 23.  Guidance to
Software for Invalidations" in "6.5.3.3 Guidance to Software for
Invalidations" explicitly mentions

> SSADE transition from 0 to 1 in a scalable-mode PASID-table
entry with PGTT value of Second-stage or Nested

* ARM SMMUV3.2:

SMMUv3.2 needs to toggle the dirty bit descriptor
over the CD (or S2CD) for toggling and flush/invalidate
the IOMMU dev IOTLB.

Reference[0]: SMMU spec, "5.4.1 CD notes",

> The following CD fields are permitted to be cached as part of a
translation or TLB entry, and alteration requires
invalidation of any TLB entry that might have cached these
fields, in addition to CD structure cache invalidation:

...
HA, HD
...

Although, The ARM SMMUv3 case is a tad different that its x86
counterparts. Rather than changing *only* the IOMMU domain device entry to
enable dirty tracking (and having a dedicated bit for dirtyness in IOPTE)
ARM instead uses a dirty-bit modifier which is separately enabled, and
changes the *existing* meaning of access bits (for ro/rw), to the point
that marking access bit read-only but with dirty-bit-modifier enabled
doesn't trigger an perm io page fault.

In pratice this means that changing iommu context isn't enough
and in fact mostly useless IIUC (and can be always enabled). Dirtying
is only really enabled when the DBM pte bit is enabled (with the
CD.HD bit as a prereq).

To capture this h/w construct an iommu core API is added which enables
dirty tracking on an IOVA range rather than a device/context entry.
iommufd picks one or the other, and IOMMUFD core will favour
device-context op followed by IOVA-range alternative.

[0] https://developer.arm.com/documentation/ihi0070/latest
[1] https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf
[2] https://cdrdv2.intel.com/v1/dl/getContent/671081

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/iommu.c      | 28 ++++++++++++++++++++
 include/linux/io-pgtable.h |  6 +++++
 include/linux/iommu.h      | 52 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 86 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 0c42ece25854..d18b9ddbcce4 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -15,6 +15,7 @@
 #include <linux/init.h>
 #include <linux/export.h>
 #include <linux/slab.h>
+#include <linux/highmem.h>
 #include <linux/errno.h>
 #include <linux/iommu.h>
 #include <linux/idr.h>
@@ -3167,3 +3168,30 @@ bool iommu_group_dma_owner_claimed(struct iommu_group *group)
 	return user;
 }
 EXPORT_SYMBOL_GPL(iommu_group_dma_owner_claimed);
+
+unsigned int iommu_dirty_bitmap_record(struct iommu_dirty_bitmap *dirty,
+				       unsigned long iova, unsigned long length)
+{
+	unsigned long nbits, offset, start_offset, idx, size, *kaddr;
+
+	nbits = max(1UL, length >> dirty->pgshift);
+	offset = (iova - dirty->iova) >> dirty->pgshift;
+	idx = offset / (PAGE_SIZE * BITS_PER_BYTE);
+	offset = offset % (PAGE_SIZE * BITS_PER_BYTE);
+	start_offset = dirty->start_offset;
+
+	while (nbits > 0) {
+		kaddr = kmap(dirty->pages[idx]) + start_offset;
+		size = min(PAGE_SIZE * BITS_PER_BYTE - offset, nbits);
+		bitmap_set(kaddr, offset, size);
+		kunmap(dirty->pages[idx]);
+		start_offset = offset = 0;
+		nbits -= size;
+		idx++;
+	}
+
+	if (dirty->gather)
+		iommu_iotlb_gather_add_range(dirty->gather, iova, length);
+
+	return nbits;
+}
diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index 86af6f0a00a2..82b39925c21f 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -165,6 +165,12 @@ struct io_pgtable_ops {
 			      struct iommu_iotlb_gather *gather);
 	phys_addr_t (*iova_to_phys)(struct io_pgtable_ops *ops,
 				    unsigned long iova);
+	int (*set_dirty_tracking)(struct io_pgtable_ops *ops,
+				  unsigned long iova, size_t size,
+				  bool enabled);
+	int (*read_and_clear_dirty)(struct io_pgtable_ops *ops,
+				    unsigned long iova, size_t size,
+				    struct iommu_dirty_bitmap *dirty);
 };
 
 /**
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 6ef2df258673..ca076365d77b 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -189,6 +189,25 @@ struct iommu_iotlb_gather {
 	bool			queued;
 };
 
+/**
+ * struct iommu_dirty_bitmap - Dirty IOVA bitmap state
+ *
+ * @iova: IOVA representing the start of the bitmap, the first bit of the bitmap
+ * @pgshift: Page granularity of the bitmap
+ * @gather: Range information for a pending IOTLB flush
+ * @start_offset: Offset of the first user page
+ * @pages: User pages representing the bitmap region
+ * @npages: Number of user pages pinned
+ */
+struct iommu_dirty_bitmap {
+	unsigned long iova;
+	unsigned long pgshift;
+	struct iommu_iotlb_gather *gather;
+	unsigned long start_offset;
+	unsigned long npages;
+	struct page **pages;
+};
+
 /**
  * struct iommu_ops - iommu ops and capabilities
  * @capable: check capability
@@ -275,6 +294,13 @@ struct iommu_ops {
  * @enable_nesting: Enable nesting
  * @set_pgtable_quirks: Set io page table quirks (IO_PGTABLE_QUIRK_*)
  * @free: Release the domain after use.
+ * @set_dirty_tracking: Enable or Disable dirty tracking on the iommu domain
+ * @set_dirty_tracking_range: Enable or Disable dirty tracking on a range of
+ *                            an iommu domain
+ * @read_and_clear_dirty: Walk IOMMU page tables for dirtied PTEs marshalled
+ *                        into a bitmap, with a bit represented as a page.
+ *                        Reads the dirty PTE bits and clears it from IO
+ *                        pagetables.
  */
 struct iommu_domain_ops {
 	int (*attach_dev)(struct iommu_domain *domain, struct device *dev);
@@ -305,6 +331,15 @@ struct iommu_domain_ops {
 				  unsigned long quirks);
 
 	void (*free)(struct iommu_domain *domain);
+
+	int (*set_dirty_tracking)(struct iommu_domain *domain, bool enabled);
+	int (*set_dirty_tracking_range)(struct iommu_domain *domain,
+					unsigned long iova, size_t size,
+					struct iommu_iotlb_gather *iotlb_gather,
+					bool enabled);
+	int (*read_and_clear_dirty)(struct iommu_domain *domain,
+				    unsigned long iova, size_t size,
+				    struct iommu_dirty_bitmap *dirty);
 };
 
 /**
@@ -494,6 +529,23 @@ void iommu_set_dma_strict(void);
 extern int report_iommu_fault(struct iommu_domain *domain, struct device *dev,
 			      unsigned long iova, int flags);
 
+unsigned int iommu_dirty_bitmap_record(struct iommu_dirty_bitmap *dirty,
+				       unsigned long iova, unsigned long length);
+
+static inline void iommu_dirty_bitmap_init(struct iommu_dirty_bitmap *dirty,
+					   unsigned long base,
+					   unsigned long pgshift,
+					   struct iommu_iotlb_gather *gather)
+{
+	memset(dirty, 0, sizeof(*dirty));
+	dirty->iova = base;
+	dirty->pgshift = pgshift;
+	dirty->gather = gather;
+
+	if (gather)
+		iommu_iotlb_gather_init(dirty->gather);
+}
+
 static inline void iommu_flush_iotlb_all(struct iommu_domain *domain)
 {
 	if (domain->ops->flush_iotlb_all)
-- 
2.17.2

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH RFC 02/19] iommufd: Dirty tracking for io_pagetable
  2022-04-28 21:09 ` Joao Martins
@ 2022-04-28 21:09   ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Joao Martins, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian,
	Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck, kvm

Add an io_pagetable kernel API to toggle dirty tracking:

* iopt_set_dirty_tracking(iopt, [domain], state)

It receives either NULL (which means all domains) or an
iommu_domain. The intended caller of this is via the hw_pagetable
object that is created on device attach, which passes an
iommu_domain. For now, the all-domains is left for vfio-compat.

The hw protection domain dirty control is favored over the IOVA-range
alternative. For the latter, it iterates over all IOVA areas and calls
iommu domain op to enable/disable for the range.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/iommufd/io_pagetable.c    | 71 +++++++++++++++++++++++++
 drivers/iommu/iommufd/iommufd_private.h |  3 ++
 2 files changed, 74 insertions(+)

diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
index f9f3b06946bf..f4609ef369e0 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -276,6 +276,77 @@ int iopt_map_user_pages(struct io_pagetable *iopt, unsigned long *iova,
 	return 0;
 }
 
+static int __set_dirty_tracking_range_locked(struct iommu_domain *domain,
+					     struct io_pagetable *iopt,
+					     bool enable)
+{
+	const struct iommu_domain_ops *ops = domain->ops;
+	struct iommu_iotlb_gather gather;
+	struct iopt_area *area;
+	int ret = -EOPNOTSUPP;
+	unsigned long iova;
+	size_t size;
+
+	iommu_iotlb_gather_init(&gather);
+
+	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
+	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
+		iova = iopt_area_iova(area);
+		size = iopt_area_last_iova(area) - iova;
+
+		if (ops->set_dirty_tracking_range) {
+			ret = ops->set_dirty_tracking_range(domain, iova,
+							    size, &gather,
+							    enable);
+			if (ret < 0)
+				break;
+		}
+	}
+
+	iommu_iotlb_sync(domain, &gather);
+
+	return ret;
+}
+
+static int iommu_set_dirty_tracking(struct iommu_domain *domain,
+				    struct io_pagetable *iopt, bool enable)
+{
+	const struct iommu_domain_ops *ops = domain->ops;
+	int ret = -EOPNOTSUPP;
+
+	if (ops->set_dirty_tracking)
+		ret = ops->set_dirty_tracking(domain, enable);
+	else if (ops->set_dirty_tracking_range)
+		ret = __set_dirty_tracking_range_locked(domain, iopt,
+							enable);
+
+	return ret;
+}
+
+int iopt_set_dirty_tracking(struct io_pagetable *iopt,
+			    struct iommu_domain *domain, bool enable)
+{
+	struct iommu_domain *dom;
+	unsigned long index;
+	int ret = -EOPNOTSUPP;
+
+	down_write(&iopt->iova_rwsem);
+	if (!domain) {
+		down_write(&iopt->domains_rwsem);
+		xa_for_each(&iopt->domains, index, dom) {
+			ret = iommu_set_dirty_tracking(dom, iopt, enable);
+			if (ret < 0)
+				break;
+		}
+		up_write(&iopt->domains_rwsem);
+	} else {
+		ret = iommu_set_dirty_tracking(domain, iopt, enable);
+	}
+
+	up_write(&iopt->iova_rwsem);
+	return ret;
+}
+
 struct iopt_pages *iopt_get_pages(struct io_pagetable *iopt, unsigned long iova,
 				  unsigned long *start_byte,
 				  unsigned long length)
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index f55654278ac4..d00ef3b785c5 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -49,6 +49,9 @@ int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
 		    unsigned long length);
 int iopt_unmap_all(struct io_pagetable *iopt);
 
+int iopt_set_dirty_tracking(struct io_pagetable *iopt,
+			    struct iommu_domain *domain, bool enable);
+
 int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
 		      unsigned long npages, struct page **out_pages, bool write);
 void iopt_unaccess_pages(struct io_pagetable *iopt, unsigned long iova,
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH RFC 02/19] iommufd: Dirty tracking for io_pagetable
@ 2022-04-28 21:09   ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, Jason Gunthorpe,
	kvm, Will Deacon, Cornelia Huck, Alex Williamson, Joao Martins,
	David Woodhouse, Robin Murphy

Add an io_pagetable kernel API to toggle dirty tracking:

* iopt_set_dirty_tracking(iopt, [domain], state)

It receives either NULL (which means all domains) or an
iommu_domain. The intended caller of this is via the hw_pagetable
object that is created on device attach, which passes an
iommu_domain. For now, the all-domains is left for vfio-compat.

The hw protection domain dirty control is favored over the IOVA-range
alternative. For the latter, it iterates over all IOVA areas and calls
iommu domain op to enable/disable for the range.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/iommufd/io_pagetable.c    | 71 +++++++++++++++++++++++++
 drivers/iommu/iommufd/iommufd_private.h |  3 ++
 2 files changed, 74 insertions(+)

diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
index f9f3b06946bf..f4609ef369e0 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -276,6 +276,77 @@ int iopt_map_user_pages(struct io_pagetable *iopt, unsigned long *iova,
 	return 0;
 }
 
+static int __set_dirty_tracking_range_locked(struct iommu_domain *domain,
+					     struct io_pagetable *iopt,
+					     bool enable)
+{
+	const struct iommu_domain_ops *ops = domain->ops;
+	struct iommu_iotlb_gather gather;
+	struct iopt_area *area;
+	int ret = -EOPNOTSUPP;
+	unsigned long iova;
+	size_t size;
+
+	iommu_iotlb_gather_init(&gather);
+
+	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
+	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
+		iova = iopt_area_iova(area);
+		size = iopt_area_last_iova(area) - iova;
+
+		if (ops->set_dirty_tracking_range) {
+			ret = ops->set_dirty_tracking_range(domain, iova,
+							    size, &gather,
+							    enable);
+			if (ret < 0)
+				break;
+		}
+	}
+
+	iommu_iotlb_sync(domain, &gather);
+
+	return ret;
+}
+
+static int iommu_set_dirty_tracking(struct iommu_domain *domain,
+				    struct io_pagetable *iopt, bool enable)
+{
+	const struct iommu_domain_ops *ops = domain->ops;
+	int ret = -EOPNOTSUPP;
+
+	if (ops->set_dirty_tracking)
+		ret = ops->set_dirty_tracking(domain, enable);
+	else if (ops->set_dirty_tracking_range)
+		ret = __set_dirty_tracking_range_locked(domain, iopt,
+							enable);
+
+	return ret;
+}
+
+int iopt_set_dirty_tracking(struct io_pagetable *iopt,
+			    struct iommu_domain *domain, bool enable)
+{
+	struct iommu_domain *dom;
+	unsigned long index;
+	int ret = -EOPNOTSUPP;
+
+	down_write(&iopt->iova_rwsem);
+	if (!domain) {
+		down_write(&iopt->domains_rwsem);
+		xa_for_each(&iopt->domains, index, dom) {
+			ret = iommu_set_dirty_tracking(dom, iopt, enable);
+			if (ret < 0)
+				break;
+		}
+		up_write(&iopt->domains_rwsem);
+	} else {
+		ret = iommu_set_dirty_tracking(domain, iopt, enable);
+	}
+
+	up_write(&iopt->iova_rwsem);
+	return ret;
+}
+
 struct iopt_pages *iopt_get_pages(struct io_pagetable *iopt, unsigned long iova,
 				  unsigned long *start_byte,
 				  unsigned long length)
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index f55654278ac4..d00ef3b785c5 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -49,6 +49,9 @@ int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
 		    unsigned long length);
 int iopt_unmap_all(struct io_pagetable *iopt);
 
+int iopt_set_dirty_tracking(struct io_pagetable *iopt,
+			    struct iommu_domain *domain, bool enable);
+
 int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
 		      unsigned long npages, struct page **out_pages, bool write);
 void iopt_unaccess_pages(struct io_pagetable *iopt, unsigned long iova,
-- 
2.17.2

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH RFC 03/19] iommufd: Dirty tracking data support
  2022-04-28 21:09 ` Joao Martins
@ 2022-04-28 21:09   ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Joao Martins, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian,
	Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck, kvm

Add an IO pagetable API iopt_read_and_clear_dirty_data() that
performs the reading of dirty IOPTEs for a given IOVA range and
then copying back to userspace from each area-internal bitmap.

Underneath it uses the IOMMU equivalent API which will read the
dirty bits, as well as atomically clearing the IOPTE dirty bit
and flushing the IOTLB at the end. The dirty bitmaps pass an
iotlb_gather to allow batching the dirty-bit updates.

Most of the complexity, though, is in the handling of the user
bitmaps to avoid copies back and forth. The bitmap user addresses
need to be iterated through, pinned and then passing the pages
into iommu core. The amount of bitmap data passed at a time for a
read_and_clear_dirty() is 1 page worth of pinned base page
pointers. That equates to 16M bits, or rather 64G of data that
can be returned as 'dirtied'. The flush the IOTLB at the end of
the whole scanned IOVA range, to defer as much as possible the
potential DMA performance penalty.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/iommufd/io_pagetable.c    | 169 ++++++++++++++++++++++++
 drivers/iommu/iommufd/iommufd_private.h |  44 ++++++
 2 files changed, 213 insertions(+)

diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
index f4609ef369e0..835b5040fce9 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -14,6 +14,7 @@
 #include <linux/err.h>
 #include <linux/slab.h>
 #include <linux/errno.h>
+#include <uapi/linux/iommufd.h>
 
 #include "io_pagetable.h"
 
@@ -347,6 +348,174 @@ int iopt_set_dirty_tracking(struct io_pagetable *iopt,
 	return ret;
 }
 
+int iommufd_dirty_iter_init(struct iommufd_dirty_iter *iter,
+			    struct iommufd_dirty_data *bitmap)
+{
+	struct iommu_dirty_bitmap *dirty = &iter->dirty;
+	unsigned long bitmap_len;
+
+	bitmap_len = dirty_bitmap_bytes(bitmap->length >> dirty->pgshift);
+
+	import_single_range(WRITE, bitmap->data, bitmap_len,
+			    &iter->bitmap_iov, &iter->bitmap_iter);
+	iter->iova = bitmap->iova;
+
+	/* Can record up to 64G at a time */
+	dirty->pages = (struct page **) __get_free_page(GFP_KERNEL);
+
+	return !dirty->pages ? -ENOMEM : 0;
+}
+
+void iommufd_dirty_iter_free(struct iommufd_dirty_iter *iter)
+{
+	struct iommu_dirty_bitmap *dirty = &iter->dirty;
+
+	if (dirty->pages) {
+		free_page((unsigned long) dirty->pages);
+		dirty->pages = NULL;
+	}
+}
+
+bool iommufd_dirty_iter_done(struct iommufd_dirty_iter *iter)
+{
+	return iov_iter_count(&iter->bitmap_iter) > 0;
+}
+
+static inline unsigned long iommufd_dirty_iter_bytes(struct iommufd_dirty_iter *iter)
+{
+	unsigned long left = iter->bitmap_iter.count - iter->bitmap_iter.iov_offset;
+
+	left = min_t(unsigned long, left, (iter->dirty.npages << PAGE_SHIFT));
+
+	return left;
+}
+
+unsigned long iommufd_dirty_iova_length(struct iommufd_dirty_iter *iter)
+{
+	unsigned long left = iommufd_dirty_iter_bytes(iter);
+
+	return ((BITS_PER_BYTE * left) << iter->dirty.pgshift);
+}
+
+unsigned long iommufd_dirty_iova(struct iommufd_dirty_iter *iter)
+{
+	unsigned long skip = iter->bitmap_iter.iov_offset;
+
+	return iter->iova + ((BITS_PER_BYTE * skip) << iter->dirty.pgshift);
+}
+
+void iommufd_dirty_iter_advance(struct iommufd_dirty_iter *iter)
+{
+	iov_iter_advance(&iter->bitmap_iter, iommufd_dirty_iter_bytes(iter));
+}
+
+void iommufd_dirty_iter_put(struct iommufd_dirty_iter *iter)
+{
+	struct iommu_dirty_bitmap *dirty = &iter->dirty;
+
+	if (dirty->npages)
+		unpin_user_pages(dirty->pages, dirty->npages);
+}
+
+int iommufd_dirty_iter_get(struct iommufd_dirty_iter *iter)
+{
+	struct iommu_dirty_bitmap *dirty = &iter->dirty;
+	unsigned long npages;
+	unsigned long ret;
+	void *addr;
+
+	addr = iter->bitmap_iov.iov_base + iter->bitmap_iter.iov_offset;
+	npages = iov_iter_npages(&iter->bitmap_iter,
+				 PAGE_SIZE / sizeof(struct page *));
+
+	ret = pin_user_pages_fast((unsigned long) addr, npages,
+				  FOLL_WRITE, dirty->pages);
+	if (ret <= 0)
+		return -EINVAL;
+
+	dirty->npages = ret;
+	dirty->iova = iommufd_dirty_iova(iter);
+	dirty->start_offset = offset_in_page(addr);
+	return 0;
+}
+
+static int iommu_read_and_clear_dirty(struct iommu_domain *domain,
+				      struct iommufd_dirty_data *bitmap)
+{
+	const struct iommu_domain_ops *ops = domain->ops;
+	struct iommu_iotlb_gather gather;
+	struct iommufd_dirty_iter iter;
+	int ret = 0;
+
+	if (!ops || !ops->read_and_clear_dirty)
+		return -EOPNOTSUPP;
+
+	iommu_dirty_bitmap_init(&iter.dirty, bitmap->iova,
+				__ffs(bitmap->page_size), &gather);
+	ret = iommufd_dirty_iter_init(&iter, bitmap);
+	if (ret)
+		return -ENOMEM;
+
+	for (; iommufd_dirty_iter_done(&iter);
+	     iommufd_dirty_iter_advance(&iter)) {
+		ret = iommufd_dirty_iter_get(&iter);
+		if (ret)
+			break;
+
+		ret = ops->read_and_clear_dirty(domain,
+			iommufd_dirty_iova(&iter),
+			iommufd_dirty_iova_length(&iter), &iter.dirty);
+
+		iommufd_dirty_iter_put(&iter);
+
+		if (ret)
+			break;
+	}
+
+	iommu_iotlb_sync(domain, &gather);
+	iommufd_dirty_iter_free(&iter);
+
+	return ret;
+}
+
+int iopt_read_and_clear_dirty_data(struct io_pagetable *iopt,
+				   struct iommu_domain *domain,
+				   struct iommufd_dirty_data *bitmap)
+{
+	unsigned long iova, length, iova_end;
+	struct iommu_domain *dom;
+	struct iopt_area *area;
+	unsigned long index;
+	int ret = -EOPNOTSUPP;
+
+	iova = bitmap->iova;
+	length = bitmap->length - 1;
+	if (check_add_overflow(iova, length, &iova_end))
+		return -EOVERFLOW;
+
+	down_read(&iopt->iova_rwsem);
+	area = iopt_find_exact_area(iopt, iova, iova_end);
+	if (!area) {
+		up_read(&iopt->iova_rwsem);
+		return -ENOENT;
+	}
+
+	if (!domain) {
+		down_read(&iopt->domains_rwsem);
+		xa_for_each(&iopt->domains, index, dom) {
+			ret = iommu_read_and_clear_dirty(dom, bitmap);
+			if (ret)
+				break;
+		}
+		up_read(&iopt->domains_rwsem);
+	} else {
+		ret = iommu_read_and_clear_dirty(domain, bitmap);
+	}
+
+	up_read(&iopt->iova_rwsem);
+	return ret;
+}
+
 struct iopt_pages *iopt_get_pages(struct io_pagetable *iopt, unsigned long iova,
 				  unsigned long *start_byte,
 				  unsigned long length)
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index d00ef3b785c5..4c12b4a8f1a6 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -8,6 +8,8 @@
 #include <linux/xarray.h>
 #include <linux/refcount.h>
 #include <linux/uaccess.h>
+#include <linux/iommu.h>
+#include <linux/uio.h>
 
 struct iommu_domain;
 struct iommu_group;
@@ -49,8 +51,50 @@ int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
 		    unsigned long length);
 int iopt_unmap_all(struct io_pagetable *iopt);
 
+struct iommufd_dirty_data {
+	unsigned long iova;
+	unsigned long length;
+	unsigned long page_size;
+	unsigned long *data;
+};
+
 int iopt_set_dirty_tracking(struct io_pagetable *iopt,
 			    struct iommu_domain *domain, bool enable);
+int iopt_read_and_clear_dirty_data(struct io_pagetable *iopt,
+				   struct iommu_domain *domain,
+				   struct iommufd_dirty_data *bitmap);
+
+struct iommufd_dirty_iter {
+	struct iommu_dirty_bitmap dirty;
+	struct iovec bitmap_iov;
+	struct iov_iter bitmap_iter;
+	unsigned long iova;
+};
+
+void iommufd_dirty_iter_put(struct iommufd_dirty_iter *iter);
+int iommufd_dirty_iter_get(struct iommufd_dirty_iter *iter);
+int iommufd_dirty_iter_init(struct iommufd_dirty_iter *iter,
+			    struct iommufd_dirty_data *bitmap);
+void iommufd_dirty_iter_free(struct iommufd_dirty_iter *iter);
+bool iommufd_dirty_iter_done(struct iommufd_dirty_iter *iter);
+void iommufd_dirty_iter_advance(struct iommufd_dirty_iter *iter);
+unsigned long iommufd_dirty_iova_length(struct iommufd_dirty_iter *iter);
+unsigned long iommufd_dirty_iova(struct iommufd_dirty_iter *iter);
+static inline unsigned long dirty_bitmap_bytes(unsigned long nr_pages)
+{
+	return (ALIGN(nr_pages, BITS_PER_TYPE(u64)) / BITS_PER_BYTE);
+}
+
+/*
+ * Input argument of number of bits to bitmap_set() is unsigned integer, which
+ * further casts to signed integer for unaligned multi-bit operation,
+ * __bitmap_set().
+ * Then maximum bitmap size supported is 2^31 bits divided by 2^3 bits/byte,
+ * that is 2^28 (256 MB) which maps to 2^31 * 2^12 = 2^43 (8TB) on 4K page
+ * system.
+ */
+#define DIRTY_BITMAP_PAGES_MAX  ((u64)INT_MAX)
+#define DIRTY_BITMAP_SIZE_MAX   dirty_bitmap_bytes(DIRTY_BITMAP_PAGES_MAX)
 
 int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
 		      unsigned long npages, struct page **out_pages, bool write);
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH RFC 03/19] iommufd: Dirty tracking data support
@ 2022-04-28 21:09   ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, Jason Gunthorpe,
	kvm, Will Deacon, Cornelia Huck, Alex Williamson, Joao Martins,
	David Woodhouse, Robin Murphy

Add an IO pagetable API iopt_read_and_clear_dirty_data() that
performs the reading of dirty IOPTEs for a given IOVA range and
then copying back to userspace from each area-internal bitmap.

Underneath it uses the IOMMU equivalent API which will read the
dirty bits, as well as atomically clearing the IOPTE dirty bit
and flushing the IOTLB at the end. The dirty bitmaps pass an
iotlb_gather to allow batching the dirty-bit updates.

Most of the complexity, though, is in the handling of the user
bitmaps to avoid copies back and forth. The bitmap user addresses
need to be iterated through, pinned and then passing the pages
into iommu core. The amount of bitmap data passed at a time for a
read_and_clear_dirty() is 1 page worth of pinned base page
pointers. That equates to 16M bits, or rather 64G of data that
can be returned as 'dirtied'. The flush the IOTLB at the end of
the whole scanned IOVA range, to defer as much as possible the
potential DMA performance penalty.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/iommufd/io_pagetable.c    | 169 ++++++++++++++++++++++++
 drivers/iommu/iommufd/iommufd_private.h |  44 ++++++
 2 files changed, 213 insertions(+)

diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
index f4609ef369e0..835b5040fce9 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -14,6 +14,7 @@
 #include <linux/err.h>
 #include <linux/slab.h>
 #include <linux/errno.h>
+#include <uapi/linux/iommufd.h>
 
 #include "io_pagetable.h"
 
@@ -347,6 +348,174 @@ int iopt_set_dirty_tracking(struct io_pagetable *iopt,
 	return ret;
 }
 
+int iommufd_dirty_iter_init(struct iommufd_dirty_iter *iter,
+			    struct iommufd_dirty_data *bitmap)
+{
+	struct iommu_dirty_bitmap *dirty = &iter->dirty;
+	unsigned long bitmap_len;
+
+	bitmap_len = dirty_bitmap_bytes(bitmap->length >> dirty->pgshift);
+
+	import_single_range(WRITE, bitmap->data, bitmap_len,
+			    &iter->bitmap_iov, &iter->bitmap_iter);
+	iter->iova = bitmap->iova;
+
+	/* Can record up to 64G at a time */
+	dirty->pages = (struct page **) __get_free_page(GFP_KERNEL);
+
+	return !dirty->pages ? -ENOMEM : 0;
+}
+
+void iommufd_dirty_iter_free(struct iommufd_dirty_iter *iter)
+{
+	struct iommu_dirty_bitmap *dirty = &iter->dirty;
+
+	if (dirty->pages) {
+		free_page((unsigned long) dirty->pages);
+		dirty->pages = NULL;
+	}
+}
+
+bool iommufd_dirty_iter_done(struct iommufd_dirty_iter *iter)
+{
+	return iov_iter_count(&iter->bitmap_iter) > 0;
+}
+
+static inline unsigned long iommufd_dirty_iter_bytes(struct iommufd_dirty_iter *iter)
+{
+	unsigned long left = iter->bitmap_iter.count - iter->bitmap_iter.iov_offset;
+
+	left = min_t(unsigned long, left, (iter->dirty.npages << PAGE_SHIFT));
+
+	return left;
+}
+
+unsigned long iommufd_dirty_iova_length(struct iommufd_dirty_iter *iter)
+{
+	unsigned long left = iommufd_dirty_iter_bytes(iter);
+
+	return ((BITS_PER_BYTE * left) << iter->dirty.pgshift);
+}
+
+unsigned long iommufd_dirty_iova(struct iommufd_dirty_iter *iter)
+{
+	unsigned long skip = iter->bitmap_iter.iov_offset;
+
+	return iter->iova + ((BITS_PER_BYTE * skip) << iter->dirty.pgshift);
+}
+
+void iommufd_dirty_iter_advance(struct iommufd_dirty_iter *iter)
+{
+	iov_iter_advance(&iter->bitmap_iter, iommufd_dirty_iter_bytes(iter));
+}
+
+void iommufd_dirty_iter_put(struct iommufd_dirty_iter *iter)
+{
+	struct iommu_dirty_bitmap *dirty = &iter->dirty;
+
+	if (dirty->npages)
+		unpin_user_pages(dirty->pages, dirty->npages);
+}
+
+int iommufd_dirty_iter_get(struct iommufd_dirty_iter *iter)
+{
+	struct iommu_dirty_bitmap *dirty = &iter->dirty;
+	unsigned long npages;
+	unsigned long ret;
+	void *addr;
+
+	addr = iter->bitmap_iov.iov_base + iter->bitmap_iter.iov_offset;
+	npages = iov_iter_npages(&iter->bitmap_iter,
+				 PAGE_SIZE / sizeof(struct page *));
+
+	ret = pin_user_pages_fast((unsigned long) addr, npages,
+				  FOLL_WRITE, dirty->pages);
+	if (ret <= 0)
+		return -EINVAL;
+
+	dirty->npages = ret;
+	dirty->iova = iommufd_dirty_iova(iter);
+	dirty->start_offset = offset_in_page(addr);
+	return 0;
+}
+
+static int iommu_read_and_clear_dirty(struct iommu_domain *domain,
+				      struct iommufd_dirty_data *bitmap)
+{
+	const struct iommu_domain_ops *ops = domain->ops;
+	struct iommu_iotlb_gather gather;
+	struct iommufd_dirty_iter iter;
+	int ret = 0;
+
+	if (!ops || !ops->read_and_clear_dirty)
+		return -EOPNOTSUPP;
+
+	iommu_dirty_bitmap_init(&iter.dirty, bitmap->iova,
+				__ffs(bitmap->page_size), &gather);
+	ret = iommufd_dirty_iter_init(&iter, bitmap);
+	if (ret)
+		return -ENOMEM;
+
+	for (; iommufd_dirty_iter_done(&iter);
+	     iommufd_dirty_iter_advance(&iter)) {
+		ret = iommufd_dirty_iter_get(&iter);
+		if (ret)
+			break;
+
+		ret = ops->read_and_clear_dirty(domain,
+			iommufd_dirty_iova(&iter),
+			iommufd_dirty_iova_length(&iter), &iter.dirty);
+
+		iommufd_dirty_iter_put(&iter);
+
+		if (ret)
+			break;
+	}
+
+	iommu_iotlb_sync(domain, &gather);
+	iommufd_dirty_iter_free(&iter);
+
+	return ret;
+}
+
+int iopt_read_and_clear_dirty_data(struct io_pagetable *iopt,
+				   struct iommu_domain *domain,
+				   struct iommufd_dirty_data *bitmap)
+{
+	unsigned long iova, length, iova_end;
+	struct iommu_domain *dom;
+	struct iopt_area *area;
+	unsigned long index;
+	int ret = -EOPNOTSUPP;
+
+	iova = bitmap->iova;
+	length = bitmap->length - 1;
+	if (check_add_overflow(iova, length, &iova_end))
+		return -EOVERFLOW;
+
+	down_read(&iopt->iova_rwsem);
+	area = iopt_find_exact_area(iopt, iova, iova_end);
+	if (!area) {
+		up_read(&iopt->iova_rwsem);
+		return -ENOENT;
+	}
+
+	if (!domain) {
+		down_read(&iopt->domains_rwsem);
+		xa_for_each(&iopt->domains, index, dom) {
+			ret = iommu_read_and_clear_dirty(dom, bitmap);
+			if (ret)
+				break;
+		}
+		up_read(&iopt->domains_rwsem);
+	} else {
+		ret = iommu_read_and_clear_dirty(domain, bitmap);
+	}
+
+	up_read(&iopt->iova_rwsem);
+	return ret;
+}
+
 struct iopt_pages *iopt_get_pages(struct io_pagetable *iopt, unsigned long iova,
 				  unsigned long *start_byte,
 				  unsigned long length)
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index d00ef3b785c5..4c12b4a8f1a6 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -8,6 +8,8 @@
 #include <linux/xarray.h>
 #include <linux/refcount.h>
 #include <linux/uaccess.h>
+#include <linux/iommu.h>
+#include <linux/uio.h>
 
 struct iommu_domain;
 struct iommu_group;
@@ -49,8 +51,50 @@ int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
 		    unsigned long length);
 int iopt_unmap_all(struct io_pagetable *iopt);
 
+struct iommufd_dirty_data {
+	unsigned long iova;
+	unsigned long length;
+	unsigned long page_size;
+	unsigned long *data;
+};
+
 int iopt_set_dirty_tracking(struct io_pagetable *iopt,
 			    struct iommu_domain *domain, bool enable);
+int iopt_read_and_clear_dirty_data(struct io_pagetable *iopt,
+				   struct iommu_domain *domain,
+				   struct iommufd_dirty_data *bitmap);
+
+struct iommufd_dirty_iter {
+	struct iommu_dirty_bitmap dirty;
+	struct iovec bitmap_iov;
+	struct iov_iter bitmap_iter;
+	unsigned long iova;
+};
+
+void iommufd_dirty_iter_put(struct iommufd_dirty_iter *iter);
+int iommufd_dirty_iter_get(struct iommufd_dirty_iter *iter);
+int iommufd_dirty_iter_init(struct iommufd_dirty_iter *iter,
+			    struct iommufd_dirty_data *bitmap);
+void iommufd_dirty_iter_free(struct iommufd_dirty_iter *iter);
+bool iommufd_dirty_iter_done(struct iommufd_dirty_iter *iter);
+void iommufd_dirty_iter_advance(struct iommufd_dirty_iter *iter);
+unsigned long iommufd_dirty_iova_length(struct iommufd_dirty_iter *iter);
+unsigned long iommufd_dirty_iova(struct iommufd_dirty_iter *iter);
+static inline unsigned long dirty_bitmap_bytes(unsigned long nr_pages)
+{
+	return (ALIGN(nr_pages, BITS_PER_TYPE(u64)) / BITS_PER_BYTE);
+}
+
+/*
+ * Input argument of number of bits to bitmap_set() is unsigned integer, which
+ * further casts to signed integer for unaligned multi-bit operation,
+ * __bitmap_set().
+ * Then maximum bitmap size supported is 2^31 bits divided by 2^3 bits/byte,
+ * that is 2^28 (256 MB) which maps to 2^31 * 2^12 = 2^43 (8TB) on 4K page
+ * system.
+ */
+#define DIRTY_BITMAP_PAGES_MAX  ((u64)INT_MAX)
+#define DIRTY_BITMAP_SIZE_MAX   dirty_bitmap_bytes(DIRTY_BITMAP_PAGES_MAX)
 
 int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
 		      unsigned long npages, struct page **out_pages, bool write);
-- 
2.17.2

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH RFC 04/19] iommu: Add an unmap API that returns dirtied IOPTEs
  2022-04-28 21:09 ` Joao Martins
@ 2022-04-28 21:09   ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Joao Martins, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian,
	Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck, kvm

Today, the dirty state is lost and the page wouldn't be migrated to
destination potentially leading the guest into error.

Add an unmap API that reads the dirty bit and sets it in the
user passed bitmap. This unmap iommu API tackles a potentially
racy update to the dirty bit *when* doing DMA on a iova that is
being unmapped at the same time.

The new unmap_read_dirty/unmap_pages_read_dirty does not replace
the unmap pages, but rather only when explicit called with an dirty
bitmap data passed in.

It could be said that the guest is buggy and rather than a special unmap
path tackling the theoretical race ... it would suffice fetching the
dirty bits (with GET_DIRTY_IOVA), and then unmap the IOVA.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/iommu.c      | 43 +++++++++++++++++++++++++++++++-------
 include/linux/io-pgtable.h | 10 +++++++++
 include/linux/iommu.h      | 12 +++++++++++
 3 files changed, 58 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index d18b9ddbcce4..cc04263709ee 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2289,12 +2289,25 @@ EXPORT_SYMBOL_GPL(iommu_map_atomic);
 
 static size_t __iommu_unmap_pages(struct iommu_domain *domain,
 				  unsigned long iova, size_t size,
-				  struct iommu_iotlb_gather *iotlb_gather)
+				  struct iommu_iotlb_gather *iotlb_gather,
+				  struct iommu_dirty_bitmap *dirty)
 {
 	const struct iommu_domain_ops *ops = domain->ops;
 	size_t pgsize, count;
 
 	pgsize = iommu_pgsize(domain, iova, iova, size, &count);
+
+	if (dirty) {
+		if (!ops->unmap_read_dirty && !ops->unmap_pages_read_dirty)
+			return 0;
+
+		return ops->unmap_pages_read_dirty ?
+		       ops->unmap_pages_read_dirty(domain, iova, pgsize,
+						   count, iotlb_gather, dirty) :
+		       ops->unmap_read_dirty(domain, iova, pgsize,
+					     iotlb_gather, dirty);
+	}
+
 	return ops->unmap_pages ?
 	       ops->unmap_pages(domain, iova, pgsize, count, iotlb_gather) :
 	       ops->unmap(domain, iova, pgsize, iotlb_gather);
@@ -2302,7 +2315,8 @@ static size_t __iommu_unmap_pages(struct iommu_domain *domain,
 
 static size_t __iommu_unmap(struct iommu_domain *domain,
 			    unsigned long iova, size_t size,
-			    struct iommu_iotlb_gather *iotlb_gather)
+			    struct iommu_iotlb_gather *iotlb_gather,
+			    struct iommu_dirty_bitmap *dirty)
 {
 	const struct iommu_domain_ops *ops = domain->ops;
 	size_t unmapped_page, unmapped = 0;
@@ -2337,9 +2351,8 @@ static size_t __iommu_unmap(struct iommu_domain *domain,
 	 * or we hit an area that isn't mapped.
 	 */
 	while (unmapped < size) {
-		unmapped_page = __iommu_unmap_pages(domain, iova,
-						    size - unmapped,
-						    iotlb_gather);
+		unmapped_page = __iommu_unmap_pages(domain, iova, size - unmapped,
+						    iotlb_gather, dirty);
 		if (!unmapped_page)
 			break;
 
@@ -2361,18 +2374,34 @@ size_t iommu_unmap(struct iommu_domain *domain,
 	size_t ret;
 
 	iommu_iotlb_gather_init(&iotlb_gather);
-	ret = __iommu_unmap(domain, iova, size, &iotlb_gather);
+	ret = __iommu_unmap(domain, iova, size, &iotlb_gather, NULL);
 	iommu_iotlb_sync(domain, &iotlb_gather);
 
 	return ret;
 }
 EXPORT_SYMBOL_GPL(iommu_unmap);
 
+size_t iommu_unmap_read_dirty(struct iommu_domain *domain,
+			      unsigned long iova, size_t size,
+			      struct iommu_dirty_bitmap *dirty)
+{
+	struct iommu_iotlb_gather iotlb_gather;
+	size_t ret;
+
+	iommu_iotlb_gather_init(&iotlb_gather);
+	ret = __iommu_unmap(domain, iova, size, &iotlb_gather, dirty);
+	iommu_iotlb_sync(domain, &iotlb_gather);
+
+	return ret;
+
+}
+EXPORT_SYMBOL_GPL(iommu_unmap_read_dirty);
+
 size_t iommu_unmap_fast(struct iommu_domain *domain,
 			unsigned long iova, size_t size,
 			struct iommu_iotlb_gather *iotlb_gather)
 {
-	return __iommu_unmap(domain, iova, size, iotlb_gather);
+	return __iommu_unmap(domain, iova, size, iotlb_gather, NULL);
 }
 EXPORT_SYMBOL_GPL(iommu_unmap_fast);
 
diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index 82b39925c21f..c2ebfe037f5d 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -171,6 +171,16 @@ struct io_pgtable_ops {
 	int (*read_and_clear_dirty)(struct io_pgtable_ops *ops,
 				    unsigned long iova, size_t size,
 				    struct iommu_dirty_bitmap *dirty);
+	size_t (*unmap_read_dirty)(struct io_pgtable_ops *ops,
+				   unsigned long iova,
+				   size_t size,
+				   struct iommu_iotlb_gather *gather,
+				   struct iommu_dirty_bitmap *dirty);
+	size_t (*unmap_pages_read_dirty)(struct io_pgtable_ops *ops,
+					 unsigned long iova,
+					 size_t pgsize, size_t pgcount,
+					 struct iommu_iotlb_gather *gather,
+					 struct iommu_dirty_bitmap *dirty);
 };
 
 /**
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index ca076365d77b..7c66b4e00556 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -340,6 +340,15 @@ struct iommu_domain_ops {
 	int (*read_and_clear_dirty)(struct iommu_domain *domain,
 				    unsigned long iova, size_t size,
 				    struct iommu_dirty_bitmap *dirty);
+	size_t (*unmap_read_dirty)(struct iommu_domain *domain,
+				   unsigned long iova, size_t size,
+				   struct iommu_iotlb_gather *iotlb_gather,
+				   struct iommu_dirty_bitmap *dirty);
+	size_t (*unmap_pages_read_dirty)(struct iommu_domain *domain,
+					 unsigned long iova,
+					 size_t pgsize, size_t pgcount,
+					 struct iommu_iotlb_gather *iotlb_gather,
+					 struct iommu_dirty_bitmap *dirty);
 };
 
 /**
@@ -463,6 +472,9 @@ extern int iommu_map_atomic(struct iommu_domain *domain, unsigned long iova,
 			    phys_addr_t paddr, size_t size, int prot);
 extern size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova,
 			  size_t size);
+extern size_t iommu_unmap_read_dirty(struct iommu_domain *domain,
+				     unsigned long iova, size_t size,
+				     struct iommu_dirty_bitmap *dirty);
 extern size_t iommu_unmap_fast(struct iommu_domain *domain,
 			       unsigned long iova, size_t size,
 			       struct iommu_iotlb_gather *iotlb_gather);
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH RFC 04/19] iommu: Add an unmap API that returns dirtied IOPTEs
@ 2022-04-28 21:09   ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, Jason Gunthorpe,
	kvm, Will Deacon, Cornelia Huck, Alex Williamson, Joao Martins,
	David Woodhouse, Robin Murphy

Today, the dirty state is lost and the page wouldn't be migrated to
destination potentially leading the guest into error.

Add an unmap API that reads the dirty bit and sets it in the
user passed bitmap. This unmap iommu API tackles a potentially
racy update to the dirty bit *when* doing DMA on a iova that is
being unmapped at the same time.

The new unmap_read_dirty/unmap_pages_read_dirty does not replace
the unmap pages, but rather only when explicit called with an dirty
bitmap data passed in.

It could be said that the guest is buggy and rather than a special unmap
path tackling the theoretical race ... it would suffice fetching the
dirty bits (with GET_DIRTY_IOVA), and then unmap the IOVA.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/iommu.c      | 43 +++++++++++++++++++++++++++++++-------
 include/linux/io-pgtable.h | 10 +++++++++
 include/linux/iommu.h      | 12 +++++++++++
 3 files changed, 58 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index d18b9ddbcce4..cc04263709ee 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2289,12 +2289,25 @@ EXPORT_SYMBOL_GPL(iommu_map_atomic);
 
 static size_t __iommu_unmap_pages(struct iommu_domain *domain,
 				  unsigned long iova, size_t size,
-				  struct iommu_iotlb_gather *iotlb_gather)
+				  struct iommu_iotlb_gather *iotlb_gather,
+				  struct iommu_dirty_bitmap *dirty)
 {
 	const struct iommu_domain_ops *ops = domain->ops;
 	size_t pgsize, count;
 
 	pgsize = iommu_pgsize(domain, iova, iova, size, &count);
+
+	if (dirty) {
+		if (!ops->unmap_read_dirty && !ops->unmap_pages_read_dirty)
+			return 0;
+
+		return ops->unmap_pages_read_dirty ?
+		       ops->unmap_pages_read_dirty(domain, iova, pgsize,
+						   count, iotlb_gather, dirty) :
+		       ops->unmap_read_dirty(domain, iova, pgsize,
+					     iotlb_gather, dirty);
+	}
+
 	return ops->unmap_pages ?
 	       ops->unmap_pages(domain, iova, pgsize, count, iotlb_gather) :
 	       ops->unmap(domain, iova, pgsize, iotlb_gather);
@@ -2302,7 +2315,8 @@ static size_t __iommu_unmap_pages(struct iommu_domain *domain,
 
 static size_t __iommu_unmap(struct iommu_domain *domain,
 			    unsigned long iova, size_t size,
-			    struct iommu_iotlb_gather *iotlb_gather)
+			    struct iommu_iotlb_gather *iotlb_gather,
+			    struct iommu_dirty_bitmap *dirty)
 {
 	const struct iommu_domain_ops *ops = domain->ops;
 	size_t unmapped_page, unmapped = 0;
@@ -2337,9 +2351,8 @@ static size_t __iommu_unmap(struct iommu_domain *domain,
 	 * or we hit an area that isn't mapped.
 	 */
 	while (unmapped < size) {
-		unmapped_page = __iommu_unmap_pages(domain, iova,
-						    size - unmapped,
-						    iotlb_gather);
+		unmapped_page = __iommu_unmap_pages(domain, iova, size - unmapped,
+						    iotlb_gather, dirty);
 		if (!unmapped_page)
 			break;
 
@@ -2361,18 +2374,34 @@ size_t iommu_unmap(struct iommu_domain *domain,
 	size_t ret;
 
 	iommu_iotlb_gather_init(&iotlb_gather);
-	ret = __iommu_unmap(domain, iova, size, &iotlb_gather);
+	ret = __iommu_unmap(domain, iova, size, &iotlb_gather, NULL);
 	iommu_iotlb_sync(domain, &iotlb_gather);
 
 	return ret;
 }
 EXPORT_SYMBOL_GPL(iommu_unmap);
 
+size_t iommu_unmap_read_dirty(struct iommu_domain *domain,
+			      unsigned long iova, size_t size,
+			      struct iommu_dirty_bitmap *dirty)
+{
+	struct iommu_iotlb_gather iotlb_gather;
+	size_t ret;
+
+	iommu_iotlb_gather_init(&iotlb_gather);
+	ret = __iommu_unmap(domain, iova, size, &iotlb_gather, dirty);
+	iommu_iotlb_sync(domain, &iotlb_gather);
+
+	return ret;
+
+}
+EXPORT_SYMBOL_GPL(iommu_unmap_read_dirty);
+
 size_t iommu_unmap_fast(struct iommu_domain *domain,
 			unsigned long iova, size_t size,
 			struct iommu_iotlb_gather *iotlb_gather)
 {
-	return __iommu_unmap(domain, iova, size, iotlb_gather);
+	return __iommu_unmap(domain, iova, size, iotlb_gather, NULL);
 }
 EXPORT_SYMBOL_GPL(iommu_unmap_fast);
 
diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index 82b39925c21f..c2ebfe037f5d 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -171,6 +171,16 @@ struct io_pgtable_ops {
 	int (*read_and_clear_dirty)(struct io_pgtable_ops *ops,
 				    unsigned long iova, size_t size,
 				    struct iommu_dirty_bitmap *dirty);
+	size_t (*unmap_read_dirty)(struct io_pgtable_ops *ops,
+				   unsigned long iova,
+				   size_t size,
+				   struct iommu_iotlb_gather *gather,
+				   struct iommu_dirty_bitmap *dirty);
+	size_t (*unmap_pages_read_dirty)(struct io_pgtable_ops *ops,
+					 unsigned long iova,
+					 size_t pgsize, size_t pgcount,
+					 struct iommu_iotlb_gather *gather,
+					 struct iommu_dirty_bitmap *dirty);
 };
 
 /**
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index ca076365d77b..7c66b4e00556 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -340,6 +340,15 @@ struct iommu_domain_ops {
 	int (*read_and_clear_dirty)(struct iommu_domain *domain,
 				    unsigned long iova, size_t size,
 				    struct iommu_dirty_bitmap *dirty);
+	size_t (*unmap_read_dirty)(struct iommu_domain *domain,
+				   unsigned long iova, size_t size,
+				   struct iommu_iotlb_gather *iotlb_gather,
+				   struct iommu_dirty_bitmap *dirty);
+	size_t (*unmap_pages_read_dirty)(struct iommu_domain *domain,
+					 unsigned long iova,
+					 size_t pgsize, size_t pgcount,
+					 struct iommu_iotlb_gather *iotlb_gather,
+					 struct iommu_dirty_bitmap *dirty);
 };
 
 /**
@@ -463,6 +472,9 @@ extern int iommu_map_atomic(struct iommu_domain *domain, unsigned long iova,
 			    phys_addr_t paddr, size_t size, int prot);
 extern size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova,
 			  size_t size);
+extern size_t iommu_unmap_read_dirty(struct iommu_domain *domain,
+				     unsigned long iova, size_t size,
+				     struct iommu_dirty_bitmap *dirty);
 extern size_t iommu_unmap_fast(struct iommu_domain *domain,
 			       unsigned long iova, size_t size,
 			       struct iommu_iotlb_gather *iotlb_gather);
-- 
2.17.2

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH RFC 05/19] iommufd: Add a dirty bitmap to iopt_unmap_iova()
  2022-04-28 21:09 ` Joao Martins
@ 2022-04-28 21:09   ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Joao Martins, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian,
	Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck, kvm

Add an argument to the kAPI that unmaps an IOVA from the attached
domains, to also receive a bitmap.

When passed an iommufd_dirty_data::bitmap we call out the special
dirty unmap (iommu_unmap_read_dirty()). The bitmap data is
iterated, similarly, like the read_and_clear_dirty() in IOVA
chunks using the previously added iommufd_dirty_iter* helper
functions.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/iommufd/io_pagetable.c    | 13 ++--
 drivers/iommu/iommufd/io_pagetable.h    |  3 +-
 drivers/iommu/iommufd/ioas.c            |  2 +-
 drivers/iommu/iommufd/iommufd_private.h |  4 +-
 drivers/iommu/iommufd/pages.c           | 79 +++++++++++++++++++++----
 drivers/iommu/iommufd/vfio_compat.c     |  2 +-
 6 files changed, 80 insertions(+), 23 deletions(-)

diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
index 835b5040fce9..6f4117c629d4 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -542,13 +542,14 @@ struct iopt_pages *iopt_get_pages(struct io_pagetable *iopt, unsigned long iova,
 }
 
 static int __iopt_unmap_iova(struct io_pagetable *iopt, struct iopt_area *area,
-			     struct iopt_pages *pages)
+			     struct iopt_pages *pages,
+			     struct iommufd_dirty_data *bitmap)
 {
 	/* Drivers have to unpin on notification. */
 	if (WARN_ON(atomic_read(&area->num_users)))
 		return -EBUSY;
 
-	iopt_area_unfill_domains(area, pages);
+	iopt_area_unfill_domains(area, pages, bitmap);
 	WARN_ON(atomic_read(&area->num_users));
 	iopt_abort_area(area);
 	iopt_put_pages(pages);
@@ -560,12 +561,13 @@ static int __iopt_unmap_iova(struct io_pagetable *iopt, struct iopt_area *area,
  * @iopt: io_pagetable to act on
  * @iova: Starting iova to unmap
  * @length: Number of bytes to unmap
+ * @bitmap: Bitmap of dirtied IOVAs
  *
  * The requested range must exactly match an existing range.
  * Splitting/truncating IOVA mappings is not allowed.
  */
 int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
-		    unsigned long length)
+		    unsigned long length, struct iommufd_dirty_data *bitmap)
 {
 	struct iopt_pages *pages;
 	struct iopt_area *area;
@@ -590,7 +592,8 @@ int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
 	area->pages = NULL;
 	up_write(&iopt->iova_rwsem);
 
-	rc = __iopt_unmap_iova(iopt, area, pages);
+	rc = __iopt_unmap_iova(iopt, area, pages, bitmap);
+
 	up_read(&iopt->domains_rwsem);
 	return rc;
 }
@@ -614,7 +617,7 @@ int iopt_unmap_all(struct io_pagetable *iopt)
 		area->pages = NULL;
 		up_write(&iopt->iova_rwsem);
 
-		rc = __iopt_unmap_iova(iopt, area, pages);
+		rc = __iopt_unmap_iova(iopt, area, pages, NULL);
 		if (rc)
 			goto out_unlock_domains;
 
diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h
index c8b6a60ff24c..c8baab25ab08 100644
--- a/drivers/iommu/iommufd/io_pagetable.h
+++ b/drivers/iommu/iommufd/io_pagetable.h
@@ -48,7 +48,8 @@ struct iopt_area {
 };
 
 int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages);
-void iopt_area_unfill_domains(struct iopt_area *area, struct iopt_pages *pages);
+void iopt_area_unfill_domains(struct iopt_area *area, struct iopt_pages *pages,
+			      struct iommufd_dirty_data *bitmap);
 
 int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain);
 void iopt_area_unfill_domain(struct iopt_area *area, struct iopt_pages *pages,
diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
index 48149988c84b..19d6591aa005 100644
--- a/drivers/iommu/iommufd/ioas.c
+++ b/drivers/iommu/iommufd/ioas.c
@@ -243,7 +243,7 @@ int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd)
 			rc = -EOVERFLOW;
 			goto out_put;
 		}
-		rc = iopt_unmap_iova(&ioas->iopt, cmd->iova, cmd->length);
+		rc = iopt_unmap_iova(&ioas->iopt, cmd->iova, cmd->length, NULL);
 	}
 
 out_put:
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 4c12b4a8f1a6..3e3a97f623a1 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -47,8 +47,6 @@ int iopt_map_user_pages(struct io_pagetable *iopt, unsigned long *iova,
 int iopt_map_pages(struct io_pagetable *iopt, struct iopt_pages *pages,
 		   unsigned long *dst_iova, unsigned long start_byte,
 		   unsigned long length, int iommu_prot, unsigned int flags);
-int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
-		    unsigned long length);
 int iopt_unmap_all(struct io_pagetable *iopt);
 
 struct iommufd_dirty_data {
@@ -63,6 +61,8 @@ int iopt_set_dirty_tracking(struct io_pagetable *iopt,
 int iopt_read_and_clear_dirty_data(struct io_pagetable *iopt,
 				   struct iommu_domain *domain,
 				   struct iommufd_dirty_data *bitmap);
+int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
+		    unsigned long length, struct iommufd_dirty_data *bitmap);
 
 struct iommufd_dirty_iter {
 	struct iommu_dirty_bitmap dirty;
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index 3fd39e0201f5..722c77cbbe3a 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -144,16 +144,64 @@ static void iommu_unmap_nofail(struct iommu_domain *domain, unsigned long iova,
 	WARN_ON(ret != size);
 }
 
+static void iommu_unmap_read_dirty_nofail(struct iommu_domain *domain,
+					  unsigned long iova, size_t size,
+					  struct iommufd_dirty_data *bitmap,
+					  struct iommufd_dirty_iter *iter)
+{
+	size_t ret = 0;
+
+	ret = iommufd_dirty_iter_init(iter, bitmap);
+	WARN_ON(ret);
+
+	for (; iommufd_dirty_iter_done(iter);
+	     iommufd_dirty_iter_advance(iter)) {
+		ret = iommufd_dirty_iter_get(iter);
+		if (ret < 0)
+			break;
+
+		ret = iommu_unmap_read_dirty(domain,
+			iommufd_dirty_iova(iter),
+			iommufd_dirty_iova_length(iter), &iter->dirty);
+
+		iommufd_dirty_iter_put(iter);
+
+		/*
+		 * It is a logic error in this code or a driver bug
+		 * if the IOMMU unmaps something other than exactly
+		 * as requested.
+		 */
+		if (ret != size) {
+			WARN_ONCE(1, "unmapped %ld instead of %ld", ret, size);
+			break;
+		}
+	}
+
+	iommufd_dirty_iter_free(iter);
+}
+
 static void iopt_area_unmap_domain_range(struct iopt_area *area,
 					 struct iommu_domain *domain,
 					 unsigned long start_index,
-					 unsigned long last_index)
+					 unsigned long last_index,
+					 struct iommufd_dirty_data *bitmap)
 {
 	unsigned long start_iova = iopt_area_index_to_iova(area, start_index);
 
-	iommu_unmap_nofail(domain, start_iova,
-			   iopt_area_index_to_iova_last(area, last_index) -
-				   start_iova + 1);
+	if (bitmap) {
+		struct iommufd_dirty_iter iter;
+
+		iommu_dirty_bitmap_init(&iter.dirty, bitmap->iova,
+					__ffs(bitmap->page_size), NULL);
+
+		iommu_unmap_read_dirty_nofail(domain, start_iova,
+			iopt_area_index_to_iova_last(area, last_index) -
+					   start_iova + 1, bitmap, &iter);
+	} else {
+		iommu_unmap_nofail(domain, start_iova,
+				   iopt_area_index_to_iova_last(area, last_index) -
+					   start_iova + 1);
+	}
 }
 
 static struct iopt_area *iopt_pages_find_domain_area(struct iopt_pages *pages,
@@ -808,7 +856,8 @@ static bool interval_tree_fully_covers_area(struct rb_root_cached *root,
 static void __iopt_area_unfill_domain(struct iopt_area *area,
 				      struct iopt_pages *pages,
 				      struct iommu_domain *domain,
-				      unsigned long last_index)
+				      unsigned long last_index,
+				      struct iommufd_dirty_data *bitmap)
 {
 	unsigned long unmapped_index = iopt_area_index(area);
 	unsigned long cur_index = unmapped_index;
@@ -821,7 +870,8 @@ static void __iopt_area_unfill_domain(struct iopt_area *area,
 	if (interval_tree_fully_covers_area(&pages->domains_itree, area) ||
 	    interval_tree_fully_covers_area(&pages->users_itree, area)) {
 		iopt_area_unmap_domain_range(area, domain,
-					     iopt_area_index(area), last_index);
+					     iopt_area_index(area),
+					     last_index, bitmap);
 		return;
 	}
 
@@ -837,7 +887,7 @@ static void __iopt_area_unfill_domain(struct iopt_area *area,
 		batch_from_domain(&batch, domain, area, cur_index, last_index);
 		cur_index += batch.total_pfns;
 		iopt_area_unmap_domain_range(area, domain, unmapped_index,
-					     cur_index - 1);
+					     cur_index - 1, bitmap);
 		unmapped_index = cur_index;
 		iopt_pages_unpin(pages, &batch, batch_index, cur_index - 1);
 		batch_clear(&batch);
@@ -852,7 +902,8 @@ static void iopt_area_unfill_partial_domain(struct iopt_area *area,
 					    unsigned long end_index)
 {
 	if (end_index != iopt_area_index(area))
-		__iopt_area_unfill_domain(area, pages, domain, end_index - 1);
+		__iopt_area_unfill_domain(area, pages, domain,
+					  end_index - 1, NULL);
 }
 
 /**
@@ -891,7 +942,7 @@ void iopt_area_unfill_domain(struct iopt_area *area, struct iopt_pages *pages,
 			     struct iommu_domain *domain)
 {
 	__iopt_area_unfill_domain(area, pages, domain,
-				  iopt_area_last_index(area));
+				  iopt_area_last_index(area), NULL);
 }
 
 /**
@@ -1004,7 +1055,7 @@ int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages)
 			if (end_index != iopt_area_index(area))
 				iopt_area_unmap_domain_range(
 					area, domain, iopt_area_index(area),
-					end_index - 1);
+					end_index - 1, NULL);
 		} else {
 			iopt_area_unfill_partial_domain(area, pages, domain,
 							end_index);
@@ -1025,7 +1076,8 @@ int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages)
  * Called during area destruction. This unmaps the iova's covered by all the
  * area's domains and releases the PFNs.
  */
-void iopt_area_unfill_domains(struct iopt_area *area, struct iopt_pages *pages)
+void iopt_area_unfill_domains(struct iopt_area *area, struct iopt_pages *pages,
+			      struct iommufd_dirty_data *bitmap)
 {
 	struct io_pagetable *iopt = area->iopt;
 	struct iommu_domain *domain;
@@ -1041,10 +1093,11 @@ void iopt_area_unfill_domains(struct iopt_area *area, struct iopt_pages *pages)
 		if (domain != area->storage_domain)
 			iopt_area_unmap_domain_range(
 				area, domain, iopt_area_index(area),
-				iopt_area_last_index(area));
+				iopt_area_last_index(area), bitmap);
 
 	interval_tree_remove(&area->pages_node, &pages->domains_itree);
-	iopt_area_unfill_domain(area, pages, area->storage_domain);
+	__iopt_area_unfill_domain(area, pages, area->storage_domain,
+				  iopt_area_last_index(area), bitmap);
 	area->storage_domain = NULL;
 out_unlock:
 	mutex_unlock(&pages->mutex);
diff --git a/drivers/iommu/iommufd/vfio_compat.c b/drivers/iommu/iommufd/vfio_compat.c
index 5b196de00ff9..dbe39404a105 100644
--- a/drivers/iommu/iommufd/vfio_compat.c
+++ b/drivers/iommu/iommufd/vfio_compat.c
@@ -148,7 +148,7 @@ static int iommufd_vfio_unmap_dma(struct iommufd_ctx *ictx, unsigned int cmd,
 	if (unmap.flags & VFIO_DMA_UNMAP_FLAG_ALL)
 		rc = iopt_unmap_all(&ioas->iopt);
 	else
-		rc = iopt_unmap_iova(&ioas->iopt, unmap.iova, unmap.size);
+		rc = iopt_unmap_iova(&ioas->iopt, unmap.iova, unmap.size, NULL);
 	iommufd_put_object(&ioas->obj);
 	return rc;
 }
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH RFC 05/19] iommufd: Add a dirty bitmap to iopt_unmap_iova()
@ 2022-04-28 21:09   ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, Jason Gunthorpe,
	kvm, Will Deacon, Cornelia Huck, Alex Williamson, Joao Martins,
	David Woodhouse, Robin Murphy

Add an argument to the kAPI that unmaps an IOVA from the attached
domains, to also receive a bitmap.

When passed an iommufd_dirty_data::bitmap we call out the special
dirty unmap (iommu_unmap_read_dirty()). The bitmap data is
iterated, similarly, like the read_and_clear_dirty() in IOVA
chunks using the previously added iommufd_dirty_iter* helper
functions.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/iommufd/io_pagetable.c    | 13 ++--
 drivers/iommu/iommufd/io_pagetable.h    |  3 +-
 drivers/iommu/iommufd/ioas.c            |  2 +-
 drivers/iommu/iommufd/iommufd_private.h |  4 +-
 drivers/iommu/iommufd/pages.c           | 79 +++++++++++++++++++++----
 drivers/iommu/iommufd/vfio_compat.c     |  2 +-
 6 files changed, 80 insertions(+), 23 deletions(-)

diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
index 835b5040fce9..6f4117c629d4 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -542,13 +542,14 @@ struct iopt_pages *iopt_get_pages(struct io_pagetable *iopt, unsigned long iova,
 }
 
 static int __iopt_unmap_iova(struct io_pagetable *iopt, struct iopt_area *area,
-			     struct iopt_pages *pages)
+			     struct iopt_pages *pages,
+			     struct iommufd_dirty_data *bitmap)
 {
 	/* Drivers have to unpin on notification. */
 	if (WARN_ON(atomic_read(&area->num_users)))
 		return -EBUSY;
 
-	iopt_area_unfill_domains(area, pages);
+	iopt_area_unfill_domains(area, pages, bitmap);
 	WARN_ON(atomic_read(&area->num_users));
 	iopt_abort_area(area);
 	iopt_put_pages(pages);
@@ -560,12 +561,13 @@ static int __iopt_unmap_iova(struct io_pagetable *iopt, struct iopt_area *area,
  * @iopt: io_pagetable to act on
  * @iova: Starting iova to unmap
  * @length: Number of bytes to unmap
+ * @bitmap: Bitmap of dirtied IOVAs
  *
  * The requested range must exactly match an existing range.
  * Splitting/truncating IOVA mappings is not allowed.
  */
 int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
-		    unsigned long length)
+		    unsigned long length, struct iommufd_dirty_data *bitmap)
 {
 	struct iopt_pages *pages;
 	struct iopt_area *area;
@@ -590,7 +592,8 @@ int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
 	area->pages = NULL;
 	up_write(&iopt->iova_rwsem);
 
-	rc = __iopt_unmap_iova(iopt, area, pages);
+	rc = __iopt_unmap_iova(iopt, area, pages, bitmap);
+
 	up_read(&iopt->domains_rwsem);
 	return rc;
 }
@@ -614,7 +617,7 @@ int iopt_unmap_all(struct io_pagetable *iopt)
 		area->pages = NULL;
 		up_write(&iopt->iova_rwsem);
 
-		rc = __iopt_unmap_iova(iopt, area, pages);
+		rc = __iopt_unmap_iova(iopt, area, pages, NULL);
 		if (rc)
 			goto out_unlock_domains;
 
diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h
index c8b6a60ff24c..c8baab25ab08 100644
--- a/drivers/iommu/iommufd/io_pagetable.h
+++ b/drivers/iommu/iommufd/io_pagetable.h
@@ -48,7 +48,8 @@ struct iopt_area {
 };
 
 int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages);
-void iopt_area_unfill_domains(struct iopt_area *area, struct iopt_pages *pages);
+void iopt_area_unfill_domains(struct iopt_area *area, struct iopt_pages *pages,
+			      struct iommufd_dirty_data *bitmap);
 
 int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain);
 void iopt_area_unfill_domain(struct iopt_area *area, struct iopt_pages *pages,
diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
index 48149988c84b..19d6591aa005 100644
--- a/drivers/iommu/iommufd/ioas.c
+++ b/drivers/iommu/iommufd/ioas.c
@@ -243,7 +243,7 @@ int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd)
 			rc = -EOVERFLOW;
 			goto out_put;
 		}
-		rc = iopt_unmap_iova(&ioas->iopt, cmd->iova, cmd->length);
+		rc = iopt_unmap_iova(&ioas->iopt, cmd->iova, cmd->length, NULL);
 	}
 
 out_put:
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 4c12b4a8f1a6..3e3a97f623a1 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -47,8 +47,6 @@ int iopt_map_user_pages(struct io_pagetable *iopt, unsigned long *iova,
 int iopt_map_pages(struct io_pagetable *iopt, struct iopt_pages *pages,
 		   unsigned long *dst_iova, unsigned long start_byte,
 		   unsigned long length, int iommu_prot, unsigned int flags);
-int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
-		    unsigned long length);
 int iopt_unmap_all(struct io_pagetable *iopt);
 
 struct iommufd_dirty_data {
@@ -63,6 +61,8 @@ int iopt_set_dirty_tracking(struct io_pagetable *iopt,
 int iopt_read_and_clear_dirty_data(struct io_pagetable *iopt,
 				   struct iommu_domain *domain,
 				   struct iommufd_dirty_data *bitmap);
+int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
+		    unsigned long length, struct iommufd_dirty_data *bitmap);
 
 struct iommufd_dirty_iter {
 	struct iommu_dirty_bitmap dirty;
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index 3fd39e0201f5..722c77cbbe3a 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -144,16 +144,64 @@ static void iommu_unmap_nofail(struct iommu_domain *domain, unsigned long iova,
 	WARN_ON(ret != size);
 }
 
+static void iommu_unmap_read_dirty_nofail(struct iommu_domain *domain,
+					  unsigned long iova, size_t size,
+					  struct iommufd_dirty_data *bitmap,
+					  struct iommufd_dirty_iter *iter)
+{
+	size_t ret = 0;
+
+	ret = iommufd_dirty_iter_init(iter, bitmap);
+	WARN_ON(ret);
+
+	for (; iommufd_dirty_iter_done(iter);
+	     iommufd_dirty_iter_advance(iter)) {
+		ret = iommufd_dirty_iter_get(iter);
+		if (ret < 0)
+			break;
+
+		ret = iommu_unmap_read_dirty(domain,
+			iommufd_dirty_iova(iter),
+			iommufd_dirty_iova_length(iter), &iter->dirty);
+
+		iommufd_dirty_iter_put(iter);
+
+		/*
+		 * It is a logic error in this code or a driver bug
+		 * if the IOMMU unmaps something other than exactly
+		 * as requested.
+		 */
+		if (ret != size) {
+			WARN_ONCE(1, "unmapped %ld instead of %ld", ret, size);
+			break;
+		}
+	}
+
+	iommufd_dirty_iter_free(iter);
+}
+
 static void iopt_area_unmap_domain_range(struct iopt_area *area,
 					 struct iommu_domain *domain,
 					 unsigned long start_index,
-					 unsigned long last_index)
+					 unsigned long last_index,
+					 struct iommufd_dirty_data *bitmap)
 {
 	unsigned long start_iova = iopt_area_index_to_iova(area, start_index);
 
-	iommu_unmap_nofail(domain, start_iova,
-			   iopt_area_index_to_iova_last(area, last_index) -
-				   start_iova + 1);
+	if (bitmap) {
+		struct iommufd_dirty_iter iter;
+
+		iommu_dirty_bitmap_init(&iter.dirty, bitmap->iova,
+					__ffs(bitmap->page_size), NULL);
+
+		iommu_unmap_read_dirty_nofail(domain, start_iova,
+			iopt_area_index_to_iova_last(area, last_index) -
+					   start_iova + 1, bitmap, &iter);
+	} else {
+		iommu_unmap_nofail(domain, start_iova,
+				   iopt_area_index_to_iova_last(area, last_index) -
+					   start_iova + 1);
+	}
 }
 
 static struct iopt_area *iopt_pages_find_domain_area(struct iopt_pages *pages,
@@ -808,7 +856,8 @@ static bool interval_tree_fully_covers_area(struct rb_root_cached *root,
 static void __iopt_area_unfill_domain(struct iopt_area *area,
 				      struct iopt_pages *pages,
 				      struct iommu_domain *domain,
-				      unsigned long last_index)
+				      unsigned long last_index,
+				      struct iommufd_dirty_data *bitmap)
 {
 	unsigned long unmapped_index = iopt_area_index(area);
 	unsigned long cur_index = unmapped_index;
@@ -821,7 +870,8 @@ static void __iopt_area_unfill_domain(struct iopt_area *area,
 	if (interval_tree_fully_covers_area(&pages->domains_itree, area) ||
 	    interval_tree_fully_covers_area(&pages->users_itree, area)) {
 		iopt_area_unmap_domain_range(area, domain,
-					     iopt_area_index(area), last_index);
+					     iopt_area_index(area),
+					     last_index, bitmap);
 		return;
 	}
 
@@ -837,7 +887,7 @@ static void __iopt_area_unfill_domain(struct iopt_area *area,
 		batch_from_domain(&batch, domain, area, cur_index, last_index);
 		cur_index += batch.total_pfns;
 		iopt_area_unmap_domain_range(area, domain, unmapped_index,
-					     cur_index - 1);
+					     cur_index - 1, bitmap);
 		unmapped_index = cur_index;
 		iopt_pages_unpin(pages, &batch, batch_index, cur_index - 1);
 		batch_clear(&batch);
@@ -852,7 +902,8 @@ static void iopt_area_unfill_partial_domain(struct iopt_area *area,
 					    unsigned long end_index)
 {
 	if (end_index != iopt_area_index(area))
-		__iopt_area_unfill_domain(area, pages, domain, end_index - 1);
+		__iopt_area_unfill_domain(area, pages, domain,
+					  end_index - 1, NULL);
 }
 
 /**
@@ -891,7 +942,7 @@ void iopt_area_unfill_domain(struct iopt_area *area, struct iopt_pages *pages,
 			     struct iommu_domain *domain)
 {
 	__iopt_area_unfill_domain(area, pages, domain,
-				  iopt_area_last_index(area));
+				  iopt_area_last_index(area), NULL);
 }
 
 /**
@@ -1004,7 +1055,7 @@ int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages)
 			if (end_index != iopt_area_index(area))
 				iopt_area_unmap_domain_range(
 					area, domain, iopt_area_index(area),
-					end_index - 1);
+					end_index - 1, NULL);
 		} else {
 			iopt_area_unfill_partial_domain(area, pages, domain,
 							end_index);
@@ -1025,7 +1076,8 @@ int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages)
  * Called during area destruction. This unmaps the iova's covered by all the
  * area's domains and releases the PFNs.
  */
-void iopt_area_unfill_domains(struct iopt_area *area, struct iopt_pages *pages)
+void iopt_area_unfill_domains(struct iopt_area *area, struct iopt_pages *pages,
+			      struct iommufd_dirty_data *bitmap)
 {
 	struct io_pagetable *iopt = area->iopt;
 	struct iommu_domain *domain;
@@ -1041,10 +1093,11 @@ void iopt_area_unfill_domains(struct iopt_area *area, struct iopt_pages *pages)
 		if (domain != area->storage_domain)
 			iopt_area_unmap_domain_range(
 				area, domain, iopt_area_index(area),
-				iopt_area_last_index(area));
+				iopt_area_last_index(area), bitmap);
 
 	interval_tree_remove(&area->pages_node, &pages->domains_itree);
-	iopt_area_unfill_domain(area, pages, area->storage_domain);
+	__iopt_area_unfill_domain(area, pages, area->storage_domain,
+				  iopt_area_last_index(area), bitmap);
 	area->storage_domain = NULL;
 out_unlock:
 	mutex_unlock(&pages->mutex);
diff --git a/drivers/iommu/iommufd/vfio_compat.c b/drivers/iommu/iommufd/vfio_compat.c
index 5b196de00ff9..dbe39404a105 100644
--- a/drivers/iommu/iommufd/vfio_compat.c
+++ b/drivers/iommu/iommufd/vfio_compat.c
@@ -148,7 +148,7 @@ static int iommufd_vfio_unmap_dma(struct iommufd_ctx *ictx, unsigned int cmd,
 	if (unmap.flags & VFIO_DMA_UNMAP_FLAG_ALL)
 		rc = iopt_unmap_all(&ioas->iopt);
 	else
-		rc = iopt_unmap_iova(&ioas->iopt, unmap.iova, unmap.size);
+		rc = iopt_unmap_iova(&ioas->iopt, unmap.iova, unmap.size, NULL);
 	iommufd_put_object(&ioas->obj);
 	return rc;
 }
-- 
2.17.2

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH RFC 06/19] iommufd: Dirty tracking IOCTLs for the hw_pagetable
  2022-04-28 21:09 ` Joao Martins
@ 2022-04-28 21:09   ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Joao Martins, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian,
	Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck, kvm

Every IOMMU driver should be able to implement the needed
iommu domain ops to perform dirty tracking.

Connect a hw_pagetable to the IOMMU core dirty tracking ops.
It exposes all of the functionality for the UAPI:

- Enable/Disable dirty tracking on an IOMMU domain (hw_pagetable id)
- Read the dirtied IOVAs (which clear IOMMU domain bitmap under the hood)
- Unmap and get the dirtied IOVAs

In doing so the previously internal iommufd_dirty_data structure is
moved over as the UAPI intermediate structure for representing iommufd
dirty bitmaps.

Contrary to past incantations the IOVA range to be scanned or unmap is
tied in to the bitmap size, and thus puts the heavy lifting in the
application to make sure it passes a precisedly sized bitmap address as
opposed to allowing base_iova != iova, which simplifies things further.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/iommufd/hw_pagetable.c    | 79 +++++++++++++++++++++++++
 drivers/iommu/iommufd/ioas.c            | 33 +++++++++++
 drivers/iommu/iommufd/iommufd_private.h | 22 ++++---
 drivers/iommu/iommufd/main.c            |  9 +++
 include/uapi/linux/iommufd.h            | 78 ++++++++++++++++++++++++
 5 files changed, 214 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index bafd7d07918b..943bcc3898a4 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -3,6 +3,7 @@
  * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
  */
 #include <linux/iommu.h>
+#include <uapi/linux/iommufd.h>
 
 #include "iommufd_private.h"
 
@@ -140,3 +141,81 @@ void iommufd_hw_pagetable_put(struct iommufd_ctx *ictx,
 	}
 	iommufd_object_destroy_user(ictx, &hwpt->obj);
 }
+
+int iommufd_hwpt_set_dirty(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_hwpt_set_dirty *cmd = ucmd->cmd;
+	struct iommufd_hw_pagetable *hwpt;
+	struct iommufd_ioas *ioas;
+	int rc = -EOPNOTSUPP;
+	bool enable;
+
+	hwpt = iommufd_get_hwpt(ucmd, cmd->hwpt_id);
+	if (IS_ERR(hwpt))
+		return PTR_ERR(hwpt);
+
+	ioas = hwpt->ioas;
+	enable = cmd->flags & IOMMU_DIRTY_TRACKING_ENABLED;
+
+	rc = iopt_set_dirty_tracking(&ioas->iopt, hwpt->domain, enable);
+
+	iommufd_put_object(&hwpt->obj);
+	return rc;
+}
+
+int iommufd_check_iova_range(struct iommufd_ioas *ioas,
+			     struct iommufd_dirty_data *bitmap)
+{
+	unsigned long pgshift, npages;
+	size_t iommu_pgsize;
+	int rc = -EINVAL;
+	u64 bitmap_size;
+
+	pgshift = __ffs(bitmap->page_size);
+	npages = bitmap->length >> pgshift;
+	bitmap_size = dirty_bitmap_bytes(npages);
+
+	if (!npages || (bitmap_size > DIRTY_BITMAP_SIZE_MAX))
+		return rc;
+
+	if (!access_ok((void __user *) bitmap->data, bitmap_size))
+		return rc;
+
+	iommu_pgsize = 1 << __ffs(ioas->iopt.iova_alignment);
+
+	/* allow only smallest supported pgsize */
+	if (bitmap->page_size != iommu_pgsize)
+		return rc;
+
+	if (bitmap->iova & (iommu_pgsize - 1))
+		return rc;
+
+	if (!bitmap->length || bitmap->length & (iommu_pgsize - 1))
+		return rc;
+
+	return 0;
+}
+
+int iommufd_hwpt_get_dirty_iova(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_hwpt_get_dirty_iova *cmd = ucmd->cmd;
+	struct iommufd_hw_pagetable *hwpt;
+	struct iommufd_ioas *ioas;
+	int rc = -EOPNOTSUPP;
+
+	hwpt = iommufd_get_hwpt(ucmd, cmd->hwpt_id);
+	if (IS_ERR(hwpt))
+		return PTR_ERR(hwpt);
+
+	ioas = hwpt->ioas;
+	rc = iommufd_check_iova_range(ioas, &cmd->bitmap);
+	if (rc)
+		goto out_put;
+
+	rc = iopt_read_and_clear_dirty_data(&ioas->iopt, hwpt->domain,
+					    &cmd->bitmap);
+
+out_put:
+	iommufd_put_object(&hwpt->obj);
+	return rc;
+}
diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
index 19d6591aa005..50bef46bc0bb 100644
--- a/drivers/iommu/iommufd/ioas.c
+++ b/drivers/iommu/iommufd/ioas.c
@@ -243,6 +243,7 @@ int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd)
 			rc = -EOVERFLOW;
 			goto out_put;
 		}
+
 		rc = iopt_unmap_iova(&ioas->iopt, cmd->iova, cmd->length, NULL);
 	}
 
@@ -250,3 +251,35 @@ int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd)
 	iommufd_put_object(&ioas->obj);
 	return rc;
 }
+
+int iommufd_ioas_unmap_dirty(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_ioas_unmap_dirty *cmd = ucmd->cmd;
+	struct iommufd_dirty_data *bitmap;
+	struct iommufd_ioas *ioas;
+	int rc;
+
+	ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	/* The bitmaps would be gigantic */
+	bitmap = &cmd->bitmap;
+	if (bitmap->iova == 0 && bitmap->length == U64_MAX)
+		return -EINVAL;
+
+	if (bitmap->iova >= ULONG_MAX || bitmap->length >= ULONG_MAX) {
+		rc = -EOVERFLOW;
+		goto out_put;
+	}
+
+	rc = iommufd_check_iova_range(ioas, bitmap);
+	if (rc)
+		goto out_put;
+
+	rc = iopt_unmap_iova(&ioas->iopt, bitmap->iova, bitmap->length, bitmap);
+
+out_put:
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 3e3a97f623a1..68c77cf4793f 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -10,6 +10,7 @@
 #include <linux/uaccess.h>
 #include <linux/iommu.h>
 #include <linux/uio.h>
+#include <uapi/linux/iommufd.h>
 
 struct iommu_domain;
 struct iommu_group;
@@ -49,13 +50,6 @@ int iopt_map_pages(struct io_pagetable *iopt, struct iopt_pages *pages,
 		   unsigned long length, int iommu_prot, unsigned int flags);
 int iopt_unmap_all(struct io_pagetable *iopt);
 
-struct iommufd_dirty_data {
-	unsigned long iova;
-	unsigned long length;
-	unsigned long page_size;
-	unsigned long *data;
-};
-
 int iopt_set_dirty_tracking(struct io_pagetable *iopt,
 			    struct iommu_domain *domain, bool enable);
 int iopt_read_and_clear_dirty_data(struct io_pagetable *iopt,
@@ -244,7 +238,10 @@ int iommufd_ioas_iova_ranges(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_map(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_copy(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd);
+int iommufd_ioas_unmap_dirty(struct iommufd_ucmd *ucmd);
 int iommufd_vfio_ioas(struct iommufd_ucmd *ucmd);
+int iommufd_check_iova_range(struct iommufd_ioas *ioas,
+			     struct iommufd_dirty_data *bitmap);
 
 /*
  * A HW pagetable is called an iommu_domain inside the kernel. This user object
@@ -263,6 +260,17 @@ struct iommufd_hw_pagetable {
 	struct list_head devices;
 };
 
+static inline struct iommufd_hw_pagetable *iommufd_get_hwpt(
+					struct iommufd_ucmd *ucmd, u32 id)
+{
+	return container_of(iommufd_get_object(ucmd->ictx, id,
+					       IOMMUFD_OBJ_HW_PAGETABLE),
+			    struct iommufd_hw_pagetable, obj);
+}
+int iommufd_hwpt_set_dirty(struct iommufd_ucmd *ucmd);
+int iommufd_hwpt_get_dirty_iova(struct iommufd_ucmd *ucmd);
+int iommufd_hwpt_unmap_dirty(struct iommufd_ucmd *ucmd);
+
 struct iommufd_hw_pagetable *
 iommufd_hw_pagetable_from_id(struct iommufd_ctx *ictx, u32 pt_id,
 			     struct device *dev);
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 0e34426eec9f..4785fc9f4fb3 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -192,7 +192,10 @@ union ucmd_buffer {
 	struct iommu_ioas_iova_ranges iova_ranges;
 	struct iommu_ioas_map map;
 	struct iommu_ioas_unmap unmap;
+	struct iommu_ioas_unmap_dirty unmap_dirty;
 	struct iommu_destroy destroy;
+	struct iommu_hwpt_set_dirty set_dirty;
+	struct iommu_hwpt_get_dirty_iova get_dirty_iova;
 #ifdef CONFIG_IOMMUFD_TEST
 	struct iommu_test_cmd test;
 #endif
@@ -226,8 +229,14 @@ static struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
 		 __reserved),
 	IOCTL_OP(IOMMU_IOAS_UNMAP, iommufd_ioas_unmap, struct iommu_ioas_unmap,
 		 length),
+	IOCTL_OP(IOMMU_IOAS_UNMAP_DIRTY, iommufd_ioas_unmap_dirty,
+		 struct iommu_ioas_unmap_dirty, bitmap.data),
 	IOCTL_OP(IOMMU_VFIO_IOAS, iommufd_vfio_ioas, struct iommu_vfio_ioas,
 		 __reserved),
+	IOCTL_OP(IOMMU_HWPT_SET_DIRTY, iommufd_hwpt_set_dirty,
+		 struct iommu_hwpt_set_dirty, __reserved),
+	IOCTL_OP(IOMMU_HWPT_GET_DIRTY_IOVA, iommufd_hwpt_get_dirty_iova,
+		 struct iommu_hwpt_get_dirty_iova, bitmap.data),
 #ifdef CONFIG_IOMMUFD_TEST
 	IOCTL_OP(IOMMU_TEST_CMD, iommufd_test, struct iommu_test_cmd, last),
 #endif
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 2c0f5ced4173..01c5da7a1ab7 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -43,6 +43,9 @@ enum {
 	IOMMUFD_CMD_IOAS_COPY,
 	IOMMUFD_CMD_IOAS_UNMAP,
 	IOMMUFD_CMD_VFIO_IOAS,
+	IOMMUFD_CMD_HWPT_SET_DIRTY,
+	IOMMUFD_CMD_HWPT_GET_DIRTY_IOVA,
+	IOMMUFD_CMD_IOAS_UNMAP_DIRTY,
 };
 
 /**
@@ -220,4 +223,79 @@ struct iommu_vfio_ioas {
 	__u16 __reserved;
 };
 #define IOMMU_VFIO_IOAS _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VFIO_IOAS)
+
+/**
+ * enum iommufd_set_dirty_flags - Flags for steering dirty tracking
+ * @IOMMU_DIRTY_TRACKING_DISABLED: Disables dirty tracking
+ * @IOMMU_DIRTY_TRACKING_ENABLED: Enables dirty tracking
+ */
+enum iommufd_set_dirty_flags {
+	IOMMU_DIRTY_TRACKING_DISABLED = 0,
+	IOMMU_DIRTY_TRACKING_ENABLED = 1 << 0,
+};
+
+/**
+ * struct iommu_hwpt_set_dirty - ioctl(IOMMU_HWPT_SET_DIRTY)
+ * @size: sizeof(struct iommu_hwpt_set_dirty)
+ * @flags: Flags to control dirty tracking status.
+ * @hwpt_id: HW pagetable ID that represents the IOMMU domain.
+ *
+ * Toggle dirty tracking on an HW pagetable.
+ */
+struct iommu_hwpt_set_dirty {
+	__u32 size;
+	__u32 flags;
+	__u32 hwpt_id;
+	__u32 __reserved;
+};
+#define IOMMU_HWPT_SET_DIRTY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_SET_DIRTY)
+
+/**
+ * struct iommufd_dirty_bitmap - Dirty IOVA tracking bitmap
+ * @iova: base IOVA of the bitmap
+ * @length: IOVA size
+ * @page_size: page size granularity of each bit in the bitmap
+ * @data: bitmap where to set the dirty bits. The bitmap bits each
+ * represent a page_size which you deviate from an arbitrary iova.
+ * Checking a given IOVA is dirty:
+ *
+ *  data[(iova / page_size) / 64] & (1ULL << (iova % 64))
+ */
+struct iommufd_dirty_data {
+	__aligned_u64 iova;
+	__aligned_u64 length;
+	__aligned_u64 page_size;
+	__aligned_u64 *data;
+};
+
+/**
+ * struct iommu_hwpt_get_dirty_iova - ioctl(IOMMU_HWPT_GET_DIRTY_IOVA)
+ * @size: sizeof(struct iommu_hwpt_get_dirty_iova)
+ * @bitmap: Bitmap of the range of IOVA to read out
+ */
+struct iommu_hwpt_get_dirty_iova {
+	__u32 size;
+	__u32 hwpt_id;
+	struct iommufd_dirty_data bitmap;
+};
+#define IOMMU_HWPT_GET_DIRTY_IOVA _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_GET_DIRTY_IOVA)
+
+/**
+ * struct iommu_hwpt_unmap - ioctl(IOMMU_HWPT_UNMAP_DIRTY)
+ * @size: sizeof(struct iommu_hwpt_unmap_dirty)
+ * @ioas_id: IOAS ID to unmap the mapping of
+ * @data: Dirty data of the range of IOVA to unmap
+ *
+ * Unmap an IOVA range and return a bitmap of the dirty bits.
+ * The iova/length must exactly match a range
+ * used with IOMMU_IOAS_PAGETABLE_MAP, or be the values 0 & U64_MAX.
+ * In the latter case all IOVAs will be unmaped.
+ */
+struct iommu_ioas_unmap_dirty {
+	__u32 size;
+	__u32 ioas_id;
+	struct iommufd_dirty_data bitmap;
+};
+#define IOMMU_IOAS_UNMAP_DIRTY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_UNMAP_DIRTY)
+
 #endif
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH RFC 06/19] iommufd: Dirty tracking IOCTLs for the hw_pagetable
@ 2022-04-28 21:09   ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, Jason Gunthorpe,
	kvm, Will Deacon, Cornelia Huck, Alex Williamson, Joao Martins,
	David Woodhouse, Robin Murphy

Every IOMMU driver should be able to implement the needed
iommu domain ops to perform dirty tracking.

Connect a hw_pagetable to the IOMMU core dirty tracking ops.
It exposes all of the functionality for the UAPI:

- Enable/Disable dirty tracking on an IOMMU domain (hw_pagetable id)
- Read the dirtied IOVAs (which clear IOMMU domain bitmap under the hood)
- Unmap and get the dirtied IOVAs

In doing so the previously internal iommufd_dirty_data structure is
moved over as the UAPI intermediate structure for representing iommufd
dirty bitmaps.

Contrary to past incantations the IOVA range to be scanned or unmap is
tied in to the bitmap size, and thus puts the heavy lifting in the
application to make sure it passes a precisedly sized bitmap address as
opposed to allowing base_iova != iova, which simplifies things further.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/iommufd/hw_pagetable.c    | 79 +++++++++++++++++++++++++
 drivers/iommu/iommufd/ioas.c            | 33 +++++++++++
 drivers/iommu/iommufd/iommufd_private.h | 22 ++++---
 drivers/iommu/iommufd/main.c            |  9 +++
 include/uapi/linux/iommufd.h            | 78 ++++++++++++++++++++++++
 5 files changed, 214 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index bafd7d07918b..943bcc3898a4 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -3,6 +3,7 @@
  * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
  */
 #include <linux/iommu.h>
+#include <uapi/linux/iommufd.h>
 
 #include "iommufd_private.h"
 
@@ -140,3 +141,81 @@ void iommufd_hw_pagetable_put(struct iommufd_ctx *ictx,
 	}
 	iommufd_object_destroy_user(ictx, &hwpt->obj);
 }
+
+int iommufd_hwpt_set_dirty(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_hwpt_set_dirty *cmd = ucmd->cmd;
+	struct iommufd_hw_pagetable *hwpt;
+	struct iommufd_ioas *ioas;
+	int rc = -EOPNOTSUPP;
+	bool enable;
+
+	hwpt = iommufd_get_hwpt(ucmd, cmd->hwpt_id);
+	if (IS_ERR(hwpt))
+		return PTR_ERR(hwpt);
+
+	ioas = hwpt->ioas;
+	enable = cmd->flags & IOMMU_DIRTY_TRACKING_ENABLED;
+
+	rc = iopt_set_dirty_tracking(&ioas->iopt, hwpt->domain, enable);
+
+	iommufd_put_object(&hwpt->obj);
+	return rc;
+}
+
+int iommufd_check_iova_range(struct iommufd_ioas *ioas,
+			     struct iommufd_dirty_data *bitmap)
+{
+	unsigned long pgshift, npages;
+	size_t iommu_pgsize;
+	int rc = -EINVAL;
+	u64 bitmap_size;
+
+	pgshift = __ffs(bitmap->page_size);
+	npages = bitmap->length >> pgshift;
+	bitmap_size = dirty_bitmap_bytes(npages);
+
+	if (!npages || (bitmap_size > DIRTY_BITMAP_SIZE_MAX))
+		return rc;
+
+	if (!access_ok((void __user *) bitmap->data, bitmap_size))
+		return rc;
+
+	iommu_pgsize = 1 << __ffs(ioas->iopt.iova_alignment);
+
+	/* allow only smallest supported pgsize */
+	if (bitmap->page_size != iommu_pgsize)
+		return rc;
+
+	if (bitmap->iova & (iommu_pgsize - 1))
+		return rc;
+
+	if (!bitmap->length || bitmap->length & (iommu_pgsize - 1))
+		return rc;
+
+	return 0;
+}
+
+int iommufd_hwpt_get_dirty_iova(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_hwpt_get_dirty_iova *cmd = ucmd->cmd;
+	struct iommufd_hw_pagetable *hwpt;
+	struct iommufd_ioas *ioas;
+	int rc = -EOPNOTSUPP;
+
+	hwpt = iommufd_get_hwpt(ucmd, cmd->hwpt_id);
+	if (IS_ERR(hwpt))
+		return PTR_ERR(hwpt);
+
+	ioas = hwpt->ioas;
+	rc = iommufd_check_iova_range(ioas, &cmd->bitmap);
+	if (rc)
+		goto out_put;
+
+	rc = iopt_read_and_clear_dirty_data(&ioas->iopt, hwpt->domain,
+					    &cmd->bitmap);
+
+out_put:
+	iommufd_put_object(&hwpt->obj);
+	return rc;
+}
diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
index 19d6591aa005..50bef46bc0bb 100644
--- a/drivers/iommu/iommufd/ioas.c
+++ b/drivers/iommu/iommufd/ioas.c
@@ -243,6 +243,7 @@ int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd)
 			rc = -EOVERFLOW;
 			goto out_put;
 		}
+
 		rc = iopt_unmap_iova(&ioas->iopt, cmd->iova, cmd->length, NULL);
 	}
 
@@ -250,3 +251,35 @@ int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd)
 	iommufd_put_object(&ioas->obj);
 	return rc;
 }
+
+int iommufd_ioas_unmap_dirty(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_ioas_unmap_dirty *cmd = ucmd->cmd;
+	struct iommufd_dirty_data *bitmap;
+	struct iommufd_ioas *ioas;
+	int rc;
+
+	ioas = iommufd_get_ioas(ucmd, cmd->ioas_id);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	/* The bitmaps would be gigantic */
+	bitmap = &cmd->bitmap;
+	if (bitmap->iova == 0 && bitmap->length == U64_MAX)
+		return -EINVAL;
+
+	if (bitmap->iova >= ULONG_MAX || bitmap->length >= ULONG_MAX) {
+		rc = -EOVERFLOW;
+		goto out_put;
+	}
+
+	rc = iommufd_check_iova_range(ioas, bitmap);
+	if (rc)
+		goto out_put;
+
+	rc = iopt_unmap_iova(&ioas->iopt, bitmap->iova, bitmap->length, bitmap);
+
+out_put:
+	iommufd_put_object(&ioas->obj);
+	return rc;
+}
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 3e3a97f623a1..68c77cf4793f 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -10,6 +10,7 @@
 #include <linux/uaccess.h>
 #include <linux/iommu.h>
 #include <linux/uio.h>
+#include <uapi/linux/iommufd.h>
 
 struct iommu_domain;
 struct iommu_group;
@@ -49,13 +50,6 @@ int iopt_map_pages(struct io_pagetable *iopt, struct iopt_pages *pages,
 		   unsigned long length, int iommu_prot, unsigned int flags);
 int iopt_unmap_all(struct io_pagetable *iopt);
 
-struct iommufd_dirty_data {
-	unsigned long iova;
-	unsigned long length;
-	unsigned long page_size;
-	unsigned long *data;
-};
-
 int iopt_set_dirty_tracking(struct io_pagetable *iopt,
 			    struct iommu_domain *domain, bool enable);
 int iopt_read_and_clear_dirty_data(struct io_pagetable *iopt,
@@ -244,7 +238,10 @@ int iommufd_ioas_iova_ranges(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_map(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_copy(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd);
+int iommufd_ioas_unmap_dirty(struct iommufd_ucmd *ucmd);
 int iommufd_vfio_ioas(struct iommufd_ucmd *ucmd);
+int iommufd_check_iova_range(struct iommufd_ioas *ioas,
+			     struct iommufd_dirty_data *bitmap);
 
 /*
  * A HW pagetable is called an iommu_domain inside the kernel. This user object
@@ -263,6 +260,17 @@ struct iommufd_hw_pagetable {
 	struct list_head devices;
 };
 
+static inline struct iommufd_hw_pagetable *iommufd_get_hwpt(
+					struct iommufd_ucmd *ucmd, u32 id)
+{
+	return container_of(iommufd_get_object(ucmd->ictx, id,
+					       IOMMUFD_OBJ_HW_PAGETABLE),
+			    struct iommufd_hw_pagetable, obj);
+}
+int iommufd_hwpt_set_dirty(struct iommufd_ucmd *ucmd);
+int iommufd_hwpt_get_dirty_iova(struct iommufd_ucmd *ucmd);
+int iommufd_hwpt_unmap_dirty(struct iommufd_ucmd *ucmd);
+
 struct iommufd_hw_pagetable *
 iommufd_hw_pagetable_from_id(struct iommufd_ctx *ictx, u32 pt_id,
 			     struct device *dev);
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 0e34426eec9f..4785fc9f4fb3 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -192,7 +192,10 @@ union ucmd_buffer {
 	struct iommu_ioas_iova_ranges iova_ranges;
 	struct iommu_ioas_map map;
 	struct iommu_ioas_unmap unmap;
+	struct iommu_ioas_unmap_dirty unmap_dirty;
 	struct iommu_destroy destroy;
+	struct iommu_hwpt_set_dirty set_dirty;
+	struct iommu_hwpt_get_dirty_iova get_dirty_iova;
 #ifdef CONFIG_IOMMUFD_TEST
 	struct iommu_test_cmd test;
 #endif
@@ -226,8 +229,14 @@ static struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
 		 __reserved),
 	IOCTL_OP(IOMMU_IOAS_UNMAP, iommufd_ioas_unmap, struct iommu_ioas_unmap,
 		 length),
+	IOCTL_OP(IOMMU_IOAS_UNMAP_DIRTY, iommufd_ioas_unmap_dirty,
+		 struct iommu_ioas_unmap_dirty, bitmap.data),
 	IOCTL_OP(IOMMU_VFIO_IOAS, iommufd_vfio_ioas, struct iommu_vfio_ioas,
 		 __reserved),
+	IOCTL_OP(IOMMU_HWPT_SET_DIRTY, iommufd_hwpt_set_dirty,
+		 struct iommu_hwpt_set_dirty, __reserved),
+	IOCTL_OP(IOMMU_HWPT_GET_DIRTY_IOVA, iommufd_hwpt_get_dirty_iova,
+		 struct iommu_hwpt_get_dirty_iova, bitmap.data),
 #ifdef CONFIG_IOMMUFD_TEST
 	IOCTL_OP(IOMMU_TEST_CMD, iommufd_test, struct iommu_test_cmd, last),
 #endif
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 2c0f5ced4173..01c5da7a1ab7 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -43,6 +43,9 @@ enum {
 	IOMMUFD_CMD_IOAS_COPY,
 	IOMMUFD_CMD_IOAS_UNMAP,
 	IOMMUFD_CMD_VFIO_IOAS,
+	IOMMUFD_CMD_HWPT_SET_DIRTY,
+	IOMMUFD_CMD_HWPT_GET_DIRTY_IOVA,
+	IOMMUFD_CMD_IOAS_UNMAP_DIRTY,
 };
 
 /**
@@ -220,4 +223,79 @@ struct iommu_vfio_ioas {
 	__u16 __reserved;
 };
 #define IOMMU_VFIO_IOAS _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VFIO_IOAS)
+
+/**
+ * enum iommufd_set_dirty_flags - Flags for steering dirty tracking
+ * @IOMMU_DIRTY_TRACKING_DISABLED: Disables dirty tracking
+ * @IOMMU_DIRTY_TRACKING_ENABLED: Enables dirty tracking
+ */
+enum iommufd_set_dirty_flags {
+	IOMMU_DIRTY_TRACKING_DISABLED = 0,
+	IOMMU_DIRTY_TRACKING_ENABLED = 1 << 0,
+};
+
+/**
+ * struct iommu_hwpt_set_dirty - ioctl(IOMMU_HWPT_SET_DIRTY)
+ * @size: sizeof(struct iommu_hwpt_set_dirty)
+ * @flags: Flags to control dirty tracking status.
+ * @hwpt_id: HW pagetable ID that represents the IOMMU domain.
+ *
+ * Toggle dirty tracking on an HW pagetable.
+ */
+struct iommu_hwpt_set_dirty {
+	__u32 size;
+	__u32 flags;
+	__u32 hwpt_id;
+	__u32 __reserved;
+};
+#define IOMMU_HWPT_SET_DIRTY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_SET_DIRTY)
+
+/**
+ * struct iommufd_dirty_bitmap - Dirty IOVA tracking bitmap
+ * @iova: base IOVA of the bitmap
+ * @length: IOVA size
+ * @page_size: page size granularity of each bit in the bitmap
+ * @data: bitmap where to set the dirty bits. The bitmap bits each
+ * represent a page_size which you deviate from an arbitrary iova.
+ * Checking a given IOVA is dirty:
+ *
+ *  data[(iova / page_size) / 64] & (1ULL << (iova % 64))
+ */
+struct iommufd_dirty_data {
+	__aligned_u64 iova;
+	__aligned_u64 length;
+	__aligned_u64 page_size;
+	__aligned_u64 *data;
+};
+
+/**
+ * struct iommu_hwpt_get_dirty_iova - ioctl(IOMMU_HWPT_GET_DIRTY_IOVA)
+ * @size: sizeof(struct iommu_hwpt_get_dirty_iova)
+ * @bitmap: Bitmap of the range of IOVA to read out
+ */
+struct iommu_hwpt_get_dirty_iova {
+	__u32 size;
+	__u32 hwpt_id;
+	struct iommufd_dirty_data bitmap;
+};
+#define IOMMU_HWPT_GET_DIRTY_IOVA _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_GET_DIRTY_IOVA)
+
+/**
+ * struct iommu_hwpt_unmap - ioctl(IOMMU_HWPT_UNMAP_DIRTY)
+ * @size: sizeof(struct iommu_hwpt_unmap_dirty)
+ * @ioas_id: IOAS ID to unmap the mapping of
+ * @data: Dirty data of the range of IOVA to unmap
+ *
+ * Unmap an IOVA range and return a bitmap of the dirty bits.
+ * The iova/length must exactly match a range
+ * used with IOMMU_IOAS_PAGETABLE_MAP, or be the values 0 & U64_MAX.
+ * In the latter case all IOVAs will be unmaped.
+ */
+struct iommu_ioas_unmap_dirty {
+	__u32 size;
+	__u32 ioas_id;
+	struct iommufd_dirty_data bitmap;
+};
+#define IOMMU_IOAS_UNMAP_DIRTY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_UNMAP_DIRTY)
+
 #endif
-- 
2.17.2

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH RFC 07/19] iommufd/vfio-compat: Dirty tracking IOCTLs compatibility
  2022-04-28 21:09 ` Joao Martins
@ 2022-04-28 21:09   ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Joao Martins, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian,
	Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck, kvm

Add the correspondent APIs for performing VFIO dirty tracking,
particularly VFIO_IOMMU_DIRTY_PAGES ioctl subcmds:
* VFIO_IOMMU_DIRTY_PAGES_FLAG_START: Start dirty tracking and allocates
				     the area @dirty_bitmap
* VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP: Stop dirty tracking and frees
				    the area @dirty_bitmap
* VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP: Fetch dirty bitmap while dirty
tracking is active.

Advertise the VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION
whereas it gets set the domain configured page size the same as
iopt::iova_alignment and maximum dirty bitmap size same
as VFIO. Compared to VFIO type1 iommu, the perpectual dirtying is
not implemented and userspace gets -EOPNOTSUPP which is handled by
today's userspace.

Move iommufd_get_pagesizes() definition prior to unmap for
iommufd_vfio_unmap_dma() dirty support to validate the user bitmap page
size against IOPT pagesize.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/iommufd/vfio_compat.c | 221 ++++++++++++++++++++++++++--
 1 file changed, 209 insertions(+), 12 deletions(-)

diff --git a/drivers/iommu/iommufd/vfio_compat.c b/drivers/iommu/iommufd/vfio_compat.c
index dbe39404a105..2802f49cc10d 100644
--- a/drivers/iommu/iommufd/vfio_compat.c
+++ b/drivers/iommu/iommufd/vfio_compat.c
@@ -56,6 +56,16 @@ create_compat_ioas(struct iommufd_ctx *ictx)
 	return ioas;
 }
 
+static u64 iommufd_get_pagesizes(struct iommufd_ioas *ioas)
+{
+	/* FIXME: See vfio_update_pgsize_bitmap(), for compat this should return
+	 * the high bits too, and we need to decide if we should report that
+	 * iommufd supports less than PAGE_SIZE alignment or stick to strict
+	 * compatibility. qemu only cares about the first set bit.
+	 */
+	return ioas->iopt.iova_alignment;
+}
+
 int iommufd_vfio_ioas(struct iommufd_ucmd *ucmd)
 {
 	struct iommu_vfio_ioas *cmd = ucmd->cmd;
@@ -130,9 +140,14 @@ static int iommufd_vfio_unmap_dma(struct iommufd_ctx *ictx, unsigned int cmd,
 				  void __user *arg)
 {
 	size_t minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size);
-	u32 supported_flags = VFIO_DMA_UNMAP_FLAG_ALL;
+	u32 supported_flags = VFIO_DMA_UNMAP_FLAG_ALL |
+		VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP;
+	struct iommufd_dirty_data dirty, *dirtyp = NULL;
 	struct vfio_iommu_type1_dma_unmap unmap;
+	struct vfio_bitmap bitmap;
 	struct iommufd_ioas *ioas;
+	unsigned long pgshift;
+	size_t pgsize;
 	int rc;
 
 	if (copy_from_user(&unmap, arg, minsz))
@@ -141,14 +156,53 @@ static int iommufd_vfio_unmap_dma(struct iommufd_ctx *ictx, unsigned int cmd,
 	if (unmap.argsz < minsz || unmap.flags & ~supported_flags)
 		return -EINVAL;
 
+	if (unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) {
+		unsigned long npages;
+
+		if (copy_from_user(&bitmap,
+				   (void __user *)(arg + minsz),
+				   sizeof(bitmap)))
+			return -EFAULT;
+
+		if (!access_ok((void __user *)bitmap.data, bitmap.size))
+			return -EINVAL;
+
+		pgshift = __ffs(bitmap.pgsize);
+		npages = unmap.size >> pgshift;
+
+		if (!npages || !bitmap.size ||
+		    (bitmap.size > DIRTY_BITMAP_SIZE_MAX) ||
+		    (bitmap.size < dirty_bitmap_bytes(npages)))
+			return -EINVAL;
+
+		dirty.iova = unmap.iova;
+		dirty.length = unmap.size;
+		dirty.data = bitmap.data;
+		dirty.page_size = 1 << pgshift;
+		dirtyp = &dirty;
+	}
+
 	ioas = get_compat_ioas(ictx);
 	if (IS_ERR(ioas))
 		return PTR_ERR(ioas);
 
+	pgshift = __ffs(iommufd_get_pagesizes(ioas)),
+	pgsize = (size_t)1 << pgshift;
+
+	/* When dirty tracking is enabled, allow only min supported pgsize */
+	if ((unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) &&
+	    (bitmap.pgsize != pgsize)) {
+		rc = -EINVAL;
+		goto out_put;
+	}
+
 	if (unmap.flags & VFIO_DMA_UNMAP_FLAG_ALL)
 		rc = iopt_unmap_all(&ioas->iopt);
 	else
-		rc = iopt_unmap_iova(&ioas->iopt, unmap.iova, unmap.size, NULL);
+		rc = iopt_unmap_iova(&ioas->iopt, unmap.iova, unmap.size,
+				     dirtyp);
+
+out_put:
 	iommufd_put_object(&ioas->obj);
 	return rc;
 }
@@ -222,16 +276,6 @@ static int iommufd_vfio_set_iommu(struct iommufd_ctx *ictx, unsigned long type)
 	return 0;
 }
 
-static u64 iommufd_get_pagesizes(struct iommufd_ioas *ioas)
-{
-	/* FIXME: See vfio_update_pgsize_bitmap(), for compat this should return
-	 * the high bits too, and we need to decide if we should report that
-	 * iommufd supports less than PAGE_SIZE alignment or stick to strict
-	 * compatibility. qemu only cares about the first set bit.
-	 */
-	return ioas->iopt.iova_alignment;
-}
-
 static int iommufd_fill_cap_iova(struct iommufd_ioas *ioas,
 				 struct vfio_info_cap_header __user *cur,
 				 size_t avail)
@@ -289,6 +333,26 @@ static int iommufd_fill_cap_dma_avail(struct iommufd_ioas *ioas,
 	return sizeof(cap_dma);
 }
 
+static int iommufd_fill_cap_migration(struct iommufd_ioas *ioas,
+				      struct vfio_info_cap_header __user *cur,
+				      size_t avail)
+{
+	struct vfio_iommu_type1_info_cap_migration cap_mig = {
+		.header = {
+			.id = VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION,
+			.version = 1,
+		},
+		.flags = 0,
+		.pgsize_bitmap = (size_t) 1 << __ffs(iommufd_get_pagesizes(ioas)),
+		.max_dirty_bitmap_size = DIRTY_BITMAP_SIZE_MAX,
+	};
+
+	if (avail >= sizeof(cap_mig) &&
+	    copy_to_user(cur, &cap_mig, sizeof(cap_mig)))
+		return -EFAULT;
+	return sizeof(cap_mig);
+}
+
 static int iommufd_vfio_iommu_get_info(struct iommufd_ctx *ictx,
 				       void __user *arg)
 {
@@ -298,6 +362,7 @@ static int iommufd_vfio_iommu_get_info(struct iommufd_ctx *ictx,
 	static const fill_cap_fn fill_fns[] = {
 		iommufd_fill_cap_iova,
 		iommufd_fill_cap_dma_avail,
+		iommufd_fill_cap_migration,
 	};
 	size_t minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes);
 	struct vfio_info_cap_header __user *last_cap = NULL;
@@ -364,6 +429,137 @@ static int iommufd_vfio_iommu_get_info(struct iommufd_ctx *ictx,
 	return rc;
 }
 
+static int iommufd_vfio_dirty_pages_start(struct iommufd_ctx *ictx,
+				struct vfio_iommu_type1_dirty_bitmap *dirty)
+{
+	struct iommufd_ioas *ioas;
+	int ret = -EINVAL;
+
+	ioas = get_compat_ioas(ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	ret = iopt_set_dirty_tracking(&ioas->iopt, NULL, true);
+
+	iommufd_put_object(&ioas->obj);
+
+	return ret;
+}
+
+static int iommufd_vfio_dirty_pages_stop(struct iommufd_ctx *ictx,
+				struct vfio_iommu_type1_dirty_bitmap *dirty)
+{
+	struct iommufd_ioas *ioas;
+	int ret;
+
+	ioas = get_compat_ioas(ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	ret = iopt_set_dirty_tracking(&ioas->iopt, NULL, false);
+
+	iommufd_put_object(&ioas->obj);
+
+	return ret;
+}
+
+static int iommufd_vfio_dirty_pages_get_bitmap(struct iommufd_ctx *ictx,
+				struct vfio_iommu_type1_dirty_bitmap_get *range)
+{
+	struct iommufd_dirty_data bitmap;
+	uint64_t npages, bitmap_size;
+	struct iommufd_ioas *ioas;
+	unsigned long pgshift;
+	size_t iommu_pgsize;
+	int ret = -EINVAL;
+
+	ioas = get_compat_ioas(ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	down_read(&ioas->iopt.iova_rwsem);
+	pgshift = __ffs(range->bitmap.pgsize);
+	npages = range->size >> pgshift;
+	bitmap_size = range->bitmap.size;
+
+	if (!npages || !bitmap_size || (bitmap_size > DIRTY_BITMAP_SIZE_MAX) ||
+	    (bitmap_size < dirty_bitmap_bytes(npages)))
+		goto out_put;
+
+	iommu_pgsize = 1 << __ffs(iommufd_get_pagesizes(ioas));
+
+	/* allow only smallest supported pgsize */
+	if (range->bitmap.pgsize != iommu_pgsize)
+		goto out_put;
+
+	if (range->iova & (iommu_pgsize - 1))
+		goto out_put;
+
+	if (!range->size || range->size & (iommu_pgsize - 1))
+		goto out_put;
+
+	bitmap.iova = range->iova;
+	bitmap.length = range->size;
+	bitmap.data = range->bitmap.data;
+	bitmap.page_size = 1 << pgshift;
+
+	ret = iopt_read_and_clear_dirty_data(&ioas->iopt, NULL, &bitmap);
+
+out_put:
+	up_read(&ioas->iopt.iova_rwsem);
+	iommufd_put_object(&ioas->obj);
+	return ret;
+}
+
+static int iommufd_vfio_dirty_pages(struct iommufd_ctx *ictx, unsigned int cmd,
+				    void __user *arg)
+{
+	size_t minsz = offsetofend(struct vfio_iommu_type1_dirty_bitmap, flags);
+	struct vfio_iommu_type1_dirty_bitmap dirty;
+	u32 supported_flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
+			VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP |
+			VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
+	int ret = 0;
+
+	if (copy_from_user(&dirty, (void __user *)arg, minsz))
+		return -EFAULT;
+
+	if (dirty.argsz < minsz || dirty.flags & ~supported_flags)
+		return -EINVAL;
+
+	/* only one flag should be set at a time */
+	if (__ffs(dirty.flags) != __fls(dirty.flags))
+		return -EINVAL;
+
+	if (dirty.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) {
+		ret = iommufd_vfio_dirty_pages_start(ictx, &dirty);
+	} else if (dirty.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) {
+		ret = iommufd_vfio_dirty_pages_stop(ictx, &dirty);
+	} else if (dirty.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
+		struct vfio_iommu_type1_dirty_bitmap_get range;
+		size_t data_size = dirty.argsz - minsz;
+
+		if (!data_size || data_size < sizeof(range))
+			return -EINVAL;
+
+		if (copy_from_user(&range, (void __user *)(arg + minsz),
+				   sizeof(range)))
+			return -EFAULT;
+
+		if (range.iova + range.size < range.iova)
+			return -EINVAL;
+
+		if (!access_ok((void __user *)range.bitmap.data,
+			       range.bitmap.size))
+			return -EINVAL;
+
+		ret = iommufd_vfio_dirty_pages_get_bitmap(ictx, &range);
+	}
+
+	return ret;
+}
+
+
 /* FIXME TODO:
 PowerPC SPAPR only:
 #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
@@ -394,6 +590,7 @@ int iommufd_vfio_ioctl(struct iommufd_ctx *ictx, unsigned int cmd,
 	case VFIO_IOMMU_UNMAP_DMA:
 		return iommufd_vfio_unmap_dma(ictx, cmd, uarg);
 	case VFIO_IOMMU_DIRTY_PAGES:
+		return iommufd_vfio_dirty_pages(ictx, cmd, uarg);
 	default:
 		return -ENOIOCTLCMD;
 	}
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH RFC 07/19] iommufd/vfio-compat: Dirty tracking IOCTLs compatibility
@ 2022-04-28 21:09   ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, Jason Gunthorpe,
	kvm, Will Deacon, Cornelia Huck, Alex Williamson, Joao Martins,
	David Woodhouse, Robin Murphy

Add the correspondent APIs for performing VFIO dirty tracking,
particularly VFIO_IOMMU_DIRTY_PAGES ioctl subcmds:
* VFIO_IOMMU_DIRTY_PAGES_FLAG_START: Start dirty tracking and allocates
				     the area @dirty_bitmap
* VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP: Stop dirty tracking and frees
				    the area @dirty_bitmap
* VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP: Fetch dirty bitmap while dirty
tracking is active.

Advertise the VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION
whereas it gets set the domain configured page size the same as
iopt::iova_alignment and maximum dirty bitmap size same
as VFIO. Compared to VFIO type1 iommu, the perpectual dirtying is
not implemented and userspace gets -EOPNOTSUPP which is handled by
today's userspace.

Move iommufd_get_pagesizes() definition prior to unmap for
iommufd_vfio_unmap_dma() dirty support to validate the user bitmap page
size against IOPT pagesize.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/iommufd/vfio_compat.c | 221 ++++++++++++++++++++++++++--
 1 file changed, 209 insertions(+), 12 deletions(-)

diff --git a/drivers/iommu/iommufd/vfio_compat.c b/drivers/iommu/iommufd/vfio_compat.c
index dbe39404a105..2802f49cc10d 100644
--- a/drivers/iommu/iommufd/vfio_compat.c
+++ b/drivers/iommu/iommufd/vfio_compat.c
@@ -56,6 +56,16 @@ create_compat_ioas(struct iommufd_ctx *ictx)
 	return ioas;
 }
 
+static u64 iommufd_get_pagesizes(struct iommufd_ioas *ioas)
+{
+	/* FIXME: See vfio_update_pgsize_bitmap(), for compat this should return
+	 * the high bits too, and we need to decide if we should report that
+	 * iommufd supports less than PAGE_SIZE alignment or stick to strict
+	 * compatibility. qemu only cares about the first set bit.
+	 */
+	return ioas->iopt.iova_alignment;
+}
+
 int iommufd_vfio_ioas(struct iommufd_ucmd *ucmd)
 {
 	struct iommu_vfio_ioas *cmd = ucmd->cmd;
@@ -130,9 +140,14 @@ static int iommufd_vfio_unmap_dma(struct iommufd_ctx *ictx, unsigned int cmd,
 				  void __user *arg)
 {
 	size_t minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size);
-	u32 supported_flags = VFIO_DMA_UNMAP_FLAG_ALL;
+	u32 supported_flags = VFIO_DMA_UNMAP_FLAG_ALL |
+		VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP;
+	struct iommufd_dirty_data dirty, *dirtyp = NULL;
 	struct vfio_iommu_type1_dma_unmap unmap;
+	struct vfio_bitmap bitmap;
 	struct iommufd_ioas *ioas;
+	unsigned long pgshift;
+	size_t pgsize;
 	int rc;
 
 	if (copy_from_user(&unmap, arg, minsz))
@@ -141,14 +156,53 @@ static int iommufd_vfio_unmap_dma(struct iommufd_ctx *ictx, unsigned int cmd,
 	if (unmap.argsz < minsz || unmap.flags & ~supported_flags)
 		return -EINVAL;
 
+	if (unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) {
+		unsigned long npages;
+
+		if (copy_from_user(&bitmap,
+				   (void __user *)(arg + minsz),
+				   sizeof(bitmap)))
+			return -EFAULT;
+
+		if (!access_ok((void __user *)bitmap.data, bitmap.size))
+			return -EINVAL;
+
+		pgshift = __ffs(bitmap.pgsize);
+		npages = unmap.size >> pgshift;
+
+		if (!npages || !bitmap.size ||
+		    (bitmap.size > DIRTY_BITMAP_SIZE_MAX) ||
+		    (bitmap.size < dirty_bitmap_bytes(npages)))
+			return -EINVAL;
+
+		dirty.iova = unmap.iova;
+		dirty.length = unmap.size;
+		dirty.data = bitmap.data;
+		dirty.page_size = 1 << pgshift;
+		dirtyp = &dirty;
+	}
+
 	ioas = get_compat_ioas(ictx);
 	if (IS_ERR(ioas))
 		return PTR_ERR(ioas);
 
+	pgshift = __ffs(iommufd_get_pagesizes(ioas)),
+	pgsize = (size_t)1 << pgshift;
+
+	/* When dirty tracking is enabled, allow only min supported pgsize */
+	if ((unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) &&
+	    (bitmap.pgsize != pgsize)) {
+		rc = -EINVAL;
+		goto out_put;
+	}
+
 	if (unmap.flags & VFIO_DMA_UNMAP_FLAG_ALL)
 		rc = iopt_unmap_all(&ioas->iopt);
 	else
-		rc = iopt_unmap_iova(&ioas->iopt, unmap.iova, unmap.size, NULL);
+		rc = iopt_unmap_iova(&ioas->iopt, unmap.iova, unmap.size,
+				     dirtyp);
+
+out_put:
 	iommufd_put_object(&ioas->obj);
 	return rc;
 }
@@ -222,16 +276,6 @@ static int iommufd_vfio_set_iommu(struct iommufd_ctx *ictx, unsigned long type)
 	return 0;
 }
 
-static u64 iommufd_get_pagesizes(struct iommufd_ioas *ioas)
-{
-	/* FIXME: See vfio_update_pgsize_bitmap(), for compat this should return
-	 * the high bits too, and we need to decide if we should report that
-	 * iommufd supports less than PAGE_SIZE alignment or stick to strict
-	 * compatibility. qemu only cares about the first set bit.
-	 */
-	return ioas->iopt.iova_alignment;
-}
-
 static int iommufd_fill_cap_iova(struct iommufd_ioas *ioas,
 				 struct vfio_info_cap_header __user *cur,
 				 size_t avail)
@@ -289,6 +333,26 @@ static int iommufd_fill_cap_dma_avail(struct iommufd_ioas *ioas,
 	return sizeof(cap_dma);
 }
 
+static int iommufd_fill_cap_migration(struct iommufd_ioas *ioas,
+				      struct vfio_info_cap_header __user *cur,
+				      size_t avail)
+{
+	struct vfio_iommu_type1_info_cap_migration cap_mig = {
+		.header = {
+			.id = VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION,
+			.version = 1,
+		},
+		.flags = 0,
+		.pgsize_bitmap = (size_t) 1 << __ffs(iommufd_get_pagesizes(ioas)),
+		.max_dirty_bitmap_size = DIRTY_BITMAP_SIZE_MAX,
+	};
+
+	if (avail >= sizeof(cap_mig) &&
+	    copy_to_user(cur, &cap_mig, sizeof(cap_mig)))
+		return -EFAULT;
+	return sizeof(cap_mig);
+}
+
 static int iommufd_vfio_iommu_get_info(struct iommufd_ctx *ictx,
 				       void __user *arg)
 {
@@ -298,6 +362,7 @@ static int iommufd_vfio_iommu_get_info(struct iommufd_ctx *ictx,
 	static const fill_cap_fn fill_fns[] = {
 		iommufd_fill_cap_iova,
 		iommufd_fill_cap_dma_avail,
+		iommufd_fill_cap_migration,
 	};
 	size_t minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes);
 	struct vfio_info_cap_header __user *last_cap = NULL;
@@ -364,6 +429,137 @@ static int iommufd_vfio_iommu_get_info(struct iommufd_ctx *ictx,
 	return rc;
 }
 
+static int iommufd_vfio_dirty_pages_start(struct iommufd_ctx *ictx,
+				struct vfio_iommu_type1_dirty_bitmap *dirty)
+{
+	struct iommufd_ioas *ioas;
+	int ret = -EINVAL;
+
+	ioas = get_compat_ioas(ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	ret = iopt_set_dirty_tracking(&ioas->iopt, NULL, true);
+
+	iommufd_put_object(&ioas->obj);
+
+	return ret;
+}
+
+static int iommufd_vfio_dirty_pages_stop(struct iommufd_ctx *ictx,
+				struct vfio_iommu_type1_dirty_bitmap *dirty)
+{
+	struct iommufd_ioas *ioas;
+	int ret;
+
+	ioas = get_compat_ioas(ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	ret = iopt_set_dirty_tracking(&ioas->iopt, NULL, false);
+
+	iommufd_put_object(&ioas->obj);
+
+	return ret;
+}
+
+static int iommufd_vfio_dirty_pages_get_bitmap(struct iommufd_ctx *ictx,
+				struct vfio_iommu_type1_dirty_bitmap_get *range)
+{
+	struct iommufd_dirty_data bitmap;
+	uint64_t npages, bitmap_size;
+	struct iommufd_ioas *ioas;
+	unsigned long pgshift;
+	size_t iommu_pgsize;
+	int ret = -EINVAL;
+
+	ioas = get_compat_ioas(ictx);
+	if (IS_ERR(ioas))
+		return PTR_ERR(ioas);
+
+	down_read(&ioas->iopt.iova_rwsem);
+	pgshift = __ffs(range->bitmap.pgsize);
+	npages = range->size >> pgshift;
+	bitmap_size = range->bitmap.size;
+
+	if (!npages || !bitmap_size || (bitmap_size > DIRTY_BITMAP_SIZE_MAX) ||
+	    (bitmap_size < dirty_bitmap_bytes(npages)))
+		goto out_put;
+
+	iommu_pgsize = 1 << __ffs(iommufd_get_pagesizes(ioas));
+
+	/* allow only smallest supported pgsize */
+	if (range->bitmap.pgsize != iommu_pgsize)
+		goto out_put;
+
+	if (range->iova & (iommu_pgsize - 1))
+		goto out_put;
+
+	if (!range->size || range->size & (iommu_pgsize - 1))
+		goto out_put;
+
+	bitmap.iova = range->iova;
+	bitmap.length = range->size;
+	bitmap.data = range->bitmap.data;
+	bitmap.page_size = 1 << pgshift;
+
+	ret = iopt_read_and_clear_dirty_data(&ioas->iopt, NULL, &bitmap);
+
+out_put:
+	up_read(&ioas->iopt.iova_rwsem);
+	iommufd_put_object(&ioas->obj);
+	return ret;
+}
+
+static int iommufd_vfio_dirty_pages(struct iommufd_ctx *ictx, unsigned int cmd,
+				    void __user *arg)
+{
+	size_t minsz = offsetofend(struct vfio_iommu_type1_dirty_bitmap, flags);
+	struct vfio_iommu_type1_dirty_bitmap dirty;
+	u32 supported_flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START |
+			VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP |
+			VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
+	int ret = 0;
+
+	if (copy_from_user(&dirty, (void __user *)arg, minsz))
+		return -EFAULT;
+
+	if (dirty.argsz < minsz || dirty.flags & ~supported_flags)
+		return -EINVAL;
+
+	/* only one flag should be set at a time */
+	if (__ffs(dirty.flags) != __fls(dirty.flags))
+		return -EINVAL;
+
+	if (dirty.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) {
+		ret = iommufd_vfio_dirty_pages_start(ictx, &dirty);
+	} else if (dirty.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) {
+		ret = iommufd_vfio_dirty_pages_stop(ictx, &dirty);
+	} else if (dirty.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
+		struct vfio_iommu_type1_dirty_bitmap_get range;
+		size_t data_size = dirty.argsz - minsz;
+
+		if (!data_size || data_size < sizeof(range))
+			return -EINVAL;
+
+		if (copy_from_user(&range, (void __user *)(arg + minsz),
+				   sizeof(range)))
+			return -EFAULT;
+
+		if (range.iova + range.size < range.iova)
+			return -EINVAL;
+
+		if (!access_ok((void __user *)range.bitmap.data,
+			       range.bitmap.size))
+			return -EINVAL;
+
+		ret = iommufd_vfio_dirty_pages_get_bitmap(ictx, &range);
+	}
+
+	return ret;
+}
+
+
 /* FIXME TODO:
 PowerPC SPAPR only:
 #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
@@ -394,6 +590,7 @@ int iommufd_vfio_ioctl(struct iommufd_ctx *ictx, unsigned int cmd,
 	case VFIO_IOMMU_UNMAP_DMA:
 		return iommufd_vfio_unmap_dma(ictx, cmd, uarg);
 	case VFIO_IOMMU_DIRTY_PAGES:
+		return iommufd_vfio_dirty_pages(ictx, cmd, uarg);
 	default:
 		return -ENOIOCTLCMD;
 	}
-- 
2.17.2

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH RFC 08/19] iommufd: Add a test for dirty tracking ioctls
  2022-04-28 21:09 ` Joao Martins
@ 2022-04-28 21:09   ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Joao Martins, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian,
	Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck, kvm

Add a new test ioctl for simulating the dirty IOVAs
in the mock domain, and implement the mock iommu domain ops
that get the dirty tracking supported.

The selftest exercises the usual main workflow of:

1) Setting/Clearing dirty tracking from the iommu domain
2) Read and clear dirty IOPTEs
3) Unmap and read dirty back

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/iommufd/iommufd_test.h    |   9 ++
 drivers/iommu/iommufd/selftest.c        | 137 +++++++++++++++++++++++-
 tools/testing/selftests/iommu/Makefile  |   1 +
 tools/testing/selftests/iommu/iommufd.c | 135 +++++++++++++++++++++++
 4 files changed, 279 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h
index d22ef484af1a..90dafa513078 100644
--- a/drivers/iommu/iommufd/iommufd_test.h
+++ b/drivers/iommu/iommufd/iommufd_test.h
@@ -14,6 +14,7 @@ enum {
 	IOMMU_TEST_OP_MD_CHECK_REFS,
 	IOMMU_TEST_OP_ACCESS_PAGES,
 	IOMMU_TEST_OP_SET_TEMP_MEMORY_LIMIT,
+	IOMMU_TEST_OP_DIRTY,
 };
 
 enum {
@@ -57,6 +58,14 @@ struct iommu_test_cmd {
 		struct {
 			__u32 limit;
 		} memory_limit;
+		struct {
+			__u32 flags;
+			__aligned_u64 iova;
+			__aligned_u64 length;
+			__aligned_u64 page_size;
+			__aligned_u64 uptr;
+			__aligned_u64 out_nr_dirty;
+		} dirty;
 	};
 	__u32 last;
 };
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index a665719b493e..b02309722436 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -13,6 +13,7 @@
 size_t iommufd_test_memory_limit = 65536;
 
 enum {
+	MOCK_DIRTY_TRACK = 1,
 	MOCK_IO_PAGE_SIZE = PAGE_SIZE / 2,
 
 	/*
@@ -25,9 +26,11 @@ enum {
 	_MOCK_PFN_START = MOCK_PFN_MASK + 1,
 	MOCK_PFN_START_IOVA = _MOCK_PFN_START,
 	MOCK_PFN_LAST_IOVA = _MOCK_PFN_START,
+	MOCK_PFN_DIRTY_IOVA = _MOCK_PFN_START << 1,
 };
 
 struct mock_iommu_domain {
+	unsigned long flags;
 	struct iommu_domain domain;
 	struct xarray pfns;
 };
@@ -133,7 +136,7 @@ static size_t mock_domain_unmap_pages(struct iommu_domain *domain,
 
 		for (cur = 0; cur != pgsize; cur += MOCK_IO_PAGE_SIZE) {
 			ent = xa_erase(&mock->pfns, iova / MOCK_IO_PAGE_SIZE);
-			WARN_ON(!ent);
+
 			/*
 			 * iommufd generates unmaps that must be a strict
 			 * superset of the map's performend So every starting
@@ -143,12 +146,12 @@ static size_t mock_domain_unmap_pages(struct iommu_domain *domain,
 			 * passed to map_pages
 			 */
 			if (first) {
-				WARN_ON(!(xa_to_value(ent) &
+				WARN_ON(ent && !(xa_to_value(ent) &
 					  MOCK_PFN_START_IOVA));
 				first = false;
 			}
 			if (pgcount == 1 && cur + MOCK_IO_PAGE_SIZE == pgsize)
-				WARN_ON(!(xa_to_value(ent) &
+				WARN_ON(ent && !(xa_to_value(ent) &
 					  MOCK_PFN_LAST_IOVA));
 
 			iova += MOCK_IO_PAGE_SIZE;
@@ -171,6 +174,75 @@ static phys_addr_t mock_domain_iova_to_phys(struct iommu_domain *domain,
 	return (xa_to_value(ent) & MOCK_PFN_MASK) * MOCK_IO_PAGE_SIZE;
 }
 
+static int mock_domain_set_dirty_tracking(struct iommu_domain *domain,
+					  bool enable)
+{
+	struct mock_iommu_domain *mock =
+		container_of(domain, struct mock_iommu_domain, domain);
+	unsigned long flags = mock->flags;
+
+	/* No change? */
+	if (!(enable ^ !!(flags & MOCK_DIRTY_TRACK)))
+		return -EINVAL;
+
+	flags = (enable ?
+		 flags | MOCK_DIRTY_TRACK : flags & ~MOCK_DIRTY_TRACK);
+
+	mock->flags = flags;
+	return 0;
+}
+
+static int mock_domain_read_and_clear_dirty(struct iommu_domain *domain,
+					    unsigned long iova, size_t size,
+					    struct iommu_dirty_bitmap *dirty)
+{
+	struct mock_iommu_domain *mock =
+		container_of(domain, struct mock_iommu_domain, domain);
+	unsigned long i, max = size / MOCK_IO_PAGE_SIZE;
+	void *ent, *old;
+
+	if (!(mock->flags & MOCK_DIRTY_TRACK))
+		return -EINVAL;
+
+	for (i = 0; i < max; i++) {
+		unsigned long cur = iova + i * MOCK_IO_PAGE_SIZE;
+
+		ent = xa_load(&mock->pfns, cur / MOCK_IO_PAGE_SIZE);
+		if (ent &&
+		    (xa_to_value(ent) & MOCK_PFN_DIRTY_IOVA)) {
+			unsigned long val;
+
+			/* Clear dirty */
+			val = xa_to_value(ent) & ~MOCK_PFN_DIRTY_IOVA;
+			old = xa_store(&mock->pfns, cur / MOCK_IO_PAGE_SIZE,
+				       xa_mk_value(val), GFP_KERNEL);
+			WARN_ON_ONCE(ent != old);
+			iommu_dirty_bitmap_record(dirty, cur, MOCK_IO_PAGE_SIZE);
+		}
+	}
+
+	return 0;
+}
+
+static size_t mock_domain_unmap_read_dirty(struct iommu_domain *domain,
+					   unsigned long iova, size_t page_size,
+					   struct iommu_iotlb_gather *gather,
+					   struct iommu_dirty_bitmap *dirty)
+{
+	struct mock_iommu_domain *mock =
+		container_of(domain, struct mock_iommu_domain, domain);
+	void *ent;
+
+	WARN_ON(page_size != MOCK_IO_PAGE_SIZE);
+
+	ent = xa_erase(&mock->pfns, iova / MOCK_IO_PAGE_SIZE);
+	if (ent && (xa_to_value(ent) & MOCK_PFN_DIRTY_IOVA) &&
+	    (mock->flags & MOCK_DIRTY_TRACK))
+		iommu_dirty_bitmap_record(dirty, iova, page_size);
+
+	return ent ? page_size : 0;
+}
+
 static const struct iommu_ops mock_ops = {
 	.owner = THIS_MODULE,
 	.pgsize_bitmap = MOCK_IO_PAGE_SIZE,
@@ -181,6 +253,9 @@ static const struct iommu_ops mock_ops = {
 			.map_pages = mock_domain_map_pages,
 			.unmap_pages = mock_domain_unmap_pages,
 			.iova_to_phys = mock_domain_iova_to_phys,
+			.set_dirty_tracking = mock_domain_set_dirty_tracking,
+			.read_and_clear_dirty = mock_domain_read_and_clear_dirty,
+			.unmap_read_dirty = mock_domain_unmap_read_dirty,
 		},
 };
 
@@ -442,6 +517,56 @@ static int iommufd_test_access_pages(struct iommufd_ucmd *ucmd,
 	return rc;
 }
 
+static int iommufd_test_dirty(struct iommufd_ucmd *ucmd,
+			      unsigned int mockpt_id, unsigned long iova,
+			      size_t length, unsigned long page_size,
+			      void __user *uptr, u32 flags)
+{
+	unsigned long i, max = length / page_size;
+	struct iommu_test_cmd *cmd = ucmd->cmd;
+	struct iommufd_hw_pagetable *hwpt;
+	struct mock_iommu_domain *mock;
+	int rc, count = 0;
+
+	if (iova % page_size || length % page_size ||
+	    (uintptr_t)uptr % page_size)
+		return -EINVAL;
+
+	hwpt = get_md_pagetable(ucmd, mockpt_id, &mock);
+	if (IS_ERR(hwpt))
+		return PTR_ERR(hwpt);
+
+	if (!(mock->flags & MOCK_DIRTY_TRACK)) {
+		rc = -EINVAL;
+		goto out_put;
+	}
+
+	for (i = 0; i < max; i++) {
+		unsigned long cur = iova + i * page_size;
+		void *ent, *old;
+
+		if (!test_bit(i, (unsigned long *) uptr))
+			continue;
+
+		ent = xa_load(&mock->pfns, cur / page_size);
+		if (ent) {
+			unsigned long val;
+
+			val = xa_to_value(ent) | MOCK_PFN_DIRTY_IOVA;
+			old = xa_store(&mock->pfns, cur / page_size,
+				       xa_mk_value(val), GFP_KERNEL);
+			WARN_ON_ONCE(ent != old);
+			count++;
+		}
+	}
+
+	cmd->dirty.out_nr_dirty = count;
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+out_put:
+	iommufd_put_object(&hwpt->obj);
+	return rc;
+}
+
 void iommufd_selftest_destroy(struct iommufd_object *obj)
 {
 	struct selftest_obj *sobj = container_of(obj, struct selftest_obj, obj);
@@ -486,6 +611,12 @@ int iommufd_test(struct iommufd_ucmd *ucmd)
 			cmd->access_pages.length,
 			u64_to_user_ptr(cmd->access_pages.uptr),
 			cmd->access_pages.flags);
+	case IOMMU_TEST_OP_DIRTY:
+		return iommufd_test_dirty(
+			ucmd, cmd->id, cmd->dirty.iova,
+			cmd->dirty.length, cmd->dirty.page_size,
+			u64_to_user_ptr(cmd->dirty.uptr),
+			cmd->dirty.flags);
 	case IOMMU_TEST_OP_SET_TEMP_MEMORY_LIMIT:
 		iommufd_test_memory_limit = cmd->memory_limit.limit;
 		return 0;
diff --git a/tools/testing/selftests/iommu/Makefile b/tools/testing/selftests/iommu/Makefile
index 7bc38b3beaeb..48d4dcf11506 100644
--- a/tools/testing/selftests/iommu/Makefile
+++ b/tools/testing/selftests/iommu/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 CFLAGS += -Wall -O2 -Wno-unused-function
+CFLAGS += -I../../../../tools/include/
 CFLAGS += -I../../../../include/uapi/
 CFLAGS += -I../../../../include/
 
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
index 5c47d706ed94..3a494f7958f4 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -13,13 +13,18 @@
 #define __EXPORTED_HEADERS__
 #include <linux/iommufd.h>
 #include <linux/vfio.h>
+#include <linux/bitmap.h>
+#include <linux/bitops.h>
 #include "../../../../drivers/iommu/iommufd/iommufd_test.h"
+#define BITS_PER_BYTE 8
 
 static void *buffer;
+static void *bitmap;
 
 static unsigned long PAGE_SIZE;
 static unsigned long HUGEPAGE_SIZE;
 static unsigned long BUFFER_SIZE;
+static unsigned long BITMAP_SIZE;
 
 #define MOCK_PAGE_SIZE (PAGE_SIZE / 2)
 
@@ -52,6 +57,10 @@ static __attribute__((constructor)) void setup_sizes(void)
 	BUFFER_SIZE = PAGE_SIZE * 16;
 	rc = posix_memalign(&buffer, HUGEPAGE_SIZE, BUFFER_SIZE);
 	assert(rc || buffer || (uintptr_t)buffer % HUGEPAGE_SIZE == 0);
+
+	BITMAP_SIZE = BUFFER_SIZE / MOCK_PAGE_SIZE / BITS_PER_BYTE;
+	rc = posix_memalign(&bitmap, PAGE_SIZE, BUFFER_SIZE);
+	assert(rc || buffer || (uintptr_t)buffer % PAGE_SIZE == 0);
 }
 
 /*
@@ -546,6 +555,132 @@ TEST_F(iommufd_ioas, iova_ranges)
 	EXPECT_EQ(0, cmd->out_valid_iovas[1].last);
 }
 
+TEST_F(iommufd_ioas, dirty)
+{
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.ioas_id = self->ioas_id,
+		.user_va = (uintptr_t)buffer,
+		.length = BUFFER_SIZE,
+		.iova = MOCK_APERTURE_START,
+	};
+	struct iommu_test_cmd mock_cmd = {
+		.size = sizeof(mock_cmd),
+		.op = IOMMU_TEST_OP_MOCK_DOMAIN,
+		.id = self->ioas_id,
+	};
+	struct iommu_hwpt_set_dirty set_dirty_cmd = {
+		.size = sizeof(set_dirty_cmd),
+		.flags = IOMMU_DIRTY_TRACKING_ENABLED,
+		.hwpt_id = self->ioas_id,
+	};
+	struct iommu_test_cmd dirty_cmd = {
+		.size = sizeof(dirty_cmd),
+		.op = IOMMU_TEST_OP_DIRTY,
+		.id = self->ioas_id,
+		.dirty = { .iova = MOCK_APERTURE_START,
+			   .length = BUFFER_SIZE,
+			   .page_size = MOCK_PAGE_SIZE,
+			   .uptr = (uintptr_t)bitmap },
+	};
+	struct iommu_hwpt_get_dirty_iova get_dirty_cmd = {
+		.size = sizeof(get_dirty_cmd),
+		.hwpt_id = self->ioas_id,
+		.bitmap = {
+			.iova = MOCK_APERTURE_START,
+			.length = BUFFER_SIZE,
+			.page_size = MOCK_PAGE_SIZE,
+			.data = (__u64 *)bitmap,
+		}
+	};
+	struct iommu_ioas_unmap_dirty unmap_dirty_cmd = {
+		.size = sizeof(unmap_dirty_cmd),
+		.ioas_id = self->ioas_id,
+		.bitmap = {
+			.iova = MOCK_APERTURE_START,
+			.length = BUFFER_SIZE,
+			.page_size = MOCK_PAGE_SIZE,
+			.data = (__u64 *)bitmap,
+		},
+	};
+	struct iommu_destroy destroy_cmd = { .size = sizeof(destroy_cmd) };
+	unsigned long i, count, nbits = BITMAP_SIZE * BITS_PER_BYTE;
+
+	/* Toggle dirty with a domain and a single map */
+	ASSERT_EQ(0, ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_MOCK_DOMAIN),
+			   &mock_cmd));
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+
+	set_dirty_cmd.hwpt_id = mock_cmd.id;
+	ASSERT_EQ(0,
+		  ioctl(self->fd, IOMMU_HWPT_SET_DIRTY, &set_dirty_cmd));
+	EXPECT_ERRNO(EINVAL,
+		  ioctl(self->fd, IOMMU_HWPT_SET_DIRTY, &set_dirty_cmd));
+
+	/* Mark all even bits as dirty in the mock domain */
+	for (count = 0, i = 0; i < nbits; count += !(i%2), i++)
+		if (!(i % 2))
+			set_bit(i, (unsigned long *) bitmap);
+	ASSERT_EQ(count, BITMAP_SIZE * BITS_PER_BYTE / 2);
+
+	dirty_cmd.id = mock_cmd.id;
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_DIRTY),
+			&dirty_cmd));
+	ASSERT_EQ(BITMAP_SIZE * BITS_PER_BYTE / 2,
+		  dirty_cmd.dirty.out_nr_dirty);
+
+	get_dirty_cmd.hwpt_id = mock_cmd.id;
+	memset(bitmap, 0, BITMAP_SIZE);
+	ASSERT_EQ(0,
+		  ioctl(self->fd, IOMMU_HWPT_GET_DIRTY_IOVA, &get_dirty_cmd));
+
+	/* All even bits should be dirty */
+	for (count = 0, i = 0; i < nbits; count += !(i%2), i++)
+		ASSERT_EQ(!(i % 2), test_bit(i, (unsigned long *) bitmap));
+	ASSERT_EQ(count, dirty_cmd.dirty.out_nr_dirty);
+
+	memset(bitmap, 0, BITMAP_SIZE);
+	ASSERT_EQ(0,
+		  ioctl(self->fd, IOMMU_HWPT_GET_DIRTY_IOVA, &get_dirty_cmd));
+
+	/* Should be all zeroes */
+	for (i = 0; i < nbits; i++)
+		ASSERT_EQ(0, test_bit(i, (unsigned long *) bitmap));
+
+	/* Mark all even bits as dirty in the mock domain */
+	for (count = 0, i = 0; i < nbits; count += !(i%2), i++)
+		if (!(i % 2))
+			set_bit(i, (unsigned long *) bitmap);
+	ASSERT_EQ(count, BITMAP_SIZE * BITS_PER_BYTE / 2);
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_DIRTY),
+			&dirty_cmd));
+	ASSERT_EQ(BITMAP_SIZE * BITS_PER_BYTE / 2,
+		  dirty_cmd.dirty.out_nr_dirty);
+
+	memset(bitmap, 0, BITMAP_SIZE);
+	ASSERT_EQ(0,
+		  ioctl(self->fd, IOMMU_IOAS_UNMAP_DIRTY, &unmap_dirty_cmd));
+
+	/* All even bits should be dirty */
+	for (count = 0, i = 0; i < nbits; count += !(i%2), i++)
+		ASSERT_EQ(!(i % 2), test_bit(i, (unsigned long *) bitmap));
+	ASSERT_EQ(count, dirty_cmd.dirty.out_nr_dirty);
+
+	set_dirty_cmd.flags = IOMMU_DIRTY_TRACKING_DISABLED;
+	ASSERT_EQ(0,
+		     ioctl(self->fd, IOMMU_HWPT_SET_DIRTY, &set_dirty_cmd));
+	EXPECT_ERRNO(EINVAL,
+		     ioctl(self->fd, IOMMU_HWPT_SET_DIRTY, &set_dirty_cmd));
+
+	destroy_cmd.id = mock_cmd.mock_domain.device_id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_DESTROY, &destroy_cmd));
+	destroy_cmd.id = mock_cmd.id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_DESTROY, &destroy_cmd));
+}
+
 TEST_F(iommufd_ioas, access)
 {
 	struct iommu_ioas_map map_cmd = {
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH RFC 08/19] iommufd: Add a test for dirty tracking ioctls
@ 2022-04-28 21:09   ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, Jason Gunthorpe,
	kvm, Will Deacon, Cornelia Huck, Alex Williamson, Joao Martins,
	David Woodhouse, Robin Murphy

Add a new test ioctl for simulating the dirty IOVAs
in the mock domain, and implement the mock iommu domain ops
that get the dirty tracking supported.

The selftest exercises the usual main workflow of:

1) Setting/Clearing dirty tracking from the iommu domain
2) Read and clear dirty IOPTEs
3) Unmap and read dirty back

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/iommufd/iommufd_test.h    |   9 ++
 drivers/iommu/iommufd/selftest.c        | 137 +++++++++++++++++++++++-
 tools/testing/selftests/iommu/Makefile  |   1 +
 tools/testing/selftests/iommu/iommufd.c | 135 +++++++++++++++++++++++
 4 files changed, 279 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h
index d22ef484af1a..90dafa513078 100644
--- a/drivers/iommu/iommufd/iommufd_test.h
+++ b/drivers/iommu/iommufd/iommufd_test.h
@@ -14,6 +14,7 @@ enum {
 	IOMMU_TEST_OP_MD_CHECK_REFS,
 	IOMMU_TEST_OP_ACCESS_PAGES,
 	IOMMU_TEST_OP_SET_TEMP_MEMORY_LIMIT,
+	IOMMU_TEST_OP_DIRTY,
 };
 
 enum {
@@ -57,6 +58,14 @@ struct iommu_test_cmd {
 		struct {
 			__u32 limit;
 		} memory_limit;
+		struct {
+			__u32 flags;
+			__aligned_u64 iova;
+			__aligned_u64 length;
+			__aligned_u64 page_size;
+			__aligned_u64 uptr;
+			__aligned_u64 out_nr_dirty;
+		} dirty;
 	};
 	__u32 last;
 };
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index a665719b493e..b02309722436 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -13,6 +13,7 @@
 size_t iommufd_test_memory_limit = 65536;
 
 enum {
+	MOCK_DIRTY_TRACK = 1,
 	MOCK_IO_PAGE_SIZE = PAGE_SIZE / 2,
 
 	/*
@@ -25,9 +26,11 @@ enum {
 	_MOCK_PFN_START = MOCK_PFN_MASK + 1,
 	MOCK_PFN_START_IOVA = _MOCK_PFN_START,
 	MOCK_PFN_LAST_IOVA = _MOCK_PFN_START,
+	MOCK_PFN_DIRTY_IOVA = _MOCK_PFN_START << 1,
 };
 
 struct mock_iommu_domain {
+	unsigned long flags;
 	struct iommu_domain domain;
 	struct xarray pfns;
 };
@@ -133,7 +136,7 @@ static size_t mock_domain_unmap_pages(struct iommu_domain *domain,
 
 		for (cur = 0; cur != pgsize; cur += MOCK_IO_PAGE_SIZE) {
 			ent = xa_erase(&mock->pfns, iova / MOCK_IO_PAGE_SIZE);
-			WARN_ON(!ent);
+
 			/*
 			 * iommufd generates unmaps that must be a strict
 			 * superset of the map's performend So every starting
@@ -143,12 +146,12 @@ static size_t mock_domain_unmap_pages(struct iommu_domain *domain,
 			 * passed to map_pages
 			 */
 			if (first) {
-				WARN_ON(!(xa_to_value(ent) &
+				WARN_ON(ent && !(xa_to_value(ent) &
 					  MOCK_PFN_START_IOVA));
 				first = false;
 			}
 			if (pgcount == 1 && cur + MOCK_IO_PAGE_SIZE == pgsize)
-				WARN_ON(!(xa_to_value(ent) &
+				WARN_ON(ent && !(xa_to_value(ent) &
 					  MOCK_PFN_LAST_IOVA));
 
 			iova += MOCK_IO_PAGE_SIZE;
@@ -171,6 +174,75 @@ static phys_addr_t mock_domain_iova_to_phys(struct iommu_domain *domain,
 	return (xa_to_value(ent) & MOCK_PFN_MASK) * MOCK_IO_PAGE_SIZE;
 }
 
+static int mock_domain_set_dirty_tracking(struct iommu_domain *domain,
+					  bool enable)
+{
+	struct mock_iommu_domain *mock =
+		container_of(domain, struct mock_iommu_domain, domain);
+	unsigned long flags = mock->flags;
+
+	/* No change? */
+	if (!(enable ^ !!(flags & MOCK_DIRTY_TRACK)))
+		return -EINVAL;
+
+	flags = (enable ?
+		 flags | MOCK_DIRTY_TRACK : flags & ~MOCK_DIRTY_TRACK);
+
+	mock->flags = flags;
+	return 0;
+}
+
+static int mock_domain_read_and_clear_dirty(struct iommu_domain *domain,
+					    unsigned long iova, size_t size,
+					    struct iommu_dirty_bitmap *dirty)
+{
+	struct mock_iommu_domain *mock =
+		container_of(domain, struct mock_iommu_domain, domain);
+	unsigned long i, max = size / MOCK_IO_PAGE_SIZE;
+	void *ent, *old;
+
+	if (!(mock->flags & MOCK_DIRTY_TRACK))
+		return -EINVAL;
+
+	for (i = 0; i < max; i++) {
+		unsigned long cur = iova + i * MOCK_IO_PAGE_SIZE;
+
+		ent = xa_load(&mock->pfns, cur / MOCK_IO_PAGE_SIZE);
+		if (ent &&
+		    (xa_to_value(ent) & MOCK_PFN_DIRTY_IOVA)) {
+			unsigned long val;
+
+			/* Clear dirty */
+			val = xa_to_value(ent) & ~MOCK_PFN_DIRTY_IOVA;
+			old = xa_store(&mock->pfns, cur / MOCK_IO_PAGE_SIZE,
+				       xa_mk_value(val), GFP_KERNEL);
+			WARN_ON_ONCE(ent != old);
+			iommu_dirty_bitmap_record(dirty, cur, MOCK_IO_PAGE_SIZE);
+		}
+	}
+
+	return 0;
+}
+
+static size_t mock_domain_unmap_read_dirty(struct iommu_domain *domain,
+					   unsigned long iova, size_t page_size,
+					   struct iommu_iotlb_gather *gather,
+					   struct iommu_dirty_bitmap *dirty)
+{
+	struct mock_iommu_domain *mock =
+		container_of(domain, struct mock_iommu_domain, domain);
+	void *ent;
+
+	WARN_ON(page_size != MOCK_IO_PAGE_SIZE);
+
+	ent = xa_erase(&mock->pfns, iova / MOCK_IO_PAGE_SIZE);
+	if (ent && (xa_to_value(ent) & MOCK_PFN_DIRTY_IOVA) &&
+	    (mock->flags & MOCK_DIRTY_TRACK))
+		iommu_dirty_bitmap_record(dirty, iova, page_size);
+
+	return ent ? page_size : 0;
+}
+
 static const struct iommu_ops mock_ops = {
 	.owner = THIS_MODULE,
 	.pgsize_bitmap = MOCK_IO_PAGE_SIZE,
@@ -181,6 +253,9 @@ static const struct iommu_ops mock_ops = {
 			.map_pages = mock_domain_map_pages,
 			.unmap_pages = mock_domain_unmap_pages,
 			.iova_to_phys = mock_domain_iova_to_phys,
+			.set_dirty_tracking = mock_domain_set_dirty_tracking,
+			.read_and_clear_dirty = mock_domain_read_and_clear_dirty,
+			.unmap_read_dirty = mock_domain_unmap_read_dirty,
 		},
 };
 
@@ -442,6 +517,56 @@ static int iommufd_test_access_pages(struct iommufd_ucmd *ucmd,
 	return rc;
 }
 
+static int iommufd_test_dirty(struct iommufd_ucmd *ucmd,
+			      unsigned int mockpt_id, unsigned long iova,
+			      size_t length, unsigned long page_size,
+			      void __user *uptr, u32 flags)
+{
+	unsigned long i, max = length / page_size;
+	struct iommu_test_cmd *cmd = ucmd->cmd;
+	struct iommufd_hw_pagetable *hwpt;
+	struct mock_iommu_domain *mock;
+	int rc, count = 0;
+
+	if (iova % page_size || length % page_size ||
+	    (uintptr_t)uptr % page_size)
+		return -EINVAL;
+
+	hwpt = get_md_pagetable(ucmd, mockpt_id, &mock);
+	if (IS_ERR(hwpt))
+		return PTR_ERR(hwpt);
+
+	if (!(mock->flags & MOCK_DIRTY_TRACK)) {
+		rc = -EINVAL;
+		goto out_put;
+	}
+
+	for (i = 0; i < max; i++) {
+		unsigned long cur = iova + i * page_size;
+		void *ent, *old;
+
+		if (!test_bit(i, (unsigned long *) uptr))
+			continue;
+
+		ent = xa_load(&mock->pfns, cur / page_size);
+		if (ent) {
+			unsigned long val;
+
+			val = xa_to_value(ent) | MOCK_PFN_DIRTY_IOVA;
+			old = xa_store(&mock->pfns, cur / page_size,
+				       xa_mk_value(val), GFP_KERNEL);
+			WARN_ON_ONCE(ent != old);
+			count++;
+		}
+	}
+
+	cmd->dirty.out_nr_dirty = count;
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+out_put:
+	iommufd_put_object(&hwpt->obj);
+	return rc;
+}
+
 void iommufd_selftest_destroy(struct iommufd_object *obj)
 {
 	struct selftest_obj *sobj = container_of(obj, struct selftest_obj, obj);
@@ -486,6 +611,12 @@ int iommufd_test(struct iommufd_ucmd *ucmd)
 			cmd->access_pages.length,
 			u64_to_user_ptr(cmd->access_pages.uptr),
 			cmd->access_pages.flags);
+	case IOMMU_TEST_OP_DIRTY:
+		return iommufd_test_dirty(
+			ucmd, cmd->id, cmd->dirty.iova,
+			cmd->dirty.length, cmd->dirty.page_size,
+			u64_to_user_ptr(cmd->dirty.uptr),
+			cmd->dirty.flags);
 	case IOMMU_TEST_OP_SET_TEMP_MEMORY_LIMIT:
 		iommufd_test_memory_limit = cmd->memory_limit.limit;
 		return 0;
diff --git a/tools/testing/selftests/iommu/Makefile b/tools/testing/selftests/iommu/Makefile
index 7bc38b3beaeb..48d4dcf11506 100644
--- a/tools/testing/selftests/iommu/Makefile
+++ b/tools/testing/selftests/iommu/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 CFLAGS += -Wall -O2 -Wno-unused-function
+CFLAGS += -I../../../../tools/include/
 CFLAGS += -I../../../../include/uapi/
 CFLAGS += -I../../../../include/
 
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
index 5c47d706ed94..3a494f7958f4 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -13,13 +13,18 @@
 #define __EXPORTED_HEADERS__
 #include <linux/iommufd.h>
 #include <linux/vfio.h>
+#include <linux/bitmap.h>
+#include <linux/bitops.h>
 #include "../../../../drivers/iommu/iommufd/iommufd_test.h"
+#define BITS_PER_BYTE 8
 
 static void *buffer;
+static void *bitmap;
 
 static unsigned long PAGE_SIZE;
 static unsigned long HUGEPAGE_SIZE;
 static unsigned long BUFFER_SIZE;
+static unsigned long BITMAP_SIZE;
 
 #define MOCK_PAGE_SIZE (PAGE_SIZE / 2)
 
@@ -52,6 +57,10 @@ static __attribute__((constructor)) void setup_sizes(void)
 	BUFFER_SIZE = PAGE_SIZE * 16;
 	rc = posix_memalign(&buffer, HUGEPAGE_SIZE, BUFFER_SIZE);
 	assert(rc || buffer || (uintptr_t)buffer % HUGEPAGE_SIZE == 0);
+
+	BITMAP_SIZE = BUFFER_SIZE / MOCK_PAGE_SIZE / BITS_PER_BYTE;
+	rc = posix_memalign(&bitmap, PAGE_SIZE, BUFFER_SIZE);
+	assert(rc || buffer || (uintptr_t)buffer % PAGE_SIZE == 0);
 }
 
 /*
@@ -546,6 +555,132 @@ TEST_F(iommufd_ioas, iova_ranges)
 	EXPECT_EQ(0, cmd->out_valid_iovas[1].last);
 }
 
+TEST_F(iommufd_ioas, dirty)
+{
+	struct iommu_ioas_map map_cmd = {
+		.size = sizeof(map_cmd),
+		.flags = IOMMU_IOAS_MAP_FIXED_IOVA,
+		.ioas_id = self->ioas_id,
+		.user_va = (uintptr_t)buffer,
+		.length = BUFFER_SIZE,
+		.iova = MOCK_APERTURE_START,
+	};
+	struct iommu_test_cmd mock_cmd = {
+		.size = sizeof(mock_cmd),
+		.op = IOMMU_TEST_OP_MOCK_DOMAIN,
+		.id = self->ioas_id,
+	};
+	struct iommu_hwpt_set_dirty set_dirty_cmd = {
+		.size = sizeof(set_dirty_cmd),
+		.flags = IOMMU_DIRTY_TRACKING_ENABLED,
+		.hwpt_id = self->ioas_id,
+	};
+	struct iommu_test_cmd dirty_cmd = {
+		.size = sizeof(dirty_cmd),
+		.op = IOMMU_TEST_OP_DIRTY,
+		.id = self->ioas_id,
+		.dirty = { .iova = MOCK_APERTURE_START,
+			   .length = BUFFER_SIZE,
+			   .page_size = MOCK_PAGE_SIZE,
+			   .uptr = (uintptr_t)bitmap },
+	};
+	struct iommu_hwpt_get_dirty_iova get_dirty_cmd = {
+		.size = sizeof(get_dirty_cmd),
+		.hwpt_id = self->ioas_id,
+		.bitmap = {
+			.iova = MOCK_APERTURE_START,
+			.length = BUFFER_SIZE,
+			.page_size = MOCK_PAGE_SIZE,
+			.data = (__u64 *)bitmap,
+		}
+	};
+	struct iommu_ioas_unmap_dirty unmap_dirty_cmd = {
+		.size = sizeof(unmap_dirty_cmd),
+		.ioas_id = self->ioas_id,
+		.bitmap = {
+			.iova = MOCK_APERTURE_START,
+			.length = BUFFER_SIZE,
+			.page_size = MOCK_PAGE_SIZE,
+			.data = (__u64 *)bitmap,
+		},
+	};
+	struct iommu_destroy destroy_cmd = { .size = sizeof(destroy_cmd) };
+	unsigned long i, count, nbits = BITMAP_SIZE * BITS_PER_BYTE;
+
+	/* Toggle dirty with a domain and a single map */
+	ASSERT_EQ(0, ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_MOCK_DOMAIN),
+			   &mock_cmd));
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_MAP, &map_cmd));
+
+	set_dirty_cmd.hwpt_id = mock_cmd.id;
+	ASSERT_EQ(0,
+		  ioctl(self->fd, IOMMU_HWPT_SET_DIRTY, &set_dirty_cmd));
+	EXPECT_ERRNO(EINVAL,
+		  ioctl(self->fd, IOMMU_HWPT_SET_DIRTY, &set_dirty_cmd));
+
+	/* Mark all even bits as dirty in the mock domain */
+	for (count = 0, i = 0; i < nbits; count += !(i%2), i++)
+		if (!(i % 2))
+			set_bit(i, (unsigned long *) bitmap);
+	ASSERT_EQ(count, BITMAP_SIZE * BITS_PER_BYTE / 2);
+
+	dirty_cmd.id = mock_cmd.id;
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_DIRTY),
+			&dirty_cmd));
+	ASSERT_EQ(BITMAP_SIZE * BITS_PER_BYTE / 2,
+		  dirty_cmd.dirty.out_nr_dirty);
+
+	get_dirty_cmd.hwpt_id = mock_cmd.id;
+	memset(bitmap, 0, BITMAP_SIZE);
+	ASSERT_EQ(0,
+		  ioctl(self->fd, IOMMU_HWPT_GET_DIRTY_IOVA, &get_dirty_cmd));
+
+	/* All even bits should be dirty */
+	for (count = 0, i = 0; i < nbits; count += !(i%2), i++)
+		ASSERT_EQ(!(i % 2), test_bit(i, (unsigned long *) bitmap));
+	ASSERT_EQ(count, dirty_cmd.dirty.out_nr_dirty);
+
+	memset(bitmap, 0, BITMAP_SIZE);
+	ASSERT_EQ(0,
+		  ioctl(self->fd, IOMMU_HWPT_GET_DIRTY_IOVA, &get_dirty_cmd));
+
+	/* Should be all zeroes */
+	for (i = 0; i < nbits; i++)
+		ASSERT_EQ(0, test_bit(i, (unsigned long *) bitmap));
+
+	/* Mark all even bits as dirty in the mock domain */
+	for (count = 0, i = 0; i < nbits; count += !(i%2), i++)
+		if (!(i % 2))
+			set_bit(i, (unsigned long *) bitmap);
+	ASSERT_EQ(count, BITMAP_SIZE * BITS_PER_BYTE / 2);
+	ASSERT_EQ(0,
+		  ioctl(self->fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_DIRTY),
+			&dirty_cmd));
+	ASSERT_EQ(BITMAP_SIZE * BITS_PER_BYTE / 2,
+		  dirty_cmd.dirty.out_nr_dirty);
+
+	memset(bitmap, 0, BITMAP_SIZE);
+	ASSERT_EQ(0,
+		  ioctl(self->fd, IOMMU_IOAS_UNMAP_DIRTY, &unmap_dirty_cmd));
+
+	/* All even bits should be dirty */
+	for (count = 0, i = 0; i < nbits; count += !(i%2), i++)
+		ASSERT_EQ(!(i % 2), test_bit(i, (unsigned long *) bitmap));
+	ASSERT_EQ(count, dirty_cmd.dirty.out_nr_dirty);
+
+	set_dirty_cmd.flags = IOMMU_DIRTY_TRACKING_DISABLED;
+	ASSERT_EQ(0,
+		     ioctl(self->fd, IOMMU_HWPT_SET_DIRTY, &set_dirty_cmd));
+	EXPECT_ERRNO(EINVAL,
+		     ioctl(self->fd, IOMMU_HWPT_SET_DIRTY, &set_dirty_cmd));
+
+	destroy_cmd.id = mock_cmd.mock_domain.device_id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_DESTROY, &destroy_cmd));
+	destroy_cmd.id = mock_cmd.id;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_DESTROY, &destroy_cmd));
+}
+
 TEST_F(iommufd_ioas, access)
 {
 	struct iommu_ioas_map map_cmd = {
-- 
2.17.2

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH RFC 09/19] iommu/amd: Access/Dirty bit support in IOPTEs
  2022-04-28 21:09 ` Joao Martins
@ 2022-04-28 21:09   ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Joao Martins, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian,
	Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck, kvm

IOMMU advertises Access/Dirty bits if the extended feature register
reports it. Relevant AMD IOMMU SDM ref[0]
"1.3.8 Enhanced Support for Access and Dirty Bits"

To enable it set the DTE flag in bits 7 and 8 to enable access, or
access+dirty. With that, the IOMMU starts marking the D and A flags on
every Memory Request or ATS translation request. It is on the VMM side
to steer whether to enable dirty tracking or not, rather than wrongly
doing in IOMMU. Relevant AMD IOMMU SDM ref [0], "Table 7. Device Table
Entry (DTE) Field Definitions" particularly the entry "HAD".

To actually toggle on and off it's relatively simple as it's setting
2 bits on DTE and flush the device DTE cache.

To get what's dirtied use existing AMD io-pgtable support, by walking
the pagetables over each IOVA, with fetch_pte().  The IOTLB flushing is
left to the caller (much like unmap), and iommu_dirty_bitmap_record() is
the one adding page-ranges to invalidate. This allows caller to batch
the flush over a big span of IOVA space, without the iommu wondering
about when to flush.

Worthwhile sections from AMD IOMMU SDM:

"2.2.3.1 Host Access Support"
"2.2.3.2 Host Dirty Support"

For details on how IOMMU hardware updates the dirty bit see,
and expects from its consequent clearing by CPU:

"2.2.7.4 Updating Accessed and Dirty Bits in the Guest Address Tables"
"2.2.7.5 Clearing Accessed and Dirty Bits"

Quoting the SDM:

"The setting of accessed and dirty status bits in the page tables is
visible to both the CPU and the peripheral when sharing guest page
tables. The IOMMU interlocked operations to update A and D bits must be
64-bit operations and naturally aligned on a 64-bit boundary"

.. and for the IOMMU update sequence to Dirty bit, essentially is states:

1. Decodes the read and write intent from the memory access.
2. If P=0 in the page descriptor, fail the access.
3. Compare the A & D bits in the descriptor with the read and write
intent in the request.
4. If the A or D bits need to be updated in the descriptor:
* Start atomic operation.
* Read the descriptor as a 64-bit access.
* If the descriptor no longer appears to require an update, release the
atomic lock with
no further action and continue to step 5.
* Calculate the new A & D bits.
* Write the descriptor as a 64-bit access.
* End atomic operation.
5. Continue to the next stage of translation or to the memory access.

Access/Dirty bits readout also need to consider the default
non-page-size (aka replicated PTEs as mentined by manual), as AMD
supports all powers of two page sizes (except 512G) even though the
underlying IOTLB mappings are restricted to the same ones as supported
by the CPU (4K, 2M, 1G). It makes one wonder whether AMD_IOMMU_PGSIZES
ought to avoid advertising non-default page sizes at all, when creating
an UNMANAGED DOMAIN, or when dirty tracking is toggling in.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/amd/amd_iommu.h       |  1 +
 drivers/iommu/amd/amd_iommu_types.h | 11 +++++
 drivers/iommu/amd/init.c            |  8 ++-
 drivers/iommu/amd/io_pgtable.c      | 56 +++++++++++++++++++++
 drivers/iommu/amd/iommu.c           | 77 +++++++++++++++++++++++++++++
 5 files changed, 152 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/amd/amd_iommu.h b/drivers/iommu/amd/amd_iommu.h
index 1ab31074f5b3..2f16ad8f7514 100644
--- a/drivers/iommu/amd/amd_iommu.h
+++ b/drivers/iommu/amd/amd_iommu.h
@@ -34,6 +34,7 @@ extern int amd_iommu_reenable(int);
 extern int amd_iommu_enable_faulting(void);
 extern int amd_iommu_guest_ir;
 extern enum io_pgtable_fmt amd_iommu_pgtable;
+extern bool amd_iommu_had_support;
 
 /* IOMMUv2 specific functions */
 struct iommu_domain;
diff --git a/drivers/iommu/amd/amd_iommu_types.h b/drivers/iommu/amd/amd_iommu_types.h
index 47108ed44fbb..c1eba8fce4bb 100644
--- a/drivers/iommu/amd/amd_iommu_types.h
+++ b/drivers/iommu/amd/amd_iommu_types.h
@@ -93,7 +93,9 @@
 #define FEATURE_HE		(1ULL<<8)
 #define FEATURE_PC		(1ULL<<9)
 #define FEATURE_GAM_VAPIC	(1ULL<<21)
+#define FEATURE_HASUP		(1ULL<<49)
 #define FEATURE_EPHSUP		(1ULL<<50)
+#define FEATURE_HDSUP		(1ULL<<52)
 #define FEATURE_SNP		(1ULL<<63)
 
 #define FEATURE_PASID_SHIFT	32
@@ -197,6 +199,7 @@
 /* macros and definitions for device table entries */
 #define DEV_ENTRY_VALID         0x00
 #define DEV_ENTRY_TRANSLATION   0x01
+#define DEV_ENTRY_HAD           0x07
 #define DEV_ENTRY_PPR           0x34
 #define DEV_ENTRY_IR            0x3d
 #define DEV_ENTRY_IW            0x3e
@@ -350,10 +353,16 @@
 #define PTE_LEVEL_PAGE_SIZE(level)			\
 	(1ULL << (12 + (9 * (level))))
 
+/*
+ * The IOPTE dirty bit
+ */
+#define IOMMU_PTE_HD_BIT (6)
+
 /*
  * Bit value definition for I/O PTE fields
  */
 #define IOMMU_PTE_PR (1ULL << 0)
+#define IOMMU_PTE_HD (1ULL << IOMMU_PTE_HD_BIT)
 #define IOMMU_PTE_U  (1ULL << 59)
 #define IOMMU_PTE_FC (1ULL << 60)
 #define IOMMU_PTE_IR (1ULL << 61)
@@ -364,6 +373,7 @@
  */
 #define DTE_FLAG_V  (1ULL << 0)
 #define DTE_FLAG_TV (1ULL << 1)
+#define DTE_FLAG_HAD (3ULL << 7)
 #define DTE_FLAG_IR (1ULL << 61)
 #define DTE_FLAG_IW (1ULL << 62)
 
@@ -390,6 +400,7 @@
 
 #define IOMMU_PAGE_MASK (((1ULL << 52) - 1) & ~0xfffULL)
 #define IOMMU_PTE_PRESENT(pte) ((pte) & IOMMU_PTE_PR)
+#define IOMMU_PTE_DIRTY(pte) ((pte) & IOMMU_PTE_HD)
 #define IOMMU_PTE_PAGE(pte) (iommu_phys_to_virt((pte) & IOMMU_PAGE_MASK))
 #define IOMMU_PTE_MODE(pte) (((pte) >> 9) & 0x07)
 
diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index b4a798c7b347..27f2cf61d0c6 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -149,6 +149,7 @@ struct ivmd_header {
 
 bool amd_iommu_dump;
 bool amd_iommu_irq_remap __read_mostly;
+bool amd_iommu_had_support __read_mostly;
 
 enum io_pgtable_fmt amd_iommu_pgtable = AMD_IOMMU_V1;
 
@@ -1986,8 +1987,13 @@ static int __init amd_iommu_init_pci(void)
 	for_each_iommu(iommu)
 		iommu_flush_all_caches(iommu);
 
-	if (!ret)
+	if (!ret) {
+		if (check_feature_on_all_iommus(FEATURE_HASUP) &&
+		    check_feature_on_all_iommus(FEATURE_HDSUP))
+			amd_iommu_had_support = true;
+
 		print_iommu_info();
+	}
 
 out:
 	return ret;
diff --git a/drivers/iommu/amd/io_pgtable.c b/drivers/iommu/amd/io_pgtable.c
index 6608d1717574..8325ef193093 100644
--- a/drivers/iommu/amd/io_pgtable.c
+++ b/drivers/iommu/amd/io_pgtable.c
@@ -478,6 +478,61 @@ static phys_addr_t iommu_v1_iova_to_phys(struct io_pgtable_ops *ops, unsigned lo
 	return (__pte & ~offset_mask) | (iova & offset_mask);
 }
 
+static bool pte_test_and_clear_dirty(u64 *ptep, unsigned long size)
+{
+	bool dirty = false;
+	int i, count;
+
+	/*
+	 * 2.2.3.2 Host Dirty Support
+	 * When a non-default page size is used , software must OR the
+	 * Dirty bits in all of the replicated host PTEs used to map
+	 * the page. The IOMMU does not guarantee the Dirty bits are
+	 * set in all of the replicated PTEs. Any portion of the page
+	 * may have been written even if the Dirty bit is set in only
+	 * one of the replicated PTEs.
+	 */
+	count = PAGE_SIZE_PTE_COUNT(size);
+	for (i = 0; i < count; i++)
+		if (test_and_clear_bit(IOMMU_PTE_HD_BIT,
+					(unsigned long *) &ptep[i]))
+			dirty = true;
+
+	return dirty;
+}
+
+static int iommu_v1_read_and_clear_dirty(struct io_pgtable_ops *ops,
+					 unsigned long iova, size_t size,
+					 struct iommu_dirty_bitmap *dirty)
+{
+	struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
+	unsigned long end = iova + size - 1;
+
+	do {
+		unsigned long pgsize = 0;
+		u64 *ptep, pte;
+
+		ptep = fetch_pte(pgtable, iova, &pgsize);
+		if (ptep)
+			pte = READ_ONCE(*ptep);
+		if (!ptep || !IOMMU_PTE_PRESENT(pte)) {
+			pgsize = pgsize ?: PTE_LEVEL_PAGE_SIZE(0);
+			iova += pgsize;
+			continue;
+		}
+
+		/*
+		 * Mark the whole IOVA range as dirty even if only one of
+		 * the replicated PTEs were marked dirty.
+		 */
+		if (pte_test_and_clear_dirty(ptep, pgsize))
+			iommu_dirty_bitmap_record(dirty, iova, pgsize);
+		iova += pgsize;
+	} while (iova < end);
+
+	return 0;
+}
+
 /*
  * ----------------------------------------------------
  */
@@ -519,6 +574,7 @@ static struct io_pgtable *v1_alloc_pgtable(struct io_pgtable_cfg *cfg, void *coo
 	pgtable->iop.ops.map          = iommu_v1_map_page;
 	pgtable->iop.ops.unmap        = iommu_v1_unmap_page;
 	pgtable->iop.ops.iova_to_phys = iommu_v1_iova_to_phys;
+	pgtable->iop.ops.read_and_clear_dirty = iommu_v1_read_and_clear_dirty;
 
 	return &pgtable->iop;
 }
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index a1ada7bff44e..0a86392b2367 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2169,6 +2169,81 @@ static bool amd_iommu_capable(enum iommu_cap cap)
 	return false;
 }
 
+static int amd_iommu_set_dirty_tracking(struct iommu_domain *domain,
+					bool enable)
+{
+	struct protection_domain *pdomain = to_pdomain(domain);
+	struct iommu_dev_data *dev_data;
+	bool dom_flush = false;
+
+	if (!amd_iommu_had_support)
+		return -EOPNOTSUPP;
+
+	list_for_each_entry(dev_data, &pdomain->dev_list, list) {
+		struct amd_iommu *iommu;
+		u64 pte_root;
+
+		iommu = amd_iommu_rlookup_table[dev_data->devid];
+		pte_root = amd_iommu_dev_table[dev_data->devid].data[0];
+
+		/* No change? */
+		if (!(enable ^ !!(pte_root & DTE_FLAG_HAD)))
+			continue;
+
+		pte_root = (enable ?
+			pte_root | DTE_FLAG_HAD : pte_root & ~DTE_FLAG_HAD);
+
+		/* Flush device DTE */
+		amd_iommu_dev_table[dev_data->devid].data[0] = pte_root;
+		device_flush_dte(dev_data);
+		dom_flush = true;
+	}
+
+	/* Flush IOTLB to mark IOPTE dirty on the next translation(s) */
+	if (dom_flush) {
+		unsigned long flags;
+
+		spin_lock_irqsave(&pdomain->lock, flags);
+		amd_iommu_domain_flush_tlb_pde(pdomain);
+		amd_iommu_domain_flush_complete(pdomain);
+		spin_unlock_irqrestore(&pdomain->lock, flags);
+	}
+
+	return 0;
+}
+
+static bool amd_iommu_get_dirty_tracking(struct iommu_domain *domain)
+{
+	struct protection_domain *pdomain = to_pdomain(domain);
+	struct iommu_dev_data *dev_data;
+	u64 dte;
+
+	list_for_each_entry(dev_data, &pdomain->dev_list, list) {
+		dte = amd_iommu_dev_table[dev_data->devid].data[0];
+		if (!(dte & DTE_FLAG_HAD))
+			return false;
+	}
+
+	return true;
+}
+
+static int amd_iommu_read_and_clear_dirty(struct iommu_domain *domain,
+					  unsigned long iova, size_t size,
+					  struct iommu_dirty_bitmap *dirty)
+{
+	struct protection_domain *pdomain = to_pdomain(domain);
+	struct io_pgtable_ops *ops = &pdomain->iop.iop.ops;
+
+	if (!amd_iommu_get_dirty_tracking(domain))
+		return -EOPNOTSUPP;
+
+	if (!ops || !ops->read_and_clear_dirty)
+		return -ENODEV;
+
+	return ops->read_and_clear_dirty(ops, iova, size, dirty);
+}
+
+
 static void amd_iommu_get_resv_regions(struct device *dev,
 				       struct list_head *head)
 {
@@ -2293,6 +2368,8 @@ const struct iommu_ops amd_iommu_ops = {
 		.flush_iotlb_all = amd_iommu_flush_iotlb_all,
 		.iotlb_sync	= amd_iommu_iotlb_sync,
 		.free		= amd_iommu_domain_free,
+		.set_dirty_tracking = amd_iommu_set_dirty_tracking,
+		.read_and_clear_dirty = amd_iommu_read_and_clear_dirty,
 	}
 };
 
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH RFC 09/19] iommu/amd: Access/Dirty bit support in IOPTEs
@ 2022-04-28 21:09   ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, Jason Gunthorpe,
	kvm, Will Deacon, Cornelia Huck, Alex Williamson, Joao Martins,
	David Woodhouse, Robin Murphy

IOMMU advertises Access/Dirty bits if the extended feature register
reports it. Relevant AMD IOMMU SDM ref[0]
"1.3.8 Enhanced Support for Access and Dirty Bits"

To enable it set the DTE flag in bits 7 and 8 to enable access, or
access+dirty. With that, the IOMMU starts marking the D and A flags on
every Memory Request or ATS translation request. It is on the VMM side
to steer whether to enable dirty tracking or not, rather than wrongly
doing in IOMMU. Relevant AMD IOMMU SDM ref [0], "Table 7. Device Table
Entry (DTE) Field Definitions" particularly the entry "HAD".

To actually toggle on and off it's relatively simple as it's setting
2 bits on DTE and flush the device DTE cache.

To get what's dirtied use existing AMD io-pgtable support, by walking
the pagetables over each IOVA, with fetch_pte().  The IOTLB flushing is
left to the caller (much like unmap), and iommu_dirty_bitmap_record() is
the one adding page-ranges to invalidate. This allows caller to batch
the flush over a big span of IOVA space, without the iommu wondering
about when to flush.

Worthwhile sections from AMD IOMMU SDM:

"2.2.3.1 Host Access Support"
"2.2.3.2 Host Dirty Support"

For details on how IOMMU hardware updates the dirty bit see,
and expects from its consequent clearing by CPU:

"2.2.7.4 Updating Accessed and Dirty Bits in the Guest Address Tables"
"2.2.7.5 Clearing Accessed and Dirty Bits"

Quoting the SDM:

"The setting of accessed and dirty status bits in the page tables is
visible to both the CPU and the peripheral when sharing guest page
tables. The IOMMU interlocked operations to update A and D bits must be
64-bit operations and naturally aligned on a 64-bit boundary"

.. and for the IOMMU update sequence to Dirty bit, essentially is states:

1. Decodes the read and write intent from the memory access.
2. If P=0 in the page descriptor, fail the access.
3. Compare the A & D bits in the descriptor with the read and write
intent in the request.
4. If the A or D bits need to be updated in the descriptor:
* Start atomic operation.
* Read the descriptor as a 64-bit access.
* If the descriptor no longer appears to require an update, release the
atomic lock with
no further action and continue to step 5.
* Calculate the new A & D bits.
* Write the descriptor as a 64-bit access.
* End atomic operation.
5. Continue to the next stage of translation or to the memory access.

Access/Dirty bits readout also need to consider the default
non-page-size (aka replicated PTEs as mentined by manual), as AMD
supports all powers of two page sizes (except 512G) even though the
underlying IOTLB mappings are restricted to the same ones as supported
by the CPU (4K, 2M, 1G). It makes one wonder whether AMD_IOMMU_PGSIZES
ought to avoid advertising non-default page sizes at all, when creating
an UNMANAGED DOMAIN, or when dirty tracking is toggling in.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/amd/amd_iommu.h       |  1 +
 drivers/iommu/amd/amd_iommu_types.h | 11 +++++
 drivers/iommu/amd/init.c            |  8 ++-
 drivers/iommu/amd/io_pgtable.c      | 56 +++++++++++++++++++++
 drivers/iommu/amd/iommu.c           | 77 +++++++++++++++++++++++++++++
 5 files changed, 152 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/amd/amd_iommu.h b/drivers/iommu/amd/amd_iommu.h
index 1ab31074f5b3..2f16ad8f7514 100644
--- a/drivers/iommu/amd/amd_iommu.h
+++ b/drivers/iommu/amd/amd_iommu.h
@@ -34,6 +34,7 @@ extern int amd_iommu_reenable(int);
 extern int amd_iommu_enable_faulting(void);
 extern int amd_iommu_guest_ir;
 extern enum io_pgtable_fmt amd_iommu_pgtable;
+extern bool amd_iommu_had_support;
 
 /* IOMMUv2 specific functions */
 struct iommu_domain;
diff --git a/drivers/iommu/amd/amd_iommu_types.h b/drivers/iommu/amd/amd_iommu_types.h
index 47108ed44fbb..c1eba8fce4bb 100644
--- a/drivers/iommu/amd/amd_iommu_types.h
+++ b/drivers/iommu/amd/amd_iommu_types.h
@@ -93,7 +93,9 @@
 #define FEATURE_HE		(1ULL<<8)
 #define FEATURE_PC		(1ULL<<9)
 #define FEATURE_GAM_VAPIC	(1ULL<<21)
+#define FEATURE_HASUP		(1ULL<<49)
 #define FEATURE_EPHSUP		(1ULL<<50)
+#define FEATURE_HDSUP		(1ULL<<52)
 #define FEATURE_SNP		(1ULL<<63)
 
 #define FEATURE_PASID_SHIFT	32
@@ -197,6 +199,7 @@
 /* macros and definitions for device table entries */
 #define DEV_ENTRY_VALID         0x00
 #define DEV_ENTRY_TRANSLATION   0x01
+#define DEV_ENTRY_HAD           0x07
 #define DEV_ENTRY_PPR           0x34
 #define DEV_ENTRY_IR            0x3d
 #define DEV_ENTRY_IW            0x3e
@@ -350,10 +353,16 @@
 #define PTE_LEVEL_PAGE_SIZE(level)			\
 	(1ULL << (12 + (9 * (level))))
 
+/*
+ * The IOPTE dirty bit
+ */
+#define IOMMU_PTE_HD_BIT (6)
+
 /*
  * Bit value definition for I/O PTE fields
  */
 #define IOMMU_PTE_PR (1ULL << 0)
+#define IOMMU_PTE_HD (1ULL << IOMMU_PTE_HD_BIT)
 #define IOMMU_PTE_U  (1ULL << 59)
 #define IOMMU_PTE_FC (1ULL << 60)
 #define IOMMU_PTE_IR (1ULL << 61)
@@ -364,6 +373,7 @@
  */
 #define DTE_FLAG_V  (1ULL << 0)
 #define DTE_FLAG_TV (1ULL << 1)
+#define DTE_FLAG_HAD (3ULL << 7)
 #define DTE_FLAG_IR (1ULL << 61)
 #define DTE_FLAG_IW (1ULL << 62)
 
@@ -390,6 +400,7 @@
 
 #define IOMMU_PAGE_MASK (((1ULL << 52) - 1) & ~0xfffULL)
 #define IOMMU_PTE_PRESENT(pte) ((pte) & IOMMU_PTE_PR)
+#define IOMMU_PTE_DIRTY(pte) ((pte) & IOMMU_PTE_HD)
 #define IOMMU_PTE_PAGE(pte) (iommu_phys_to_virt((pte) & IOMMU_PAGE_MASK))
 #define IOMMU_PTE_MODE(pte) (((pte) >> 9) & 0x07)
 
diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index b4a798c7b347..27f2cf61d0c6 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -149,6 +149,7 @@ struct ivmd_header {
 
 bool amd_iommu_dump;
 bool amd_iommu_irq_remap __read_mostly;
+bool amd_iommu_had_support __read_mostly;
 
 enum io_pgtable_fmt amd_iommu_pgtable = AMD_IOMMU_V1;
 
@@ -1986,8 +1987,13 @@ static int __init amd_iommu_init_pci(void)
 	for_each_iommu(iommu)
 		iommu_flush_all_caches(iommu);
 
-	if (!ret)
+	if (!ret) {
+		if (check_feature_on_all_iommus(FEATURE_HASUP) &&
+		    check_feature_on_all_iommus(FEATURE_HDSUP))
+			amd_iommu_had_support = true;
+
 		print_iommu_info();
+	}
 
 out:
 	return ret;
diff --git a/drivers/iommu/amd/io_pgtable.c b/drivers/iommu/amd/io_pgtable.c
index 6608d1717574..8325ef193093 100644
--- a/drivers/iommu/amd/io_pgtable.c
+++ b/drivers/iommu/amd/io_pgtable.c
@@ -478,6 +478,61 @@ static phys_addr_t iommu_v1_iova_to_phys(struct io_pgtable_ops *ops, unsigned lo
 	return (__pte & ~offset_mask) | (iova & offset_mask);
 }
 
+static bool pte_test_and_clear_dirty(u64 *ptep, unsigned long size)
+{
+	bool dirty = false;
+	int i, count;
+
+	/*
+	 * 2.2.3.2 Host Dirty Support
+	 * When a non-default page size is used , software must OR the
+	 * Dirty bits in all of the replicated host PTEs used to map
+	 * the page. The IOMMU does not guarantee the Dirty bits are
+	 * set in all of the replicated PTEs. Any portion of the page
+	 * may have been written even if the Dirty bit is set in only
+	 * one of the replicated PTEs.
+	 */
+	count = PAGE_SIZE_PTE_COUNT(size);
+	for (i = 0; i < count; i++)
+		if (test_and_clear_bit(IOMMU_PTE_HD_BIT,
+					(unsigned long *) &ptep[i]))
+			dirty = true;
+
+	return dirty;
+}
+
+static int iommu_v1_read_and_clear_dirty(struct io_pgtable_ops *ops,
+					 unsigned long iova, size_t size,
+					 struct iommu_dirty_bitmap *dirty)
+{
+	struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
+	unsigned long end = iova + size - 1;
+
+	do {
+		unsigned long pgsize = 0;
+		u64 *ptep, pte;
+
+		ptep = fetch_pte(pgtable, iova, &pgsize);
+		if (ptep)
+			pte = READ_ONCE(*ptep);
+		if (!ptep || !IOMMU_PTE_PRESENT(pte)) {
+			pgsize = pgsize ?: PTE_LEVEL_PAGE_SIZE(0);
+			iova += pgsize;
+			continue;
+		}
+
+		/*
+		 * Mark the whole IOVA range as dirty even if only one of
+		 * the replicated PTEs were marked dirty.
+		 */
+		if (pte_test_and_clear_dirty(ptep, pgsize))
+			iommu_dirty_bitmap_record(dirty, iova, pgsize);
+		iova += pgsize;
+	} while (iova < end);
+
+	return 0;
+}
+
 /*
  * ----------------------------------------------------
  */
@@ -519,6 +574,7 @@ static struct io_pgtable *v1_alloc_pgtable(struct io_pgtable_cfg *cfg, void *coo
 	pgtable->iop.ops.map          = iommu_v1_map_page;
 	pgtable->iop.ops.unmap        = iommu_v1_unmap_page;
 	pgtable->iop.ops.iova_to_phys = iommu_v1_iova_to_phys;
+	pgtable->iop.ops.read_and_clear_dirty = iommu_v1_read_and_clear_dirty;
 
 	return &pgtable->iop;
 }
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index a1ada7bff44e..0a86392b2367 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2169,6 +2169,81 @@ static bool amd_iommu_capable(enum iommu_cap cap)
 	return false;
 }
 
+static int amd_iommu_set_dirty_tracking(struct iommu_domain *domain,
+					bool enable)
+{
+	struct protection_domain *pdomain = to_pdomain(domain);
+	struct iommu_dev_data *dev_data;
+	bool dom_flush = false;
+
+	if (!amd_iommu_had_support)
+		return -EOPNOTSUPP;
+
+	list_for_each_entry(dev_data, &pdomain->dev_list, list) {
+		struct amd_iommu *iommu;
+		u64 pte_root;
+
+		iommu = amd_iommu_rlookup_table[dev_data->devid];
+		pte_root = amd_iommu_dev_table[dev_data->devid].data[0];
+
+		/* No change? */
+		if (!(enable ^ !!(pte_root & DTE_FLAG_HAD)))
+			continue;
+
+		pte_root = (enable ?
+			pte_root | DTE_FLAG_HAD : pte_root & ~DTE_FLAG_HAD);
+
+		/* Flush device DTE */
+		amd_iommu_dev_table[dev_data->devid].data[0] = pte_root;
+		device_flush_dte(dev_data);
+		dom_flush = true;
+	}
+
+	/* Flush IOTLB to mark IOPTE dirty on the next translation(s) */
+	if (dom_flush) {
+		unsigned long flags;
+
+		spin_lock_irqsave(&pdomain->lock, flags);
+		amd_iommu_domain_flush_tlb_pde(pdomain);
+		amd_iommu_domain_flush_complete(pdomain);
+		spin_unlock_irqrestore(&pdomain->lock, flags);
+	}
+
+	return 0;
+}
+
+static bool amd_iommu_get_dirty_tracking(struct iommu_domain *domain)
+{
+	struct protection_domain *pdomain = to_pdomain(domain);
+	struct iommu_dev_data *dev_data;
+	u64 dte;
+
+	list_for_each_entry(dev_data, &pdomain->dev_list, list) {
+		dte = amd_iommu_dev_table[dev_data->devid].data[0];
+		if (!(dte & DTE_FLAG_HAD))
+			return false;
+	}
+
+	return true;
+}
+
+static int amd_iommu_read_and_clear_dirty(struct iommu_domain *domain,
+					  unsigned long iova, size_t size,
+					  struct iommu_dirty_bitmap *dirty)
+{
+	struct protection_domain *pdomain = to_pdomain(domain);
+	struct io_pgtable_ops *ops = &pdomain->iop.iop.ops;
+
+	if (!amd_iommu_get_dirty_tracking(domain))
+		return -EOPNOTSUPP;
+
+	if (!ops || !ops->read_and_clear_dirty)
+		return -ENODEV;
+
+	return ops->read_and_clear_dirty(ops, iova, size, dirty);
+}
+
+
 static void amd_iommu_get_resv_regions(struct device *dev,
 				       struct list_head *head)
 {
@@ -2293,6 +2368,8 @@ const struct iommu_ops amd_iommu_ops = {
 		.flush_iotlb_all = amd_iommu_flush_iotlb_all,
 		.iotlb_sync	= amd_iommu_iotlb_sync,
 		.free		= amd_iommu_domain_free,
+		.set_dirty_tracking = amd_iommu_set_dirty_tracking,
+		.read_and_clear_dirty = amd_iommu_read_and_clear_dirty,
 	}
 };
 
-- 
2.17.2

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH RFC 10/19] iommu/amd: Add unmap_read_dirty() support
  2022-04-28 21:09 ` Joao Martins
@ 2022-04-28 21:09   ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Joao Martins, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian,
	Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck, kvm

AMD implementation of unmap_read_dirty() is pretty simple as
mostly reuses unmap code with the extra addition of marshalling
the dirty bit into the bitmap as it walks the to-be-unmapped
IOPTE.

Extra care is taken though, to switch over to cmpxchg as opposed
to a non-serialized store to the PTE and testing the dirty bit
only set until cmpxchg succeeds to set to 0.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/amd/io_pgtable.c | 44 +++++++++++++++++++++++++++++-----
 drivers/iommu/amd/iommu.c      | 22 +++++++++++++++++
 2 files changed, 60 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/amd/io_pgtable.c b/drivers/iommu/amd/io_pgtable.c
index 8325ef193093..1868c3b58e6d 100644
--- a/drivers/iommu/amd/io_pgtable.c
+++ b/drivers/iommu/amd/io_pgtable.c
@@ -355,6 +355,16 @@ static void free_clear_pte(u64 *pte, u64 pteval, struct list_head *freelist)
 	free_sub_pt(pt, mode, freelist);
 }
 
+static bool free_pte_dirty(u64 *pte, u64 pteval)
+{
+	bool dirty = false;
+
+	while (IOMMU_PTE_DIRTY(cmpxchg64(pte, pteval, 0)))
+		dirty = true;
+
+	return dirty;
+}
+
 /*
  * Generic mapping functions. It maps a physical address into a DMA
  * address space. It allocates the page table pages if necessary.
@@ -428,10 +438,11 @@ static int iommu_v1_map_page(struct io_pgtable_ops *ops, unsigned long iova,
 	return ret;
 }
 
-static unsigned long iommu_v1_unmap_page(struct io_pgtable_ops *ops,
-				      unsigned long iova,
-				      size_t size,
-				      struct iommu_iotlb_gather *gather)
+static unsigned long __iommu_v1_unmap_page(struct io_pgtable_ops *ops,
+					   unsigned long iova,
+					   size_t size,
+					   struct iommu_iotlb_gather *gather,
+					   struct iommu_dirty_bitmap *dirty)
 {
 	struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
 	unsigned long long unmapped;
@@ -445,11 +456,15 @@ static unsigned long iommu_v1_unmap_page(struct io_pgtable_ops *ops,
 	while (unmapped < size) {
 		pte = fetch_pte(pgtable, iova, &unmap_size);
 		if (pte) {
-			int i, count;
+			unsigned long i, count;
+			bool pte_dirty = false;
 
 			count = PAGE_SIZE_PTE_COUNT(unmap_size);
 			for (i = 0; i < count; i++)
-				pte[i] = 0ULL;
+				pte_dirty |= free_pte_dirty(&pte[i], pte[i]);
+
+			if (unlikely(pte_dirty && dirty))
+				iommu_dirty_bitmap_record(dirty, iova, unmap_size);
 		}
 
 		iova = (iova & ~(unmap_size - 1)) + unmap_size;
@@ -461,6 +476,22 @@ static unsigned long iommu_v1_unmap_page(struct io_pgtable_ops *ops,
 	return unmapped;
 }
 
+static unsigned long iommu_v1_unmap_page(struct io_pgtable_ops *ops,
+					 unsigned long iova,
+					 size_t size,
+					 struct iommu_iotlb_gather *gather)
+{
+	return __iommu_v1_unmap_page(ops, iova, size, gather, NULL);
+}
+
+static unsigned long iommu_v1_unmap_page_read_dirty(struct io_pgtable_ops *ops,
+				unsigned long iova, size_t size,
+				struct iommu_iotlb_gather *gather,
+				struct iommu_dirty_bitmap *dirty)
+{
+	return __iommu_v1_unmap_page(ops, iova, size, gather, dirty);
+}
+
 static phys_addr_t iommu_v1_iova_to_phys(struct io_pgtable_ops *ops, unsigned long iova)
 {
 	struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
@@ -575,6 +606,7 @@ static struct io_pgtable *v1_alloc_pgtable(struct io_pgtable_cfg *cfg, void *coo
 	pgtable->iop.ops.unmap        = iommu_v1_unmap_page;
 	pgtable->iop.ops.iova_to_phys = iommu_v1_iova_to_phys;
 	pgtable->iop.ops.read_and_clear_dirty = iommu_v1_read_and_clear_dirty;
+	pgtable->iop.ops.unmap_read_dirty = iommu_v1_unmap_page_read_dirty;
 
 	return &pgtable->iop;
 }
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 0a86392b2367..a8fcb6e9a684 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2144,6 +2144,27 @@ static size_t amd_iommu_unmap(struct iommu_domain *dom, unsigned long iova,
 	return r;
 }
 
+static size_t amd_iommu_unmap_read_dirty(struct iommu_domain *dom,
+					 unsigned long iova, size_t page_size,
+					 struct iommu_iotlb_gather *gather,
+					 struct iommu_dirty_bitmap *dirty)
+{
+	struct protection_domain *domain = to_pdomain(dom);
+	struct io_pgtable_ops *ops = &domain->iop.iop.ops;
+	size_t r;
+
+	if ((amd_iommu_pgtable == AMD_IOMMU_V1) &&
+	    (domain->iop.mode == PAGE_MODE_NONE))
+		return 0;
+
+	r = (ops->unmap_read_dirty) ?
+		ops->unmap_read_dirty(ops, iova, page_size, gather, dirty) : 0;
+
+	amd_iommu_iotlb_gather_add_page(dom, gather, iova, page_size);
+
+	return r;
+}
+
 static phys_addr_t amd_iommu_iova_to_phys(struct iommu_domain *dom,
 					  dma_addr_t iova)
 {
@@ -2370,6 +2391,7 @@ const struct iommu_ops amd_iommu_ops = {
 		.free		= amd_iommu_domain_free,
 		.set_dirty_tracking = amd_iommu_set_dirty_tracking,
 		.read_and_clear_dirty = amd_iommu_read_and_clear_dirty,
+		.unmap_read_dirty = amd_iommu_unmap_read_dirty,
 	}
 };
 
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH RFC 10/19] iommu/amd: Add unmap_read_dirty() support
@ 2022-04-28 21:09   ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, Jason Gunthorpe,
	kvm, Will Deacon, Cornelia Huck, Alex Williamson, Joao Martins,
	David Woodhouse, Robin Murphy

AMD implementation of unmap_read_dirty() is pretty simple as
mostly reuses unmap code with the extra addition of marshalling
the dirty bit into the bitmap as it walks the to-be-unmapped
IOPTE.

Extra care is taken though, to switch over to cmpxchg as opposed
to a non-serialized store to the PTE and testing the dirty bit
only set until cmpxchg succeeds to set to 0.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/amd/io_pgtable.c | 44 +++++++++++++++++++++++++++++-----
 drivers/iommu/amd/iommu.c      | 22 +++++++++++++++++
 2 files changed, 60 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/amd/io_pgtable.c b/drivers/iommu/amd/io_pgtable.c
index 8325ef193093..1868c3b58e6d 100644
--- a/drivers/iommu/amd/io_pgtable.c
+++ b/drivers/iommu/amd/io_pgtable.c
@@ -355,6 +355,16 @@ static void free_clear_pte(u64 *pte, u64 pteval, struct list_head *freelist)
 	free_sub_pt(pt, mode, freelist);
 }
 
+static bool free_pte_dirty(u64 *pte, u64 pteval)
+{
+	bool dirty = false;
+
+	while (IOMMU_PTE_DIRTY(cmpxchg64(pte, pteval, 0)))
+		dirty = true;
+
+	return dirty;
+}
+
 /*
  * Generic mapping functions. It maps a physical address into a DMA
  * address space. It allocates the page table pages if necessary.
@@ -428,10 +438,11 @@ static int iommu_v1_map_page(struct io_pgtable_ops *ops, unsigned long iova,
 	return ret;
 }
 
-static unsigned long iommu_v1_unmap_page(struct io_pgtable_ops *ops,
-				      unsigned long iova,
-				      size_t size,
-				      struct iommu_iotlb_gather *gather)
+static unsigned long __iommu_v1_unmap_page(struct io_pgtable_ops *ops,
+					   unsigned long iova,
+					   size_t size,
+					   struct iommu_iotlb_gather *gather,
+					   struct iommu_dirty_bitmap *dirty)
 {
 	struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
 	unsigned long long unmapped;
@@ -445,11 +456,15 @@ static unsigned long iommu_v1_unmap_page(struct io_pgtable_ops *ops,
 	while (unmapped < size) {
 		pte = fetch_pte(pgtable, iova, &unmap_size);
 		if (pte) {
-			int i, count;
+			unsigned long i, count;
+			bool pte_dirty = false;
 
 			count = PAGE_SIZE_PTE_COUNT(unmap_size);
 			for (i = 0; i < count; i++)
-				pte[i] = 0ULL;
+				pte_dirty |= free_pte_dirty(&pte[i], pte[i]);
+
+			if (unlikely(pte_dirty && dirty))
+				iommu_dirty_bitmap_record(dirty, iova, unmap_size);
 		}
 
 		iova = (iova & ~(unmap_size - 1)) + unmap_size;
@@ -461,6 +476,22 @@ static unsigned long iommu_v1_unmap_page(struct io_pgtable_ops *ops,
 	return unmapped;
 }
 
+static unsigned long iommu_v1_unmap_page(struct io_pgtable_ops *ops,
+					 unsigned long iova,
+					 size_t size,
+					 struct iommu_iotlb_gather *gather)
+{
+	return __iommu_v1_unmap_page(ops, iova, size, gather, NULL);
+}
+
+static unsigned long iommu_v1_unmap_page_read_dirty(struct io_pgtable_ops *ops,
+				unsigned long iova, size_t size,
+				struct iommu_iotlb_gather *gather,
+				struct iommu_dirty_bitmap *dirty)
+{
+	return __iommu_v1_unmap_page(ops, iova, size, gather, dirty);
+}
+
 static phys_addr_t iommu_v1_iova_to_phys(struct io_pgtable_ops *ops, unsigned long iova)
 {
 	struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
@@ -575,6 +606,7 @@ static struct io_pgtable *v1_alloc_pgtable(struct io_pgtable_cfg *cfg, void *coo
 	pgtable->iop.ops.unmap        = iommu_v1_unmap_page;
 	pgtable->iop.ops.iova_to_phys = iommu_v1_iova_to_phys;
 	pgtable->iop.ops.read_and_clear_dirty = iommu_v1_read_and_clear_dirty;
+	pgtable->iop.ops.unmap_read_dirty = iommu_v1_unmap_page_read_dirty;
 
 	return &pgtable->iop;
 }
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 0a86392b2367..a8fcb6e9a684 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2144,6 +2144,27 @@ static size_t amd_iommu_unmap(struct iommu_domain *dom, unsigned long iova,
 	return r;
 }
 
+static size_t amd_iommu_unmap_read_dirty(struct iommu_domain *dom,
+					 unsigned long iova, size_t page_size,
+					 struct iommu_iotlb_gather *gather,
+					 struct iommu_dirty_bitmap *dirty)
+{
+	struct protection_domain *domain = to_pdomain(dom);
+	struct io_pgtable_ops *ops = &domain->iop.iop.ops;
+	size_t r;
+
+	if ((amd_iommu_pgtable == AMD_IOMMU_V1) &&
+	    (domain->iop.mode == PAGE_MODE_NONE))
+		return 0;
+
+	r = (ops->unmap_read_dirty) ?
+		ops->unmap_read_dirty(ops, iova, page_size, gather, dirty) : 0;
+
+	amd_iommu_iotlb_gather_add_page(dom, gather, iova, page_size);
+
+	return r;
+}
+
 static phys_addr_t amd_iommu_iova_to_phys(struct iommu_domain *dom,
 					  dma_addr_t iova)
 {
@@ -2370,6 +2391,7 @@ const struct iommu_ops amd_iommu_ops = {
 		.free		= amd_iommu_domain_free,
 		.set_dirty_tracking = amd_iommu_set_dirty_tracking,
 		.read_and_clear_dirty = amd_iommu_read_and_clear_dirty,
+		.unmap_read_dirty = amd_iommu_unmap_read_dirty,
 	}
 };
 
-- 
2.17.2

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH RFC 11/19] iommu/amd: Print access/dirty bits if supported
  2022-04-28 21:09 ` Joao Martins
@ 2022-04-28 21:09   ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, Jason Gunthorpe,
	kvm, Will Deacon, Cornelia Huck, Alex Williamson, Joao Martins,
	David Woodhouse, Robin Murphy

Print the feature, much like other kernel-supported features.

One can still probe its actual hw support via sysfs, regardless
of what the kernel does.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/amd/init.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index 27f2cf61d0c6..c410d127eb58 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -1936,6 +1936,10 @@ static void print_iommu_info(void)
 
 			if (iommu->features & FEATURE_GAM_VAPIC)
 				pr_cont(" GA_vAPIC");
+			if (iommu->features & FEATURE_HASUP)
+				pr_cont(" HASup");
+			if (iommu->features & FEATURE_HDSUP)
+				pr_cont(" HDSup");
 
 			pr_cont("\n");
 		}
-- 
2.17.2

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH RFC 11/19] iommu/amd: Print access/dirty bits if supported
@ 2022-04-28 21:09   ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Joao Martins, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian,
	Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck, kvm

Print the feature, much like other kernel-supported features.

One can still probe its actual hw support via sysfs, regardless
of what the kernel does.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/amd/init.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index 27f2cf61d0c6..c410d127eb58 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -1936,6 +1936,10 @@ static void print_iommu_info(void)
 
 			if (iommu->features & FEATURE_GAM_VAPIC)
 				pr_cont(" GA_vAPIC");
+			if (iommu->features & FEATURE_HASUP)
+				pr_cont(" HASup");
+			if (iommu->features & FEATURE_HDSUP)
+				pr_cont(" HDSup");
 
 			pr_cont("\n");
 		}
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH RFC 12/19] iommu/arm-smmu-v3: Add feature detection for HTTU
  2022-04-28 21:09 ` Joao Martins
@ 2022-04-28 21:09   ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, Jason Gunthorpe,
	kvm, Will Deacon, Cornelia Huck, Alex Williamson, Joao Martins,
	David Woodhouse, Robin Murphy

From: Jean-Philippe Brucker <jean-philippe@linaro.org>

If the SMMU supports it and the kernel was built with HTTU support,
Probe support for Hardware Translation Table Update (HTTU) which is
essentially to enable hardware update of access and dirty flags.

Probe and set the smmu::features for Hardware Dirty and Hardware Access
bits. This is in preparation, to enable it on the context descriptors of
stage 1 format.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
[joaomart: Change commit message to reflect the underlying changes]
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 32 +++++++++++++++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  5 ++++
 2 files changed, 37 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index fd49282c03a3..14609ece4e33 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3424,6 +3424,28 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass)
 	return 0;
 }
 
+static void arm_smmu_get_httu(struct arm_smmu_device *smmu, u32 reg)
+{
+	u32 fw_features = smmu->features & (ARM_SMMU_FEAT_HA | ARM_SMMU_FEAT_HD);
+	u32 features = 0;
+
+	switch (FIELD_GET(IDR0_HTTU, reg)) {
+	case IDR0_HTTU_ACCESS_DIRTY:
+		features |= ARM_SMMU_FEAT_HD;
+		fallthrough;
+	case IDR0_HTTU_ACCESS:
+		features |= ARM_SMMU_FEAT_HA;
+	}
+
+	if (smmu->dev->of_node)
+		smmu->features |= features;
+	else if (features != fw_features)
+		/* ACPI IORT sets the HTTU bits */
+		dev_warn(smmu->dev,
+			 "IDR0.HTTU overridden by FW configuration (0x%x)\n",
+			 fw_features);
+}
+
 static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
 {
 	u32 reg;
@@ -3484,6 +3506,8 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
 			smmu->features |= ARM_SMMU_FEAT_E2H;
 	}
 
+	arm_smmu_get_httu(smmu, reg);
+
 	/*
 	 * The coherency feature as set by FW is used in preference to the ID
 	 * register, but warn on mismatch.
@@ -3669,6 +3693,14 @@ static int arm_smmu_device_acpi_probe(struct platform_device *pdev,
 	if (iort_smmu->flags & ACPI_IORT_SMMU_V3_COHACC_OVERRIDE)
 		smmu->features |= ARM_SMMU_FEAT_COHERENCY;
 
+	switch (FIELD_GET(ACPI_IORT_SMMU_V3_HTTU_OVERRIDE, iort_smmu->flags)) {
+	case IDR0_HTTU_ACCESS_DIRTY:
+		smmu->features |= ARM_SMMU_FEAT_HD;
+		fallthrough;
+	case IDR0_HTTU_ACCESS:
+		smmu->features |= ARM_SMMU_FEAT_HA;
+	}
+
 	return 0;
 }
 #else
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index cd48590ada30..1487a80fdf1b 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -33,6 +33,9 @@
 #define IDR0_ASID16			(1 << 12)
 #define IDR0_ATS			(1 << 10)
 #define IDR0_HYP			(1 << 9)
+#define IDR0_HTTU			GENMASK(7, 6)
+#define IDR0_HTTU_ACCESS		1
+#define IDR0_HTTU_ACCESS_DIRTY		2
 #define IDR0_COHACC			(1 << 4)
 #define IDR0_TTF			GENMASK(3, 2)
 #define IDR0_TTF_AARCH64		2
@@ -639,6 +642,8 @@ struct arm_smmu_device {
 #define ARM_SMMU_FEAT_BTM		(1 << 16)
 #define ARM_SMMU_FEAT_SVA		(1 << 17)
 #define ARM_SMMU_FEAT_E2H		(1 << 18)
+#define ARM_SMMU_FEAT_HA		(1 << 19)
+#define ARM_SMMU_FEAT_HD		(1 << 20)
 	u32				features;
 
 #define ARM_SMMU_OPT_SKIP_PREFETCH	(1 << 0)
-- 
2.17.2

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH RFC 12/19] iommu/arm-smmu-v3: Add feature detection for HTTU
@ 2022-04-28 21:09   ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Joao Martins, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian,
	Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck, kvm

From: Jean-Philippe Brucker <jean-philippe@linaro.org>

If the SMMU supports it and the kernel was built with HTTU support,
Probe support for Hardware Translation Table Update (HTTU) which is
essentially to enable hardware update of access and dirty flags.

Probe and set the smmu::features for Hardware Dirty and Hardware Access
bits. This is in preparation, to enable it on the context descriptors of
stage 1 format.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
[joaomart: Change commit message to reflect the underlying changes]
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 32 +++++++++++++++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  5 ++++
 2 files changed, 37 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index fd49282c03a3..14609ece4e33 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3424,6 +3424,28 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass)
 	return 0;
 }
 
+static void arm_smmu_get_httu(struct arm_smmu_device *smmu, u32 reg)
+{
+	u32 fw_features = smmu->features & (ARM_SMMU_FEAT_HA | ARM_SMMU_FEAT_HD);
+	u32 features = 0;
+
+	switch (FIELD_GET(IDR0_HTTU, reg)) {
+	case IDR0_HTTU_ACCESS_DIRTY:
+		features |= ARM_SMMU_FEAT_HD;
+		fallthrough;
+	case IDR0_HTTU_ACCESS:
+		features |= ARM_SMMU_FEAT_HA;
+	}
+
+	if (smmu->dev->of_node)
+		smmu->features |= features;
+	else if (features != fw_features)
+		/* ACPI IORT sets the HTTU bits */
+		dev_warn(smmu->dev,
+			 "IDR0.HTTU overridden by FW configuration (0x%x)\n",
+			 fw_features);
+}
+
 static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
 {
 	u32 reg;
@@ -3484,6 +3506,8 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
 			smmu->features |= ARM_SMMU_FEAT_E2H;
 	}
 
+	arm_smmu_get_httu(smmu, reg);
+
 	/*
 	 * The coherency feature as set by FW is used in preference to the ID
 	 * register, but warn on mismatch.
@@ -3669,6 +3693,14 @@ static int arm_smmu_device_acpi_probe(struct platform_device *pdev,
 	if (iort_smmu->flags & ACPI_IORT_SMMU_V3_COHACC_OVERRIDE)
 		smmu->features |= ARM_SMMU_FEAT_COHERENCY;
 
+	switch (FIELD_GET(ACPI_IORT_SMMU_V3_HTTU_OVERRIDE, iort_smmu->flags)) {
+	case IDR0_HTTU_ACCESS_DIRTY:
+		smmu->features |= ARM_SMMU_FEAT_HD;
+		fallthrough;
+	case IDR0_HTTU_ACCESS:
+		smmu->features |= ARM_SMMU_FEAT_HA;
+	}
+
 	return 0;
 }
 #else
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index cd48590ada30..1487a80fdf1b 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -33,6 +33,9 @@
 #define IDR0_ASID16			(1 << 12)
 #define IDR0_ATS			(1 << 10)
 #define IDR0_HYP			(1 << 9)
+#define IDR0_HTTU			GENMASK(7, 6)
+#define IDR0_HTTU_ACCESS		1
+#define IDR0_HTTU_ACCESS_DIRTY		2
 #define IDR0_COHACC			(1 << 4)
 #define IDR0_TTF			GENMASK(3, 2)
 #define IDR0_TTF_AARCH64		2
@@ -639,6 +642,8 @@ struct arm_smmu_device {
 #define ARM_SMMU_FEAT_BTM		(1 << 16)
 #define ARM_SMMU_FEAT_SVA		(1 << 17)
 #define ARM_SMMU_FEAT_E2H		(1 << 18)
+#define ARM_SMMU_FEAT_HA		(1 << 19)
+#define ARM_SMMU_FEAT_HD		(1 << 20)
 	u32				features;
 
 #define ARM_SMMU_OPT_SKIP_PREFETCH	(1 << 0)
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH RFC 13/19] iommu/arm-smmu-v3: Add feature detection for BBML
  2022-04-28 21:09 ` Joao Martins
@ 2022-04-28 21:09   ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, Jason Gunthorpe,
	kvm, Will Deacon, Cornelia Huck, Alex Williamson, Joao Martins,
	David Woodhouse, Robin Murphy

From: Kunkun Jiang <jiangkunkun@huawei.com>

This detects BBML feature and if SMMU supports it, transfer BBMLx
quirk to io-pgtable.

BBML1 requires still marking PTE nT prior to performing a
translation table update, while BBML2 requires neither break-before-make
nor PTE nT bit being set. For dirty tracking it needs to clear
the dirty bit so checking BBML2 tells us the prerequisite. See SMMUv3.2
manual, section "3.21.1.3 When SMMU_IDR3.BBML == 2 (Level 2)" and
"3.21.1.2 When SMMU_IDR3.BBML == 1 (Level 1)"

Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
[joaomart: massage commit message with the need to have BBML quirk
 and add the Quirk io-pgtable flags]
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 19 +++++++++++++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  6 ++++++
 include/linux/io-pgtable.h                  |  3 +++
 3 files changed, 28 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 14609ece4e33..4dba53bde2e3 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2203,6 +2203,11 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain,
 		.iommu_dev	= smmu->dev,
 	};
 
+	if (smmu->features & ARM_SMMU_FEAT_BBML1)
+		pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_BBML1;
+	else if (smmu->features & ARM_SMMU_FEAT_BBML2)
+		pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_BBML2;
+
 	pgtbl_ops = alloc_io_pgtable_ops(fmt, &pgtbl_cfg, smmu_domain);
 	if (!pgtbl_ops)
 		return -ENOMEM;
@@ -3591,6 +3596,20 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
 
 	/* IDR3 */
 	reg = readl_relaxed(smmu->base + ARM_SMMU_IDR3);
+	switch (FIELD_GET(IDR3_BBML, reg)) {
+	case IDR3_BBML0:
+		break;
+	case IDR3_BBML1:
+		smmu->features |= ARM_SMMU_FEAT_BBML1;
+		break;
+	case IDR3_BBML2:
+		smmu->features |= ARM_SMMU_FEAT_BBML2;
+		break;
+	default:
+		dev_err(smmu->dev, "unknown/unsupported BBM behavior level\n");
+		return -ENXIO;
+	}
+
 	if (FIELD_GET(IDR3_RIL, reg))
 		smmu->features |= ARM_SMMU_FEAT_RANGE_INV;
 
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 1487a80fdf1b..e15750be1d95 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -54,6 +54,10 @@
 #define IDR1_SIDSIZE			GENMASK(5, 0)
 
 #define ARM_SMMU_IDR3			0xc
+#define IDR3_BBML			GENMASK(12, 11)
+#define IDR3_BBML0			0
+#define IDR3_BBML1			1
+#define IDR3_BBML2			2
 #define IDR3_RIL			(1 << 10)
 
 #define ARM_SMMU_IDR5			0x14
@@ -644,6 +648,8 @@ struct arm_smmu_device {
 #define ARM_SMMU_FEAT_E2H		(1 << 18)
 #define ARM_SMMU_FEAT_HA		(1 << 19)
 #define ARM_SMMU_FEAT_HD		(1 << 20)
+#define ARM_SMMU_FEAT_BBML1		(1 << 21)
+#define ARM_SMMU_FEAT_BBML2		(1 << 22)
 	u32				features;
 
 #define ARM_SMMU_OPT_SKIP_PREFETCH	(1 << 0)
diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index c2ebfe037f5d..d7626ca67dbf 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -85,6 +85,9 @@ struct io_pgtable_cfg {
 	#define IO_PGTABLE_QUIRK_ARM_MTK_EXT	BIT(3)
 	#define IO_PGTABLE_QUIRK_ARM_TTBR1	BIT(5)
 	#define IO_PGTABLE_QUIRK_ARM_OUTER_WBWA	BIT(6)
+	#define IO_PGTABLE_QUIRK_ARM_BBML1      BIT(7)
+	#define IO_PGTABLE_QUIRK_ARM_BBML2      BIT(8)
+
 	unsigned long			quirks;
 	unsigned long			pgsize_bitmap;
 	unsigned int			ias;
-- 
2.17.2

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH RFC 13/19] iommu/arm-smmu-v3: Add feature detection for BBML
@ 2022-04-28 21:09   ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Joao Martins, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian,
	Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck, kvm,
	Kunkun Jiang

From: Kunkun Jiang <jiangkunkun@huawei.com>

This detects BBML feature and if SMMU supports it, transfer BBMLx
quirk to io-pgtable.

BBML1 requires still marking PTE nT prior to performing a
translation table update, while BBML2 requires neither break-before-make
nor PTE nT bit being set. For dirty tracking it needs to clear
the dirty bit so checking BBML2 tells us the prerequisite. See SMMUv3.2
manual, section "3.21.1.3 When SMMU_IDR3.BBML == 2 (Level 2)" and
"3.21.1.2 When SMMU_IDR3.BBML == 1 (Level 1)"

Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
[joaomart: massage commit message with the need to have BBML quirk
 and add the Quirk io-pgtable flags]
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 19 +++++++++++++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  6 ++++++
 include/linux/io-pgtable.h                  |  3 +++
 3 files changed, 28 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 14609ece4e33..4dba53bde2e3 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2203,6 +2203,11 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain,
 		.iommu_dev	= smmu->dev,
 	};
 
+	if (smmu->features & ARM_SMMU_FEAT_BBML1)
+		pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_BBML1;
+	else if (smmu->features & ARM_SMMU_FEAT_BBML2)
+		pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_BBML2;
+
 	pgtbl_ops = alloc_io_pgtable_ops(fmt, &pgtbl_cfg, smmu_domain);
 	if (!pgtbl_ops)
 		return -ENOMEM;
@@ -3591,6 +3596,20 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
 
 	/* IDR3 */
 	reg = readl_relaxed(smmu->base + ARM_SMMU_IDR3);
+	switch (FIELD_GET(IDR3_BBML, reg)) {
+	case IDR3_BBML0:
+		break;
+	case IDR3_BBML1:
+		smmu->features |= ARM_SMMU_FEAT_BBML1;
+		break;
+	case IDR3_BBML2:
+		smmu->features |= ARM_SMMU_FEAT_BBML2;
+		break;
+	default:
+		dev_err(smmu->dev, "unknown/unsupported BBM behavior level\n");
+		return -ENXIO;
+	}
+
 	if (FIELD_GET(IDR3_RIL, reg))
 		smmu->features |= ARM_SMMU_FEAT_RANGE_INV;
 
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 1487a80fdf1b..e15750be1d95 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -54,6 +54,10 @@
 #define IDR1_SIDSIZE			GENMASK(5, 0)
 
 #define ARM_SMMU_IDR3			0xc
+#define IDR3_BBML			GENMASK(12, 11)
+#define IDR3_BBML0			0
+#define IDR3_BBML1			1
+#define IDR3_BBML2			2
 #define IDR3_RIL			(1 << 10)
 
 #define ARM_SMMU_IDR5			0x14
@@ -644,6 +648,8 @@ struct arm_smmu_device {
 #define ARM_SMMU_FEAT_E2H		(1 << 18)
 #define ARM_SMMU_FEAT_HA		(1 << 19)
 #define ARM_SMMU_FEAT_HD		(1 << 20)
+#define ARM_SMMU_FEAT_BBML1		(1 << 21)
+#define ARM_SMMU_FEAT_BBML2		(1 << 22)
 	u32				features;
 
 #define ARM_SMMU_OPT_SKIP_PREFETCH	(1 << 0)
diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index c2ebfe037f5d..d7626ca67dbf 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -85,6 +85,9 @@ struct io_pgtable_cfg {
 	#define IO_PGTABLE_QUIRK_ARM_MTK_EXT	BIT(3)
 	#define IO_PGTABLE_QUIRK_ARM_TTBR1	BIT(5)
 	#define IO_PGTABLE_QUIRK_ARM_OUTER_WBWA	BIT(6)
+	#define IO_PGTABLE_QUIRK_ARM_BBML1      BIT(7)
+	#define IO_PGTABLE_QUIRK_ARM_BBML2      BIT(8)
+
 	unsigned long			quirks;
 	unsigned long			pgsize_bitmap;
 	unsigned int			ias;
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH RFC 14/19] iommu/arm-smmu-v3: Add read_and_clear_dirty() support
  2022-04-28 21:09 ` Joao Martins
@ 2022-04-28 21:09   ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, Jason Gunthorpe,
	kvm, Will Deacon, Cornelia Huck, Alex Williamson, Joao Martins,
	David Woodhouse, Robin Murphy

.read_and_clear_dirty() IOMMU domain op takes care of
reading the dirty bits (i.e. PTE has both DBM and AP[2] set)
and marshalling into a bitmap of a given page size.

While reading the dirty bits we also clear the PTE AP[2]
bit to mark it as writable-clean.

Structure it in a way that the IOPTE walker is generic,
and so we pass a function pointer over what to do on a per-PTE
basis. This is useful for a followup patch where we supply an
io-pgtable op to enable DBM when starting/stopping dirty tracking.

Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Co-developed-by: Kunkun Jiang <jiangkunkun@huawei.com>
Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |  27 ++++++
 drivers/iommu/io-pgtable-arm.c              | 102 +++++++++++++++++++-
 2 files changed, 128 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 4dba53bde2e3..232057d20197 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2743,6 +2743,32 @@ static int arm_smmu_enable_nesting(struct iommu_domain *domain)
 	return ret;
 }
 
+static int arm_smmu_read_and_clear_dirty(struct iommu_domain *domain,
+					 unsigned long iova, size_t size,
+					 struct iommu_dirty_bitmap *dirty)
+{
+	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+	struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+	struct arm_smmu_device *smmu = smmu_domain->smmu;
+	int ret;
+
+	if (!(smmu->features & ARM_SMMU_FEAT_HD) ||
+	    !(smmu->features & ARM_SMMU_FEAT_BBML2))
+		return -ENODEV;
+
+	if (smmu_domain->stage != ARM_SMMU_DOMAIN_S1)
+		return -EINVAL;
+
+	if (!ops || !ops->read_and_clear_dirty) {
+		pr_err_once("io-pgtable don't support dirty tracking\n");
+		return -ENODEV;
+	}
+
+	ret = ops->read_and_clear_dirty(ops, iova, size, dirty);
+
+	return ret;
+}
+
 static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
 {
 	return iommu_fwspec_add_ids(dev, args->args, 1);
@@ -2871,6 +2897,7 @@ static struct iommu_ops arm_smmu_ops = {
 		.iova_to_phys		= arm_smmu_iova_to_phys,
 		.enable_nesting		= arm_smmu_enable_nesting,
 		.free			= arm_smmu_domain_free,
+		.read_and_clear_dirty	= arm_smmu_read_and_clear_dirty,
 	}
 };
 
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 94ff319ae8ac..3c99028d315a 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -75,6 +75,7 @@
 
 #define ARM_LPAE_PTE_NSTABLE		(((arm_lpae_iopte)1) << 63)
 #define ARM_LPAE_PTE_XN			(((arm_lpae_iopte)3) << 53)
+#define ARM_LPAE_PTE_DBM		(((arm_lpae_iopte)1) << 51)
 #define ARM_LPAE_PTE_AF			(((arm_lpae_iopte)1) << 10)
 #define ARM_LPAE_PTE_SH_NS		(((arm_lpae_iopte)0) << 8)
 #define ARM_LPAE_PTE_SH_OS		(((arm_lpae_iopte)2) << 8)
@@ -84,7 +85,7 @@
 
 #define ARM_LPAE_PTE_ATTR_LO_MASK	(((arm_lpae_iopte)0x3ff) << 2)
 /* Ignore the contiguous bit for block splitting */
-#define ARM_LPAE_PTE_ATTR_HI_MASK	(((arm_lpae_iopte)6) << 52)
+#define ARM_LPAE_PTE_ATTR_HI_MASK	(((arm_lpae_iopte)13) << 51)
 #define ARM_LPAE_PTE_ATTR_MASK		(ARM_LPAE_PTE_ATTR_LO_MASK |	\
 					 ARM_LPAE_PTE_ATTR_HI_MASK)
 /* Software bit for solving coherency races */
@@ -93,6 +94,9 @@
 /* Stage-1 PTE */
 #define ARM_LPAE_PTE_AP_UNPRIV		(((arm_lpae_iopte)1) << 6)
 #define ARM_LPAE_PTE_AP_RDONLY		(((arm_lpae_iopte)2) << 6)
+#define ARM_LPAE_PTE_AP_RDONLY_BIT	7
+#define ARM_LPAE_PTE_AP_WRITABLE	(ARM_LPAE_PTE_AP_RDONLY | \
+					 ARM_LPAE_PTE_DBM)
 #define ARM_LPAE_PTE_ATTRINDX_SHIFT	2
 #define ARM_LPAE_PTE_nG			(((arm_lpae_iopte)1) << 11)
 
@@ -737,6 +741,101 @@ static phys_addr_t arm_lpae_iova_to_phys(struct io_pgtable_ops *ops,
 	return iopte_to_paddr(pte, data) | iova;
 }
 
+static int __arm_lpae_read_and_clear_dirty(unsigned long iova, size_t size,
+					   arm_lpae_iopte *ptep, void *opaque)
+{
+	struct iommu_dirty_bitmap *dirty = opaque;
+	arm_lpae_iopte pte;
+
+	pte = READ_ONCE(*ptep);
+	if (WARN_ON(!pte))
+		return -EINVAL;
+
+	if (pte & ARM_LPAE_PTE_AP_WRITABLE)
+		return 0;
+
+	if (!(pte & ARM_LPAE_PTE_DBM))
+		return 0;
+
+	iommu_dirty_bitmap_record(dirty, iova, size);
+	set_bit(ARM_LPAE_PTE_AP_RDONLY_BIT, (unsigned long *)ptep);
+	return 0;
+}
+
+static int __arm_lpae_iopte_walk(struct arm_lpae_io_pgtable *data,
+				 unsigned long iova, size_t size,
+				 int lvl, arm_lpae_iopte *ptep,
+				 int (*fn)(unsigned long iova, size_t size,
+					   arm_lpae_iopte *pte, void *opaque),
+				 void *opaque)
+{
+	arm_lpae_iopte pte;
+	struct io_pgtable *iop = &data->iop;
+	size_t base, next_size;
+	int ret;
+
+	if (WARN_ON_ONCE(!fn))
+		return -EINVAL;
+
+	if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
+		return -EINVAL;
+
+	ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
+	pte = READ_ONCE(*ptep);
+	if (WARN_ON(!pte))
+		return -EINVAL;
+
+	if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) {
+		if (iopte_leaf(pte, lvl, iop->fmt))
+			return fn(iova, size, ptep, opaque);
+
+		/* Current level is table, traverse next level */
+		next_size = ARM_LPAE_BLOCK_SIZE(lvl + 1, data);
+		ptep = iopte_deref(pte, data);
+		for (base = 0; base < size; base += next_size) {
+			ret = __arm_lpae_iopte_walk(data, iova + base,
+						    next_size, lvl + 1, ptep,
+						    fn, opaque);
+			if (ret)
+				return ret;
+		}
+		return 0;
+	} else if (iopte_leaf(pte, lvl, iop->fmt)) {
+		return fn(iova, size, ptep, opaque);
+	}
+
+	/* Keep on walkin */
+	ptep = iopte_deref(pte, data);
+	return __arm_lpae_iopte_walk(data, iova, size, lvl + 1, ptep,
+				     fn, opaque);
+}
+
+static int arm_lpae_read_and_clear_dirty(struct io_pgtable_ops *ops,
+					 unsigned long iova, size_t size,
+					 struct iommu_dirty_bitmap *dirty)
+{
+	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	arm_lpae_iopte *ptep = data->pgd;
+	int lvl = data->start_level;
+	long iaext = (s64)iova >> cfg->ias;
+
+	if (WARN_ON(!size || (size & cfg->pgsize_bitmap) != size))
+		return -EINVAL;
+
+	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_TTBR1)
+		iaext = ~iaext;
+	if (WARN_ON(iaext))
+		return -EINVAL;
+
+	if (data->iop.fmt != ARM_64_LPAE_S1 &&
+	    data->iop.fmt != ARM_32_LPAE_S1)
+		return -EINVAL;
+
+	return __arm_lpae_iopte_walk(data, iova, size, lvl, ptep,
+				     __arm_lpae_read_and_clear_dirty, dirty);
+}
+
 static void arm_lpae_restrict_pgsizes(struct io_pgtable_cfg *cfg)
 {
 	unsigned long granule, page_sizes;
@@ -817,6 +916,7 @@ arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg)
 		.unmap		= arm_lpae_unmap,
 		.unmap_pages	= arm_lpae_unmap_pages,
 		.iova_to_phys	= arm_lpae_iova_to_phys,
+		.read_and_clear_dirty = arm_lpae_read_and_clear_dirty,
 	};
 
 	return data;
-- 
2.17.2

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH RFC 14/19] iommu/arm-smmu-v3: Add read_and_clear_dirty() support
@ 2022-04-28 21:09   ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Joao Martins, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian,
	Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck, kvm,
	Kunkun Jiang

.read_and_clear_dirty() IOMMU domain op takes care of
reading the dirty bits (i.e. PTE has both DBM and AP[2] set)
and marshalling into a bitmap of a given page size.

While reading the dirty bits we also clear the PTE AP[2]
bit to mark it as writable-clean.

Structure it in a way that the IOPTE walker is generic,
and so we pass a function pointer over what to do on a per-PTE
basis. This is useful for a followup patch where we supply an
io-pgtable op to enable DBM when starting/stopping dirty tracking.

Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Co-developed-by: Kunkun Jiang <jiangkunkun@huawei.com>
Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |  27 ++++++
 drivers/iommu/io-pgtable-arm.c              | 102 +++++++++++++++++++-
 2 files changed, 128 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 4dba53bde2e3..232057d20197 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2743,6 +2743,32 @@ static int arm_smmu_enable_nesting(struct iommu_domain *domain)
 	return ret;
 }
 
+static int arm_smmu_read_and_clear_dirty(struct iommu_domain *domain,
+					 unsigned long iova, size_t size,
+					 struct iommu_dirty_bitmap *dirty)
+{
+	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+	struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+	struct arm_smmu_device *smmu = smmu_domain->smmu;
+	int ret;
+
+	if (!(smmu->features & ARM_SMMU_FEAT_HD) ||
+	    !(smmu->features & ARM_SMMU_FEAT_BBML2))
+		return -ENODEV;
+
+	if (smmu_domain->stage != ARM_SMMU_DOMAIN_S1)
+		return -EINVAL;
+
+	if (!ops || !ops->read_and_clear_dirty) {
+		pr_err_once("io-pgtable don't support dirty tracking\n");
+		return -ENODEV;
+	}
+
+	ret = ops->read_and_clear_dirty(ops, iova, size, dirty);
+
+	return ret;
+}
+
 static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
 {
 	return iommu_fwspec_add_ids(dev, args->args, 1);
@@ -2871,6 +2897,7 @@ static struct iommu_ops arm_smmu_ops = {
 		.iova_to_phys		= arm_smmu_iova_to_phys,
 		.enable_nesting		= arm_smmu_enable_nesting,
 		.free			= arm_smmu_domain_free,
+		.read_and_clear_dirty	= arm_smmu_read_and_clear_dirty,
 	}
 };
 
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 94ff319ae8ac..3c99028d315a 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -75,6 +75,7 @@
 
 #define ARM_LPAE_PTE_NSTABLE		(((arm_lpae_iopte)1) << 63)
 #define ARM_LPAE_PTE_XN			(((arm_lpae_iopte)3) << 53)
+#define ARM_LPAE_PTE_DBM		(((arm_lpae_iopte)1) << 51)
 #define ARM_LPAE_PTE_AF			(((arm_lpae_iopte)1) << 10)
 #define ARM_LPAE_PTE_SH_NS		(((arm_lpae_iopte)0) << 8)
 #define ARM_LPAE_PTE_SH_OS		(((arm_lpae_iopte)2) << 8)
@@ -84,7 +85,7 @@
 
 #define ARM_LPAE_PTE_ATTR_LO_MASK	(((arm_lpae_iopte)0x3ff) << 2)
 /* Ignore the contiguous bit for block splitting */
-#define ARM_LPAE_PTE_ATTR_HI_MASK	(((arm_lpae_iopte)6) << 52)
+#define ARM_LPAE_PTE_ATTR_HI_MASK	(((arm_lpae_iopte)13) << 51)
 #define ARM_LPAE_PTE_ATTR_MASK		(ARM_LPAE_PTE_ATTR_LO_MASK |	\
 					 ARM_LPAE_PTE_ATTR_HI_MASK)
 /* Software bit for solving coherency races */
@@ -93,6 +94,9 @@
 /* Stage-1 PTE */
 #define ARM_LPAE_PTE_AP_UNPRIV		(((arm_lpae_iopte)1) << 6)
 #define ARM_LPAE_PTE_AP_RDONLY		(((arm_lpae_iopte)2) << 6)
+#define ARM_LPAE_PTE_AP_RDONLY_BIT	7
+#define ARM_LPAE_PTE_AP_WRITABLE	(ARM_LPAE_PTE_AP_RDONLY | \
+					 ARM_LPAE_PTE_DBM)
 #define ARM_LPAE_PTE_ATTRINDX_SHIFT	2
 #define ARM_LPAE_PTE_nG			(((arm_lpae_iopte)1) << 11)
 
@@ -737,6 +741,101 @@ static phys_addr_t arm_lpae_iova_to_phys(struct io_pgtable_ops *ops,
 	return iopte_to_paddr(pte, data) | iova;
 }
 
+static int __arm_lpae_read_and_clear_dirty(unsigned long iova, size_t size,
+					   arm_lpae_iopte *ptep, void *opaque)
+{
+	struct iommu_dirty_bitmap *dirty = opaque;
+	arm_lpae_iopte pte;
+
+	pte = READ_ONCE(*ptep);
+	if (WARN_ON(!pte))
+		return -EINVAL;
+
+	if (pte & ARM_LPAE_PTE_AP_WRITABLE)
+		return 0;
+
+	if (!(pte & ARM_LPAE_PTE_DBM))
+		return 0;
+
+	iommu_dirty_bitmap_record(dirty, iova, size);
+	set_bit(ARM_LPAE_PTE_AP_RDONLY_BIT, (unsigned long *)ptep);
+	return 0;
+}
+
+static int __arm_lpae_iopte_walk(struct arm_lpae_io_pgtable *data,
+				 unsigned long iova, size_t size,
+				 int lvl, arm_lpae_iopte *ptep,
+				 int (*fn)(unsigned long iova, size_t size,
+					   arm_lpae_iopte *pte, void *opaque),
+				 void *opaque)
+{
+	arm_lpae_iopte pte;
+	struct io_pgtable *iop = &data->iop;
+	size_t base, next_size;
+	int ret;
+
+	if (WARN_ON_ONCE(!fn))
+		return -EINVAL;
+
+	if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
+		return -EINVAL;
+
+	ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
+	pte = READ_ONCE(*ptep);
+	if (WARN_ON(!pte))
+		return -EINVAL;
+
+	if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) {
+		if (iopte_leaf(pte, lvl, iop->fmt))
+			return fn(iova, size, ptep, opaque);
+
+		/* Current level is table, traverse next level */
+		next_size = ARM_LPAE_BLOCK_SIZE(lvl + 1, data);
+		ptep = iopte_deref(pte, data);
+		for (base = 0; base < size; base += next_size) {
+			ret = __arm_lpae_iopte_walk(data, iova + base,
+						    next_size, lvl + 1, ptep,
+						    fn, opaque);
+			if (ret)
+				return ret;
+		}
+		return 0;
+	} else if (iopte_leaf(pte, lvl, iop->fmt)) {
+		return fn(iova, size, ptep, opaque);
+	}
+
+	/* Keep on walkin */
+	ptep = iopte_deref(pte, data);
+	return __arm_lpae_iopte_walk(data, iova, size, lvl + 1, ptep,
+				     fn, opaque);
+}
+
+static int arm_lpae_read_and_clear_dirty(struct io_pgtable_ops *ops,
+					 unsigned long iova, size_t size,
+					 struct iommu_dirty_bitmap *dirty)
+{
+	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	arm_lpae_iopte *ptep = data->pgd;
+	int lvl = data->start_level;
+	long iaext = (s64)iova >> cfg->ias;
+
+	if (WARN_ON(!size || (size & cfg->pgsize_bitmap) != size))
+		return -EINVAL;
+
+	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_TTBR1)
+		iaext = ~iaext;
+	if (WARN_ON(iaext))
+		return -EINVAL;
+
+	if (data->iop.fmt != ARM_64_LPAE_S1 &&
+	    data->iop.fmt != ARM_32_LPAE_S1)
+		return -EINVAL;
+
+	return __arm_lpae_iopte_walk(data, iova, size, lvl, ptep,
+				     __arm_lpae_read_and_clear_dirty, dirty);
+}
+
 static void arm_lpae_restrict_pgsizes(struct io_pgtable_cfg *cfg)
 {
 	unsigned long granule, page_sizes;
@@ -817,6 +916,7 @@ arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg)
 		.unmap		= arm_lpae_unmap,
 		.unmap_pages	= arm_lpae_unmap_pages,
 		.iova_to_phys	= arm_lpae_iova_to_phys,
+		.read_and_clear_dirty = arm_lpae_read_and_clear_dirty,
 	};
 
 	return data;
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
  2022-04-28 21:09 ` Joao Martins
@ 2022-04-28 21:09   ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, Jason Gunthorpe,
	kvm, Will Deacon, Cornelia Huck, Alex Williamson, Joao Martins,
	David Woodhouse, Robin Murphy

Similar to .read_and_clear_dirty() use the page table
walker helper functions and set DBM|RDONLY bit, thus
switching the IOPTE to writeable-clean.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 29 ++++++++++++
 drivers/iommu/io-pgtable-arm.c              | 52 +++++++++++++++++++++
 2 files changed, 81 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 232057d20197..1ca72fcca930 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2769,6 +2769,34 @@ static int arm_smmu_read_and_clear_dirty(struct iommu_domain *domain,
 	return ret;
 }
 
+static int arm_smmu_set_dirty_tracking(struct iommu_domain *domain,
+				       unsigned long iova, size_t size,
+				       struct iommu_iotlb_gather *iotlb_gather,
+				       bool enabled)
+{
+	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+	struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+	struct arm_smmu_device *smmu = smmu_domain->smmu;
+	int ret;
+
+	if (!(smmu->features & ARM_SMMU_FEAT_HD) ||
+	    !(smmu->features & ARM_SMMU_FEAT_BBML2))
+		return -ENODEV;
+
+	if (smmu_domain->stage != ARM_SMMU_DOMAIN_S1)
+		return -EINVAL;
+
+	if (!ops || !ops->set_dirty_tracking) {
+		pr_err_once("io-pgtable don't support dirty tracking\n");
+		return -ENODEV;
+	}
+
+	ret = ops->set_dirty_tracking(ops, iova, size, enabled);
+	iommu_iotlb_gather_add_range(iotlb_gather, iova, size);
+
+	return ret;
+}
+
 static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
 {
 	return iommu_fwspec_add_ids(dev, args->args, 1);
@@ -2898,6 +2926,7 @@ static struct iommu_ops arm_smmu_ops = {
 		.enable_nesting		= arm_smmu_enable_nesting,
 		.free			= arm_smmu_domain_free,
 		.read_and_clear_dirty	= arm_smmu_read_and_clear_dirty,
+		.set_dirty_tracking_range = arm_smmu_set_dirty_tracking,
 	}
 };
 
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 3c99028d315a..361410aa836c 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -76,6 +76,7 @@
 #define ARM_LPAE_PTE_NSTABLE		(((arm_lpae_iopte)1) << 63)
 #define ARM_LPAE_PTE_XN			(((arm_lpae_iopte)3) << 53)
 #define ARM_LPAE_PTE_DBM		(((arm_lpae_iopte)1) << 51)
+#define ARM_LPAE_PTE_DBM_BIT		51
 #define ARM_LPAE_PTE_AF			(((arm_lpae_iopte)1) << 10)
 #define ARM_LPAE_PTE_SH_NS		(((arm_lpae_iopte)0) << 8)
 #define ARM_LPAE_PTE_SH_OS		(((arm_lpae_iopte)2) << 8)
@@ -836,6 +837,56 @@ static int arm_lpae_read_and_clear_dirty(struct io_pgtable_ops *ops,
 				     __arm_lpae_read_and_clear_dirty, dirty);
 }
 
+static int __arm_lpae_set_dirty_modifier(unsigned long iova, size_t size,
+					 arm_lpae_iopte *ptep, void *opaque)
+{
+	bool enabled = *((bool *) opaque);
+	arm_lpae_iopte pte;
+
+	pte = READ_ONCE(*ptep);
+	if (WARN_ON(!pte))
+		return -EINVAL;
+
+	if ((pte & ARM_LPAE_PTE_AP_WRITABLE) == ARM_LPAE_PTE_AP_RDONLY)
+		return -EINVAL;
+
+	if (!(enabled ^ !(pte & ARM_LPAE_PTE_DBM)))
+		return 0;
+
+	pte = enabled ? pte | (ARM_LPAE_PTE_DBM | ARM_LPAE_PTE_AP_RDONLY) :
+		pte & ~(ARM_LPAE_PTE_DBM | ARM_LPAE_PTE_AP_RDONLY);
+
+	WRITE_ONCE(*ptep, pte);
+	return 0;
+}
+
+
+static int arm_lpae_set_dirty_tracking(struct io_pgtable_ops *ops,
+				       unsigned long iova, size_t size,
+				       bool enabled)
+{
+	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	arm_lpae_iopte *ptep = data->pgd;
+	int lvl = data->start_level;
+	long iaext = (s64)iova >> cfg->ias;
+
+	if (WARN_ON(!size || (size & cfg->pgsize_bitmap) != size))
+		return -EINVAL;
+
+	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_TTBR1)
+		iaext = ~iaext;
+	if (WARN_ON(iaext))
+		return -EINVAL;
+
+	if (data->iop.fmt != ARM_64_LPAE_S1 &&
+	    data->iop.fmt != ARM_32_LPAE_S1)
+		return -EINVAL;
+
+	return __arm_lpae_iopte_walk(data, iova, size, lvl, ptep,
+				     __arm_lpae_set_dirty_modifier, &enabled);
+}
+
 static void arm_lpae_restrict_pgsizes(struct io_pgtable_cfg *cfg)
 {
 	unsigned long granule, page_sizes;
@@ -917,6 +968,7 @@ arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg)
 		.unmap_pages	= arm_lpae_unmap_pages,
 		.iova_to_phys	= arm_lpae_iova_to_phys,
 		.read_and_clear_dirty = arm_lpae_read_and_clear_dirty,
+		.set_dirty_tracking   = arm_lpae_set_dirty_tracking,
 	};
 
 	return data;
-- 
2.17.2

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
@ 2022-04-28 21:09   ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Joao Martins, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian,
	Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck, kvm

Similar to .read_and_clear_dirty() use the page table
walker helper functions and set DBM|RDONLY bit, thus
switching the IOPTE to writeable-clean.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 29 ++++++++++++
 drivers/iommu/io-pgtable-arm.c              | 52 +++++++++++++++++++++
 2 files changed, 81 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 232057d20197..1ca72fcca930 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2769,6 +2769,34 @@ static int arm_smmu_read_and_clear_dirty(struct iommu_domain *domain,
 	return ret;
 }
 
+static int arm_smmu_set_dirty_tracking(struct iommu_domain *domain,
+				       unsigned long iova, size_t size,
+				       struct iommu_iotlb_gather *iotlb_gather,
+				       bool enabled)
+{
+	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+	struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+	struct arm_smmu_device *smmu = smmu_domain->smmu;
+	int ret;
+
+	if (!(smmu->features & ARM_SMMU_FEAT_HD) ||
+	    !(smmu->features & ARM_SMMU_FEAT_BBML2))
+		return -ENODEV;
+
+	if (smmu_domain->stage != ARM_SMMU_DOMAIN_S1)
+		return -EINVAL;
+
+	if (!ops || !ops->set_dirty_tracking) {
+		pr_err_once("io-pgtable don't support dirty tracking\n");
+		return -ENODEV;
+	}
+
+	ret = ops->set_dirty_tracking(ops, iova, size, enabled);
+	iommu_iotlb_gather_add_range(iotlb_gather, iova, size);
+
+	return ret;
+}
+
 static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
 {
 	return iommu_fwspec_add_ids(dev, args->args, 1);
@@ -2898,6 +2926,7 @@ static struct iommu_ops arm_smmu_ops = {
 		.enable_nesting		= arm_smmu_enable_nesting,
 		.free			= arm_smmu_domain_free,
 		.read_and_clear_dirty	= arm_smmu_read_and_clear_dirty,
+		.set_dirty_tracking_range = arm_smmu_set_dirty_tracking,
 	}
 };
 
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 3c99028d315a..361410aa836c 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -76,6 +76,7 @@
 #define ARM_LPAE_PTE_NSTABLE		(((arm_lpae_iopte)1) << 63)
 #define ARM_LPAE_PTE_XN			(((arm_lpae_iopte)3) << 53)
 #define ARM_LPAE_PTE_DBM		(((arm_lpae_iopte)1) << 51)
+#define ARM_LPAE_PTE_DBM_BIT		51
 #define ARM_LPAE_PTE_AF			(((arm_lpae_iopte)1) << 10)
 #define ARM_LPAE_PTE_SH_NS		(((arm_lpae_iopte)0) << 8)
 #define ARM_LPAE_PTE_SH_OS		(((arm_lpae_iopte)2) << 8)
@@ -836,6 +837,56 @@ static int arm_lpae_read_and_clear_dirty(struct io_pgtable_ops *ops,
 				     __arm_lpae_read_and_clear_dirty, dirty);
 }
 
+static int __arm_lpae_set_dirty_modifier(unsigned long iova, size_t size,
+					 arm_lpae_iopte *ptep, void *opaque)
+{
+	bool enabled = *((bool *) opaque);
+	arm_lpae_iopte pte;
+
+	pte = READ_ONCE(*ptep);
+	if (WARN_ON(!pte))
+		return -EINVAL;
+
+	if ((pte & ARM_LPAE_PTE_AP_WRITABLE) == ARM_LPAE_PTE_AP_RDONLY)
+		return -EINVAL;
+
+	if (!(enabled ^ !(pte & ARM_LPAE_PTE_DBM)))
+		return 0;
+
+	pte = enabled ? pte | (ARM_LPAE_PTE_DBM | ARM_LPAE_PTE_AP_RDONLY) :
+		pte & ~(ARM_LPAE_PTE_DBM | ARM_LPAE_PTE_AP_RDONLY);
+
+	WRITE_ONCE(*ptep, pte);
+	return 0;
+}
+
+
+static int arm_lpae_set_dirty_tracking(struct io_pgtable_ops *ops,
+				       unsigned long iova, size_t size,
+				       bool enabled)
+{
+	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	arm_lpae_iopte *ptep = data->pgd;
+	int lvl = data->start_level;
+	long iaext = (s64)iova >> cfg->ias;
+
+	if (WARN_ON(!size || (size & cfg->pgsize_bitmap) != size))
+		return -EINVAL;
+
+	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_TTBR1)
+		iaext = ~iaext;
+	if (WARN_ON(iaext))
+		return -EINVAL;
+
+	if (data->iop.fmt != ARM_64_LPAE_S1 &&
+	    data->iop.fmt != ARM_32_LPAE_S1)
+		return -EINVAL;
+
+	return __arm_lpae_iopte_walk(data, iova, size, lvl, ptep,
+				     __arm_lpae_set_dirty_modifier, &enabled);
+}
+
 static void arm_lpae_restrict_pgsizes(struct io_pgtable_cfg *cfg)
 {
 	unsigned long granule, page_sizes;
@@ -917,6 +968,7 @@ arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg)
 		.unmap_pages	= arm_lpae_unmap_pages,
 		.iova_to_phys	= arm_lpae_iova_to_phys,
 		.read_and_clear_dirty = arm_lpae_read_and_clear_dirty,
+		.set_dirty_tracking   = arm_lpae_set_dirty_tracking,
 	};
 
 	return data;
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH RFC 16/19] iommu/arm-smmu-v3: Enable HTTU for stage1 with io-pgtable mapping
  2022-04-28 21:09 ` Joao Martins
@ 2022-04-28 21:09   ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, Jason Gunthorpe,
	kvm, Will Deacon, Cornelia Huck, Alex Williamson, Joao Martins,
	David Woodhouse, Robin Murphy

From: Kunkun Jiang <jiangkunkun@huawei.com>

As nested mode is not upstreamed now, we just aim to support dirty
log tracking for stage1 with io-pgtable mapping (means not support
SVA mapping). If HTTU is supported, we enable HA/HD bits in the SMMU
CD and transfer ARM_HD quirk to io-pgtable.

We additionally filter out HD|HA if not supportted. The CD.HD bit
is not particularly useful unless we toggle the DBM bit in the PTE
entries.

Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
[joaomart:Convey HD|HA bits over to the context descriptor
 and update commit message]
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 11 +++++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  3 +++
 include/linux/io-pgtable.h                  |  1 +
 3 files changed, 15 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 1ca72fcca930..5f728f8f20a2 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1077,10 +1077,18 @@ int arm_smmu_write_ctx_desc(struct arm_smmu_domain *smmu_domain, int ssid,
 		 * this substream's traffic
 		 */
 	} else { /* (1) and (2) */
+		struct arm_smmu_device *smmu = smmu_domain->smmu;
+		u64 tcr = cd->tcr;
+
 		cdptr[1] = cpu_to_le64(cd->ttbr & CTXDESC_CD_1_TTB0_MASK);
 		cdptr[2] = 0;
 		cdptr[3] = cpu_to_le64(cd->mair);
 
+		if (!(smmu->features & ARM_SMMU_FEAT_HD))
+			tcr &= ~CTXDESC_CD_0_TCR_HD;
+		if (!(smmu->features & ARM_SMMU_FEAT_HA))
+			tcr &= ~CTXDESC_CD_0_TCR_HA;
+
 		/*
 		 * STE is live, and the SMMU might read dwords of this CD in any
 		 * order. Ensure that it observes valid values before reading
@@ -2100,6 +2108,7 @@ static int arm_smmu_domain_finalise_s1(struct arm_smmu_domain *smmu_domain,
 			  FIELD_PREP(CTXDESC_CD_0_TCR_ORGN0, tcr->orgn) |
 			  FIELD_PREP(CTXDESC_CD_0_TCR_SH0, tcr->sh) |
 			  FIELD_PREP(CTXDESC_CD_0_TCR_IPS, tcr->ips) |
+			  CTXDESC_CD_0_TCR_HA | CTXDESC_CD_0_TCR_HD |
 			  CTXDESC_CD_0_TCR_EPD1 | CTXDESC_CD_0_AA64;
 	cfg->cd.mair	= pgtbl_cfg->arm_lpae_s1_cfg.mair;
 
@@ -2203,6 +2212,8 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain,
 		.iommu_dev	= smmu->dev,
 	};
 
+	if (smmu->features & ARM_SMMU_FEAT_HD)
+		pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_HD;
 	if (smmu->features & ARM_SMMU_FEAT_BBML1)
 		pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_BBML1;
 	else if (smmu->features & ARM_SMMU_FEAT_BBML2)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index e15750be1d95..ff32242f2fdb 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -292,6 +292,9 @@
 #define CTXDESC_CD_0_TCR_IPS		GENMASK_ULL(34, 32)
 #define CTXDESC_CD_0_TCR_TBI0		(1ULL << 38)
 
+#define CTXDESC_CD_0_TCR_HA            (1UL << 43)
+#define CTXDESC_CD_0_TCR_HD            (1UL << 42)
+
 #define CTXDESC_CD_0_AA64		(1UL << 41)
 #define CTXDESC_CD_0_S			(1UL << 44)
 #define CTXDESC_CD_0_R			(1UL << 45)
diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index d7626ca67dbf..a11902ae9cf1 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -87,6 +87,7 @@ struct io_pgtable_cfg {
 	#define IO_PGTABLE_QUIRK_ARM_OUTER_WBWA	BIT(6)
 	#define IO_PGTABLE_QUIRK_ARM_BBML1      BIT(7)
 	#define IO_PGTABLE_QUIRK_ARM_BBML2      BIT(8)
+	#define IO_PGTABLE_QUIRK_ARM_HD         BIT(9)
 
 	unsigned long			quirks;
 	unsigned long			pgsize_bitmap;
-- 
2.17.2

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH RFC 16/19] iommu/arm-smmu-v3: Enable HTTU for stage1 with io-pgtable mapping
@ 2022-04-28 21:09   ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Joao Martins, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian,
	Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck, kvm,
	Kunkun Jiang

From: Kunkun Jiang <jiangkunkun@huawei.com>

As nested mode is not upstreamed now, we just aim to support dirty
log tracking for stage1 with io-pgtable mapping (means not support
SVA mapping). If HTTU is supported, we enable HA/HD bits in the SMMU
CD and transfer ARM_HD quirk to io-pgtable.

We additionally filter out HD|HA if not supportted. The CD.HD bit
is not particularly useful unless we toggle the DBM bit in the PTE
entries.

Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
[joaomart:Convey HD|HA bits over to the context descriptor
 and update commit message]
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 11 +++++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  3 +++
 include/linux/io-pgtable.h                  |  1 +
 3 files changed, 15 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 1ca72fcca930..5f728f8f20a2 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1077,10 +1077,18 @@ int arm_smmu_write_ctx_desc(struct arm_smmu_domain *smmu_domain, int ssid,
 		 * this substream's traffic
 		 */
 	} else { /* (1) and (2) */
+		struct arm_smmu_device *smmu = smmu_domain->smmu;
+		u64 tcr = cd->tcr;
+
 		cdptr[1] = cpu_to_le64(cd->ttbr & CTXDESC_CD_1_TTB0_MASK);
 		cdptr[2] = 0;
 		cdptr[3] = cpu_to_le64(cd->mair);
 
+		if (!(smmu->features & ARM_SMMU_FEAT_HD))
+			tcr &= ~CTXDESC_CD_0_TCR_HD;
+		if (!(smmu->features & ARM_SMMU_FEAT_HA))
+			tcr &= ~CTXDESC_CD_0_TCR_HA;
+
 		/*
 		 * STE is live, and the SMMU might read dwords of this CD in any
 		 * order. Ensure that it observes valid values before reading
@@ -2100,6 +2108,7 @@ static int arm_smmu_domain_finalise_s1(struct arm_smmu_domain *smmu_domain,
 			  FIELD_PREP(CTXDESC_CD_0_TCR_ORGN0, tcr->orgn) |
 			  FIELD_PREP(CTXDESC_CD_0_TCR_SH0, tcr->sh) |
 			  FIELD_PREP(CTXDESC_CD_0_TCR_IPS, tcr->ips) |
+			  CTXDESC_CD_0_TCR_HA | CTXDESC_CD_0_TCR_HD |
 			  CTXDESC_CD_0_TCR_EPD1 | CTXDESC_CD_0_AA64;
 	cfg->cd.mair	= pgtbl_cfg->arm_lpae_s1_cfg.mair;
 
@@ -2203,6 +2212,8 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain,
 		.iommu_dev	= smmu->dev,
 	};
 
+	if (smmu->features & ARM_SMMU_FEAT_HD)
+		pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_HD;
 	if (smmu->features & ARM_SMMU_FEAT_BBML1)
 		pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_BBML1;
 	else if (smmu->features & ARM_SMMU_FEAT_BBML2)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index e15750be1d95..ff32242f2fdb 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -292,6 +292,9 @@
 #define CTXDESC_CD_0_TCR_IPS		GENMASK_ULL(34, 32)
 #define CTXDESC_CD_0_TCR_TBI0		(1ULL << 38)
 
+#define CTXDESC_CD_0_TCR_HA            (1UL << 43)
+#define CTXDESC_CD_0_TCR_HD            (1UL << 42)
+
 #define CTXDESC_CD_0_AA64		(1UL << 41)
 #define CTXDESC_CD_0_S			(1UL << 44)
 #define CTXDESC_CD_0_R			(1UL << 45)
diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index d7626ca67dbf..a11902ae9cf1 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -87,6 +87,7 @@ struct io_pgtable_cfg {
 	#define IO_PGTABLE_QUIRK_ARM_OUTER_WBWA	BIT(6)
 	#define IO_PGTABLE_QUIRK_ARM_BBML1      BIT(7)
 	#define IO_PGTABLE_QUIRK_ARM_BBML2      BIT(8)
+	#define IO_PGTABLE_QUIRK_ARM_HD         BIT(9)
 
 	unsigned long			quirks;
 	unsigned long			pgsize_bitmap;
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH RFC 17/19] iommu/arm-smmu-v3: Add unmap_read_dirty() support
  2022-04-28 21:09 ` Joao Martins
@ 2022-04-28 21:09   ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, Jason Gunthorpe,
	kvm, Will Deacon, Cornelia Huck, Alex Williamson, Joao Martins,
	David Woodhouse, Robin Murphy

Mostly reuses unmap existing code with the extra addition of
marshalling into a bitmap of a page size. To tackle the race,
switch away from a plain store to a cmpxchg() and check whether
IOVA was dirtied or not once it succeeds.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 17 +++++
 drivers/iommu/io-pgtable-arm.c              | 78 +++++++++++++++++----
 2 files changed, 82 insertions(+), 13 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 5f728f8f20a2..d1fb757056cc 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2499,6 +2499,22 @@ static size_t arm_smmu_unmap_pages(struct iommu_domain *domain, unsigned long io
 	return ops->unmap_pages(ops, iova, pgsize, pgcount, gather);
 }
 
+static size_t arm_smmu_unmap_pages_read_dirty(struct iommu_domain *domain,
+					      unsigned long iova, size_t pgsize,
+					      size_t pgcount,
+					      struct iommu_iotlb_gather *gather,
+					      struct iommu_dirty_bitmap *dirty)
+{
+	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+	struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+
+	if (!ops)
+		return 0;
+
+	return ops->unmap_pages_read_dirty(ops, iova, pgsize, pgcount,
+					   gather, dirty);
+}
+
 static void arm_smmu_flush_iotlb_all(struct iommu_domain *domain)
 {
 	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
@@ -2938,6 +2954,7 @@ static struct iommu_ops arm_smmu_ops = {
 		.free			= arm_smmu_domain_free,
 		.read_and_clear_dirty	= arm_smmu_read_and_clear_dirty,
 		.set_dirty_tracking_range = arm_smmu_set_dirty_tracking,
+		.unmap_pages_read_dirty	= arm_smmu_unmap_pages_read_dirty,
 	}
 };
 
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 361410aa836c..143ee7d73f88 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -259,10 +259,30 @@ static void __arm_lpae_clear_pte(arm_lpae_iopte *ptep, struct io_pgtable_cfg *cf
 		__arm_lpae_sync_pte(ptep, 1, cfg);
 }
 
+static bool __arm_lpae_clear_dirty_pte(arm_lpae_iopte *ptep,
+				       struct io_pgtable_cfg *cfg)
+{
+	arm_lpae_iopte tmp;
+	bool dirty = false;
+
+	do {
+		tmp = cmpxchg64(ptep, *ptep, 0);
+		if ((tmp & ARM_LPAE_PTE_DBM) &&
+		    !(tmp & ARM_LPAE_PTE_AP_RDONLY))
+			dirty = true;
+	} while (tmp);
+
+	if (!cfg->coherent_walk)
+		__arm_lpae_sync_pte(ptep, 1, cfg);
+
+	return dirty;
+}
+
 static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
 			       struct iommu_iotlb_gather *gather,
 			       unsigned long iova, size_t size, size_t pgcount,
-			       int lvl, arm_lpae_iopte *ptep);
+			       int lvl, arm_lpae_iopte *ptep,
+			       struct iommu_dirty_bitmap *dirty);
 
 static void __arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
 				phys_addr_t paddr, arm_lpae_iopte prot,
@@ -306,8 +326,13 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
 			size_t sz = ARM_LPAE_BLOCK_SIZE(lvl, data);
 
 			tblp = ptep - ARM_LPAE_LVL_IDX(iova, lvl, data);
+
+			/*
+			 * No need for dirty bitmap as arm_lpae_init_pte() is
+			 * only called from __arm_lpae_map()
+			 */
 			if (__arm_lpae_unmap(data, NULL, iova + i * sz, sz, 1,
-					     lvl, tblp) != sz) {
+					     lvl, tblp, NULL) != sz) {
 				WARN_ON(1);
 				return -EINVAL;
 			}
@@ -564,7 +589,8 @@ static size_t arm_lpae_split_blk_unmap(struct arm_lpae_io_pgtable *data,
 				       struct iommu_iotlb_gather *gather,
 				       unsigned long iova, size_t size,
 				       arm_lpae_iopte blk_pte, int lvl,
-				       arm_lpae_iopte *ptep, size_t pgcount)
+				       arm_lpae_iopte *ptep, size_t pgcount,
+				       struct iommu_dirty_bitmap *dirty)
 {
 	struct io_pgtable_cfg *cfg = &data->iop.cfg;
 	arm_lpae_iopte pte, *tablep;
@@ -617,13 +643,15 @@ static size_t arm_lpae_split_blk_unmap(struct arm_lpae_io_pgtable *data,
 		return num_entries * size;
 	}
 
-	return __arm_lpae_unmap(data, gather, iova, size, pgcount, lvl, tablep);
+	return __arm_lpae_unmap(data, gather, iova, size, pgcount,
+				lvl, tablep, dirty);
 }
 
 static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
 			       struct iommu_iotlb_gather *gather,
 			       unsigned long iova, size_t size, size_t pgcount,
-			       int lvl, arm_lpae_iopte *ptep)
+			       int lvl, arm_lpae_iopte *ptep,
+			       struct iommu_dirty_bitmap *dirty)
 {
 	arm_lpae_iopte pte;
 	struct io_pgtable *iop = &data->iop;
@@ -649,7 +677,11 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
 			if (WARN_ON(!pte))
 				break;
 
-			__arm_lpae_clear_pte(ptep, &iop->cfg);
+			if (likely(!dirty))
+				__arm_lpae_clear_pte(ptep, &iop->cfg);
+			else if (__arm_lpae_clear_dirty_pte(ptep, &iop->cfg))
+				iommu_dirty_bitmap_record(dirty, iova, size);
+
 
 			if (!iopte_leaf(pte, lvl, iop->fmt)) {
 				/* Also flush any partial walks */
@@ -671,17 +703,20 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
 		 * minus the part we want to unmap
 		 */
 		return arm_lpae_split_blk_unmap(data, gather, iova, size, pte,
-						lvl + 1, ptep, pgcount);
+						lvl + 1, ptep, pgcount, dirty);
 	}
 
 	/* Keep on walkin' */
 	ptep = iopte_deref(pte, data);
-	return __arm_lpae_unmap(data, gather, iova, size, pgcount, lvl + 1, ptep);
+	return __arm_lpae_unmap(data, gather, iova, size, pgcount,
+				lvl + 1, ptep, dirty);
 }
 
-static size_t arm_lpae_unmap_pages(struct io_pgtable_ops *ops, unsigned long iova,
-				   size_t pgsize, size_t pgcount,
-				   struct iommu_iotlb_gather *gather)
+static size_t __arm_lpae_unmap_pages(struct io_pgtable_ops *ops,
+				     unsigned long iova,
+				     size_t pgsize, size_t pgcount,
+				     struct iommu_iotlb_gather *gather,
+				     struct iommu_dirty_bitmap *dirty)
 {
 	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
 	struct io_pgtable_cfg *cfg = &data->iop.cfg;
@@ -697,13 +732,29 @@ static size_t arm_lpae_unmap_pages(struct io_pgtable_ops *ops, unsigned long iov
 		return 0;
 
 	return __arm_lpae_unmap(data, gather, iova, pgsize, pgcount,
-				data->start_level, ptep);
+				data->start_level, ptep, dirty);
+}
+
+static size_t arm_lpae_unmap_pages(struct io_pgtable_ops *ops, unsigned long iova,
+				   size_t pgsize, size_t pgcount,
+				   struct iommu_iotlb_gather *gather)
+{
+	return __arm_lpae_unmap_pages(ops, iova, pgsize, pgcount, gather, NULL);
 }
 
 static size_t arm_lpae_unmap(struct io_pgtable_ops *ops, unsigned long iova,
 			     size_t size, struct iommu_iotlb_gather *gather)
 {
-	return arm_lpae_unmap_pages(ops, iova, size, 1, gather);
+	return __arm_lpae_unmap_pages(ops, iova, size, 1, gather, NULL);
+}
+
+static size_t arm_lpae_unmap_pages_read_dirty(struct io_pgtable_ops *ops,
+					      unsigned long iova,
+					      size_t pgsize, size_t pgcount,
+					      struct iommu_iotlb_gather *gather,
+					      struct iommu_dirty_bitmap *dirty)
+{
+	return __arm_lpae_unmap_pages(ops, iova, pgsize, pgcount, gather, dirty);
 }
 
 static phys_addr_t arm_lpae_iova_to_phys(struct io_pgtable_ops *ops,
@@ -969,6 +1020,7 @@ arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg)
 		.iova_to_phys	= arm_lpae_iova_to_phys,
 		.read_and_clear_dirty = arm_lpae_read_and_clear_dirty,
 		.set_dirty_tracking   = arm_lpae_set_dirty_tracking,
+		.unmap_pages_read_dirty     = arm_lpae_unmap_pages_read_dirty,
 	};
 
 	return data;
-- 
2.17.2

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH RFC 17/19] iommu/arm-smmu-v3: Add unmap_read_dirty() support
@ 2022-04-28 21:09   ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Joao Martins, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian,
	Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck, kvm

Mostly reuses unmap existing code with the extra addition of
marshalling into a bitmap of a page size. To tackle the race,
switch away from a plain store to a cmpxchg() and check whether
IOVA was dirtied or not once it succeeds.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 17 +++++
 drivers/iommu/io-pgtable-arm.c              | 78 +++++++++++++++++----
 2 files changed, 82 insertions(+), 13 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 5f728f8f20a2..d1fb757056cc 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2499,6 +2499,22 @@ static size_t arm_smmu_unmap_pages(struct iommu_domain *domain, unsigned long io
 	return ops->unmap_pages(ops, iova, pgsize, pgcount, gather);
 }
 
+static size_t arm_smmu_unmap_pages_read_dirty(struct iommu_domain *domain,
+					      unsigned long iova, size_t pgsize,
+					      size_t pgcount,
+					      struct iommu_iotlb_gather *gather,
+					      struct iommu_dirty_bitmap *dirty)
+{
+	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+	struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+
+	if (!ops)
+		return 0;
+
+	return ops->unmap_pages_read_dirty(ops, iova, pgsize, pgcount,
+					   gather, dirty);
+}
+
 static void arm_smmu_flush_iotlb_all(struct iommu_domain *domain)
 {
 	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
@@ -2938,6 +2954,7 @@ static struct iommu_ops arm_smmu_ops = {
 		.free			= arm_smmu_domain_free,
 		.read_and_clear_dirty	= arm_smmu_read_and_clear_dirty,
 		.set_dirty_tracking_range = arm_smmu_set_dirty_tracking,
+		.unmap_pages_read_dirty	= arm_smmu_unmap_pages_read_dirty,
 	}
 };
 
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 361410aa836c..143ee7d73f88 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -259,10 +259,30 @@ static void __arm_lpae_clear_pte(arm_lpae_iopte *ptep, struct io_pgtable_cfg *cf
 		__arm_lpae_sync_pte(ptep, 1, cfg);
 }
 
+static bool __arm_lpae_clear_dirty_pte(arm_lpae_iopte *ptep,
+				       struct io_pgtable_cfg *cfg)
+{
+	arm_lpae_iopte tmp;
+	bool dirty = false;
+
+	do {
+		tmp = cmpxchg64(ptep, *ptep, 0);
+		if ((tmp & ARM_LPAE_PTE_DBM) &&
+		    !(tmp & ARM_LPAE_PTE_AP_RDONLY))
+			dirty = true;
+	} while (tmp);
+
+	if (!cfg->coherent_walk)
+		__arm_lpae_sync_pte(ptep, 1, cfg);
+
+	return dirty;
+}
+
 static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
 			       struct iommu_iotlb_gather *gather,
 			       unsigned long iova, size_t size, size_t pgcount,
-			       int lvl, arm_lpae_iopte *ptep);
+			       int lvl, arm_lpae_iopte *ptep,
+			       struct iommu_dirty_bitmap *dirty);
 
 static void __arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
 				phys_addr_t paddr, arm_lpae_iopte prot,
@@ -306,8 +326,13 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
 			size_t sz = ARM_LPAE_BLOCK_SIZE(lvl, data);
 
 			tblp = ptep - ARM_LPAE_LVL_IDX(iova, lvl, data);
+
+			/*
+			 * No need for dirty bitmap as arm_lpae_init_pte() is
+			 * only called from __arm_lpae_map()
+			 */
 			if (__arm_lpae_unmap(data, NULL, iova + i * sz, sz, 1,
-					     lvl, tblp) != sz) {
+					     lvl, tblp, NULL) != sz) {
 				WARN_ON(1);
 				return -EINVAL;
 			}
@@ -564,7 +589,8 @@ static size_t arm_lpae_split_blk_unmap(struct arm_lpae_io_pgtable *data,
 				       struct iommu_iotlb_gather *gather,
 				       unsigned long iova, size_t size,
 				       arm_lpae_iopte blk_pte, int lvl,
-				       arm_lpae_iopte *ptep, size_t pgcount)
+				       arm_lpae_iopte *ptep, size_t pgcount,
+				       struct iommu_dirty_bitmap *dirty)
 {
 	struct io_pgtable_cfg *cfg = &data->iop.cfg;
 	arm_lpae_iopte pte, *tablep;
@@ -617,13 +643,15 @@ static size_t arm_lpae_split_blk_unmap(struct arm_lpae_io_pgtable *data,
 		return num_entries * size;
 	}
 
-	return __arm_lpae_unmap(data, gather, iova, size, pgcount, lvl, tablep);
+	return __arm_lpae_unmap(data, gather, iova, size, pgcount,
+				lvl, tablep, dirty);
 }
 
 static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
 			       struct iommu_iotlb_gather *gather,
 			       unsigned long iova, size_t size, size_t pgcount,
-			       int lvl, arm_lpae_iopte *ptep)
+			       int lvl, arm_lpae_iopte *ptep,
+			       struct iommu_dirty_bitmap *dirty)
 {
 	arm_lpae_iopte pte;
 	struct io_pgtable *iop = &data->iop;
@@ -649,7 +677,11 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
 			if (WARN_ON(!pte))
 				break;
 
-			__arm_lpae_clear_pte(ptep, &iop->cfg);
+			if (likely(!dirty))
+				__arm_lpae_clear_pte(ptep, &iop->cfg);
+			else if (__arm_lpae_clear_dirty_pte(ptep, &iop->cfg))
+				iommu_dirty_bitmap_record(dirty, iova, size);
+
 
 			if (!iopte_leaf(pte, lvl, iop->fmt)) {
 				/* Also flush any partial walks */
@@ -671,17 +703,20 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
 		 * minus the part we want to unmap
 		 */
 		return arm_lpae_split_blk_unmap(data, gather, iova, size, pte,
-						lvl + 1, ptep, pgcount);
+						lvl + 1, ptep, pgcount, dirty);
 	}
 
 	/* Keep on walkin' */
 	ptep = iopte_deref(pte, data);
-	return __arm_lpae_unmap(data, gather, iova, size, pgcount, lvl + 1, ptep);
+	return __arm_lpae_unmap(data, gather, iova, size, pgcount,
+				lvl + 1, ptep, dirty);
 }
 
-static size_t arm_lpae_unmap_pages(struct io_pgtable_ops *ops, unsigned long iova,
-				   size_t pgsize, size_t pgcount,
-				   struct iommu_iotlb_gather *gather)
+static size_t __arm_lpae_unmap_pages(struct io_pgtable_ops *ops,
+				     unsigned long iova,
+				     size_t pgsize, size_t pgcount,
+				     struct iommu_iotlb_gather *gather,
+				     struct iommu_dirty_bitmap *dirty)
 {
 	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
 	struct io_pgtable_cfg *cfg = &data->iop.cfg;
@@ -697,13 +732,29 @@ static size_t arm_lpae_unmap_pages(struct io_pgtable_ops *ops, unsigned long iov
 		return 0;
 
 	return __arm_lpae_unmap(data, gather, iova, pgsize, pgcount,
-				data->start_level, ptep);
+				data->start_level, ptep, dirty);
+}
+
+static size_t arm_lpae_unmap_pages(struct io_pgtable_ops *ops, unsigned long iova,
+				   size_t pgsize, size_t pgcount,
+				   struct iommu_iotlb_gather *gather)
+{
+	return __arm_lpae_unmap_pages(ops, iova, pgsize, pgcount, gather, NULL);
 }
 
 static size_t arm_lpae_unmap(struct io_pgtable_ops *ops, unsigned long iova,
 			     size_t size, struct iommu_iotlb_gather *gather)
 {
-	return arm_lpae_unmap_pages(ops, iova, size, 1, gather);
+	return __arm_lpae_unmap_pages(ops, iova, size, 1, gather, NULL);
+}
+
+static size_t arm_lpae_unmap_pages_read_dirty(struct io_pgtable_ops *ops,
+					      unsigned long iova,
+					      size_t pgsize, size_t pgcount,
+					      struct iommu_iotlb_gather *gather,
+					      struct iommu_dirty_bitmap *dirty)
+{
+	return __arm_lpae_unmap_pages(ops, iova, pgsize, pgcount, gather, dirty);
 }
 
 static phys_addr_t arm_lpae_iova_to_phys(struct io_pgtable_ops *ops,
@@ -969,6 +1020,7 @@ arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg)
 		.iova_to_phys	= arm_lpae_iova_to_phys,
 		.read_and_clear_dirty = arm_lpae_read_and_clear_dirty,
 		.set_dirty_tracking   = arm_lpae_set_dirty_tracking,
+		.unmap_pages_read_dirty     = arm_lpae_unmap_pages_read_dirty,
 	};
 
 	return data;
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH RFC 18/19] iommu/intel: Access/Dirty bit support for SL domains
  2022-04-28 21:09 ` Joao Martins
@ 2022-04-28 21:09   ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Joao Martins, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian,
	Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck, kvm

IOMMU advertises Access/Dirty bits if the extended capability
DMAR register reports it (ECAP, mnemonic ECAP.SSADS). The first
stage table, though, has not bit for advertising, unless referenced via
a scalable-mode PASID Entry. Relevant Intel IOMMU SDM ref for first stage
table "3.6.2 Accessed, Extended Accessed, and Dirty Flags" and second
stage table "3.7.2 Accessed and Dirty Flags".

To enable it scalable-mode for the second-stage table is required,
solimit the use of dirty-bit to scalable-mode and discarding the
first stage configured DMAR domains. To use SSADS, we set a bit in
the scalable-mode PASID Table entry, by setting bit 9 (SSADE). When
doing so, flush all iommu caches. Relevant SDM refs:

"3.7.2 Accessed and Dirty Flags"
"6.5.3.3 Guidance to Software for Invalidations,
 Table 23. Guidance to Software for Invalidations"

Dirty bit on the PTE is located in the same location (bit 9). The IOTLB
caches some attributes when SSADE is enabled and dirty-ness information,
so we also need to flush IOTLB to make sure IOMMU attempts to set the
dirty bit again. Relevant manuals over the hardware translation is
chapter 6 with some special mention to:

"6.2.3.1 Scalable-Mode PASID-Table Entry Programming Considerations"
"6.2.4 IOTLB"

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
Shouldn't probably be as aggresive as to flush all; needs
checking with hardware (and invalidations guidance) as to understand
what exactly needs flush.
---
 drivers/iommu/intel/iommu.c | 109 ++++++++++++++++++++++++++++++++++++
 drivers/iommu/intel/pasid.c |  76 +++++++++++++++++++++++++
 drivers/iommu/intel/pasid.h |   7 +++
 include/linux/intel-iommu.h |  14 +++++
 4 files changed, 206 insertions(+)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index ce33f85c72ab..92af43f27241 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5089,6 +5089,113 @@ static void intel_iommu_iotlb_sync_map(struct iommu_domain *domain,
 	}
 }
 
+static int intel_iommu_set_dirty_tracking(struct iommu_domain *domain,
+					  bool enable)
+{
+	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+	struct device_domain_info *info;
+	unsigned long flags;
+	int ret = -EINVAL;
+
+	spin_lock_irqsave(&device_domain_lock, flags);
+	if (list_empty(&dmar_domain->devices)) {
+		spin_unlock_irqrestore(&device_domain_lock, flags);
+		return ret;
+	}
+
+	list_for_each_entry(info, &dmar_domain->devices, link) {
+		if (!info->dev || (info->domain != dmar_domain))
+			continue;
+
+		/* Dirty tracking is second-stage level SM only */
+		if ((info->domain && domain_use_first_level(info->domain)) ||
+		    !ecap_slads(info->iommu->ecap) ||
+		    !sm_supported(info->iommu) || !intel_iommu_sm) {
+			ret = -EOPNOTSUPP;
+			continue;
+		}
+
+		ret = intel_pasid_setup_dirty_tracking(info->iommu, info->domain,
+						     info->dev, PASID_RID2PASID,
+						     enable);
+		if (ret)
+			break;
+	}
+	spin_unlock_irqrestore(&device_domain_lock, flags);
+
+	/*
+	 * We need to flush context TLB and IOTLB with any cached translations
+	 * to force the incoming DMA requests for have its IOTLB entries tagged
+	 * with A/D bits
+	 */
+	intel_flush_iotlb_all(domain);
+	return ret;
+}
+
+static int intel_iommu_get_dirty_tracking(struct iommu_domain *domain)
+{
+	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+	struct device_domain_info *info;
+	unsigned long flags;
+	int ret = 0;
+
+	spin_lock_irqsave(&device_domain_lock, flags);
+	list_for_each_entry(info, &dmar_domain->devices, link) {
+		if (!info->dev || (info->domain != dmar_domain))
+			continue;
+
+		/* Dirty tracking is second-stage level SM only */
+		if ((info->domain && domain_use_first_level(info->domain)) ||
+		    !ecap_slads(info->iommu->ecap) ||
+		    !sm_supported(info->iommu) || !intel_iommu_sm) {
+			ret = -EOPNOTSUPP;
+			continue;
+		}
+
+		if (!intel_pasid_dirty_tracking_enabled(info->iommu, info->domain,
+						 info->dev, PASID_RID2PASID)) {
+			ret = -EINVAL;
+			break;
+		}
+	}
+	spin_unlock_irqrestore(&device_domain_lock, flags);
+
+	return ret;
+}
+
+static int intel_iommu_read_and_clear_dirty(struct iommu_domain *domain,
+					    unsigned long iova, size_t size,
+					    struct iommu_dirty_bitmap *dirty)
+{
+	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+	unsigned long end = iova + size - 1;
+	unsigned long pgsize;
+	int ret;
+
+	ret = intel_iommu_get_dirty_tracking(domain);
+	if (ret)
+		return ret;
+
+	do {
+		struct dma_pte *pte;
+		int lvl = 0;
+
+		pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &lvl);
+		pgsize = level_size(lvl) << VTD_PAGE_SHIFT;
+		if (!pte || !dma_pte_present(pte)) {
+			iova += pgsize;
+			continue;
+		}
+
+		/* It is writable, set the bitmap */
+		if (dma_sl_pte_test_and_clear_dirty(pte))
+			iommu_dirty_bitmap_record(dirty, iova, pgsize);
+		iova += pgsize;
+	} while (iova < end);
+
+	return 0;
+}
+
 const struct iommu_ops intel_iommu_ops = {
 	.capable		= intel_iommu_capable,
 	.domain_alloc		= intel_iommu_domain_alloc,
@@ -5119,6 +5226,8 @@ const struct iommu_ops intel_iommu_ops = {
 		.iotlb_sync		= intel_iommu_tlb_sync,
 		.iova_to_phys		= intel_iommu_iova_to_phys,
 		.free			= intel_iommu_domain_free,
+		.set_dirty_tracking	= intel_iommu_set_dirty_tracking,
+		.read_and_clear_dirty   = intel_iommu_read_and_clear_dirty,
 	}
 };
 
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 10fb82ea467d..90c7e018bc5c 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -331,6 +331,11 @@ static inline void pasid_set_bits(u64 *ptr, u64 mask, u64 bits)
 	WRITE_ONCE(*ptr, (old & ~mask) | bits);
 }
 
+static inline u64 pasid_get_bits(u64 *ptr)
+{
+	return READ_ONCE(*ptr);
+}
+
 /*
  * Setup the DID(Domain Identifier) field (Bit 64~79) of scalable mode
  * PASID entry.
@@ -389,6 +394,36 @@ static inline void pasid_set_fault_enable(struct pasid_entry *pe)
 	pasid_set_bits(&pe->val[0], 1 << 1, 0);
 }
 
+/*
+ * Enable second level A/D bits by setting the SLADE (Second Level
+ * Access Dirty Enable) field (Bit 9) of a scalable mode PASID
+ * entry.
+ */
+static inline void pasid_set_ssade(struct pasid_entry *pe)
+{
+	pasid_set_bits(&pe->val[0], 1 << 9, 1 << 9);
+}
+
+/*
+ * Enable second level A/D bits by setting the SLADE (Second Level
+ * Access Dirty Enable) field (Bit 9) of a scalable mode PASID
+ * entry.
+ */
+static inline void pasid_clear_ssade(struct pasid_entry *pe)
+{
+	pasid_set_bits(&pe->val[0], 1 << 9, 0);
+}
+
+/*
+ * Checks if second level A/D bits by setting the SLADE (Second Level
+ * Access Dirty Enable) field (Bit 9) of a scalable mode PASID
+ * entry is enabled.
+ */
+static inline bool pasid_get_ssade(struct pasid_entry *pe)
+{
+	return pasid_get_bits(&pe->val[0]) & (1 << 9);
+}
+
 /*
  * Setup the SRE(Supervisor Request Enable) field (Bit 128) of a
  * scalable mode PASID entry.
@@ -725,6 +760,47 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu,
 	return 0;
 }
 
+/*
+ * Set up dirty tracking on a second only translation type.
+ */
+int intel_pasid_setup_dirty_tracking(struct intel_iommu *iommu,
+				     struct dmar_domain *domain,
+				     struct device *dev, u32 pasid,
+				     bool enabled)
+{
+	struct pasid_entry *pte;
+
+	pte = intel_pasid_get_entry(dev, pasid);
+	if (!pte) {
+		dev_err(dev, "Failed to get pasid entry of PASID %d\n", pasid);
+		return -ENODEV;
+	}
+
+	if (enabled)
+		pasid_set_ssade(pte);
+	else
+		pasid_clear_ssade(pte);
+	return 0;
+}
+
+/*
+ * Set up dirty tracking on a second only translation type.
+ */
+bool intel_pasid_dirty_tracking_enabled(struct intel_iommu *iommu,
+					struct dmar_domain *domain,
+					struct device *dev, u32 pasid)
+{
+	struct pasid_entry *pte;
+
+	pte = intel_pasid_get_entry(dev, pasid);
+	if (!pte) {
+		dev_err(dev, "Failed to get pasid entry of PASID %d\n", pasid);
+		return false;
+	}
+
+	return pasid_get_ssade(pte);
+}
+
 /*
  * Set up the scalable mode pasid entry for passthrough translation type.
  */
diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
index ab4408c824a5..3dab86017228 100644
--- a/drivers/iommu/intel/pasid.h
+++ b/drivers/iommu/intel/pasid.h
@@ -115,6 +115,13 @@ int intel_pasid_setup_first_level(struct intel_iommu *iommu,
 int intel_pasid_setup_second_level(struct intel_iommu *iommu,
 				   struct dmar_domain *domain,
 				   struct device *dev, u32 pasid);
+int intel_pasid_setup_dirty_tracking(struct intel_iommu *iommu,
+				     struct dmar_domain *domain,
+				     struct device *dev, u32 pasid,
+				     bool enabled);
+bool intel_pasid_dirty_tracking_enabled(struct intel_iommu *iommu,
+					struct dmar_domain *domain,
+					struct device *dev, u32 pasid);
 int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
 				   struct dmar_domain *domain,
 				   struct device *dev, u32 pasid);
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 5cfda90b2cca..1328d1805197 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -47,6 +47,9 @@
 #define DMA_FL_PTE_DIRTY	BIT_ULL(6)
 #define DMA_FL_PTE_XD		BIT_ULL(63)
 
+#define DMA_SL_PTE_DIRTY_BIT	9
+#define DMA_SL_PTE_DIRTY	BIT_ULL(DMA_SL_PTE_DIRTY_BIT)
+
 #define ADDR_WIDTH_5LEVEL	(57)
 #define ADDR_WIDTH_4LEVEL	(48)
 
@@ -677,6 +680,17 @@ static inline bool dma_pte_present(struct dma_pte *pte)
 	return (pte->val & 3) != 0;
 }
 
+static inline bool dma_sl_pte_dirty(struct dma_pte *pte)
+{
+	return (pte->val & DMA_SL_PTE_DIRTY) != 0;
+}
+
+static inline bool dma_sl_pte_test_and_clear_dirty(struct dma_pte *pte)
+{
+	return test_and_clear_bit(DMA_SL_PTE_DIRTY_BIT,
+				  (unsigned long *)&pte->val);
+}
+
 static inline bool dma_pte_superpage(struct dma_pte *pte)
 {
 	return (pte->val & DMA_PTE_LARGE_PAGE);
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH RFC 18/19] iommu/intel: Access/Dirty bit support for SL domains
@ 2022-04-28 21:09   ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, Jason Gunthorpe,
	kvm, Will Deacon, Cornelia Huck, Alex Williamson, Joao Martins,
	David Woodhouse, Robin Murphy

IOMMU advertises Access/Dirty bits if the extended capability
DMAR register reports it (ECAP, mnemonic ECAP.SSADS). The first
stage table, though, has not bit for advertising, unless referenced via
a scalable-mode PASID Entry. Relevant Intel IOMMU SDM ref for first stage
table "3.6.2 Accessed, Extended Accessed, and Dirty Flags" and second
stage table "3.7.2 Accessed and Dirty Flags".

To enable it scalable-mode for the second-stage table is required,
solimit the use of dirty-bit to scalable-mode and discarding the
first stage configured DMAR domains. To use SSADS, we set a bit in
the scalable-mode PASID Table entry, by setting bit 9 (SSADE). When
doing so, flush all iommu caches. Relevant SDM refs:

"3.7.2 Accessed and Dirty Flags"
"6.5.3.3 Guidance to Software for Invalidations,
 Table 23. Guidance to Software for Invalidations"

Dirty bit on the PTE is located in the same location (bit 9). The IOTLB
caches some attributes when SSADE is enabled and dirty-ness information,
so we also need to flush IOTLB to make sure IOMMU attempts to set the
dirty bit again. Relevant manuals over the hardware translation is
chapter 6 with some special mention to:

"6.2.3.1 Scalable-Mode PASID-Table Entry Programming Considerations"
"6.2.4 IOTLB"

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
Shouldn't probably be as aggresive as to flush all; needs
checking with hardware (and invalidations guidance) as to understand
what exactly needs flush.
---
 drivers/iommu/intel/iommu.c | 109 ++++++++++++++++++++++++++++++++++++
 drivers/iommu/intel/pasid.c |  76 +++++++++++++++++++++++++
 drivers/iommu/intel/pasid.h |   7 +++
 include/linux/intel-iommu.h |  14 +++++
 4 files changed, 206 insertions(+)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index ce33f85c72ab..92af43f27241 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5089,6 +5089,113 @@ static void intel_iommu_iotlb_sync_map(struct iommu_domain *domain,
 	}
 }
 
+static int intel_iommu_set_dirty_tracking(struct iommu_domain *domain,
+					  bool enable)
+{
+	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+	struct device_domain_info *info;
+	unsigned long flags;
+	int ret = -EINVAL;
+
+	spin_lock_irqsave(&device_domain_lock, flags);
+	if (list_empty(&dmar_domain->devices)) {
+		spin_unlock_irqrestore(&device_domain_lock, flags);
+		return ret;
+	}
+
+	list_for_each_entry(info, &dmar_domain->devices, link) {
+		if (!info->dev || (info->domain != dmar_domain))
+			continue;
+
+		/* Dirty tracking is second-stage level SM only */
+		if ((info->domain && domain_use_first_level(info->domain)) ||
+		    !ecap_slads(info->iommu->ecap) ||
+		    !sm_supported(info->iommu) || !intel_iommu_sm) {
+			ret = -EOPNOTSUPP;
+			continue;
+		}
+
+		ret = intel_pasid_setup_dirty_tracking(info->iommu, info->domain,
+						     info->dev, PASID_RID2PASID,
+						     enable);
+		if (ret)
+			break;
+	}
+	spin_unlock_irqrestore(&device_domain_lock, flags);
+
+	/*
+	 * We need to flush context TLB and IOTLB with any cached translations
+	 * to force the incoming DMA requests for have its IOTLB entries tagged
+	 * with A/D bits
+	 */
+	intel_flush_iotlb_all(domain);
+	return ret;
+}
+
+static int intel_iommu_get_dirty_tracking(struct iommu_domain *domain)
+{
+	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+	struct device_domain_info *info;
+	unsigned long flags;
+	int ret = 0;
+
+	spin_lock_irqsave(&device_domain_lock, flags);
+	list_for_each_entry(info, &dmar_domain->devices, link) {
+		if (!info->dev || (info->domain != dmar_domain))
+			continue;
+
+		/* Dirty tracking is second-stage level SM only */
+		if ((info->domain && domain_use_first_level(info->domain)) ||
+		    !ecap_slads(info->iommu->ecap) ||
+		    !sm_supported(info->iommu) || !intel_iommu_sm) {
+			ret = -EOPNOTSUPP;
+			continue;
+		}
+
+		if (!intel_pasid_dirty_tracking_enabled(info->iommu, info->domain,
+						 info->dev, PASID_RID2PASID)) {
+			ret = -EINVAL;
+			break;
+		}
+	}
+	spin_unlock_irqrestore(&device_domain_lock, flags);
+
+	return ret;
+}
+
+static int intel_iommu_read_and_clear_dirty(struct iommu_domain *domain,
+					    unsigned long iova, size_t size,
+					    struct iommu_dirty_bitmap *dirty)
+{
+	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+	unsigned long end = iova + size - 1;
+	unsigned long pgsize;
+	int ret;
+
+	ret = intel_iommu_get_dirty_tracking(domain);
+	if (ret)
+		return ret;
+
+	do {
+		struct dma_pte *pte;
+		int lvl = 0;
+
+		pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &lvl);
+		pgsize = level_size(lvl) << VTD_PAGE_SHIFT;
+		if (!pte || !dma_pte_present(pte)) {
+			iova += pgsize;
+			continue;
+		}
+
+		/* It is writable, set the bitmap */
+		if (dma_sl_pte_test_and_clear_dirty(pte))
+			iommu_dirty_bitmap_record(dirty, iova, pgsize);
+		iova += pgsize;
+	} while (iova < end);
+
+	return 0;
+}
+
 const struct iommu_ops intel_iommu_ops = {
 	.capable		= intel_iommu_capable,
 	.domain_alloc		= intel_iommu_domain_alloc,
@@ -5119,6 +5226,8 @@ const struct iommu_ops intel_iommu_ops = {
 		.iotlb_sync		= intel_iommu_tlb_sync,
 		.iova_to_phys		= intel_iommu_iova_to_phys,
 		.free			= intel_iommu_domain_free,
+		.set_dirty_tracking	= intel_iommu_set_dirty_tracking,
+		.read_and_clear_dirty   = intel_iommu_read_and_clear_dirty,
 	}
 };
 
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 10fb82ea467d..90c7e018bc5c 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -331,6 +331,11 @@ static inline void pasid_set_bits(u64 *ptr, u64 mask, u64 bits)
 	WRITE_ONCE(*ptr, (old & ~mask) | bits);
 }
 
+static inline u64 pasid_get_bits(u64 *ptr)
+{
+	return READ_ONCE(*ptr);
+}
+
 /*
  * Setup the DID(Domain Identifier) field (Bit 64~79) of scalable mode
  * PASID entry.
@@ -389,6 +394,36 @@ static inline void pasid_set_fault_enable(struct pasid_entry *pe)
 	pasid_set_bits(&pe->val[0], 1 << 1, 0);
 }
 
+/*
+ * Enable second level A/D bits by setting the SLADE (Second Level
+ * Access Dirty Enable) field (Bit 9) of a scalable mode PASID
+ * entry.
+ */
+static inline void pasid_set_ssade(struct pasid_entry *pe)
+{
+	pasid_set_bits(&pe->val[0], 1 << 9, 1 << 9);
+}
+
+/*
+ * Enable second level A/D bits by setting the SLADE (Second Level
+ * Access Dirty Enable) field (Bit 9) of a scalable mode PASID
+ * entry.
+ */
+static inline void pasid_clear_ssade(struct pasid_entry *pe)
+{
+	pasid_set_bits(&pe->val[0], 1 << 9, 0);
+}
+
+/*
+ * Checks if second level A/D bits by setting the SLADE (Second Level
+ * Access Dirty Enable) field (Bit 9) of a scalable mode PASID
+ * entry is enabled.
+ */
+static inline bool pasid_get_ssade(struct pasid_entry *pe)
+{
+	return pasid_get_bits(&pe->val[0]) & (1 << 9);
+}
+
 /*
  * Setup the SRE(Supervisor Request Enable) field (Bit 128) of a
  * scalable mode PASID entry.
@@ -725,6 +760,47 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu,
 	return 0;
 }
 
+/*
+ * Set up dirty tracking on a second only translation type.
+ */
+int intel_pasid_setup_dirty_tracking(struct intel_iommu *iommu,
+				     struct dmar_domain *domain,
+				     struct device *dev, u32 pasid,
+				     bool enabled)
+{
+	struct pasid_entry *pte;
+
+	pte = intel_pasid_get_entry(dev, pasid);
+	if (!pte) {
+		dev_err(dev, "Failed to get pasid entry of PASID %d\n", pasid);
+		return -ENODEV;
+	}
+
+	if (enabled)
+		pasid_set_ssade(pte);
+	else
+		pasid_clear_ssade(pte);
+	return 0;
+}
+
+/*
+ * Set up dirty tracking on a second only translation type.
+ */
+bool intel_pasid_dirty_tracking_enabled(struct intel_iommu *iommu,
+					struct dmar_domain *domain,
+					struct device *dev, u32 pasid)
+{
+	struct pasid_entry *pte;
+
+	pte = intel_pasid_get_entry(dev, pasid);
+	if (!pte) {
+		dev_err(dev, "Failed to get pasid entry of PASID %d\n", pasid);
+		return false;
+	}
+
+	return pasid_get_ssade(pte);
+}
+
 /*
  * Set up the scalable mode pasid entry for passthrough translation type.
  */
diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
index ab4408c824a5..3dab86017228 100644
--- a/drivers/iommu/intel/pasid.h
+++ b/drivers/iommu/intel/pasid.h
@@ -115,6 +115,13 @@ int intel_pasid_setup_first_level(struct intel_iommu *iommu,
 int intel_pasid_setup_second_level(struct intel_iommu *iommu,
 				   struct dmar_domain *domain,
 				   struct device *dev, u32 pasid);
+int intel_pasid_setup_dirty_tracking(struct intel_iommu *iommu,
+				     struct dmar_domain *domain,
+				     struct device *dev, u32 pasid,
+				     bool enabled);
+bool intel_pasid_dirty_tracking_enabled(struct intel_iommu *iommu,
+					struct dmar_domain *domain,
+					struct device *dev, u32 pasid);
 int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
 				   struct dmar_domain *domain,
 				   struct device *dev, u32 pasid);
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 5cfda90b2cca..1328d1805197 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -47,6 +47,9 @@
 #define DMA_FL_PTE_DIRTY	BIT_ULL(6)
 #define DMA_FL_PTE_XD		BIT_ULL(63)
 
+#define DMA_SL_PTE_DIRTY_BIT	9
+#define DMA_SL_PTE_DIRTY	BIT_ULL(DMA_SL_PTE_DIRTY_BIT)
+
 #define ADDR_WIDTH_5LEVEL	(57)
 #define ADDR_WIDTH_4LEVEL	(48)
 
@@ -677,6 +680,17 @@ static inline bool dma_pte_present(struct dma_pte *pte)
 	return (pte->val & 3) != 0;
 }
 
+static inline bool dma_sl_pte_dirty(struct dma_pte *pte)
+{
+	return (pte->val & DMA_SL_PTE_DIRTY) != 0;
+}
+
+static inline bool dma_sl_pte_test_and_clear_dirty(struct dma_pte *pte)
+{
+	return test_and_clear_bit(DMA_SL_PTE_DIRTY_BIT,
+				  (unsigned long *)&pte->val);
+}
+
 static inline bool dma_pte_superpage(struct dma_pte *pte)
 {
 	return (pte->val & DMA_PTE_LARGE_PAGE);
-- 
2.17.2

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH RFC 19/19] iommu/intel: Add unmap_read_dirty() support
  2022-04-28 21:09 ` Joao Martins
@ 2022-04-28 21:09   ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Joao Martins, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian,
	Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck, kvm

Similar to other IOMMUs base unmap_read_dirty out of how unmap() with
the exception to having a non-racy clear of the PTE to return whether it
was dirty or not.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/intel/iommu.c | 43 ++++++++++++++++++++++++++++---------
 include/linux/intel-iommu.h | 16 ++++++++++++++
 2 files changed, 49 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 92af43f27241..e80e98f5202b 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1317,7 +1317,8 @@ static void dma_pte_list_pagetables(struct dmar_domain *domain,
 static void dma_pte_clear_level(struct dmar_domain *domain, int level,
 				struct dma_pte *pte, unsigned long pfn,
 				unsigned long start_pfn, unsigned long last_pfn,
-				struct list_head *freelist)
+				struct list_head *freelist,
+				struct iommu_dirty_bitmap *dirty)
 {
 	struct dma_pte *first_pte = NULL, *last_pte = NULL;
 
@@ -1338,7 +1339,11 @@ static void dma_pte_clear_level(struct dmar_domain *domain, int level,
 			if (level > 1 && !dma_pte_superpage(pte))
 				dma_pte_list_pagetables(domain, level - 1, pte, freelist);
 
-			dma_clear_pte(pte);
+			if (dma_clear_pte_dirty(pte) && dirty)
+				iommu_dirty_bitmap_record(dirty,
+					pfn << VTD_PAGE_SHIFT,
+					level_size(level) << VTD_PAGE_SHIFT);
+
 			if (!first_pte)
 				first_pte = pte;
 			last_pte = pte;
@@ -1347,7 +1352,7 @@ static void dma_pte_clear_level(struct dmar_domain *domain, int level,
 			dma_pte_clear_level(domain, level - 1,
 					    phys_to_virt(dma_pte_addr(pte)),
 					    level_pfn, start_pfn, last_pfn,
-					    freelist);
+					    freelist, dirty);
 		}
 next:
 		pfn = level_pfn + level_size(level);
@@ -1362,7 +1367,8 @@ static void dma_pte_clear_level(struct dmar_domain *domain, int level,
    the page tables, and may have cached the intermediate levels. The
    pages can only be freed after the IOTLB flush has been done. */
 static void domain_unmap(struct dmar_domain *domain, unsigned long start_pfn,
-			 unsigned long last_pfn, struct list_head *freelist)
+			 unsigned long last_pfn, struct list_head *freelist,
+			 struct iommu_dirty_bitmap *dirty)
 {
 	BUG_ON(!domain_pfn_supported(domain, start_pfn));
 	BUG_ON(!domain_pfn_supported(domain, last_pfn));
@@ -1370,7 +1376,8 @@ static void domain_unmap(struct dmar_domain *domain, unsigned long start_pfn,
 
 	/* we don't need lock here; nobody else touches the iova range */
 	dma_pte_clear_level(domain, agaw_to_level(domain->agaw),
-			    domain->pgd, 0, start_pfn, last_pfn, freelist);
+			    domain->pgd, 0, start_pfn, last_pfn, freelist,
+			    dirty);
 
 	/* free pgd */
 	if (start_pfn == 0 && last_pfn == DOMAIN_MAX_PFN(domain->gaw)) {
@@ -2031,7 +2038,8 @@ static void domain_exit(struct dmar_domain *domain)
 	if (domain->pgd) {
 		LIST_HEAD(freelist);
 
-		domain_unmap(domain, 0, DOMAIN_MAX_PFN(domain->gaw), &freelist);
+		domain_unmap(domain, 0, DOMAIN_MAX_PFN(domain->gaw), &freelist,
+			     NULL);
 		put_pages_list(&freelist);
 	}
 
@@ -4125,7 +4133,8 @@ static int intel_iommu_memory_notifier(struct notifier_block *nb,
 			struct intel_iommu *iommu;
 			LIST_HEAD(freelist);
 
-			domain_unmap(si_domain, start_vpfn, last_vpfn, &freelist);
+			domain_unmap(si_domain, start_vpfn, last_vpfn,
+				     &freelist, NULL);
 
 			rcu_read_lock();
 			for_each_active_iommu(iommu, drhd)
@@ -4737,7 +4746,8 @@ static int intel_iommu_map_pages(struct iommu_domain *domain,
 
 static size_t intel_iommu_unmap(struct iommu_domain *domain,
 				unsigned long iova, size_t size,
-				struct iommu_iotlb_gather *gather)
+				struct iommu_iotlb_gather *gather,
+				struct iommu_dirty_bitmap *dirty)
 {
 	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
 	unsigned long start_pfn, last_pfn;
@@ -4753,7 +4763,7 @@ static size_t intel_iommu_unmap(struct iommu_domain *domain,
 	start_pfn = iova >> VTD_PAGE_SHIFT;
 	last_pfn = (iova + size - 1) >> VTD_PAGE_SHIFT;
 
-	domain_unmap(dmar_domain, start_pfn, last_pfn, &gather->freelist);
+	domain_unmap(dmar_domain, start_pfn, last_pfn, &gather->freelist, dirty);
 
 	if (dmar_domain->max_addr == iova + size)
 		dmar_domain->max_addr = iova;
@@ -4771,7 +4781,19 @@ static size_t intel_iommu_unmap_pages(struct iommu_domain *domain,
 	unsigned long pgshift = __ffs(pgsize);
 	size_t size = pgcount << pgshift;
 
-	return intel_iommu_unmap(domain, iova, size, gather);
+	return intel_iommu_unmap(domain, iova, size, gather, NULL);
+}
+
+static size_t intel_iommu_unmap_read_dirty(struct iommu_domain *domain,
+					   unsigned long iova,
+					   size_t pgsize, size_t pgcount,
+					   struct iommu_iotlb_gather *gather,
+					   struct iommu_dirty_bitmap *dirty)
+{
+	unsigned long pgshift = __ffs(pgsize);
+	size_t size = pgcount << pgshift;
+
+	return intel_iommu_unmap(domain, iova, size, gather, dirty);
 }
 
 static void intel_iommu_tlb_sync(struct iommu_domain *domain,
@@ -5228,6 +5250,7 @@ const struct iommu_ops intel_iommu_ops = {
 		.free			= intel_iommu_domain_free,
 		.set_dirty_tracking	= intel_iommu_set_dirty_tracking,
 		.read_and_clear_dirty   = intel_iommu_read_and_clear_dirty,
+		.unmap_pages_read_dirty = intel_iommu_unmap_read_dirty,
 	}
 };
 
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 1328d1805197..c7f0801ccba6 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -664,6 +664,22 @@ static inline void dma_clear_pte(struct dma_pte *pte)
 	pte->val = 0;
 }
 
+static inline bool dma_clear_pte_dirty(struct dma_pte *pte)
+{
+	bool dirty = false;
+	u64 val;
+
+	val = READ_ONCE(pte->val);
+
+	do {
+		val = cmpxchg64(&pte->val, val, 0);
+		if ((val & VTD_PAGE_MASK) & DMA_SL_PTE_DIRTY)
+			dirty = true;
+	} while (val);
+
+	return dirty;
+}
+
 static inline u64 dma_pte_addr(struct dma_pte *pte)
 {
 #ifdef CONFIG_64BIT
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* [PATCH RFC 19/19] iommu/intel: Add unmap_read_dirty() support
@ 2022-04-28 21:09   ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-28 21:09 UTC (permalink / raw)
  To: iommu
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, Jason Gunthorpe,
	kvm, Will Deacon, Cornelia Huck, Alex Williamson, Joao Martins,
	David Woodhouse, Robin Murphy

Similar to other IOMMUs base unmap_read_dirty out of how unmap() with
the exception to having a non-racy clear of the PTE to return whether it
was dirty or not.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/iommu/intel/iommu.c | 43 ++++++++++++++++++++++++++++---------
 include/linux/intel-iommu.h | 16 ++++++++++++++
 2 files changed, 49 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 92af43f27241..e80e98f5202b 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1317,7 +1317,8 @@ static void dma_pte_list_pagetables(struct dmar_domain *domain,
 static void dma_pte_clear_level(struct dmar_domain *domain, int level,
 				struct dma_pte *pte, unsigned long pfn,
 				unsigned long start_pfn, unsigned long last_pfn,
-				struct list_head *freelist)
+				struct list_head *freelist,
+				struct iommu_dirty_bitmap *dirty)
 {
 	struct dma_pte *first_pte = NULL, *last_pte = NULL;
 
@@ -1338,7 +1339,11 @@ static void dma_pte_clear_level(struct dmar_domain *domain, int level,
 			if (level > 1 && !dma_pte_superpage(pte))
 				dma_pte_list_pagetables(domain, level - 1, pte, freelist);
 
-			dma_clear_pte(pte);
+			if (dma_clear_pte_dirty(pte) && dirty)
+				iommu_dirty_bitmap_record(dirty,
+					pfn << VTD_PAGE_SHIFT,
+					level_size(level) << VTD_PAGE_SHIFT);
+
 			if (!first_pte)
 				first_pte = pte;
 			last_pte = pte;
@@ -1347,7 +1352,7 @@ static void dma_pte_clear_level(struct dmar_domain *domain, int level,
 			dma_pte_clear_level(domain, level - 1,
 					    phys_to_virt(dma_pte_addr(pte)),
 					    level_pfn, start_pfn, last_pfn,
-					    freelist);
+					    freelist, dirty);
 		}
 next:
 		pfn = level_pfn + level_size(level);
@@ -1362,7 +1367,8 @@ static void dma_pte_clear_level(struct dmar_domain *domain, int level,
    the page tables, and may have cached the intermediate levels. The
    pages can only be freed after the IOTLB flush has been done. */
 static void domain_unmap(struct dmar_domain *domain, unsigned long start_pfn,
-			 unsigned long last_pfn, struct list_head *freelist)
+			 unsigned long last_pfn, struct list_head *freelist,
+			 struct iommu_dirty_bitmap *dirty)
 {
 	BUG_ON(!domain_pfn_supported(domain, start_pfn));
 	BUG_ON(!domain_pfn_supported(domain, last_pfn));
@@ -1370,7 +1376,8 @@ static void domain_unmap(struct dmar_domain *domain, unsigned long start_pfn,
 
 	/* we don't need lock here; nobody else touches the iova range */
 	dma_pte_clear_level(domain, agaw_to_level(domain->agaw),
-			    domain->pgd, 0, start_pfn, last_pfn, freelist);
+			    domain->pgd, 0, start_pfn, last_pfn, freelist,
+			    dirty);
 
 	/* free pgd */
 	if (start_pfn == 0 && last_pfn == DOMAIN_MAX_PFN(domain->gaw)) {
@@ -2031,7 +2038,8 @@ static void domain_exit(struct dmar_domain *domain)
 	if (domain->pgd) {
 		LIST_HEAD(freelist);
 
-		domain_unmap(domain, 0, DOMAIN_MAX_PFN(domain->gaw), &freelist);
+		domain_unmap(domain, 0, DOMAIN_MAX_PFN(domain->gaw), &freelist,
+			     NULL);
 		put_pages_list(&freelist);
 	}
 
@@ -4125,7 +4133,8 @@ static int intel_iommu_memory_notifier(struct notifier_block *nb,
 			struct intel_iommu *iommu;
 			LIST_HEAD(freelist);
 
-			domain_unmap(si_domain, start_vpfn, last_vpfn, &freelist);
+			domain_unmap(si_domain, start_vpfn, last_vpfn,
+				     &freelist, NULL);
 
 			rcu_read_lock();
 			for_each_active_iommu(iommu, drhd)
@@ -4737,7 +4746,8 @@ static int intel_iommu_map_pages(struct iommu_domain *domain,
 
 static size_t intel_iommu_unmap(struct iommu_domain *domain,
 				unsigned long iova, size_t size,
-				struct iommu_iotlb_gather *gather)
+				struct iommu_iotlb_gather *gather,
+				struct iommu_dirty_bitmap *dirty)
 {
 	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
 	unsigned long start_pfn, last_pfn;
@@ -4753,7 +4763,7 @@ static size_t intel_iommu_unmap(struct iommu_domain *domain,
 	start_pfn = iova >> VTD_PAGE_SHIFT;
 	last_pfn = (iova + size - 1) >> VTD_PAGE_SHIFT;
 
-	domain_unmap(dmar_domain, start_pfn, last_pfn, &gather->freelist);
+	domain_unmap(dmar_domain, start_pfn, last_pfn, &gather->freelist, dirty);
 
 	if (dmar_domain->max_addr == iova + size)
 		dmar_domain->max_addr = iova;
@@ -4771,7 +4781,19 @@ static size_t intel_iommu_unmap_pages(struct iommu_domain *domain,
 	unsigned long pgshift = __ffs(pgsize);
 	size_t size = pgcount << pgshift;
 
-	return intel_iommu_unmap(domain, iova, size, gather);
+	return intel_iommu_unmap(domain, iova, size, gather, NULL);
+}
+
+static size_t intel_iommu_unmap_read_dirty(struct iommu_domain *domain,
+					   unsigned long iova,
+					   size_t pgsize, size_t pgcount,
+					   struct iommu_iotlb_gather *gather,
+					   struct iommu_dirty_bitmap *dirty)
+{
+	unsigned long pgshift = __ffs(pgsize);
+	size_t size = pgcount << pgshift;
+
+	return intel_iommu_unmap(domain, iova, size, gather, dirty);
 }
 
 static void intel_iommu_tlb_sync(struct iommu_domain *domain,
@@ -5228,6 +5250,7 @@ const struct iommu_ops intel_iommu_ops = {
 		.free			= intel_iommu_domain_free,
 		.set_dirty_tracking	= intel_iommu_set_dirty_tracking,
 		.read_and_clear_dirty   = intel_iommu_read_and_clear_dirty,
+		.unmap_pages_read_dirty = intel_iommu_unmap_read_dirty,
 	}
 };
 
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 1328d1805197..c7f0801ccba6 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -664,6 +664,22 @@ static inline void dma_clear_pte(struct dma_pte *pte)
 	pte->val = 0;
 }
 
+static inline bool dma_clear_pte_dirty(struct dma_pte *pte)
+{
+	bool dirty = false;
+	u64 val;
+
+	val = READ_ONCE(pte->val);
+
+	do {
+		val = cmpxchg64(&pte->val, val, 0);
+		if ((val & VTD_PAGE_MASK) & DMA_SL_PTE_DIRTY)
+			dirty = true;
+	} while (val);
+
+	return dirty;
+}
+
 static inline u64 dma_pte_addr(struct dma_pte *pte)
 {
 #ifdef CONFIG_64BIT
-- 
2.17.2

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* RE: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
  2022-04-28 21:09 ` Joao Martins
@ 2022-04-29  5:45   ` Tian, Kevin
  -1 siblings, 0 replies; 209+ messages in thread
From: Tian, Kevin @ 2022-04-29  5:45 UTC (permalink / raw)
  To: Martins, Joao, iommu
  Cc: Martins, Joao, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Eric Auger, Liu,
	Yi L, Alex Williamson, Cornelia Huck, kvm

> From: Joao Martins <joao.m.martins@oracle.com>
> Sent: Friday, April 29, 2022 5:09 AM
> 
> Presented herewith is a series that extends IOMMUFD to have IOMMU
> hardware support for dirty bit in the IOPTEs.
> 
> Today, AMD Milan (which been out for a year now) supports it while ARM
> SMMUv3.2+ alongside VT-D rev3.x are expected to eventually come along.
> The intended use-case is to support Live Migration with SR-IOV, with

this should not be restricted to SR-IOV.

> IOMMUs
> that support it. Yishai Hadas will be soon submiting an RFC that covers the
> PCI device dirty tracker via vfio.
> 
> At a quick glance, IOMMUFD lets the userspace VMM create the IOAS with a
> set of a IOVA ranges mapped to some physical memory composing an IO
> pagetable. This is then attached to a particular device, consequently
> creating the protection domain to share a common IO page table
> representing the endporint DMA-addressable guest address space.
> (Hopefully I am not twisting the terminology here) The resultant object

Just remove VMM/guest/... since iommufd is not specific to virtualization. 

> is an hw_pagetable object which represents the iommu_domain
> object that will be directly manipulated. For more background on
> IOMMUFD have a look at these two series[0][1] on the kernel and qemu
> consumption respectivally. The IOMMUFD UAPI, kAPI and the iommu core
> kAPI is then extended to provide:
> 
>  1) Enabling or disabling dirty tracking on the iommu_domain. Model
> as the most common case of changing hardware protection domain control

didn't get what 'most common case' here tries to explain

> bits, and ARM specific case of having to enable the per-PTE DBM control
> bit. The 'real' tracking of whether dirty tracking is enabled or not is
> stored in the vendor IOMMU, hence no new fields are added to iommufd
> pagetable structures.
> 
>  2) Read the I/O PTEs and marshal its dirtyiness into a bitmap. The bitmap
> thus describe the IOVAs that got written by the device. While performing
> the marshalling also vendors need to clear the dirty bits from IOPTE and

s/vendors/iommu drivers/ 

> allow the kAPI caller to batch the much needed IOTLB flush.
> There's no copy of bitmaps to userspace backed memory, all is zerocopy
> based. So far this is a test-and-clear kind of interface given that the
> IOPT walk is going to be expensive. It occured to me to separate
> the readout of dirty, and the clearing of dirty from IOPTEs.
> I haven't opted for that one, given that it would mean two lenghty IOPTE
> walks and felt counter-performant.

me too. that doesn't feel like a performant way.

> 
>  3) Unmapping an IOVA range while returning its dirty bit prior to
> unmap. This case is specific for non-nested vIOMMU case where an
> erronous guest (or device) DMAing to an address being unmapped at the
> same time.

an erroneous attempt like above cannot anticipate which DMAs can
succeed in that window thus the end behavior is undefined. For an
undefined behavior nothing will be broken by losing some bits dirtied
in the window between reading back dirty bits of the range and
actually calling unmap. From guest p.o.v. all those are black-box
hardware logic to serve a virtual iotlb invalidation request which just
cannot be completed in one cycle.

Hence in reality probably this is not required except to meet vfio
compat requirement. Just in concept returning dirty bits at unmap
is more accurate.

I'm slightly inclined to abandon it in iommufd uAPI.

> 
> [See at the end too, on general remarks, specifically the one regarding
>  probing dirty tracking via a dedicated iommufd cap ioctl]
> 
> The series is organized as follows:
> 
> * Patches 1-3: Takes care of the iommu domain operations to be added and
> extends iommufd io-pagetable to set/clear dirty tracking, as well as
> reading the dirty bits from the vendor pagetables. The idea is to abstract
> iommu vendors from any idea of how bitmaps are stored or propagated
> back to
> the caller, as well as allowing control/batching over IOTLB flush. So
> there's a data structure and an helper that only tells the upper layer that
> an IOVA range got dirty. IOMMUFD carries the logic to pin pages, walking

why do we need another pinning here? any page mapped in iommu page
table is supposed to have been pinned already...

> the bitmap user memory, and kmap-ing them as needed. IOMMU vendor
> just has
> an idea of a 'dirty bitmap state' and recording an IOVA as dirty by the
> vendor IOMMU implementor.
> 
> * Patches 4-5: Adds the new unmap domain op that returns whether the
> IOVA
> got dirtied. I separated this from the rest of the set, as I am still
> questioning the need for this API and whether this race needs to be
> fundamentally be handled. I guess the thinking is that live-migration
> should be guest foolproof, but how much the race happens in pratice to
> deem this as a necessary unmap variant. Perhaps maybe it might be enough
> fetching dirty bits prior to the unmap? Feedback appreciated.

I think so as aforementioned.

> 
> * Patches 6-8: Adds the UAPIs for IOMMUFD, vfio-compat and selftests.
> We should discuss whether to include the vfio-compat or not. Given how
> vfio-type1-iommu perpectually dirties any IOVA, and here I am replacing
> with the IOMMU hw support. I haven't implemented the perpectual dirtying
> given his lack of usefullness over an IOMMU-backed implementation (or so
> I think). The selftests, test mainly the principal workflow, still needs
> to get added more corner cases.

Or in another way could we keep vfio-compat as type1 does today, i.e.
restricting iommu dirty tacking only to iommufd native uAPI?

> 
> Note: Given that there's no capability for new APIs, or page sizes or
> etc, the userspace app using IOMMUFD native API would gather -
> EOPNOTSUPP
> when dirty tracking is not supported by the IOMMU hardware.
> 
> For completeness and most importantly to make sure the new IOMMU core
> ops
> capture the hardware blocks, all the IOMMUs that will eventually get IOMMU
> A/D
> support were implemented. So the next half of the series presents *proof of
> concept* implementations for IOMMUs:
> 
> * Patches 9-11: AMD IOMMU implementation, particularly on those having
> HDSup support. Tested with a Qemu amd-iommu with HDSUp emulated,
> and also on a AMD Milan server IOMMU.
> 
> * Patches 12-17: Adapts the past series from Keqian Zhu[2] but reworked
> to do the dynamic set/clear dirty tracking, and immplicitly clearing
> dirty bits on the readout. Given the lack of hardware and difficulty
> to get this in an emulated SMMUv3 (given the dependency on the PE HTTU
> and BBML2, IIUC) then this is only compiled tested. Hopefully I am not
> getting the attribution wrong.
> 
> * Patches 18-19: Intel IOMMU rev3.x implementation. Tested with a Qemu
> based intel-iommu with SSADS/SLADS emulation support.
> 
> To help testing/prototypization, qemu iommu emulation bits were written
> to increase coverage of this code and hopefully make this more broadly
> available for fellow contributors/devs. A separate series is submitted right
> after this covering the Qemu IOMMUFD extensions for dirty tracking,
> alongside
> its x86 iommus emulation A/D bits. Meanwhile it's also on github
> (https://github.com/jpemartins/qemu/commits/iommufd)
> 
> Remarks / Observations:
> 
> * There's no capabilities API in IOMMUFD, and in this RFC each vendor tracks

there was discussion adding device capability uAPI somewhere.

> what has access in each of the newly added ops. Initially I was thinking to
> have a HWPT_GET_DIRTY to probe how dirty tracking is supported (rather
> than
> bailing out with EOPNOTSUP) as well as an get_dirty_tracking
> iommu-core API. On the UAPI, perhaps it might be better to have a single API
> for capabilities in general (similar to KVM)  and at the simplest is a subop
> where the necessary info is conveyed on a per-subop basis?

probably this can be reported as a device cap as supporting of dirty bit is
an immutable property of the iommu serving that device. Userspace can
enable dirty tracking on a hwpt if all attached devices claim the support
and kernel will does the same verification.

btw do we still want to keep vfio type1 behavior as the fallback i.e. mark
all pinned pages as dirty when iommu dirty support is missing? From uAPI
naming p.o.v. set/clear_dirty_tracking doesn't preclude a special
implementation like vfio type1.

> 
> * The UAPI/kAPI could be generalized over the next iteration to also cover
> Access bit (or Intel's Extended Access bit that tracks non-CPU usage).
> It wasn't done, as I was not aware of a use-case. I am wondering
> if the access-bits could be used to do some form of zero page detection
> (to just send the pages that got touched), although dirty-bits could be
> used just the same way. Happy to adjust for RFCv2. The algorithms, IOPTE

I'm not fan of adding support for uncertain usages. Comparing to this
I'd give higher priority to large page break-down as w/o it it's hard to
find real-world deployment on this work. 😊

> walk and marshalling into bitmaps as well as the necessary IOTLB flush
> batching are all the same. The focus is on dirty bit given that the
> dirtyness IOVA feedback is used to select the pages that need to be
> transfered
> to the destination while migration is happening.
> Sidebar: Sadly, there's a lot less clever possible tricks that can be
> done (compared to the CPU/KVM) without having the PCI device cooperate
> (like
> userfaultfd, wrprotect, etc as those would turn into nepharious IOMMU
> perm faults and devices with DMA target aborts).
> If folks thing the UAPI/iommu-kAPI should be agnostic to any PTE A/D
> bits, we can instead have the ioctls be named after
> HWPT_SET_TRACKING() and add another argument which asks which bits to
> enabling tracking
> (IOMMUFD_ACCESS/IOMMUFD_DIRTY/IOMMUFD_ACCESS_NONCPU).
> Likewise for the read_and_clear() as all PTE bits follow the same logic
> as dirty. Happy to readjust if folks think it is worthwhile.
> 
> * IOMMU Nesting /shouldn't/ matter in this work, as it is expected that we
> only care about the first stage of IOMMU pagetables for hypervisors i.e.
> tracking dirty GPAs (and not caring about dirty GIOVAs).

Hypervisor uses second-stage while guest manages first-stage in nesting.

> 
> * Dirty bit tracking only, is not enough. Large IO pages tend to be the norm
> when DMA mapping large ranges of IOVA space, when really the VMM wants
> the
> smallest granularity possible to track(i.e. host base pages). A separate bit
> of work will need to take care demoting IOPTE page sizes at guest-runtime to
> increase/decrease the dirty tracking granularity, likely under the form of a
> IOAS demote/promote page-size within a previously mapped IOVA range.
> 
> Feedback is very much appreciated!

Thanks for the work!

> 
> [0] https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-
> iommufd_jgg@nvidia.com/
> [1] https://lore.kernel.org/kvm/20220414104710.28534-1-yi.l.liu@intel.com/
> [2] https://lore.kernel.org/linux-arm-kernel/20210413085457.25400-1-
> zhukeqian1@huawei.com/
> 
> 	Joao
> 
> TODOs:
> * More selftests for large/small iopte sizes;
> * Better vIOMMU+VFIO testing (AMD doesn't support it);
> * Performance efficiency of GET_DIRTY_IOVA in various workloads;
> * Testing with a live migrateable VF;
> 
> Jean-Philippe Brucker (1):
>   iommu/arm-smmu-v3: Add feature detection for HTTU
> 
> Joao Martins (16):
>   iommu: Add iommu_domain ops for dirty tracking
>   iommufd: Dirty tracking for io_pagetable
>   iommufd: Dirty tracking data support
>   iommu: Add an unmap API that returns dirtied IOPTEs
>   iommufd: Add a dirty bitmap to iopt_unmap_iova()
>   iommufd: Dirty tracking IOCTLs for the hw_pagetable
>   iommufd/vfio-compat: Dirty tracking IOCTLs compatibility
>   iommufd: Add a test for dirty tracking ioctls
>   iommu/amd: Access/Dirty bit support in IOPTEs
>   iommu/amd: Add unmap_read_dirty() support
>   iommu/amd: Print access/dirty bits if supported
>   iommu/arm-smmu-v3: Add read_and_clear_dirty() support
>   iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
>   iommu/arm-smmu-v3: Add unmap_read_dirty() support
>   iommu/intel: Access/Dirty bit support for SL domains
>   iommu/intel: Add unmap_read_dirty() support
> 
> Kunkun Jiang (2):
>   iommu/arm-smmu-v3: Add feature detection for BBML
>   iommu/arm-smmu-v3: Enable HTTU for stage1 with io-pgtable mapping
> 
>  drivers/iommu/amd/amd_iommu.h               |   1 +
>  drivers/iommu/amd/amd_iommu_types.h         |  11 +
>  drivers/iommu/amd/init.c                    |  12 +-
>  drivers/iommu/amd/io_pgtable.c              | 100 +++++++-
>  drivers/iommu/amd/iommu.c                   |  99 ++++++++
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 135 +++++++++++
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  14 ++
>  drivers/iommu/intel/iommu.c                 | 152 +++++++++++-
>  drivers/iommu/intel/pasid.c                 |  76 ++++++
>  drivers/iommu/intel/pasid.h                 |   7 +
>  drivers/iommu/io-pgtable-arm.c              | 232 ++++++++++++++++--
>  drivers/iommu/iommu.c                       |  71 +++++-
>  drivers/iommu/iommufd/hw_pagetable.c        |  79 ++++++
>  drivers/iommu/iommufd/io_pagetable.c        | 253 +++++++++++++++++++-
>  drivers/iommu/iommufd/io_pagetable.h        |   3 +-
>  drivers/iommu/iommufd/ioas.c                |  35 ++-
>  drivers/iommu/iommufd/iommufd_private.h     |  59 ++++-
>  drivers/iommu/iommufd/iommufd_test.h        |   9 +
>  drivers/iommu/iommufd/main.c                |   9 +
>  drivers/iommu/iommufd/pages.c               |  79 +++++-
>  drivers/iommu/iommufd/selftest.c            | 137 ++++++++++-
>  drivers/iommu/iommufd/vfio_compat.c         | 221 ++++++++++++++++-
>  include/linux/intel-iommu.h                 |  30 +++
>  include/linux/io-pgtable.h                  |  20 ++
>  include/linux/iommu.h                       |  64 +++++
>  include/uapi/linux/iommufd.h                |  78 ++++++
>  tools/testing/selftests/iommu/Makefile      |   1 +
>  tools/testing/selftests/iommu/iommufd.c     | 135 +++++++++++
>  28 files changed, 2047 insertions(+), 75 deletions(-)
> 
> --
> 2.17.2


^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
@ 2022-04-29  5:45   ` Tian, Kevin
  0 siblings, 0 replies; 209+ messages in thread
From: Tian, Kevin @ 2022-04-29  5:45 UTC (permalink / raw)
  To: Martins, Joao, iommu
  Cc: Jean-Philippe Brucker, Yishai Hadas, Jason Gunthorpe, kvm,
	Will Deacon, Cornelia Huck, Alex Williamson, Martins, Joao,
	David Woodhouse, Robin Murphy

> From: Joao Martins <joao.m.martins@oracle.com>
> Sent: Friday, April 29, 2022 5:09 AM
> 
> Presented herewith is a series that extends IOMMUFD to have IOMMU
> hardware support for dirty bit in the IOPTEs.
> 
> Today, AMD Milan (which been out for a year now) supports it while ARM
> SMMUv3.2+ alongside VT-D rev3.x are expected to eventually come along.
> The intended use-case is to support Live Migration with SR-IOV, with

this should not be restricted to SR-IOV.

> IOMMUs
> that support it. Yishai Hadas will be soon submiting an RFC that covers the
> PCI device dirty tracker via vfio.
> 
> At a quick glance, IOMMUFD lets the userspace VMM create the IOAS with a
> set of a IOVA ranges mapped to some physical memory composing an IO
> pagetable. This is then attached to a particular device, consequently
> creating the protection domain to share a common IO page table
> representing the endporint DMA-addressable guest address space.
> (Hopefully I am not twisting the terminology here) The resultant object

Just remove VMM/guest/... since iommufd is not specific to virtualization. 

> is an hw_pagetable object which represents the iommu_domain
> object that will be directly manipulated. For more background on
> IOMMUFD have a look at these two series[0][1] on the kernel and qemu
> consumption respectivally. The IOMMUFD UAPI, kAPI and the iommu core
> kAPI is then extended to provide:
> 
>  1) Enabling or disabling dirty tracking on the iommu_domain. Model
> as the most common case of changing hardware protection domain control

didn't get what 'most common case' here tries to explain

> bits, and ARM specific case of having to enable the per-PTE DBM control
> bit. The 'real' tracking of whether dirty tracking is enabled or not is
> stored in the vendor IOMMU, hence no new fields are added to iommufd
> pagetable structures.
> 
>  2) Read the I/O PTEs and marshal its dirtyiness into a bitmap. The bitmap
> thus describe the IOVAs that got written by the device. While performing
> the marshalling also vendors need to clear the dirty bits from IOPTE and

s/vendors/iommu drivers/ 

> allow the kAPI caller to batch the much needed IOTLB flush.
> There's no copy of bitmaps to userspace backed memory, all is zerocopy
> based. So far this is a test-and-clear kind of interface given that the
> IOPT walk is going to be expensive. It occured to me to separate
> the readout of dirty, and the clearing of dirty from IOPTEs.
> I haven't opted for that one, given that it would mean two lenghty IOPTE
> walks and felt counter-performant.

me too. that doesn't feel like a performant way.

> 
>  3) Unmapping an IOVA range while returning its dirty bit prior to
> unmap. This case is specific for non-nested vIOMMU case where an
> erronous guest (or device) DMAing to an address being unmapped at the
> same time.

an erroneous attempt like above cannot anticipate which DMAs can
succeed in that window thus the end behavior is undefined. For an
undefined behavior nothing will be broken by losing some bits dirtied
in the window between reading back dirty bits of the range and
actually calling unmap. From guest p.o.v. all those are black-box
hardware logic to serve a virtual iotlb invalidation request which just
cannot be completed in one cycle.

Hence in reality probably this is not required except to meet vfio
compat requirement. Just in concept returning dirty bits at unmap
is more accurate.

I'm slightly inclined to abandon it in iommufd uAPI.

> 
> [See at the end too, on general remarks, specifically the one regarding
>  probing dirty tracking via a dedicated iommufd cap ioctl]
> 
> The series is organized as follows:
> 
> * Patches 1-3: Takes care of the iommu domain operations to be added and
> extends iommufd io-pagetable to set/clear dirty tracking, as well as
> reading the dirty bits from the vendor pagetables. The idea is to abstract
> iommu vendors from any idea of how bitmaps are stored or propagated
> back to
> the caller, as well as allowing control/batching over IOTLB flush. So
> there's a data structure and an helper that only tells the upper layer that
> an IOVA range got dirty. IOMMUFD carries the logic to pin pages, walking

why do we need another pinning here? any page mapped in iommu page
table is supposed to have been pinned already...

> the bitmap user memory, and kmap-ing them as needed. IOMMU vendor
> just has
> an idea of a 'dirty bitmap state' and recording an IOVA as dirty by the
> vendor IOMMU implementor.
> 
> * Patches 4-5: Adds the new unmap domain op that returns whether the
> IOVA
> got dirtied. I separated this from the rest of the set, as I am still
> questioning the need for this API and whether this race needs to be
> fundamentally be handled. I guess the thinking is that live-migration
> should be guest foolproof, but how much the race happens in pratice to
> deem this as a necessary unmap variant. Perhaps maybe it might be enough
> fetching dirty bits prior to the unmap? Feedback appreciated.

I think so as aforementioned.

> 
> * Patches 6-8: Adds the UAPIs for IOMMUFD, vfio-compat and selftests.
> We should discuss whether to include the vfio-compat or not. Given how
> vfio-type1-iommu perpectually dirties any IOVA, and here I am replacing
> with the IOMMU hw support. I haven't implemented the perpectual dirtying
> given his lack of usefullness over an IOMMU-backed implementation (or so
> I think). The selftests, test mainly the principal workflow, still needs
> to get added more corner cases.

Or in another way could we keep vfio-compat as type1 does today, i.e.
restricting iommu dirty tacking only to iommufd native uAPI?

> 
> Note: Given that there's no capability for new APIs, or page sizes or
> etc, the userspace app using IOMMUFD native API would gather -
> EOPNOTSUPP
> when dirty tracking is not supported by the IOMMU hardware.
> 
> For completeness and most importantly to make sure the new IOMMU core
> ops
> capture the hardware blocks, all the IOMMUs that will eventually get IOMMU
> A/D
> support were implemented. So the next half of the series presents *proof of
> concept* implementations for IOMMUs:
> 
> * Patches 9-11: AMD IOMMU implementation, particularly on those having
> HDSup support. Tested with a Qemu amd-iommu with HDSUp emulated,
> and also on a AMD Milan server IOMMU.
> 
> * Patches 12-17: Adapts the past series from Keqian Zhu[2] but reworked
> to do the dynamic set/clear dirty tracking, and immplicitly clearing
> dirty bits on the readout. Given the lack of hardware and difficulty
> to get this in an emulated SMMUv3 (given the dependency on the PE HTTU
> and BBML2, IIUC) then this is only compiled tested. Hopefully I am not
> getting the attribution wrong.
> 
> * Patches 18-19: Intel IOMMU rev3.x implementation. Tested with a Qemu
> based intel-iommu with SSADS/SLADS emulation support.
> 
> To help testing/prototypization, qemu iommu emulation bits were written
> to increase coverage of this code and hopefully make this more broadly
> available for fellow contributors/devs. A separate series is submitted right
> after this covering the Qemu IOMMUFD extensions for dirty tracking,
> alongside
> its x86 iommus emulation A/D bits. Meanwhile it's also on github
> (https://github.com/jpemartins/qemu/commits/iommufd)
> 
> Remarks / Observations:
> 
> * There's no capabilities API in IOMMUFD, and in this RFC each vendor tracks

there was discussion adding device capability uAPI somewhere.

> what has access in each of the newly added ops. Initially I was thinking to
> have a HWPT_GET_DIRTY to probe how dirty tracking is supported (rather
> than
> bailing out with EOPNOTSUP) as well as an get_dirty_tracking
> iommu-core API. On the UAPI, perhaps it might be better to have a single API
> for capabilities in general (similar to KVM)  and at the simplest is a subop
> where the necessary info is conveyed on a per-subop basis?

probably this can be reported as a device cap as supporting of dirty bit is
an immutable property of the iommu serving that device. Userspace can
enable dirty tracking on a hwpt if all attached devices claim the support
and kernel will does the same verification.

btw do we still want to keep vfio type1 behavior as the fallback i.e. mark
all pinned pages as dirty when iommu dirty support is missing? From uAPI
naming p.o.v. set/clear_dirty_tracking doesn't preclude a special
implementation like vfio type1.

> 
> * The UAPI/kAPI could be generalized over the next iteration to also cover
> Access bit (or Intel's Extended Access bit that tracks non-CPU usage).
> It wasn't done, as I was not aware of a use-case. I am wondering
> if the access-bits could be used to do some form of zero page detection
> (to just send the pages that got touched), although dirty-bits could be
> used just the same way. Happy to adjust for RFCv2. The algorithms, IOPTE

I'm not fan of adding support for uncertain usages. Comparing to this
I'd give higher priority to large page break-down as w/o it it's hard to
find real-world deployment on this work. 😊

> walk and marshalling into bitmaps as well as the necessary IOTLB flush
> batching are all the same. The focus is on dirty bit given that the
> dirtyness IOVA feedback is used to select the pages that need to be
> transfered
> to the destination while migration is happening.
> Sidebar: Sadly, there's a lot less clever possible tricks that can be
> done (compared to the CPU/KVM) without having the PCI device cooperate
> (like
> userfaultfd, wrprotect, etc as those would turn into nepharious IOMMU
> perm faults and devices with DMA target aborts).
> If folks thing the UAPI/iommu-kAPI should be agnostic to any PTE A/D
> bits, we can instead have the ioctls be named after
> HWPT_SET_TRACKING() and add another argument which asks which bits to
> enabling tracking
> (IOMMUFD_ACCESS/IOMMUFD_DIRTY/IOMMUFD_ACCESS_NONCPU).
> Likewise for the read_and_clear() as all PTE bits follow the same logic
> as dirty. Happy to readjust if folks think it is worthwhile.
> 
> * IOMMU Nesting /shouldn't/ matter in this work, as it is expected that we
> only care about the first stage of IOMMU pagetables for hypervisors i.e.
> tracking dirty GPAs (and not caring about dirty GIOVAs).

Hypervisor uses second-stage while guest manages first-stage in nesting.

> 
> * Dirty bit tracking only, is not enough. Large IO pages tend to be the norm
> when DMA mapping large ranges of IOVA space, when really the VMM wants
> the
> smallest granularity possible to track(i.e. host base pages). A separate bit
> of work will need to take care demoting IOPTE page sizes at guest-runtime to
> increase/decrease the dirty tracking granularity, likely under the form of a
> IOAS demote/promote page-size within a previously mapped IOVA range.
> 
> Feedback is very much appreciated!

Thanks for the work!

> 
> [0] https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-
> iommufd_jgg@nvidia.com/
> [1] https://lore.kernel.org/kvm/20220414104710.28534-1-yi.l.liu@intel.com/
> [2] https://lore.kernel.org/linux-arm-kernel/20210413085457.25400-1-
> zhukeqian1@huawei.com/
> 
> 	Joao
> 
> TODOs:
> * More selftests for large/small iopte sizes;
> * Better vIOMMU+VFIO testing (AMD doesn't support it);
> * Performance efficiency of GET_DIRTY_IOVA in various workloads;
> * Testing with a live migrateable VF;
> 
> Jean-Philippe Brucker (1):
>   iommu/arm-smmu-v3: Add feature detection for HTTU
> 
> Joao Martins (16):
>   iommu: Add iommu_domain ops for dirty tracking
>   iommufd: Dirty tracking for io_pagetable
>   iommufd: Dirty tracking data support
>   iommu: Add an unmap API that returns dirtied IOPTEs
>   iommufd: Add a dirty bitmap to iopt_unmap_iova()
>   iommufd: Dirty tracking IOCTLs for the hw_pagetable
>   iommufd/vfio-compat: Dirty tracking IOCTLs compatibility
>   iommufd: Add a test for dirty tracking ioctls
>   iommu/amd: Access/Dirty bit support in IOPTEs
>   iommu/amd: Add unmap_read_dirty() support
>   iommu/amd: Print access/dirty bits if supported
>   iommu/arm-smmu-v3: Add read_and_clear_dirty() support
>   iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
>   iommu/arm-smmu-v3: Add unmap_read_dirty() support
>   iommu/intel: Access/Dirty bit support for SL domains
>   iommu/intel: Add unmap_read_dirty() support
> 
> Kunkun Jiang (2):
>   iommu/arm-smmu-v3: Add feature detection for BBML
>   iommu/arm-smmu-v3: Enable HTTU for stage1 with io-pgtable mapping
> 
>  drivers/iommu/amd/amd_iommu.h               |   1 +
>  drivers/iommu/amd/amd_iommu_types.h         |  11 +
>  drivers/iommu/amd/init.c                    |  12 +-
>  drivers/iommu/amd/io_pgtable.c              | 100 +++++++-
>  drivers/iommu/amd/iommu.c                   |  99 ++++++++
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 135 +++++++++++
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  14 ++
>  drivers/iommu/intel/iommu.c                 | 152 +++++++++++-
>  drivers/iommu/intel/pasid.c                 |  76 ++++++
>  drivers/iommu/intel/pasid.h                 |   7 +
>  drivers/iommu/io-pgtable-arm.c              | 232 ++++++++++++++++--
>  drivers/iommu/iommu.c                       |  71 +++++-
>  drivers/iommu/iommufd/hw_pagetable.c        |  79 ++++++
>  drivers/iommu/iommufd/io_pagetable.c        | 253 +++++++++++++++++++-
>  drivers/iommu/iommufd/io_pagetable.h        |   3 +-
>  drivers/iommu/iommufd/ioas.c                |  35 ++-
>  drivers/iommu/iommufd/iommufd_private.h     |  59 ++++-
>  drivers/iommu/iommufd/iommufd_test.h        |   9 +
>  drivers/iommu/iommufd/main.c                |   9 +
>  drivers/iommu/iommufd/pages.c               |  79 +++++-
>  drivers/iommu/iommufd/selftest.c            | 137 ++++++++++-
>  drivers/iommu/iommufd/vfio_compat.c         | 221 ++++++++++++++++-
>  include/linux/intel-iommu.h                 |  30 +++
>  include/linux/io-pgtable.h                  |  20 ++
>  include/linux/iommu.h                       |  64 +++++
>  include/uapi/linux/iommufd.h                |  78 ++++++
>  tools/testing/selftests/iommu/Makefile      |   1 +
>  tools/testing/selftests/iommu/iommufd.c     | 135 +++++++++++
>  28 files changed, 2047 insertions(+), 75 deletions(-)
> 
> --
> 2.17.2

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH RFC 01/19] iommu: Add iommu_domain ops for dirty tracking
  2022-04-28 21:09   ` Joao Martins
@ 2022-04-29  7:54     ` Tian, Kevin
  -1 siblings, 0 replies; 209+ messages in thread
From: Tian, Kevin @ 2022-04-29  7:54 UTC (permalink / raw)
  To: Martins, Joao, iommu
  Cc: Martins, Joao, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Eric Auger, Liu,
	Yi L, Alex Williamson, Cornelia Huck, kvm

> From: Joao Martins <joao.m.martins@oracle.com>
> Sent: Friday, April 29, 2022 5:09 AM
> 
> Add to iommu domain operations a set of callbacks to
> perform dirty tracking, particulary to start and stop
> tracking and finally to test and clear the dirty data.

to be consistent with other context, s/test/read/

> 
> Drivers are expected to dynamically change its hw protection
> domain bits to toggle the tracking and flush some form of

'hw protection domain bits' sounds a bit weird. what about
just using 'translation structures'?

> control state structure that stands in the IOVA translation
> path.
> 
> For reading and clearing dirty data, in all IOMMUs a transition
> from any of the PTE access bits (Access, Dirty) implies flushing
> the IOTLB to invalidate any stale data in the IOTLB as to whether
> or not the IOMMU should update the said PTEs. The iommu core APIs
> introduce a new structure for storing the dirties, albeit vendor
> IOMMUs implementing .read_and_clear_dirty() just use

s/vendor IOMMUs/iommu drivers/

btw according to past history in iommu mailing list sounds like
'vendor' is not a term welcomed in the kernel, while there are
many occurrences in this series.

[...]
> Although, The ARM SMMUv3 case is a tad different that its x86
> counterparts. Rather than changing *only* the IOMMU domain device entry
> to
> enable dirty tracking (and having a dedicated bit for dirtyness in IOPTE)
> ARM instead uses a dirty-bit modifier which is separately enabled, and
> changes the *existing* meaning of access bits (for ro/rw), to the point
> that marking access bit read-only but with dirty-bit-modifier enabled
> doesn't trigger an perm io page fault.
> 
> In pratice this means that changing iommu context isn't enough
> and in fact mostly useless IIUC (and can be always enabled). Dirtying
> is only really enabled when the DBM pte bit is enabled (with the
> CD.HD bit as a prereq).
> 
> To capture this h/w construct an iommu core API is added which enables
> dirty tracking on an IOVA range rather than a device/context entry.
> iommufd picks one or the other, and IOMMUFD core will favour
> device-context op followed by IOVA-range alternative.

Above doesn't convince me on the necessity of introducing two ops
here. Even for ARM it can accept a per-domain op and then walk the
page table to manipulate any modifier for existing mappings. It
doesn't matter whether it sets one bit in the context entry or multiple
bits in the page table.

[...]
> +

Miss comment for this function.

> +unsigned int iommu_dirty_bitmap_record(struct iommu_dirty_bitmap
> *dirty,
> +				       unsigned long iova, unsigned long length)
> +{
> +	unsigned long nbits, offset, start_offset, idx, size, *kaddr;
> +
> +	nbits = max(1UL, length >> dirty->pgshift);
> +	offset = (iova - dirty->iova) >> dirty->pgshift;
> +	idx = offset / (PAGE_SIZE * BITS_PER_BYTE);
> +	offset = offset % (PAGE_SIZE * BITS_PER_BYTE);
> +	start_offset = dirty->start_offset;

could you elaborate the purpose of dirty->start_offset? Why dirty->iova
doesn't start at offset 0 of the bitmap?

> +
> +	while (nbits > 0) {
> +		kaddr = kmap(dirty->pages[idx]) + start_offset;
> +		size = min(PAGE_SIZE * BITS_PER_BYTE - offset, nbits);
> +		bitmap_set(kaddr, offset, size);
> +		kunmap(dirty->pages[idx]);

what about the overhead of kmap/kunmap when it's done for every
dirtied page (as done in patch 18)?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH RFC 01/19] iommu: Add iommu_domain ops for dirty tracking
@ 2022-04-29  7:54     ` Tian, Kevin
  0 siblings, 0 replies; 209+ messages in thread
From: Tian, Kevin @ 2022-04-29  7:54 UTC (permalink / raw)
  To: Martins, Joao, iommu
  Cc: Jean-Philippe Brucker, Yishai Hadas, Jason Gunthorpe, kvm,
	Will Deacon, Cornelia Huck, Alex Williamson, Martins, Joao,
	David Woodhouse, Robin Murphy

> From: Joao Martins <joao.m.martins@oracle.com>
> Sent: Friday, April 29, 2022 5:09 AM
> 
> Add to iommu domain operations a set of callbacks to
> perform dirty tracking, particulary to start and stop
> tracking and finally to test and clear the dirty data.

to be consistent with other context, s/test/read/

> 
> Drivers are expected to dynamically change its hw protection
> domain bits to toggle the tracking and flush some form of

'hw protection domain bits' sounds a bit weird. what about
just using 'translation structures'?

> control state structure that stands in the IOVA translation
> path.
> 
> For reading and clearing dirty data, in all IOMMUs a transition
> from any of the PTE access bits (Access, Dirty) implies flushing
> the IOTLB to invalidate any stale data in the IOTLB as to whether
> or not the IOMMU should update the said PTEs. The iommu core APIs
> introduce a new structure for storing the dirties, albeit vendor
> IOMMUs implementing .read_and_clear_dirty() just use

s/vendor IOMMUs/iommu drivers/

btw according to past history in iommu mailing list sounds like
'vendor' is not a term welcomed in the kernel, while there are
many occurrences in this series.

[...]
> Although, The ARM SMMUv3 case is a tad different that its x86
> counterparts. Rather than changing *only* the IOMMU domain device entry
> to
> enable dirty tracking (and having a dedicated bit for dirtyness in IOPTE)
> ARM instead uses a dirty-bit modifier which is separately enabled, and
> changes the *existing* meaning of access bits (for ro/rw), to the point
> that marking access bit read-only but with dirty-bit-modifier enabled
> doesn't trigger an perm io page fault.
> 
> In pratice this means that changing iommu context isn't enough
> and in fact mostly useless IIUC (and can be always enabled). Dirtying
> is only really enabled when the DBM pte bit is enabled (with the
> CD.HD bit as a prereq).
> 
> To capture this h/w construct an iommu core API is added which enables
> dirty tracking on an IOVA range rather than a device/context entry.
> iommufd picks one or the other, and IOMMUFD core will favour
> device-context op followed by IOVA-range alternative.

Above doesn't convince me on the necessity of introducing two ops
here. Even for ARM it can accept a per-domain op and then walk the
page table to manipulate any modifier for existing mappings. It
doesn't matter whether it sets one bit in the context entry or multiple
bits in the page table.

[...]
> +

Miss comment for this function.

> +unsigned int iommu_dirty_bitmap_record(struct iommu_dirty_bitmap
> *dirty,
> +				       unsigned long iova, unsigned long length)
> +{
> +	unsigned long nbits, offset, start_offset, idx, size, *kaddr;
> +
> +	nbits = max(1UL, length >> dirty->pgshift);
> +	offset = (iova - dirty->iova) >> dirty->pgshift;
> +	idx = offset / (PAGE_SIZE * BITS_PER_BYTE);
> +	offset = offset % (PAGE_SIZE * BITS_PER_BYTE);
> +	start_offset = dirty->start_offset;

could you elaborate the purpose of dirty->start_offset? Why dirty->iova
doesn't start at offset 0 of the bitmap?

> +
> +	while (nbits > 0) {
> +		kaddr = kmap(dirty->pages[idx]) + start_offset;
> +		size = min(PAGE_SIZE * BITS_PER_BYTE - offset, nbits);
> +		bitmap_set(kaddr, offset, size);
> +		kunmap(dirty->pages[idx]);

what about the overhead of kmap/kunmap when it's done for every
dirtied page (as done in patch 18)?

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH RFC 02/19] iommufd: Dirty tracking for io_pagetable
  2022-04-28 21:09   ` Joao Martins
@ 2022-04-29  8:07     ` Tian, Kevin
  -1 siblings, 0 replies; 209+ messages in thread
From: Tian, Kevin @ 2022-04-29  8:07 UTC (permalink / raw)
  To: Martins, Joao, iommu
  Cc: Martins, Joao, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Eric Auger, Liu,
	Yi L, Alex Williamson, Cornelia Huck, kvm

> From: Joao Martins <joao.m.martins@oracle.com>
> Sent: Friday, April 29, 2022 5:09 AM
> 
> +static int __set_dirty_tracking_range_locked(struct iommu_domain
> *domain,

suppose anything using iommu_domain as the first argument should
be put in the iommu layer. Here it's more reasonable to use iopt
as the first argument or simply merge with the next function.

> +					     struct io_pagetable *iopt,
> +					     bool enable)
> +{
> +	const struct iommu_domain_ops *ops = domain->ops;
> +	struct iommu_iotlb_gather gather;
> +	struct iopt_area *area;
> +	int ret = -EOPNOTSUPP;
> +	unsigned long iova;
> +	size_t size;
> +
> +	iommu_iotlb_gather_init(&gather);
> +
> +	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
> +	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {

how is this different from leaving iommu driver to walk the page table
and the poke the modifier bit for all present PTEs? As commented in last
patch this may allow removing the range op completely.

> +		iova = iopt_area_iova(area);
> +		size = iopt_area_last_iova(area) - iova;
> +
> +		if (ops->set_dirty_tracking_range) {
> +			ret = ops->set_dirty_tracking_range(domain, iova,
> +							    size, &gather,
> +							    enable);
> +			if (ret < 0)
> +				break;
> +		}
> +	}
> +
> +	iommu_iotlb_sync(domain, &gather);
> +
> +	return ret;
> +}
> +
> +static int iommu_set_dirty_tracking(struct iommu_domain *domain,
> +				    struct io_pagetable *iopt, bool enable)

similarly rename to __iopt_set_dirty_tracking() and use iopt as the
leading argument.

> +{
> +	const struct iommu_domain_ops *ops = domain->ops;
> +	int ret = -EOPNOTSUPP;
> +
> +	if (ops->set_dirty_tracking)
> +		ret = ops->set_dirty_tracking(domain, enable);
> +	else if (ops->set_dirty_tracking_range)
> +		ret = __set_dirty_tracking_range_locked(domain, iopt,
> +							enable);
> +
> +	return ret;
> +}
> +
> +int iopt_set_dirty_tracking(struct io_pagetable *iopt,
> +			    struct iommu_domain *domain, bool enable)
> +{
> +	struct iommu_domain *dom;
> +	unsigned long index;
> +	int ret = -EOPNOTSUPP;
> +
> +	down_write(&iopt->iova_rwsem);
> +	if (!domain) {
> +		down_write(&iopt->domains_rwsem);
> +		xa_for_each(&iopt->domains, index, dom) {
> +			ret = iommu_set_dirty_tracking(dom, iopt, enable);
> +			if (ret < 0)
> +				break;
> +		}
> +		up_write(&iopt->domains_rwsem);
> +	} else {
> +		ret = iommu_set_dirty_tracking(domain, iopt, enable);
> +	}
> +
> +	up_write(&iopt->iova_rwsem);
> +	return ret;
> +}
> +
>  struct iopt_pages *iopt_get_pages(struct io_pagetable *iopt, unsigned long
> iova,
>  				  unsigned long *start_byte,
>  				  unsigned long length)
> diff --git a/drivers/iommu/iommufd/iommufd_private.h
> b/drivers/iommu/iommufd/iommufd_private.h
> index f55654278ac4..d00ef3b785c5 100644
> --- a/drivers/iommu/iommufd/iommufd_private.h
> +++ b/drivers/iommu/iommufd/iommufd_private.h
> @@ -49,6 +49,9 @@ int iopt_unmap_iova(struct io_pagetable *iopt,
> unsigned long iova,
>  		    unsigned long length);
>  int iopt_unmap_all(struct io_pagetable *iopt);
> 
> +int iopt_set_dirty_tracking(struct io_pagetable *iopt,
> +			    struct iommu_domain *domain, bool enable);
> +
>  int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
>  		      unsigned long npages, struct page **out_pages, bool
> write);
>  void iopt_unaccess_pages(struct io_pagetable *iopt, unsigned long iova,
> --
> 2.17.2


^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH RFC 02/19] iommufd: Dirty tracking for io_pagetable
@ 2022-04-29  8:07     ` Tian, Kevin
  0 siblings, 0 replies; 209+ messages in thread
From: Tian, Kevin @ 2022-04-29  8:07 UTC (permalink / raw)
  To: Martins, Joao, iommu
  Cc: Jean-Philippe Brucker, Yishai Hadas, Jason Gunthorpe, kvm,
	Will Deacon, Cornelia Huck, Alex Williamson, Martins, Joao,
	David Woodhouse, Robin Murphy

> From: Joao Martins <joao.m.martins@oracle.com>
> Sent: Friday, April 29, 2022 5:09 AM
> 
> +static int __set_dirty_tracking_range_locked(struct iommu_domain
> *domain,

suppose anything using iommu_domain as the first argument should
be put in the iommu layer. Here it's more reasonable to use iopt
as the first argument or simply merge with the next function.

> +					     struct io_pagetable *iopt,
> +					     bool enable)
> +{
> +	const struct iommu_domain_ops *ops = domain->ops;
> +	struct iommu_iotlb_gather gather;
> +	struct iopt_area *area;
> +	int ret = -EOPNOTSUPP;
> +	unsigned long iova;
> +	size_t size;
> +
> +	iommu_iotlb_gather_init(&gather);
> +
> +	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
> +	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {

how is this different from leaving iommu driver to walk the page table
and the poke the modifier bit for all present PTEs? As commented in last
patch this may allow removing the range op completely.

> +		iova = iopt_area_iova(area);
> +		size = iopt_area_last_iova(area) - iova;
> +
> +		if (ops->set_dirty_tracking_range) {
> +			ret = ops->set_dirty_tracking_range(domain, iova,
> +							    size, &gather,
> +							    enable);
> +			if (ret < 0)
> +				break;
> +		}
> +	}
> +
> +	iommu_iotlb_sync(domain, &gather);
> +
> +	return ret;
> +}
> +
> +static int iommu_set_dirty_tracking(struct iommu_domain *domain,
> +				    struct io_pagetable *iopt, bool enable)

similarly rename to __iopt_set_dirty_tracking() and use iopt as the
leading argument.

> +{
> +	const struct iommu_domain_ops *ops = domain->ops;
> +	int ret = -EOPNOTSUPP;
> +
> +	if (ops->set_dirty_tracking)
> +		ret = ops->set_dirty_tracking(domain, enable);
> +	else if (ops->set_dirty_tracking_range)
> +		ret = __set_dirty_tracking_range_locked(domain, iopt,
> +							enable);
> +
> +	return ret;
> +}
> +
> +int iopt_set_dirty_tracking(struct io_pagetable *iopt,
> +			    struct iommu_domain *domain, bool enable)
> +{
> +	struct iommu_domain *dom;
> +	unsigned long index;
> +	int ret = -EOPNOTSUPP;
> +
> +	down_write(&iopt->iova_rwsem);
> +	if (!domain) {
> +		down_write(&iopt->domains_rwsem);
> +		xa_for_each(&iopt->domains, index, dom) {
> +			ret = iommu_set_dirty_tracking(dom, iopt, enable);
> +			if (ret < 0)
> +				break;
> +		}
> +		up_write(&iopt->domains_rwsem);
> +	} else {
> +		ret = iommu_set_dirty_tracking(domain, iopt, enable);
> +	}
> +
> +	up_write(&iopt->iova_rwsem);
> +	return ret;
> +}
> +
>  struct iopt_pages *iopt_get_pages(struct io_pagetable *iopt, unsigned long
> iova,
>  				  unsigned long *start_byte,
>  				  unsigned long length)
> diff --git a/drivers/iommu/iommufd/iommufd_private.h
> b/drivers/iommu/iommufd/iommufd_private.h
> index f55654278ac4..d00ef3b785c5 100644
> --- a/drivers/iommu/iommufd/iommufd_private.h
> +++ b/drivers/iommu/iommufd/iommufd_private.h
> @@ -49,6 +49,9 @@ int iopt_unmap_iova(struct io_pagetable *iopt,
> unsigned long iova,
>  		    unsigned long length);
>  int iopt_unmap_all(struct io_pagetable *iopt);
> 
> +int iopt_set_dirty_tracking(struct io_pagetable *iopt,
> +			    struct iommu_domain *domain, bool enable);
> +
>  int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
>  		      unsigned long npages, struct page **out_pages, bool
> write);
>  void iopt_unaccess_pages(struct io_pagetable *iopt, unsigned long iova,
> --
> 2.17.2

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH RFC 03/19] iommufd: Dirty tracking data support
  2022-04-28 21:09   ` Joao Martins
@ 2022-04-29  8:12     ` Tian, Kevin
  -1 siblings, 0 replies; 209+ messages in thread
From: Tian, Kevin @ 2022-04-29  8:12 UTC (permalink / raw)
  To: Martins, Joao, iommu
  Cc: Martins, Joao, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Eric Auger, Liu,
	Yi L, Alex Williamson, Cornelia Huck, kvm

> From: Joao Martins <joao.m.martins@oracle.com>
> Sent: Friday, April 29, 2022 5:09 AM
[...]
> +
> +static int iommu_read_and_clear_dirty(struct iommu_domain *domain,
> +				      struct iommufd_dirty_data *bitmap)

In a glance this function and all previous helpers doesn't rely on any
iommufd objects except that the new structures are named as
iommufd_xxx. 

I wonder whether moving all of them to the iommu layer would make
more sense here.

> +{
> +	const struct iommu_domain_ops *ops = domain->ops;
> +	struct iommu_iotlb_gather gather;
> +	struct iommufd_dirty_iter iter;
> +	int ret = 0;
> +
> +	if (!ops || !ops->read_and_clear_dirty)
> +		return -EOPNOTSUPP;
> +
> +	iommu_dirty_bitmap_init(&iter.dirty, bitmap->iova,
> +				__ffs(bitmap->page_size), &gather);
> +	ret = iommufd_dirty_iter_init(&iter, bitmap);
> +	if (ret)
> +		return -ENOMEM;
> +
> +	for (; iommufd_dirty_iter_done(&iter);
> +	     iommufd_dirty_iter_advance(&iter)) {
> +		ret = iommufd_dirty_iter_get(&iter);
> +		if (ret)
> +			break;
> +
> +		ret = ops->read_and_clear_dirty(domain,
> +			iommufd_dirty_iova(&iter),
> +			iommufd_dirty_iova_length(&iter), &iter.dirty);
> +
> +		iommufd_dirty_iter_put(&iter);
> +
> +		if (ret)
> +			break;
> +	}
> +
> +	iommu_iotlb_sync(domain, &gather);
> +	iommufd_dirty_iter_free(&iter);
> +
> +	return ret;
> +}
> +

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH RFC 03/19] iommufd: Dirty tracking data support
@ 2022-04-29  8:12     ` Tian, Kevin
  0 siblings, 0 replies; 209+ messages in thread
From: Tian, Kevin @ 2022-04-29  8:12 UTC (permalink / raw)
  To: Martins, Joao, iommu
  Cc: Jean-Philippe Brucker, Yishai Hadas, Jason Gunthorpe, kvm,
	Will Deacon, Cornelia Huck, Alex Williamson, Martins, Joao,
	David Woodhouse, Robin Murphy

> From: Joao Martins <joao.m.martins@oracle.com>
> Sent: Friday, April 29, 2022 5:09 AM
[...]
> +
> +static int iommu_read_and_clear_dirty(struct iommu_domain *domain,
> +				      struct iommufd_dirty_data *bitmap)

In a glance this function and all previous helpers doesn't rely on any
iommufd objects except that the new structures are named as
iommufd_xxx. 

I wonder whether moving all of them to the iommu layer would make
more sense here.

> +{
> +	const struct iommu_domain_ops *ops = domain->ops;
> +	struct iommu_iotlb_gather gather;
> +	struct iommufd_dirty_iter iter;
> +	int ret = 0;
> +
> +	if (!ops || !ops->read_and_clear_dirty)
> +		return -EOPNOTSUPP;
> +
> +	iommu_dirty_bitmap_init(&iter.dirty, bitmap->iova,
> +				__ffs(bitmap->page_size), &gather);
> +	ret = iommufd_dirty_iter_init(&iter, bitmap);
> +	if (ret)
> +		return -ENOMEM;
> +
> +	for (; iommufd_dirty_iter_done(&iter);
> +	     iommufd_dirty_iter_advance(&iter)) {
> +		ret = iommufd_dirty_iter_get(&iter);
> +		if (ret)
> +			break;
> +
> +		ret = ops->read_and_clear_dirty(domain,
> +			iommufd_dirty_iova(&iter),
> +			iommufd_dirty_iova_length(&iter), &iter.dirty);
> +
> +		iommufd_dirty_iter_put(&iter);
> +
> +		if (ret)
> +			break;
> +	}
> +
> +	iommu_iotlb_sync(domain, &gather);
> +	iommufd_dirty_iter_free(&iter);
> +
> +	return ret;
> +}
> +
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
  2022-04-28 21:09   ` Joao Martins
@ 2022-04-29  8:28     ` Tian, Kevin
  -1 siblings, 0 replies; 209+ messages in thread
From: Tian, Kevin @ 2022-04-29  8:28 UTC (permalink / raw)
  To: Martins, Joao, iommu
  Cc: Martins, Joao, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Eric Auger, Liu,
	Yi L, Alex Williamson, Cornelia Huck, kvm

> From: Joao Martins <joao.m.martins@oracle.com>
> Sent: Friday, April 29, 2022 5:09 AM
> 
> Similar to .read_and_clear_dirty() use the page table
> walker helper functions and set DBM|RDONLY bit, thus
> switching the IOPTE to writeable-clean.

this should not be one-off if the operation needs to be
applied to IOPTE. Say a map request comes right after
set_dirty_tracking() is called. If it's agreed to remove
the range op then smmu driver should record the tracking
status internally and then apply the modifier to all the new
mappings automatically before dirty tracking is disabled.
Otherwise the same logic needs to be kept in iommufd to
call set_dirty_tracking_range() explicitly for every new
iopt_area created within the tracking window.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
@ 2022-04-29  8:28     ` Tian, Kevin
  0 siblings, 0 replies; 209+ messages in thread
From: Tian, Kevin @ 2022-04-29  8:28 UTC (permalink / raw)
  To: Martins, Joao, iommu
  Cc: Jean-Philippe Brucker, Yishai Hadas, Jason Gunthorpe, kvm,
	Will Deacon, Cornelia Huck, Alex Williamson, Martins, Joao,
	David Woodhouse, Robin Murphy

> From: Joao Martins <joao.m.martins@oracle.com>
> Sent: Friday, April 29, 2022 5:09 AM
> 
> Similar to .read_and_clear_dirty() use the page table
> walker helper functions and set DBM|RDONLY bit, thus
> switching the IOPTE to writeable-clean.

this should not be one-off if the operation needs to be
applied to IOPTE. Say a map request comes right after
set_dirty_tracking() is called. If it's agreed to remove
the range op then smmu driver should record the tracking
status internally and then apply the modifier to all the new
mappings automatically before dirty tracking is disabled.
Otherwise the same logic needs to be kept in iommufd to
call set_dirty_tracking_range() explicitly for every new
iopt_area created within the tracking window.

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH RFC 18/19] iommu/intel: Access/Dirty bit support for SL domains
  2022-04-28 21:09   ` Joao Martins
@ 2022-04-29  9:03     ` Tian, Kevin
  -1 siblings, 0 replies; 209+ messages in thread
From: Tian, Kevin @ 2022-04-29  9:03 UTC (permalink / raw)
  To: Martins, Joao, iommu
  Cc: Martins, Joao, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Eric Auger, Liu,
	Yi L, Alex Williamson, Cornelia Huck, kvm

> From: Joao Martins <joao.m.martins@oracle.com>
> Sent: Friday, April 29, 2022 5:10 AM
> 
> IOMMU advertises Access/Dirty bits if the extended capability
> DMAR register reports it (ECAP, mnemonic ECAP.SSADS). The first
> stage table, though, has not bit for advertising, unless referenced via

first-stage is compatible to CPU page table thus a/d bit support is
implied. But for dirty tracking I'm I'm fine with only supporting it
with second-stage as first-stage will be used only for guest in the
nesting case (though in concept first-stage could also be used for
IOVA when nesting is disabled there is no plan to do so on Intel
platforms).

> a scalable-mode PASID Entry. Relevant Intel IOMMU SDM ref for first stage
> table "3.6.2 Accessed, Extended Accessed, and Dirty Flags" and second
> stage table "3.7.2 Accessed and Dirty Flags".
> 
> To enable it scalable-mode for the second-stage table is required,
> solimit the use of dirty-bit to scalable-mode and discarding the
> first stage configured DMAR domains. To use SSADS, we set a bit in

above is inaccurate. dirty bit is only supported in scalable mode so
there is no limit per se.

> the scalable-mode PASID Table entry, by setting bit 9 (SSADE). When

"To use SSADS, we set bit 9 (SSADE) in the scalable-mode PASID table
entry"

> doing so, flush all iommu caches. Relevant SDM refs:
> 
> "3.7.2 Accessed and Dirty Flags"
> "6.5.3.3 Guidance to Software for Invalidations,
>  Table 23. Guidance to Software for Invalidations"
> 
> Dirty bit on the PTE is located in the same location (bit 9). The IOTLB

I'm not sure what information 'same location' here tries to convey...

> caches some attributes when SSADE is enabled and dirty-ness information,

be direct that the dirty bit is cached in IOTLB thus any change of that
bit requires flushing IOTLB

> so we also need to flush IOTLB to make sure IOMMU attempts to set the
> dirty bit again. Relevant manuals over the hardware translation is
> chapter 6 with some special mention to:
> 
> "6.2.3.1 Scalable-Mode PASID-Table Entry Programming Considerations"
> "6.2.4 IOTLB"
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
> Shouldn't probably be as aggresive as to flush all; needs
> checking with hardware (and invalidations guidance) as to understand
> what exactly needs flush.

yes, definitely not required to flush all. You can follow table 23
for software guidance for invalidations.

> ---
>  drivers/iommu/intel/iommu.c | 109
> ++++++++++++++++++++++++++++++++++++
>  drivers/iommu/intel/pasid.c |  76 +++++++++++++++++++++++++
>  drivers/iommu/intel/pasid.h |   7 +++
>  include/linux/intel-iommu.h |  14 +++++
>  4 files changed, 206 insertions(+)
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index ce33f85c72ab..92af43f27241 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -5089,6 +5089,113 @@ static void intel_iommu_iotlb_sync_map(struct
> iommu_domain *domain,
>  	}
>  }
> 
> +static int intel_iommu_set_dirty_tracking(struct iommu_domain *domain,
> +					  bool enable)
> +{
> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> +	struct device_domain_info *info;
> +	unsigned long flags;
> +	int ret = -EINVAL;
> +
> +	spin_lock_irqsave(&device_domain_lock, flags);
> +	if (list_empty(&dmar_domain->devices)) {
> +		spin_unlock_irqrestore(&device_domain_lock, flags);
> +		return ret;
> +	}

or return success here and just don't set any dirty bitmap in
read_and_clear_dirty()?

btw I think every iommu driver needs to record the tracking status
so later if a device which doesn't claim dirty tracking support is
attached to a domain which already has dirty_tracking enabled
then the attach request should be rejected. once the capability
uAPI is introduced.

> +
> +	list_for_each_entry(info, &dmar_domain->devices, link) {
> +		if (!info->dev || (info->domain != dmar_domain))
> +			continue;

why would there be a device linked under a dmar_domain but its
internal domain pointer doesn't point to that dmar_domain?

> +
> +		/* Dirty tracking is second-stage level SM only */
> +		if ((info->domain && domain_use_first_level(info->domain))
> ||
> +		    !ecap_slads(info->iommu->ecap) ||
> +		    !sm_supported(info->iommu) || !intel_iommu_sm) {

sm_supported() already covers the check on intel_iommu_sm.

> +			ret = -EOPNOTSUPP;
> +			continue;
> +		}
> +
> +		ret = intel_pasid_setup_dirty_tracking(info->iommu, info-
> >domain,
> +						     info->dev,
> PASID_RID2PASID,
> +						     enable);
> +		if (ret)
> +			break;
> +	}
> +	spin_unlock_irqrestore(&device_domain_lock, flags);
> +
> +	/*
> +	 * We need to flush context TLB and IOTLB with any cached
> translations
> +	 * to force the incoming DMA requests for have its IOTLB entries
> tagged
> +	 * with A/D bits
> +	 */
> +	intel_flush_iotlb_all(domain);
> +	return ret;
> +}
> +
> +static int intel_iommu_get_dirty_tracking(struct iommu_domain *domain)
> +{
> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> +	struct device_domain_info *info;
> +	unsigned long flags;
> +	int ret = 0;
> +
> +	spin_lock_irqsave(&device_domain_lock, flags);
> +	list_for_each_entry(info, &dmar_domain->devices, link) {
> +		if (!info->dev || (info->domain != dmar_domain))
> +			continue;
> +
> +		/* Dirty tracking is second-stage level SM only */
> +		if ((info->domain && domain_use_first_level(info->domain))
> ||
> +		    !ecap_slads(info->iommu->ecap) ||
> +		    !sm_supported(info->iommu) || !intel_iommu_sm) {
> +			ret = -EOPNOTSUPP;
> +			continue;
> +		}
> +
> +		if (!intel_pasid_dirty_tracking_enabled(info->iommu, info-
> >domain,
> +						 info->dev, PASID_RID2PASID))
> {
> +			ret = -EINVAL;
> +			break;
> +		}
> +	}
> +	spin_unlock_irqrestore(&device_domain_lock, flags);
> +
> +	return ret;
> +}

All above can be translated to a single status bit in dmar_domain.

> +
> +static int intel_iommu_read_and_clear_dirty(struct iommu_domain
> *domain,
> +					    unsigned long iova, size_t size,
> +					    struct iommu_dirty_bitmap *dirty)
> +{
> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> +	unsigned long end = iova + size - 1;
> +	unsigned long pgsize;
> +	int ret;
> +
> +	ret = intel_iommu_get_dirty_tracking(domain);
> +	if (ret)
> +		return ret;
> +
> +	do {
> +		struct dma_pte *pte;
> +		int lvl = 0;
> +
> +		pte = pfn_to_dma_pte(dmar_domain, iova >>
> VTD_PAGE_SHIFT, &lvl);

it's probably fine as the starting point but moving forward this could
be further optimized so there is no need to walk from L4->L3->L2->L1
for every pte.

> +		pgsize = level_size(lvl) << VTD_PAGE_SHIFT;
> +		if (!pte || !dma_pte_present(pte)) {
> +			iova += pgsize;
> +			continue;
> +		}
> +
> +		/* It is writable, set the bitmap */
> +		if (dma_sl_pte_test_and_clear_dirty(pte))
> +			iommu_dirty_bitmap_record(dirty, iova, pgsize);
> +		iova += pgsize;
> +	} while (iova < end);
> +
> +	return 0;
> +}
> +
>  const struct iommu_ops intel_iommu_ops = {
>  	.capable		= intel_iommu_capable,
>  	.domain_alloc		= intel_iommu_domain_alloc,
> @@ -5119,6 +5226,8 @@ const struct iommu_ops intel_iommu_ops = {
>  		.iotlb_sync		= intel_iommu_tlb_sync,
>  		.iova_to_phys		= intel_iommu_iova_to_phys,
>  		.free			= intel_iommu_domain_free,
> +		.set_dirty_tracking	= intel_iommu_set_dirty_tracking,
> +		.read_and_clear_dirty   = intel_iommu_read_and_clear_dirty,
>  	}
>  };
> 
> diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
> index 10fb82ea467d..90c7e018bc5c 100644
> --- a/drivers/iommu/intel/pasid.c
> +++ b/drivers/iommu/intel/pasid.c
> @@ -331,6 +331,11 @@ static inline void pasid_set_bits(u64 *ptr, u64 mask,
> u64 bits)
>  	WRITE_ONCE(*ptr, (old & ~mask) | bits);
>  }
> 
> +static inline u64 pasid_get_bits(u64 *ptr)
> +{
> +	return READ_ONCE(*ptr);
> +}
> +
>  /*
>   * Setup the DID(Domain Identifier) field (Bit 64~79) of scalable mode
>   * PASID entry.
> @@ -389,6 +394,36 @@ static inline void pasid_set_fault_enable(struct
> pasid_entry *pe)
>  	pasid_set_bits(&pe->val[0], 1 << 1, 0);
>  }
> 
> +/*
> + * Enable second level A/D bits by setting the SLADE (Second Level
> + * Access Dirty Enable) field (Bit 9) of a scalable mode PASID
> + * entry.
> + */
> +static inline void pasid_set_ssade(struct pasid_entry *pe)
> +{
> +	pasid_set_bits(&pe->val[0], 1 << 9, 1 << 9);
> +}
> +
> +/*
> + * Enable second level A/D bits by setting the SLADE (Second Level
> + * Access Dirty Enable) field (Bit 9) of a scalable mode PASID
> + * entry.
> + */
> +static inline void pasid_clear_ssade(struct pasid_entry *pe)
> +{
> +	pasid_set_bits(&pe->val[0], 1 << 9, 0);
> +}
> +
> +/*
> + * Checks if second level A/D bits by setting the SLADE (Second Level
> + * Access Dirty Enable) field (Bit 9) of a scalable mode PASID
> + * entry is enabled.
> + */
> +static inline bool pasid_get_ssade(struct pasid_entry *pe)
> +{
> +	return pasid_get_bits(&pe->val[0]) & (1 << 9);
> +}
> +
>  /*
>   * Setup the SRE(Supervisor Request Enable) field (Bit 128) of a
>   * scalable mode PASID entry.
> @@ -725,6 +760,47 @@ int intel_pasid_setup_second_level(struct
> intel_iommu *iommu,
>  	return 0;
>  }
> 
> +/*
> + * Set up dirty tracking on a second only translation type.
> + */
> +int intel_pasid_setup_dirty_tracking(struct intel_iommu *iommu,
> +				     struct dmar_domain *domain,
> +				     struct device *dev, u32 pasid,
> +				     bool enabled)
> +{
> +	struct pasid_entry *pte;
> +
> +	pte = intel_pasid_get_entry(dev, pasid);
> +	if (!pte) {
> +		dev_err(dev, "Failed to get pasid entry of PASID %d\n",
> pasid);
> +		return -ENODEV;
> +	}
> +
> +	if (enabled)
> +		pasid_set_ssade(pte);
> +	else
> +		pasid_clear_ssade(pte);
> +	return 0;
> +}
> +
> +/*
> + * Set up dirty tracking on a second only translation type.
> + */
> +bool intel_pasid_dirty_tracking_enabled(struct intel_iommu *iommu,
> +					struct dmar_domain *domain,
> +					struct device *dev, u32 pasid)
> +{
> +	struct pasid_entry *pte;
> +
> +	pte = intel_pasid_get_entry(dev, pasid);
> +	if (!pte) {
> +		dev_err(dev, "Failed to get pasid entry of PASID %d\n",
> pasid);
> +		return false;
> +	}
> +
> +	return pasid_get_ssade(pte);
> +}
> +
>  /*
>   * Set up the scalable mode pasid entry for passthrough translation type.
>   */
> diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
> index ab4408c824a5..3dab86017228 100644
> --- a/drivers/iommu/intel/pasid.h
> +++ b/drivers/iommu/intel/pasid.h
> @@ -115,6 +115,13 @@ int intel_pasid_setup_first_level(struct intel_iommu
> *iommu,
>  int intel_pasid_setup_second_level(struct intel_iommu *iommu,
>  				   struct dmar_domain *domain,
>  				   struct device *dev, u32 pasid);
> +int intel_pasid_setup_dirty_tracking(struct intel_iommu *iommu,
> +				     struct dmar_domain *domain,
> +				     struct device *dev, u32 pasid,
> +				     bool enabled);
> +bool intel_pasid_dirty_tracking_enabled(struct intel_iommu *iommu,
> +					struct dmar_domain *domain,
> +					struct device *dev, u32 pasid);
>  int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
>  				   struct dmar_domain *domain,
>  				   struct device *dev, u32 pasid);
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index 5cfda90b2cca..1328d1805197 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -47,6 +47,9 @@
>  #define DMA_FL_PTE_DIRTY	BIT_ULL(6)
>  #define DMA_FL_PTE_XD		BIT_ULL(63)
> 
> +#define DMA_SL_PTE_DIRTY_BIT	9
> +#define DMA_SL_PTE_DIRTY	BIT_ULL(DMA_SL_PTE_DIRTY_BIT)
> +
>  #define ADDR_WIDTH_5LEVEL	(57)
>  #define ADDR_WIDTH_4LEVEL	(48)
> 
> @@ -677,6 +680,17 @@ static inline bool dma_pte_present(struct dma_pte
> *pte)
>  	return (pte->val & 3) != 0;
>  }
> 
> +static inline bool dma_sl_pte_dirty(struct dma_pte *pte)
> +{
> +	return (pte->val & DMA_SL_PTE_DIRTY) != 0;
> +}
> +
> +static inline bool dma_sl_pte_test_and_clear_dirty(struct dma_pte *pte)
> +{
> +	return test_and_clear_bit(DMA_SL_PTE_DIRTY_BIT,
> +				  (unsigned long *)&pte->val);
> +}
> +
>  static inline bool dma_pte_superpage(struct dma_pte *pte)
>  {
>  	return (pte->val & DMA_PTE_LARGE_PAGE);
> --
> 2.17.2


^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH RFC 18/19] iommu/intel: Access/Dirty bit support for SL domains
@ 2022-04-29  9:03     ` Tian, Kevin
  0 siblings, 0 replies; 209+ messages in thread
From: Tian, Kevin @ 2022-04-29  9:03 UTC (permalink / raw)
  To: Martins, Joao, iommu
  Cc: Jean-Philippe Brucker, Yishai Hadas, Jason Gunthorpe, kvm,
	Will Deacon, Cornelia Huck, Alex Williamson, Martins, Joao,
	David Woodhouse, Robin Murphy

> From: Joao Martins <joao.m.martins@oracle.com>
> Sent: Friday, April 29, 2022 5:10 AM
> 
> IOMMU advertises Access/Dirty bits if the extended capability
> DMAR register reports it (ECAP, mnemonic ECAP.SSADS). The first
> stage table, though, has not bit for advertising, unless referenced via

first-stage is compatible to CPU page table thus a/d bit support is
implied. But for dirty tracking I'm I'm fine with only supporting it
with second-stage as first-stage will be used only for guest in the
nesting case (though in concept first-stage could also be used for
IOVA when nesting is disabled there is no plan to do so on Intel
platforms).

> a scalable-mode PASID Entry. Relevant Intel IOMMU SDM ref for first stage
> table "3.6.2 Accessed, Extended Accessed, and Dirty Flags" and second
> stage table "3.7.2 Accessed and Dirty Flags".
> 
> To enable it scalable-mode for the second-stage table is required,
> solimit the use of dirty-bit to scalable-mode and discarding the
> first stage configured DMAR domains. To use SSADS, we set a bit in

above is inaccurate. dirty bit is only supported in scalable mode so
there is no limit per se.

> the scalable-mode PASID Table entry, by setting bit 9 (SSADE). When

"To use SSADS, we set bit 9 (SSADE) in the scalable-mode PASID table
entry"

> doing so, flush all iommu caches. Relevant SDM refs:
> 
> "3.7.2 Accessed and Dirty Flags"
> "6.5.3.3 Guidance to Software for Invalidations,
>  Table 23. Guidance to Software for Invalidations"
> 
> Dirty bit on the PTE is located in the same location (bit 9). The IOTLB

I'm not sure what information 'same location' here tries to convey...

> caches some attributes when SSADE is enabled and dirty-ness information,

be direct that the dirty bit is cached in IOTLB thus any change of that
bit requires flushing IOTLB

> so we also need to flush IOTLB to make sure IOMMU attempts to set the
> dirty bit again. Relevant manuals over the hardware translation is
> chapter 6 with some special mention to:
> 
> "6.2.3.1 Scalable-Mode PASID-Table Entry Programming Considerations"
> "6.2.4 IOTLB"
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
> Shouldn't probably be as aggresive as to flush all; needs
> checking with hardware (and invalidations guidance) as to understand
> what exactly needs flush.

yes, definitely not required to flush all. You can follow table 23
for software guidance for invalidations.

> ---
>  drivers/iommu/intel/iommu.c | 109
> ++++++++++++++++++++++++++++++++++++
>  drivers/iommu/intel/pasid.c |  76 +++++++++++++++++++++++++
>  drivers/iommu/intel/pasid.h |   7 +++
>  include/linux/intel-iommu.h |  14 +++++
>  4 files changed, 206 insertions(+)
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index ce33f85c72ab..92af43f27241 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -5089,6 +5089,113 @@ static void intel_iommu_iotlb_sync_map(struct
> iommu_domain *domain,
>  	}
>  }
> 
> +static int intel_iommu_set_dirty_tracking(struct iommu_domain *domain,
> +					  bool enable)
> +{
> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> +	struct device_domain_info *info;
> +	unsigned long flags;
> +	int ret = -EINVAL;
> +
> +	spin_lock_irqsave(&device_domain_lock, flags);
> +	if (list_empty(&dmar_domain->devices)) {
> +		spin_unlock_irqrestore(&device_domain_lock, flags);
> +		return ret;
> +	}

or return success here and just don't set any dirty bitmap in
read_and_clear_dirty()?

btw I think every iommu driver needs to record the tracking status
so later if a device which doesn't claim dirty tracking support is
attached to a domain which already has dirty_tracking enabled
then the attach request should be rejected. once the capability
uAPI is introduced.

> +
> +	list_for_each_entry(info, &dmar_domain->devices, link) {
> +		if (!info->dev || (info->domain != dmar_domain))
> +			continue;

why would there be a device linked under a dmar_domain but its
internal domain pointer doesn't point to that dmar_domain?

> +
> +		/* Dirty tracking is second-stage level SM only */
> +		if ((info->domain && domain_use_first_level(info->domain))
> ||
> +		    !ecap_slads(info->iommu->ecap) ||
> +		    !sm_supported(info->iommu) || !intel_iommu_sm) {

sm_supported() already covers the check on intel_iommu_sm.

> +			ret = -EOPNOTSUPP;
> +			continue;
> +		}
> +
> +		ret = intel_pasid_setup_dirty_tracking(info->iommu, info-
> >domain,
> +						     info->dev,
> PASID_RID2PASID,
> +						     enable);
> +		if (ret)
> +			break;
> +	}
> +	spin_unlock_irqrestore(&device_domain_lock, flags);
> +
> +	/*
> +	 * We need to flush context TLB and IOTLB with any cached
> translations
> +	 * to force the incoming DMA requests for have its IOTLB entries
> tagged
> +	 * with A/D bits
> +	 */
> +	intel_flush_iotlb_all(domain);
> +	return ret;
> +}
> +
> +static int intel_iommu_get_dirty_tracking(struct iommu_domain *domain)
> +{
> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> +	struct device_domain_info *info;
> +	unsigned long flags;
> +	int ret = 0;
> +
> +	spin_lock_irqsave(&device_domain_lock, flags);
> +	list_for_each_entry(info, &dmar_domain->devices, link) {
> +		if (!info->dev || (info->domain != dmar_domain))
> +			continue;
> +
> +		/* Dirty tracking is second-stage level SM only */
> +		if ((info->domain && domain_use_first_level(info->domain))
> ||
> +		    !ecap_slads(info->iommu->ecap) ||
> +		    !sm_supported(info->iommu) || !intel_iommu_sm) {
> +			ret = -EOPNOTSUPP;
> +			continue;
> +		}
> +
> +		if (!intel_pasid_dirty_tracking_enabled(info->iommu, info-
> >domain,
> +						 info->dev, PASID_RID2PASID))
> {
> +			ret = -EINVAL;
> +			break;
> +		}
> +	}
> +	spin_unlock_irqrestore(&device_domain_lock, flags);
> +
> +	return ret;
> +}

All above can be translated to a single status bit in dmar_domain.

> +
> +static int intel_iommu_read_and_clear_dirty(struct iommu_domain
> *domain,
> +					    unsigned long iova, size_t size,
> +					    struct iommu_dirty_bitmap *dirty)
> +{
> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> +	unsigned long end = iova + size - 1;
> +	unsigned long pgsize;
> +	int ret;
> +
> +	ret = intel_iommu_get_dirty_tracking(domain);
> +	if (ret)
> +		return ret;
> +
> +	do {
> +		struct dma_pte *pte;
> +		int lvl = 0;
> +
> +		pte = pfn_to_dma_pte(dmar_domain, iova >>
> VTD_PAGE_SHIFT, &lvl);

it's probably fine as the starting point but moving forward this could
be further optimized so there is no need to walk from L4->L3->L2->L1
for every pte.

> +		pgsize = level_size(lvl) << VTD_PAGE_SHIFT;
> +		if (!pte || !dma_pte_present(pte)) {
> +			iova += pgsize;
> +			continue;
> +		}
> +
> +		/* It is writable, set the bitmap */
> +		if (dma_sl_pte_test_and_clear_dirty(pte))
> +			iommu_dirty_bitmap_record(dirty, iova, pgsize);
> +		iova += pgsize;
> +	} while (iova < end);
> +
> +	return 0;
> +}
> +
>  const struct iommu_ops intel_iommu_ops = {
>  	.capable		= intel_iommu_capable,
>  	.domain_alloc		= intel_iommu_domain_alloc,
> @@ -5119,6 +5226,8 @@ const struct iommu_ops intel_iommu_ops = {
>  		.iotlb_sync		= intel_iommu_tlb_sync,
>  		.iova_to_phys		= intel_iommu_iova_to_phys,
>  		.free			= intel_iommu_domain_free,
> +		.set_dirty_tracking	= intel_iommu_set_dirty_tracking,
> +		.read_and_clear_dirty   = intel_iommu_read_and_clear_dirty,
>  	}
>  };
> 
> diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
> index 10fb82ea467d..90c7e018bc5c 100644
> --- a/drivers/iommu/intel/pasid.c
> +++ b/drivers/iommu/intel/pasid.c
> @@ -331,6 +331,11 @@ static inline void pasid_set_bits(u64 *ptr, u64 mask,
> u64 bits)
>  	WRITE_ONCE(*ptr, (old & ~mask) | bits);
>  }
> 
> +static inline u64 pasid_get_bits(u64 *ptr)
> +{
> +	return READ_ONCE(*ptr);
> +}
> +
>  /*
>   * Setup the DID(Domain Identifier) field (Bit 64~79) of scalable mode
>   * PASID entry.
> @@ -389,6 +394,36 @@ static inline void pasid_set_fault_enable(struct
> pasid_entry *pe)
>  	pasid_set_bits(&pe->val[0], 1 << 1, 0);
>  }
> 
> +/*
> + * Enable second level A/D bits by setting the SLADE (Second Level
> + * Access Dirty Enable) field (Bit 9) of a scalable mode PASID
> + * entry.
> + */
> +static inline void pasid_set_ssade(struct pasid_entry *pe)
> +{
> +	pasid_set_bits(&pe->val[0], 1 << 9, 1 << 9);
> +}
> +
> +/*
> + * Enable second level A/D bits by setting the SLADE (Second Level
> + * Access Dirty Enable) field (Bit 9) of a scalable mode PASID
> + * entry.
> + */
> +static inline void pasid_clear_ssade(struct pasid_entry *pe)
> +{
> +	pasid_set_bits(&pe->val[0], 1 << 9, 0);
> +}
> +
> +/*
> + * Checks if second level A/D bits by setting the SLADE (Second Level
> + * Access Dirty Enable) field (Bit 9) of a scalable mode PASID
> + * entry is enabled.
> + */
> +static inline bool pasid_get_ssade(struct pasid_entry *pe)
> +{
> +	return pasid_get_bits(&pe->val[0]) & (1 << 9);
> +}
> +
>  /*
>   * Setup the SRE(Supervisor Request Enable) field (Bit 128) of a
>   * scalable mode PASID entry.
> @@ -725,6 +760,47 @@ int intel_pasid_setup_second_level(struct
> intel_iommu *iommu,
>  	return 0;
>  }
> 
> +/*
> + * Set up dirty tracking on a second only translation type.
> + */
> +int intel_pasid_setup_dirty_tracking(struct intel_iommu *iommu,
> +				     struct dmar_domain *domain,
> +				     struct device *dev, u32 pasid,
> +				     bool enabled)
> +{
> +	struct pasid_entry *pte;
> +
> +	pte = intel_pasid_get_entry(dev, pasid);
> +	if (!pte) {
> +		dev_err(dev, "Failed to get pasid entry of PASID %d\n",
> pasid);
> +		return -ENODEV;
> +	}
> +
> +	if (enabled)
> +		pasid_set_ssade(pte);
> +	else
> +		pasid_clear_ssade(pte);
> +	return 0;
> +}
> +
> +/*
> + * Set up dirty tracking on a second only translation type.
> + */
> +bool intel_pasid_dirty_tracking_enabled(struct intel_iommu *iommu,
> +					struct dmar_domain *domain,
> +					struct device *dev, u32 pasid)
> +{
> +	struct pasid_entry *pte;
> +
> +	pte = intel_pasid_get_entry(dev, pasid);
> +	if (!pte) {
> +		dev_err(dev, "Failed to get pasid entry of PASID %d\n",
> pasid);
> +		return false;
> +	}
> +
> +	return pasid_get_ssade(pte);
> +}
> +
>  /*
>   * Set up the scalable mode pasid entry for passthrough translation type.
>   */
> diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
> index ab4408c824a5..3dab86017228 100644
> --- a/drivers/iommu/intel/pasid.h
> +++ b/drivers/iommu/intel/pasid.h
> @@ -115,6 +115,13 @@ int intel_pasid_setup_first_level(struct intel_iommu
> *iommu,
>  int intel_pasid_setup_second_level(struct intel_iommu *iommu,
>  				   struct dmar_domain *domain,
>  				   struct device *dev, u32 pasid);
> +int intel_pasid_setup_dirty_tracking(struct intel_iommu *iommu,
> +				     struct dmar_domain *domain,
> +				     struct device *dev, u32 pasid,
> +				     bool enabled);
> +bool intel_pasid_dirty_tracking_enabled(struct intel_iommu *iommu,
> +					struct dmar_domain *domain,
> +					struct device *dev, u32 pasid);
>  int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
>  				   struct dmar_domain *domain,
>  				   struct device *dev, u32 pasid);
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index 5cfda90b2cca..1328d1805197 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -47,6 +47,9 @@
>  #define DMA_FL_PTE_DIRTY	BIT_ULL(6)
>  #define DMA_FL_PTE_XD		BIT_ULL(63)
> 
> +#define DMA_SL_PTE_DIRTY_BIT	9
> +#define DMA_SL_PTE_DIRTY	BIT_ULL(DMA_SL_PTE_DIRTY_BIT)
> +
>  #define ADDR_WIDTH_5LEVEL	(57)
>  #define ADDR_WIDTH_4LEVEL	(48)
> 
> @@ -677,6 +680,17 @@ static inline bool dma_pte_present(struct dma_pte
> *pte)
>  	return (pte->val & 3) != 0;
>  }
> 
> +static inline bool dma_sl_pte_dirty(struct dma_pte *pte)
> +{
> +	return (pte->val & DMA_SL_PTE_DIRTY) != 0;
> +}
> +
> +static inline bool dma_sl_pte_test_and_clear_dirty(struct dma_pte *pte)
> +{
> +	return test_and_clear_bit(DMA_SL_PTE_DIRTY_BIT,
> +				  (unsigned long *)&pte->val);
> +}
> +
>  static inline bool dma_pte_superpage(struct dma_pte *pte)
>  {
>  	return (pte->val & DMA_PTE_LARGE_PAGE);
> --
> 2.17.2

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
  2022-04-29  5:45   ` Tian, Kevin
@ 2022-04-29 10:27     ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 10:27 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Joerg Roedel, Suravee Suthikulpanit, Will Deacon, Robin Murphy,
	Jean-Philippe Brucker, Keqian Zhu, Shameerali Kolothum Thodi,
	David Woodhouse, Lu Baolu, Jason Gunthorpe, Nicolin Chen,
	Yishai Hadas, Eric Auger, Liu, Yi L, Alex Williamson,
	Cornelia Huck, kvm, iommu

On 4/29/22 06:45, Tian, Kevin wrote:
>> From: Joao Martins <joao.m.martins@oracle.com>
>> Sent: Friday, April 29, 2022 5:09 AM
>>
>> Presented herewith is a series that extends IOMMUFD to have IOMMU
>> hardware support for dirty bit in the IOPTEs.
>>
>> Today, AMD Milan (which been out for a year now) supports it while ARM
>> SMMUv3.2+ alongside VT-D rev3.x are expected to eventually come along.
>> The intended use-case is to support Live Migration with SR-IOV, with
> 
> this should not be restricted to SR-IOV.
> 
True. Should have written PCI Devices as that is orthogonal to SF/S-IOV/SR-IOV.

>> IOMMUs
>> that support it. Yishai Hadas will be soon submiting an RFC that covers the
>> PCI device dirty tracker via vfio.
>>
>> At a quick glance, IOMMUFD lets the userspace VMM create the IOAS with a
>> set of a IOVA ranges mapped to some physical memory composing an IO
>> pagetable. This is then attached to a particular device, consequently
>> creating the protection domain to share a common IO page table
>> representing the endporint DMA-addressable guest address space.
>> (Hopefully I am not twisting the terminology here) The resultant object
> 
> Just remove VMM/guest/... since iommufd is not specific to virtualization. 
> 
/me nods

>> is an hw_pagetable object which represents the iommu_domain
>> object that will be directly manipulated. For more background on
>> IOMMUFD have a look at these two series[0][1] on the kernel and qemu
>> consumption respectivally. The IOMMUFD UAPI, kAPI and the iommu core
>> kAPI is then extended to provide:
>>
>>  1) Enabling or disabling dirty tracking on the iommu_domain. Model
>> as the most common case of changing hardware protection domain control
> 
> didn't get what 'most common case' here tries to explain
> 
Most common case because out of the 3 analyzed IOMMUs two of them require
per-device context bits (intel, amd) and not page table entries being changed (arm).

>> bits, and ARM specific case of having to enable the per-PTE DBM control
>> bit. The 'real' tracking of whether dirty tracking is enabled or not is
>> stored in the vendor IOMMU, hence no new fields are added to iommufd
>> pagetable structures.
>>
>>  2) Read the I/O PTEs and marshal its dirtyiness into a bitmap. The bitmap
>> thus describe the IOVAs that got written by the device. While performing
>> the marshalling also vendors need to clear the dirty bits from IOPTE and
> 
> s/vendors/iommu drivers/ 
> 
OK, I will avoid the `vendor` term going forward.

>> allow the kAPI caller to batch the much needed IOTLB flush.
>> There's no copy of bitmaps to userspace backed memory, all is zerocopy
>> based. So far this is a test-and-clear kind of interface given that the
>> IOPT walk is going to be expensive. It occured to me to separate
>> the readout of dirty, and the clearing of dirty from IOPTEs.
>> I haven't opted for that one, given that it would mean two lenghty IOPTE
>> walks and felt counter-performant.
> 
> me too. that doesn't feel like a performant way.
> 
>>
>>  3) Unmapping an IOVA range while returning its dirty bit prior to
>> unmap. This case is specific for non-nested vIOMMU case where an
>> erronous guest (or device) DMAing to an address being unmapped at the
>> same time.
> 
> an erroneous attempt like above cannot anticipate which DMAs can
> succeed in that window thus the end behavior is undefined. For an
> undefined behavior nothing will be broken by losing some bits dirtied
> in the window between reading back dirty bits of the range and
> actually calling unmap. From guest p.o.v. all those are black-box
> hardware logic to serve a virtual iotlb invalidation request which just
> cannot be completed in one cycle.
> 
> Hence in reality probably this is not required except to meet vfio
> compat requirement. Just in concept returning dirty bits at unmap
> is more accurate.
> 
> I'm slightly inclined to abandon it in iommufd uAPI.
> 

OK, it seems I am not far off from your thoughts.

I'll see what others think too, and if so I'll remove the unmap_dirty.

Because if vfio-compat doesn't get the iommu hw dirty support, then there would
be no users of unmap_dirty.

>>
>> [See at the end too, on general remarks, specifically the one regarding
>>  probing dirty tracking via a dedicated iommufd cap ioctl]
>>
>> The series is organized as follows:
>>
>> * Patches 1-3: Takes care of the iommu domain operations to be added and
>> extends iommufd io-pagetable to set/clear dirty tracking, as well as
>> reading the dirty bits from the vendor pagetables. The idea is to abstract
>> iommu vendors from any idea of how bitmaps are stored or propagated
>> back to
>> the caller, as well as allowing control/batching over IOTLB flush. So
>> there's a data structure and an helper that only tells the upper layer that
>> an IOVA range got dirty. IOMMUFD carries the logic to pin pages, walking
> 
> why do we need another pinning here? any page mapped in iommu page
> table is supposed to have been pinned already...
> 

The pinning is for user bitmap data, not the IOVAs. This is mainly to avoid
doing any copying back to userspace of the bitmap dirty data. And this happens
for every 2M of bitmap data (i.e. representing 64G of IOVA space, having one
page track 128M of IOVA assuming worst case scenario of base-pages)

I think I can't just use/deref user memory bluntly and IOMMU core ought to
work with kernel buffers instead.

>> the bitmap user memory, and kmap-ing them as needed. IOMMU vendor
>> just has
>> an idea of a 'dirty bitmap state' and recording an IOVA as dirty by the
>> vendor IOMMU implementor.
>>
>> * Patches 4-5: Adds the new unmap domain op that returns whether the
>> IOVA
>> got dirtied. I separated this from the rest of the set, as I am still
>> questioning the need for this API and whether this race needs to be
>> fundamentally be handled. I guess the thinking is that live-migration
>> should be guest foolproof, but how much the race happens in pratice to
>> deem this as a necessary unmap variant. Perhaps maybe it might be enough
>> fetching dirty bits prior to the unmap? Feedback appreciated.
> 
> I think so as aforementioned.
> 
/me nods

>>
>> * Patches 6-8: Adds the UAPIs for IOMMUFD, vfio-compat and selftests.
>> We should discuss whether to include the vfio-compat or not. Given how
>> vfio-type1-iommu perpectually dirties any IOVA, and here I am replacing
>> with the IOMMU hw support. I haven't implemented the perpectual dirtying
>> given his lack of usefullness over an IOMMU-backed implementation (or so
>> I think). The selftests, test mainly the principal workflow, still needs
>> to get added more corner cases.
> 
> Or in another way could we keep vfio-compat as type1 does today, i.e.
> restricting iommu dirty tacking only to iommufd native uAPI?
> 
I suppose?

Another option is not exposing the type1 migration capability.
See further below.

>>
>> Note: Given that there's no capability for new APIs, or page sizes or
>> etc, the userspace app using IOMMUFD native API would gather -
>> EOPNOTSUPP
>> when dirty tracking is not supported by the IOMMU hardware.
>>
>> For completeness and most importantly to make sure the new IOMMU core
>> ops
>> capture the hardware blocks, all the IOMMUs that will eventually get IOMMU
>> A/D
>> support were implemented. So the next half of the series presents *proof of
>> concept* implementations for IOMMUs:
>>
>> * Patches 9-11: AMD IOMMU implementation, particularly on those having
>> HDSup support. Tested with a Qemu amd-iommu with HDSUp emulated,
>> and also on a AMD Milan server IOMMU.
>>
>> * Patches 12-17: Adapts the past series from Keqian Zhu[2] but reworked
>> to do the dynamic set/clear dirty tracking, and immplicitly clearing
>> dirty bits on the readout. Given the lack of hardware and difficulty
>> to get this in an emulated SMMUv3 (given the dependency on the PE HTTU
>> and BBML2, IIUC) then this is only compiled tested. Hopefully I am not
>> getting the attribution wrong.
>>
>> * Patches 18-19: Intel IOMMU rev3.x implementation. Tested with a Qemu
>> based intel-iommu with SSADS/SLADS emulation support.
>>
>> To help testing/prototypization, qemu iommu emulation bits were written
>> to increase coverage of this code and hopefully make this more broadly
>> available for fellow contributors/devs. A separate series is submitted right
>> after this covering the Qemu IOMMUFD extensions for dirty tracking,
>> alongside
>> its x86 iommus emulation A/D bits. Meanwhile it's also on github
>> (https://github.com/jpemartins/qemu/commits/iommufd)
>>
>> Remarks / Observations:
>>
>> * There's no capabilities API in IOMMUFD, and in this RFC each vendor tracks
> 
> there was discussion adding device capability uAPI somewhere.
> 
ack let me know if there was snippets to the conversation as I seem to have missed that.

>> what has access in each of the newly added ops. Initially I was thinking to
>> have a HWPT_GET_DIRTY to probe how dirty tracking is supported (rather
>> than
>> bailing out with EOPNOTSUP) as well as an get_dirty_tracking
>> iommu-core API. On the UAPI, perhaps it might be better to have a single API
>> for capabilities in general (similar to KVM)  and at the simplest is a subop
>> where the necessary info is conveyed on a per-subop basis?
> 
> probably this can be reported as a device cap as supporting of dirty bit is
> an immutable property of the iommu serving that device. 

I wasn't quite sure how this mapped in the rest of potential features to probe
in the iommufd grand scheme of things. I'll get properly done for the next iteration. In
the kernel, I was wondering if this could be tracked at iommu_domain given that virtually
all supporting iommu drivers will need to track dirty-tracking status on a per-domain
basis. But that structure is devoid of any state :/ so I suppose each iommu driver tracks
in its private structures (which part of me was trying to avoid).

> Userspace can
> enable dirty tracking on a hwpt if all attached devices claim the support
> and kernel will does the same verification.
> 
Sorry to be dense but this is not up to 'devices' given they take no part in the tracking?
I guess by 'devices' you mean the software idea of it i.e. the iommu context created for
attaching a said physical device, not the physical device itself.

> btw do we still want to keep vfio type1 behavior as the fallback i.e. mark
> all pinned pages as dirty when iommu dirty support is missing? From uAPI
> naming p.o.v. set/clear_dirty_tracking doesn't preclude a special
> implementation like vfio type1.
> 
Maybe let's not illude userspace that dirty tracking is supported?
I wonder how much of this can be done in userspace
without the iommu pretending to be doing said tracking, if all we are doing is setting
all IOVAs as dirty.

The issue /I think/ with the perpectual dirtyness is that it's not that useful
in pratice, and gives a false illusion of any tracking happening. Really looks
to be useful in maybe the testing of a vfio-pci vendor driver and one gotta put a gigantic
@downtime-limit so large not to make the VMM think that the migration can't
converged given the very high rate of dirty pages.

For the testing in general, my idea was to have iommu emulation to fill that gap.

>> * The UAPI/kAPI could be generalized over the next iteration to also cover
>> Access bit (or Intel's Extended Access bit that tracks non-CPU usage).
>> It wasn't done, as I was not aware of a use-case. I am wondering
>> if the access-bits could be used to do some form of zero page detection
>> (to just send the pages that got touched), although dirty-bits could be
>> used just the same way. Happy to adjust for RFCv2. The algorithms, IOPTE
> 
> I'm not fan of adding support for uncertain usages. 

The suggestion above was really because the logic doesn't change much.

But I guess no point in fattening UAPI if it's there's no use-case.

> Comparing to this
> I'd give higher priority to large page break-down as w/o it it's hard to
> find real-world deployment on this work. 😊
> 
Yeap. Once I hash out the comments I get here in terms of
direction, that's what I will be focusing next shortly (unless someone else wants
to take that adventure).

>> walk and marshalling into bitmaps as well as the necessary IOTLB flush
>> batching are all the same. The focus is on dirty bit given that the
>> dirtyness IOVA feedback is used to select the pages that need to be
>> transfered
>> to the destination while migration is happening.
>> Sidebar: Sadly, there's a lot less clever possible tricks that can be
>> done (compared to the CPU/KVM) without having the PCI device cooperate
>> (like
>> userfaultfd, wrprotect, etc as those would turn into nepharious IOMMU
>> perm faults and devices with DMA target aborts).
>> If folks thing the UAPI/iommu-kAPI should be agnostic to any PTE A/D
>> bits, we can instead have the ioctls be named after
>> HWPT_SET_TRACKING() and add another argument which asks which bits to
>> enabling tracking
>> (IOMMUFD_ACCESS/IOMMUFD_DIRTY/IOMMUFD_ACCESS_NONCPU).
>> Likewise for the read_and_clear() as all PTE bits follow the same logic
>> as dirty. Happy to readjust if folks think it is worthwhile.
>>
>> * IOMMU Nesting /shouldn't/ matter in this work, as it is expected that we
>> only care about the first stage of IOMMU pagetables for hypervisors i.e.
>> tracking dirty GPAs (and not caring about dirty GIOVAs).
> 
> Hypervisor uses second-stage while guest manages first-stage in nesting.
> 
/me nods

>>
>> * Dirty bit tracking only, is not enough. Large IO pages tend to be the norm
>> when DMA mapping large ranges of IOVA space, when really the VMM wants
>> the
>> smallest granularity possible to track(i.e. host base pages). A separate bit
>> of work will need to take care demoting IOPTE page sizes at guest-runtime to
>> increase/decrease the dirty tracking granularity, likely under the form of a
>> IOAS demote/promote page-size within a previously mapped IOVA range.
>>
>> Feedback is very much appreciated!
> 
> Thanks for the work!
> 
Thanks for the feedback thus far and in the rest of the patches too!

>>
>> [0] https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-
>> iommufd_jgg@nvidia.com/
>> [1] https://lore.kernel.org/kvm/20220414104710.28534-1-yi.l.liu@intel.com/
>> [2] https://lore.kernel.org/linux-arm-kernel/20210413085457.25400-1-
>> zhukeqian1@huawei.com/
>>
>> 	Joao
>>
>> TODOs:
>> * More selftests for large/small iopte sizes;
>> * Better vIOMMU+VFIO testing (AMD doesn't support it);
>> * Performance efficiency of GET_DIRTY_IOVA in various workloads;
>> * Testing with a live migrateable VF;
>>
>> Jean-Philippe Brucker (1):
>>   iommu/arm-smmu-v3: Add feature detection for HTTU
>>
>> Joao Martins (16):
>>   iommu: Add iommu_domain ops for dirty tracking
>>   iommufd: Dirty tracking for io_pagetable
>>   iommufd: Dirty tracking data support
>>   iommu: Add an unmap API that returns dirtied IOPTEs
>>   iommufd: Add a dirty bitmap to iopt_unmap_iova()
>>   iommufd: Dirty tracking IOCTLs for the hw_pagetable
>>   iommufd/vfio-compat: Dirty tracking IOCTLs compatibility
>>   iommufd: Add a test for dirty tracking ioctls
>>   iommu/amd: Access/Dirty bit support in IOPTEs
>>   iommu/amd: Add unmap_read_dirty() support
>>   iommu/amd: Print access/dirty bits if supported
>>   iommu/arm-smmu-v3: Add read_and_clear_dirty() support
>>   iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
>>   iommu/arm-smmu-v3: Add unmap_read_dirty() support
>>   iommu/intel: Access/Dirty bit support for SL domains
>>   iommu/intel: Add unmap_read_dirty() support
>>
>> Kunkun Jiang (2):
>>   iommu/arm-smmu-v3: Add feature detection for BBML
>>   iommu/arm-smmu-v3: Enable HTTU for stage1 with io-pgtable mapping
>>
>>  drivers/iommu/amd/amd_iommu.h               |   1 +
>>  drivers/iommu/amd/amd_iommu_types.h         |  11 +
>>  drivers/iommu/amd/init.c                    |  12 +-
>>  drivers/iommu/amd/io_pgtable.c              | 100 +++++++-
>>  drivers/iommu/amd/iommu.c                   |  99 ++++++++
>>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 135 +++++++++++
>>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  14 ++
>>  drivers/iommu/intel/iommu.c                 | 152 +++++++++++-
>>  drivers/iommu/intel/pasid.c                 |  76 ++++++
>>  drivers/iommu/intel/pasid.h                 |   7 +
>>  drivers/iommu/io-pgtable-arm.c              | 232 ++++++++++++++++--
>>  drivers/iommu/iommu.c                       |  71 +++++-
>>  drivers/iommu/iommufd/hw_pagetable.c        |  79 ++++++
>>  drivers/iommu/iommufd/io_pagetable.c        | 253 +++++++++++++++++++-
>>  drivers/iommu/iommufd/io_pagetable.h        |   3 +-
>>  drivers/iommu/iommufd/ioas.c                |  35 ++-
>>  drivers/iommu/iommufd/iommufd_private.h     |  59 ++++-
>>  drivers/iommu/iommufd/iommufd_test.h        |   9 +
>>  drivers/iommu/iommufd/main.c                |   9 +
>>  drivers/iommu/iommufd/pages.c               |  79 +++++-
>>  drivers/iommu/iommufd/selftest.c            | 137 ++++++++++-
>>  drivers/iommu/iommufd/vfio_compat.c         | 221 ++++++++++++++++-
>>  include/linux/intel-iommu.h                 |  30 +++
>>  include/linux/io-pgtable.h                  |  20 ++
>>  include/linux/iommu.h                       |  64 +++++
>>  include/uapi/linux/iommufd.h                |  78 ++++++
>>  tools/testing/selftests/iommu/Makefile      |   1 +
>>  tools/testing/selftests/iommu/iommufd.c     | 135 +++++++++++
>>  28 files changed, 2047 insertions(+), 75 deletions(-)
>>
>> --
>> 2.17.2
> 

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
@ 2022-04-29 10:27     ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 10:27 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jean-Philippe Brucker, Yishai Hadas, Jason Gunthorpe, kvm,
	Will Deacon, Cornelia Huck, iommu, Alex Williamson,
	David Woodhouse, Robin Murphy

On 4/29/22 06:45, Tian, Kevin wrote:
>> From: Joao Martins <joao.m.martins@oracle.com>
>> Sent: Friday, April 29, 2022 5:09 AM
>>
>> Presented herewith is a series that extends IOMMUFD to have IOMMU
>> hardware support for dirty bit in the IOPTEs.
>>
>> Today, AMD Milan (which been out for a year now) supports it while ARM
>> SMMUv3.2+ alongside VT-D rev3.x are expected to eventually come along.
>> The intended use-case is to support Live Migration with SR-IOV, with
> 
> this should not be restricted to SR-IOV.
> 
True. Should have written PCI Devices as that is orthogonal to SF/S-IOV/SR-IOV.

>> IOMMUs
>> that support it. Yishai Hadas will be soon submiting an RFC that covers the
>> PCI device dirty tracker via vfio.
>>
>> At a quick glance, IOMMUFD lets the userspace VMM create the IOAS with a
>> set of a IOVA ranges mapped to some physical memory composing an IO
>> pagetable. This is then attached to a particular device, consequently
>> creating the protection domain to share a common IO page table
>> representing the endporint DMA-addressable guest address space.
>> (Hopefully I am not twisting the terminology here) The resultant object
> 
> Just remove VMM/guest/... since iommufd is not specific to virtualization. 
> 
/me nods

>> is an hw_pagetable object which represents the iommu_domain
>> object that will be directly manipulated. For more background on
>> IOMMUFD have a look at these two series[0][1] on the kernel and qemu
>> consumption respectivally. The IOMMUFD UAPI, kAPI and the iommu core
>> kAPI is then extended to provide:
>>
>>  1) Enabling or disabling dirty tracking on the iommu_domain. Model
>> as the most common case of changing hardware protection domain control
> 
> didn't get what 'most common case' here tries to explain
> 
Most common case because out of the 3 analyzed IOMMUs two of them require
per-device context bits (intel, amd) and not page table entries being changed (arm).

>> bits, and ARM specific case of having to enable the per-PTE DBM control
>> bit. The 'real' tracking of whether dirty tracking is enabled or not is
>> stored in the vendor IOMMU, hence no new fields are added to iommufd
>> pagetable structures.
>>
>>  2) Read the I/O PTEs and marshal its dirtyiness into a bitmap. The bitmap
>> thus describe the IOVAs that got written by the device. While performing
>> the marshalling also vendors need to clear the dirty bits from IOPTE and
> 
> s/vendors/iommu drivers/ 
> 
OK, I will avoid the `vendor` term going forward.

>> allow the kAPI caller to batch the much needed IOTLB flush.
>> There's no copy of bitmaps to userspace backed memory, all is zerocopy
>> based. So far this is a test-and-clear kind of interface given that the
>> IOPT walk is going to be expensive. It occured to me to separate
>> the readout of dirty, and the clearing of dirty from IOPTEs.
>> I haven't opted for that one, given that it would mean two lenghty IOPTE
>> walks and felt counter-performant.
> 
> me too. that doesn't feel like a performant way.
> 
>>
>>  3) Unmapping an IOVA range while returning its dirty bit prior to
>> unmap. This case is specific for non-nested vIOMMU case where an
>> erronous guest (or device) DMAing to an address being unmapped at the
>> same time.
> 
> an erroneous attempt like above cannot anticipate which DMAs can
> succeed in that window thus the end behavior is undefined. For an
> undefined behavior nothing will be broken by losing some bits dirtied
> in the window between reading back dirty bits of the range and
> actually calling unmap. From guest p.o.v. all those are black-box
> hardware logic to serve a virtual iotlb invalidation request which just
> cannot be completed in one cycle.
> 
> Hence in reality probably this is not required except to meet vfio
> compat requirement. Just in concept returning dirty bits at unmap
> is more accurate.
> 
> I'm slightly inclined to abandon it in iommufd uAPI.
> 

OK, it seems I am not far off from your thoughts.

I'll see what others think too, and if so I'll remove the unmap_dirty.

Because if vfio-compat doesn't get the iommu hw dirty support, then there would
be no users of unmap_dirty.

>>
>> [See at the end too, on general remarks, specifically the one regarding
>>  probing dirty tracking via a dedicated iommufd cap ioctl]
>>
>> The series is organized as follows:
>>
>> * Patches 1-3: Takes care of the iommu domain operations to be added and
>> extends iommufd io-pagetable to set/clear dirty tracking, as well as
>> reading the dirty bits from the vendor pagetables. The idea is to abstract
>> iommu vendors from any idea of how bitmaps are stored or propagated
>> back to
>> the caller, as well as allowing control/batching over IOTLB flush. So
>> there's a data structure and an helper that only tells the upper layer that
>> an IOVA range got dirty. IOMMUFD carries the logic to pin pages, walking
> 
> why do we need another pinning here? any page mapped in iommu page
> table is supposed to have been pinned already...
> 

The pinning is for user bitmap data, not the IOVAs. This is mainly to avoid
doing any copying back to userspace of the bitmap dirty data. And this happens
for every 2M of bitmap data (i.e. representing 64G of IOVA space, having one
page track 128M of IOVA assuming worst case scenario of base-pages)

I think I can't just use/deref user memory bluntly and IOMMU core ought to
work with kernel buffers instead.

>> the bitmap user memory, and kmap-ing them as needed. IOMMU vendor
>> just has
>> an idea of a 'dirty bitmap state' and recording an IOVA as dirty by the
>> vendor IOMMU implementor.
>>
>> * Patches 4-5: Adds the new unmap domain op that returns whether the
>> IOVA
>> got dirtied. I separated this from the rest of the set, as I am still
>> questioning the need for this API and whether this race needs to be
>> fundamentally be handled. I guess the thinking is that live-migration
>> should be guest foolproof, but how much the race happens in pratice to
>> deem this as a necessary unmap variant. Perhaps maybe it might be enough
>> fetching dirty bits prior to the unmap? Feedback appreciated.
> 
> I think so as aforementioned.
> 
/me nods

>>
>> * Patches 6-8: Adds the UAPIs for IOMMUFD, vfio-compat and selftests.
>> We should discuss whether to include the vfio-compat or not. Given how
>> vfio-type1-iommu perpectually dirties any IOVA, and here I am replacing
>> with the IOMMU hw support. I haven't implemented the perpectual dirtying
>> given his lack of usefullness over an IOMMU-backed implementation (or so
>> I think). The selftests, test mainly the principal workflow, still needs
>> to get added more corner cases.
> 
> Or in another way could we keep vfio-compat as type1 does today, i.e.
> restricting iommu dirty tacking only to iommufd native uAPI?
> 
I suppose?

Another option is not exposing the type1 migration capability.
See further below.

>>
>> Note: Given that there's no capability for new APIs, or page sizes or
>> etc, the userspace app using IOMMUFD native API would gather -
>> EOPNOTSUPP
>> when dirty tracking is not supported by the IOMMU hardware.
>>
>> For completeness and most importantly to make sure the new IOMMU core
>> ops
>> capture the hardware blocks, all the IOMMUs that will eventually get IOMMU
>> A/D
>> support were implemented. So the next half of the series presents *proof of
>> concept* implementations for IOMMUs:
>>
>> * Patches 9-11: AMD IOMMU implementation, particularly on those having
>> HDSup support. Tested with a Qemu amd-iommu with HDSUp emulated,
>> and also on a AMD Milan server IOMMU.
>>
>> * Patches 12-17: Adapts the past series from Keqian Zhu[2] but reworked
>> to do the dynamic set/clear dirty tracking, and immplicitly clearing
>> dirty bits on the readout. Given the lack of hardware and difficulty
>> to get this in an emulated SMMUv3 (given the dependency on the PE HTTU
>> and BBML2, IIUC) then this is only compiled tested. Hopefully I am not
>> getting the attribution wrong.
>>
>> * Patches 18-19: Intel IOMMU rev3.x implementation. Tested with a Qemu
>> based intel-iommu with SSADS/SLADS emulation support.
>>
>> To help testing/prototypization, qemu iommu emulation bits were written
>> to increase coverage of this code and hopefully make this more broadly
>> available for fellow contributors/devs. A separate series is submitted right
>> after this covering the Qemu IOMMUFD extensions for dirty tracking,
>> alongside
>> its x86 iommus emulation A/D bits. Meanwhile it's also on github
>> (https://github.com/jpemartins/qemu/commits/iommufd)
>>
>> Remarks / Observations:
>>
>> * There's no capabilities API in IOMMUFD, and in this RFC each vendor tracks
> 
> there was discussion adding device capability uAPI somewhere.
> 
ack let me know if there was snippets to the conversation as I seem to have missed that.

>> what has access in each of the newly added ops. Initially I was thinking to
>> have a HWPT_GET_DIRTY to probe how dirty tracking is supported (rather
>> than
>> bailing out with EOPNOTSUP) as well as an get_dirty_tracking
>> iommu-core API. On the UAPI, perhaps it might be better to have a single API
>> for capabilities in general (similar to KVM)  and at the simplest is a subop
>> where the necessary info is conveyed on a per-subop basis?
> 
> probably this can be reported as a device cap as supporting of dirty bit is
> an immutable property of the iommu serving that device. 

I wasn't quite sure how this mapped in the rest of potential features to probe
in the iommufd grand scheme of things. I'll get properly done for the next iteration. In
the kernel, I was wondering if this could be tracked at iommu_domain given that virtually
all supporting iommu drivers will need to track dirty-tracking status on a per-domain
basis. But that structure is devoid of any state :/ so I suppose each iommu driver tracks
in its private structures (which part of me was trying to avoid).

> Userspace can
> enable dirty tracking on a hwpt if all attached devices claim the support
> and kernel will does the same verification.
> 
Sorry to be dense but this is not up to 'devices' given they take no part in the tracking?
I guess by 'devices' you mean the software idea of it i.e. the iommu context created for
attaching a said physical device, not the physical device itself.

> btw do we still want to keep vfio type1 behavior as the fallback i.e. mark
> all pinned pages as dirty when iommu dirty support is missing? From uAPI
> naming p.o.v. set/clear_dirty_tracking doesn't preclude a special
> implementation like vfio type1.
> 
Maybe let's not illude userspace that dirty tracking is supported?
I wonder how much of this can be done in userspace
without the iommu pretending to be doing said tracking, if all we are doing is setting
all IOVAs as dirty.

The issue /I think/ with the perpectual dirtyness is that it's not that useful
in pratice, and gives a false illusion of any tracking happening. Really looks
to be useful in maybe the testing of a vfio-pci vendor driver and one gotta put a gigantic
@downtime-limit so large not to make the VMM think that the migration can't
converged given the very high rate of dirty pages.

For the testing in general, my idea was to have iommu emulation to fill that gap.

>> * The UAPI/kAPI could be generalized over the next iteration to also cover
>> Access bit (or Intel's Extended Access bit that tracks non-CPU usage).
>> It wasn't done, as I was not aware of a use-case. I am wondering
>> if the access-bits could be used to do some form of zero page detection
>> (to just send the pages that got touched), although dirty-bits could be
>> used just the same way. Happy to adjust for RFCv2. The algorithms, IOPTE
> 
> I'm not fan of adding support for uncertain usages. 

The suggestion above was really because the logic doesn't change much.

But I guess no point in fattening UAPI if it's there's no use-case.

> Comparing to this
> I'd give higher priority to large page break-down as w/o it it's hard to
> find real-world deployment on this work. 😊
> 
Yeap. Once I hash out the comments I get here in terms of
direction, that's what I will be focusing next shortly (unless someone else wants
to take that adventure).

>> walk and marshalling into bitmaps as well as the necessary IOTLB flush
>> batching are all the same. The focus is on dirty bit given that the
>> dirtyness IOVA feedback is used to select the pages that need to be
>> transfered
>> to the destination while migration is happening.
>> Sidebar: Sadly, there's a lot less clever possible tricks that can be
>> done (compared to the CPU/KVM) without having the PCI device cooperate
>> (like
>> userfaultfd, wrprotect, etc as those would turn into nepharious IOMMU
>> perm faults and devices with DMA target aborts).
>> If folks thing the UAPI/iommu-kAPI should be agnostic to any PTE A/D
>> bits, we can instead have the ioctls be named after
>> HWPT_SET_TRACKING() and add another argument which asks which bits to
>> enabling tracking
>> (IOMMUFD_ACCESS/IOMMUFD_DIRTY/IOMMUFD_ACCESS_NONCPU).
>> Likewise for the read_and_clear() as all PTE bits follow the same logic
>> as dirty. Happy to readjust if folks think it is worthwhile.
>>
>> * IOMMU Nesting /shouldn't/ matter in this work, as it is expected that we
>> only care about the first stage of IOMMU pagetables for hypervisors i.e.
>> tracking dirty GPAs (and not caring about dirty GIOVAs).
> 
> Hypervisor uses second-stage while guest manages first-stage in nesting.
> 
/me nods

>>
>> * Dirty bit tracking only, is not enough. Large IO pages tend to be the norm
>> when DMA mapping large ranges of IOVA space, when really the VMM wants
>> the
>> smallest granularity possible to track(i.e. host base pages). A separate bit
>> of work will need to take care demoting IOPTE page sizes at guest-runtime to
>> increase/decrease the dirty tracking granularity, likely under the form of a
>> IOAS demote/promote page-size within a previously mapped IOVA range.
>>
>> Feedback is very much appreciated!
> 
> Thanks for the work!
> 
Thanks for the feedback thus far and in the rest of the patches too!

>>
>> [0] https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-
>> iommufd_jgg@nvidia.com/
>> [1] https://lore.kernel.org/kvm/20220414104710.28534-1-yi.l.liu@intel.com/
>> [2] https://lore.kernel.org/linux-arm-kernel/20210413085457.25400-1-
>> zhukeqian1@huawei.com/
>>
>> 	Joao
>>
>> TODOs:
>> * More selftests for large/small iopte sizes;
>> * Better vIOMMU+VFIO testing (AMD doesn't support it);
>> * Performance efficiency of GET_DIRTY_IOVA in various workloads;
>> * Testing with a live migrateable VF;
>>
>> Jean-Philippe Brucker (1):
>>   iommu/arm-smmu-v3: Add feature detection for HTTU
>>
>> Joao Martins (16):
>>   iommu: Add iommu_domain ops for dirty tracking
>>   iommufd: Dirty tracking for io_pagetable
>>   iommufd: Dirty tracking data support
>>   iommu: Add an unmap API that returns dirtied IOPTEs
>>   iommufd: Add a dirty bitmap to iopt_unmap_iova()
>>   iommufd: Dirty tracking IOCTLs for the hw_pagetable
>>   iommufd/vfio-compat: Dirty tracking IOCTLs compatibility
>>   iommufd: Add a test for dirty tracking ioctls
>>   iommu/amd: Access/Dirty bit support in IOPTEs
>>   iommu/amd: Add unmap_read_dirty() support
>>   iommu/amd: Print access/dirty bits if supported
>>   iommu/arm-smmu-v3: Add read_and_clear_dirty() support
>>   iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
>>   iommu/arm-smmu-v3: Add unmap_read_dirty() support
>>   iommu/intel: Access/Dirty bit support for SL domains
>>   iommu/intel: Add unmap_read_dirty() support
>>
>> Kunkun Jiang (2):
>>   iommu/arm-smmu-v3: Add feature detection for BBML
>>   iommu/arm-smmu-v3: Enable HTTU for stage1 with io-pgtable mapping
>>
>>  drivers/iommu/amd/amd_iommu.h               |   1 +
>>  drivers/iommu/amd/amd_iommu_types.h         |  11 +
>>  drivers/iommu/amd/init.c                    |  12 +-
>>  drivers/iommu/amd/io_pgtable.c              | 100 +++++++-
>>  drivers/iommu/amd/iommu.c                   |  99 ++++++++
>>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 135 +++++++++++
>>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  14 ++
>>  drivers/iommu/intel/iommu.c                 | 152 +++++++++++-
>>  drivers/iommu/intel/pasid.c                 |  76 ++++++
>>  drivers/iommu/intel/pasid.h                 |   7 +
>>  drivers/iommu/io-pgtable-arm.c              | 232 ++++++++++++++++--
>>  drivers/iommu/iommu.c                       |  71 +++++-
>>  drivers/iommu/iommufd/hw_pagetable.c        |  79 ++++++
>>  drivers/iommu/iommufd/io_pagetable.c        | 253 +++++++++++++++++++-
>>  drivers/iommu/iommufd/io_pagetable.h        |   3 +-
>>  drivers/iommu/iommufd/ioas.c                |  35 ++-
>>  drivers/iommu/iommufd/iommufd_private.h     |  59 ++++-
>>  drivers/iommu/iommufd/iommufd_test.h        |   9 +
>>  drivers/iommu/iommufd/main.c                |   9 +
>>  drivers/iommu/iommufd/pages.c               |  79 +++++-
>>  drivers/iommu/iommufd/selftest.c            | 137 ++++++++++-
>>  drivers/iommu/iommufd/vfio_compat.c         | 221 ++++++++++++++++-
>>  include/linux/intel-iommu.h                 |  30 +++
>>  include/linux/io-pgtable.h                  |  20 ++
>>  include/linux/iommu.h                       |  64 +++++
>>  include/uapi/linux/iommufd.h                |  78 ++++++
>>  tools/testing/selftests/iommu/Makefile      |   1 +
>>  tools/testing/selftests/iommu/iommufd.c     | 135 +++++++++++
>>  28 files changed, 2047 insertions(+), 75 deletions(-)
>>
>> --
>> 2.17.2
> 
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 01/19] iommu: Add iommu_domain ops for dirty tracking
  2022-04-29  7:54     ` Tian, Kevin
@ 2022-04-29 10:44       ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 10:44 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jean-Philippe Brucker, Yishai Hadas, Jason Gunthorpe, kvm,
	Will Deacon, Cornelia Huck, iommu, Alex Williamson,
	David Woodhouse, Robin Murphy

On 4/29/22 08:54, Tian, Kevin wrote:
>> From: Joao Martins <joao.m.martins@oracle.com>
>> Sent: Friday, April 29, 2022 5:09 AM
>>
>> Add to iommu domain operations a set of callbacks to
>> perform dirty tracking, particulary to start and stop
>> tracking and finally to test and clear the dirty data.
> 
> to be consistent with other context, s/test/read/
> 
/me nods

>>
>> Drivers are expected to dynamically change its hw protection
>> domain bits to toggle the tracking and flush some form of
> 
> 'hw protection domain bits' sounds a bit weird. what about
> just using 'translation structures'?
> 
I replace with that instead.

>> control state structure that stands in the IOVA translation
>> path.
>>
>> For reading and clearing dirty data, in all IOMMUs a transition
>> from any of the PTE access bits (Access, Dirty) implies flushing
>> the IOTLB to invalidate any stale data in the IOTLB as to whether
>> or not the IOMMU should update the said PTEs. The iommu core APIs
>> introduce a new structure for storing the dirties, albeit vendor
>> IOMMUs implementing .read_and_clear_dirty() just use
> 
> s/vendor IOMMUs/iommu drivers/
> 
> btw according to past history in iommu mailing list sounds like
> 'vendor' is not a term welcomed in the kernel, while there are
> many occurrences in this series.
> 
Hmm, I wasn't aware actually.

Will move away from using 'vendor'.

> [...]
>> Although, The ARM SMMUv3 case is a tad different that its x86
>> counterparts. Rather than changing *only* the IOMMU domain device entry
>> to
>> enable dirty tracking (and having a dedicated bit for dirtyness in IOPTE)
>> ARM instead uses a dirty-bit modifier which is separately enabled, and
>> changes the *existing* meaning of access bits (for ro/rw), to the point
>> that marking access bit read-only but with dirty-bit-modifier enabled
>> doesn't trigger an perm io page fault.
>>
>> In pratice this means that changing iommu context isn't enough
>> and in fact mostly useless IIUC (and can be always enabled). Dirtying
>> is only really enabled when the DBM pte bit is enabled (with the
>> CD.HD bit as a prereq).
>>
>> To capture this h/w construct an iommu core API is added which enables
>> dirty tracking on an IOVA range rather than a device/context entry.
>> iommufd picks one or the other, and IOMMUFD core will favour
>> device-context op followed by IOVA-range alternative.
> 
> Above doesn't convince me on the necessity of introducing two ops
> here. Even for ARM it can accept a per-domain op and then walk the
> page table to manipulate any modifier for existing mappings. It
> doesn't matter whether it sets one bit in the context entry or multiple
> bits in the page table.
> 
OK

> [...]
>> +
> 
> Miss comment for this function.
> 
ack

>> +unsigned int iommu_dirty_bitmap_record(struct iommu_dirty_bitmap
>> *dirty,
>> +				       unsigned long iova, unsigned long length)
>> +{
>> +	unsigned long nbits, offset, start_offset, idx, size, *kaddr;
>> +
>> +	nbits = max(1UL, length >> dirty->pgshift);
>> +	offset = (iova - dirty->iova) >> dirty->pgshift;
>> +	idx = offset / (PAGE_SIZE * BITS_PER_BYTE);
>> +	offset = offset % (PAGE_SIZE * BITS_PER_BYTE);
>> +	start_offset = dirty->start_offset;
> 
> could you elaborate the purpose of dirty->start_offset? Why dirty->iova
> doesn't start at offset 0 of the bitmap?
> 

It is to deal with page-unaligned addresses.

Like if the start of the bitmap -- and hence bitmap base IOVA for the first bit of the
bitmap -- isn't page-aligned and starts in the offset of a given page. Thus start-offset
is to know bit in the pinned page does dirty::iova correspond to.

>> +
>> +	while (nbits > 0) {
>> +		kaddr = kmap(dirty->pages[idx]) + start_offset;
>> +		size = min(PAGE_SIZE * BITS_PER_BYTE - offset, nbits);
>> +		bitmap_set(kaddr, offset, size);
>> +		kunmap(dirty->pages[idx]);
> 
> what about the overhead of kmap/kunmap when it's done for every
> dirtied page (as done in patch 18)?

Isn't it an overhead mainly with highmem? Otherwise it ends up being page_to_virt(...)

But anyways the kmap's should be cached, and teardown when pinning the next user data.

Performance analysis is also something I want to fully hash out (as mentioned in the cover
letter).
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 01/19] iommu: Add iommu_domain ops for dirty tracking
@ 2022-04-29 10:44       ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 10:44 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Joerg Roedel, Suravee Suthikulpanit, Will Deacon, Robin Murphy,
	Jean-Philippe Brucker, Keqian Zhu, Shameerali Kolothum Thodi,
	David Woodhouse, Lu Baolu, Jason Gunthorpe, Nicolin Chen,
	Yishai Hadas, Eric Auger, Liu, Yi L, Alex Williamson,
	Cornelia Huck, kvm, iommu

On 4/29/22 08:54, Tian, Kevin wrote:
>> From: Joao Martins <joao.m.martins@oracle.com>
>> Sent: Friday, April 29, 2022 5:09 AM
>>
>> Add to iommu domain operations a set of callbacks to
>> perform dirty tracking, particulary to start and stop
>> tracking and finally to test and clear the dirty data.
> 
> to be consistent with other context, s/test/read/
> 
/me nods

>>
>> Drivers are expected to dynamically change its hw protection
>> domain bits to toggle the tracking and flush some form of
> 
> 'hw protection domain bits' sounds a bit weird. what about
> just using 'translation structures'?
> 
I replace with that instead.

>> control state structure that stands in the IOVA translation
>> path.
>>
>> For reading and clearing dirty data, in all IOMMUs a transition
>> from any of the PTE access bits (Access, Dirty) implies flushing
>> the IOTLB to invalidate any stale data in the IOTLB as to whether
>> or not the IOMMU should update the said PTEs. The iommu core APIs
>> introduce a new structure for storing the dirties, albeit vendor
>> IOMMUs implementing .read_and_clear_dirty() just use
> 
> s/vendor IOMMUs/iommu drivers/
> 
> btw according to past history in iommu mailing list sounds like
> 'vendor' is not a term welcomed in the kernel, while there are
> many occurrences in this series.
> 
Hmm, I wasn't aware actually.

Will move away from using 'vendor'.

> [...]
>> Although, The ARM SMMUv3 case is a tad different that its x86
>> counterparts. Rather than changing *only* the IOMMU domain device entry
>> to
>> enable dirty tracking (and having a dedicated bit for dirtyness in IOPTE)
>> ARM instead uses a dirty-bit modifier which is separately enabled, and
>> changes the *existing* meaning of access bits (for ro/rw), to the point
>> that marking access bit read-only but with dirty-bit-modifier enabled
>> doesn't trigger an perm io page fault.
>>
>> In pratice this means that changing iommu context isn't enough
>> and in fact mostly useless IIUC (and can be always enabled). Dirtying
>> is only really enabled when the DBM pte bit is enabled (with the
>> CD.HD bit as a prereq).
>>
>> To capture this h/w construct an iommu core API is added which enables
>> dirty tracking on an IOVA range rather than a device/context entry.
>> iommufd picks one or the other, and IOMMUFD core will favour
>> device-context op followed by IOVA-range alternative.
> 
> Above doesn't convince me on the necessity of introducing two ops
> here. Even for ARM it can accept a per-domain op and then walk the
> page table to manipulate any modifier for existing mappings. It
> doesn't matter whether it sets one bit in the context entry or multiple
> bits in the page table.
> 
OK

> [...]
>> +
> 
> Miss comment for this function.
> 
ack

>> +unsigned int iommu_dirty_bitmap_record(struct iommu_dirty_bitmap
>> *dirty,
>> +				       unsigned long iova, unsigned long length)
>> +{
>> +	unsigned long nbits, offset, start_offset, idx, size, *kaddr;
>> +
>> +	nbits = max(1UL, length >> dirty->pgshift);
>> +	offset = (iova - dirty->iova) >> dirty->pgshift;
>> +	idx = offset / (PAGE_SIZE * BITS_PER_BYTE);
>> +	offset = offset % (PAGE_SIZE * BITS_PER_BYTE);
>> +	start_offset = dirty->start_offset;
> 
> could you elaborate the purpose of dirty->start_offset? Why dirty->iova
> doesn't start at offset 0 of the bitmap?
> 

It is to deal with page-unaligned addresses.

Like if the start of the bitmap -- and hence bitmap base IOVA for the first bit of the
bitmap -- isn't page-aligned and starts in the offset of a given page. Thus start-offset
is to know bit in the pinned page does dirty::iova correspond to.

>> +
>> +	while (nbits > 0) {
>> +		kaddr = kmap(dirty->pages[idx]) + start_offset;
>> +		size = min(PAGE_SIZE * BITS_PER_BYTE - offset, nbits);
>> +		bitmap_set(kaddr, offset, size);
>> +		kunmap(dirty->pages[idx]);
> 
> what about the overhead of kmap/kunmap when it's done for every
> dirtied page (as done in patch 18)?

Isn't it an overhead mainly with highmem? Otherwise it ends up being page_to_virt(...)

But anyways the kmap's should be cached, and teardown when pinning the next user data.

Performance analysis is also something I want to fully hash out (as mentioned in the cover
letter).

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 02/19] iommufd: Dirty tracking for io_pagetable
  2022-04-29  8:07     ` Tian, Kevin
@ 2022-04-29 10:48       ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 10:48 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jean-Philippe Brucker, Yishai Hadas, Jason Gunthorpe, kvm,
	Will Deacon, Cornelia Huck, iommu, Alex Williamson,
	David Woodhouse, Robin Murphy

On 4/29/22 09:07, Tian, Kevin wrote:
>> From: Joao Martins <joao.m.martins@oracle.com>
>> Sent: Friday, April 29, 2022 5:09 AM
>>
>> +static int __set_dirty_tracking_range_locked(struct iommu_domain
>> *domain,
> 
> suppose anything using iommu_domain as the first argument should
> be put in the iommu layer. Here it's more reasonable to use iopt
> as the first argument or simply merge with the next function.
> 
OK

>> +					     struct io_pagetable *iopt,
>> +					     bool enable)
>> +{
>> +	const struct iommu_domain_ops *ops = domain->ops;
>> +	struct iommu_iotlb_gather gather;
>> +	struct iopt_area *area;
>> +	int ret = -EOPNOTSUPP;
>> +	unsigned long iova;
>> +	size_t size;
>> +
>> +	iommu_iotlb_gather_init(&gather);
>> +
>> +	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
>> +	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
> 
> how is this different from leaving iommu driver to walk the page table
> and the poke the modifier bit for all present PTEs? 

It isn't. Moving towards a single op makes this simpler for iommu core API.

> As commented in last
> patch this may allow removing the range op completely.
> 
Yes.

>> +		iova = iopt_area_iova(area);
>> +		size = iopt_area_last_iova(area) - iova;
>> +
>> +		if (ops->set_dirty_tracking_range) {
>> +			ret = ops->set_dirty_tracking_range(domain, iova,
>> +							    size, &gather,
>> +							    enable);
>> +			if (ret < 0)
>> +				break;
>> +		}
>> +	}
>> +
>> +	iommu_iotlb_sync(domain, &gather);
>> +
>> +	return ret;
>> +}
>> +
>> +static int iommu_set_dirty_tracking(struct iommu_domain *domain,
>> +				    struct io_pagetable *iopt, bool enable)
> 
> similarly rename to __iopt_set_dirty_tracking() and use iopt as the
> leading argument.
> 
/me nods
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 02/19] iommufd: Dirty tracking for io_pagetable
@ 2022-04-29 10:48       ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 10:48 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Joerg Roedel, Suravee Suthikulpanit, Will Deacon, Robin Murphy,
	Jean-Philippe Brucker, Keqian Zhu, Shameerali Kolothum Thodi,
	David Woodhouse, Lu Baolu, Jason Gunthorpe, Nicolin Chen,
	Yishai Hadas, Eric Auger, Liu, Yi L, Alex Williamson,
	Cornelia Huck, kvm, iommu

On 4/29/22 09:07, Tian, Kevin wrote:
>> From: Joao Martins <joao.m.martins@oracle.com>
>> Sent: Friday, April 29, 2022 5:09 AM
>>
>> +static int __set_dirty_tracking_range_locked(struct iommu_domain
>> *domain,
> 
> suppose anything using iommu_domain as the first argument should
> be put in the iommu layer. Here it's more reasonable to use iopt
> as the first argument or simply merge with the next function.
> 
OK

>> +					     struct io_pagetable *iopt,
>> +					     bool enable)
>> +{
>> +	const struct iommu_domain_ops *ops = domain->ops;
>> +	struct iommu_iotlb_gather gather;
>> +	struct iopt_area *area;
>> +	int ret = -EOPNOTSUPP;
>> +	unsigned long iova;
>> +	size_t size;
>> +
>> +	iommu_iotlb_gather_init(&gather);
>> +
>> +	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
>> +	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
> 
> how is this different from leaving iommu driver to walk the page table
> and the poke the modifier bit for all present PTEs? 

It isn't. Moving towards a single op makes this simpler for iommu core API.

> As commented in last
> patch this may allow removing the range op completely.
> 
Yes.

>> +		iova = iopt_area_iova(area);
>> +		size = iopt_area_last_iova(area) - iova;
>> +
>> +		if (ops->set_dirty_tracking_range) {
>> +			ret = ops->set_dirty_tracking_range(domain, iova,
>> +							    size, &gather,
>> +							    enable);
>> +			if (ret < 0)
>> +				break;
>> +		}
>> +	}
>> +
>> +	iommu_iotlb_sync(domain, &gather);
>> +
>> +	return ret;
>> +}
>> +
>> +static int iommu_set_dirty_tracking(struct iommu_domain *domain,
>> +				    struct io_pagetable *iopt, bool enable)
> 
> similarly rename to __iopt_set_dirty_tracking() and use iopt as the
> leading argument.
> 
/me nods

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 03/19] iommufd: Dirty tracking data support
  2022-04-29  8:12     ` Tian, Kevin
@ 2022-04-29 10:54       ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 10:54 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Joerg Roedel, Suravee Suthikulpanit, Will Deacon, Robin Murphy,
	Jean-Philippe Brucker, Keqian Zhu, Shameerali Kolothum Thodi,
	David Woodhouse, Lu Baolu, Jason Gunthorpe, Nicolin Chen,
	Yishai Hadas, Eric Auger, Liu, Yi L, Alex Williamson,
	Cornelia Huck, kvm, iommu

On 4/29/22 09:12, Tian, Kevin wrote:
>> From: Joao Martins <joao.m.martins@oracle.com>
>> Sent: Friday, April 29, 2022 5:09 AM
> [...]
>> +
>> +static int iommu_read_and_clear_dirty(struct iommu_domain *domain,
>> +				      struct iommufd_dirty_data *bitmap)
> 
> In a glance this function and all previous helpers doesn't rely on any
> iommufd objects except that the new structures are named as
> iommufd_xxx. 
> 
> I wonder whether moving all of them to the iommu layer would make
> more sense here.
> 
I suppose, instinctively, I was trying to make this tie to iommufd only,
to avoid getting it called in cases we don't except when made as a generic
exported kernel facility.

(note: iommufd can be built as a module).

>> +{
>> +	const struct iommu_domain_ops *ops = domain->ops;
>> +	struct iommu_iotlb_gather gather;
>> +	struct iommufd_dirty_iter iter;
>> +	int ret = 0;
>> +
>> +	if (!ops || !ops->read_and_clear_dirty)
>> +		return -EOPNOTSUPP;
>> +
>> +	iommu_dirty_bitmap_init(&iter.dirty, bitmap->iova,
>> +				__ffs(bitmap->page_size), &gather);
>> +	ret = iommufd_dirty_iter_init(&iter, bitmap);
>> +	if (ret)
>> +		return -ENOMEM;
>> +
>> +	for (; iommufd_dirty_iter_done(&iter);
>> +	     iommufd_dirty_iter_advance(&iter)) {
>> +		ret = iommufd_dirty_iter_get(&iter);
>> +		if (ret)
>> +			break;
>> +
>> +		ret = ops->read_and_clear_dirty(domain,
>> +			iommufd_dirty_iova(&iter),
>> +			iommufd_dirty_iova_length(&iter), &iter.dirty);
>> +
>> +		iommufd_dirty_iter_put(&iter);
>> +
>> +		if (ret)
>> +			break;
>> +	}
>> +
>> +	iommu_iotlb_sync(domain, &gather);
>> +	iommufd_dirty_iter_free(&iter);
>> +
>> +	return ret;
>> +}
>> +

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 03/19] iommufd: Dirty tracking data support
@ 2022-04-29 10:54       ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 10:54 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jean-Philippe Brucker, Yishai Hadas, Jason Gunthorpe, kvm,
	Will Deacon, Cornelia Huck, iommu, Alex Williamson,
	David Woodhouse, Robin Murphy

On 4/29/22 09:12, Tian, Kevin wrote:
>> From: Joao Martins <joao.m.martins@oracle.com>
>> Sent: Friday, April 29, 2022 5:09 AM
> [...]
>> +
>> +static int iommu_read_and_clear_dirty(struct iommu_domain *domain,
>> +				      struct iommufd_dirty_data *bitmap)
> 
> In a glance this function and all previous helpers doesn't rely on any
> iommufd objects except that the new structures are named as
> iommufd_xxx. 
> 
> I wonder whether moving all of them to the iommu layer would make
> more sense here.
> 
I suppose, instinctively, I was trying to make this tie to iommufd only,
to avoid getting it called in cases we don't except when made as a generic
exported kernel facility.

(note: iommufd can be built as a module).

>> +{
>> +	const struct iommu_domain_ops *ops = domain->ops;
>> +	struct iommu_iotlb_gather gather;
>> +	struct iommufd_dirty_iter iter;
>> +	int ret = 0;
>> +
>> +	if (!ops || !ops->read_and_clear_dirty)
>> +		return -EOPNOTSUPP;
>> +
>> +	iommu_dirty_bitmap_init(&iter.dirty, bitmap->iova,
>> +				__ffs(bitmap->page_size), &gather);
>> +	ret = iommufd_dirty_iter_init(&iter, bitmap);
>> +	if (ret)
>> +		return -ENOMEM;
>> +
>> +	for (; iommufd_dirty_iter_done(&iter);
>> +	     iommufd_dirty_iter_advance(&iter)) {
>> +		ret = iommufd_dirty_iter_get(&iter);
>> +		if (ret)
>> +			break;
>> +
>> +		ret = ops->read_and_clear_dirty(domain,
>> +			iommufd_dirty_iova(&iter),
>> +			iommufd_dirty_iova_length(&iter), &iter.dirty);
>> +
>> +		iommufd_dirty_iter_put(&iter);
>> +
>> +		if (ret)
>> +			break;
>> +	}
>> +
>> +	iommu_iotlb_sync(domain, &gather);
>> +	iommufd_dirty_iter_free(&iter);
>> +
>> +	return ret;
>> +}
>> +
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
  2022-04-29  8:28     ` Tian, Kevin
@ 2022-04-29 11:05       ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 11:05 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Joerg Roedel, Suravee Suthikulpanit, Will Deacon, Robin Murphy,
	Jean-Philippe Brucker, Keqian Zhu, Shameerali Kolothum Thodi,
	David Woodhouse, Lu Baolu, Jason Gunthorpe, Nicolin Chen,
	Yishai Hadas, Eric Auger, Liu, Yi L, Alex Williamson,
	Cornelia Huck, kvm, iommu

On 4/29/22 09:28, Tian, Kevin wrote:
>> From: Joao Martins <joao.m.martins@oracle.com>
>> Sent: Friday, April 29, 2022 5:09 AM
>>
>> Similar to .read_and_clear_dirty() use the page table
>> walker helper functions and set DBM|RDONLY bit, thus
>> switching the IOPTE to writeable-clean.
> 
> this should not be one-off if the operation needs to be
> applied to IOPTE. Say a map request comes right after
> set_dirty_tracking() is called. If it's agreed to remove
> the range op then smmu driver should record the tracking
> status internally and then apply the modifier to all the new
> mappings automatically before dirty tracking is disabled.
> Otherwise the same logic needs to be kept in iommufd to
> call set_dirty_tracking_range() explicitly for every new
> iopt_area created within the tracking window.

Gah, I totally missed that by mistake. New mappings aren't
carrying over the "DBM is set". This needs a new io-pgtable
quirk added post dirty-tracking toggling.

I can adjust, but I am at odds on including this in a future
iteration given that I can't really test any of this stuff.
Might drop the driver until I have hardware/emulation I can
use (or maybe others can take over this). It was included
for revising the iommu core ops and whether iommufd was
affected by it.

I'll delete the range op, and let smmu v3 driver walk its
own IO pgtables.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
@ 2022-04-29 11:05       ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 11:05 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jean-Philippe Brucker, Yishai Hadas, Jason Gunthorpe, kvm,
	Will Deacon, Cornelia Huck, iommu, Alex Williamson,
	David Woodhouse, Robin Murphy

On 4/29/22 09:28, Tian, Kevin wrote:
>> From: Joao Martins <joao.m.martins@oracle.com>
>> Sent: Friday, April 29, 2022 5:09 AM
>>
>> Similar to .read_and_clear_dirty() use the page table
>> walker helper functions and set DBM|RDONLY bit, thus
>> switching the IOPTE to writeable-clean.
> 
> this should not be one-off if the operation needs to be
> applied to IOPTE. Say a map request comes right after
> set_dirty_tracking() is called. If it's agreed to remove
> the range op then smmu driver should record the tracking
> status internally and then apply the modifier to all the new
> mappings automatically before dirty tracking is disabled.
> Otherwise the same logic needs to be kept in iommufd to
> call set_dirty_tracking_range() explicitly for every new
> iopt_area created within the tracking window.

Gah, I totally missed that by mistake. New mappings aren't
carrying over the "DBM is set". This needs a new io-pgtable
quirk added post dirty-tracking toggling.

I can adjust, but I am at odds on including this in a future
iteration given that I can't really test any of this stuff.
Might drop the driver until I have hardware/emulation I can
use (or maybe others can take over this). It was included
for revising the iommu core ops and whether iommufd was
affected by it.

I'll delete the range op, and let smmu v3 driver walk its
own IO pgtables.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 13/19] iommu/arm-smmu-v3: Add feature detection for BBML
  2022-04-28 21:09   ` Joao Martins
@ 2022-04-29 11:11     ` Robin Murphy
  -1 siblings, 0 replies; 209+ messages in thread
From: Robin Murphy @ 2022-04-29 11:11 UTC (permalink / raw)
  To: Joao Martins, iommu
  Cc: Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Jean-Philippe Brucker, Keqian Zhu, Shameerali Kolothum Thodi,
	David Woodhouse, Lu Baolu, Jason Gunthorpe, Nicolin Chen,
	Yishai Hadas, Kevin Tian, Eric Auger, Yi Liu, Alex Williamson,
	Cornelia Huck, kvm, Kunkun Jiang

On 2022-04-28 22:09, Joao Martins wrote:
> From: Kunkun Jiang <jiangkunkun@huawei.com>
> 
> This detects BBML feature and if SMMU supports it, transfer BBMLx
> quirk to io-pgtable.
> 
> BBML1 requires still marking PTE nT prior to performing a
> translation table update, while BBML2 requires neither break-before-make
> nor PTE nT bit being set. For dirty tracking it needs to clear
> the dirty bit so checking BBML2 tells us the prerequisite. See SMMUv3.2
> manual, section "3.21.1.3 When SMMU_IDR3.BBML == 2 (Level 2)" and
> "3.21.1.2 When SMMU_IDR3.BBML == 1 (Level 1)"

You can drop this, and the dependencies on BBML elsewhere, until you get 
round to the future large-page-splitting work, since that's the only 
thing this represents. Not much point having the feature flags without 
an actual implementation, or any users.

Robin.

> Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
> Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
> [joaomart: massage commit message with the need to have BBML quirk
>   and add the Quirk io-pgtable flags]
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 19 +++++++++++++++++++
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  6 ++++++
>   include/linux/io-pgtable.h                  |  3 +++
>   3 files changed, 28 insertions(+)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 14609ece4e33..4dba53bde2e3 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -2203,6 +2203,11 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain,
>   		.iommu_dev	= smmu->dev,
>   	};
>   
> +	if (smmu->features & ARM_SMMU_FEAT_BBML1)
> +		pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_BBML1;
> +	else if (smmu->features & ARM_SMMU_FEAT_BBML2)
> +		pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_BBML2;
> +
>   	pgtbl_ops = alloc_io_pgtable_ops(fmt, &pgtbl_cfg, smmu_domain);
>   	if (!pgtbl_ops)
>   		return -ENOMEM;
> @@ -3591,6 +3596,20 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
>   
>   	/* IDR3 */
>   	reg = readl_relaxed(smmu->base + ARM_SMMU_IDR3);
> +	switch (FIELD_GET(IDR3_BBML, reg)) {
> +	case IDR3_BBML0:
> +		break;
> +	case IDR3_BBML1:
> +		smmu->features |= ARM_SMMU_FEAT_BBML1;
> +		break;
> +	case IDR3_BBML2:
> +		smmu->features |= ARM_SMMU_FEAT_BBML2;
> +		break;
> +	default:
> +		dev_err(smmu->dev, "unknown/unsupported BBM behavior level\n");
> +		return -ENXIO;
> +	}
> +
>   	if (FIELD_GET(IDR3_RIL, reg))
>   		smmu->features |= ARM_SMMU_FEAT_RANGE_INV;
>   
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index 1487a80fdf1b..e15750be1d95 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -54,6 +54,10 @@
>   #define IDR1_SIDSIZE			GENMASK(5, 0)
>   
>   #define ARM_SMMU_IDR3			0xc
> +#define IDR3_BBML			GENMASK(12, 11)
> +#define IDR3_BBML0			0
> +#define IDR3_BBML1			1
> +#define IDR3_BBML2			2
>   #define IDR3_RIL			(1 << 10)
>   
>   #define ARM_SMMU_IDR5			0x14
> @@ -644,6 +648,8 @@ struct arm_smmu_device {
>   #define ARM_SMMU_FEAT_E2H		(1 << 18)
>   #define ARM_SMMU_FEAT_HA		(1 << 19)
>   #define ARM_SMMU_FEAT_HD		(1 << 20)
> +#define ARM_SMMU_FEAT_BBML1		(1 << 21)
> +#define ARM_SMMU_FEAT_BBML2		(1 << 22)
>   	u32				features;
>   
>   #define ARM_SMMU_OPT_SKIP_PREFETCH	(1 << 0)
> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
> index c2ebfe037f5d..d7626ca67dbf 100644
> --- a/include/linux/io-pgtable.h
> +++ b/include/linux/io-pgtable.h
> @@ -85,6 +85,9 @@ struct io_pgtable_cfg {
>   	#define IO_PGTABLE_QUIRK_ARM_MTK_EXT	BIT(3)
>   	#define IO_PGTABLE_QUIRK_ARM_TTBR1	BIT(5)
>   	#define IO_PGTABLE_QUIRK_ARM_OUTER_WBWA	BIT(6)
> +	#define IO_PGTABLE_QUIRK_ARM_BBML1      BIT(7)
> +	#define IO_PGTABLE_QUIRK_ARM_BBML2      BIT(8)
> +
>   	unsigned long			quirks;
>   	unsigned long			pgsize_bitmap;
>   	unsigned int			ias;

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 13/19] iommu/arm-smmu-v3: Add feature detection for BBML
@ 2022-04-29 11:11     ` Robin Murphy
  0 siblings, 0 replies; 209+ messages in thread
From: Robin Murphy @ 2022-04-29 11:11 UTC (permalink / raw)
  To: Joao Martins, iommu
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, Jason Gunthorpe,
	kvm, Cornelia Huck, Alex Williamson, Will Deacon,
	David Woodhouse

On 2022-04-28 22:09, Joao Martins wrote:
> From: Kunkun Jiang <jiangkunkun@huawei.com>
> 
> This detects BBML feature and if SMMU supports it, transfer BBMLx
> quirk to io-pgtable.
> 
> BBML1 requires still marking PTE nT prior to performing a
> translation table update, while BBML2 requires neither break-before-make
> nor PTE nT bit being set. For dirty tracking it needs to clear
> the dirty bit so checking BBML2 tells us the prerequisite. See SMMUv3.2
> manual, section "3.21.1.3 When SMMU_IDR3.BBML == 2 (Level 2)" and
> "3.21.1.2 When SMMU_IDR3.BBML == 1 (Level 1)"

You can drop this, and the dependencies on BBML elsewhere, until you get 
round to the future large-page-splitting work, since that's the only 
thing this represents. Not much point having the feature flags without 
an actual implementation, or any users.

Robin.

> Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
> Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
> [joaomart: massage commit message with the need to have BBML quirk
>   and add the Quirk io-pgtable flags]
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 19 +++++++++++++++++++
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  6 ++++++
>   include/linux/io-pgtable.h                  |  3 +++
>   3 files changed, 28 insertions(+)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 14609ece4e33..4dba53bde2e3 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -2203,6 +2203,11 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain,
>   		.iommu_dev	= smmu->dev,
>   	};
>   
> +	if (smmu->features & ARM_SMMU_FEAT_BBML1)
> +		pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_BBML1;
> +	else if (smmu->features & ARM_SMMU_FEAT_BBML2)
> +		pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_BBML2;
> +
>   	pgtbl_ops = alloc_io_pgtable_ops(fmt, &pgtbl_cfg, smmu_domain);
>   	if (!pgtbl_ops)
>   		return -ENOMEM;
> @@ -3591,6 +3596,20 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
>   
>   	/* IDR3 */
>   	reg = readl_relaxed(smmu->base + ARM_SMMU_IDR3);
> +	switch (FIELD_GET(IDR3_BBML, reg)) {
> +	case IDR3_BBML0:
> +		break;
> +	case IDR3_BBML1:
> +		smmu->features |= ARM_SMMU_FEAT_BBML1;
> +		break;
> +	case IDR3_BBML2:
> +		smmu->features |= ARM_SMMU_FEAT_BBML2;
> +		break;
> +	default:
> +		dev_err(smmu->dev, "unknown/unsupported BBM behavior level\n");
> +		return -ENXIO;
> +	}
> +
>   	if (FIELD_GET(IDR3_RIL, reg))
>   		smmu->features |= ARM_SMMU_FEAT_RANGE_INV;
>   
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index 1487a80fdf1b..e15750be1d95 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -54,6 +54,10 @@
>   #define IDR1_SIDSIZE			GENMASK(5, 0)
>   
>   #define ARM_SMMU_IDR3			0xc
> +#define IDR3_BBML			GENMASK(12, 11)
> +#define IDR3_BBML0			0
> +#define IDR3_BBML1			1
> +#define IDR3_BBML2			2
>   #define IDR3_RIL			(1 << 10)
>   
>   #define ARM_SMMU_IDR5			0x14
> @@ -644,6 +648,8 @@ struct arm_smmu_device {
>   #define ARM_SMMU_FEAT_E2H		(1 << 18)
>   #define ARM_SMMU_FEAT_HA		(1 << 19)
>   #define ARM_SMMU_FEAT_HD		(1 << 20)
> +#define ARM_SMMU_FEAT_BBML1		(1 << 21)
> +#define ARM_SMMU_FEAT_BBML2		(1 << 22)
>   	u32				features;
>   
>   #define ARM_SMMU_OPT_SKIP_PREFETCH	(1 << 0)
> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
> index c2ebfe037f5d..d7626ca67dbf 100644
> --- a/include/linux/io-pgtable.h
> +++ b/include/linux/io-pgtable.h
> @@ -85,6 +85,9 @@ struct io_pgtable_cfg {
>   	#define IO_PGTABLE_QUIRK_ARM_MTK_EXT	BIT(3)
>   	#define IO_PGTABLE_QUIRK_ARM_TTBR1	BIT(5)
>   	#define IO_PGTABLE_QUIRK_ARM_OUTER_WBWA	BIT(6)
> +	#define IO_PGTABLE_QUIRK_ARM_BBML1      BIT(7)
> +	#define IO_PGTABLE_QUIRK_ARM_BBML2      BIT(8)
> +
>   	unsigned long			quirks;
>   	unsigned long			pgsize_bitmap;
>   	unsigned int			ias;
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
  2022-04-29 11:05       ` Joao Martins
@ 2022-04-29 11:19         ` Robin Murphy
  -1 siblings, 0 replies; 209+ messages in thread
From: Robin Murphy @ 2022-04-29 11:19 UTC (permalink / raw)
  To: Joao Martins, Tian, Kevin
  Cc: Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Jean-Philippe Brucker, Keqian Zhu, Shameerali Kolothum Thodi,
	David Woodhouse, Lu Baolu, Jason Gunthorpe, Nicolin Chen,
	Yishai Hadas, Eric Auger, Liu, Yi L, Alex Williamson,
	Cornelia Huck, kvm, iommu

On 2022-04-29 12:05, Joao Martins wrote:
> On 4/29/22 09:28, Tian, Kevin wrote:
>>> From: Joao Martins <joao.m.martins@oracle.com>
>>> Sent: Friday, April 29, 2022 5:09 AM
>>>
>>> Similar to .read_and_clear_dirty() use the page table
>>> walker helper functions and set DBM|RDONLY bit, thus
>>> switching the IOPTE to writeable-clean.
>>
>> this should not be one-off if the operation needs to be
>> applied to IOPTE. Say a map request comes right after
>> set_dirty_tracking() is called. If it's agreed to remove
>> the range op then smmu driver should record the tracking
>> status internally and then apply the modifier to all the new
>> mappings automatically before dirty tracking is disabled.
>> Otherwise the same logic needs to be kept in iommufd to
>> call set_dirty_tracking_range() explicitly for every new
>> iopt_area created within the tracking window.
> 
> Gah, I totally missed that by mistake. New mappings aren't
> carrying over the "DBM is set". This needs a new io-pgtable
> quirk added post dirty-tracking toggling.
> 
> I can adjust, but I am at odds on including this in a future
> iteration given that I can't really test any of this stuff.
> Might drop the driver until I have hardware/emulation I can
> use (or maybe others can take over this). It was included
> for revising the iommu core ops and whether iommufd was
> affected by it.
> 
> I'll delete the range op, and let smmu v3 driver walk its
> own IO pgtables.

TBH I'd be inclined to just enable DBM unconditionally in 
arm_smmu_domain_finalise() if the SMMU supports it. Trying to toggle it 
dynamically (especially on a live domain) seems more trouble that it's 
worth.

Robin.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
@ 2022-04-29 11:19         ` Robin Murphy
  0 siblings, 0 replies; 209+ messages in thread
From: Robin Murphy @ 2022-04-29 11:19 UTC (permalink / raw)
  To: Joao Martins, Tian, Kevin
  Cc: Jean-Philippe Brucker, Yishai Hadas, Jason Gunthorpe, kvm,
	Cornelia Huck, iommu, Alex Williamson, Will Deacon,
	David Woodhouse

On 2022-04-29 12:05, Joao Martins wrote:
> On 4/29/22 09:28, Tian, Kevin wrote:
>>> From: Joao Martins <joao.m.martins@oracle.com>
>>> Sent: Friday, April 29, 2022 5:09 AM
>>>
>>> Similar to .read_and_clear_dirty() use the page table
>>> walker helper functions and set DBM|RDONLY bit, thus
>>> switching the IOPTE to writeable-clean.
>>
>> this should not be one-off if the operation needs to be
>> applied to IOPTE. Say a map request comes right after
>> set_dirty_tracking() is called. If it's agreed to remove
>> the range op then smmu driver should record the tracking
>> status internally and then apply the modifier to all the new
>> mappings automatically before dirty tracking is disabled.
>> Otherwise the same logic needs to be kept in iommufd to
>> call set_dirty_tracking_range() explicitly for every new
>> iopt_area created within the tracking window.
> 
> Gah, I totally missed that by mistake. New mappings aren't
> carrying over the "DBM is set". This needs a new io-pgtable
> quirk added post dirty-tracking toggling.
> 
> I can adjust, but I am at odds on including this in a future
> iteration given that I can't really test any of this stuff.
> Might drop the driver until I have hardware/emulation I can
> use (or maybe others can take over this). It was included
> for revising the iommu core ops and whether iommufd was
> affected by it.
> 
> I'll delete the range op, and let smmu v3 driver walk its
> own IO pgtables.

TBH I'd be inclined to just enable DBM unconditionally in 
arm_smmu_domain_finalise() if the SMMU supports it. Trying to toggle it 
dynamically (especially on a live domain) seems more trouble that it's 
worth.

Robin.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 18/19] iommu/intel: Access/Dirty bit support for SL domains
  2022-04-29  9:03     ` Tian, Kevin
@ 2022-04-29 11:20       ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 11:20 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Joerg Roedel, Suravee Suthikulpanit, Will Deacon, Robin Murphy,
	Jean-Philippe Brucker, Keqian Zhu, Shameerali Kolothum Thodi,
	David Woodhouse, Lu Baolu, Jason Gunthorpe, Nicolin Chen,
	Yishai Hadas, Eric Auger, Liu, Yi L, Alex Williamson,
	Cornelia Huck, kvm, iommu

On 4/29/22 10:03, Tian, Kevin wrote:
>> From: Joao Martins <joao.m.martins@oracle.com>
>> Sent: Friday, April 29, 2022 5:10 AM
>>
>> IOMMU advertises Access/Dirty bits if the extended capability
>> DMAR register reports it (ECAP, mnemonic ECAP.SSADS). The first
>> stage table, though, has not bit for advertising, unless referenced via
> 
> first-stage is compatible to CPU page table thus a/d bit support is
> implied. 

Ah! That clarifies something which the manual wasn't quite so clear about :)
I mean I understood what you just said from reading the manual but was
not was /really 100% sure/

> But for dirty tracking I'm I'm fine with only supporting it
> with second-stage as first-stage will be used only for guest in the
> nesting case (though in concept first-stage could also be used for
> IOVA when nesting is disabled there is no plan to do so on Intel
> platforms).
> 
Cool.

>> a scalable-mode PASID Entry. Relevant Intel IOMMU SDM ref for first stage
>> table "3.6.2 Accessed, Extended Accessed, and Dirty Flags" and second
>> stage table "3.7.2 Accessed and Dirty Flags".
>>
>> To enable it scalable-mode for the second-stage table is required,
>> solimit the use of dirty-bit to scalable-mode and discarding the
>> first stage configured DMAR domains. To use SSADS, we set a bit in
> 
> above is inaccurate. dirty bit is only supported in scalable mode so
> there is no limit per se.
> 
OK.

>> the scalable-mode PASID Table entry, by setting bit 9 (SSADE). When
> 
> "To use SSADS, we set bit 9 (SSADE) in the scalable-mode PASID table
> entry"
> 
/me nods

>> doing so, flush all iommu caches. Relevant SDM refs:
>>
>> "3.7.2 Accessed and Dirty Flags"
>> "6.5.3.3 Guidance to Software for Invalidations,
>>  Table 23. Guidance to Software for Invalidations"
>>
>> Dirty bit on the PTE is located in the same location (bit 9). The IOTLB
> 
> I'm not sure what information 'same location' here tries to convey...
> 

The PASID table *and* the dirty bit are both on bit 9.
(On AMD for example it's on different bits)

That's what 'location' meant, not the actual storage of those bits of course :)

>> caches some attributes when SSADE is enabled and dirty-ness information,
> 
> be direct that the dirty bit is cached in IOTLB thus any change of that
> bit requires flushing IOTLB
> 
OK, will make it clearer.

>> so we also need to flush IOTLB to make sure IOMMU attempts to set the
>> dirty bit again. Relevant manuals over the hardware translation is
>> chapter 6 with some special mention to:
>>
>> "6.2.3.1 Scalable-Mode PASID-Table Entry Programming Considerations"
>> "6.2.4 IOTLB"
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>> Shouldn't probably be as aggresive as to flush all; needs
>> checking with hardware (and invalidations guidance) as to understand
>> what exactly needs flush.
> 
> yes, definitely not required to flush all. You can follow table 23
> for software guidance for invalidations.
> 
/me nods

>> ---
>>  drivers/iommu/intel/iommu.c | 109
>> ++++++++++++++++++++++++++++++++++++
>>  drivers/iommu/intel/pasid.c |  76 +++++++++++++++++++++++++
>>  drivers/iommu/intel/pasid.h |   7 +++
>>  include/linux/intel-iommu.h |  14 +++++
>>  4 files changed, 206 insertions(+)
>>
>> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
>> index ce33f85c72ab..92af43f27241 100644
>> --- a/drivers/iommu/intel/iommu.c
>> +++ b/drivers/iommu/intel/iommu.c
>> @@ -5089,6 +5089,113 @@ static void intel_iommu_iotlb_sync_map(struct
>> iommu_domain *domain,
>>  	}
>>  }
>>
>> +static int intel_iommu_set_dirty_tracking(struct iommu_domain *domain,
>> +					  bool enable)
>> +{
>> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
>> +	struct device_domain_info *info;
>> +	unsigned long flags;
>> +	int ret = -EINVAL;
>> +
>> +	spin_lock_irqsave(&device_domain_lock, flags);
>> +	if (list_empty(&dmar_domain->devices)) {
>> +		spin_unlock_irqrestore(&device_domain_lock, flags);
>> +		return ret;
>> +	}
> 
> or return success here and just don't set any dirty bitmap in
> read_and_clear_dirty()?
> 
Yeap.

> btw I think every iommu driver needs to record the tracking status
> so later if a device which doesn't claim dirty tracking support is
> attached to a domain which already has dirty_tracking enabled
> then the attach request should be rejected. once the capability
> uAPI is introduced.
> 
Good point.

>> +
>> +	list_for_each_entry(info, &dmar_domain->devices, link) {
>> +		if (!info->dev || (info->domain != dmar_domain))
>> +			continue;
> 
> why would there be a device linked under a dmar_domain but its
> internal domain pointer doesn't point to that dmar_domain?
> 
I think I got a little confused when using this list with something else.
Let me fix that.

>> +
>> +		/* Dirty tracking is second-stage level SM only */
>> +		if ((info->domain && domain_use_first_level(info->domain))
>> ||
>> +		    !ecap_slads(info->iommu->ecap) ||
>> +		    !sm_supported(info->iommu) || !intel_iommu_sm) {
> 
> sm_supported() already covers the check on intel_iommu_sm.
> 
/me nods, removed it.

>> +			ret = -EOPNOTSUPP;
>> +			continue;
>> +		}
>> +
>> +		ret = intel_pasid_setup_dirty_tracking(info->iommu, info-
>>> domain,
>> +						     info->dev,
>> PASID_RID2PASID,
>> +						     enable);
>> +		if (ret)
>> +			break;
>> +	}
>> +	spin_unlock_irqrestore(&device_domain_lock, flags);
>> +
>> +	/*
>> +	 * We need to flush context TLB and IOTLB with any cached
>> translations
>> +	 * to force the incoming DMA requests for have its IOTLB entries
>> tagged
>> +	 * with A/D bits
>> +	 */
>> +	intel_flush_iotlb_all(domain);
>> +	return ret;
>> +}
>> +
>> +static int intel_iommu_get_dirty_tracking(struct iommu_domain *domain)
>> +{
>> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
>> +	struct device_domain_info *info;
>> +	unsigned long flags;
>> +	int ret = 0;
>> +
>> +	spin_lock_irqsave(&device_domain_lock, flags);
>> +	list_for_each_entry(info, &dmar_domain->devices, link) {
>> +		if (!info->dev || (info->domain != dmar_domain))
>> +			continue;
>> +
>> +		/* Dirty tracking is second-stage level SM only */
>> +		if ((info->domain && domain_use_first_level(info->domain))
>> ||
>> +		    !ecap_slads(info->iommu->ecap) ||
>> +		    !sm_supported(info->iommu) || !intel_iommu_sm) {
>> +			ret = -EOPNOTSUPP;
>> +			continue;
>> +		}
>> +
>> +		if (!intel_pasid_dirty_tracking_enabled(info->iommu, info-
>>> domain,
>> +						 info->dev, PASID_RID2PASID))
>> {
>> +			ret = -EINVAL;
>> +			break;
>> +		}
>> +	}
>> +	spin_unlock_irqrestore(&device_domain_lock, flags);
>> +
>> +	return ret;
>> +}
> 
> All above can be translated to a single status bit in dmar_domain.
> 
Yes.

I wrestled a bit over making this a domains_op, which would then tie in
into a tracking bit in the iommu domain (or driver representation of it).
Which is why you see a get_dirty_tracking() helper here and in amd IOMMU counterpart.
But I figured I would tie in into the capability part.

>> +
>> +static int intel_iommu_read_and_clear_dirty(struct iommu_domain
>> *domain,
>> +					    unsigned long iova, size_t size,
>> +					    struct iommu_dirty_bitmap *dirty)
>> +{
>> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
>> +	unsigned long end = iova + size - 1;
>> +	unsigned long pgsize;
>> +	int ret;
>> +
>> +	ret = intel_iommu_get_dirty_tracking(domain);
>> +	if (ret)
>> +		return ret;
>> +
>> +	do {
>> +		struct dma_pte *pte;
>> +		int lvl = 0;
>> +
>> +		pte = pfn_to_dma_pte(dmar_domain, iova >>
>> VTD_PAGE_SHIFT, &lvl);
> 
> it's probably fine as the starting point but moving forward this could
> be further optimized so there is no need to walk from L4->L3->L2->L1
> for every pte.
> 

Yes. This is actually part of my TODO on Performance (in the cover letter).

Both AMD and Intel could use its own dedicated lookup.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 18/19] iommu/intel: Access/Dirty bit support for SL domains
@ 2022-04-29 11:20       ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 11:20 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jean-Philippe Brucker, Yishai Hadas, Jason Gunthorpe, kvm,
	Will Deacon, Cornelia Huck, iommu, Alex Williamson,
	David Woodhouse, Robin Murphy

On 4/29/22 10:03, Tian, Kevin wrote:
>> From: Joao Martins <joao.m.martins@oracle.com>
>> Sent: Friday, April 29, 2022 5:10 AM
>>
>> IOMMU advertises Access/Dirty bits if the extended capability
>> DMAR register reports it (ECAP, mnemonic ECAP.SSADS). The first
>> stage table, though, has not bit for advertising, unless referenced via
> 
> first-stage is compatible to CPU page table thus a/d bit support is
> implied. 

Ah! That clarifies something which the manual wasn't quite so clear about :)
I mean I understood what you just said from reading the manual but was
not was /really 100% sure/

> But for dirty tracking I'm I'm fine with only supporting it
> with second-stage as first-stage will be used only for guest in the
> nesting case (though in concept first-stage could also be used for
> IOVA when nesting is disabled there is no plan to do so on Intel
> platforms).
> 
Cool.

>> a scalable-mode PASID Entry. Relevant Intel IOMMU SDM ref for first stage
>> table "3.6.2 Accessed, Extended Accessed, and Dirty Flags" and second
>> stage table "3.7.2 Accessed and Dirty Flags".
>>
>> To enable it scalable-mode for the second-stage table is required,
>> solimit the use of dirty-bit to scalable-mode and discarding the
>> first stage configured DMAR domains. To use SSADS, we set a bit in
> 
> above is inaccurate. dirty bit is only supported in scalable mode so
> there is no limit per se.
> 
OK.

>> the scalable-mode PASID Table entry, by setting bit 9 (SSADE). When
> 
> "To use SSADS, we set bit 9 (SSADE) in the scalable-mode PASID table
> entry"
> 
/me nods

>> doing so, flush all iommu caches. Relevant SDM refs:
>>
>> "3.7.2 Accessed and Dirty Flags"
>> "6.5.3.3 Guidance to Software for Invalidations,
>>  Table 23. Guidance to Software for Invalidations"
>>
>> Dirty bit on the PTE is located in the same location (bit 9). The IOTLB
> 
> I'm not sure what information 'same location' here tries to convey...
> 

The PASID table *and* the dirty bit are both on bit 9.
(On AMD for example it's on different bits)

That's what 'location' meant, not the actual storage of those bits of course :)

>> caches some attributes when SSADE is enabled and dirty-ness information,
> 
> be direct that the dirty bit is cached in IOTLB thus any change of that
> bit requires flushing IOTLB
> 
OK, will make it clearer.

>> so we also need to flush IOTLB to make sure IOMMU attempts to set the
>> dirty bit again. Relevant manuals over the hardware translation is
>> chapter 6 with some special mention to:
>>
>> "6.2.3.1 Scalable-Mode PASID-Table Entry Programming Considerations"
>> "6.2.4 IOTLB"
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>> Shouldn't probably be as aggresive as to flush all; needs
>> checking with hardware (and invalidations guidance) as to understand
>> what exactly needs flush.
> 
> yes, definitely not required to flush all. You can follow table 23
> for software guidance for invalidations.
> 
/me nods

>> ---
>>  drivers/iommu/intel/iommu.c | 109
>> ++++++++++++++++++++++++++++++++++++
>>  drivers/iommu/intel/pasid.c |  76 +++++++++++++++++++++++++
>>  drivers/iommu/intel/pasid.h |   7 +++
>>  include/linux/intel-iommu.h |  14 +++++
>>  4 files changed, 206 insertions(+)
>>
>> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
>> index ce33f85c72ab..92af43f27241 100644
>> --- a/drivers/iommu/intel/iommu.c
>> +++ b/drivers/iommu/intel/iommu.c
>> @@ -5089,6 +5089,113 @@ static void intel_iommu_iotlb_sync_map(struct
>> iommu_domain *domain,
>>  	}
>>  }
>>
>> +static int intel_iommu_set_dirty_tracking(struct iommu_domain *domain,
>> +					  bool enable)
>> +{
>> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
>> +	struct device_domain_info *info;
>> +	unsigned long flags;
>> +	int ret = -EINVAL;
>> +
>> +	spin_lock_irqsave(&device_domain_lock, flags);
>> +	if (list_empty(&dmar_domain->devices)) {
>> +		spin_unlock_irqrestore(&device_domain_lock, flags);
>> +		return ret;
>> +	}
> 
> or return success here and just don't set any dirty bitmap in
> read_and_clear_dirty()?
> 
Yeap.

> btw I think every iommu driver needs to record the tracking status
> so later if a device which doesn't claim dirty tracking support is
> attached to a domain which already has dirty_tracking enabled
> then the attach request should be rejected. once the capability
> uAPI is introduced.
> 
Good point.

>> +
>> +	list_for_each_entry(info, &dmar_domain->devices, link) {
>> +		if (!info->dev || (info->domain != dmar_domain))
>> +			continue;
> 
> why would there be a device linked under a dmar_domain but its
> internal domain pointer doesn't point to that dmar_domain?
> 
I think I got a little confused when using this list with something else.
Let me fix that.

>> +
>> +		/* Dirty tracking is second-stage level SM only */
>> +		if ((info->domain && domain_use_first_level(info->domain))
>> ||
>> +		    !ecap_slads(info->iommu->ecap) ||
>> +		    !sm_supported(info->iommu) || !intel_iommu_sm) {
> 
> sm_supported() already covers the check on intel_iommu_sm.
> 
/me nods, removed it.

>> +			ret = -EOPNOTSUPP;
>> +			continue;
>> +		}
>> +
>> +		ret = intel_pasid_setup_dirty_tracking(info->iommu, info-
>>> domain,
>> +						     info->dev,
>> PASID_RID2PASID,
>> +						     enable);
>> +		if (ret)
>> +			break;
>> +	}
>> +	spin_unlock_irqrestore(&device_domain_lock, flags);
>> +
>> +	/*
>> +	 * We need to flush context TLB and IOTLB with any cached
>> translations
>> +	 * to force the incoming DMA requests for have its IOTLB entries
>> tagged
>> +	 * with A/D bits
>> +	 */
>> +	intel_flush_iotlb_all(domain);
>> +	return ret;
>> +}
>> +
>> +static int intel_iommu_get_dirty_tracking(struct iommu_domain *domain)
>> +{
>> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
>> +	struct device_domain_info *info;
>> +	unsigned long flags;
>> +	int ret = 0;
>> +
>> +	spin_lock_irqsave(&device_domain_lock, flags);
>> +	list_for_each_entry(info, &dmar_domain->devices, link) {
>> +		if (!info->dev || (info->domain != dmar_domain))
>> +			continue;
>> +
>> +		/* Dirty tracking is second-stage level SM only */
>> +		if ((info->domain && domain_use_first_level(info->domain))
>> ||
>> +		    !ecap_slads(info->iommu->ecap) ||
>> +		    !sm_supported(info->iommu) || !intel_iommu_sm) {
>> +			ret = -EOPNOTSUPP;
>> +			continue;
>> +		}
>> +
>> +		if (!intel_pasid_dirty_tracking_enabled(info->iommu, info-
>>> domain,
>> +						 info->dev, PASID_RID2PASID))
>> {
>> +			ret = -EINVAL;
>> +			break;
>> +		}
>> +	}
>> +	spin_unlock_irqrestore(&device_domain_lock, flags);
>> +
>> +	return ret;
>> +}
> 
> All above can be translated to a single status bit in dmar_domain.
> 
Yes.

I wrestled a bit over making this a domains_op, which would then tie in
into a tracking bit in the iommu domain (or driver representation of it).
Which is why you see a get_dirty_tracking() helper here and in amd IOMMU counterpart.
But I figured I would tie in into the capability part.

>> +
>> +static int intel_iommu_read_and_clear_dirty(struct iommu_domain
>> *domain,
>> +					    unsigned long iova, size_t size,
>> +					    struct iommu_dirty_bitmap *dirty)
>> +{
>> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
>> +	unsigned long end = iova + size - 1;
>> +	unsigned long pgsize;
>> +	int ret;
>> +
>> +	ret = intel_iommu_get_dirty_tracking(domain);
>> +	if (ret)
>> +		return ret;
>> +
>> +	do {
>> +		struct dma_pte *pte;
>> +		int lvl = 0;
>> +
>> +		pte = pfn_to_dma_pte(dmar_domain, iova >>
>> VTD_PAGE_SHIFT, &lvl);
> 
> it's probably fine as the starting point but moving forward this could
> be further optimized so there is no need to walk from L4->L3->L2->L1
> for every pte.
> 

Yes. This is actually part of my TODO on Performance (in the cover letter).

Both AMD and Intel could use its own dedicated lookup.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 16/19] iommu/arm-smmu-v3: Enable HTTU for stage1 with io-pgtable mapping
  2022-04-28 21:09   ` Joao Martins
@ 2022-04-29 11:35     ` Robin Murphy
  -1 siblings, 0 replies; 209+ messages in thread
From: Robin Murphy @ 2022-04-29 11:35 UTC (permalink / raw)
  To: Joao Martins, iommu
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, Jason Gunthorpe,
	kvm, Cornelia Huck, Alex Williamson, Will Deacon,
	David Woodhouse

On 2022-04-28 22:09, Joao Martins wrote:
> From: Kunkun Jiang <jiangkunkun@huawei.com>
> 
> As nested mode is not upstreamed now, we just aim to support dirty
> log tracking for stage1 with io-pgtable mapping (means not support
> SVA mapping). If HTTU is supported, we enable HA/HD bits in the SMMU
> CD and transfer ARM_HD quirk to io-pgtable.
> 
> We additionally filter out HD|HA if not supportted. The CD.HD bit
> is not particularly useful unless we toggle the DBM bit in the PTE
> entries.
> 
> Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
> Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
> [joaomart:Convey HD|HA bits over to the context descriptor
>   and update commit message]
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 11 +++++++++++
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  3 +++
>   include/linux/io-pgtable.h                  |  1 +
>   3 files changed, 15 insertions(+)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 1ca72fcca930..5f728f8f20a2 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -1077,10 +1077,18 @@ int arm_smmu_write_ctx_desc(struct arm_smmu_domain *smmu_domain, int ssid,
>   		 * this substream's traffic
>   		 */
>   	} else { /* (1) and (2) */
> +		struct arm_smmu_device *smmu = smmu_domain->smmu;
> +		u64 tcr = cd->tcr;
> +
>   		cdptr[1] = cpu_to_le64(cd->ttbr & CTXDESC_CD_1_TTB0_MASK);
>   		cdptr[2] = 0;
>   		cdptr[3] = cpu_to_le64(cd->mair);
>   
> +		if (!(smmu->features & ARM_SMMU_FEAT_HD))
> +			tcr &= ~CTXDESC_CD_0_TCR_HD;
> +		if (!(smmu->features & ARM_SMMU_FEAT_HA))
> +			tcr &= ~CTXDESC_CD_0_TCR_HA;

This is very backwards...

> +
>   		/*
>   		 * STE is live, and the SMMU might read dwords of this CD in any
>   		 * order. Ensure that it observes valid values before reading
> @@ -2100,6 +2108,7 @@ static int arm_smmu_domain_finalise_s1(struct arm_smmu_domain *smmu_domain,
>   			  FIELD_PREP(CTXDESC_CD_0_TCR_ORGN0, tcr->orgn) |
>   			  FIELD_PREP(CTXDESC_CD_0_TCR_SH0, tcr->sh) |
>   			  FIELD_PREP(CTXDESC_CD_0_TCR_IPS, tcr->ips) |
> +			  CTXDESC_CD_0_TCR_HA | CTXDESC_CD_0_TCR_HD |

...these should be set in io-pgtable's TCR value *if* io-pgatble is 
using DBM, then propagated through from there like everything else.

>   			  CTXDESC_CD_0_TCR_EPD1 | CTXDESC_CD_0_AA64;
>   	cfg->cd.mair	= pgtbl_cfg->arm_lpae_s1_cfg.mair;
>   
> @@ -2203,6 +2212,8 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain,
>   		.iommu_dev	= smmu->dev,
>   	};
>   
> +	if (smmu->features & ARM_SMMU_FEAT_HD)
> +		pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_HD;

You need to depend on ARM_SMMU_FEAT_COHERENCY for this as well, not 
least because you don't have any of the relevant business for 
synchronising non-coherent PTEs in your walk functions, but it's also 
implementation-defined whether HTTU even operates on non-cacheable 
pagetables, and frankly you just don't want to go there ;)

Robin.

>   	if (smmu->features & ARM_SMMU_FEAT_BBML1)
>   		pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_BBML1;
>   	else if (smmu->features & ARM_SMMU_FEAT_BBML2)
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index e15750be1d95..ff32242f2fdb 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -292,6 +292,9 @@
>   #define CTXDESC_CD_0_TCR_IPS		GENMASK_ULL(34, 32)
>   #define CTXDESC_CD_0_TCR_TBI0		(1ULL << 38)
>   
> +#define CTXDESC_CD_0_TCR_HA            (1UL << 43)
> +#define CTXDESC_CD_0_TCR_HD            (1UL << 42)
> +
>   #define CTXDESC_CD_0_AA64		(1UL << 41)
>   #define CTXDESC_CD_0_S			(1UL << 44)
>   #define CTXDESC_CD_0_R			(1UL << 45)
> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
> index d7626ca67dbf..a11902ae9cf1 100644
> --- a/include/linux/io-pgtable.h
> +++ b/include/linux/io-pgtable.h
> @@ -87,6 +87,7 @@ struct io_pgtable_cfg {
>   	#define IO_PGTABLE_QUIRK_ARM_OUTER_WBWA	BIT(6)
>   	#define IO_PGTABLE_QUIRK_ARM_BBML1      BIT(7)
>   	#define IO_PGTABLE_QUIRK_ARM_BBML2      BIT(8)
> +	#define IO_PGTABLE_QUIRK_ARM_HD         BIT(9)
>   
>   	unsigned long			quirks;
>   	unsigned long			pgsize_bitmap;
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 16/19] iommu/arm-smmu-v3: Enable HTTU for stage1 with io-pgtable mapping
@ 2022-04-29 11:35     ` Robin Murphy
  0 siblings, 0 replies; 209+ messages in thread
From: Robin Murphy @ 2022-04-29 11:35 UTC (permalink / raw)
  To: Joao Martins, iommu
  Cc: Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Jean-Philippe Brucker, Keqian Zhu, Shameerali Kolothum Thodi,
	David Woodhouse, Lu Baolu, Jason Gunthorpe, Nicolin Chen,
	Yishai Hadas, Kevin Tian, Eric Auger, Yi Liu, Alex Williamson,
	Cornelia Huck, kvm, Kunkun Jiang

On 2022-04-28 22:09, Joao Martins wrote:
> From: Kunkun Jiang <jiangkunkun@huawei.com>
> 
> As nested mode is not upstreamed now, we just aim to support dirty
> log tracking for stage1 with io-pgtable mapping (means not support
> SVA mapping). If HTTU is supported, we enable HA/HD bits in the SMMU
> CD and transfer ARM_HD quirk to io-pgtable.
> 
> We additionally filter out HD|HA if not supportted. The CD.HD bit
> is not particularly useful unless we toggle the DBM bit in the PTE
> entries.
> 
> Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
> Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
> [joaomart:Convey HD|HA bits over to the context descriptor
>   and update commit message]
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 11 +++++++++++
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  3 +++
>   include/linux/io-pgtable.h                  |  1 +
>   3 files changed, 15 insertions(+)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 1ca72fcca930..5f728f8f20a2 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -1077,10 +1077,18 @@ int arm_smmu_write_ctx_desc(struct arm_smmu_domain *smmu_domain, int ssid,
>   		 * this substream's traffic
>   		 */
>   	} else { /* (1) and (2) */
> +		struct arm_smmu_device *smmu = smmu_domain->smmu;
> +		u64 tcr = cd->tcr;
> +
>   		cdptr[1] = cpu_to_le64(cd->ttbr & CTXDESC_CD_1_TTB0_MASK);
>   		cdptr[2] = 0;
>   		cdptr[3] = cpu_to_le64(cd->mair);
>   
> +		if (!(smmu->features & ARM_SMMU_FEAT_HD))
> +			tcr &= ~CTXDESC_CD_0_TCR_HD;
> +		if (!(smmu->features & ARM_SMMU_FEAT_HA))
> +			tcr &= ~CTXDESC_CD_0_TCR_HA;

This is very backwards...

> +
>   		/*
>   		 * STE is live, and the SMMU might read dwords of this CD in any
>   		 * order. Ensure that it observes valid values before reading
> @@ -2100,6 +2108,7 @@ static int arm_smmu_domain_finalise_s1(struct arm_smmu_domain *smmu_domain,
>   			  FIELD_PREP(CTXDESC_CD_0_TCR_ORGN0, tcr->orgn) |
>   			  FIELD_PREP(CTXDESC_CD_0_TCR_SH0, tcr->sh) |
>   			  FIELD_PREP(CTXDESC_CD_0_TCR_IPS, tcr->ips) |
> +			  CTXDESC_CD_0_TCR_HA | CTXDESC_CD_0_TCR_HD |

...these should be set in io-pgtable's TCR value *if* io-pgatble is 
using DBM, then propagated through from there like everything else.

>   			  CTXDESC_CD_0_TCR_EPD1 | CTXDESC_CD_0_AA64;
>   	cfg->cd.mair	= pgtbl_cfg->arm_lpae_s1_cfg.mair;
>   
> @@ -2203,6 +2212,8 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain,
>   		.iommu_dev	= smmu->dev,
>   	};
>   
> +	if (smmu->features & ARM_SMMU_FEAT_HD)
> +		pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_HD;

You need to depend on ARM_SMMU_FEAT_COHERENCY for this as well, not 
least because you don't have any of the relevant business for 
synchronising non-coherent PTEs in your walk functions, but it's also 
implementation-defined whether HTTU even operates on non-cacheable 
pagetables, and frankly you just don't want to go there ;)

Robin.

>   	if (smmu->features & ARM_SMMU_FEAT_BBML1)
>   		pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_BBML1;
>   	else if (smmu->features & ARM_SMMU_FEAT_BBML2)
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index e15750be1d95..ff32242f2fdb 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -292,6 +292,9 @@
>   #define CTXDESC_CD_0_TCR_IPS		GENMASK_ULL(34, 32)
>   #define CTXDESC_CD_0_TCR_TBI0		(1ULL << 38)
>   
> +#define CTXDESC_CD_0_TCR_HA            (1UL << 43)
> +#define CTXDESC_CD_0_TCR_HD            (1UL << 42)
> +
>   #define CTXDESC_CD_0_AA64		(1UL << 41)
>   #define CTXDESC_CD_0_S			(1UL << 44)
>   #define CTXDESC_CD_0_R			(1UL << 45)
> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
> index d7626ca67dbf..a11902ae9cf1 100644
> --- a/include/linux/io-pgtable.h
> +++ b/include/linux/io-pgtable.h
> @@ -87,6 +87,7 @@ struct io_pgtable_cfg {
>   	#define IO_PGTABLE_QUIRK_ARM_OUTER_WBWA	BIT(6)
>   	#define IO_PGTABLE_QUIRK_ARM_BBML1      BIT(7)
>   	#define IO_PGTABLE_QUIRK_ARM_BBML2      BIT(8)
> +	#define IO_PGTABLE_QUIRK_ARM_HD         BIT(9)
>   
>   	unsigned long			quirks;
>   	unsigned long			pgsize_bitmap;

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 17/19] iommu/arm-smmu-v3: Add unmap_read_dirty() support
  2022-04-28 21:09   ` Joao Martins
@ 2022-04-29 11:53     ` Robin Murphy
  -1 siblings, 0 replies; 209+ messages in thread
From: Robin Murphy @ 2022-04-29 11:53 UTC (permalink / raw)
  To: Joao Martins, iommu
  Cc: Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Jean-Philippe Brucker, Keqian Zhu, Shameerali Kolothum Thodi,
	David Woodhouse, Lu Baolu, Jason Gunthorpe, Nicolin Chen,
	Yishai Hadas, Kevin Tian, Eric Auger, Yi Liu, Alex Williamson,
	Cornelia Huck, kvm

On 2022-04-28 22:09, Joao Martins wrote:
> Mostly reuses unmap existing code with the extra addition of
> marshalling into a bitmap of a page size. To tackle the race,
> switch away from a plain store to a cmpxchg() and check whether
> IOVA was dirtied or not once it succeeds.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 17 +++++
>   drivers/iommu/io-pgtable-arm.c              | 78 +++++++++++++++++----
>   2 files changed, 82 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 5f728f8f20a2..d1fb757056cc 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -2499,6 +2499,22 @@ static size_t arm_smmu_unmap_pages(struct iommu_domain *domain, unsigned long io
>   	return ops->unmap_pages(ops, iova, pgsize, pgcount, gather);
>   }
>   
> +static size_t arm_smmu_unmap_pages_read_dirty(struct iommu_domain *domain,
> +					      unsigned long iova, size_t pgsize,
> +					      size_t pgcount,
> +					      struct iommu_iotlb_gather *gather,
> +					      struct iommu_dirty_bitmap *dirty)
> +{
> +	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> +	struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
> +
> +	if (!ops)
> +		return 0;
> +
> +	return ops->unmap_pages_read_dirty(ops, iova, pgsize, pgcount,
> +					   gather, dirty);
> +}
> +
>   static void arm_smmu_flush_iotlb_all(struct iommu_domain *domain)
>   {
>   	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> @@ -2938,6 +2954,7 @@ static struct iommu_ops arm_smmu_ops = {
>   		.free			= arm_smmu_domain_free,
>   		.read_and_clear_dirty	= arm_smmu_read_and_clear_dirty,
>   		.set_dirty_tracking_range = arm_smmu_set_dirty_tracking,
> +		.unmap_pages_read_dirty	= arm_smmu_unmap_pages_read_dirty,
>   	}
>   };
>   
> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> index 361410aa836c..143ee7d73f88 100644
> --- a/drivers/iommu/io-pgtable-arm.c
> +++ b/drivers/iommu/io-pgtable-arm.c
> @@ -259,10 +259,30 @@ static void __arm_lpae_clear_pte(arm_lpae_iopte *ptep, struct io_pgtable_cfg *cf
>   		__arm_lpae_sync_pte(ptep, 1, cfg);
>   }
>   
> +static bool __arm_lpae_clear_dirty_pte(arm_lpae_iopte *ptep,
> +				       struct io_pgtable_cfg *cfg)
> +{
> +	arm_lpae_iopte tmp;
> +	bool dirty = false;
> +
> +	do {
> +		tmp = cmpxchg64(ptep, *ptep, 0);
> +		if ((tmp & ARM_LPAE_PTE_DBM) &&
> +		    !(tmp & ARM_LPAE_PTE_AP_RDONLY))
> +			dirty = true;
> +	} while (tmp);
> +
> +	if (!cfg->coherent_walk)
> +		__arm_lpae_sync_pte(ptep, 1, cfg);

Note that this doesn't do enough, since it's only making the CPU's 
clearing of the PTE visible to the SMMU; the cmpxchg could have happily 
succeeded on a stale cached copy of the writeable-clean PTE regardless 
of what the SMMU might have done in the meantime. If we were to even 
pretend to cope with a non-coherent SMMU writing back to the pagetables, 
I think we'd have to scrap the current DMA API approach and make the CPU 
view of the pagetables non-cacheable as well, but as mentioned, there's 
no guarantee that that would even be useful anyway.

Robin.

> +
> +	return dirty;
> +}
> +
>   static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
>   			       struct iommu_iotlb_gather *gather,
>   			       unsigned long iova, size_t size, size_t pgcount,
> -			       int lvl, arm_lpae_iopte *ptep);
> +			       int lvl, arm_lpae_iopte *ptep,
> +			       struct iommu_dirty_bitmap *dirty);
>   
>   static void __arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
>   				phys_addr_t paddr, arm_lpae_iopte prot,
> @@ -306,8 +326,13 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
>   			size_t sz = ARM_LPAE_BLOCK_SIZE(lvl, data);
>   
>   			tblp = ptep - ARM_LPAE_LVL_IDX(iova, lvl, data);
> +
> +			/*
> +			 * No need for dirty bitmap as arm_lpae_init_pte() is
> +			 * only called from __arm_lpae_map()
> +			 */
>   			if (__arm_lpae_unmap(data, NULL, iova + i * sz, sz, 1,
> -					     lvl, tblp) != sz) {
> +					     lvl, tblp, NULL) != sz) {
>   				WARN_ON(1);
>   				return -EINVAL;
>   			}
> @@ -564,7 +589,8 @@ static size_t arm_lpae_split_blk_unmap(struct arm_lpae_io_pgtable *data,
>   				       struct iommu_iotlb_gather *gather,
>   				       unsigned long iova, size_t size,
>   				       arm_lpae_iopte blk_pte, int lvl,
> -				       arm_lpae_iopte *ptep, size_t pgcount)
> +				       arm_lpae_iopte *ptep, size_t pgcount,
> +				       struct iommu_dirty_bitmap *dirty)
>   {
>   	struct io_pgtable_cfg *cfg = &data->iop.cfg;
>   	arm_lpae_iopte pte, *tablep;
> @@ -617,13 +643,15 @@ static size_t arm_lpae_split_blk_unmap(struct arm_lpae_io_pgtable *data,
>   		return num_entries * size;
>   	}
>   
> -	return __arm_lpae_unmap(data, gather, iova, size, pgcount, lvl, tablep);
> +	return __arm_lpae_unmap(data, gather, iova, size, pgcount,
> +				lvl, tablep, dirty);
>   }
>   
>   static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
>   			       struct iommu_iotlb_gather *gather,
>   			       unsigned long iova, size_t size, size_t pgcount,
> -			       int lvl, arm_lpae_iopte *ptep)
> +			       int lvl, arm_lpae_iopte *ptep,
> +			       struct iommu_dirty_bitmap *dirty)
>   {
>   	arm_lpae_iopte pte;
>   	struct io_pgtable *iop = &data->iop;
> @@ -649,7 +677,11 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
>   			if (WARN_ON(!pte))
>   				break;
>   
> -			__arm_lpae_clear_pte(ptep, &iop->cfg);
> +			if (likely(!dirty))
> +				__arm_lpae_clear_pte(ptep, &iop->cfg);
> +			else if (__arm_lpae_clear_dirty_pte(ptep, &iop->cfg))
> +				iommu_dirty_bitmap_record(dirty, iova, size);
> +
>   
>   			if (!iopte_leaf(pte, lvl, iop->fmt)) {
>   				/* Also flush any partial walks */
> @@ -671,17 +703,20 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
>   		 * minus the part we want to unmap
>   		 */
>   		return arm_lpae_split_blk_unmap(data, gather, iova, size, pte,
> -						lvl + 1, ptep, pgcount);
> +						lvl + 1, ptep, pgcount, dirty);
>   	}
>   
>   	/* Keep on walkin' */
>   	ptep = iopte_deref(pte, data);
> -	return __arm_lpae_unmap(data, gather, iova, size, pgcount, lvl + 1, ptep);
> +	return __arm_lpae_unmap(data, gather, iova, size, pgcount,
> +				lvl + 1, ptep, dirty);
>   }
>   
> -static size_t arm_lpae_unmap_pages(struct io_pgtable_ops *ops, unsigned long iova,
> -				   size_t pgsize, size_t pgcount,
> -				   struct iommu_iotlb_gather *gather)
> +static size_t __arm_lpae_unmap_pages(struct io_pgtable_ops *ops,
> +				     unsigned long iova,
> +				     size_t pgsize, size_t pgcount,
> +				     struct iommu_iotlb_gather *gather,
> +				     struct iommu_dirty_bitmap *dirty)
>   {
>   	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
>   	struct io_pgtable_cfg *cfg = &data->iop.cfg;
> @@ -697,13 +732,29 @@ static size_t arm_lpae_unmap_pages(struct io_pgtable_ops *ops, unsigned long iov
>   		return 0;
>   
>   	return __arm_lpae_unmap(data, gather, iova, pgsize, pgcount,
> -				data->start_level, ptep);
> +				data->start_level, ptep, dirty);
> +}
> +
> +static size_t arm_lpae_unmap_pages(struct io_pgtable_ops *ops, unsigned long iova,
> +				   size_t pgsize, size_t pgcount,
> +				   struct iommu_iotlb_gather *gather)
> +{
> +	return __arm_lpae_unmap_pages(ops, iova, pgsize, pgcount, gather, NULL);
>   }
>   
>   static size_t arm_lpae_unmap(struct io_pgtable_ops *ops, unsigned long iova,
>   			     size_t size, struct iommu_iotlb_gather *gather)
>   {
> -	return arm_lpae_unmap_pages(ops, iova, size, 1, gather);
> +	return __arm_lpae_unmap_pages(ops, iova, size, 1, gather, NULL);
> +}
> +
> +static size_t arm_lpae_unmap_pages_read_dirty(struct io_pgtable_ops *ops,
> +					      unsigned long iova,
> +					      size_t pgsize, size_t pgcount,
> +					      struct iommu_iotlb_gather *gather,
> +					      struct iommu_dirty_bitmap *dirty)
> +{
> +	return __arm_lpae_unmap_pages(ops, iova, pgsize, pgcount, gather, dirty);
>   }
>   
>   static phys_addr_t arm_lpae_iova_to_phys(struct io_pgtable_ops *ops,
> @@ -969,6 +1020,7 @@ arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg)
>   		.iova_to_phys	= arm_lpae_iova_to_phys,
>   		.read_and_clear_dirty = arm_lpae_read_and_clear_dirty,
>   		.set_dirty_tracking   = arm_lpae_set_dirty_tracking,
> +		.unmap_pages_read_dirty     = arm_lpae_unmap_pages_read_dirty,
>   	};
>   
>   	return data;

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 17/19] iommu/arm-smmu-v3: Add unmap_read_dirty() support
@ 2022-04-29 11:53     ` Robin Murphy
  0 siblings, 0 replies; 209+ messages in thread
From: Robin Murphy @ 2022-04-29 11:53 UTC (permalink / raw)
  To: Joao Martins, iommu
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, Jason Gunthorpe,
	kvm, Cornelia Huck, Alex Williamson, Will Deacon,
	David Woodhouse

On 2022-04-28 22:09, Joao Martins wrote:
> Mostly reuses unmap existing code with the extra addition of
> marshalling into a bitmap of a page size. To tackle the race,
> switch away from a plain store to a cmpxchg() and check whether
> IOVA was dirtied or not once it succeeds.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 17 +++++
>   drivers/iommu/io-pgtable-arm.c              | 78 +++++++++++++++++----
>   2 files changed, 82 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 5f728f8f20a2..d1fb757056cc 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -2499,6 +2499,22 @@ static size_t arm_smmu_unmap_pages(struct iommu_domain *domain, unsigned long io
>   	return ops->unmap_pages(ops, iova, pgsize, pgcount, gather);
>   }
>   
> +static size_t arm_smmu_unmap_pages_read_dirty(struct iommu_domain *domain,
> +					      unsigned long iova, size_t pgsize,
> +					      size_t pgcount,
> +					      struct iommu_iotlb_gather *gather,
> +					      struct iommu_dirty_bitmap *dirty)
> +{
> +	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> +	struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
> +
> +	if (!ops)
> +		return 0;
> +
> +	return ops->unmap_pages_read_dirty(ops, iova, pgsize, pgcount,
> +					   gather, dirty);
> +}
> +
>   static void arm_smmu_flush_iotlb_all(struct iommu_domain *domain)
>   {
>   	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> @@ -2938,6 +2954,7 @@ static struct iommu_ops arm_smmu_ops = {
>   		.free			= arm_smmu_domain_free,
>   		.read_and_clear_dirty	= arm_smmu_read_and_clear_dirty,
>   		.set_dirty_tracking_range = arm_smmu_set_dirty_tracking,
> +		.unmap_pages_read_dirty	= arm_smmu_unmap_pages_read_dirty,
>   	}
>   };
>   
> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> index 361410aa836c..143ee7d73f88 100644
> --- a/drivers/iommu/io-pgtable-arm.c
> +++ b/drivers/iommu/io-pgtable-arm.c
> @@ -259,10 +259,30 @@ static void __arm_lpae_clear_pte(arm_lpae_iopte *ptep, struct io_pgtable_cfg *cf
>   		__arm_lpae_sync_pte(ptep, 1, cfg);
>   }
>   
> +static bool __arm_lpae_clear_dirty_pte(arm_lpae_iopte *ptep,
> +				       struct io_pgtable_cfg *cfg)
> +{
> +	arm_lpae_iopte tmp;
> +	bool dirty = false;
> +
> +	do {
> +		tmp = cmpxchg64(ptep, *ptep, 0);
> +		if ((tmp & ARM_LPAE_PTE_DBM) &&
> +		    !(tmp & ARM_LPAE_PTE_AP_RDONLY))
> +			dirty = true;
> +	} while (tmp);
> +
> +	if (!cfg->coherent_walk)
> +		__arm_lpae_sync_pte(ptep, 1, cfg);

Note that this doesn't do enough, since it's only making the CPU's 
clearing of the PTE visible to the SMMU; the cmpxchg could have happily 
succeeded on a stale cached copy of the writeable-clean PTE regardless 
of what the SMMU might have done in the meantime. If we were to even 
pretend to cope with a non-coherent SMMU writing back to the pagetables, 
I think we'd have to scrap the current DMA API approach and make the CPU 
view of the pagetables non-cacheable as well, but as mentioned, there's 
no guarantee that that would even be useful anyway.

Robin.

> +
> +	return dirty;
> +}
> +
>   static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
>   			       struct iommu_iotlb_gather *gather,
>   			       unsigned long iova, size_t size, size_t pgcount,
> -			       int lvl, arm_lpae_iopte *ptep);
> +			       int lvl, arm_lpae_iopte *ptep,
> +			       struct iommu_dirty_bitmap *dirty);
>   
>   static void __arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
>   				phys_addr_t paddr, arm_lpae_iopte prot,
> @@ -306,8 +326,13 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
>   			size_t sz = ARM_LPAE_BLOCK_SIZE(lvl, data);
>   
>   			tblp = ptep - ARM_LPAE_LVL_IDX(iova, lvl, data);
> +
> +			/*
> +			 * No need for dirty bitmap as arm_lpae_init_pte() is
> +			 * only called from __arm_lpae_map()
> +			 */
>   			if (__arm_lpae_unmap(data, NULL, iova + i * sz, sz, 1,
> -					     lvl, tblp) != sz) {
> +					     lvl, tblp, NULL) != sz) {
>   				WARN_ON(1);
>   				return -EINVAL;
>   			}
> @@ -564,7 +589,8 @@ static size_t arm_lpae_split_blk_unmap(struct arm_lpae_io_pgtable *data,
>   				       struct iommu_iotlb_gather *gather,
>   				       unsigned long iova, size_t size,
>   				       arm_lpae_iopte blk_pte, int lvl,
> -				       arm_lpae_iopte *ptep, size_t pgcount)
> +				       arm_lpae_iopte *ptep, size_t pgcount,
> +				       struct iommu_dirty_bitmap *dirty)
>   {
>   	struct io_pgtable_cfg *cfg = &data->iop.cfg;
>   	arm_lpae_iopte pte, *tablep;
> @@ -617,13 +643,15 @@ static size_t arm_lpae_split_blk_unmap(struct arm_lpae_io_pgtable *data,
>   		return num_entries * size;
>   	}
>   
> -	return __arm_lpae_unmap(data, gather, iova, size, pgcount, lvl, tablep);
> +	return __arm_lpae_unmap(data, gather, iova, size, pgcount,
> +				lvl, tablep, dirty);
>   }
>   
>   static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
>   			       struct iommu_iotlb_gather *gather,
>   			       unsigned long iova, size_t size, size_t pgcount,
> -			       int lvl, arm_lpae_iopte *ptep)
> +			       int lvl, arm_lpae_iopte *ptep,
> +			       struct iommu_dirty_bitmap *dirty)
>   {
>   	arm_lpae_iopte pte;
>   	struct io_pgtable *iop = &data->iop;
> @@ -649,7 +677,11 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
>   			if (WARN_ON(!pte))
>   				break;
>   
> -			__arm_lpae_clear_pte(ptep, &iop->cfg);
> +			if (likely(!dirty))
> +				__arm_lpae_clear_pte(ptep, &iop->cfg);
> +			else if (__arm_lpae_clear_dirty_pte(ptep, &iop->cfg))
> +				iommu_dirty_bitmap_record(dirty, iova, size);
> +
>   
>   			if (!iopte_leaf(pte, lvl, iop->fmt)) {
>   				/* Also flush any partial walks */
> @@ -671,17 +703,20 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
>   		 * minus the part we want to unmap
>   		 */
>   		return arm_lpae_split_blk_unmap(data, gather, iova, size, pte,
> -						lvl + 1, ptep, pgcount);
> +						lvl + 1, ptep, pgcount, dirty);
>   	}
>   
>   	/* Keep on walkin' */
>   	ptep = iopte_deref(pte, data);
> -	return __arm_lpae_unmap(data, gather, iova, size, pgcount, lvl + 1, ptep);
> +	return __arm_lpae_unmap(data, gather, iova, size, pgcount,
> +				lvl + 1, ptep, dirty);
>   }
>   
> -static size_t arm_lpae_unmap_pages(struct io_pgtable_ops *ops, unsigned long iova,
> -				   size_t pgsize, size_t pgcount,
> -				   struct iommu_iotlb_gather *gather)
> +static size_t __arm_lpae_unmap_pages(struct io_pgtable_ops *ops,
> +				     unsigned long iova,
> +				     size_t pgsize, size_t pgcount,
> +				     struct iommu_iotlb_gather *gather,
> +				     struct iommu_dirty_bitmap *dirty)
>   {
>   	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
>   	struct io_pgtable_cfg *cfg = &data->iop.cfg;
> @@ -697,13 +732,29 @@ static size_t arm_lpae_unmap_pages(struct io_pgtable_ops *ops, unsigned long iov
>   		return 0;
>   
>   	return __arm_lpae_unmap(data, gather, iova, pgsize, pgcount,
> -				data->start_level, ptep);
> +				data->start_level, ptep, dirty);
> +}
> +
> +static size_t arm_lpae_unmap_pages(struct io_pgtable_ops *ops, unsigned long iova,
> +				   size_t pgsize, size_t pgcount,
> +				   struct iommu_iotlb_gather *gather)
> +{
> +	return __arm_lpae_unmap_pages(ops, iova, pgsize, pgcount, gather, NULL);
>   }
>   
>   static size_t arm_lpae_unmap(struct io_pgtable_ops *ops, unsigned long iova,
>   			     size_t size, struct iommu_iotlb_gather *gather)
>   {
> -	return arm_lpae_unmap_pages(ops, iova, size, 1, gather);
> +	return __arm_lpae_unmap_pages(ops, iova, size, 1, gather, NULL);
> +}
> +
> +static size_t arm_lpae_unmap_pages_read_dirty(struct io_pgtable_ops *ops,
> +					      unsigned long iova,
> +					      size_t pgsize, size_t pgcount,
> +					      struct iommu_iotlb_gather *gather,
> +					      struct iommu_dirty_bitmap *dirty)
> +{
> +	return __arm_lpae_unmap_pages(ops, iova, pgsize, pgcount, gather, dirty);
>   }
>   
>   static phys_addr_t arm_lpae_iova_to_phys(struct io_pgtable_ops *ops,
> @@ -969,6 +1020,7 @@ arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg)
>   		.iova_to_phys	= arm_lpae_iova_to_phys,
>   		.read_and_clear_dirty = arm_lpae_read_and_clear_dirty,
>   		.set_dirty_tracking   = arm_lpae_set_dirty_tracking,
> +		.unmap_pages_read_dirty     = arm_lpae_unmap_pages_read_dirty,
>   	};
>   
>   	return data;
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 13/19] iommu/arm-smmu-v3: Add feature detection for BBML
  2022-04-29 11:11     ` Robin Murphy
@ 2022-04-29 11:54       ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 11:54 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Jean-Philippe Brucker, Keqian Zhu, Shameerali Kolothum Thodi,
	David Woodhouse, Lu Baolu, Jason Gunthorpe, Nicolin Chen,
	Yishai Hadas, Kevin Tian, Eric Auger, Yi Liu, Alex Williamson,
	Cornelia Huck, kvm, Kunkun Jiang, iommu

On 4/29/22 12:11, Robin Murphy wrote:
> On 2022-04-28 22:09, Joao Martins wrote:
>> From: Kunkun Jiang <jiangkunkun@huawei.com>
>>
>> This detects BBML feature and if SMMU supports it, transfer BBMLx
>> quirk to io-pgtable.
>>
>> BBML1 requires still marking PTE nT prior to performing a
>> translation table update, while BBML2 requires neither break-before-make
>> nor PTE nT bit being set. For dirty tracking it needs to clear
>> the dirty bit so checking BBML2 tells us the prerequisite. See SMMUv3.2
>> manual, section "3.21.1.3 When SMMU_IDR3.BBML == 2 (Level 2)" and
>> "3.21.1.2 When SMMU_IDR3.BBML == 1 (Level 1)"
> 
> You can drop this, and the dependencies on BBML elsewhere, until you get 
> round to the future large-page-splitting work, since that's the only 
> thing this represents. Not much point having the feature flags without 
> an actual implementation, or any users.
> 
OK.

My thinking was that the BBML2 meant *also* that we don't need that break-before-make
thingie upon switching translation table entries. It seems that from what you
say, BBML2 then just refers to this but only on the context of switching between
hugepages/normal pages (?), not in general on all bits of the PTE (which we woud .. upon
switching from writeable-dirty to writeable-clean with DBM-set).

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 13/19] iommu/arm-smmu-v3: Add feature detection for BBML
@ 2022-04-29 11:54       ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 11:54 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, Jason Gunthorpe,
	kvm, Cornelia Huck, iommu, Alex Williamson, Will Deacon,
	David Woodhouse

On 4/29/22 12:11, Robin Murphy wrote:
> On 2022-04-28 22:09, Joao Martins wrote:
>> From: Kunkun Jiang <jiangkunkun@huawei.com>
>>
>> This detects BBML feature and if SMMU supports it, transfer BBMLx
>> quirk to io-pgtable.
>>
>> BBML1 requires still marking PTE nT prior to performing a
>> translation table update, while BBML2 requires neither break-before-make
>> nor PTE nT bit being set. For dirty tracking it needs to clear
>> the dirty bit so checking BBML2 tells us the prerequisite. See SMMUv3.2
>> manual, section "3.21.1.3 When SMMU_IDR3.BBML == 2 (Level 2)" and
>> "3.21.1.2 When SMMU_IDR3.BBML == 1 (Level 1)"
> 
> You can drop this, and the dependencies on BBML elsewhere, until you get 
> round to the future large-page-splitting work, since that's the only 
> thing this represents. Not much point having the feature flags without 
> an actual implementation, or any users.
> 
OK.

My thinking was that the BBML2 meant *also* that we don't need that break-before-make
thingie upon switching translation table entries. It seems that from what you
say, BBML2 then just refers to this but only on the context of switching between
hugepages/normal pages (?), not in general on all bits of the PTE (which we woud .. upon
switching from writeable-dirty to writeable-clean with DBM-set).
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 02/19] iommufd: Dirty tracking for io_pagetable
  2022-04-29  8:07     ` Tian, Kevin
@ 2022-04-29 11:56       ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 209+ messages in thread
From: Jason Gunthorpe @ 2022-04-29 11:56 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Martins, Joao, iommu, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Nicolin Chen, Yishai Hadas, Eric Auger, Liu, Yi L,
	Alex Williamson, Cornelia Huck, kvm

On Fri, Apr 29, 2022 at 08:07:14AM +0000, Tian, Kevin wrote:
> > From: Joao Martins <joao.m.martins@oracle.com>
> > Sent: Friday, April 29, 2022 5:09 AM
> > 
> > +static int __set_dirty_tracking_range_locked(struct iommu_domain
> > *domain,
> 
> suppose anything using iommu_domain as the first argument should
> be put in the iommu layer. Here it's more reasonable to use iopt
> as the first argument or simply merge with the next function.
> 
> > +					     struct io_pagetable *iopt,
> > +					     bool enable)
> > +{
> > +	const struct iommu_domain_ops *ops = domain->ops;
> > +	struct iommu_iotlb_gather gather;
> > +	struct iopt_area *area;
> > +	int ret = -EOPNOTSUPP;
> > +	unsigned long iova;
> > +	size_t size;
> > +
> > +	iommu_iotlb_gather_init(&gather);
> > +
> > +	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
> > +	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
> 
> how is this different from leaving iommu driver to walk the page table
> and the poke the modifier bit for all present PTEs? As commented in last
> patch this may allow removing the range op completely.

Yea, I'm not super keen on the two ops either, especially since they
are so wildly different.

I would expect that set_dirty_tracking turns on tracking for the
entire iommu domain, for all present and future maps

While set_dirty_tracking_range - I guess it only does the range, so if
we make a new map then the new range will be untracked? But that is
now racy, we have to map and then call set_dirty_tracking_range

It seems better for the iommu driver to deal with this and ARM should
atomically make the new maps dirty tracking..

> > +int iopt_set_dirty_tracking(struct io_pagetable *iopt,
> > +			    struct iommu_domain *domain, bool enable)
> > +{
> > +	struct iommu_domain *dom;
> > +	unsigned long index;
> > +	int ret = -EOPNOTSUPP;

Returns EOPNOTSUPP if the xarray is empty?

Jason

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 02/19] iommufd: Dirty tracking for io_pagetable
@ 2022-04-29 11:56       ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 209+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-04-29 11:56 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jean-Philippe Brucker, Yishai Hadas, kvm, Will Deacon,
	Cornelia Huck, Alex Williamson, iommu, Martins, Joao,
	David Woodhouse, Robin Murphy

On Fri, Apr 29, 2022 at 08:07:14AM +0000, Tian, Kevin wrote:
> > From: Joao Martins <joao.m.martins@oracle.com>
> > Sent: Friday, April 29, 2022 5:09 AM
> > 
> > +static int __set_dirty_tracking_range_locked(struct iommu_domain
> > *domain,
> 
> suppose anything using iommu_domain as the first argument should
> be put in the iommu layer. Here it's more reasonable to use iopt
> as the first argument or simply merge with the next function.
> 
> > +					     struct io_pagetable *iopt,
> > +					     bool enable)
> > +{
> > +	const struct iommu_domain_ops *ops = domain->ops;
> > +	struct iommu_iotlb_gather gather;
> > +	struct iopt_area *area;
> > +	int ret = -EOPNOTSUPP;
> > +	unsigned long iova;
> > +	size_t size;
> > +
> > +	iommu_iotlb_gather_init(&gather);
> > +
> > +	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
> > +	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
> 
> how is this different from leaving iommu driver to walk the page table
> and the poke the modifier bit for all present PTEs? As commented in last
> patch this may allow removing the range op completely.

Yea, I'm not super keen on the two ops either, especially since they
are so wildly different.

I would expect that set_dirty_tracking turns on tracking for the
entire iommu domain, for all present and future maps

While set_dirty_tracking_range - I guess it only does the range, so if
we make a new map then the new range will be untracked? But that is
now racy, we have to map and then call set_dirty_tracking_range

It seems better for the iommu driver to deal with this and ARM should
atomically make the new maps dirty tracking..

> > +int iopt_set_dirty_tracking(struct io_pagetable *iopt,
> > +			    struct iommu_domain *domain, bool enable)
> > +{
> > +	struct iommu_domain *dom;
> > +	unsigned long index;
> > +	int ret = -EOPNOTSUPP;

Returns EOPNOTSUPP if the xarray is empty?

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
  2022-04-29 11:19         ` Robin Murphy
@ 2022-04-29 12:06           ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 12:06 UTC (permalink / raw)
  To: Robin Murphy, Tian, Kevin
  Cc: Jean-Philippe Brucker, Yishai Hadas, Jason Gunthorpe, kvm,
	Cornelia Huck, iommu, Alex Williamson, Will Deacon,
	David Woodhouse

On 4/29/22 12:19, Robin Murphy wrote:
> On 2022-04-29 12:05, Joao Martins wrote:
>> On 4/29/22 09:28, Tian, Kevin wrote:
>>>> From: Joao Martins <joao.m.martins@oracle.com>
>>>> Sent: Friday, April 29, 2022 5:09 AM
>>>>
>>>> Similar to .read_and_clear_dirty() use the page table
>>>> walker helper functions and set DBM|RDONLY bit, thus
>>>> switching the IOPTE to writeable-clean.
>>>
>>> this should not be one-off if the operation needs to be
>>> applied to IOPTE. Say a map request comes right after
>>> set_dirty_tracking() is called. If it's agreed to remove
>>> the range op then smmu driver should record the tracking
>>> status internally and then apply the modifier to all the new
>>> mappings automatically before dirty tracking is disabled.
>>> Otherwise the same logic needs to be kept in iommufd to
>>> call set_dirty_tracking_range() explicitly for every new
>>> iopt_area created within the tracking window.
>>
>> Gah, I totally missed that by mistake. New mappings aren't
>> carrying over the "DBM is set". This needs a new io-pgtable
>> quirk added post dirty-tracking toggling.
>>
>> I can adjust, but I am at odds on including this in a future
>> iteration given that I can't really test any of this stuff.
>> Might drop the driver until I have hardware/emulation I can
>> use (or maybe others can take over this). It was included
>> for revising the iommu core ops and whether iommufd was
>> affected by it.
>>
>> I'll delete the range op, and let smmu v3 driver walk its
>> own IO pgtables.
> 
> TBH I'd be inclined to just enable DBM unconditionally in 
> arm_smmu_domain_finalise() if the SMMU supports it. Trying to toggle it 
> dynamically (especially on a live domain) seems more trouble that it's 
> worth.

Hmmm, but then it would strip userland/VMM from any sort of control (contrary
to what we can do on the CPU/KVM side). e.g. the first time you do
GET_DIRTY_IOVA it would return all dirtied IOVAs since the beginning
of guest time, as opposed to those only after you enabled dirty-tracking.

We do add the TCR values unconditionally if supported, but not
the actual dirty tracking.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
@ 2022-04-29 12:06           ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 12:06 UTC (permalink / raw)
  To: Robin Murphy, Tian, Kevin
  Cc: Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Jean-Philippe Brucker, Keqian Zhu, Shameerali Kolothum Thodi,
	David Woodhouse, Lu Baolu, Jason Gunthorpe, Nicolin Chen,
	Yishai Hadas, Eric Auger, Liu, Yi L, Alex Williamson,
	Cornelia Huck, kvm, iommu

On 4/29/22 12:19, Robin Murphy wrote:
> On 2022-04-29 12:05, Joao Martins wrote:
>> On 4/29/22 09:28, Tian, Kevin wrote:
>>>> From: Joao Martins <joao.m.martins@oracle.com>
>>>> Sent: Friday, April 29, 2022 5:09 AM
>>>>
>>>> Similar to .read_and_clear_dirty() use the page table
>>>> walker helper functions and set DBM|RDONLY bit, thus
>>>> switching the IOPTE to writeable-clean.
>>>
>>> this should not be one-off if the operation needs to be
>>> applied to IOPTE. Say a map request comes right after
>>> set_dirty_tracking() is called. If it's agreed to remove
>>> the range op then smmu driver should record the tracking
>>> status internally and then apply the modifier to all the new
>>> mappings automatically before dirty tracking is disabled.
>>> Otherwise the same logic needs to be kept in iommufd to
>>> call set_dirty_tracking_range() explicitly for every new
>>> iopt_area created within the tracking window.
>>
>> Gah, I totally missed that by mistake. New mappings aren't
>> carrying over the "DBM is set". This needs a new io-pgtable
>> quirk added post dirty-tracking toggling.
>>
>> I can adjust, but I am at odds on including this in a future
>> iteration given that I can't really test any of this stuff.
>> Might drop the driver until I have hardware/emulation I can
>> use (or maybe others can take over this). It was included
>> for revising the iommu core ops and whether iommufd was
>> affected by it.
>>
>> I'll delete the range op, and let smmu v3 driver walk its
>> own IO pgtables.
> 
> TBH I'd be inclined to just enable DBM unconditionally in 
> arm_smmu_domain_finalise() if the SMMU supports it. Trying to toggle it 
> dynamically (especially on a live domain) seems more trouble that it's 
> worth.

Hmmm, but then it would strip userland/VMM from any sort of control (contrary
to what we can do on the CPU/KVM side). e.g. the first time you do
GET_DIRTY_IOVA it would return all dirtied IOVAs since the beginning
of guest time, as opposed to those only after you enabled dirty-tracking.

We do add the TCR values unconditionally if supported, but not
the actual dirty tracking.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 01/19] iommu: Add iommu_domain ops for dirty tracking
  2022-04-28 21:09   ` Joao Martins
@ 2022-04-29 12:08     ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 209+ messages in thread
From: Jason Gunthorpe @ 2022-04-29 12:08 UTC (permalink / raw)
  To: Joao Martins
  Cc: iommu, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Nicolin Chen, Yishai Hadas, Kevin Tian, Eric Auger, Yi Liu,
	Alex Williamson, Cornelia Huck, kvm

On Thu, Apr 28, 2022 at 10:09:15PM +0100, Joao Martins wrote:
> +
> +unsigned int iommu_dirty_bitmap_record(struct iommu_dirty_bitmap *dirty,
> +				       unsigned long iova, unsigned long length)
> +{

Lets put iommu_dirty_bitmap in its own patch, the VFIO driver side
will want to use this same data structure.

> +	while (nbits > 0) {
> +		kaddr = kmap(dirty->pages[idx]) + start_offset;

kmap_local?

> +/**
> + * struct iommu_dirty_bitmap - Dirty IOVA bitmap state
> + *
> + * @iova: IOVA representing the start of the bitmap, the first bit of the bitmap
> + * @pgshift: Page granularity of the bitmap
> + * @gather: Range information for a pending IOTLB flush
> + * @start_offset: Offset of the first user page
> + * @pages: User pages representing the bitmap region
> + * @npages: Number of user pages pinned
> + */
> +struct iommu_dirty_bitmap {
> +	unsigned long iova;
> +	unsigned long pgshift;
> +	struct iommu_iotlb_gather *gather;
> +	unsigned long start_offset;
> +	unsigned long npages;
> +	struct page **pages;

In many (all?) cases I would expect this to be called from a process
context, can we just store the __user pointer here, or is the idea
that with modern kernels poking a u64 to userspace is slower than a
kmap?

I'm particularly concerend that this starts to require high
order allocations with more than 2M of bitmap.. Maybe one direction is
to GUP 2M chunks at a time and walk the __user pointer.

> +static inline void iommu_dirty_bitmap_init(struct iommu_dirty_bitmap *dirty,
> +					   unsigned long base,
> +					   unsigned long pgshift,
> +					   struct iommu_iotlb_gather *gather)
> +{
> +	memset(dirty, 0, sizeof(*dirty));
> +	dirty->iova = base;
> +	dirty->pgshift = pgshift;
> +	dirty->gather = gather;
> +
> +	if (gather)
> +		iommu_iotlb_gather_init(dirty->gather);
> +}

I would expect all the GUPing logic to be here too?

Jason

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 01/19] iommu: Add iommu_domain ops for dirty tracking
@ 2022-04-29 12:08     ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 209+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-04-29 12:08 UTC (permalink / raw)
  To: Joao Martins
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, kvm,
	Will Deacon, Cornelia Huck, Alex Williamson, iommu,
	David Woodhouse, Robin Murphy

On Thu, Apr 28, 2022 at 10:09:15PM +0100, Joao Martins wrote:
> +
> +unsigned int iommu_dirty_bitmap_record(struct iommu_dirty_bitmap *dirty,
> +				       unsigned long iova, unsigned long length)
> +{

Lets put iommu_dirty_bitmap in its own patch, the VFIO driver side
will want to use this same data structure.

> +	while (nbits > 0) {
> +		kaddr = kmap(dirty->pages[idx]) + start_offset;

kmap_local?

> +/**
> + * struct iommu_dirty_bitmap - Dirty IOVA bitmap state
> + *
> + * @iova: IOVA representing the start of the bitmap, the first bit of the bitmap
> + * @pgshift: Page granularity of the bitmap
> + * @gather: Range information for a pending IOTLB flush
> + * @start_offset: Offset of the first user page
> + * @pages: User pages representing the bitmap region
> + * @npages: Number of user pages pinned
> + */
> +struct iommu_dirty_bitmap {
> +	unsigned long iova;
> +	unsigned long pgshift;
> +	struct iommu_iotlb_gather *gather;
> +	unsigned long start_offset;
> +	unsigned long npages;
> +	struct page **pages;

In many (all?) cases I would expect this to be called from a process
context, can we just store the __user pointer here, or is the idea
that with modern kernels poking a u64 to userspace is slower than a
kmap?

I'm particularly concerend that this starts to require high
order allocations with more than 2M of bitmap.. Maybe one direction is
to GUP 2M chunks at a time and walk the __user pointer.

> +static inline void iommu_dirty_bitmap_init(struct iommu_dirty_bitmap *dirty,
> +					   unsigned long base,
> +					   unsigned long pgshift,
> +					   struct iommu_iotlb_gather *gather)
> +{
> +	memset(dirty, 0, sizeof(*dirty));
> +	dirty->iova = base;
> +	dirty->pgshift = pgshift;
> +	dirty->gather = gather;
> +
> +	if (gather)
> +		iommu_iotlb_gather_init(dirty->gather);
> +}

I would expect all the GUPing logic to be here too?

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 03/19] iommufd: Dirty tracking data support
  2022-04-29 10:54       ` Joao Martins
@ 2022-04-29 12:09         ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 209+ messages in thread
From: Jason Gunthorpe @ 2022-04-29 12:09 UTC (permalink / raw)
  To: Joao Martins
  Cc: Tian, Kevin, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Nicolin Chen, Yishai Hadas, Eric Auger, Liu, Yi L,
	Alex Williamson, Cornelia Huck, kvm, iommu

On Fri, Apr 29, 2022 at 11:54:16AM +0100, Joao Martins wrote:
> On 4/29/22 09:12, Tian, Kevin wrote:
> >> From: Joao Martins <joao.m.martins@oracle.com>
> >> Sent: Friday, April 29, 2022 5:09 AM
> > [...]
> >> +
> >> +static int iommu_read_and_clear_dirty(struct iommu_domain *domain,
> >> +				      struct iommufd_dirty_data *bitmap)
> > 
> > In a glance this function and all previous helpers doesn't rely on any
> > iommufd objects except that the new structures are named as
> > iommufd_xxx. 
> > 
> > I wonder whether moving all of them to the iommu layer would make
> > more sense here.
> > 
> I suppose, instinctively, I was trying to make this tie to iommufd only,
> to avoid getting it called in cases we don't except when made as a generic
> exported kernel facility.
> 
> (note: iommufd can be built as a module).

Yeah, I think that is a reasonable reason to put iommufd only stuff in
iommufd.ko rather than bloat the static kernel.

You could put it in a new .c file though so there is some logical
modularity?

Jason

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 03/19] iommufd: Dirty tracking data support
@ 2022-04-29 12:09         ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 209+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-04-29 12:09 UTC (permalink / raw)
  To: Joao Martins
  Cc: Jean-Philippe Brucker, Tian, Kevin, Yishai Hadas, kvm,
	Will Deacon, Cornelia Huck, iommu, Alex Williamson,
	David Woodhouse, Robin Murphy

On Fri, Apr 29, 2022 at 11:54:16AM +0100, Joao Martins wrote:
> On 4/29/22 09:12, Tian, Kevin wrote:
> >> From: Joao Martins <joao.m.martins@oracle.com>
> >> Sent: Friday, April 29, 2022 5:09 AM
> > [...]
> >> +
> >> +static int iommu_read_and_clear_dirty(struct iommu_domain *domain,
> >> +				      struct iommufd_dirty_data *bitmap)
> > 
> > In a glance this function and all previous helpers doesn't rely on any
> > iommufd objects except that the new structures are named as
> > iommufd_xxx. 
> > 
> > I wonder whether moving all of them to the iommu layer would make
> > more sense here.
> > 
> I suppose, instinctively, I was trying to make this tie to iommufd only,
> to avoid getting it called in cases we don't except when made as a generic
> exported kernel facility.
> 
> (note: iommufd can be built as a module).

Yeah, I think that is a reasonable reason to put iommufd only stuff in
iommufd.ko rather than bloat the static kernel.

You could put it in a new .c file though so there is some logical
modularity?

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 16/19] iommu/arm-smmu-v3: Enable HTTU for stage1 with io-pgtable mapping
  2022-04-29 11:35     ` Robin Murphy
@ 2022-04-29 12:10       ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 12:10 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Jean-Philippe Brucker, Keqian Zhu, Shameerali Kolothum Thodi,
	David Woodhouse, Lu Baolu, Jason Gunthorpe, Nicolin Chen,
	Yishai Hadas, Kevin Tian, Eric Auger, Yi Liu, Alex Williamson,
	Cornelia Huck, kvm, Kunkun Jiang, iommu

On 4/29/22 12:35, Robin Murphy wrote:
> On 2022-04-28 22:09, Joao Martins wrote:
>> From: Kunkun Jiang <jiangkunkun@huawei.com>
>>
>> As nested mode is not upstreamed now, we just aim to support dirty
>> log tracking for stage1 with io-pgtable mapping (means not support
>> SVA mapping). If HTTU is supported, we enable HA/HD bits in the SMMU
>> CD and transfer ARM_HD quirk to io-pgtable.
>>
>> We additionally filter out HD|HA if not supportted. The CD.HD bit
>> is not particularly useful unless we toggle the DBM bit in the PTE
>> entries.
>>
>> Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
>> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
>> Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
>> [joaomart:Convey HD|HA bits over to the context descriptor
>>   and update commit message]
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 11 +++++++++++
>>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  3 +++
>>   include/linux/io-pgtable.h                  |  1 +
>>   3 files changed, 15 insertions(+)
>>
>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> index 1ca72fcca930..5f728f8f20a2 100644
>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> @@ -1077,10 +1077,18 @@ int arm_smmu_write_ctx_desc(struct arm_smmu_domain *smmu_domain, int ssid,
>>   		 * this substream's traffic
>>   		 */
>>   	} else { /* (1) and (2) */
>> +		struct arm_smmu_device *smmu = smmu_domain->smmu;
>> +		u64 tcr = cd->tcr;
>> +
>>   		cdptr[1] = cpu_to_le64(cd->ttbr & CTXDESC_CD_1_TTB0_MASK);
>>   		cdptr[2] = 0;
>>   		cdptr[3] = cpu_to_le64(cd->mair);
>>   
>> +		if (!(smmu->features & ARM_SMMU_FEAT_HD))
>> +			tcr &= ~CTXDESC_CD_0_TCR_HD;
>> +		if (!(smmu->features & ARM_SMMU_FEAT_HA))
>> +			tcr &= ~CTXDESC_CD_0_TCR_HA;
> 
> This is very backwards...
> 
Yes.

>> +
>>   		/*
>>   		 * STE is live, and the SMMU might read dwords of this CD in any
>>   		 * order. Ensure that it observes valid values before reading
>> @@ -2100,6 +2108,7 @@ static int arm_smmu_domain_finalise_s1(struct arm_smmu_domain *smmu_domain,
>>   			  FIELD_PREP(CTXDESC_CD_0_TCR_ORGN0, tcr->orgn) |
>>   			  FIELD_PREP(CTXDESC_CD_0_TCR_SH0, tcr->sh) |
>>   			  FIELD_PREP(CTXDESC_CD_0_TCR_IPS, tcr->ips) |
>> +			  CTXDESC_CD_0_TCR_HA | CTXDESC_CD_0_TCR_HD |
> 
> ...these should be set in io-pgtable's TCR value *if* io-pgatble is 
> using DBM, then propagated through from there like everything else.
> 

So the DBM bit superseedes the TCR bit -- that's strage? say if you mark a PTE as
writeable-clean with DBM set but TCR.HD unset .. then  won't trigger a perm-fault?
I need to re-read that section of the manual, as I didn't get the impression above.

>>   			  CTXDESC_CD_0_TCR_EPD1 | CTXDESC_CD_0_AA64;
>>   	cfg->cd.mair	= pgtbl_cfg->arm_lpae_s1_cfg.mair;
>>   
>> @@ -2203,6 +2212,8 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain,
>>   		.iommu_dev	= smmu->dev,
>>   	};
>>   
>> +	if (smmu->features & ARM_SMMU_FEAT_HD)
>> +		pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_HD;
> 
> You need to depend on ARM_SMMU_FEAT_COHERENCY for this as well, not 
> least because you don't have any of the relevant business for 
> synchronising non-coherent PTEs in your walk functions, but it's also 
> implementation-defined whether HTTU even operates on non-cacheable 
> pagetables, and frankly you just don't want to go there ;)
> 
/me nods OK.

> Robin.
> 
>>   	if (smmu->features & ARM_SMMU_FEAT_BBML1)
>>   		pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_BBML1;
>>   	else if (smmu->features & ARM_SMMU_FEAT_BBML2)
>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>> index e15750be1d95..ff32242f2fdb 100644
>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>> @@ -292,6 +292,9 @@
>>   #define CTXDESC_CD_0_TCR_IPS		GENMASK_ULL(34, 32)
>>   #define CTXDESC_CD_0_TCR_TBI0		(1ULL << 38)
>>   
>> +#define CTXDESC_CD_0_TCR_HA            (1UL << 43)
>> +#define CTXDESC_CD_0_TCR_HD            (1UL << 42)
>> +
>>   #define CTXDESC_CD_0_AA64		(1UL << 41)
>>   #define CTXDESC_CD_0_S			(1UL << 44)
>>   #define CTXDESC_CD_0_R			(1UL << 45)
>> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
>> index d7626ca67dbf..a11902ae9cf1 100644
>> --- a/include/linux/io-pgtable.h
>> +++ b/include/linux/io-pgtable.h
>> @@ -87,6 +87,7 @@ struct io_pgtable_cfg {
>>   	#define IO_PGTABLE_QUIRK_ARM_OUTER_WBWA	BIT(6)
>>   	#define IO_PGTABLE_QUIRK_ARM_BBML1      BIT(7)
>>   	#define IO_PGTABLE_QUIRK_ARM_BBML2      BIT(8)
>> +	#define IO_PGTABLE_QUIRK_ARM_HD         BIT(9)
>>   
>>   	unsigned long			quirks;
>>   	unsigned long			pgsize_bitmap;

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 16/19] iommu/arm-smmu-v3: Enable HTTU for stage1 with io-pgtable mapping
@ 2022-04-29 12:10       ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 12:10 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, Jason Gunthorpe,
	kvm, Cornelia Huck, iommu, Alex Williamson, Will Deacon,
	David Woodhouse

On 4/29/22 12:35, Robin Murphy wrote:
> On 2022-04-28 22:09, Joao Martins wrote:
>> From: Kunkun Jiang <jiangkunkun@huawei.com>
>>
>> As nested mode is not upstreamed now, we just aim to support dirty
>> log tracking for stage1 with io-pgtable mapping (means not support
>> SVA mapping). If HTTU is supported, we enable HA/HD bits in the SMMU
>> CD and transfer ARM_HD quirk to io-pgtable.
>>
>> We additionally filter out HD|HA if not supportted. The CD.HD bit
>> is not particularly useful unless we toggle the DBM bit in the PTE
>> entries.
>>
>> Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
>> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
>> Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
>> [joaomart:Convey HD|HA bits over to the context descriptor
>>   and update commit message]
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 11 +++++++++++
>>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  3 +++
>>   include/linux/io-pgtable.h                  |  1 +
>>   3 files changed, 15 insertions(+)
>>
>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> index 1ca72fcca930..5f728f8f20a2 100644
>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> @@ -1077,10 +1077,18 @@ int arm_smmu_write_ctx_desc(struct arm_smmu_domain *smmu_domain, int ssid,
>>   		 * this substream's traffic
>>   		 */
>>   	} else { /* (1) and (2) */
>> +		struct arm_smmu_device *smmu = smmu_domain->smmu;
>> +		u64 tcr = cd->tcr;
>> +
>>   		cdptr[1] = cpu_to_le64(cd->ttbr & CTXDESC_CD_1_TTB0_MASK);
>>   		cdptr[2] = 0;
>>   		cdptr[3] = cpu_to_le64(cd->mair);
>>   
>> +		if (!(smmu->features & ARM_SMMU_FEAT_HD))
>> +			tcr &= ~CTXDESC_CD_0_TCR_HD;
>> +		if (!(smmu->features & ARM_SMMU_FEAT_HA))
>> +			tcr &= ~CTXDESC_CD_0_TCR_HA;
> 
> This is very backwards...
> 
Yes.

>> +
>>   		/*
>>   		 * STE is live, and the SMMU might read dwords of this CD in any
>>   		 * order. Ensure that it observes valid values before reading
>> @@ -2100,6 +2108,7 @@ static int arm_smmu_domain_finalise_s1(struct arm_smmu_domain *smmu_domain,
>>   			  FIELD_PREP(CTXDESC_CD_0_TCR_ORGN0, tcr->orgn) |
>>   			  FIELD_PREP(CTXDESC_CD_0_TCR_SH0, tcr->sh) |
>>   			  FIELD_PREP(CTXDESC_CD_0_TCR_IPS, tcr->ips) |
>> +			  CTXDESC_CD_0_TCR_HA | CTXDESC_CD_0_TCR_HD |
> 
> ...these should be set in io-pgtable's TCR value *if* io-pgatble is 
> using DBM, then propagated through from there like everything else.
> 

So the DBM bit superseedes the TCR bit -- that's strage? say if you mark a PTE as
writeable-clean with DBM set but TCR.HD unset .. then  won't trigger a perm-fault?
I need to re-read that section of the manual, as I didn't get the impression above.

>>   			  CTXDESC_CD_0_TCR_EPD1 | CTXDESC_CD_0_AA64;
>>   	cfg->cd.mair	= pgtbl_cfg->arm_lpae_s1_cfg.mair;
>>   
>> @@ -2203,6 +2212,8 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain,
>>   		.iommu_dev	= smmu->dev,
>>   	};
>>   
>> +	if (smmu->features & ARM_SMMU_FEAT_HD)
>> +		pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_HD;
> 
> You need to depend on ARM_SMMU_FEAT_COHERENCY for this as well, not 
> least because you don't have any of the relevant business for 
> synchronising non-coherent PTEs in your walk functions, but it's also 
> implementation-defined whether HTTU even operates on non-cacheable 
> pagetables, and frankly you just don't want to go there ;)
> 
/me nods OK.

> Robin.
> 
>>   	if (smmu->features & ARM_SMMU_FEAT_BBML1)
>>   		pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_BBML1;
>>   	else if (smmu->features & ARM_SMMU_FEAT_BBML2)
>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>> index e15750be1d95..ff32242f2fdb 100644
>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>> @@ -292,6 +292,9 @@
>>   #define CTXDESC_CD_0_TCR_IPS		GENMASK_ULL(34, 32)
>>   #define CTXDESC_CD_0_TCR_TBI0		(1ULL << 38)
>>   
>> +#define CTXDESC_CD_0_TCR_HA            (1UL << 43)
>> +#define CTXDESC_CD_0_TCR_HD            (1UL << 42)
>> +
>>   #define CTXDESC_CD_0_AA64		(1UL << 41)
>>   #define CTXDESC_CD_0_S			(1UL << 44)
>>   #define CTXDESC_CD_0_R			(1UL << 45)
>> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
>> index d7626ca67dbf..a11902ae9cf1 100644
>> --- a/include/linux/io-pgtable.h
>> +++ b/include/linux/io-pgtable.h
>> @@ -87,6 +87,7 @@ struct io_pgtable_cfg {
>>   	#define IO_PGTABLE_QUIRK_ARM_OUTER_WBWA	BIT(6)
>>   	#define IO_PGTABLE_QUIRK_ARM_BBML1      BIT(7)
>>   	#define IO_PGTABLE_QUIRK_ARM_BBML2      BIT(8)
>> +	#define IO_PGTABLE_QUIRK_ARM_HD         BIT(9)
>>   
>>   	unsigned long			quirks;
>>   	unsigned long			pgsize_bitmap;
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 05/19] iommufd: Add a dirty bitmap to iopt_unmap_iova()
  2022-04-28 21:09   ` Joao Martins
@ 2022-04-29 12:14     ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 209+ messages in thread
From: Jason Gunthorpe @ 2022-04-29 12:14 UTC (permalink / raw)
  To: Joao Martins
  Cc: iommu, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Nicolin Chen, Yishai Hadas, Kevin Tian, Eric Auger, Yi Liu,
	Alex Williamson, Cornelia Huck, kvm

On Thu, Apr 28, 2022 at 10:09:19PM +0100, Joao Martins wrote:

> +static void iommu_unmap_read_dirty_nofail(struct iommu_domain *domain,
> +					  unsigned long iova, size_t size,
> +					  struct iommufd_dirty_data *bitmap,
> +					  struct iommufd_dirty_iter *iter)
> +{

This shouldn't be a nofail - that is only for path that trigger from
destroy/error unwindow, which read dirty never does. The return code
has to be propogated.

It needs some more thought how to organize this.. only unfill_domains
needs this path, but it is shared with the error unwind paths and
cannot generally fail..

Jason

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 05/19] iommufd: Add a dirty bitmap to iopt_unmap_iova()
@ 2022-04-29 12:14     ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 209+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-04-29 12:14 UTC (permalink / raw)
  To: Joao Martins
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, kvm,
	Will Deacon, Cornelia Huck, Alex Williamson, iommu,
	David Woodhouse, Robin Murphy

On Thu, Apr 28, 2022 at 10:09:19PM +0100, Joao Martins wrote:

> +static void iommu_unmap_read_dirty_nofail(struct iommu_domain *domain,
> +					  unsigned long iova, size_t size,
> +					  struct iommufd_dirty_data *bitmap,
> +					  struct iommufd_dirty_iter *iter)
> +{

This shouldn't be a nofail - that is only for path that trigger from
destroy/error unwindow, which read dirty never does. The return code
has to be propogated.

It needs some more thought how to organize this.. only unfill_domains
needs this path, but it is shared with the error unwind paths and
cannot generally fail..

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 07/19] iommufd/vfio-compat: Dirty tracking IOCTLs compatibility
  2022-04-28 21:09   ` Joao Martins
@ 2022-04-29 12:19     ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 209+ messages in thread
From: Jason Gunthorpe @ 2022-04-29 12:19 UTC (permalink / raw)
  To: Joao Martins
  Cc: iommu, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Nicolin Chen, Yishai Hadas, Kevin Tian, Eric Auger, Yi Liu,
	Alex Williamson, Cornelia Huck, kvm

On Thu, Apr 28, 2022 at 10:09:21PM +0100, Joao Martins wrote:
> Add the correspondent APIs for performing VFIO dirty tracking,
> particularly VFIO_IOMMU_DIRTY_PAGES ioctl subcmds:
> * VFIO_IOMMU_DIRTY_PAGES_FLAG_START: Start dirty tracking and allocates
> 				     the area @dirty_bitmap
> * VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP: Stop dirty tracking and frees
> 				    the area @dirty_bitmap
> * VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP: Fetch dirty bitmap while dirty
> tracking is active.
> 
> Advertise the VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION
> whereas it gets set the domain configured page size the same as
> iopt::iova_alignment and maximum dirty bitmap size same
> as VFIO. Compared to VFIO type1 iommu, the perpectual dirtying is
> not implemented and userspace gets -EOPNOTSUPP which is handled by
> today's userspace.
> 
> Move iommufd_get_pagesizes() definition prior to unmap for
> iommufd_vfio_unmap_dma() dirty support to validate the user bitmap page
> size against IOPT pagesize.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  drivers/iommu/iommufd/vfio_compat.c | 221 ++++++++++++++++++++++++++--
>  1 file changed, 209 insertions(+), 12 deletions(-)

I think I would probably not do this patch, it has behavior that is
quite different from the current vfio - ie the interaction with the
mdevs, and I don't intend to fix that. So, with this patch and a mdev
then vfio_compat will return all-not-dirty but current vfio will
return all-dirty - and that is significant enough to break qemu.

We've made a qemu patch to allow qemu to be happy if dirty tracking is
not supported in the vfio container for migration, which is part of
the v2 enablement series. That seems like the better direction.

I can see why this is useful to test with the current qemu however.

Jason

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 07/19] iommufd/vfio-compat: Dirty tracking IOCTLs compatibility
@ 2022-04-29 12:19     ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 209+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-04-29 12:19 UTC (permalink / raw)
  To: Joao Martins
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, kvm,
	Will Deacon, Cornelia Huck, Alex Williamson, iommu,
	David Woodhouse, Robin Murphy

On Thu, Apr 28, 2022 at 10:09:21PM +0100, Joao Martins wrote:
> Add the correspondent APIs for performing VFIO dirty tracking,
> particularly VFIO_IOMMU_DIRTY_PAGES ioctl subcmds:
> * VFIO_IOMMU_DIRTY_PAGES_FLAG_START: Start dirty tracking and allocates
> 				     the area @dirty_bitmap
> * VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP: Stop dirty tracking and frees
> 				    the area @dirty_bitmap
> * VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP: Fetch dirty bitmap while dirty
> tracking is active.
> 
> Advertise the VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION
> whereas it gets set the domain configured page size the same as
> iopt::iova_alignment and maximum dirty bitmap size same
> as VFIO. Compared to VFIO type1 iommu, the perpectual dirtying is
> not implemented and userspace gets -EOPNOTSUPP which is handled by
> today's userspace.
> 
> Move iommufd_get_pagesizes() definition prior to unmap for
> iommufd_vfio_unmap_dma() dirty support to validate the user bitmap page
> size against IOPT pagesize.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  drivers/iommu/iommufd/vfio_compat.c | 221 ++++++++++++++++++++++++++--
>  1 file changed, 209 insertions(+), 12 deletions(-)

I think I would probably not do this patch, it has behavior that is
quite different from the current vfio - ie the interaction with the
mdevs, and I don't intend to fix that. So, with this patch and a mdev
then vfio_compat will return all-not-dirty but current vfio will
return all-dirty - and that is significant enough to break qemu.

We've made a qemu patch to allow qemu to be happy if dirty tracking is
not supported in the vfio container for migration, which is part of
the v2 enablement series. That seems like the better direction.

I can see why this is useful to test with the current qemu however.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
  2022-04-29 12:06           ` Joao Martins
@ 2022-04-29 12:23             ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 209+ messages in thread
From: Jason Gunthorpe @ 2022-04-29 12:23 UTC (permalink / raw)
  To: Joao Martins
  Cc: Robin Murphy, Tian, Kevin, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Nicolin Chen, Yishai Hadas, Eric Auger, Liu, Yi L,
	Alex Williamson, Cornelia Huck, kvm, iommu

On Fri, Apr 29, 2022 at 01:06:06PM +0100, Joao Martins wrote:

> > TBH I'd be inclined to just enable DBM unconditionally in 
> > arm_smmu_domain_finalise() if the SMMU supports it. Trying to toggle it 
> > dynamically (especially on a live domain) seems more trouble that it's 
> > worth.
> 
> Hmmm, but then it would strip userland/VMM from any sort of control (contrary
> to what we can do on the CPU/KVM side). e.g. the first time you do
> GET_DIRTY_IOVA it would return all dirtied IOVAs since the beginning
> of guest time, as opposed to those only after you enabled dirty-tracking.

It just means that on SMMU the start tracking op clears all the dirty
bits.

I also suppose you'd also want to install the IOPTEs as dirty to
avoid a performance regression writing out new dirties for cases where
we don't dirty track? And then the start tracking op will switch this
so map creates non-dirty IOPTEs?

Jason

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
@ 2022-04-29 12:23             ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 209+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-04-29 12:23 UTC (permalink / raw)
  To: Joao Martins
  Cc: Jean-Philippe Brucker, Tian, Kevin, Yishai Hadas, kvm,
	Robin Murphy, Cornelia Huck, iommu, Alex Williamson,
	David Woodhouse, Will Deacon

On Fri, Apr 29, 2022 at 01:06:06PM +0100, Joao Martins wrote:

> > TBH I'd be inclined to just enable DBM unconditionally in 
> > arm_smmu_domain_finalise() if the SMMU supports it. Trying to toggle it 
> > dynamically (especially on a live domain) seems more trouble that it's 
> > worth.
> 
> Hmmm, but then it would strip userland/VMM from any sort of control (contrary
> to what we can do on the CPU/KVM side). e.g. the first time you do
> GET_DIRTY_IOVA it would return all dirtied IOVAs since the beginning
> of guest time, as opposed to those only after you enabled dirty-tracking.

It just means that on SMMU the start tracking op clears all the dirty
bits.

I also suppose you'd also want to install the IOPTEs as dirty to
avoid a performance regression writing out new dirties for cases where
we don't dirty track? And then the start tracking op will switch this
so map creates non-dirty IOPTEs?

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 13/19] iommu/arm-smmu-v3: Add feature detection for BBML
  2022-04-29 11:54       ` Joao Martins
@ 2022-04-29 12:26         ` Robin Murphy
  -1 siblings, 0 replies; 209+ messages in thread
From: Robin Murphy @ 2022-04-29 12:26 UTC (permalink / raw)
  To: Joao Martins
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, Jason Gunthorpe,
	kvm, Cornelia Huck, iommu, Alex Williamson, Will Deacon,
	David Woodhouse

On 2022-04-29 12:54, Joao Martins wrote:
> On 4/29/22 12:11, Robin Murphy wrote:
>> On 2022-04-28 22:09, Joao Martins wrote:
>>> From: Kunkun Jiang <jiangkunkun@huawei.com>
>>>
>>> This detects BBML feature and if SMMU supports it, transfer BBMLx
>>> quirk to io-pgtable.
>>>
>>> BBML1 requires still marking PTE nT prior to performing a
>>> translation table update, while BBML2 requires neither break-before-make
>>> nor PTE nT bit being set. For dirty tracking it needs to clear
>>> the dirty bit so checking BBML2 tells us the prerequisite. See SMMUv3.2
>>> manual, section "3.21.1.3 When SMMU_IDR3.BBML == 2 (Level 2)" and
>>> "3.21.1.2 When SMMU_IDR3.BBML == 1 (Level 1)"
>>
>> You can drop this, and the dependencies on BBML elsewhere, until you get
>> round to the future large-page-splitting work, since that's the only
>> thing this represents. Not much point having the feature flags without
>> an actual implementation, or any users.
>>
> OK.
> 
> My thinking was that the BBML2 meant *also* that we don't need that break-before-make
> thingie upon switching translation table entries. It seems that from what you
> say, BBML2 then just refers to this but only on the context of switching between
> hugepages/normal pages (?), not in general on all bits of the PTE (which we woud .. upon
> switching from writeable-dirty to writeable-clean with DBM-set).

Yes, BBML is purely about swapping between a block (hugepage) entry and 
a table representing the exact equivalent mapping.

A break-before-make procedure isn't required when just changing 
permissions, and AFAICS it doesn't apply to changing the DBM bit either, 
but as mentioned I think we could probably just not do that anyway.

Robin.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 13/19] iommu/arm-smmu-v3: Add feature detection for BBML
@ 2022-04-29 12:26         ` Robin Murphy
  0 siblings, 0 replies; 209+ messages in thread
From: Robin Murphy @ 2022-04-29 12:26 UTC (permalink / raw)
  To: Joao Martins
  Cc: Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Jean-Philippe Brucker, Keqian Zhu, Shameerali Kolothum Thodi,
	David Woodhouse, Lu Baolu, Jason Gunthorpe, Nicolin Chen,
	Yishai Hadas, Kevin Tian, Eric Auger, Yi Liu, Alex Williamson,
	Cornelia Huck, kvm, Kunkun Jiang, iommu

On 2022-04-29 12:54, Joao Martins wrote:
> On 4/29/22 12:11, Robin Murphy wrote:
>> On 2022-04-28 22:09, Joao Martins wrote:
>>> From: Kunkun Jiang <jiangkunkun@huawei.com>
>>>
>>> This detects BBML feature and if SMMU supports it, transfer BBMLx
>>> quirk to io-pgtable.
>>>
>>> BBML1 requires still marking PTE nT prior to performing a
>>> translation table update, while BBML2 requires neither break-before-make
>>> nor PTE nT bit being set. For dirty tracking it needs to clear
>>> the dirty bit so checking BBML2 tells us the prerequisite. See SMMUv3.2
>>> manual, section "3.21.1.3 When SMMU_IDR3.BBML == 2 (Level 2)" and
>>> "3.21.1.2 When SMMU_IDR3.BBML == 1 (Level 1)"
>>
>> You can drop this, and the dependencies on BBML elsewhere, until you get
>> round to the future large-page-splitting work, since that's the only
>> thing this represents. Not much point having the feature flags without
>> an actual implementation, or any users.
>>
> OK.
> 
> My thinking was that the BBML2 meant *also* that we don't need that break-before-make
> thingie upon switching translation table entries. It seems that from what you
> say, BBML2 then just refers to this but only on the context of switching between
> hugepages/normal pages (?), not in general on all bits of the PTE (which we woud .. upon
> switching from writeable-dirty to writeable-clean with DBM-set).

Yes, BBML is purely about swapping between a block (hugepage) entry and 
a table representing the exact equivalent mapping.

A break-before-make procedure isn't required when just changing 
permissions, and AFAICS it doesn't apply to changing the DBM bit either, 
but as mentioned I think we could probably just not do that anyway.

Robin.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
  2022-04-29 10:27     ` Joao Martins
@ 2022-04-29 12:38       ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 209+ messages in thread
From: Jason Gunthorpe @ 2022-04-29 12:38 UTC (permalink / raw)
  To: Joao Martins
  Cc: Tian, Kevin, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Nicolin Chen, Yishai Hadas, Eric Auger, Liu, Yi L,
	Alex Williamson, Cornelia Huck, kvm, iommu

On Fri, Apr 29, 2022 at 11:27:58AM +0100, Joao Martins wrote:
> >>  3) Unmapping an IOVA range while returning its dirty bit prior to
> >> unmap. This case is specific for non-nested vIOMMU case where an
> >> erronous guest (or device) DMAing to an address being unmapped at the
> >> same time.
> > 
> > an erroneous attempt like above cannot anticipate which DMAs can
> > succeed in that window thus the end behavior is undefined. For an
> > undefined behavior nothing will be broken by losing some bits dirtied
> > in the window between reading back dirty bits of the range and
> > actually calling unmap. From guest p.o.v. all those are black-box
> > hardware logic to serve a virtual iotlb invalidation request which just
> > cannot be completed in one cycle.
> > 
> > Hence in reality probably this is not required except to meet vfio
> > compat requirement. Just in concept returning dirty bits at unmap
> > is more accurate.
> > 
> > I'm slightly inclined to abandon it in iommufd uAPI.
> 
> OK, it seems I am not far off from your thoughts.
> 
> I'll see what others think too, and if so I'll remove the unmap_dirty.
> 
> Because if vfio-compat doesn't get the iommu hw dirty support, then there would
> be no users of unmap_dirty.

I'm inclined to agree with Kevin.

If the VM does do a rouge DMA while unmapping its vIOMMU then already
it will randomly get or loose that DMA. Adding the dirty tracking race
during live migration just further bias's that randomness toward
loose.  Since we don't relay protection faults to the guest there is
no guest observable difference, IMHO.

In any case, I don't think the implementation here for unmap_dirty is
race free?  So, if we are doing all this complexity just to make the
race smaller, I don't see the point.

To make it race free I think you have to write protect the IOPTE then
synchronize the IOTLB, read back the dirty, then unmap and synchronize
the IOTLB again. That has such a high performance cost I'm not
convinced it is worthwhile - and if it has to be two step like this
then it would be cleaner to introduce a 'writeprotect and read dirty'
op instead of overloading unmap. We don't need to microoptimize away
the extra io page table walk when we are already doing two
invalidations in the overhead..

> >> * There's no capabilities API in IOMMUFD, and in this RFC each vendor tracks
> > 
> > there was discussion adding device capability uAPI somewhere.
> > 
> ack let me know if there was snippets to the conversation as I seem to have missed that.

It was just discssion pending something we actually needed to report.

Would be a very simple ioctl taking in the device ID and fulling a
struct of stuff.
 
> > probably this can be reported as a device cap as supporting of dirty bit is
> > an immutable property of the iommu serving that device. 

It is an easier fit to read it out of the iommu_domain after device
attach though - since we don't need to build new kernel infrastructure
to query it from a device.
 
> > Userspace can
> > enable dirty tracking on a hwpt if all attached devices claim the support
> > and kernel will does the same verification.
> 
> Sorry to be dense but this is not up to 'devices' given they take no
> part in the tracking?  I guess by 'devices' you mean the software
> idea of it i.e. the iommu context created for attaching a said
> physical device, not the physical device itself.

Indeed, an hwpt represents an iommu_domain and if the iommu_domain has
dirty tracking ops set then that is an inherent propery of the domain
and does not suddenly go away when a new device is attached.
 
Jason

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
@ 2022-04-29 12:38       ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 209+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-04-29 12:38 UTC (permalink / raw)
  To: Joao Martins
  Cc: Jean-Philippe Brucker, Tian, Kevin, Yishai Hadas, kvm,
	Will Deacon, Cornelia Huck, iommu, Alex Williamson,
	David Woodhouse, Robin Murphy

On Fri, Apr 29, 2022 at 11:27:58AM +0100, Joao Martins wrote:
> >>  3) Unmapping an IOVA range while returning its dirty bit prior to
> >> unmap. This case is specific for non-nested vIOMMU case where an
> >> erronous guest (or device) DMAing to an address being unmapped at the
> >> same time.
> > 
> > an erroneous attempt like above cannot anticipate which DMAs can
> > succeed in that window thus the end behavior is undefined. For an
> > undefined behavior nothing will be broken by losing some bits dirtied
> > in the window between reading back dirty bits of the range and
> > actually calling unmap. From guest p.o.v. all those are black-box
> > hardware logic to serve a virtual iotlb invalidation request which just
> > cannot be completed in one cycle.
> > 
> > Hence in reality probably this is not required except to meet vfio
> > compat requirement. Just in concept returning dirty bits at unmap
> > is more accurate.
> > 
> > I'm slightly inclined to abandon it in iommufd uAPI.
> 
> OK, it seems I am not far off from your thoughts.
> 
> I'll see what others think too, and if so I'll remove the unmap_dirty.
> 
> Because if vfio-compat doesn't get the iommu hw dirty support, then there would
> be no users of unmap_dirty.

I'm inclined to agree with Kevin.

If the VM does do a rouge DMA while unmapping its vIOMMU then already
it will randomly get or loose that DMA. Adding the dirty tracking race
during live migration just further bias's that randomness toward
loose.  Since we don't relay protection faults to the guest there is
no guest observable difference, IMHO.

In any case, I don't think the implementation here for unmap_dirty is
race free?  So, if we are doing all this complexity just to make the
race smaller, I don't see the point.

To make it race free I think you have to write protect the IOPTE then
synchronize the IOTLB, read back the dirty, then unmap and synchronize
the IOTLB again. That has such a high performance cost I'm not
convinced it is worthwhile - and if it has to be two step like this
then it would be cleaner to introduce a 'writeprotect and read dirty'
op instead of overloading unmap. We don't need to microoptimize away
the extra io page table walk when we are already doing two
invalidations in the overhead..

> >> * There's no capabilities API in IOMMUFD, and in this RFC each vendor tracks
> > 
> > there was discussion adding device capability uAPI somewhere.
> > 
> ack let me know if there was snippets to the conversation as I seem to have missed that.

It was just discssion pending something we actually needed to report.

Would be a very simple ioctl taking in the device ID and fulling a
struct of stuff.
 
> > probably this can be reported as a device cap as supporting of dirty bit is
> > an immutable property of the iommu serving that device. 

It is an easier fit to read it out of the iommu_domain after device
attach though - since we don't need to build new kernel infrastructure
to query it from a device.
 
> > Userspace can
> > enable dirty tracking on a hwpt if all attached devices claim the support
> > and kernel will does the same verification.
> 
> Sorry to be dense but this is not up to 'devices' given they take no
> part in the tracking?  I guess by 'devices' you mean the software
> idea of it i.e. the iommu context created for attaching a said
> physical device, not the physical device itself.

Indeed, an hwpt represents an iommu_domain and if the iommu_domain has
dirty tracking ops set then that is an inherent propery of the domain
and does not suddenly go away when a new device is attached.
 
Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 16/19] iommu/arm-smmu-v3: Enable HTTU for stage1 with io-pgtable mapping
  2022-04-29 12:10       ` Joao Martins
@ 2022-04-29 12:46         ` Robin Murphy
  -1 siblings, 0 replies; 209+ messages in thread
From: Robin Murphy @ 2022-04-29 12:46 UTC (permalink / raw)
  To: Joao Martins
  Cc: Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Jean-Philippe Brucker, Keqian Zhu, Shameerali Kolothum Thodi,
	David Woodhouse, Lu Baolu, Jason Gunthorpe, Nicolin Chen,
	Yishai Hadas, Kevin Tian, Eric Auger, Yi Liu, Alex Williamson,
	Cornelia Huck, kvm, Kunkun Jiang, iommu

On 2022-04-29 13:10, Joao Martins wrote:
> On 4/29/22 12:35, Robin Murphy wrote:
>> On 2022-04-28 22:09, Joao Martins wrote:
>>> From: Kunkun Jiang <jiangkunkun@huawei.com>
>>>
>>> As nested mode is not upstreamed now, we just aim to support dirty
>>> log tracking for stage1 with io-pgtable mapping (means not support
>>> SVA mapping). If HTTU is supported, we enable HA/HD bits in the SMMU
>>> CD and transfer ARM_HD quirk to io-pgtable.
>>>
>>> We additionally filter out HD|HA if not supportted. The CD.HD bit
>>> is not particularly useful unless we toggle the DBM bit in the PTE
>>> entries.
>>>
>>> Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
>>> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
>>> Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
>>> [joaomart:Convey HD|HA bits over to the context descriptor
>>>    and update commit message]
>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>> ---
>>>    drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 11 +++++++++++
>>>    drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  3 +++
>>>    include/linux/io-pgtable.h                  |  1 +
>>>    3 files changed, 15 insertions(+)
>>>
>>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>> index 1ca72fcca930..5f728f8f20a2 100644
>>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>> @@ -1077,10 +1077,18 @@ int arm_smmu_write_ctx_desc(struct arm_smmu_domain *smmu_domain, int ssid,
>>>    		 * this substream's traffic
>>>    		 */
>>>    	} else { /* (1) and (2) */
>>> +		struct arm_smmu_device *smmu = smmu_domain->smmu;
>>> +		u64 tcr = cd->tcr;
>>> +
>>>    		cdptr[1] = cpu_to_le64(cd->ttbr & CTXDESC_CD_1_TTB0_MASK);
>>>    		cdptr[2] = 0;
>>>    		cdptr[3] = cpu_to_le64(cd->mair);
>>>    
>>> +		if (!(smmu->features & ARM_SMMU_FEAT_HD))
>>> +			tcr &= ~CTXDESC_CD_0_TCR_HD;
>>> +		if (!(smmu->features & ARM_SMMU_FEAT_HA))
>>> +			tcr &= ~CTXDESC_CD_0_TCR_HA;
>>
>> This is very backwards...
>>
> Yes.
> 
>>> +
>>>    		/*
>>>    		 * STE is live, and the SMMU might read dwords of this CD in any
>>>    		 * order. Ensure that it observes valid values before reading
>>> @@ -2100,6 +2108,7 @@ static int arm_smmu_domain_finalise_s1(struct arm_smmu_domain *smmu_domain,
>>>    			  FIELD_PREP(CTXDESC_CD_0_TCR_ORGN0, tcr->orgn) |
>>>    			  FIELD_PREP(CTXDESC_CD_0_TCR_SH0, tcr->sh) |
>>>    			  FIELD_PREP(CTXDESC_CD_0_TCR_IPS, tcr->ips) |
>>> +			  CTXDESC_CD_0_TCR_HA | CTXDESC_CD_0_TCR_HD |
>>
>> ...these should be set in io-pgtable's TCR value *if* io-pgatble is
>> using DBM, then propagated through from there like everything else.
>>
> 
> So the DBM bit superseedes the TCR bit -- that's strage? say if you mark a PTE as
> writeable-clean with DBM set but TCR.HD unset .. then  won't trigger a perm-fault?
> I need to re-read that section of the manual, as I didn't get the impression above.

No, architecturally, the {TCR,CD}.HD bit is still the "master switch" 
for whether the DBM field in PTEs is interpreted or not, but in terms of 
our abstraction, we only need to care about setting HD if io-pgtable is 
actually going to want to use DBM, so we may as well leave it to 
io-pgtable to tell us canonically. The logical interface here in general 
is that we use the initial io_pgtable_cfg to tell it what it *can* use, 
but then we read back afterwards to see exactly what it has chosen to 
do, and I think HA/HD also fit perfectly into that paradigm.

Robin.

>>>    			  CTXDESC_CD_0_TCR_EPD1 | CTXDESC_CD_0_AA64;
>>>    	cfg->cd.mair	= pgtbl_cfg->arm_lpae_s1_cfg.mair;
>>>    
>>> @@ -2203,6 +2212,8 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain,
>>>    		.iommu_dev	= smmu->dev,
>>>    	};
>>>    
>>> +	if (smmu->features & ARM_SMMU_FEAT_HD)
>>> +		pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_HD;
>>
>> You need to depend on ARM_SMMU_FEAT_COHERENCY for this as well, not
>> least because you don't have any of the relevant business for
>> synchronising non-coherent PTEs in your walk functions, but it's also
>> implementation-defined whether HTTU even operates on non-cacheable
>> pagetables, and frankly you just don't want to go there ;)
>>
> /me nods OK.
> 
>> Robin.
>>
>>>    	if (smmu->features & ARM_SMMU_FEAT_BBML1)
>>>    		pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_BBML1;
>>>    	else if (smmu->features & ARM_SMMU_FEAT_BBML2)
>>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>>> index e15750be1d95..ff32242f2fdb 100644
>>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>>> @@ -292,6 +292,9 @@
>>>    #define CTXDESC_CD_0_TCR_IPS		GENMASK_ULL(34, 32)
>>>    #define CTXDESC_CD_0_TCR_TBI0		(1ULL << 38)
>>>    
>>> +#define CTXDESC_CD_0_TCR_HA            (1UL << 43)
>>> +#define CTXDESC_CD_0_TCR_HD            (1UL << 42)
>>> +
>>>    #define CTXDESC_CD_0_AA64		(1UL << 41)
>>>    #define CTXDESC_CD_0_S			(1UL << 44)
>>>    #define CTXDESC_CD_0_R			(1UL << 45)
>>> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
>>> index d7626ca67dbf..a11902ae9cf1 100644
>>> --- a/include/linux/io-pgtable.h
>>> +++ b/include/linux/io-pgtable.h
>>> @@ -87,6 +87,7 @@ struct io_pgtable_cfg {
>>>    	#define IO_PGTABLE_QUIRK_ARM_OUTER_WBWA	BIT(6)
>>>    	#define IO_PGTABLE_QUIRK_ARM_BBML1      BIT(7)
>>>    	#define IO_PGTABLE_QUIRK_ARM_BBML2      BIT(8)
>>> +	#define IO_PGTABLE_QUIRK_ARM_HD         BIT(9)
>>>    
>>>    	unsigned long			quirks;
>>>    	unsigned long			pgsize_bitmap;

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 16/19] iommu/arm-smmu-v3: Enable HTTU for stage1 with io-pgtable mapping
@ 2022-04-29 12:46         ` Robin Murphy
  0 siblings, 0 replies; 209+ messages in thread
From: Robin Murphy @ 2022-04-29 12:46 UTC (permalink / raw)
  To: Joao Martins
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, Jason Gunthorpe,
	kvm, Cornelia Huck, iommu, Alex Williamson, Will Deacon,
	David Woodhouse

On 2022-04-29 13:10, Joao Martins wrote:
> On 4/29/22 12:35, Robin Murphy wrote:
>> On 2022-04-28 22:09, Joao Martins wrote:
>>> From: Kunkun Jiang <jiangkunkun@huawei.com>
>>>
>>> As nested mode is not upstreamed now, we just aim to support dirty
>>> log tracking for stage1 with io-pgtable mapping (means not support
>>> SVA mapping). If HTTU is supported, we enable HA/HD bits in the SMMU
>>> CD and transfer ARM_HD quirk to io-pgtable.
>>>
>>> We additionally filter out HD|HA if not supportted. The CD.HD bit
>>> is not particularly useful unless we toggle the DBM bit in the PTE
>>> entries.
>>>
>>> Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
>>> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
>>> Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
>>> [joaomart:Convey HD|HA bits over to the context descriptor
>>>    and update commit message]
>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>> ---
>>>    drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 11 +++++++++++
>>>    drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  3 +++
>>>    include/linux/io-pgtable.h                  |  1 +
>>>    3 files changed, 15 insertions(+)
>>>
>>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>> index 1ca72fcca930..5f728f8f20a2 100644
>>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>> @@ -1077,10 +1077,18 @@ int arm_smmu_write_ctx_desc(struct arm_smmu_domain *smmu_domain, int ssid,
>>>    		 * this substream's traffic
>>>    		 */
>>>    	} else { /* (1) and (2) */
>>> +		struct arm_smmu_device *smmu = smmu_domain->smmu;
>>> +		u64 tcr = cd->tcr;
>>> +
>>>    		cdptr[1] = cpu_to_le64(cd->ttbr & CTXDESC_CD_1_TTB0_MASK);
>>>    		cdptr[2] = 0;
>>>    		cdptr[3] = cpu_to_le64(cd->mair);
>>>    
>>> +		if (!(smmu->features & ARM_SMMU_FEAT_HD))
>>> +			tcr &= ~CTXDESC_CD_0_TCR_HD;
>>> +		if (!(smmu->features & ARM_SMMU_FEAT_HA))
>>> +			tcr &= ~CTXDESC_CD_0_TCR_HA;
>>
>> This is very backwards...
>>
> Yes.
> 
>>> +
>>>    		/*
>>>    		 * STE is live, and the SMMU might read dwords of this CD in any
>>>    		 * order. Ensure that it observes valid values before reading
>>> @@ -2100,6 +2108,7 @@ static int arm_smmu_domain_finalise_s1(struct arm_smmu_domain *smmu_domain,
>>>    			  FIELD_PREP(CTXDESC_CD_0_TCR_ORGN0, tcr->orgn) |
>>>    			  FIELD_PREP(CTXDESC_CD_0_TCR_SH0, tcr->sh) |
>>>    			  FIELD_PREP(CTXDESC_CD_0_TCR_IPS, tcr->ips) |
>>> +			  CTXDESC_CD_0_TCR_HA | CTXDESC_CD_0_TCR_HD |
>>
>> ...these should be set in io-pgtable's TCR value *if* io-pgatble is
>> using DBM, then propagated through from there like everything else.
>>
> 
> So the DBM bit superseedes the TCR bit -- that's strage? say if you mark a PTE as
> writeable-clean with DBM set but TCR.HD unset .. then  won't trigger a perm-fault?
> I need to re-read that section of the manual, as I didn't get the impression above.

No, architecturally, the {TCR,CD}.HD bit is still the "master switch" 
for whether the DBM field in PTEs is interpreted or not, but in terms of 
our abstraction, we only need to care about setting HD if io-pgtable is 
actually going to want to use DBM, so we may as well leave it to 
io-pgtable to tell us canonically. The logical interface here in general 
is that we use the initial io_pgtable_cfg to tell it what it *can* use, 
but then we read back afterwards to see exactly what it has chosen to 
do, and I think HA/HD also fit perfectly into that paradigm.

Robin.

>>>    			  CTXDESC_CD_0_TCR_EPD1 | CTXDESC_CD_0_AA64;
>>>    	cfg->cd.mair	= pgtbl_cfg->arm_lpae_s1_cfg.mair;
>>>    
>>> @@ -2203,6 +2212,8 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain,
>>>    		.iommu_dev	= smmu->dev,
>>>    	};
>>>    
>>> +	if (smmu->features & ARM_SMMU_FEAT_HD)
>>> +		pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_HD;
>>
>> You need to depend on ARM_SMMU_FEAT_COHERENCY for this as well, not
>> least because you don't have any of the relevant business for
>> synchronising non-coherent PTEs in your walk functions, but it's also
>> implementation-defined whether HTTU even operates on non-cacheable
>> pagetables, and frankly you just don't want to go there ;)
>>
> /me nods OK.
> 
>> Robin.
>>
>>>    	if (smmu->features & ARM_SMMU_FEAT_BBML1)
>>>    		pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_BBML1;
>>>    	else if (smmu->features & ARM_SMMU_FEAT_BBML2)
>>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>>> index e15750be1d95..ff32242f2fdb 100644
>>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>>> @@ -292,6 +292,9 @@
>>>    #define CTXDESC_CD_0_TCR_IPS		GENMASK_ULL(34, 32)
>>>    #define CTXDESC_CD_0_TCR_TBI0		(1ULL << 38)
>>>    
>>> +#define CTXDESC_CD_0_TCR_HA            (1UL << 43)
>>> +#define CTXDESC_CD_0_TCR_HD            (1UL << 42)
>>> +
>>>    #define CTXDESC_CD_0_AA64		(1UL << 41)
>>>    #define CTXDESC_CD_0_S			(1UL << 44)
>>>    #define CTXDESC_CD_0_R			(1UL << 45)
>>> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
>>> index d7626ca67dbf..a11902ae9cf1 100644
>>> --- a/include/linux/io-pgtable.h
>>> +++ b/include/linux/io-pgtable.h
>>> @@ -87,6 +87,7 @@ struct io_pgtable_cfg {
>>>    	#define IO_PGTABLE_QUIRK_ARM_OUTER_WBWA	BIT(6)
>>>    	#define IO_PGTABLE_QUIRK_ARM_BBML1      BIT(7)
>>>    	#define IO_PGTABLE_QUIRK_ARM_BBML2      BIT(8)
>>> +	#define IO_PGTABLE_QUIRK_ARM_HD         BIT(9)
>>>    
>>>    	unsigned long			quirks;
>>>    	unsigned long			pgsize_bitmap;
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 01/19] iommu: Add iommu_domain ops for dirty tracking
  2022-04-28 21:09   ` Joao Martins
@ 2022-04-29 13:40     ` Baolu Lu
  -1 siblings, 0 replies; 209+ messages in thread
From: Baolu Lu @ 2022-04-29 13:40 UTC (permalink / raw)
  To: Joao Martins, iommu
  Cc: Joerg Roedel, Suravee Suthikulpanit, Will Deacon, Robin Murphy,
	Jean-Philippe Brucker, Keqian Zhu, Shameerali Kolothum Thodi,
	David Woodhouse, Jason Gunthorpe, Nicolin Chen, Yishai Hadas,
	Kevin Tian, Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck,
	kvm

Hi Joao,

Thanks for doing this.

On 2022/4/29 05:09, Joao Martins wrote:
> Add to iommu domain operations a set of callbacks to
> perform dirty tracking, particulary to start and stop
> tracking and finally to test and clear the dirty data.
> 
> Drivers are expected to dynamically change its hw protection
> domain bits to toggle the tracking and flush some form of
> control state structure that stands in the IOVA translation
> path.
> 
> For reading and clearing dirty data, in all IOMMUs a transition
> from any of the PTE access bits (Access, Dirty) implies flushing
> the IOTLB to invalidate any stale data in the IOTLB as to whether
> or not the IOMMU should update the said PTEs. The iommu core APIs
> introduce a new structure for storing the dirties, albeit vendor
> IOMMUs implementing .read_and_clear_dirty() just use
> iommu_dirty_bitmap_record() to set the memory storing dirties.
> The underlying tracking/iteration of user bitmap memory is instead
> done by iommufd which takes care of initializing the dirty bitmap
> *prior* to passing to the IOMMU domain op.
> 
> So far for currently/to-be-supported IOMMUs with dirty tracking
> support this particularly because the tracking is part of
> first stage tables and part of address translation. Below
> it is mentioned how hardware deal with the hardware protection
> domain control bits, to justify the added iommu core APIs.
> vendor IOMMU implementation will also explain in more detail on
> the dirty bit usage/clearing in the IOPTEs.
> 
> * x86 AMD:
> 
> The same thing for AMD particularly the Device Table
> respectivally, followed by flushing the Device IOTLB. On AMD[1],
> section "2.2.1 Updating Shared Tables", e.g.
> 
>> Each table can also have its contents cached by the IOMMU or
> peripheral IOTLBs. Therefore, after
> updating a table entry that can be cached, system software must
> send the IOMMU an appropriate
> invalidate command. Information in the peripheral IOTLBs must
> also be invalidated.
> 
> There's no mention of particular bits that are cached or
> not but fetching a dev entry is part of address translation
> as also depicted, so invalidate the device table to make
> sure the next translations fetch a DTE entry with the HD bits set.
> 
> * x86 Intel (rev3.0+):
> 
> Likewise[2] set the SSADE bit in the scalable-entry second stage table
> to enable Access/Dirty bits in the second stage page table. See manual,
> particularly on "6.2.3.1 Scalable-Mode PASID-Table Entry Programming
> Considerations"
> 
>> When modifying root-entries, scalable-mode root-entries,
> context-entries, or scalable-mode context
> entries:
>> Software must serially invalidate the context-cache,
> PASID-cache (if applicable), and the IOTLB.  The serialization is
> required since hardware may utilize information from the
> context-caches (e.g., Domain-ID) to tag new entries inserted to
> the PASID-cache and IOTLB for processing in-flight requests.
> Section 6.5 describe the invalidation operations.
> 
> And also the whole chapter "" Table "Table 23.  Guidance to
> Software for Invalidations" in "6.5.3.3 Guidance to Software for
> Invalidations" explicitly mentions
> 
>> SSADE transition from 0 to 1 in a scalable-mode PASID-table
> entry with PGTT value of Second-stage or Nested
> 
> * ARM SMMUV3.2:
> 
> SMMUv3.2 needs to toggle the dirty bit descriptor
> over the CD (or S2CD) for toggling and flush/invalidate
> the IOMMU dev IOTLB.
> 
> Reference[0]: SMMU spec, "5.4.1 CD notes",
> 
>> The following CD fields are permitted to be cached as part of a
> translation or TLB entry, and alteration requires
> invalidation of any TLB entry that might have cached these
> fields, in addition to CD structure cache invalidation:
> 
> ...
> HA, HD
> ...
> 
> Although, The ARM SMMUv3 case is a tad different that its x86
> counterparts. Rather than changing *only* the IOMMU domain device entry to
> enable dirty tracking (and having a dedicated bit for dirtyness in IOPTE)
> ARM instead uses a dirty-bit modifier which is separately enabled, and
> changes the *existing* meaning of access bits (for ro/rw), to the point
> that marking access bit read-only but with dirty-bit-modifier enabled
> doesn't trigger an perm io page fault.
> 
> In pratice this means that changing iommu context isn't enough
> and in fact mostly useless IIUC (and can be always enabled). Dirtying
> is only really enabled when the DBM pte bit is enabled (with the
> CD.HD bit as a prereq).
> 
> To capture this h/w construct an iommu core API is added which enables
> dirty tracking on an IOVA range rather than a device/context entry.
> iommufd picks one or the other, and IOMMUFD core will favour
> device-context op followed by IOVA-range alternative.

Instead of specification words, I'd like to read more about why the
callbacks are needed and how should they be implemented and consumed.

> 
> [0] https://developer.arm.com/documentation/ihi0070/latest
> [1] https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf
> [2] https://cdrdv2.intel.com/v1/dl/getContent/671081
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>   drivers/iommu/iommu.c      | 28 ++++++++++++++++++++
>   include/linux/io-pgtable.h |  6 +++++
>   include/linux/iommu.h      | 52 ++++++++++++++++++++++++++++++++++++++
>   3 files changed, 86 insertions(+)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 0c42ece25854..d18b9ddbcce4 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -15,6 +15,7 @@
>   #include <linux/init.h>
>   #include <linux/export.h>
>   #include <linux/slab.h>
> +#include <linux/highmem.h>
>   #include <linux/errno.h>
>   #include <linux/iommu.h>
>   #include <linux/idr.h>
> @@ -3167,3 +3168,30 @@ bool iommu_group_dma_owner_claimed(struct iommu_group *group)
>   	return user;
>   }
>   EXPORT_SYMBOL_GPL(iommu_group_dma_owner_claimed);
> +
> +unsigned int iommu_dirty_bitmap_record(struct iommu_dirty_bitmap *dirty,
> +				       unsigned long iova, unsigned long length)
> +{
> +	unsigned long nbits, offset, start_offset, idx, size, *kaddr;
> +
> +	nbits = max(1UL, length >> dirty->pgshift);
> +	offset = (iova - dirty->iova) >> dirty->pgshift;
> +	idx = offset / (PAGE_SIZE * BITS_PER_BYTE);
> +	offset = offset % (PAGE_SIZE * BITS_PER_BYTE);
> +	start_offset = dirty->start_offset;
> +
> +	while (nbits > 0) {
> +		kaddr = kmap(dirty->pages[idx]) + start_offset;
> +		size = min(PAGE_SIZE * BITS_PER_BYTE - offset, nbits);
> +		bitmap_set(kaddr, offset, size);
> +		kunmap(dirty->pages[idx]);
> +		start_offset = offset = 0;
> +		nbits -= size;
> +		idx++;
> +	}
> +
> +	if (dirty->gather)
> +		iommu_iotlb_gather_add_range(dirty->gather, iova, length);
> +
> +	return nbits;
> +}
> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
> index 86af6f0a00a2..82b39925c21f 100644
> --- a/include/linux/io-pgtable.h
> +++ b/include/linux/io-pgtable.h
> @@ -165,6 +165,12 @@ struct io_pgtable_ops {
>   			      struct iommu_iotlb_gather *gather);
>   	phys_addr_t (*iova_to_phys)(struct io_pgtable_ops *ops,
>   				    unsigned long iova);
> +	int (*set_dirty_tracking)(struct io_pgtable_ops *ops,
> +				  unsigned long iova, size_t size,
> +				  bool enabled);
> +	int (*read_and_clear_dirty)(struct io_pgtable_ops *ops,
> +				    unsigned long iova, size_t size,
> +				    struct iommu_dirty_bitmap *dirty);
>   };
>   
>   /**
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 6ef2df258673..ca076365d77b 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -189,6 +189,25 @@ struct iommu_iotlb_gather {
>   	bool			queued;
>   };
>   
> +/**
> + * struct iommu_dirty_bitmap - Dirty IOVA bitmap state
> + *
> + * @iova: IOVA representing the start of the bitmap, the first bit of the bitmap
> + * @pgshift: Page granularity of the bitmap
> + * @gather: Range information for a pending IOTLB flush
> + * @start_offset: Offset of the first user page
> + * @pages: User pages representing the bitmap region
> + * @npages: Number of user pages pinned
> + */
> +struct iommu_dirty_bitmap {
> +	unsigned long iova;
> +	unsigned long pgshift;
> +	struct iommu_iotlb_gather *gather;
> +	unsigned long start_offset;
> +	unsigned long npages;

I haven't found where "npages" is used in this patch. It's better to add
it when it's really used? Sorry if I missed anything.

> +	struct page **pages;
> +};
> +
>   /**
>    * struct iommu_ops - iommu ops and capabilities
>    * @capable: check capability
> @@ -275,6 +294,13 @@ struct iommu_ops {
>    * @enable_nesting: Enable nesting
>    * @set_pgtable_quirks: Set io page table quirks (IO_PGTABLE_QUIRK_*)
>    * @free: Release the domain after use.
> + * @set_dirty_tracking: Enable or Disable dirty tracking on the iommu domain
> + * @set_dirty_tracking_range: Enable or Disable dirty tracking on a range of
> + *                            an iommu domain
> + * @read_and_clear_dirty: Walk IOMMU page tables for dirtied PTEs marshalled
> + *                        into a bitmap, with a bit represented as a page.
> + *                        Reads the dirty PTE bits and clears it from IO
> + *                        pagetables.
>    */
>   struct iommu_domain_ops {
>   	int (*attach_dev)(struct iommu_domain *domain, struct device *dev);
> @@ -305,6 +331,15 @@ struct iommu_domain_ops {
>   				  unsigned long quirks);
>   
>   	void (*free)(struct iommu_domain *domain);
> +
> +	int (*set_dirty_tracking)(struct iommu_domain *domain, bool enabled);
> +	int (*set_dirty_tracking_range)(struct iommu_domain *domain,
> +					unsigned long iova, size_t size,
> +					struct iommu_iotlb_gather *iotlb_gather,
> +					bool enabled);

It seems that we are adding two callbacks for the same purpose. How
should the IOMMU drivers select to support? Any functional different
between these two? How should the caller select to use?

> +	int (*read_and_clear_dirty)(struct iommu_domain *domain,
> +				    unsigned long iova, size_t size,
> +				    struct iommu_dirty_bitmap *dirty);
>   };
>   
>   /**
> @@ -494,6 +529,23 @@ void iommu_set_dma_strict(void);
>   extern int report_iommu_fault(struct iommu_domain *domain, struct device *dev,
>   			      unsigned long iova, int flags);
>   
> +unsigned int iommu_dirty_bitmap_record(struct iommu_dirty_bitmap *dirty,
> +				       unsigned long iova, unsigned long length);
> +
> +static inline void iommu_dirty_bitmap_init(struct iommu_dirty_bitmap *dirty,
> +					   unsigned long base,
> +					   unsigned long pgshift,
> +					   struct iommu_iotlb_gather *gather)
> +{
> +	memset(dirty, 0, sizeof(*dirty));
> +	dirty->iova = base;
> +	dirty->pgshift = pgshift;
> +	dirty->gather = gather;
> +
> +	if (gather)
> +		iommu_iotlb_gather_init(dirty->gather);
> +}
> +
>   static inline void iommu_flush_iotlb_all(struct iommu_domain *domain)
>   {
>   	if (domain->ops->flush_iotlb_all)

Best regards,
baolu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 01/19] iommu: Add iommu_domain ops for dirty tracking
@ 2022-04-29 13:40     ` Baolu Lu
  0 siblings, 0 replies; 209+ messages in thread
From: Baolu Lu @ 2022-04-29 13:40 UTC (permalink / raw)
  To: Joao Martins, iommu
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, Jason Gunthorpe,
	kvm, Will Deacon, Cornelia Huck, Alex Williamson,
	David Woodhouse, Robin Murphy

Hi Joao,

Thanks for doing this.

On 2022/4/29 05:09, Joao Martins wrote:
> Add to iommu domain operations a set of callbacks to
> perform dirty tracking, particulary to start and stop
> tracking and finally to test and clear the dirty data.
> 
> Drivers are expected to dynamically change its hw protection
> domain bits to toggle the tracking and flush some form of
> control state structure that stands in the IOVA translation
> path.
> 
> For reading and clearing dirty data, in all IOMMUs a transition
> from any of the PTE access bits (Access, Dirty) implies flushing
> the IOTLB to invalidate any stale data in the IOTLB as to whether
> or not the IOMMU should update the said PTEs. The iommu core APIs
> introduce a new structure for storing the dirties, albeit vendor
> IOMMUs implementing .read_and_clear_dirty() just use
> iommu_dirty_bitmap_record() to set the memory storing dirties.
> The underlying tracking/iteration of user bitmap memory is instead
> done by iommufd which takes care of initializing the dirty bitmap
> *prior* to passing to the IOMMU domain op.
> 
> So far for currently/to-be-supported IOMMUs with dirty tracking
> support this particularly because the tracking is part of
> first stage tables and part of address translation. Below
> it is mentioned how hardware deal with the hardware protection
> domain control bits, to justify the added iommu core APIs.
> vendor IOMMU implementation will also explain in more detail on
> the dirty bit usage/clearing in the IOPTEs.
> 
> * x86 AMD:
> 
> The same thing for AMD particularly the Device Table
> respectivally, followed by flushing the Device IOTLB. On AMD[1],
> section "2.2.1 Updating Shared Tables", e.g.
> 
>> Each table can also have its contents cached by the IOMMU or
> peripheral IOTLBs. Therefore, after
> updating a table entry that can be cached, system software must
> send the IOMMU an appropriate
> invalidate command. Information in the peripheral IOTLBs must
> also be invalidated.
> 
> There's no mention of particular bits that are cached or
> not but fetching a dev entry is part of address translation
> as also depicted, so invalidate the device table to make
> sure the next translations fetch a DTE entry with the HD bits set.
> 
> * x86 Intel (rev3.0+):
> 
> Likewise[2] set the SSADE bit in the scalable-entry second stage table
> to enable Access/Dirty bits in the second stage page table. See manual,
> particularly on "6.2.3.1 Scalable-Mode PASID-Table Entry Programming
> Considerations"
> 
>> When modifying root-entries, scalable-mode root-entries,
> context-entries, or scalable-mode context
> entries:
>> Software must serially invalidate the context-cache,
> PASID-cache (if applicable), and the IOTLB.  The serialization is
> required since hardware may utilize information from the
> context-caches (e.g., Domain-ID) to tag new entries inserted to
> the PASID-cache and IOTLB for processing in-flight requests.
> Section 6.5 describe the invalidation operations.
> 
> And also the whole chapter "" Table "Table 23.  Guidance to
> Software for Invalidations" in "6.5.3.3 Guidance to Software for
> Invalidations" explicitly mentions
> 
>> SSADE transition from 0 to 1 in a scalable-mode PASID-table
> entry with PGTT value of Second-stage or Nested
> 
> * ARM SMMUV3.2:
> 
> SMMUv3.2 needs to toggle the dirty bit descriptor
> over the CD (or S2CD) for toggling and flush/invalidate
> the IOMMU dev IOTLB.
> 
> Reference[0]: SMMU spec, "5.4.1 CD notes",
> 
>> The following CD fields are permitted to be cached as part of a
> translation or TLB entry, and alteration requires
> invalidation of any TLB entry that might have cached these
> fields, in addition to CD structure cache invalidation:
> 
> ...
> HA, HD
> ...
> 
> Although, The ARM SMMUv3 case is a tad different that its x86
> counterparts. Rather than changing *only* the IOMMU domain device entry to
> enable dirty tracking (and having a dedicated bit for dirtyness in IOPTE)
> ARM instead uses a dirty-bit modifier which is separately enabled, and
> changes the *existing* meaning of access bits (for ro/rw), to the point
> that marking access bit read-only but with dirty-bit-modifier enabled
> doesn't trigger an perm io page fault.
> 
> In pratice this means that changing iommu context isn't enough
> and in fact mostly useless IIUC (and can be always enabled). Dirtying
> is only really enabled when the DBM pte bit is enabled (with the
> CD.HD bit as a prereq).
> 
> To capture this h/w construct an iommu core API is added which enables
> dirty tracking on an IOVA range rather than a device/context entry.
> iommufd picks one or the other, and IOMMUFD core will favour
> device-context op followed by IOVA-range alternative.

Instead of specification words, I'd like to read more about why the
callbacks are needed and how should they be implemented and consumed.

> 
> [0] https://developer.arm.com/documentation/ihi0070/latest
> [1] https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf
> [2] https://cdrdv2.intel.com/v1/dl/getContent/671081
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>   drivers/iommu/iommu.c      | 28 ++++++++++++++++++++
>   include/linux/io-pgtable.h |  6 +++++
>   include/linux/iommu.h      | 52 ++++++++++++++++++++++++++++++++++++++
>   3 files changed, 86 insertions(+)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 0c42ece25854..d18b9ddbcce4 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -15,6 +15,7 @@
>   #include <linux/init.h>
>   #include <linux/export.h>
>   #include <linux/slab.h>
> +#include <linux/highmem.h>
>   #include <linux/errno.h>
>   #include <linux/iommu.h>
>   #include <linux/idr.h>
> @@ -3167,3 +3168,30 @@ bool iommu_group_dma_owner_claimed(struct iommu_group *group)
>   	return user;
>   }
>   EXPORT_SYMBOL_GPL(iommu_group_dma_owner_claimed);
> +
> +unsigned int iommu_dirty_bitmap_record(struct iommu_dirty_bitmap *dirty,
> +				       unsigned long iova, unsigned long length)
> +{
> +	unsigned long nbits, offset, start_offset, idx, size, *kaddr;
> +
> +	nbits = max(1UL, length >> dirty->pgshift);
> +	offset = (iova - dirty->iova) >> dirty->pgshift;
> +	idx = offset / (PAGE_SIZE * BITS_PER_BYTE);
> +	offset = offset % (PAGE_SIZE * BITS_PER_BYTE);
> +	start_offset = dirty->start_offset;
> +
> +	while (nbits > 0) {
> +		kaddr = kmap(dirty->pages[idx]) + start_offset;
> +		size = min(PAGE_SIZE * BITS_PER_BYTE - offset, nbits);
> +		bitmap_set(kaddr, offset, size);
> +		kunmap(dirty->pages[idx]);
> +		start_offset = offset = 0;
> +		nbits -= size;
> +		idx++;
> +	}
> +
> +	if (dirty->gather)
> +		iommu_iotlb_gather_add_range(dirty->gather, iova, length);
> +
> +	return nbits;
> +}
> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
> index 86af6f0a00a2..82b39925c21f 100644
> --- a/include/linux/io-pgtable.h
> +++ b/include/linux/io-pgtable.h
> @@ -165,6 +165,12 @@ struct io_pgtable_ops {
>   			      struct iommu_iotlb_gather *gather);
>   	phys_addr_t (*iova_to_phys)(struct io_pgtable_ops *ops,
>   				    unsigned long iova);
> +	int (*set_dirty_tracking)(struct io_pgtable_ops *ops,
> +				  unsigned long iova, size_t size,
> +				  bool enabled);
> +	int (*read_and_clear_dirty)(struct io_pgtable_ops *ops,
> +				    unsigned long iova, size_t size,
> +				    struct iommu_dirty_bitmap *dirty);
>   };
>   
>   /**
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 6ef2df258673..ca076365d77b 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -189,6 +189,25 @@ struct iommu_iotlb_gather {
>   	bool			queued;
>   };
>   
> +/**
> + * struct iommu_dirty_bitmap - Dirty IOVA bitmap state
> + *
> + * @iova: IOVA representing the start of the bitmap, the first bit of the bitmap
> + * @pgshift: Page granularity of the bitmap
> + * @gather: Range information for a pending IOTLB flush
> + * @start_offset: Offset of the first user page
> + * @pages: User pages representing the bitmap region
> + * @npages: Number of user pages pinned
> + */
> +struct iommu_dirty_bitmap {
> +	unsigned long iova;
> +	unsigned long pgshift;
> +	struct iommu_iotlb_gather *gather;
> +	unsigned long start_offset;
> +	unsigned long npages;

I haven't found where "npages" is used in this patch. It's better to add
it when it's really used? Sorry if I missed anything.

> +	struct page **pages;
> +};
> +
>   /**
>    * struct iommu_ops - iommu ops and capabilities
>    * @capable: check capability
> @@ -275,6 +294,13 @@ struct iommu_ops {
>    * @enable_nesting: Enable nesting
>    * @set_pgtable_quirks: Set io page table quirks (IO_PGTABLE_QUIRK_*)
>    * @free: Release the domain after use.
> + * @set_dirty_tracking: Enable or Disable dirty tracking on the iommu domain
> + * @set_dirty_tracking_range: Enable or Disable dirty tracking on a range of
> + *                            an iommu domain
> + * @read_and_clear_dirty: Walk IOMMU page tables for dirtied PTEs marshalled
> + *                        into a bitmap, with a bit represented as a page.
> + *                        Reads the dirty PTE bits and clears it from IO
> + *                        pagetables.
>    */
>   struct iommu_domain_ops {
>   	int (*attach_dev)(struct iommu_domain *domain, struct device *dev);
> @@ -305,6 +331,15 @@ struct iommu_domain_ops {
>   				  unsigned long quirks);
>   
>   	void (*free)(struct iommu_domain *domain);
> +
> +	int (*set_dirty_tracking)(struct iommu_domain *domain, bool enabled);
> +	int (*set_dirty_tracking_range)(struct iommu_domain *domain,
> +					unsigned long iova, size_t size,
> +					struct iommu_iotlb_gather *iotlb_gather,
> +					bool enabled);

It seems that we are adding two callbacks for the same purpose. How
should the IOMMU drivers select to support? Any functional different
between these two? How should the caller select to use?

> +	int (*read_and_clear_dirty)(struct iommu_domain *domain,
> +				    unsigned long iova, size_t size,
> +				    struct iommu_dirty_bitmap *dirty);
>   };
>   
>   /**
> @@ -494,6 +529,23 @@ void iommu_set_dma_strict(void);
>   extern int report_iommu_fault(struct iommu_domain *domain, struct device *dev,
>   			      unsigned long iova, int flags);
>   
> +unsigned int iommu_dirty_bitmap_record(struct iommu_dirty_bitmap *dirty,
> +				       unsigned long iova, unsigned long length);
> +
> +static inline void iommu_dirty_bitmap_init(struct iommu_dirty_bitmap *dirty,
> +					   unsigned long base,
> +					   unsigned long pgshift,
> +					   struct iommu_iotlb_gather *gather)
> +{
> +	memset(dirty, 0, sizeof(*dirty));
> +	dirty->iova = base;
> +	dirty->pgshift = pgshift;
> +	dirty->gather = gather;
> +
> +	if (gather)
> +		iommu_iotlb_gather_init(dirty->gather);
> +}
> +
>   static inline void iommu_flush_iotlb_all(struct iommu_domain *domain)
>   {
>   	if (domain->ops->flush_iotlb_all)

Best regards,
baolu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 01/19] iommu: Add iommu_domain ops for dirty tracking
  2022-04-29 12:08     ` Jason Gunthorpe via iommu
@ 2022-04-29 14:26       ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 14:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Nicolin Chen, Yishai Hadas, Kevin Tian, Eric Auger, Yi Liu,
	Alex Williamson, Cornelia Huck, kvm

On 4/29/22 13:08, Jason Gunthorpe wrote:
> On Thu, Apr 28, 2022 at 10:09:15PM +0100, Joao Martins wrote:
>> +
>> +unsigned int iommu_dirty_bitmap_record(struct iommu_dirty_bitmap *dirty,
>> +				       unsigned long iova, unsigned long length)
>> +{
> 
> Lets put iommu_dirty_bitmap in its own patch, the VFIO driver side
> will want to use this same data structure.
> 
OK.

>> +	while (nbits > 0) {
>> +		kaddr = kmap(dirty->pages[idx]) + start_offset;
> 
> kmap_local?
> 
/me nods

>> +/**
>> + * struct iommu_dirty_bitmap - Dirty IOVA bitmap state
>> + *
>> + * @iova: IOVA representing the start of the bitmap, the first bit of the bitmap
>> + * @pgshift: Page granularity of the bitmap
>> + * @gather: Range information for a pending IOTLB flush
>> + * @start_offset: Offset of the first user page
>> + * @pages: User pages representing the bitmap region
>> + * @npages: Number of user pages pinned
>> + */
>> +struct iommu_dirty_bitmap {
>> +	unsigned long iova;
>> +	unsigned long pgshift;
>> +	struct iommu_iotlb_gather *gather;
>> +	unsigned long start_offset;
>> +	unsigned long npages;
>> +	struct page **pages;
> 
> In many (all?) cases I would expect this to be called from a process
> context, can we just store the __user pointer here, or is the idea
> that with modern kernels poking a u64 to userspace is slower than a
> kmap?
> 
I have both options implemented, I'll need to measure it. Code-wise it would be
a lot simpler to just poke at the userspace addresses (that was my first
prototype of this) but felt that poking at kernel addresses was safer and
avoid assumptions over the context (from the iommu driver). I can bring back
the former alternative if this was the wrong thing to do.

> I'm particularly concerend that this starts to require high
> order allocations with more than 2M of bitmap.. Maybe one direction is
> to GUP 2M chunks at a time and walk the __user pointer.
> 
That's what I am doing here. We GUP 2M of *bitmap* at a time.
Which is about 1 page for the struct page pointers. That is enough
for 64G of IOVA dirties read worst-case scenario (i.e. with base pages).

>> +static inline void iommu_dirty_bitmap_init(struct iommu_dirty_bitmap *dirty,
>> +					   unsigned long base,
>> +					   unsigned long pgshift,
>> +					   struct iommu_iotlb_gather *gather)
>> +{
>> +	memset(dirty, 0, sizeof(*dirty));
>> +	dirty->iova = base;
>> +	dirty->pgshift = pgshift;
>> +	dirty->gather = gather;
>> +
>> +	if (gather)
>> +		iommu_iotlb_gather_init(dirty->gather);
>> +}
> 
> I would expect all the GUPing logic to be here too?

I had this in the iommufd_dirty_iter logic given that the iommu iteration
logic is in the parent structure that stores iommu_dirty_data.

My thinking with this patch was just to have what the IOMMU driver needs.

Which actually if anything this helper above ought to be in a later patch.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 01/19] iommu: Add iommu_domain ops for dirty tracking
@ 2022-04-29 14:26       ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 14:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, kvm,
	Will Deacon, Cornelia Huck, Alex Williamson, iommu,
	David Woodhouse, Robin Murphy

On 4/29/22 13:08, Jason Gunthorpe wrote:
> On Thu, Apr 28, 2022 at 10:09:15PM +0100, Joao Martins wrote:
>> +
>> +unsigned int iommu_dirty_bitmap_record(struct iommu_dirty_bitmap *dirty,
>> +				       unsigned long iova, unsigned long length)
>> +{
> 
> Lets put iommu_dirty_bitmap in its own patch, the VFIO driver side
> will want to use this same data structure.
> 
OK.

>> +	while (nbits > 0) {
>> +		kaddr = kmap(dirty->pages[idx]) + start_offset;
> 
> kmap_local?
> 
/me nods

>> +/**
>> + * struct iommu_dirty_bitmap - Dirty IOVA bitmap state
>> + *
>> + * @iova: IOVA representing the start of the bitmap, the first bit of the bitmap
>> + * @pgshift: Page granularity of the bitmap
>> + * @gather: Range information for a pending IOTLB flush
>> + * @start_offset: Offset of the first user page
>> + * @pages: User pages representing the bitmap region
>> + * @npages: Number of user pages pinned
>> + */
>> +struct iommu_dirty_bitmap {
>> +	unsigned long iova;
>> +	unsigned long pgshift;
>> +	struct iommu_iotlb_gather *gather;
>> +	unsigned long start_offset;
>> +	unsigned long npages;
>> +	struct page **pages;
> 
> In many (all?) cases I would expect this to be called from a process
> context, can we just store the __user pointer here, or is the idea
> that with modern kernels poking a u64 to userspace is slower than a
> kmap?
> 
I have both options implemented, I'll need to measure it. Code-wise it would be
a lot simpler to just poke at the userspace addresses (that was my first
prototype of this) but felt that poking at kernel addresses was safer and
avoid assumptions over the context (from the iommu driver). I can bring back
the former alternative if this was the wrong thing to do.

> I'm particularly concerend that this starts to require high
> order allocations with more than 2M of bitmap.. Maybe one direction is
> to GUP 2M chunks at a time and walk the __user pointer.
> 
That's what I am doing here. We GUP 2M of *bitmap* at a time.
Which is about 1 page for the struct page pointers. That is enough
for 64G of IOVA dirties read worst-case scenario (i.e. with base pages).

>> +static inline void iommu_dirty_bitmap_init(struct iommu_dirty_bitmap *dirty,
>> +					   unsigned long base,
>> +					   unsigned long pgshift,
>> +					   struct iommu_iotlb_gather *gather)
>> +{
>> +	memset(dirty, 0, sizeof(*dirty));
>> +	dirty->iova = base;
>> +	dirty->pgshift = pgshift;
>> +	dirty->gather = gather;
>> +
>> +	if (gather)
>> +		iommu_iotlb_gather_init(dirty->gather);
>> +}
> 
> I would expect all the GUPing logic to be here too?

I had this in the iommufd_dirty_iter logic given that the iommu iteration
logic is in the parent structure that stores iommu_dirty_data.

My thinking with this patch was just to have what the IOMMU driver needs.

Which actually if anything this helper above ought to be in a later patch.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 07/19] iommufd/vfio-compat: Dirty tracking IOCTLs compatibility
  2022-04-29 12:19     ` Jason Gunthorpe via iommu
@ 2022-04-29 14:27       ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 14:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Nicolin Chen, Yishai Hadas, Kevin Tian, Eric Auger, Yi Liu,
	Alex Williamson, Cornelia Huck, kvm

On 4/29/22 13:19, Jason Gunthorpe wrote:
> On Thu, Apr 28, 2022 at 10:09:21PM +0100, Joao Martins wrote:
>> Add the correspondent APIs for performing VFIO dirty tracking,
>> particularly VFIO_IOMMU_DIRTY_PAGES ioctl subcmds:
>> * VFIO_IOMMU_DIRTY_PAGES_FLAG_START: Start dirty tracking and allocates
>> 				     the area @dirty_bitmap
>> * VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP: Stop dirty tracking and frees
>> 				    the area @dirty_bitmap
>> * VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP: Fetch dirty bitmap while dirty
>> tracking is active.
>>
>> Advertise the VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION
>> whereas it gets set the domain configured page size the same as
>> iopt::iova_alignment and maximum dirty bitmap size same
>> as VFIO. Compared to VFIO type1 iommu, the perpectual dirtying is
>> not implemented and userspace gets -EOPNOTSUPP which is handled by
>> today's userspace.
>>
>> Move iommufd_get_pagesizes() definition prior to unmap for
>> iommufd_vfio_unmap_dma() dirty support to validate the user bitmap page
>> size against IOPT pagesize.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  drivers/iommu/iommufd/vfio_compat.c | 221 ++++++++++++++++++++++++++--
>>  1 file changed, 209 insertions(+), 12 deletions(-)
> 
> I think I would probably not do this patch, it has behavior that is
> quite different from the current vfio - ie the interaction with the
> mdevs, and I don't intend to fix that. 

I'll drop this, until I hear otherwise.

I wasn't sure what people leaning towards to and keeping perpectual-dirty
stuff didn't feel right for a new UAPI either.

> So, with this patch and a mdev
> then vfio_compat will return all-not-dirty but current vfio will
> return all-dirty - and that is significant enough to break qemu.
> 
Ack

> We've made a qemu patch to allow qemu to be happy if dirty tracking is
> not supported in the vfio container for migration, which is part of
> the v2 enablement series. That seems like the better direction.
> 
So in my auditing/testing, the listener callbacks are called but the dirty ioctls
return an error at start, and bails out early on sync. I suppose migration
won't really work, as no pages aren't set and what not but it could
cope with no-dirty-tracking support. So by 'making qemu happy' is this mainly
cleaning out the constant error messages you get and not even attempt
migration by introducing a migration blocker early on ... should it fetch
no migration capability?

> I can see why this is useful to test with the current qemu however.

Yes, it is indeed useful for testing.

I am wondering if we can still emulate that in userspace, given that the expectation
from each GET_BITMAP call is to get all dirties, likewise for type1 unmap dirty. Unless
I am missed something obvious.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 07/19] iommufd/vfio-compat: Dirty tracking IOCTLs compatibility
@ 2022-04-29 14:27       ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 14:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, kvm,
	Will Deacon, Cornelia Huck, Alex Williamson, iommu,
	David Woodhouse, Robin Murphy

On 4/29/22 13:19, Jason Gunthorpe wrote:
> On Thu, Apr 28, 2022 at 10:09:21PM +0100, Joao Martins wrote:
>> Add the correspondent APIs for performing VFIO dirty tracking,
>> particularly VFIO_IOMMU_DIRTY_PAGES ioctl subcmds:
>> * VFIO_IOMMU_DIRTY_PAGES_FLAG_START: Start dirty tracking and allocates
>> 				     the area @dirty_bitmap
>> * VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP: Stop dirty tracking and frees
>> 				    the area @dirty_bitmap
>> * VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP: Fetch dirty bitmap while dirty
>> tracking is active.
>>
>> Advertise the VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION
>> whereas it gets set the domain configured page size the same as
>> iopt::iova_alignment and maximum dirty bitmap size same
>> as VFIO. Compared to VFIO type1 iommu, the perpectual dirtying is
>> not implemented and userspace gets -EOPNOTSUPP which is handled by
>> today's userspace.
>>
>> Move iommufd_get_pagesizes() definition prior to unmap for
>> iommufd_vfio_unmap_dma() dirty support to validate the user bitmap page
>> size against IOPT pagesize.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  drivers/iommu/iommufd/vfio_compat.c | 221 ++++++++++++++++++++++++++--
>>  1 file changed, 209 insertions(+), 12 deletions(-)
> 
> I think I would probably not do this patch, it has behavior that is
> quite different from the current vfio - ie the interaction with the
> mdevs, and I don't intend to fix that. 

I'll drop this, until I hear otherwise.

I wasn't sure what people leaning towards to and keeping perpectual-dirty
stuff didn't feel right for a new UAPI either.

> So, with this patch and a mdev
> then vfio_compat will return all-not-dirty but current vfio will
> return all-dirty - and that is significant enough to break qemu.
> 
Ack

> We've made a qemu patch to allow qemu to be happy if dirty tracking is
> not supported in the vfio container for migration, which is part of
> the v2 enablement series. That seems like the better direction.
> 
So in my auditing/testing, the listener callbacks are called but the dirty ioctls
return an error at start, and bails out early on sync. I suppose migration
won't really work, as no pages aren't set and what not but it could
cope with no-dirty-tracking support. So by 'making qemu happy' is this mainly
cleaning out the constant error messages you get and not even attempt
migration by introducing a migration blocker early on ... should it fetch
no migration capability?

> I can see why this is useful to test with the current qemu however.

Yes, it is indeed useful for testing.

I am wondering if we can still emulate that in userspace, given that the expectation
from each GET_BITMAP call is to get all dirties, likewise for type1 unmap dirty. Unless
I am missed something obvious.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 02/19] iommufd: Dirty tracking for io_pagetable
  2022-04-29 11:56       ` Jason Gunthorpe via iommu
@ 2022-04-29 14:28         ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 14:28 UTC (permalink / raw)
  To: Jason Gunthorpe, Tian, Kevin
  Cc: iommu, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Nicolin Chen, Yishai Hadas, Eric Auger, Liu, Yi L,
	Alex Williamson, Cornelia Huck, kvm

On 4/29/22 12:56, Jason Gunthorpe wrote:
> On Fri, Apr 29, 2022 at 08:07:14AM +0000, Tian, Kevin wrote:
>>> From: Joao Martins <joao.m.martins@oracle.com>
>>> Sent: Friday, April 29, 2022 5:09 AM
>>>
>>> +static int __set_dirty_tracking_range_locked(struct iommu_domain
>>> *domain,
>>
>> suppose anything using iommu_domain as the first argument should
>> be put in the iommu layer. Here it's more reasonable to use iopt
>> as the first argument or simply merge with the next function.
>>
>>> +					     struct io_pagetable *iopt,
>>> +					     bool enable)
>>> +{
>>> +	const struct iommu_domain_ops *ops = domain->ops;
>>> +	struct iommu_iotlb_gather gather;
>>> +	struct iopt_area *area;
>>> +	int ret = -EOPNOTSUPP;
>>> +	unsigned long iova;
>>> +	size_t size;
>>> +
>>> +	iommu_iotlb_gather_init(&gather);
>>> +
>>> +	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
>>> +	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
>>
>> how is this different from leaving iommu driver to walk the page table
>> and the poke the modifier bit for all present PTEs? As commented in last
>> patch this may allow removing the range op completely.
> 
> Yea, I'm not super keen on the two ops either, especially since they
> are so wildly different.
> 
/me ack

> I would expect that set_dirty_tracking turns on tracking for the
> entire iommu domain, for all present and future maps
> 
Yes.

I didn't do that correctly on ARM, neither on device-attach
(for x86 e.g. on hotplug).

> While set_dirty_tracking_range - I guess it only does the range, so if
> we make a new map then the new range will be untracked? But that is
> now racy, we have to map and then call set_dirty_tracking_range
> 
> It seems better for the iommu driver to deal with this and ARM should
> atomically make the new maps dirty tracking..
> 

Next iteration I'll need to fix the way IOMMUs handle dirty-tracking
probing and tracking in its private intermediate structures.

But yes, I was trying to transfer this to the iommu driver (perhaps in a
convoluted way).

>>> +int iopt_set_dirty_tracking(struct io_pagetable *iopt,
>>> +			    struct iommu_domain *domain, bool enable)
>>> +{
>>> +	struct iommu_domain *dom;
>>> +	unsigned long index;
>>> +	int ret = -EOPNOTSUPP;
> 
> Returns EOPNOTSUPP if the xarray is empty?
> 
Argh no. Maybe -EINVAL is better here.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 02/19] iommufd: Dirty tracking for io_pagetable
@ 2022-04-29 14:28         ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 14:28 UTC (permalink / raw)
  To: Jason Gunthorpe, Tian, Kevin
  Cc: Jean-Philippe Brucker, Yishai Hadas, kvm, Will Deacon,
	Cornelia Huck, Alex Williamson, iommu, David Woodhouse,
	Robin Murphy

On 4/29/22 12:56, Jason Gunthorpe wrote:
> On Fri, Apr 29, 2022 at 08:07:14AM +0000, Tian, Kevin wrote:
>>> From: Joao Martins <joao.m.martins@oracle.com>
>>> Sent: Friday, April 29, 2022 5:09 AM
>>>
>>> +static int __set_dirty_tracking_range_locked(struct iommu_domain
>>> *domain,
>>
>> suppose anything using iommu_domain as the first argument should
>> be put in the iommu layer. Here it's more reasonable to use iopt
>> as the first argument or simply merge with the next function.
>>
>>> +					     struct io_pagetable *iopt,
>>> +					     bool enable)
>>> +{
>>> +	const struct iommu_domain_ops *ops = domain->ops;
>>> +	struct iommu_iotlb_gather gather;
>>> +	struct iopt_area *area;
>>> +	int ret = -EOPNOTSUPP;
>>> +	unsigned long iova;
>>> +	size_t size;
>>> +
>>> +	iommu_iotlb_gather_init(&gather);
>>> +
>>> +	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
>>> +	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
>>
>> how is this different from leaving iommu driver to walk the page table
>> and the poke the modifier bit for all present PTEs? As commented in last
>> patch this may allow removing the range op completely.
> 
> Yea, I'm not super keen on the two ops either, especially since they
> are so wildly different.
> 
/me ack

> I would expect that set_dirty_tracking turns on tracking for the
> entire iommu domain, for all present and future maps
> 
Yes.

I didn't do that correctly on ARM, neither on device-attach
(for x86 e.g. on hotplug).

> While set_dirty_tracking_range - I guess it only does the range, so if
> we make a new map then the new range will be untracked? But that is
> now racy, we have to map and then call set_dirty_tracking_range
> 
> It seems better for the iommu driver to deal with this and ARM should
> atomically make the new maps dirty tracking..
> 

Next iteration I'll need to fix the way IOMMUs handle dirty-tracking
probing and tracking in its private intermediate structures.

But yes, I was trying to transfer this to the iommu driver (perhaps in a
convoluted way).

>>> +int iopt_set_dirty_tracking(struct io_pagetable *iopt,
>>> +			    struct iommu_domain *domain, bool enable)
>>> +{
>>> +	struct iommu_domain *dom;
>>> +	unsigned long index;
>>> +	int ret = -EOPNOTSUPP;
> 
> Returns EOPNOTSUPP if the xarray is empty?
> 
Argh no. Maybe -EINVAL is better here.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 03/19] iommufd: Dirty tracking data support
  2022-04-29 12:09         ` Jason Gunthorpe via iommu
@ 2022-04-29 14:33           ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 14:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Nicolin Chen, Yishai Hadas, Eric Auger, Liu, Yi L,
	Alex Williamson, Cornelia Huck, kvm, iommu

On 4/29/22 13:09, Jason Gunthorpe wrote:
> On Fri, Apr 29, 2022 at 11:54:16AM +0100, Joao Martins wrote:
>> On 4/29/22 09:12, Tian, Kevin wrote:
>>>> From: Joao Martins <joao.m.martins@oracle.com>
>>>> Sent: Friday, April 29, 2022 5:09 AM
>>> [...]
>>>> +
>>>> +static int iommu_read_and_clear_dirty(struct iommu_domain *domain,
>>>> +				      struct iommufd_dirty_data *bitmap)
>>>
>>> In a glance this function and all previous helpers doesn't rely on any
>>> iommufd objects except that the new structures are named as
>>> iommufd_xxx. 
>>>
>>> I wonder whether moving all of them to the iommu layer would make
>>> more sense here.
>>>
>> I suppose, instinctively, I was trying to make this tie to iommufd only,
>> to avoid getting it called in cases we don't except when made as a generic
>> exported kernel facility.
>>
>> (note: iommufd can be built as a module).
> 
> Yeah, I think that is a reasonable reason to put iommufd only stuff in
> iommufd.ko rather than bloat the static kernel.
> 
> You could put it in a new .c file though so there is some logical
> modularity?

I can do that (iommu.c / dirty.c if no better idea comes to mind,
suggestions welcome :)).

Although I should said that there's some dependency on iopt structures and
what not so I have to see if this is a change for the better. I'll respond
here should it be dubious.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 03/19] iommufd: Dirty tracking data support
@ 2022-04-29 14:33           ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 14:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Tian, Kevin, Yishai Hadas, kvm,
	Will Deacon, Cornelia Huck, iommu, Alex Williamson,
	David Woodhouse, Robin Murphy

On 4/29/22 13:09, Jason Gunthorpe wrote:
> On Fri, Apr 29, 2022 at 11:54:16AM +0100, Joao Martins wrote:
>> On 4/29/22 09:12, Tian, Kevin wrote:
>>>> From: Joao Martins <joao.m.martins@oracle.com>
>>>> Sent: Friday, April 29, 2022 5:09 AM
>>> [...]
>>>> +
>>>> +static int iommu_read_and_clear_dirty(struct iommu_domain *domain,
>>>> +				      struct iommufd_dirty_data *bitmap)
>>>
>>> In a glance this function and all previous helpers doesn't rely on any
>>> iommufd objects except that the new structures are named as
>>> iommufd_xxx. 
>>>
>>> I wonder whether moving all of them to the iommu layer would make
>>> more sense here.
>>>
>> I suppose, instinctively, I was trying to make this tie to iommufd only,
>> to avoid getting it called in cases we don't except when made as a generic
>> exported kernel facility.
>>
>> (note: iommufd can be built as a module).
> 
> Yeah, I think that is a reasonable reason to put iommufd only stuff in
> iommufd.ko rather than bloat the static kernel.
> 
> You could put it in a new .c file though so there is some logical
> modularity?

I can do that (iommu.c / dirty.c if no better idea comes to mind,
suggestions welcome :)).

Although I should said that there's some dependency on iopt structures and
what not so I have to see if this is a change for the better. I'll respond
here should it be dubious.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 13/19] iommu/arm-smmu-v3: Add feature detection for BBML
  2022-04-29 12:26         ` Robin Murphy
@ 2022-04-29 14:34           ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 14:34 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, Jason Gunthorpe,
	kvm, Cornelia Huck, iommu, Alex Williamson, Will Deacon,
	David Woodhouse

On 4/29/22 13:26, Robin Murphy wrote:
> On 2022-04-29 12:54, Joao Martins wrote:
>> On 4/29/22 12:11, Robin Murphy wrote:
>>> On 2022-04-28 22:09, Joao Martins wrote:
>>>> From: Kunkun Jiang <jiangkunkun@huawei.com>
>>>>
>>>> This detects BBML feature and if SMMU supports it, transfer BBMLx
>>>> quirk to io-pgtable.
>>>>
>>>> BBML1 requires still marking PTE nT prior to performing a
>>>> translation table update, while BBML2 requires neither break-before-make
>>>> nor PTE nT bit being set. For dirty tracking it needs to clear
>>>> the dirty bit so checking BBML2 tells us the prerequisite. See SMMUv3.2
>>>> manual, section "3.21.1.3 When SMMU_IDR3.BBML == 2 (Level 2)" and
>>>> "3.21.1.2 When SMMU_IDR3.BBML == 1 (Level 1)"
>>>
>>> You can drop this, and the dependencies on BBML elsewhere, until you get
>>> round to the future large-page-splitting work, since that's the only
>>> thing this represents. Not much point having the feature flags without
>>> an actual implementation, or any users.
>>>
>> OK.
>>
>> My thinking was that the BBML2 meant *also* that we don't need that break-before-make
>> thingie upon switching translation table entries. It seems that from what you
>> say, BBML2 then just refers to this but only on the context of switching between
>> hugepages/normal pages (?), not in general on all bits of the PTE (which we woud .. upon
>> switching from writeable-dirty to writeable-clean with DBM-set).
> 
> Yes, BBML is purely about swapping between a block (hugepage) entry and 
> a table representing the exact equivalent mapping.
> 
> A break-before-make procedure isn't required when just changing 
> permissions, and AFAICS it doesn't apply to changing the DBM bit either, 
> but as mentioned I think we could probably just not do that anyway.

Interesting, thanks for the clarification.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 13/19] iommu/arm-smmu-v3: Add feature detection for BBML
@ 2022-04-29 14:34           ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 14:34 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Jean-Philippe Brucker, Keqian Zhu, Shameerali Kolothum Thodi,
	David Woodhouse, Lu Baolu, Jason Gunthorpe, Nicolin Chen,
	Yishai Hadas, Kevin Tian, Eric Auger, Yi Liu, Alex Williamson,
	Cornelia Huck, kvm, Kunkun Jiang, iommu

On 4/29/22 13:26, Robin Murphy wrote:
> On 2022-04-29 12:54, Joao Martins wrote:
>> On 4/29/22 12:11, Robin Murphy wrote:
>>> On 2022-04-28 22:09, Joao Martins wrote:
>>>> From: Kunkun Jiang <jiangkunkun@huawei.com>
>>>>
>>>> This detects BBML feature and if SMMU supports it, transfer BBMLx
>>>> quirk to io-pgtable.
>>>>
>>>> BBML1 requires still marking PTE nT prior to performing a
>>>> translation table update, while BBML2 requires neither break-before-make
>>>> nor PTE nT bit being set. For dirty tracking it needs to clear
>>>> the dirty bit so checking BBML2 tells us the prerequisite. See SMMUv3.2
>>>> manual, section "3.21.1.3 When SMMU_IDR3.BBML == 2 (Level 2)" and
>>>> "3.21.1.2 When SMMU_IDR3.BBML == 1 (Level 1)"
>>>
>>> You can drop this, and the dependencies on BBML elsewhere, until you get
>>> round to the future large-page-splitting work, since that's the only
>>> thing this represents. Not much point having the feature flags without
>>> an actual implementation, or any users.
>>>
>> OK.
>>
>> My thinking was that the BBML2 meant *also* that we don't need that break-before-make
>> thingie upon switching translation table entries. It seems that from what you
>> say, BBML2 then just refers to this but only on the context of switching between
>> hugepages/normal pages (?), not in general on all bits of the PTE (which we woud .. upon
>> switching from writeable-dirty to writeable-clean with DBM-set).
> 
> Yes, BBML is purely about swapping between a block (hugepage) entry and 
> a table representing the exact equivalent mapping.
> 
> A break-before-make procedure isn't required when just changing 
> permissions, and AFAICS it doesn't apply to changing the DBM bit either, 
> but as mentioned I think we could probably just not do that anyway.

Interesting, thanks for the clarification.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 01/19] iommu: Add iommu_domain ops for dirty tracking
  2022-04-29 14:26       ` Joao Martins
@ 2022-04-29 14:35         ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 209+ messages in thread
From: Jason Gunthorpe @ 2022-04-29 14:35 UTC (permalink / raw)
  To: Joao Martins
  Cc: iommu, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Nicolin Chen, Yishai Hadas, Kevin Tian, Eric Auger, Yi Liu,
	Alex Williamson, Cornelia Huck, kvm

On Fri, Apr 29, 2022 at 03:26:41PM +0100, Joao Martins wrote:

> I had this in the iommufd_dirty_iter logic given that the iommu iteration
> logic is in the parent structure that stores iommu_dirty_data.
> 
> My thinking with this patch was just to have what the IOMMU driver needs.

I would put the whole mechanism in one patch, even though most of the
code will live in iommufd, then it would be clear how it works

Jason

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 01/19] iommu: Add iommu_domain ops for dirty tracking
@ 2022-04-29 14:35         ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 209+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-04-29 14:35 UTC (permalink / raw)
  To: Joao Martins
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, kvm,
	Will Deacon, Cornelia Huck, Alex Williamson, iommu,
	David Woodhouse, Robin Murphy

On Fri, Apr 29, 2022 at 03:26:41PM +0100, Joao Martins wrote:

> I had this in the iommufd_dirty_iter logic given that the iommu iteration
> logic is in the parent structure that stores iommu_dirty_data.
> 
> My thinking with this patch was just to have what the IOMMU driver needs.

I would put the whole mechanism in one patch, even though most of the
code will live in iommufd, then it would be clear how it works

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 05/19] iommufd: Add a dirty bitmap to iopt_unmap_iova()
  2022-04-29 12:14     ` Jason Gunthorpe via iommu
@ 2022-04-29 14:36       ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 14:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, kvm,
	Will Deacon, Cornelia Huck, Alex Williamson, iommu,
	David Woodhouse, Robin Murphy

On 4/29/22 13:14, Jason Gunthorpe wrote:
> On Thu, Apr 28, 2022 at 10:09:19PM +0100, Joao Martins wrote:
> 
>> +static void iommu_unmap_read_dirty_nofail(struct iommu_domain *domain,
>> +					  unsigned long iova, size_t size,
>> +					  struct iommufd_dirty_data *bitmap,
>> +					  struct iommufd_dirty_iter *iter)
>> +{
> 
> This shouldn't be a nofail - that is only for path that trigger from
> destroy/error unwindow, which read dirty never does. The return code
> has to be propogated.
> 
> It needs some more thought how to organize this.. only unfill_domains
> needs this path, but it is shared with the error unwind paths and
> cannot generally fail..

It's part of the reason I splitted this part as it didn't struck as a natural
extension of the API.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 05/19] iommufd: Add a dirty bitmap to iopt_unmap_iova()
@ 2022-04-29 14:36       ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 14:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Nicolin Chen, Yishai Hadas, Kevin Tian, Eric Auger, Yi Liu,
	Alex Williamson, Cornelia Huck, kvm

On 4/29/22 13:14, Jason Gunthorpe wrote:
> On Thu, Apr 28, 2022 at 10:09:19PM +0100, Joao Martins wrote:
> 
>> +static void iommu_unmap_read_dirty_nofail(struct iommu_domain *domain,
>> +					  unsigned long iova, size_t size,
>> +					  struct iommufd_dirty_data *bitmap,
>> +					  struct iommufd_dirty_iter *iter)
>> +{
> 
> This shouldn't be a nofail - that is only for path that trigger from
> destroy/error unwindow, which read dirty never does. The return code
> has to be propogated.
> 
> It needs some more thought how to organize this.. only unfill_domains
> needs this path, but it is shared with the error unwind paths and
> cannot generally fail..

It's part of the reason I splitted this part as it didn't struck as a natural
extension of the API.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 07/19] iommufd/vfio-compat: Dirty tracking IOCTLs compatibility
  2022-04-29 14:27       ` Joao Martins
@ 2022-04-29 14:36         ` Jason Gunthorpe
  -1 siblings, 0 replies; 209+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-04-29 14:36 UTC (permalink / raw)
  To: Joao Martins
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, kvm,
	Will Deacon, Cornelia Huck, Alex Williamson, iommu,
	David Woodhouse, Robin Murphy

On Fri, Apr 29, 2022 at 03:27:00PM +0100, Joao Martins wrote:

> > We've made a qemu patch to allow qemu to be happy if dirty tracking is
> > not supported in the vfio container for migration, which is part of
> > the v2 enablement series. That seems like the better direction.
> > 
> So in my auditing/testing, the listener callbacks are called but the dirty ioctls
> return an error at start, and bails out early on sync. I suppose migration
> won't really work, as no pages aren't set and what not but it could
> cope with no-dirty-tracking support. So by 'making qemu happy' is this mainly
> cleaning out the constant error messages you get and not even attempt
> migration by introducing a migration blocker early on ... should it fetch
> no migration capability?

It really just means pre-copy doesn't work and we can skip it, though
I'm not sure exactly what the qemu patch ended up doing.. I think it
will be posted by Monday

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 07/19] iommufd/vfio-compat: Dirty tracking IOCTLs compatibility
@ 2022-04-29 14:36         ` Jason Gunthorpe
  0 siblings, 0 replies; 209+ messages in thread
From: Jason Gunthorpe @ 2022-04-29 14:36 UTC (permalink / raw)
  To: Joao Martins
  Cc: iommu, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Nicolin Chen, Yishai Hadas, Kevin Tian, Eric Auger, Yi Liu,
	Alex Williamson, Cornelia Huck, kvm

On Fri, Apr 29, 2022 at 03:27:00PM +0100, Joao Martins wrote:

> > We've made a qemu patch to allow qemu to be happy if dirty tracking is
> > not supported in the vfio container for migration, which is part of
> > the v2 enablement series. That seems like the better direction.
> > 
> So in my auditing/testing, the listener callbacks are called but the dirty ioctls
> return an error at start, and bails out early on sync. I suppose migration
> won't really work, as no pages aren't set and what not but it could
> cope with no-dirty-tracking support. So by 'making qemu happy' is this mainly
> cleaning out the constant error messages you get and not even attempt
> migration by introducing a migration blocker early on ... should it fetch
> no migration capability?

It really just means pre-copy doesn't work and we can skip it, though
I'm not sure exactly what the qemu patch ended up doing.. I think it
will be posted by Monday

Jason

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
  2022-04-29 12:23             ` Jason Gunthorpe via iommu
@ 2022-04-29 14:45               ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 14:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Robin Murphy, Tian, Kevin, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Nicolin Chen, Yishai Hadas, Eric Auger, Liu, Yi L,
	Alex Williamson, Cornelia Huck, kvm, iommu

On 4/29/22 13:23, Jason Gunthorpe wrote:
> On Fri, Apr 29, 2022 at 01:06:06PM +0100, Joao Martins wrote:
> 
>>> TBH I'd be inclined to just enable DBM unconditionally in 
>>> arm_smmu_domain_finalise() if the SMMU supports it. Trying to toggle it 
>>> dynamically (especially on a live domain) seems more trouble that it's 
>>> worth.
>>
>> Hmmm, but then it would strip userland/VMM from any sort of control (contrary
>> to what we can do on the CPU/KVM side). e.g. the first time you do
>> GET_DIRTY_IOVA it would return all dirtied IOVAs since the beginning
>> of guest time, as opposed to those only after you enabled dirty-tracking.
> 
> It just means that on SMMU the start tracking op clears all the dirty
> bits.
> 
Hmm, OK. But aren't really picking a poison here? On ARM it's the difference
from switching the setting the DBM bit and put the IOPTE as writeable-clean (which
is clearing another bit) versus read-and-clear-when-dirty-track-start which means
we need to re-walk the pagetables to clear one bit.

It's walking over ranges regardless.

> I also suppose you'd also want to install the IOPTEs as dirty to
> avoid a performance regression writing out new dirties for cases where
> we don't dirty track? And then the start tracking op will switch this
> so map creates non-dirty IOPTEs?

If we end up always enabling DBM + CD.HD perhaps it makes sense for IOTLB cache
the dirty-bit until we clear those bits.

But really, the way this series was /trying/ to do still feels the least pain,
and that way we have the same expectations from all iommus from iommufd
perspective too.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
@ 2022-04-29 14:45               ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 14:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Tian, Kevin, Yishai Hadas, kvm,
	Robin Murphy, Cornelia Huck, iommu, Alex Williamson,
	David Woodhouse, Will Deacon

On 4/29/22 13:23, Jason Gunthorpe wrote:
> On Fri, Apr 29, 2022 at 01:06:06PM +0100, Joao Martins wrote:
> 
>>> TBH I'd be inclined to just enable DBM unconditionally in 
>>> arm_smmu_domain_finalise() if the SMMU supports it. Trying to toggle it 
>>> dynamically (especially on a live domain) seems more trouble that it's 
>>> worth.
>>
>> Hmmm, but then it would strip userland/VMM from any sort of control (contrary
>> to what we can do on the CPU/KVM side). e.g. the first time you do
>> GET_DIRTY_IOVA it would return all dirtied IOVAs since the beginning
>> of guest time, as opposed to those only after you enabled dirty-tracking.
> 
> It just means that on SMMU the start tracking op clears all the dirty
> bits.
> 
Hmm, OK. But aren't really picking a poison here? On ARM it's the difference
from switching the setting the DBM bit and put the IOPTE as writeable-clean (which
is clearing another bit) versus read-and-clear-when-dirty-track-start which means
we need to re-walk the pagetables to clear one bit.

It's walking over ranges regardless.

> I also suppose you'd also want to install the IOPTEs as dirty to
> avoid a performance regression writing out new dirties for cases where
> we don't dirty track? And then the start tracking op will switch this
> so map creates non-dirty IOPTEs?

If we end up always enabling DBM + CD.HD perhaps it makes sense for IOTLB cache
the dirty-bit until we clear those bits.

But really, the way this series was /trying/ to do still feels the least pain,
and that way we have the same expectations from all iommus from iommufd
perspective too.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 07/19] iommufd/vfio-compat: Dirty tracking IOCTLs compatibility
  2022-04-29 14:36         ` Jason Gunthorpe
@ 2022-04-29 14:52           ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 14:52 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Nicolin Chen, Yishai Hadas, Kevin Tian, Eric Auger, Yi Liu,
	Alex Williamson, Cornelia Huck, kvm

On 4/29/22 15:36, Jason Gunthorpe wrote:
> On Fri, Apr 29, 2022 at 03:27:00PM +0100, Joao Martins wrote:
> 
>>> We've made a qemu patch to allow qemu to be happy if dirty tracking is
>>> not supported in the vfio container for migration, which is part of
>>> the v2 enablement series. That seems like the better direction.
>>>
>> So in my auditing/testing, the listener callbacks are called but the dirty ioctls
>> return an error at start, and bails out early on sync. I suppose migration
>> won't really work, as no pages aren't set and what not but it could
>> cope with no-dirty-tracking support. So by 'making qemu happy' is this mainly
>> cleaning out the constant error messages you get and not even attempt
>> migration by introducing a migration blocker early on ... should it fetch
>> no migration capability?
> 
> It really just means pre-copy doesn't work and we can skip it, though
> I'm not sure exactly what the qemu patch ended up doing.. I think it
> will be posted by Monday
> 
Ha, or that :D i.e.

Why bother checking if there's dirty pages periodically when we can just do at the
beginning, and at the end when we pause the guest(and DMA). Maybe it prevents a whole
bunch of copying in the interim, and this patch of yours might be a improvement.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 07/19] iommufd/vfio-compat: Dirty tracking IOCTLs compatibility
@ 2022-04-29 14:52           ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 14:52 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, kvm,
	Will Deacon, Cornelia Huck, Alex Williamson, iommu,
	David Woodhouse, Robin Murphy

On 4/29/22 15:36, Jason Gunthorpe wrote:
> On Fri, Apr 29, 2022 at 03:27:00PM +0100, Joao Martins wrote:
> 
>>> We've made a qemu patch to allow qemu to be happy if dirty tracking is
>>> not supported in the vfio container for migration, which is part of
>>> the v2 enablement series. That seems like the better direction.
>>>
>> So in my auditing/testing, the listener callbacks are called but the dirty ioctls
>> return an error at start, and bails out early on sync. I suppose migration
>> won't really work, as no pages aren't set and what not but it could
>> cope with no-dirty-tracking support. So by 'making qemu happy' is this mainly
>> cleaning out the constant error messages you get and not even attempt
>> migration by introducing a migration blocker early on ... should it fetch
>> no migration capability?
> 
> It really just means pre-copy doesn't work and we can skip it, though
> I'm not sure exactly what the qemu patch ended up doing.. I think it
> will be posted by Monday
> 
Ha, or that :D i.e.

Why bother checking if there's dirty pages periodically when we can just do at the
beginning, and at the end when we pause the guest(and DMA). Maybe it prevents a whole
bunch of copying in the interim, and this patch of yours might be a improvement.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
  2022-04-29 12:38       ` Jason Gunthorpe via iommu
@ 2022-04-29 15:20         ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 15:20 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Nicolin Chen, Yishai Hadas, Eric Auger, Liu, Yi L,
	Alex Williamson, Cornelia Huck, kvm, iommu

On 4/29/22 13:38, Jason Gunthorpe wrote:
> On Fri, Apr 29, 2022 at 11:27:58AM +0100, Joao Martins wrote:
>>>>  3) Unmapping an IOVA range while returning its dirty bit prior to
>>>> unmap. This case is specific for non-nested vIOMMU case where an
>>>> erronous guest (or device) DMAing to an address being unmapped at the
>>>> same time.
>>>
>>> an erroneous attempt like above cannot anticipate which DMAs can
>>> succeed in that window thus the end behavior is undefined. For an
>>> undefined behavior nothing will be broken by losing some bits dirtied
>>> in the window between reading back dirty bits of the range and
>>> actually calling unmap. From guest p.o.v. all those are black-box
>>> hardware logic to serve a virtual iotlb invalidation request which just
>>> cannot be completed in one cycle.
>>>
>>> Hence in reality probably this is not required except to meet vfio
>>> compat requirement. Just in concept returning dirty bits at unmap
>>> is more accurate.
>>>
>>> I'm slightly inclined to abandon it in iommufd uAPI.
>>
>> OK, it seems I am not far off from your thoughts.
>>
>> I'll see what others think too, and if so I'll remove the unmap_dirty.
>>
>> Because if vfio-compat doesn't get the iommu hw dirty support, then there would
>> be no users of unmap_dirty.
> 
> I'm inclined to agree with Kevin.
> 
> If the VM does do a rouge DMA while unmapping its vIOMMU then already
> it will randomly get or loose that DMA. Adding the dirty tracking race
> during live migration just further bias's that randomness toward
> loose.  Since we don't relay protection faults to the guest there is
> no guest observable difference, IMHO.
> 
Hmm, we don't /yet/. I don't know if that is going to change at some point.

We do propagate MCEs for example (and AER?). And I suppose with nesting
IO-page-faults will be propagated. Albeit it is a different thing of this
problem above.

Albeit even if we do, after the unmap-and-read-dirty induced IO page faults
ought to not be propagated to the guest.

> In any case, I don't think the implementation here for unmap_dirty is
> race free?  So, if we are doing all this complexity just to make the
> race smaller, I don't see the point.
> 
+1

> To make it race free I think you have to write protect the IOPTE then
> synchronize the IOTLB, read back the dirty, then unmap and synchronize
> the IOTLB again. 

That would indeed fully close the race with the IOTLB. But damn, it would
be expensive.

> That has such a high performance cost I'm not
> convinced it is worthwhile - and if it has to be two step like this
> then it would be cleaner to introduce a 'writeprotect and read dirty'
> op instead of overloading unmap. 

I can switch to that kind of primitive, should the group deem this as
necessary. But it feels like we are more leaning towards a no.

> We don't need to microoptimize away
> the extra io page table walk when we are already doing two
> invalidations in the overhead..
> 
IIUC fully closing the race as above might be incompatible with SMMUv3
provided that we need to clear the DBM (or CD.HD) to mark the IOPTEs
from writeable-clean to read-only, but then the dirty bit loses its
meaning. Oh wait, unless it's just rather than comparing writeable-clean
we clear DBM and then just check if the PTE was RO or RW to determine
dirty (provided we discard any IO PAGE faults happening between wrprotect
and read-dirty)

>>>> * There's no capabilities API in IOMMUFD, and in this RFC each vendor tracks
>>>
>>> there was discussion adding device capability uAPI somewhere.
>>>
>> ack let me know if there was snippets to the conversation as I seem to have missed that.
> 
> It was just discssion pending something we actually needed to report.
> 
> Would be a very simple ioctl taking in the device ID and fulling a
> struct of stuff.
>  
Yeap.

>>> probably this can be reported as a device cap as supporting of dirty bit is
>>> an immutable property of the iommu serving that device. 
> 
> It is an easier fit to read it out of the iommu_domain after device
> attach though - since we don't need to build new kernel infrastructure
> to query it from a device.
>  
That would be more like working on a hwpt_id instead of a device_id for that
previously mentioned ioctl. Something like IOMMUFD_CHECK_EXTENSION

Which receives a capability nr (or additionally hwpt_id) and returns a struct of
something. That is more future proof towards new kinds of stuff e.g. fetching the
whole domain hardware capabilities available in the platform (or device when passed a
hwpt_id), platform reserved ranges (like the HT hole that AMD systems have, or
the 4G hole in x86). Right now it is all buried in sysfs, or sometimes in sysfs but
specific to the device, even though some of that info is orthogonal to the device.

>>> Userspace can
>>> enable dirty tracking on a hwpt if all attached devices claim the support
>>> and kernel will does the same verification.
>>
>> Sorry to be dense but this is not up to 'devices' given they take no
>> part in the tracking?  I guess by 'devices' you mean the software
>> idea of it i.e. the iommu context created for attaching a said
>> physical device, not the physical device itself.
> 
> Indeed, an hwpt represents an iommu_domain and if the iommu_domain has
> dirty tracking ops set then that is an inherent propery of the domain
> and does not suddenly go away when a new device is attached.
>  
> Jason

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
@ 2022-04-29 15:20         ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 15:20 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Tian, Kevin, Yishai Hadas, kvm,
	Will Deacon, Cornelia Huck, iommu, Alex Williamson,
	David Woodhouse, Robin Murphy

On 4/29/22 13:38, Jason Gunthorpe wrote:
> On Fri, Apr 29, 2022 at 11:27:58AM +0100, Joao Martins wrote:
>>>>  3) Unmapping an IOVA range while returning its dirty bit prior to
>>>> unmap. This case is specific for non-nested vIOMMU case where an
>>>> erronous guest (or device) DMAing to an address being unmapped at the
>>>> same time.
>>>
>>> an erroneous attempt like above cannot anticipate which DMAs can
>>> succeed in that window thus the end behavior is undefined. For an
>>> undefined behavior nothing will be broken by losing some bits dirtied
>>> in the window between reading back dirty bits of the range and
>>> actually calling unmap. From guest p.o.v. all those are black-box
>>> hardware logic to serve a virtual iotlb invalidation request which just
>>> cannot be completed in one cycle.
>>>
>>> Hence in reality probably this is not required except to meet vfio
>>> compat requirement. Just in concept returning dirty bits at unmap
>>> is more accurate.
>>>
>>> I'm slightly inclined to abandon it in iommufd uAPI.
>>
>> OK, it seems I am not far off from your thoughts.
>>
>> I'll see what others think too, and if so I'll remove the unmap_dirty.
>>
>> Because if vfio-compat doesn't get the iommu hw dirty support, then there would
>> be no users of unmap_dirty.
> 
> I'm inclined to agree with Kevin.
> 
> If the VM does do a rouge DMA while unmapping its vIOMMU then already
> it will randomly get or loose that DMA. Adding the dirty tracking race
> during live migration just further bias's that randomness toward
> loose.  Since we don't relay protection faults to the guest there is
> no guest observable difference, IMHO.
> 
Hmm, we don't /yet/. I don't know if that is going to change at some point.

We do propagate MCEs for example (and AER?). And I suppose with nesting
IO-page-faults will be propagated. Albeit it is a different thing of this
problem above.

Albeit even if we do, after the unmap-and-read-dirty induced IO page faults
ought to not be propagated to the guest.

> In any case, I don't think the implementation here for unmap_dirty is
> race free?  So, if we are doing all this complexity just to make the
> race smaller, I don't see the point.
> 
+1

> To make it race free I think you have to write protect the IOPTE then
> synchronize the IOTLB, read back the dirty, then unmap and synchronize
> the IOTLB again. 

That would indeed fully close the race with the IOTLB. But damn, it would
be expensive.

> That has such a high performance cost I'm not
> convinced it is worthwhile - and if it has to be two step like this
> then it would be cleaner to introduce a 'writeprotect and read dirty'
> op instead of overloading unmap. 

I can switch to that kind of primitive, should the group deem this as
necessary. But it feels like we are more leaning towards a no.

> We don't need to microoptimize away
> the extra io page table walk when we are already doing two
> invalidations in the overhead..
> 
IIUC fully closing the race as above might be incompatible with SMMUv3
provided that we need to clear the DBM (or CD.HD) to mark the IOPTEs
from writeable-clean to read-only, but then the dirty bit loses its
meaning. Oh wait, unless it's just rather than comparing writeable-clean
we clear DBM and then just check if the PTE was RO or RW to determine
dirty (provided we discard any IO PAGE faults happening between wrprotect
and read-dirty)

>>>> * There's no capabilities API in IOMMUFD, and in this RFC each vendor tracks
>>>
>>> there was discussion adding device capability uAPI somewhere.
>>>
>> ack let me know if there was snippets to the conversation as I seem to have missed that.
> 
> It was just discssion pending something we actually needed to report.
> 
> Would be a very simple ioctl taking in the device ID and fulling a
> struct of stuff.
>  
Yeap.

>>> probably this can be reported as a device cap as supporting of dirty bit is
>>> an immutable property of the iommu serving that device. 
> 
> It is an easier fit to read it out of the iommu_domain after device
> attach though - since we don't need to build new kernel infrastructure
> to query it from a device.
>  
That would be more like working on a hwpt_id instead of a device_id for that
previously mentioned ioctl. Something like IOMMUFD_CHECK_EXTENSION

Which receives a capability nr (or additionally hwpt_id) and returns a struct of
something. That is more future proof towards new kinds of stuff e.g. fetching the
whole domain hardware capabilities available in the platform (or device when passed a
hwpt_id), platform reserved ranges (like the HT hole that AMD systems have, or
the 4G hole in x86). Right now it is all buried in sysfs, or sometimes in sysfs but
specific to the device, even though some of that info is orthogonal to the device.

>>> Userspace can
>>> enable dirty tracking on a hwpt if all attached devices claim the support
>>> and kernel will does the same verification.
>>
>> Sorry to be dense but this is not up to 'devices' given they take no
>> part in the tracking?  I guess by 'devices' you mean the software
>> idea of it i.e. the iommu context created for attaching a said
>> physical device, not the physical device itself.
> 
> Indeed, an hwpt represents an iommu_domain and if the iommu_domain has
> dirty tracking ops set then that is an inherent propery of the domain
> and does not suddenly go away when a new device is attached.
>  
> Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 01/19] iommu: Add iommu_domain ops for dirty tracking
  2022-04-29 13:40     ` Baolu Lu
@ 2022-04-29 15:27       ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 15:27 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, Jason Gunthorpe,
	kvm, Will Deacon, Cornelia Huck, iommu, Alex Williamson,
	David Woodhouse, Robin Murphy

On 4/29/22 14:40, Baolu Lu wrote:
> Hi Joao,
> 
> Thanks for doing this.
> 
> On 2022/4/29 05:09, Joao Martins wrote:
>> Add to iommu domain operations a set of callbacks to
>> perform dirty tracking, particulary to start and stop
>> tracking and finally to test and clear the dirty data.
>>
>> Drivers are expected to dynamically change its hw protection
>> domain bits to toggle the tracking and flush some form of
>> control state structure that stands in the IOVA translation
>> path.
>>
>> For reading and clearing dirty data, in all IOMMUs a transition
>> from any of the PTE access bits (Access, Dirty) implies flushing
>> the IOTLB to invalidate any stale data in the IOTLB as to whether
>> or not the IOMMU should update the said PTEs. The iommu core APIs
>> introduce a new structure for storing the dirties, albeit vendor
>> IOMMUs implementing .read_and_clear_dirty() just use
>> iommu_dirty_bitmap_record() to set the memory storing dirties.
>> The underlying tracking/iteration of user bitmap memory is instead
>> done by iommufd which takes care of initializing the dirty bitmap
>> *prior* to passing to the IOMMU domain op.
>>
>> So far for currently/to-be-supported IOMMUs with dirty tracking
>> support this particularly because the tracking is part of
>> first stage tables and part of address translation. Below
>> it is mentioned how hardware deal with the hardware protection
>> domain control bits, to justify the added iommu core APIs.
>> vendor IOMMU implementation will also explain in more detail on
>> the dirty bit usage/clearing in the IOPTEs.
>>
>> * x86 AMD:
>>
>> The same thing for AMD particularly the Device Table
>> respectivally, followed by flushing the Device IOTLB. On AMD[1],
>> section "2.2.1 Updating Shared Tables", e.g.
>>
>>> Each table can also have its contents cached by the IOMMU or
>> peripheral IOTLBs. Therefore, after
>> updating a table entry that can be cached, system software must
>> send the IOMMU an appropriate
>> invalidate command. Information in the peripheral IOTLBs must
>> also be invalidated.
>>
>> There's no mention of particular bits that are cached or
>> not but fetching a dev entry is part of address translation
>> as also depicted, so invalidate the device table to make
>> sure the next translations fetch a DTE entry with the HD bits set.
>>
>> * x86 Intel (rev3.0+):
>>
>> Likewise[2] set the SSADE bit in the scalable-entry second stage table
>> to enable Access/Dirty bits in the second stage page table. See manual,
>> particularly on "6.2.3.1 Scalable-Mode PASID-Table Entry Programming
>> Considerations"
>>
>>> When modifying root-entries, scalable-mode root-entries,
>> context-entries, or scalable-mode context
>> entries:
>>> Software must serially invalidate the context-cache,
>> PASID-cache (if applicable), and the IOTLB.  The serialization is
>> required since hardware may utilize information from the
>> context-caches (e.g., Domain-ID) to tag new entries inserted to
>> the PASID-cache and IOTLB for processing in-flight requests.
>> Section 6.5 describe the invalidation operations.
>>
>> And also the whole chapter "" Table "Table 23.  Guidance to
>> Software for Invalidations" in "6.5.3.3 Guidance to Software for
>> Invalidations" explicitly mentions
>>
>>> SSADE transition from 0 to 1 in a scalable-mode PASID-table
>> entry with PGTT value of Second-stage or Nested
>>
>> * ARM SMMUV3.2:
>>
>> SMMUv3.2 needs to toggle the dirty bit descriptor
>> over the CD (or S2CD) for toggling and flush/invalidate
>> the IOMMU dev IOTLB.
>>
>> Reference[0]: SMMU spec, "5.4.1 CD notes",
>>
>>> The following CD fields are permitted to be cached as part of a
>> translation or TLB entry, and alteration requires
>> invalidation of any TLB entry that might have cached these
>> fields, in addition to CD structure cache invalidation:
>>
>> ...
>> HA, HD
>> ...
>>
>> Although, The ARM SMMUv3 case is a tad different that its x86
>> counterparts. Rather than changing *only* the IOMMU domain device entry to
>> enable dirty tracking (and having a dedicated bit for dirtyness in IOPTE)
>> ARM instead uses a dirty-bit modifier which is separately enabled, and
>> changes the *existing* meaning of access bits (for ro/rw), to the point
>> that marking access bit read-only but with dirty-bit-modifier enabled
>> doesn't trigger an perm io page fault.
>>
>> In pratice this means that changing iommu context isn't enough
>> and in fact mostly useless IIUC (and can be always enabled). Dirtying
>> is only really enabled when the DBM pte bit is enabled (with the
>> CD.HD bit as a prereq).
>>
>> To capture this h/w construct an iommu core API is added which enables
>> dirty tracking on an IOVA range rather than a device/context entry.
>> iommufd picks one or the other, and IOMMUFD core will favour
>> device-context op followed by IOVA-range alternative.
> 
> Instead of specification words, I'd like to read more about why the
> callbacks are needed and how should they be implemented and consumed.
> 
OK. I can extend the commit message towards that.

This was roughly my paranoid mind trying to capture all three so dumping
some of the pointers I read (and in the other commits ttoo) is for future
consultation as well.

>>
>> [0] https://developer.arm.com/documentation/ihi0070/latest
>> [1] https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf
>> [2] https://cdrdv2.intel.com/v1/dl/getContent/671081
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>   drivers/iommu/iommu.c      | 28 ++++++++++++++++++++
>>   include/linux/io-pgtable.h |  6 +++++
>>   include/linux/iommu.h      | 52 ++++++++++++++++++++++++++++++++++++++
>>   3 files changed, 86 insertions(+)
>>
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index 0c42ece25854..d18b9ddbcce4 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -15,6 +15,7 @@
>>   #include <linux/init.h>
>>   #include <linux/export.h>
>>   #include <linux/slab.h>
>> +#include <linux/highmem.h>
>>   #include <linux/errno.h>
>>   #include <linux/iommu.h>
>>   #include <linux/idr.h>
>> @@ -3167,3 +3168,30 @@ bool iommu_group_dma_owner_claimed(struct iommu_group *group)
>>   	return user;
>>   }
>>   EXPORT_SYMBOL_GPL(iommu_group_dma_owner_claimed);
>> +
>> +unsigned int iommu_dirty_bitmap_record(struct iommu_dirty_bitmap *dirty,
>> +				       unsigned long iova, unsigned long length)
>> +{
>> +	unsigned long nbits, offset, start_offset, idx, size, *kaddr;
>> +
>> +	nbits = max(1UL, length >> dirty->pgshift);
>> +	offset = (iova - dirty->iova) >> dirty->pgshift;
>> +	idx = offset / (PAGE_SIZE * BITS_PER_BYTE);
>> +	offset = offset % (PAGE_SIZE * BITS_PER_BYTE);
>> +	start_offset = dirty->start_offset;
>> +
>> +	while (nbits > 0) {
>> +		kaddr = kmap(dirty->pages[idx]) + start_offset;
>> +		size = min(PAGE_SIZE * BITS_PER_BYTE - offset, nbits);
>> +		bitmap_set(kaddr, offset, size);
>> +		kunmap(dirty->pages[idx]);
>> +		start_offset = offset = 0;
>> +		nbits -= size;
>> +		idx++;
>> +	}
>> +
>> +	if (dirty->gather)
>> +		iommu_iotlb_gather_add_range(dirty->gather, iova, length);
>> +
>> +	return nbits;
>> +}
>> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
>> index 86af6f0a00a2..82b39925c21f 100644
>> --- a/include/linux/io-pgtable.h
>> +++ b/include/linux/io-pgtable.h
>> @@ -165,6 +165,12 @@ struct io_pgtable_ops {
>>   			      struct iommu_iotlb_gather *gather);
>>   	phys_addr_t (*iova_to_phys)(struct io_pgtable_ops *ops,
>>   				    unsigned long iova);
>> +	int (*set_dirty_tracking)(struct io_pgtable_ops *ops,
>> +				  unsigned long iova, size_t size,
>> +				  bool enabled);
>> +	int (*read_and_clear_dirty)(struct io_pgtable_ops *ops,
>> +				    unsigned long iova, size_t size,
>> +				    struct iommu_dirty_bitmap *dirty);
>>   };
>>   
>>   /**
>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>> index 6ef2df258673..ca076365d77b 100644
>> --- a/include/linux/iommu.h
>> +++ b/include/linux/iommu.h
>> @@ -189,6 +189,25 @@ struct iommu_iotlb_gather {
>>   	bool			queued;
>>   };
>>   
>> +/**
>> + * struct iommu_dirty_bitmap - Dirty IOVA bitmap state
>> + *
>> + * @iova: IOVA representing the start of the bitmap, the first bit of the bitmap
>> + * @pgshift: Page granularity of the bitmap
>> + * @gather: Range information for a pending IOTLB flush
>> + * @start_offset: Offset of the first user page
>> + * @pages: User pages representing the bitmap region
>> + * @npages: Number of user pages pinned
>> + */
>> +struct iommu_dirty_bitmap {
>> +	unsigned long iova;
>> +	unsigned long pgshift;
>> +	struct iommu_iotlb_gather *gather;
>> +	unsigned long start_offset;
>> +	unsigned long npages;
> 
> I haven't found where "npages" is used in this patch. It's better to add
> it when it's really used? Sorry if I missed anything.
> 
Yeap, you're right. This was an oversight when I was moving code around.

But I might introduce all the code that uses/manipulates this structure.

>> +	struct page **pages;
>> +};
>> +
>>   /**
>>    * struct iommu_ops - iommu ops and capabilities
>>    * @capable: check capability
>> @@ -275,6 +294,13 @@ struct iommu_ops {
>>    * @enable_nesting: Enable nesting
>>    * @set_pgtable_quirks: Set io page table quirks (IO_PGTABLE_QUIRK_*)
>>    * @free: Release the domain after use.
>> + * @set_dirty_tracking: Enable or Disable dirty tracking on the iommu domain
>> + * @set_dirty_tracking_range: Enable or Disable dirty tracking on a range of
>> + *                            an iommu domain
>> + * @read_and_clear_dirty: Walk IOMMU page tables for dirtied PTEs marshalled
>> + *                        into a bitmap, with a bit represented as a page.
>> + *                        Reads the dirty PTE bits and clears it from IO
>> + *                        pagetables.
>>    */
>>   struct iommu_domain_ops {
>>   	int (*attach_dev)(struct iommu_domain *domain, struct device *dev);
>> @@ -305,6 +331,15 @@ struct iommu_domain_ops {
>>   				  unsigned long quirks);
>>   
>>   	void (*free)(struct iommu_domain *domain);
>> +
>> +	int (*set_dirty_tracking)(struct iommu_domain *domain, bool enabled);
>> +	int (*set_dirty_tracking_range)(struct iommu_domain *domain,
>> +					unsigned long iova, size_t size,
>> +					struct iommu_iotlb_gather *iotlb_gather,
>> +					bool enabled);
> 
> It seems that we are adding two callbacks for the same purpose. How
> should the IOMMU drivers select to support? Any functional different
> between these two? How should the caller select to use?
> 

x86 wouldn't need to care about the second one as it's all on a per-domain
basis. See last two patches as to how I sketched Intel IOMMU support.

Albeit the second callback is going to be removed, based on this morning discussion.

But originally it was to cover how SMMUv3.2 enables dirty tracking only really
gets enabled on a PTE basis rather than on the iommu domain. But this
is deferred now to be up to the iommu driver (when it needs to) ... to walk
its pagetables and set DBM (or maybe from the beginning, currently in debate).

>> +	int (*read_and_clear_dirty)(struct iommu_domain *domain,
>> +				    unsigned long iova, size_t size,
>> +				    struct iommu_dirty_bitmap *dirty);
>>   };
>>   
>>   /**
>> @@ -494,6 +529,23 @@ void iommu_set_dma_strict(void);
>>   extern int report_iommu_fault(struct iommu_domain *domain, struct device *dev,
>>   			      unsigned long iova, int flags);
>>   
>> +unsigned int iommu_dirty_bitmap_record(struct iommu_dirty_bitmap *dirty,
>> +				       unsigned long iova, unsigned long length);
>> +
>> +static inline void iommu_dirty_bitmap_init(struct iommu_dirty_bitmap *dirty,
>> +					   unsigned long base,
>> +					   unsigned long pgshift,
>> +					   struct iommu_iotlb_gather *gather)
>> +{
>> +	memset(dirty, 0, sizeof(*dirty));
>> +	dirty->iova = base;
>> +	dirty->pgshift = pgshift;
>> +	dirty->gather = gather;
>> +
>> +	if (gather)
>> +		iommu_iotlb_gather_init(dirty->gather);
>> +}
>> +
>>   static inline void iommu_flush_iotlb_all(struct iommu_domain *domain)
>>   {
>>   	if (domain->ops->flush_iotlb_all)
> 
> Best regards,
> baolu

Thanks!
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 01/19] iommu: Add iommu_domain ops for dirty tracking
@ 2022-04-29 15:27       ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 15:27 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Joerg Roedel, Suravee Suthikulpanit, Will Deacon, Robin Murphy,
	Jean-Philippe Brucker, Keqian Zhu, Shameerali Kolothum Thodi,
	David Woodhouse, Jason Gunthorpe, Nicolin Chen, Yishai Hadas,
	Kevin Tian, Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck,
	kvm, iommu

On 4/29/22 14:40, Baolu Lu wrote:
> Hi Joao,
> 
> Thanks for doing this.
> 
> On 2022/4/29 05:09, Joao Martins wrote:
>> Add to iommu domain operations a set of callbacks to
>> perform dirty tracking, particulary to start and stop
>> tracking and finally to test and clear the dirty data.
>>
>> Drivers are expected to dynamically change its hw protection
>> domain bits to toggle the tracking and flush some form of
>> control state structure that stands in the IOVA translation
>> path.
>>
>> For reading and clearing dirty data, in all IOMMUs a transition
>> from any of the PTE access bits (Access, Dirty) implies flushing
>> the IOTLB to invalidate any stale data in the IOTLB as to whether
>> or not the IOMMU should update the said PTEs. The iommu core APIs
>> introduce a new structure for storing the dirties, albeit vendor
>> IOMMUs implementing .read_and_clear_dirty() just use
>> iommu_dirty_bitmap_record() to set the memory storing dirties.
>> The underlying tracking/iteration of user bitmap memory is instead
>> done by iommufd which takes care of initializing the dirty bitmap
>> *prior* to passing to the IOMMU domain op.
>>
>> So far for currently/to-be-supported IOMMUs with dirty tracking
>> support this particularly because the tracking is part of
>> first stage tables and part of address translation. Below
>> it is mentioned how hardware deal with the hardware protection
>> domain control bits, to justify the added iommu core APIs.
>> vendor IOMMU implementation will also explain in more detail on
>> the dirty bit usage/clearing in the IOPTEs.
>>
>> * x86 AMD:
>>
>> The same thing for AMD particularly the Device Table
>> respectivally, followed by flushing the Device IOTLB. On AMD[1],
>> section "2.2.1 Updating Shared Tables", e.g.
>>
>>> Each table can also have its contents cached by the IOMMU or
>> peripheral IOTLBs. Therefore, after
>> updating a table entry that can be cached, system software must
>> send the IOMMU an appropriate
>> invalidate command. Information in the peripheral IOTLBs must
>> also be invalidated.
>>
>> There's no mention of particular bits that are cached or
>> not but fetching a dev entry is part of address translation
>> as also depicted, so invalidate the device table to make
>> sure the next translations fetch a DTE entry with the HD bits set.
>>
>> * x86 Intel (rev3.0+):
>>
>> Likewise[2] set the SSADE bit in the scalable-entry second stage table
>> to enable Access/Dirty bits in the second stage page table. See manual,
>> particularly on "6.2.3.1 Scalable-Mode PASID-Table Entry Programming
>> Considerations"
>>
>>> When modifying root-entries, scalable-mode root-entries,
>> context-entries, or scalable-mode context
>> entries:
>>> Software must serially invalidate the context-cache,
>> PASID-cache (if applicable), and the IOTLB.  The serialization is
>> required since hardware may utilize information from the
>> context-caches (e.g., Domain-ID) to tag new entries inserted to
>> the PASID-cache and IOTLB for processing in-flight requests.
>> Section 6.5 describe the invalidation operations.
>>
>> And also the whole chapter "" Table "Table 23.  Guidance to
>> Software for Invalidations" in "6.5.3.3 Guidance to Software for
>> Invalidations" explicitly mentions
>>
>>> SSADE transition from 0 to 1 in a scalable-mode PASID-table
>> entry with PGTT value of Second-stage or Nested
>>
>> * ARM SMMUV3.2:
>>
>> SMMUv3.2 needs to toggle the dirty bit descriptor
>> over the CD (or S2CD) for toggling and flush/invalidate
>> the IOMMU dev IOTLB.
>>
>> Reference[0]: SMMU spec, "5.4.1 CD notes",
>>
>>> The following CD fields are permitted to be cached as part of a
>> translation or TLB entry, and alteration requires
>> invalidation of any TLB entry that might have cached these
>> fields, in addition to CD structure cache invalidation:
>>
>> ...
>> HA, HD
>> ...
>>
>> Although, The ARM SMMUv3 case is a tad different that its x86
>> counterparts. Rather than changing *only* the IOMMU domain device entry to
>> enable dirty tracking (and having a dedicated bit for dirtyness in IOPTE)
>> ARM instead uses a dirty-bit modifier which is separately enabled, and
>> changes the *existing* meaning of access bits (for ro/rw), to the point
>> that marking access bit read-only but with dirty-bit-modifier enabled
>> doesn't trigger an perm io page fault.
>>
>> In pratice this means that changing iommu context isn't enough
>> and in fact mostly useless IIUC (and can be always enabled). Dirtying
>> is only really enabled when the DBM pte bit is enabled (with the
>> CD.HD bit as a prereq).
>>
>> To capture this h/w construct an iommu core API is added which enables
>> dirty tracking on an IOVA range rather than a device/context entry.
>> iommufd picks one or the other, and IOMMUFD core will favour
>> device-context op followed by IOVA-range alternative.
> 
> Instead of specification words, I'd like to read more about why the
> callbacks are needed and how should they be implemented and consumed.
> 
OK. I can extend the commit message towards that.

This was roughly my paranoid mind trying to capture all three so dumping
some of the pointers I read (and in the other commits ttoo) is for future
consultation as well.

>>
>> [0] https://developer.arm.com/documentation/ihi0070/latest
>> [1] https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf
>> [2] https://cdrdv2.intel.com/v1/dl/getContent/671081
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>   drivers/iommu/iommu.c      | 28 ++++++++++++++++++++
>>   include/linux/io-pgtable.h |  6 +++++
>>   include/linux/iommu.h      | 52 ++++++++++++++++++++++++++++++++++++++
>>   3 files changed, 86 insertions(+)
>>
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index 0c42ece25854..d18b9ddbcce4 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -15,6 +15,7 @@
>>   #include <linux/init.h>
>>   #include <linux/export.h>
>>   #include <linux/slab.h>
>> +#include <linux/highmem.h>
>>   #include <linux/errno.h>
>>   #include <linux/iommu.h>
>>   #include <linux/idr.h>
>> @@ -3167,3 +3168,30 @@ bool iommu_group_dma_owner_claimed(struct iommu_group *group)
>>   	return user;
>>   }
>>   EXPORT_SYMBOL_GPL(iommu_group_dma_owner_claimed);
>> +
>> +unsigned int iommu_dirty_bitmap_record(struct iommu_dirty_bitmap *dirty,
>> +				       unsigned long iova, unsigned long length)
>> +{
>> +	unsigned long nbits, offset, start_offset, idx, size, *kaddr;
>> +
>> +	nbits = max(1UL, length >> dirty->pgshift);
>> +	offset = (iova - dirty->iova) >> dirty->pgshift;
>> +	idx = offset / (PAGE_SIZE * BITS_PER_BYTE);
>> +	offset = offset % (PAGE_SIZE * BITS_PER_BYTE);
>> +	start_offset = dirty->start_offset;
>> +
>> +	while (nbits > 0) {
>> +		kaddr = kmap(dirty->pages[idx]) + start_offset;
>> +		size = min(PAGE_SIZE * BITS_PER_BYTE - offset, nbits);
>> +		bitmap_set(kaddr, offset, size);
>> +		kunmap(dirty->pages[idx]);
>> +		start_offset = offset = 0;
>> +		nbits -= size;
>> +		idx++;
>> +	}
>> +
>> +	if (dirty->gather)
>> +		iommu_iotlb_gather_add_range(dirty->gather, iova, length);
>> +
>> +	return nbits;
>> +}
>> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
>> index 86af6f0a00a2..82b39925c21f 100644
>> --- a/include/linux/io-pgtable.h
>> +++ b/include/linux/io-pgtable.h
>> @@ -165,6 +165,12 @@ struct io_pgtable_ops {
>>   			      struct iommu_iotlb_gather *gather);
>>   	phys_addr_t (*iova_to_phys)(struct io_pgtable_ops *ops,
>>   				    unsigned long iova);
>> +	int (*set_dirty_tracking)(struct io_pgtable_ops *ops,
>> +				  unsigned long iova, size_t size,
>> +				  bool enabled);
>> +	int (*read_and_clear_dirty)(struct io_pgtable_ops *ops,
>> +				    unsigned long iova, size_t size,
>> +				    struct iommu_dirty_bitmap *dirty);
>>   };
>>   
>>   /**
>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>> index 6ef2df258673..ca076365d77b 100644
>> --- a/include/linux/iommu.h
>> +++ b/include/linux/iommu.h
>> @@ -189,6 +189,25 @@ struct iommu_iotlb_gather {
>>   	bool			queued;
>>   };
>>   
>> +/**
>> + * struct iommu_dirty_bitmap - Dirty IOVA bitmap state
>> + *
>> + * @iova: IOVA representing the start of the bitmap, the first bit of the bitmap
>> + * @pgshift: Page granularity of the bitmap
>> + * @gather: Range information for a pending IOTLB flush
>> + * @start_offset: Offset of the first user page
>> + * @pages: User pages representing the bitmap region
>> + * @npages: Number of user pages pinned
>> + */
>> +struct iommu_dirty_bitmap {
>> +	unsigned long iova;
>> +	unsigned long pgshift;
>> +	struct iommu_iotlb_gather *gather;
>> +	unsigned long start_offset;
>> +	unsigned long npages;
> 
> I haven't found where "npages" is used in this patch. It's better to add
> it when it's really used? Sorry if I missed anything.
> 
Yeap, you're right. This was an oversight when I was moving code around.

But I might introduce all the code that uses/manipulates this structure.

>> +	struct page **pages;
>> +};
>> +
>>   /**
>>    * struct iommu_ops - iommu ops and capabilities
>>    * @capable: check capability
>> @@ -275,6 +294,13 @@ struct iommu_ops {
>>    * @enable_nesting: Enable nesting
>>    * @set_pgtable_quirks: Set io page table quirks (IO_PGTABLE_QUIRK_*)
>>    * @free: Release the domain after use.
>> + * @set_dirty_tracking: Enable or Disable dirty tracking on the iommu domain
>> + * @set_dirty_tracking_range: Enable or Disable dirty tracking on a range of
>> + *                            an iommu domain
>> + * @read_and_clear_dirty: Walk IOMMU page tables for dirtied PTEs marshalled
>> + *                        into a bitmap, with a bit represented as a page.
>> + *                        Reads the dirty PTE bits and clears it from IO
>> + *                        pagetables.
>>    */
>>   struct iommu_domain_ops {
>>   	int (*attach_dev)(struct iommu_domain *domain, struct device *dev);
>> @@ -305,6 +331,15 @@ struct iommu_domain_ops {
>>   				  unsigned long quirks);
>>   
>>   	void (*free)(struct iommu_domain *domain);
>> +
>> +	int (*set_dirty_tracking)(struct iommu_domain *domain, bool enabled);
>> +	int (*set_dirty_tracking_range)(struct iommu_domain *domain,
>> +					unsigned long iova, size_t size,
>> +					struct iommu_iotlb_gather *iotlb_gather,
>> +					bool enabled);
> 
> It seems that we are adding two callbacks for the same purpose. How
> should the IOMMU drivers select to support? Any functional different
> between these two? How should the caller select to use?
> 

x86 wouldn't need to care about the second one as it's all on a per-domain
basis. See last two patches as to how I sketched Intel IOMMU support.

Albeit the second callback is going to be removed, based on this morning discussion.

But originally it was to cover how SMMUv3.2 enables dirty tracking only really
gets enabled on a PTE basis rather than on the iommu domain. But this
is deferred now to be up to the iommu driver (when it needs to) ... to walk
its pagetables and set DBM (or maybe from the beginning, currently in debate).

>> +	int (*read_and_clear_dirty)(struct iommu_domain *domain,
>> +				    unsigned long iova, size_t size,
>> +				    struct iommu_dirty_bitmap *dirty);
>>   };
>>   
>>   /**
>> @@ -494,6 +529,23 @@ void iommu_set_dma_strict(void);
>>   extern int report_iommu_fault(struct iommu_domain *domain, struct device *dev,
>>   			      unsigned long iova, int flags);
>>   
>> +unsigned int iommu_dirty_bitmap_record(struct iommu_dirty_bitmap *dirty,
>> +				       unsigned long iova, unsigned long length);
>> +
>> +static inline void iommu_dirty_bitmap_init(struct iommu_dirty_bitmap *dirty,
>> +					   unsigned long base,
>> +					   unsigned long pgshift,
>> +					   struct iommu_iotlb_gather *gather)
>> +{
>> +	memset(dirty, 0, sizeof(*dirty));
>> +	dirty->iova = base;
>> +	dirty->pgshift = pgshift;
>> +	dirty->gather = gather;
>> +
>> +	if (gather)
>> +		iommu_iotlb_gather_init(dirty->gather);
>> +}
>> +
>>   static inline void iommu_flush_iotlb_all(struct iommu_domain *domain)
>>   {
>>   	if (domain->ops->flush_iotlb_all)
> 
> Best regards,
> baolu

Thanks!

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
  2022-04-29 14:45               ` Joao Martins
@ 2022-04-29 16:11                 ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 209+ messages in thread
From: Jason Gunthorpe @ 2022-04-29 16:11 UTC (permalink / raw)
  To: Joao Martins
  Cc: Robin Murphy, Tian, Kevin, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Nicolin Chen, Yishai Hadas, Eric Auger, Liu, Yi L,
	Alex Williamson, Cornelia Huck, kvm, iommu

On Fri, Apr 29, 2022 at 03:45:23PM +0100, Joao Martins wrote:
> On 4/29/22 13:23, Jason Gunthorpe wrote:
> > On Fri, Apr 29, 2022 at 01:06:06PM +0100, Joao Martins wrote:
> > 
> >>> TBH I'd be inclined to just enable DBM unconditionally in 
> >>> arm_smmu_domain_finalise() if the SMMU supports it. Trying to toggle it 
> >>> dynamically (especially on a live domain) seems more trouble that it's 
> >>> worth.
> >>
> >> Hmmm, but then it would strip userland/VMM from any sort of control (contrary
> >> to what we can do on the CPU/KVM side). e.g. the first time you do
> >> GET_DIRTY_IOVA it would return all dirtied IOVAs since the beginning
> >> of guest time, as opposed to those only after you enabled dirty-tracking.
> > 
> > It just means that on SMMU the start tracking op clears all the dirty
> > bits.
> > 
> Hmm, OK. But aren't really picking a poison here? On ARM it's the difference
> from switching the setting the DBM bit and put the IOPTE as writeable-clean (which
> is clearing another bit) versus read-and-clear-when-dirty-track-start which means
> we need to re-walk the pagetables to clear one bit.

Yes, I don't think a iopte walk is avoidable?

> It's walking over ranges regardless.

Also, keep in mind start should always come up in a 0 dirties state on
all platforms. So all implementations need to do something to wipe the
dirty state, either explicitly during start or restoring all clean
during stop.

A common use model might be to just destroy the iommu_domain without
doing stop so prefering the clearing io page table at stop might be a
better overall design.

Jason

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
@ 2022-04-29 16:11                 ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 209+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-04-29 16:11 UTC (permalink / raw)
  To: Joao Martins
  Cc: Jean-Philippe Brucker, Tian, Kevin, Yishai Hadas, kvm,
	Robin Murphy, Cornelia Huck, iommu, Alex Williamson,
	David Woodhouse, Will Deacon

On Fri, Apr 29, 2022 at 03:45:23PM +0100, Joao Martins wrote:
> On 4/29/22 13:23, Jason Gunthorpe wrote:
> > On Fri, Apr 29, 2022 at 01:06:06PM +0100, Joao Martins wrote:
> > 
> >>> TBH I'd be inclined to just enable DBM unconditionally in 
> >>> arm_smmu_domain_finalise() if the SMMU supports it. Trying to toggle it 
> >>> dynamically (especially on a live domain) seems more trouble that it's 
> >>> worth.
> >>
> >> Hmmm, but then it would strip userland/VMM from any sort of control (contrary
> >> to what we can do on the CPU/KVM side). e.g. the first time you do
> >> GET_DIRTY_IOVA it would return all dirtied IOVAs since the beginning
> >> of guest time, as opposed to those only after you enabled dirty-tracking.
> > 
> > It just means that on SMMU the start tracking op clears all the dirty
> > bits.
> > 
> Hmm, OK. But aren't really picking a poison here? On ARM it's the difference
> from switching the setting the DBM bit and put the IOPTE as writeable-clean (which
> is clearing another bit) versus read-and-clear-when-dirty-track-start which means
> we need to re-walk the pagetables to clear one bit.

Yes, I don't think a iopte walk is avoidable?

> It's walking over ranges regardless.

Also, keep in mind start should always come up in a 0 dirties state on
all platforms. So all implementations need to do something to wipe the
dirty state, either explicitly during start or restoring all clean
during stop.

A common use model might be to just destroy the iommu_domain without
doing stop so prefering the clearing io page table at stop might be a
better overall design.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
  2022-04-29 16:11                 ` Jason Gunthorpe via iommu
@ 2022-04-29 16:40                   ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 16:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Robin Murphy, Tian, Kevin, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Nicolin Chen, Yishai Hadas, Eric Auger, Liu, Yi L,
	Alex Williamson, Cornelia Huck, kvm, iommu

On 4/29/22 17:11, Jason Gunthorpe wrote:
> On Fri, Apr 29, 2022 at 03:45:23PM +0100, Joao Martins wrote:
>> On 4/29/22 13:23, Jason Gunthorpe wrote:
>>> On Fri, Apr 29, 2022 at 01:06:06PM +0100, Joao Martins wrote:
>>>
>>>>> TBH I'd be inclined to just enable DBM unconditionally in 
>>>>> arm_smmu_domain_finalise() if the SMMU supports it. Trying to toggle it 
>>>>> dynamically (especially on a live domain) seems more trouble that it's 
>>>>> worth.
>>>>
>>>> Hmmm, but then it would strip userland/VMM from any sort of control (contrary
>>>> to what we can do on the CPU/KVM side). e.g. the first time you do
>>>> GET_DIRTY_IOVA it would return all dirtied IOVAs since the beginning
>>>> of guest time, as opposed to those only after you enabled dirty-tracking.
>>>
>>> It just means that on SMMU the start tracking op clears all the dirty
>>> bits.
>>>
>> Hmm, OK. But aren't really picking a poison here? On ARM it's the difference
>> from switching the setting the DBM bit and put the IOPTE as writeable-clean (which
>> is clearing another bit) versus read-and-clear-when-dirty-track-start which means
>> we need to re-walk the pagetables to clear one bit.
> 
> Yes, I don't think a iopte walk is avoidable?
> 
Correct -- exactly why I am still more learning towards enable DBM bit only at start
versus enabling DBM at domain-creation while clearing dirty at start.

>> It's walking over ranges regardless.
> 
> Also, keep in mind start should always come up in a 0 dirties state on
> all platforms. So all implementations need to do something to wipe the
> dirty state, either explicitly during start or restoring all clean
> during stop.
> 
> A common use model might be to just destroy the iommu_domain without
> doing stop so prefering the clearing io page table at stop might be a
> better overall design.

If we want to ensure that the IOPTE dirty state is immutable before start
and after stop maybe this behaviour could be a new flag in the set-dirty-tracking
(or be implicit as you suggest).  but ... hmm, at the same time, I wonder if
it's better to let userspace fetch the dirties that were there /right after stopping/
(via GET_DIRTY_IOVA) rather than just discarding them implicitly at SET_DIRTY_TRACKING(0|1).

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
@ 2022-04-29 16:40                   ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-04-29 16:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Tian, Kevin, Yishai Hadas, kvm,
	Robin Murphy, Cornelia Huck, iommu, Alex Williamson,
	David Woodhouse, Will Deacon

On 4/29/22 17:11, Jason Gunthorpe wrote:
> On Fri, Apr 29, 2022 at 03:45:23PM +0100, Joao Martins wrote:
>> On 4/29/22 13:23, Jason Gunthorpe wrote:
>>> On Fri, Apr 29, 2022 at 01:06:06PM +0100, Joao Martins wrote:
>>>
>>>>> TBH I'd be inclined to just enable DBM unconditionally in 
>>>>> arm_smmu_domain_finalise() if the SMMU supports it. Trying to toggle it 
>>>>> dynamically (especially on a live domain) seems more trouble that it's 
>>>>> worth.
>>>>
>>>> Hmmm, but then it would strip userland/VMM from any sort of control (contrary
>>>> to what we can do on the CPU/KVM side). e.g. the first time you do
>>>> GET_DIRTY_IOVA it would return all dirtied IOVAs since the beginning
>>>> of guest time, as opposed to those only after you enabled dirty-tracking.
>>>
>>> It just means that on SMMU the start tracking op clears all the dirty
>>> bits.
>>>
>> Hmm, OK. But aren't really picking a poison here? On ARM it's the difference
>> from switching the setting the DBM bit and put the IOPTE as writeable-clean (which
>> is clearing another bit) versus read-and-clear-when-dirty-track-start which means
>> we need to re-walk the pagetables to clear one bit.
> 
> Yes, I don't think a iopte walk is avoidable?
> 
Correct -- exactly why I am still more learning towards enable DBM bit only at start
versus enabling DBM at domain-creation while clearing dirty at start.

>> It's walking over ranges regardless.
> 
> Also, keep in mind start should always come up in a 0 dirties state on
> all platforms. So all implementations need to do something to wipe the
> dirty state, either explicitly during start or restoring all clean
> during stop.
> 
> A common use model might be to just destroy the iommu_domain without
> doing stop so prefering the clearing io page table at stop might be a
> better overall design.

If we want to ensure that the IOPTE dirty state is immutable before start
and after stop maybe this behaviour could be a new flag in the set-dirty-tracking
(or be implicit as you suggest).  but ... hmm, at the same time, I wonder if
it's better to let userspace fetch the dirties that were there /right after stopping/
(via GET_DIRTY_IOVA) rather than just discarding them implicitly at SET_DIRTY_TRACKING(0|1).
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
  2022-04-29 16:40                   ` Joao Martins
@ 2022-04-29 16:46                     ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 209+ messages in thread
From: Jason Gunthorpe @ 2022-04-29 16:46 UTC (permalink / raw)
  To: Joao Martins
  Cc: Robin Murphy, Tian, Kevin, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Nicolin Chen, Yishai Hadas, Eric Auger, Liu, Yi L,
	Alex Williamson, Cornelia Huck, kvm, iommu

On Fri, Apr 29, 2022 at 05:40:56PM +0100, Joao Martins wrote:

> > A common use model might be to just destroy the iommu_domain without
> > doing stop so prefering the clearing io page table at stop might be a
> > better overall design.
> 
> If we want to ensure that the IOPTE dirty state is immutable before start
> and after stop maybe this behaviour could be a new flag in the set-dirty-tracking
> (or be implicit as you suggest).  but ... hmm, at the same time, I wonder if
> it's better to let userspace fetch the dirties that were there /right after stopping/
> (via GET_DIRTY_IOVA) rather than just discarding them implicitly at SET_DIRTY_TRACKING(0|1).

It is not immutable, it is just the idea that there are no left over
false-dirties after start returns.

Combined with the realization that in many cases we don't need to
stop, but will just destroy the whole iommu_domain

Jason

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
@ 2022-04-29 16:46                     ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 209+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-04-29 16:46 UTC (permalink / raw)
  To: Joao Martins
  Cc: Jean-Philippe Brucker, Tian, Kevin, Yishai Hadas, kvm,
	Robin Murphy, Cornelia Huck, iommu, Alex Williamson,
	David Woodhouse, Will Deacon

On Fri, Apr 29, 2022 at 05:40:56PM +0100, Joao Martins wrote:

> > A common use model might be to just destroy the iommu_domain without
> > doing stop so prefering the clearing io page table at stop might be a
> > better overall design.
> 
> If we want to ensure that the IOPTE dirty state is immutable before start
> and after stop maybe this behaviour could be a new flag in the set-dirty-tracking
> (or be implicit as you suggest).  but ... hmm, at the same time, I wonder if
> it's better to let userspace fetch the dirties that were there /right after stopping/
> (via GET_DIRTY_IOVA) rather than just discarding them implicitly at SET_DIRTY_TRACKING(0|1).

It is not immutable, it is just the idea that there are no left over
false-dirties after start returns.

Combined with the realization that in many cases we don't need to
stop, but will just destroy the whole iommu_domain

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
  2022-04-29 16:40                   ` Joao Martins
@ 2022-04-29 19:20                     ` Robin Murphy
  -1 siblings, 0 replies; 209+ messages in thread
From: Robin Murphy @ 2022-04-29 19:20 UTC (permalink / raw)
  To: Joao Martins, Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Tian, Kevin, Yishai Hadas, kvm,
	Will Deacon, Cornelia Huck, iommu, Alex Williamson,
	David Woodhouse

On 2022-04-29 17:40, Joao Martins wrote:
> On 4/29/22 17:11, Jason Gunthorpe wrote:
>> On Fri, Apr 29, 2022 at 03:45:23PM +0100, Joao Martins wrote:
>>> On 4/29/22 13:23, Jason Gunthorpe wrote:
>>>> On Fri, Apr 29, 2022 at 01:06:06PM +0100, Joao Martins wrote:
>>>>
>>>>>> TBH I'd be inclined to just enable DBM unconditionally in
>>>>>> arm_smmu_domain_finalise() if the SMMU supports it. Trying to toggle it
>>>>>> dynamically (especially on a live domain) seems more trouble that it's
>>>>>> worth.
>>>>>
>>>>> Hmmm, but then it would strip userland/VMM from any sort of control (contrary
>>>>> to what we can do on the CPU/KVM side). e.g. the first time you do
>>>>> GET_DIRTY_IOVA it would return all dirtied IOVAs since the beginning
>>>>> of guest time, as opposed to those only after you enabled dirty-tracking.
>>>>
>>>> It just means that on SMMU the start tracking op clears all the dirty
>>>> bits.
>>>>
>>> Hmm, OK. But aren't really picking a poison here? On ARM it's the difference
>>> from switching the setting the DBM bit and put the IOPTE as writeable-clean (which
>>> is clearing another bit) versus read-and-clear-when-dirty-track-start which means
>>> we need to re-walk the pagetables to clear one bit.
>>
>> Yes, I don't think a iopte walk is avoidable?
>>
> Correct -- exactly why I am still more learning towards enable DBM bit only at start
> versus enabling DBM at domain-creation while clearing dirty at start.

I'd say it's largely down to whether you want the bother of 
communicating a dynamic behaviour change into io-pgtable. The big 
advantage of having it just use DBM all the time is that you don't have 
to do that, and the "start tracking" operation is then nothing more than 
a normal "read and clear" operation but ignoring the read result.

At this point I'd much rather opt for simplicity, and leave the fancier 
stuff to revisit later if and when somebody does demonstrate a 
significant overhead from using DBM when not strictly needed.

Thanks,
Robin.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
@ 2022-04-29 19:20                     ` Robin Murphy
  0 siblings, 0 replies; 209+ messages in thread
From: Robin Murphy @ 2022-04-29 19:20 UTC (permalink / raw)
  To: Joao Martins, Jason Gunthorpe
  Cc: Tian, Kevin, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Jean-Philippe Brucker, Keqian Zhu, Shameerali Kolothum Thodi,
	David Woodhouse, Lu Baolu, Nicolin Chen, Yishai Hadas,
	Eric Auger, Liu, Yi L, Alex Williamson, Cornelia Huck, kvm,
	iommu

On 2022-04-29 17:40, Joao Martins wrote:
> On 4/29/22 17:11, Jason Gunthorpe wrote:
>> On Fri, Apr 29, 2022 at 03:45:23PM +0100, Joao Martins wrote:
>>> On 4/29/22 13:23, Jason Gunthorpe wrote:
>>>> On Fri, Apr 29, 2022 at 01:06:06PM +0100, Joao Martins wrote:
>>>>
>>>>>> TBH I'd be inclined to just enable DBM unconditionally in
>>>>>> arm_smmu_domain_finalise() if the SMMU supports it. Trying to toggle it
>>>>>> dynamically (especially on a live domain) seems more trouble that it's
>>>>>> worth.
>>>>>
>>>>> Hmmm, but then it would strip userland/VMM from any sort of control (contrary
>>>>> to what we can do on the CPU/KVM side). e.g. the first time you do
>>>>> GET_DIRTY_IOVA it would return all dirtied IOVAs since the beginning
>>>>> of guest time, as opposed to those only after you enabled dirty-tracking.
>>>>
>>>> It just means that on SMMU the start tracking op clears all the dirty
>>>> bits.
>>>>
>>> Hmm, OK. But aren't really picking a poison here? On ARM it's the difference
>>> from switching the setting the DBM bit and put the IOPTE as writeable-clean (which
>>> is clearing another bit) versus read-and-clear-when-dirty-track-start which means
>>> we need to re-walk the pagetables to clear one bit.
>>
>> Yes, I don't think a iopte walk is avoidable?
>>
> Correct -- exactly why I am still more learning towards enable DBM bit only at start
> versus enabling DBM at domain-creation while clearing dirty at start.

I'd say it's largely down to whether you want the bother of 
communicating a dynamic behaviour change into io-pgtable. The big 
advantage of having it just use DBM all the time is that you don't have 
to do that, and the "start tracking" operation is then nothing more than 
a normal "read and clear" operation but ignoring the read result.

At this point I'd much rather opt for simplicity, and leave the fancier 
stuff to revisit later if and when somebody does demonstrate a 
significant overhead from using DBM when not strictly needed.

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 02/19] iommufd: Dirty tracking for io_pagetable
  2022-04-28 21:09   ` Joao Martins
@ 2022-04-29 23:51     ` Baolu Lu
  -1 siblings, 0 replies; 209+ messages in thread
From: Baolu Lu @ 2022-04-29 23:51 UTC (permalink / raw)
  To: Joao Martins, iommu
  Cc: Joerg Roedel, Suravee Suthikulpanit, Will Deacon, Robin Murphy,
	Jean-Philippe Brucker, Keqian Zhu, Shameerali Kolothum Thodi,
	David Woodhouse, Jason Gunthorpe, Nicolin Chen, Yishai Hadas,
	Kevin Tian, Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck,
	kvm

On 2022/4/29 05:09, Joao Martins wrote:
> +int iopt_set_dirty_tracking(struct io_pagetable *iopt,
> +			    struct iommu_domain *domain, bool enable)
> +{
> +	struct iommu_domain *dom;
> +	unsigned long index;
> +	int ret = -EOPNOTSUPP;
> +
> +	down_write(&iopt->iova_rwsem);
> +	if (!domain) {
> +		down_write(&iopt->domains_rwsem);
> +		xa_for_each(&iopt->domains, index, dom) {
> +			ret = iommu_set_dirty_tracking(dom, iopt, enable);
> +			if (ret < 0)
> +				break;

Do you need to roll back to the original state before return failure?
Partial domains have already had dirty bit tracking enabled.

> +		}
> +		up_write(&iopt->domains_rwsem);
> +	} else {
> +		ret = iommu_set_dirty_tracking(domain, iopt, enable);
> +	}
> +
> +	up_write(&iopt->iova_rwsem);
> +	return ret;
> +}

Best regards,
baolu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 02/19] iommufd: Dirty tracking for io_pagetable
@ 2022-04-29 23:51     ` Baolu Lu
  0 siblings, 0 replies; 209+ messages in thread
From: Baolu Lu @ 2022-04-29 23:51 UTC (permalink / raw)
  To: Joao Martins, iommu
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, Jason Gunthorpe,
	kvm, Will Deacon, Cornelia Huck, Alex Williamson,
	David Woodhouse, Robin Murphy

On 2022/4/29 05:09, Joao Martins wrote:
> +int iopt_set_dirty_tracking(struct io_pagetable *iopt,
> +			    struct iommu_domain *domain, bool enable)
> +{
> +	struct iommu_domain *dom;
> +	unsigned long index;
> +	int ret = -EOPNOTSUPP;
> +
> +	down_write(&iopt->iova_rwsem);
> +	if (!domain) {
> +		down_write(&iopt->domains_rwsem);
> +		xa_for_each(&iopt->domains, index, dom) {
> +			ret = iommu_set_dirty_tracking(dom, iopt, enable);
> +			if (ret < 0)
> +				break;

Do you need to roll back to the original state before return failure?
Partial domains have already had dirty bit tracking enabled.

> +		}
> +		up_write(&iopt->domains_rwsem);
> +	} else {
> +		ret = iommu_set_dirty_tracking(domain, iopt, enable);
> +	}
> +
> +	up_write(&iopt->iova_rwsem);
> +	return ret;
> +}

Best regards,
baolu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 03/19] iommufd: Dirty tracking data support
  2022-04-28 21:09   ` Joao Martins
@ 2022-04-30  4:11     ` Baolu Lu
  -1 siblings, 0 replies; 209+ messages in thread
From: Baolu Lu @ 2022-04-30  4:11 UTC (permalink / raw)
  To: Joao Martins, iommu
  Cc: Joerg Roedel, Suravee Suthikulpanit, Will Deacon, Robin Murphy,
	Jean-Philippe Brucker, Keqian Zhu, Shameerali Kolothum Thodi,
	David Woodhouse, Jason Gunthorpe, Nicolin Chen, Yishai Hadas,
	Kevin Tian, Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck,
	kvm

On 2022/4/29 05:09, Joao Martins wrote:
> Add an IO pagetable API iopt_read_and_clear_dirty_data() that
> performs the reading of dirty IOPTEs for a given IOVA range and
> then copying back to userspace from each area-internal bitmap.
> 
> Underneath it uses the IOMMU equivalent API which will read the
> dirty bits, as well as atomically clearing the IOPTE dirty bit
> and flushing the IOTLB at the end. The dirty bitmaps pass an
> iotlb_gather to allow batching the dirty-bit updates.
> 
> Most of the complexity, though, is in the handling of the user
> bitmaps to avoid copies back and forth. The bitmap user addresses
> need to be iterated through, pinned and then passing the pages
> into iommu core. The amount of bitmap data passed at a time for a
> read_and_clear_dirty() is 1 page worth of pinned base page
> pointers. That equates to 16M bits, or rather 64G of data that
> can be returned as 'dirtied'. The flush the IOTLB at the end of
> the whole scanned IOVA range, to defer as much as possible the
> potential DMA performance penalty.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>   drivers/iommu/iommufd/io_pagetable.c    | 169 ++++++++++++++++++++++++
>   drivers/iommu/iommufd/iommufd_private.h |  44 ++++++
>   2 files changed, 213 insertions(+)
> 
> diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
> index f4609ef369e0..835b5040fce9 100644
> --- a/drivers/iommu/iommufd/io_pagetable.c
> +++ b/drivers/iommu/iommufd/io_pagetable.c
> @@ -14,6 +14,7 @@
>   #include <linux/err.h>
>   #include <linux/slab.h>
>   #include <linux/errno.h>
> +#include <uapi/linux/iommufd.h>
>   
>   #include "io_pagetable.h"
>   
> @@ -347,6 +348,174 @@ int iopt_set_dirty_tracking(struct io_pagetable *iopt,
>   	return ret;
>   }
>   
> +int iommufd_dirty_iter_init(struct iommufd_dirty_iter *iter,
> +			    struct iommufd_dirty_data *bitmap)
> +{
> +	struct iommu_dirty_bitmap *dirty = &iter->dirty;
> +	unsigned long bitmap_len;
> +
> +	bitmap_len = dirty_bitmap_bytes(bitmap->length >> dirty->pgshift);
> +
> +	import_single_range(WRITE, bitmap->data, bitmap_len,
> +			    &iter->bitmap_iov, &iter->bitmap_iter);
> +	iter->iova = bitmap->iova;
> +
> +	/* Can record up to 64G at a time */
> +	dirty->pages = (struct page **) __get_free_page(GFP_KERNEL);
> +
> +	return !dirty->pages ? -ENOMEM : 0;
> +}
> +
> +void iommufd_dirty_iter_free(struct iommufd_dirty_iter *iter)
> +{
> +	struct iommu_dirty_bitmap *dirty = &iter->dirty;
> +
> +	if (dirty->pages) {
> +		free_page((unsigned long) dirty->pages);
> +		dirty->pages = NULL;
> +	}
> +}
> +
> +bool iommufd_dirty_iter_done(struct iommufd_dirty_iter *iter)
> +{
> +	return iov_iter_count(&iter->bitmap_iter) > 0;
> +}
> +
> +static inline unsigned long iommufd_dirty_iter_bytes(struct iommufd_dirty_iter *iter)
> +{
> +	unsigned long left = iter->bitmap_iter.count - iter->bitmap_iter.iov_offset;
> +
> +	left = min_t(unsigned long, left, (iter->dirty.npages << PAGE_SHIFT));
> +
> +	return left;
> +}
> +
> +unsigned long iommufd_dirty_iova_length(struct iommufd_dirty_iter *iter)
> +{
> +	unsigned long left = iommufd_dirty_iter_bytes(iter);
> +
> +	return ((BITS_PER_BYTE * left) << iter->dirty.pgshift);
> +}
> +
> +unsigned long iommufd_dirty_iova(struct iommufd_dirty_iter *iter)
> +{
> +	unsigned long skip = iter->bitmap_iter.iov_offset;
> +
> +	return iter->iova + ((BITS_PER_BYTE * skip) << iter->dirty.pgshift);
> +}
> +
> +void iommufd_dirty_iter_advance(struct iommufd_dirty_iter *iter)
> +{
> +	iov_iter_advance(&iter->bitmap_iter, iommufd_dirty_iter_bytes(iter));
> +}
> +
> +void iommufd_dirty_iter_put(struct iommufd_dirty_iter *iter)
> +{
> +	struct iommu_dirty_bitmap *dirty = &iter->dirty;
> +
> +	if (dirty->npages)
> +		unpin_user_pages(dirty->pages, dirty->npages);
> +}
> +
> +int iommufd_dirty_iter_get(struct iommufd_dirty_iter *iter)
> +{
> +	struct iommu_dirty_bitmap *dirty = &iter->dirty;
> +	unsigned long npages;
> +	unsigned long ret;
> +	void *addr;
> +
> +	addr = iter->bitmap_iov.iov_base + iter->bitmap_iter.iov_offset;
> +	npages = iov_iter_npages(&iter->bitmap_iter,
> +				 PAGE_SIZE / sizeof(struct page *));
> +
> +	ret = pin_user_pages_fast((unsigned long) addr, npages,
> +				  FOLL_WRITE, dirty->pages);
> +	if (ret <= 0)
> +		return -EINVAL;
> +
> +	dirty->npages = ret;
> +	dirty->iova = iommufd_dirty_iova(iter);
> +	dirty->start_offset = offset_in_page(addr);
> +	return 0;
> +}
> +
> +static int iommu_read_and_clear_dirty(struct iommu_domain *domain,
> +				      struct iommufd_dirty_data *bitmap)

This looks more like a helper in the iommu core. How about

	iommufd_read_clear_domain_dirty()?

> +{
> +	const struct iommu_domain_ops *ops = domain->ops;
> +	struct iommu_iotlb_gather gather;
> +	struct iommufd_dirty_iter iter;
> +	int ret = 0;
> +
> +	if (!ops || !ops->read_and_clear_dirty)
> +		return -EOPNOTSUPP;
> +
> +	iommu_dirty_bitmap_init(&iter.dirty, bitmap->iova,
> +				__ffs(bitmap->page_size), &gather);
> +	ret = iommufd_dirty_iter_init(&iter, bitmap);
> +	if (ret)
> +		return -ENOMEM;
> +
> +	for (; iommufd_dirty_iter_done(&iter);
> +	     iommufd_dirty_iter_advance(&iter)) {
> +		ret = iommufd_dirty_iter_get(&iter);
> +		if (ret)
> +			break;
> +
> +		ret = ops->read_and_clear_dirty(domain,
> +			iommufd_dirty_iova(&iter),
> +			iommufd_dirty_iova_length(&iter), &iter.dirty);
> +
> +		iommufd_dirty_iter_put(&iter);
> +
> +		if (ret)
> +			break;
> +	}
> +
> +	iommu_iotlb_sync(domain, &gather);
> +	iommufd_dirty_iter_free(&iter);
> +
> +	return ret;
> +}
> +
> +int iopt_read_and_clear_dirty_data(struct io_pagetable *iopt,
> +				   struct iommu_domain *domain,
> +				   struct iommufd_dirty_data *bitmap)
> +{
> +	unsigned long iova, length, iova_end;
> +	struct iommu_domain *dom;
> +	struct iopt_area *area;
> +	unsigned long index;
> +	int ret = -EOPNOTSUPP;
> +
> +	iova = bitmap->iova;
> +	length = bitmap->length - 1;
> +	if (check_add_overflow(iova, length, &iova_end))
> +		return -EOVERFLOW;
> +
> +	down_read(&iopt->iova_rwsem);
> +	area = iopt_find_exact_area(iopt, iova, iova_end);
> +	if (!area) {
> +		up_read(&iopt->iova_rwsem);
> +		return -ENOENT;
> +	}
> +
> +	if (!domain) {
> +		down_read(&iopt->domains_rwsem);
> +		xa_for_each(&iopt->domains, index, dom) {
> +			ret = iommu_read_and_clear_dirty(dom, bitmap);

Perhaps use @domain directly, hence no need the @dom?

	xa_for_each(&iopt->domains, index, domain) {
		ret = iommu_read_and_clear_dirty(domain, bitmap);

> +			if (ret)
> +				break;
> +		}
> +		up_read(&iopt->domains_rwsem);
> +	} else {
> +		ret = iommu_read_and_clear_dirty(domain, bitmap);
> +	}
> +
> +	up_read(&iopt->iova_rwsem);
> +	return ret;
> +}
> +
>   struct iopt_pages *iopt_get_pages(struct io_pagetable *iopt, unsigned long iova,
>   				  unsigned long *start_byte,
>   				  unsigned long length)
> diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
> index d00ef3b785c5..4c12b4a8f1a6 100644
> --- a/drivers/iommu/iommufd/iommufd_private.h
> +++ b/drivers/iommu/iommufd/iommufd_private.h
> @@ -8,6 +8,8 @@
>   #include <linux/xarray.h>
>   #include <linux/refcount.h>
>   #include <linux/uaccess.h>
> +#include <linux/iommu.h>
> +#include <linux/uio.h>
>   
>   struct iommu_domain;
>   struct iommu_group;
> @@ -49,8 +51,50 @@ int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
>   		    unsigned long length);
>   int iopt_unmap_all(struct io_pagetable *iopt);
>   
> +struct iommufd_dirty_data {
> +	unsigned long iova;
> +	unsigned long length;
> +	unsigned long page_size;
> +	unsigned long *data;
> +};

How about adding some comments around this struct? Any alingment
requirement for iova/length? What does the @data stand for?

> +
>   int iopt_set_dirty_tracking(struct io_pagetable *iopt,
>   			    struct iommu_domain *domain, bool enable);
> +int iopt_read_and_clear_dirty_data(struct io_pagetable *iopt,
> +				   struct iommu_domain *domain,
> +				   struct iommufd_dirty_data *bitmap);
> +
> +struct iommufd_dirty_iter {
> +	struct iommu_dirty_bitmap dirty;
> +	struct iovec bitmap_iov;
> +	struct iov_iter bitmap_iter;
> +	unsigned long iova;
> +};

Same here.

> +
> +void iommufd_dirty_iter_put(struct iommufd_dirty_iter *iter);
> +int iommufd_dirty_iter_get(struct iommufd_dirty_iter *iter);
> +int iommufd_dirty_iter_init(struct iommufd_dirty_iter *iter,
> +			    struct iommufd_dirty_data *bitmap);
> +void iommufd_dirty_iter_free(struct iommufd_dirty_iter *iter);
> +bool iommufd_dirty_iter_done(struct iommufd_dirty_iter *iter);
> +void iommufd_dirty_iter_advance(struct iommufd_dirty_iter *iter);
> +unsigned long iommufd_dirty_iova_length(struct iommufd_dirty_iter *iter);
> +unsigned long iommufd_dirty_iova(struct iommufd_dirty_iter *iter);
> +static inline unsigned long dirty_bitmap_bytes(unsigned long nr_pages)
> +{
> +	return (ALIGN(nr_pages, BITS_PER_TYPE(u64)) / BITS_PER_BYTE);
> +}
> +
> +/*
> + * Input argument of number of bits to bitmap_set() is unsigned integer, which
> + * further casts to signed integer for unaligned multi-bit operation,
> + * __bitmap_set().
> + * Then maximum bitmap size supported is 2^31 bits divided by 2^3 bits/byte,
> + * that is 2^28 (256 MB) which maps to 2^31 * 2^12 = 2^43 (8TB) on 4K page
> + * system.
> + */
> +#define DIRTY_BITMAP_PAGES_MAX  ((u64)INT_MAX)
> +#define DIRTY_BITMAP_SIZE_MAX   dirty_bitmap_bytes(DIRTY_BITMAP_PAGES_MAX)
>   
>   int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
>   		      unsigned long npages, struct page **out_pages, bool write);

Best regards,
baolu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 03/19] iommufd: Dirty tracking data support
@ 2022-04-30  4:11     ` Baolu Lu
  0 siblings, 0 replies; 209+ messages in thread
From: Baolu Lu @ 2022-04-30  4:11 UTC (permalink / raw)
  To: Joao Martins, iommu
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, Jason Gunthorpe,
	kvm, Will Deacon, Cornelia Huck, Alex Williamson,
	David Woodhouse, Robin Murphy

On 2022/4/29 05:09, Joao Martins wrote:
> Add an IO pagetable API iopt_read_and_clear_dirty_data() that
> performs the reading of dirty IOPTEs for a given IOVA range and
> then copying back to userspace from each area-internal bitmap.
> 
> Underneath it uses the IOMMU equivalent API which will read the
> dirty bits, as well as atomically clearing the IOPTE dirty bit
> and flushing the IOTLB at the end. The dirty bitmaps pass an
> iotlb_gather to allow batching the dirty-bit updates.
> 
> Most of the complexity, though, is in the handling of the user
> bitmaps to avoid copies back and forth. The bitmap user addresses
> need to be iterated through, pinned and then passing the pages
> into iommu core. The amount of bitmap data passed at a time for a
> read_and_clear_dirty() is 1 page worth of pinned base page
> pointers. That equates to 16M bits, or rather 64G of data that
> can be returned as 'dirtied'. The flush the IOTLB at the end of
> the whole scanned IOVA range, to defer as much as possible the
> potential DMA performance penalty.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>   drivers/iommu/iommufd/io_pagetable.c    | 169 ++++++++++++++++++++++++
>   drivers/iommu/iommufd/iommufd_private.h |  44 ++++++
>   2 files changed, 213 insertions(+)
> 
> diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
> index f4609ef369e0..835b5040fce9 100644
> --- a/drivers/iommu/iommufd/io_pagetable.c
> +++ b/drivers/iommu/iommufd/io_pagetable.c
> @@ -14,6 +14,7 @@
>   #include <linux/err.h>
>   #include <linux/slab.h>
>   #include <linux/errno.h>
> +#include <uapi/linux/iommufd.h>
>   
>   #include "io_pagetable.h"
>   
> @@ -347,6 +348,174 @@ int iopt_set_dirty_tracking(struct io_pagetable *iopt,
>   	return ret;
>   }
>   
> +int iommufd_dirty_iter_init(struct iommufd_dirty_iter *iter,
> +			    struct iommufd_dirty_data *bitmap)
> +{
> +	struct iommu_dirty_bitmap *dirty = &iter->dirty;
> +	unsigned long bitmap_len;
> +
> +	bitmap_len = dirty_bitmap_bytes(bitmap->length >> dirty->pgshift);
> +
> +	import_single_range(WRITE, bitmap->data, bitmap_len,
> +			    &iter->bitmap_iov, &iter->bitmap_iter);
> +	iter->iova = bitmap->iova;
> +
> +	/* Can record up to 64G at a time */
> +	dirty->pages = (struct page **) __get_free_page(GFP_KERNEL);
> +
> +	return !dirty->pages ? -ENOMEM : 0;
> +}
> +
> +void iommufd_dirty_iter_free(struct iommufd_dirty_iter *iter)
> +{
> +	struct iommu_dirty_bitmap *dirty = &iter->dirty;
> +
> +	if (dirty->pages) {
> +		free_page((unsigned long) dirty->pages);
> +		dirty->pages = NULL;
> +	}
> +}
> +
> +bool iommufd_dirty_iter_done(struct iommufd_dirty_iter *iter)
> +{
> +	return iov_iter_count(&iter->bitmap_iter) > 0;
> +}
> +
> +static inline unsigned long iommufd_dirty_iter_bytes(struct iommufd_dirty_iter *iter)
> +{
> +	unsigned long left = iter->bitmap_iter.count - iter->bitmap_iter.iov_offset;
> +
> +	left = min_t(unsigned long, left, (iter->dirty.npages << PAGE_SHIFT));
> +
> +	return left;
> +}
> +
> +unsigned long iommufd_dirty_iova_length(struct iommufd_dirty_iter *iter)
> +{
> +	unsigned long left = iommufd_dirty_iter_bytes(iter);
> +
> +	return ((BITS_PER_BYTE * left) << iter->dirty.pgshift);
> +}
> +
> +unsigned long iommufd_dirty_iova(struct iommufd_dirty_iter *iter)
> +{
> +	unsigned long skip = iter->bitmap_iter.iov_offset;
> +
> +	return iter->iova + ((BITS_PER_BYTE * skip) << iter->dirty.pgshift);
> +}
> +
> +void iommufd_dirty_iter_advance(struct iommufd_dirty_iter *iter)
> +{
> +	iov_iter_advance(&iter->bitmap_iter, iommufd_dirty_iter_bytes(iter));
> +}
> +
> +void iommufd_dirty_iter_put(struct iommufd_dirty_iter *iter)
> +{
> +	struct iommu_dirty_bitmap *dirty = &iter->dirty;
> +
> +	if (dirty->npages)
> +		unpin_user_pages(dirty->pages, dirty->npages);
> +}
> +
> +int iommufd_dirty_iter_get(struct iommufd_dirty_iter *iter)
> +{
> +	struct iommu_dirty_bitmap *dirty = &iter->dirty;
> +	unsigned long npages;
> +	unsigned long ret;
> +	void *addr;
> +
> +	addr = iter->bitmap_iov.iov_base + iter->bitmap_iter.iov_offset;
> +	npages = iov_iter_npages(&iter->bitmap_iter,
> +				 PAGE_SIZE / sizeof(struct page *));
> +
> +	ret = pin_user_pages_fast((unsigned long) addr, npages,
> +				  FOLL_WRITE, dirty->pages);
> +	if (ret <= 0)
> +		return -EINVAL;
> +
> +	dirty->npages = ret;
> +	dirty->iova = iommufd_dirty_iova(iter);
> +	dirty->start_offset = offset_in_page(addr);
> +	return 0;
> +}
> +
> +static int iommu_read_and_clear_dirty(struct iommu_domain *domain,
> +				      struct iommufd_dirty_data *bitmap)

This looks more like a helper in the iommu core. How about

	iommufd_read_clear_domain_dirty()?

> +{
> +	const struct iommu_domain_ops *ops = domain->ops;
> +	struct iommu_iotlb_gather gather;
> +	struct iommufd_dirty_iter iter;
> +	int ret = 0;
> +
> +	if (!ops || !ops->read_and_clear_dirty)
> +		return -EOPNOTSUPP;
> +
> +	iommu_dirty_bitmap_init(&iter.dirty, bitmap->iova,
> +				__ffs(bitmap->page_size), &gather);
> +	ret = iommufd_dirty_iter_init(&iter, bitmap);
> +	if (ret)
> +		return -ENOMEM;
> +
> +	for (; iommufd_dirty_iter_done(&iter);
> +	     iommufd_dirty_iter_advance(&iter)) {
> +		ret = iommufd_dirty_iter_get(&iter);
> +		if (ret)
> +			break;
> +
> +		ret = ops->read_and_clear_dirty(domain,
> +			iommufd_dirty_iova(&iter),
> +			iommufd_dirty_iova_length(&iter), &iter.dirty);
> +
> +		iommufd_dirty_iter_put(&iter);
> +
> +		if (ret)
> +			break;
> +	}
> +
> +	iommu_iotlb_sync(domain, &gather);
> +	iommufd_dirty_iter_free(&iter);
> +
> +	return ret;
> +}
> +
> +int iopt_read_and_clear_dirty_data(struct io_pagetable *iopt,
> +				   struct iommu_domain *domain,
> +				   struct iommufd_dirty_data *bitmap)
> +{
> +	unsigned long iova, length, iova_end;
> +	struct iommu_domain *dom;
> +	struct iopt_area *area;
> +	unsigned long index;
> +	int ret = -EOPNOTSUPP;
> +
> +	iova = bitmap->iova;
> +	length = bitmap->length - 1;
> +	if (check_add_overflow(iova, length, &iova_end))
> +		return -EOVERFLOW;
> +
> +	down_read(&iopt->iova_rwsem);
> +	area = iopt_find_exact_area(iopt, iova, iova_end);
> +	if (!area) {
> +		up_read(&iopt->iova_rwsem);
> +		return -ENOENT;
> +	}
> +
> +	if (!domain) {
> +		down_read(&iopt->domains_rwsem);
> +		xa_for_each(&iopt->domains, index, dom) {
> +			ret = iommu_read_and_clear_dirty(dom, bitmap);

Perhaps use @domain directly, hence no need the @dom?

	xa_for_each(&iopt->domains, index, domain) {
		ret = iommu_read_and_clear_dirty(domain, bitmap);

> +			if (ret)
> +				break;
> +		}
> +		up_read(&iopt->domains_rwsem);
> +	} else {
> +		ret = iommu_read_and_clear_dirty(domain, bitmap);
> +	}
> +
> +	up_read(&iopt->iova_rwsem);
> +	return ret;
> +}
> +
>   struct iopt_pages *iopt_get_pages(struct io_pagetable *iopt, unsigned long iova,
>   				  unsigned long *start_byte,
>   				  unsigned long length)
> diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
> index d00ef3b785c5..4c12b4a8f1a6 100644
> --- a/drivers/iommu/iommufd/iommufd_private.h
> +++ b/drivers/iommu/iommufd/iommufd_private.h
> @@ -8,6 +8,8 @@
>   #include <linux/xarray.h>
>   #include <linux/refcount.h>
>   #include <linux/uaccess.h>
> +#include <linux/iommu.h>
> +#include <linux/uio.h>
>   
>   struct iommu_domain;
>   struct iommu_group;
> @@ -49,8 +51,50 @@ int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
>   		    unsigned long length);
>   int iopt_unmap_all(struct io_pagetable *iopt);
>   
> +struct iommufd_dirty_data {
> +	unsigned long iova;
> +	unsigned long length;
> +	unsigned long page_size;
> +	unsigned long *data;
> +};

How about adding some comments around this struct? Any alingment
requirement for iova/length? What does the @data stand for?

> +
>   int iopt_set_dirty_tracking(struct io_pagetable *iopt,
>   			    struct iommu_domain *domain, bool enable);
> +int iopt_read_and_clear_dirty_data(struct io_pagetable *iopt,
> +				   struct iommu_domain *domain,
> +				   struct iommufd_dirty_data *bitmap);
> +
> +struct iommufd_dirty_iter {
> +	struct iommu_dirty_bitmap dirty;
> +	struct iovec bitmap_iov;
> +	struct iov_iter bitmap_iter;
> +	unsigned long iova;
> +};

Same here.

> +
> +void iommufd_dirty_iter_put(struct iommufd_dirty_iter *iter);
> +int iommufd_dirty_iter_get(struct iommufd_dirty_iter *iter);
> +int iommufd_dirty_iter_init(struct iommufd_dirty_iter *iter,
> +			    struct iommufd_dirty_data *bitmap);
> +void iommufd_dirty_iter_free(struct iommufd_dirty_iter *iter);
> +bool iommufd_dirty_iter_done(struct iommufd_dirty_iter *iter);
> +void iommufd_dirty_iter_advance(struct iommufd_dirty_iter *iter);
> +unsigned long iommufd_dirty_iova_length(struct iommufd_dirty_iter *iter);
> +unsigned long iommufd_dirty_iova(struct iommufd_dirty_iter *iter);
> +static inline unsigned long dirty_bitmap_bytes(unsigned long nr_pages)
> +{
> +	return (ALIGN(nr_pages, BITS_PER_TYPE(u64)) / BITS_PER_BYTE);
> +}
> +
> +/*
> + * Input argument of number of bits to bitmap_set() is unsigned integer, which
> + * further casts to signed integer for unaligned multi-bit operation,
> + * __bitmap_set().
> + * Then maximum bitmap size supported is 2^31 bits divided by 2^3 bits/byte,
> + * that is 2^28 (256 MB) which maps to 2^31 * 2^12 = 2^43 (8TB) on 4K page
> + * system.
> + */
> +#define DIRTY_BITMAP_PAGES_MAX  ((u64)INT_MAX)
> +#define DIRTY_BITMAP_SIZE_MAX   dirty_bitmap_bytes(DIRTY_BITMAP_PAGES_MAX)
>   
>   int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
>   		      unsigned long npages, struct page **out_pages, bool write);

Best regards,
baolu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 04/19] iommu: Add an unmap API that returns dirtied IOPTEs
  2022-04-28 21:09   ` Joao Martins
@ 2022-04-30  5:12     ` Baolu Lu
  -1 siblings, 0 replies; 209+ messages in thread
From: Baolu Lu @ 2022-04-30  5:12 UTC (permalink / raw)
  To: Joao Martins, iommu
  Cc: Joerg Roedel, Suravee Suthikulpanit, Will Deacon, Robin Murphy,
	Jean-Philippe Brucker, Keqian Zhu, Shameerali Kolothum Thodi,
	David Woodhouse, Jason Gunthorpe, Nicolin Chen, Yishai Hadas,
	Kevin Tian, Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck,
	kvm

On 2022/4/29 05:09, Joao Martins wrote:
> Today, the dirty state is lost and the page wouldn't be migrated to
> destination potentially leading the guest into error.
> 
> Add an unmap API that reads the dirty bit and sets it in the
> user passed bitmap. This unmap iommu API tackles a potentially
> racy update to the dirty bit *when* doing DMA on a iova that is
> being unmapped at the same time.
> 
> The new unmap_read_dirty/unmap_pages_read_dirty does not replace
> the unmap pages, but rather only when explicit called with an dirty
> bitmap data passed in.
> 
> It could be said that the guest is buggy and rather than a special unmap
> path tackling the theoretical race ... it would suffice fetching the
> dirty bits (with GET_DIRTY_IOVA), and then unmap the IOVA.

I am not sure whether this API could solve the race.

size_t iommu_unmap(struct iommu_domain *domain,
                    unsigned long iova, size_t size)
{
         struct iommu_iotlb_gather iotlb_gather;
         size_t ret;

         iommu_iotlb_gather_init(&iotlb_gather);
         ret = __iommu_unmap(domain, iova, size, &iotlb_gather);
         iommu_iotlb_sync(domain, &iotlb_gather);

         return ret;
}

The PTEs are cleared before iotlb invalidation. What if a DMA write
happens after PTE clearing and before the iotlb invalidation with the
PTE happening to be cached?

Best regards,
baolu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 04/19] iommu: Add an unmap API that returns dirtied IOPTEs
@ 2022-04-30  5:12     ` Baolu Lu
  0 siblings, 0 replies; 209+ messages in thread
From: Baolu Lu @ 2022-04-30  5:12 UTC (permalink / raw)
  To: Joao Martins, iommu
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, Jason Gunthorpe,
	kvm, Will Deacon, Cornelia Huck, Alex Williamson,
	David Woodhouse, Robin Murphy

On 2022/4/29 05:09, Joao Martins wrote:
> Today, the dirty state is lost and the page wouldn't be migrated to
> destination potentially leading the guest into error.
> 
> Add an unmap API that reads the dirty bit and sets it in the
> user passed bitmap. This unmap iommu API tackles a potentially
> racy update to the dirty bit *when* doing DMA on a iova that is
> being unmapped at the same time.
> 
> The new unmap_read_dirty/unmap_pages_read_dirty does not replace
> the unmap pages, but rather only when explicit called with an dirty
> bitmap data passed in.
> 
> It could be said that the guest is buggy and rather than a special unmap
> path tackling the theoretical race ... it would suffice fetching the
> dirty bits (with GET_DIRTY_IOVA), and then unmap the IOVA.

I am not sure whether this API could solve the race.

size_t iommu_unmap(struct iommu_domain *domain,
                    unsigned long iova, size_t size)
{
         struct iommu_iotlb_gather iotlb_gather;
         size_t ret;

         iommu_iotlb_gather_init(&iotlb_gather);
         ret = __iommu_unmap(domain, iova, size, &iotlb_gather);
         iommu_iotlb_sync(domain, &iotlb_gather);

         return ret;
}

The PTEs are cleared before iotlb invalidation. What if a DMA write
happens after PTE clearing and before the iotlb invalidation with the
PTE happening to be cached?

Best regards,
baolu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 18/19] iommu/intel: Access/Dirty bit support for SL domains
  2022-04-28 21:09   ` Joao Martins
@ 2022-04-30  6:12     ` Baolu Lu
  -1 siblings, 0 replies; 209+ messages in thread
From: Baolu Lu @ 2022-04-30  6:12 UTC (permalink / raw)
  To: Joao Martins, iommu
  Cc: Joerg Roedel, Suravee Suthikulpanit, Will Deacon, Robin Murphy,
	Jean-Philippe Brucker, Keqian Zhu, Shameerali Kolothum Thodi,
	David Woodhouse, Jason Gunthorpe, Nicolin Chen, Yishai Hadas,
	Kevin Tian, Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck,
	kvm

On 2022/4/29 05:09, Joao Martins wrote:
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -5089,6 +5089,113 @@ static void intel_iommu_iotlb_sync_map(struct iommu_domain *domain,
>   	}
>   }
>   
> +static int intel_iommu_set_dirty_tracking(struct iommu_domain *domain,
> +					  bool enable)
> +{
> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> +	struct device_domain_info *info;
> +	unsigned long flags;
> +	int ret = -EINVAL;

	if (domain_use_first_level(dmar_domain))
		return -EOPNOTSUPP;

> +
> +	spin_lock_irqsave(&device_domain_lock, flags);
> +	if (list_empty(&dmar_domain->devices)) {
> +		spin_unlock_irqrestore(&device_domain_lock, flags);
> +		return ret;
> +	}

I agreed with Kevin's suggestion in his reply.

> +
> +	list_for_each_entry(info, &dmar_domain->devices, link) {
> +		if (!info->dev || (info->domain != dmar_domain))
> +			continue;

This check is redundant.

> +
> +		/* Dirty tracking is second-stage level SM only */
> +		if ((info->domain && domain_use_first_level(info->domain)) ||
> +		    !ecap_slads(info->iommu->ecap) ||
> +		    !sm_supported(info->iommu) || !intel_iommu_sm) {
> +			ret = -EOPNOTSUPP;
> +			continue;

Perhaps break and return -EOPNOTSUPP directly here? We are not able to
support a mixed mode, right?

> +		}
> +
> +		ret = intel_pasid_setup_dirty_tracking(info->iommu, info->domain,
> +						     info->dev, PASID_RID2PASID,
> +						     enable);
> +		if (ret)
> +			break;
> +	}
> +	spin_unlock_irqrestore(&device_domain_lock, flags);
> +
> +	/*
> +	 * We need to flush context TLB and IOTLB with any cached translations
> +	 * to force the incoming DMA requests for have its IOTLB entries tagged
> +	 * with A/D bits
> +	 */
> +	intel_flush_iotlb_all(domain);
> +	return ret;
> +}

Best regards,
baolu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 18/19] iommu/intel: Access/Dirty bit support for SL domains
@ 2022-04-30  6:12     ` Baolu Lu
  0 siblings, 0 replies; 209+ messages in thread
From: Baolu Lu @ 2022-04-30  6:12 UTC (permalink / raw)
  To: Joao Martins, iommu
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, Jason Gunthorpe,
	kvm, Will Deacon, Cornelia Huck, Alex Williamson,
	David Woodhouse, Robin Murphy

On 2022/4/29 05:09, Joao Martins wrote:
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -5089,6 +5089,113 @@ static void intel_iommu_iotlb_sync_map(struct iommu_domain *domain,
>   	}
>   }
>   
> +static int intel_iommu_set_dirty_tracking(struct iommu_domain *domain,
> +					  bool enable)
> +{
> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> +	struct device_domain_info *info;
> +	unsigned long flags;
> +	int ret = -EINVAL;

	if (domain_use_first_level(dmar_domain))
		return -EOPNOTSUPP;

> +
> +	spin_lock_irqsave(&device_domain_lock, flags);
> +	if (list_empty(&dmar_domain->devices)) {
> +		spin_unlock_irqrestore(&device_domain_lock, flags);
> +		return ret;
> +	}

I agreed with Kevin's suggestion in his reply.

> +
> +	list_for_each_entry(info, &dmar_domain->devices, link) {
> +		if (!info->dev || (info->domain != dmar_domain))
> +			continue;

This check is redundant.

> +
> +		/* Dirty tracking is second-stage level SM only */
> +		if ((info->domain && domain_use_first_level(info->domain)) ||
> +		    !ecap_slads(info->iommu->ecap) ||
> +		    !sm_supported(info->iommu) || !intel_iommu_sm) {
> +			ret = -EOPNOTSUPP;
> +			continue;

Perhaps break and return -EOPNOTSUPP directly here? We are not able to
support a mixed mode, right?

> +		}
> +
> +		ret = intel_pasid_setup_dirty_tracking(info->iommu, info->domain,
> +						     info->dev, PASID_RID2PASID,
> +						     enable);
> +		if (ret)
> +			break;
> +	}
> +	spin_unlock_irqrestore(&device_domain_lock, flags);
> +
> +	/*
> +	 * We need to flush context TLB and IOTLB with any cached translations
> +	 * to force the incoming DMA requests for have its IOTLB entries tagged
> +	 * with A/D bits
> +	 */
> +	intel_flush_iotlb_all(domain);
> +	return ret;
> +}

Best regards,
baolu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
  2022-04-29 19:20                     ` Robin Murphy
@ 2022-05-02 11:52                       ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-05-02 11:52 UTC (permalink / raw)
  To: Robin Murphy, Jason Gunthorpe
  Cc: Tian, Kevin, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Jean-Philippe Brucker, Keqian Zhu, Shameerali Kolothum Thodi,
	David Woodhouse, Lu Baolu, Nicolin Chen, Yishai Hadas,
	Eric Auger, Liu, Yi L, Alex Williamson, Cornelia Huck, kvm,
	iommu

On 4/29/22 20:20, Robin Murphy wrote:
> On 2022-04-29 17:40, Joao Martins wrote:
>> On 4/29/22 17:11, Jason Gunthorpe wrote:
>>> On Fri, Apr 29, 2022 at 03:45:23PM +0100, Joao Martins wrote:
>>>> On 4/29/22 13:23, Jason Gunthorpe wrote:
>>>>> On Fri, Apr 29, 2022 at 01:06:06PM +0100, Joao Martins wrote:
>>>>>
>>>>>>> TBH I'd be inclined to just enable DBM unconditionally in
>>>>>>> arm_smmu_domain_finalise() if the SMMU supports it. Trying to toggle it
>>>>>>> dynamically (especially on a live domain) seems more trouble that it's
>>>>>>> worth.
>>>>>>
>>>>>> Hmmm, but then it would strip userland/VMM from any sort of control (contrary
>>>>>> to what we can do on the CPU/KVM side). e.g. the first time you do
>>>>>> GET_DIRTY_IOVA it would return all dirtied IOVAs since the beginning
>>>>>> of guest time, as opposed to those only after you enabled dirty-tracking.
>>>>>
>>>>> It just means that on SMMU the start tracking op clears all the dirty
>>>>> bits.
>>>>>
>>>> Hmm, OK. But aren't really picking a poison here? On ARM it's the difference
>>>> from switching the setting the DBM bit and put the IOPTE as writeable-clean (which
>>>> is clearing another bit) versus read-and-clear-when-dirty-track-start which means
>>>> we need to re-walk the pagetables to clear one bit.
>>>
>>> Yes, I don't think a iopte walk is avoidable?
>>>
>> Correct -- exactly why I am still more learning towards enable DBM bit only at start
>> versus enabling DBM at domain-creation while clearing dirty at start.
> 
> I'd say it's largely down to whether you want the bother of 
> communicating a dynamic behaviour change into io-pgtable. The big 
> advantage of having it just use DBM all the time is that you don't have 
> to do that, and the "start tracking" operation is then nothing more than 
> a normal "read and clear" operation but ignoring the read result.
> 
> At this point I'd much rather opt for simplicity, and leave the fancier 
> stuff to revisit later if and when somebody does demonstrate a 
> significant overhead from using DBM when not strictly needed.
> OK -- I did get the code simplicity part[*]. Albeit my concern is that last
point: if there's anything fundamentally affecting DMA performance then
any SMMU user would see it even if they don't care at all about DBM (i.e. regular
baremetal/non-vm iommu usage).

[*] It was how I had this initially PoC-ed. And really all IOMMU drivers dirty tracking
could be simplified to be always-enabled, and start/stop is essentially flushing/clearing
dirties. Albeit I like that this is only really used (by hardware) when needed and any
other DMA user isn't affected.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
@ 2022-05-02 11:52                       ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-05-02 11:52 UTC (permalink / raw)
  To: Robin Murphy, Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Tian, Kevin, Yishai Hadas, kvm,
	Will Deacon, Cornelia Huck, iommu, Alex Williamson,
	David Woodhouse

On 4/29/22 20:20, Robin Murphy wrote:
> On 2022-04-29 17:40, Joao Martins wrote:
>> On 4/29/22 17:11, Jason Gunthorpe wrote:
>>> On Fri, Apr 29, 2022 at 03:45:23PM +0100, Joao Martins wrote:
>>>> On 4/29/22 13:23, Jason Gunthorpe wrote:
>>>>> On Fri, Apr 29, 2022 at 01:06:06PM +0100, Joao Martins wrote:
>>>>>
>>>>>>> TBH I'd be inclined to just enable DBM unconditionally in
>>>>>>> arm_smmu_domain_finalise() if the SMMU supports it. Trying to toggle it
>>>>>>> dynamically (especially on a live domain) seems more trouble that it's
>>>>>>> worth.
>>>>>>
>>>>>> Hmmm, but then it would strip userland/VMM from any sort of control (contrary
>>>>>> to what we can do on the CPU/KVM side). e.g. the first time you do
>>>>>> GET_DIRTY_IOVA it would return all dirtied IOVAs since the beginning
>>>>>> of guest time, as opposed to those only after you enabled dirty-tracking.
>>>>>
>>>>> It just means that on SMMU the start tracking op clears all the dirty
>>>>> bits.
>>>>>
>>>> Hmm, OK. But aren't really picking a poison here? On ARM it's the difference
>>>> from switching the setting the DBM bit and put the IOPTE as writeable-clean (which
>>>> is clearing another bit) versus read-and-clear-when-dirty-track-start which means
>>>> we need to re-walk the pagetables to clear one bit.
>>>
>>> Yes, I don't think a iopte walk is avoidable?
>>>
>> Correct -- exactly why I am still more learning towards enable DBM bit only at start
>> versus enabling DBM at domain-creation while clearing dirty at start.
> 
> I'd say it's largely down to whether you want the bother of 
> communicating a dynamic behaviour change into io-pgtable. The big 
> advantage of having it just use DBM all the time is that you don't have 
> to do that, and the "start tracking" operation is then nothing more than 
> a normal "read and clear" operation but ignoring the read result.
> 
> At this point I'd much rather opt for simplicity, and leave the fancier 
> stuff to revisit later if and when somebody does demonstrate a 
> significant overhead from using DBM when not strictly needed.
> OK -- I did get the code simplicity part[*]. Albeit my concern is that last
point: if there's anything fundamentally affecting DMA performance then
any SMMU user would see it even if they don't care at all about DBM (i.e. regular
baremetal/non-vm iommu usage).

[*] It was how I had this initially PoC-ed. And really all IOMMU drivers dirty tracking
could be simplified to be always-enabled, and start/stop is essentially flushing/clearing
dirties. Albeit I like that this is only really used (by hardware) when needed and any
other DMA user isn't affected.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
  2022-05-02 11:52                       ` Joao Martins
@ 2022-05-02 11:57                         ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-05-02 11:57 UTC (permalink / raw)
  To: Robin Murphy, Jason Gunthorpe
  Cc: Tian, Kevin, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Jean-Philippe Brucker, Keqian Zhu, Shameerali Kolothum Thodi,
	David Woodhouse, Lu Baolu, Nicolin Chen, Yishai Hadas,
	Eric Auger, Liu, Yi L, Alex Williamson, Cornelia Huck, kvm,
	iommu

[my mua made the message a tad crooked with the quotations]

On 5/2/22 12:52, Joao Martins wrote:
> On 4/29/22 20:20, Robin Murphy wrote:
>> On 2022-04-29 17:40, Joao Martins wrote:
>>> On 4/29/22 17:11, Jason Gunthorpe wrote:
>>>> On Fri, Apr 29, 2022 at 03:45:23PM +0100, Joao Martins wrote:
>>>>> On 4/29/22 13:23, Jason Gunthorpe wrote:
>>>>>> On Fri, Apr 29, 2022 at 01:06:06PM +0100, Joao Martins wrote:
>>>>>>
>>>>>>>> TBH I'd be inclined to just enable DBM unconditionally in
>>>>>>>> arm_smmu_domain_finalise() if the SMMU supports it. Trying to toggle it
>>>>>>>> dynamically (especially on a live domain) seems more trouble that it's
>>>>>>>> worth.
>>>>>>>
>>>>>>> Hmmm, but then it would strip userland/VMM from any sort of control (contrary
>>>>>>> to what we can do on the CPU/KVM side). e.g. the first time you do
>>>>>>> GET_DIRTY_IOVA it would return all dirtied IOVAs since the beginning
>>>>>>> of guest time, as opposed to those only after you enabled dirty-tracking.
>>>>>>
>>>>>> It just means that on SMMU the start tracking op clears all the dirty
>>>>>> bits.
>>>>>>
>>>>> Hmm, OK. But aren't really picking a poison here? On ARM it's the difference
>>>>> from switching the setting the DBM bit and put the IOPTE as writeable-clean (which
>>>>> is clearing another bit) versus read-and-clear-when-dirty-track-start which means
>>>>> we need to re-walk the pagetables to clear one bit.
>>>>
>>>> Yes, I don't think a iopte walk is avoidable?
>>>>
>>> Correct -- exactly why I am still more learning towards enable DBM bit only at start
>>> versus enabling DBM at domain-creation while clearing dirty at start.
>>
>> I'd say it's largely down to whether you want the bother of 
>> communicating a dynamic behaviour change into io-pgtable. The big 
>> advantage of having it just use DBM all the time is that you don't have 
>> to do that, and the "start tracking" operation is then nothing more than 
>> a normal "read and clear" operation but ignoring the read result.
>>
>> At this point I'd much rather opt for simplicity, and leave the fancier 
>> stuff to revisit later if and when somebody does demonstrate a 
>> significant overhead from using DBM when not strictly needed.
> OK -- I did get the code simplicity part[*]. Albeit my concern is that last
> point: if there's anything fundamentally affecting DMA performance then
> any SMMU user would see it even if they don't care at all about DBM (i.e. regular
> baremetal/non-vm iommu usage).
> 

I can switch the SMMUv3 one to the always-enabled DBM bit.

> [*] It was how I had this initially PoC-ed. And really all IOMMU drivers dirty tracking
> could be simplified to be always-enabled, and start/stop is essentially flushing/clearing
> dirties. Albeit I like that this is only really used (by hardware) when needed and any
> other DMA user isn't affected.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
@ 2022-05-02 11:57                         ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-05-02 11:57 UTC (permalink / raw)
  To: Robin Murphy, Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Tian, Kevin, Yishai Hadas, kvm,
	Will Deacon, Cornelia Huck, iommu, Alex Williamson,
	David Woodhouse

[my mua made the message a tad crooked with the quotations]

On 5/2/22 12:52, Joao Martins wrote:
> On 4/29/22 20:20, Robin Murphy wrote:
>> On 2022-04-29 17:40, Joao Martins wrote:
>>> On 4/29/22 17:11, Jason Gunthorpe wrote:
>>>> On Fri, Apr 29, 2022 at 03:45:23PM +0100, Joao Martins wrote:
>>>>> On 4/29/22 13:23, Jason Gunthorpe wrote:
>>>>>> On Fri, Apr 29, 2022 at 01:06:06PM +0100, Joao Martins wrote:
>>>>>>
>>>>>>>> TBH I'd be inclined to just enable DBM unconditionally in
>>>>>>>> arm_smmu_domain_finalise() if the SMMU supports it. Trying to toggle it
>>>>>>>> dynamically (especially on a live domain) seems more trouble that it's
>>>>>>>> worth.
>>>>>>>
>>>>>>> Hmmm, but then it would strip userland/VMM from any sort of control (contrary
>>>>>>> to what we can do on the CPU/KVM side). e.g. the first time you do
>>>>>>> GET_DIRTY_IOVA it would return all dirtied IOVAs since the beginning
>>>>>>> of guest time, as opposed to those only after you enabled dirty-tracking.
>>>>>>
>>>>>> It just means that on SMMU the start tracking op clears all the dirty
>>>>>> bits.
>>>>>>
>>>>> Hmm, OK. But aren't really picking a poison here? On ARM it's the difference
>>>>> from switching the setting the DBM bit and put the IOPTE as writeable-clean (which
>>>>> is clearing another bit) versus read-and-clear-when-dirty-track-start which means
>>>>> we need to re-walk the pagetables to clear one bit.
>>>>
>>>> Yes, I don't think a iopte walk is avoidable?
>>>>
>>> Correct -- exactly why I am still more learning towards enable DBM bit only at start
>>> versus enabling DBM at domain-creation while clearing dirty at start.
>>
>> I'd say it's largely down to whether you want the bother of 
>> communicating a dynamic behaviour change into io-pgtable. The big 
>> advantage of having it just use DBM all the time is that you don't have 
>> to do that, and the "start tracking" operation is then nothing more than 
>> a normal "read and clear" operation but ignoring the read result.
>>
>> At this point I'd much rather opt for simplicity, and leave the fancier 
>> stuff to revisit later if and when somebody does demonstrate a 
>> significant overhead from using DBM when not strictly needed.
> OK -- I did get the code simplicity part[*]. Albeit my concern is that last
> point: if there's anything fundamentally affecting DMA performance then
> any SMMU user would see it even if they don't care at all about DBM (i.e. regular
> baremetal/non-vm iommu usage).
> 

I can switch the SMMUv3 one to the always-enabled DBM bit.

> [*] It was how I had this initially PoC-ed. And really all IOMMU drivers dirty tracking
> could be simplified to be always-enabled, and start/stop is essentially flushing/clearing
> dirties. Albeit I like that this is only really used (by hardware) when needed and any
> other DMA user isn't affected.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 02/19] iommufd: Dirty tracking for io_pagetable
  2022-04-29 23:51     ` Baolu Lu
@ 2022-05-02 11:57       ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-05-02 11:57 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Joerg Roedel, Suravee Suthikulpanit, Will Deacon, Robin Murphy,
	Jean-Philippe Brucker, Keqian Zhu, Shameerali Kolothum Thodi,
	David Woodhouse, Jason Gunthorpe, Nicolin Chen, Yishai Hadas,
	Kevin Tian, Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck,
	kvm, iommu

On 4/30/22 00:51, Baolu Lu wrote:
> On 2022/4/29 05:09, Joao Martins wrote:
>> +int iopt_set_dirty_tracking(struct io_pagetable *iopt,
>> +			    struct iommu_domain *domain, bool enable)
>> +{
>> +	struct iommu_domain *dom;
>> +	unsigned long index;
>> +	int ret = -EOPNOTSUPP;
>> +
>> +	down_write(&iopt->iova_rwsem);
>> +	if (!domain) {
>> +		down_write(&iopt->domains_rwsem);
>> +		xa_for_each(&iopt->domains, index, dom) {
>> +			ret = iommu_set_dirty_tracking(dom, iopt, enable);
>> +			if (ret < 0)
>> +				break;
> 
> Do you need to roll back to the original state before return failure?
> Partial domains have already had dirty bit tracking enabled.
> 
Yeap, will fix the unwinding for next iteration.

>> +		}
>> +		up_write(&iopt->domains_rwsem);
>> +	} else {
>> +		ret = iommu_set_dirty_tracking(domain, iopt, enable);
>> +	}
>> +
>> +	up_write(&iopt->iova_rwsem);
>> +	return ret;
>> +}
> 
> Best regards,
> baolu


^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 02/19] iommufd: Dirty tracking for io_pagetable
@ 2022-05-02 11:57       ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-05-02 11:57 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, Jason Gunthorpe,
	kvm, Will Deacon, Cornelia Huck, iommu, Alex Williamson,
	David Woodhouse, Robin Murphy

On 4/30/22 00:51, Baolu Lu wrote:
> On 2022/4/29 05:09, Joao Martins wrote:
>> +int iopt_set_dirty_tracking(struct io_pagetable *iopt,
>> +			    struct iommu_domain *domain, bool enable)
>> +{
>> +	struct iommu_domain *dom;
>> +	unsigned long index;
>> +	int ret = -EOPNOTSUPP;
>> +
>> +	down_write(&iopt->iova_rwsem);
>> +	if (!domain) {
>> +		down_write(&iopt->domains_rwsem);
>> +		xa_for_each(&iopt->domains, index, dom) {
>> +			ret = iommu_set_dirty_tracking(dom, iopt, enable);
>> +			if (ret < 0)
>> +				break;
> 
> Do you need to roll back to the original state before return failure?
> Partial domains have already had dirty bit tracking enabled.
> 
Yeap, will fix the unwinding for next iteration.

>> +		}
>> +		up_write(&iopt->domains_rwsem);
>> +	} else {
>> +		ret = iommu_set_dirty_tracking(domain, iopt, enable);
>> +	}
>> +
>> +	up_write(&iopt->iova_rwsem);
>> +	return ret;
>> +}
> 
> Best regards,
> baolu

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 03/19] iommufd: Dirty tracking data support
  2022-04-30  4:11     ` Baolu Lu
@ 2022-05-02 12:06       ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-05-02 12:06 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Joerg Roedel, Suravee Suthikulpanit, Will Deacon, Robin Murphy,
	Jean-Philippe Brucker, Keqian Zhu, Shameerali Kolothum Thodi,
	David Woodhouse, Jason Gunthorpe, Nicolin Chen, Yishai Hadas,
	Kevin Tian, Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck,
	kvm, iommu

On 4/30/22 05:11, Baolu Lu wrote:
> On 2022/4/29 05:09, Joao Martins wrote:
>> Add an IO pagetable API iopt_read_and_clear_dirty_data() that
>> performs the reading of dirty IOPTEs for a given IOVA range and
>> then copying back to userspace from each area-internal bitmap.
>>
>> Underneath it uses the IOMMU equivalent API which will read the
>> dirty bits, as well as atomically clearing the IOPTE dirty bit
>> and flushing the IOTLB at the end. The dirty bitmaps pass an
>> iotlb_gather to allow batching the dirty-bit updates.
>>
>> Most of the complexity, though, is in the handling of the user
>> bitmaps to avoid copies back and forth. The bitmap user addresses
>> need to be iterated through, pinned and then passing the pages
>> into iommu core. The amount of bitmap data passed at a time for a
>> read_and_clear_dirty() is 1 page worth of pinned base page
>> pointers. That equates to 16M bits, or rather 64G of data that
>> can be returned as 'dirtied'. The flush the IOTLB at the end of
>> the whole scanned IOVA range, to defer as much as possible the
>> potential DMA performance penalty.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>   drivers/iommu/iommufd/io_pagetable.c    | 169 ++++++++++++++++++++++++
>>   drivers/iommu/iommufd/iommufd_private.h |  44 ++++++
>>   2 files changed, 213 insertions(+)
>>
>> diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
>> index f4609ef369e0..835b5040fce9 100644
>> --- a/drivers/iommu/iommufd/io_pagetable.c
>> +++ b/drivers/iommu/iommufd/io_pagetable.c
>> @@ -14,6 +14,7 @@
>>   #include <linux/err.h>
>>   #include <linux/slab.h>
>>   #include <linux/errno.h>
>> +#include <uapi/linux/iommufd.h>
>>   
>>   #include "io_pagetable.h"
>>   
>> @@ -347,6 +348,174 @@ int iopt_set_dirty_tracking(struct io_pagetable *iopt,
>>   	return ret;
>>   }
>>   
>> +int iommufd_dirty_iter_init(struct iommufd_dirty_iter *iter,
>> +			    struct iommufd_dirty_data *bitmap)
>> +{
>> +	struct iommu_dirty_bitmap *dirty = &iter->dirty;
>> +	unsigned long bitmap_len;
>> +
>> +	bitmap_len = dirty_bitmap_bytes(bitmap->length >> dirty->pgshift);
>> +
>> +	import_single_range(WRITE, bitmap->data, bitmap_len,
>> +			    &iter->bitmap_iov, &iter->bitmap_iter);
>> +	iter->iova = bitmap->iova;
>> +
>> +	/* Can record up to 64G at a time */
>> +	dirty->pages = (struct page **) __get_free_page(GFP_KERNEL);
>> +
>> +	return !dirty->pages ? -ENOMEM : 0;
>> +}
>> +
>> +void iommufd_dirty_iter_free(struct iommufd_dirty_iter *iter)
>> +{
>> +	struct iommu_dirty_bitmap *dirty = &iter->dirty;
>> +
>> +	if (dirty->pages) {
>> +		free_page((unsigned long) dirty->pages);
>> +		dirty->pages = NULL;
>> +	}
>> +}
>> +
>> +bool iommufd_dirty_iter_done(struct iommufd_dirty_iter *iter)
>> +{
>> +	return iov_iter_count(&iter->bitmap_iter) > 0;
>> +}
>> +
>> +static inline unsigned long iommufd_dirty_iter_bytes(struct iommufd_dirty_iter *iter)
>> +{
>> +	unsigned long left = iter->bitmap_iter.count - iter->bitmap_iter.iov_offset;
>> +
>> +	left = min_t(unsigned long, left, (iter->dirty.npages << PAGE_SHIFT));
>> +
>> +	return left;
>> +}
>> +
>> +unsigned long iommufd_dirty_iova_length(struct iommufd_dirty_iter *iter)
>> +{
>> +	unsigned long left = iommufd_dirty_iter_bytes(iter);
>> +
>> +	return ((BITS_PER_BYTE * left) << iter->dirty.pgshift);
>> +}
>> +
>> +unsigned long iommufd_dirty_iova(struct iommufd_dirty_iter *iter)
>> +{
>> +	unsigned long skip = iter->bitmap_iter.iov_offset;
>> +
>> +	return iter->iova + ((BITS_PER_BYTE * skip) << iter->dirty.pgshift);
>> +}
>> +
>> +void iommufd_dirty_iter_advance(struct iommufd_dirty_iter *iter)
>> +{
>> +	iov_iter_advance(&iter->bitmap_iter, iommufd_dirty_iter_bytes(iter));
>> +}
>> +
>> +void iommufd_dirty_iter_put(struct iommufd_dirty_iter *iter)
>> +{
>> +	struct iommu_dirty_bitmap *dirty = &iter->dirty;
>> +
>> +	if (dirty->npages)
>> +		unpin_user_pages(dirty->pages, dirty->npages);
>> +}
>> +
>> +int iommufd_dirty_iter_get(struct iommufd_dirty_iter *iter)
>> +{
>> +	struct iommu_dirty_bitmap *dirty = &iter->dirty;
>> +	unsigned long npages;
>> +	unsigned long ret;
>> +	void *addr;
>> +
>> +	addr = iter->bitmap_iov.iov_base + iter->bitmap_iter.iov_offset;
>> +	npages = iov_iter_npages(&iter->bitmap_iter,
>> +				 PAGE_SIZE / sizeof(struct page *));
>> +
>> +	ret = pin_user_pages_fast((unsigned long) addr, npages,
>> +				  FOLL_WRITE, dirty->pages);
>> +	if (ret <= 0)
>> +		return -EINVAL;
>> +
>> +	dirty->npages = ret;
>> +	dirty->iova = iommufd_dirty_iova(iter);
>> +	dirty->start_offset = offset_in_page(addr);
>> +	return 0;
>> +}
>> +
>> +static int iommu_read_and_clear_dirty(struct iommu_domain *domain,
>> +				      struct iommufd_dirty_data *bitmap)
> 
> This looks more like a helper in the iommu core. How about
> 
> 	iommufd_read_clear_domain_dirty()?
> 
Heh, I guess that's more accurate naming indeed. I can switch to that.

>> +{
>> +	const struct iommu_domain_ops *ops = domain->ops;
>> +	struct iommu_iotlb_gather gather;
>> +	struct iommufd_dirty_iter iter;
>> +	int ret = 0;
>> +
>> +	if (!ops || !ops->read_and_clear_dirty)
>> +		return -EOPNOTSUPP;
>> +
>> +	iommu_dirty_bitmap_init(&iter.dirty, bitmap->iova,
>> +				__ffs(bitmap->page_size), &gather);
>> +	ret = iommufd_dirty_iter_init(&iter, bitmap);
>> +	if (ret)
>> +		return -ENOMEM;
>> +
>> +	for (; iommufd_dirty_iter_done(&iter);
>> +	     iommufd_dirty_iter_advance(&iter)) {
>> +		ret = iommufd_dirty_iter_get(&iter);
>> +		if (ret)
>> +			break;
>> +
>> +		ret = ops->read_and_clear_dirty(domain,
>> +			iommufd_dirty_iova(&iter),
>> +			iommufd_dirty_iova_length(&iter), &iter.dirty);
>> +
>> +		iommufd_dirty_iter_put(&iter);
>> +
>> +		if (ret)
>> +			break;
>> +	}
>> +
>> +	iommu_iotlb_sync(domain, &gather);
>> +	iommufd_dirty_iter_free(&iter);
>> +
>> +	return ret;
>> +}
>> +
>> +int iopt_read_and_clear_dirty_data(struct io_pagetable *iopt,
>> +				   struct iommu_domain *domain,
>> +				   struct iommufd_dirty_data *bitmap)
>> +{
>> +	unsigned long iova, length, iova_end;
>> +	struct iommu_domain *dom;
>> +	struct iopt_area *area;
>> +	unsigned long index;
>> +	int ret = -EOPNOTSUPP;
>> +
>> +	iova = bitmap->iova;
>> +	length = bitmap->length - 1;
>> +	if (check_add_overflow(iova, length, &iova_end))
>> +		return -EOVERFLOW;
>> +
>> +	down_read(&iopt->iova_rwsem);
>> +	area = iopt_find_exact_area(iopt, iova, iova_end);
>> +	if (!area) {
>> +		up_read(&iopt->iova_rwsem);
>> +		return -ENOENT;
>> +	}
>> +
>> +	if (!domain) {
>> +		down_read(&iopt->domains_rwsem);
>> +		xa_for_each(&iopt->domains, index, dom) {
>> +			ret = iommu_read_and_clear_dirty(dom, bitmap);
> 
> Perhaps use @domain directly, hence no need the @dom?
> 
> 	xa_for_each(&iopt->domains, index, domain) {
> 		ret = iommu_read_and_clear_dirty(domain, bitmap);
> 
Yeap.

>> +			if (ret)
>> +				break;
>> +		}
>> +		up_read(&iopt->domains_rwsem);
>> +	} else {
>> +		ret = iommu_read_and_clear_dirty(domain, bitmap);
>> +	}
>> +
>> +	up_read(&iopt->iova_rwsem);
>> +	return ret;
>> +}
>> +
>>   struct iopt_pages *iopt_get_pages(struct io_pagetable *iopt, unsigned long iova,
>>   				  unsigned long *start_byte,
>>   				  unsigned long length)
>> diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
>> index d00ef3b785c5..4c12b4a8f1a6 100644
>> --- a/drivers/iommu/iommufd/iommufd_private.h
>> +++ b/drivers/iommu/iommufd/iommufd_private.h
>> @@ -8,6 +8,8 @@
>>   #include <linux/xarray.h>
>>   #include <linux/refcount.h>
>>   #include <linux/uaccess.h>
>> +#include <linux/iommu.h>
>> +#include <linux/uio.h>
>>   
>>   struct iommu_domain;
>>   struct iommu_group;
>> @@ -49,8 +51,50 @@ int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
>>   		    unsigned long length);
>>   int iopt_unmap_all(struct io_pagetable *iopt);
>>   
>> +struct iommufd_dirty_data {
>> +	unsigned long iova;
>> +	unsigned long length;
>> +	unsigned long page_size;
>> +	unsigned long *data;
>> +};
> 
> How about adding some comments around this struct? Any alingment
> requirement for iova/length? What does the @data stand for?
> 
I'll add them.

Albeit this structure eventually gets moved to iommu-core later in
the series when we add the UAPI and it has some comments documenting it.

I don't cover the the alignment though, but it's the same restrictions
as IOAS map/unmap (iopt_alignment essentially) which is the smaller-page-size
supported by IOMMU hw.

>> +
>>   int iopt_set_dirty_tracking(struct io_pagetable *iopt,
>>   			    struct iommu_domain *domain, bool enable);
>> +int iopt_read_and_clear_dirty_data(struct io_pagetable *iopt,
>> +				   struct iommu_domain *domain,
>> +				   struct iommufd_dirty_data *bitmap);
>> +
>> +struct iommufd_dirty_iter {
>> +	struct iommu_dirty_bitmap dirty;
>> +	struct iovec bitmap_iov;
>> +	struct iov_iter bitmap_iter;
>> +	unsigned long iova;
>> +};
> 
> Same here.
> 
Yes, this one deserves some comments.

Most of it is state for gup/pup and iterating the bitmap user addresses
to make iommu_dirty_bitmap_record() work only with kva.

>> +
>> +void iommufd_dirty_iter_put(struct iommufd_dirty_iter *iter);
>> +int iommufd_dirty_iter_get(struct iommufd_dirty_iter *iter);
>> +int iommufd_dirty_iter_init(struct iommufd_dirty_iter *iter,
>> +			    struct iommufd_dirty_data *bitmap);
>> +void iommufd_dirty_iter_free(struct iommufd_dirty_iter *iter);
>> +bool iommufd_dirty_iter_done(struct iommufd_dirty_iter *iter);
>> +void iommufd_dirty_iter_advance(struct iommufd_dirty_iter *iter);
>> +unsigned long iommufd_dirty_iova_length(struct iommufd_dirty_iter *iter);
>> +unsigned long iommufd_dirty_iova(struct iommufd_dirty_iter *iter);
>> +static inline unsigned long dirty_bitmap_bytes(unsigned long nr_pages)
>> +{
>> +	return (ALIGN(nr_pages, BITS_PER_TYPE(u64)) / BITS_PER_BYTE);
>> +}
>> +
>> +/*
>> + * Input argument of number of bits to bitmap_set() is unsigned integer, which
>> + * further casts to signed integer for unaligned multi-bit operation,
>> + * __bitmap_set().
>> + * Then maximum bitmap size supported is 2^31 bits divided by 2^3 bits/byte,
>> + * that is 2^28 (256 MB) which maps to 2^31 * 2^12 = 2^43 (8TB) on 4K page
>> + * system.
>> + */
>> +#define DIRTY_BITMAP_PAGES_MAX  ((u64)INT_MAX)
>> +#define DIRTY_BITMAP_SIZE_MAX   dirty_bitmap_bytes(DIRTY_BITMAP_PAGES_MAX)
>>   
>>   int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
>>   		      unsigned long npages, struct page **out_pages, bool write);
> 
> Best regards,
> baolu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 03/19] iommufd: Dirty tracking data support
@ 2022-05-02 12:06       ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-05-02 12:06 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, Jason Gunthorpe,
	kvm, Will Deacon, Cornelia Huck, iommu, Alex Williamson,
	David Woodhouse, Robin Murphy

On 4/30/22 05:11, Baolu Lu wrote:
> On 2022/4/29 05:09, Joao Martins wrote:
>> Add an IO pagetable API iopt_read_and_clear_dirty_data() that
>> performs the reading of dirty IOPTEs for a given IOVA range and
>> then copying back to userspace from each area-internal bitmap.
>>
>> Underneath it uses the IOMMU equivalent API which will read the
>> dirty bits, as well as atomically clearing the IOPTE dirty bit
>> and flushing the IOTLB at the end. The dirty bitmaps pass an
>> iotlb_gather to allow batching the dirty-bit updates.
>>
>> Most of the complexity, though, is in the handling of the user
>> bitmaps to avoid copies back and forth. The bitmap user addresses
>> need to be iterated through, pinned and then passing the pages
>> into iommu core. The amount of bitmap data passed at a time for a
>> read_and_clear_dirty() is 1 page worth of pinned base page
>> pointers. That equates to 16M bits, or rather 64G of data that
>> can be returned as 'dirtied'. The flush the IOTLB at the end of
>> the whole scanned IOVA range, to defer as much as possible the
>> potential DMA performance penalty.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>   drivers/iommu/iommufd/io_pagetable.c    | 169 ++++++++++++++++++++++++
>>   drivers/iommu/iommufd/iommufd_private.h |  44 ++++++
>>   2 files changed, 213 insertions(+)
>>
>> diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
>> index f4609ef369e0..835b5040fce9 100644
>> --- a/drivers/iommu/iommufd/io_pagetable.c
>> +++ b/drivers/iommu/iommufd/io_pagetable.c
>> @@ -14,6 +14,7 @@
>>   #include <linux/err.h>
>>   #include <linux/slab.h>
>>   #include <linux/errno.h>
>> +#include <uapi/linux/iommufd.h>
>>   
>>   #include "io_pagetable.h"
>>   
>> @@ -347,6 +348,174 @@ int iopt_set_dirty_tracking(struct io_pagetable *iopt,
>>   	return ret;
>>   }
>>   
>> +int iommufd_dirty_iter_init(struct iommufd_dirty_iter *iter,
>> +			    struct iommufd_dirty_data *bitmap)
>> +{
>> +	struct iommu_dirty_bitmap *dirty = &iter->dirty;
>> +	unsigned long bitmap_len;
>> +
>> +	bitmap_len = dirty_bitmap_bytes(bitmap->length >> dirty->pgshift);
>> +
>> +	import_single_range(WRITE, bitmap->data, bitmap_len,
>> +			    &iter->bitmap_iov, &iter->bitmap_iter);
>> +	iter->iova = bitmap->iova;
>> +
>> +	/* Can record up to 64G at a time */
>> +	dirty->pages = (struct page **) __get_free_page(GFP_KERNEL);
>> +
>> +	return !dirty->pages ? -ENOMEM : 0;
>> +}
>> +
>> +void iommufd_dirty_iter_free(struct iommufd_dirty_iter *iter)
>> +{
>> +	struct iommu_dirty_bitmap *dirty = &iter->dirty;
>> +
>> +	if (dirty->pages) {
>> +		free_page((unsigned long) dirty->pages);
>> +		dirty->pages = NULL;
>> +	}
>> +}
>> +
>> +bool iommufd_dirty_iter_done(struct iommufd_dirty_iter *iter)
>> +{
>> +	return iov_iter_count(&iter->bitmap_iter) > 0;
>> +}
>> +
>> +static inline unsigned long iommufd_dirty_iter_bytes(struct iommufd_dirty_iter *iter)
>> +{
>> +	unsigned long left = iter->bitmap_iter.count - iter->bitmap_iter.iov_offset;
>> +
>> +	left = min_t(unsigned long, left, (iter->dirty.npages << PAGE_SHIFT));
>> +
>> +	return left;
>> +}
>> +
>> +unsigned long iommufd_dirty_iova_length(struct iommufd_dirty_iter *iter)
>> +{
>> +	unsigned long left = iommufd_dirty_iter_bytes(iter);
>> +
>> +	return ((BITS_PER_BYTE * left) << iter->dirty.pgshift);
>> +}
>> +
>> +unsigned long iommufd_dirty_iova(struct iommufd_dirty_iter *iter)
>> +{
>> +	unsigned long skip = iter->bitmap_iter.iov_offset;
>> +
>> +	return iter->iova + ((BITS_PER_BYTE * skip) << iter->dirty.pgshift);
>> +}
>> +
>> +void iommufd_dirty_iter_advance(struct iommufd_dirty_iter *iter)
>> +{
>> +	iov_iter_advance(&iter->bitmap_iter, iommufd_dirty_iter_bytes(iter));
>> +}
>> +
>> +void iommufd_dirty_iter_put(struct iommufd_dirty_iter *iter)
>> +{
>> +	struct iommu_dirty_bitmap *dirty = &iter->dirty;
>> +
>> +	if (dirty->npages)
>> +		unpin_user_pages(dirty->pages, dirty->npages);
>> +}
>> +
>> +int iommufd_dirty_iter_get(struct iommufd_dirty_iter *iter)
>> +{
>> +	struct iommu_dirty_bitmap *dirty = &iter->dirty;
>> +	unsigned long npages;
>> +	unsigned long ret;
>> +	void *addr;
>> +
>> +	addr = iter->bitmap_iov.iov_base + iter->bitmap_iter.iov_offset;
>> +	npages = iov_iter_npages(&iter->bitmap_iter,
>> +				 PAGE_SIZE / sizeof(struct page *));
>> +
>> +	ret = pin_user_pages_fast((unsigned long) addr, npages,
>> +				  FOLL_WRITE, dirty->pages);
>> +	if (ret <= 0)
>> +		return -EINVAL;
>> +
>> +	dirty->npages = ret;
>> +	dirty->iova = iommufd_dirty_iova(iter);
>> +	dirty->start_offset = offset_in_page(addr);
>> +	return 0;
>> +}
>> +
>> +static int iommu_read_and_clear_dirty(struct iommu_domain *domain,
>> +				      struct iommufd_dirty_data *bitmap)
> 
> This looks more like a helper in the iommu core. How about
> 
> 	iommufd_read_clear_domain_dirty()?
> 
Heh, I guess that's more accurate naming indeed. I can switch to that.

>> +{
>> +	const struct iommu_domain_ops *ops = domain->ops;
>> +	struct iommu_iotlb_gather gather;
>> +	struct iommufd_dirty_iter iter;
>> +	int ret = 0;
>> +
>> +	if (!ops || !ops->read_and_clear_dirty)
>> +		return -EOPNOTSUPP;
>> +
>> +	iommu_dirty_bitmap_init(&iter.dirty, bitmap->iova,
>> +				__ffs(bitmap->page_size), &gather);
>> +	ret = iommufd_dirty_iter_init(&iter, bitmap);
>> +	if (ret)
>> +		return -ENOMEM;
>> +
>> +	for (; iommufd_dirty_iter_done(&iter);
>> +	     iommufd_dirty_iter_advance(&iter)) {
>> +		ret = iommufd_dirty_iter_get(&iter);
>> +		if (ret)
>> +			break;
>> +
>> +		ret = ops->read_and_clear_dirty(domain,
>> +			iommufd_dirty_iova(&iter),
>> +			iommufd_dirty_iova_length(&iter), &iter.dirty);
>> +
>> +		iommufd_dirty_iter_put(&iter);
>> +
>> +		if (ret)
>> +			break;
>> +	}
>> +
>> +	iommu_iotlb_sync(domain, &gather);
>> +	iommufd_dirty_iter_free(&iter);
>> +
>> +	return ret;
>> +}
>> +
>> +int iopt_read_and_clear_dirty_data(struct io_pagetable *iopt,
>> +				   struct iommu_domain *domain,
>> +				   struct iommufd_dirty_data *bitmap)
>> +{
>> +	unsigned long iova, length, iova_end;
>> +	struct iommu_domain *dom;
>> +	struct iopt_area *area;
>> +	unsigned long index;
>> +	int ret = -EOPNOTSUPP;
>> +
>> +	iova = bitmap->iova;
>> +	length = bitmap->length - 1;
>> +	if (check_add_overflow(iova, length, &iova_end))
>> +		return -EOVERFLOW;
>> +
>> +	down_read(&iopt->iova_rwsem);
>> +	area = iopt_find_exact_area(iopt, iova, iova_end);
>> +	if (!area) {
>> +		up_read(&iopt->iova_rwsem);
>> +		return -ENOENT;
>> +	}
>> +
>> +	if (!domain) {
>> +		down_read(&iopt->domains_rwsem);
>> +		xa_for_each(&iopt->domains, index, dom) {
>> +			ret = iommu_read_and_clear_dirty(dom, bitmap);
> 
> Perhaps use @domain directly, hence no need the @dom?
> 
> 	xa_for_each(&iopt->domains, index, domain) {
> 		ret = iommu_read_and_clear_dirty(domain, bitmap);
> 
Yeap.

>> +			if (ret)
>> +				break;
>> +		}
>> +		up_read(&iopt->domains_rwsem);
>> +	} else {
>> +		ret = iommu_read_and_clear_dirty(domain, bitmap);
>> +	}
>> +
>> +	up_read(&iopt->iova_rwsem);
>> +	return ret;
>> +}
>> +
>>   struct iopt_pages *iopt_get_pages(struct io_pagetable *iopt, unsigned long iova,
>>   				  unsigned long *start_byte,
>>   				  unsigned long length)
>> diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
>> index d00ef3b785c5..4c12b4a8f1a6 100644
>> --- a/drivers/iommu/iommufd/iommufd_private.h
>> +++ b/drivers/iommu/iommufd/iommufd_private.h
>> @@ -8,6 +8,8 @@
>>   #include <linux/xarray.h>
>>   #include <linux/refcount.h>
>>   #include <linux/uaccess.h>
>> +#include <linux/iommu.h>
>> +#include <linux/uio.h>
>>   
>>   struct iommu_domain;
>>   struct iommu_group;
>> @@ -49,8 +51,50 @@ int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
>>   		    unsigned long length);
>>   int iopt_unmap_all(struct io_pagetable *iopt);
>>   
>> +struct iommufd_dirty_data {
>> +	unsigned long iova;
>> +	unsigned long length;
>> +	unsigned long page_size;
>> +	unsigned long *data;
>> +};
> 
> How about adding some comments around this struct? Any alingment
> requirement for iova/length? What does the @data stand for?
> 
I'll add them.

Albeit this structure eventually gets moved to iommu-core later in
the series when we add the UAPI and it has some comments documenting it.

I don't cover the the alignment though, but it's the same restrictions
as IOAS map/unmap (iopt_alignment essentially) which is the smaller-page-size
supported by IOMMU hw.

>> +
>>   int iopt_set_dirty_tracking(struct io_pagetable *iopt,
>>   			    struct iommu_domain *domain, bool enable);
>> +int iopt_read_and_clear_dirty_data(struct io_pagetable *iopt,
>> +				   struct iommu_domain *domain,
>> +				   struct iommufd_dirty_data *bitmap);
>> +
>> +struct iommufd_dirty_iter {
>> +	struct iommu_dirty_bitmap dirty;
>> +	struct iovec bitmap_iov;
>> +	struct iov_iter bitmap_iter;
>> +	unsigned long iova;
>> +};
> 
> Same here.
> 
Yes, this one deserves some comments.

Most of it is state for gup/pup and iterating the bitmap user addresses
to make iommu_dirty_bitmap_record() work only with kva.

>> +
>> +void iommufd_dirty_iter_put(struct iommufd_dirty_iter *iter);
>> +int iommufd_dirty_iter_get(struct iommufd_dirty_iter *iter);
>> +int iommufd_dirty_iter_init(struct iommufd_dirty_iter *iter,
>> +			    struct iommufd_dirty_data *bitmap);
>> +void iommufd_dirty_iter_free(struct iommufd_dirty_iter *iter);
>> +bool iommufd_dirty_iter_done(struct iommufd_dirty_iter *iter);
>> +void iommufd_dirty_iter_advance(struct iommufd_dirty_iter *iter);
>> +unsigned long iommufd_dirty_iova_length(struct iommufd_dirty_iter *iter);
>> +unsigned long iommufd_dirty_iova(struct iommufd_dirty_iter *iter);
>> +static inline unsigned long dirty_bitmap_bytes(unsigned long nr_pages)
>> +{
>> +	return (ALIGN(nr_pages, BITS_PER_TYPE(u64)) / BITS_PER_BYTE);
>> +}
>> +
>> +/*
>> + * Input argument of number of bits to bitmap_set() is unsigned integer, which
>> + * further casts to signed integer for unaligned multi-bit operation,
>> + * __bitmap_set().
>> + * Then maximum bitmap size supported is 2^31 bits divided by 2^3 bits/byte,
>> + * that is 2^28 (256 MB) which maps to 2^31 * 2^12 = 2^43 (8TB) on 4K page
>> + * system.
>> + */
>> +#define DIRTY_BITMAP_PAGES_MAX  ((u64)INT_MAX)
>> +#define DIRTY_BITMAP_SIZE_MAX   dirty_bitmap_bytes(DIRTY_BITMAP_PAGES_MAX)
>>   
>>   int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
>>   		      unsigned long npages, struct page **out_pages, bool write);
> 
> Best regards,
> baolu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 04/19] iommu: Add an unmap API that returns dirtied IOPTEs
  2022-04-30  5:12     ` Baolu Lu
@ 2022-05-02 12:22       ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-05-02 12:22 UTC (permalink / raw)
  To: Baolu Lu, iommu
  Cc: Joerg Roedel, Suravee Suthikulpanit, Will Deacon, Robin Murphy,
	Jean-Philippe Brucker, Keqian Zhu, Shameerali Kolothum Thodi,
	David Woodhouse, Jason Gunthorpe, Nicolin Chen, Yishai Hadas,
	Kevin Tian, Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck,
	kvm

On 4/30/22 06:12, Baolu Lu wrote:
> On 2022/4/29 05:09, Joao Martins wrote:
>> Today, the dirty state is lost and the page wouldn't be migrated to
>> destination potentially leading the guest into error.
>>
>> Add an unmap API that reads the dirty bit and sets it in the
>> user passed bitmap. This unmap iommu API tackles a potentially
>> racy update to the dirty bit *when* doing DMA on a iova that is
>> being unmapped at the same time.
>>
>> The new unmap_read_dirty/unmap_pages_read_dirty does not replace
>> the unmap pages, but rather only when explicit called with an dirty
>> bitmap data passed in.
>>
>> It could be said that the guest is buggy and rather than a special unmap
>> path tackling the theoretical race ... it would suffice fetching the
>> dirty bits (with GET_DIRTY_IOVA), and then unmap the IOVA.
> 
> I am not sure whether this API could solve the race.
> 

Yeah, it doesn't fully solve the race as DMA can still potentially
occuring until the IOMMU needs to rewalk page tables (i.e. after IOTLB flush).


> size_t iommu_unmap(struct iommu_domain *domain,
>                     unsigned long iova, size_t size)
> {
>          struct iommu_iotlb_gather iotlb_gather;
>          size_t ret;
> 
>          iommu_iotlb_gather_init(&iotlb_gather);
>          ret = __iommu_unmap(domain, iova, size, &iotlb_gather);
>          iommu_iotlb_sync(domain, &iotlb_gather);
> 
>          return ret;
> }
> 
> The PTEs are cleared before iotlb invalidation. What if a DMA write
> happens after PTE clearing and before the iotlb invalidation with the
> PTE happening to be cached?


Yeap. Jason/Robin also reiterated similarly.

To fully handle this we need to force the PTEs readonly, and check the dirty bit
after. So perhaps if we wanna go to the extent of fully stopping DMA -- which none
of unmap APIs ever guarantee -- we need more of an write-protects API that optionally
fetches the dirties. And then the unmap remains as is (prior to this series).

Now whether this race is worth solving isn't clear (bearing that solving the race will add
a lot of overhead), and git/mailing list archeology doesn't respond to that either if this
was ever useful in pratice :(

	Joao

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 04/19] iommu: Add an unmap API that returns dirtied IOPTEs
@ 2022-05-02 12:22       ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-05-02 12:22 UTC (permalink / raw)
  To: Baolu Lu, iommu
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, Jason Gunthorpe,
	kvm, Will Deacon, Cornelia Huck, Alex Williamson,
	David Woodhouse, Robin Murphy

On 4/30/22 06:12, Baolu Lu wrote:
> On 2022/4/29 05:09, Joao Martins wrote:
>> Today, the dirty state is lost and the page wouldn't be migrated to
>> destination potentially leading the guest into error.
>>
>> Add an unmap API that reads the dirty bit and sets it in the
>> user passed bitmap. This unmap iommu API tackles a potentially
>> racy update to the dirty bit *when* doing DMA on a iova that is
>> being unmapped at the same time.
>>
>> The new unmap_read_dirty/unmap_pages_read_dirty does not replace
>> the unmap pages, but rather only when explicit called with an dirty
>> bitmap data passed in.
>>
>> It could be said that the guest is buggy and rather than a special unmap
>> path tackling the theoretical race ... it would suffice fetching the
>> dirty bits (with GET_DIRTY_IOVA), and then unmap the IOVA.
> 
> I am not sure whether this API could solve the race.
> 

Yeah, it doesn't fully solve the race as DMA can still potentially
occuring until the IOMMU needs to rewalk page tables (i.e. after IOTLB flush).


> size_t iommu_unmap(struct iommu_domain *domain,
>                     unsigned long iova, size_t size)
> {
>          struct iommu_iotlb_gather iotlb_gather;
>          size_t ret;
> 
>          iommu_iotlb_gather_init(&iotlb_gather);
>          ret = __iommu_unmap(domain, iova, size, &iotlb_gather);
>          iommu_iotlb_sync(domain, &iotlb_gather);
> 
>          return ret;
> }
> 
> The PTEs are cleared before iotlb invalidation. What if a DMA write
> happens after PTE clearing and before the iotlb invalidation with the
> PTE happening to be cached?


Yeap. Jason/Robin also reiterated similarly.

To fully handle this we need to force the PTEs readonly, and check the dirty bit
after. So perhaps if we wanna go to the extent of fully stopping DMA -- which none
of unmap APIs ever guarantee -- we need more of an write-protects API that optionally
fetches the dirties. And then the unmap remains as is (prior to this series).

Now whether this race is worth solving isn't clear (bearing that solving the race will add
a lot of overhead), and git/mailing list archeology doesn't respond to that either if this
was ever useful in pratice :(

	Joao
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 18/19] iommu/intel: Access/Dirty bit support for SL domains
  2022-04-30  6:12     ` Baolu Lu
@ 2022-05-02 12:24       ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-05-02 12:24 UTC (permalink / raw)
  To: Baolu Lu, iommu
  Cc: Joerg Roedel, Suravee Suthikulpanit, Will Deacon, Robin Murphy,
	Jean-Philippe Brucker, Keqian Zhu, Shameerali Kolothum Thodi,
	David Woodhouse, Jason Gunthorpe, Nicolin Chen, Yishai Hadas,
	Kevin Tian, Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck,
	kvm

On 4/30/22 07:12, Baolu Lu wrote:
> On 2022/4/29 05:09, Joao Martins wrote:
>> --- a/drivers/iommu/intel/iommu.c
>> +++ b/drivers/iommu/intel/iommu.c
>> @@ -5089,6 +5089,113 @@ static void intel_iommu_iotlb_sync_map(struct iommu_domain *domain,
>>   	}
>>   }
>>   
>> +static int intel_iommu_set_dirty_tracking(struct iommu_domain *domain,
>> +					  bool enable)
>> +{
>> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
>> +	struct device_domain_info *info;
>> +	unsigned long flags;
>> +	int ret = -EINVAL;
> 
> 	if (domain_use_first_level(dmar_domain))
> 		return -EOPNOTSUPP;
> 
Will add.

>> +
>> +	spin_lock_irqsave(&device_domain_lock, flags);
>> +	if (list_empty(&dmar_domain->devices)) {
>> +		spin_unlock_irqrestore(&device_domain_lock, flags);
>> +		return ret;
>> +	}
> 
> I agreed with Kevin's suggestion in his reply.
> 
/me nods

>> +
>> +	list_for_each_entry(info, &dmar_domain->devices, link) {
>> +		if (!info->dev || (info->domain != dmar_domain))
>> +			continue;
> 
> This check is redundant.
> 

I'll drop it.

>> +
>> +		/* Dirty tracking is second-stage level SM only */
>> +		if ((info->domain && domain_use_first_level(info->domain)) ||
>> +		    !ecap_slads(info->iommu->ecap) ||
>> +		    !sm_supported(info->iommu) || !intel_iommu_sm) {
>> +			ret = -EOPNOTSUPP;
>> +			continue;
> 
> Perhaps break and return -EOPNOTSUPP directly here? We are not able to
> support a mixed mode, right?
> 
Correct, I should return early here.

>> +		}
>> +
>> +		ret = intel_pasid_setup_dirty_tracking(info->iommu, info->domain,
>> +						     info->dev, PASID_RID2PASID,
>> +						     enable);
>> +		if (ret)
>> +			break;
>> +	}
>> +	spin_unlock_irqrestore(&device_domain_lock, flags);
>> +
>> +	/*
>> +	 * We need to flush context TLB and IOTLB with any cached translations
>> +	 * to force the incoming DMA requests for have its IOTLB entries tagged
>> +	 * with A/D bits
>> +	 */
>> +	intel_flush_iotlb_all(domain);
>> +	return ret;
>> +}
> 
> Best regards,
> baolu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 18/19] iommu/intel: Access/Dirty bit support for SL domains
@ 2022-05-02 12:24       ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-05-02 12:24 UTC (permalink / raw)
  To: Baolu Lu, iommu
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, Jason Gunthorpe,
	kvm, Will Deacon, Cornelia Huck, Alex Williamson,
	David Woodhouse, Robin Murphy

On 4/30/22 07:12, Baolu Lu wrote:
> On 2022/4/29 05:09, Joao Martins wrote:
>> --- a/drivers/iommu/intel/iommu.c
>> +++ b/drivers/iommu/intel/iommu.c
>> @@ -5089,6 +5089,113 @@ static void intel_iommu_iotlb_sync_map(struct iommu_domain *domain,
>>   	}
>>   }
>>   
>> +static int intel_iommu_set_dirty_tracking(struct iommu_domain *domain,
>> +					  bool enable)
>> +{
>> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
>> +	struct device_domain_info *info;
>> +	unsigned long flags;
>> +	int ret = -EINVAL;
> 
> 	if (domain_use_first_level(dmar_domain))
> 		return -EOPNOTSUPP;
> 
Will add.

>> +
>> +	spin_lock_irqsave(&device_domain_lock, flags);
>> +	if (list_empty(&dmar_domain->devices)) {
>> +		spin_unlock_irqrestore(&device_domain_lock, flags);
>> +		return ret;
>> +	}
> 
> I agreed with Kevin's suggestion in his reply.
> 
/me nods

>> +
>> +	list_for_each_entry(info, &dmar_domain->devices, link) {
>> +		if (!info->dev || (info->domain != dmar_domain))
>> +			continue;
> 
> This check is redundant.
> 

I'll drop it.

>> +
>> +		/* Dirty tracking is second-stage level SM only */
>> +		if ((info->domain && domain_use_first_level(info->domain)) ||
>> +		    !ecap_slads(info->iommu->ecap) ||
>> +		    !sm_supported(info->iommu) || !intel_iommu_sm) {
>> +			ret = -EOPNOTSUPP;
>> +			continue;
> 
> Perhaps break and return -EOPNOTSUPP directly here? We are not able to
> support a mixed mode, right?
> 
Correct, I should return early here.

>> +		}
>> +
>> +		ret = intel_pasid_setup_dirty_tracking(info->iommu, info->domain,
>> +						     info->dev, PASID_RID2PASID,
>> +						     enable);
>> +		if (ret)
>> +			break;
>> +	}
>> +	spin_unlock_irqrestore(&device_domain_lock, flags);
>> +
>> +	/*
>> +	 * We need to flush context TLB and IOTLB with any cached translations
>> +	 * to force the incoming DMA requests for have its IOTLB entries tagged
>> +	 * with A/D bits
>> +	 */
>> +	intel_flush_iotlb_all(domain);
>> +	return ret;
>> +}
> 
> Best regards,
> baolu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
  2022-04-29  5:45   ` Tian, Kevin
@ 2022-05-02 18:11     ` Alex Williamson
  -1 siblings, 0 replies; 209+ messages in thread
From: Alex Williamson @ 2022-05-02 18:11 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Martins, Joao, iommu, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Eric Auger, Liu,
	Yi L, Cornelia Huck, kvm

On Fri, 29 Apr 2022 05:45:20 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:
> > From: Joao Martins <joao.m.martins@oracle.com>
> >  3) Unmapping an IOVA range while returning its dirty bit prior to
> > unmap. This case is specific for non-nested vIOMMU case where an
> > erronous guest (or device) DMAing to an address being unmapped at the
> > same time.  
> 
> an erroneous attempt like above cannot anticipate which DMAs can
> succeed in that window thus the end behavior is undefined. For an
> undefined behavior nothing will be broken by losing some bits dirtied
> in the window between reading back dirty bits of the range and
> actually calling unmap. From guest p.o.v. all those are black-box
> hardware logic to serve a virtual iotlb invalidation request which just
> cannot be completed in one cycle.
> 
> Hence in reality probably this is not required except to meet vfio
> compat requirement. Just in concept returning dirty bits at unmap
> is more accurate.
> 
> I'm slightly inclined to abandon it in iommufd uAPI.

Sorry, I'm not following why an unmap with returned dirty bitmap
operation is specific to a vIOMMU case, or in fact indicative of some
sort of erroneous, racy behavior of guest or device.  We need the
flexibility to support memory hot-unplug operations during migration,
but even in the vIOMMU case, isn't it fair for the VMM to ask whether a
device dirtied the range being unmapped?  This was implemented as a
single operation specifically to avoid races where ongoing access may be
available after retrieving a snapshot of the bitmap.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
@ 2022-05-02 18:11     ` Alex Williamson
  0 siblings, 0 replies; 209+ messages in thread
From: Alex Williamson @ 2022-05-02 18:11 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jean-Philippe Brucker, Yishai Hadas, Jason Gunthorpe, kvm,
	Will Deacon, Cornelia Huck, iommu, Martins, Joao,
	David Woodhouse, Robin Murphy

On Fri, 29 Apr 2022 05:45:20 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:
> > From: Joao Martins <joao.m.martins@oracle.com>
> >  3) Unmapping an IOVA range while returning its dirty bit prior to
> > unmap. This case is specific for non-nested vIOMMU case where an
> > erronous guest (or device) DMAing to an address being unmapped at the
> > same time.  
> 
> an erroneous attempt like above cannot anticipate which DMAs can
> succeed in that window thus the end behavior is undefined. For an
> undefined behavior nothing will be broken by losing some bits dirtied
> in the window between reading back dirty bits of the range and
> actually calling unmap. From guest p.o.v. all those are black-box
> hardware logic to serve a virtual iotlb invalidation request which just
> cannot be completed in one cycle.
> 
> Hence in reality probably this is not required except to meet vfio
> compat requirement. Just in concept returning dirty bits at unmap
> is more accurate.
> 
> I'm slightly inclined to abandon it in iommufd uAPI.

Sorry, I'm not following why an unmap with returned dirty bitmap
operation is specific to a vIOMMU case, or in fact indicative of some
sort of erroneous, racy behavior of guest or device.  We need the
flexibility to support memory hot-unplug operations during migration,
but even in the vIOMMU case, isn't it fair for the VMM to ask whether a
device dirtied the range being unmapped?  This was implemented as a
single operation specifically to avoid races where ongoing access may be
available after retrieving a snapshot of the bitmap.  Thanks,

Alex

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
  2022-05-02 18:11     ` Alex Williamson
@ 2022-05-02 18:52       ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 209+ messages in thread
From: Jason Gunthorpe @ 2022-05-02 18:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Martins, Joao, iommu, Joerg Roedel,
	Suravee Suthikulpanit, Will Deacon, Robin Murphy,
	Jean-Philippe Brucker, Keqian Zhu, Shameerali Kolothum Thodi,
	David Woodhouse, Lu Baolu, Nicolin Chen, Yishai Hadas,
	Eric Auger, Liu, Yi L, Cornelia Huck, kvm

On Mon, May 02, 2022 at 12:11:07PM -0600, Alex Williamson wrote:
> On Fri, 29 Apr 2022 05:45:20 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > From: Joao Martins <joao.m.martins@oracle.com>
> > >  3) Unmapping an IOVA range while returning its dirty bit prior to
> > > unmap. This case is specific for non-nested vIOMMU case where an
> > > erronous guest (or device) DMAing to an address being unmapped at the
> > > same time.  
> > 
> > an erroneous attempt like above cannot anticipate which DMAs can
> > succeed in that window thus the end behavior is undefined. For an
> > undefined behavior nothing will be broken by losing some bits dirtied
> > in the window between reading back dirty bits of the range and
> > actually calling unmap. From guest p.o.v. all those are black-box
> > hardware logic to serve a virtual iotlb invalidation request which just
> > cannot be completed in one cycle.
> > 
> > Hence in reality probably this is not required except to meet vfio
> > compat requirement. Just in concept returning dirty bits at unmap
> > is more accurate.
> > 
> > I'm slightly inclined to abandon it in iommufd uAPI.
> 
> Sorry, I'm not following why an unmap with returned dirty bitmap
> operation is specific to a vIOMMU case, or in fact indicative of some
> sort of erroneous, racy behavior of guest or device.

It is being compared against the alternative which is to explicitly
query dirty then do a normal unmap as two system calls and permit a
race.

The only case with any difference is if the guest is racing DMA with
the unmap - in which case it is already indeterminate for the guest if
the DMA will be completed or not. 

eg on the vIOMMU case if the guest races DMA with unmap then we are
already fine with throwing away that DMA because that is how the race
resolves during non-migration situations, so resovling it as throwing
away the DMA during migration is OK too.

> We need the flexibility to support memory hot-unplug operations
> during migration,

I would have thought that hotplug during migration would simply
discard all the data - how does it use the dirty bitmap?

> This was implemented as a single operation specifically to avoid
> races where ongoing access may be available after retrieving a
> snapshot of the bitmap.  Thanks,

The issue is the cost.

On a real iommu elminating the race is expensive as we have to write
protect the pages before query dirty, which seems to be an extra IOTLB
flush.

It is not clear if paying this cost to become atomic is actually
something any use case needs.

So, I suggest we think about a 3rd op 'write protect and clear
dirties' that will be followed by a normal unmap - the extra op will
have the extra oveheard and userspace can decide if it wants to pay or
not vs the non-atomic read dirties operation. And lets have a use case
where this must be atomic before we implement it..

The downside is we loose a little bit of efficiency by unbundling
these steps, the upside is that it doesn't require quite as many
special iommu_domain/etc paths.

(Also Joao, you should probably have a read and do not clear dirty
operation with the idea that the next operation will be unmap - then
maybe we can avoid IOTLB flushing..)

Jason

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
@ 2022-05-02 18:52       ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 209+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-05-02 18:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jean-Philippe Brucker, Tian, Kevin, Yishai Hadas, kvm,
	Robin Murphy, Cornelia Huck, iommu, Martins, Joao,
	David Woodhouse, Will Deacon

On Mon, May 02, 2022 at 12:11:07PM -0600, Alex Williamson wrote:
> On Fri, 29 Apr 2022 05:45:20 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > From: Joao Martins <joao.m.martins@oracle.com>
> > >  3) Unmapping an IOVA range while returning its dirty bit prior to
> > > unmap. This case is specific for non-nested vIOMMU case where an
> > > erronous guest (or device) DMAing to an address being unmapped at the
> > > same time.  
> > 
> > an erroneous attempt like above cannot anticipate which DMAs can
> > succeed in that window thus the end behavior is undefined. For an
> > undefined behavior nothing will be broken by losing some bits dirtied
> > in the window between reading back dirty bits of the range and
> > actually calling unmap. From guest p.o.v. all those are black-box
> > hardware logic to serve a virtual iotlb invalidation request which just
> > cannot be completed in one cycle.
> > 
> > Hence in reality probably this is not required except to meet vfio
> > compat requirement. Just in concept returning dirty bits at unmap
> > is more accurate.
> > 
> > I'm slightly inclined to abandon it in iommufd uAPI.
> 
> Sorry, I'm not following why an unmap with returned dirty bitmap
> operation is specific to a vIOMMU case, or in fact indicative of some
> sort of erroneous, racy behavior of guest or device.

It is being compared against the alternative which is to explicitly
query dirty then do a normal unmap as two system calls and permit a
race.

The only case with any difference is if the guest is racing DMA with
the unmap - in which case it is already indeterminate for the guest if
the DMA will be completed or not. 

eg on the vIOMMU case if the guest races DMA with unmap then we are
already fine with throwing away that DMA because that is how the race
resolves during non-migration situations, so resovling it as throwing
away the DMA during migration is OK too.

> We need the flexibility to support memory hot-unplug operations
> during migration,

I would have thought that hotplug during migration would simply
discard all the data - how does it use the dirty bitmap?

> This was implemented as a single operation specifically to avoid
> races where ongoing access may be available after retrieving a
> snapshot of the bitmap.  Thanks,

The issue is the cost.

On a real iommu elminating the race is expensive as we have to write
protect the pages before query dirty, which seems to be an extra IOTLB
flush.

It is not clear if paying this cost to become atomic is actually
something any use case needs.

So, I suggest we think about a 3rd op 'write protect and clear
dirties' that will be followed by a normal unmap - the extra op will
have the extra oveheard and userspace can decide if it wants to pay or
not vs the non-atomic read dirties operation. And lets have a use case
where this must be atomic before we implement it..

The downside is we loose a little bit of efficiency by unbundling
these steps, the upside is that it doesn't require quite as many
special iommu_domain/etc paths.

(Also Joao, you should probably have a read and do not clear dirty
operation with the idea that the next operation will be unmap - then
maybe we can avoid IOTLB flushing..)

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
  2022-05-02 18:52       ` Jason Gunthorpe via iommu
@ 2022-05-03 10:48         ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-05-03 10:48 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Tian, Kevin, iommu, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Nicolin Chen, Yishai Hadas, Eric Auger, Liu, Yi L, Cornelia Huck,
	kvm

On 5/2/22 19:52, Jason Gunthorpe wrote:
> On Mon, May 02, 2022 at 12:11:07PM -0600, Alex Williamson wrote:
>> On Fri, 29 Apr 2022 05:45:20 +0000
>> "Tian, Kevin" <kevin.tian@intel.com> wrote:
>>>> From: Joao Martins <joao.m.martins@oracle.com>
>>>>  3) Unmapping an IOVA range while returning its dirty bit prior to
>>>> unmap. This case is specific for non-nested vIOMMU case where an
>>>> erronous guest (or device) DMAing to an address being unmapped at the
>>>> same time.  
>>>
>>> an erroneous attempt like above cannot anticipate which DMAs can
>>> succeed in that window thus the end behavior is undefined. For an
>>> undefined behavior nothing will be broken by losing some bits dirtied
>>> in the window between reading back dirty bits of the range and
>>> actually calling unmap. From guest p.o.v. all those are black-box
>>> hardware logic to serve a virtual iotlb invalidation request which just
>>> cannot be completed in one cycle.
>>>
>>> Hence in reality probably this is not required except to meet vfio
>>> compat requirement. Just in concept returning dirty bits at unmap
>>> is more accurate.
>>>
>>> I'm slightly inclined to abandon it in iommufd uAPI.
>>
>> Sorry, I'm not following why an unmap with returned dirty bitmap
>> operation is specific to a vIOMMU case, or in fact indicative of some
>> sort of erroneous, racy behavior of guest or device.
> 
> It is being compared against the alternative which is to explicitly
> query dirty then do a normal unmap as two system calls and permit a
> race.
> 
> The only case with any difference is if the guest is racing DMA with
> the unmap - in which case it is already indeterminate for the guest if
> the DMA will be completed or not. 
> 
> eg on the vIOMMU case if the guest races DMA with unmap then we are
> already fine with throwing away that DMA because that is how the race
> resolves during non-migration situations, so resovling it as throwing
> away the DMA during migration is OK too.
> 

Exactly.

Even current unmap (ignoring dirties) isn't race-free and DMA could still be
happening between clearing PTE until the IOTLB flush.

The code in this series *attempted* at tackling races against hw IOMMU updates
to the A/D bits at the same time we are clearing the IOPTEs. But it didn't fully
addressed the race with DMA.

The current code (IIUC) just assumes it is dirty if it as pinned and DMA mapped,
so maybe it avoided some of these fundamental questions...

So really the comparison is whether we care of fixing the race *during unmap* --
which really device shouldn't be DMA-ing to in the first place -- that we need
to go out of our way to block DMA writes from happening then fetch dirties and
then unmap. Or can we fetch dirties and then unmap as two separate operations.

>> We need the flexibility to support memory hot-unplug operations
>> during migration,
> 
> I would have thought that hotplug during migration would simply
> discard all the data - how does it use the dirty bitmap?
> 

hmmm I don't follow either -- why one would we care about hot-unplugged
memory being dirty? Unless Alex is thinking that the guest would take
initiative in hotunplugging+hotplugging and expecting the same data to
be there, like pmem style...?

>> This was implemented as a single operation specifically to avoid
>> races where ongoing access may be available after retrieving a
>> snapshot of the bitmap.  Thanks,
> 
> The issue is the cost.
> 
> On a real iommu elminating the race is expensive as we have to write
> protect the pages before query dirty, which seems to be an extra IOTLB
> flush.
> 

... and that is only the DMA performance part affecting the endpoint
device. In software, there's also the extra overhead of walking the IOMMU
pagetables twice. So it's like unmap being 2x more expensive.


> It is not clear if paying this cost to become atomic is actually
> something any use case needs.
> 
> So, I suggest we think about a 3rd op 'write protect and clear
> dirties' that will be followed by a normal unmap - the extra op will
> have the extra oveheard and userspace can decide if it wants to pay or
> not vs the non-atomic read dirties operation. And lets have a use case
> where this must be atomic before we implement it..
> 

Definitely, I am happy to implement it if there's a use-case. But
I am not sure there's one right now aside from theory only? Have we
see issues that would otherwise require this?

> The downside is we loose a little bit of efficiency by unbundling
> these steps, the upside is that it doesn't require quite as many
> special iommu_domain/etc paths.
> 
> (Also Joao, you should probably have a read and do not clear dirty
> operation with the idea that the next operation will be unmap - then
> maybe we can avoid IOTLB flushing..)

Yes, that's a great idea. I am thinking of adding a regular @flags field to
the GET_DIRTY_IOVA and iommu domain op argument counterpart.

Albeit, from iommu kAPI side at the end of the day this primitive is an IO
pagetable walker helper which lets it check/manipulate some of the IOPTE
special bits and marshal its state into a bitmap. Extra ::flags values could
be other access bits, avoiding clearing said bits or more should we want to
make it more future-proof to extensions.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
@ 2022-05-03 10:48         ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-05-03 10:48 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Jean-Philippe Brucker, Tian, Kevin, Yishai Hadas, kvm,
	Will Deacon, Cornelia Huck, iommu, David Woodhouse, Robin Murphy

On 5/2/22 19:52, Jason Gunthorpe wrote:
> On Mon, May 02, 2022 at 12:11:07PM -0600, Alex Williamson wrote:
>> On Fri, 29 Apr 2022 05:45:20 +0000
>> "Tian, Kevin" <kevin.tian@intel.com> wrote:
>>>> From: Joao Martins <joao.m.martins@oracle.com>
>>>>  3) Unmapping an IOVA range while returning its dirty bit prior to
>>>> unmap. This case is specific for non-nested vIOMMU case where an
>>>> erronous guest (or device) DMAing to an address being unmapped at the
>>>> same time.  
>>>
>>> an erroneous attempt like above cannot anticipate which DMAs can
>>> succeed in that window thus the end behavior is undefined. For an
>>> undefined behavior nothing will be broken by losing some bits dirtied
>>> in the window between reading back dirty bits of the range and
>>> actually calling unmap. From guest p.o.v. all those are black-box
>>> hardware logic to serve a virtual iotlb invalidation request which just
>>> cannot be completed in one cycle.
>>>
>>> Hence in reality probably this is not required except to meet vfio
>>> compat requirement. Just in concept returning dirty bits at unmap
>>> is more accurate.
>>>
>>> I'm slightly inclined to abandon it in iommufd uAPI.
>>
>> Sorry, I'm not following why an unmap with returned dirty bitmap
>> operation is specific to a vIOMMU case, or in fact indicative of some
>> sort of erroneous, racy behavior of guest or device.
> 
> It is being compared against the alternative which is to explicitly
> query dirty then do a normal unmap as two system calls and permit a
> race.
> 
> The only case with any difference is if the guest is racing DMA with
> the unmap - in which case it is already indeterminate for the guest if
> the DMA will be completed or not. 
> 
> eg on the vIOMMU case if the guest races DMA with unmap then we are
> already fine with throwing away that DMA because that is how the race
> resolves during non-migration situations, so resovling it as throwing
> away the DMA during migration is OK too.
> 

Exactly.

Even current unmap (ignoring dirties) isn't race-free and DMA could still be
happening between clearing PTE until the IOTLB flush.

The code in this series *attempted* at tackling races against hw IOMMU updates
to the A/D bits at the same time we are clearing the IOPTEs. But it didn't fully
addressed the race with DMA.

The current code (IIUC) just assumes it is dirty if it as pinned and DMA mapped,
so maybe it avoided some of these fundamental questions...

So really the comparison is whether we care of fixing the race *during unmap* --
which really device shouldn't be DMA-ing to in the first place -- that we need
to go out of our way to block DMA writes from happening then fetch dirties and
then unmap. Or can we fetch dirties and then unmap as two separate operations.

>> We need the flexibility to support memory hot-unplug operations
>> during migration,
> 
> I would have thought that hotplug during migration would simply
> discard all the data - how does it use the dirty bitmap?
> 

hmmm I don't follow either -- why one would we care about hot-unplugged
memory being dirty? Unless Alex is thinking that the guest would take
initiative in hotunplugging+hotplugging and expecting the same data to
be there, like pmem style...?

>> This was implemented as a single operation specifically to avoid
>> races where ongoing access may be available after retrieving a
>> snapshot of the bitmap.  Thanks,
> 
> The issue is the cost.
> 
> On a real iommu elminating the race is expensive as we have to write
> protect the pages before query dirty, which seems to be an extra IOTLB
> flush.
> 

... and that is only the DMA performance part affecting the endpoint
device. In software, there's also the extra overhead of walking the IOMMU
pagetables twice. So it's like unmap being 2x more expensive.


> It is not clear if paying this cost to become atomic is actually
> something any use case needs.
> 
> So, I suggest we think about a 3rd op 'write protect and clear
> dirties' that will be followed by a normal unmap - the extra op will
> have the extra oveheard and userspace can decide if it wants to pay or
> not vs the non-atomic read dirties operation. And lets have a use case
> where this must be atomic before we implement it..
> 

Definitely, I am happy to implement it if there's a use-case. But
I am not sure there's one right now aside from theory only? Have we
see issues that would otherwise require this?

> The downside is we loose a little bit of efficiency by unbundling
> these steps, the upside is that it doesn't require quite as many
> special iommu_domain/etc paths.
> 
> (Also Joao, you should probably have a read and do not clear dirty
> operation with the idea that the next operation will be unmap - then
> maybe we can avoid IOTLB flushing..)

Yes, that's a great idea. I am thinking of adding a regular @flags field to
the GET_DIRTY_IOVA and iommu domain op argument counterpart.

Albeit, from iommu kAPI side at the end of the day this primitive is an IO
pagetable walker helper which lets it check/manipulate some of the IOPTE
special bits and marshal its state into a bitmap. Extra ::flags values could
be other access bits, avoiding clearing said bits or more should we want to
make it more future-proof to extensions.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
  2022-04-29 11:05       ` Joao Martins
@ 2022-05-05  7:25         ` Shameerali Kolothum Thodi via iommu
  -1 siblings, 0 replies; 209+ messages in thread
From: Shameerali Kolothum Thodi @ 2022-05-05  7:25 UTC (permalink / raw)
  To: Joao Martins, Tian, Kevin
  Cc: Joerg Roedel, Suravee Suthikulpanit, Will Deacon, Robin Murphy,
	Jean-Philippe Brucker, zhukeqian, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Eric Auger, Liu,
	Yi L, Alex Williamson, Cornelia Huck, kvm, iommu, jiangkunkun



> -----Original Message-----
> From: Joao Martins [mailto:joao.m.martins@oracle.com]
> Sent: 29 April 2022 12:05
> To: Tian, Kevin <kevin.tian@intel.com>
> Cc: Joerg Roedel <joro@8bytes.org>; Suravee Suthikulpanit
> <suravee.suthikulpanit@amd.com>; Will Deacon <will@kernel.org>; Robin
> Murphy <robin.murphy@arm.com>; Jean-Philippe Brucker
> <jean-philippe@linaro.org>; zhukeqian <zhukeqian1@huawei.com>;
> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> David Woodhouse <dwmw2@infradead.org>; Lu Baolu
> <baolu.lu@linux.intel.com>; Jason Gunthorpe <jgg@nvidia.com>; Nicolin Chen
> <nicolinc@nvidia.com>; Yishai Hadas <yishaih@nvidia.com>; Eric Auger
> <eric.auger@redhat.com>; Liu, Yi L <yi.l.liu@intel.com>; Alex Williamson
> <alex.williamson@redhat.com>; Cornelia Huck <cohuck@redhat.com>;
> kvm@vger.kernel.org; iommu@lists.linux-foundation.org
> Subject: Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add
> set_dirty_tracking_range() support
> 
> On 4/29/22 09:28, Tian, Kevin wrote:
> >> From: Joao Martins <joao.m.martins@oracle.com>
> >> Sent: Friday, April 29, 2022 5:09 AM
> >>
> >> Similar to .read_and_clear_dirty() use the page table
> >> walker helper functions and set DBM|RDONLY bit, thus
> >> switching the IOPTE to writeable-clean.
> >
> > this should not be one-off if the operation needs to be
> > applied to IOPTE. Say a map request comes right after
> > set_dirty_tracking() is called. If it's agreed to remove
> > the range op then smmu driver should record the tracking
> > status internally and then apply the modifier to all the new
> > mappings automatically before dirty tracking is disabled.
> > Otherwise the same logic needs to be kept in iommufd to
> > call set_dirty_tracking_range() explicitly for every new
> > iopt_area created within the tracking window.
> 
> Gah, I totally missed that by mistake. New mappings aren't
> carrying over the "DBM is set". This needs a new io-pgtable
> quirk added post dirty-tracking toggling.
> 
> I can adjust, but I am at odds on including this in a future
> iteration given that I can't really test any of this stuff.
> Might drop the driver until I have hardware/emulation I can
> use (or maybe others can take over this). It was included
> for revising the iommu core ops and whether iommufd was
> affected by it.

[+Kunkun Jiang]. I think he is now looking into this and might have
a test setup to verify this.

Thanks,
Shameer



^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
@ 2022-05-05  7:25         ` Shameerali Kolothum Thodi via iommu
  0 siblings, 0 replies; 209+ messages in thread
From: Shameerali Kolothum Thodi via iommu @ 2022-05-05  7:25 UTC (permalink / raw)
  To: Joao Martins, Tian, Kevin
  Cc: Jean-Philippe Brucker, Yishai Hadas, Jason Gunthorpe, kvm,
	Will Deacon, Cornelia Huck, iommu, Alex Williamson,
	David Woodhouse, Robin Murphy



> -----Original Message-----
> From: Joao Martins [mailto:joao.m.martins@oracle.com]
> Sent: 29 April 2022 12:05
> To: Tian, Kevin <kevin.tian@intel.com>
> Cc: Joerg Roedel <joro@8bytes.org>; Suravee Suthikulpanit
> <suravee.suthikulpanit@amd.com>; Will Deacon <will@kernel.org>; Robin
> Murphy <robin.murphy@arm.com>; Jean-Philippe Brucker
> <jean-philippe@linaro.org>; zhukeqian <zhukeqian1@huawei.com>;
> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> David Woodhouse <dwmw2@infradead.org>; Lu Baolu
> <baolu.lu@linux.intel.com>; Jason Gunthorpe <jgg@nvidia.com>; Nicolin Chen
> <nicolinc@nvidia.com>; Yishai Hadas <yishaih@nvidia.com>; Eric Auger
> <eric.auger@redhat.com>; Liu, Yi L <yi.l.liu@intel.com>; Alex Williamson
> <alex.williamson@redhat.com>; Cornelia Huck <cohuck@redhat.com>;
> kvm@vger.kernel.org; iommu@lists.linux-foundation.org
> Subject: Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add
> set_dirty_tracking_range() support
> 
> On 4/29/22 09:28, Tian, Kevin wrote:
> >> From: Joao Martins <joao.m.martins@oracle.com>
> >> Sent: Friday, April 29, 2022 5:09 AM
> >>
> >> Similar to .read_and_clear_dirty() use the page table
> >> walker helper functions and set DBM|RDONLY bit, thus
> >> switching the IOPTE to writeable-clean.
> >
> > this should not be one-off if the operation needs to be
> > applied to IOPTE. Say a map request comes right after
> > set_dirty_tracking() is called. If it's agreed to remove
> > the range op then smmu driver should record the tracking
> > status internally and then apply the modifier to all the new
> > mappings automatically before dirty tracking is disabled.
> > Otherwise the same logic needs to be kept in iommufd to
> > call set_dirty_tracking_range() explicitly for every new
> > iopt_area created within the tracking window.
> 
> Gah, I totally missed that by mistake. New mappings aren't
> carrying over the "DBM is set". This needs a new io-pgtable
> quirk added post dirty-tracking toggling.
> 
> I can adjust, but I am at odds on including this in a future
> iteration given that I can't really test any of this stuff.
> Might drop the driver until I have hardware/emulation I can
> use (or maybe others can take over this). It was included
> for revising the iommu core ops and whether iommufd was
> affected by it.

[+Kunkun Jiang]. I think he is now looking into this and might have
a test setup to verify this.

Thanks,
Shameer


_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
  2022-04-29 12:38       ` Jason Gunthorpe via iommu
@ 2022-05-05  7:40         ` Tian, Kevin
  -1 siblings, 0 replies; 209+ messages in thread
From: Tian, Kevin @ 2022-05-05  7:40 UTC (permalink / raw)
  To: Jason Gunthorpe, Martins, Joao
  Cc: Joerg Roedel, Suravee Suthikulpanit, Will Deacon, Robin Murphy,
	Jean-Philippe Brucker, Keqian Zhu, Shameerali Kolothum Thodi,
	David Woodhouse, Lu Baolu, Nicolin Chen, Yishai Hadas,
	Eric Auger, Liu, Yi L, Alex Williamson, Cornelia Huck, kvm,
	iommu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, April 29, 2022 8:39 PM
> 
> > >> * There's no capabilities API in IOMMUFD, and in this RFC each vendor
> tracks
> > >
> > > there was discussion adding device capability uAPI somewhere.
> > >
> > ack let me know if there was snippets to the conversation as I seem to have
> missed that.
> 
> It was just discssion pending something we actually needed to report.
> 
> Would be a very simple ioctl taking in the device ID and fulling a
> struct of stuff.
> 
> > > probably this can be reported as a device cap as supporting of dirty bit is
> > > an immutable property of the iommu serving that device.
> 
> It is an easier fit to read it out of the iommu_domain after device
> attach though - since we don't need to build new kernel infrastructure
> to query it from a device.
> 
> > > Userspace can
> > > enable dirty tracking on a hwpt if all attached devices claim the support
> > > and kernel will does the same verification.
> >
> > Sorry to be dense but this is not up to 'devices' given they take no
> > part in the tracking?  I guess by 'devices' you mean the software
> > idea of it i.e. the iommu context created for attaching a said
> > physical device, not the physical device itself.
> 
> Indeed, an hwpt represents an iommu_domain and if the iommu_domain
> has
> dirty tracking ops set then that is an inherent propery of the domain
> and does not suddenly go away when a new device is attached.
> 

In concept this is an iommu property instead of a domain property.
The two can draw an equation only if the iommu driver registers
dirty tracking ops only when all IOMMUs in the platform support
the capability, i.e. sort of managing this iommu property in a global
way.

But the global way sort of conflicts with the on-going direction making
iommu capability truly per-iommu (though I'm not sure whether
heterogeneity would exist for dirty tracking). Following that trend
a domain property is not inherent as it is meaningless if no device is
attached at all.

From this angle IMHO it's more reasonable to report this IOMMU
property to userspace via a device capability. If all devices attached
to a hwpt claim IOMMU dirty tracking capability, the user can call
set_dirty_tracking() on the hwpt object. Once dirty tracking is
enabled on a hwpt, further attaching a device which doesn't claim
this capability is simply rejected.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
@ 2022-05-05  7:40         ` Tian, Kevin
  0 siblings, 0 replies; 209+ messages in thread
From: Tian, Kevin @ 2022-05-05  7:40 UTC (permalink / raw)
  To: Jason Gunthorpe, Martins, Joao
  Cc: Jean-Philippe Brucker, Yishai Hadas, kvm, Will Deacon,
	Cornelia Huck, iommu, Alex Williamson, David Woodhouse,
	Robin Murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, April 29, 2022 8:39 PM
> 
> > >> * There's no capabilities API in IOMMUFD, and in this RFC each vendor
> tracks
> > >
> > > there was discussion adding device capability uAPI somewhere.
> > >
> > ack let me know if there was snippets to the conversation as I seem to have
> missed that.
> 
> It was just discssion pending something we actually needed to report.
> 
> Would be a very simple ioctl taking in the device ID and fulling a
> struct of stuff.
> 
> > > probably this can be reported as a device cap as supporting of dirty bit is
> > > an immutable property of the iommu serving that device.
> 
> It is an easier fit to read it out of the iommu_domain after device
> attach though - since we don't need to build new kernel infrastructure
> to query it from a device.
> 
> > > Userspace can
> > > enable dirty tracking on a hwpt if all attached devices claim the support
> > > and kernel will does the same verification.
> >
> > Sorry to be dense but this is not up to 'devices' given they take no
> > part in the tracking?  I guess by 'devices' you mean the software
> > idea of it i.e. the iommu context created for attaching a said
> > physical device, not the physical device itself.
> 
> Indeed, an hwpt represents an iommu_domain and if the iommu_domain
> has
> dirty tracking ops set then that is an inherent propery of the domain
> and does not suddenly go away when a new device is attached.
> 

In concept this is an iommu property instead of a domain property.
The two can draw an equation only if the iommu driver registers
dirty tracking ops only when all IOMMUs in the platform support
the capability, i.e. sort of managing this iommu property in a global
way.

But the global way sort of conflicts with the on-going direction making
iommu capability truly per-iommu (though I'm not sure whether
heterogeneity would exist for dirty tracking). Following that trend
a domain property is not inherent as it is meaningless if no device is
attached at all.

From this angle IMHO it's more reasonable to report this IOMMU
property to userspace via a device capability. If all devices attached
to a hwpt claim IOMMU dirty tracking capability, the user can call
set_dirty_tracking() on the hwpt object. Once dirty tracking is
enabled on a hwpt, further attaching a device which doesn't claim
this capability is simply rejected.

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
  2022-05-02 18:52       ` Jason Gunthorpe via iommu
@ 2022-05-05  7:42         ` Tian, Kevin
  -1 siblings, 0 replies; 209+ messages in thread
From: Tian, Kevin @ 2022-05-05  7:42 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Martins, Joao, iommu, Joerg Roedel, Suravee Suthikulpanit,
	Will Deacon, Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Nicolin Chen, Yishai Hadas, Eric Auger, Liu, Yi L, Cornelia Huck,
	kvm

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, May 3, 2022 2:53 AM
> 
> On Mon, May 02, 2022 at 12:11:07PM -0600, Alex Williamson wrote:
> > On Fri, 29 Apr 2022 05:45:20 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > > From: Joao Martins <joao.m.martins@oracle.com>
> > > >  3) Unmapping an IOVA range while returning its dirty bit prior to
> > > > unmap. This case is specific for non-nested vIOMMU case where an
> > > > erronous guest (or device) DMAing to an address being unmapped at
> the
> > > > same time.
> > >
> > > an erroneous attempt like above cannot anticipate which DMAs can
> > > succeed in that window thus the end behavior is undefined. For an
> > > undefined behavior nothing will be broken by losing some bits dirtied
> > > in the window between reading back dirty bits of the range and
> > > actually calling unmap. From guest p.o.v. all those are black-box
> > > hardware logic to serve a virtual iotlb invalidation request which just
> > > cannot be completed in one cycle.
> > >
> > > Hence in reality probably this is not required except to meet vfio
> > > compat requirement. Just in concept returning dirty bits at unmap
> > > is more accurate.
> > >
> > > I'm slightly inclined to abandon it in iommufd uAPI.
> >
> > Sorry, I'm not following why an unmap with returned dirty bitmap
> > operation is specific to a vIOMMU case, or in fact indicative of some
> > sort of erroneous, racy behavior of guest or device.
> 
> It is being compared against the alternative which is to explicitly
> query dirty then do a normal unmap as two system calls and permit a
> race.
> 
> The only case with any difference is if the guest is racing DMA with
> the unmap - in which case it is already indeterminate for the guest if
> the DMA will be completed or not.
> 
> eg on the vIOMMU case if the guest races DMA with unmap then we are
> already fine with throwing away that DMA because that is how the race
> resolves during non-migration situations, so resovling it as throwing
> away the DMA during migration is OK too.
> 
> > We need the flexibility to support memory hot-unplug operations
> > during migration,
> 
> I would have thought that hotplug during migration would simply
> discard all the data - how does it use the dirty bitmap?
> 
> > This was implemented as a single operation specifically to avoid
> > races where ongoing access may be available after retrieving a
> > snapshot of the bitmap.  Thanks,
> 
> The issue is the cost.
> 
> On a real iommu elminating the race is expensive as we have to write
> protect the pages before query dirty, which seems to be an extra IOTLB
> flush.
> 
> It is not clear if paying this cost to become atomic is actually
> something any use case needs.
> 
> So, I suggest we think about a 3rd op 'write protect and clear
> dirties' that will be followed by a normal unmap - the extra op will
> have the extra oveheard and userspace can decide if it wants to pay or
> not vs the non-atomic read dirties operation. And lets have a use case
> where this must be atomic before we implement it..

and write-protection also relies on the support of I/O page fault...

> 
> The downside is we loose a little bit of efficiency by unbundling
> these steps, the upside is that it doesn't require quite as many
> special iommu_domain/etc paths.
> 
> (Also Joao, you should probably have a read and do not clear dirty
> operation with the idea that the next operation will be unmap - then
> maybe we can avoid IOTLB flushing..)
> 
> Jason

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
@ 2022-05-05  7:42         ` Tian, Kevin
  0 siblings, 0 replies; 209+ messages in thread
From: Tian, Kevin @ 2022-05-05  7:42 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Jean-Philippe Brucker, Yishai Hadas, kvm, Will Deacon,
	Cornelia Huck, iommu, Martins, Joao, David Woodhouse,
	Robin Murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, May 3, 2022 2:53 AM
> 
> On Mon, May 02, 2022 at 12:11:07PM -0600, Alex Williamson wrote:
> > On Fri, 29 Apr 2022 05:45:20 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > > From: Joao Martins <joao.m.martins@oracle.com>
> > > >  3) Unmapping an IOVA range while returning its dirty bit prior to
> > > > unmap. This case is specific for non-nested vIOMMU case where an
> > > > erronous guest (or device) DMAing to an address being unmapped at
> the
> > > > same time.
> > >
> > > an erroneous attempt like above cannot anticipate which DMAs can
> > > succeed in that window thus the end behavior is undefined. For an
> > > undefined behavior nothing will be broken by losing some bits dirtied
> > > in the window between reading back dirty bits of the range and
> > > actually calling unmap. From guest p.o.v. all those are black-box
> > > hardware logic to serve a virtual iotlb invalidation request which just
> > > cannot be completed in one cycle.
> > >
> > > Hence in reality probably this is not required except to meet vfio
> > > compat requirement. Just in concept returning dirty bits at unmap
> > > is more accurate.
> > >
> > > I'm slightly inclined to abandon it in iommufd uAPI.
> >
> > Sorry, I'm not following why an unmap with returned dirty bitmap
> > operation is specific to a vIOMMU case, or in fact indicative of some
> > sort of erroneous, racy behavior of guest or device.
> 
> It is being compared against the alternative which is to explicitly
> query dirty then do a normal unmap as two system calls and permit a
> race.
> 
> The only case with any difference is if the guest is racing DMA with
> the unmap - in which case it is already indeterminate for the guest if
> the DMA will be completed or not.
> 
> eg on the vIOMMU case if the guest races DMA with unmap then we are
> already fine with throwing away that DMA because that is how the race
> resolves during non-migration situations, so resovling it as throwing
> away the DMA during migration is OK too.
> 
> > We need the flexibility to support memory hot-unplug operations
> > during migration,
> 
> I would have thought that hotplug during migration would simply
> discard all the data - how does it use the dirty bitmap?
> 
> > This was implemented as a single operation specifically to avoid
> > races where ongoing access may be available after retrieving a
> > snapshot of the bitmap.  Thanks,
> 
> The issue is the cost.
> 
> On a real iommu elminating the race is expensive as we have to write
> protect the pages before query dirty, which seems to be an extra IOTLB
> flush.
> 
> It is not clear if paying this cost to become atomic is actually
> something any use case needs.
> 
> So, I suggest we think about a 3rd op 'write protect and clear
> dirties' that will be followed by a normal unmap - the extra op will
> have the extra oveheard and userspace can decide if it wants to pay or
> not vs the non-atomic read dirties operation. And lets have a use case
> where this must be atomic before we implement it..

and write-protection also relies on the support of I/O page fault...

> 
> The downside is we loose a little bit of efficiency by unbundling
> these steps, the upside is that it doesn't require quite as many
> special iommu_domain/etc paths.
> 
> (Also Joao, you should probably have a read and do not clear dirty
> operation with the idea that the next operation will be unmap - then
> maybe we can avoid IOTLB flushing..)
> 
> Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
  2022-05-05  7:25         ` Shameerali Kolothum Thodi via iommu
@ 2022-05-05  9:52           ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-05-05  9:52 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: Joerg Roedel, Suravee Suthikulpanit, Will Deacon, Robin Murphy,
	Jean-Philippe Brucker, zhukeqian, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Eric Auger, Liu,
	Yi L, Alex Williamson, Cornelia Huck, kvm, iommu, jiangkunkun,
	Tian, Kevin

On 5/5/22 08:25, Shameerali Kolothum Thodi wrote:
>> -----Original Message-----
>> From: Joao Martins [mailto:joao.m.martins@oracle.com]
>> Sent: 29 April 2022 12:05
>> To: Tian, Kevin <kevin.tian@intel.com>
>> Cc: Joerg Roedel <joro@8bytes.org>; Suravee Suthikulpanit
>> <suravee.suthikulpanit@amd.com>; Will Deacon <will@kernel.org>; Robin
>> Murphy <robin.murphy@arm.com>; Jean-Philippe Brucker
>> <jean-philippe@linaro.org>; zhukeqian <zhukeqian1@huawei.com>;
>> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
>> David Woodhouse <dwmw2@infradead.org>; Lu Baolu
>> <baolu.lu@linux.intel.com>; Jason Gunthorpe <jgg@nvidia.com>; Nicolin Chen
>> <nicolinc@nvidia.com>; Yishai Hadas <yishaih@nvidia.com>; Eric Auger
>> <eric.auger@redhat.com>; Liu, Yi L <yi.l.liu@intel.com>; Alex Williamson
>> <alex.williamson@redhat.com>; Cornelia Huck <cohuck@redhat.com>;
>> kvm@vger.kernel.org; iommu@lists.linux-foundation.org
>> Subject: Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add
>> set_dirty_tracking_range() support
>>
>> On 4/29/22 09:28, Tian, Kevin wrote:
>>>> From: Joao Martins <joao.m.martins@oracle.com>
>>>> Sent: Friday, April 29, 2022 5:09 AM
>>>>
>>>> Similar to .read_and_clear_dirty() use the page table
>>>> walker helper functions and set DBM|RDONLY bit, thus
>>>> switching the IOPTE to writeable-clean.
>>>
>>> this should not be one-off if the operation needs to be
>>> applied to IOPTE. Say a map request comes right after
>>> set_dirty_tracking() is called. If it's agreed to remove
>>> the range op then smmu driver should record the tracking
>>> status internally and then apply the modifier to all the new
>>> mappings automatically before dirty tracking is disabled.
>>> Otherwise the same logic needs to be kept in iommufd to
>>> call set_dirty_tracking_range() explicitly for every new
>>> iopt_area created within the tracking window.
>>
>> Gah, I totally missed that by mistake. New mappings aren't
>> carrying over the "DBM is set". This needs a new io-pgtable
>> quirk added post dirty-tracking toggling.
>>
>> I can adjust, but I am at odds on including this in a future
>> iteration given that I can't really test any of this stuff.
>> Might drop the driver until I have hardware/emulation I can
>> use (or maybe others can take over this). It was included
>> for revising the iommu core ops and whether iommufd was
>> affected by it.
> 
> [+Kunkun Jiang]. I think he is now looking into this and might have
> a test setup to verify this.

I'll keep him CC'ed next iterations. Thanks!

FWIW, the should change a bit on next iteration (simpler)
by always enabling DBM from the start. SMMUv3 ::set_dirty_tracking() becomes
a simpler function that tests quirks (i.e. DBM set) and what not, and calls
read_and_clear_dirty() without a bitmap argument to clear dirties.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
@ 2022-05-05  9:52           ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-05-05  9:52 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: Jean-Philippe Brucker, Tian, Kevin, Yishai Hadas,
	Jason Gunthorpe, kvm, Will Deacon, Cornelia Huck, iommu,
	Alex Williamson, David Woodhouse, Robin Murphy

On 5/5/22 08:25, Shameerali Kolothum Thodi wrote:
>> -----Original Message-----
>> From: Joao Martins [mailto:joao.m.martins@oracle.com]
>> Sent: 29 April 2022 12:05
>> To: Tian, Kevin <kevin.tian@intel.com>
>> Cc: Joerg Roedel <joro@8bytes.org>; Suravee Suthikulpanit
>> <suravee.suthikulpanit@amd.com>; Will Deacon <will@kernel.org>; Robin
>> Murphy <robin.murphy@arm.com>; Jean-Philippe Brucker
>> <jean-philippe@linaro.org>; zhukeqian <zhukeqian1@huawei.com>;
>> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
>> David Woodhouse <dwmw2@infradead.org>; Lu Baolu
>> <baolu.lu@linux.intel.com>; Jason Gunthorpe <jgg@nvidia.com>; Nicolin Chen
>> <nicolinc@nvidia.com>; Yishai Hadas <yishaih@nvidia.com>; Eric Auger
>> <eric.auger@redhat.com>; Liu, Yi L <yi.l.liu@intel.com>; Alex Williamson
>> <alex.williamson@redhat.com>; Cornelia Huck <cohuck@redhat.com>;
>> kvm@vger.kernel.org; iommu@lists.linux-foundation.org
>> Subject: Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add
>> set_dirty_tracking_range() support
>>
>> On 4/29/22 09:28, Tian, Kevin wrote:
>>>> From: Joao Martins <joao.m.martins@oracle.com>
>>>> Sent: Friday, April 29, 2022 5:09 AM
>>>>
>>>> Similar to .read_and_clear_dirty() use the page table
>>>> walker helper functions and set DBM|RDONLY bit, thus
>>>> switching the IOPTE to writeable-clean.
>>>
>>> this should not be one-off if the operation needs to be
>>> applied to IOPTE. Say a map request comes right after
>>> set_dirty_tracking() is called. If it's agreed to remove
>>> the range op then smmu driver should record the tracking
>>> status internally and then apply the modifier to all the new
>>> mappings automatically before dirty tracking is disabled.
>>> Otherwise the same logic needs to be kept in iommufd to
>>> call set_dirty_tracking_range() explicitly for every new
>>> iopt_area created within the tracking window.
>>
>> Gah, I totally missed that by mistake. New mappings aren't
>> carrying over the "DBM is set". This needs a new io-pgtable
>> quirk added post dirty-tracking toggling.
>>
>> I can adjust, but I am at odds on including this in a future
>> iteration given that I can't really test any of this stuff.
>> Might drop the driver until I have hardware/emulation I can
>> use (or maybe others can take over this). It was included
>> for revising the iommu core ops and whether iommufd was
>> affected by it.
> 
> [+Kunkun Jiang]. I think he is now looking into this and might have
> a test setup to verify this.

I'll keep him CC'ed next iterations. Thanks!

FWIW, the should change a bit on next iteration (simpler)
by always enabling DBM from the start. SMMUv3 ::set_dirty_tracking() becomes
a simpler function that tests quirks (i.e. DBM set) and what not, and calls
read_and_clear_dirty() without a bitmap argument to clear dirties.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
  2022-05-05  7:42         ` Tian, Kevin
@ 2022-05-05 10:06           ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-05-05 10:06 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe, Alex Williamson
  Cc: iommu, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Nicolin Chen, Yishai Hadas, Eric Auger, Liu, Yi L, Cornelia Huck,
	kvm

On 5/5/22 08:42, Tian, Kevin wrote:
>> From: Jason Gunthorpe <jgg@nvidia.com>
>> Sent: Tuesday, May 3, 2022 2:53 AM
>>
>> On Mon, May 02, 2022 at 12:11:07PM -0600, Alex Williamson wrote:
>>> On Fri, 29 Apr 2022 05:45:20 +0000
>>> "Tian, Kevin" <kevin.tian@intel.com> wrote:
>>>>> From: Joao Martins <joao.m.martins@oracle.com>
>>>>>  3) Unmapping an IOVA range while returning its dirty bit prior to
>>>>> unmap. This case is specific for non-nested vIOMMU case where an
>>>>> erronous guest (or device) DMAing to an address being unmapped at
>> the
>>>>> same time.
>>>>
>>>> an erroneous attempt like above cannot anticipate which DMAs can
>>>> succeed in that window thus the end behavior is undefined. For an
>>>> undefined behavior nothing will be broken by losing some bits dirtied
>>>> in the window between reading back dirty bits of the range and
>>>> actually calling unmap. From guest p.o.v. all those are black-box
>>>> hardware logic to serve a virtual iotlb invalidation request which just
>>>> cannot be completed in one cycle.
>>>>
>>>> Hence in reality probably this is not required except to meet vfio
>>>> compat requirement. Just in concept returning dirty bits at unmap
>>>> is more accurate.
>>>>
>>>> I'm slightly inclined to abandon it in iommufd uAPI.
>>>
>>> Sorry, I'm not following why an unmap with returned dirty bitmap
>>> operation is specific to a vIOMMU case, or in fact indicative of some
>>> sort of erroneous, racy behavior of guest or device.
>>
>> It is being compared against the alternative which is to explicitly
>> query dirty then do a normal unmap as two system calls and permit a
>> race.
>>
>> The only case with any difference is if the guest is racing DMA with
>> the unmap - in which case it is already indeterminate for the guest if
>> the DMA will be completed or not.
>>
>> eg on the vIOMMU case if the guest races DMA with unmap then we are
>> already fine with throwing away that DMA because that is how the race
>> resolves during non-migration situations, so resovling it as throwing
>> away the DMA during migration is OK too.
>>
>>> We need the flexibility to support memory hot-unplug operations
>>> during migration,
>>
>> I would have thought that hotplug during migration would simply
>> discard all the data - how does it use the dirty bitmap?
>>
>>> This was implemented as a single operation specifically to avoid
>>> races where ongoing access may be available after retrieving a
>>> snapshot of the bitmap.  Thanks,
>>
>> The issue is the cost.
>>
>> On a real iommu elminating the race is expensive as we have to write
>> protect the pages before query dirty, which seems to be an extra IOTLB
>> flush.
>>
>> It is not clear if paying this cost to become atomic is actually
>> something any use case needs.
>>
>> So, I suggest we think about a 3rd op 'write protect and clear
>> dirties' that will be followed by a normal unmap - the extra op will
>> have the extra oveheard and userspace can decide if it wants to pay or
>> not vs the non-atomic read dirties operation. And lets have a use case
>> where this must be atomic before we implement it..
> 
> and write-protection also relies on the support of I/O page fault...
> 
/I think/ all IOMMUs in this series already support permission/unrecoverable
I/O page faults for a long time IIUC.

The earlier suggestion was just to discard the I/O page fault after
write-protection happens. fwiw, some IOMMUs also support suppressing
the event notification (like AMD).

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
@ 2022-05-05 10:06           ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-05-05 10:06 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe, Alex Williamson
  Cc: Jean-Philippe Brucker, Yishai Hadas, kvm, Will Deacon,
	Cornelia Huck, iommu, David Woodhouse, Robin Murphy

On 5/5/22 08:42, Tian, Kevin wrote:
>> From: Jason Gunthorpe <jgg@nvidia.com>
>> Sent: Tuesday, May 3, 2022 2:53 AM
>>
>> On Mon, May 02, 2022 at 12:11:07PM -0600, Alex Williamson wrote:
>>> On Fri, 29 Apr 2022 05:45:20 +0000
>>> "Tian, Kevin" <kevin.tian@intel.com> wrote:
>>>>> From: Joao Martins <joao.m.martins@oracle.com>
>>>>>  3) Unmapping an IOVA range while returning its dirty bit prior to
>>>>> unmap. This case is specific for non-nested vIOMMU case where an
>>>>> erronous guest (or device) DMAing to an address being unmapped at
>> the
>>>>> same time.
>>>>
>>>> an erroneous attempt like above cannot anticipate which DMAs can
>>>> succeed in that window thus the end behavior is undefined. For an
>>>> undefined behavior nothing will be broken by losing some bits dirtied
>>>> in the window between reading back dirty bits of the range and
>>>> actually calling unmap. From guest p.o.v. all those are black-box
>>>> hardware logic to serve a virtual iotlb invalidation request which just
>>>> cannot be completed in one cycle.
>>>>
>>>> Hence in reality probably this is not required except to meet vfio
>>>> compat requirement. Just in concept returning dirty bits at unmap
>>>> is more accurate.
>>>>
>>>> I'm slightly inclined to abandon it in iommufd uAPI.
>>>
>>> Sorry, I'm not following why an unmap with returned dirty bitmap
>>> operation is specific to a vIOMMU case, or in fact indicative of some
>>> sort of erroneous, racy behavior of guest or device.
>>
>> It is being compared against the alternative which is to explicitly
>> query dirty then do a normal unmap as two system calls and permit a
>> race.
>>
>> The only case with any difference is if the guest is racing DMA with
>> the unmap - in which case it is already indeterminate for the guest if
>> the DMA will be completed or not.
>>
>> eg on the vIOMMU case if the guest races DMA with unmap then we are
>> already fine with throwing away that DMA because that is how the race
>> resolves during non-migration situations, so resovling it as throwing
>> away the DMA during migration is OK too.
>>
>>> We need the flexibility to support memory hot-unplug operations
>>> during migration,
>>
>> I would have thought that hotplug during migration would simply
>> discard all the data - how does it use the dirty bitmap?
>>
>>> This was implemented as a single operation specifically to avoid
>>> races where ongoing access may be available after retrieving a
>>> snapshot of the bitmap.  Thanks,
>>
>> The issue is the cost.
>>
>> On a real iommu elminating the race is expensive as we have to write
>> protect the pages before query dirty, which seems to be an extra IOTLB
>> flush.
>>
>> It is not clear if paying this cost to become atomic is actually
>> something any use case needs.
>>
>> So, I suggest we think about a 3rd op 'write protect and clear
>> dirties' that will be followed by a normal unmap - the extra op will
>> have the extra oveheard and userspace can decide if it wants to pay or
>> not vs the non-atomic read dirties operation. And lets have a use case
>> where this must be atomic before we implement it..
> 
> and write-protection also relies on the support of I/O page fault...
> 
/I think/ all IOMMUs in this series already support permission/unrecoverable
I/O page faults for a long time IIUC.

The earlier suggestion was just to discard the I/O page fault after
write-protection happens. fwiw, some IOMMUs also support suppressing
the event notification (like AMD).
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
  2022-05-05 10:06           ` Joao Martins
@ 2022-05-05 11:03             ` Tian, Kevin
  -1 siblings, 0 replies; 209+ messages in thread
From: Tian, Kevin @ 2022-05-05 11:03 UTC (permalink / raw)
  To: Martins, Joao, Jason Gunthorpe, Alex Williamson
  Cc: iommu, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Nicolin Chen, Yishai Hadas, Eric Auger, Liu, Yi L, Cornelia Huck,
	kvm

> From: Joao Martins <joao.m.martins@oracle.com>
> Sent: Thursday, May 5, 2022 6:07 PM
> 
> On 5/5/22 08:42, Tian, Kevin wrote:
> >> From: Jason Gunthorpe <jgg@nvidia.com>
> >> Sent: Tuesday, May 3, 2022 2:53 AM
> >>
> >> On Mon, May 02, 2022 at 12:11:07PM -0600, Alex Williamson wrote:
> >>> On Fri, 29 Apr 2022 05:45:20 +0000
> >>> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >>>>> From: Joao Martins <joao.m.martins@oracle.com>
> >>>>>  3) Unmapping an IOVA range while returning its dirty bit prior to
> >>>>> unmap. This case is specific for non-nested vIOMMU case where an
> >>>>> erronous guest (or device) DMAing to an address being unmapped at
> >> the
> >>>>> same time.
> >>>>
> >>>> an erroneous attempt like above cannot anticipate which DMAs can
> >>>> succeed in that window thus the end behavior is undefined. For an
> >>>> undefined behavior nothing will be broken by losing some bits dirtied
> >>>> in the window between reading back dirty bits of the range and
> >>>> actually calling unmap. From guest p.o.v. all those are black-box
> >>>> hardware logic to serve a virtual iotlb invalidation request which just
> >>>> cannot be completed in one cycle.
> >>>>
> >>>> Hence in reality probably this is not required except to meet vfio
> >>>> compat requirement. Just in concept returning dirty bits at unmap
> >>>> is more accurate.
> >>>>
> >>>> I'm slightly inclined to abandon it in iommufd uAPI.
> >>>
> >>> Sorry, I'm not following why an unmap with returned dirty bitmap
> >>> operation is specific to a vIOMMU case, or in fact indicative of some
> >>> sort of erroneous, racy behavior of guest or device.
> >>
> >> It is being compared against the alternative which is to explicitly
> >> query dirty then do a normal unmap as two system calls and permit a
> >> race.
> >>
> >> The only case with any difference is if the guest is racing DMA with
> >> the unmap - in which case it is already indeterminate for the guest if
> >> the DMA will be completed or not.
> >>
> >> eg on the vIOMMU case if the guest races DMA with unmap then we are
> >> already fine with throwing away that DMA because that is how the race
> >> resolves during non-migration situations, so resovling it as throwing
> >> away the DMA during migration is OK too.
> >>
> >>> We need the flexibility to support memory hot-unplug operations
> >>> during migration,
> >>
> >> I would have thought that hotplug during migration would simply
> >> discard all the data - how does it use the dirty bitmap?
> >>
> >>> This was implemented as a single operation specifically to avoid
> >>> races where ongoing access may be available after retrieving a
> >>> snapshot of the bitmap.  Thanks,
> >>
> >> The issue is the cost.
> >>
> >> On a real iommu elminating the race is expensive as we have to write
> >> protect the pages before query dirty, which seems to be an extra IOTLB
> >> flush.
> >>
> >> It is not clear if paying this cost to become atomic is actually
> >> something any use case needs.
> >>
> >> So, I suggest we think about a 3rd op 'write protect and clear
> >> dirties' that will be followed by a normal unmap - the extra op will
> >> have the extra oveheard and userspace can decide if it wants to pay or
> >> not vs the non-atomic read dirties operation. And lets have a use case
> >> where this must be atomic before we implement it..
> >
> > and write-protection also relies on the support of I/O page fault...
> >
> /I think/ all IOMMUs in this series already support permission/unrecoverable
> I/O page faults for a long time IIUC.
> 
> The earlier suggestion was just to discard the I/O page fault after
> write-protection happens. fwiw, some IOMMUs also support suppressing
> the event notification (like AMD).

iiuc the purpose of 'write-protection' here is to capture in-fly dirty pages
in the said race window until unmap and iotlb is invalidated is completed.

*unrecoverable* faults are not expected to be used in a feature path
as occurrence of such faults may lead to severe reaction in iommu
drivers e.g. completely block DMA from the device causing such faults.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
@ 2022-05-05 11:03             ` Tian, Kevin
  0 siblings, 0 replies; 209+ messages in thread
From: Tian, Kevin @ 2022-05-05 11:03 UTC (permalink / raw)
  To: Martins, Joao, Jason Gunthorpe, Alex Williamson
  Cc: Jean-Philippe Brucker, Yishai Hadas, kvm, Will Deacon,
	Cornelia Huck, iommu, David Woodhouse, Robin Murphy

> From: Joao Martins <joao.m.martins@oracle.com>
> Sent: Thursday, May 5, 2022 6:07 PM
> 
> On 5/5/22 08:42, Tian, Kevin wrote:
> >> From: Jason Gunthorpe <jgg@nvidia.com>
> >> Sent: Tuesday, May 3, 2022 2:53 AM
> >>
> >> On Mon, May 02, 2022 at 12:11:07PM -0600, Alex Williamson wrote:
> >>> On Fri, 29 Apr 2022 05:45:20 +0000
> >>> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >>>>> From: Joao Martins <joao.m.martins@oracle.com>
> >>>>>  3) Unmapping an IOVA range while returning its dirty bit prior to
> >>>>> unmap. This case is specific for non-nested vIOMMU case where an
> >>>>> erronous guest (or device) DMAing to an address being unmapped at
> >> the
> >>>>> same time.
> >>>>
> >>>> an erroneous attempt like above cannot anticipate which DMAs can
> >>>> succeed in that window thus the end behavior is undefined. For an
> >>>> undefined behavior nothing will be broken by losing some bits dirtied
> >>>> in the window between reading back dirty bits of the range and
> >>>> actually calling unmap. From guest p.o.v. all those are black-box
> >>>> hardware logic to serve a virtual iotlb invalidation request which just
> >>>> cannot be completed in one cycle.
> >>>>
> >>>> Hence in reality probably this is not required except to meet vfio
> >>>> compat requirement. Just in concept returning dirty bits at unmap
> >>>> is more accurate.
> >>>>
> >>>> I'm slightly inclined to abandon it in iommufd uAPI.
> >>>
> >>> Sorry, I'm not following why an unmap with returned dirty bitmap
> >>> operation is specific to a vIOMMU case, or in fact indicative of some
> >>> sort of erroneous, racy behavior of guest or device.
> >>
> >> It is being compared against the alternative which is to explicitly
> >> query dirty then do a normal unmap as two system calls and permit a
> >> race.
> >>
> >> The only case with any difference is if the guest is racing DMA with
> >> the unmap - in which case it is already indeterminate for the guest if
> >> the DMA will be completed or not.
> >>
> >> eg on the vIOMMU case if the guest races DMA with unmap then we are
> >> already fine with throwing away that DMA because that is how the race
> >> resolves during non-migration situations, so resovling it as throwing
> >> away the DMA during migration is OK too.
> >>
> >>> We need the flexibility to support memory hot-unplug operations
> >>> during migration,
> >>
> >> I would have thought that hotplug during migration would simply
> >> discard all the data - how does it use the dirty bitmap?
> >>
> >>> This was implemented as a single operation specifically to avoid
> >>> races where ongoing access may be available after retrieving a
> >>> snapshot of the bitmap.  Thanks,
> >>
> >> The issue is the cost.
> >>
> >> On a real iommu elminating the race is expensive as we have to write
> >> protect the pages before query dirty, which seems to be an extra IOTLB
> >> flush.
> >>
> >> It is not clear if paying this cost to become atomic is actually
> >> something any use case needs.
> >>
> >> So, I suggest we think about a 3rd op 'write protect and clear
> >> dirties' that will be followed by a normal unmap - the extra op will
> >> have the extra oveheard and userspace can decide if it wants to pay or
> >> not vs the non-atomic read dirties operation. And lets have a use case
> >> where this must be atomic before we implement it..
> >
> > and write-protection also relies on the support of I/O page fault...
> >
> /I think/ all IOMMUs in this series already support permission/unrecoverable
> I/O page faults for a long time IIUC.
> 
> The earlier suggestion was just to discard the I/O page fault after
> write-protection happens. fwiw, some IOMMUs also support suppressing
> the event notification (like AMD).

iiuc the purpose of 'write-protection' here is to capture in-fly dirty pages
in the said race window until unmap and iotlb is invalidated is completed.

*unrecoverable* faults are not expected to be used in a feature path
as occurrence of such faults may lead to severe reaction in iommu
drivers e.g. completely block DMA from the device causing such faults.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
  2022-05-05 11:03             ` Tian, Kevin
@ 2022-05-05 11:50               ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-05-05 11:50 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe, Alex Williamson
  Cc: iommu, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Nicolin Chen, Yishai Hadas, Eric Auger, Liu, Yi L, Cornelia Huck,
	kvm

On 5/5/22 12:03, Tian, Kevin wrote:
>> From: Joao Martins <joao.m.martins@oracle.com>
>> Sent: Thursday, May 5, 2022 6:07 PM
>>
>> On 5/5/22 08:42, Tian, Kevin wrote:
>>>> From: Jason Gunthorpe <jgg@nvidia.com>
>>>> Sent: Tuesday, May 3, 2022 2:53 AM
>>>>
>>>> On Mon, May 02, 2022 at 12:11:07PM -0600, Alex Williamson wrote:
>>>>> On Fri, 29 Apr 2022 05:45:20 +0000
>>>>> "Tian, Kevin" <kevin.tian@intel.com> wrote:
>>>>>>> From: Joao Martins <joao.m.martins@oracle.com>
>>>>>>>  3) Unmapping an IOVA range while returning its dirty bit prior to
>>>>>>> unmap. This case is specific for non-nested vIOMMU case where an
>>>>>>> erronous guest (or device) DMAing to an address being unmapped at
>>>> the
>>>>>>> same time.
>>>>>>
>>>>>> an erroneous attempt like above cannot anticipate which DMAs can
>>>>>> succeed in that window thus the end behavior is undefined. For an
>>>>>> undefined behavior nothing will be broken by losing some bits dirtied
>>>>>> in the window between reading back dirty bits of the range and
>>>>>> actually calling unmap. From guest p.o.v. all those are black-box
>>>>>> hardware logic to serve a virtual iotlb invalidation request which just
>>>>>> cannot be completed in one cycle.
>>>>>>
>>>>>> Hence in reality probably this is not required except to meet vfio
>>>>>> compat requirement. Just in concept returning dirty bits at unmap
>>>>>> is more accurate.
>>>>>>
>>>>>> I'm slightly inclined to abandon it in iommufd uAPI.
>>>>>
>>>>> Sorry, I'm not following why an unmap with returned dirty bitmap
>>>>> operation is specific to a vIOMMU case, or in fact indicative of some
>>>>> sort of erroneous, racy behavior of guest or device.
>>>>
>>>> It is being compared against the alternative which is to explicitly
>>>> query dirty then do a normal unmap as two system calls and permit a
>>>> race.
>>>>
>>>> The only case with any difference is if the guest is racing DMA with
>>>> the unmap - in which case it is already indeterminate for the guest if
>>>> the DMA will be completed or not.
>>>>
>>>> eg on the vIOMMU case if the guest races DMA with unmap then we are
>>>> already fine with throwing away that DMA because that is how the race
>>>> resolves during non-migration situations, so resovling it as throwing
>>>> away the DMA during migration is OK too.
>>>>
>>>>> We need the flexibility to support memory hot-unplug operations
>>>>> during migration,
>>>>
>>>> I would have thought that hotplug during migration would simply
>>>> discard all the data - how does it use the dirty bitmap?
>>>>
>>>>> This was implemented as a single operation specifically to avoid
>>>>> races where ongoing access may be available after retrieving a
>>>>> snapshot of the bitmap.  Thanks,
>>>>
>>>> The issue is the cost.
>>>>
>>>> On a real iommu elminating the race is expensive as we have to write
>>>> protect the pages before query dirty, which seems to be an extra IOTLB
>>>> flush.
>>>>
>>>> It is not clear if paying this cost to become atomic is actually
>>>> something any use case needs.
>>>>
>>>> So, I suggest we think about a 3rd op 'write protect and clear
>>>> dirties' that will be followed by a normal unmap - the extra op will
>>>> have the extra oveheard and userspace can decide if it wants to pay or
>>>> not vs the non-atomic read dirties operation. And lets have a use case
>>>> where this must be atomic before we implement it..
>>>
>>> and write-protection also relies on the support of I/O page fault...
>>>
>> /I think/ all IOMMUs in this series already support permission/unrecoverable
>> I/O page faults for a long time IIUC.
>>
>> The earlier suggestion was just to discard the I/O page fault after
>> write-protection happens. fwiw, some IOMMUs also support suppressing
>> the event notification (like AMD).
> 
> iiuc the purpose of 'write-protection' here is to capture in-fly dirty pages
> in the said race window until unmap and iotlb is invalidated is completed.
> 
But then we depend on PRS being there on the device, because without it, DMA is
aborted on the target on a read-only IOVA prior to the page fault, thus the page
is not going to be dirty anyways.

> *unrecoverable* faults are not expected to be used in a feature path
> as occurrence of such faults may lead to severe reaction in iommu
> drivers e.g. completely block DMA from the device causing such faults.

Unless I totally misunderstood ... the later is actually what we were suggesting
here /in the context of unmaping an GIOVA/(*).

The wrprotect() was there to ensure we get an atomic dirty state of the IOVA range
afterwards, by blocking DMA (as opposed to sort of mediating DMA). The I/O page fault is
not supposed to happen unless there's rogue DMA AIUI.

TBH, the same could be said for normal DMA unmap as that does not make any sort of
guarantees of stopping DMA until the IOTLB flush happens.

(*) Although I am not saying the use-case of wrprotect() and mediating dirty pages you say
isn't useful. I guess it is in a world where we want support post-copy migration with VFs,
which would require some form of PRI (via the PF?) of the migratable VF. I was just trying
to differentiate that this in the context of unmapping an IOVA.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
@ 2022-05-05 11:50               ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-05-05 11:50 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe, Alex Williamson
  Cc: Jean-Philippe Brucker, Yishai Hadas, kvm, Will Deacon,
	Cornelia Huck, iommu, David Woodhouse, Robin Murphy

On 5/5/22 12:03, Tian, Kevin wrote:
>> From: Joao Martins <joao.m.martins@oracle.com>
>> Sent: Thursday, May 5, 2022 6:07 PM
>>
>> On 5/5/22 08:42, Tian, Kevin wrote:
>>>> From: Jason Gunthorpe <jgg@nvidia.com>
>>>> Sent: Tuesday, May 3, 2022 2:53 AM
>>>>
>>>> On Mon, May 02, 2022 at 12:11:07PM -0600, Alex Williamson wrote:
>>>>> On Fri, 29 Apr 2022 05:45:20 +0000
>>>>> "Tian, Kevin" <kevin.tian@intel.com> wrote:
>>>>>>> From: Joao Martins <joao.m.martins@oracle.com>
>>>>>>>  3) Unmapping an IOVA range while returning its dirty bit prior to
>>>>>>> unmap. This case is specific for non-nested vIOMMU case where an
>>>>>>> erronous guest (or device) DMAing to an address being unmapped at
>>>> the
>>>>>>> same time.
>>>>>>
>>>>>> an erroneous attempt like above cannot anticipate which DMAs can
>>>>>> succeed in that window thus the end behavior is undefined. For an
>>>>>> undefined behavior nothing will be broken by losing some bits dirtied
>>>>>> in the window between reading back dirty bits of the range and
>>>>>> actually calling unmap. From guest p.o.v. all those are black-box
>>>>>> hardware logic to serve a virtual iotlb invalidation request which just
>>>>>> cannot be completed in one cycle.
>>>>>>
>>>>>> Hence in reality probably this is not required except to meet vfio
>>>>>> compat requirement. Just in concept returning dirty bits at unmap
>>>>>> is more accurate.
>>>>>>
>>>>>> I'm slightly inclined to abandon it in iommufd uAPI.
>>>>>
>>>>> Sorry, I'm not following why an unmap with returned dirty bitmap
>>>>> operation is specific to a vIOMMU case, or in fact indicative of some
>>>>> sort of erroneous, racy behavior of guest or device.
>>>>
>>>> It is being compared against the alternative which is to explicitly
>>>> query dirty then do a normal unmap as two system calls and permit a
>>>> race.
>>>>
>>>> The only case with any difference is if the guest is racing DMA with
>>>> the unmap - in which case it is already indeterminate for the guest if
>>>> the DMA will be completed or not.
>>>>
>>>> eg on the vIOMMU case if the guest races DMA with unmap then we are
>>>> already fine with throwing away that DMA because that is how the race
>>>> resolves during non-migration situations, so resovling it as throwing
>>>> away the DMA during migration is OK too.
>>>>
>>>>> We need the flexibility to support memory hot-unplug operations
>>>>> during migration,
>>>>
>>>> I would have thought that hotplug during migration would simply
>>>> discard all the data - how does it use the dirty bitmap?
>>>>
>>>>> This was implemented as a single operation specifically to avoid
>>>>> races where ongoing access may be available after retrieving a
>>>>> snapshot of the bitmap.  Thanks,
>>>>
>>>> The issue is the cost.
>>>>
>>>> On a real iommu elminating the race is expensive as we have to write
>>>> protect the pages before query dirty, which seems to be an extra IOTLB
>>>> flush.
>>>>
>>>> It is not clear if paying this cost to become atomic is actually
>>>> something any use case needs.
>>>>
>>>> So, I suggest we think about a 3rd op 'write protect and clear
>>>> dirties' that will be followed by a normal unmap - the extra op will
>>>> have the extra oveheard and userspace can decide if it wants to pay or
>>>> not vs the non-atomic read dirties operation. And lets have a use case
>>>> where this must be atomic before we implement it..
>>>
>>> and write-protection also relies on the support of I/O page fault...
>>>
>> /I think/ all IOMMUs in this series already support permission/unrecoverable
>> I/O page faults for a long time IIUC.
>>
>> The earlier suggestion was just to discard the I/O page fault after
>> write-protection happens. fwiw, some IOMMUs also support suppressing
>> the event notification (like AMD).
> 
> iiuc the purpose of 'write-protection' here is to capture in-fly dirty pages
> in the said race window until unmap and iotlb is invalidated is completed.
> 
But then we depend on PRS being there on the device, because without it, DMA is
aborted on the target on a read-only IOVA prior to the page fault, thus the page
is not going to be dirty anyways.

> *unrecoverable* faults are not expected to be used in a feature path
> as occurrence of such faults may lead to severe reaction in iommu
> drivers e.g. completely block DMA from the device causing such faults.

Unless I totally misunderstood ... the later is actually what we were suggesting
here /in the context of unmaping an GIOVA/(*).

The wrprotect() was there to ensure we get an atomic dirty state of the IOVA range
afterwards, by blocking DMA (as opposed to sort of mediating DMA). The I/O page fault is
not supposed to happen unless there's rogue DMA AIUI.

TBH, the same could be said for normal DMA unmap as that does not make any sort of
guarantees of stopping DMA until the IOTLB flush happens.

(*) Although I am not saying the use-case of wrprotect() and mediating dirty pages you say
isn't useful. I guess it is in a world where we want support post-copy migration with VFs,
which would require some form of PRI (via the PF?) of the migratable VF. I was just trying
to differentiate that this in the context of unmapping an IOVA.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
  2022-05-05 11:03             ` Tian, Kevin
@ 2022-05-05 13:55               ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 209+ messages in thread
From: Jason Gunthorpe @ 2022-05-05 13:55 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Martins, Joao, Alex Williamson, iommu, Joerg Roedel,
	Suravee Suthikulpanit, Will Deacon, Robin Murphy,
	Jean-Philippe Brucker, Keqian Zhu, Shameerali Kolothum Thodi,
	David Woodhouse, Lu Baolu, Nicolin Chen, Yishai Hadas,
	Eric Auger, Liu, Yi L, Cornelia Huck, kvm

On Thu, May 05, 2022 at 11:03:18AM +0000, Tian, Kevin wrote:

> iiuc the purpose of 'write-protection' here is to capture in-fly dirty pages
> in the said race window until unmap and iotlb is invalidated is completed.

No, the purpose is to perform "unmap" without destroying the dirty bit
in the process.
 
If an IOMMU architecture has a way to render the page unmaped and
flush back the dirty bit/not destroy then it doesn't require a write
protect pass.

Jason

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
@ 2022-05-05 13:55               ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 209+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-05-05 13:55 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jean-Philippe Brucker, Yishai Hadas, kvm, Robin Murphy,
	Cornelia Huck, iommu, Alex Williamson, Martins, Joao,
	David Woodhouse, Will Deacon

On Thu, May 05, 2022 at 11:03:18AM +0000, Tian, Kevin wrote:

> iiuc the purpose of 'write-protection' here is to capture in-fly dirty pages
> in the said race window until unmap and iotlb is invalidated is completed.

No, the purpose is to perform "unmap" without destroying the dirty bit
in the process.
 
If an IOMMU architecture has a way to render the page unmaped and
flush back the dirty bit/not destroy then it doesn't require a write
protect pass.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
  2022-05-05  7:40         ` Tian, Kevin
@ 2022-05-05 14:07           ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 209+ messages in thread
From: Jason Gunthorpe @ 2022-05-05 14:07 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Martins, Joao, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Nicolin Chen, Yishai Hadas, Eric Auger, Liu, Yi L,
	Alex Williamson, Cornelia Huck, kvm, iommu

On Thu, May 05, 2022 at 07:40:37AM +0000, Tian, Kevin wrote:
 
> In concept this is an iommu property instead of a domain property.

Not really, domains shouldn't be changing behaviors once they are
created. If a domain supports dirty tracking and I attach a new device
then it still must support dirty tracking.

I suppose we may need something here because we need to control when
domains are re-used if they don't have the right properties in case
the system iommu's are discontiguous somehow.

ie iommufd should be able to assert that dirty tracking is desired and
an existing non-dirty tracking capable domain will not be
automatically re-used.

We don't really have the right infrastructure to do this currently.

> From this angle IMHO it's more reasonable to report this IOMMU
> property to userspace via a device capability. If all devices attached
> to a hwpt claim IOMMU dirty tracking capability, the user can call
> set_dirty_tracking() on the hwpt object. 

Inherent domain properties need to be immutable or, at least one-way,
like enforced coherent, or it just all stops making any kind of sense.

> Once dirty tracking is enabled on a hwpt, further attaching a device
> which doesn't claim this capability is simply rejected.

It would be OK to do as enforced coherent does as flip a domain
permanently into dirty-tracking enabled, or specify a flag at domain
creation time.

Jason

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
@ 2022-05-05 14:07           ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 209+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-05-05 14:07 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jean-Philippe Brucker, Yishai Hadas, kvm, Will Deacon,
	Cornelia Huck, iommu, Alex Williamson, Martins, Joao,
	David Woodhouse, Robin Murphy

On Thu, May 05, 2022 at 07:40:37AM +0000, Tian, Kevin wrote:
 
> In concept this is an iommu property instead of a domain property.

Not really, domains shouldn't be changing behaviors once they are
created. If a domain supports dirty tracking and I attach a new device
then it still must support dirty tracking.

I suppose we may need something here because we need to control when
domains are re-used if they don't have the right properties in case
the system iommu's are discontiguous somehow.

ie iommufd should be able to assert that dirty tracking is desired and
an existing non-dirty tracking capable domain will not be
automatically re-used.

We don't really have the right infrastructure to do this currently.

> From this angle IMHO it's more reasonable to report this IOMMU
> property to userspace via a device capability. If all devices attached
> to a hwpt claim IOMMU dirty tracking capability, the user can call
> set_dirty_tracking() on the hwpt object. 

Inherent domain properties need to be immutable or, at least one-way,
like enforced coherent, or it just all stops making any kind of sense.

> Once dirty tracking is enabled on a hwpt, further attaching a device
> which doesn't claim this capability is simply rejected.

It would be OK to do as enforced coherent does as flip a domain
permanently into dirty-tracking enabled, or specify a flag at domain
creation time.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
  2022-05-05 11:50               ` Joao Martins
@ 2022-05-06  3:14                 ` Tian, Kevin
  -1 siblings, 0 replies; 209+ messages in thread
From: Tian, Kevin @ 2022-05-06  3:14 UTC (permalink / raw)
  To: Martins, Joao, Jason Gunthorpe, Alex Williamson
  Cc: iommu, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Nicolin Chen, Yishai Hadas, Eric Auger, Liu, Yi L, Cornelia Huck,
	kvm

> From: Joao Martins <joao.m.martins@oracle.com>
> Sent: Thursday, May 5, 2022 7:51 PM
> 
> On 5/5/22 12:03, Tian, Kevin wrote:
> >> From: Joao Martins <joao.m.martins@oracle.com>
> >> Sent: Thursday, May 5, 2022 6:07 PM
> >>
> >> On 5/5/22 08:42, Tian, Kevin wrote:
> >>>> From: Jason Gunthorpe <jgg@nvidia.com>
> >>>> Sent: Tuesday, May 3, 2022 2:53 AM
> >>>>
> >>>> On Mon, May 02, 2022 at 12:11:07PM -0600, Alex Williamson wrote:
> >>>>> On Fri, 29 Apr 2022 05:45:20 +0000
> >>>>> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >>>>>>> From: Joao Martins <joao.m.martins@oracle.com>
> >>>>>>>  3) Unmapping an IOVA range while returning its dirty bit prior to
> >>>>>>> unmap. This case is specific for non-nested vIOMMU case where an
> >>>>>>> erronous guest (or device) DMAing to an address being unmapped
> at
> >>>> the
> >>>>>>> same time.
> >>>>>>
> >>>>>> an erroneous attempt like above cannot anticipate which DMAs can
> >>>>>> succeed in that window thus the end behavior is undefined. For an
> >>>>>> undefined behavior nothing will be broken by losing some bits dirtied
> >>>>>> in the window between reading back dirty bits of the range and
> >>>>>> actually calling unmap. From guest p.o.v. all those are black-box
> >>>>>> hardware logic to serve a virtual iotlb invalidation request which just
> >>>>>> cannot be completed in one cycle.
> >>>>>>
> >>>>>> Hence in reality probably this is not required except to meet vfio
> >>>>>> compat requirement. Just in concept returning dirty bits at unmap
> >>>>>> is more accurate.
> >>>>>>
> >>>>>> I'm slightly inclined to abandon it in iommufd uAPI.
> >>>>>
> >>>>> Sorry, I'm not following why an unmap with returned dirty bitmap
> >>>>> operation is specific to a vIOMMU case, or in fact indicative of some
> >>>>> sort of erroneous, racy behavior of guest or device.
> >>>>
> >>>> It is being compared against the alternative which is to explicitly
> >>>> query dirty then do a normal unmap as two system calls and permit a
> >>>> race.
> >>>>
> >>>> The only case with any difference is if the guest is racing DMA with
> >>>> the unmap - in which case it is already indeterminate for the guest if
> >>>> the DMA will be completed or not.
> >>>>
> >>>> eg on the vIOMMU case if the guest races DMA with unmap then we
> are
> >>>> already fine with throwing away that DMA because that is how the race
> >>>> resolves during non-migration situations, so resovling it as throwing
> >>>> away the DMA during migration is OK too.
> >>>>
> >>>>> We need the flexibility to support memory hot-unplug operations
> >>>>> during migration,
> >>>>
> >>>> I would have thought that hotplug during migration would simply
> >>>> discard all the data - how does it use the dirty bitmap?
> >>>>
> >>>>> This was implemented as a single operation specifically to avoid
> >>>>> races where ongoing access may be available after retrieving a
> >>>>> snapshot of the bitmap.  Thanks,
> >>>>
> >>>> The issue is the cost.
> >>>>
> >>>> On a real iommu elminating the race is expensive as we have to write
> >>>> protect the pages before query dirty, which seems to be an extra IOTLB
> >>>> flush.
> >>>>
> >>>> It is not clear if paying this cost to become atomic is actually
> >>>> something any use case needs.
> >>>>
> >>>> So, I suggest we think about a 3rd op 'write protect and clear
> >>>> dirties' that will be followed by a normal unmap - the extra op will
> >>>> have the extra oveheard and userspace can decide if it wants to pay or
> >>>> not vs the non-atomic read dirties operation. And lets have a use case
> >>>> where this must be atomic before we implement it..
> >>>
> >>> and write-protection also relies on the support of I/O page fault...
> >>>
> >> /I think/ all IOMMUs in this series already support
> permission/unrecoverable
> >> I/O page faults for a long time IIUC.
> >>
> >> The earlier suggestion was just to discard the I/O page fault after
> >> write-protection happens. fwiw, some IOMMUs also support suppressing
> >> the event notification (like AMD).
> >
> > iiuc the purpose of 'write-protection' here is to capture in-fly dirty pages
> > in the said race window until unmap and iotlb is invalidated is completed.
> >
> But then we depend on PRS being there on the device, because without it,
> DMA is
> aborted on the target on a read-only IOVA prior to the page fault, thus the
> page
> is not going to be dirty anyways.
> 
> > *unrecoverable* faults are not expected to be used in a feature path
> > as occurrence of such faults may lead to severe reaction in iommu
> > drivers e.g. completely block DMA from the device causing such faults.
> 
> Unless I totally misunderstood ... the later is actually what we were
> suggesting
> here /in the context of unmaping an GIOVA/(*).
> 
> The wrprotect() was there to ensure we get an atomic dirty state of the IOVA
> range
> afterwards, by blocking DMA (as opposed to sort of mediating DMA). The I/O
> page fault is
> not supposed to happen unless there's rogue DMA AIUI.

You are right. It's me misunderstanding the proposal here. 😊

> 
> TBH, the same could be said for normal DMA unmap as that does not make
> any sort of
> guarantees of stopping DMA until the IOTLB flush happens.
> 
> (*) Although I am not saying the use-case of wrprotect() and mediating dirty
> pages you say
> isn't useful. I guess it is in a world where we want support post-copy
> migration with VFs,
> which would require some form of PRI (via the PF?) of the migratable VF. I
> was just trying
> to differentiate that this in the context of unmapping an IOVA.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
@ 2022-05-06  3:14                 ` Tian, Kevin
  0 siblings, 0 replies; 209+ messages in thread
From: Tian, Kevin @ 2022-05-06  3:14 UTC (permalink / raw)
  To: Martins, Joao, Jason Gunthorpe, Alex Williamson
  Cc: Jean-Philippe Brucker, Yishai Hadas, kvm, Will Deacon,
	Cornelia Huck, iommu, David Woodhouse, Robin Murphy

> From: Joao Martins <joao.m.martins@oracle.com>
> Sent: Thursday, May 5, 2022 7:51 PM
> 
> On 5/5/22 12:03, Tian, Kevin wrote:
> >> From: Joao Martins <joao.m.martins@oracle.com>
> >> Sent: Thursday, May 5, 2022 6:07 PM
> >>
> >> On 5/5/22 08:42, Tian, Kevin wrote:
> >>>> From: Jason Gunthorpe <jgg@nvidia.com>
> >>>> Sent: Tuesday, May 3, 2022 2:53 AM
> >>>>
> >>>> On Mon, May 02, 2022 at 12:11:07PM -0600, Alex Williamson wrote:
> >>>>> On Fri, 29 Apr 2022 05:45:20 +0000
> >>>>> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >>>>>>> From: Joao Martins <joao.m.martins@oracle.com>
> >>>>>>>  3) Unmapping an IOVA range while returning its dirty bit prior to
> >>>>>>> unmap. This case is specific for non-nested vIOMMU case where an
> >>>>>>> erronous guest (or device) DMAing to an address being unmapped
> at
> >>>> the
> >>>>>>> same time.
> >>>>>>
> >>>>>> an erroneous attempt like above cannot anticipate which DMAs can
> >>>>>> succeed in that window thus the end behavior is undefined. For an
> >>>>>> undefined behavior nothing will be broken by losing some bits dirtied
> >>>>>> in the window between reading back dirty bits of the range and
> >>>>>> actually calling unmap. From guest p.o.v. all those are black-box
> >>>>>> hardware logic to serve a virtual iotlb invalidation request which just
> >>>>>> cannot be completed in one cycle.
> >>>>>>
> >>>>>> Hence in reality probably this is not required except to meet vfio
> >>>>>> compat requirement. Just in concept returning dirty bits at unmap
> >>>>>> is more accurate.
> >>>>>>
> >>>>>> I'm slightly inclined to abandon it in iommufd uAPI.
> >>>>>
> >>>>> Sorry, I'm not following why an unmap with returned dirty bitmap
> >>>>> operation is specific to a vIOMMU case, or in fact indicative of some
> >>>>> sort of erroneous, racy behavior of guest or device.
> >>>>
> >>>> It is being compared against the alternative which is to explicitly
> >>>> query dirty then do a normal unmap as two system calls and permit a
> >>>> race.
> >>>>
> >>>> The only case with any difference is if the guest is racing DMA with
> >>>> the unmap - in which case it is already indeterminate for the guest if
> >>>> the DMA will be completed or not.
> >>>>
> >>>> eg on the vIOMMU case if the guest races DMA with unmap then we
> are
> >>>> already fine with throwing away that DMA because that is how the race
> >>>> resolves during non-migration situations, so resovling it as throwing
> >>>> away the DMA during migration is OK too.
> >>>>
> >>>>> We need the flexibility to support memory hot-unplug operations
> >>>>> during migration,
> >>>>
> >>>> I would have thought that hotplug during migration would simply
> >>>> discard all the data - how does it use the dirty bitmap?
> >>>>
> >>>>> This was implemented as a single operation specifically to avoid
> >>>>> races where ongoing access may be available after retrieving a
> >>>>> snapshot of the bitmap.  Thanks,
> >>>>
> >>>> The issue is the cost.
> >>>>
> >>>> On a real iommu elminating the race is expensive as we have to write
> >>>> protect the pages before query dirty, which seems to be an extra IOTLB
> >>>> flush.
> >>>>
> >>>> It is not clear if paying this cost to become atomic is actually
> >>>> something any use case needs.
> >>>>
> >>>> So, I suggest we think about a 3rd op 'write protect and clear
> >>>> dirties' that will be followed by a normal unmap - the extra op will
> >>>> have the extra oveheard and userspace can decide if it wants to pay or
> >>>> not vs the non-atomic read dirties operation. And lets have a use case
> >>>> where this must be atomic before we implement it..
> >>>
> >>> and write-protection also relies on the support of I/O page fault...
> >>>
> >> /I think/ all IOMMUs in this series already support
> permission/unrecoverable
> >> I/O page faults for a long time IIUC.
> >>
> >> The earlier suggestion was just to discard the I/O page fault after
> >> write-protection happens. fwiw, some IOMMUs also support suppressing
> >> the event notification (like AMD).
> >
> > iiuc the purpose of 'write-protection' here is to capture in-fly dirty pages
> > in the said race window until unmap and iotlb is invalidated is completed.
> >
> But then we depend on PRS being there on the device, because without it,
> DMA is
> aborted on the target on a read-only IOVA prior to the page fault, thus the
> page
> is not going to be dirty anyways.
> 
> > *unrecoverable* faults are not expected to be used in a feature path
> > as occurrence of such faults may lead to severe reaction in iommu
> > drivers e.g. completely block DMA from the device causing such faults.
> 
> Unless I totally misunderstood ... the later is actually what we were
> suggesting
> here /in the context of unmaping an GIOVA/(*).
> 
> The wrprotect() was there to ensure we get an atomic dirty state of the IOVA
> range
> afterwards, by blocking DMA (as opposed to sort of mediating DMA). The I/O
> page fault is
> not supposed to happen unless there's rogue DMA AIUI.

You are right. It's me misunderstanding the proposal here. 😊

> 
> TBH, the same could be said for normal DMA unmap as that does not make
> any sort of
> guarantees of stopping DMA until the IOTLB flush happens.
> 
> (*) Although I am not saying the use-case of wrprotect() and mediating dirty
> pages you say
> isn't useful. I guess it is in a world where we want support post-copy
> migration with VFs,
> which would require some form of PRI (via the PF?) of the migratable VF. I
> was just trying
> to differentiate that this in the context of unmapping an IOVA.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
  2022-05-05 13:55               ` Jason Gunthorpe via iommu
@ 2022-05-06  3:17                 ` Tian, Kevin
  -1 siblings, 0 replies; 209+ messages in thread
From: Tian, Kevin @ 2022-05-06  3:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Martins, Joao, Alex Williamson, iommu, Joerg Roedel,
	Suravee Suthikulpanit, Will Deacon, Robin Murphy,
	Jean-Philippe Brucker, Keqian Zhu, Shameerali Kolothum Thodi,
	David Woodhouse, Lu Baolu, Nicolin Chen, Yishai Hadas,
	Eric Auger, Liu, Yi L, Cornelia Huck, kvm

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, May 5, 2022 9:55 PM
> 
> On Thu, May 05, 2022 at 11:03:18AM +0000, Tian, Kevin wrote:
> 
> > iiuc the purpose of 'write-protection' here is to capture in-fly dirty pages
> > in the said race window until unmap and iotlb is invalidated is completed.
> 
> No, the purpose is to perform "unmap" without destroying the dirty bit
> in the process.
> 
> If an IOMMU architecture has a way to render the page unmaped and
> flush back the dirty bit/not destroy then it doesn't require a write
> protect pass.
> 

Yes, I see the point now. As you said let's consider it only when
there is a real use case requiring such atomicity.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
@ 2022-05-06  3:17                 ` Tian, Kevin
  0 siblings, 0 replies; 209+ messages in thread
From: Tian, Kevin @ 2022-05-06  3:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Yishai Hadas, kvm, Robin Murphy,
	Cornelia Huck, iommu, Alex Williamson, Martins, Joao,
	David Woodhouse, Will Deacon

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, May 5, 2022 9:55 PM
> 
> On Thu, May 05, 2022 at 11:03:18AM +0000, Tian, Kevin wrote:
> 
> > iiuc the purpose of 'write-protection' here is to capture in-fly dirty pages
> > in the said race window until unmap and iotlb is invalidated is completed.
> 
> No, the purpose is to perform "unmap" without destroying the dirty bit
> in the process.
> 
> If an IOMMU architecture has a way to render the page unmaped and
> flush back the dirty bit/not destroy then it doesn't require a write
> protect pass.
> 

Yes, I see the point now. As you said let's consider it only when
there is a real use case requiring such atomicity.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
  2022-05-05 14:07           ` Jason Gunthorpe via iommu
@ 2022-05-06  3:51             ` Tian, Kevin
  -1 siblings, 0 replies; 209+ messages in thread
From: Tian, Kevin @ 2022-05-06  3:51 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Martins, Joao, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Nicolin Chen, Yishai Hadas, Eric Auger, Liu, Yi L,
	Alex Williamson, Cornelia Huck, kvm, iommu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, May 5, 2022 10:08 PM
> 
> On Thu, May 05, 2022 at 07:40:37AM +0000, Tian, Kevin wrote:
> 
> > In concept this is an iommu property instead of a domain property.
> 
> Not really, domains shouldn't be changing behaviors once they are
> created. If a domain supports dirty tracking and I attach a new device
> then it still must support dirty tracking.

That sort of suggests that userspace should specify whether a domain
supports dirty tracking when it's created. But how does userspace
know that it should create the domain in this way in the first place? 
live migration is triggered on demand and it may not happen in the
lifetime of a VM.

and if the user always creates domain to allow dirty tracking by default,
how does it know a failed attach is due to missing dirty tracking support
by the IOMMU and then creates another domain which disables dirty
tracking and retry-attach again?

In any case IMHO having a device capability still sounds applausive even
in above model so userspace can create domain with right property
based on a potential list of devices to be attached. Once the domain is
created, then further attached devices must be compatible to the
domain property.

> 
> I suppose we may need something here because we need to control when
> domains are re-used if they don't have the right properties in case
> the system iommu's are discontiguous somehow.
> 
> ie iommufd should be able to assert that dirty tracking is desired and
> an existing non-dirty tracking capable domain will not be
> automatically re-used.
> 
> We don't really have the right infrastructure to do this currently.
> 
> > From this angle IMHO it's more reasonable to report this IOMMU
> > property to userspace via a device capability. If all devices attached
> > to a hwpt claim IOMMU dirty tracking capability, the user can call
> > set_dirty_tracking() on the hwpt object.
> 
> Inherent domain properties need to be immutable or, at least one-way,
> like enforced coherent, or it just all stops making any kind of sense.
> 
> > Once dirty tracking is enabled on a hwpt, further attaching a device
> > which doesn't claim this capability is simply rejected.
> 
> It would be OK to do as enforced coherent does as flip a domain
> permanently into dirty-tracking enabled, or specify a flag at domain
> creation time.
> 

Either way I think a device capability is useful for the user to decide
the necessity of flipping one-way or specifying a flag at domain
creation.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
@ 2022-05-06  3:51             ` Tian, Kevin
  0 siblings, 0 replies; 209+ messages in thread
From: Tian, Kevin @ 2022-05-06  3:51 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Yishai Hadas, kvm, Will Deacon,
	Cornelia Huck, iommu, Alex Williamson, Martins, Joao,
	David Woodhouse, Robin Murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, May 5, 2022 10:08 PM
> 
> On Thu, May 05, 2022 at 07:40:37AM +0000, Tian, Kevin wrote:
> 
> > In concept this is an iommu property instead of a domain property.
> 
> Not really, domains shouldn't be changing behaviors once they are
> created. If a domain supports dirty tracking and I attach a new device
> then it still must support dirty tracking.

That sort of suggests that userspace should specify whether a domain
supports dirty tracking when it's created. But how does userspace
know that it should create the domain in this way in the first place? 
live migration is triggered on demand and it may not happen in the
lifetime of a VM.

and if the user always creates domain to allow dirty tracking by default,
how does it know a failed attach is due to missing dirty tracking support
by the IOMMU and then creates another domain which disables dirty
tracking and retry-attach again?

In any case IMHO having a device capability still sounds applausive even
in above model so userspace can create domain with right property
based on a potential list of devices to be attached. Once the domain is
created, then further attached devices must be compatible to the
domain property.

> 
> I suppose we may need something here because we need to control when
> domains are re-used if they don't have the right properties in case
> the system iommu's are discontiguous somehow.
> 
> ie iommufd should be able to assert that dirty tracking is desired and
> an existing non-dirty tracking capable domain will not be
> automatically re-used.
> 
> We don't really have the right infrastructure to do this currently.
> 
> > From this angle IMHO it's more reasonable to report this IOMMU
> > property to userspace via a device capability. If all devices attached
> > to a hwpt claim IOMMU dirty tracking capability, the user can call
> > set_dirty_tracking() on the hwpt object.
> 
> Inherent domain properties need to be immutable or, at least one-way,
> like enforced coherent, or it just all stops making any kind of sense.
> 
> > Once dirty tracking is enabled on a hwpt, further attaching a device
> > which doesn't claim this capability is simply rejected.
> 
> It would be OK to do as enforced coherent does as flip a domain
> permanently into dirty-tracking enabled, or specify a flag at domain
> creation time.
> 

Either way I think a device capability is useful for the user to decide
the necessity of flipping one-way or specifying a flag at domain
creation.

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
  2022-05-06  3:51             ` Tian, Kevin
@ 2022-05-06 11:46               ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 209+ messages in thread
From: Jason Gunthorpe @ 2022-05-06 11:46 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Martins, Joao, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Nicolin Chen, Yishai Hadas, Eric Auger, Liu, Yi L,
	Alex Williamson, Cornelia Huck, kvm, iommu

On Fri, May 06, 2022 at 03:51:40AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Thursday, May 5, 2022 10:08 PM
> > 
> > On Thu, May 05, 2022 at 07:40:37AM +0000, Tian, Kevin wrote:
> > 
> > > In concept this is an iommu property instead of a domain property.
> > 
> > Not really, domains shouldn't be changing behaviors once they are
> > created. If a domain supports dirty tracking and I attach a new device
> > then it still must support dirty tracking.
> 
> That sort of suggests that userspace should specify whether a domain
> supports dirty tracking when it's created. But how does userspace
> know that it should create the domain in this way in the first place? 
> live migration is triggered on demand and it may not happen in the
> lifetime of a VM.

The best you could do is to look at the devices being plugged in at VM
startup, and if they all support live migration then request dirty
tracking, otherwise don't.

However, tt costs nothing to have dirty tracking as long as all iommus
support it in the system - which seems to be the normal case today.

We should just always turn it on at this point. 

> and if the user always creates domain to allow dirty tracking by default,
> how does it know a failed attach is due to missing dirty tracking support
> by the IOMMU and then creates another domain which disables dirty
> tracking and retry-attach again?

The automatic logic is complicated for sure, if you had a device flag
it would have to figure it out that way

Jason

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
@ 2022-05-06 11:46               ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 209+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-05-06 11:46 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jean-Philippe Brucker, Yishai Hadas, kvm, Will Deacon,
	Cornelia Huck, iommu, Alex Williamson, Martins, Joao,
	David Woodhouse, Robin Murphy

On Fri, May 06, 2022 at 03:51:40AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Thursday, May 5, 2022 10:08 PM
> > 
> > On Thu, May 05, 2022 at 07:40:37AM +0000, Tian, Kevin wrote:
> > 
> > > In concept this is an iommu property instead of a domain property.
> > 
> > Not really, domains shouldn't be changing behaviors once they are
> > created. If a domain supports dirty tracking and I attach a new device
> > then it still must support dirty tracking.
> 
> That sort of suggests that userspace should specify whether a domain
> supports dirty tracking when it's created. But how does userspace
> know that it should create the domain in this way in the first place? 
> live migration is triggered on demand and it may not happen in the
> lifetime of a VM.

The best you could do is to look at the devices being plugged in at VM
startup, and if they all support live migration then request dirty
tracking, otherwise don't.

However, tt costs nothing to have dirty tracking as long as all iommus
support it in the system - which seems to be the normal case today.

We should just always turn it on at this point. 

> and if the user always creates domain to allow dirty tracking by default,
> how does it know a failed attach is due to missing dirty tracking support
> by the IOMMU and then creates another domain which disables dirty
> tracking and retry-attach again?

The automatic logic is complicated for sure, if you had a device flag
it would have to figure it out that way

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
  2022-05-06 11:46               ` Jason Gunthorpe via iommu
@ 2022-05-10  1:38                 ` Tian, Kevin
  -1 siblings, 0 replies; 209+ messages in thread
From: Tian, Kevin @ 2022-05-10  1:38 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Martins, Joao, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Nicolin Chen, Yishai Hadas, Eric Auger, Liu, Yi L,
	Alex Williamson, Cornelia Huck, kvm, iommu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, May 6, 2022 7:46 PM
> 
> On Fri, May 06, 2022 at 03:51:40AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Thursday, May 5, 2022 10:08 PM
> > >
> > > On Thu, May 05, 2022 at 07:40:37AM +0000, Tian, Kevin wrote:
> > >
> > > > In concept this is an iommu property instead of a domain property.
> > >
> > > Not really, domains shouldn't be changing behaviors once they are
> > > created. If a domain supports dirty tracking and I attach a new device
> > > then it still must support dirty tracking.
> >
> > That sort of suggests that userspace should specify whether a domain
> > supports dirty tracking when it's created. But how does userspace
> > know that it should create the domain in this way in the first place?
> > live migration is triggered on demand and it may not happen in the
> > lifetime of a VM.
> 
> The best you could do is to look at the devices being plugged in at VM
> startup, and if they all support live migration then request dirty
> tracking, otherwise don't.

Yes, this is how a device capability can help.

> 
> However, tt costs nothing to have dirty tracking as long as all iommus
> support it in the system - which seems to be the normal case today.
> 
> We should just always turn it on at this point.

Then still need a way to report " all iommus support it in the system"
to userspace since many old systems don't support it at all. If we all
agree that a device capability flag would be helpful on this front (like
you also said below), probably can start building the initial skeleton
with that in mind?

> 
> > and if the user always creates domain to allow dirty tracking by default,
> > how does it know a failed attach is due to missing dirty tracking support
> > by the IOMMU and then creates another domain which disables dirty
> > tracking and retry-attach again?
> 
> The automatic logic is complicated for sure, if you had a device flag
> it would have to figure it out that way
> 

Yes. That is the model in my mind.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
@ 2022-05-10  1:38                 ` Tian, Kevin
  0 siblings, 0 replies; 209+ messages in thread
From: Tian, Kevin @ 2022-05-10  1:38 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Yishai Hadas, kvm, Will Deacon,
	Cornelia Huck, iommu, Alex Williamson, Martins, Joao,
	David Woodhouse, Robin Murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, May 6, 2022 7:46 PM
> 
> On Fri, May 06, 2022 at 03:51:40AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Thursday, May 5, 2022 10:08 PM
> > >
> > > On Thu, May 05, 2022 at 07:40:37AM +0000, Tian, Kevin wrote:
> > >
> > > > In concept this is an iommu property instead of a domain property.
> > >
> > > Not really, domains shouldn't be changing behaviors once they are
> > > created. If a domain supports dirty tracking and I attach a new device
> > > then it still must support dirty tracking.
> >
> > That sort of suggests that userspace should specify whether a domain
> > supports dirty tracking when it's created. But how does userspace
> > know that it should create the domain in this way in the first place?
> > live migration is triggered on demand and it may not happen in the
> > lifetime of a VM.
> 
> The best you could do is to look at the devices being plugged in at VM
> startup, and if they all support live migration then request dirty
> tracking, otherwise don't.

Yes, this is how a device capability can help.

> 
> However, tt costs nothing to have dirty tracking as long as all iommus
> support it in the system - which seems to be the normal case today.
> 
> We should just always turn it on at this point.

Then still need a way to report " all iommus support it in the system"
to userspace since many old systems don't support it at all. If we all
agree that a device capability flag would be helpful on this front (like
you also said below), probably can start building the initial skeleton
with that in mind?

> 
> > and if the user always creates domain to allow dirty tracking by default,
> > how does it know a failed attach is due to missing dirty tracking support
> > by the IOMMU and then creates another domain which disables dirty
> > tracking and retry-attach again?
> 
> The automatic logic is complicated for sure, if you had a device flag
> it would have to figure it out that way
> 

Yes. That is the model in my mind.

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
  2022-05-10  1:38                 ` Tian, Kevin
@ 2022-05-10 11:50                   ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-05-10 11:50 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe
  Cc: Joerg Roedel, Suravee Suthikulpanit, Will Deacon, Robin Murphy,
	Jean-Philippe Brucker, Keqian Zhu, Shameerali Kolothum Thodi,
	David Woodhouse, Lu Baolu, Nicolin Chen, Yishai Hadas,
	Eric Auger, Liu, Yi L, Alex Williamson, Cornelia Huck, kvm,
	iommu

On 5/10/22 02:38, Tian, Kevin wrote:
>> From: Jason Gunthorpe <jgg@nvidia.com>
>> Sent: Friday, May 6, 2022 7:46 PM
>>
>> On Fri, May 06, 2022 at 03:51:40AM +0000, Tian, Kevin wrote:
>>>> From: Jason Gunthorpe <jgg@nvidia.com>
>>>> Sent: Thursday, May 5, 2022 10:08 PM
>>>>
>>>> On Thu, May 05, 2022 at 07:40:37AM +0000, Tian, Kevin wrote:
>>>>
>>>>> In concept this is an iommu property instead of a domain property.
>>>>
>>>> Not really, domains shouldn't be changing behaviors once they are
>>>> created. If a domain supports dirty tracking and I attach a new device
>>>> then it still must support dirty tracking.
>>>
>>> That sort of suggests that userspace should specify whether a domain
>>> supports dirty tracking when it's created. But how does userspace
>>> know that it should create the domain in this way in the first place?
>>> live migration is triggered on demand and it may not happen in the
>>> lifetime of a VM.
>>
>> The best you could do is to look at the devices being plugged in at VM
>> startup, and if they all support live migration then request dirty
>> tracking, otherwise don't.
> 
> Yes, this is how a device capability can help.
> 
>>
>> However, tt costs nothing to have dirty tracking as long as all iommus
>> support it in the system - which seems to be the normal case today.
>>
>> We should just always turn it on at this point.
> 
> Then still need a way to report " all iommus support it in the system"
> to userspace since many old systems don't support it at all. If we all
> agree that a device capability flag would be helpful on this front (like
> you also said below), probably can start building the initial skeleton
> with that in mind?
> 

This would capture device-specific and maybe iommu-instance features, but
there's some tiny bit odd semantic here. There's nothing that
depends on the device to support any of this, but rather the IOMMU instance that sits
below the device which is independent of device-own capabilities e.g. PRI on the other
hand would be a perfect fit for a device capability (?), but dirty tracking
conveying over a device capability would be a convenience rather than an exact
hw representation.

Thinking out loud if we are going as a device/iommu capability [to see if this matches
what people have or not in mind]: we would add dirty-tracking feature bit via the existent
kAPI for iommu device features (e.g. IOMMU_DEV_FEAT_AD) and on iommufd we would maybe add
an IOMMUFD_CMD_DEV_GET_IOMMU_FEATURES ioctl which would have an u64 dev_id as input (from
the returned vfio-pci BIND_IOMMUFD @out_dev_id) and u64 features as an output bitmap of
synthetic feature bits, having IOMMUFD_FEATURE_AD the only one we query (and
IOMMUFD_FEATURE_{SVA,IOPF} as potentially future candidates). Qemu would then at start of
day would check if /all devices/ support it and it would then still do the blind set
tracking, but bail out preemptively if any of device-iommu don't support dirty-tracking. I
don't think we have any case today for having to deal with different IOMMU instances that
have different features.

Either that or as discussed in the beginning perhaps add an iommufd (or iommufd hwpt one)
ioctl  call (e.g.IOMMUFD_CMD_CAP) via a input value (e.g. subop IOMMU_FEATURES) which
would gives us a structure of things (e.g. for the IOMMU_FEATURES subop the common
featureset bitmap in all iommu instances). This would give the 'all iommus support it in
the system'. Albeit the device one might have more concrete longevity if there's further
plans aside from dirty tracking.

>>
>>> and if the user always creates domain to allow dirty tracking by default,
>>> how does it know a failed attach is due to missing dirty tracking support
>>> by the IOMMU and then creates another domain which disables dirty
>>> tracking and retry-attach again?
>>
>> The automatic logic is complicated for sure, if you had a device flag
>> it would have to figure it out that way
>>
> 
> Yes. That is the model in my mind.
> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
@ 2022-05-10 11:50                   ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-05-10 11:50 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Yishai Hadas, kvm, Will Deacon,
	Cornelia Huck, iommu, Alex Williamson, David Woodhouse,
	Robin Murphy

On 5/10/22 02:38, Tian, Kevin wrote:
>> From: Jason Gunthorpe <jgg@nvidia.com>
>> Sent: Friday, May 6, 2022 7:46 PM
>>
>> On Fri, May 06, 2022 at 03:51:40AM +0000, Tian, Kevin wrote:
>>>> From: Jason Gunthorpe <jgg@nvidia.com>
>>>> Sent: Thursday, May 5, 2022 10:08 PM
>>>>
>>>> On Thu, May 05, 2022 at 07:40:37AM +0000, Tian, Kevin wrote:
>>>>
>>>>> In concept this is an iommu property instead of a domain property.
>>>>
>>>> Not really, domains shouldn't be changing behaviors once they are
>>>> created. If a domain supports dirty tracking and I attach a new device
>>>> then it still must support dirty tracking.
>>>
>>> That sort of suggests that userspace should specify whether a domain
>>> supports dirty tracking when it's created. But how does userspace
>>> know that it should create the domain in this way in the first place?
>>> live migration is triggered on demand and it may not happen in the
>>> lifetime of a VM.
>>
>> The best you could do is to look at the devices being plugged in at VM
>> startup, and if they all support live migration then request dirty
>> tracking, otherwise don't.
> 
> Yes, this is how a device capability can help.
> 
>>
>> However, tt costs nothing to have dirty tracking as long as all iommus
>> support it in the system - which seems to be the normal case today.
>>
>> We should just always turn it on at this point.
> 
> Then still need a way to report " all iommus support it in the system"
> to userspace since many old systems don't support it at all. If we all
> agree that a device capability flag would be helpful on this front (like
> you also said below), probably can start building the initial skeleton
> with that in mind?
> 

This would capture device-specific and maybe iommu-instance features, but
there's some tiny bit odd semantic here. There's nothing that
depends on the device to support any of this, but rather the IOMMU instance that sits
below the device which is independent of device-own capabilities e.g. PRI on the other
hand would be a perfect fit for a device capability (?), but dirty tracking
conveying over a device capability would be a convenience rather than an exact
hw representation.

Thinking out loud if we are going as a device/iommu capability [to see if this matches
what people have or not in mind]: we would add dirty-tracking feature bit via the existent
kAPI for iommu device features (e.g. IOMMU_DEV_FEAT_AD) and on iommufd we would maybe add
an IOMMUFD_CMD_DEV_GET_IOMMU_FEATURES ioctl which would have an u64 dev_id as input (from
the returned vfio-pci BIND_IOMMUFD @out_dev_id) and u64 features as an output bitmap of
synthetic feature bits, having IOMMUFD_FEATURE_AD the only one we query (and
IOMMUFD_FEATURE_{SVA,IOPF} as potentially future candidates). Qemu would then at start of
day would check if /all devices/ support it and it would then still do the blind set
tracking, but bail out preemptively if any of device-iommu don't support dirty-tracking. I
don't think we have any case today for having to deal with different IOMMU instances that
have different features.

Either that or as discussed in the beginning perhaps add an iommufd (or iommufd hwpt one)
ioctl  call (e.g.IOMMUFD_CMD_CAP) via a input value (e.g. subop IOMMU_FEATURES) which
would gives us a structure of things (e.g. for the IOMMU_FEATURES subop the common
featureset bitmap in all iommu instances). This would give the 'all iommus support it in
the system'. Albeit the device one might have more concrete longevity if there's further
plans aside from dirty tracking.

>>
>>> and if the user always creates domain to allow dirty tracking by default,
>>> how does it know a failed attach is due to missing dirty tracking support
>>> by the IOMMU and then creates another domain which disables dirty
>>> tracking and retry-attach again?
>>
>> The automatic logic is complicated for sure, if you had a device flag
>> it would have to figure it out that way
>>
> 
> Yes. That is the model in my mind.
> 
> Thanks
> Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
  2022-05-10  1:38                 ` Tian, Kevin
@ 2022-05-10 13:46                   ` Jason Gunthorpe
  -1 siblings, 0 replies; 209+ messages in thread
From: Jason Gunthorpe via iommu @ 2022-05-10 13:46 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jean-Philippe Brucker, Yishai Hadas, kvm, Will Deacon,
	Cornelia Huck, iommu, Alex Williamson, Martins, Joao,
	David Woodhouse, Robin Murphy

On Tue, May 10, 2022 at 01:38:26AM +0000, Tian, Kevin wrote:

> > However, tt costs nothing to have dirty tracking as long as all iommus
> > support it in the system - which seems to be the normal case today.
> > 
> > We should just always turn it on at this point.
> 
> Then still need a way to report " all iommus support it in the system"
> to userspace since many old systems don't support it at all. 

Userspace can query the iommu_domain directly, or 'try and fail' to
turn on tracking.

A device capability flag is useless without a control knob to request
a domain is created with tracking, and we don't have that, or a reason
to add that.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
@ 2022-05-10 13:46                   ` Jason Gunthorpe
  0 siblings, 0 replies; 209+ messages in thread
From: Jason Gunthorpe @ 2022-05-10 13:46 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Martins, Joao, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Nicolin Chen, Yishai Hadas, Eric Auger, Liu, Yi L,
	Alex Williamson, Cornelia Huck, kvm, iommu

On Tue, May 10, 2022 at 01:38:26AM +0000, Tian, Kevin wrote:

> > However, tt costs nothing to have dirty tracking as long as all iommus
> > support it in the system - which seems to be the normal case today.
> > 
> > We should just always turn it on at this point.
> 
> Then still need a way to report " all iommus support it in the system"
> to userspace since many old systems don't support it at all. 

Userspace can query the iommu_domain directly, or 'try and fail' to
turn on tracking.

A device capability flag is useless without a control knob to request
a domain is created with tracking, and we don't have that, or a reason
to add that.

Jason

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
  2022-05-10 13:46                   ` Jason Gunthorpe
@ 2022-05-11  1:10                     ` Tian, Kevin
  -1 siblings, 0 replies; 209+ messages in thread
From: Tian, Kevin @ 2022-05-11  1:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Martins, Joao, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Nicolin Chen, Yishai Hadas, Eric Auger, Liu, Yi L,
	Alex Williamson, Cornelia Huck, kvm, iommu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, May 10, 2022 9:47 PM
> 
> On Tue, May 10, 2022 at 01:38:26AM +0000, Tian, Kevin wrote:
> 
> > > However, tt costs nothing to have dirty tracking as long as all iommus
> > > support it in the system - which seems to be the normal case today.
> > >
> > > We should just always turn it on at this point.
> >
> > Then still need a way to report " all iommus support it in the system"
> > to userspace since many old systems don't support it at all.
> 
> Userspace can query the iommu_domain directly, or 'try and fail' to
> turn on tracking.
> 
> A device capability flag is useless without a control knob to request
> a domain is created with tracking, and we don't have that, or a reason
> to add that.
> 

I'm getting confused on your last comment. A capability flag has to
accompany with a control knob which iiuc is what you advocated
in earlier discussion i.e. specifying the tracking property when creating
the domain. In this case the flag assists the userspace in deciding
whether to set the property.

Not sure whether we argued pass each other but here is another
attempt.

In general I saw three options here:

a) 'try and fail' when creating the domain. It succeeds only when
all iommus support tracking;

b) capability reported on iommu domain. The capability is reported true
only when all iommus support tracking. This allows domain property
to be set after domain is created. But there is no much gain of doing
so when comparing to a).

c) capability reported on device. future compatible for heterogenous
platform. domain property is specified at domain creation and domains
can have different properties based on tracking capability of attached
devices.

I'm inclined to c) as it is more aligned to Robin's cleanup effort on
iommu_capable() and iommu_present() in the iommu layer which
moves away from global manner to per-device style. Along with 
that direction I guess we want to discourage adding more APIs
assuming 'all iommus supporting certain capability' thing?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
@ 2022-05-11  1:10                     ` Tian, Kevin
  0 siblings, 0 replies; 209+ messages in thread
From: Tian, Kevin @ 2022-05-11  1:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Yishai Hadas, kvm, Will Deacon,
	Cornelia Huck, iommu, Alex Williamson, Martins, Joao,
	David Woodhouse, Robin Murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, May 10, 2022 9:47 PM
> 
> On Tue, May 10, 2022 at 01:38:26AM +0000, Tian, Kevin wrote:
> 
> > > However, tt costs nothing to have dirty tracking as long as all iommus
> > > support it in the system - which seems to be the normal case today.
> > >
> > > We should just always turn it on at this point.
> >
> > Then still need a way to report " all iommus support it in the system"
> > to userspace since many old systems don't support it at all.
> 
> Userspace can query the iommu_domain directly, or 'try and fail' to
> turn on tracking.
> 
> A device capability flag is useless without a control knob to request
> a domain is created with tracking, and we don't have that, or a reason
> to add that.
> 

I'm getting confused on your last comment. A capability flag has to
accompany with a control knob which iiuc is what you advocated
in earlier discussion i.e. specifying the tracking property when creating
the domain. In this case the flag assists the userspace in deciding
whether to set the property.

Not sure whether we argued pass each other but here is another
attempt.

In general I saw three options here:

a) 'try and fail' when creating the domain. It succeeds only when
all iommus support tracking;

b) capability reported on iommu domain. The capability is reported true
only when all iommus support tracking. This allows domain property
to be set after domain is created. But there is no much gain of doing
so when comparing to a).

c) capability reported on device. future compatible for heterogenous
platform. domain property is specified at domain creation and domains
can have different properties based on tracking capability of attached
devices.

I'm inclined to c) as it is more aligned to Robin's cleanup effort on
iommu_capable() and iommu_present() in the iommu layer which
moves away from global manner to per-device style. Along with 
that direction I guess we want to discourage adding more APIs
assuming 'all iommus supporting certain capability' thing?

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
  2022-05-10 11:50                   ` Joao Martins
@ 2022-05-11  1:17                     ` Tian, Kevin
  -1 siblings, 0 replies; 209+ messages in thread
From: Tian, Kevin @ 2022-05-11  1:17 UTC (permalink / raw)
  To: Martins, Joao, Jason Gunthorpe
  Cc: Joerg Roedel, Suravee Suthikulpanit, Will Deacon, Robin Murphy,
	Jean-Philippe Brucker, Keqian Zhu, Shameerali Kolothum Thodi,
	David Woodhouse, Lu Baolu, Nicolin Chen, Yishai Hadas,
	Eric Auger, Liu, Yi L, Alex Williamson, Cornelia Huck, kvm,
	iommu

> From: Joao Martins <joao.m.martins@oracle.com>
> Sent: Tuesday, May 10, 2022 7:51 PM
> 
> On 5/10/22 02:38, Tian, Kevin wrote:
> >> From: Jason Gunthorpe <jgg@nvidia.com>
> >> Sent: Friday, May 6, 2022 7:46 PM
> >>
> >> On Fri, May 06, 2022 at 03:51:40AM +0000, Tian, Kevin wrote:
> >>>> From: Jason Gunthorpe <jgg@nvidia.com>
> >>>> Sent: Thursday, May 5, 2022 10:08 PM
> >>>>
> >>>> On Thu, May 05, 2022 at 07:40:37AM +0000, Tian, Kevin wrote:
> >>>>
> >>>>> In concept this is an iommu property instead of a domain property.
> >>>>
> >>>> Not really, domains shouldn't be changing behaviors once they are
> >>>> created. If a domain supports dirty tracking and I attach a new device
> >>>> then it still must support dirty tracking.
> >>>
> >>> That sort of suggests that userspace should specify whether a domain
> >>> supports dirty tracking when it's created. But how does userspace
> >>> know that it should create the domain in this way in the first place?
> >>> live migration is triggered on demand and it may not happen in the
> >>> lifetime of a VM.
> >>
> >> The best you could do is to look at the devices being plugged in at VM
> >> startup, and if they all support live migration then request dirty
> >> tracking, otherwise don't.
> >
> > Yes, this is how a device capability can help.
> >
> >>
> >> However, tt costs nothing to have dirty tracking as long as all iommus
> >> support it in the system - which seems to be the normal case today.
> >>
> >> We should just always turn it on at this point.
> >
> > Then still need a way to report " all iommus support it in the system"
> > to userspace since many old systems don't support it at all. If we all
> > agree that a device capability flag would be helpful on this front (like
> > you also said below), probably can start building the initial skeleton
> > with that in mind?
> >
> 
> This would capture device-specific and maybe iommu-instance features, but
> there's some tiny bit odd semantic here. There's nothing that
> depends on the device to support any of this, but rather the IOMMU instance
> that sits
> below the device which is independent of device-own capabilities e.g. PRI on
> the other
> hand would be a perfect fit for a device capability (?), but dirty tracking
> conveying over a device capability would be a convenience rather than an
> exact
> hw representation.

it is sort of getting certain iommu capability for a given device, as how
the iommu kAPI is moving toward.

> 
> Thinking out loud if we are going as a device/iommu capability [to see if this
> matches
> what people have or not in mind]: we would add dirty-tracking feature bit via
> the existent
> kAPI for iommu device features (e.g. IOMMU_DEV_FEAT_AD) and on
> iommufd we would maybe add
> an IOMMUFD_CMD_DEV_GET_IOMMU_FEATURES ioctl which would have an
> u64 dev_id as input (from
> the returned vfio-pci BIND_IOMMUFD @out_dev_id) and u64 features as an
> output bitmap of
> synthetic feature bits, having IOMMUFD_FEATURE_AD the only one we query
> (and
> IOMMUFD_FEATURE_{SVA,IOPF} as potentially future candidates). Qemu
> would then at start of
> day would check if /all devices/ support it and it would then still do the blind
> set
> tracking, but bail out preemptively if any of device-iommu don't support
> dirty-tracking. I
> don't think we have any case today for having to deal with different IOMMU
> instances that
> have different features.

This heterogeneity already exists today. On Intel platform not all IOMMUs
support force snooping. I believe ARM has similar situation which is why
Robin is refactoring bus-oriented iommu_capable() etc. to device-oriented.

I'm not aware of such heterogeneity particularly for dirty tracking today. But
who knows it won't happen in the future? I just feel that aligning iommufd
uAPI to iommu kAPI for capability reporting might be more future proof here.

> 
> Either that or as discussed in the beginning perhaps add an iommufd (or
> iommufd hwpt one)
> ioctl  call (e.g.IOMMUFD_CMD_CAP) via a input value (e.g. subop
> IOMMU_FEATURES) which
> would gives us a structure of things (e.g. for the IOMMU_FEATURES subop
> the common
> featureset bitmap in all iommu instances). This would give the 'all iommus
> support it in
> the system'. Albeit the device one might have more concrete longevity if
> there's further
> plans aside from dirty tracking.
> 
> >>
> >>> and if the user always creates domain to allow dirty tracking by default,
> >>> how does it know a failed attach is due to missing dirty tracking support
> >>> by the IOMMU and then creates another domain which disables dirty
> >>> tracking and retry-attach again?
> >>
> >> The automatic logic is complicated for sure, if you had a device flag
> >> it would have to figure it out that way
> >>
> >
> > Yes. That is the model in my mind.
> >
> > Thanks
> > Kevin

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
@ 2022-05-11  1:17                     ` Tian, Kevin
  0 siblings, 0 replies; 209+ messages in thread
From: Tian, Kevin @ 2022-05-11  1:17 UTC (permalink / raw)
  To: Martins, Joao, Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Yishai Hadas, kvm, Will Deacon,
	Cornelia Huck, iommu, Alex Williamson, David Woodhouse,
	Robin Murphy

> From: Joao Martins <joao.m.martins@oracle.com>
> Sent: Tuesday, May 10, 2022 7:51 PM
> 
> On 5/10/22 02:38, Tian, Kevin wrote:
> >> From: Jason Gunthorpe <jgg@nvidia.com>
> >> Sent: Friday, May 6, 2022 7:46 PM
> >>
> >> On Fri, May 06, 2022 at 03:51:40AM +0000, Tian, Kevin wrote:
> >>>> From: Jason Gunthorpe <jgg@nvidia.com>
> >>>> Sent: Thursday, May 5, 2022 10:08 PM
> >>>>
> >>>> On Thu, May 05, 2022 at 07:40:37AM +0000, Tian, Kevin wrote:
> >>>>
> >>>>> In concept this is an iommu property instead of a domain property.
> >>>>
> >>>> Not really, domains shouldn't be changing behaviors once they are
> >>>> created. If a domain supports dirty tracking and I attach a new device
> >>>> then it still must support dirty tracking.
> >>>
> >>> That sort of suggests that userspace should specify whether a domain
> >>> supports dirty tracking when it's created. But how does userspace
> >>> know that it should create the domain in this way in the first place?
> >>> live migration is triggered on demand and it may not happen in the
> >>> lifetime of a VM.
> >>
> >> The best you could do is to look at the devices being plugged in at VM
> >> startup, and if they all support live migration then request dirty
> >> tracking, otherwise don't.
> >
> > Yes, this is how a device capability can help.
> >
> >>
> >> However, tt costs nothing to have dirty tracking as long as all iommus
> >> support it in the system - which seems to be the normal case today.
> >>
> >> We should just always turn it on at this point.
> >
> > Then still need a way to report " all iommus support it in the system"
> > to userspace since many old systems don't support it at all. If we all
> > agree that a device capability flag would be helpful on this front (like
> > you also said below), probably can start building the initial skeleton
> > with that in mind?
> >
> 
> This would capture device-specific and maybe iommu-instance features, but
> there's some tiny bit odd semantic here. There's nothing that
> depends on the device to support any of this, but rather the IOMMU instance
> that sits
> below the device which is independent of device-own capabilities e.g. PRI on
> the other
> hand would be a perfect fit for a device capability (?), but dirty tracking
> conveying over a device capability would be a convenience rather than an
> exact
> hw representation.

it is sort of getting certain iommu capability for a given device, as how
the iommu kAPI is moving toward.

> 
> Thinking out loud if we are going as a device/iommu capability [to see if this
> matches
> what people have or not in mind]: we would add dirty-tracking feature bit via
> the existent
> kAPI for iommu device features (e.g. IOMMU_DEV_FEAT_AD) and on
> iommufd we would maybe add
> an IOMMUFD_CMD_DEV_GET_IOMMU_FEATURES ioctl which would have an
> u64 dev_id as input (from
> the returned vfio-pci BIND_IOMMUFD @out_dev_id) and u64 features as an
> output bitmap of
> synthetic feature bits, having IOMMUFD_FEATURE_AD the only one we query
> (and
> IOMMUFD_FEATURE_{SVA,IOPF} as potentially future candidates). Qemu
> would then at start of
> day would check if /all devices/ support it and it would then still do the blind
> set
> tracking, but bail out preemptively if any of device-iommu don't support
> dirty-tracking. I
> don't think we have any case today for having to deal with different IOMMU
> instances that
> have different features.

This heterogeneity already exists today. On Intel platform not all IOMMUs
support force snooping. I believe ARM has similar situation which is why
Robin is refactoring bus-oriented iommu_capable() etc. to device-oriented.

I'm not aware of such heterogeneity particularly for dirty tracking today. But
who knows it won't happen in the future? I just feel that aligning iommufd
uAPI to iommu kAPI for capability reporting might be more future proof here.

> 
> Either that or as discussed in the beginning perhaps add an iommufd (or
> iommufd hwpt one)
> ioctl  call (e.g.IOMMUFD_CMD_CAP) via a input value (e.g. subop
> IOMMU_FEATURES) which
> would gives us a structure of things (e.g. for the IOMMU_FEATURES subop
> the common
> featureset bitmap in all iommu instances). This would give the 'all iommus
> support it in
> the system'. Albeit the device one might have more concrete longevity if
> there's further
> plans aside from dirty tracking.
> 
> >>
> >>> and if the user always creates domain to allow dirty tracking by default,
> >>> how does it know a failed attach is due to missing dirty tracking support
> >>> by the IOMMU and then creates another domain which disables dirty
> >>> tracking and retry-attach again?
> >>
> >> The automatic logic is complicated for sure, if you had a device flag
> >> it would have to figure it out that way
> >>
> >
> > Yes. That is the model in my mind.
> >
> > Thanks
> > Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 09/19] iommu/amd: Access/Dirty bit support in IOPTEs
  2022-04-28 21:09   ` Joao Martins
@ 2022-05-31 11:34     ` Suravee Suthikulpanit
  -1 siblings, 0 replies; 209+ messages in thread
From: Suravee Suthikulpanit via iommu @ 2022-05-31 11:34 UTC (permalink / raw)
  To: Joao Martins, iommu
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, kvm,
	Will Deacon, Cornelia Huck, Alex Williamson, Jason Gunthorpe,
	David Woodhouse, Robin Murphy

Joao,

On 4/29/22 4:09 AM, Joao Martins wrote:
> .....
> +static int amd_iommu_set_dirty_tracking(struct iommu_domain *domain,
> +					bool enable)
> +{
> +	struct protection_domain *pdomain = to_pdomain(domain);
> +	struct iommu_dev_data *dev_data;
> +	bool dom_flush = false;
> +
> +	if (!amd_iommu_had_support)
> +		return -EOPNOTSUPP;
> +
> +	list_for_each_entry(dev_data, &pdomain->dev_list, list) {

Since we iterate through device list for the domain, we would need to
call spin_lock_irqsave(&pdomain->lock, flags) here.

> +		struct amd_iommu *iommu;
> +		u64 pte_root;
> +
> +		iommu = amd_iommu_rlookup_table[dev_data->devid];
> +		pte_root = amd_iommu_dev_table[dev_data->devid].data[0];
> +
> +		/* No change? */
> +		if (!(enable ^ !!(pte_root & DTE_FLAG_HAD)))
> +			continue;
> +
> +		pte_root = (enable ?
> +			pte_root | DTE_FLAG_HAD : pte_root & ~DTE_FLAG_HAD);
> +
> +		/* Flush device DTE */
> +		amd_iommu_dev_table[dev_data->devid].data[0] = pte_root;
> +		device_flush_dte(dev_data);
> +		dom_flush = true;
> +	}
> +
> +	/* Flush IOTLB to mark IOPTE dirty on the next translation(s) */
> +	if (dom_flush) {
> +		unsigned long flags;
> +
> +		spin_lock_irqsave(&pdomain->lock, flags);
> +		amd_iommu_domain_flush_tlb_pde(pdomain);
> +		amd_iommu_domain_flush_complete(pdomain);
> +		spin_unlock_irqrestore(&pdomain->lock, flags);
> +	}

And call spin_unlock_irqrestore(&pdomain->lock, flags); here.
> +
> +	return 0;
> +}
> +
> +static bool amd_iommu_get_dirty_tracking(struct iommu_domain *domain)
> +{
> +	struct protection_domain *pdomain = to_pdomain(domain);
> +	struct iommu_dev_data *dev_data;
> +	u64 dte;
> +

Also call spin_lock_irqsave(&pdomain->lock, flags) here

> +	list_for_each_entry(dev_data, &pdomain->dev_list, list) {
> +		dte = amd_iommu_dev_table[dev_data->devid].data[0];
> +		if (!(dte & DTE_FLAG_HAD))
> +			return false;
> +	}
> +

And call spin_unlock_irqsave(&pdomain->lock, flags) here

> +	return true;
> +}
> +
> +static int amd_iommu_read_and_clear_dirty(struct iommu_domain *domain,
> +					  unsigned long iova, size_t size,
> +					  struct iommu_dirty_bitmap *dirty)
> +{
> +	struct protection_domain *pdomain = to_pdomain(domain);
> +	struct io_pgtable_ops *ops = &pdomain->iop.iop.ops;
> +
> +	if (!amd_iommu_get_dirty_tracking(domain))
> +		return -EOPNOTSUPP;
> +
> +	if (!ops || !ops->read_and_clear_dirty)
> +		return -ENODEV;

We move this check before the amd_iommu_get_dirty_tracking().

Best Regards,
Suravee

> +
> +	return ops->read_and_clear_dirty(ops, iova, size, dirty);
> +}
> +
> +
>   static void amd_iommu_get_resv_regions(struct device *dev,
>   				       struct list_head *head)
>   {
> @@ -2293,6 +2368,8 @@ const struct iommu_ops amd_iommu_ops = {
>   		.flush_iotlb_all = amd_iommu_flush_iotlb_all,
>   		.iotlb_sync	= amd_iommu_iotlb_sync,
>   		.free		= amd_iommu_domain_free,
> +		.set_dirty_tracking = amd_iommu_set_dirty_tracking,
> +		.read_and_clear_dirty = amd_iommu_read_and_clear_dirty,
>   	}
>   };
>   
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 09/19] iommu/amd: Access/Dirty bit support in IOPTEs
@ 2022-05-31 11:34     ` Suravee Suthikulpanit
  0 siblings, 0 replies; 209+ messages in thread
From: Suravee Suthikulpanit @ 2022-05-31 11:34 UTC (permalink / raw)
  To: Joao Martins, iommu
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jean-Philippe Brucker,
	Keqian Zhu, Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian,
	Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck, kvm

Joao,

On 4/29/22 4:09 AM, Joao Martins wrote:
> .....
> +static int amd_iommu_set_dirty_tracking(struct iommu_domain *domain,
> +					bool enable)
> +{
> +	struct protection_domain *pdomain = to_pdomain(domain);
> +	struct iommu_dev_data *dev_data;
> +	bool dom_flush = false;
> +
> +	if (!amd_iommu_had_support)
> +		return -EOPNOTSUPP;
> +
> +	list_for_each_entry(dev_data, &pdomain->dev_list, list) {

Since we iterate through device list for the domain, we would need to
call spin_lock_irqsave(&pdomain->lock, flags) here.

> +		struct amd_iommu *iommu;
> +		u64 pte_root;
> +
> +		iommu = amd_iommu_rlookup_table[dev_data->devid];
> +		pte_root = amd_iommu_dev_table[dev_data->devid].data[0];
> +
> +		/* No change? */
> +		if (!(enable ^ !!(pte_root & DTE_FLAG_HAD)))
> +			continue;
> +
> +		pte_root = (enable ?
> +			pte_root | DTE_FLAG_HAD : pte_root & ~DTE_FLAG_HAD);
> +
> +		/* Flush device DTE */
> +		amd_iommu_dev_table[dev_data->devid].data[0] = pte_root;
> +		device_flush_dte(dev_data);
> +		dom_flush = true;
> +	}
> +
> +	/* Flush IOTLB to mark IOPTE dirty on the next translation(s) */
> +	if (dom_flush) {
> +		unsigned long flags;
> +
> +		spin_lock_irqsave(&pdomain->lock, flags);
> +		amd_iommu_domain_flush_tlb_pde(pdomain);
> +		amd_iommu_domain_flush_complete(pdomain);
> +		spin_unlock_irqrestore(&pdomain->lock, flags);
> +	}

And call spin_unlock_irqrestore(&pdomain->lock, flags); here.
> +
> +	return 0;
> +}
> +
> +static bool amd_iommu_get_dirty_tracking(struct iommu_domain *domain)
> +{
> +	struct protection_domain *pdomain = to_pdomain(domain);
> +	struct iommu_dev_data *dev_data;
> +	u64 dte;
> +

Also call spin_lock_irqsave(&pdomain->lock, flags) here

> +	list_for_each_entry(dev_data, &pdomain->dev_list, list) {
> +		dte = amd_iommu_dev_table[dev_data->devid].data[0];
> +		if (!(dte & DTE_FLAG_HAD))
> +			return false;
> +	}
> +

And call spin_unlock_irqsave(&pdomain->lock, flags) here

> +	return true;
> +}
> +
> +static int amd_iommu_read_and_clear_dirty(struct iommu_domain *domain,
> +					  unsigned long iova, size_t size,
> +					  struct iommu_dirty_bitmap *dirty)
> +{
> +	struct protection_domain *pdomain = to_pdomain(domain);
> +	struct io_pgtable_ops *ops = &pdomain->iop.iop.ops;
> +
> +	if (!amd_iommu_get_dirty_tracking(domain))
> +		return -EOPNOTSUPP;
> +
> +	if (!ops || !ops->read_and_clear_dirty)
> +		return -ENODEV;

We move this check before the amd_iommu_get_dirty_tracking().

Best Regards,
Suravee

> +
> +	return ops->read_and_clear_dirty(ops, iova, size, dirty);
> +}
> +
> +
>   static void amd_iommu_get_resv_regions(struct device *dev,
>   				       struct list_head *head)
>   {
> @@ -2293,6 +2368,8 @@ const struct iommu_ops amd_iommu_ops = {
>   		.flush_iotlb_all = amd_iommu_flush_iotlb_all,
>   		.iotlb_sync	= amd_iommu_iotlb_sync,
>   		.free		= amd_iommu_domain_free,
> +		.set_dirty_tracking = amd_iommu_set_dirty_tracking,
> +		.read_and_clear_dirty = amd_iommu_read_and_clear_dirty,
>   	}
>   };
>   

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 09/19] iommu/amd: Access/Dirty bit support in IOPTEs
  2022-05-31 11:34     ` Suravee Suthikulpanit
@ 2022-05-31 12:15       ` Baolu Lu
  -1 siblings, 0 replies; 209+ messages in thread
From: Baolu Lu @ 2022-05-31 12:15 UTC (permalink / raw)
  To: Suravee Suthikulpanit, Joao Martins, iommu
  Cc: baolu.lu, Joerg Roedel, Will Deacon, Robin Murphy,
	Jean-Philippe Brucker, Keqian Zhu, Shameerali Kolothum Thodi,
	David Woodhouse, Jason Gunthorpe, Nicolin Chen, Yishai Hadas,
	Kevin Tian, Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck,
	kvm

Hi Suravee ,

On 2022/5/31 19:34, Suravee Suthikulpanit wrote:
> On 4/29/22 4:09 AM, Joao Martins wrote:
>> .....
>> +static int amd_iommu_set_dirty_tracking(struct iommu_domain *domain,
>> +                    bool enable)
>> +{
>> +    struct protection_domain *pdomain = to_pdomain(domain);
>> +    struct iommu_dev_data *dev_data;
>> +    bool dom_flush = false;
>> +
>> +    if (!amd_iommu_had_support)
>> +        return -EOPNOTSUPP;
>> +
>> +    list_for_each_entry(dev_data, &pdomain->dev_list, list) {
> 
> Since we iterate through device list for the domain, we would need to
> call spin_lock_irqsave(&pdomain->lock, flags) here.

Not related, just out of curiosity. Does it really need to disable the
interrupt while holding this lock? Any case this list would be traversed
in any interrupt context? Perhaps I missed anything?

Best regards,
baolu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 09/19] iommu/amd: Access/Dirty bit support in IOPTEs
@ 2022-05-31 12:15       ` Baolu Lu
  0 siblings, 0 replies; 209+ messages in thread
From: Baolu Lu @ 2022-05-31 12:15 UTC (permalink / raw)
  To: Suravee Suthikulpanit, Joao Martins, iommu
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, kvm,
	Will Deacon, Robin Murphy, Cornelia Huck, Alex Williamson,
	Jason Gunthorpe, David Woodhouse

Hi Suravee ,

On 2022/5/31 19:34, Suravee Suthikulpanit wrote:
> On 4/29/22 4:09 AM, Joao Martins wrote:
>> .....
>> +static int amd_iommu_set_dirty_tracking(struct iommu_domain *domain,
>> +                    bool enable)
>> +{
>> +    struct protection_domain *pdomain = to_pdomain(domain);
>> +    struct iommu_dev_data *dev_data;
>> +    bool dom_flush = false;
>> +
>> +    if (!amd_iommu_had_support)
>> +        return -EOPNOTSUPP;
>> +
>> +    list_for_each_entry(dev_data, &pdomain->dev_list, list) {
> 
> Since we iterate through device list for the domain, we would need to
> call spin_lock_irqsave(&pdomain->lock, flags) here.

Not related, just out of curiosity. Does it really need to disable the
interrupt while holding this lock? Any case this list would be traversed
in any interrupt context? Perhaps I missed anything?

Best regards,
baolu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 10/19] iommu/amd: Add unmap_read_dirty() support
  2022-04-28 21:09   ` Joao Martins
@ 2022-05-31 12:39     ` Suravee Suthikulpanit via iommu
  -1 siblings, 0 replies; 209+ messages in thread
From: Suravee Suthikulpanit @ 2022-05-31 12:39 UTC (permalink / raw)
  To: Joao Martins, iommu
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jean-Philippe Brucker,
	Keqian Zhu, Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian,
	Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck, kvm



On 4/29/22 4:09 AM, Joao Martins wrote:
> AMD implementation of unmap_read_dirty() is pretty simple as
> mostly reuses unmap code with the extra addition of marshalling
> the dirty bit into the bitmap as it walks the to-be-unmapped
> IOPTE.
> 
> Extra care is taken though, to switch over to cmpxchg as opposed
> to a non-serialized store to the PTE and testing the dirty bit
> only set until cmpxchg succeeds to set to 0.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>   drivers/iommu/amd/io_pgtable.c | 44 +++++++++++++++++++++++++++++-----
>   drivers/iommu/amd/iommu.c      | 22 +++++++++++++++++
>   2 files changed, 60 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/iommu/amd/io_pgtable.c b/drivers/iommu/amd/io_pgtable.c
> index 8325ef193093..1868c3b58e6d 100644
> --- a/drivers/iommu/amd/io_pgtable.c
> +++ b/drivers/iommu/amd/io_pgtable.c
> @@ -355,6 +355,16 @@ static void free_clear_pte(u64 *pte, u64 pteval, struct list_head *freelist)
>   	free_sub_pt(pt, mode, freelist);
>   }
>   
> +static bool free_pte_dirty(u64 *pte, u64 pteval)

Nitpick: Since we free and clearing the dirty bit, should we change
the function name to free_clear_pte_dirty()?

> +{
> +	bool dirty = false;
> +
> +	while (IOMMU_PTE_DIRTY(cmpxchg64(pte, pteval, 0)))

We should use 0ULL instead of 0.

> +		dirty = true;
> +
> +	return dirty;
> +}
> +

Actually, what do you think if we enhance the current free_clear_pte()
to also handle the check dirty as well?

>   /*
>    * Generic mapping functions. It maps a physical address into a DMA
>    * address space. It allocates the page table pages if necessary.
> @@ -428,10 +438,11 @@ static int iommu_v1_map_page(struct io_pgtable_ops *ops, unsigned long iova,
>   	return ret;
>   }
>   
> -static unsigned long iommu_v1_unmap_page(struct io_pgtable_ops *ops,
> -				      unsigned long iova,
> -				      size_t size,
> -				      struct iommu_iotlb_gather *gather)
> +static unsigned long __iommu_v1_unmap_page(struct io_pgtable_ops *ops,
> +					   unsigned long iova,
> +					   size_t size,
> +					   struct iommu_iotlb_gather *gather,
> +					   struct iommu_dirty_bitmap *dirty)
>   {
>   	struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
>   	unsigned long long unmapped;
> @@ -445,11 +456,15 @@ static unsigned long iommu_v1_unmap_page(struct io_pgtable_ops *ops,
>   	while (unmapped < size) {
>   		pte = fetch_pte(pgtable, iova, &unmap_size);
>   		if (pte) {
> -			int i, count;
> +			unsigned long i, count;
> +			bool pte_dirty = false;
>   
>   			count = PAGE_SIZE_PTE_COUNT(unmap_size);
>   			for (i = 0; i < count; i++)
> -				pte[i] = 0ULL;
> +				pte_dirty |= free_pte_dirty(&pte[i], pte[i]);
> +

Actually, what if we change the existing free_clear_pte() to free_and_clear_dirty_pte(),
and incorporate the logic for

> ...
> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
> index 0a86392b2367..a8fcb6e9a684 100644
> --- a/drivers/iommu/amd/iommu.c
> +++ b/drivers/iommu/amd/iommu.c
> @@ -2144,6 +2144,27 @@ static size_t amd_iommu_unmap(struct iommu_domain *dom, unsigned long iova,
>   	return r;
>   }
>   
> +static size_t amd_iommu_unmap_read_dirty(struct iommu_domain *dom,
> +					 unsigned long iova, size_t page_size,
> +					 struct iommu_iotlb_gather *gather,
> +					 struct iommu_dirty_bitmap *dirty)
> +{
> +	struct protection_domain *domain = to_pdomain(dom);
> +	struct io_pgtable_ops *ops = &domain->iop.iop.ops;
> +	size_t r;
> +
> +	if ((amd_iommu_pgtable == AMD_IOMMU_V1) &&
> +	    (domain->iop.mode == PAGE_MODE_NONE))
> +		return 0;
> +
> +	r = (ops->unmap_read_dirty) ?
> +		ops->unmap_read_dirty(ops, iova, page_size, gather, dirty) : 0;
> +
> +	amd_iommu_iotlb_gather_add_page(dom, gather, iova, page_size);
> +
> +	return r;
> +}
> +

Instead of creating a new function, what if we enhance the current amd_iommu_unmap()
to also handle read dirty part as well (e.g. __amd_iommu_unmap_read_dirty()), and
then both amd_iommu_unmap() and amd_iommu_unmap_read_dirty() can call
the __amd_iommu_unmap_read_dirty()?

Best Regards,
Suravee

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 10/19] iommu/amd: Add unmap_read_dirty() support
@ 2022-05-31 12:39     ` Suravee Suthikulpanit via iommu
  0 siblings, 0 replies; 209+ messages in thread
From: Suravee Suthikulpanit via iommu @ 2022-05-31 12:39 UTC (permalink / raw)
  To: Joao Martins, iommu
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, kvm,
	Will Deacon, Cornelia Huck, Alex Williamson, Jason Gunthorpe,
	David Woodhouse, Robin Murphy



On 4/29/22 4:09 AM, Joao Martins wrote:
> AMD implementation of unmap_read_dirty() is pretty simple as
> mostly reuses unmap code with the extra addition of marshalling
> the dirty bit into the bitmap as it walks the to-be-unmapped
> IOPTE.
> 
> Extra care is taken though, to switch over to cmpxchg as opposed
> to a non-serialized store to the PTE and testing the dirty bit
> only set until cmpxchg succeeds to set to 0.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>   drivers/iommu/amd/io_pgtable.c | 44 +++++++++++++++++++++++++++++-----
>   drivers/iommu/amd/iommu.c      | 22 +++++++++++++++++
>   2 files changed, 60 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/iommu/amd/io_pgtable.c b/drivers/iommu/amd/io_pgtable.c
> index 8325ef193093..1868c3b58e6d 100644
> --- a/drivers/iommu/amd/io_pgtable.c
> +++ b/drivers/iommu/amd/io_pgtable.c
> @@ -355,6 +355,16 @@ static void free_clear_pte(u64 *pte, u64 pteval, struct list_head *freelist)
>   	free_sub_pt(pt, mode, freelist);
>   }
>   
> +static bool free_pte_dirty(u64 *pte, u64 pteval)

Nitpick: Since we free and clearing the dirty bit, should we change
the function name to free_clear_pte_dirty()?

> +{
> +	bool dirty = false;
> +
> +	while (IOMMU_PTE_DIRTY(cmpxchg64(pte, pteval, 0)))

We should use 0ULL instead of 0.

> +		dirty = true;
> +
> +	return dirty;
> +}
> +

Actually, what do you think if we enhance the current free_clear_pte()
to also handle the check dirty as well?

>   /*
>    * Generic mapping functions. It maps a physical address into a DMA
>    * address space. It allocates the page table pages if necessary.
> @@ -428,10 +438,11 @@ static int iommu_v1_map_page(struct io_pgtable_ops *ops, unsigned long iova,
>   	return ret;
>   }
>   
> -static unsigned long iommu_v1_unmap_page(struct io_pgtable_ops *ops,
> -				      unsigned long iova,
> -				      size_t size,
> -				      struct iommu_iotlb_gather *gather)
> +static unsigned long __iommu_v1_unmap_page(struct io_pgtable_ops *ops,
> +					   unsigned long iova,
> +					   size_t size,
> +					   struct iommu_iotlb_gather *gather,
> +					   struct iommu_dirty_bitmap *dirty)
>   {
>   	struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
>   	unsigned long long unmapped;
> @@ -445,11 +456,15 @@ static unsigned long iommu_v1_unmap_page(struct io_pgtable_ops *ops,
>   	while (unmapped < size) {
>   		pte = fetch_pte(pgtable, iova, &unmap_size);
>   		if (pte) {
> -			int i, count;
> +			unsigned long i, count;
> +			bool pte_dirty = false;
>   
>   			count = PAGE_SIZE_PTE_COUNT(unmap_size);
>   			for (i = 0; i < count; i++)
> -				pte[i] = 0ULL;
> +				pte_dirty |= free_pte_dirty(&pte[i], pte[i]);
> +

Actually, what if we change the existing free_clear_pte() to free_and_clear_dirty_pte(),
and incorporate the logic for

> ...
> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
> index 0a86392b2367..a8fcb6e9a684 100644
> --- a/drivers/iommu/amd/iommu.c
> +++ b/drivers/iommu/amd/iommu.c
> @@ -2144,6 +2144,27 @@ static size_t amd_iommu_unmap(struct iommu_domain *dom, unsigned long iova,
>   	return r;
>   }
>   
> +static size_t amd_iommu_unmap_read_dirty(struct iommu_domain *dom,
> +					 unsigned long iova, size_t page_size,
> +					 struct iommu_iotlb_gather *gather,
> +					 struct iommu_dirty_bitmap *dirty)
> +{
> +	struct protection_domain *domain = to_pdomain(dom);
> +	struct io_pgtable_ops *ops = &domain->iop.iop.ops;
> +	size_t r;
> +
> +	if ((amd_iommu_pgtable == AMD_IOMMU_V1) &&
> +	    (domain->iop.mode == PAGE_MODE_NONE))
> +		return 0;
> +
> +	r = (ops->unmap_read_dirty) ?
> +		ops->unmap_read_dirty(ops, iova, page_size, gather, dirty) : 0;
> +
> +	amd_iommu_iotlb_gather_add_page(dom, gather, iova, page_size);
> +
> +	return r;
> +}
> +

Instead of creating a new function, what if we enhance the current amd_iommu_unmap()
to also handle read dirty part as well (e.g. __amd_iommu_unmap_read_dirty()), and
then both amd_iommu_unmap() and amd_iommu_unmap_read_dirty() can call
the __amd_iommu_unmap_read_dirty()?

Best Regards,
Suravee
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 09/19] iommu/amd: Access/Dirty bit support in IOPTEs
  2022-05-31 11:34     ` Suravee Suthikulpanit
@ 2022-05-31 15:22       ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-05-31 15:22 UTC (permalink / raw)
  To: Suravee Suthikulpanit, iommu
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jean-Philippe Brucker,
	Keqian Zhu, Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian,
	Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck, kvm

On 5/31/22 12:34, Suravee Suthikulpanit wrote:
> Joao,
> 
> On 4/29/22 4:09 AM, Joao Martins wrote:
>> .....
>> +static int amd_iommu_set_dirty_tracking(struct iommu_domain *domain,
>> +					bool enable)
>> +{
>> +	struct protection_domain *pdomain = to_pdomain(domain);
>> +	struct iommu_dev_data *dev_data;
>> +	bool dom_flush = false;
>> +
>> +	if (!amd_iommu_had_support)
>> +		return -EOPNOTSUPP;
>> +
>> +	list_for_each_entry(dev_data, &pdomain->dev_list, list) {
> 
> Since we iterate through device list for the domain, we would need to
> call spin_lock_irqsave(&pdomain->lock, flags) here.
> 
Ugh, yes. Will fix.

>> +		struct amd_iommu *iommu;
>> +		u64 pte_root;
>> +
>> +		iommu = amd_iommu_rlookup_table[dev_data->devid];
>> +		pte_root = amd_iommu_dev_table[dev_data->devid].data[0];
>> +
>> +		/* No change? */
>> +		if (!(enable ^ !!(pte_root & DTE_FLAG_HAD)))
>> +			continue;
>> +
>> +		pte_root = (enable ?
>> +			pte_root | DTE_FLAG_HAD : pte_root & ~DTE_FLAG_HAD);
>> +
>> +		/* Flush device DTE */
>> +		amd_iommu_dev_table[dev_data->devid].data[0] = pte_root;
>> +		device_flush_dte(dev_data);
>> +		dom_flush = true;
>> +	}
>> +
>> +	/* Flush IOTLB to mark IOPTE dirty on the next translation(s) */
>> +	if (dom_flush) {
>> +		unsigned long flags;
>> +
>> +		spin_lock_irqsave(&pdomain->lock, flags);
>> +		amd_iommu_domain_flush_tlb_pde(pdomain);
>> +		amd_iommu_domain_flush_complete(pdomain);
>> +		spin_unlock_irqrestore(&pdomain->lock, flags);
>> +	}
> 
> And call spin_unlock_irqrestore(&pdomain->lock, flags); here.

ack

Additionally, something that I am thinking for v2 was going to have
@had bool field in iommu_dev_data. That would align better with the
rest of amd iommu code rather than me introducing this pattern of
using hardware location of PTE roots. Let me know if you disagree.

>> +
>> +	return 0;
>> +}
>> +
>> +static bool amd_iommu_get_dirty_tracking(struct iommu_domain *domain)
>> +{
>> +	struct protection_domain *pdomain = to_pdomain(domain);
>> +	struct iommu_dev_data *dev_data;
>> +	u64 dte;
>> +
> 
> Also call spin_lock_irqsave(&pdomain->lock, flags) here
> 
ack
>> +	list_for_each_entry(dev_data, &pdomain->dev_list, list) {
>> +		dte = amd_iommu_dev_table[dev_data->devid].data[0];
>> +		if (!(dte & DTE_FLAG_HAD))
>> +			return false;
>> +	}
>> +
> 
> And call spin_unlock_irqsave(&pdomain->lock, flags) here
> 
ack

Same comment as I was saying above, and replace the @dte checking
to just instead check this new variable.

>> +	return true;
>> +}
>> +
>> +static int amd_iommu_read_and_clear_dirty(struct iommu_domain *domain,
>> +					  unsigned long iova, size_t size,
>> +					  struct iommu_dirty_bitmap *dirty)
>> +{
>> +	struct protection_domain *pdomain = to_pdomain(domain);
>> +	struct io_pgtable_ops *ops = &pdomain->iop.iop.ops;
>> +
>> +	if (!amd_iommu_get_dirty_tracking(domain))
>> +		return -EOPNOTSUPP;
>> +
>> +	if (!ops || !ops->read_and_clear_dirty)
>> +		return -ENODEV;
> 
> We move this check before the amd_iommu_get_dirty_tracking().
> 

Yeap, better fail earlier.

> Best Regards,
> Suravee
> 
>> +
>> +	return ops->read_and_clear_dirty(ops, iova, size, dirty);
>> +}
>> +
>> +
>>   static void amd_iommu_get_resv_regions(struct device *dev,
>>   				       struct list_head *head)
>>   {
>> @@ -2293,6 +2368,8 @@ const struct iommu_ops amd_iommu_ops = {
>>   		.flush_iotlb_all = amd_iommu_flush_iotlb_all,
>>   		.iotlb_sync	= amd_iommu_iotlb_sync,
>>   		.free		= amd_iommu_domain_free,
>> +		.set_dirty_tracking = amd_iommu_set_dirty_tracking,
>> +		.read_and_clear_dirty = amd_iommu_read_and_clear_dirty,
>>   	}
>>   };
>>   

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 09/19] iommu/amd: Access/Dirty bit support in IOPTEs
@ 2022-05-31 15:22       ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-05-31 15:22 UTC (permalink / raw)
  To: Suravee Suthikulpanit, iommu
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, kvm,
	Will Deacon, Cornelia Huck, Alex Williamson, Jason Gunthorpe,
	David Woodhouse, Robin Murphy

On 5/31/22 12:34, Suravee Suthikulpanit wrote:
> Joao,
> 
> On 4/29/22 4:09 AM, Joao Martins wrote:
>> .....
>> +static int amd_iommu_set_dirty_tracking(struct iommu_domain *domain,
>> +					bool enable)
>> +{
>> +	struct protection_domain *pdomain = to_pdomain(domain);
>> +	struct iommu_dev_data *dev_data;
>> +	bool dom_flush = false;
>> +
>> +	if (!amd_iommu_had_support)
>> +		return -EOPNOTSUPP;
>> +
>> +	list_for_each_entry(dev_data, &pdomain->dev_list, list) {
> 
> Since we iterate through device list for the domain, we would need to
> call spin_lock_irqsave(&pdomain->lock, flags) here.
> 
Ugh, yes. Will fix.

>> +		struct amd_iommu *iommu;
>> +		u64 pte_root;
>> +
>> +		iommu = amd_iommu_rlookup_table[dev_data->devid];
>> +		pte_root = amd_iommu_dev_table[dev_data->devid].data[0];
>> +
>> +		/* No change? */
>> +		if (!(enable ^ !!(pte_root & DTE_FLAG_HAD)))
>> +			continue;
>> +
>> +		pte_root = (enable ?
>> +			pte_root | DTE_FLAG_HAD : pte_root & ~DTE_FLAG_HAD);
>> +
>> +		/* Flush device DTE */
>> +		amd_iommu_dev_table[dev_data->devid].data[0] = pte_root;
>> +		device_flush_dte(dev_data);
>> +		dom_flush = true;
>> +	}
>> +
>> +	/* Flush IOTLB to mark IOPTE dirty on the next translation(s) */
>> +	if (dom_flush) {
>> +		unsigned long flags;
>> +
>> +		spin_lock_irqsave(&pdomain->lock, flags);
>> +		amd_iommu_domain_flush_tlb_pde(pdomain);
>> +		amd_iommu_domain_flush_complete(pdomain);
>> +		spin_unlock_irqrestore(&pdomain->lock, flags);
>> +	}
> 
> And call spin_unlock_irqrestore(&pdomain->lock, flags); here.

ack

Additionally, something that I am thinking for v2 was going to have
@had bool field in iommu_dev_data. That would align better with the
rest of amd iommu code rather than me introducing this pattern of
using hardware location of PTE roots. Let me know if you disagree.

>> +
>> +	return 0;
>> +}
>> +
>> +static bool amd_iommu_get_dirty_tracking(struct iommu_domain *domain)
>> +{
>> +	struct protection_domain *pdomain = to_pdomain(domain);
>> +	struct iommu_dev_data *dev_data;
>> +	u64 dte;
>> +
> 
> Also call spin_lock_irqsave(&pdomain->lock, flags) here
> 
ack
>> +	list_for_each_entry(dev_data, &pdomain->dev_list, list) {
>> +		dte = amd_iommu_dev_table[dev_data->devid].data[0];
>> +		if (!(dte & DTE_FLAG_HAD))
>> +			return false;
>> +	}
>> +
> 
> And call spin_unlock_irqsave(&pdomain->lock, flags) here
> 
ack

Same comment as I was saying above, and replace the @dte checking
to just instead check this new variable.

>> +	return true;
>> +}
>> +
>> +static int amd_iommu_read_and_clear_dirty(struct iommu_domain *domain,
>> +					  unsigned long iova, size_t size,
>> +					  struct iommu_dirty_bitmap *dirty)
>> +{
>> +	struct protection_domain *pdomain = to_pdomain(domain);
>> +	struct io_pgtable_ops *ops = &pdomain->iop.iop.ops;
>> +
>> +	if (!amd_iommu_get_dirty_tracking(domain))
>> +		return -EOPNOTSUPP;
>> +
>> +	if (!ops || !ops->read_and_clear_dirty)
>> +		return -ENODEV;
> 
> We move this check before the amd_iommu_get_dirty_tracking().
> 

Yeap, better fail earlier.

> Best Regards,
> Suravee
> 
>> +
>> +	return ops->read_and_clear_dirty(ops, iova, size, dirty);
>> +}
>> +
>> +
>>   static void amd_iommu_get_resv_regions(struct device *dev,
>>   				       struct list_head *head)
>>   {
>> @@ -2293,6 +2368,8 @@ const struct iommu_ops amd_iommu_ops = {
>>   		.flush_iotlb_all = amd_iommu_flush_iotlb_all,
>>   		.iotlb_sync	= amd_iommu_iotlb_sync,
>>   		.free		= amd_iommu_domain_free,
>> +		.set_dirty_tracking = amd_iommu_set_dirty_tracking,
>> +		.read_and_clear_dirty = amd_iommu_read_and_clear_dirty,
>>   	}
>>   };
>>   
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 10/19] iommu/amd: Add unmap_read_dirty() support
  2022-05-31 12:39     ` Suravee Suthikulpanit via iommu
@ 2022-05-31 15:51       ` Joao Martins
  -1 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-05-31 15:51 UTC (permalink / raw)
  To: Suravee Suthikulpanit, iommu
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jean-Philippe Brucker,
	Keqian Zhu, Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian,
	Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck, kvm

On 5/31/22 13:39, Suravee Suthikulpanit wrote:
> On 4/29/22 4:09 AM, Joao Martins wrote:
>> AMD implementation of unmap_read_dirty() is pretty simple as
>> mostly reuses unmap code with the extra addition of marshalling
>> the dirty bit into the bitmap as it walks the to-be-unmapped
>> IOPTE.
>>
>> Extra care is taken though, to switch over to cmpxchg as opposed
>> to a non-serialized store to the PTE and testing the dirty bit
>> only set until cmpxchg succeeds to set to 0.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>   drivers/iommu/amd/io_pgtable.c | 44 +++++++++++++++++++++++++++++-----
>>   drivers/iommu/amd/iommu.c      | 22 +++++++++++++++++
>>   2 files changed, 60 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/iommu/amd/io_pgtable.c b/drivers/iommu/amd/io_pgtable.c
>> index 8325ef193093..1868c3b58e6d 100644
>> --- a/drivers/iommu/amd/io_pgtable.c
>> +++ b/drivers/iommu/amd/io_pgtable.c
>> @@ -355,6 +355,16 @@ static void free_clear_pte(u64 *pte, u64 pteval, struct list_head *freelist)
>>   	free_sub_pt(pt, mode, freelist);
>>   }
>>   
>> +static bool free_pte_dirty(u64 *pte, u64 pteval)
> 
> Nitpick: Since we free and clearing the dirty bit, should we change
> the function name to free_clear_pte_dirty()?
> 

We free and *read* the dirty bit. It just so happens that we clear dirty
bit and every other one in the process. Just to make sure that I am not
clear the dirty bit explicitly (like the read_and_clear_dirty())

>> +{
>> +	bool dirty = false;
>> +
>> +	while (IOMMU_PTE_DIRTY(cmpxchg64(pte, pteval, 0)))
> 
> We should use 0ULL instead of 0.
>

ack.

>> +		dirty = true;
>> +
>> +	return dirty;
>> +}
>> +
> 
> Actually, what do you think if we enhance the current free_clear_pte()
> to also handle the check dirty as well?
> 
See further below, about dropping this patch.

>>   /*
>>    * Generic mapping functions. It maps a physical address into a DMA
>>    * address space. It allocates the page table pages if necessary.
>> @@ -428,10 +438,11 @@ static int iommu_v1_map_page(struct io_pgtable_ops *ops, unsigned long iova,
>>   	return ret;
>>   }
>>   
>> -static unsigned long iommu_v1_unmap_page(struct io_pgtable_ops *ops,
>> -				      unsigned long iova,
>> -				      size_t size,
>> -				      struct iommu_iotlb_gather *gather)
>> +static unsigned long __iommu_v1_unmap_page(struct io_pgtable_ops *ops,
>> +					   unsigned long iova,
>> +					   size_t size,
>> +					   struct iommu_iotlb_gather *gather,
>> +					   struct iommu_dirty_bitmap *dirty)
>>   {
>>   	struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
>>   	unsigned long long unmapped;
>> @@ -445,11 +456,15 @@ static unsigned long iommu_v1_unmap_page(struct io_pgtable_ops *ops,
>>   	while (unmapped < size) {
>>   		pte = fetch_pte(pgtable, iova, &unmap_size);
>>   		if (pte) {
>> -			int i, count;
>> +			unsigned long i, count;
>> +			bool pte_dirty = false;
>>   
>>   			count = PAGE_SIZE_PTE_COUNT(unmap_size);
>>   			for (i = 0; i < count; i++)
>> -				pte[i] = 0ULL;
>> +				pte_dirty |= free_pte_dirty(&pte[i], pte[i]);
>> +
> 
> Actually, what if we change the existing free_clear_pte() to free_and_clear_dirty_pte(),
> and incorporate the logic for
> 
Likewise, but otherwise it would be a good idea.

>> ...
>> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
>> index 0a86392b2367..a8fcb6e9a684 100644
>> --- a/drivers/iommu/amd/iommu.c
>> +++ b/drivers/iommu/amd/iommu.c
>> @@ -2144,6 +2144,27 @@ static size_t amd_iommu_unmap(struct iommu_domain *dom, unsigned long iova,
>>   	return r;
>>   }
>>   
>> +static size_t amd_iommu_unmap_read_dirty(struct iommu_domain *dom,
>> +					 unsigned long iova, size_t page_size,
>> +					 struct iommu_iotlb_gather *gather,
>> +					 struct iommu_dirty_bitmap *dirty)
>> +{
>> +	struct protection_domain *domain = to_pdomain(dom);
>> +	struct io_pgtable_ops *ops = &domain->iop.iop.ops;
>> +	size_t r;
>> +
>> +	if ((amd_iommu_pgtable == AMD_IOMMU_V1) &&
>> +	    (domain->iop.mode == PAGE_MODE_NONE))
>> +		return 0;
>> +
>> +	r = (ops->unmap_read_dirty) ?
>> +		ops->unmap_read_dirty(ops, iova, page_size, gather, dirty) : 0;
>> +
>> +	amd_iommu_iotlb_gather_add_page(dom, gather, iova, page_size);
>> +
>> +	return r;
>> +}
>> +
> 
> Instead of creating a new function, what if we enhance the current amd_iommu_unmap()
> to also handle read dirty part as well (e.g. __amd_iommu_unmap_read_dirty()), and
> then both amd_iommu_unmap() and amd_iommu_unmap_read_dirty() can call
> the __amd_iommu_unmap_read_dirty()?

Yes, if we were to keep this one.

I am actually dropping this patch (and the whole unmap_read_dirty additions).
The unmap_read_dirty() will be replaced but having userspace do get_dirty_iova() before
the unmap() or still keep the uAPI in iommufd while being a read_dirty() followed by unmap
without the special IOMMU unmap path. See this thread that starts here:

	https://lore.kernel.org/linux-iommu/20220502185239.GR8364@nvidia.com/

But essentially, the proposed unmap_read_dirty primitive isn't fully race free as it only
tackle races against IOMMU updating the IOPTE. DMA could be happening between the time I
clear the PTE and when I do the IOMMU TLB flush. Think vIOMMU usecases. Eliminating the
race fully is expensive requiring an extra TLB flush + IOPT walk in addition to the unmap
one (we would essentially double the cost). The thinking is that an alternative new
primitive would instead wrprotect the IOVA (i.e. thus blocking DMA), flush the IOTLB and
then we would read out a dirty bit and the unmap would be a regular unmap. For now I won't
be adding this as it is not clear if any use case really needs this.

	Joao

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 10/19] iommu/amd: Add unmap_read_dirty() support
@ 2022-05-31 15:51       ` Joao Martins
  0 siblings, 0 replies; 209+ messages in thread
From: Joao Martins @ 2022-05-31 15:51 UTC (permalink / raw)
  To: Suravee Suthikulpanit, iommu
  Cc: Jean-Philippe Brucker, Kevin Tian, Yishai Hadas, kvm,
	Will Deacon, Cornelia Huck, Alex Williamson, Jason Gunthorpe,
	David Woodhouse, Robin Murphy

On 5/31/22 13:39, Suravee Suthikulpanit wrote:
> On 4/29/22 4:09 AM, Joao Martins wrote:
>> AMD implementation of unmap_read_dirty() is pretty simple as
>> mostly reuses unmap code with the extra addition of marshalling
>> the dirty bit into the bitmap as it walks the to-be-unmapped
>> IOPTE.
>>
>> Extra care is taken though, to switch over to cmpxchg as opposed
>> to a non-serialized store to the PTE and testing the dirty bit
>> only set until cmpxchg succeeds to set to 0.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>   drivers/iommu/amd/io_pgtable.c | 44 +++++++++++++++++++++++++++++-----
>>   drivers/iommu/amd/iommu.c      | 22 +++++++++++++++++
>>   2 files changed, 60 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/iommu/amd/io_pgtable.c b/drivers/iommu/amd/io_pgtable.c
>> index 8325ef193093..1868c3b58e6d 100644
>> --- a/drivers/iommu/amd/io_pgtable.c
>> +++ b/drivers/iommu/amd/io_pgtable.c
>> @@ -355,6 +355,16 @@ static void free_clear_pte(u64 *pte, u64 pteval, struct list_head *freelist)
>>   	free_sub_pt(pt, mode, freelist);
>>   }
>>   
>> +static bool free_pte_dirty(u64 *pte, u64 pteval)
> 
> Nitpick: Since we free and clearing the dirty bit, should we change
> the function name to free_clear_pte_dirty()?
> 

We free and *read* the dirty bit. It just so happens that we clear dirty
bit and every other one in the process. Just to make sure that I am not
clear the dirty bit explicitly (like the read_and_clear_dirty())

>> +{
>> +	bool dirty = false;
>> +
>> +	while (IOMMU_PTE_DIRTY(cmpxchg64(pte, pteval, 0)))
> 
> We should use 0ULL instead of 0.
>

ack.

>> +		dirty = true;
>> +
>> +	return dirty;
>> +}
>> +
> 
> Actually, what do you think if we enhance the current free_clear_pte()
> to also handle the check dirty as well?
> 
See further below, about dropping this patch.

>>   /*
>>    * Generic mapping functions. It maps a physical address into a DMA
>>    * address space. It allocates the page table pages if necessary.
>> @@ -428,10 +438,11 @@ static int iommu_v1_map_page(struct io_pgtable_ops *ops, unsigned long iova,
>>   	return ret;
>>   }
>>   
>> -static unsigned long iommu_v1_unmap_page(struct io_pgtable_ops *ops,
>> -				      unsigned long iova,
>> -				      size_t size,
>> -				      struct iommu_iotlb_gather *gather)
>> +static unsigned long __iommu_v1_unmap_page(struct io_pgtable_ops *ops,
>> +					   unsigned long iova,
>> +					   size_t size,
>> +					   struct iommu_iotlb_gather *gather,
>> +					   struct iommu_dirty_bitmap *dirty)
>>   {
>>   	struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
>>   	unsigned long long unmapped;
>> @@ -445,11 +456,15 @@ static unsigned long iommu_v1_unmap_page(struct io_pgtable_ops *ops,
>>   	while (unmapped < size) {
>>   		pte = fetch_pte(pgtable, iova, &unmap_size);
>>   		if (pte) {
>> -			int i, count;
>> +			unsigned long i, count;
>> +			bool pte_dirty = false;
>>   
>>   			count = PAGE_SIZE_PTE_COUNT(unmap_size);
>>   			for (i = 0; i < count; i++)
>> -				pte[i] = 0ULL;
>> +				pte_dirty |= free_pte_dirty(&pte[i], pte[i]);
>> +
> 
> Actually, what if we change the existing free_clear_pte() to free_and_clear_dirty_pte(),
> and incorporate the logic for
> 
Likewise, but otherwise it would be a good idea.

>> ...
>> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
>> index 0a86392b2367..a8fcb6e9a684 100644
>> --- a/drivers/iommu/amd/iommu.c
>> +++ b/drivers/iommu/amd/iommu.c
>> @@ -2144,6 +2144,27 @@ static size_t amd_iommu_unmap(struct iommu_domain *dom, unsigned long iova,
>>   	return r;
>>   }
>>   
>> +static size_t amd_iommu_unmap_read_dirty(struct iommu_domain *dom,
>> +					 unsigned long iova, size_t page_size,
>> +					 struct iommu_iotlb_gather *gather,
>> +					 struct iommu_dirty_bitmap *dirty)
>> +{
>> +	struct protection_domain *domain = to_pdomain(dom);
>> +	struct io_pgtable_ops *ops = &domain->iop.iop.ops;
>> +	size_t r;
>> +
>> +	if ((amd_iommu_pgtable == AMD_IOMMU_V1) &&
>> +	    (domain->iop.mode == PAGE_MODE_NONE))
>> +		return 0;
>> +
>> +	r = (ops->unmap_read_dirty) ?
>> +		ops->unmap_read_dirty(ops, iova, page_size, gather, dirty) : 0;
>> +
>> +	amd_iommu_iotlb_gather_add_page(dom, gather, iova, page_size);
>> +
>> +	return r;
>> +}
>> +
> 
> Instead of creating a new function, what if we enhance the current amd_iommu_unmap()
> to also handle read dirty part as well (e.g. __amd_iommu_unmap_read_dirty()), and
> then both amd_iommu_unmap() and amd_iommu_unmap_read_dirty() can call
> the __amd_iommu_unmap_read_dirty()?

Yes, if we were to keep this one.

I am actually dropping this patch (and the whole unmap_read_dirty additions).
The unmap_read_dirty() will be replaced but having userspace do get_dirty_iova() before
the unmap() or still keep the uAPI in iommufd while being a read_dirty() followed by unmap
without the special IOMMU unmap path. See this thread that starts here:

	https://lore.kernel.org/linux-iommu/20220502185239.GR8364@nvidia.com/

But essentially, the proposed unmap_read_dirty primitive isn't fully race free as it only
tackle races against IOMMU updating the IOPTE. DMA could be happening between the time I
clear the PTE and when I do the IOMMU TLB flush. Think vIOMMU usecases. Eliminating the
race fully is expensive requiring an extra TLB flush + IOPT walk in addition to the unmap
one (we would essentially double the cost). The thinking is that an alternative new
primitive would instead wrprotect the IOVA (i.e. thus blocking DMA), flush the IOTLB and
then we would read out a dirty bit and the unmap would be a regular unmap. For now I won't
be adding this as it is not clear if any use case really needs this.

	Joao
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
  2022-05-11  1:10                     ` Tian, Kevin
  (?)
@ 2022-07-12 18:34                     ` Joao Martins
  2022-07-21 14:24                       ` Jason Gunthorpe
  -1 siblings, 1 reply; 209+ messages in thread
From: Joao Martins @ 2022-07-12 18:34 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe
  Cc: Joerg Roedel, Suravee Suthikulpanit, Will Deacon, Robin Murphy,
	Jean-Philippe Brucker, Keqian Zhu, Shameerali Kolothum Thodi,
	David Woodhouse, Lu Baolu, Nicolin Chen, Yishai Hadas,
	Eric Auger, Liu, Yi L, Alex Williamson, Cornelia Huck, kvm,
	iommu

On 5/11/22 02:10, Tian, Kevin wrote:
>> From: Jason Gunthorpe <jgg@nvidia.com>
>> Sent: Tuesday, May 10, 2022 9:47 PM
>>
>> On Tue, May 10, 2022 at 01:38:26AM +0000, Tian, Kevin wrote:
>>
>>>> However, tt costs nothing to have dirty tracking as long as all iommus
>>>> support it in the system - which seems to be the normal case today.
>>>>
>>>> We should just always turn it on at this point.
>>>
>>> Then still need a way to report " all iommus support it in the system"
>>> to userspace since many old systems don't support it at all.
>>
>> Userspace can query the iommu_domain directly, or 'try and fail' to
>> turn on tracking.
>>
>> A device capability flag is useless without a control knob to request
>> a domain is created with tracking, and we don't have that, or a reason
>> to add that.
>>
> 
> I'm getting confused on your last comment. A capability flag has to
> accompany with a control knob which iiuc is what you advocated
> in earlier discussion i.e. specifying the tracking property when creating
> the domain. In this case the flag assists the userspace in deciding
> whether to set the property.
> 
> Not sure whether we argued pass each other but here is another attempt.
> 
> In general I saw three options here:
> 
> a) 'try and fail' when creating the domain. It succeeds only when
> all iommus support tracking;
> 
> b) capability reported on iommu domain. The capability is reported true
> only when all iommus support tracking. This allows domain property
> to be set after domain is created. But there is no much gain of doing
> so when comparing to a).
> 
> c) capability reported on device. future compatible for heterogenous
> platform. domain property is specified at domain creation and domains
> can have different properties based on tracking capability of attached
> devices.
> 
> I'm inclined to c) as it is more aligned to Robin's cleanup effort on
> iommu_capable() and iommu_present() in the iommu layer which
> moves away from global manner to per-device style. Along with 
> that direction I guess we want to discourage adding more APIs
> assuming 'all iommus supporting certain capability' thing?
> 

Not sure where we are left off on this one, so hopefully just for my own
clarification on what we see is the path forward.

I have a tiny inclination towards option b) because VMMs with IOMMU
dirty tracking only care about what an IOMMU domain (its set of devices) can do.
Like migration shouldn't even be attempted if one of the devices in the IOMMU
domain don't support it. Albeit, it seems we will need something like c) for
other usecases that depend on the PCIe endpoint support (like PRS)

a) is what we have in the RFC and has the same context as b) with b) having
an explicit query support API rather than implicit failure if one of the
devices in the iommu_domain doesn't support it.

Here's an interface sketch for b) and c).

Kevin seems to be inclined into c); how about you Jason?

For b):

+
+/**
+ * enum iommufd_dirty_status_flags - Flags for dirty tracking status
+ */
+enum iommufd_dirty_status_flags {
+       IOMMU_DIRTY_TRACKING_DISABLED = 0,
+       IOMMU_DIRTY_TRACKING_ENABLED = 1 << 0,
+       IOMMU_DIRTY_TRACKING_SUPPORTED = 1 << 1,
+       IOMMU_DIRTY_TRACKING_UNSUPPORTED = 1 << 2,
+};
+
+/**
+ * struct iommu_hwpt_get_dirty - ioctl(IOMMU_HWPT_GET_DIRTY)
+ * @size: sizeof(struct iommu_hwptgset_dirty)
+ * @hwpt_id: HW pagetable ID that represents the IOMMU domain.
+ * @out_status: status of dirty tracking support (see iommu_dirty_status_flags)
+ *
+ * Get dirty tracking status on an HW pagetable.
+ */
+struct iommu_hwpt_get_dirty {
+       __u32 size;
+       __u32 hwpt_id;
+       __u16 out_status;
+       __u16 __reserved;
+};
+#define IOMMU_HWPT_GET_DIRTY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_GET_DIRTY)

The IOMMU implementation tells if it's enabled/disabled or supported/unsupported for the
set of devices in the iommu domain. After we set dirty we are supposed to fail device
attach for any potential IOMMU not supporting dirty tracking. This is anyways supposed to
happen regardless of any of the approaches.

For c):

+
+/**
+ * enum iommufd_device_caps
+ * @IOMMU_CAP_DIRTY_TRACKING: IOMMU device support for dirty tracking
+ */
+enum iommufd_device_caps {
+       IOMMUFD_CAP_DIRTY_TRACKING = 1 << 0,
+};
+
+/*
+ * struct iommu_device_caps - ioctl(IOMMU_DEVICE_GET_CAPS)
+ * @size: sizeof(struct iommu_device_caps)
+ * @dev_id: the device to query
+ * @caps: IOMMU capabilities of the device
+ */
+struct iommu_device_caps {
+       __u32 size;
+       __u32 dev_id;
+       __aligned_u64 caps;
+};
+#define IOMMU_DEVICE_GET_CAPS _IO(IOMMUFD_TYPE, IOMMUFD_CMD_DEVICE_GET_CAPS)

Returns a hardware-agnostic view of IOMMU 'capabilities' of the device. @dev_id
is supposed to be an iommufd_device object id. VMM is supposed to store and iterate
dev_id and check every one of them for dirty tracking support prior to set_dirty.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
  2022-07-12 18:34                     ` Joao Martins
@ 2022-07-21 14:24                       ` Jason Gunthorpe
  0 siblings, 0 replies; 209+ messages in thread
From: Jason Gunthorpe @ 2022-07-21 14:24 UTC (permalink / raw)
  To: Joao Martins
  Cc: Tian, Kevin, Joerg Roedel, Suravee Suthikulpanit, Will Deacon,
	Robin Murphy, Jean-Philippe Brucker, Keqian Zhu,
	Shameerali Kolothum Thodi, David Woodhouse, Lu Baolu,
	Nicolin Chen, Yishai Hadas, Eric Auger, Liu, Yi L,
	Alex Williamson, Cornelia Huck, kvm, iommu

On Tue, Jul 12, 2022 at 07:34:48PM +0100, Joao Martins wrote:

> > In general I saw three options here:
> > 
> > a) 'try and fail' when creating the domain. It succeeds only when
> > all iommus support tracking;
> > 
> > b) capability reported on iommu domain. The capability is reported true
> > only when all iommus support tracking. This allows domain property
> > to be set after domain is created. But there is no much gain of doing
> > so when comparing to a).
> > 
> > c) capability reported on device. future compatible for heterogenous
> > platform. domain property is specified at domain creation and domains
> > can have different properties based on tracking capability of attached
> > devices.
> > 
> > I'm inclined to c) as it is more aligned to Robin's cleanup effort on
> > iommu_capable() and iommu_present() in the iommu layer which
> > moves away from global manner to per-device style. Along with 
> > that direction I guess we want to discourage adding more APIs
> > assuming 'all iommus supporting certain capability' thing?
> > 
> 
> Not sure where we are left off on this one, so hopefully just for my own
> clarification on what we see is the path forward.

I prefer we stick to the APIs we know we already need.

We need an API to create an iommu_domain for a device with a bunch of
parameters. "I want dirty tracking" is a very reasonable parameter to
put here. This can support "try and fail" if we want.

We certainly need "c", somehow the userspace needs to know what inputs
the create domain call will accept - minimally it needs to know what
the IOMMU driver is under the device so it knows how to ask for a user
space page table. This could also report other paramters that are
interesting, like "device could support dirty tracking"

Having the domain set to dirty tracking also means that it will refuse
to attach to any device that doesn't have dirty tracking support (eg
if there are non-uniformity in the iommu) - this composes will with
the EMEDIUMTYPE work

So, I would only change from your proposal to move to the create
domain command. Which we should really pull out of one of the HW
branchs and lock down RSN..

Jason

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
  2022-05-05  9:52           ` Joao Martins
  (?)
@ 2022-08-29  9:59           ` Shameerali Kolothum Thodi
  -1 siblings, 0 replies; 209+ messages in thread
From: Shameerali Kolothum Thodi @ 2022-08-29  9:59 UTC (permalink / raw)
  To: Joao Martins
  Cc: Joerg Roedel, Suravee Suthikulpanit, Will Deacon, Robin Murphy,
	Jean-Philippe Brucker, zhukeqian, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Eric Auger, Liu,
	Yi L, Alex Williamson, Cornelia Huck, kvm, iommu, jiangkunkun,
	Tian, Kevin



> -----Original Message-----
> From: Joao Martins [mailto:joao.m.martins@oracle.com]
> Sent: 05 May 2022 10:53
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: Joerg Roedel <joro@8bytes.org>; Suravee Suthikulpanit
> <suravee.suthikulpanit@amd.com>; Will Deacon <will@kernel.org>; Robin
> Murphy <robin.murphy@arm.com>; Jean-Philippe Brucker
> <jean-philippe@linaro.org>; zhukeqian <zhukeqian1@huawei.com>; David
> Woodhouse <dwmw2@infradead.org>; Lu Baolu <baolu.lu@linux.intel.com>;
> Jason Gunthorpe <jgg@nvidia.com>; Nicolin Chen <nicolinc@nvidia.com>;
> Yishai Hadas <yishaih@nvidia.com>; Eric Auger <eric.auger@redhat.com>;
> Liu, Yi L <yi.l.liu@intel.com>; Alex Williamson
> <alex.williamson@redhat.com>; Cornelia Huck <cohuck@redhat.com>;
> kvm@vger.kernel.org; iommu@lists.linux-foundation.org; jiangkunkun
> <jiangkunkun@huawei.com>; Tian, Kevin <kevin.tian@intel.com>
> Subject: Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add
> set_dirty_tracking_range() support
> 
> On 5/5/22 08:25, Shameerali Kolothum Thodi wrote:
> >> -----Original Message-----
> >> From: Joao Martins [mailto:joao.m.martins@oracle.com]
> >> Sent: 29 April 2022 12:05
> >> To: Tian, Kevin <kevin.tian@intel.com>
> >> Cc: Joerg Roedel <joro@8bytes.org>; Suravee Suthikulpanit
> >> <suravee.suthikulpanit@amd.com>; Will Deacon <will@kernel.org>; Robin
> >> Murphy <robin.murphy@arm.com>; Jean-Philippe Brucker
> >> <jean-philippe@linaro.org>; zhukeqian <zhukeqian1@huawei.com>;
> >> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> >> David Woodhouse <dwmw2@infradead.org>; Lu Baolu
> >> <baolu.lu@linux.intel.com>; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
> Chen
> >> <nicolinc@nvidia.com>; Yishai Hadas <yishaih@nvidia.com>; Eric Auger
> >> <eric.auger@redhat.com>; Liu, Yi L <yi.l.liu@intel.com>; Alex Williamson
> >> <alex.williamson@redhat.com>; Cornelia Huck <cohuck@redhat.com>;
> >> kvm@vger.kernel.org; iommu@lists.linux-foundation.org
> >> Subject: Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add
> >> set_dirty_tracking_range() support
> >>
> >> On 4/29/22 09:28, Tian, Kevin wrote:
> >>>> From: Joao Martins <joao.m.martins@oracle.com>
> >>>> Sent: Friday, April 29, 2022 5:09 AM
> >>>>
> >>>> Similar to .read_and_clear_dirty() use the page table
> >>>> walker helper functions and set DBM|RDONLY bit, thus
> >>>> switching the IOPTE to writeable-clean.
> >>>
> >>> this should not be one-off if the operation needs to be
> >>> applied to IOPTE. Say a map request comes right after
> >>> set_dirty_tracking() is called. If it's agreed to remove
> >>> the range op then smmu driver should record the tracking
> >>> status internally and then apply the modifier to all the new
> >>> mappings automatically before dirty tracking is disabled.
> >>> Otherwise the same logic needs to be kept in iommufd to
> >>> call set_dirty_tracking_range() explicitly for every new
> >>> iopt_area created within the tracking window.
> >>
> >> Gah, I totally missed that by mistake. New mappings aren't
> >> carrying over the "DBM is set". This needs a new io-pgtable
> >> quirk added post dirty-tracking toggling.
> >>
> >> I can adjust, but I am at odds on including this in a future
> >> iteration given that I can't really test any of this stuff.
> >> Might drop the driver until I have hardware/emulation I can
> >> use (or maybe others can take over this). It was included
> >> for revising the iommu core ops and whether iommufd was
> >> affected by it.
> >
> > [+Kunkun Jiang]. I think he is now looking into this and might have
> > a test setup to verify this.
> 
> I'll keep him CC'ed next iterations. Thanks!
> 
> FWIW, the should change a bit on next iteration (simpler)
> by always enabling DBM from the start. SMMUv3 ::set_dirty_tracking()
> becomes
> a simpler function that tests quirks (i.e. DBM set) and what not, and calls
> read_and_clear_dirty() without a bitmap argument to clear dirties.

Hi Joao,

Hope soon we will have a revised spin on this series. Meantime, I tried to
hack the Qemu vSMMUv3 to emulate the support required to test this as 
access to Hardware is very limited. I manage to have a just enough setup 
to cover the ARM side of this series. Based on the test coverage I had 
and going through the code, please see my comments on few of the
patches on this series. Hope, it will be helpful when you attempt a re-spin.

Thanks,
Shameer
 

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH RFC 14/19] iommu/arm-smmu-v3: Add read_and_clear_dirty() support
  2022-04-28 21:09   ` Joao Martins
  (?)
@ 2022-08-29  9:59   ` Shameerali Kolothum Thodi
  -1 siblings, 0 replies; 209+ messages in thread
From: Shameerali Kolothum Thodi @ 2022-08-29  9:59 UTC (permalink / raw)
  To: Joao Martins, iommu
  Cc: Joerg Roedel, Suravee Suthikulpanit, Will Deacon, Robin Murphy,
	Jean-Philippe Brucker, zhukeqian, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian,
	Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck, kvm,
	jiangkunkun



> -----Original Message-----
> From: Joao Martins [mailto:joao.m.martins@oracle.com]
> Sent: 28 April 2022 22:09
> To: iommu@lists.linux-foundation.org
> Cc: Joao Martins <joao.m.martins@oracle.com>; Joerg Roedel
> <joro@8bytes.org>; Suravee Suthikulpanit
> <suravee.suthikulpanit@amd.com>; Will Deacon <will@kernel.org>; Robin
> Murphy <robin.murphy@arm.com>; Jean-Philippe Brucker
> <jean-philippe@linaro.org>; zhukeqian <zhukeqian1@huawei.com>;
> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> David Woodhouse <dwmw2@infradead.org>; Lu Baolu
> <baolu.lu@linux.intel.com>; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
> Chen <nicolinc@nvidia.com>; Yishai Hadas <yishaih@nvidia.com>; Kevin Tian
> <kevin.tian@intel.com>; Eric Auger <eric.auger@redhat.com>; Yi Liu
> <yi.l.liu@intel.com>; Alex Williamson <alex.williamson@redhat.com>;
> Cornelia Huck <cohuck@redhat.com>; kvm@vger.kernel.org; jiangkunkun
> <jiangkunkun@huawei.com>
> Subject: [PATCH RFC 14/19] iommu/arm-smmu-v3: Add
> read_and_clear_dirty() support
> 
> .read_and_clear_dirty() IOMMU domain op takes care of
> reading the dirty bits (i.e. PTE has both DBM and AP[2] set)
> and marshalling into a bitmap of a given page size.
> 
> While reading the dirty bits we also clear the PTE AP[2]
> bit to mark it as writable-clean.
> 
> Structure it in a way that the IOPTE walker is generic,
> and so we pass a function pointer over what to do on a per-PTE
> basis. This is useful for a followup patch where we supply an
> io-pgtable op to enable DBM when starting/stopping dirty tracking.
> 
> Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
> Co-developed-by: Kunkun Jiang <jiangkunkun@huawei.com>
> Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |  27 ++++++
>  drivers/iommu/io-pgtable-arm.c              | 102
> +++++++++++++++++++-
>  2 files changed, 128 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 4dba53bde2e3..232057d20197 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -2743,6 +2743,32 @@ static int arm_smmu_enable_nesting(struct
> iommu_domain *domain)
>  	return ret;
>  }
> 
> +static int arm_smmu_read_and_clear_dirty(struct iommu_domain
> *domain,
> +					 unsigned long iova, size_t size,
> +					 struct iommu_dirty_bitmap *dirty)
> +{
> +	struct arm_smmu_domain *smmu_domain =
> to_smmu_domain(domain);
> +	struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
> +	struct arm_smmu_device *smmu = smmu_domain->smmu;
> +	int ret;
> +
> +	if (!(smmu->features & ARM_SMMU_FEAT_HD) ||
> +	    !(smmu->features & ARM_SMMU_FEAT_BBML2))
> +		return -ENODEV;
> +
> +	if (smmu_domain->stage != ARM_SMMU_DOMAIN_S1)
> +		return -EINVAL;
> +
> +	if (!ops || !ops->read_and_clear_dirty) {
> +		pr_err_once("io-pgtable don't support dirty tracking\n");
> +		return -ENODEV;
> +	}
> +
> +	ret = ops->read_and_clear_dirty(ops, iova, size, dirty);
> +
> +	return ret;
> +}
> +
>  static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args
> *args)
>  {
>  	return iommu_fwspec_add_ids(dev, args->args, 1);
> @@ -2871,6 +2897,7 @@ static struct iommu_ops arm_smmu_ops = {
>  		.iova_to_phys		= arm_smmu_iova_to_phys,
>  		.enable_nesting		= arm_smmu_enable_nesting,
>  		.free			= arm_smmu_domain_free,
> +		.read_and_clear_dirty	= arm_smmu_read_and_clear_dirty,
>  	}
>  };
> 
> diff --git a/drivers/iommu/io-pgtable-arm.c
> b/drivers/iommu/io-pgtable-arm.c
> index 94ff319ae8ac..3c99028d315a 100644
> --- a/drivers/iommu/io-pgtable-arm.c
> +++ b/drivers/iommu/io-pgtable-arm.c
> @@ -75,6 +75,7 @@
> 
>  #define ARM_LPAE_PTE_NSTABLE		(((arm_lpae_iopte)1) << 63)
>  #define ARM_LPAE_PTE_XN			(((arm_lpae_iopte)3) << 53)
> +#define ARM_LPAE_PTE_DBM		(((arm_lpae_iopte)1) << 51)
>  #define ARM_LPAE_PTE_AF			(((arm_lpae_iopte)1) << 10)
>  #define ARM_LPAE_PTE_SH_NS		(((arm_lpae_iopte)0) << 8)
>  #define ARM_LPAE_PTE_SH_OS		(((arm_lpae_iopte)2) << 8)
> @@ -84,7 +85,7 @@
> 
>  #define ARM_LPAE_PTE_ATTR_LO_MASK	(((arm_lpae_iopte)0x3ff) << 2)
>  /* Ignore the contiguous bit for block splitting */
> -#define ARM_LPAE_PTE_ATTR_HI_MASK	(((arm_lpae_iopte)6) << 52)
> +#define ARM_LPAE_PTE_ATTR_HI_MASK	(((arm_lpae_iopte)13) << 51)
>  #define ARM_LPAE_PTE_ATTR_MASK		(ARM_LPAE_PTE_ATTR_LO_MASK
> |	\
>  					 ARM_LPAE_PTE_ATTR_HI_MASK)
>  /* Software bit for solving coherency races */
> @@ -93,6 +94,9 @@
>  /* Stage-1 PTE */
>  #define ARM_LPAE_PTE_AP_UNPRIV		(((arm_lpae_iopte)1) << 6)
>  #define ARM_LPAE_PTE_AP_RDONLY		(((arm_lpae_iopte)2) << 6)
> +#define ARM_LPAE_PTE_AP_RDONLY_BIT	7
> +#define ARM_LPAE_PTE_AP_WRITABLE	(ARM_LPAE_PTE_AP_RDONLY | \
> +					 ARM_LPAE_PTE_DBM)
>  #define ARM_LPAE_PTE_ATTRINDX_SHIFT	2
>  #define ARM_LPAE_PTE_nG			(((arm_lpae_iopte)1) << 11)
> 
> @@ -737,6 +741,101 @@ static phys_addr_t arm_lpae_iova_to_phys(struct
> io_pgtable_ops *ops,
>  	return iopte_to_paddr(pte, data) | iova;
>  }
> 
> +static int __arm_lpae_read_and_clear_dirty(unsigned long iova, size_t size,
> +					   arm_lpae_iopte *ptep, void *opaque)
> +{
> +	struct iommu_dirty_bitmap *dirty = opaque;
> +	arm_lpae_iopte pte;
> +
> +	pte = READ_ONCE(*ptep);
> +	if (WARN_ON(!pte))
> +		return -EINVAL;
> +
> +	if (pte & ARM_LPAE_PTE_AP_WRITABLE)
> +		return 0;

We might have set ARM_LPAE_PTE_DBM already. So does the above needs to be,
if ((pte & ARM_LPAE_PTE_AP_WRITABLE) == ARM_LPAE_PTE_AP_WRITABLE) ?

Thanks,
Shameer

> +
> +	if (!(pte & ARM_LPAE_PTE_DBM))
> +		return 0;
> +
> +	iommu_dirty_bitmap_record(dirty, iova, size);
> +	set_bit(ARM_LPAE_PTE_AP_RDONLY_BIT, (unsigned long *)ptep);
> +	return 0;
> +}
> +
> +static int __arm_lpae_iopte_walk(struct arm_lpae_io_pgtable *data,
> +				 unsigned long iova, size_t size,
> +				 int lvl, arm_lpae_iopte *ptep,
> +				 int (*fn)(unsigned long iova, size_t size,
> +					   arm_lpae_iopte *pte, void *opaque),
> +				 void *opaque)
> +{
> +	arm_lpae_iopte pte;
> +	struct io_pgtable *iop = &data->iop;
> +	size_t base, next_size;
> +	int ret;
> +
> +	if (WARN_ON_ONCE(!fn))
> +		return -EINVAL;
> +
> +	if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
> +		return -EINVAL;
> +
> +	ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
> +	pte = READ_ONCE(*ptep);
> +	if (WARN_ON(!pte))
> +		return -EINVAL;
> +
> +	if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) {
> +		if (iopte_leaf(pte, lvl, iop->fmt))
> +			return fn(iova, size, ptep, opaque);
> +
> +		/* Current level is table, traverse next level */
> +		next_size = ARM_LPAE_BLOCK_SIZE(lvl + 1, data);
> +		ptep = iopte_deref(pte, data);
> +		for (base = 0; base < size; base += next_size) {
> +			ret = __arm_lpae_iopte_walk(data, iova + base,
> +						    next_size, lvl + 1, ptep,
> +						    fn, opaque);
> +			if (ret)
> +				return ret;
> +		}
> +		return 0;
> +	} else if (iopte_leaf(pte, lvl, iop->fmt)) {
> +		return fn(iova, size, ptep, opaque);
> +	}
> +
> +	/* Keep on walkin */
> +	ptep = iopte_deref(pte, data);
> +	return __arm_lpae_iopte_walk(data, iova, size, lvl + 1, ptep,
> +				     fn, opaque);
> +}
> +
> +static int arm_lpae_read_and_clear_dirty(struct io_pgtable_ops *ops,
> +					 unsigned long iova, size_t size,
> +					 struct iommu_dirty_bitmap *dirty)
> +{
> +	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
> +	struct io_pgtable_cfg *cfg = &data->iop.cfg;
> +	arm_lpae_iopte *ptep = data->pgd;
> +	int lvl = data->start_level;
> +	long iaext = (s64)iova >> cfg->ias;
> +
> +	if (WARN_ON(!size || (size & cfg->pgsize_bitmap) != size))
> +		return -EINVAL;
> +
> +	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_TTBR1)
> +		iaext = ~iaext;
> +	if (WARN_ON(iaext))
> +		return -EINVAL;
> +
> +	if (data->iop.fmt != ARM_64_LPAE_S1 &&
> +	    data->iop.fmt != ARM_32_LPAE_S1)
> +		return -EINVAL;
> +
> +	return __arm_lpae_iopte_walk(data, iova, size, lvl, ptep,
> +				     __arm_lpae_read_and_clear_dirty, dirty);
> +}
> +
>  static void arm_lpae_restrict_pgsizes(struct io_pgtable_cfg *cfg)
>  {
>  	unsigned long granule, page_sizes;
> @@ -817,6 +916,7 @@ arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg)
>  		.unmap		= arm_lpae_unmap,
>  		.unmap_pages	= arm_lpae_unmap_pages,
>  		.iova_to_phys	= arm_lpae_iova_to_phys,
> +		.read_and_clear_dirty = arm_lpae_read_and_clear_dirty,
>  	};
> 
>  	return data;
> --
> 2.17.2


^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
  2022-04-28 21:09   ` Joao Martins
  (?)
  (?)
@ 2022-08-29 10:00   ` Shameerali Kolothum Thodi
  -1 siblings, 0 replies; 209+ messages in thread
From: Shameerali Kolothum Thodi @ 2022-08-29 10:00 UTC (permalink / raw)
  To: Joao Martins, iommu
  Cc: Joerg Roedel, Suravee Suthikulpanit, Will Deacon, Robin Murphy,
	Jean-Philippe Brucker, zhukeqian, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian,
	Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck, kvm



> -----Original Message-----
> From: Joao Martins [mailto:joao.m.martins@oracle.com]
> Sent: 28 April 2022 22:09
> To: iommu@lists.linux-foundation.org
> Cc: Joao Martins <joao.m.martins@oracle.com>; Joerg Roedel
> <joro@8bytes.org>; Suravee Suthikulpanit
> <suravee.suthikulpanit@amd.com>; Will Deacon <will@kernel.org>; Robin
> Murphy <robin.murphy@arm.com>; Jean-Philippe Brucker
> <jean-philippe@linaro.org>; zhukeqian <zhukeqian1@huawei.com>;
> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> David Woodhouse <dwmw2@infradead.org>; Lu Baolu
> <baolu.lu@linux.intel.com>; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
> Chen <nicolinc@nvidia.com>; Yishai Hadas <yishaih@nvidia.com>; Kevin Tian
> <kevin.tian@intel.com>; Eric Auger <eric.auger@redhat.com>; Yi Liu
> <yi.l.liu@intel.com>; Alex Williamson <alex.williamson@redhat.com>;
> Cornelia Huck <cohuck@redhat.com>; kvm@vger.kernel.org
> Subject: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add
> set_dirty_tracking_range() support
> 
> Similar to .read_and_clear_dirty() use the page table
> walker helper functions and set DBM|RDONLY bit, thus
> switching the IOPTE to writeable-clean.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 29 ++++++++++++
>  drivers/iommu/io-pgtable-arm.c              | 52
> +++++++++++++++++++++
>  2 files changed, 81 insertions(+)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 232057d20197..1ca72fcca930 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -2769,6 +2769,34 @@ static int
> arm_smmu_read_and_clear_dirty(struct iommu_domain *domain,
>  	return ret;
>  }
> 
> +static int arm_smmu_set_dirty_tracking(struct iommu_domain *domain,
> +				       unsigned long iova, size_t size,
> +				       struct iommu_iotlb_gather *iotlb_gather,
> +				       bool enabled)
> +{
> +	struct arm_smmu_domain *smmu_domain =
> to_smmu_domain(domain);
> +	struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
> +	struct arm_smmu_device *smmu = smmu_domain->smmu;
> +	int ret;
> +
> +	if (!(smmu->features & ARM_SMMU_FEAT_HD) ||
> +	    !(smmu->features & ARM_SMMU_FEAT_BBML2))
> +		return -ENODEV;
> +
> +	if (smmu_domain->stage != ARM_SMMU_DOMAIN_S1)
> +		return -EINVAL;
> +
> +	if (!ops || !ops->set_dirty_tracking) {
> +		pr_err_once("io-pgtable don't support dirty tracking\n");
> +		return -ENODEV;
> +	}
> +
> +	ret = ops->set_dirty_tracking(ops, iova, size, enabled);
> +	iommu_iotlb_gather_add_range(iotlb_gather, iova, size);
> +
> +	return ret;
> +}
> +
>  static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args
> *args)
>  {
>  	return iommu_fwspec_add_ids(dev, args->args, 1);
> @@ -2898,6 +2926,7 @@ static struct iommu_ops arm_smmu_ops = {
>  		.enable_nesting		= arm_smmu_enable_nesting,
>  		.free			= arm_smmu_domain_free,
>  		.read_and_clear_dirty	= arm_smmu_read_and_clear_dirty,
> +		.set_dirty_tracking_range = arm_smmu_set_dirty_tracking,
>  	}
>  };
> 
> diff --git a/drivers/iommu/io-pgtable-arm.c
> b/drivers/iommu/io-pgtable-arm.c
> index 3c99028d315a..361410aa836c 100644
> --- a/drivers/iommu/io-pgtable-arm.c
> +++ b/drivers/iommu/io-pgtable-arm.c
> @@ -76,6 +76,7 @@
>  #define ARM_LPAE_PTE_NSTABLE		(((arm_lpae_iopte)1) << 63)
>  #define ARM_LPAE_PTE_XN			(((arm_lpae_iopte)3) << 53)
>  #define ARM_LPAE_PTE_DBM		(((arm_lpae_iopte)1) << 51)
> +#define ARM_LPAE_PTE_DBM_BIT		51
>  #define ARM_LPAE_PTE_AF			(((arm_lpae_iopte)1) << 10)
>  #define ARM_LPAE_PTE_SH_NS		(((arm_lpae_iopte)0) << 8)
>  #define ARM_LPAE_PTE_SH_OS		(((arm_lpae_iopte)2) << 8)
> @@ -836,6 +837,56 @@ static int arm_lpae_read_and_clear_dirty(struct
> io_pgtable_ops *ops,
>  				     __arm_lpae_read_and_clear_dirty, dirty);
>  }
> 
> +static int __arm_lpae_set_dirty_modifier(unsigned long iova, size_t size,
> +					 arm_lpae_iopte *ptep, void *opaque)
> +{
> +	bool enabled = *((bool *) opaque);
> +	arm_lpae_iopte pte;
> +
> +	pte = READ_ONCE(*ptep);
> +	if (WARN_ON(!pte))
> +		return -EINVAL;
> +
> +	if ((pte & ARM_LPAE_PTE_AP_WRITABLE) ==
> ARM_LPAE_PTE_AP_RDONLY)
> +		return -EINVAL;
> +
> +	if (!(enabled ^ !(pte & ARM_LPAE_PTE_DBM)))
> +		return 0;

Does the above needs to be double negative?

if (!(enabled ^ !!(pte & ARM_LPAE_PTE_DBM)))

Thanks,
Shameer

> +
> +	pte = enabled ? pte | (ARM_LPAE_PTE_DBM |
> ARM_LPAE_PTE_AP_RDONLY) :
> +		pte & ~(ARM_LPAE_PTE_DBM | ARM_LPAE_PTE_AP_RDONLY);
> +
> +	WRITE_ONCE(*ptep, pte);
> +	return 0;
> +}
> +
> +
> +static int arm_lpae_set_dirty_tracking(struct io_pgtable_ops *ops,
> +				       unsigned long iova, size_t size,
> +				       bool enabled)
> +{
> +	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
> +	struct io_pgtable_cfg *cfg = &data->iop.cfg;
> +	arm_lpae_iopte *ptep = data->pgd;
> +	int lvl = data->start_level;
> +	long iaext = (s64)iova >> cfg->ias;
> +
> +	if (WARN_ON(!size || (size & cfg->pgsize_bitmap) != size))
> +		return -EINVAL;
> +
> +	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_TTBR1)
> +		iaext = ~iaext;
> +	if (WARN_ON(iaext))
> +		return -EINVAL;
> +
> +	if (data->iop.fmt != ARM_64_LPAE_S1 &&
> +	    data->iop.fmt != ARM_32_LPAE_S1)
> +		return -EINVAL;
> +
> +	return __arm_lpae_iopte_walk(data, iova, size, lvl, ptep,
> +				     __arm_lpae_set_dirty_modifier, &enabled);
> +}
> +
>  static void arm_lpae_restrict_pgsizes(struct io_pgtable_cfg *cfg)
>  {
>  	unsigned long granule, page_sizes;
> @@ -917,6 +968,7 @@ arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg)
>  		.unmap_pages	= arm_lpae_unmap_pages,
>  		.iova_to_phys	= arm_lpae_iova_to_phys,
>  		.read_and_clear_dirty = arm_lpae_read_and_clear_dirty,
> +		.set_dirty_tracking   = arm_lpae_set_dirty_tracking,
>  	};
> 
>  	return data;
> --
> 2.17.2


^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH RFC 16/19] iommu/arm-smmu-v3: Enable HTTU for stage1 with io-pgtable mapping
  2022-04-28 21:09   ` Joao Martins
  (?)
  (?)
@ 2022-08-29 10:00   ` Shameerali Kolothum Thodi
  -1 siblings, 0 replies; 209+ messages in thread
From: Shameerali Kolothum Thodi @ 2022-08-29 10:00 UTC (permalink / raw)
  To: Joao Martins, iommu
  Cc: Joerg Roedel, Suravee Suthikulpanit, Will Deacon, Robin Murphy,
	Jean-Philippe Brucker, zhukeqian, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian,
	Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck, kvm,
	jiangkunkun



> -----Original Message-----
> From: Joao Martins [mailto:joao.m.martins@oracle.com]
> Sent: 28 April 2022 22:10
> To: iommu@lists.linux-foundation.org
> Cc: Joao Martins <joao.m.martins@oracle.com>; Joerg Roedel
> <joro@8bytes.org>; Suravee Suthikulpanit
> <suravee.suthikulpanit@amd.com>; Will Deacon <will@kernel.org>; Robin
> Murphy <robin.murphy@arm.com>; Jean-Philippe Brucker
> <jean-philippe@linaro.org>; zhukeqian <zhukeqian1@huawei.com>;
> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> David Woodhouse <dwmw2@infradead.org>; Lu Baolu
> <baolu.lu@linux.intel.com>; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
> Chen <nicolinc@nvidia.com>; Yishai Hadas <yishaih@nvidia.com>; Kevin Tian
> <kevin.tian@intel.com>; Eric Auger <eric.auger@redhat.com>; Yi Liu
> <yi.l.liu@intel.com>; Alex Williamson <alex.williamson@redhat.com>;
> Cornelia Huck <cohuck@redhat.com>; kvm@vger.kernel.org; jiangkunkun
> <jiangkunkun@huawei.com>
> Subject: [PATCH RFC 16/19] iommu/arm-smmu-v3: Enable HTTU for stage1
> with io-pgtable mapping
> 
> From: Kunkun Jiang <jiangkunkun@huawei.com>
> 
> As nested mode is not upstreamed now, we just aim to support dirty log
> tracking for stage1 with io-pgtable mapping (means not support SVA
> mapping). If HTTU is supported, we enable HA/HD bits in the SMMU CD and
> transfer ARM_HD quirk to io-pgtable.
> 
> We additionally filter out HD|HA if not supportted. The CD.HD bit is not
> particularly useful unless we toggle the DBM bit in the PTE entries.
> 
> Co-developed-by: Keqian Zhu <zhukeqian1@huawei.com>
> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
> Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com> [joaomart:Convey
> HD|HA bits over to the context descriptor  and update commit message]
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 11 +++++++++++
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  3 +++
>  include/linux/io-pgtable.h                  |  1 +
>  3 files changed, 15 insertions(+)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 1ca72fcca930..5f728f8f20a2 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -1077,10 +1077,18 @@ int arm_smmu_write_ctx_desc(struct
> arm_smmu_domain *smmu_domain, int ssid,
>  		 * this substream's traffic
>  		 */
>  	} else { /* (1) and (2) */
> +		struct arm_smmu_device *smmu = smmu_domain->smmu;
> +		u64 tcr = cd->tcr;
> +
>  		cdptr[1] = cpu_to_le64(cd->ttbr & CTXDESC_CD_1_TTB0_MASK);
>  		cdptr[2] = 0;
>  		cdptr[3] = cpu_to_le64(cd->mair);
> 
> +		if (!(smmu->features & ARM_SMMU_FEAT_HD))
> +			tcr &= ~CTXDESC_CD_0_TCR_HD;
> +		if (!(smmu->features & ARM_SMMU_FEAT_HA))
> +			tcr &= ~CTXDESC_CD_0_TCR_HA;
> +
>  		/*
>  		 * STE is live, and the SMMU might read dwords of this CD in any
>  		 * order. Ensure that it observes valid values before reading @@
> -2100,6 +2108,7 @@ static int arm_smmu_domain_finalise_s1(struct
> arm_smmu_domain *smmu_domain,
>  			  FIELD_PREP(CTXDESC_CD_0_TCR_ORGN0, tcr->orgn) |
>  			  FIELD_PREP(CTXDESC_CD_0_TCR_SH0, tcr->sh) |
>  			  FIELD_PREP(CTXDESC_CD_0_TCR_IPS, tcr->ips) |
> +			  CTXDESC_CD_0_TCR_HA | CTXDESC_CD_0_TCR_HD |
>  			  CTXDESC_CD_0_TCR_EPD1 | CTXDESC_CD_0_AA64;
>  	cfg->cd.mair	= pgtbl_cfg->arm_lpae_s1_cfg.mair;
> 
> @@ -2203,6 +2212,8 @@ static int arm_smmu_domain_finalise(struct
> iommu_domain *domain,
>  		.iommu_dev	= smmu->dev,
>  	};
> 
> +	if (smmu->features & ARM_SMMU_FEAT_HD)
> +		pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_HD;

Setting these quirk bits requires updating the check in arm_64_lpae_alloc_pgtable_s1()
in drivers/iommu/io-pgtable-arm.c

Thanks,
Shameer

>  	if (smmu->features & ARM_SMMU_FEAT_BBML1)
>  		pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_BBML1;
>  	else if (smmu->features & ARM_SMMU_FEAT_BBML2) diff --git
> a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index e15750be1d95..ff32242f2fdb 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -292,6 +292,9 @@
>  #define CTXDESC_CD_0_TCR_IPS		GENMASK_ULL(34, 32)
>  #define CTXDESC_CD_0_TCR_TBI0		(1ULL << 38)
> 
> +#define CTXDESC_CD_0_TCR_HA            (1UL << 43)
> +#define CTXDESC_CD_0_TCR_HD            (1UL << 42)
> +
>  #define CTXDESC_CD_0_AA64		(1UL << 41)
>  #define CTXDESC_CD_0_S			(1UL << 44)
>  #define CTXDESC_CD_0_R			(1UL << 45)
> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h index
> d7626ca67dbf..a11902ae9cf1 100644
> --- a/include/linux/io-pgtable.h
> +++ b/include/linux/io-pgtable.h
> @@ -87,6 +87,7 @@ struct io_pgtable_cfg {
>  	#define IO_PGTABLE_QUIRK_ARM_OUTER_WBWA	BIT(6)
>  	#define IO_PGTABLE_QUIRK_ARM_BBML1      BIT(7)
>  	#define IO_PGTABLE_QUIRK_ARM_BBML2      BIT(8)
> +	#define IO_PGTABLE_QUIRK_ARM_HD         BIT(9)
> 
>  	unsigned long			quirks;
>  	unsigned long			pgsize_bitmap;
> --
> 2.17.2


^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH RFC 02/19] iommufd: Dirty tracking for io_pagetable
  2022-04-28 21:09   ` Joao Martins
                     ` (2 preceding siblings ...)
  (?)
@ 2022-08-29 10:01   ` Shameerali Kolothum Thodi
  -1 siblings, 0 replies; 209+ messages in thread
From: Shameerali Kolothum Thodi @ 2022-08-29 10:01 UTC (permalink / raw)
  To: Joao Martins, iommu
  Cc: Joerg Roedel, Suravee Suthikulpanit, Will Deacon, Robin Murphy,
	Jean-Philippe Brucker, zhukeqian, David Woodhouse, Lu Baolu,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian,
	Eric Auger, Yi Liu, Alex Williamson, Cornelia Huck, kvm



> -----Original Message-----
> From: Joao Martins [mailto:joao.m.martins@oracle.com]
> Sent: 28 April 2022 22:09
> To: iommu@lists.linux-foundation.org
> Cc: Joao Martins <joao.m.martins@oracle.com>; Joerg Roedel
> <joro@8bytes.org>; Suravee Suthikulpanit
> <suravee.suthikulpanit@amd.com>; Will Deacon <will@kernel.org>; Robin
> Murphy <robin.murphy@arm.com>; Jean-Philippe Brucker
> <jean-philippe@linaro.org>; zhukeqian <zhukeqian1@huawei.com>;
> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> David Woodhouse <dwmw2@infradead.org>; Lu Baolu
> <baolu.lu@linux.intel.com>; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
> Chen <nicolinc@nvidia.com>; Yishai Hadas <yishaih@nvidia.com>; Kevin Tian
> <kevin.tian@intel.com>; Eric Auger <eric.auger@redhat.com>; Yi Liu
> <yi.l.liu@intel.com>; Alex Williamson <alex.williamson@redhat.com>;
> Cornelia Huck <cohuck@redhat.com>; kvm@vger.kernel.org
> Subject: [PATCH RFC 02/19] iommufd: Dirty tracking for io_pagetable
> 
> Add an io_pagetable kernel API to toggle dirty tracking:
> 
> * iopt_set_dirty_tracking(iopt, [domain], state)
> 
> It receives either NULL (which means all domains) or an
> iommu_domain. The intended caller of this is via the hw_pagetable
> object that is created on device attach, which passes an
> iommu_domain. For now, the all-domains is left for vfio-compat.
> 
> The hw protection domain dirty control is favored over the IOVA-range
> alternative. For the latter, it iterates over all IOVA areas and calls
> iommu domain op to enable/disable for the range.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  drivers/iommu/iommufd/io_pagetable.c    | 71
> +++++++++++++++++++++++++
>  drivers/iommu/iommufd/iommufd_private.h |  3 ++
>  2 files changed, 74 insertions(+)
> 
> diff --git a/drivers/iommu/iommufd/io_pagetable.c
> b/drivers/iommu/iommufd/io_pagetable.c
> index f9f3b06946bf..f4609ef369e0 100644
> --- a/drivers/iommu/iommufd/io_pagetable.c
> +++ b/drivers/iommu/iommufd/io_pagetable.c
> @@ -276,6 +276,77 @@ int iopt_map_user_pages(struct io_pagetable *iopt,
> unsigned long *iova,
>  	return 0;
>  }
> 
> +static int __set_dirty_tracking_range_locked(struct iommu_domain
> *domain,
> +					     struct io_pagetable *iopt,
> +					     bool enable)
> +{
> +	const struct iommu_domain_ops *ops = domain->ops;
> +	struct iommu_iotlb_gather gather;
> +	struct iopt_area *area;
> +	int ret = -EOPNOTSUPP;
> +	unsigned long iova;
> +	size_t size;
> +
> +	iommu_iotlb_gather_init(&gather);
> +
> +	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
> +	     area = iopt_area_iter_next(area, 0, ULONG_MAX)) {
> +		iova = iopt_area_iova(area);
> +		size = iopt_area_last_iova(area) - iova;

   size = iopt_area_last_iova(area) - iova + 1;  ?

Thanks,
Shameer
> +
> +		if (ops->set_dirty_tracking_range) {
> +			ret = ops->set_dirty_tracking_range(domain, iova,
> +							    size, &gather,
> +							    enable);
> +			if (ret < 0)
> +				break;
> +		}
> +	}
> +
> +	iommu_iotlb_sync(domain, &gather);
> +
> +	return ret;
> +}
> +
> +static int iommu_set_dirty_tracking(struct iommu_domain *domain,
> +				    struct io_pagetable *iopt, bool enable)
> +{
> +	const struct iommu_domain_ops *ops = domain->ops;
> +	int ret = -EOPNOTSUPP;
> +
> +	if (ops->set_dirty_tracking)
> +		ret = ops->set_dirty_tracking(domain, enable);
> +	else if (ops->set_dirty_tracking_range)
> +		ret = __set_dirty_tracking_range_locked(domain, iopt,
> +							enable);
> +
> +	return ret;
> +}
> +
> +int iopt_set_dirty_tracking(struct io_pagetable *iopt,
> +			    struct iommu_domain *domain, bool enable)
> +{
> +	struct iommu_domain *dom;
> +	unsigned long index;
> +	int ret = -EOPNOTSUPP;
> +
> +	down_write(&iopt->iova_rwsem);
> +	if (!domain) {
> +		down_write(&iopt->domains_rwsem);
> +		xa_for_each(&iopt->domains, index, dom) {
> +			ret = iommu_set_dirty_tracking(dom, iopt, enable);
> +			if (ret < 0)
> +				break;
> +		}
> +		up_write(&iopt->domains_rwsem);
> +	} else {
> +		ret = iommu_set_dirty_tracking(domain, iopt, enable);
> +	}
> +
> +	up_write(&iopt->iova_rwsem);
> +	return ret;
> +}
> +
>  struct iopt_pages *iopt_get_pages(struct io_pagetable *iopt, unsigned long
> iova,
>  				  unsigned long *start_byte,
>  				  unsigned long length)
> diff --git a/drivers/iommu/iommufd/iommufd_private.h
> b/drivers/iommu/iommufd/iommufd_private.h
> index f55654278ac4..d00ef3b785c5 100644
> --- a/drivers/iommu/iommufd/iommufd_private.h
> +++ b/drivers/iommu/iommufd/iommufd_private.h
> @@ -49,6 +49,9 @@ int iopt_unmap_iova(struct io_pagetable *iopt,
> unsigned long iova,
>  		    unsigned long length);
>  int iopt_unmap_all(struct io_pagetable *iopt);
> 
> +int iopt_set_dirty_tracking(struct io_pagetable *iopt,
> +			    struct iommu_domain *domain, bool enable);
> +
>  int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
>  		      unsigned long npages, struct page **out_pages, bool write);
>  void iopt_unaccess_pages(struct io_pagetable *iopt, unsigned long iova,
> --
> 2.17.2


^ permalink raw reply	[flat|nested] 209+ messages in thread

end of thread, other threads:[~2022-08-29 10:02 UTC | newest]

Thread overview: 209+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-28 21:09 [PATCH RFC 00/19] IOMMUFD Dirty Tracking Joao Martins
2022-04-28 21:09 ` Joao Martins
2022-04-28 21:09 ` [PATCH RFC 01/19] iommu: Add iommu_domain ops for dirty tracking Joao Martins
2022-04-28 21:09   ` Joao Martins
2022-04-29  7:54   ` Tian, Kevin
2022-04-29  7:54     ` Tian, Kevin
2022-04-29 10:44     ` Joao Martins
2022-04-29 10:44       ` Joao Martins
2022-04-29 12:08   ` Jason Gunthorpe
2022-04-29 12:08     ` Jason Gunthorpe via iommu
2022-04-29 14:26     ` Joao Martins
2022-04-29 14:26       ` Joao Martins
2022-04-29 14:35       ` Jason Gunthorpe
2022-04-29 14:35         ` Jason Gunthorpe via iommu
2022-04-29 13:40   ` Baolu Lu
2022-04-29 13:40     ` Baolu Lu
2022-04-29 15:27     ` Joao Martins
2022-04-29 15:27       ` Joao Martins
2022-04-28 21:09 ` [PATCH RFC 02/19] iommufd: Dirty tracking for io_pagetable Joao Martins
2022-04-28 21:09   ` Joao Martins
2022-04-29  8:07   ` Tian, Kevin
2022-04-29  8:07     ` Tian, Kevin
2022-04-29 10:48     ` Joao Martins
2022-04-29 10:48       ` Joao Martins
2022-04-29 11:56     ` Jason Gunthorpe
2022-04-29 11:56       ` Jason Gunthorpe via iommu
2022-04-29 14:28       ` Joao Martins
2022-04-29 14:28         ` Joao Martins
2022-04-29 23:51   ` Baolu Lu
2022-04-29 23:51     ` Baolu Lu
2022-05-02 11:57     ` Joao Martins
2022-05-02 11:57       ` Joao Martins
2022-08-29 10:01   ` Shameerali Kolothum Thodi
2022-04-28 21:09 ` [PATCH RFC 03/19] iommufd: Dirty tracking data support Joao Martins
2022-04-28 21:09   ` Joao Martins
2022-04-29  8:12   ` Tian, Kevin
2022-04-29  8:12     ` Tian, Kevin
2022-04-29 10:54     ` Joao Martins
2022-04-29 10:54       ` Joao Martins
2022-04-29 12:09       ` Jason Gunthorpe
2022-04-29 12:09         ` Jason Gunthorpe via iommu
2022-04-29 14:33         ` Joao Martins
2022-04-29 14:33           ` Joao Martins
2022-04-30  4:11   ` Baolu Lu
2022-04-30  4:11     ` Baolu Lu
2022-05-02 12:06     ` Joao Martins
2022-05-02 12:06       ` Joao Martins
2022-04-28 21:09 ` [PATCH RFC 04/19] iommu: Add an unmap API that returns dirtied IOPTEs Joao Martins
2022-04-28 21:09   ` Joao Martins
2022-04-30  5:12   ` Baolu Lu
2022-04-30  5:12     ` Baolu Lu
2022-05-02 12:22     ` Joao Martins
2022-05-02 12:22       ` Joao Martins
2022-04-28 21:09 ` [PATCH RFC 05/19] iommufd: Add a dirty bitmap to iopt_unmap_iova() Joao Martins
2022-04-28 21:09   ` Joao Martins
2022-04-29 12:14   ` Jason Gunthorpe
2022-04-29 12:14     ` Jason Gunthorpe via iommu
2022-04-29 14:36     ` Joao Martins
2022-04-29 14:36       ` Joao Martins
2022-04-28 21:09 ` [PATCH RFC 06/19] iommufd: Dirty tracking IOCTLs for the hw_pagetable Joao Martins
2022-04-28 21:09   ` Joao Martins
2022-04-28 21:09 ` [PATCH RFC 07/19] iommufd/vfio-compat: Dirty tracking IOCTLs compatibility Joao Martins
2022-04-28 21:09   ` Joao Martins
2022-04-29 12:19   ` Jason Gunthorpe
2022-04-29 12:19     ` Jason Gunthorpe via iommu
2022-04-29 14:27     ` Joao Martins
2022-04-29 14:27       ` Joao Martins
2022-04-29 14:36       ` Jason Gunthorpe via iommu
2022-04-29 14:36         ` Jason Gunthorpe
2022-04-29 14:52         ` Joao Martins
2022-04-29 14:52           ` Joao Martins
2022-04-28 21:09 ` [PATCH RFC 08/19] iommufd: Add a test for dirty tracking ioctls Joao Martins
2022-04-28 21:09   ` Joao Martins
2022-04-28 21:09 ` [PATCH RFC 09/19] iommu/amd: Access/Dirty bit support in IOPTEs Joao Martins
2022-04-28 21:09   ` Joao Martins
2022-05-31 11:34   ` Suravee Suthikulpanit via iommu
2022-05-31 11:34     ` Suravee Suthikulpanit
2022-05-31 12:15     ` Baolu Lu
2022-05-31 12:15       ` Baolu Lu
2022-05-31 15:22     ` Joao Martins
2022-05-31 15:22       ` Joao Martins
2022-04-28 21:09 ` [PATCH RFC 10/19] iommu/amd: Add unmap_read_dirty() support Joao Martins
2022-04-28 21:09   ` Joao Martins
2022-05-31 12:39   ` Suravee Suthikulpanit
2022-05-31 12:39     ` Suravee Suthikulpanit via iommu
2022-05-31 15:51     ` Joao Martins
2022-05-31 15:51       ` Joao Martins
2022-04-28 21:09 ` [PATCH RFC 11/19] iommu/amd: Print access/dirty bits if supported Joao Martins
2022-04-28 21:09   ` Joao Martins
2022-04-28 21:09 ` [PATCH RFC 12/19] iommu/arm-smmu-v3: Add feature detection for HTTU Joao Martins
2022-04-28 21:09   ` Joao Martins
2022-04-28 21:09 ` [PATCH RFC 13/19] iommu/arm-smmu-v3: Add feature detection for BBML Joao Martins
2022-04-28 21:09   ` Joao Martins
2022-04-29 11:11   ` Robin Murphy
2022-04-29 11:11     ` Robin Murphy
2022-04-29 11:54     ` Joao Martins
2022-04-29 11:54       ` Joao Martins
2022-04-29 12:26       ` Robin Murphy
2022-04-29 12:26         ` Robin Murphy
2022-04-29 14:34         ` Joao Martins
2022-04-29 14:34           ` Joao Martins
2022-04-28 21:09 ` [PATCH RFC 14/19] iommu/arm-smmu-v3: Add read_and_clear_dirty() support Joao Martins
2022-04-28 21:09   ` Joao Martins
2022-08-29  9:59   ` Shameerali Kolothum Thodi
2022-04-28 21:09 ` [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support Joao Martins
2022-04-28 21:09   ` Joao Martins
2022-04-29  8:28   ` Tian, Kevin
2022-04-29  8:28     ` Tian, Kevin
2022-04-29 11:05     ` Joao Martins
2022-04-29 11:05       ` Joao Martins
2022-04-29 11:19       ` Robin Murphy
2022-04-29 11:19         ` Robin Murphy
2022-04-29 12:06         ` Joao Martins
2022-04-29 12:06           ` Joao Martins
2022-04-29 12:23           ` Jason Gunthorpe
2022-04-29 12:23             ` Jason Gunthorpe via iommu
2022-04-29 14:45             ` Joao Martins
2022-04-29 14:45               ` Joao Martins
2022-04-29 16:11               ` Jason Gunthorpe
2022-04-29 16:11                 ` Jason Gunthorpe via iommu
2022-04-29 16:40                 ` Joao Martins
2022-04-29 16:40                   ` Joao Martins
2022-04-29 16:46                   ` Jason Gunthorpe
2022-04-29 16:46                     ` Jason Gunthorpe via iommu
2022-04-29 19:20                   ` Robin Murphy
2022-04-29 19:20                     ` Robin Murphy
2022-05-02 11:52                     ` Joao Martins
2022-05-02 11:52                       ` Joao Martins
2022-05-02 11:57                       ` Joao Martins
2022-05-02 11:57                         ` Joao Martins
2022-05-05  7:25       ` Shameerali Kolothum Thodi
2022-05-05  7:25         ` Shameerali Kolothum Thodi via iommu
2022-05-05  9:52         ` Joao Martins
2022-05-05  9:52           ` Joao Martins
2022-08-29  9:59           ` Shameerali Kolothum Thodi
2022-08-29 10:00   ` Shameerali Kolothum Thodi
2022-04-28 21:09 ` [PATCH RFC 16/19] iommu/arm-smmu-v3: Enable HTTU for stage1 with io-pgtable mapping Joao Martins
2022-04-28 21:09   ` Joao Martins
2022-04-29 11:35   ` Robin Murphy
2022-04-29 11:35     ` Robin Murphy
2022-04-29 12:10     ` Joao Martins
2022-04-29 12:10       ` Joao Martins
2022-04-29 12:46       ` Robin Murphy
2022-04-29 12:46         ` Robin Murphy
2022-08-29 10:00   ` Shameerali Kolothum Thodi
2022-04-28 21:09 ` [PATCH RFC 17/19] iommu/arm-smmu-v3: Add unmap_read_dirty() support Joao Martins
2022-04-28 21:09   ` Joao Martins
2022-04-29 11:53   ` Robin Murphy
2022-04-29 11:53     ` Robin Murphy
2022-04-28 21:09 ` [PATCH RFC 18/19] iommu/intel: Access/Dirty bit support for SL domains Joao Martins
2022-04-28 21:09   ` Joao Martins
2022-04-29  9:03   ` Tian, Kevin
2022-04-29  9:03     ` Tian, Kevin
2022-04-29 11:20     ` Joao Martins
2022-04-29 11:20       ` Joao Martins
2022-04-30  6:12   ` Baolu Lu
2022-04-30  6:12     ` Baolu Lu
2022-05-02 12:24     ` Joao Martins
2022-05-02 12:24       ` Joao Martins
2022-04-28 21:09 ` [PATCH RFC 19/19] iommu/intel: Add unmap_read_dirty() support Joao Martins
2022-04-28 21:09   ` Joao Martins
2022-04-29  5:45 ` [PATCH RFC 00/19] IOMMUFD Dirty Tracking Tian, Kevin
2022-04-29  5:45   ` Tian, Kevin
2022-04-29 10:27   ` Joao Martins
2022-04-29 10:27     ` Joao Martins
2022-04-29 12:38     ` Jason Gunthorpe
2022-04-29 12:38       ` Jason Gunthorpe via iommu
2022-04-29 15:20       ` Joao Martins
2022-04-29 15:20         ` Joao Martins
2022-05-05  7:40       ` Tian, Kevin
2022-05-05  7:40         ` Tian, Kevin
2022-05-05 14:07         ` Jason Gunthorpe
2022-05-05 14:07           ` Jason Gunthorpe via iommu
2022-05-06  3:51           ` Tian, Kevin
2022-05-06  3:51             ` Tian, Kevin
2022-05-06 11:46             ` Jason Gunthorpe
2022-05-06 11:46               ` Jason Gunthorpe via iommu
2022-05-10  1:38               ` Tian, Kevin
2022-05-10  1:38                 ` Tian, Kevin
2022-05-10 11:50                 ` Joao Martins
2022-05-10 11:50                   ` Joao Martins
2022-05-11  1:17                   ` Tian, Kevin
2022-05-11  1:17                     ` Tian, Kevin
2022-05-10 13:46                 ` Jason Gunthorpe via iommu
2022-05-10 13:46                   ` Jason Gunthorpe
2022-05-11  1:10                   ` Tian, Kevin
2022-05-11  1:10                     ` Tian, Kevin
2022-07-12 18:34                     ` Joao Martins
2022-07-21 14:24                       ` Jason Gunthorpe
2022-05-02 18:11   ` Alex Williamson
2022-05-02 18:11     ` Alex Williamson
2022-05-02 18:52     ` Jason Gunthorpe
2022-05-02 18:52       ` Jason Gunthorpe via iommu
2022-05-03 10:48       ` Joao Martins
2022-05-03 10:48         ` Joao Martins
2022-05-05  7:42       ` Tian, Kevin
2022-05-05  7:42         ` Tian, Kevin
2022-05-05 10:06         ` Joao Martins
2022-05-05 10:06           ` Joao Martins
2022-05-05 11:03           ` Tian, Kevin
2022-05-05 11:03             ` Tian, Kevin
2022-05-05 11:50             ` Joao Martins
2022-05-05 11:50               ` Joao Martins
2022-05-06  3:14               ` Tian, Kevin
2022-05-06  3:14                 ` Tian, Kevin
2022-05-05 13:55             ` Jason Gunthorpe
2022-05-05 13:55               ` Jason Gunthorpe via iommu
2022-05-06  3:17               ` Tian, Kevin
2022-05-06  3:17                 ` Tian, Kevin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.