All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 00/10] hw/vfio, x86/iommu: IOMMUFD Dirty Tracking
@ 2022-04-28 21:13 ` Joao Martins
  0 siblings, 0 replies; 37+ messages in thread
From: Joao Martins @ 2022-04-28 21:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: Joao Martins, Paolo Bonzini, Dr. David Alan Gilbert,
	Richard Henderson, Eduardo Habkost, Michael S. Tsirkin,
	Marcel Apfelbaum, Peter Xu, Jason Wang, Alex Williamson,
	David Hildenbrand, Philippe Mathieu-Daudé,
	Cornelia Huck, Juan Quintela, Eric Blake, Markus Armbruster,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian, Yi Liu,
	Eric Auger, Thanos Makatos, John G . Johnson, kvm

This series expands IOMMUFD series from Yi and Eric into
supporting IOMMU Dirty Tracking. It adds both the emulated x86
IOMMUs, as well as IOMMUFD support that exercises said
emulation (or H/W).

It is organized into:

* Patches 1 - 4: x86 IOMMU emulation that performs dirty tracking,
useful for a nested guest exercising the IOMMUFD kernel counterpart
for coverage and development. The caching of the PASID PTE root flags
and AMD DTE is not exactly the cleanest, still wondering about a
better way without having to lookup context/DTE prior to IOTLB lookup.

* Patches 5 - 9: IOMMUFD backend support for dirty tracking;
Sadly this wasn't exactly tested with VF Live Migration, but it adds
(last 2 patches) a way to measure dirty rate via HMP and QMP.

The IOMMUFD kernel patches go into extent on what each of the ioctls
do, albeit the workflow is relatively simple:

 1) Toggling dirty tracking via HWPT_SET_DIRTY or get -ENOTSUPP otherwise
    which voids any attempt of walking IO pagetables

 2) Read-and-clear of Dirty IOVAs

 3) Unmap vIOMMU-backed IOVAs and fetch its dirty state right-before-unmap.

The series is still a WIP. The intend is to exercise the kernel side[0]
aside from the selftests that provide some coverage.

Comments, feedback, as always appreciated.

	Joao

[0] https://lore.kernel.org/linux-iommu/20220428210933.3583-1-joao.m.martins@oracle.com/

Joao Martins (10):
  amd-iommu: Cache PTE/DTE info in IOTLB
  amd-iommu: Access/Dirty bit support
  intel-iommu: Cache PASID entry flags
  intel_iommu: Second Stage Access Dirty bit support
  linux-headers: import iommufd.h hwpt extensions
  vfio/iommufd: Add HWPT_SET_DIRTY support
  vfio/iommufd: Add HWPT_GET_DIRTY_IOVA support
  vfio/iommufd: Add IOAS_UNMAP_DIRTY support
  migration/dirtyrate: Expand dirty_bitmap to be tracked separately for
    devices
  hw/vfio: Add nr of dirty pages to tracepoints

 accel/kvm/kvm-all.c            |  12 +++
 hmp-commands.hx                |   5 +-
 hw/i386/amd_iommu.c            |  76 +++++++++++++-
 hw/i386/amd_iommu.h            |  11 +-
 hw/i386/intel_iommu.c          | 103 +++++++++++++++++--
 hw/i386/intel_iommu_internal.h |   4 +
 hw/i386/trace-events           |   4 +
 hw/iommufd/iommufd.c           |  63 ++++++++++++
 hw/iommufd/trace-events        |   3 +
 hw/vfio/container.c            |  20 +++-
 hw/vfio/iommufd.c              | 179 ++++++++++++++++++++++++++++++++-
 hw/vfio/trace-events           |   3 +-
 include/exec/memory.h          |  10 +-
 include/hw/i386/intel_iommu.h  |   1 +
 include/hw/iommufd/iommufd.h   |   6 ++
 linux-headers/linux/iommufd.h  |  78 ++++++++++++++
 migration/dirtyrate.c          |  59 ++++++++---
 migration/dirtyrate.h          |   1 +
 qapi/migration.json            |  15 +++
 softmmu/memory.c               |   5 +
 20 files changed, 618 insertions(+), 40 deletions(-)

-- 
2.17.2


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH RFC 00/10] hw/vfio, x86/iommu: IOMMUFD Dirty Tracking
@ 2022-04-28 21:13 ` Joao Martins
  0 siblings, 0 replies; 37+ messages in thread
From: Joao Martins @ 2022-04-28 21:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: John G . Johnson, kvm, Michael S. Tsirkin, Jason Wang, Peter Xu,
	Joao Martins, Eric Blake, Yi Liu, Juan Quintela,
	David Hildenbrand, Markus Armbruster, Nicolin Chen,
	Jason Gunthorpe, Kevin Tian, Richard Henderson,
	Dr. David Alan Gilbert, Eric Auger, Alex Williamson,
	Thanos Makatos, Eduardo Habkost, Yishai Hadas, Cornelia Huck,
	Philippe Mathieu-Daudé,
	Paolo Bonzini

This series expands IOMMUFD series from Yi and Eric into
supporting IOMMU Dirty Tracking. It adds both the emulated x86
IOMMUs, as well as IOMMUFD support that exercises said
emulation (or H/W).

It is organized into:

* Patches 1 - 4: x86 IOMMU emulation that performs dirty tracking,
useful for a nested guest exercising the IOMMUFD kernel counterpart
for coverage and development. The caching of the PASID PTE root flags
and AMD DTE is not exactly the cleanest, still wondering about a
better way without having to lookup context/DTE prior to IOTLB lookup.

* Patches 5 - 9: IOMMUFD backend support for dirty tracking;
Sadly this wasn't exactly tested with VF Live Migration, but it adds
(last 2 patches) a way to measure dirty rate via HMP and QMP.

The IOMMUFD kernel patches go into extent on what each of the ioctls
do, albeit the workflow is relatively simple:

 1) Toggling dirty tracking via HWPT_SET_DIRTY or get -ENOTSUPP otherwise
    which voids any attempt of walking IO pagetables

 2) Read-and-clear of Dirty IOVAs

 3) Unmap vIOMMU-backed IOVAs and fetch its dirty state right-before-unmap.

The series is still a WIP. The intend is to exercise the kernel side[0]
aside from the selftests that provide some coverage.

Comments, feedback, as always appreciated.

	Joao

[0] https://lore.kernel.org/linux-iommu/20220428210933.3583-1-joao.m.martins@oracle.com/

Joao Martins (10):
  amd-iommu: Cache PTE/DTE info in IOTLB
  amd-iommu: Access/Dirty bit support
  intel-iommu: Cache PASID entry flags
  intel_iommu: Second Stage Access Dirty bit support
  linux-headers: import iommufd.h hwpt extensions
  vfio/iommufd: Add HWPT_SET_DIRTY support
  vfio/iommufd: Add HWPT_GET_DIRTY_IOVA support
  vfio/iommufd: Add IOAS_UNMAP_DIRTY support
  migration/dirtyrate: Expand dirty_bitmap to be tracked separately for
    devices
  hw/vfio: Add nr of dirty pages to tracepoints

 accel/kvm/kvm-all.c            |  12 +++
 hmp-commands.hx                |   5 +-
 hw/i386/amd_iommu.c            |  76 +++++++++++++-
 hw/i386/amd_iommu.h            |  11 +-
 hw/i386/intel_iommu.c          | 103 +++++++++++++++++--
 hw/i386/intel_iommu_internal.h |   4 +
 hw/i386/trace-events           |   4 +
 hw/iommufd/iommufd.c           |  63 ++++++++++++
 hw/iommufd/trace-events        |   3 +
 hw/vfio/container.c            |  20 +++-
 hw/vfio/iommufd.c              | 179 ++++++++++++++++++++++++++++++++-
 hw/vfio/trace-events           |   3 +-
 include/exec/memory.h          |  10 +-
 include/hw/i386/intel_iommu.h  |   1 +
 include/hw/iommufd/iommufd.h   |   6 ++
 linux-headers/linux/iommufd.h  |  78 ++++++++++++++
 migration/dirtyrate.c          |  59 ++++++++---
 migration/dirtyrate.h          |   1 +
 qapi/migration.json            |  15 +++
 softmmu/memory.c               |   5 +
 20 files changed, 618 insertions(+), 40 deletions(-)

-- 
2.17.2



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH RFC 01/10] amd-iommu: Cache PTE/DTE info in IOTLB
  2022-04-28 21:13 ` Joao Martins
@ 2022-04-28 21:13   ` Joao Martins
  -1 siblings, 0 replies; 37+ messages in thread
From: Joao Martins @ 2022-04-28 21:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: Joao Martins, Paolo Bonzini, Dr. David Alan Gilbert,
	Richard Henderson, Eduardo Habkost, Michael S. Tsirkin,
	Marcel Apfelbaum, Peter Xu, Jason Wang, Alex Williamson,
	David Hildenbrand, Philippe Mathieu-Daudé,
	Cornelia Huck, Juan Quintela, Eric Blake, Markus Armbruster,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian, Yi Liu,
	Eric Auger, Thanos Makatos, John G . Johnson, kvm

On a successful translation, cache the PTE and DTE
flags set at the time of the translation i.e. the first 12bits
as well as the PTE storage. These bits contain read, write,
dirty and access for example. In theory the DTE lookup takes
precendence in the translation path, but in the interest of
performance extend the AMDVIIOTLBEntry to include that information.

This is a preparatory for AMD HDSup/HASup which requires updating
A/D bits off the PTE (even after its insertion in the IOTLB) based
on the fact that HAD bits (0x3 or 0x1) were set on the Device
Table Entry.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/amd_iommu.c | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index ea8eaeb330b6..25b5c3be70ea 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -72,6 +72,9 @@ typedef struct AMDVIIOTLBEntry {
     uint64_t perms;             /* access permissions  */
     uint64_t translated_addr;   /* translated address  */
     uint64_t page_mask;         /* physical page size  */
+    uint16_t dte_flags;         /* device table entry flags */
+    uint64_t pte;               /* pte entry */
+    uint64_t pte_addr;          /* pte entry iova */
 } AMDVIIOTLBEntry;
 
 /* configure MMIO registers at startup/reset */
@@ -340,7 +343,8 @@ static void amdvi_iotlb_remove_page(AMDVIState *s, hwaddr addr,
 
 static void amdvi_update_iotlb(AMDVIState *s, uint16_t devid,
                                uint64_t gpa, IOMMUTLBEntry to_cache,
-                               uint16_t domid)
+                               uint16_t domid, uint16_t dte_flags,
+                               uint64_t pte, uint64_t pte_addr)
 {
     AMDVIIOTLBEntry *entry = g_new(AMDVIIOTLBEntry, 1);
     uint64_t *key = g_new(uint64_t, 1);
@@ -359,6 +363,9 @@ static void amdvi_update_iotlb(AMDVIState *s, uint16_t devid,
         entry->perms = to_cache.perm;
         entry->translated_addr = to_cache.translated_addr;
         entry->page_mask = to_cache.addr_mask;
+        entry->dte_flags = dte_flags;
+        entry->pte = pte;
+        entry->pte_addr = pte_addr;
         *key = gfn | ((uint64_t)(devid) << AMDVI_DEVID_SHIFT);
         g_hash_table_replace(s->iotlb, key, entry);
     }
@@ -896,7 +903,8 @@ static inline uint64_t amdvi_get_pte_entry(AMDVIState *s, uint64_t pte_addr,
 
 static void amdvi_page_walk(AMDVIAddressSpace *as, uint64_t *dte,
                             IOMMUTLBEntry *ret, unsigned perms,
-                            hwaddr addr)
+                            hwaddr addr, uint64_t *iotlb_pte,
+                            uint64_t *iotlb_pte_addr)
 {
     unsigned level, present, pte_perms, oldlevel;
     uint64_t pte = dte[0], pte_addr, page_mask;
@@ -945,6 +953,8 @@ static void amdvi_page_walk(AMDVIAddressSpace *as, uint64_t *dte,
         ret->translated_addr = (pte & AMDVI_DEV_PT_ROOT_MASK) & page_mask;
         ret->addr_mask = ~page_mask;
         ret->perm = amdvi_get_perms(pte);
+        *iotlb_pte = pte;
+        *iotlb_pte_addr = addr;
         return;
     }
 no_remap:
@@ -952,6 +962,8 @@ no_remap:
     ret->translated_addr = addr & AMDVI_PAGE_MASK_4K;
     ret->addr_mask = ~AMDVI_PAGE_MASK_4K;
     ret->perm = amdvi_get_perms(pte);
+    *iotlb_pte = pte;
+    *iotlb_pte_addr = addr;
 }
 
 static void amdvi_do_translate(AMDVIAddressSpace *as, hwaddr addr,
@@ -960,7 +972,7 @@ static void amdvi_do_translate(AMDVIAddressSpace *as, hwaddr addr,
     AMDVIState *s = as->iommu_state;
     uint16_t devid = PCI_BUILD_BDF(as->bus_num, as->devfn);
     AMDVIIOTLBEntry *iotlb_entry = amdvi_iotlb_lookup(s, addr, devid);
-    uint64_t entry[4];
+    uint64_t entry[4], pte, pte_addr;
 
     if (iotlb_entry) {
         trace_amdvi_iotlb_hit(PCI_BUS_NUM(devid), PCI_SLOT(devid),
@@ -982,10 +994,12 @@ static void amdvi_do_translate(AMDVIAddressSpace *as, hwaddr addr,
     }
 
     amdvi_page_walk(as, entry, ret,
-                    is_write ? AMDVI_PERM_WRITE : AMDVI_PERM_READ, addr);
+                    is_write ? AMDVI_PERM_WRITE : AMDVI_PERM_READ, addr,
+                    &pte, &pte_addr);
 
     amdvi_update_iotlb(s, devid, addr, *ret,
-                       entry[1] & AMDVI_DEV_DOMID_ID_MASK);
+                       entry[1] & AMDVI_DEV_DOMID_ID_MASK,
+                       entry[0] & ~AMDVI_DEV_PT_ROOT_MASK, pte, pte_addr);
     return;
 
 out:
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC 01/10] amd-iommu: Cache PTE/DTE info in IOTLB
@ 2022-04-28 21:13   ` Joao Martins
  0 siblings, 0 replies; 37+ messages in thread
From: Joao Martins @ 2022-04-28 21:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: John G . Johnson, kvm, Michael S. Tsirkin, Jason Wang, Peter Xu,
	Joao Martins, Eric Blake, Yi Liu, Juan Quintela,
	David Hildenbrand, Markus Armbruster, Nicolin Chen,
	Jason Gunthorpe, Kevin Tian, Richard Henderson,
	Dr. David Alan Gilbert, Eric Auger, Alex Williamson,
	Thanos Makatos, Eduardo Habkost, Yishai Hadas, Cornelia Huck,
	Philippe Mathieu-Daudé,
	Paolo Bonzini

On a successful translation, cache the PTE and DTE
flags set at the time of the translation i.e. the first 12bits
as well as the PTE storage. These bits contain read, write,
dirty and access for example. In theory the DTE lookup takes
precendence in the translation path, but in the interest of
performance extend the AMDVIIOTLBEntry to include that information.

This is a preparatory for AMD HDSup/HASup which requires updating
A/D bits off the PTE (even after its insertion in the IOTLB) based
on the fact that HAD bits (0x3 or 0x1) were set on the Device
Table Entry.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/amd_iommu.c | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index ea8eaeb330b6..25b5c3be70ea 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -72,6 +72,9 @@ typedef struct AMDVIIOTLBEntry {
     uint64_t perms;             /* access permissions  */
     uint64_t translated_addr;   /* translated address  */
     uint64_t page_mask;         /* physical page size  */
+    uint16_t dte_flags;         /* device table entry flags */
+    uint64_t pte;               /* pte entry */
+    uint64_t pte_addr;          /* pte entry iova */
 } AMDVIIOTLBEntry;
 
 /* configure MMIO registers at startup/reset */
@@ -340,7 +343,8 @@ static void amdvi_iotlb_remove_page(AMDVIState *s, hwaddr addr,
 
 static void amdvi_update_iotlb(AMDVIState *s, uint16_t devid,
                                uint64_t gpa, IOMMUTLBEntry to_cache,
-                               uint16_t domid)
+                               uint16_t domid, uint16_t dte_flags,
+                               uint64_t pte, uint64_t pte_addr)
 {
     AMDVIIOTLBEntry *entry = g_new(AMDVIIOTLBEntry, 1);
     uint64_t *key = g_new(uint64_t, 1);
@@ -359,6 +363,9 @@ static void amdvi_update_iotlb(AMDVIState *s, uint16_t devid,
         entry->perms = to_cache.perm;
         entry->translated_addr = to_cache.translated_addr;
         entry->page_mask = to_cache.addr_mask;
+        entry->dte_flags = dte_flags;
+        entry->pte = pte;
+        entry->pte_addr = pte_addr;
         *key = gfn | ((uint64_t)(devid) << AMDVI_DEVID_SHIFT);
         g_hash_table_replace(s->iotlb, key, entry);
     }
@@ -896,7 +903,8 @@ static inline uint64_t amdvi_get_pte_entry(AMDVIState *s, uint64_t pte_addr,
 
 static void amdvi_page_walk(AMDVIAddressSpace *as, uint64_t *dte,
                             IOMMUTLBEntry *ret, unsigned perms,
-                            hwaddr addr)
+                            hwaddr addr, uint64_t *iotlb_pte,
+                            uint64_t *iotlb_pte_addr)
 {
     unsigned level, present, pte_perms, oldlevel;
     uint64_t pte = dte[0], pte_addr, page_mask;
@@ -945,6 +953,8 @@ static void amdvi_page_walk(AMDVIAddressSpace *as, uint64_t *dte,
         ret->translated_addr = (pte & AMDVI_DEV_PT_ROOT_MASK) & page_mask;
         ret->addr_mask = ~page_mask;
         ret->perm = amdvi_get_perms(pte);
+        *iotlb_pte = pte;
+        *iotlb_pte_addr = addr;
         return;
     }
 no_remap:
@@ -952,6 +962,8 @@ no_remap:
     ret->translated_addr = addr & AMDVI_PAGE_MASK_4K;
     ret->addr_mask = ~AMDVI_PAGE_MASK_4K;
     ret->perm = amdvi_get_perms(pte);
+    *iotlb_pte = pte;
+    *iotlb_pte_addr = addr;
 }
 
 static void amdvi_do_translate(AMDVIAddressSpace *as, hwaddr addr,
@@ -960,7 +972,7 @@ static void amdvi_do_translate(AMDVIAddressSpace *as, hwaddr addr,
     AMDVIState *s = as->iommu_state;
     uint16_t devid = PCI_BUILD_BDF(as->bus_num, as->devfn);
     AMDVIIOTLBEntry *iotlb_entry = amdvi_iotlb_lookup(s, addr, devid);
-    uint64_t entry[4];
+    uint64_t entry[4], pte, pte_addr;
 
     if (iotlb_entry) {
         trace_amdvi_iotlb_hit(PCI_BUS_NUM(devid), PCI_SLOT(devid),
@@ -982,10 +994,12 @@ static void amdvi_do_translate(AMDVIAddressSpace *as, hwaddr addr,
     }
 
     amdvi_page_walk(as, entry, ret,
-                    is_write ? AMDVI_PERM_WRITE : AMDVI_PERM_READ, addr);
+                    is_write ? AMDVI_PERM_WRITE : AMDVI_PERM_READ, addr,
+                    &pte, &pte_addr);
 
     amdvi_update_iotlb(s, devid, addr, *ret,
-                       entry[1] & AMDVI_DEV_DOMID_ID_MASK);
+                       entry[1] & AMDVI_DEV_DOMID_ID_MASK,
+                       entry[0] & ~AMDVI_DEV_PT_ROOT_MASK, pte, pte_addr);
     return;
 
 out:
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC 02/10] amd-iommu: Access/Dirty bit support
  2022-04-28 21:13 ` Joao Martins
@ 2022-04-28 21:13   ` Joao Martins
  -1 siblings, 0 replies; 37+ messages in thread
From: Joao Martins @ 2022-04-28 21:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: Joao Martins, Paolo Bonzini, Dr. David Alan Gilbert,
	Richard Henderson, Eduardo Habkost, Michael S. Tsirkin,
	Marcel Apfelbaum, Peter Xu, Jason Wang, Alex Williamson,
	David Hildenbrand, Philippe Mathieu-Daudé,
	Cornelia Huck, Juan Quintela, Eric Blake, Markus Armbruster,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian, Yi Liu,
	Eric Auger, Thanos Makatos, John G . Johnson, kvm

IOMMU advertises Access/Dirty bits if the extended feature
register reports it. Relevant AMD IOMMU SDM ref[0]
"1.3.8 Enhanced Support for Access and Dirty Bits"

To enable it we set the DTE flag in bits 7 and 8 to enable
access, or access+dirty. With that, the IOMMU starts
marking the D and A flags on every Memory Request or ATS
translation request.  Relevant AMD IOMMU SDM ref [0],
"Table 7. Device Table Entry (DTE) Field Definitions"
particularly the entry "HAD". The cached DTE information
is then used on amdvi_had_update on both when we do an IO
page walk, or when we found an IOTLB entry for it.

To actually toggle on and off it's relatively simple as it's setting
2 bits on DTE and flush the device DTE cache. The information
is then cleared and set again on the next device context
cached or IOVA.

Worthwhile sections from AMD IOMMU SDM:

"2.2.3.1 Host Access Support"
"2.2.3.2 Host Dirty Support"

For details on how IOMMU hardware updates the dirty bit see,
and expects from its consequent clearing by CPU:

"2.2.7.4 Updating Accessed and Dirty Bits in the Guest Address Tables"
"2.2.7.5 Clearing Accessed and Dirty Bits"

This is useful to help prototypization of IOMMU dirty tracking,
particularly the IOMMUFD and VFIO sides.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/amd_iommu.c  | 52 ++++++++++++++++++++++++++++++++++++++++++++
 hw/i386/amd_iommu.h  | 11 ++++++++--
 hw/i386/trace-events |  2 ++
 3 files changed, 63 insertions(+), 2 deletions(-)

diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 25b5c3be70ea..7f48a2601579 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -24,6 +24,7 @@
 #include "hw/i386/pc.h"
 #include "hw/pci/msi.h"
 #include "hw/pci/pci_bus.h"
+#include "hw/qdev-properties.h"
 #include "migration/vmstate.h"
 #include "amd_iommu.h"
 #include "qapi/error.h"
@@ -901,6 +902,48 @@ static inline uint64_t amdvi_get_pte_entry(AMDVIState *s, uint64_t pte_addr,
     return pte;
 }
 
+static inline int amdvi_set_pte_entry(AMDVIState *s, uint64_t pte_addr,
+                                      uint16_t devid, uint64_t pte)
+{
+    if (dma_memory_write(&address_space_memory, pte_addr, &pte, sizeof(pte),
+                         MEMTXATTRS_UNSPECIFIED)) {
+        trace_amdvi_get_pte_hwerror(pte_addr);
+        amdvi_log_pagetab_error(s, devid, pte_addr, 0);
+        return -EINVAL;
+    }
+    return 0;
+}
+
+/*
+ * Checks if A/D bits need to be updated.
+ * It can only be called when PTE permissions have been
+ * validated against he transaction-requested ones.
+ */
+static bool amdvi_had_update(AMDVIAddressSpace *as, uint64_t dte,
+                             uint64_t *pte, unsigned perms)
+{
+    bool is_write = perms & AMDVI_PERM_WRITE;
+    bool dirty, access;
+
+    dirty = access = false;
+
+    if (is_write && (dte & AMDVI_DEV_HADEN) &&
+        !(*pte & AMDVI_DEV_PERM_DIRTY)) {
+        *pte |= AMDVI_DEV_PERM_DIRTY;
+        trace_amdvi_hd_update(*pte);
+        dirty = true;
+    }
+
+    if ((!is_write | dirty) && (dte & AMDVI_DEV_HAEN) &&
+        !(*pte & AMDVI_DEV_PERM_ACCESS)) {
+        *pte |= AMDVI_DEV_PERM_ACCESS;
+        trace_amdvi_ha_update(*pte);
+        access = true;
+    }
+
+    return dirty || access;
+}
+
 static void amdvi_page_walk(AMDVIAddressSpace *as, uint64_t *dte,
                             IOMMUTLBEntry *ret, unsigned perms,
                             hwaddr addr, uint64_t *iotlb_pte,
@@ -948,6 +991,11 @@ static void amdvi_page_walk(AMDVIAddressSpace *as, uint64_t *dte,
             page_mask = pte_get_page_mask(oldlevel);
         }
 
+        if (amdvi_had_update(as, dte[0], &pte, perms)) {
+            amdvi_set_pte_entry(as->iommu_state, pte_addr, as->devfn,
+                                cpu_to_le64(pte));
+        }
+
         /* get access permissions from pte */
         ret->iova = addr & page_mask;
         ret->translated_addr = (pte & AMDVI_DEV_PT_ROOT_MASK) & page_mask;
@@ -977,6 +1025,10 @@ static void amdvi_do_translate(AMDVIAddressSpace *as, hwaddr addr,
     if (iotlb_entry) {
         trace_amdvi_iotlb_hit(PCI_BUS_NUM(devid), PCI_SLOT(devid),
                 PCI_FUNC(devid), addr, iotlb_entry->translated_addr);
+        if (amdvi_had_update(as, iotlb_entry->dte_flags,
+                             &iotlb_entry->pte, iotlb_entry->perms))
+            amdvi_set_pte_entry(as->iommu_state, iotlb_entry->pte_addr,
+                                as->devfn, cpu_to_le64(iotlb_entry->pte));
         ret->iova = addr & ~iotlb_entry->page_mask;
         ret->translated_addr = iotlb_entry->translated_addr;
         ret->addr_mask = iotlb_entry->page_mask;
diff --git a/hw/i386/amd_iommu.h b/hw/i386/amd_iommu.h
index 79d38a3e4184..b794596aa07d 100644
--- a/hw/i386/amd_iommu.h
+++ b/hw/i386/amd_iommu.h
@@ -135,6 +135,10 @@
 #define AMDVI_DEV_PERM_SHIFT              61
 #define AMDVI_DEV_PERM_READ               (1ULL << 61)
 #define AMDVI_DEV_PERM_WRITE              (1ULL << 62)
+#define AMDVI_DEV_PERM_ACCESS             (1ULL << 5)
+#define AMDVI_DEV_PERM_DIRTY              (1ULL << 6)
+#define AMDVI_DEV_HADEN                   (3ULL << 7)
+#define AMDVI_DEV_HAEN                    (1ULL << 7)
 
 /* Device table entry bits 64:127 */
 #define AMDVI_DEV_DOMID_ID_MASK          ((1ULL << 16) - 1)
@@ -159,9 +163,11 @@
 #define AMDVI_FEATURE_GA                  (1ULL << 7) /* guest VAPIC support */
 #define AMDVI_FEATURE_HE                  (1ULL << 8) /* hardware error regs */
 #define AMDVI_FEATURE_PC                  (1ULL << 9) /* Perf counters       */
+#define AMDVI_FEATURE_HD                  (1ULL << 52) /* Host Dirty support */
+#define AMDVI_FEATURE_HA                  (1ULL << 49) /* Host Access        */
 
 /* reserved DTE bits */
-#define AMDVI_DTE_LOWER_QUAD_RESERVED  0x80300000000000fc
+#define AMDVI_DTE_LOWER_QUAD_RESERVED  0x803000000000006c
 #define AMDVI_DTE_MIDDLE_QUAD_RESERVED 0x0000000000000100
 #define AMDVI_DTE_UPPER_QUAD_RESERVED  0x08f0000000000000
 
@@ -176,7 +182,8 @@
 /* extended feature support */
 #define AMDVI_EXT_FEATURES (AMDVI_FEATURE_PREFETCH | AMDVI_FEATURE_PPR | \
         AMDVI_FEATURE_IA | AMDVI_FEATURE_GT | AMDVI_FEATURE_HE | \
-        AMDVI_GATS_MODE | AMDVI_HATS_MODE | AMDVI_FEATURE_GA)
+        AMDVI_GATS_MODE | AMDVI_HATS_MODE | AMDVI_FEATURE_GA | \
+        AMDVI_FEATURE_HD | AMDVI_FEATURE_HA)
 
 /* capabilities header */
 #define AMDVI_CAPAB_FEATURES (AMDVI_CAPAB_FLAT_EXT | \
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index e49814dd642d..eb5f075873cd 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -107,6 +107,8 @@ amdvi_ir_intctl(uint8_t val) "int_ctl 0x%"PRIx8
 amdvi_ir_target_abort(const char *str) "%s"
 amdvi_ir_delivery_mode(const char *str) "%s"
 amdvi_ir_irte_ga_val(uint64_t hi, uint64_t lo) "hi 0x%"PRIx64" lo 0x%"PRIx64
+amdvi_ha_update(uint64_t pte) "pte 0x%"PRIx64
+amdvi_hd_update(uint64_t pte) "pte 0x%"PRIx64
 
 # vmport.c
 vmport_register(unsigned char command, void *func, void *opaque) "command: 0x%02x func: %p opaque: %p"
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC 02/10] amd-iommu: Access/Dirty bit support
@ 2022-04-28 21:13   ` Joao Martins
  0 siblings, 0 replies; 37+ messages in thread
From: Joao Martins @ 2022-04-28 21:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: John G . Johnson, kvm, Michael S. Tsirkin, Jason Wang, Peter Xu,
	Joao Martins, Eric Blake, Yi Liu, Juan Quintela,
	David Hildenbrand, Markus Armbruster, Nicolin Chen,
	Jason Gunthorpe, Kevin Tian, Richard Henderson,
	Dr. David Alan Gilbert, Eric Auger, Alex Williamson,
	Thanos Makatos, Eduardo Habkost, Yishai Hadas, Cornelia Huck,
	Philippe Mathieu-Daudé,
	Paolo Bonzini

IOMMU advertises Access/Dirty bits if the extended feature
register reports it. Relevant AMD IOMMU SDM ref[0]
"1.3.8 Enhanced Support for Access and Dirty Bits"

To enable it we set the DTE flag in bits 7 and 8 to enable
access, or access+dirty. With that, the IOMMU starts
marking the D and A flags on every Memory Request or ATS
translation request.  Relevant AMD IOMMU SDM ref [0],
"Table 7. Device Table Entry (DTE) Field Definitions"
particularly the entry "HAD". The cached DTE information
is then used on amdvi_had_update on both when we do an IO
page walk, or when we found an IOTLB entry for it.

To actually toggle on and off it's relatively simple as it's setting
2 bits on DTE and flush the device DTE cache. The information
is then cleared and set again on the next device context
cached or IOVA.

Worthwhile sections from AMD IOMMU SDM:

"2.2.3.1 Host Access Support"
"2.2.3.2 Host Dirty Support"

For details on how IOMMU hardware updates the dirty bit see,
and expects from its consequent clearing by CPU:

"2.2.7.4 Updating Accessed and Dirty Bits in the Guest Address Tables"
"2.2.7.5 Clearing Accessed and Dirty Bits"

This is useful to help prototypization of IOMMU dirty tracking,
particularly the IOMMUFD and VFIO sides.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/amd_iommu.c  | 52 ++++++++++++++++++++++++++++++++++++++++++++
 hw/i386/amd_iommu.h  | 11 ++++++++--
 hw/i386/trace-events |  2 ++
 3 files changed, 63 insertions(+), 2 deletions(-)

diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 25b5c3be70ea..7f48a2601579 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -24,6 +24,7 @@
 #include "hw/i386/pc.h"
 #include "hw/pci/msi.h"
 #include "hw/pci/pci_bus.h"
+#include "hw/qdev-properties.h"
 #include "migration/vmstate.h"
 #include "amd_iommu.h"
 #include "qapi/error.h"
@@ -901,6 +902,48 @@ static inline uint64_t amdvi_get_pte_entry(AMDVIState *s, uint64_t pte_addr,
     return pte;
 }
 
+static inline int amdvi_set_pte_entry(AMDVIState *s, uint64_t pte_addr,
+                                      uint16_t devid, uint64_t pte)
+{
+    if (dma_memory_write(&address_space_memory, pte_addr, &pte, sizeof(pte),
+                         MEMTXATTRS_UNSPECIFIED)) {
+        trace_amdvi_get_pte_hwerror(pte_addr);
+        amdvi_log_pagetab_error(s, devid, pte_addr, 0);
+        return -EINVAL;
+    }
+    return 0;
+}
+
+/*
+ * Checks if A/D bits need to be updated.
+ * It can only be called when PTE permissions have been
+ * validated against he transaction-requested ones.
+ */
+static bool amdvi_had_update(AMDVIAddressSpace *as, uint64_t dte,
+                             uint64_t *pte, unsigned perms)
+{
+    bool is_write = perms & AMDVI_PERM_WRITE;
+    bool dirty, access;
+
+    dirty = access = false;
+
+    if (is_write && (dte & AMDVI_DEV_HADEN) &&
+        !(*pte & AMDVI_DEV_PERM_DIRTY)) {
+        *pte |= AMDVI_DEV_PERM_DIRTY;
+        trace_amdvi_hd_update(*pte);
+        dirty = true;
+    }
+
+    if ((!is_write | dirty) && (dte & AMDVI_DEV_HAEN) &&
+        !(*pte & AMDVI_DEV_PERM_ACCESS)) {
+        *pte |= AMDVI_DEV_PERM_ACCESS;
+        trace_amdvi_ha_update(*pte);
+        access = true;
+    }
+
+    return dirty || access;
+}
+
 static void amdvi_page_walk(AMDVIAddressSpace *as, uint64_t *dte,
                             IOMMUTLBEntry *ret, unsigned perms,
                             hwaddr addr, uint64_t *iotlb_pte,
@@ -948,6 +991,11 @@ static void amdvi_page_walk(AMDVIAddressSpace *as, uint64_t *dte,
             page_mask = pte_get_page_mask(oldlevel);
         }
 
+        if (amdvi_had_update(as, dte[0], &pte, perms)) {
+            amdvi_set_pte_entry(as->iommu_state, pte_addr, as->devfn,
+                                cpu_to_le64(pte));
+        }
+
         /* get access permissions from pte */
         ret->iova = addr & page_mask;
         ret->translated_addr = (pte & AMDVI_DEV_PT_ROOT_MASK) & page_mask;
@@ -977,6 +1025,10 @@ static void amdvi_do_translate(AMDVIAddressSpace *as, hwaddr addr,
     if (iotlb_entry) {
         trace_amdvi_iotlb_hit(PCI_BUS_NUM(devid), PCI_SLOT(devid),
                 PCI_FUNC(devid), addr, iotlb_entry->translated_addr);
+        if (amdvi_had_update(as, iotlb_entry->dte_flags,
+                             &iotlb_entry->pte, iotlb_entry->perms))
+            amdvi_set_pte_entry(as->iommu_state, iotlb_entry->pte_addr,
+                                as->devfn, cpu_to_le64(iotlb_entry->pte));
         ret->iova = addr & ~iotlb_entry->page_mask;
         ret->translated_addr = iotlb_entry->translated_addr;
         ret->addr_mask = iotlb_entry->page_mask;
diff --git a/hw/i386/amd_iommu.h b/hw/i386/amd_iommu.h
index 79d38a3e4184..b794596aa07d 100644
--- a/hw/i386/amd_iommu.h
+++ b/hw/i386/amd_iommu.h
@@ -135,6 +135,10 @@
 #define AMDVI_DEV_PERM_SHIFT              61
 #define AMDVI_DEV_PERM_READ               (1ULL << 61)
 #define AMDVI_DEV_PERM_WRITE              (1ULL << 62)
+#define AMDVI_DEV_PERM_ACCESS             (1ULL << 5)
+#define AMDVI_DEV_PERM_DIRTY              (1ULL << 6)
+#define AMDVI_DEV_HADEN                   (3ULL << 7)
+#define AMDVI_DEV_HAEN                    (1ULL << 7)
 
 /* Device table entry bits 64:127 */
 #define AMDVI_DEV_DOMID_ID_MASK          ((1ULL << 16) - 1)
@@ -159,9 +163,11 @@
 #define AMDVI_FEATURE_GA                  (1ULL << 7) /* guest VAPIC support */
 #define AMDVI_FEATURE_HE                  (1ULL << 8) /* hardware error regs */
 #define AMDVI_FEATURE_PC                  (1ULL << 9) /* Perf counters       */
+#define AMDVI_FEATURE_HD                  (1ULL << 52) /* Host Dirty support */
+#define AMDVI_FEATURE_HA                  (1ULL << 49) /* Host Access        */
 
 /* reserved DTE bits */
-#define AMDVI_DTE_LOWER_QUAD_RESERVED  0x80300000000000fc
+#define AMDVI_DTE_LOWER_QUAD_RESERVED  0x803000000000006c
 #define AMDVI_DTE_MIDDLE_QUAD_RESERVED 0x0000000000000100
 #define AMDVI_DTE_UPPER_QUAD_RESERVED  0x08f0000000000000
 
@@ -176,7 +182,8 @@
 /* extended feature support */
 #define AMDVI_EXT_FEATURES (AMDVI_FEATURE_PREFETCH | AMDVI_FEATURE_PPR | \
         AMDVI_FEATURE_IA | AMDVI_FEATURE_GT | AMDVI_FEATURE_HE | \
-        AMDVI_GATS_MODE | AMDVI_HATS_MODE | AMDVI_FEATURE_GA)
+        AMDVI_GATS_MODE | AMDVI_HATS_MODE | AMDVI_FEATURE_GA | \
+        AMDVI_FEATURE_HD | AMDVI_FEATURE_HA)
 
 /* capabilities header */
 #define AMDVI_CAPAB_FEATURES (AMDVI_CAPAB_FLAT_EXT | \
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index e49814dd642d..eb5f075873cd 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -107,6 +107,8 @@ amdvi_ir_intctl(uint8_t val) "int_ctl 0x%"PRIx8
 amdvi_ir_target_abort(const char *str) "%s"
 amdvi_ir_delivery_mode(const char *str) "%s"
 amdvi_ir_irte_ga_val(uint64_t hi, uint64_t lo) "hi 0x%"PRIx64" lo 0x%"PRIx64
+amdvi_ha_update(uint64_t pte) "pte 0x%"PRIx64
+amdvi_hd_update(uint64_t pte) "pte 0x%"PRIx64
 
 # vmport.c
 vmport_register(unsigned char command, void *func, void *opaque) "command: 0x%02x func: %p opaque: %p"
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC 03/10] intel-iommu: Cache PASID entry flags
  2022-04-28 21:13 ` Joao Martins
@ 2022-04-28 21:13   ` Joao Martins
  -1 siblings, 0 replies; 37+ messages in thread
From: Joao Martins @ 2022-04-28 21:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: Joao Martins, Paolo Bonzini, Dr. David Alan Gilbert,
	Richard Henderson, Eduardo Habkost, Michael S. Tsirkin,
	Marcel Apfelbaum, Peter Xu, Jason Wang, Alex Williamson,
	David Hildenbrand, Philippe Mathieu-Daudé,
	Cornelia Huck, Juan Quintela, Eric Blake, Markus Armbruster,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian, Yi Liu,
	Eric Auger, Thanos Makatos, John G . Johnson, kvm

On a successful translation, cache the PASID Table entry
flags set at the context at the time i.e. the first 12bits.
These bits contain read, write, dirty and access for example.

This is a preparatory for SSADS which requires updating A/D
bits on a translation based on the fact that SSADS was enabled
on the given scalable mode PASID Table entry.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/intel_iommu.c         | 18 ++++++++++++++++--
 include/hw/i386/intel_iommu.h |  1 +
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index c64aa81a83fc..752940fa4c0e 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -314,7 +314,7 @@ out:
 /* Must be with IOMMU lock held */
 static void vtd_update_iotlb(IntelIOMMUState *s, uint16_t source_id,
                              uint16_t domain_id, hwaddr addr, uint64_t slpte,
-                             uint8_t access_flags, uint32_t level)
+                             uint8_t access_flags, uint32_t level, uint16_t pe)
 {
     VTDIOTLBEntry *entry = g_malloc(sizeof(*entry));
     uint64_t *key = g_malloc(sizeof(*key));
@@ -331,6 +331,7 @@ static void vtd_update_iotlb(IntelIOMMUState *s, uint16_t source_id,
     entry->slpte = slpte;
     entry->access_flags = access_flags;
     entry->mask = vtd_slpt_level_page_mask(level);
+    entry->sm_pe_flags = pe;
     *key = vtd_get_iotlb_key(gfn, source_id, level);
     g_hash_table_replace(s->iotlb, key, entry);
 }
@@ -965,6 +966,19 @@ static dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
     return vtd_ce_get_slpt_base(ce);
 }
 
+static uint64_t vtd_sm_pasid_entry_flags(IntelIOMMUState *s,
+                                         VTDContextEntry *ce)
+{
+    VTDPASIDEntry pe;
+
+    if (!s->root_scalable) {
+        return 0;
+    }
+
+    vtd_ce_get_rid2pasid_entry(s, ce, &pe);
+    return pe.val[0] & (~VTD_SM_PASID_ENTRY_SLPTPTR);
+}
+
 /*
  * Rsvd field masks for spte:
  *     vtd_spte_rsvd 4k pages
@@ -1789,7 +1803,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
     page_mask = vtd_slpt_level_page_mask(level);
     access_flags = IOMMU_ACCESS_FLAG(reads, writes);
     vtd_update_iotlb(s, source_id, vtd_get_domain_id(s, &ce), addr, slpte,
-                     access_flags, level);
+                     access_flags, level, vtd_sm_pasid_entry_flags(s, &ce));
 out:
     vtd_iommu_unlock(s);
     entry->iova = addr & page_mask;
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 3b5ac869db6e..11446012a94c 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -123,6 +123,7 @@ struct VTDIOTLBEntry {
     uint64_t slpte;
     uint64_t mask;
     uint8_t access_flags;
+    uint16_t sm_pe_flags;
 };
 
 /* VT-d Source-ID Qualifier types */
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC 03/10] intel-iommu: Cache PASID entry flags
@ 2022-04-28 21:13   ` Joao Martins
  0 siblings, 0 replies; 37+ messages in thread
From: Joao Martins @ 2022-04-28 21:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: John G . Johnson, kvm, Michael S. Tsirkin, Jason Wang, Peter Xu,
	Joao Martins, Eric Blake, Yi Liu, Juan Quintela,
	David Hildenbrand, Markus Armbruster, Nicolin Chen,
	Jason Gunthorpe, Kevin Tian, Richard Henderson,
	Dr. David Alan Gilbert, Eric Auger, Alex Williamson,
	Thanos Makatos, Eduardo Habkost, Yishai Hadas, Cornelia Huck,
	Philippe Mathieu-Daudé,
	Paolo Bonzini

On a successful translation, cache the PASID Table entry
flags set at the context at the time i.e. the first 12bits.
These bits contain read, write, dirty and access for example.

This is a preparatory for SSADS which requires updating A/D
bits on a translation based on the fact that SSADS was enabled
on the given scalable mode PASID Table entry.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/intel_iommu.c         | 18 ++++++++++++++++--
 include/hw/i386/intel_iommu.h |  1 +
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index c64aa81a83fc..752940fa4c0e 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -314,7 +314,7 @@ out:
 /* Must be with IOMMU lock held */
 static void vtd_update_iotlb(IntelIOMMUState *s, uint16_t source_id,
                              uint16_t domain_id, hwaddr addr, uint64_t slpte,
-                             uint8_t access_flags, uint32_t level)
+                             uint8_t access_flags, uint32_t level, uint16_t pe)
 {
     VTDIOTLBEntry *entry = g_malloc(sizeof(*entry));
     uint64_t *key = g_malloc(sizeof(*key));
@@ -331,6 +331,7 @@ static void vtd_update_iotlb(IntelIOMMUState *s, uint16_t source_id,
     entry->slpte = slpte;
     entry->access_flags = access_flags;
     entry->mask = vtd_slpt_level_page_mask(level);
+    entry->sm_pe_flags = pe;
     *key = vtd_get_iotlb_key(gfn, source_id, level);
     g_hash_table_replace(s->iotlb, key, entry);
 }
@@ -965,6 +966,19 @@ static dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
     return vtd_ce_get_slpt_base(ce);
 }
 
+static uint64_t vtd_sm_pasid_entry_flags(IntelIOMMUState *s,
+                                         VTDContextEntry *ce)
+{
+    VTDPASIDEntry pe;
+
+    if (!s->root_scalable) {
+        return 0;
+    }
+
+    vtd_ce_get_rid2pasid_entry(s, ce, &pe);
+    return pe.val[0] & (~VTD_SM_PASID_ENTRY_SLPTPTR);
+}
+
 /*
  * Rsvd field masks for spte:
  *     vtd_spte_rsvd 4k pages
@@ -1789,7 +1803,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
     page_mask = vtd_slpt_level_page_mask(level);
     access_flags = IOMMU_ACCESS_FLAG(reads, writes);
     vtd_update_iotlb(s, source_id, vtd_get_domain_id(s, &ce), addr, slpte,
-                     access_flags, level);
+                     access_flags, level, vtd_sm_pasid_entry_flags(s, &ce));
 out:
     vtd_iommu_unlock(s);
     entry->iova = addr & page_mask;
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 3b5ac869db6e..11446012a94c 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -123,6 +123,7 @@ struct VTDIOTLBEntry {
     uint64_t slpte;
     uint64_t mask;
     uint8_t access_flags;
+    uint16_t sm_pe_flags;
 };
 
 /* VT-d Source-ID Qualifier types */
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC 04/10] intel_iommu: Second Stage Access Dirty bit support
  2022-04-28 21:13 ` Joao Martins
@ 2022-04-28 21:13   ` Joao Martins
  -1 siblings, 0 replies; 37+ messages in thread
From: Joao Martins @ 2022-04-28 21:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: Joao Martins, Paolo Bonzini, Dr. David Alan Gilbert,
	Richard Henderson, Eduardo Habkost, Michael S. Tsirkin,
	Marcel Apfelbaum, Peter Xu, Jason Wang, Alex Williamson,
	David Hildenbrand, Philippe Mathieu-Daudé,
	Cornelia Huck, Juan Quintela, Eric Blake, Markus Armbruster,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian, Yi Liu,
	Eric Auger, Thanos Makatos, John G . Johnson, kvm

IOMMU advertises Access/Dirty bits if the extended capability
DMAR register reports it (ECAP, mnemonic ECAP.SSADS albeit it used
to be known as SLADS before). The first stage table, though, has no bit for
advertising Access/Dirty, unless referenced via a scalable-mode PASID Entry.
Relevant Intel IOMMU SDM ref for first stage table "3.6.2 Accessed, Extended
Accessed, and Dirty Flags" and second stage table "3.7.2 Accessed and Dirty
Flags".

To enable it we depend on scalable-mode for the second-stage table,
so we limit use of dirty-bit to scalable-mode To use SSADS, we set a bit in
the scalable-mode PASID Table entry, by setting bit 9 (SSADE). When
we do so we require flushing the context/pasid-table caches and IOTLB much
like AMD. Relevant SDM refs:

"3.7.2 Accessed and Dirty Flags"
"6.5.3.3 Guidance to Software for Invalidations,
 Table 23. Guidance to Software for Invalidations"

To read out what's dirtied, same thing as past implementations is done.
Dirty bit support is located in the same location (bit 9). The IOTLB
caches some attributes when SSADE is enabled and dirty-ness information,
so we also need to flush IOTLB to make sure IOMMU attempts to set the
dirty bit again. Relevant manuals over the hardware translation is
chapter 6 with some special mention to:

"6.2.3.1 Scalable-Mode PASID-Table Entry Programming Considerations"
"6.2.4 IOTLB"

The first 12bits of the PTE are already cached via the SLPTE pointer,
similar to how it is added in amd-iommu. Use also the previously
added PASID entry cache in order to fetch whether Dirty bit was
enabled or not in the SM second stage table.

This is useful for covering and prototyping IOMMU support for access/dirty
bits.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/intel_iommu.c          | 85 ++++++++++++++++++++++++++++++----
 hw/i386/intel_iommu_internal.h |  4 ++
 hw/i386/trace-events           |  2 +
 3 files changed, 82 insertions(+), 9 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 752940fa4c0e..e946f793a968 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -651,6 +651,21 @@ static uint64_t vtd_get_slpte(dma_addr_t base_addr, uint32_t index)
     return slpte;
 }
 
+/* Get the content of a spte located in @base_addr[@index] */
+static uint64_t vtd_set_slpte(dma_addr_t base_addr, uint32_t index,
+                              uint64_t slpte)
+{
+
+    if (dma_memory_write(&address_space_memory,
+                         base_addr + index * sizeof(slpte), &slpte,
+                         sizeof(slpte), MEMTXATTRS_UNSPECIFIED)) {
+        slpte = (uint64_t)-1;
+        return slpte;
+    }
+
+    return vtd_get_slpte(base_addr, index);
+}
+
 /* Given an iova and the level of paging structure, return the offset
  * of current level.
  */
@@ -720,6 +735,11 @@ static inline bool vtd_pe_present(VTDPASIDEntry *pe)
     return pe->val[0] & VTD_PASID_ENTRY_P;
 }
 
+static inline bool vtd_pe_slad_enabled(VTDPASIDEntry *pe)
+{
+    return pe->val[0] & VTD_SM_PASID_ENTRY_SLADE;
+}
+
 static int vtd_get_pe_in_pasid_leaf_table(IntelIOMMUState *s,
                                           uint32_t pasid,
                                           dma_addr_t addr,
@@ -1026,6 +1046,33 @@ static VTDBus *vtd_find_as_from_bus_num(IntelIOMMUState *s, uint8_t bus_num)
     return NULL;
 }
 
+static inline bool vtd_ssad_update(IntelIOMMUState *s, uint16_t pe_flags,
+                                   uint64_t *slpte, bool is_write,
+                                   bool reads, bool writes)
+{
+    bool dirty, access = reads;
+
+    if (!(pe_flags & VTD_SM_PASID_ENTRY_SLADE)) {
+        return false;
+    }
+
+    dirty = access = false;
+
+    if (is_write && writes && !(*slpte & VTD_SL_D)) {
+        *slpte |= VTD_SL_D;
+        trace_vtd_dirty_update(*slpte);
+        dirty = true;
+    }
+
+    if (((!is_write && reads) || dirty) && !(*slpte & VTD_SL_A)) {
+        *slpte |= VTD_SL_A;
+        trace_vtd_access_update(*slpte);
+        access = true;
+    }
+
+    return dirty || access;
+}
+
 /* Given the @iova, get relevant @slptep. @slpte_level will be the last level
  * of the translation, can be used for deciding the size of large page.
  */
@@ -1039,6 +1086,7 @@ static int vtd_iova_to_slpte(IntelIOMMUState *s, VTDContextEntry *ce,
     uint32_t offset;
     uint64_t slpte;
     uint64_t access_right_check;
+    uint16_t pe_flags;
 
     if (!vtd_iova_range_check(s, iova, ce, aw_bits)) {
         error_report_once("%s: detected IOVA overflow (iova=0x%" PRIx64 ")",
@@ -1054,14 +1102,7 @@ static int vtd_iova_to_slpte(IntelIOMMUState *s, VTDContextEntry *ce,
         slpte = vtd_get_slpte(addr, offset);
 
         if (slpte == (uint64_t)-1) {
-            error_report_once("%s: detected read error on DMAR slpte "
-                              "(iova=0x%" PRIx64 ")", __func__, iova);
-            if (level == vtd_get_iova_level(s, ce)) {
-                /* Invalid programming of context-entry */
-                return -VTD_FR_CONTEXT_ENTRY_INV;
-            } else {
-                return -VTD_FR_PAGING_ENTRY_INV;
-            }
+            goto inv_slpte;
         }
         *reads = (*reads) && (slpte & VTD_SL_R);
         *writes = (*writes) && (slpte & VTD_SL_W);
@@ -1081,6 +1122,14 @@ static int vtd_iova_to_slpte(IntelIOMMUState *s, VTDContextEntry *ce,
         }
 
         if (vtd_is_last_slpte(slpte, level)) {
+            pe_flags = vtd_sm_pasid_entry_flags(s, ce);
+            if (vtd_ssad_update(s, pe_flags, &slpte, is_write,
+                                *reads, *writes)) {
+                slpte = vtd_set_slpte(addr, offset, slpte);
+                if (slpte == (uint64_t)-1) {
+                    goto inv_slpte;
+                }
+            }
             *slptep = slpte;
             *slpte_level = level;
             return 0;
@@ -1088,6 +1137,16 @@ static int vtd_iova_to_slpte(IntelIOMMUState *s, VTDContextEntry *ce,
         addr = vtd_get_slpte_addr(slpte, aw_bits);
         level--;
     }
+
+inv_slpte:
+    error_report_once("%s: detected read error on DMAR slpte "
+                      "(iova=0x%" PRIx64 ")", __func__, iova);
+    if (level == vtd_get_iova_level(s, ce)) {
+        /* Invalid programming of context-entry */
+        return -VTD_FR_CONTEXT_ENTRY_INV;
+    } else {
+        return -VTD_FR_PAGING_ENTRY_INV;
+    }
 }
 
 typedef int (*vtd_page_walk_hook)(IOMMUTLBEvent *event, void *private);
@@ -1742,6 +1801,13 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
         slpte = iotlb_entry->slpte;
         access_flags = iotlb_entry->access_flags;
         page_mask = iotlb_entry->mask;
+        if (vtd_ssad_update(s, iotlb_entry->sm_pe_flags, &slpte, is_write,
+                            access_flags & IOMMU_RO, access_flags & IOMMU_WO)) {
+                uint32_t offset;
+
+                offset = vtd_iova_level_offset(addr, vtd_get_iova_level(s, &ce));
+                vtd_set_slpte(addr, offset, slpte);
+        }
         goto out;
     }
 
@@ -3693,7 +3759,8 @@ static void vtd_init(IntelIOMMUState *s)
 
     /* TODO: read cap/ecap from host to decide which cap to be exposed. */
     if (s->scalable_mode) {
-        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS;
+        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS |
+                   VTD_ECAP_SLADS;
     }
 
     if (s->snoop_control) {
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 1ff13b40f9bb..c00f6e7b4a72 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -192,6 +192,7 @@
 #define VTD_ECAP_MHMV               (15ULL << 20)
 #define VTD_ECAP_SRS                (1ULL << 31)
 #define VTD_ECAP_SMTS               (1ULL << 43)
+#define VTD_ECAP_SLADS              (1ULL << 45)
 #define VTD_ECAP_SLTS               (1ULL << 46)
 
 /* CAP_REG */
@@ -492,6 +493,7 @@ typedef struct VTDRootEntry VTDRootEntry;
 
 #define VTD_SM_PASID_ENTRY_AW          7ULL /* Adjusted guest-address-width */
 #define VTD_SM_PASID_ENTRY_DID(val)    ((val) & VTD_DOMAIN_ID_MASK)
+#define VTD_SM_PASID_ENTRY_SLADE       (1ULL << 9)
 
 /* Second Level Page Translation Pointer*/
 #define VTD_SM_PASID_ENTRY_SLPTPTR     (~0xfffULL)
@@ -515,5 +517,7 @@ typedef struct VTDRootEntry VTDRootEntry;
 #define VTD_SL_PT_BASE_ADDR_MASK(aw) (~(VTD_PAGE_SIZE - 1) & VTD_HAW_MASK(aw))
 #define VTD_SL_IGN_COM              0xbff0000000000000ULL
 #define VTD_SL_TM                   (1ULL << 62)
+#define VTD_SL_A                    (1ULL << 8)
+#define VTD_SL_D                    (1ULL << 9)
 
 #endif
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index eb5f075873cd..e4122ee8a999 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -66,6 +66,8 @@ vtd_frr_new(int index, uint64_t hi, uint64_t lo) "index %d high 0x%"PRIx64" low
 vtd_warn_invalid_qi_tail(uint16_t tail) "tail 0x%"PRIx16
 vtd_warn_ir_vector(uint16_t sid, int index, int vec, int target) "sid 0x%"PRIx16" index %d vec %d (should be: %d)"
 vtd_warn_ir_trigger(uint16_t sid, int index, int trig, int target) "sid 0x%"PRIx16" index %d trigger %d (should be: %d)"
+vtd_access_update(uint64_t slpte) "slpte 0x%"PRIx64
+vtd_dirty_update(uint64_t slpte) "slpte 0x%"PRIx64
 
 # amd_iommu.c
 amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC 04/10] intel_iommu: Second Stage Access Dirty bit support
@ 2022-04-28 21:13   ` Joao Martins
  0 siblings, 0 replies; 37+ messages in thread
From: Joao Martins @ 2022-04-28 21:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: John G . Johnson, kvm, Michael S. Tsirkin, Jason Wang, Peter Xu,
	Joao Martins, Eric Blake, Yi Liu, Juan Quintela,
	David Hildenbrand, Markus Armbruster, Nicolin Chen,
	Jason Gunthorpe, Kevin Tian, Richard Henderson,
	Dr. David Alan Gilbert, Eric Auger, Alex Williamson,
	Thanos Makatos, Eduardo Habkost, Yishai Hadas, Cornelia Huck,
	Philippe Mathieu-Daudé,
	Paolo Bonzini

IOMMU advertises Access/Dirty bits if the extended capability
DMAR register reports it (ECAP, mnemonic ECAP.SSADS albeit it used
to be known as SLADS before). The first stage table, though, has no bit for
advertising Access/Dirty, unless referenced via a scalable-mode PASID Entry.
Relevant Intel IOMMU SDM ref for first stage table "3.6.2 Accessed, Extended
Accessed, and Dirty Flags" and second stage table "3.7.2 Accessed and Dirty
Flags".

To enable it we depend on scalable-mode for the second-stage table,
so we limit use of dirty-bit to scalable-mode To use SSADS, we set a bit in
the scalable-mode PASID Table entry, by setting bit 9 (SSADE). When
we do so we require flushing the context/pasid-table caches and IOTLB much
like AMD. Relevant SDM refs:

"3.7.2 Accessed and Dirty Flags"
"6.5.3.3 Guidance to Software for Invalidations,
 Table 23. Guidance to Software for Invalidations"

To read out what's dirtied, same thing as past implementations is done.
Dirty bit support is located in the same location (bit 9). The IOTLB
caches some attributes when SSADE is enabled and dirty-ness information,
so we also need to flush IOTLB to make sure IOMMU attempts to set the
dirty bit again. Relevant manuals over the hardware translation is
chapter 6 with some special mention to:

"6.2.3.1 Scalable-Mode PASID-Table Entry Programming Considerations"
"6.2.4 IOTLB"

The first 12bits of the PTE are already cached via the SLPTE pointer,
similar to how it is added in amd-iommu. Use also the previously
added PASID entry cache in order to fetch whether Dirty bit was
enabled or not in the SM second stage table.

This is useful for covering and prototyping IOMMU support for access/dirty
bits.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/intel_iommu.c          | 85 ++++++++++++++++++++++++++++++----
 hw/i386/intel_iommu_internal.h |  4 ++
 hw/i386/trace-events           |  2 +
 3 files changed, 82 insertions(+), 9 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 752940fa4c0e..e946f793a968 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -651,6 +651,21 @@ static uint64_t vtd_get_slpte(dma_addr_t base_addr, uint32_t index)
     return slpte;
 }
 
+/* Get the content of a spte located in @base_addr[@index] */
+static uint64_t vtd_set_slpte(dma_addr_t base_addr, uint32_t index,
+                              uint64_t slpte)
+{
+
+    if (dma_memory_write(&address_space_memory,
+                         base_addr + index * sizeof(slpte), &slpte,
+                         sizeof(slpte), MEMTXATTRS_UNSPECIFIED)) {
+        slpte = (uint64_t)-1;
+        return slpte;
+    }
+
+    return vtd_get_slpte(base_addr, index);
+}
+
 /* Given an iova and the level of paging structure, return the offset
  * of current level.
  */
@@ -720,6 +735,11 @@ static inline bool vtd_pe_present(VTDPASIDEntry *pe)
     return pe->val[0] & VTD_PASID_ENTRY_P;
 }
 
+static inline bool vtd_pe_slad_enabled(VTDPASIDEntry *pe)
+{
+    return pe->val[0] & VTD_SM_PASID_ENTRY_SLADE;
+}
+
 static int vtd_get_pe_in_pasid_leaf_table(IntelIOMMUState *s,
                                           uint32_t pasid,
                                           dma_addr_t addr,
@@ -1026,6 +1046,33 @@ static VTDBus *vtd_find_as_from_bus_num(IntelIOMMUState *s, uint8_t bus_num)
     return NULL;
 }
 
+static inline bool vtd_ssad_update(IntelIOMMUState *s, uint16_t pe_flags,
+                                   uint64_t *slpte, bool is_write,
+                                   bool reads, bool writes)
+{
+    bool dirty, access = reads;
+
+    if (!(pe_flags & VTD_SM_PASID_ENTRY_SLADE)) {
+        return false;
+    }
+
+    dirty = access = false;
+
+    if (is_write && writes && !(*slpte & VTD_SL_D)) {
+        *slpte |= VTD_SL_D;
+        trace_vtd_dirty_update(*slpte);
+        dirty = true;
+    }
+
+    if (((!is_write && reads) || dirty) && !(*slpte & VTD_SL_A)) {
+        *slpte |= VTD_SL_A;
+        trace_vtd_access_update(*slpte);
+        access = true;
+    }
+
+    return dirty || access;
+}
+
 /* Given the @iova, get relevant @slptep. @slpte_level will be the last level
  * of the translation, can be used for deciding the size of large page.
  */
@@ -1039,6 +1086,7 @@ static int vtd_iova_to_slpte(IntelIOMMUState *s, VTDContextEntry *ce,
     uint32_t offset;
     uint64_t slpte;
     uint64_t access_right_check;
+    uint16_t pe_flags;
 
     if (!vtd_iova_range_check(s, iova, ce, aw_bits)) {
         error_report_once("%s: detected IOVA overflow (iova=0x%" PRIx64 ")",
@@ -1054,14 +1102,7 @@ static int vtd_iova_to_slpte(IntelIOMMUState *s, VTDContextEntry *ce,
         slpte = vtd_get_slpte(addr, offset);
 
         if (slpte == (uint64_t)-1) {
-            error_report_once("%s: detected read error on DMAR slpte "
-                              "(iova=0x%" PRIx64 ")", __func__, iova);
-            if (level == vtd_get_iova_level(s, ce)) {
-                /* Invalid programming of context-entry */
-                return -VTD_FR_CONTEXT_ENTRY_INV;
-            } else {
-                return -VTD_FR_PAGING_ENTRY_INV;
-            }
+            goto inv_slpte;
         }
         *reads = (*reads) && (slpte & VTD_SL_R);
         *writes = (*writes) && (slpte & VTD_SL_W);
@@ -1081,6 +1122,14 @@ static int vtd_iova_to_slpte(IntelIOMMUState *s, VTDContextEntry *ce,
         }
 
         if (vtd_is_last_slpte(slpte, level)) {
+            pe_flags = vtd_sm_pasid_entry_flags(s, ce);
+            if (vtd_ssad_update(s, pe_flags, &slpte, is_write,
+                                *reads, *writes)) {
+                slpte = vtd_set_slpte(addr, offset, slpte);
+                if (slpte == (uint64_t)-1) {
+                    goto inv_slpte;
+                }
+            }
             *slptep = slpte;
             *slpte_level = level;
             return 0;
@@ -1088,6 +1137,16 @@ static int vtd_iova_to_slpte(IntelIOMMUState *s, VTDContextEntry *ce,
         addr = vtd_get_slpte_addr(slpte, aw_bits);
         level--;
     }
+
+inv_slpte:
+    error_report_once("%s: detected read error on DMAR slpte "
+                      "(iova=0x%" PRIx64 ")", __func__, iova);
+    if (level == vtd_get_iova_level(s, ce)) {
+        /* Invalid programming of context-entry */
+        return -VTD_FR_CONTEXT_ENTRY_INV;
+    } else {
+        return -VTD_FR_PAGING_ENTRY_INV;
+    }
 }
 
 typedef int (*vtd_page_walk_hook)(IOMMUTLBEvent *event, void *private);
@@ -1742,6 +1801,13 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
         slpte = iotlb_entry->slpte;
         access_flags = iotlb_entry->access_flags;
         page_mask = iotlb_entry->mask;
+        if (vtd_ssad_update(s, iotlb_entry->sm_pe_flags, &slpte, is_write,
+                            access_flags & IOMMU_RO, access_flags & IOMMU_WO)) {
+                uint32_t offset;
+
+                offset = vtd_iova_level_offset(addr, vtd_get_iova_level(s, &ce));
+                vtd_set_slpte(addr, offset, slpte);
+        }
         goto out;
     }
 
@@ -3693,7 +3759,8 @@ static void vtd_init(IntelIOMMUState *s)
 
     /* TODO: read cap/ecap from host to decide which cap to be exposed. */
     if (s->scalable_mode) {
-        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS;
+        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS |
+                   VTD_ECAP_SLADS;
     }
 
     if (s->snoop_control) {
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 1ff13b40f9bb..c00f6e7b4a72 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -192,6 +192,7 @@
 #define VTD_ECAP_MHMV               (15ULL << 20)
 #define VTD_ECAP_SRS                (1ULL << 31)
 #define VTD_ECAP_SMTS               (1ULL << 43)
+#define VTD_ECAP_SLADS              (1ULL << 45)
 #define VTD_ECAP_SLTS               (1ULL << 46)
 
 /* CAP_REG */
@@ -492,6 +493,7 @@ typedef struct VTDRootEntry VTDRootEntry;
 
 #define VTD_SM_PASID_ENTRY_AW          7ULL /* Adjusted guest-address-width */
 #define VTD_SM_PASID_ENTRY_DID(val)    ((val) & VTD_DOMAIN_ID_MASK)
+#define VTD_SM_PASID_ENTRY_SLADE       (1ULL << 9)
 
 /* Second Level Page Translation Pointer*/
 #define VTD_SM_PASID_ENTRY_SLPTPTR     (~0xfffULL)
@@ -515,5 +517,7 @@ typedef struct VTDRootEntry VTDRootEntry;
 #define VTD_SL_PT_BASE_ADDR_MASK(aw) (~(VTD_PAGE_SIZE - 1) & VTD_HAW_MASK(aw))
 #define VTD_SL_IGN_COM              0xbff0000000000000ULL
 #define VTD_SL_TM                   (1ULL << 62)
+#define VTD_SL_A                    (1ULL << 8)
+#define VTD_SL_D                    (1ULL << 9)
 
 #endif
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index eb5f075873cd..e4122ee8a999 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -66,6 +66,8 @@ vtd_frr_new(int index, uint64_t hi, uint64_t lo) "index %d high 0x%"PRIx64" low
 vtd_warn_invalid_qi_tail(uint16_t tail) "tail 0x%"PRIx16
 vtd_warn_ir_vector(uint16_t sid, int index, int vec, int target) "sid 0x%"PRIx16" index %d vec %d (should be: %d)"
 vtd_warn_ir_trigger(uint16_t sid, int index, int trig, int target) "sid 0x%"PRIx16" index %d trigger %d (should be: %d)"
+vtd_access_update(uint64_t slpte) "slpte 0x%"PRIx64
+vtd_dirty_update(uint64_t slpte) "slpte 0x%"PRIx64
 
 # amd_iommu.c
 amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC 05/10] linux-headers: import iommufd.h hwpt extensions
  2022-04-28 21:13 ` Joao Martins
@ 2022-04-28 21:13   ` Joao Martins
  -1 siblings, 0 replies; 37+ messages in thread
From: Joao Martins @ 2022-04-28 21:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: Joao Martins, Paolo Bonzini, Dr. David Alan Gilbert,
	Richard Henderson, Eduardo Habkost, Michael S. Tsirkin,
	Marcel Apfelbaum, Peter Xu, Jason Wang, Alex Williamson,
	David Hildenbrand, Philippe Mathieu-Daudé,
	Cornelia Huck, Juan Quintela, Eric Blake, Markus Armbruster,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian, Yi Liu,
	Eric Auger, Thanos Makatos, John G . Johnson, kvm

Generated from `scripts/update-linux-headers`
from github.com:jpemartins/linux:iommufd

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 linux-headers/linux/iommufd.h | 78 +++++++++++++++++++++++++++++++++++
 1 file changed, 78 insertions(+)

diff --git a/linux-headers/linux/iommufd.h b/linux-headers/linux/iommufd.h
index 6c3cd9e259e2..e3c981f81e43 100644
--- a/linux-headers/linux/iommufd.h
+++ b/linux-headers/linux/iommufd.h
@@ -43,6 +43,9 @@ enum {
 	IOMMUFD_CMD_IOAS_COPY,
 	IOMMUFD_CMD_IOAS_UNMAP,
 	IOMMUFD_CMD_VFIO_IOAS,
+	IOMMUFD_CMD_HWPT_SET_DIRTY,
+	IOMMUFD_CMD_HWPT_GET_DIRTY_IOVA,
+	IOMMUFD_CMD_IOAS_UNMAP_DIRTY,
 };
 
 /**
@@ -220,4 +223,79 @@ struct iommu_vfio_ioas {
 	__u16 __reserved;
 };
 #define IOMMU_VFIO_IOAS _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VFIO_IOAS)
+
+/**
+ * enum iommufd_set_dirty_flags - Flags for steering dirty tracking
+ * @IOMMU_DIRTY_TRACKING_DISABLED: Disables dirty tracking
+ * @IOMMU_DIRTY_TRACKING_ENABLED: Enables dirty tracking
+ */
+enum iommufd_set_dirty_flags {
+	IOMMU_DIRTY_TRACKING_DISABLED = 0,
+	IOMMU_DIRTY_TRACKING_ENABLED = 1 << 0,
+};
+
+/**
+ * struct iommu_hwpt_set_dirty - ioctl(IOMMU_HWPT_SET_DIRTY)
+ * @size: sizeof(struct iommu_hwpt_set_dirty)
+ * @flags: Flags to control dirty tracking status.
+ * @hwpt_id: HW pagetable ID that represents the IOMMU domain.
+ *
+ * Toggle dirty tracking on an HW pagetable.
+ */
+struct iommu_hwpt_set_dirty {
+	__u32 size;
+	__u32 flags;
+	__u32 hwpt_id;
+	__u32 __reserved;
+};
+#define IOMMU_HWPT_SET_DIRTY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_SET_DIRTY)
+
+/**
+ * struct iommufd_dirty_bitmap - Dirty IOVA tracking bitmap
+ * @iova: base IOVA of the bitmap
+ * @length: IOVA size
+ * @page_size: page size granularity of each bit in the bitmap
+ * @data: bitmap where to set the dirty bits. The bitmap bits each
+ * represent a page_size which you deviate from an arbitrary iova.
+ * Checking a given IOVA is dirty:
+ *
+ *  data[(iova / page_size) / 64] & (1ULL << (iova % 64))
+ */
+struct iommufd_dirty_data {
+	__aligned_u64 iova;
+	__aligned_u64 length;
+	__aligned_u64 page_size;
+	__aligned_u64 *data;
+};
+
+/**
+ * struct iommu_hwpt_get_dirty_iova - ioctl(IOMMU_HWPT_GET_DIRTY_IOVA)
+ * @size: sizeof(struct iommu_hwpt_get_dirty_iova)
+ * @bitmap: Bitmap of the range of IOVA to read out
+ */
+struct iommu_hwpt_get_dirty_iova {
+	__u32 size;
+	__u32 hwpt_id;
+	struct iommufd_dirty_data bitmap;
+};
+#define IOMMU_HWPT_GET_DIRTY_IOVA _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_GET_DIRTY_IOVA)
+
+/**
+ * struct iommu_hwpt_unmap - ioctl(IOMMU_HWPT_UNMAP_DIRTY)
+ * @size: sizeof(struct iommu_hwpt_unmap_dirty)
+ * @ioas_id: IOAS ID to unmap the mapping of
+ * @data: Dirty data of the range of IOVA to unmap
+ *
+ * Unmap an IOVA range and return a bitmap of the dirty bits.
+ * The iova/length must exactly match a range
+ * used with IOMMU_IOAS_PAGETABLE_MAP, or be the values 0 & U64_MAX.
+ * In the latter case all IOVAs will be unmaped.
+ */
+struct iommu_ioas_unmap_dirty {
+	__u32 size;
+	__u32 ioas_id;
+	struct iommufd_dirty_data bitmap;
+};
+#define IOMMU_IOAS_UNMAP_DIRTY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_UNMAP_DIRTY)
+
 #endif
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC 05/10] linux-headers: import iommufd.h hwpt extensions
@ 2022-04-28 21:13   ` Joao Martins
  0 siblings, 0 replies; 37+ messages in thread
From: Joao Martins @ 2022-04-28 21:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: John G . Johnson, kvm, Michael S. Tsirkin, Jason Wang, Peter Xu,
	Joao Martins, Eric Blake, Yi Liu, Juan Quintela,
	David Hildenbrand, Markus Armbruster, Nicolin Chen,
	Jason Gunthorpe, Kevin Tian, Richard Henderson,
	Dr. David Alan Gilbert, Eric Auger, Alex Williamson,
	Thanos Makatos, Eduardo Habkost, Yishai Hadas, Cornelia Huck,
	Philippe Mathieu-Daudé,
	Paolo Bonzini

Generated from `scripts/update-linux-headers`
from github.com:jpemartins/linux:iommufd

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 linux-headers/linux/iommufd.h | 78 +++++++++++++++++++++++++++++++++++
 1 file changed, 78 insertions(+)

diff --git a/linux-headers/linux/iommufd.h b/linux-headers/linux/iommufd.h
index 6c3cd9e259e2..e3c981f81e43 100644
--- a/linux-headers/linux/iommufd.h
+++ b/linux-headers/linux/iommufd.h
@@ -43,6 +43,9 @@ enum {
 	IOMMUFD_CMD_IOAS_COPY,
 	IOMMUFD_CMD_IOAS_UNMAP,
 	IOMMUFD_CMD_VFIO_IOAS,
+	IOMMUFD_CMD_HWPT_SET_DIRTY,
+	IOMMUFD_CMD_HWPT_GET_DIRTY_IOVA,
+	IOMMUFD_CMD_IOAS_UNMAP_DIRTY,
 };
 
 /**
@@ -220,4 +223,79 @@ struct iommu_vfio_ioas {
 	__u16 __reserved;
 };
 #define IOMMU_VFIO_IOAS _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VFIO_IOAS)
+
+/**
+ * enum iommufd_set_dirty_flags - Flags for steering dirty tracking
+ * @IOMMU_DIRTY_TRACKING_DISABLED: Disables dirty tracking
+ * @IOMMU_DIRTY_TRACKING_ENABLED: Enables dirty tracking
+ */
+enum iommufd_set_dirty_flags {
+	IOMMU_DIRTY_TRACKING_DISABLED = 0,
+	IOMMU_DIRTY_TRACKING_ENABLED = 1 << 0,
+};
+
+/**
+ * struct iommu_hwpt_set_dirty - ioctl(IOMMU_HWPT_SET_DIRTY)
+ * @size: sizeof(struct iommu_hwpt_set_dirty)
+ * @flags: Flags to control dirty tracking status.
+ * @hwpt_id: HW pagetable ID that represents the IOMMU domain.
+ *
+ * Toggle dirty tracking on an HW pagetable.
+ */
+struct iommu_hwpt_set_dirty {
+	__u32 size;
+	__u32 flags;
+	__u32 hwpt_id;
+	__u32 __reserved;
+};
+#define IOMMU_HWPT_SET_DIRTY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_SET_DIRTY)
+
+/**
+ * struct iommufd_dirty_bitmap - Dirty IOVA tracking bitmap
+ * @iova: base IOVA of the bitmap
+ * @length: IOVA size
+ * @page_size: page size granularity of each bit in the bitmap
+ * @data: bitmap where to set the dirty bits. The bitmap bits each
+ * represent a page_size which you deviate from an arbitrary iova.
+ * Checking a given IOVA is dirty:
+ *
+ *  data[(iova / page_size) / 64] & (1ULL << (iova % 64))
+ */
+struct iommufd_dirty_data {
+	__aligned_u64 iova;
+	__aligned_u64 length;
+	__aligned_u64 page_size;
+	__aligned_u64 *data;
+};
+
+/**
+ * struct iommu_hwpt_get_dirty_iova - ioctl(IOMMU_HWPT_GET_DIRTY_IOVA)
+ * @size: sizeof(struct iommu_hwpt_get_dirty_iova)
+ * @bitmap: Bitmap of the range of IOVA to read out
+ */
+struct iommu_hwpt_get_dirty_iova {
+	__u32 size;
+	__u32 hwpt_id;
+	struct iommufd_dirty_data bitmap;
+};
+#define IOMMU_HWPT_GET_DIRTY_IOVA _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_GET_DIRTY_IOVA)
+
+/**
+ * struct iommu_hwpt_unmap - ioctl(IOMMU_HWPT_UNMAP_DIRTY)
+ * @size: sizeof(struct iommu_hwpt_unmap_dirty)
+ * @ioas_id: IOAS ID to unmap the mapping of
+ * @data: Dirty data of the range of IOVA to unmap
+ *
+ * Unmap an IOVA range and return a bitmap of the dirty bits.
+ * The iova/length must exactly match a range
+ * used with IOMMU_IOAS_PAGETABLE_MAP, or be the values 0 & U64_MAX.
+ * In the latter case all IOVAs will be unmaped.
+ */
+struct iommu_ioas_unmap_dirty {
+	__u32 size;
+	__u32 ioas_id;
+	struct iommufd_dirty_data bitmap;
+};
+#define IOMMU_IOAS_UNMAP_DIRTY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_UNMAP_DIRTY)
+
 #endif
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC 06/10] vfio/iommufd: Add HWPT_SET_DIRTY support
  2022-04-28 21:13 ` Joao Martins
@ 2022-04-28 21:13   ` Joao Martins
  -1 siblings, 0 replies; 37+ messages in thread
From: Joao Martins @ 2022-04-28 21:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: Joao Martins, Paolo Bonzini, Dr. David Alan Gilbert,
	Richard Henderson, Eduardo Habkost, Michael S. Tsirkin,
	Marcel Apfelbaum, Peter Xu, Jason Wang, Alex Williamson,
	David Hildenbrand, Philippe Mathieu-Daudé,
	Cornelia Huck, Juan Quintela, Eric Blake, Markus Armbruster,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian, Yi Liu,
	Eric Auger, Thanos Makatos, John G . Johnson, kvm

ioctl(iommufd, IOMMU_HWPT_SET_DIRTY, arg) is the UAPI
that enables or disables dirty page tracking. We set it
on the whole list of iommu domains we are tracking, and
set ::dirty_pages_supported accordingly, used when we
attempt at reading out the dirty bits from the hw pagetables.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/iommufd/iommufd.c         | 18 ++++++++++++
 hw/iommufd/trace-events      |  1 +
 hw/vfio/iommufd.c            | 57 ++++++++++++++++++++++++++++++++++++
 include/hw/iommufd/iommufd.h |  1 +
 4 files changed, 77 insertions(+)

diff --git a/hw/iommufd/iommufd.c b/hw/iommufd/iommufd.c
index 4e8179d612a5..e5aff5deaf14 100644
--- a/hw/iommufd/iommufd.c
+++ b/hw/iommufd/iommufd.c
@@ -201,6 +201,24 @@ int iommufd_copy_dma(int iommufd, uint32_t src_ioas, uint32_t dst_ioas,
     return !ret ? 0 : -errno;
 }
 
+int iommufd_set_dirty_tracking(int iommufd, uint32_t hwpt_id, bool start)
+{
+    int ret;
+    struct iommu_hwpt_set_dirty set_dirty = {
+            .size = sizeof(set_dirty),
+            .hwpt_id = hwpt_id,
+            .flags = !start ? IOMMU_DIRTY_TRACKING_DISABLED :
+                        IOMMU_DIRTY_TRACKING_ENABLED,
+    };
+
+    ret = ioctl(iommufd, IOMMU_HWPT_SET_DIRTY, &set_dirty);
+    trace_iommufd_set_dirty(iommufd, hwpt_id, start, ret);
+    if (ret) {
+        error_report("IOMMU_HWPT_SET_DIRTY failed: %s", strerror(errno));
+    }
+    return !ret ? 0 : -errno;
+}
+
 static void iommufd_register_types(void)
 {
     qemu_mutex_init(&iommufd_lock);
diff --git a/hw/iommufd/trace-events b/hw/iommufd/trace-events
index 615d80cdf42c..d3c2b5a0ab95 100644
--- a/hw/iommufd/trace-events
+++ b/hw/iommufd/trace-events
@@ -9,3 +9,4 @@ iommufd_put_ioas(int iommufd, uint32_t ioas) " iommufd=%d ioas=%d"
 iommufd_unmap_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
 iommufd_map_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, void *vaddr, bool readonly, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" addr=%p readonly=%d (%d)"
 iommufd_copy_dma(int iommufd, uint32_t src_ioas, uint32_t dst_ioas, uint64_t iova, uint64_t size, bool readonly, int ret) " iommufd=%d src_ioas=%d dst_ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" readonly=%d (%d)"
+iommufd_set_dirty(int iommufd, uint32_t hwpt_id, bool start, int ret) " iommufd=%d hwpt=%d enable=%d (%d)"
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 8ff5988b0773..8146407feedd 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -33,6 +33,7 @@
 #include "hw/qdev-core.h"
 #include "sysemu/reset.h"
 #include "qemu/cutils.h"
+#include "migration/migration.h"
 
 static bool iommufd_check_extension(VFIOContainer *bcontainer,
                                     VFIOContainerFeature feat)
@@ -82,6 +83,25 @@ static int iommufd_unmap(VFIOContainer *bcontainer,
                              container->ioas_id, iova, size);
 }
 
+static void iommufd_set_dirty_page_tracking(VFIOContainer *bcontainer,
+                                            bool start)
+{
+    VFIOIOMMUFDContainer *container = container_of(bcontainer,
+                                                   VFIOIOMMUFDContainer, obj);
+    int ret;
+    VFIOIOASHwpt *hwpt;
+
+    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
+        ret = iommufd_set_dirty_tracking(container->iommufd,
+                                         hwpt->hwpt_id, start);
+        if (ret) {
+            return;
+        }
+    }
+
+    bcontainer->dirty_pages_supported = start;
+}
+
 static int vfio_get_devicefd(const char *sysfs_path, Error **errp)
 {
     long int vfio_id = -1, ret = -ENOTTY;
@@ -304,6 +324,40 @@ static int vfio_device_reset(VFIODevice *vbasedev)
     return 0;
 }
 
+static bool vfio_iommufd_devices_all_dirty_tracking(VFIOContainer *bcontainer)
+{
+    MigrationState *ms = migrate_get_current();
+    VFIOIOMMUFDContainer *container;
+    VFIODevice *vbasedev;
+    VFIOIOASHwpt *hwpt;
+
+    if (bcontainer->dirty_pages_supported) {
+        return true;
+    }
+
+    if (!migration_is_setup_or_active(ms->state)) {
+        return false;
+    }
+
+    container = container_of(bcontainer, VFIOIOMMUFDContainer, obj);
+
+    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
+        QLIST_FOREACH(vbasedev, &hwpt->device_list, hwpt_next) {
+            VFIOMigration *migration = vbasedev->migration;
+
+            if (!migration) {
+                return false;
+            }
+
+            if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF)
+                && (migration->device_state & VFIO_DEVICE_STATE_RUNNING)) {
+                return false;
+            }
+        }
+    }
+    return true;
+}
+
 static int vfio_iommufd_container_reset(VFIOContainer *bcontainer)
 {
     VFIOIOMMUFDContainer *container;
@@ -446,6 +500,7 @@ static int iommufd_attach_device(VFIODevice *vbasedev, AddressSpace *as,
      */
 
     vfio_as_add_container(space, bcontainer);
+    bcontainer->dirty_pages_supported = true;
     bcontainer->initialized = true;
 
 out:
@@ -554,6 +609,8 @@ static void vfio_iommufd_class_init(ObjectClass *klass,
     vccs->attach_device = iommufd_attach_device;
     vccs->detach_device = iommufd_detach_device;
     vccs->reset = vfio_iommufd_container_reset;
+    vccs->devices_all_dirty_tracking = vfio_iommufd_devices_all_dirty_tracking;
+    vccs->set_dirty_page_tracking = iommufd_set_dirty_page_tracking;
 }
 
 static const TypeInfo vfio_iommufd_info = {
diff --git a/include/hw/iommufd/iommufd.h b/include/hw/iommufd/iommufd.h
index 59835cddcacf..61fd83771099 100644
--- a/include/hw/iommufd/iommufd.h
+++ b/include/hw/iommufd/iommufd.h
@@ -33,5 +33,6 @@ int iommufd_map_dma(int iommufd, uint32_t ioas, hwaddr iova,
                     ram_addr_t size, void *vaddr, bool readonly);
 int iommufd_copy_dma(int iommufd, uint32_t src_ioas, uint32_t dst_ioas,
                      hwaddr iova, ram_addr_t size, bool readonly);
+int iommufd_set_dirty_tracking(int iommufd, uint32_t hwpt_id, bool start);
 bool iommufd_supported(void);
 #endif /* HW_IOMMUFD_IOMMUFD_H */
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC 06/10] vfio/iommufd: Add HWPT_SET_DIRTY support
@ 2022-04-28 21:13   ` Joao Martins
  0 siblings, 0 replies; 37+ messages in thread
From: Joao Martins @ 2022-04-28 21:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: John G . Johnson, kvm, Michael S. Tsirkin, Jason Wang, Peter Xu,
	Joao Martins, Eric Blake, Yi Liu, Juan Quintela,
	David Hildenbrand, Markus Armbruster, Nicolin Chen,
	Jason Gunthorpe, Kevin Tian, Richard Henderson,
	Dr. David Alan Gilbert, Eric Auger, Alex Williamson,
	Thanos Makatos, Eduardo Habkost, Yishai Hadas, Cornelia Huck,
	Philippe Mathieu-Daudé,
	Paolo Bonzini

ioctl(iommufd, IOMMU_HWPT_SET_DIRTY, arg) is the UAPI
that enables or disables dirty page tracking. We set it
on the whole list of iommu domains we are tracking, and
set ::dirty_pages_supported accordingly, used when we
attempt at reading out the dirty bits from the hw pagetables.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/iommufd/iommufd.c         | 18 ++++++++++++
 hw/iommufd/trace-events      |  1 +
 hw/vfio/iommufd.c            | 57 ++++++++++++++++++++++++++++++++++++
 include/hw/iommufd/iommufd.h |  1 +
 4 files changed, 77 insertions(+)

diff --git a/hw/iommufd/iommufd.c b/hw/iommufd/iommufd.c
index 4e8179d612a5..e5aff5deaf14 100644
--- a/hw/iommufd/iommufd.c
+++ b/hw/iommufd/iommufd.c
@@ -201,6 +201,24 @@ int iommufd_copy_dma(int iommufd, uint32_t src_ioas, uint32_t dst_ioas,
     return !ret ? 0 : -errno;
 }
 
+int iommufd_set_dirty_tracking(int iommufd, uint32_t hwpt_id, bool start)
+{
+    int ret;
+    struct iommu_hwpt_set_dirty set_dirty = {
+            .size = sizeof(set_dirty),
+            .hwpt_id = hwpt_id,
+            .flags = !start ? IOMMU_DIRTY_TRACKING_DISABLED :
+                        IOMMU_DIRTY_TRACKING_ENABLED,
+    };
+
+    ret = ioctl(iommufd, IOMMU_HWPT_SET_DIRTY, &set_dirty);
+    trace_iommufd_set_dirty(iommufd, hwpt_id, start, ret);
+    if (ret) {
+        error_report("IOMMU_HWPT_SET_DIRTY failed: %s", strerror(errno));
+    }
+    return !ret ? 0 : -errno;
+}
+
 static void iommufd_register_types(void)
 {
     qemu_mutex_init(&iommufd_lock);
diff --git a/hw/iommufd/trace-events b/hw/iommufd/trace-events
index 615d80cdf42c..d3c2b5a0ab95 100644
--- a/hw/iommufd/trace-events
+++ b/hw/iommufd/trace-events
@@ -9,3 +9,4 @@ iommufd_put_ioas(int iommufd, uint32_t ioas) " iommufd=%d ioas=%d"
 iommufd_unmap_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
 iommufd_map_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, void *vaddr, bool readonly, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" addr=%p readonly=%d (%d)"
 iommufd_copy_dma(int iommufd, uint32_t src_ioas, uint32_t dst_ioas, uint64_t iova, uint64_t size, bool readonly, int ret) " iommufd=%d src_ioas=%d dst_ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" readonly=%d (%d)"
+iommufd_set_dirty(int iommufd, uint32_t hwpt_id, bool start, int ret) " iommufd=%d hwpt=%d enable=%d (%d)"
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 8ff5988b0773..8146407feedd 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -33,6 +33,7 @@
 #include "hw/qdev-core.h"
 #include "sysemu/reset.h"
 #include "qemu/cutils.h"
+#include "migration/migration.h"
 
 static bool iommufd_check_extension(VFIOContainer *bcontainer,
                                     VFIOContainerFeature feat)
@@ -82,6 +83,25 @@ static int iommufd_unmap(VFIOContainer *bcontainer,
                              container->ioas_id, iova, size);
 }
 
+static void iommufd_set_dirty_page_tracking(VFIOContainer *bcontainer,
+                                            bool start)
+{
+    VFIOIOMMUFDContainer *container = container_of(bcontainer,
+                                                   VFIOIOMMUFDContainer, obj);
+    int ret;
+    VFIOIOASHwpt *hwpt;
+
+    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
+        ret = iommufd_set_dirty_tracking(container->iommufd,
+                                         hwpt->hwpt_id, start);
+        if (ret) {
+            return;
+        }
+    }
+
+    bcontainer->dirty_pages_supported = start;
+}
+
 static int vfio_get_devicefd(const char *sysfs_path, Error **errp)
 {
     long int vfio_id = -1, ret = -ENOTTY;
@@ -304,6 +324,40 @@ static int vfio_device_reset(VFIODevice *vbasedev)
     return 0;
 }
 
+static bool vfio_iommufd_devices_all_dirty_tracking(VFIOContainer *bcontainer)
+{
+    MigrationState *ms = migrate_get_current();
+    VFIOIOMMUFDContainer *container;
+    VFIODevice *vbasedev;
+    VFIOIOASHwpt *hwpt;
+
+    if (bcontainer->dirty_pages_supported) {
+        return true;
+    }
+
+    if (!migration_is_setup_or_active(ms->state)) {
+        return false;
+    }
+
+    container = container_of(bcontainer, VFIOIOMMUFDContainer, obj);
+
+    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
+        QLIST_FOREACH(vbasedev, &hwpt->device_list, hwpt_next) {
+            VFIOMigration *migration = vbasedev->migration;
+
+            if (!migration) {
+                return false;
+            }
+
+            if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF)
+                && (migration->device_state & VFIO_DEVICE_STATE_RUNNING)) {
+                return false;
+            }
+        }
+    }
+    return true;
+}
+
 static int vfio_iommufd_container_reset(VFIOContainer *bcontainer)
 {
     VFIOIOMMUFDContainer *container;
@@ -446,6 +500,7 @@ static int iommufd_attach_device(VFIODevice *vbasedev, AddressSpace *as,
      */
 
     vfio_as_add_container(space, bcontainer);
+    bcontainer->dirty_pages_supported = true;
     bcontainer->initialized = true;
 
 out:
@@ -554,6 +609,8 @@ static void vfio_iommufd_class_init(ObjectClass *klass,
     vccs->attach_device = iommufd_attach_device;
     vccs->detach_device = iommufd_detach_device;
     vccs->reset = vfio_iommufd_container_reset;
+    vccs->devices_all_dirty_tracking = vfio_iommufd_devices_all_dirty_tracking;
+    vccs->set_dirty_page_tracking = iommufd_set_dirty_page_tracking;
 }
 
 static const TypeInfo vfio_iommufd_info = {
diff --git a/include/hw/iommufd/iommufd.h b/include/hw/iommufd/iommufd.h
index 59835cddcacf..61fd83771099 100644
--- a/include/hw/iommufd/iommufd.h
+++ b/include/hw/iommufd/iommufd.h
@@ -33,5 +33,6 @@ int iommufd_map_dma(int iommufd, uint32_t ioas, hwaddr iova,
                     ram_addr_t size, void *vaddr, bool readonly);
 int iommufd_copy_dma(int iommufd, uint32_t src_ioas, uint32_t dst_ioas,
                      hwaddr iova, ram_addr_t size, bool readonly);
+int iommufd_set_dirty_tracking(int iommufd, uint32_t hwpt_id, bool start);
 bool iommufd_supported(void);
 #endif /* HW_IOMMUFD_IOMMUFD_H */
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC 07/10] vfio/iommufd: Add HWPT_GET_DIRTY_IOVA support
  2022-04-28 21:13 ` Joao Martins
@ 2022-04-28 21:13   ` Joao Martins
  -1 siblings, 0 replies; 37+ messages in thread
From: Joao Martins @ 2022-04-28 21:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: Joao Martins, Paolo Bonzini, Dr. David Alan Gilbert,
	Richard Henderson, Eduardo Habkost, Michael S. Tsirkin,
	Marcel Apfelbaum, Peter Xu, Jason Wang, Alex Williamson,
	David Hildenbrand, Philippe Mathieu-Daudé,
	Cornelia Huck, Juan Quintela, Eric Blake, Markus Armbruster,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian, Yi Liu,
	Eric Auger, Thanos Makatos, John G . Johnson, kvm

ioctl(iommufd, IOMMU_HWPT_GET_DIRTY_IOVA, arg) is the UAPI
that fetches the bitmap that tells what was dirty in an IOVA
range.

A single bitmap is allocated and used across all the hw_pagetables
sharing an IOAS which is then used in log_sync() to set Qemu
global bitmaps.

There's no point of even attempting to fetch these bitmaps,
should the iommu tracker fail to start in a previous call
to HWPT_SET_DIRTY.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/iommufd/iommufd.c         | 24 ++++++++++++++++++++
 hw/iommufd/trace-events      |  1 +
 hw/vfio/iommufd.c            | 44 ++++++++++++++++++++++++++++++++++++
 include/hw/iommufd/iommufd.h |  2 ++
 4 files changed, 71 insertions(+)

diff --git a/hw/iommufd/iommufd.c b/hw/iommufd/iommufd.c
index e5aff5deaf14..bc870b5e9b2f 100644
--- a/hw/iommufd/iommufd.c
+++ b/hw/iommufd/iommufd.c
@@ -219,6 +219,30 @@ int iommufd_set_dirty_tracking(int iommufd, uint32_t hwpt_id, bool start)
     return !ret ? 0 : -errno;
 }
 
+int iommufd_get_dirty_iova(int iommufd, uint32_t hwpt_id, uint64_t iova,
+                           ram_addr_t size, uint64_t page_size, uint64_t *data)
+{
+    int ret;
+    struct iommu_hwpt_get_dirty_iova get_dirty_iova = {
+        .size = sizeof(get_dirty_iova),
+        .hwpt_id = hwpt_id,
+        .bitmap = {
+            .iova = iova, .length = size,
+            .page_size = page_size, .data = (__u64 *)data,
+        },
+    };
+
+    ret = ioctl(iommufd, IOMMU_HWPT_GET_DIRTY_IOVA, &get_dirty_iova);
+    trace_iommufd_get_dirty_iova(iommufd, hwpt_id, iova, size, page_size, ret);
+    if (ret) {
+        error_report("IOMMU_HWPT_GET_DIRTY_IOVA (iova: 0x%"PRIx64
+                     " size: 0x%"PRIx64") failed: %s", iova,
+                     size, strerror(errno));
+    }
+
+    return !ret ? 0 : -errno;
+}
+
 static void iommufd_register_types(void)
 {
     qemu_mutex_init(&iommufd_lock);
diff --git a/hw/iommufd/trace-events b/hw/iommufd/trace-events
index d3c2b5a0ab95..9fe2cc60c6fe 100644
--- a/hw/iommufd/trace-events
+++ b/hw/iommufd/trace-events
@@ -10,3 +10,4 @@ iommufd_unmap_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int
 iommufd_map_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, void *vaddr, bool readonly, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" addr=%p readonly=%d (%d)"
 iommufd_copy_dma(int iommufd, uint32_t src_ioas, uint32_t dst_ioas, uint64_t iova, uint64_t size, bool readonly, int ret) " iommufd=%d src_ioas=%d dst_ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" readonly=%d (%d)"
 iommufd_set_dirty(int iommufd, uint32_t hwpt_id, bool start, int ret) " iommufd=%d hwpt=%d enable=%d (%d)"
+iommufd_get_dirty_iova(int iommufd, uint32_t hwpt_id, uint64_t iova, uint64_t size, uint64_t page_size, int ret) " iommufd=%d hwpt=%d iova=0x%"PRIx64" size=0x%"PRIx64" page_size=0x%"PRIx64" (%d)"
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 8146407feedd..6c12239a40ab 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -33,6 +33,7 @@
 #include "hw/qdev-core.h"
 #include "sysemu/reset.h"
 #include "qemu/cutils.h"
+#include "exec/ram_addr.h"
 #include "migration/migration.h"
 
 static bool iommufd_check_extension(VFIOContainer *bcontainer,
@@ -102,6 +103,48 @@ static void iommufd_set_dirty_page_tracking(VFIOContainer *bcontainer,
     bcontainer->dirty_pages_supported = start;
 }
 
+static int iommufd_get_dirty_bitmap(VFIOContainer *bcontainer, uint64_t iova,
+                                    uint64_t size, ram_addr_t ram_addr)
+{
+    VFIOIOMMUFDContainer *container = container_of(bcontainer,
+                                                   VFIOIOMMUFDContainer, obj);
+    int ret;
+    VFIOIOASHwpt *hwpt;
+    unsigned long *data, page_size, bitmap_size, pages;
+
+    if (!bcontainer->dirty_pages_supported) {
+        return 0;
+    }
+
+    page_size = qemu_real_host_page_size;
+    pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size;
+    bitmap_size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
+                                         BITS_PER_BYTE;
+    data = g_try_malloc0(bitmap_size);
+    if (!data) {
+        ret = -ENOMEM;
+        goto err_out;
+    }
+
+    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
+        ret = iommufd_get_dirty_iova(container->iommufd, hwpt->hwpt_id,
+                                     iova, size, page_size, data);
+        if (ret) {
+            goto err_out;
+        }
+    }
+
+    cpu_physical_memory_set_dirty_lebitmap(data, ram_addr, pages);
+
+    trace_vfio_get_dirty_bitmap(container->iommufd, iova, size, bitmap_size,
+                                ram_addr);
+
+err_out:
+    g_free(data);
+
+    return ret;
+}
+
 static int vfio_get_devicefd(const char *sysfs_path, Error **errp)
 {
     long int vfio_id = -1, ret = -ENOTTY;
@@ -611,6 +654,7 @@ static void vfio_iommufd_class_init(ObjectClass *klass,
     vccs->reset = vfio_iommufd_container_reset;
     vccs->devices_all_dirty_tracking = vfio_iommufd_devices_all_dirty_tracking;
     vccs->set_dirty_page_tracking = iommufd_set_dirty_page_tracking;
+    vccs->get_dirty_bitmap = iommufd_get_dirty_bitmap;
 }
 
 static const TypeInfo vfio_iommufd_info = {
diff --git a/include/hw/iommufd/iommufd.h b/include/hw/iommufd/iommufd.h
index 61fd83771099..9b467e57723b 100644
--- a/include/hw/iommufd/iommufd.h
+++ b/include/hw/iommufd/iommufd.h
@@ -34,5 +34,7 @@ int iommufd_map_dma(int iommufd, uint32_t ioas, hwaddr iova,
 int iommufd_copy_dma(int iommufd, uint32_t src_ioas, uint32_t dst_ioas,
                      hwaddr iova, ram_addr_t size, bool readonly);
 int iommufd_set_dirty_tracking(int iommufd, uint32_t hwpt_id, bool start);
+int iommufd_get_dirty_iova(int iommufd, uint32_t hwpt_id, uint64_t iova,
+                           ram_addr_t size, uint64_t page_size, uint64_t *data);
 bool iommufd_supported(void);
 #endif /* HW_IOMMUFD_IOMMUFD_H */
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC 07/10] vfio/iommufd: Add HWPT_GET_DIRTY_IOVA support
@ 2022-04-28 21:13   ` Joao Martins
  0 siblings, 0 replies; 37+ messages in thread
From: Joao Martins @ 2022-04-28 21:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: John G . Johnson, kvm, Michael S. Tsirkin, Jason Wang, Peter Xu,
	Joao Martins, Eric Blake, Yi Liu, Juan Quintela,
	David Hildenbrand, Markus Armbruster, Nicolin Chen,
	Jason Gunthorpe, Kevin Tian, Richard Henderson,
	Dr. David Alan Gilbert, Eric Auger, Alex Williamson,
	Thanos Makatos, Eduardo Habkost, Yishai Hadas, Cornelia Huck,
	Philippe Mathieu-Daudé,
	Paolo Bonzini

ioctl(iommufd, IOMMU_HWPT_GET_DIRTY_IOVA, arg) is the UAPI
that fetches the bitmap that tells what was dirty in an IOVA
range.

A single bitmap is allocated and used across all the hw_pagetables
sharing an IOAS which is then used in log_sync() to set Qemu
global bitmaps.

There's no point of even attempting to fetch these bitmaps,
should the iommu tracker fail to start in a previous call
to HWPT_SET_DIRTY.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/iommufd/iommufd.c         | 24 ++++++++++++++++++++
 hw/iommufd/trace-events      |  1 +
 hw/vfio/iommufd.c            | 44 ++++++++++++++++++++++++++++++++++++
 include/hw/iommufd/iommufd.h |  2 ++
 4 files changed, 71 insertions(+)

diff --git a/hw/iommufd/iommufd.c b/hw/iommufd/iommufd.c
index e5aff5deaf14..bc870b5e9b2f 100644
--- a/hw/iommufd/iommufd.c
+++ b/hw/iommufd/iommufd.c
@@ -219,6 +219,30 @@ int iommufd_set_dirty_tracking(int iommufd, uint32_t hwpt_id, bool start)
     return !ret ? 0 : -errno;
 }
 
+int iommufd_get_dirty_iova(int iommufd, uint32_t hwpt_id, uint64_t iova,
+                           ram_addr_t size, uint64_t page_size, uint64_t *data)
+{
+    int ret;
+    struct iommu_hwpt_get_dirty_iova get_dirty_iova = {
+        .size = sizeof(get_dirty_iova),
+        .hwpt_id = hwpt_id,
+        .bitmap = {
+            .iova = iova, .length = size,
+            .page_size = page_size, .data = (__u64 *)data,
+        },
+    };
+
+    ret = ioctl(iommufd, IOMMU_HWPT_GET_DIRTY_IOVA, &get_dirty_iova);
+    trace_iommufd_get_dirty_iova(iommufd, hwpt_id, iova, size, page_size, ret);
+    if (ret) {
+        error_report("IOMMU_HWPT_GET_DIRTY_IOVA (iova: 0x%"PRIx64
+                     " size: 0x%"PRIx64") failed: %s", iova,
+                     size, strerror(errno));
+    }
+
+    return !ret ? 0 : -errno;
+}
+
 static void iommufd_register_types(void)
 {
     qemu_mutex_init(&iommufd_lock);
diff --git a/hw/iommufd/trace-events b/hw/iommufd/trace-events
index d3c2b5a0ab95..9fe2cc60c6fe 100644
--- a/hw/iommufd/trace-events
+++ b/hw/iommufd/trace-events
@@ -10,3 +10,4 @@ iommufd_unmap_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int
 iommufd_map_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, void *vaddr, bool readonly, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" addr=%p readonly=%d (%d)"
 iommufd_copy_dma(int iommufd, uint32_t src_ioas, uint32_t dst_ioas, uint64_t iova, uint64_t size, bool readonly, int ret) " iommufd=%d src_ioas=%d dst_ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" readonly=%d (%d)"
 iommufd_set_dirty(int iommufd, uint32_t hwpt_id, bool start, int ret) " iommufd=%d hwpt=%d enable=%d (%d)"
+iommufd_get_dirty_iova(int iommufd, uint32_t hwpt_id, uint64_t iova, uint64_t size, uint64_t page_size, int ret) " iommufd=%d hwpt=%d iova=0x%"PRIx64" size=0x%"PRIx64" page_size=0x%"PRIx64" (%d)"
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 8146407feedd..6c12239a40ab 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -33,6 +33,7 @@
 #include "hw/qdev-core.h"
 #include "sysemu/reset.h"
 #include "qemu/cutils.h"
+#include "exec/ram_addr.h"
 #include "migration/migration.h"
 
 static bool iommufd_check_extension(VFIOContainer *bcontainer,
@@ -102,6 +103,48 @@ static void iommufd_set_dirty_page_tracking(VFIOContainer *bcontainer,
     bcontainer->dirty_pages_supported = start;
 }
 
+static int iommufd_get_dirty_bitmap(VFIOContainer *bcontainer, uint64_t iova,
+                                    uint64_t size, ram_addr_t ram_addr)
+{
+    VFIOIOMMUFDContainer *container = container_of(bcontainer,
+                                                   VFIOIOMMUFDContainer, obj);
+    int ret;
+    VFIOIOASHwpt *hwpt;
+    unsigned long *data, page_size, bitmap_size, pages;
+
+    if (!bcontainer->dirty_pages_supported) {
+        return 0;
+    }
+
+    page_size = qemu_real_host_page_size;
+    pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size;
+    bitmap_size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
+                                         BITS_PER_BYTE;
+    data = g_try_malloc0(bitmap_size);
+    if (!data) {
+        ret = -ENOMEM;
+        goto err_out;
+    }
+
+    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
+        ret = iommufd_get_dirty_iova(container->iommufd, hwpt->hwpt_id,
+                                     iova, size, page_size, data);
+        if (ret) {
+            goto err_out;
+        }
+    }
+
+    cpu_physical_memory_set_dirty_lebitmap(data, ram_addr, pages);
+
+    trace_vfio_get_dirty_bitmap(container->iommufd, iova, size, bitmap_size,
+                                ram_addr);
+
+err_out:
+    g_free(data);
+
+    return ret;
+}
+
 static int vfio_get_devicefd(const char *sysfs_path, Error **errp)
 {
     long int vfio_id = -1, ret = -ENOTTY;
@@ -611,6 +654,7 @@ static void vfio_iommufd_class_init(ObjectClass *klass,
     vccs->reset = vfio_iommufd_container_reset;
     vccs->devices_all_dirty_tracking = vfio_iommufd_devices_all_dirty_tracking;
     vccs->set_dirty_page_tracking = iommufd_set_dirty_page_tracking;
+    vccs->get_dirty_bitmap = iommufd_get_dirty_bitmap;
 }
 
 static const TypeInfo vfio_iommufd_info = {
diff --git a/include/hw/iommufd/iommufd.h b/include/hw/iommufd/iommufd.h
index 61fd83771099..9b467e57723b 100644
--- a/include/hw/iommufd/iommufd.h
+++ b/include/hw/iommufd/iommufd.h
@@ -34,5 +34,7 @@ int iommufd_map_dma(int iommufd, uint32_t ioas, hwaddr iova,
 int iommufd_copy_dma(int iommufd, uint32_t src_ioas, uint32_t dst_ioas,
                      hwaddr iova, ram_addr_t size, bool readonly);
 int iommufd_set_dirty_tracking(int iommufd, uint32_t hwpt_id, bool start);
+int iommufd_get_dirty_iova(int iommufd, uint32_t hwpt_id, uint64_t iova,
+                           ram_addr_t size, uint64_t page_size, uint64_t *data);
 bool iommufd_supported(void);
 #endif /* HW_IOMMUFD_IOMMUFD_H */
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC 08/10] vfio/iommufd: Add IOAS_UNMAP_DIRTY support
  2022-04-28 21:13 ` Joao Martins
@ 2022-04-28 21:13   ` Joao Martins
  -1 siblings, 0 replies; 37+ messages in thread
From: Joao Martins @ 2022-04-28 21:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: Joao Martins, Paolo Bonzini, Dr. David Alan Gilbert,
	Richard Henderson, Eduardo Habkost, Michael S. Tsirkin,
	Marcel Apfelbaum, Peter Xu, Jason Wang, Alex Williamson,
	David Hildenbrand, Philippe Mathieu-Daudé,
	Cornelia Huck, Juan Quintela, Eric Blake, Markus Armbruster,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian, Yi Liu,
	Eric Auger, Thanos Makatos, John G . Johnson, kvm

The ioctl(iommufd, IOAS_UNMAP_DIRTY) performs an unmap
of an IOVA range and returns whether or not it was dirty.

The kernel atomically clears the IOPTE while telling if
the old IOPTE was dirty or not. This in theory is needed
for the vIOMMU case to handle a potentially erronous guest
PCI device performing DMA on an IOVA that is simultaneous
being IOMMU-unmap... to then transfer that dirty page into
the destination.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/iommufd/iommufd.c         | 21 +++++++++++
 hw/iommufd/trace-events      |  1 +
 hw/vfio/iommufd.c            | 72 +++++++++++++++++++++++++++++++++++-
 include/hw/iommufd/iommufd.h |  3 ++
 4 files changed, 96 insertions(+), 1 deletion(-)

diff --git a/hw/iommufd/iommufd.c b/hw/iommufd/iommufd.c
index bc870b5e9b2f..0f7d9f22ae52 100644
--- a/hw/iommufd/iommufd.c
+++ b/hw/iommufd/iommufd.c
@@ -243,6 +243,27 @@ int iommufd_get_dirty_iova(int iommufd, uint32_t hwpt_id, uint64_t iova,
     return !ret ? 0 : -errno;
 }
 
+int iommufd_unmap_dma_dirty(int iommufd, uint32_t ioas, hwaddr iova,
+                            ram_addr_t size, uint64_t page_size, uint64_t *data)
+{
+    int ret;
+    struct iommu_ioas_unmap_dirty unmap = {
+        .size = sizeof(unmap),
+        .ioas_id = ioas,
+        .bitmap = {
+            .iova = iova, .length = size,
+            .page_size = page_size, .data = (__u64 *)data,
+        },
+    };
+
+    ret = ioctl(iommufd, IOMMU_IOAS_UNMAP_DIRTY, &unmap);
+    trace_iommufd_unmap_dma_dirty(iommufd, ioas, iova, size, page_size, ret);
+    if (ret) {
+        error_report("IOMMU_IOAS_UNMAP_DIRTY failed: %s", strerror(errno));
+    }
+    return !ret ? 0 : -errno;
+}
+
 static void iommufd_register_types(void)
 {
     qemu_mutex_init(&iommufd_lock);
diff --git a/hw/iommufd/trace-events b/hw/iommufd/trace-events
index 9fe2cc60c6fe..3e99290a9a77 100644
--- a/hw/iommufd/trace-events
+++ b/hw/iommufd/trace-events
@@ -11,3 +11,4 @@ iommufd_map_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, void *
 iommufd_copy_dma(int iommufd, uint32_t src_ioas, uint32_t dst_ioas, uint64_t iova, uint64_t size, bool readonly, int ret) " iommufd=%d src_ioas=%d dst_ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" readonly=%d (%d)"
 iommufd_set_dirty(int iommufd, uint32_t hwpt_id, bool start, int ret) " iommufd=%d hwpt=%d enable=%d (%d)"
 iommufd_get_dirty_iova(int iommufd, uint32_t hwpt_id, uint64_t iova, uint64_t size, uint64_t page_size, int ret) " iommufd=%d hwpt=%d iova=0x%"PRIx64" size=0x%"PRIx64" page_size=0x%"PRIx64" (%d)"
+iommufd_unmap_dma_dirty(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, uint64_t page_size, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" page_size=0x%"PRIx64" (%d)"
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 6c12239a40ab..d75ecbf2ae52 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -36,6 +36,8 @@
 #include "exec/ram_addr.h"
 #include "migration/migration.h"
 
+static bool vfio_devices_all_running_and_saving(VFIOContainer *container);
+
 static bool iommufd_check_extension(VFIOContainer *bcontainer,
                                     VFIOContainerFeature feat)
 {
@@ -72,6 +74,36 @@ static int iommufd_copy(VFIOContainer *src, VFIOContainer *dst,
                             container_dst->ioas_id, iova, size, readonly);
 }
 
+static int iommufd_unmap_bitmap(int iommufd, int ioas_id, hwaddr iova,
+                                ram_addr_t size, ram_addr_t translated)
+{
+    unsigned long *data, pgsize, bitmap_size, pages;
+    int ret;
+
+    pgsize = qemu_real_host_page_size;
+    pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size;
+    bitmap_size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
+                                         BITS_PER_BYTE;
+    data = g_try_malloc0(bitmap_size);
+    if (!data) {
+        ret = -ENOMEM;
+        goto err_out;
+    }
+
+    ret = iommufd_unmap_dma_dirty(iommufd, ioas_id, iova, size, pgsize, data);
+    if (ret) {
+        goto err_out;
+    }
+
+    cpu_physical_memory_set_dirty_lebitmap(data, translated, pages);
+
+    trace_vfio_get_dirty_bitmap(iommufd, iova, size, bitmap_size, translated);
+
+err_out:
+    g_free(data);
+    return ret;
+}
+
 static int iommufd_unmap(VFIOContainer *bcontainer,
                          hwaddr iova, ram_addr_t size,
                          IOMMUTLBEntry *iotlb)
@@ -79,7 +111,13 @@ static int iommufd_unmap(VFIOContainer *bcontainer,
     VFIOIOMMUFDContainer *container = container_of(bcontainer,
                                                    VFIOIOMMUFDContainer, obj);
 
-    /* TODO: Handle dma_unmap_bitmap with iotlb args (migration) */
+    if (iotlb && bcontainer->dirty_pages_supported &&
+        vfio_devices_all_running_and_saving(bcontainer)) {
+        return iommufd_unmap_bitmap(container->iommufd,
+                                    container->ioas_id, iova, size,
+                                    iotlb->translated_addr);
+    }
+
     return iommufd_unmap_dma(container->iommufd,
                              container->ioas_id, iova, size);
 }
@@ -367,6 +405,38 @@ static int vfio_device_reset(VFIODevice *vbasedev)
     return 0;
 }
 
+static bool vfio_devices_all_running_and_saving(VFIOContainer *bcontainer)
+{
+    MigrationState *ms = migrate_get_current();
+    VFIOIOMMUFDContainer *container;
+    VFIODevice *vbasedev;
+    VFIOIOASHwpt *hwpt;
+
+    if (!migration_is_setup_or_active(ms->state)) {
+        return false;
+    }
+
+    container = container_of(bcontainer, VFIOIOMMUFDContainer, obj);
+
+    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
+        QLIST_FOREACH(vbasedev, &hwpt->device_list, hwpt_next) {
+            VFIOMigration *migration = vbasedev->migration;
+
+            if (!migration) {
+                return false;
+            }
+
+            if ((migration->device_state & VFIO_DEVICE_STATE_SAVING) &&
+                (migration->device_state & VFIO_DEVICE_STATE_RUNNING)) {
+                continue;
+            } else {
+                return false;
+            }
+        }
+    }
+    return true;
+}
+
 static bool vfio_iommufd_devices_all_dirty_tracking(VFIOContainer *bcontainer)
 {
     MigrationState *ms = migrate_get_current();
diff --git a/include/hw/iommufd/iommufd.h b/include/hw/iommufd/iommufd.h
index 9b467e57723b..2c58b95d619c 100644
--- a/include/hw/iommufd/iommufd.h
+++ b/include/hw/iommufd/iommufd.h
@@ -36,5 +36,8 @@ int iommufd_copy_dma(int iommufd, uint32_t src_ioas, uint32_t dst_ioas,
 int iommufd_set_dirty_tracking(int iommufd, uint32_t hwpt_id, bool start);
 int iommufd_get_dirty_iova(int iommufd, uint32_t hwpt_id, uint64_t iova,
                            ram_addr_t size, uint64_t page_size, uint64_t *data);
+int iommufd_unmap_dma_dirty(int iommufd, uint32_t ioas, hwaddr iova,
+                            ram_addr_t size, uint64_t page_size,
+                            uint64_t *data);
 bool iommufd_supported(void);
 #endif /* HW_IOMMUFD_IOMMUFD_H */
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC 08/10] vfio/iommufd: Add IOAS_UNMAP_DIRTY support
@ 2022-04-28 21:13   ` Joao Martins
  0 siblings, 0 replies; 37+ messages in thread
From: Joao Martins @ 2022-04-28 21:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: John G . Johnson, kvm, Michael S. Tsirkin, Jason Wang, Peter Xu,
	Joao Martins, Eric Blake, Yi Liu, Juan Quintela,
	David Hildenbrand, Markus Armbruster, Nicolin Chen,
	Jason Gunthorpe, Kevin Tian, Richard Henderson,
	Dr. David Alan Gilbert, Eric Auger, Alex Williamson,
	Thanos Makatos, Eduardo Habkost, Yishai Hadas, Cornelia Huck,
	Philippe Mathieu-Daudé,
	Paolo Bonzini

The ioctl(iommufd, IOAS_UNMAP_DIRTY) performs an unmap
of an IOVA range and returns whether or not it was dirty.

The kernel atomically clears the IOPTE while telling if
the old IOPTE was dirty or not. This in theory is needed
for the vIOMMU case to handle a potentially erronous guest
PCI device performing DMA on an IOVA that is simultaneous
being IOMMU-unmap... to then transfer that dirty page into
the destination.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/iommufd/iommufd.c         | 21 +++++++++++
 hw/iommufd/trace-events      |  1 +
 hw/vfio/iommufd.c            | 72 +++++++++++++++++++++++++++++++++++-
 include/hw/iommufd/iommufd.h |  3 ++
 4 files changed, 96 insertions(+), 1 deletion(-)

diff --git a/hw/iommufd/iommufd.c b/hw/iommufd/iommufd.c
index bc870b5e9b2f..0f7d9f22ae52 100644
--- a/hw/iommufd/iommufd.c
+++ b/hw/iommufd/iommufd.c
@@ -243,6 +243,27 @@ int iommufd_get_dirty_iova(int iommufd, uint32_t hwpt_id, uint64_t iova,
     return !ret ? 0 : -errno;
 }
 
+int iommufd_unmap_dma_dirty(int iommufd, uint32_t ioas, hwaddr iova,
+                            ram_addr_t size, uint64_t page_size, uint64_t *data)
+{
+    int ret;
+    struct iommu_ioas_unmap_dirty unmap = {
+        .size = sizeof(unmap),
+        .ioas_id = ioas,
+        .bitmap = {
+            .iova = iova, .length = size,
+            .page_size = page_size, .data = (__u64 *)data,
+        },
+    };
+
+    ret = ioctl(iommufd, IOMMU_IOAS_UNMAP_DIRTY, &unmap);
+    trace_iommufd_unmap_dma_dirty(iommufd, ioas, iova, size, page_size, ret);
+    if (ret) {
+        error_report("IOMMU_IOAS_UNMAP_DIRTY failed: %s", strerror(errno));
+    }
+    return !ret ? 0 : -errno;
+}
+
 static void iommufd_register_types(void)
 {
     qemu_mutex_init(&iommufd_lock);
diff --git a/hw/iommufd/trace-events b/hw/iommufd/trace-events
index 9fe2cc60c6fe..3e99290a9a77 100644
--- a/hw/iommufd/trace-events
+++ b/hw/iommufd/trace-events
@@ -11,3 +11,4 @@ iommufd_map_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, void *
 iommufd_copy_dma(int iommufd, uint32_t src_ioas, uint32_t dst_ioas, uint64_t iova, uint64_t size, bool readonly, int ret) " iommufd=%d src_ioas=%d dst_ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" readonly=%d (%d)"
 iommufd_set_dirty(int iommufd, uint32_t hwpt_id, bool start, int ret) " iommufd=%d hwpt=%d enable=%d (%d)"
 iommufd_get_dirty_iova(int iommufd, uint32_t hwpt_id, uint64_t iova, uint64_t size, uint64_t page_size, int ret) " iommufd=%d hwpt=%d iova=0x%"PRIx64" size=0x%"PRIx64" page_size=0x%"PRIx64" (%d)"
+iommufd_unmap_dma_dirty(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, uint64_t page_size, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" page_size=0x%"PRIx64" (%d)"
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 6c12239a40ab..d75ecbf2ae52 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -36,6 +36,8 @@
 #include "exec/ram_addr.h"
 #include "migration/migration.h"
 
+static bool vfio_devices_all_running_and_saving(VFIOContainer *container);
+
 static bool iommufd_check_extension(VFIOContainer *bcontainer,
                                     VFIOContainerFeature feat)
 {
@@ -72,6 +74,36 @@ static int iommufd_copy(VFIOContainer *src, VFIOContainer *dst,
                             container_dst->ioas_id, iova, size, readonly);
 }
 
+static int iommufd_unmap_bitmap(int iommufd, int ioas_id, hwaddr iova,
+                                ram_addr_t size, ram_addr_t translated)
+{
+    unsigned long *data, pgsize, bitmap_size, pages;
+    int ret;
+
+    pgsize = qemu_real_host_page_size;
+    pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size;
+    bitmap_size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
+                                         BITS_PER_BYTE;
+    data = g_try_malloc0(bitmap_size);
+    if (!data) {
+        ret = -ENOMEM;
+        goto err_out;
+    }
+
+    ret = iommufd_unmap_dma_dirty(iommufd, ioas_id, iova, size, pgsize, data);
+    if (ret) {
+        goto err_out;
+    }
+
+    cpu_physical_memory_set_dirty_lebitmap(data, translated, pages);
+
+    trace_vfio_get_dirty_bitmap(iommufd, iova, size, bitmap_size, translated);
+
+err_out:
+    g_free(data);
+    return ret;
+}
+
 static int iommufd_unmap(VFIOContainer *bcontainer,
                          hwaddr iova, ram_addr_t size,
                          IOMMUTLBEntry *iotlb)
@@ -79,7 +111,13 @@ static int iommufd_unmap(VFIOContainer *bcontainer,
     VFIOIOMMUFDContainer *container = container_of(bcontainer,
                                                    VFIOIOMMUFDContainer, obj);
 
-    /* TODO: Handle dma_unmap_bitmap with iotlb args (migration) */
+    if (iotlb && bcontainer->dirty_pages_supported &&
+        vfio_devices_all_running_and_saving(bcontainer)) {
+        return iommufd_unmap_bitmap(container->iommufd,
+                                    container->ioas_id, iova, size,
+                                    iotlb->translated_addr);
+    }
+
     return iommufd_unmap_dma(container->iommufd,
                              container->ioas_id, iova, size);
 }
@@ -367,6 +405,38 @@ static int vfio_device_reset(VFIODevice *vbasedev)
     return 0;
 }
 
+static bool vfio_devices_all_running_and_saving(VFIOContainer *bcontainer)
+{
+    MigrationState *ms = migrate_get_current();
+    VFIOIOMMUFDContainer *container;
+    VFIODevice *vbasedev;
+    VFIOIOASHwpt *hwpt;
+
+    if (!migration_is_setup_or_active(ms->state)) {
+        return false;
+    }
+
+    container = container_of(bcontainer, VFIOIOMMUFDContainer, obj);
+
+    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
+        QLIST_FOREACH(vbasedev, &hwpt->device_list, hwpt_next) {
+            VFIOMigration *migration = vbasedev->migration;
+
+            if (!migration) {
+                return false;
+            }
+
+            if ((migration->device_state & VFIO_DEVICE_STATE_SAVING) &&
+                (migration->device_state & VFIO_DEVICE_STATE_RUNNING)) {
+                continue;
+            } else {
+                return false;
+            }
+        }
+    }
+    return true;
+}
+
 static bool vfio_iommufd_devices_all_dirty_tracking(VFIOContainer *bcontainer)
 {
     MigrationState *ms = migrate_get_current();
diff --git a/include/hw/iommufd/iommufd.h b/include/hw/iommufd/iommufd.h
index 9b467e57723b..2c58b95d619c 100644
--- a/include/hw/iommufd/iommufd.h
+++ b/include/hw/iommufd/iommufd.h
@@ -36,5 +36,8 @@ int iommufd_copy_dma(int iommufd, uint32_t src_ioas, uint32_t dst_ioas,
 int iommufd_set_dirty_tracking(int iommufd, uint32_t hwpt_id, bool start);
 int iommufd_get_dirty_iova(int iommufd, uint32_t hwpt_id, uint64_t iova,
                            ram_addr_t size, uint64_t page_size, uint64_t *data);
+int iommufd_unmap_dma_dirty(int iommufd, uint32_t ioas, hwaddr iova,
+                            ram_addr_t size, uint64_t page_size,
+                            uint64_t *data);
 bool iommufd_supported(void);
 #endif /* HW_IOMMUFD_IOMMUFD_H */
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC 09/10] migration/dirtyrate: Expand dirty_bitmap to be tracked separately for devices
  2022-04-28 21:13 ` Joao Martins
@ 2022-04-28 21:13   ` Joao Martins
  -1 siblings, 0 replies; 37+ messages in thread
From: Joao Martins @ 2022-04-28 21:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: Joao Martins, Paolo Bonzini, Dr. David Alan Gilbert,
	Richard Henderson, Eduardo Habkost, Michael S. Tsirkin,
	Marcel Apfelbaum, Peter Xu, Jason Wang, Alex Williamson,
	David Hildenbrand, Philippe Mathieu-Daudé,
	Cornelia Huck, Juan Quintela, Eric Blake, Markus Armbruster,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian, Yi Liu,
	Eric Auger, Thanos Makatos, John G . Johnson, kvm

Expand dirtyrate measurer that is accessible via HMP calc_dirty_rate
or QMP 'calc-dirty-rate' to receive a @scope argument. The scope
then restricts the dirty tracking to be done at devices only,
while neither enabling or using the KVM (CPU) dirty tracker.
The default stays as is i.e. dirty-ring / dirty-bitmap from KVM.

This is useful to test, exercise the IOMMU dirty tracker and observe
how much a given device is dirtying memory.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 accel/kvm/kvm-all.c   | 12 +++++++++
 hmp-commands.hx       |  5 ++--
 hw/vfio/container.c   |  8 ++++++
 hw/vfio/iommufd.c     |  4 +++
 include/exec/memory.h | 10 +++++++-
 migration/dirtyrate.c | 59 +++++++++++++++++++++++++++++++++----------
 migration/dirtyrate.h |  1 +
 qapi/migration.json   | 15 +++++++++++
 softmmu/memory.c      |  5 ++++
 9 files changed, 102 insertions(+), 17 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 5f1377ca048c..b4bbe0d20f6e 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1517,6 +1517,10 @@ static void kvm_log_sync(MemoryListener *listener,
 {
     KVMMemoryListener *kml = container_of(listener, KVMMemoryListener, listener);
 
+    if (memory_global_dirty_devices()) {
+        return;
+    }
+
     kvm_slots_lock();
     kvm_physical_sync_dirty_bitmap(kml, section);
     kvm_slots_unlock();
@@ -1529,6 +1533,10 @@ static void kvm_log_sync_global(MemoryListener *l)
     KVMSlot *mem;
     int i;
 
+    if (memory_global_dirty_devices()) {
+        return;
+    }
+
     /* Flush all kernel dirty addresses into KVMSlot dirty bitmap */
     kvm_dirty_ring_flush();
 
@@ -1558,6 +1566,10 @@ static void kvm_log_clear(MemoryListener *listener,
     KVMMemoryListener *kml = container_of(listener, KVMMemoryListener, listener);
     int r;
 
+    if (memory_global_dirty_devices()) {
+        return;
+    }
+
     r = kvm_physical_log_clear(kml, section);
     if (r < 0) {
         error_report_once("%s: kvm log clear failed: mr=%s "
diff --git a/hmp-commands.hx b/hmp-commands.hx
index 8476277aa9c9..28122d268ea3 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -1739,10 +1739,11 @@ ERST
 
     {
         .name       = "calc_dirty_rate",
-        .args_type  = "dirty_ring:-r,dirty_bitmap:-b,second:l,sample_pages_per_GB:l?",
-        .params     = "[-r] [-b] second [sample_pages_per_GB]",
+        .args_type  = "dirty_devices:-d,dirty_ring:-r,dirty_bitmap:-b,second:l,sample_pages_per_GB:l?",
+        .params     = "[-d] [-r] [-b] second [sample_pages_per_GB]",
         .help       = "start a round of guest dirty rate measurement (using -r to"
                       "\n\t\t\t specify dirty ring as the method of calculation and"
+                      "\n\t\t\t specify devices as the only scope and"
                       "\n\t\t\t -b to specify dirty bitmap as method of calculation)",
         .cmd        = hmp_calc_dirty_rate,
     },
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index 6bc1b8763f75..fff8319c0036 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -84,6 +84,10 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *bcontainer)
     VFIODevice *vbasedev;
     MigrationState *ms = migrate_get_current();
 
+    if (bcontainer->dirty_pages_supported) {
+        return true;
+    }
+
     if (!migration_is_setup_or_active(ms->state)) {
         return false;
     }
@@ -311,6 +315,10 @@ static int vfio_get_dirty_bitmap(VFIOContainer *bcontainer, uint64_t iova,
     uint64_t pages;
     int ret;
 
+    if (!memory_global_dirty_devices()) {
+        return 0;
+    }
+
     dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
 
     dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index d75ecbf2ae52..4686cc713aac 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -150,6 +150,10 @@ static int iommufd_get_dirty_bitmap(VFIOContainer *bcontainer, uint64_t iova,
     VFIOIOASHwpt *hwpt;
     unsigned long *data, page_size, bitmap_size, pages;
 
+    if (!memory_global_dirty_devices()) {
+        return 0;
+    }
+
     if (!bcontainer->dirty_pages_supported) {
         return 0;
     }
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 4d5997e6bbae..59c1d8bcc495 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -69,7 +69,10 @@ static inline void fuzz_dma_read_cb(size_t addr,
 /* Dirty tracking enabled because measuring dirty rate */
 #define GLOBAL_DIRTY_DIRTY_RATE (1U << 1)
 
-#define GLOBAL_DIRTY_MASK  (0x3)
+/* Dirty tracking enabled because measuring devices dirty rate */
+#define GLOBAL_DIRTY_DIRTY_RATE_DEVICES (1U << 2)
+
+#define GLOBAL_DIRTY_MASK  (0x7)
 
 extern unsigned int global_dirty_tracking;
 
@@ -2433,6 +2436,11 @@ void memory_global_dirty_log_start(unsigned int flags);
  */
 void memory_global_dirty_log_stop(unsigned int flags);
 
+/**
+ * memory_global_dirty_devices: check if the scope is just devices
+ */
+bool memory_global_dirty_devices(void);
+
 void mtree_info(bool flatview, bool dispatch_tree, bool owner, bool disabled);
 
 /**
diff --git a/migration/dirtyrate.c b/migration/dirtyrate.c
index aace12a78764..8c00cb6a3702 100644
--- a/migration/dirtyrate.c
+++ b/migration/dirtyrate.c
@@ -45,6 +45,8 @@ static int CalculatingState = DIRTY_RATE_STATUS_UNSTARTED;
 static struct DirtyRateStat DirtyStat;
 static DirtyRateMeasureMode dirtyrate_mode =
                 DIRTY_RATE_MEASURE_MODE_PAGE_SAMPLING;
+static DirtyRateScope dirtyrate_scope =
+                DIRTY_RATE_SCOPE_ALL;
 
 static int64_t set_sample_page_period(int64_t msec, int64_t initial_time)
 {
@@ -99,6 +101,7 @@ static struct DirtyRateInfo *query_dirty_rate_info(void)
     info->calc_time = DirtyStat.calc_time;
     info->sample_pages = DirtyStat.sample_pages;
     info->mode = dirtyrate_mode;
+    info->scope = dirtyrate_scope;
 
     if (qatomic_read(&CalculatingState) == DIRTY_RATE_STATUS_MEASURED) {
         info->has_dirty_rate = true;
@@ -406,32 +409,44 @@ static inline void record_dirtypages(DirtyPageRecord *dirty_pages,
     }
 }
 
-static void dirtyrate_global_dirty_log_start(void)
+static void dirtyrate_global_dirty_log_start(DirtyRateScope scope)
 {
+    unsigned int flags = GLOBAL_DIRTY_DIRTY_RATE;
+
+    if (scope == DIRTY_RATE_SCOPE_DIRTY_DEVICES) {
+        flags |= GLOBAL_DIRTY_DIRTY_RATE_DEVICES;
+    }
+
     qemu_mutex_lock_iothread();
-    memory_global_dirty_log_start(GLOBAL_DIRTY_DIRTY_RATE);
+    memory_global_dirty_log_start(flags);
     qemu_mutex_unlock_iothread();
 }
 
-static void dirtyrate_global_dirty_log_stop(void)
+static void dirtyrate_global_dirty_log_stop(DirtyRateScope scope)
 {
+    unsigned int flags = GLOBAL_DIRTY_DIRTY_RATE;
+
+    if (scope == DIRTY_RATE_SCOPE_DIRTY_DEVICES) {
+        flags |= GLOBAL_DIRTY_DIRTY_RATE_DEVICES;
+    }
+
     qemu_mutex_lock_iothread();
     memory_global_dirty_log_sync();
-    memory_global_dirty_log_stop(GLOBAL_DIRTY_DIRTY_RATE);
+    memory_global_dirty_log_stop(flags);
     qemu_mutex_unlock_iothread();
 }
 
 static int64_t do_calculate_dirtyrate_vcpu(DirtyPageRecord dirty_pages)
 {
-    uint64_t memory_size_MB;
+    uint64_t memory_size_KB;
     int64_t time_s;
     uint64_t increased_dirty_pages =
         dirty_pages.end_pages - dirty_pages.start_pages;
 
-    memory_size_MB = (increased_dirty_pages * TARGET_PAGE_SIZE) >> 20;
+    memory_size_KB = (increased_dirty_pages * TARGET_PAGE_SIZE) >> 10;
     time_s = DirtyStat.calc_time;
 
-    return memory_size_MB / time_s;
+    return memory_size_KB / time_s;
 }
 
 static inline void record_dirtypages_bitmap(DirtyPageRecord *dirty_pages,
@@ -466,9 +481,14 @@ static void calculate_dirtyrate_dirty_bitmap(struct DirtyRateConfig config)
     int64_t msec = 0;
     int64_t start_time;
     DirtyPageRecord dirty_pages;
+    unsigned int flags = GLOBAL_DIRTY_DIRTY_RATE;
+
+    if (config.scope == DIRTY_RATE_SCOPE_DIRTY_DEVICES) {
+        flags |= GLOBAL_DIRTY_DIRTY_RATE_DEVICES;
+    }
 
     qemu_mutex_lock_iothread();
-    memory_global_dirty_log_start(GLOBAL_DIRTY_DIRTY_RATE);
+    memory_global_dirty_log_start(flags);
 
     /*
      * 1'round of log sync may return all 1 bits with
@@ -500,7 +520,7 @@ static void calculate_dirtyrate_dirty_bitmap(struct DirtyRateConfig config)
      * 1. fetch dirty bitmap from kvm
      * 2. stop dirty tracking
      */
-    dirtyrate_global_dirty_log_stop();
+    dirtyrate_global_dirty_log_stop(config.scope);
 
     record_dirtypages_bitmap(&dirty_pages, false);
 
@@ -527,7 +547,7 @@ static void calculate_dirtyrate_dirty_ring(struct DirtyRateConfig config)
     DirtyStat.dirty_ring.nvcpu = nvcpu;
     DirtyStat.dirty_ring.rates = malloc(sizeof(DirtyRateVcpu) * nvcpu);
 
-    dirtyrate_global_dirty_log_start();
+    dirtyrate_global_dirty_log_start(config.scope);
 
     CPU_FOREACH(cpu) {
         record_dirtypages(dirty_pages, cpu, true);
@@ -540,7 +560,7 @@ static void calculate_dirtyrate_dirty_ring(struct DirtyRateConfig config)
     msec = set_sample_page_period(msec, start_time);
     DirtyStat.calc_time = msec / 1000;
 
-    dirtyrate_global_dirty_log_stop();
+    dirtyrate_global_dirty_log_stop(config.scope);
 
     CPU_FOREACH(cpu) {
         record_dirtypages(dirty_pages, cpu, false);
@@ -631,6 +651,8 @@ void *get_dirtyrate_thread(void *arg)
 void qmp_calc_dirty_rate(int64_t calc_time,
                          bool has_sample_pages,
                          int64_t sample_pages,
+                         bool has_scope,
+                         DirtyRateScope scope,
                          bool has_mode,
                          DirtyRateMeasureMode mode,
                          Error **errp)
@@ -701,6 +723,7 @@ void qmp_calc_dirty_rate(int64_t calc_time,
     config.sample_period_seconds = calc_time;
     config.sample_pages_per_gigabytes = sample_pages;
     config.mode = mode;
+    config.scope = scope;
 
     cleanup_dirtyrate_stat(config);
 
@@ -709,6 +732,7 @@ void qmp_calc_dirty_rate(int64_t calc_time,
      * been used in last calculation
      **/
     dirtyrate_mode = mode;
+    dirtyrate_scope = scope;
 
     start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME) / 1000;
     init_dirtyrate_stat(start_time, config);
@@ -736,9 +760,11 @@ void hmp_info_dirty_rate(Monitor *mon, const QDict *qdict)
                    info->calc_time);
     monitor_printf(mon, "Mode: %s\n",
                    DirtyRateMeasureMode_str(info->mode));
+    monitor_printf(mon, "Scope: %s\n",
+                   DirtyRateScope_str(info->scope));
     monitor_printf(mon, "Dirty rate: ");
     if (info->has_dirty_rate) {
-        monitor_printf(mon, "%"PRIi64" (MB/s)\n", info->dirty_rate);
+        monitor_printf(mon, "%"PRIi64" (KB/s)\n", info->dirty_rate);
         if (info->has_vcpu_dirty_rate) {
             DirtyRateVcpuList *rate, *head = info->vcpu_dirty_rate;
             for (rate = head; rate != NULL; rate = rate->next) {
@@ -762,7 +788,9 @@ void hmp_calc_dirty_rate(Monitor *mon, const QDict *qdict)
     bool has_sample_pages = (sample_pages != -1);
     bool dirty_ring = qdict_get_try_bool(qdict, "dirty_ring", false);
     bool dirty_bitmap = qdict_get_try_bool(qdict, "dirty_bitmap", false);
+    bool dirty_devices = qdict_get_try_bool(qdict, "dirty_devices", false);
     DirtyRateMeasureMode mode = DIRTY_RATE_MEASURE_MODE_PAGE_SAMPLING;
+    DirtyRateScope scope = DIRTY_RATE_SCOPE_ALL;
     Error *err = NULL;
 
     if (!sec) {
@@ -781,9 +809,12 @@ void hmp_calc_dirty_rate(Monitor *mon, const QDict *qdict)
     } else if (dirty_ring) {
         mode = DIRTY_RATE_MEASURE_MODE_DIRTY_RING;
     }
+    if (dirty_devices) {
+        scope = DIRTY_RATE_SCOPE_DIRTY_DEVICES;
+    }
 
-    qmp_calc_dirty_rate(sec, has_sample_pages, sample_pages, true,
-                        mode, &err);
+    qmp_calc_dirty_rate(sec, has_sample_pages, sample_pages,
+                       true, scope, true, mode, &err);
     if (err) {
         hmp_handle_error(mon, err);
         return;
diff --git a/migration/dirtyrate.h b/migration/dirtyrate.h
index 69d4c5b8655f..4061edf9f4de 100644
--- a/migration/dirtyrate.h
+++ b/migration/dirtyrate.h
@@ -44,6 +44,7 @@ struct DirtyRateConfig {
     uint64_t sample_pages_per_gigabytes; /* sample pages per GB */
     int64_t sample_period_seconds; /* time duration between two sampling */
     DirtyRateMeasureMode mode; /* mode of dirtyrate measurement */
+    DirtyRateScope scope; /* scope of dirtyrate measurement */
 };
 
 /*
diff --git a/qapi/migration.json b/qapi/migration.json
index 27d7b281581d..082830c6e771 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -1793,6 +1793,19 @@
 { 'enum': 'DirtyRateMeasureMode',
   'data': ['page-sampling', 'dirty-ring', 'dirty-bitmap'] }
 
+##
+# @DirtyRateScope:
+#
+# An enumeration of scope of measuring dirtyrate.
+#
+# @dirty-devices: calculate dirtyrate by devices only.
+#
+# Since: 6.2
+#
+##
+{ 'enum': 'DirtyRateScope',
+  'data': ['all', 'dirty-devices'] }
+
 ##
 # @DirtyRateInfo:
 #
@@ -1827,6 +1840,7 @@
            'calc-time': 'int64',
            'sample-pages': 'uint64',
            'mode': 'DirtyRateMeasureMode',
+           'scope': 'DirtyRateScope',
            '*vcpu-dirty-rate': [ 'DirtyRateVcpu' ] } }
 
 ##
@@ -1851,6 +1865,7 @@
 ##
 { 'command': 'calc-dirty-rate', 'data': {'calc-time': 'int64',
                                          '*sample-pages': 'int',
+                                         '*scope': 'DirtyRateScope',
                                          '*mode': 'DirtyRateMeasureMode'} }
 
 ##
diff --git a/softmmu/memory.c b/softmmu/memory.c
index bfa5d5178c5b..120c41f3b303 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -2826,6 +2826,11 @@ void memory_global_dirty_log_start(unsigned int flags)
     }
 }
 
+bool memory_global_dirty_devices(void)
+{
+    return !!(global_dirty_tracking & GLOBAL_DIRTY_DIRTY_RATE_DEVICES);
+}
+
 static void memory_global_dirty_log_do_stop(unsigned int flags)
 {
     assert(flags && !(flags & (~GLOBAL_DIRTY_MASK)));
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC 09/10] migration/dirtyrate: Expand dirty_bitmap to be tracked separately for devices
@ 2022-04-28 21:13   ` Joao Martins
  0 siblings, 0 replies; 37+ messages in thread
From: Joao Martins @ 2022-04-28 21:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: John G . Johnson, kvm, Michael S. Tsirkin, Jason Wang, Peter Xu,
	Joao Martins, Eric Blake, Yi Liu, Juan Quintela,
	David Hildenbrand, Markus Armbruster, Nicolin Chen,
	Jason Gunthorpe, Kevin Tian, Richard Henderson,
	Dr. David Alan Gilbert, Eric Auger, Alex Williamson,
	Thanos Makatos, Eduardo Habkost, Yishai Hadas, Cornelia Huck,
	Philippe Mathieu-Daudé,
	Paolo Bonzini

Expand dirtyrate measurer that is accessible via HMP calc_dirty_rate
or QMP 'calc-dirty-rate' to receive a @scope argument. The scope
then restricts the dirty tracking to be done at devices only,
while neither enabling or using the KVM (CPU) dirty tracker.
The default stays as is i.e. dirty-ring / dirty-bitmap from KVM.

This is useful to test, exercise the IOMMU dirty tracker and observe
how much a given device is dirtying memory.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 accel/kvm/kvm-all.c   | 12 +++++++++
 hmp-commands.hx       |  5 ++--
 hw/vfio/container.c   |  8 ++++++
 hw/vfio/iommufd.c     |  4 +++
 include/exec/memory.h | 10 +++++++-
 migration/dirtyrate.c | 59 +++++++++++++++++++++++++++++++++----------
 migration/dirtyrate.h |  1 +
 qapi/migration.json   | 15 +++++++++++
 softmmu/memory.c      |  5 ++++
 9 files changed, 102 insertions(+), 17 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 5f1377ca048c..b4bbe0d20f6e 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1517,6 +1517,10 @@ static void kvm_log_sync(MemoryListener *listener,
 {
     KVMMemoryListener *kml = container_of(listener, KVMMemoryListener, listener);
 
+    if (memory_global_dirty_devices()) {
+        return;
+    }
+
     kvm_slots_lock();
     kvm_physical_sync_dirty_bitmap(kml, section);
     kvm_slots_unlock();
@@ -1529,6 +1533,10 @@ static void kvm_log_sync_global(MemoryListener *l)
     KVMSlot *mem;
     int i;
 
+    if (memory_global_dirty_devices()) {
+        return;
+    }
+
     /* Flush all kernel dirty addresses into KVMSlot dirty bitmap */
     kvm_dirty_ring_flush();
 
@@ -1558,6 +1566,10 @@ static void kvm_log_clear(MemoryListener *listener,
     KVMMemoryListener *kml = container_of(listener, KVMMemoryListener, listener);
     int r;
 
+    if (memory_global_dirty_devices()) {
+        return;
+    }
+
     r = kvm_physical_log_clear(kml, section);
     if (r < 0) {
         error_report_once("%s: kvm log clear failed: mr=%s "
diff --git a/hmp-commands.hx b/hmp-commands.hx
index 8476277aa9c9..28122d268ea3 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -1739,10 +1739,11 @@ ERST
 
     {
         .name       = "calc_dirty_rate",
-        .args_type  = "dirty_ring:-r,dirty_bitmap:-b,second:l,sample_pages_per_GB:l?",
-        .params     = "[-r] [-b] second [sample_pages_per_GB]",
+        .args_type  = "dirty_devices:-d,dirty_ring:-r,dirty_bitmap:-b,second:l,sample_pages_per_GB:l?",
+        .params     = "[-d] [-r] [-b] second [sample_pages_per_GB]",
         .help       = "start a round of guest dirty rate measurement (using -r to"
                       "\n\t\t\t specify dirty ring as the method of calculation and"
+                      "\n\t\t\t specify devices as the only scope and"
                       "\n\t\t\t -b to specify dirty bitmap as method of calculation)",
         .cmd        = hmp_calc_dirty_rate,
     },
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index 6bc1b8763f75..fff8319c0036 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -84,6 +84,10 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *bcontainer)
     VFIODevice *vbasedev;
     MigrationState *ms = migrate_get_current();
 
+    if (bcontainer->dirty_pages_supported) {
+        return true;
+    }
+
     if (!migration_is_setup_or_active(ms->state)) {
         return false;
     }
@@ -311,6 +315,10 @@ static int vfio_get_dirty_bitmap(VFIOContainer *bcontainer, uint64_t iova,
     uint64_t pages;
     int ret;
 
+    if (!memory_global_dirty_devices()) {
+        return 0;
+    }
+
     dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
 
     dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index d75ecbf2ae52..4686cc713aac 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -150,6 +150,10 @@ static int iommufd_get_dirty_bitmap(VFIOContainer *bcontainer, uint64_t iova,
     VFIOIOASHwpt *hwpt;
     unsigned long *data, page_size, bitmap_size, pages;
 
+    if (!memory_global_dirty_devices()) {
+        return 0;
+    }
+
     if (!bcontainer->dirty_pages_supported) {
         return 0;
     }
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 4d5997e6bbae..59c1d8bcc495 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -69,7 +69,10 @@ static inline void fuzz_dma_read_cb(size_t addr,
 /* Dirty tracking enabled because measuring dirty rate */
 #define GLOBAL_DIRTY_DIRTY_RATE (1U << 1)
 
-#define GLOBAL_DIRTY_MASK  (0x3)
+/* Dirty tracking enabled because measuring devices dirty rate */
+#define GLOBAL_DIRTY_DIRTY_RATE_DEVICES (1U << 2)
+
+#define GLOBAL_DIRTY_MASK  (0x7)
 
 extern unsigned int global_dirty_tracking;
 
@@ -2433,6 +2436,11 @@ void memory_global_dirty_log_start(unsigned int flags);
  */
 void memory_global_dirty_log_stop(unsigned int flags);
 
+/**
+ * memory_global_dirty_devices: check if the scope is just devices
+ */
+bool memory_global_dirty_devices(void);
+
 void mtree_info(bool flatview, bool dispatch_tree, bool owner, bool disabled);
 
 /**
diff --git a/migration/dirtyrate.c b/migration/dirtyrate.c
index aace12a78764..8c00cb6a3702 100644
--- a/migration/dirtyrate.c
+++ b/migration/dirtyrate.c
@@ -45,6 +45,8 @@ static int CalculatingState = DIRTY_RATE_STATUS_UNSTARTED;
 static struct DirtyRateStat DirtyStat;
 static DirtyRateMeasureMode dirtyrate_mode =
                 DIRTY_RATE_MEASURE_MODE_PAGE_SAMPLING;
+static DirtyRateScope dirtyrate_scope =
+                DIRTY_RATE_SCOPE_ALL;
 
 static int64_t set_sample_page_period(int64_t msec, int64_t initial_time)
 {
@@ -99,6 +101,7 @@ static struct DirtyRateInfo *query_dirty_rate_info(void)
     info->calc_time = DirtyStat.calc_time;
     info->sample_pages = DirtyStat.sample_pages;
     info->mode = dirtyrate_mode;
+    info->scope = dirtyrate_scope;
 
     if (qatomic_read(&CalculatingState) == DIRTY_RATE_STATUS_MEASURED) {
         info->has_dirty_rate = true;
@@ -406,32 +409,44 @@ static inline void record_dirtypages(DirtyPageRecord *dirty_pages,
     }
 }
 
-static void dirtyrate_global_dirty_log_start(void)
+static void dirtyrate_global_dirty_log_start(DirtyRateScope scope)
 {
+    unsigned int flags = GLOBAL_DIRTY_DIRTY_RATE;
+
+    if (scope == DIRTY_RATE_SCOPE_DIRTY_DEVICES) {
+        flags |= GLOBAL_DIRTY_DIRTY_RATE_DEVICES;
+    }
+
     qemu_mutex_lock_iothread();
-    memory_global_dirty_log_start(GLOBAL_DIRTY_DIRTY_RATE);
+    memory_global_dirty_log_start(flags);
     qemu_mutex_unlock_iothread();
 }
 
-static void dirtyrate_global_dirty_log_stop(void)
+static void dirtyrate_global_dirty_log_stop(DirtyRateScope scope)
 {
+    unsigned int flags = GLOBAL_DIRTY_DIRTY_RATE;
+
+    if (scope == DIRTY_RATE_SCOPE_DIRTY_DEVICES) {
+        flags |= GLOBAL_DIRTY_DIRTY_RATE_DEVICES;
+    }
+
     qemu_mutex_lock_iothread();
     memory_global_dirty_log_sync();
-    memory_global_dirty_log_stop(GLOBAL_DIRTY_DIRTY_RATE);
+    memory_global_dirty_log_stop(flags);
     qemu_mutex_unlock_iothread();
 }
 
 static int64_t do_calculate_dirtyrate_vcpu(DirtyPageRecord dirty_pages)
 {
-    uint64_t memory_size_MB;
+    uint64_t memory_size_KB;
     int64_t time_s;
     uint64_t increased_dirty_pages =
         dirty_pages.end_pages - dirty_pages.start_pages;
 
-    memory_size_MB = (increased_dirty_pages * TARGET_PAGE_SIZE) >> 20;
+    memory_size_KB = (increased_dirty_pages * TARGET_PAGE_SIZE) >> 10;
     time_s = DirtyStat.calc_time;
 
-    return memory_size_MB / time_s;
+    return memory_size_KB / time_s;
 }
 
 static inline void record_dirtypages_bitmap(DirtyPageRecord *dirty_pages,
@@ -466,9 +481,14 @@ static void calculate_dirtyrate_dirty_bitmap(struct DirtyRateConfig config)
     int64_t msec = 0;
     int64_t start_time;
     DirtyPageRecord dirty_pages;
+    unsigned int flags = GLOBAL_DIRTY_DIRTY_RATE;
+
+    if (config.scope == DIRTY_RATE_SCOPE_DIRTY_DEVICES) {
+        flags |= GLOBAL_DIRTY_DIRTY_RATE_DEVICES;
+    }
 
     qemu_mutex_lock_iothread();
-    memory_global_dirty_log_start(GLOBAL_DIRTY_DIRTY_RATE);
+    memory_global_dirty_log_start(flags);
 
     /*
      * 1'round of log sync may return all 1 bits with
@@ -500,7 +520,7 @@ static void calculate_dirtyrate_dirty_bitmap(struct DirtyRateConfig config)
      * 1. fetch dirty bitmap from kvm
      * 2. stop dirty tracking
      */
-    dirtyrate_global_dirty_log_stop();
+    dirtyrate_global_dirty_log_stop(config.scope);
 
     record_dirtypages_bitmap(&dirty_pages, false);
 
@@ -527,7 +547,7 @@ static void calculate_dirtyrate_dirty_ring(struct DirtyRateConfig config)
     DirtyStat.dirty_ring.nvcpu = nvcpu;
     DirtyStat.dirty_ring.rates = malloc(sizeof(DirtyRateVcpu) * nvcpu);
 
-    dirtyrate_global_dirty_log_start();
+    dirtyrate_global_dirty_log_start(config.scope);
 
     CPU_FOREACH(cpu) {
         record_dirtypages(dirty_pages, cpu, true);
@@ -540,7 +560,7 @@ static void calculate_dirtyrate_dirty_ring(struct DirtyRateConfig config)
     msec = set_sample_page_period(msec, start_time);
     DirtyStat.calc_time = msec / 1000;
 
-    dirtyrate_global_dirty_log_stop();
+    dirtyrate_global_dirty_log_stop(config.scope);
 
     CPU_FOREACH(cpu) {
         record_dirtypages(dirty_pages, cpu, false);
@@ -631,6 +651,8 @@ void *get_dirtyrate_thread(void *arg)
 void qmp_calc_dirty_rate(int64_t calc_time,
                          bool has_sample_pages,
                          int64_t sample_pages,
+                         bool has_scope,
+                         DirtyRateScope scope,
                          bool has_mode,
                          DirtyRateMeasureMode mode,
                          Error **errp)
@@ -701,6 +723,7 @@ void qmp_calc_dirty_rate(int64_t calc_time,
     config.sample_period_seconds = calc_time;
     config.sample_pages_per_gigabytes = sample_pages;
     config.mode = mode;
+    config.scope = scope;
 
     cleanup_dirtyrate_stat(config);
 
@@ -709,6 +732,7 @@ void qmp_calc_dirty_rate(int64_t calc_time,
      * been used in last calculation
      **/
     dirtyrate_mode = mode;
+    dirtyrate_scope = scope;
 
     start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME) / 1000;
     init_dirtyrate_stat(start_time, config);
@@ -736,9 +760,11 @@ void hmp_info_dirty_rate(Monitor *mon, const QDict *qdict)
                    info->calc_time);
     monitor_printf(mon, "Mode: %s\n",
                    DirtyRateMeasureMode_str(info->mode));
+    monitor_printf(mon, "Scope: %s\n",
+                   DirtyRateScope_str(info->scope));
     monitor_printf(mon, "Dirty rate: ");
     if (info->has_dirty_rate) {
-        monitor_printf(mon, "%"PRIi64" (MB/s)\n", info->dirty_rate);
+        monitor_printf(mon, "%"PRIi64" (KB/s)\n", info->dirty_rate);
         if (info->has_vcpu_dirty_rate) {
             DirtyRateVcpuList *rate, *head = info->vcpu_dirty_rate;
             for (rate = head; rate != NULL; rate = rate->next) {
@@ -762,7 +788,9 @@ void hmp_calc_dirty_rate(Monitor *mon, const QDict *qdict)
     bool has_sample_pages = (sample_pages != -1);
     bool dirty_ring = qdict_get_try_bool(qdict, "dirty_ring", false);
     bool dirty_bitmap = qdict_get_try_bool(qdict, "dirty_bitmap", false);
+    bool dirty_devices = qdict_get_try_bool(qdict, "dirty_devices", false);
     DirtyRateMeasureMode mode = DIRTY_RATE_MEASURE_MODE_PAGE_SAMPLING;
+    DirtyRateScope scope = DIRTY_RATE_SCOPE_ALL;
     Error *err = NULL;
 
     if (!sec) {
@@ -781,9 +809,12 @@ void hmp_calc_dirty_rate(Monitor *mon, const QDict *qdict)
     } else if (dirty_ring) {
         mode = DIRTY_RATE_MEASURE_MODE_DIRTY_RING;
     }
+    if (dirty_devices) {
+        scope = DIRTY_RATE_SCOPE_DIRTY_DEVICES;
+    }
 
-    qmp_calc_dirty_rate(sec, has_sample_pages, sample_pages, true,
-                        mode, &err);
+    qmp_calc_dirty_rate(sec, has_sample_pages, sample_pages,
+                       true, scope, true, mode, &err);
     if (err) {
         hmp_handle_error(mon, err);
         return;
diff --git a/migration/dirtyrate.h b/migration/dirtyrate.h
index 69d4c5b8655f..4061edf9f4de 100644
--- a/migration/dirtyrate.h
+++ b/migration/dirtyrate.h
@@ -44,6 +44,7 @@ struct DirtyRateConfig {
     uint64_t sample_pages_per_gigabytes; /* sample pages per GB */
     int64_t sample_period_seconds; /* time duration between two sampling */
     DirtyRateMeasureMode mode; /* mode of dirtyrate measurement */
+    DirtyRateScope scope; /* scope of dirtyrate measurement */
 };
 
 /*
diff --git a/qapi/migration.json b/qapi/migration.json
index 27d7b281581d..082830c6e771 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -1793,6 +1793,19 @@
 { 'enum': 'DirtyRateMeasureMode',
   'data': ['page-sampling', 'dirty-ring', 'dirty-bitmap'] }
 
+##
+# @DirtyRateScope:
+#
+# An enumeration of scope of measuring dirtyrate.
+#
+# @dirty-devices: calculate dirtyrate by devices only.
+#
+# Since: 6.2
+#
+##
+{ 'enum': 'DirtyRateScope',
+  'data': ['all', 'dirty-devices'] }
+
 ##
 # @DirtyRateInfo:
 #
@@ -1827,6 +1840,7 @@
            'calc-time': 'int64',
            'sample-pages': 'uint64',
            'mode': 'DirtyRateMeasureMode',
+           'scope': 'DirtyRateScope',
            '*vcpu-dirty-rate': [ 'DirtyRateVcpu' ] } }
 
 ##
@@ -1851,6 +1865,7 @@
 ##
 { 'command': 'calc-dirty-rate', 'data': {'calc-time': 'int64',
                                          '*sample-pages': 'int',
+                                         '*scope': 'DirtyRateScope',
                                          '*mode': 'DirtyRateMeasureMode'} }
 
 ##
diff --git a/softmmu/memory.c b/softmmu/memory.c
index bfa5d5178c5b..120c41f3b303 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -2826,6 +2826,11 @@ void memory_global_dirty_log_start(unsigned int flags)
     }
 }
 
+bool memory_global_dirty_devices(void)
+{
+    return !!(global_dirty_tracking & GLOBAL_DIRTY_DIRTY_RATE_DEVICES);
+}
+
 static void memory_global_dirty_log_do_stop(unsigned int flags)
 {
     assert(flags && !(flags & (~GLOBAL_DIRTY_MASK)));
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC 10/10] hw/vfio: Add nr of dirty pages to tracepoints
  2022-04-28 21:13 ` Joao Martins
@ 2022-04-28 21:13   ` Joao Martins
  -1 siblings, 0 replies; 37+ messages in thread
From: Joao Martins @ 2022-04-28 21:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: Joao Martins, Paolo Bonzini, Dr. David Alan Gilbert,
	Richard Henderson, Eduardo Habkost, Michael S. Tsirkin,
	Marcel Apfelbaum, Peter Xu, Jason Wang, Alex Williamson,
	David Hildenbrand, Philippe Mathieu-Daudé,
	Cornelia Huck, Juan Quintela, Eric Blake, Markus Armbruster,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian, Yi Liu,
	Eric Auger, Thanos Makatos, John G . Johnson, kvm

Print the number of dirty pages after calling
cpu_physical_memory_set_lebitmap() on the vfio_get_dirty_bitmap
tracepoint. Additionally, print the number of dirty pages to
capture the unmap case under a new tracepoint called
vfio_set_dirty_pages.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/vfio/container.c  | 12 +++++++++---
 hw/vfio/iommufd.c    | 10 ++++++----
 hw/vfio/trace-events |  3 ++-
 3 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index fff8319c0036..b17d3499d9a1 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -146,6 +146,7 @@ static int vfio_dma_unmap_bitmap(VFIOLegacyContainer *container,
     struct vfio_iommu_type1_dma_unmap *unmap;
     struct vfio_bitmap *bitmap;
     uint64_t pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size;
+    uint64_t dirty;
     int ret;
 
     unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));
@@ -181,8 +182,11 @@ static int vfio_dma_unmap_bitmap(VFIOLegacyContainer *container,
 
     ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap);
     if (!ret) {
+        dirty = total_dirty_pages;
         cpu_physical_memory_set_dirty_lebitmap((unsigned long *)bitmap->data,
                 iotlb->translated_addr, pages);
+        trace_vfio_set_dirty_pages(container->fd, iova, size,
+                                   total_dirty_pages - dirty);
     } else {
         error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %m");
     }
@@ -312,7 +316,7 @@ static int vfio_get_dirty_bitmap(VFIOContainer *bcontainer, uint64_t iova,
                                                   VFIOLegacyContainer, obj);
     struct vfio_iommu_type1_dirty_bitmap *dbitmap;
     struct vfio_iommu_type1_dirty_bitmap_get *range;
-    uint64_t pages;
+    uint64_t pages, dirty;
     int ret;
 
     if (!memory_global_dirty_devices()) {
@@ -351,11 +355,13 @@ static int vfio_get_dirty_bitmap(VFIOContainer *bcontainer, uint64_t iova,
         goto err_out;
     }
 
+
+    dirty = total_dirty_pages;
     cpu_physical_memory_set_dirty_lebitmap((unsigned long *)range->bitmap.data,
                                             ram_addr, pages);
-
     trace_vfio_get_dirty_bitmap(container->fd, range->iova, range->size,
-                                range->bitmap.size, ram_addr);
+                                range->bitmap.size, ram_addr,
+                                total_dirty_pages - dirty);
 err_out:
     g_free(range->bitmap.data);
     g_free(dbitmap);
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 4686cc713aac..461030bb7135 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -77,7 +77,7 @@ static int iommufd_copy(VFIOContainer *src, VFIOContainer *dst,
 static int iommufd_unmap_bitmap(int iommufd, int ioas_id, hwaddr iova,
                                 ram_addr_t size, ram_addr_t translated)
 {
-    unsigned long *data, pgsize, bitmap_size, pages;
+    unsigned long *data, pgsize, bitmap_size, pages, dirty;
     int ret;
 
     pgsize = qemu_real_host_page_size;
@@ -95,9 +95,10 @@ static int iommufd_unmap_bitmap(int iommufd, int ioas_id, hwaddr iova,
         goto err_out;
     }
 
+    dirty = total_dirty_pages;
     cpu_physical_memory_set_dirty_lebitmap(data, translated, pages);
 
-    trace_vfio_get_dirty_bitmap(iommufd, iova, size, bitmap_size, translated);
+    trace_vfio_set_dirty_pages(iommufd, iova, size, total_dirty_pages - dirty);
 
 err_out:
     g_free(data);
@@ -148,7 +149,7 @@ static int iommufd_get_dirty_bitmap(VFIOContainer *bcontainer, uint64_t iova,
                                                    VFIOIOMMUFDContainer, obj);
     int ret;
     VFIOIOASHwpt *hwpt;
-    unsigned long *data, page_size, bitmap_size, pages;
+    unsigned long *data, page_size, bitmap_size, pages, dirty;
 
     if (!memory_global_dirty_devices()) {
         return 0;
@@ -176,10 +177,11 @@ static int iommufd_get_dirty_bitmap(VFIOContainer *bcontainer, uint64_t iova,
         }
     }
 
+    dirty = total_dirty_pages;
     cpu_physical_memory_set_dirty_lebitmap(data, ram_addr, pages);
 
     trace_vfio_get_dirty_bitmap(container->iommufd, iova, size, bitmap_size,
-                                ram_addr);
+                                ram_addr, total_dirty_pages - dirty);
 
 err_out:
     g_free(data);
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 51f04b0b80b8..c8d6348469aa 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -163,7 +163,8 @@ vfio_load_device_config_state(const char *name) " (%s)"
 vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
 vfio_load_state_device_data(const char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
 vfio_load_cleanup(const char *name) " (%s)"
-vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64
+vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start, uint64_t dirty) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64" dirty=%"PRIu64
+vfio_set_dirty_pages(int fd, uint64_t iova, uint64_t size, uint64_t nr_pages) "container fd=%d, iova=0x%"PRIx64" size=0x%"PRIx64" nr_pages=%"PRIu64
 vfio_iommu_map_dirty_notify(uint64_t iova_start, uint64_t iova_end) "iommu dirty @ 0x%"PRIx64" - 0x%"PRIx64
 
 #iommufd.c
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC 10/10] hw/vfio: Add nr of dirty pages to tracepoints
@ 2022-04-28 21:13   ` Joao Martins
  0 siblings, 0 replies; 37+ messages in thread
From: Joao Martins @ 2022-04-28 21:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: John G . Johnson, kvm, Michael S. Tsirkin, Jason Wang, Peter Xu,
	Joao Martins, Eric Blake, Yi Liu, Juan Quintela,
	David Hildenbrand, Markus Armbruster, Nicolin Chen,
	Jason Gunthorpe, Kevin Tian, Richard Henderson,
	Dr. David Alan Gilbert, Eric Auger, Alex Williamson,
	Thanos Makatos, Eduardo Habkost, Yishai Hadas, Cornelia Huck,
	Philippe Mathieu-Daudé,
	Paolo Bonzini

Print the number of dirty pages after calling
cpu_physical_memory_set_lebitmap() on the vfio_get_dirty_bitmap
tracepoint. Additionally, print the number of dirty pages to
capture the unmap case under a new tracepoint called
vfio_set_dirty_pages.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/vfio/container.c  | 12 +++++++++---
 hw/vfio/iommufd.c    | 10 ++++++----
 hw/vfio/trace-events |  3 ++-
 3 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index fff8319c0036..b17d3499d9a1 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -146,6 +146,7 @@ static int vfio_dma_unmap_bitmap(VFIOLegacyContainer *container,
     struct vfio_iommu_type1_dma_unmap *unmap;
     struct vfio_bitmap *bitmap;
     uint64_t pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size;
+    uint64_t dirty;
     int ret;
 
     unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));
@@ -181,8 +182,11 @@ static int vfio_dma_unmap_bitmap(VFIOLegacyContainer *container,
 
     ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap);
     if (!ret) {
+        dirty = total_dirty_pages;
         cpu_physical_memory_set_dirty_lebitmap((unsigned long *)bitmap->data,
                 iotlb->translated_addr, pages);
+        trace_vfio_set_dirty_pages(container->fd, iova, size,
+                                   total_dirty_pages - dirty);
     } else {
         error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %m");
     }
@@ -312,7 +316,7 @@ static int vfio_get_dirty_bitmap(VFIOContainer *bcontainer, uint64_t iova,
                                                   VFIOLegacyContainer, obj);
     struct vfio_iommu_type1_dirty_bitmap *dbitmap;
     struct vfio_iommu_type1_dirty_bitmap_get *range;
-    uint64_t pages;
+    uint64_t pages, dirty;
     int ret;
 
     if (!memory_global_dirty_devices()) {
@@ -351,11 +355,13 @@ static int vfio_get_dirty_bitmap(VFIOContainer *bcontainer, uint64_t iova,
         goto err_out;
     }
 
+
+    dirty = total_dirty_pages;
     cpu_physical_memory_set_dirty_lebitmap((unsigned long *)range->bitmap.data,
                                             ram_addr, pages);
-
     trace_vfio_get_dirty_bitmap(container->fd, range->iova, range->size,
-                                range->bitmap.size, ram_addr);
+                                range->bitmap.size, ram_addr,
+                                total_dirty_pages - dirty);
 err_out:
     g_free(range->bitmap.data);
     g_free(dbitmap);
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 4686cc713aac..461030bb7135 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -77,7 +77,7 @@ static int iommufd_copy(VFIOContainer *src, VFIOContainer *dst,
 static int iommufd_unmap_bitmap(int iommufd, int ioas_id, hwaddr iova,
                                 ram_addr_t size, ram_addr_t translated)
 {
-    unsigned long *data, pgsize, bitmap_size, pages;
+    unsigned long *data, pgsize, bitmap_size, pages, dirty;
     int ret;
 
     pgsize = qemu_real_host_page_size;
@@ -95,9 +95,10 @@ static int iommufd_unmap_bitmap(int iommufd, int ioas_id, hwaddr iova,
         goto err_out;
     }
 
+    dirty = total_dirty_pages;
     cpu_physical_memory_set_dirty_lebitmap(data, translated, pages);
 
-    trace_vfio_get_dirty_bitmap(iommufd, iova, size, bitmap_size, translated);
+    trace_vfio_set_dirty_pages(iommufd, iova, size, total_dirty_pages - dirty);
 
 err_out:
     g_free(data);
@@ -148,7 +149,7 @@ static int iommufd_get_dirty_bitmap(VFIOContainer *bcontainer, uint64_t iova,
                                                    VFIOIOMMUFDContainer, obj);
     int ret;
     VFIOIOASHwpt *hwpt;
-    unsigned long *data, page_size, bitmap_size, pages;
+    unsigned long *data, page_size, bitmap_size, pages, dirty;
 
     if (!memory_global_dirty_devices()) {
         return 0;
@@ -176,10 +177,11 @@ static int iommufd_get_dirty_bitmap(VFIOContainer *bcontainer, uint64_t iova,
         }
     }
 
+    dirty = total_dirty_pages;
     cpu_physical_memory_set_dirty_lebitmap(data, ram_addr, pages);
 
     trace_vfio_get_dirty_bitmap(container->iommufd, iova, size, bitmap_size,
-                                ram_addr);
+                                ram_addr, total_dirty_pages - dirty);
 
 err_out:
     g_free(data);
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 51f04b0b80b8..c8d6348469aa 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -163,7 +163,8 @@ vfio_load_device_config_state(const char *name) " (%s)"
 vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
 vfio_load_state_device_data(const char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
 vfio_load_cleanup(const char *name) " (%s)"
-vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64
+vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start, uint64_t dirty) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64" dirty=%"PRIu64
+vfio_set_dirty_pages(int fd, uint64_t iova, uint64_t size, uint64_t nr_pages) "container fd=%d, iova=0x%"PRIx64" size=0x%"PRIx64" nr_pages=%"PRIu64
 vfio_iommu_map_dirty_notify(uint64_t iova_start, uint64_t iova_end) "iommu dirty @ 0x%"PRIx64" - 0x%"PRIx64
 
 #iommufd.c
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC 04/10] intel_iommu: Second Stage Access Dirty bit support
  2022-04-28 21:13   ` Joao Martins
@ 2022-04-29  2:26     ` Jason Wang
  -1 siblings, 0 replies; 37+ messages in thread
From: Jason Wang @ 2022-04-29  2:26 UTC (permalink / raw)
  To: Joao Martins
  Cc: qemu-devel, Paolo Bonzini, Dr. David Alan Gilbert,
	Richard Henderson, Eduardo Habkost, Michael S. Tsirkin,
	Marcel Apfelbaum, Peter Xu, Alex Williamson, David Hildenbrand,
	Philippe Mathieu-Daudé,
	Cornelia Huck, Juan Quintela, Eric Blake, Markus Armbruster,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian, Yi Liu,
	Eric Auger, Thanos Makatos, John G . Johnson, kvm

On Fri, Apr 29, 2022 at 5:14 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> IOMMU advertises Access/Dirty bits if the extended capability
> DMAR register reports it (ECAP, mnemonic ECAP.SSADS albeit it used
> to be known as SLADS before). The first stage table, though, has no bit for
> advertising Access/Dirty, unless referenced via a scalable-mode PASID Entry.
> Relevant Intel IOMMU SDM ref for first stage table "3.6.2 Accessed, Extended
> Accessed, and Dirty Flags" and second stage table "3.7.2 Accessed and Dirty
> Flags".
>
> To enable it we depend on scalable-mode for the second-stage table,
> so we limit use of dirty-bit to scalable-mode To use SSADS, we set a bit in
> the scalable-mode PASID Table entry, by setting bit 9 (SSADE). When
> we do so we require flushing the context/pasid-table caches and IOTLB much
> like AMD. Relevant SDM refs:
>
> "3.7.2 Accessed and Dirty Flags"
> "6.5.3.3 Guidance to Software for Invalidations,
>  Table 23. Guidance to Software for Invalidations"
>
> To read out what's dirtied, same thing as past implementations is done.
> Dirty bit support is located in the same location (bit 9). The IOTLB
> caches some attributes when SSADE is enabled and dirty-ness information,
> so we also need to flush IOTLB to make sure IOMMU attempts to set the
> dirty bit again. Relevant manuals over the hardware translation is
> chapter 6 with some special mention to:
>
> "6.2.3.1 Scalable-Mode PASID-Table Entry Programming Considerations"
> "6.2.4 IOTLB"
>
> The first 12bits of the PTE are already cached via the SLPTE pointer,
> similar to how it is added in amd-iommu. Use also the previously
> added PASID entry cache in order to fetch whether Dirty bit was
> enabled or not in the SM second stage table.
>
> This is useful for covering and prototyping IOMMU support for access/dirty
> bits.
>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  hw/i386/intel_iommu.c          | 85 ++++++++++++++++++++++++++++++----
>  hw/i386/intel_iommu_internal.h |  4 ++
>  hw/i386/trace-events           |  2 +
>  3 files changed, 82 insertions(+), 9 deletions(-)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 752940fa4c0e..e946f793a968 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -651,6 +651,21 @@ static uint64_t vtd_get_slpte(dma_addr_t base_addr, uint32_t index)
>      return slpte;
>  }
>
> +/* Get the content of a spte located in @base_addr[@index] */
> +static uint64_t vtd_set_slpte(dma_addr_t base_addr, uint32_t index,
> +                              uint64_t slpte)
> +{
> +
> +    if (dma_memory_write(&address_space_memory,
> +                         base_addr + index * sizeof(slpte), &slpte,
> +                         sizeof(slpte), MEMTXATTRS_UNSPECIFIED)) {
> +        slpte = (uint64_t)-1;
> +        return slpte;
> +    }
> +
> +    return vtd_get_slpte(base_addr, index);
> +}
> +
>  /* Given an iova and the level of paging structure, return the offset
>   * of current level.
>   */
> @@ -720,6 +735,11 @@ static inline bool vtd_pe_present(VTDPASIDEntry *pe)
>      return pe->val[0] & VTD_PASID_ENTRY_P;
>  }
>
> +static inline bool vtd_pe_slad_enabled(VTDPASIDEntry *pe)
> +{
> +    return pe->val[0] & VTD_SM_PASID_ENTRY_SLADE;
> +}
> +
>  static int vtd_get_pe_in_pasid_leaf_table(IntelIOMMUState *s,
>                                            uint32_t pasid,
>                                            dma_addr_t addr,
> @@ -1026,6 +1046,33 @@ static VTDBus *vtd_find_as_from_bus_num(IntelIOMMUState *s, uint8_t bus_num)
>      return NULL;
>  }
>
> +static inline bool vtd_ssad_update(IntelIOMMUState *s, uint16_t pe_flags,
> +                                   uint64_t *slpte, bool is_write,
> +                                   bool reads, bool writes)
> +{
> +    bool dirty, access = reads;
> +
> +    if (!(pe_flags & VTD_SM_PASID_ENTRY_SLADE)) {
> +        return false;
> +    }
> +
> +    dirty = access = false;
> +
> +    if (is_write && writes && !(*slpte & VTD_SL_D)) {
> +        *slpte |= VTD_SL_D;
> +        trace_vtd_dirty_update(*slpte);
> +        dirty = true;
> +    }
> +
> +    if (((!is_write && reads) || dirty) && !(*slpte & VTD_SL_A)) {
> +        *slpte |= VTD_SL_A;
> +        trace_vtd_access_update(*slpte);
> +        access = true;
> +    }
> +
> +    return dirty || access;
> +}
> +
>  /* Given the @iova, get relevant @slptep. @slpte_level will be the last level
>   * of the translation, can be used for deciding the size of large page.
>   */
> @@ -1039,6 +1086,7 @@ static int vtd_iova_to_slpte(IntelIOMMUState *s, VTDContextEntry *ce,
>      uint32_t offset;
>      uint64_t slpte;
>      uint64_t access_right_check;
> +    uint16_t pe_flags;
>
>      if (!vtd_iova_range_check(s, iova, ce, aw_bits)) {
>          error_report_once("%s: detected IOVA overflow (iova=0x%" PRIx64 ")",
> @@ -1054,14 +1102,7 @@ static int vtd_iova_to_slpte(IntelIOMMUState *s, VTDContextEntry *ce,
>          slpte = vtd_get_slpte(addr, offset);
>
>          if (slpte == (uint64_t)-1) {
> -            error_report_once("%s: detected read error on DMAR slpte "
> -                              "(iova=0x%" PRIx64 ")", __func__, iova);
> -            if (level == vtd_get_iova_level(s, ce)) {
> -                /* Invalid programming of context-entry */
> -                return -VTD_FR_CONTEXT_ENTRY_INV;
> -            } else {
> -                return -VTD_FR_PAGING_ENTRY_INV;
> -            }
> +            goto inv_slpte;
>          }
>          *reads = (*reads) && (slpte & VTD_SL_R);
>          *writes = (*writes) && (slpte & VTD_SL_W);
> @@ -1081,6 +1122,14 @@ static int vtd_iova_to_slpte(IntelIOMMUState *s, VTDContextEntry *ce,
>          }
>
>          if (vtd_is_last_slpte(slpte, level)) {
> +            pe_flags = vtd_sm_pasid_entry_flags(s, ce);
> +            if (vtd_ssad_update(s, pe_flags, &slpte, is_write,
> +                                *reads, *writes)) {
> +                slpte = vtd_set_slpte(addr, offset, slpte);
> +                if (slpte == (uint64_t)-1) {
> +                    goto inv_slpte;
> +                }
> +            }
>              *slptep = slpte;
>              *slpte_level = level;
>              return 0;
> @@ -1088,6 +1137,16 @@ static int vtd_iova_to_slpte(IntelIOMMUState *s, VTDContextEntry *ce,
>          addr = vtd_get_slpte_addr(slpte, aw_bits);
>          level--;
>      }
> +
> +inv_slpte:
> +    error_report_once("%s: detected read error on DMAR slpte "
> +                      "(iova=0x%" PRIx64 ")", __func__, iova);
> +    if (level == vtd_get_iova_level(s, ce)) {
> +        /* Invalid programming of context-entry */
> +        return -VTD_FR_CONTEXT_ENTRY_INV;
> +    } else {
> +        return -VTD_FR_PAGING_ENTRY_INV;
> +    }
>  }
>
>  typedef int (*vtd_page_walk_hook)(IOMMUTLBEvent *event, void *private);
> @@ -1742,6 +1801,13 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>          slpte = iotlb_entry->slpte;
>          access_flags = iotlb_entry->access_flags;
>          page_mask = iotlb_entry->mask;
> +        if (vtd_ssad_update(s, iotlb_entry->sm_pe_flags, &slpte, is_write,
> +                            access_flags & IOMMU_RO, access_flags & IOMMU_WO)) {
> +                uint32_t offset;
> +
> +                offset = vtd_iova_level_offset(addr, vtd_get_iova_level(s, &ce));
> +                vtd_set_slpte(addr, offset, slpte);
> +        }
>          goto out;
>      }
>
> @@ -3693,7 +3759,8 @@ static void vtd_init(IntelIOMMUState *s)
>
>      /* TODO: read cap/ecap from host to decide which cap to be exposed. */
>      if (s->scalable_mode) {
> -        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS;
> +        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS |
> +                   VTD_ECAP_SLADS;
>      }

We probably need a dedicated command line parameter and make it compat
for pre 7.1 machines.

Otherwise we may break migration.

Thanks

>
>      if (s->snoop_control) {
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 1ff13b40f9bb..c00f6e7b4a72 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -192,6 +192,7 @@
>  #define VTD_ECAP_MHMV               (15ULL << 20)
>  #define VTD_ECAP_SRS                (1ULL << 31)
>  #define VTD_ECAP_SMTS               (1ULL << 43)
> +#define VTD_ECAP_SLADS              (1ULL << 45)
>  #define VTD_ECAP_SLTS               (1ULL << 46)
>
>  /* CAP_REG */
> @@ -492,6 +493,7 @@ typedef struct VTDRootEntry VTDRootEntry;
>
>  #define VTD_SM_PASID_ENTRY_AW          7ULL /* Adjusted guest-address-width */
>  #define VTD_SM_PASID_ENTRY_DID(val)    ((val) & VTD_DOMAIN_ID_MASK)
> +#define VTD_SM_PASID_ENTRY_SLADE       (1ULL << 9)
>
>  /* Second Level Page Translation Pointer*/
>  #define VTD_SM_PASID_ENTRY_SLPTPTR     (~0xfffULL)
> @@ -515,5 +517,7 @@ typedef struct VTDRootEntry VTDRootEntry;
>  #define VTD_SL_PT_BASE_ADDR_MASK(aw) (~(VTD_PAGE_SIZE - 1) & VTD_HAW_MASK(aw))
>  #define VTD_SL_IGN_COM              0xbff0000000000000ULL
>  #define VTD_SL_TM                   (1ULL << 62)
> +#define VTD_SL_A                    (1ULL << 8)
> +#define VTD_SL_D                    (1ULL << 9)
>
>  #endif
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index eb5f075873cd..e4122ee8a999 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -66,6 +66,8 @@ vtd_frr_new(int index, uint64_t hi, uint64_t lo) "index %d high 0x%"PRIx64" low
>  vtd_warn_invalid_qi_tail(uint16_t tail) "tail 0x%"PRIx16
>  vtd_warn_ir_vector(uint16_t sid, int index, int vec, int target) "sid 0x%"PRIx16" index %d vec %d (should be: %d)"
>  vtd_warn_ir_trigger(uint16_t sid, int index, int trig, int target) "sid 0x%"PRIx16" index %d trigger %d (should be: %d)"
> +vtd_access_update(uint64_t slpte) "slpte 0x%"PRIx64
> +vtd_dirty_update(uint64_t slpte) "slpte 0x%"PRIx64
>
>  # amd_iommu.c
>  amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
> --
> 2.17.2
>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC 04/10] intel_iommu: Second Stage Access Dirty bit support
@ 2022-04-29  2:26     ` Jason Wang
  0 siblings, 0 replies; 37+ messages in thread
From: Jason Wang @ 2022-04-29  2:26 UTC (permalink / raw)
  To: Joao Martins
  Cc: John G . Johnson, kvm, Michael S. Tsirkin, qemu-devel, Peter Xu,
	Eric Blake, Yi Liu, Juan Quintela, David Hildenbrand,
	Markus Armbruster, Nicolin Chen, Jason Gunthorpe, Kevin Tian,
	Richard Henderson, Dr. David Alan Gilbert, Eric Auger,
	Alex Williamson, Paolo Bonzini, Eduardo Habkost, Yishai Hadas,
	Cornelia Huck, Philippe Mathieu-Daudé,
	Thanos Makatos

On Fri, Apr 29, 2022 at 5:14 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> IOMMU advertises Access/Dirty bits if the extended capability
> DMAR register reports it (ECAP, mnemonic ECAP.SSADS albeit it used
> to be known as SLADS before). The first stage table, though, has no bit for
> advertising Access/Dirty, unless referenced via a scalable-mode PASID Entry.
> Relevant Intel IOMMU SDM ref for first stage table "3.6.2 Accessed, Extended
> Accessed, and Dirty Flags" and second stage table "3.7.2 Accessed and Dirty
> Flags".
>
> To enable it we depend on scalable-mode for the second-stage table,
> so we limit use of dirty-bit to scalable-mode To use SSADS, we set a bit in
> the scalable-mode PASID Table entry, by setting bit 9 (SSADE). When
> we do so we require flushing the context/pasid-table caches and IOTLB much
> like AMD. Relevant SDM refs:
>
> "3.7.2 Accessed and Dirty Flags"
> "6.5.3.3 Guidance to Software for Invalidations,
>  Table 23. Guidance to Software for Invalidations"
>
> To read out what's dirtied, same thing as past implementations is done.
> Dirty bit support is located in the same location (bit 9). The IOTLB
> caches some attributes when SSADE is enabled and dirty-ness information,
> so we also need to flush IOTLB to make sure IOMMU attempts to set the
> dirty bit again. Relevant manuals over the hardware translation is
> chapter 6 with some special mention to:
>
> "6.2.3.1 Scalable-Mode PASID-Table Entry Programming Considerations"
> "6.2.4 IOTLB"
>
> The first 12bits of the PTE are already cached via the SLPTE pointer,
> similar to how it is added in amd-iommu. Use also the previously
> added PASID entry cache in order to fetch whether Dirty bit was
> enabled or not in the SM second stage table.
>
> This is useful for covering and prototyping IOMMU support for access/dirty
> bits.
>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  hw/i386/intel_iommu.c          | 85 ++++++++++++++++++++++++++++++----
>  hw/i386/intel_iommu_internal.h |  4 ++
>  hw/i386/trace-events           |  2 +
>  3 files changed, 82 insertions(+), 9 deletions(-)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 752940fa4c0e..e946f793a968 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -651,6 +651,21 @@ static uint64_t vtd_get_slpte(dma_addr_t base_addr, uint32_t index)
>      return slpte;
>  }
>
> +/* Get the content of a spte located in @base_addr[@index] */
> +static uint64_t vtd_set_slpte(dma_addr_t base_addr, uint32_t index,
> +                              uint64_t slpte)
> +{
> +
> +    if (dma_memory_write(&address_space_memory,
> +                         base_addr + index * sizeof(slpte), &slpte,
> +                         sizeof(slpte), MEMTXATTRS_UNSPECIFIED)) {
> +        slpte = (uint64_t)-1;
> +        return slpte;
> +    }
> +
> +    return vtd_get_slpte(base_addr, index);
> +}
> +
>  /* Given an iova and the level of paging structure, return the offset
>   * of current level.
>   */
> @@ -720,6 +735,11 @@ static inline bool vtd_pe_present(VTDPASIDEntry *pe)
>      return pe->val[0] & VTD_PASID_ENTRY_P;
>  }
>
> +static inline bool vtd_pe_slad_enabled(VTDPASIDEntry *pe)
> +{
> +    return pe->val[0] & VTD_SM_PASID_ENTRY_SLADE;
> +}
> +
>  static int vtd_get_pe_in_pasid_leaf_table(IntelIOMMUState *s,
>                                            uint32_t pasid,
>                                            dma_addr_t addr,
> @@ -1026,6 +1046,33 @@ static VTDBus *vtd_find_as_from_bus_num(IntelIOMMUState *s, uint8_t bus_num)
>      return NULL;
>  }
>
> +static inline bool vtd_ssad_update(IntelIOMMUState *s, uint16_t pe_flags,
> +                                   uint64_t *slpte, bool is_write,
> +                                   bool reads, bool writes)
> +{
> +    bool dirty, access = reads;
> +
> +    if (!(pe_flags & VTD_SM_PASID_ENTRY_SLADE)) {
> +        return false;
> +    }
> +
> +    dirty = access = false;
> +
> +    if (is_write && writes && !(*slpte & VTD_SL_D)) {
> +        *slpte |= VTD_SL_D;
> +        trace_vtd_dirty_update(*slpte);
> +        dirty = true;
> +    }
> +
> +    if (((!is_write && reads) || dirty) && !(*slpte & VTD_SL_A)) {
> +        *slpte |= VTD_SL_A;
> +        trace_vtd_access_update(*slpte);
> +        access = true;
> +    }
> +
> +    return dirty || access;
> +}
> +
>  /* Given the @iova, get relevant @slptep. @slpte_level will be the last level
>   * of the translation, can be used for deciding the size of large page.
>   */
> @@ -1039,6 +1086,7 @@ static int vtd_iova_to_slpte(IntelIOMMUState *s, VTDContextEntry *ce,
>      uint32_t offset;
>      uint64_t slpte;
>      uint64_t access_right_check;
> +    uint16_t pe_flags;
>
>      if (!vtd_iova_range_check(s, iova, ce, aw_bits)) {
>          error_report_once("%s: detected IOVA overflow (iova=0x%" PRIx64 ")",
> @@ -1054,14 +1102,7 @@ static int vtd_iova_to_slpte(IntelIOMMUState *s, VTDContextEntry *ce,
>          slpte = vtd_get_slpte(addr, offset);
>
>          if (slpte == (uint64_t)-1) {
> -            error_report_once("%s: detected read error on DMAR slpte "
> -                              "(iova=0x%" PRIx64 ")", __func__, iova);
> -            if (level == vtd_get_iova_level(s, ce)) {
> -                /* Invalid programming of context-entry */
> -                return -VTD_FR_CONTEXT_ENTRY_INV;
> -            } else {
> -                return -VTD_FR_PAGING_ENTRY_INV;
> -            }
> +            goto inv_slpte;
>          }
>          *reads = (*reads) && (slpte & VTD_SL_R);
>          *writes = (*writes) && (slpte & VTD_SL_W);
> @@ -1081,6 +1122,14 @@ static int vtd_iova_to_slpte(IntelIOMMUState *s, VTDContextEntry *ce,
>          }
>
>          if (vtd_is_last_slpte(slpte, level)) {
> +            pe_flags = vtd_sm_pasid_entry_flags(s, ce);
> +            if (vtd_ssad_update(s, pe_flags, &slpte, is_write,
> +                                *reads, *writes)) {
> +                slpte = vtd_set_slpte(addr, offset, slpte);
> +                if (slpte == (uint64_t)-1) {
> +                    goto inv_slpte;
> +                }
> +            }
>              *slptep = slpte;
>              *slpte_level = level;
>              return 0;
> @@ -1088,6 +1137,16 @@ static int vtd_iova_to_slpte(IntelIOMMUState *s, VTDContextEntry *ce,
>          addr = vtd_get_slpte_addr(slpte, aw_bits);
>          level--;
>      }
> +
> +inv_slpte:
> +    error_report_once("%s: detected read error on DMAR slpte "
> +                      "(iova=0x%" PRIx64 ")", __func__, iova);
> +    if (level == vtd_get_iova_level(s, ce)) {
> +        /* Invalid programming of context-entry */
> +        return -VTD_FR_CONTEXT_ENTRY_INV;
> +    } else {
> +        return -VTD_FR_PAGING_ENTRY_INV;
> +    }
>  }
>
>  typedef int (*vtd_page_walk_hook)(IOMMUTLBEvent *event, void *private);
> @@ -1742,6 +1801,13 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>          slpte = iotlb_entry->slpte;
>          access_flags = iotlb_entry->access_flags;
>          page_mask = iotlb_entry->mask;
> +        if (vtd_ssad_update(s, iotlb_entry->sm_pe_flags, &slpte, is_write,
> +                            access_flags & IOMMU_RO, access_flags & IOMMU_WO)) {
> +                uint32_t offset;
> +
> +                offset = vtd_iova_level_offset(addr, vtd_get_iova_level(s, &ce));
> +                vtd_set_slpte(addr, offset, slpte);
> +        }
>          goto out;
>      }
>
> @@ -3693,7 +3759,8 @@ static void vtd_init(IntelIOMMUState *s)
>
>      /* TODO: read cap/ecap from host to decide which cap to be exposed. */
>      if (s->scalable_mode) {
> -        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS;
> +        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS |
> +                   VTD_ECAP_SLADS;
>      }

We probably need a dedicated command line parameter and make it compat
for pre 7.1 machines.

Otherwise we may break migration.

Thanks

>
>      if (s->snoop_control) {
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 1ff13b40f9bb..c00f6e7b4a72 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -192,6 +192,7 @@
>  #define VTD_ECAP_MHMV               (15ULL << 20)
>  #define VTD_ECAP_SRS                (1ULL << 31)
>  #define VTD_ECAP_SMTS               (1ULL << 43)
> +#define VTD_ECAP_SLADS              (1ULL << 45)
>  #define VTD_ECAP_SLTS               (1ULL << 46)
>
>  /* CAP_REG */
> @@ -492,6 +493,7 @@ typedef struct VTDRootEntry VTDRootEntry;
>
>  #define VTD_SM_PASID_ENTRY_AW          7ULL /* Adjusted guest-address-width */
>  #define VTD_SM_PASID_ENTRY_DID(val)    ((val) & VTD_DOMAIN_ID_MASK)
> +#define VTD_SM_PASID_ENTRY_SLADE       (1ULL << 9)
>
>  /* Second Level Page Translation Pointer*/
>  #define VTD_SM_PASID_ENTRY_SLPTPTR     (~0xfffULL)
> @@ -515,5 +517,7 @@ typedef struct VTDRootEntry VTDRootEntry;
>  #define VTD_SL_PT_BASE_ADDR_MASK(aw) (~(VTD_PAGE_SIZE - 1) & VTD_HAW_MASK(aw))
>  #define VTD_SL_IGN_COM              0xbff0000000000000ULL
>  #define VTD_SL_TM                   (1ULL << 62)
> +#define VTD_SL_A                    (1ULL << 8)
> +#define VTD_SL_D                    (1ULL << 9)
>
>  #endif
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index eb5f075873cd..e4122ee8a999 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -66,6 +66,8 @@ vtd_frr_new(int index, uint64_t hi, uint64_t lo) "index %d high 0x%"PRIx64" low
>  vtd_warn_invalid_qi_tail(uint16_t tail) "tail 0x%"PRIx16
>  vtd_warn_ir_vector(uint16_t sid, int index, int vec, int target) "sid 0x%"PRIx16" index %d vec %d (should be: %d)"
>  vtd_warn_ir_trigger(uint16_t sid, int index, int trig, int target) "sid 0x%"PRIx16" index %d trigger %d (should be: %d)"
> +vtd_access_update(uint64_t slpte) "slpte 0x%"PRIx64
> +vtd_dirty_update(uint64_t slpte) "slpte 0x%"PRIx64
>
>  # amd_iommu.c
>  amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
> --
> 2.17.2
>



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC 04/10] intel_iommu: Second Stage Access Dirty bit support
  2022-04-29  2:26     ` Jason Wang
@ 2022-04-29  9:12       ` Joao Martins
  -1 siblings, 0 replies; 37+ messages in thread
From: Joao Martins @ 2022-04-29  9:12 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, Paolo Bonzini, Dr. David Alan Gilbert,
	Richard Henderson, Eduardo Habkost, Michael S. Tsirkin,
	Marcel Apfelbaum, Peter Xu, Alex Williamson, David Hildenbrand,
	Philippe Mathieu-Daudé,
	Cornelia Huck, Juan Quintela, Eric Blake, Markus Armbruster,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian, Yi Liu,
	Eric Auger, Thanos Makatos, John G . Johnson, kvm

On 4/29/22 03:26, Jason Wang wrote:
> On Fri, Apr 29, 2022 at 5:14 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>> @@ -3693,7 +3759,8 @@ static void vtd_init(IntelIOMMUState *s)
>>
>>      /* TODO: read cap/ecap from host to decide which cap to be exposed. */
>>      if (s->scalable_mode) {
>> -        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS;
>> +        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS |
>> +                   VTD_ECAP_SLADS;
>>      }
> 
> We probably need a dedicated command line parameter and make it compat
> for pre 7.1 machines.
> 
> Otherwise we may break migration.

I can gate over an 'x-ssads' option (default disabled). Which reminds me that I probably
should rename to the most recent mnemonic (as SLADS no longer exists in manuals).

If we all want by default enabled I can add a separate patch to do so.

	Joao

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC 04/10] intel_iommu: Second Stage Access Dirty bit support
@ 2022-04-29  9:12       ` Joao Martins
  0 siblings, 0 replies; 37+ messages in thread
From: Joao Martins @ 2022-04-29  9:12 UTC (permalink / raw)
  To: Jason Wang
  Cc: John G . Johnson, kvm, Michael S. Tsirkin, qemu-devel, Peter Xu,
	Eric Blake, Yi Liu, Juan Quintela, David Hildenbrand,
	Markus Armbruster, Nicolin Chen, Jason Gunthorpe, Kevin Tian,
	Richard Henderson, Dr. David Alan Gilbert, Eric Auger,
	Alex Williamson, Paolo Bonzini, Eduardo Habkost, Yishai Hadas,
	Cornelia Huck, Philippe Mathieu-Daudé,
	Thanos Makatos

On 4/29/22 03:26, Jason Wang wrote:
> On Fri, Apr 29, 2022 at 5:14 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>> @@ -3693,7 +3759,8 @@ static void vtd_init(IntelIOMMUState *s)
>>
>>      /* TODO: read cap/ecap from host to decide which cap to be exposed. */
>>      if (s->scalable_mode) {
>> -        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS;
>> +        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS |
>> +                   VTD_ECAP_SLADS;
>>      }
> 
> We probably need a dedicated command line parameter and make it compat
> for pre 7.1 machines.
> 
> Otherwise we may break migration.

I can gate over an 'x-ssads' option (default disabled). Which reminds me that I probably
should rename to the most recent mnemonic (as SLADS no longer exists in manuals).

If we all want by default enabled I can add a separate patch to do so.

	Joao


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC 04/10] intel_iommu: Second Stage Access Dirty bit support
  2022-04-29  9:12       ` Joao Martins
@ 2022-04-29 18:21         ` Peter Xu
  -1 siblings, 0 replies; 37+ messages in thread
From: Peter Xu @ 2022-04-29 18:21 UTC (permalink / raw)
  To: Joao Martins, Jason Wang
  Cc: Jason Wang, qemu-devel, Paolo Bonzini, Dr. David Alan Gilbert,
	Richard Henderson, Eduardo Habkost, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, David Hildenbrand,
	Philippe Mathieu-Daudé,
	Cornelia Huck, Juan Quintela, Eric Blake, Markus Armbruster,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian, Yi Liu,
	Eric Auger, Thanos Makatos, John G . Johnson, kvm

On Fri, Apr 29, 2022 at 10:12:01AM +0100, Joao Martins wrote:
> On 4/29/22 03:26, Jason Wang wrote:
> > On Fri, Apr 29, 2022 at 5:14 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> >> @@ -3693,7 +3759,8 @@ static void vtd_init(IntelIOMMUState *s)
> >>
> >>      /* TODO: read cap/ecap from host to decide which cap to be exposed. */
> >>      if (s->scalable_mode) {
> >> -        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS;
> >> +        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS |
> >> +                   VTD_ECAP_SLADS;
> >>      }
> > 
> > We probably need a dedicated command line parameter and make it compat
> > for pre 7.1 machines.
> > 
> > Otherwise we may break migration.
> 
> I can gate over an 'x-ssads' option (default disabled). Which reminds me that I probably
> should rename to the most recent mnemonic (as SLADS no longer exists in manuals).
> 
> If we all want by default enabled I can add a separate patch to do so.

The new option sounds good.

Jason, per our previous discussion, shall we not worry about the
compatibility issues per machine-type until the whole feature reaches a
mostly-complete stage?

There seems to have a bunch of sub-features for scalable mode and it's a
large project as a whole.  I'm worried trying to maintain compatibilities
for all the small sub-features could be an unnessary burden to the code
base.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC 04/10] intel_iommu: Second Stage Access Dirty bit support
@ 2022-04-29 18:21         ` Peter Xu
  0 siblings, 0 replies; 37+ messages in thread
From: Peter Xu @ 2022-04-29 18:21 UTC (permalink / raw)
  To: Joao Martins, Jason Wang
  Cc: John G . Johnson, kvm, Michael S. Tsirkin, Jason Wang,
	qemu-devel, Eric Blake, Yi Liu, Juan Quintela, David Hildenbrand,
	Markus Armbruster, Nicolin Chen, Jason Gunthorpe, Kevin Tian,
	Richard Henderson, Dr. David Alan Gilbert, Eric Auger,
	Alex Williamson, Paolo Bonzini, Eduardo Habkost, Yishai Hadas,
	Cornelia Huck, Philippe Mathieu-Daudé,
	Thanos Makatos

On Fri, Apr 29, 2022 at 10:12:01AM +0100, Joao Martins wrote:
> On 4/29/22 03:26, Jason Wang wrote:
> > On Fri, Apr 29, 2022 at 5:14 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> >> @@ -3693,7 +3759,8 @@ static void vtd_init(IntelIOMMUState *s)
> >>
> >>      /* TODO: read cap/ecap from host to decide which cap to be exposed. */
> >>      if (s->scalable_mode) {
> >> -        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS;
> >> +        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS |
> >> +                   VTD_ECAP_SLADS;
> >>      }
> > 
> > We probably need a dedicated command line parameter and make it compat
> > for pre 7.1 machines.
> > 
> > Otherwise we may break migration.
> 
> I can gate over an 'x-ssads' option (default disabled). Which reminds me that I probably
> should rename to the most recent mnemonic (as SLADS no longer exists in manuals).
> 
> If we all want by default enabled I can add a separate patch to do so.

The new option sounds good.

Jason, per our previous discussion, shall we not worry about the
compatibility issues per machine-type until the whole feature reaches a
mostly-complete stage?

There seems to have a bunch of sub-features for scalable mode and it's a
large project as a whole.  I'm worried trying to maintain compatibilities
for all the small sub-features could be an unnessary burden to the code
base.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC 09/10] migration/dirtyrate: Expand dirty_bitmap to be tracked separately for devices
  2022-04-28 21:13   ` Joao Martins
@ 2022-05-02 12:54     ` Markus Armbruster
  -1 siblings, 0 replies; 37+ messages in thread
From: Markus Armbruster @ 2022-05-02 12:54 UTC (permalink / raw)
  To: Joao Martins
  Cc: qemu-devel, Paolo Bonzini, Dr. David Alan Gilbert,
	Richard Henderson, Eduardo Habkost, Michael S. Tsirkin,
	Marcel Apfelbaum, Peter Xu, Jason Wang, Alex Williamson,
	David Hildenbrand, Philippe Mathieu-Daudé,
	Cornelia Huck, Juan Quintela, Eric Blake, Jason Gunthorpe,
	Nicolin Chen, Yishai Hadas, Kevin Tian, Yi Liu, Eric Auger,
	Thanos Makatos, John G . Johnson, kvm

Joao Martins <joao.m.martins@oracle.com> writes:

> Expand dirtyrate measurer that is accessible via HMP calc_dirty_rate
> or QMP 'calc-dirty-rate' to receive a @scope argument. The scope
> then restricts the dirty tracking to be done at devices only,
> while neither enabling or using the KVM (CPU) dirty tracker.
> The default stays as is i.e. dirty-ring / dirty-bitmap from KVM.
>
> This is useful to test, exercise the IOMMU dirty tracker and observe
> how much a given device is dirtying memory.
>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

[...]

> diff --git a/qapi/migration.json b/qapi/migration.json
> index 27d7b281581d..082830c6e771 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -1793,6 +1793,19 @@
>  { 'enum': 'DirtyRateMeasureMode',
>    'data': ['page-sampling', 'dirty-ring', 'dirty-bitmap'] }
>  
> +##
> +# @DirtyRateScope:
> +#
> +# An enumeration of scope of measuring dirtyrate.

"dirtyrate" is not a word.

> +#
> +# @dirty-devices: calculate dirtyrate by devices only.

Please document @all, too.

> +#
> +# Since: 6.2
> +#
> +##
> +{ 'enum': 'DirtyRateScope',
> +  'data': ['all', 'dirty-devices'] }
> +
>  ##
>  # @DirtyRateInfo:
>  #
> @@ -1827,6 +1840,7 @@
>             'calc-time': 'int64',
>             'sample-pages': 'uint64',
>             'mode': 'DirtyRateMeasureMode',
> +           'scope': 'DirtyRateScope',

Please document new member @scope.

>             '*vcpu-dirty-rate': [ 'DirtyRateVcpu' ] } }
>  
>  ##
> @@ -1851,6 +1865,7 @@
>  ##
>  { 'command': 'calc-dirty-rate', 'data': {'calc-time': 'int64',
>                                           '*sample-pages': 'int',
> +                                         '*scope': 'DirtyRateScope',
>                                           '*mode': 'DirtyRateMeasureMode'} }
>  
>  ##

[...]


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC 09/10] migration/dirtyrate: Expand dirty_bitmap to be tracked separately for devices
@ 2022-05-02 12:54     ` Markus Armbruster
  0 siblings, 0 replies; 37+ messages in thread
From: Markus Armbruster @ 2022-05-02 12:54 UTC (permalink / raw)
  To: Joao Martins
  Cc: John G . Johnson, kvm, Michael S. Tsirkin, Jason Wang,
	qemu-devel, Peter Xu, Eric Blake, Yi Liu, Juan Quintela,
	David Hildenbrand, Nicolin Chen, Jason Gunthorpe, Kevin Tian,
	Richard Henderson, Dr. David Alan Gilbert, Eric Auger,
	Alex Williamson, Paolo Bonzini, Eduardo Habkost, Yishai Hadas,
	Cornelia Huck, Philippe Mathieu-Daudé,
	Thanos Makatos

Joao Martins <joao.m.martins@oracle.com> writes:

> Expand dirtyrate measurer that is accessible via HMP calc_dirty_rate
> or QMP 'calc-dirty-rate' to receive a @scope argument. The scope
> then restricts the dirty tracking to be done at devices only,
> while neither enabling or using the KVM (CPU) dirty tracker.
> The default stays as is i.e. dirty-ring / dirty-bitmap from KVM.
>
> This is useful to test, exercise the IOMMU dirty tracker and observe
> how much a given device is dirtying memory.
>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

[...]

> diff --git a/qapi/migration.json b/qapi/migration.json
> index 27d7b281581d..082830c6e771 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -1793,6 +1793,19 @@
>  { 'enum': 'DirtyRateMeasureMode',
>    'data': ['page-sampling', 'dirty-ring', 'dirty-bitmap'] }
>  
> +##
> +# @DirtyRateScope:
> +#
> +# An enumeration of scope of measuring dirtyrate.

"dirtyrate" is not a word.

> +#
> +# @dirty-devices: calculate dirtyrate by devices only.

Please document @all, too.

> +#
> +# Since: 6.2
> +#
> +##
> +{ 'enum': 'DirtyRateScope',
> +  'data': ['all', 'dirty-devices'] }
> +
>  ##
>  # @DirtyRateInfo:
>  #
> @@ -1827,6 +1840,7 @@
>             'calc-time': 'int64',
>             'sample-pages': 'uint64',
>             'mode': 'DirtyRateMeasureMode',
> +           'scope': 'DirtyRateScope',

Please document new member @scope.

>             '*vcpu-dirty-rate': [ 'DirtyRateVcpu' ] } }
>  
>  ##
> @@ -1851,6 +1865,7 @@
>  ##
>  { 'command': 'calc-dirty-rate', 'data': {'calc-time': 'int64',
>                                           '*sample-pages': 'int',
> +                                         '*scope': 'DirtyRateScope',
>                                           '*mode': 'DirtyRateMeasureMode'} }
>  
>  ##

[...]



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC 09/10] migration/dirtyrate: Expand dirty_bitmap to be tracked separately for devices
  2022-05-02 12:54     ` Markus Armbruster
@ 2022-05-02 14:35       ` Joao Martins
  -1 siblings, 0 replies; 37+ messages in thread
From: Joao Martins @ 2022-05-02 14:35 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: qemu-devel, Paolo Bonzini, Dr. David Alan Gilbert,
	Richard Henderson, Eduardo Habkost, Michael S. Tsirkin,
	Marcel Apfelbaum, Peter Xu, Jason Wang, Alex Williamson,
	David Hildenbrand, Philippe Mathieu-Daudé,
	Cornelia Huck, Juan Quintela, Eric Blake, Jason Gunthorpe,
	Nicolin Chen, Yishai Hadas, Kevin Tian, Yi Liu, Eric Auger,
	Thanos Makatos, John G . Johnson, kvm

On 5/2/22 13:54, Markus Armbruster wrote:
> Joao Martins <joao.m.martins@oracle.com> writes:
> 
>> Expand dirtyrate measurer that is accessible via HMP calc_dirty_rate
>> or QMP 'calc-dirty-rate' to receive a @scope argument. The scope
>> then restricts the dirty tracking to be done at devices only,
>> while neither enabling or using the KVM (CPU) dirty tracker.
>> The default stays as is i.e. dirty-ring / dirty-bitmap from KVM.
>>
>> This is useful to test, exercise the IOMMU dirty tracker and observe
>> how much a given device is dirtying memory.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> 
> [...]
> 
>> diff --git a/qapi/migration.json b/qapi/migration.json
>> index 27d7b281581d..082830c6e771 100644
>> --- a/qapi/migration.json
>> +++ b/qapi/migration.json
>> @@ -1793,6 +1793,19 @@
>>  { 'enum': 'DirtyRateMeasureMode',
>>    'data': ['page-sampling', 'dirty-ring', 'dirty-bitmap'] }
>>  
>> +##
>> +# @DirtyRateScope:
>> +#
>> +# An enumeration of scope of measuring dirtyrate.
> 
> "dirtyrate" is not a word.
> 
Indeed. I will be more verbose rather than using 'dirty rate'.

>> +#
>> +# @dirty-devices: calculate dirtyrate by devices only.
> 
> Please document @all, too.
> 
OK. I probably should have used 'vcpu' and 'devices',
rather than 'all' and 'dirty-devices'

>> +#
>> +# Since: 6.2
>> +#

This should be 7.1.

>> +##
>> +{ 'enum': 'DirtyRateScope',
>> +  'data': ['all', 'dirty-devices'] }
>> +
>>  ##
>>  # @DirtyRateInfo:
>>  #
>> @@ -1827,6 +1840,7 @@
>>             'calc-time': 'int64',
>>             'sample-pages': 'uint64',
>>             'mode': 'DirtyRateMeasureMode',
>> +           'scope': 'DirtyRateScope',
> 
> Please document new member @scope.
> 
OK.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC 09/10] migration/dirtyrate: Expand dirty_bitmap to be tracked separately for devices
@ 2022-05-02 14:35       ` Joao Martins
  0 siblings, 0 replies; 37+ messages in thread
From: Joao Martins @ 2022-05-02 14:35 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: John G . Johnson, kvm, Michael S. Tsirkin, Jason Wang,
	qemu-devel, Peter Xu, Eric Blake, Yi Liu, Juan Quintela,
	David Hildenbrand, Nicolin Chen, Jason Gunthorpe, Kevin Tian,
	Richard Henderson, Dr. David Alan Gilbert, Eric Auger,
	Alex Williamson, Paolo Bonzini, Eduardo Habkost, Yishai Hadas,
	Cornelia Huck, Philippe Mathieu-Daudé,
	Thanos Makatos

On 5/2/22 13:54, Markus Armbruster wrote:
> Joao Martins <joao.m.martins@oracle.com> writes:
> 
>> Expand dirtyrate measurer that is accessible via HMP calc_dirty_rate
>> or QMP 'calc-dirty-rate' to receive a @scope argument. The scope
>> then restricts the dirty tracking to be done at devices only,
>> while neither enabling or using the KVM (CPU) dirty tracker.
>> The default stays as is i.e. dirty-ring / dirty-bitmap from KVM.
>>
>> This is useful to test, exercise the IOMMU dirty tracker and observe
>> how much a given device is dirtying memory.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> 
> [...]
> 
>> diff --git a/qapi/migration.json b/qapi/migration.json
>> index 27d7b281581d..082830c6e771 100644
>> --- a/qapi/migration.json
>> +++ b/qapi/migration.json
>> @@ -1793,6 +1793,19 @@
>>  { 'enum': 'DirtyRateMeasureMode',
>>    'data': ['page-sampling', 'dirty-ring', 'dirty-bitmap'] }
>>  
>> +##
>> +# @DirtyRateScope:
>> +#
>> +# An enumeration of scope of measuring dirtyrate.
> 
> "dirtyrate" is not a word.
> 
Indeed. I will be more verbose rather than using 'dirty rate'.

>> +#
>> +# @dirty-devices: calculate dirtyrate by devices only.
> 
> Please document @all, too.
> 
OK. I probably should have used 'vcpu' and 'devices',
rather than 'all' and 'dirty-devices'

>> +#
>> +# Since: 6.2
>> +#

This should be 7.1.

>> +##
>> +{ 'enum': 'DirtyRateScope',
>> +  'data': ['all', 'dirty-devices'] }
>> +
>>  ##
>>  # @DirtyRateInfo:
>>  #
>> @@ -1827,6 +1840,7 @@
>>             'calc-time': 'int64',
>>             'sample-pages': 'uint64',
>>             'mode': 'DirtyRateMeasureMode',
>> +           'scope': 'DirtyRateScope',
> 
> Please document new member @scope.
> 
OK.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC 04/10] intel_iommu: Second Stage Access Dirty bit support
  2022-04-29 18:21         ` Peter Xu
  (?)
@ 2022-05-03 11:54         ` Joao Martins
  2022-05-05  7:41           ` Jason Wang
  -1 siblings, 1 reply; 37+ messages in thread
From: Joao Martins @ 2022-05-03 11:54 UTC (permalink / raw)
  To: Peter Xu, Jason Wang
  Cc: qemu-devel, Paolo Bonzini, Dr. David Alan Gilbert,
	Richard Henderson, Eduardo Habkost, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, David Hildenbrand,
	Philippe Mathieu-Daudé,
	Cornelia Huck, Juan Quintela, Eric Blake, Markus Armbruster,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian, Yi Liu,
	Eric Auger, Thanos Makatos, John G . Johnson, kvm

On 4/29/22 19:21, Peter Xu wrote:
> On Fri, Apr 29, 2022 at 10:12:01AM +0100, Joao Martins wrote:
>> On 4/29/22 03:26, Jason Wang wrote:
>>> On Fri, Apr 29, 2022 at 5:14 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>> @@ -3693,7 +3759,8 @@ static void vtd_init(IntelIOMMUState *s)
>>>>
>>>>      /* TODO: read cap/ecap from host to decide which cap to be exposed. */
>>>>      if (s->scalable_mode) {
>>>> -        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS;
>>>> +        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS |
>>>> +                   VTD_ECAP_SLADS;
>>>>      }
>>>
>>> We probably need a dedicated command line parameter and make it compat
>>> for pre 7.1 machines.
>>>
>>> Otherwise we may break migration.
>>
>> I can gate over an 'x-ssads' option (default disabled). Which reminds me that I probably
>> should rename to the most recent mnemonic (as SLADS no longer exists in manuals).
>>
>> If we all want by default enabled I can add a separate patch to do so.
> 
> The new option sounds good.
> 

OK, I'll fix it then for the next iteration.

Also, perhaps I might take the emulated iommu patches out of the iommufd stuff into a
separate series. There might be a place for them in the realm of testing/prototyping.

> Jason, per our previous discussion, shall we not worry about the
> compatibility issues per machine-type until the whole feature reaches a
> mostly-complete stage?
> 
> There seems to have a bunch of sub-features for scalable mode and it's a
> large project as a whole.  I'm worried trying to maintain compatibilities
> for all the small sub-features could be an unnessary burden to the code
> base.

Perhaps best to see how close we are to spec is to check what we support in intel-iommu
in terms of VT-d revision versus how many buckets we fill in. I think SLADS/SSADS was in
3.0 IIRC.

I can take the compat stuff out if it's too early for that -- But I take it
these are questions for Jason.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC 04/10] intel_iommu: Second Stage Access Dirty bit support
  2022-04-28 21:13   ` Joao Martins
  (?)
  (?)
@ 2022-05-04 20:11   ` Peter Xu
  2022-05-05  9:54     ` Joao Martins
  -1 siblings, 1 reply; 37+ messages in thread
From: Peter Xu @ 2022-05-04 20:11 UTC (permalink / raw)
  To: Joao Martins
  Cc: qemu-devel, Paolo Bonzini, Dr. David Alan Gilbert,
	Richard Henderson, Eduardo Habkost, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Alex Williamson, David Hildenbrand,
	Philippe Mathieu-Daudé,
	Cornelia Huck, Juan Quintela, Eric Blake, Markus Armbruster,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian, Yi Liu,
	Eric Auger, Thanos Makatos, John G . Johnson, kvm

Hi, Joao,

On Thu, Apr 28, 2022 at 10:13:45PM +0100, Joao Martins wrote:
> +/* Get the content of a spte located in @base_addr[@index] */
> +static uint64_t vtd_set_slpte(dma_addr_t base_addr, uint32_t index,
> +                              uint64_t slpte)
> +{
> +
> +    if (dma_memory_write(&address_space_memory,
> +                         base_addr + index * sizeof(slpte), &slpte,
> +                         sizeof(slpte), MEMTXATTRS_UNSPECIFIED)) {
> +        slpte = (uint64_t)-1;
> +        return slpte;
> +    }
> +
> +    return vtd_get_slpte(base_addr, index);
> +}

Could I ask when the write succeeded, why need to read slpte again?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC 04/10] intel_iommu: Second Stage Access Dirty bit support
  2022-05-03 11:54         ` Joao Martins
@ 2022-05-05  7:41           ` Jason Wang
  2022-05-05  9:57             ` Joao Martins
  0 siblings, 1 reply; 37+ messages in thread
From: Jason Wang @ 2022-05-05  7:41 UTC (permalink / raw)
  To: Joao Martins
  Cc: Peter Xu, qemu-devel, Paolo Bonzini, Dr. David Alan Gilbert,
	Richard Henderson, Eduardo Habkost, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, David Hildenbrand,
	Philippe Mathieu-Daudé,
	Cornelia Huck, Juan Quintela, Eric Blake, Markus Armbruster,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian, Yi Liu,
	Eric Auger, Thanos Makatos, John G . Johnson, kvm

On Wed, May 4, 2022 at 4:47 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> On 4/29/22 19:21, Peter Xu wrote:
> > On Fri, Apr 29, 2022 at 10:12:01AM +0100, Joao Martins wrote:
> >> On 4/29/22 03:26, Jason Wang wrote:
> >>> On Fri, Apr 29, 2022 at 5:14 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> >>>> @@ -3693,7 +3759,8 @@ static void vtd_init(IntelIOMMUState *s)
> >>>>
> >>>>      /* TODO: read cap/ecap from host to decide which cap to be exposed. */
> >>>>      if (s->scalable_mode) {
> >>>> -        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS;
> >>>> +        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS |
> >>>> +                   VTD_ECAP_SLADS;
> >>>>      }
> >>>
> >>> We probably need a dedicated command line parameter and make it compat
> >>> for pre 7.1 machines.
> >>>
> >>> Otherwise we may break migration.
> >>
> >> I can gate over an 'x-ssads' option (default disabled). Which reminds me that I probably
> >> should rename to the most recent mnemonic (as SLADS no longer exists in manuals).
> >>
> >> If we all want by default enabled I can add a separate patch to do so.
> >
> > The new option sounds good.
> >
>
> OK, I'll fix it then for the next iteration.
>
> Also, perhaps I might take the emulated iommu patches out of the iommufd stuff into a
> separate series. There might be a place for them in the realm of testing/prototyping.

That would be better.

>
> > Jason, per our previous discussion, shall we not worry about the
> > compatibility issues per machine-type until the whole feature reaches a
> > mostly-complete stage?
> >
> > There seems to have a bunch of sub-features for scalable mode and it's a
> > large project as a whole.  I'm worried trying to maintain compatibilities
> > for all the small sub-features could be an unnessary burden to the code
> > base.

My understanding, if it's not too hard, it looks better for each
sub-features to try its best for compatibility. For this case, having
a dedicated option might help for debugging as well.

> Perhaps best to see how close we are to spec is to check what we support in intel-iommu
> in terms of VT-d revision versus how many buckets we fill in. I think SLADS/SSADS was in
> 3.0 IIRC.
>
> I can take the compat stuff out if it's too early for that -- But I take it
> these are questions for Jason.
>

There's probably no need for the compat stuff, having a dedicated
option and making it disabled by default should be fine.

Thanks


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC 04/10] intel_iommu: Second Stage Access Dirty bit support
  2022-05-04 20:11   ` Peter Xu
@ 2022-05-05  9:54     ` Joao Martins
  0 siblings, 0 replies; 37+ messages in thread
From: Joao Martins @ 2022-05-05  9:54 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Paolo Bonzini, Dr. David Alan Gilbert,
	Richard Henderson, Eduardo Habkost, Michael S. Tsirkin,
	Marcel Apfelbaum, Jason Wang, Alex Williamson, David Hildenbrand,
	Philippe Mathieu-Daudé,
	Cornelia Huck, Juan Quintela, Eric Blake, Markus Armbruster,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian, Yi Liu,
	Eric Auger, Thanos Makatos, John G . Johnson, kvm

On 5/4/22 21:11, Peter Xu wrote:
> Hi, Joao,
> 
> On Thu, Apr 28, 2022 at 10:13:45PM +0100, Joao Martins wrote:
>> +/* Get the content of a spte located in @base_addr[@index] */
>> +static uint64_t vtd_set_slpte(dma_addr_t base_addr, uint32_t index,
>> +                              uint64_t slpte)
>> +{
>> +
>> +    if (dma_memory_write(&address_space_memory,
>> +                         base_addr + index * sizeof(slpte), &slpte,
>> +                         sizeof(slpte), MEMTXATTRS_UNSPECIFIED)) {
>> +        slpte = (uint64_t)-1;
>> +        return slpte;
>> +    }
>> +
>> +    return vtd_get_slpte(base_addr, index);
>> +}
> 
> Could I ask when the write succeeded, why need to read slpte again?

We don't, I should delete this.

Perhaps I was obsessed that what I set is what I get in the end and this
was a remnant of it.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC 04/10] intel_iommu: Second Stage Access Dirty bit support
  2022-05-05  7:41           ` Jason Wang
@ 2022-05-05  9:57             ` Joao Martins
  0 siblings, 0 replies; 37+ messages in thread
From: Joao Martins @ 2022-05-05  9:57 UTC (permalink / raw)
  To: Jason Wang
  Cc: Peter Xu, qemu-devel, Paolo Bonzini, Dr. David Alan Gilbert,
	Richard Henderson, Eduardo Habkost, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, David Hildenbrand,
	Philippe Mathieu-Daudé,
	Cornelia Huck, Juan Quintela, Eric Blake, Markus Armbruster,
	Jason Gunthorpe, Nicolin Chen, Yishai Hadas, Kevin Tian, Yi Liu,
	Eric Auger, Thanos Makatos, John G . Johnson, kvm

On 5/5/22 08:41, Jason Wang wrote:
> On Wed, May 4, 2022 at 4:47 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> On 4/29/22 19:21, Peter Xu wrote:
>>> On Fri, Apr 29, 2022 at 10:12:01AM +0100, Joao Martins wrote:
>>>> On 4/29/22 03:26, Jason Wang wrote:
>>>>> On Fri, Apr 29, 2022 at 5:14 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>>> @@ -3693,7 +3759,8 @@ static void vtd_init(IntelIOMMUState *s)
>>>>>>
>>>>>>      /* TODO: read cap/ecap from host to decide which cap to be exposed. */
>>>>>>      if (s->scalable_mode) {
>>>>>> -        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS;
>>>>>> +        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS |
>>>>>> +                   VTD_ECAP_SLADS;
>>>>>>      }
>>>>>
>>>>> We probably need a dedicated command line parameter and make it compat
>>>>> for pre 7.1 machines.
>>>>>
>>>>> Otherwise we may break migration.
>>>>
>>>> I can gate over an 'x-ssads' option (default disabled). Which reminds me that I probably
>>>> should rename to the most recent mnemonic (as SLADS no longer exists in manuals).
>>>>
>>>> If we all want by default enabled I can add a separate patch to do so.
>>>
>>> The new option sounds good.
>>>
>>
>> OK, I'll fix it then for the next iteration.
>>
>> Also, perhaps I might take the emulated iommu patches out of the iommufd stuff into a
>> separate series. There might be a place for them in the realm of testing/prototyping.
> 
> That would be better.
> 
OK, I'll do that then.

>> Perhaps best to see how close we are to spec is to check what we support in intel-iommu
>> in terms of VT-d revision versus how many buckets we fill in. I think SLADS/SSADS was in
>> 3.0 IIRC.
>>
>> I can take the compat stuff out if it's too early for that -- But I take it
>> these are questions for Jason.
>>
> 
> There's probably no need for the compat stuff, having a dedicated
> option and making it disabled by default should be fine.

/me nods

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2022-05-05  9:57 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-28 21:13 [PATCH RFC 00/10] hw/vfio, x86/iommu: IOMMUFD Dirty Tracking Joao Martins
2022-04-28 21:13 ` Joao Martins
2022-04-28 21:13 ` [PATCH RFC 01/10] amd-iommu: Cache PTE/DTE info in IOTLB Joao Martins
2022-04-28 21:13   ` Joao Martins
2022-04-28 21:13 ` [PATCH RFC 02/10] amd-iommu: Access/Dirty bit support Joao Martins
2022-04-28 21:13   ` Joao Martins
2022-04-28 21:13 ` [PATCH RFC 03/10] intel-iommu: Cache PASID entry flags Joao Martins
2022-04-28 21:13   ` Joao Martins
2022-04-28 21:13 ` [PATCH RFC 04/10] intel_iommu: Second Stage Access Dirty bit support Joao Martins
2022-04-28 21:13   ` Joao Martins
2022-04-29  2:26   ` Jason Wang
2022-04-29  2:26     ` Jason Wang
2022-04-29  9:12     ` Joao Martins
2022-04-29  9:12       ` Joao Martins
2022-04-29 18:21       ` Peter Xu
2022-04-29 18:21         ` Peter Xu
2022-05-03 11:54         ` Joao Martins
2022-05-05  7:41           ` Jason Wang
2022-05-05  9:57             ` Joao Martins
2022-05-04 20:11   ` Peter Xu
2022-05-05  9:54     ` Joao Martins
2022-04-28 21:13 ` [PATCH RFC 05/10] linux-headers: import iommufd.h hwpt extensions Joao Martins
2022-04-28 21:13   ` Joao Martins
2022-04-28 21:13 ` [PATCH RFC 06/10] vfio/iommufd: Add HWPT_SET_DIRTY support Joao Martins
2022-04-28 21:13   ` Joao Martins
2022-04-28 21:13 ` [PATCH RFC 07/10] vfio/iommufd: Add HWPT_GET_DIRTY_IOVA support Joao Martins
2022-04-28 21:13   ` Joao Martins
2022-04-28 21:13 ` [PATCH RFC 08/10] vfio/iommufd: Add IOAS_UNMAP_DIRTY support Joao Martins
2022-04-28 21:13   ` Joao Martins
2022-04-28 21:13 ` [PATCH RFC 09/10] migration/dirtyrate: Expand dirty_bitmap to be tracked separately for devices Joao Martins
2022-04-28 21:13   ` Joao Martins
2022-05-02 12:54   ` Markus Armbruster
2022-05-02 12:54     ` Markus Armbruster
2022-05-02 14:35     ` Joao Martins
2022-05-02 14:35       ` Joao Martins
2022-04-28 21:13 ` [PATCH RFC 10/10] hw/vfio: Add nr of dirty pages to tracepoints Joao Martins
2022-04-28 21:13   ` Joao Martins

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.