All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/20] vfio: Add migration pre-copy support and device dirty tracking
@ 2023-02-22 17:48 Avihai Horon
  2023-02-22 17:48 ` [PATCH v2 01/20] migration: Pass threshold_size to .state_pending_{estimate, exact}() Avihai Horon via
                   ` (21 more replies)
  0 siblings, 22 replies; 93+ messages in thread
From: Avihai Horon @ 2023-02-22 17:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

Hello,

This series is based on the previous one that added the basic VFIO
migration protocol v2 implementation [1].

The series starts by adding pre-copy support for VFIO migration protocol
v2. Pre-copy support allows the VFIO device data to be transferred while
the VM is running. This can improve performance and reduce migration
downtime. Full description of it can be found here [2].

It then moves to implement device dirty page tracking. Device dirty page
tracking allows the VFIO device to record its DMAs and report them back
when needed. This is part of VFIO migration and is used during pre-copy
phase of migration to track the RAM pages that the device has written to
and mark those pages dirty, so they can later be re-sent to target.

Device dirty page tracking uses the DMA logging uAPI to discover device
capabilities, to start and stop tracking, and to get dirty page bitmap
report. Extra details and uAPI definition can be found here [3].

Device dirty page tracking operates in VFIOContainer scope. I.e., When
dirty tracking is started, stopped or dirty page report is queried, all
devices within a VFIOContainer are iterated and for each of them device
dirty page tracking is started, stopped or dirty page report is queried,
respectively.

Device dirty page tracking is used only if all devices within a
VFIOContainer support it. Otherwise, VFIO IOMMU dirty page tracking is
used, and if that is not supported as well, memory is perpetually marked
dirty by QEMU. Note that since VFIO IOMMU dirty page tracking has no HW
support, the last two usually have the same effect of perpetually
marking all pages dirty.

Normally, when asked to start dirty tracking, all the currently DMA
mapped ranges are tracked by device dirty page tracking. However, when
vIOMMU is enabled IOVA ranges are DMA mapped/unmapped on the fly as the
vIOMMU maps/unmaps them. These IOVA ranges can potentially be mapped
anywhere in the vIOMMU IOVA space. Due to this dynamic nature of vIOMMU
mapping/unmapping, tracking only the currently DMA mapped IOVA ranges
doesn't work very well.

Thus, when vIOMMU is enabled, we try to track the entire vIOMMU IOVA
space. If that fails (IOVA space can be rather big and we might hit HW
limitation), we try to track smaller range while marking untracked
ranges dirty.

Patch breakdown:
- Patches 1-3 add VFIO migration pre-copy support.
- Patches 4-10 fix bugs and do some preparatory work required prior to
  adding device dirty page tracking.
- Patches 11-13 implement device dirty page tracking.
- Patches 14-18 add vIOMMU support to device dirty page tracking.
- Patches 19-20 enable device dirty page tracking and document it.



Changes from v1 [4]:
- Rebased on latest master branch. As part of it, made some changes in
  pre-copy to adjust it to Juan's new patches:
  1. Added a new patch that passes threshold_size parameter to
     .state_pending_{estimate,exact}() handlers.
  2. Added a new patch that refactors vfio_save_block().
  3. Changed the pre-copy patch to cache and report pending pre-copy
     size in the .state_pending_estimate() handler.
- Removed unnecessary P2P code. This should be added later on when P2P
  support is added. (Alex)
- Moved the dirty sync to be after the DMA unmap in vfio_dma_unmap()
  (patch #11). (Alex)
- Stored vfio_devices_all_device_dirty_tracking()'s value in a local
  variable in vfio_get_dirty_bitmap() so it can be re-used (patch #11).
- Refactored the viommu device dirty tracking ranges creation code to
  make it clearer (patch #15).
- Changed overflow check in vfio_iommu_range_is_device_tracked() to
  emphasize that we specifically check for 2^64 wrap around (patch #15).
- Added R-bs / Acks.

Thanks.

[1]
https://lore.kernel.org/qemu-devel/167658846945.932837.1420176491103357684.stgit@omen/

[2]
https://lore.kernel.org/kvm/20221206083438.37807-3-yishaih@nvidia.com/

[3]
https://lore.kernel.org/netdev/20220908183448.195262-4-yishaih@nvidia.com/

Avihai Horon (14):
  migration: Pass threshold_size to .state_pending_{estimate,exact}()
  vfio/migration: Refactor vfio_save_block() to return saved data size
  vfio/migration: Add VFIO migration pre-copy support
  vfio/common: Fix error reporting in vfio_get_dirty_bitmap()
  vfio/common: Fix wrong %m usages
  vfio/common: Abort migration if dirty log start/stop/sync fails
  vfio/common: Add VFIOBitmap and (de)alloc functions
  vfio/common: Extract code from vfio_get_dirty_bitmap() to new function
  vfio/common: Extract vIOMMU code from vfio_sync_dirty_bitmap()
  memory/iommu: Add IOMMU_ATTR_MAX_IOVA attribute
  intel-iommu: Implement get_attr() method
  vfio/common: Support device dirty page tracking with vIOMMU
  vfio/common: Optimize device dirty page tracking with vIOMMU
  docs/devel: Document VFIO device dirty page tracking

Joao Martins (6):
  util: Add iova_tree_nnodes()
  util: Extend iova_tree_foreach() to take data argument
  vfio/common: Record DMA mapped IOVA ranges
  vfio/common: Add device dirty page tracking start/stop
  vfio/common: Add device dirty page bitmap sync
  vfio/migration: Query device dirty page tracking support

 docs/devel/vfio-migration.rst  |  85 ++-
 include/exec/memory.h          |   3 +-
 include/hw/vfio/vfio-common.h  |  10 +
 include/migration/register.h   |   7 +-
 include/qemu/iova-tree.h       |  19 +-
 migration/savevm.h             |   6 +-
 hw/i386/intel_iommu.c          |  18 +
 hw/s390x/s390-stattrib.c       |   4 +-
 hw/vfio/common.c               | 911 +++++++++++++++++++++++++++++----
 hw/vfio/migration.c            | 210 +++++++-
 migration/block-dirty-bitmap.c |   2 +-
 migration/block.c              |   4 +-
 migration/migration.c          |  12 +-
 migration/ram.c                |   6 +-
 migration/savevm.c             |  12 +-
 util/iova-tree.c               |  23 +-
 hw/vfio/trace-events           |   4 +-
 migration/trace-events         |   4 +-
 18 files changed, 1161 insertions(+), 179 deletions(-)

-- 
2.26.3



^ permalink raw reply	[flat|nested] 93+ messages in thread

* [PATCH v2 01/20] migration: Pass threshold_size to .state_pending_{estimate, exact}()
  2023-02-22 17:48 [PATCH v2 00/20] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
@ 2023-02-22 17:48 ` Avihai Horon via
  2023-02-22 17:48 ` [PATCH v2 02/20] vfio/migration: Refactor vfio_save_block() to return saved data size Avihai Horon
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 93+ messages in thread
From: Avihai Horon via @ 2023-02-22 17:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

Pass threshold_size to .state_pending_{estimate,exact}().

This parameter will be used in the following patch by VFIO migration to
force the complete transmission of all VFIO pre-copy initial bytes prior
moving to stop-copy phase, which can reduce migration downtime.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 include/migration/register.h   |  7 ++++---
 migration/savevm.h             |  6 ++++--
 hw/s390x/s390-stattrib.c       |  4 ++--
 hw/vfio/migration.c            |  3 ++-
 migration/block-dirty-bitmap.c |  2 +-
 migration/block.c              |  4 ++--
 migration/migration.c          | 12 ++++++++----
 migration/ram.c                |  6 ++++--
 migration/savevm.c             | 12 ++++++++----
 migration/trace-events         |  4 ++--
 10 files changed, 37 insertions(+), 23 deletions(-)

diff --git a/include/migration/register.h b/include/migration/register.h
index a8dfd8fefd..85d22931a7 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -61,11 +61,12 @@ typedef struct SaveVMHandlers {
      * pending data.
      */
     /* This estimates the remaining data to transfer */
-    void (*state_pending_estimate)(void *opaque, uint64_t *must_precopy,
+    void (*state_pending_estimate)(void *opaque, uint64_t threshold_size,
+                                   uint64_t *must_precopy,
                                    uint64_t *can_postcopy);
     /* This calculate the exact remaining data to transfer */
-    void (*state_pending_exact)(void *opaque, uint64_t *must_precopy,
-                                uint64_t *can_postcopy);
+    void (*state_pending_exact)(void *opaque, uint64_t threshold_size,
+                                uint64_t *must_precopy, uint64_t *can_postcopy);
     LoadStateHandler *load_state;
     int (*load_setup)(QEMUFile *f, void *opaque);
     int (*load_cleanup)(void *opaque);
diff --git a/migration/savevm.h b/migration/savevm.h
index fb636735f0..c94d31f051 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -40,9 +40,11 @@ void qemu_savevm_state_cleanup(void);
 void qemu_savevm_state_complete_postcopy(QEMUFile *f);
 int qemu_savevm_state_complete_precopy(QEMUFile *f, bool iterable_only,
                                        bool inactivate_disks);
-void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
+void qemu_savevm_state_pending_exact(uint64_t threshold_size,
+                                     uint64_t *must_precopy,
                                      uint64_t *can_postcopy);
-void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
+void qemu_savevm_state_pending_estimate(uint64_t threshold_size,
+                                        uint64_t *must_precopy,
                                         uint64_t *can_postcopy);
 void qemu_savevm_send_ping(QEMUFile *f, uint32_t value);
 void qemu_savevm_send_open_return_path(QEMUFile *f);
diff --git a/hw/s390x/s390-stattrib.c b/hw/s390x/s390-stattrib.c
index aed919ad7d..f1d4064c09 100644
--- a/hw/s390x/s390-stattrib.c
+++ b/hw/s390x/s390-stattrib.c
@@ -182,8 +182,8 @@ static int cmma_save_setup(QEMUFile *f, void *opaque)
     return 0;
 }
 
-static void cmma_state_pending(void *opaque, uint64_t *must_precopy,
-                               uint64_t *can_postcopy)
+static void cmma_state_pending(void *opaque, uint64_t threshold_size,
+                               uint64_t *must_precopy, uint64_t *can_postcopy)
 {
     S390StAttribState *sas = S390_STATTRIB(opaque);
     S390StAttribClass *sac = S390_STATTRIB_GET_CLASS(sas);
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index a2c3d9bade..4fb7d01532 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -314,7 +314,8 @@ static void vfio_save_cleanup(void *opaque)
  * repeatedly while pending RAM size is over the threshold, thus migration
  * can't converge and querying the VFIO device pending data size is useless.
  */
-static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
+static void vfio_state_pending_exact(void *opaque, uint64_t threshold_size,
+                                     uint64_t *must_precopy,
                                      uint64_t *can_postcopy)
 {
     VFIODevice *vbasedev = opaque;
diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
index fe73aa94b1..4fe0b83bc8 100644
--- a/migration/block-dirty-bitmap.c
+++ b/migration/block-dirty-bitmap.c
@@ -762,7 +762,7 @@ static int dirty_bitmap_save_complete(QEMUFile *f, void *opaque)
     return 0;
 }
 
-static void dirty_bitmap_state_pending(void *opaque,
+static void dirty_bitmap_state_pending(void *opaque, uint64_t threshold_size,
                                        uint64_t *must_precopy,
                                        uint64_t *can_postcopy)
 {
diff --git a/migration/block.c b/migration/block.c
index 426a25bb19..70438a299c 100644
--- a/migration/block.c
+++ b/migration/block.c
@@ -853,8 +853,8 @@ static int block_save_complete(QEMUFile *f, void *opaque)
     return 0;
 }
 
-static void block_state_pending(void *opaque, uint64_t *must_precopy,
-                                uint64_t *can_postcopy)
+static void block_state_pending(void *opaque, uint64_t threshold_size,
+                                uint64_t *must_precopy, uint64_t *can_postcopy)
 {
     /* Estimate pending number of bytes to send */
     uint64_t pending;
diff --git a/migration/migration.c b/migration/migration.c
index ae2025d9d8..a0777d9848 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3866,15 +3866,19 @@ static MigIterateState migration_iteration_run(MigrationState *s)
     uint64_t must_precopy, can_postcopy;
     bool in_postcopy = s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE;
 
-    qemu_savevm_state_pending_estimate(&must_precopy, &can_postcopy);
+    qemu_savevm_state_pending_estimate(s->threshold_size, &must_precopy,
+                                       &can_postcopy);
     uint64_t pending_size = must_precopy + can_postcopy;
 
-    trace_migrate_pending_estimate(pending_size, must_precopy, can_postcopy);
+    trace_migrate_pending_estimate(pending_size, s->threshold_size,
+                                   must_precopy, can_postcopy);
 
     if (must_precopy <= s->threshold_size) {
-        qemu_savevm_state_pending_exact(&must_precopy, &can_postcopy);
+        qemu_savevm_state_pending_exact(s->threshold_size, &must_precopy,
+                                        &can_postcopy);
         pending_size = must_precopy + can_postcopy;
-        trace_migrate_pending_exact(pending_size, must_precopy, can_postcopy);
+        trace_migrate_pending_exact(pending_size, s->threshold_size,
+                                    must_precopy, can_postcopy);
     }
 
     if (!pending_size || pending_size < s->threshold_size) {
diff --git a/migration/ram.c b/migration/ram.c
index 96e8a19a58..514a18b5d7 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3489,7 +3489,8 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
     return 0;
 }
 
-static void ram_state_pending_estimate(void *opaque, uint64_t *must_precopy,
+static void ram_state_pending_estimate(void *opaque, uint64_t threshold_size,
+                                       uint64_t *must_precopy,
                                        uint64_t *can_postcopy)
 {
     RAMState **temp = opaque;
@@ -3505,7 +3506,8 @@ static void ram_state_pending_estimate(void *opaque, uint64_t *must_precopy,
     }
 }
 
-static void ram_state_pending_exact(void *opaque, uint64_t *must_precopy,
+static void ram_state_pending_exact(void *opaque, uint64_t threshold_size,
+                                    uint64_t *must_precopy,
                                     uint64_t *can_postcopy)
 {
     RAMState **temp = opaque;
diff --git a/migration/savevm.c b/migration/savevm.c
index aa54a67fda..a642c0dd5a 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1541,7 +1541,8 @@ flush:
  * the result is split into the amount for units that can and
  * for units that can't do postcopy.
  */
-void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
+void qemu_savevm_state_pending_estimate(uint64_t threshold_size,
+                                        uint64_t *must_precopy,
                                         uint64_t *can_postcopy)
 {
     SaveStateEntry *se;
@@ -1558,11 +1559,13 @@ void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
                 continue;
             }
         }
-        se->ops->state_pending_estimate(se->opaque, must_precopy, can_postcopy);
+        se->ops->state_pending_estimate(se->opaque, threshold_size,
+                                        must_precopy, can_postcopy);
     }
 }
 
-void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
+void qemu_savevm_state_pending_exact(uint64_t threshold_size,
+                                     uint64_t *must_precopy,
                                      uint64_t *can_postcopy)
 {
     SaveStateEntry *se;
@@ -1579,7 +1582,8 @@ void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
                 continue;
             }
         }
-        se->ops->state_pending_exact(se->opaque, must_precopy, can_postcopy);
+        se->ops->state_pending_exact(se->opaque, threshold_size, must_precopy,
+                                     can_postcopy);
     }
 }
 
diff --git a/migration/trace-events b/migration/trace-events
index 92161eeac5..b23c044f5e 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -150,8 +150,8 @@ migrate_fd_cleanup(void) ""
 migrate_fd_error(const char *error_desc) "error=%s"
 migrate_fd_cancel(void) ""
 migrate_handle_rp_req_pages(const char *rbname, size_t start, size_t len) "in %s at 0x%zx len 0x%zx"
-migrate_pending_exact(uint64_t size, uint64_t pre, uint64_t post) "exact pending size %" PRIu64 " (pre = %" PRIu64 " post=%" PRIu64 ")"
-migrate_pending_estimate(uint64_t size, uint64_t pre, uint64_t post) "estimate pending size %" PRIu64 " (pre = %" PRIu64 " post=%" PRIu64 ")"
+migrate_pending_exact(uint64_t size, uint64_t threshold_size, uint64_t pre, uint64_t post) "exact pending size %" PRIu64 " threshold size %" PRIu64 " (pre = %" PRIu64 " post=%" PRIu64 ")"
+migrate_pending_estimate(uint64_t size, uint64_t threshold_size, uint64_t pre, uint64_t post) "estimate pending size %" PRIu64 " threshold size %" PRIu64 " (pre = %" PRIu64 " post=%" PRIu64 ")"
 migrate_send_rp_message(int msg_type, uint16_t len) "%d: len %d"
 migrate_send_rp_recv_bitmap(char *name, int64_t size) "block '%s' size 0x%"PRIi64
 migration_completion_file_err(void) ""
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH v2 02/20] vfio/migration: Refactor vfio_save_block() to return saved data size
  2023-02-22 17:48 [PATCH v2 00/20] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
  2023-02-22 17:48 ` [PATCH v2 01/20] migration: Pass threshold_size to .state_pending_{estimate, exact}() Avihai Horon via
@ 2023-02-22 17:48 ` Avihai Horon
  2023-02-27 14:10   ` Cédric Le Goater
  2023-02-22 17:48 ` [PATCH v2 03/20] vfio/migration: Add VFIO migration pre-copy support Avihai Horon
                   ` (19 subsequent siblings)
  21 siblings, 1 reply; 93+ messages in thread
From: Avihai Horon @ 2023-02-22 17:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

Refactor vfio_save_block() to return the size of saved data on success
and -errno on error.

This will be used in next patch to implement VFIO migration pre-copy
support.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/migration.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 4fb7d01532..94a4df73d0 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -240,8 +240,8 @@ static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
     return 0;
 }
 
-/* Returns 1 if end-of-stream is reached, 0 if more data and -errno if error */
-static int vfio_save_block(QEMUFile *f, VFIOMigration *migration)
+/* Returns the size of saved data on success and -errno on error */
+static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
 {
     ssize_t data_size;
 
@@ -251,7 +251,7 @@ static int vfio_save_block(QEMUFile *f, VFIOMigration *migration)
         return -errno;
     }
     if (data_size == 0) {
-        return 1;
+        return 0;
     }
 
     qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
@@ -261,7 +261,7 @@ static int vfio_save_block(QEMUFile *f, VFIOMigration *migration)
 
     trace_vfio_save_block(migration->vbasedev->name, data_size);
 
-    return qemu_file_get_error(f);
+    return qemu_file_get_error(f) ?: data_size;
 }
 
 /* ---------------------------------------------------------------------- */
@@ -335,6 +335,7 @@ static void vfio_state_pending_exact(void *opaque, uint64_t threshold_size,
 static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
+    ssize_t data_size;
     int ret;
 
     /* We reach here with device state STOP only */
@@ -345,11 +346,11 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
     }
 
     do {
-        ret = vfio_save_block(f, vbasedev->migration);
-        if (ret < 0) {
-            return ret;
+        data_size = vfio_save_block(f, vbasedev->migration);
+        if (data_size < 0) {
+            return data_size;
         }
-    } while (!ret);
+    } while (data_size);
 
     qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
     ret = qemu_file_get_error(f);
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH v2 03/20] vfio/migration: Add VFIO migration pre-copy support
  2023-02-22 17:48 [PATCH v2 00/20] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
  2023-02-22 17:48 ` [PATCH v2 01/20] migration: Pass threshold_size to .state_pending_{estimate, exact}() Avihai Horon via
  2023-02-22 17:48 ` [PATCH v2 02/20] vfio/migration: Refactor vfio_save_block() to return saved data size Avihai Horon
@ 2023-02-22 17:48 ` Avihai Horon
  2023-02-22 20:58   ` Alex Williamson
  2023-02-22 17:48 ` [PATCH v2 04/20] vfio/common: Fix error reporting in vfio_get_dirty_bitmap() Avihai Horon
                   ` (18 subsequent siblings)
  21 siblings, 1 reply; 93+ messages in thread
From: Avihai Horon @ 2023-02-22 17:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

Pre-copy support allows the VFIO device data to be transferred while the
VM is running. This helps to accommodate VFIO devices that have a large
amount of data that needs to be transferred, and it can reduce migration
downtime.

Pre-copy support is optional in VFIO migration protocol v2.
Implement pre-copy of VFIO migration protocol v2 and use it for devices
that support it. Full description of it can be found here [1].

[1]
https://lore.kernel.org/kvm/20221206083438.37807-3-yishaih@nvidia.com/

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 docs/devel/vfio-migration.rst |  35 +++++--
 include/hw/vfio/vfio-common.h |   3 +
 hw/vfio/common.c              |   6 +-
 hw/vfio/migration.c           | 175 ++++++++++++++++++++++++++++++++--
 hw/vfio/trace-events          |   4 +-
 5 files changed, 201 insertions(+), 22 deletions(-)

diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
index c214c73e28..ba80b9150d 100644
--- a/docs/devel/vfio-migration.rst
+++ b/docs/devel/vfio-migration.rst
@@ -7,12 +7,14 @@ the guest is running on source host and restoring this saved state on the
 destination host. This document details how saving and restoring of VFIO
 devices is done in QEMU.
 
-Migration of VFIO devices currently consists of a single stop-and-copy phase.
-During the stop-and-copy phase the guest is stopped and the entire VFIO device
-data is transferred to the destination.
-
-The pre-copy phase of migration is currently not supported for VFIO devices.
-Support for VFIO pre-copy will be added later on.
+Migration of VFIO devices consists of two phases: the optional pre-copy phase,
+and the stop-and-copy phase. The pre-copy phase is iterative and allows to
+accommodate VFIO devices that have a large amount of data that needs to be
+transferred. The iterative pre-copy phase of migration allows for the guest to
+continue whilst the VFIO device state is transferred to the destination, this
+helps to reduce the total downtime of the VM. VFIO devices can choose to skip
+the pre-copy phase of migration by not reporting the VFIO_MIGRATION_PRE_COPY
+flag in VFIO_DEVICE_FEATURE_MIGRATION ioctl.
 
 Note that currently VFIO migration is supported only for a single device. This
 is due to VFIO migration's lack of P2P support. However, P2P support is planned
@@ -29,10 +31,20 @@ VFIO implements the device hooks for the iterative approach as follows:
 * A ``load_setup`` function that sets the VFIO device on the destination in
   _RESUMING state.
 
+* A ``state_pending_estimate`` function that reports an estimate of the
+  remaining pre-copy data that the vendor driver has yet to save for the VFIO
+  device.
+
 * A ``state_pending_exact`` function that reads pending_bytes from the vendor
   driver, which indicates the amount of data that the vendor driver has yet to
   save for the VFIO device.
 
+* An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
+  active only when the VFIO device is in pre-copy states.
+
+* A ``save_live_iterate`` function that reads the VFIO device's data from the
+  vendor driver during iterative pre-copy phase.
+
 * A ``save_state`` function to save the device config space if it is present.
 
 * A ``save_live_complete_precopy`` function that sets the VFIO device in
@@ -95,8 +107,10 @@ Flow of state changes during Live migration
 ===========================================
 
 Below is the flow of state change during live migration.
-The values in the brackets represent the VM state, the migration state, and
+The values in the parentheses represent the VM state, the migration state, and
 the VFIO device state, respectively.
+The text in the square brackets represents the flow if the VFIO device supports
+pre-copy.
 
 Live migration save path
 ------------------------
@@ -108,11 +122,12 @@ Live migration save path
                                   |
                      migrate_init spawns migration_thread
                 Migration thread then calls each device's .save_setup()
-                       (RUNNING, _SETUP, _RUNNING)
+                  (RUNNING, _SETUP, _RUNNING [_PRE_COPY])
                                   |
-                      (RUNNING, _ACTIVE, _RUNNING)
-             If device is active, get pending_bytes by .state_pending_exact()
+                  (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])
+      If device is active, get pending_bytes by .state_pending_{estimate,exact}()
           If total pending_bytes >= threshold_size, call .save_live_iterate()
+                  [Data of VFIO device for pre-copy phase is copied]
         Iterate till total pending bytes converge and are less than threshold
                                   |
   On migration completion, vCPU stops and calls .save_live_complete_precopy for
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 87524c64a4..ee55d442b4 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -66,6 +66,9 @@ typedef struct VFIOMigration {
     int data_fd;
     void *data_buffer;
     size_t data_buffer_size;
+    uint64_t precopy_init_size;
+    uint64_t precopy_dirty_size;
+    uint64_t mig_flags;
 } VFIOMigration;
 
 typedef struct VFIOAddressSpace {
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index bab83c0e55..6f5afe9f5a 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -409,7 +409,8 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
             }
 
             if (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF &&
-                migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
+                (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
+                 migration->device_state == VFIO_DEVICE_STATE_PRE_COPY)) {
                 return false;
             }
         }
@@ -438,7 +439,8 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
                 return false;
             }
 
-            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
+            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
+                migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
                 continue;
             } else {
                 return false;
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 94a4df73d0..307983d57d 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -67,6 +67,8 @@ static const char *mig_state_to_str(enum vfio_device_mig_state state)
         return "STOP_COPY";
     case VFIO_DEVICE_STATE_RESUMING:
         return "RESUMING";
+    case VFIO_DEVICE_STATE_PRE_COPY:
+        return "PRE_COPY";
     default:
         return "UNKNOWN STATE";
     }
@@ -240,6 +242,23 @@ static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
     return 0;
 }
 
+static int vfio_query_precopy_size(VFIOMigration *migration,
+                                   uint64_t *init_size, uint64_t *dirty_size)
+{
+    struct vfio_precopy_info precopy = {
+        .argsz = sizeof(precopy),
+    };
+
+    if (ioctl(migration->data_fd, VFIO_MIG_GET_PRECOPY_INFO, &precopy)) {
+        return -errno;
+    }
+
+    *init_size = precopy.initial_bytes;
+    *dirty_size = precopy.dirty_bytes;
+
+    return 0;
+}
+
 /* Returns the size of saved data on success and -errno on error */
 static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
 {
@@ -248,6 +267,11 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
     data_size = read(migration->data_fd, migration->data_buffer,
                      migration->data_buffer_size);
     if (data_size < 0) {
+        /* Pre-copy emptied all the device state for now */
+        if (errno == ENOMSG) {
+            return 0;
+        }
+
         return -errno;
     }
     if (data_size == 0) {
@@ -264,6 +288,31 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
     return qemu_file_get_error(f) ?: data_size;
 }
 
+static void vfio_update_estimated_pending_data(VFIOMigration *migration,
+                                               uint64_t data_size)
+{
+    if (!data_size) {
+        /*
+         * Pre-copy emptied all the device state for now, update estimated sizes
+         * accordingly.
+         */
+        migration->precopy_init_size = 0;
+        migration->precopy_dirty_size = 0;
+
+        return;
+    }
+
+    if (migration->precopy_init_size) {
+        uint64_t init_size = MIN(migration->precopy_init_size, data_size);
+
+        migration->precopy_init_size -= init_size;
+        data_size -= init_size;
+    }
+
+    migration->precopy_dirty_size -= MIN(migration->precopy_dirty_size,
+                                         data_size);
+}
+
 /* ---------------------------------------------------------------------- */
 
 static int vfio_save_setup(QEMUFile *f, void *opaque)
@@ -284,6 +333,35 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
         return -ENOMEM;
     }
 
+    if (migration->mig_flags & VFIO_MIGRATION_PRE_COPY) {
+        uint64_t init_size = 0, dirty_size = 0;
+        int ret;
+
+        switch (migration->device_state) {
+        case VFIO_DEVICE_STATE_RUNNING:
+            ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_PRE_COPY,
+                                           VFIO_DEVICE_STATE_RUNNING);
+            if (ret) {
+                return ret;
+            }
+
+            vfio_query_precopy_size(migration, &init_size, &dirty_size);
+            migration->precopy_init_size = init_size;
+            migration->precopy_dirty_size = dirty_size;
+
+            break;
+        case VFIO_DEVICE_STATE_STOP:
+            /* vfio_save_complete_precopy() will go to STOP_COPY */
+
+            migration->precopy_init_size = 0;
+            migration->precopy_dirty_size = 0;
+
+            break;
+        default:
+            return -EINVAL;
+        }
+    }
+
     trace_vfio_save_setup(vbasedev->name, migration->data_buffer_size);
 
     qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
@@ -302,23 +380,44 @@ static void vfio_save_cleanup(void *opaque)
     trace_vfio_save_cleanup(vbasedev->name);
 }
 
+static void vfio_state_pending_estimate(void *opaque, uint64_t threshold_size,
+                                        uint64_t *must_precopy,
+                                        uint64_t *can_postcopy)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    if (migration->device_state != VFIO_DEVICE_STATE_PRE_COPY) {
+        return;
+    }
+
+    /*
+     * Initial size should be transferred during pre-copy phase so stop-copy
+     * phase will not be slowed down. Report threshold_size to force another
+     * pre-copy iteration.
+     */
+    *must_precopy += migration->precopy_init_size ?
+                         threshold_size :
+                         migration->precopy_dirty_size;
+
+    trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy,
+                                      *can_postcopy,
+                                      migration->precopy_init_size,
+                                      migration->precopy_dirty_size);
+}
+
 /*
  * Migration size of VFIO devices can be as little as a few KBs or as big as
  * many GBs. This value should be big enough to cover the worst case.
  */
 #define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
 
-/*
- * Only exact function is implemented and not estimate function. The reason is
- * that during pre-copy phase of migration the estimate function is called
- * repeatedly while pending RAM size is over the threshold, thus migration
- * can't converge and querying the VFIO device pending data size is useless.
- */
 static void vfio_state_pending_exact(void *opaque, uint64_t threshold_size,
                                      uint64_t *must_precopy,
                                      uint64_t *can_postcopy)
 {
     VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
     uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
 
     /*
@@ -328,8 +427,57 @@ static void vfio_state_pending_exact(void *opaque, uint64_t threshold_size,
     vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
     *must_precopy += stop_copy_size;
 
+    if (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
+        uint64_t init_size = 0, dirty_size = 0;
+
+        vfio_query_precopy_size(migration, &init_size, &dirty_size);
+        migration->precopy_init_size = init_size;
+        migration->precopy_dirty_size = dirty_size;
+
+        /*
+         * Initial size should be transferred during pre-copy phase so
+         * stop-copy phase will not be slowed down. Report threshold_size
+         * to force another pre-copy iteration.
+         */
+        *must_precopy += migration->precopy_init_size ?
+                             threshold_size :
+                             migration->precopy_dirty_size;
+    }
+
     trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
-                                   stop_copy_size);
+                                   stop_copy_size, migration->precopy_init_size,
+                                   migration->precopy_dirty_size);
+}
+
+static bool vfio_is_active_iterate(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY;
+}
+
+static int vfio_save_iterate(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    ssize_t data_size;
+
+    data_size = vfio_save_block(f, migration);
+    if (data_size < 0) {
+        return data_size;
+    }
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    vfio_update_estimated_pending_data(migration, data_size);
+
+    trace_vfio_save_iterate(vbasedev->name);
+
+    /*
+     * A VFIO device's pre-copy dirty_bytes is not guaranteed to reach zero.
+     * Return 1 so following handlers will not be potentially blocked.
+     */
+    return 1;
 }
 
 static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
@@ -338,7 +486,7 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
     ssize_t data_size;
     int ret;
 
-    /* We reach here with device state STOP only */
+    /* We reach here with device state STOP or STOP_COPY only */
     ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
                                    VFIO_DEVICE_STATE_STOP);
     if (ret) {
@@ -457,7 +605,10 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
 static const SaveVMHandlers savevm_vfio_handlers = {
     .save_setup = vfio_save_setup,
     .save_cleanup = vfio_save_cleanup,
+    .state_pending_estimate = vfio_state_pending_estimate,
     .state_pending_exact = vfio_state_pending_exact,
+    .is_active_iterate = vfio_is_active_iterate,
+    .save_live_iterate = vfio_save_iterate,
     .save_live_complete_precopy = vfio_save_complete_precopy,
     .save_state = vfio_save_state,
     .load_setup = vfio_load_setup,
@@ -470,13 +621,18 @@ static const SaveVMHandlers savevm_vfio_handlers = {
 static void vfio_vmstate_change(void *opaque, bool running, RunState state)
 {
     VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
     enum vfio_device_mig_state new_state;
     int ret;
 
     if (running) {
         new_state = VFIO_DEVICE_STATE_RUNNING;
     } else {
-        new_state = VFIO_DEVICE_STATE_STOP;
+        new_state =
+            (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY &&
+             (state == RUN_STATE_FINISH_MIGRATE || state == RUN_STATE_PAUSED)) ?
+                VFIO_DEVICE_STATE_STOP_COPY :
+                VFIO_DEVICE_STATE_STOP;
     }
 
     /*
@@ -590,6 +746,7 @@ static int vfio_migration_init(VFIODevice *vbasedev)
     migration->vbasedev = vbasedev;
     migration->device_state = VFIO_DEVICE_STATE_RUNNING;
     migration->data_fd = -1;
+    migration->mig_flags = mig_flags;
 
     oid = vmstate_if_get_id(VMSTATE_IF(DEVICE(obj)));
     if (oid) {
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 669d9fe07c..51613e02e6 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -161,6 +161,8 @@ vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
 vfio_save_cleanup(const char *name) " (%s)"
 vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
 vfio_save_device_config_state(const char *name) " (%s)"
+vfio_save_iterate(const char *name) " (%s)"
 vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size 0x%"PRIx64
-vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64
+vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
+vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
 vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH v2 04/20] vfio/common: Fix error reporting in vfio_get_dirty_bitmap()
  2023-02-22 17:48 [PATCH v2 00/20] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
                   ` (2 preceding siblings ...)
  2023-02-22 17:48 ` [PATCH v2 03/20] vfio/migration: Add VFIO migration pre-copy support Avihai Horon
@ 2023-02-22 17:48 ` Avihai Horon
  2023-02-22 17:49 ` [PATCH v2 05/20] vfio/common: Fix wrong %m usages Avihai Horon
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 93+ messages in thread
From: Avihai Horon @ 2023-02-22 17:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

Return -errno instead of -1 if VFIO_IOMMU_DIRTY_PAGES ioctl fails in
vfio_get_dirty_bitmap().

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
 hw/vfio/common.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 6f5afe9f5a..27db71427e 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1337,6 +1337,7 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
 
     ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
     if (ret) {
+        ret = -errno;
         error_report("Failed to get dirty bitmap for iova: 0x%"PRIx64
                 " size: 0x%"PRIx64" err: %d", (uint64_t)range->iova,
                 (uint64_t)range->size, errno);
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH v2 05/20] vfio/common: Fix wrong %m usages
  2023-02-22 17:48 [PATCH v2 00/20] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
                   ` (3 preceding siblings ...)
  2023-02-22 17:48 ` [PATCH v2 04/20] vfio/common: Fix error reporting in vfio_get_dirty_bitmap() Avihai Horon
@ 2023-02-22 17:49 ` Avihai Horon
  2023-02-22 17:49 ` [PATCH v2 06/20] vfio/common: Abort migration if dirty log start/stop/sync fails Avihai Horon
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 93+ messages in thread
From: Avihai Horon @ 2023-02-22 17:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

There are several places where the %m conversion is used if one of
vfio_dma_map(), vfio_dma_unmap() or vfio_get_dirty_bitmap() fail.

The %m usage in these places is wrong since %m relies on errno value while
the above functions don't report errors via errno.

Fix it by using strerror() with the returned value instead.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
 hw/vfio/common.c | 29 ++++++++++++++++-------------
 1 file changed, 16 insertions(+), 13 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 27db71427e..930eda40a1 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -705,17 +705,17 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
                            read_only);
         if (ret) {
             error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
-                         "0x%"HWADDR_PRIx", %p) = %d (%m)",
+                         "0x%"HWADDR_PRIx", %p) = %d (%s)",
                          container, iova,
-                         iotlb->addr_mask + 1, vaddr, ret);
+                         iotlb->addr_mask + 1, vaddr, ret, strerror(-ret));
         }
     } else {
         ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1, iotlb);
         if (ret) {
             error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
-                         "0x%"HWADDR_PRIx") = %d (%m)",
+                         "0x%"HWADDR_PRIx") = %d (%s)",
                          container, iova,
-                         iotlb->addr_mask + 1, ret);
+                         iotlb->addr_mask + 1, ret, strerror(-ret));
         }
     }
 out:
@@ -1097,8 +1097,9 @@ static void vfio_listener_region_add(MemoryListener *listener,
                        vaddr, section->readonly);
     if (ret) {
         error_setg(&err, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
-                   "0x%"HWADDR_PRIx", %p) = %d (%m)",
-                   container, iova, int128_get64(llsize), vaddr, ret);
+                   "0x%"HWADDR_PRIx", %p) = %d (%s)",
+                   container, iova, int128_get64(llsize), vaddr, ret,
+                   strerror(-ret));
         if (memory_region_is_ram_device(section->mr)) {
             /* Allow unexpected mappings not to be fatal for RAM devices */
             error_report_err(err);
@@ -1230,16 +1231,18 @@ static void vfio_listener_region_del(MemoryListener *listener,
             ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
             if (ret) {
                 error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
-                             "0x%"HWADDR_PRIx") = %d (%m)",
-                             container, iova, int128_get64(llsize), ret);
+                             "0x%"HWADDR_PRIx") = %d (%s)",
+                             container, iova, int128_get64(llsize), ret,
+                             strerror(-ret));
             }
             iova += int128_get64(llsize);
         }
         ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
         if (ret) {
             error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
-                         "0x%"HWADDR_PRIx") = %d (%m)",
-                         container, iova, int128_get64(llsize), ret);
+                         "0x%"HWADDR_PRIx") = %d (%s)",
+                         container, iova, int128_get64(llsize), ret,
+                         strerror(-ret));
         }
     }
 
@@ -1386,9 +1389,9 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
                                     translated_addr);
         if (ret) {
             error_report("vfio_iommu_map_dirty_notify(%p, 0x%"HWADDR_PRIx", "
-                         "0x%"HWADDR_PRIx") = %d (%m)",
-                         container, iova,
-                         iotlb->addr_mask + 1, ret);
+                         "0x%"HWADDR_PRIx") = %d (%s)",
+                         container, iova, iotlb->addr_mask + 1, ret,
+                         strerror(-ret));
         }
     }
     rcu_read_unlock();
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH v2 06/20] vfio/common: Abort migration if dirty log start/stop/sync fails
  2023-02-22 17:48 [PATCH v2 00/20] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
                   ` (4 preceding siblings ...)
  2023-02-22 17:49 ` [PATCH v2 05/20] vfio/common: Fix wrong %m usages Avihai Horon
@ 2023-02-22 17:49 ` Avihai Horon
  2023-02-22 17:49 ` [PATCH v2 07/20] vfio/common: Add VFIOBitmap and (de)alloc functions Avihai Horon
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 93+ messages in thread
From: Avihai Horon @ 2023-02-22 17:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

If VFIO dirty pages log start/stop/sync fails during migration,
migration should be aborted as pages dirtied by VFIO devices might not
be reported properly.

This is not the case today, where in such scenario only an error is
printed.

Fix it by aborting migration in the above scenario.

Fixes: 758b96b61d5c ("vfio/migrate: Move switch of dirty tracking into vfio_memory_listener")
Fixes: b6dd6504e303 ("vfio: Add vfio_listener_log_sync to mark dirty pages")
Fixes: 9e7b0442f23a ("vfio: Add ioctl to get dirty pages bitmap during dma unmap")
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
 hw/vfio/common.c | 53 ++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 45 insertions(+), 8 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 930eda40a1..ac93b85632 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -42,6 +42,7 @@
 #include "migration/migration.h"
 #include "migration/misc.h"
 #include "migration/blocker.h"
+#include "migration/qemu-file.h"
 #include "sysemu/tpm.h"
 
 VFIOGroupList vfio_group_list =
@@ -390,6 +391,19 @@ void vfio_unblock_multiple_devices_migration(void)
     multiple_devices_migration_blocker = NULL;
 }
 
+static void vfio_set_migration_error(int err)
+{
+    MigrationState *ms = migrate_get_current();
+
+    if (migration_is_setup_or_active(ms->state)) {
+        WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) {
+            if (ms->to_dst_file) {
+                qemu_file_set_error(ms->to_dst_file, err);
+            }
+        }
+    }
+}
+
 static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
 {
     VFIOGroup *group;
@@ -682,6 +696,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     if (iotlb->target_as != &address_space_memory) {
         error_report("Wrong target AS \"%s\", only system memory is allowed",
                      iotlb->target_as->name ? iotlb->target_as->name : "none");
+        vfio_set_migration_error(-EINVAL);
         return;
     }
 
@@ -716,6 +731,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
                          "0x%"HWADDR_PRIx") = %d (%s)",
                          container, iova,
                          iotlb->addr_mask + 1, ret, strerror(-ret));
+            vfio_set_migration_error(ret);
         }
     }
 out:
@@ -1261,7 +1277,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
     }
 }
 
-static void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
+static int vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
 {
     int ret;
     struct vfio_iommu_type1_dirty_bitmap dirty = {
@@ -1269,7 +1285,7 @@ static void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
     };
 
     if (!container->dirty_pages_supported) {
-        return;
+        return 0;
     }
 
     if (start) {
@@ -1280,23 +1296,34 @@ static void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
 
     ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty);
     if (ret) {
+        ret = -errno;
         error_report("Failed to set dirty tracking flag 0x%x errno: %d",
                      dirty.flags, errno);
     }
+
+    return ret;
 }
 
 static void vfio_listener_log_global_start(MemoryListener *listener)
 {
     VFIOContainer *container = container_of(listener, VFIOContainer, listener);
+    int ret;
 
-    vfio_set_dirty_page_tracking(container, true);
+    ret = vfio_set_dirty_page_tracking(container, true);
+    if (ret) {
+        vfio_set_migration_error(ret);
+    }
 }
 
 static void vfio_listener_log_global_stop(MemoryListener *listener)
 {
     VFIOContainer *container = container_of(listener, VFIOContainer, listener);
+    int ret;
 
-    vfio_set_dirty_page_tracking(container, false);
+    ret = vfio_set_dirty_page_tracking(container, false);
+    if (ret) {
+        vfio_set_migration_error(ret);
+    }
 }
 
 static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
@@ -1372,19 +1399,18 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     VFIOContainer *container = giommu->container;
     hwaddr iova = iotlb->iova + giommu->iommu_offset;
     ram_addr_t translated_addr;
+    int ret = -EINVAL;
 
     trace_vfio_iommu_map_dirty_notify(iova, iova + iotlb->addr_mask);
 
     if (iotlb->target_as != &address_space_memory) {
         error_report("Wrong target AS \"%s\", only system memory is allowed",
                      iotlb->target_as->name ? iotlb->target_as->name : "none");
-        return;
+        goto out;
     }
 
     rcu_read_lock();
     if (vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL)) {
-        int ret;
-
         ret = vfio_get_dirty_bitmap(container, iova, iotlb->addr_mask + 1,
                                     translated_addr);
         if (ret) {
@@ -1395,6 +1421,11 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
         }
     }
     rcu_read_unlock();
+
+out:
+    if (ret) {
+        vfio_set_migration_error(ret);
+    }
 }
 
 static int vfio_ram_discard_get_dirty_bitmap(MemoryRegionSection *section,
@@ -1487,13 +1518,19 @@ static void vfio_listener_log_sync(MemoryListener *listener,
         MemoryRegionSection *section)
 {
     VFIOContainer *container = container_of(listener, VFIOContainer, listener);
+    int ret;
 
     if (vfio_listener_skipped_section(section)) {
         return;
     }
 
     if (vfio_devices_all_dirty_tracking(container)) {
-        vfio_sync_dirty_bitmap(container, section);
+        ret = vfio_sync_dirty_bitmap(container, section);
+        if (ret) {
+            error_report("vfio: Failed to sync dirty bitmap, err: %d (%s)", ret,
+                         strerror(-ret));
+            vfio_set_migration_error(ret);
+        }
     }
 }
 
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH v2 07/20] vfio/common: Add VFIOBitmap and (de)alloc functions
  2023-02-22 17:48 [PATCH v2 00/20] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
                   ` (5 preceding siblings ...)
  2023-02-22 17:49 ` [PATCH v2 06/20] vfio/common: Abort migration if dirty log start/stop/sync fails Avihai Horon
@ 2023-02-22 17:49 ` Avihai Horon
  2023-02-22 21:40   ` Alex Williamson
  2023-02-27 14:09   ` Cédric Le Goater
  2023-02-22 17:49 ` [PATCH v2 08/20] util: Add iova_tree_nnodes() Avihai Horon
                   ` (14 subsequent siblings)
  21 siblings, 2 replies; 93+ messages in thread
From: Avihai Horon @ 2023-02-22 17:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

There are already two places where dirty page bitmap allocation and
calculations are done in open code. With device dirty page tracking
being added in next patches, there are going to be even more places.

To avoid code duplication, introduce VFIOBitmap struct and corresponding
alloc and dealloc functions and use them where applicable.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/common.c | 89 ++++++++++++++++++++++++++++++++----------------
 1 file changed, 60 insertions(+), 29 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index ac93b85632..84f08bdbbb 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -320,6 +320,41 @@ const MemoryRegionOps vfio_region_ops = {
  * Device state interfaces
  */
 
+typedef struct {
+    unsigned long *bitmap;
+    hwaddr size;
+    hwaddr pages;
+} VFIOBitmap;
+
+static VFIOBitmap *vfio_bitmap_alloc(hwaddr size)
+{
+    VFIOBitmap *vbmap = g_try_new0(VFIOBitmap, 1);
+    if (!vbmap) {
+        errno = ENOMEM;
+
+        return NULL;
+    }
+
+    vbmap->pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size();
+    vbmap->size = ROUND_UP(vbmap->pages, sizeof(__u64) * BITS_PER_BYTE) /
+                                         BITS_PER_BYTE;
+    vbmap->bitmap = g_try_malloc0(vbmap->size);
+    if (!vbmap->bitmap) {
+        g_free(vbmap);
+        errno = ENOMEM;
+
+        return NULL;
+    }
+
+    return vbmap;
+}
+
+static void vfio_bitmap_dealloc(VFIOBitmap *vbmap)
+{
+    g_free(vbmap->bitmap);
+    g_free(vbmap);
+}
+
 bool vfio_mig_active(void)
 {
     VFIOGroup *group;
@@ -470,9 +505,14 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container,
 {
     struct vfio_iommu_type1_dma_unmap *unmap;
     struct vfio_bitmap *bitmap;
-    uint64_t pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size();
+    VFIOBitmap *vbmap;
     int ret;
 
+    vbmap = vfio_bitmap_alloc(size);
+    if (!vbmap) {
+        return -errno;
+    }
+
     unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));
 
     unmap->argsz = sizeof(*unmap) + sizeof(*bitmap);
@@ -486,35 +526,28 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container,
      * qemu_real_host_page_size to mark those dirty. Hence set bitmap_pgsize
      * to qemu_real_host_page_size.
      */
-
     bitmap->pgsize = qemu_real_host_page_size();
-    bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
-                   BITS_PER_BYTE;
+    bitmap->size = vbmap->size;
+    bitmap->data = (__u64 *)vbmap->bitmap;
 
-    if (bitmap->size > container->max_dirty_bitmap_size) {
-        error_report("UNMAP: Size of bitmap too big 0x%"PRIx64,
-                     (uint64_t)bitmap->size);
+    if (vbmap->size > container->max_dirty_bitmap_size) {
+        error_report("UNMAP: Size of bitmap too big 0x%"PRIx64, vbmap->size);
         ret = -E2BIG;
         goto unmap_exit;
     }
 
-    bitmap->data = g_try_malloc0(bitmap->size);
-    if (!bitmap->data) {
-        ret = -ENOMEM;
-        goto unmap_exit;
-    }
-
     ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap);
     if (!ret) {
-        cpu_physical_memory_set_dirty_lebitmap((unsigned long *)bitmap->data,
-                iotlb->translated_addr, pages);
+        cpu_physical_memory_set_dirty_lebitmap(vbmap->bitmap,
+                iotlb->translated_addr, vbmap->pages);
     } else {
         error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %m");
     }
 
-    g_free(bitmap->data);
 unmap_exit:
     g_free(unmap);
+    vfio_bitmap_dealloc(vbmap);
+
     return ret;
 }
 
@@ -1331,7 +1364,7 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
 {
     struct vfio_iommu_type1_dirty_bitmap *dbitmap;
     struct vfio_iommu_type1_dirty_bitmap_get *range;
-    uint64_t pages;
+    VFIOBitmap *vbmap;
     int ret;
 
     if (!container->dirty_pages_supported) {
@@ -1341,6 +1374,11 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
         return 0;
     }
 
+    vbmap = vfio_bitmap_alloc(size);
+    if (!vbmap) {
+        return -errno;
+    }
+
     dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
 
     dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
@@ -1355,15 +1393,8 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
      * to qemu_real_host_page_size.
      */
     range->bitmap.pgsize = qemu_real_host_page_size();
-
-    pages = REAL_HOST_PAGE_ALIGN(range->size) / qemu_real_host_page_size();
-    range->bitmap.size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
-                                         BITS_PER_BYTE;
-    range->bitmap.data = g_try_malloc0(range->bitmap.size);
-    if (!range->bitmap.data) {
-        ret = -ENOMEM;
-        goto err_out;
-    }
+    range->bitmap.size = vbmap->size;
+    range->bitmap.data = (__u64 *)vbmap->bitmap;
 
     ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
     if (ret) {
@@ -1374,14 +1405,14 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
         goto err_out;
     }
 
-    cpu_physical_memory_set_dirty_lebitmap((unsigned long *)range->bitmap.data,
-                                            ram_addr, pages);
+    cpu_physical_memory_set_dirty_lebitmap(vbmap->bitmap, ram_addr,
+                                           vbmap->pages);
 
     trace_vfio_get_dirty_bitmap(container->fd, range->iova, range->size,
                                 range->bitmap.size, ram_addr);
 err_out:
-    g_free(range->bitmap.data);
     g_free(dbitmap);
+    vfio_bitmap_dealloc(vbmap);
 
     return ret;
 }
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH v2 08/20] util: Add iova_tree_nnodes()
  2023-02-22 17:48 [PATCH v2 00/20] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
                   ` (6 preceding siblings ...)
  2023-02-22 17:49 ` [PATCH v2 07/20] vfio/common: Add VFIOBitmap and (de)alloc functions Avihai Horon
@ 2023-02-22 17:49 ` Avihai Horon
  2023-02-22 17:49 ` [PATCH v2 09/20] util: Extend iova_tree_foreach() to take data argument Avihai Horon
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 93+ messages in thread
From: Avihai Horon @ 2023-02-22 17:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

From: Joao Martins <joao.m.martins@oracle.com>

Add iova_tree_nnodes() which returns the number of nodes in the IOVA
tree.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Acked-by: Peter Xu <peterx@redhat.com>
---
 include/qemu/iova-tree.h | 11 +++++++++++
 util/iova-tree.c         |  5 +++++
 2 files changed, 16 insertions(+)

diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
index 8528e5c98f..7bb80783ce 100644
--- a/include/qemu/iova-tree.h
+++ b/include/qemu/iova-tree.h
@@ -164,4 +164,15 @@ int iova_tree_alloc_map(IOVATree *tree, DMAMap *map, hwaddr iova_begin,
  */
 void iova_tree_destroy(IOVATree *tree);
 
+/**
+ * iova_tree_nnodes:
+ *
+ * @tree: the iova tree to consult
+ *
+ * Returns the number of nodes in the iova tree
+ *
+ * Return: >=0 for the number of nodes.
+ */
+gint iova_tree_nnodes(IOVATree *tree);
+
 #endif
diff --git a/util/iova-tree.c b/util/iova-tree.c
index 536789797e..6141a6229b 100644
--- a/util/iova-tree.c
+++ b/util/iova-tree.c
@@ -280,3 +280,8 @@ void iova_tree_destroy(IOVATree *tree)
     g_tree_destroy(tree->tree);
     g_free(tree);
 }
+
+gint iova_tree_nnodes(IOVATree *tree)
+{
+    return g_tree_nnodes(tree->tree);
+}
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH v2 09/20] util: Extend iova_tree_foreach() to take data argument
  2023-02-22 17:48 [PATCH v2 00/20] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
                   ` (7 preceding siblings ...)
  2023-02-22 17:49 ` [PATCH v2 08/20] util: Add iova_tree_nnodes() Avihai Horon
@ 2023-02-22 17:49 ` Avihai Horon
  2023-02-22 17:49 ` [PATCH v2 10/20] vfio/common: Record DMA mapped IOVA ranges Avihai Horon
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 93+ messages in thread
From: Avihai Horon @ 2023-02-22 17:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

From: Joao Martins <joao.m.martins@oracle.com>

Extend iova_tree_foreach() to take data argument to be passed and used
by the iterator.

While at it, fix a documentation error:
The documentation says iova_tree_foreach() returns a value even though
it is a void function.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Acked-by: Peter Xu <peterx@redhat.com>
---
 include/qemu/iova-tree.h |  8 +++++---
 util/iova-tree.c         | 18 ++++++++++++++----
 2 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
index 7bb80783ce..1332dce014 100644
--- a/include/qemu/iova-tree.h
+++ b/include/qemu/iova-tree.h
@@ -38,7 +38,7 @@ typedef struct DMAMap {
     hwaddr size;                /* Inclusive */
     IOMMUAccessFlags perm;
 } QEMU_PACKED DMAMap;
-typedef gboolean (*iova_tree_iterator)(DMAMap *map);
+typedef gboolean (*iova_tree_iterator)(DMAMap *map, gpointer data);
 
 /**
  * iova_tree_new:
@@ -129,12 +129,14 @@ const DMAMap *iova_tree_find_address(const IOVATree *tree, hwaddr iova);
  *
  * @tree: the iova tree to iterate on
  * @iterator: the interator for the mappings, return true to stop
+ * @data: data to be passed to the iterator
  *
  * Iterate over the iova tree.
  *
- * Return: 1 if found any overlap, 0 if not, <0 if error.
+ * Return: None.
  */
-void iova_tree_foreach(IOVATree *tree, iova_tree_iterator iterator);
+void iova_tree_foreach(IOVATree *tree, iova_tree_iterator iterator,
+                       gpointer data);
 
 /**
  * iova_tree_alloc_map:
diff --git a/util/iova-tree.c b/util/iova-tree.c
index 6141a6229b..9845427b86 100644
--- a/util/iova-tree.c
+++ b/util/iova-tree.c
@@ -42,6 +42,11 @@ typedef struct IOVATreeFindIOVAArgs {
     const DMAMap *result;
 } IOVATreeFindIOVAArgs;
 
+typedef struct IOVATreeIterator {
+    iova_tree_iterator fn;
+    gpointer data;
+} IOVATreeIterator;
+
 /**
  * Iterate args to the next hole
  *
@@ -151,17 +156,22 @@ int iova_tree_insert(IOVATree *tree, const DMAMap *map)
 static gboolean iova_tree_traverse(gpointer key, gpointer value,
                                 gpointer data)
 {
-    iova_tree_iterator iterator = data;
+    IOVATreeIterator *iterator = data;
     DMAMap *map = key;
 
     g_assert(key == value);
 
-    return iterator(map);
+    return iterator->fn(map, iterator->data);
 }
 
-void iova_tree_foreach(IOVATree *tree, iova_tree_iterator iterator)
+void iova_tree_foreach(IOVATree *tree, iova_tree_iterator iterator,
+                       gpointer data)
 {
-    g_tree_foreach(tree->tree, iova_tree_traverse, iterator);
+    IOVATreeIterator arg = {
+        .fn = iterator,
+        .data = data,
+    };
+    g_tree_foreach(tree->tree, iova_tree_traverse, &arg);
 }
 
 void iova_tree_remove(IOVATree *tree, DMAMap map)
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH v2 10/20] vfio/common: Record DMA mapped IOVA ranges
  2023-02-22 17:48 [PATCH v2 00/20] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
                   ` (8 preceding siblings ...)
  2023-02-22 17:49 ` [PATCH v2 09/20] util: Extend iova_tree_foreach() to take data argument Avihai Horon
@ 2023-02-22 17:49 ` Avihai Horon
  2023-02-22 22:10   ` Alex Williamson
  2023-02-22 17:49 ` [PATCH v2 11/20] vfio/common: Add device dirty page tracking start/stop Avihai Horon
                   ` (11 subsequent siblings)
  21 siblings, 1 reply; 93+ messages in thread
From: Avihai Horon @ 2023-02-22 17:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

From: Joao Martins <joao.m.martins@oracle.com>

According to the device DMA logging uAPI, IOVA ranges to be logged by
the device must be provided all at once upon DMA logging start.

As preparation for the following patches which will add device dirty
page tracking, keep a record of all DMA mapped IOVA ranges so later they
can be used for DMA logging start.

Note that when vIOMMU is enabled DMA mapped IOVA ranges are not tracked.
This is due to the dynamic nature of vIOMMU DMA mapping/unmapping.
Following patches will address the vIOMMU case specifically.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 include/hw/vfio/vfio-common.h |  3 ++
 hw/vfio/common.c              | 86 +++++++++++++++++++++++++++++++++--
 2 files changed, 86 insertions(+), 3 deletions(-)

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index ee55d442b4..6f36876ce0 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -23,6 +23,7 @@
 
 #include "exec/memory.h"
 #include "qemu/queue.h"
+#include "qemu/iova-tree.h"
 #include "qemu/notify.h"
 #include "ui/console.h"
 #include "hw/display/ramfb.h"
@@ -92,6 +93,8 @@ typedef struct VFIOContainer {
     uint64_t max_dirty_bitmap_size;
     unsigned long pgsizes;
     unsigned int dma_max_mappings;
+    IOVATree *mappings;
+    QemuMutex mappings_mutex;
     QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
     QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
     QLIST_HEAD(, VFIOGroup) group_list;
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 84f08bdbbb..6041da6c7e 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -44,6 +44,7 @@
 #include "migration/blocker.h"
 #include "migration/qemu-file.h"
 #include "sysemu/tpm.h"
+#include "qemu/iova-tree.h"
 
 VFIOGroupList vfio_group_list =
     QLIST_HEAD_INITIALIZER(vfio_group_list);
@@ -426,6 +427,11 @@ void vfio_unblock_multiple_devices_migration(void)
     multiple_devices_migration_blocker = NULL;
 }
 
+static bool vfio_have_giommu(VFIOContainer *container)
+{
+    return !QLIST_EMPTY(&container->giommu_list);
+}
+
 static void vfio_set_migration_error(int err)
 {
     MigrationState *ms = migrate_get_current();
@@ -499,6 +505,51 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
     return true;
 }
 
+static int vfio_record_mapping(VFIOContainer *container, hwaddr iova,
+                               hwaddr size, bool readonly)
+{
+    DMAMap map = {
+        .iova = iova,
+        .size = size - 1, /* IOVATree is inclusive, so subtract 1 from size */
+        .perm = readonly ? IOMMU_RO : IOMMU_RW,
+    };
+    int ret;
+
+    if (vfio_have_giommu(container)) {
+        return 0;
+    }
+
+    WITH_QEMU_LOCK_GUARD(&container->mappings_mutex) {
+        ret = iova_tree_insert(container->mappings, &map);
+        if (ret) {
+            if (ret == IOVA_ERR_INVALID) {
+                ret = -EINVAL;
+            } else if (ret == IOVA_ERR_OVERLAP) {
+                ret = -EEXIST;
+            }
+        }
+    }
+
+    return ret;
+}
+
+static void vfio_erase_mapping(VFIOContainer *container, hwaddr iova,
+                                hwaddr size)
+{
+    DMAMap map = {
+        .iova = iova,
+        .size = size - 1, /* IOVATree is inclusive, so subtract 1 from size */
+    };
+
+    if (vfio_have_giommu(container)) {
+        return;
+    }
+
+    WITH_QEMU_LOCK_GUARD(&container->mappings_mutex) {
+        iova_tree_remove(container->mappings, map);
+    }
+}
+
 static int vfio_dma_unmap_bitmap(VFIOContainer *container,
                                  hwaddr iova, ram_addr_t size,
                                  IOMMUTLBEntry *iotlb)
@@ -599,6 +650,8 @@ static int vfio_dma_unmap(VFIOContainer *container,
                                             DIRTY_CLIENTS_NOCODE);
     }
 
+    vfio_erase_mapping(container, iova, size);
+
     return 0;
 }
 
@@ -612,6 +665,16 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
         .iova = iova,
         .size = size,
     };
+    int ret;
+
+    ret = vfio_record_mapping(container, iova, size, readonly);
+    if (ret) {
+        error_report("vfio: Failed to record mapping, iova: 0x%" HWADDR_PRIx
+                     ", size: 0x" RAM_ADDR_FMT ", ret: %d (%s)",
+                     iova, size, ret, strerror(-ret));
+
+        return ret;
+    }
 
     if (!readonly) {
         map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
@@ -628,8 +691,12 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
         return 0;
     }
 
+    ret = -errno;
     error_report("VFIO_MAP_DMA failed: %s", strerror(errno));
-    return -errno;
+
+    vfio_erase_mapping(container, iova, size);
+
+    return ret;
 }
 
 static void vfio_host_win_add(VFIOContainer *container,
@@ -2183,16 +2250,23 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     QLIST_INIT(&container->giommu_list);
     QLIST_INIT(&container->hostwin_list);
     QLIST_INIT(&container->vrdl_list);
+    container->mappings = iova_tree_new();
+    if (!container->mappings) {
+        error_setg(errp, "Cannot allocate DMA mappings tree");
+        ret = -ENOMEM;
+        goto free_container_exit;
+    }
+    qemu_mutex_init(&container->mappings_mutex);
 
     ret = vfio_init_container(container, group->fd, errp);
     if (ret) {
-        goto free_container_exit;
+        goto destroy_mappings_exit;
     }
 
     ret = vfio_ram_block_discard_disable(container, true);
     if (ret) {
         error_setg_errno(errp, -ret, "Cannot set discarding of RAM broken");
-        goto free_container_exit;
+        goto destroy_mappings_exit;
     }
 
     switch (container->iommu_type) {
@@ -2328,6 +2402,10 @@ listener_release_exit:
 enable_discards_exit:
     vfio_ram_block_discard_disable(container, false);
 
+destroy_mappings_exit:
+    qemu_mutex_destroy(&container->mappings_mutex);
+    iova_tree_destroy(container->mappings);
+
 free_container_exit:
     g_free(container);
 
@@ -2382,6 +2460,8 @@ static void vfio_disconnect_container(VFIOGroup *group)
         }
 
         trace_vfio_disconnect_container(container->fd);
+        qemu_mutex_destroy(&container->mappings_mutex);
+        iova_tree_destroy(container->mappings);
         close(container->fd);
         g_free(container);
 
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH v2 11/20] vfio/common: Add device dirty page tracking start/stop
  2023-02-22 17:48 [PATCH v2 00/20] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
                   ` (9 preceding siblings ...)
  2023-02-22 17:49 ` [PATCH v2 10/20] vfio/common: Record DMA mapped IOVA ranges Avihai Horon
@ 2023-02-22 17:49 ` Avihai Horon
  2023-02-22 22:40   ` Alex Williamson
  2023-02-22 17:49 ` [PATCH v2 12/20] vfio/common: Extract code from vfio_get_dirty_bitmap() to new function Avihai Horon
                   ` (10 subsequent siblings)
  21 siblings, 1 reply; 93+ messages in thread
From: Avihai Horon @ 2023-02-22 17:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

From: Joao Martins <joao.m.martins@oracle.com>

Add device dirty page tracking start/stop functionality. This uses the
device DMA logging uAPI to start and stop dirty page tracking by device.

Device dirty page tracking is used only if all devices within a
container support device dirty page tracking.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 include/hw/vfio/vfio-common.h |   2 +
 hw/vfio/common.c              | 211 +++++++++++++++++++++++++++++++++-
 2 files changed, 211 insertions(+), 2 deletions(-)

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 6f36876ce0..1f21e1fa43 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -149,6 +149,8 @@ typedef struct VFIODevice {
     VFIOMigration *migration;
     Error *migration_blocker;
     OnOffAuto pre_copy_dirty_page_tracking;
+    bool dirty_pages_supported;
+    bool dirty_tracking;
 } VFIODevice;
 
 struct VFIODeviceOps {
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 6041da6c7e..740153e7d7 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -473,6 +473,22 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
     return true;
 }
 
+static bool vfio_devices_all_device_dirty_tracking(VFIOContainer *container)
+{
+    VFIOGroup *group;
+    VFIODevice *vbasedev;
+
+    QLIST_FOREACH(group, &container->group_list, container_next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            if (!vbasedev->dirty_pages_supported) {
+                return false;
+            }
+        }
+    }
+
+    return true;
+}
+
 /*
  * Check if all VFIO devices are running and migration is active, which is
  * essentially equivalent to the migration being in pre-copy phase.
@@ -1404,13 +1420,192 @@ static int vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
     return ret;
 }
 
+static int vfio_devices_dma_logging_set(VFIOContainer *container,
+                                        struct vfio_device_feature *feature)
+{
+    bool status = (feature->flags & VFIO_DEVICE_FEATURE_MASK) ==
+                  VFIO_DEVICE_FEATURE_DMA_LOGGING_START;
+    VFIODevice *vbasedev;
+    VFIOGroup *group;
+    int ret = 0;
+
+    QLIST_FOREACH(group, &container->group_list, container_next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            if (vbasedev->dirty_tracking == status) {
+                continue;
+            }
+
+            ret = ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature);
+            if (ret) {
+                ret = -errno;
+                error_report("%s: Failed to set DMA logging %s, err %d (%s)",
+                             vbasedev->name, status ? "start" : "stop", ret,
+                             strerror(errno));
+                goto out;
+            }
+            vbasedev->dirty_tracking = status;
+        }
+    }
+
+out:
+    return ret;
+}
+
+static int vfio_devices_dma_logging_stop(VFIOContainer *container)
+{
+    uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature),
+                              sizeof(uint64_t))] = {};
+    struct vfio_device_feature *feature = (struct vfio_device_feature *)buf;
+
+    feature->argsz = sizeof(buf);
+    feature->flags = VFIO_DEVICE_FEATURE_SET;
+    feature->flags |= VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP;
+
+    return vfio_devices_dma_logging_set(container, feature);
+}
+
+static gboolean vfio_device_dma_logging_range_add(DMAMap *map, gpointer data)
+{
+    struct vfio_device_feature_dma_logging_range **out = data;
+    struct vfio_device_feature_dma_logging_range *range = *out;
+
+    range->iova = map->iova;
+    /* IOVATree is inclusive, DMA logging uAPI isn't, so add 1 to length */
+    range->length = map->size + 1;
+
+    *out = ++range;
+
+    return false;
+}
+
+static gboolean vfio_iova_tree_get_first(DMAMap *map, gpointer data)
+{
+    DMAMap *first = data;
+
+    first->iova = map->iova;
+    first->size = map->size;
+
+    return true;
+}
+
+static gboolean vfio_iova_tree_get_last(DMAMap *map, gpointer data)
+{
+    DMAMap *last = data;
+
+    last->iova = map->iova;
+    last->size = map->size;
+
+    return false;
+}
+
+static struct vfio_device_feature *
+vfio_device_feature_dma_logging_start_create(VFIOContainer *container)
+{
+    struct vfio_device_feature *feature;
+    size_t feature_size;
+    struct vfio_device_feature_dma_logging_control *control;
+    struct vfio_device_feature_dma_logging_range *ranges;
+    unsigned int max_ranges;
+    unsigned int cur_ranges;
+
+    feature_size = sizeof(struct vfio_device_feature) +
+                   sizeof(struct vfio_device_feature_dma_logging_control);
+    feature = g_malloc0(feature_size);
+    feature->argsz = feature_size;
+    feature->flags = VFIO_DEVICE_FEATURE_SET;
+    feature->flags |= VFIO_DEVICE_FEATURE_DMA_LOGGING_START;
+
+    control = (struct vfio_device_feature_dma_logging_control *)feature->data;
+    control->page_size = qemu_real_host_page_size();
+
+    QEMU_LOCK_GUARD(&container->mappings_mutex);
+
+    /*
+     * DMA logging uAPI guarantees to support at least num_ranges that fits into
+     * a single host kernel page. To be on the safe side, use this as a limit
+     * from which to merge to a single range.
+     */
+    max_ranges = qemu_real_host_page_size() / sizeof(*ranges);
+    cur_ranges = iova_tree_nnodes(container->mappings);
+    control->num_ranges = (cur_ranges <= max_ranges) ? cur_ranges : 1;
+    ranges = g_try_new0(struct vfio_device_feature_dma_logging_range,
+                        control->num_ranges);
+    if (!ranges) {
+        g_free(feature);
+        errno = ENOMEM;
+
+        return NULL;
+    }
+
+    control->ranges = (uint64_t)ranges;
+    if (cur_ranges <= max_ranges) {
+        iova_tree_foreach(container->mappings,
+                          vfio_device_dma_logging_range_add, &ranges);
+    } else {
+        DMAMap first, last;
+
+        iova_tree_foreach(container->mappings, vfio_iova_tree_get_first,
+                          &first);
+        iova_tree_foreach(container->mappings, vfio_iova_tree_get_last, &last);
+        ranges->iova = first.iova;
+        /* IOVATree is inclusive, DMA logging uAPI isn't, so add 1 to length */
+        ranges->length = (last.iova - first.iova) + last.size + 1;
+    }
+
+    return feature;
+}
+
+static void vfio_device_feature_dma_logging_start_destroy(
+    struct vfio_device_feature *feature)
+{
+    struct vfio_device_feature_dma_logging_control *control =
+        (struct vfio_device_feature_dma_logging_control *)feature->data;
+    struct vfio_device_feature_dma_logging_range *ranges =
+        (struct vfio_device_feature_dma_logging_range *)control->ranges;
+
+    g_free(ranges);
+    g_free(feature);
+}
+
+static int vfio_devices_dma_logging_start(VFIOContainer *container)
+{
+    struct vfio_device_feature *feature;
+    int ret;
+
+    feature = vfio_device_feature_dma_logging_start_create(container);
+    if (!feature) {
+        return -errno;
+    }
+
+    ret = vfio_devices_dma_logging_set(container, feature);
+    if (ret) {
+        vfio_devices_dma_logging_stop(container);
+    }
+
+    vfio_device_feature_dma_logging_start_destroy(feature);
+
+    return ret;
+}
+
 static void vfio_listener_log_global_start(MemoryListener *listener)
 {
     VFIOContainer *container = container_of(listener, VFIOContainer, listener);
     int ret;
 
-    ret = vfio_set_dirty_page_tracking(container, true);
+    if (vfio_devices_all_device_dirty_tracking(container)) {
+        if (vfio_have_giommu(container)) {
+            /* Device dirty page tracking currently doesn't support vIOMMU */
+            return;
+        }
+
+        ret = vfio_devices_dma_logging_start(container);
+    } else {
+        ret = vfio_set_dirty_page_tracking(container, true);
+    }
+
     if (ret) {
+        error_report("vfio: Could not start dirty page tracking, err: %d (%s)",
+                     ret, strerror(-ret));
         vfio_set_migration_error(ret);
     }
 }
@@ -1420,8 +1615,20 @@ static void vfio_listener_log_global_stop(MemoryListener *listener)
     VFIOContainer *container = container_of(listener, VFIOContainer, listener);
     int ret;
 
-    ret = vfio_set_dirty_page_tracking(container, false);
+    if (vfio_devices_all_device_dirty_tracking(container)) {
+        if (vfio_have_giommu(container)) {
+            /* Device dirty page tracking currently doesn't support vIOMMU */
+            return;
+        }
+
+        ret = vfio_devices_dma_logging_stop(container);
+    } else {
+        ret = vfio_set_dirty_page_tracking(container, false);
+    }
+
     if (ret) {
+        error_report("vfio: Could not stop dirty page tracking, err: %d (%s)",
+                     ret, strerror(-ret));
         vfio_set_migration_error(ret);
     }
 }
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH v2 12/20] vfio/common: Extract code from vfio_get_dirty_bitmap() to new function
  2023-02-22 17:48 [PATCH v2 00/20] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
                   ` (10 preceding siblings ...)
  2023-02-22 17:49 ` [PATCH v2 11/20] vfio/common: Add device dirty page tracking start/stop Avihai Horon
@ 2023-02-22 17:49 ` Avihai Horon
  2023-02-22 17:49 ` [PATCH v2 13/20] vfio/common: Add device dirty page bitmap sync Avihai Horon
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 93+ messages in thread
From: Avihai Horon @ 2023-02-22 17:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

Extract the VFIO_IOMMU_DIRTY_PAGES ioctl code in vfio_get_dirty_bitmap()
to its own function.

This will help the code to be more readable after next patch will add
device dirty page bitmap sync functionality.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/common.c | 53 ++++++++++++++++++++++++++++++------------------
 1 file changed, 33 insertions(+), 20 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 740153e7d7..3ab5d8d442 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1633,26 +1633,13 @@ static void vfio_listener_log_global_stop(MemoryListener *listener)
     }
 }
 
-static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
-                                 uint64_t size, ram_addr_t ram_addr)
+static int vfio_query_dirty_bitmap(VFIOContainer *container, VFIOBitmap *vbmap,
+                                   hwaddr iova, hwaddr size)
 {
     struct vfio_iommu_type1_dirty_bitmap *dbitmap;
     struct vfio_iommu_type1_dirty_bitmap_get *range;
-    VFIOBitmap *vbmap;
     int ret;
 
-    if (!container->dirty_pages_supported) {
-        cpu_physical_memory_set_dirty_range(ram_addr, size,
-                                            tcg_enabled() ? DIRTY_CLIENTS_ALL :
-                                            DIRTY_CLIENTS_NOCODE);
-        return 0;
-    }
-
-    vbmap = vfio_bitmap_alloc(size);
-    if (!vbmap) {
-        return -errno;
-    }
-
     dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
 
     dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
@@ -1676,16 +1663,42 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
         error_report("Failed to get dirty bitmap for iova: 0x%"PRIx64
                 " size: 0x%"PRIx64" err: %d", (uint64_t)range->iova,
                 (uint64_t)range->size, errno);
-        goto err_out;
+    }
+
+    g_free(dbitmap);
+
+    return ret;
+}
+
+static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
+                                 uint64_t size, ram_addr_t ram_addr)
+{
+    VFIOBitmap *vbmap;
+    int ret;
+
+    if (!container->dirty_pages_supported) {
+        cpu_physical_memory_set_dirty_range(ram_addr, size,
+                                            tcg_enabled() ? DIRTY_CLIENTS_ALL :
+                                            DIRTY_CLIENTS_NOCODE);
+        return 0;
+    }
+
+    vbmap = vfio_bitmap_alloc(size);
+    if (!vbmap) {
+        return -errno;
+    }
+
+    ret = vfio_query_dirty_bitmap(container, vbmap, iova, size);
+    if (ret) {
+        goto out;
     }
 
     cpu_physical_memory_set_dirty_lebitmap(vbmap->bitmap, ram_addr,
                                            vbmap->pages);
 
-    trace_vfio_get_dirty_bitmap(container->fd, range->iova, range->size,
-                                range->bitmap.size, ram_addr);
-err_out:
-    g_free(dbitmap);
+    trace_vfio_get_dirty_bitmap(container->fd, iova, size, vbmap->size,
+                                ram_addr);
+out:
     vfio_bitmap_dealloc(vbmap);
 
     return ret;
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH v2 13/20] vfio/common: Add device dirty page bitmap sync
  2023-02-22 17:48 [PATCH v2 00/20] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
                   ` (11 preceding siblings ...)
  2023-02-22 17:49 ` [PATCH v2 12/20] vfio/common: Extract code from vfio_get_dirty_bitmap() to new function Avihai Horon
@ 2023-02-22 17:49 ` Avihai Horon
  2023-02-22 17:49 ` [PATCH v2 14/20] vfio/common: Extract vIOMMU code from vfio_sync_dirty_bitmap() Avihai Horon
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 93+ messages in thread
From: Avihai Horon @ 2023-02-22 17:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

From: Joao Martins <joao.m.martins@oracle.com>

Add device dirty page bitmap sync functionality. This uses the device
DMA logging uAPI to sync dirty page bitmap from the device.

Device dirty page bitmap sync is used only if all devices within a
container support device dirty page tracking.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/common.c | 95 +++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 86 insertions(+), 9 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 3ab5d8d442..797eb2c26e 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -356,6 +356,9 @@ static void vfio_bitmap_dealloc(VFIOBitmap *vbmap)
     g_free(vbmap);
 }
 
+static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
+                                 uint64_t size, ram_addr_t ram_addr);
+
 bool vfio_mig_active(void)
 {
     VFIOGroup *group;
@@ -631,10 +634,16 @@ static int vfio_dma_unmap(VFIOContainer *container,
         .iova = iova,
         .size = size,
     };
+    bool need_dirty_sync = false;
+    int ret;
 
-    if (iotlb && container->dirty_pages_supported &&
-        vfio_devices_all_running_and_mig_active(container)) {
-        return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
+    if (iotlb && vfio_devices_all_running_and_mig_active(container)) {
+        if (!vfio_devices_all_device_dirty_tracking(container) &&
+            container->dirty_pages_supported) {
+            return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
+        }
+
+        need_dirty_sync = true;
     }
 
     while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
@@ -660,10 +669,12 @@ static int vfio_dma_unmap(VFIOContainer *container,
         return -errno;
     }
 
-    if (iotlb && vfio_devices_all_running_and_mig_active(container)) {
-        cpu_physical_memory_set_dirty_range(iotlb->translated_addr, size,
-                                            tcg_enabled() ? DIRTY_CLIENTS_ALL :
-                                            DIRTY_CLIENTS_NOCODE);
+    if (need_dirty_sync) {
+        ret = vfio_get_dirty_bitmap(container, iova, size,
+                                    iotlb->translated_addr);
+        if (ret) {
+            return ret;
+        }
     }
 
     vfio_erase_mapping(container, iova, size);
@@ -1633,6 +1644,65 @@ static void vfio_listener_log_global_stop(MemoryListener *listener)
     }
 }
 
+static int vfio_device_dma_logging_report(VFIODevice *vbasedev, hwaddr iova,
+                                          hwaddr size, void *bitmap)
+{
+    uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
+                        sizeof(struct vfio_device_feature_dma_logging_report),
+                        sizeof(uint64_t))] = {};
+    struct vfio_device_feature *feature = (struct vfio_device_feature *)buf;
+    struct vfio_device_feature_dma_logging_report *report =
+        (struct vfio_device_feature_dma_logging_report *)feature->data;
+
+    report->iova = iova;
+    report->length = size;
+    report->page_size = qemu_real_host_page_size();
+    report->bitmap = (uint64_t)bitmap;
+
+    feature->argsz = sizeof(buf);
+    feature->flags =
+        VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_DMA_LOGGING_REPORT;
+
+    if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
+        return -errno;
+    }
+
+    return 0;
+}
+
+static int vfio_devices_query_dirty_bitmap(VFIOContainer *container,
+                                           VFIOBitmap *vbmap, hwaddr iova,
+                                           hwaddr size)
+{
+    VFIODevice *vbasedev;
+    VFIOGroup *group;
+    int ret;
+
+    if (vfio_have_giommu(container)) {
+        /* Device dirty page tracking currently doesn't support vIOMMU */
+        bitmap_set(vbmap->bitmap, 0, vbmap->pages);
+
+        return 0;
+    }
+
+    QLIST_FOREACH(group, &container->group_list, container_next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            ret = vfio_device_dma_logging_report(vbasedev, iova, size,
+                                                 vbmap->bitmap);
+            if (ret) {
+                error_report("%s: Failed to get DMA logging report, iova: "
+                             "0x%" HWADDR_PRIx ", size: 0x%" HWADDR_PRIx
+                             ", err: %d (%s)",
+                             vbasedev->name, iova, size, ret, strerror(-ret));
+
+                return ret;
+            }
+        }
+    }
+
+    return 0;
+}
+
 static int vfio_query_dirty_bitmap(VFIOContainer *container, VFIOBitmap *vbmap,
                                    hwaddr iova, hwaddr size)
 {
@@ -1673,10 +1743,12 @@ static int vfio_query_dirty_bitmap(VFIOContainer *container, VFIOBitmap *vbmap,
 static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
                                  uint64_t size, ram_addr_t ram_addr)
 {
+    bool all_device_dirty_tracking =
+        vfio_devices_all_device_dirty_tracking(container);
     VFIOBitmap *vbmap;
     int ret;
 
-    if (!container->dirty_pages_supported) {
+    if (!container->dirty_pages_supported && !all_device_dirty_tracking) {
         cpu_physical_memory_set_dirty_range(ram_addr, size,
                                             tcg_enabled() ? DIRTY_CLIENTS_ALL :
                                             DIRTY_CLIENTS_NOCODE);
@@ -1688,7 +1760,12 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
         return -errno;
     }
 
-    ret = vfio_query_dirty_bitmap(container, vbmap, iova, size);
+    if (all_device_dirty_tracking) {
+        ret = vfio_devices_query_dirty_bitmap(container, vbmap, iova, size);
+    } else {
+        ret = vfio_query_dirty_bitmap(container, vbmap, iova, size);
+    }
+
     if (ret) {
         goto out;
     }
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH v2 14/20] vfio/common: Extract vIOMMU code from vfio_sync_dirty_bitmap()
  2023-02-22 17:48 [PATCH v2 00/20] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
                   ` (12 preceding siblings ...)
  2023-02-22 17:49 ` [PATCH v2 13/20] vfio/common: Add device dirty page bitmap sync Avihai Horon
@ 2023-02-22 17:49 ` Avihai Horon
  2023-02-22 17:49 ` [PATCH v2 15/20] memory/iommu: Add IOMMU_ATTR_MAX_IOVA attribute Avihai Horon
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 93+ messages in thread
From: Avihai Horon @ 2023-02-22 17:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

Extract vIOMMU code from vfio_sync_dirty_bitmap() to a new function and
restructure the code.

This is done as preparation for the following patches which will add
vIOMMU support to device dirty page tracking. No functional changes
intended.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/common.c | 63 +++++++++++++++++++++++++++++-------------------
 1 file changed, 38 insertions(+), 25 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 797eb2c26e..4a7fff6eeb 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1866,37 +1866,50 @@ static int vfio_sync_ram_discard_listener_dirty_bitmap(VFIOContainer *container,
                                                 &vrdl);
 }
 
+static int vfio_sync_iommu_dirty_bitmap(VFIOContainer *container,
+                                        MemoryRegionSection *section)
+{
+    VFIOGuestIOMMU *giommu;
+    bool found = false;
+    Int128 llend;
+    vfio_giommu_dirty_notifier gdn;
+    int idx;
+
+    QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
+        if (MEMORY_REGION(giommu->iommu_mr) == section->mr &&
+            giommu->n.start == section->offset_within_region) {
+            found = true;
+            break;
+        }
+    }
+
+    if (!found) {
+        return 0;
+    }
+
+    gdn.giommu = giommu;
+    idx = memory_region_iommu_attrs_to_index(giommu->iommu_mr,
+                                             MEMTXATTRS_UNSPECIFIED);
+
+    llend = int128_add(int128_make64(section->offset_within_region),
+                       section->size);
+    llend = int128_sub(llend, int128_one());
+
+    iommu_notifier_init(&gdn.n, vfio_iommu_map_dirty_notify, IOMMU_NOTIFIER_MAP,
+                        section->offset_within_region, int128_get64(llend),
+                        idx);
+    memory_region_iommu_replay(giommu->iommu_mr, &gdn.n);
+
+    return 0;
+}
+
 static int vfio_sync_dirty_bitmap(VFIOContainer *container,
                                   MemoryRegionSection *section)
 {
     ram_addr_t ram_addr;
 
     if (memory_region_is_iommu(section->mr)) {
-        VFIOGuestIOMMU *giommu;
-
-        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
-            if (MEMORY_REGION(giommu->iommu_mr) == section->mr &&
-                giommu->n.start == section->offset_within_region) {
-                Int128 llend;
-                vfio_giommu_dirty_notifier gdn = { .giommu = giommu };
-                int idx = memory_region_iommu_attrs_to_index(giommu->iommu_mr,
-                                                       MEMTXATTRS_UNSPECIFIED);
-
-                llend = int128_add(int128_make64(section->offset_within_region),
-                                   section->size);
-                llend = int128_sub(llend, int128_one());
-
-                iommu_notifier_init(&gdn.n,
-                                    vfio_iommu_map_dirty_notify,
-                                    IOMMU_NOTIFIER_MAP,
-                                    section->offset_within_region,
-                                    int128_get64(llend),
-                                    idx);
-                memory_region_iommu_replay(giommu->iommu_mr, &gdn.n);
-                break;
-            }
-        }
-        return 0;
+        return vfio_sync_iommu_dirty_bitmap(container, section);
     } else if (memory_region_has_ram_discard_manager(section->mr)) {
         return vfio_sync_ram_discard_listener_dirty_bitmap(container, section);
     }
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH v2 15/20] memory/iommu: Add IOMMU_ATTR_MAX_IOVA attribute
  2023-02-22 17:48 [PATCH v2 00/20] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
                   ` (13 preceding siblings ...)
  2023-02-22 17:49 ` [PATCH v2 14/20] vfio/common: Extract vIOMMU code from vfio_sync_dirty_bitmap() Avihai Horon
@ 2023-02-22 17:49 ` Avihai Horon
  2023-02-22 17:49 ` [PATCH v2 16/20] intel-iommu: Implement get_attr() method Avihai Horon
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 93+ messages in thread
From: Avihai Horon @ 2023-02-22 17:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

Add a new IOMMU attribute IOMMU_ATTR_MAX_IOVA which indicates the
maximal IOVA that an IOMMU can use.

This attribute will be used by VFIO device dirty page tracking so it can
track the entire IOVA space when needed (i.e. when vIOMMU is enabled).

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Acked-by: Peter Xu <peterx@redhat.com>
---
 include/exec/memory.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 2e602a2fad..cdd47fb79b 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -316,7 +316,8 @@ typedef struct MemoryRegionClass {
 
 
 enum IOMMUMemoryRegionAttr {
-    IOMMU_ATTR_SPAPR_TCE_FD
+    IOMMU_ATTR_SPAPR_TCE_FD,
+    IOMMU_ATTR_MAX_IOVA,
 };
 
 /*
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH v2 16/20] intel-iommu: Implement get_attr() method
  2023-02-22 17:48 [PATCH v2 00/20] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
                   ` (14 preceding siblings ...)
  2023-02-22 17:49 ` [PATCH v2 15/20] memory/iommu: Add IOMMU_ATTR_MAX_IOVA attribute Avihai Horon
@ 2023-02-22 17:49 ` Avihai Horon
  2023-02-22 17:49 ` [PATCH v2 17/20] vfio/common: Support device dirty page tracking with vIOMMU Avihai Horon
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 93+ messages in thread
From: Avihai Horon @ 2023-02-22 17:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

Implement get_attr() method and use the address width property to report
the IOMMU_ATTR_MAX_IOVA attribute.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Acked-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 98a5c304a7..b0068b0df4 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3841,6 +3841,23 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
     return;
 }
 
+static int vtd_iommu_get_attr(IOMMUMemoryRegion *iommu_mr,
+                              enum IOMMUMemoryRegionAttr attr, void *data)
+{
+    VTDAddressSpace *vtd_as = container_of(iommu_mr, VTDAddressSpace, iommu);
+    IntelIOMMUState *s = vtd_as->iommu_state;
+
+    if (attr == IOMMU_ATTR_MAX_IOVA) {
+        hwaddr *max_iova = data;
+
+        *max_iova = (1ULL << s->aw_bits) - 1;
+
+        return 0;
+    }
+
+    return -EINVAL;
+}
+
 /* Do the initialization. It will also be called when reset, so pay
  * attention when adding new initialization stuff.
  */
@@ -4173,6 +4190,7 @@ static void vtd_iommu_memory_region_class_init(ObjectClass *klass,
     imrc->translate = vtd_iommu_translate;
     imrc->notify_flag_changed = vtd_iommu_notify_flag_changed;
     imrc->replay = vtd_iommu_replay;
+    imrc->get_attr = vtd_iommu_get_attr;
 }
 
 static const TypeInfo vtd_iommu_memory_region_info = {
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH v2 17/20] vfio/common: Support device dirty page tracking with vIOMMU
  2023-02-22 17:48 [PATCH v2 00/20] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
                   ` (15 preceding siblings ...)
  2023-02-22 17:49 ` [PATCH v2 16/20] intel-iommu: Implement get_attr() method Avihai Horon
@ 2023-02-22 17:49 ` Avihai Horon
  2023-02-22 23:34   ` Alex Williamson
  2023-02-22 17:49 ` [PATCH v2 18/20] vfio/common: Optimize " Avihai Horon
                   ` (4 subsequent siblings)
  21 siblings, 1 reply; 93+ messages in thread
From: Avihai Horon @ 2023-02-22 17:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

Currently, device dirty page tracking with vIOMMU is not supported - RAM
pages are perpetually marked dirty in this case.

When vIOMMU is used, IOVA ranges are DMA mapped/unmapped on the fly as
the vIOMMU maps/unmaps them. These IOVA ranges can potentially be mapped
anywhere in the vIOMMU IOVA space.

Due to this dynamic nature of vIOMMU mapping/unmapping, tracking only
the currently mapped IOVA ranges, as done in the non-vIOMMU case,
doesn't work very well.

Instead, to support device dirty tracking when vIOMMU is enabled, track
the entire vIOMMU IOVA space. If that fails (IOVA space can be rather
big and we might hit HW limitation), try tracking smaller range while
marking untracked ranges dirty.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 include/hw/vfio/vfio-common.h |   2 +
 hw/vfio/common.c              | 196 +++++++++++++++++++++++++++++++---
 2 files changed, 181 insertions(+), 17 deletions(-)

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 1f21e1fa43..1dc00cabcd 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -95,6 +95,8 @@ typedef struct VFIOContainer {
     unsigned int dma_max_mappings;
     IOVATree *mappings;
     QemuMutex mappings_mutex;
+    /* Represents the range [0, giommu_tracked_range) not inclusive */
+    hwaddr giommu_tracked_range;
     QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
     QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
     QLIST_HEAD(, VFIOGroup) group_list;
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 4a7fff6eeb..1024788bcc 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -45,6 +45,8 @@
 #include "migration/qemu-file.h"
 #include "sysemu/tpm.h"
 #include "qemu/iova-tree.h"
+#include "hw/boards.h"
+#include "hw/mem/memory-device.h"
 
 VFIOGroupList vfio_group_list =
     QLIST_HEAD_INITIALIZER(vfio_group_list);
@@ -430,6 +432,38 @@ void vfio_unblock_multiple_devices_migration(void)
     multiple_devices_migration_blocker = NULL;
 }
 
+static uint64_t vfio_get_ram_size(void)
+{
+    MachineState *ms = MACHINE(qdev_get_machine());
+    uint64_t plugged_size;
+
+    plugged_size = get_plugged_memory_size();
+    if (plugged_size == (uint64_t)-1) {
+        plugged_size = 0;
+    }
+
+    return ms->ram_size + plugged_size;
+}
+
+static int vfio_iommu_get_max_iova(VFIOContainer *container, hwaddr *max_iova)
+{
+    VFIOGuestIOMMU *giommu;
+    int ret;
+
+    giommu = QLIST_FIRST(&container->giommu_list);
+    if (!giommu) {
+        return -ENOENT;
+    }
+
+    ret = memory_region_iommu_get_attr(giommu->iommu_mr, IOMMU_ATTR_MAX_IOVA,
+                                       max_iova);
+    if (ret) {
+        return ret;
+    }
+
+    return 0;
+}
+
 static bool vfio_have_giommu(VFIOContainer *container)
 {
     return !QLIST_EMPTY(&container->giommu_list);
@@ -1510,7 +1544,8 @@ static gboolean vfio_iova_tree_get_last(DMAMap *map, gpointer data)
 }
 
 static struct vfio_device_feature *
-vfio_device_feature_dma_logging_start_create(VFIOContainer *container)
+vfio_device_feature_dma_logging_start_create(VFIOContainer *container,
+                                             bool giommu)
 {
     struct vfio_device_feature *feature;
     size_t feature_size;
@@ -1529,6 +1564,16 @@ vfio_device_feature_dma_logging_start_create(VFIOContainer *container)
     control = (struct vfio_device_feature_dma_logging_control *)feature->data;
     control->page_size = qemu_real_host_page_size();
 
+    if (giommu) {
+        ranges = g_malloc0(sizeof(*ranges));
+        ranges->iova = 0;
+        ranges->length = container->giommu_tracked_range;
+        control->num_ranges = 1;
+        control->ranges = (uint64_t)ranges;
+
+        return feature;
+    }
+
     QEMU_LOCK_GUARD(&container->mappings_mutex);
 
     /*
@@ -1578,12 +1623,12 @@ static void vfio_device_feature_dma_logging_start_destroy(
     g_free(feature);
 }
 
-static int vfio_devices_dma_logging_start(VFIOContainer *container)
+static int vfio_devices_dma_logging_start(VFIOContainer *container, bool giommu)
 {
     struct vfio_device_feature *feature;
     int ret;
 
-    feature = vfio_device_feature_dma_logging_start_create(container);
+    feature = vfio_device_feature_dma_logging_start_create(container, giommu);
     if (!feature) {
         return -errno;
     }
@@ -1598,18 +1643,128 @@ static int vfio_devices_dma_logging_start(VFIOContainer *container)
     return ret;
 }
 
+typedef struct {
+    hwaddr *ranges;
+    unsigned int ranges_num;
+} VFIOGIOMMUDeviceDTRanges;
+
+/*
+ * This value is used in the second attempt to start device dirty tracking with
+ * vIOMMU, or if the giommu fails to report its max iova.
+ * It should be in the middle, not too big and not too small, allowing devices
+ * with HW limitations to do device dirty tracking while covering a fair amount
+ * of the IOVA space.
+ *
+ * This arbitrary value was chosen becasue it is the minimum value of Intel
+ * IOMMU max IOVA and mlx5 devices support tracking a range of this size.
+ */
+#define VFIO_IOMMU_DEFAULT_MAX_IOVA ((1ULL << 39) - 1)
+
+#define VFIO_IOMMU_RANGES_NUM 3
+static VFIOGIOMMUDeviceDTRanges *
+vfio_iommu_device_dirty_tracking_ranges_create(VFIOContainer *container)
+{
+    hwaddr iommu_max_iova = VFIO_IOMMU_DEFAULT_MAX_IOVA;
+    hwaddr retry_iova;
+    hwaddr ram_size = vfio_get_ram_size();
+    VFIOGIOMMUDeviceDTRanges *dt_ranges;
+    int ret;
+
+    dt_ranges = g_try_new0(VFIOGIOMMUDeviceDTRanges, 1);
+    if (!dt_ranges) {
+        errno = ENOMEM;
+
+        return NULL;
+    }
+
+    dt_ranges->ranges_num = VFIO_IOMMU_RANGES_NUM;
+
+    dt_ranges->ranges = g_try_new0(hwaddr, dt_ranges->ranges_num);
+    if (!dt_ranges->ranges) {
+        g_free(dt_ranges);
+        errno = ENOMEM;
+
+        return NULL;
+    }
+
+    /*
+     * With vIOMMU we try to track the entire IOVA space. As the IOVA space can
+     * be rather big, devices might not be able to track it due to HW
+     * limitations. In that case:
+     * (1) Retry tracking a smaller part of the IOVA space.
+     * (2) Retry tracking a range in the size of the physical memory.
+     */
+    ret = vfio_iommu_get_max_iova(container, &iommu_max_iova);
+    if (!ret) {
+        /* Check 2^64 wrap around */
+        if (!REAL_HOST_PAGE_ALIGN(iommu_max_iova)) {
+            iommu_max_iova -= qemu_real_host_page_size();
+        }
+    }
+
+    retry_iova = MIN(iommu_max_iova / 2, VFIO_IOMMU_DEFAULT_MAX_IOVA);
+
+    dt_ranges->ranges[0] = REAL_HOST_PAGE_ALIGN(iommu_max_iova);
+    dt_ranges->ranges[1] = REAL_HOST_PAGE_ALIGN(retry_iova);
+    dt_ranges->ranges[2] = REAL_HOST_PAGE_ALIGN(MIN(ram_size, retry_iova / 2));
+
+    return dt_ranges;
+}
+
+static void vfio_iommu_device_dirty_tracking_ranges_destroy(
+    VFIOGIOMMUDeviceDTRanges *dt_ranges)
+{
+    g_free(dt_ranges->ranges);
+    g_free(dt_ranges);
+}
+
+static int vfio_devices_start_dirty_page_tracking(VFIOContainer *container)
+{
+    VFIOGIOMMUDeviceDTRanges *dt_ranges;
+    int ret;
+    int i;
+
+    if (!vfio_have_giommu(container)) {
+        return vfio_devices_dma_logging_start(container, false);
+    }
+
+    dt_ranges = vfio_iommu_device_dirty_tracking_ranges_create(container);
+    if (!dt_ranges) {
+        return -errno;
+    }
+
+    for (i = 0; i < dt_ranges->ranges_num; i++) {
+        container->giommu_tracked_range = dt_ranges->ranges[i];
+        ret = vfio_devices_dma_logging_start(container, true);
+        if (!ret) {
+            break;
+        }
+
+        if (i < dt_ranges->ranges_num - 1) {
+            warn_report("Failed to start device dirty tracking with vIOMMU "
+                        "with range of size 0x%" HWADDR_PRIx
+                        ", err: %d. Retrying with range "
+                        "of size 0x%" HWADDR_PRIx,
+                        dt_ranges->ranges[i], ret, dt_ranges->ranges[i + 1]);
+        } else {
+            error_report("Failed to start device dirty tracking with vIOMMU "
+                         "with range of size 0x%" HWADDR_PRIx ", err: %d",
+                         dt_ranges->ranges[i], ret);
+        }
+    }
+
+    vfio_iommu_device_dirty_tracking_ranges_destroy(dt_ranges);
+
+    return ret;
+}
+
 static void vfio_listener_log_global_start(MemoryListener *listener)
 {
     VFIOContainer *container = container_of(listener, VFIOContainer, listener);
     int ret;
 
     if (vfio_devices_all_device_dirty_tracking(container)) {
-        if (vfio_have_giommu(container)) {
-            /* Device dirty page tracking currently doesn't support vIOMMU */
-            return;
-        }
-
-        ret = vfio_devices_dma_logging_start(container);
+        ret = vfio_devices_start_dirty_page_tracking(container);
     } else {
         ret = vfio_set_dirty_page_tracking(container, true);
     }
@@ -1627,11 +1782,6 @@ static void vfio_listener_log_global_stop(MemoryListener *listener)
     int ret;
 
     if (vfio_devices_all_device_dirty_tracking(container)) {
-        if (vfio_have_giommu(container)) {
-            /* Device dirty page tracking currently doesn't support vIOMMU */
-            return;
-        }
-
         ret = vfio_devices_dma_logging_stop(container);
     } else {
         ret = vfio_set_dirty_page_tracking(container, false);
@@ -1670,6 +1820,17 @@ static int vfio_device_dma_logging_report(VFIODevice *vbasedev, hwaddr iova,
     return 0;
 }
 
+static bool vfio_iommu_range_is_device_tracked(VFIOContainer *container,
+                                               hwaddr iova, hwaddr size)
+{
+    /* Check for 2^64 wrap around */
+    if (!(iova + size)) {
+        return false;
+    }
+
+    return iova + size <= container->giommu_tracked_range;
+}
+
 static int vfio_devices_query_dirty_bitmap(VFIOContainer *container,
                                            VFIOBitmap *vbmap, hwaddr iova,
                                            hwaddr size)
@@ -1679,10 +1840,11 @@ static int vfio_devices_query_dirty_bitmap(VFIOContainer *container,
     int ret;
 
     if (vfio_have_giommu(container)) {
-        /* Device dirty page tracking currently doesn't support vIOMMU */
-        bitmap_set(vbmap->bitmap, 0, vbmap->pages);
+        if (!vfio_iommu_range_is_device_tracked(container, iova, size)) {
+            bitmap_set(vbmap->bitmap, 0, vbmap->pages);
 
-        return 0;
+            return 0;
+        }
     }
 
     QLIST_FOREACH(group, &container->group_list, container_next) {
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH v2 18/20] vfio/common: Optimize device dirty page tracking with vIOMMU
  2023-02-22 17:48 [PATCH v2 00/20] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
                   ` (16 preceding siblings ...)
  2023-02-22 17:49 ` [PATCH v2 17/20] vfio/common: Support device dirty page tracking with vIOMMU Avihai Horon
@ 2023-02-22 17:49 ` Avihai Horon
  2023-02-22 17:49 ` [PATCH v2 19/20] vfio/migration: Query device dirty page tracking support Avihai Horon
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 93+ messages in thread
From: Avihai Horon @ 2023-02-22 17:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

When vIOMMU is enabled, syncing dirty page bitmaps is done by replaying
the vIOMMU mappings and querying the dirty bitmap for each mapping.

With device dirty tracking this causes a lot of overhead, since the HW
is queried many times (even with small idle guest this can end up with
thousands of calls to HW).

Optimize this by de-coupling dirty bitmap query from vIOMMU replay.
Now a single dirty bitmap is queried per vIOMMU MR section, which is
then used for all corresponding vIOMMU mappings within that MR section.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/common.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 83 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 1024788bcc..f16a57d42b 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1946,8 +1946,42 @@ out:
 typedef struct {
     IOMMUNotifier n;
     VFIOGuestIOMMU *giommu;
+    VFIOBitmap *vbmap;
 } vfio_giommu_dirty_notifier;
 
+static int vfio_iommu_set_dirty_bitmap(VFIOContainer *container,
+                                       vfio_giommu_dirty_notifier *gdn,
+                                       hwaddr iova, hwaddr size,
+                                       ram_addr_t ram_addr)
+{
+    VFIOBitmap *vbmap = gdn->vbmap;
+    VFIOBitmap *dst_vbmap;
+    hwaddr start_iova = REAL_HOST_PAGE_ALIGN(gdn->n.start);
+    hwaddr copy_offset;
+
+    dst_vbmap = vfio_bitmap_alloc(size);
+    if (!dst_vbmap) {
+        return -errno;
+    }
+
+    if (!vfio_iommu_range_is_device_tracked(container, iova, size)) {
+        bitmap_set(dst_vbmap->bitmap, 0, dst_vbmap->pages);
+
+        goto out;
+    }
+
+    copy_offset = (iova - start_iova) / qemu_real_host_page_size();
+    bitmap_copy_with_src_offset(dst_vbmap->bitmap, vbmap->bitmap, copy_offset,
+                                dst_vbmap->pages);
+
+out:
+    cpu_physical_memory_set_dirty_lebitmap(dst_vbmap->bitmap, ram_addr,
+                                           dst_vbmap->pages);
+    vfio_bitmap_dealloc(dst_vbmap);
+
+    return 0;
+}
+
 static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
 {
     vfio_giommu_dirty_notifier *gdn = container_of(n,
@@ -1968,8 +2002,15 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
 
     rcu_read_lock();
     if (vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL)) {
-        ret = vfio_get_dirty_bitmap(container, iova, iotlb->addr_mask + 1,
-                                    translated_addr);
+        if (gdn->vbmap) {
+            ret = vfio_iommu_set_dirty_bitmap(container, gdn, iova,
+                                              iotlb->addr_mask + 1,
+                                              translated_addr);
+        } else {
+            ret = vfio_get_dirty_bitmap(container, iova, iotlb->addr_mask + 1,
+                                        translated_addr);
+        }
+
         if (ret) {
             error_report("vfio_iommu_map_dirty_notify(%p, 0x%"HWADDR_PRIx", "
                          "0x%"HWADDR_PRIx") = %d (%s)",
@@ -2033,6 +2074,7 @@ static int vfio_sync_iommu_dirty_bitmap(VFIOContainer *container,
 {
     VFIOGuestIOMMU *giommu;
     bool found = false;
+    VFIOBitmap *vbmap = NULL;
     Int128 llend;
     vfio_giommu_dirty_notifier gdn;
     int idx;
@@ -2050,6 +2092,7 @@ static int vfio_sync_iommu_dirty_bitmap(VFIOContainer *container,
     }
 
     gdn.giommu = giommu;
+    gdn.vbmap = NULL;
     idx = memory_region_iommu_attrs_to_index(giommu->iommu_mr,
                                              MEMTXATTRS_UNSPECIFIED);
 
@@ -2057,11 +2100,49 @@ static int vfio_sync_iommu_dirty_bitmap(VFIOContainer *container,
                        section->size);
     llend = int128_sub(llend, int128_one());
 
+    /*
+     * Optimize device dirty tracking if the MR section is at least partially
+     * tracked. Optimization is done by querying a single dirty bitmap for the
+     * entire range instead of querying dirty bitmap for each vIOMMU mapping.
+     */
+    if (vfio_devices_all_device_dirty_tracking(container)) {
+        hwaddr start = REAL_HOST_PAGE_ALIGN(section->offset_within_region);
+        hwaddr end = int128_get64(llend);
+        hwaddr size;
+        int ret;
+
+        if (start >= container->giommu_tracked_range) {
+            goto notifier_init;
+        }
+
+        size = REAL_HOST_PAGE_ALIGN(
+            MIN(container->giommu_tracked_range - 1, end) - start);
+
+        vbmap = vfio_bitmap_alloc(size);
+        if (!vbmap) {
+            return -errno;
+        }
+
+        ret = vfio_devices_query_dirty_bitmap(container, vbmap, start, size);
+        if (ret) {
+            vfio_bitmap_dealloc(vbmap);
+
+            return ret;
+        }
+
+        gdn.vbmap = vbmap;
+    }
+
+notifier_init:
     iommu_notifier_init(&gdn.n, vfio_iommu_map_dirty_notify, IOMMU_NOTIFIER_MAP,
                         section->offset_within_region, int128_get64(llend),
                         idx);
     memory_region_iommu_replay(giommu->iommu_mr, &gdn.n);
 
+    if (vbmap) {
+        vfio_bitmap_dealloc(vbmap);
+    }
+
     return 0;
 }
 
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH v2 19/20] vfio/migration: Query device dirty page tracking support
  2023-02-22 17:48 [PATCH v2 00/20] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
                   ` (17 preceding siblings ...)
  2023-02-22 17:49 ` [PATCH v2 18/20] vfio/common: Optimize " Avihai Horon
@ 2023-02-22 17:49 ` Avihai Horon
  2023-02-22 17:49 ` [PATCH v2 20/20] docs/devel: Document VFIO device dirty page tracking Avihai Horon
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 93+ messages in thread
From: Avihai Horon @ 2023-02-22 17:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

From: Joao Martins <joao.m.martins@oracle.com>

Now that everything has been set up for device dirty page tracking,
query the device for device dirty page tracking support.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/migration.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 307983d57d..ae2be3dd3a 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -713,6 +713,19 @@ static int vfio_migration_query_flags(VFIODevice *vbasedev, uint64_t *mig_flags)
     return 0;
 }
 
+static bool vfio_dma_logging_supported(VFIODevice *vbasedev)
+{
+    uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature),
+                              sizeof(uint64_t))] = {};
+    struct vfio_device_feature *feature = (struct vfio_device_feature *)buf;
+
+    feature->argsz = sizeof(buf);
+    feature->flags =
+        VFIO_DEVICE_FEATURE_PROBE | VFIO_DEVICE_FEATURE_DMA_LOGGING_START;
+
+    return !ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature);
+}
+
 static int vfio_migration_init(VFIODevice *vbasedev)
 {
     int ret;
@@ -748,6 +761,8 @@ static int vfio_migration_init(VFIODevice *vbasedev)
     migration->data_fd = -1;
     migration->mig_flags = mig_flags;
 
+    vbasedev->dirty_pages_supported = vfio_dma_logging_supported(vbasedev);
+
     oid = vmstate_if_get_id(VMSTATE_IF(DEVICE(obj)));
     if (oid) {
         path = g_strdup_printf("%s/vfio", oid);
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH v2 20/20] docs/devel: Document VFIO device dirty page tracking
  2023-02-22 17:48 [PATCH v2 00/20] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
                   ` (18 preceding siblings ...)
  2023-02-22 17:49 ` [PATCH v2 19/20] vfio/migration: Query device dirty page tracking support Avihai Horon
@ 2023-02-22 17:49 ` Avihai Horon
  2023-02-27 14:29   ` Cédric Le Goater
  2023-02-22 18:00 ` [PATCH v2 00/20] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
  2023-02-22 20:55 ` Alex Williamson
  21 siblings, 1 reply; 93+ messages in thread
From: Avihai Horon @ 2023-02-22 17:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

Adjust the VFIO dirty page tracking documentation and add a section to
describe device dirty page tracking.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 docs/devel/vfio-migration.rst | 50 ++++++++++++++++++++++-------------
 1 file changed, 32 insertions(+), 18 deletions(-)

diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
index ba80b9150d..a432cda081 100644
--- a/docs/devel/vfio-migration.rst
+++ b/docs/devel/vfio-migration.rst
@@ -71,22 +71,37 @@ System memory dirty pages tracking
 ----------------------------------
 
 A ``log_global_start`` and ``log_global_stop`` memory listener callback informs
-the VFIO IOMMU module to start and stop dirty page tracking. A ``log_sync``
-memory listener callback marks those system memory pages as dirty which are
-used for DMA by the VFIO device. The dirty pages bitmap is queried per
-container. All pages pinned by the vendor driver through external APIs have to
-be marked as dirty during migration. When there are CPU writes, CPU dirty page
-tracking can identify dirtied pages, but any page pinned by the vendor driver
-can also be written by the device. There is currently no device or IOMMU
-support for dirty page tracking in hardware.
+the VFIO dirty tracking module to start and stop dirty page tracking. A
+``log_sync`` memory listener callback queries the dirty page bitmap from the
+dirty tracking module and marks system memory pages which were DMA-ed by the
+VFIO device as dirty. The dirty page bitmap is queried per container.
+
+Currently there are two ways dirty page tracking can be done:
+(1) Device dirty tracking:
+In this method the device is responsible to log and report its DMAs. This
+method can be used only if the device is capable of tracking its DMAs.
+Discovering device capability, starting and stopping dirty tracking, and
+syncing the dirty bitmaps from the device are done using the DMA logging uAPI.
+More info about the uAPI can be found in the comments of the
+``vfio_device_feature_dma_logging_control`` and
+``vfio_device_feature_dma_logging_report`` structures in the header file
+linux-headers/linux/vfio.h.
+
+(2) VFIO IOMMU module:
+In this method dirty tracking is done by IOMMU. However, there is currently no
+IOMMU support for dirty page tracking. For this reason, all pages are
+perpetually marked dirty, unless the device driver pins pages through external
+APIs in which case only those pinned pages are perpetually marked dirty.
+
+If the above two methods are not supported, all pages are perpetually marked
+dirty by QEMU.
 
 By default, dirty pages are tracked during pre-copy as well as stop-and-copy
-phase. So, a page pinned by the vendor driver will be copied to the destination
-in both phases. Copying dirty pages in pre-copy phase helps QEMU to predict if
-it can achieve its downtime tolerances. If QEMU during pre-copy phase keeps
-finding dirty pages continuously, then it understands that even in stop-and-copy
-phase, it is likely to find dirty pages and can predict the downtime
-accordingly.
+phase. So, a page marked as dirty will be copied to the destination in both
+phases. Copying dirty pages in pre-copy phase helps QEMU to predict if it can
+achieve its downtime tolerances. If QEMU during pre-copy phase keeps finding
+dirty pages continuously, then it understands that even in stop-and-copy phase,
+it is likely to find dirty pages and can predict the downtime accordingly.
 
 QEMU also provides a per device opt-out option ``pre-copy-dirty-page-tracking``
 which disables querying the dirty bitmap during pre-copy phase. If it is set to
@@ -97,10 +112,9 @@ System memory dirty pages tracking when vIOMMU is enabled
 ---------------------------------------------------------
 
 With vIOMMU, an IO virtual address range can get unmapped while in pre-copy
-phase of migration. In that case, the unmap ioctl returns any dirty pages in
-that range and QEMU reports corresponding guest physical pages dirty. During
-stop-and-copy phase, an IOMMU notifier is used to get a callback for mapped
-pages and then dirty pages bitmap is fetched from VFIO IOMMU modules for those
+phase of migration. In that case, dirty page bitmap for this range is queried
+and synced with QEMU. During stop-and-copy phase, an IOMMU notifier is used to
+get a callback for mapped pages and then dirty page bitmap is fetched for those
 mapped ranges.
 
 Flow of state changes during Live migration
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 00/20] vfio: Add migration pre-copy support and device dirty tracking
  2023-02-22 17:48 [PATCH v2 00/20] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
                   ` (19 preceding siblings ...)
  2023-02-22 17:49 ` [PATCH v2 20/20] docs/devel: Document VFIO device dirty page tracking Avihai Horon
@ 2023-02-22 18:00 ` Avihai Horon
  2023-02-22 20:55 ` Alex Williamson
  21 siblings, 0 replies; 93+ messages in thread
From: Avihai Horon @ 2023-02-22 18:00 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins


On 22/02/2023 19:48, Avihai Horon wrote:
> Changes from v1 [4]:
> - Rebased on latest master branch. As part of it, made some changes in
>    pre-copy to adjust it to Juan's new patches:
>    1. Added a new patch that passes threshold_size parameter to
>       .state_pending_{estimate,exact}() handlers.
>    2. Added a new patch that refactors vfio_save_block().
>    3. Changed the pre-copy patch to cache and report pending pre-copy
>       size in the .state_pending_estimate() handler.
> - Removed unnecessary P2P code. This should be added later on when P2P
>    support is added. (Alex)
> - Moved the dirty sync to be after the DMA unmap in vfio_dma_unmap()
>    (patch #11). (Alex)
> - Stored vfio_devices_all_device_dirty_tracking()'s value in a local
>    variable in vfio_get_dirty_bitmap() so it can be re-used (patch #11).
> - Refactored the viommu device dirty tracking ranges creation code to
>    make it clearer (patch #15).
> - Changed overflow check in vfio_iommu_range_is_device_tracked() to
>    emphasize that we specifically check for 2^64 wrap around (patch #15).
> - Added R-bs / Acks.
>
> Thanks.
>
> [1]
> https://lore.kernel.org/qemu-devel/167658846945.932837.1420176491103357684.stgit@omen/
>
> [2]
> https://lore.kernel.org/kvm/20221206083438.37807-3-yishaih@nvidia.com/
>
> [3]
> https://lore.kernel.org/netdev/20220908183448.195262-4-yishaih@nvidia.com/

and here is v1 link:
[4]
https://lore.kernel.org/qemu-devel/20230126184948.10478-1-avihaih@nvidia.com/

Thanks.



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 00/20] vfio: Add migration pre-copy support and device dirty tracking
  2023-02-22 17:48 [PATCH v2 00/20] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
                   ` (20 preceding siblings ...)
  2023-02-22 18:00 ` [PATCH v2 00/20] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
@ 2023-02-22 20:55 ` Alex Williamson
  2023-02-23 10:05   ` Cédric Le Goater
  2023-02-23 14:56   ` Avihai Horon
  21 siblings, 2 replies; 93+ messages in thread
From: Alex Williamson @ 2023-02-22 20:55 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins


There are various errors running this through the CI on gitlab.

This one seems bogus but needs to be resolved regardless:

https://gitlab.com/alex.williamson/qemu/-/jobs/3817940731
FAILED: libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o 
2786s390x-linux-gnu-gcc -m64 -Ilibqemu-aarch64-softmmu.fa.p -I. -I.. -Itarget/arm -I../target/arm -Iqapi -Itrace -Iui -Iui/shader -I/usr/include/pixman-1 -I/usr/include/capstone -I/usr/include/glib-2.0 -I/usr/lib/s390x-linux-gnu/glib-2.0/include -fdiagnostics-color=auto -Wall -Winvalid-pch -Werror -std=gnu11 -O2 -g -isystem /builds/alex.williamson/qemu/linux-headers -isystem linux-headers -iquote . -iquote /builds/alex.williamson/qemu -iquote /builds/alex.williamson/qemu/include -iquote /builds/alex.williamson/qemu/tcg/s390x -pthread -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -fno-strict-aliasing -fno-common -fwrapv -Wundef -Wwrite-strings -Wmissing-prototypes -Wstrict-prototypes -Wredundant-decls -Wold-style-declaration -Wold-style-definition -Wtype-limits -Wformat-security -Wformat-y2k -Winit-self -Wignored-qualifiers -Wempty-body -Wnested-externs -Wendif-labels -Wexpansion-to-defined -Wimplicit-fallthrough=2 -Wmissing-format-attribute -Wno-missing-include-dirs -Wno-shift-negative-value -Wno-psabi -fstack-protector-strong -fPIE -isystem../linux-headers -isystemlinux-headers -DNEED_CPU_H '-DCONFIG_TARGET="aarch64-softmmu-config-target.h"' '-DCONFIG_DEVICES="aarch64-softmmu-config-devices.h"' -MD -MQ libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o -MF libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o.d -o libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o -c ../hw/vfio/common.c
2787../hw/vfio/common.c: In function ‘vfio_listener_log_global_start’:
2788../hw/vfio/common.c:1772:8: error: ‘ret’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
2789 1772 |     if (ret) {
2790      |        ^

32-bit builds have some actual errors though:

https://gitlab.com/alex.williamson/qemu/-/jobs/3817940719
FAILED: libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o 
2601cc -m32 -Ilibqemu-aarch64-softmmu.fa.p -I. -I.. -Itarget/arm -I../target/arm -Iqapi -Itrace -Iui -Iui/shader -I/usr/include/pixman-1 -I/usr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/include/sysprof-4 -fdiagnostics-color=auto -Wall -Winvalid-pch -Werror -std=gnu11 -O2 -g -isystem /builds/alex.williamson/qemu/linux-headers -isystem linux-headers -iquote . -iquote /builds/alex.williamson/qemu -iquote /builds/alex.williamson/qemu/include -iquote /builds/alex.williamson/qemu/tcg/i386 -pthread -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -fno-strict-aliasing -fno-common -fwrapv -Wundef -Wwrite-strings -Wmissing-prototypes -Wstrict-prototypes -Wredundant-decls -Wold-style-declaration -Wold-style-definition -Wtype-limits -Wformat-security -Wformat-y2k -Winit-self -Wignored-qualifiers -Wempty-body -Wnested-externs -Wendif-labels -Wexpansion-to-defined -Wimplicit-fallthrough=2 -Wmissing-format-attribute -Wno-missing-include-dirs -Wno-shift-negative-value -Wno-psabi -fstack-protector-strong -fPIE -isystem../linux-headers -isystemlinux-headers -DNEED_CPU_H '-DCONFIG_TARGET="aarch64-softmmu-config-target.h"' '-DCONFIG_DEVICES="aarch64-softmmu-config-devices.h"' -MD -MQ libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o -MF libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o.d -o libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o -c ../hw/vfio/common.c
2602../hw/vfio/common.c: In function 'vfio_device_feature_dma_logging_start_create':
2603../hw/vfio/common.c:1572:27: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]
2604 1572 |         control->ranges = (uint64_t)ranges;
2605      |                           ^
2606../hw/vfio/common.c:1596:23: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]
2607 1596 |     control->ranges = (uint64_t)ranges;
2608      |                       ^
2609../hw/vfio/common.c: In function 'vfio_device_feature_dma_logging_start_destroy':
2610../hw/vfio/common.c:1620:9: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
2611 1620 |         (struct vfio_device_feature_dma_logging_range *)control->ranges;
2612      |         ^
2613../hw/vfio/common.c: In function 'vfio_device_dma_logging_report':
2614../hw/vfio/common.c:1810:22: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]
2615 1810 |     report->bitmap = (uint64_t)bitmap;
2616      |                      ^

Thanks,
Alex



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 03/20] vfio/migration: Add VFIO migration pre-copy support
  2023-02-22 17:48 ` [PATCH v2 03/20] vfio/migration: Add VFIO migration pre-copy support Avihai Horon
@ 2023-02-22 20:58   ` Alex Williamson
  2023-02-23 15:25     ` Avihai Horon
  0 siblings, 1 reply; 93+ messages in thread
From: Alex Williamson @ 2023-02-22 20:58 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins

On Wed, 22 Feb 2023 19:48:58 +0200
Avihai Horon <avihaih@nvidia.com> wrote:

> Pre-copy support allows the VFIO device data to be transferred while the
> VM is running. This helps to accommodate VFIO devices that have a large
> amount of data that needs to be transferred, and it can reduce migration
> downtime.
> 
> Pre-copy support is optional in VFIO migration protocol v2.
> Implement pre-copy of VFIO migration protocol v2 and use it for devices
> that support it. Full description of it can be found here [1].
> 
> [1]
> https://lore.kernel.org/kvm/20221206083438.37807-3-yishaih@nvidia.com/
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>  docs/devel/vfio-migration.rst |  35 +++++--
>  include/hw/vfio/vfio-common.h |   3 +
>  hw/vfio/common.c              |   6 +-
>  hw/vfio/migration.c           | 175 ++++++++++++++++++++++++++++++++--
>  hw/vfio/trace-events          |   4 +-
>  5 files changed, 201 insertions(+), 22 deletions(-)
> 
> diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
> index c214c73e28..ba80b9150d 100644
> --- a/docs/devel/vfio-migration.rst
> +++ b/docs/devel/vfio-migration.rst
> @@ -7,12 +7,14 @@ the guest is running on source host and restoring this saved state on the
>  destination host. This document details how saving and restoring of VFIO
>  devices is done in QEMU.
>  
> -Migration of VFIO devices currently consists of a single stop-and-copy phase.
> -During the stop-and-copy phase the guest is stopped and the entire VFIO device
> -data is transferred to the destination.
> -
> -The pre-copy phase of migration is currently not supported for VFIO devices.
> -Support for VFIO pre-copy will be added later on.
> +Migration of VFIO devices consists of two phases: the optional pre-copy phase,
> +and the stop-and-copy phase. The pre-copy phase is iterative and allows to
> +accommodate VFIO devices that have a large amount of data that needs to be
> +transferred. The iterative pre-copy phase of migration allows for the guest to
> +continue whilst the VFIO device state is transferred to the destination, this
> +helps to reduce the total downtime of the VM. VFIO devices can choose to skip
> +the pre-copy phase of migration by not reporting the VFIO_MIGRATION_PRE_COPY
> +flag in VFIO_DEVICE_FEATURE_MIGRATION ioctl.

Or alternatively for the last sentence,

  VFIO devices opt-in to pre-copy support by reporting the
  VFIO_MIGRATION_PRE_COPY flag in the VFIO_DEVICE_FEATURE_MIGRATION
  ioctl.


>  Note that currently VFIO migration is supported only for a single device. This
>  is due to VFIO migration's lack of P2P support. However, P2P support is planned
> @@ -29,10 +31,20 @@ VFIO implements the device hooks for the iterative approach as follows:
>  * A ``load_setup`` function that sets the VFIO device on the destination in
>    _RESUMING state.
>  
> +* A ``state_pending_estimate`` function that reports an estimate of the
> +  remaining pre-copy data that the vendor driver has yet to save for the VFIO
> +  device.
> +
>  * A ``state_pending_exact`` function that reads pending_bytes from the vendor
>    driver, which indicates the amount of data that the vendor driver has yet to
>    save for the VFIO device.
>  
> +* An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
> +  active only when the VFIO device is in pre-copy states.
> +
> +* A ``save_live_iterate`` function that reads the VFIO device's data from the
> +  vendor driver during iterative pre-copy phase.
> +
>  * A ``save_state`` function to save the device config space if it is present.
>  
>  * A ``save_live_complete_precopy`` function that sets the VFIO device in
> @@ -95,8 +107,10 @@ Flow of state changes during Live migration
>  ===========================================
>  
>  Below is the flow of state change during live migration.
> -The values in the brackets represent the VM state, the migration state, and
> +The values in the parentheses represent the VM state, the migration state, and
>  the VFIO device state, respectively.
> +The text in the square brackets represents the flow if the VFIO device supports
> +pre-copy.
>  
>  Live migration save path
>  ------------------------
> @@ -108,11 +122,12 @@ Live migration save path
>                                    |
>                       migrate_init spawns migration_thread
>                  Migration thread then calls each device's .save_setup()
> -                       (RUNNING, _SETUP, _RUNNING)
> +                  (RUNNING, _SETUP, _RUNNING [_PRE_COPY])
>                                    |
> -                      (RUNNING, _ACTIVE, _RUNNING)
> -             If device is active, get pending_bytes by .state_pending_exact()
> +                  (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])
> +      If device is active, get pending_bytes by .state_pending_{estimate,exact}()
>            If total pending_bytes >= threshold_size, call .save_live_iterate()
> +                  [Data of VFIO device for pre-copy phase is copied]
>          Iterate till total pending bytes converge and are less than threshold
>                                    |
>    On migration completion, vCPU stops and calls .save_live_complete_precopy for
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 87524c64a4..ee55d442b4 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -66,6 +66,9 @@ typedef struct VFIOMigration {
>      int data_fd;
>      void *data_buffer;
>      size_t data_buffer_size;
> +    uint64_t precopy_init_size;
> +    uint64_t precopy_dirty_size;

size_t?

> +    uint64_t mig_flags;
>  } VFIOMigration;
>  
>  typedef struct VFIOAddressSpace {
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index bab83c0e55..6f5afe9f5a 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -409,7 +409,8 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
>              }
>  
>              if (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF &&
> -                migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
> +                (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> +                 migration->device_state == VFIO_DEVICE_STATE_PRE_COPY)) {
>                  return false;
>              }
>          }
> @@ -438,7 +439,8 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
>                  return false;
>              }
>  
> -            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
> +            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> +                migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
>                  continue;
>              } else {
>                  return false;
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 94a4df73d0..307983d57d 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -67,6 +67,8 @@ static const char *mig_state_to_str(enum vfio_device_mig_state state)
>          return "STOP_COPY";
>      case VFIO_DEVICE_STATE_RESUMING:
>          return "RESUMING";
> +    case VFIO_DEVICE_STATE_PRE_COPY:
> +        return "PRE_COPY";
>      default:
>          return "UNKNOWN STATE";
>      }
> @@ -240,6 +242,23 @@ static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
>      return 0;
>  }
>  
> +static int vfio_query_precopy_size(VFIOMigration *migration,
> +                                   uint64_t *init_size, uint64_t *dirty_size)

size_t?  Seems like a concern throughout.

> +{
> +    struct vfio_precopy_info precopy = {
> +        .argsz = sizeof(precopy),
> +    };
> +
> +    if (ioctl(migration->data_fd, VFIO_MIG_GET_PRECOPY_INFO, &precopy)) {
> +        return -errno;
> +    }
> +
> +    *init_size = precopy.initial_bytes;
> +    *dirty_size = precopy.dirty_bytes;
> +
> +    return 0;
> +}
> +
>  /* Returns the size of saved data on success and -errno on error */
>  static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
>  {
> @@ -248,6 +267,11 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
>      data_size = read(migration->data_fd, migration->data_buffer,
>                       migration->data_buffer_size);
>      if (data_size < 0) {
> +        /* Pre-copy emptied all the device state for now */
> +        if (errno == ENOMSG) {
> +            return 0;
> +        }
> +
>          return -errno;
>      }
>      if (data_size == 0) {
> @@ -264,6 +288,31 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
>      return qemu_file_get_error(f) ?: data_size;
>  }
>  
> +static void vfio_update_estimated_pending_data(VFIOMigration *migration,
> +                                               uint64_t data_size)
> +{
> +    if (!data_size) {
> +        /*
> +         * Pre-copy emptied all the device state for now, update estimated sizes
> +         * accordingly.
> +         */
> +        migration->precopy_init_size = 0;
> +        migration->precopy_dirty_size = 0;
> +
> +        return;
> +    }
> +
> +    if (migration->precopy_init_size) {
> +        uint64_t init_size = MIN(migration->precopy_init_size, data_size);
> +
> +        migration->precopy_init_size -= init_size;
> +        data_size -= init_size;
> +    }
> +
> +    migration->precopy_dirty_size -= MIN(migration->precopy_dirty_size,
> +                                         data_size);
> +}
> +
>  /* ---------------------------------------------------------------------- */
>  
>  static int vfio_save_setup(QEMUFile *f, void *opaque)
> @@ -284,6 +333,35 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
>          return -ENOMEM;
>      }
>  
> +    if (migration->mig_flags & VFIO_MIGRATION_PRE_COPY) {
> +        uint64_t init_size = 0, dirty_size = 0;
> +        int ret;
> +
> +        switch (migration->device_state) {
> +        case VFIO_DEVICE_STATE_RUNNING:
> +            ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_PRE_COPY,
> +                                           VFIO_DEVICE_STATE_RUNNING);
> +            if (ret) {
> +                return ret;
> +            }
> +
> +            vfio_query_precopy_size(migration, &init_size, &dirty_size);
> +            migration->precopy_init_size = init_size;
> +            migration->precopy_dirty_size = dirty_size;

Seems like we could do away with {init,dirty}_size, initialize
migration->precopy_{init,dirty}_size before the switch, pass them
directly to vfio_query_precopy_size() and remove all but the break from
the case below.  But then that also suggests we could redefine
vfio_query_precopy_size() to

static int vfio_update_precopy_info(VFIOMigration *migration)

which sets the fields directly since this is the only way it's used.

> +
> +            break;
> +        case VFIO_DEVICE_STATE_STOP:
> +            /* vfio_save_complete_precopy() will go to STOP_COPY */
> +
> +            migration->precopy_init_size = 0;
> +            migration->precopy_dirty_size = 0;
> +
> +            break;
> +        default:
> +            return -EINVAL;
> +        }
> +    }
> +
>      trace_vfio_save_setup(vbasedev->name, migration->data_buffer_size);
>  
>      qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> @@ -302,23 +380,44 @@ static void vfio_save_cleanup(void *opaque)
>      trace_vfio_save_cleanup(vbasedev->name);
>  }
>  
> +static void vfio_state_pending_estimate(void *opaque, uint64_t threshold_size,
> +                                        uint64_t *must_precopy,
> +                                        uint64_t *can_postcopy)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    if (migration->device_state != VFIO_DEVICE_STATE_PRE_COPY) {
> +        return;
> +    }
> +
> +    /*
> +     * Initial size should be transferred during pre-copy phase so stop-copy
> +     * phase will not be slowed down. Report threshold_size to force another
> +     * pre-copy iteration.
> +     */
> +    *must_precopy += migration->precopy_init_size ?
> +                         threshold_size :
> +                         migration->precopy_dirty_size;

This sure feels like we're feeding false data back to the iterator to
spoof it to run another iteration, when the vfio migration protocol
only recommends that initial_bytes reaches zero before proceeding to
stop-copy, it's not a requirement.  What benefit is actually observed
from this?  Why is this required for initial pre-copy support?  It
seems devious.

> +
> +    trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy,
> +                                      *can_postcopy,
> +                                      migration->precopy_init_size,
> +                                      migration->precopy_dirty_size);
> +}
> +
>  /*
>   * Migration size of VFIO devices can be as little as a few KBs or as big as
>   * many GBs. This value should be big enough to cover the worst case.
>   */
>  #define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
>  
> -/*
> - * Only exact function is implemented and not estimate function. The reason is
> - * that during pre-copy phase of migration the estimate function is called
> - * repeatedly while pending RAM size is over the threshold, thus migration
> - * can't converge and querying the VFIO device pending data size is useless.
> - */
>  static void vfio_state_pending_exact(void *opaque, uint64_t threshold_size,
>                                       uint64_t *must_precopy,
>                                       uint64_t *can_postcopy)
>  {
>      VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
>      uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
>  
>      /*
> @@ -328,8 +427,57 @@ static void vfio_state_pending_exact(void *opaque, uint64_t threshold_size,
>      vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
>      *must_precopy += stop_copy_size;
>  
> +    if (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
> +        uint64_t init_size = 0, dirty_size = 0;
> +
> +        vfio_query_precopy_size(migration, &init_size, &dirty_size);
> +        migration->precopy_init_size = init_size;
> +        migration->precopy_dirty_size = dirty_size;

This is the only other caller of vfio_query_precopy_size(), following
the same pattern that could be simplified if the function filled the
migration fields itself.

> +
> +        /*
> +         * Initial size should be transferred during pre-copy phase so
> +         * stop-copy phase will not be slowed down. Report threshold_size
> +         * to force another pre-copy iteration.
> +         */
> +        *must_precopy += migration->precopy_init_size ?
> +                             threshold_size :
> +                             migration->precopy_dirty_size;
> +    }

Just as sketchy as above.  Thanks,

Alex

> +
>      trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
> -                                   stop_copy_size);
> +                                   stop_copy_size, migration->precopy_init_size,
> +                                   migration->precopy_dirty_size);
> +}
> +
> +static bool vfio_is_active_iterate(void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY;
> +}
> +
> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    ssize_t data_size;
> +
> +    data_size = vfio_save_block(f, migration);
> +    if (data_size < 0) {
> +        return data_size;
> +    }
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    vfio_update_estimated_pending_data(migration, data_size);
> +
> +    trace_vfio_save_iterate(vbasedev->name);
> +
> +    /*
> +     * A VFIO device's pre-copy dirty_bytes is not guaranteed to reach zero.
> +     * Return 1 so following handlers will not be potentially blocked.
> +     */
> +    return 1;
>  }
>  
>  static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> @@ -338,7 +486,7 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>      ssize_t data_size;
>      int ret;
>  
> -    /* We reach here with device state STOP only */
> +    /* We reach here with device state STOP or STOP_COPY only */
>      ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
>                                     VFIO_DEVICE_STATE_STOP);
>      if (ret) {
> @@ -457,7 +605,10 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>  static const SaveVMHandlers savevm_vfio_handlers = {
>      .save_setup = vfio_save_setup,
>      .save_cleanup = vfio_save_cleanup,
> +    .state_pending_estimate = vfio_state_pending_estimate,
>      .state_pending_exact = vfio_state_pending_exact,
> +    .is_active_iterate = vfio_is_active_iterate,
> +    .save_live_iterate = vfio_save_iterate,
>      .save_live_complete_precopy = vfio_save_complete_precopy,
>      .save_state = vfio_save_state,
>      .load_setup = vfio_load_setup,
> @@ -470,13 +621,18 @@ static const SaveVMHandlers savevm_vfio_handlers = {
>  static void vfio_vmstate_change(void *opaque, bool running, RunState state)
>  {
>      VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
>      enum vfio_device_mig_state new_state;
>      int ret;
>  
>      if (running) {
>          new_state = VFIO_DEVICE_STATE_RUNNING;
>      } else {
> -        new_state = VFIO_DEVICE_STATE_STOP;
> +        new_state =
> +            (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY &&
> +             (state == RUN_STATE_FINISH_MIGRATE || state == RUN_STATE_PAUSED)) ?
> +                VFIO_DEVICE_STATE_STOP_COPY :
> +                VFIO_DEVICE_STATE_STOP;
>      }
>  
>      /*
> @@ -590,6 +746,7 @@ static int vfio_migration_init(VFIODevice *vbasedev)
>      migration->vbasedev = vbasedev;
>      migration->device_state = VFIO_DEVICE_STATE_RUNNING;
>      migration->data_fd = -1;
> +    migration->mig_flags = mig_flags;
>  
>      oid = vmstate_if_get_id(VMSTATE_IF(DEVICE(obj)));
>      if (oid) {
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 669d9fe07c..51613e02e6 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -161,6 +161,8 @@ vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
>  vfio_save_cleanup(const char *name) " (%s)"
>  vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
>  vfio_save_device_config_state(const char *name) " (%s)"
> +vfio_save_iterate(const char *name) " (%s)"
>  vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size 0x%"PRIx64
> -vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64
> +vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
> +vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
>  vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 07/20] vfio/common: Add VFIOBitmap and (de)alloc functions
  2023-02-22 17:49 ` [PATCH v2 07/20] vfio/common: Add VFIOBitmap and (de)alloc functions Avihai Horon
@ 2023-02-22 21:40   ` Alex Williamson
  2023-02-23 15:27     ` Avihai Horon
  2023-02-27 14:09   ` Cédric Le Goater
  1 sibling, 1 reply; 93+ messages in thread
From: Alex Williamson @ 2023-02-22 21:40 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins

On Wed, 22 Feb 2023 19:49:02 +0200
Avihai Horon <avihaih@nvidia.com> wrote:

> There are already two places where dirty page bitmap allocation and
> calculations are done in open code. With device dirty page tracking
> being added in next patches, there are going to be even more places.
> 
> To avoid code duplication, introduce VFIOBitmap struct and corresponding
> alloc and dealloc functions and use them where applicable.
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>  hw/vfio/common.c | 89 ++++++++++++++++++++++++++++++++----------------
>  1 file changed, 60 insertions(+), 29 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index ac93b85632..84f08bdbbb 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -320,6 +320,41 @@ const MemoryRegionOps vfio_region_ops = {
>   * Device state interfaces
>   */
>  
> +typedef struct {
> +    unsigned long *bitmap;
> +    hwaddr size;
> +    hwaddr pages;
> +} VFIOBitmap;
> +
> +static VFIOBitmap *vfio_bitmap_alloc(hwaddr size)
> +{
> +    VFIOBitmap *vbmap = g_try_new0(VFIOBitmap, 1);
> +    if (!vbmap) {
> +        errno = ENOMEM;
> +
> +        return NULL;
> +    }
> +
> +    vbmap->pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size();
> +    vbmap->size = ROUND_UP(vbmap->pages, sizeof(__u64) * BITS_PER_BYTE) /
> +                                         BITS_PER_BYTE;
> +    vbmap->bitmap = g_try_malloc0(vbmap->size);
> +    if (!vbmap->bitmap) {
> +        g_free(vbmap);
> +        errno = ENOMEM;
> +
> +        return NULL;
> +    }
> +
> +    return vbmap;
> +}
> +
> +static void vfio_bitmap_dealloc(VFIOBitmap *vbmap)
> +{
> +    g_free(vbmap->bitmap);
> +    g_free(vbmap);
> +}

Nit, '_alloc' and '_free' seems like a more standard convention.
Thanks,

Alex



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 10/20] vfio/common: Record DMA mapped IOVA ranges
  2023-02-22 17:49 ` [PATCH v2 10/20] vfio/common: Record DMA mapped IOVA ranges Avihai Horon
@ 2023-02-22 22:10   ` Alex Williamson
  2023-02-23 10:37     ` Joao Martins
  0 siblings, 1 reply; 93+ messages in thread
From: Alex Williamson @ 2023-02-22 22:10 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins

On Wed, 22 Feb 2023 19:49:05 +0200
Avihai Horon <avihaih@nvidia.com> wrote:

> From: Joao Martins <joao.m.martins@oracle.com>
> 
> According to the device DMA logging uAPI, IOVA ranges to be logged by
> the device must be provided all at once upon DMA logging start.
> 
> As preparation for the following patches which will add device dirty
> page tracking, keep a record of all DMA mapped IOVA ranges so later they
> can be used for DMA logging start.
> 
> Note that when vIOMMU is enabled DMA mapped IOVA ranges are not tracked.
> This is due to the dynamic nature of vIOMMU DMA mapping/unmapping.
> Following patches will address the vIOMMU case specifically.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>  include/hw/vfio/vfio-common.h |  3 ++
>  hw/vfio/common.c              | 86 +++++++++++++++++++++++++++++++++--
>  2 files changed, 86 insertions(+), 3 deletions(-)
> 
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index ee55d442b4..6f36876ce0 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -23,6 +23,7 @@
>  
>  #include "exec/memory.h"
>  #include "qemu/queue.h"
> +#include "qemu/iova-tree.h"
>  #include "qemu/notify.h"
>  #include "ui/console.h"
>  #include "hw/display/ramfb.h"
> @@ -92,6 +93,8 @@ typedef struct VFIOContainer {
>      uint64_t max_dirty_bitmap_size;
>      unsigned long pgsizes;
>      unsigned int dma_max_mappings;
> +    IOVATree *mappings;
> +    QemuMutex mappings_mutex;
>      QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
>      QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
>      QLIST_HEAD(, VFIOGroup) group_list;
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 84f08bdbbb..6041da6c7e 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -44,6 +44,7 @@
>  #include "migration/blocker.h"
>  #include "migration/qemu-file.h"
>  #include "sysemu/tpm.h"
> +#include "qemu/iova-tree.h"
>  
>  VFIOGroupList vfio_group_list =
>      QLIST_HEAD_INITIALIZER(vfio_group_list);
> @@ -426,6 +427,11 @@ void vfio_unblock_multiple_devices_migration(void)
>      multiple_devices_migration_blocker = NULL;
>  }
>  
> +static bool vfio_have_giommu(VFIOContainer *container)
> +{
> +    return !QLIST_EMPTY(&container->giommu_list);
> +}
> +
>  static void vfio_set_migration_error(int err)
>  {
>      MigrationState *ms = migrate_get_current();
> @@ -499,6 +505,51 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
>      return true;
>  }
>  
> +static int vfio_record_mapping(VFIOContainer *container, hwaddr iova,
> +                               hwaddr size, bool readonly)
> +{
> +    DMAMap map = {
> +        .iova = iova,
> +        .size = size - 1, /* IOVATree is inclusive, so subtract 1 from size */
> +        .perm = readonly ? IOMMU_RO : IOMMU_RW,
> +    };
> +    int ret;
> +
> +    if (vfio_have_giommu(container)) {
> +        return 0;
> +    }
> +
> +    WITH_QEMU_LOCK_GUARD(&container->mappings_mutex) {
> +        ret = iova_tree_insert(container->mappings, &map);
> +        if (ret) {
> +            if (ret == IOVA_ERR_INVALID) {
> +                ret = -EINVAL;
> +            } else if (ret == IOVA_ERR_OVERLAP) {
> +                ret = -EEXIST;
> +            }
> +        }
> +    }
> +
> +    return ret;
> +}
> +
> +static void vfio_erase_mapping(VFIOContainer *container, hwaddr iova,
> +                                hwaddr size)
> +{
> +    DMAMap map = {
> +        .iova = iova,
> +        .size = size - 1, /* IOVATree is inclusive, so subtract 1 from size */
> +    };
> +
> +    if (vfio_have_giommu(container)) {
> +        return;
> +    }
> +
> +    WITH_QEMU_LOCK_GUARD(&container->mappings_mutex) {
> +        iova_tree_remove(container->mappings, map);
> +    }
> +}

Nit, 'insert' and 'remove' to match the IOVATree semantics?

>  static int vfio_dma_unmap_bitmap(VFIOContainer *container,
>                                   hwaddr iova, ram_addr_t size,
>                                   IOMMUTLBEntry *iotlb)
> @@ -599,6 +650,8 @@ static int vfio_dma_unmap(VFIOContainer *container,
>                                              DIRTY_CLIENTS_NOCODE);
>      }
>  
> +    vfio_erase_mapping(container, iova, size);
> +
>      return 0;
>  }
>  
> @@ -612,6 +665,16 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>          .iova = iova,
>          .size = size,
>      };
> +    int ret;
> +
> +    ret = vfio_record_mapping(container, iova, size, readonly);
> +    if (ret) {
> +        error_report("vfio: Failed to record mapping, iova: 0x%" HWADDR_PRIx
> +                     ", size: 0x" RAM_ADDR_FMT ", ret: %d (%s)",
> +                     iova, size, ret, strerror(-ret));
> +
> +        return ret;
> +    }

Is there no way to replay the mappings when a migration is started?
This seems like a horrible latency and bloat trade-off for the
possibility that the VM might migrate and the device might support
these features.  Our performance with vIOMMU is already terrible, I
can't help but believe this makes it worse.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 11/20] vfio/common: Add device dirty page tracking start/stop
  2023-02-22 17:49 ` [PATCH v2 11/20] vfio/common: Add device dirty page tracking start/stop Avihai Horon
@ 2023-02-22 22:40   ` Alex Williamson
  2023-02-23  2:02     ` Jason Gunthorpe
  2023-02-23 15:36     ` Avihai Horon
  0 siblings, 2 replies; 93+ messages in thread
From: Alex Williamson @ 2023-02-22 22:40 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins

On Wed, 22 Feb 2023 19:49:06 +0200
Avihai Horon <avihaih@nvidia.com> wrote:

> From: Joao Martins <joao.m.martins@oracle.com>
> 
> Add device dirty page tracking start/stop functionality. This uses the
> device DMA logging uAPI to start and stop dirty page tracking by device.
> 
> Device dirty page tracking is used only if all devices within a
> container support device dirty page tracking.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>  include/hw/vfio/vfio-common.h |   2 +
>  hw/vfio/common.c              | 211 +++++++++++++++++++++++++++++++++-
>  2 files changed, 211 insertions(+), 2 deletions(-)
> 
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 6f36876ce0..1f21e1fa43 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -149,6 +149,8 @@ typedef struct VFIODevice {
>      VFIOMigration *migration;
>      Error *migration_blocker;
>      OnOffAuto pre_copy_dirty_page_tracking;
> +    bool dirty_pages_supported;
> +    bool dirty_tracking;
>  } VFIODevice;
>  
>  struct VFIODeviceOps {
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 6041da6c7e..740153e7d7 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -473,6 +473,22 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
>      return true;
>  }
>  
> +static bool vfio_devices_all_device_dirty_tracking(VFIOContainer *container)
> +{
> +    VFIOGroup *group;
> +    VFIODevice *vbasedev;
> +
> +    QLIST_FOREACH(group, &container->group_list, container_next) {
> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> +            if (!vbasedev->dirty_pages_supported) {
> +                return false;
> +            }
> +        }
> +    }
> +
> +    return true;
> +}
> +
>  /*
>   * Check if all VFIO devices are running and migration is active, which is
>   * essentially equivalent to the migration being in pre-copy phase.
> @@ -1404,13 +1420,192 @@ static int vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
>      return ret;
>  }
>  
> +static int vfio_devices_dma_logging_set(VFIOContainer *container,
> +                                        struct vfio_device_feature *feature)
> +{
> +    bool status = (feature->flags & VFIO_DEVICE_FEATURE_MASK) ==
> +                  VFIO_DEVICE_FEATURE_DMA_LOGGING_START;
> +    VFIODevice *vbasedev;
> +    VFIOGroup *group;
> +    int ret = 0;
> +
> +    QLIST_FOREACH(group, &container->group_list, container_next) {
> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> +            if (vbasedev->dirty_tracking == status) {
> +                continue;
> +            }
> +
> +            ret = ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature);
> +            if (ret) {
> +                ret = -errno;
> +                error_report("%s: Failed to set DMA logging %s, err %d (%s)",
> +                             vbasedev->name, status ? "start" : "stop", ret,
> +                             strerror(errno));
> +                goto out;
> +            }
> +            vbasedev->dirty_tracking = status;
> +        }
> +    }
> +
> +out:
> +    return ret;
> +}
> +
> +static int vfio_devices_dma_logging_stop(VFIOContainer *container)
> +{
> +    uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature),
> +                              sizeof(uint64_t))] = {};
> +    struct vfio_device_feature *feature = (struct vfio_device_feature *)buf;
> +
> +    feature->argsz = sizeof(buf);
> +    feature->flags = VFIO_DEVICE_FEATURE_SET;
> +    feature->flags |= VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP;
> +
> +    return vfio_devices_dma_logging_set(container, feature);
> +}
> +
> +static gboolean vfio_device_dma_logging_range_add(DMAMap *map, gpointer data)
> +{
> +    struct vfio_device_feature_dma_logging_range **out = data;
> +    struct vfio_device_feature_dma_logging_range *range = *out;
> +
> +    range->iova = map->iova;
> +    /* IOVATree is inclusive, DMA logging uAPI isn't, so add 1 to length */
> +    range->length = map->size + 1;
> +
> +    *out = ++range;
> +
> +    return false;
> +}
> +
> +static gboolean vfio_iova_tree_get_first(DMAMap *map, gpointer data)
> +{
> +    DMAMap *first = data;
> +
> +    first->iova = map->iova;
> +    first->size = map->size;
> +
> +    return true;
> +}
> +
> +static gboolean vfio_iova_tree_get_last(DMAMap *map, gpointer data)
> +{
> +    DMAMap *last = data;
> +
> +    last->iova = map->iova;
> +    last->size = map->size;
> +
> +    return false;
> +}
> +
> +static struct vfio_device_feature *
> +vfio_device_feature_dma_logging_start_create(VFIOContainer *container)
> +{
> +    struct vfio_device_feature *feature;
> +    size_t feature_size;
> +    struct vfio_device_feature_dma_logging_control *control;
> +    struct vfio_device_feature_dma_logging_range *ranges;
> +    unsigned int max_ranges;
> +    unsigned int cur_ranges;
> +
> +    feature_size = sizeof(struct vfio_device_feature) +
> +                   sizeof(struct vfio_device_feature_dma_logging_control);
> +    feature = g_malloc0(feature_size);
> +    feature->argsz = feature_size;
> +    feature->flags = VFIO_DEVICE_FEATURE_SET;
> +    feature->flags |= VFIO_DEVICE_FEATURE_DMA_LOGGING_START;
> +
> +    control = (struct vfio_device_feature_dma_logging_control *)feature->data;
> +    control->page_size = qemu_real_host_page_size();
> +
> +    QEMU_LOCK_GUARD(&container->mappings_mutex);
> +
> +    /*
> +     * DMA logging uAPI guarantees to support at least num_ranges that fits into
> +     * a single host kernel page. To be on the safe side, use this as a limit
> +     * from which to merge to a single range.
> +     */
> +    max_ranges = qemu_real_host_page_size() / sizeof(*ranges);
> +    cur_ranges = iova_tree_nnodes(container->mappings);
> +    control->num_ranges = (cur_ranges <= max_ranges) ? cur_ranges : 1;

This makes me suspicious that we're implementing to the characteristics
of a specific device rather than strictly to the vfio migration API.
Are we just trying to avoid the error handling to support the try and
fall back to a single range behavior?  If we want to make a
simplification, then document it as such.  The "[t]o be on the safe
side" phrasing above could later be interpreted as avoiding an issue
and might discourage a more complete implementation.  Thanks,

Alex

> +    ranges = g_try_new0(struct vfio_device_feature_dma_logging_range,
> +                        control->num_ranges);
> +    if (!ranges) {
> +        g_free(feature);
> +        errno = ENOMEM;
> +
> +        return NULL;
> +    }
> +
> +    control->ranges = (uint64_t)ranges;
> +    if (cur_ranges <= max_ranges) {
> +        iova_tree_foreach(container->mappings,
> +                          vfio_device_dma_logging_range_add, &ranges);
> +    } else {
> +        DMAMap first, last;
> +
> +        iova_tree_foreach(container->mappings, vfio_iova_tree_get_first,
> +                          &first);
> +        iova_tree_foreach(container->mappings, vfio_iova_tree_get_last, &last);
> +        ranges->iova = first.iova;
> +        /* IOVATree is inclusive, DMA logging uAPI isn't, so add 1 to length */
> +        ranges->length = (last.iova - first.iova) + last.size + 1;
> +    }
> +
> +    return feature;
> +}
> +
> +static void vfio_device_feature_dma_logging_start_destroy(
> +    struct vfio_device_feature *feature)
> +{
> +    struct vfio_device_feature_dma_logging_control *control =
> +        (struct vfio_device_feature_dma_logging_control *)feature->data;
> +    struct vfio_device_feature_dma_logging_range *ranges =
> +        (struct vfio_device_feature_dma_logging_range *)control->ranges;
> +
> +    g_free(ranges);
> +    g_free(feature);
> +}
> +
> +static int vfio_devices_dma_logging_start(VFIOContainer *container)
> +{
> +    struct vfio_device_feature *feature;
> +    int ret;
> +
> +    feature = vfio_device_feature_dma_logging_start_create(container);
> +    if (!feature) {
> +        return -errno;
> +    }
> +
> +    ret = vfio_devices_dma_logging_set(container, feature);
> +    if (ret) {
> +        vfio_devices_dma_logging_stop(container);
> +    }
> +
> +    vfio_device_feature_dma_logging_start_destroy(feature);
> +
> +    return ret;
> +}
> +
>  static void vfio_listener_log_global_start(MemoryListener *listener)
>  {
>      VFIOContainer *container = container_of(listener, VFIOContainer, listener);
>      int ret;
>  
> -    ret = vfio_set_dirty_page_tracking(container, true);
> +    if (vfio_devices_all_device_dirty_tracking(container)) {
> +        if (vfio_have_giommu(container)) {
> +            /* Device dirty page tracking currently doesn't support vIOMMU */
> +            return;
> +        }
> +
> +        ret = vfio_devices_dma_logging_start(container);
> +    } else {
> +        ret = vfio_set_dirty_page_tracking(container, true);
> +    }
> +
>      if (ret) {
> +        error_report("vfio: Could not start dirty page tracking, err: %d (%s)",
> +                     ret, strerror(-ret));
>          vfio_set_migration_error(ret);
>      }
>  }
> @@ -1420,8 +1615,20 @@ static void vfio_listener_log_global_stop(MemoryListener *listener)
>      VFIOContainer *container = container_of(listener, VFIOContainer, listener);
>      int ret;
>  
> -    ret = vfio_set_dirty_page_tracking(container, false);
> +    if (vfio_devices_all_device_dirty_tracking(container)) {
> +        if (vfio_have_giommu(container)) {
> +            /* Device dirty page tracking currently doesn't support vIOMMU */
> +            return;
> +        }
> +
> +        ret = vfio_devices_dma_logging_stop(container);
> +    } else {
> +        ret = vfio_set_dirty_page_tracking(container, false);
> +    }
> +
>      if (ret) {
> +        error_report("vfio: Could not stop dirty page tracking, err: %d (%s)",
> +                     ret, strerror(-ret));
>          vfio_set_migration_error(ret);
>      }
>  }



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 17/20] vfio/common: Support device dirty page tracking with vIOMMU
  2023-02-22 17:49 ` [PATCH v2 17/20] vfio/common: Support device dirty page tracking with vIOMMU Avihai Horon
@ 2023-02-22 23:34   ` Alex Williamson
  2023-02-23  2:08     ` Jason Gunthorpe
  0 siblings, 1 reply; 93+ messages in thread
From: Alex Williamson @ 2023-02-22 23:34 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins

On Wed, 22 Feb 2023 19:49:12 +0200
Avihai Horon <avihaih@nvidia.com> wrote:

> Currently, device dirty page tracking with vIOMMU is not supported - RAM
> pages are perpetually marked dirty in this case.
> 
> When vIOMMU is used, IOVA ranges are DMA mapped/unmapped on the fly as
> the vIOMMU maps/unmaps them. These IOVA ranges can potentially be mapped
> anywhere in the vIOMMU IOVA space.
> 
> Due to this dynamic nature of vIOMMU mapping/unmapping, tracking only
> the currently mapped IOVA ranges, as done in the non-vIOMMU case,
> doesn't work very well.
> 
> Instead, to support device dirty tracking when vIOMMU is enabled, track
> the entire vIOMMU IOVA space. If that fails (IOVA space can be rather
> big and we might hit HW limitation), try tracking smaller range while
> marking untracked ranges dirty.
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>  include/hw/vfio/vfio-common.h |   2 +
>  hw/vfio/common.c              | 196 +++++++++++++++++++++++++++++++---
>  2 files changed, 181 insertions(+), 17 deletions(-)
> 
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 1f21e1fa43..1dc00cabcd 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -95,6 +95,8 @@ typedef struct VFIOContainer {
>      unsigned int dma_max_mappings;
>      IOVATree *mappings;
>      QemuMutex mappings_mutex;
> +    /* Represents the range [0, giommu_tracked_range) not inclusive */
> +    hwaddr giommu_tracked_range;
>      QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
>      QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
>      QLIST_HEAD(, VFIOGroup) group_list;
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 4a7fff6eeb..1024788bcc 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -45,6 +45,8 @@
>  #include "migration/qemu-file.h"
>  #include "sysemu/tpm.h"
>  #include "qemu/iova-tree.h"
> +#include "hw/boards.h"
> +#include "hw/mem/memory-device.h"
>  
>  VFIOGroupList vfio_group_list =
>      QLIST_HEAD_INITIALIZER(vfio_group_list);
> @@ -430,6 +432,38 @@ void vfio_unblock_multiple_devices_migration(void)
>      multiple_devices_migration_blocker = NULL;
>  }
>  
> +static uint64_t vfio_get_ram_size(void)
> +{
> +    MachineState *ms = MACHINE(qdev_get_machine());
> +    uint64_t plugged_size;
> +
> +    plugged_size = get_plugged_memory_size();
> +    if (plugged_size == (uint64_t)-1) {
> +        plugged_size = 0;
> +    }
> +
> +    return ms->ram_size + plugged_size;
> +}
> +
> +static int vfio_iommu_get_max_iova(VFIOContainer *container, hwaddr *max_iova)
> +{
> +    VFIOGuestIOMMU *giommu;
> +    int ret;
> +
> +    giommu = QLIST_FIRST(&container->giommu_list);
> +    if (!giommu) {
> +        return -ENOENT;
> +    }
> +
> +    ret = memory_region_iommu_get_attr(giommu->iommu_mr, IOMMU_ATTR_MAX_IOVA,
> +                                       max_iova);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    return 0;
> +}
> +
>  static bool vfio_have_giommu(VFIOContainer *container)
>  {
>      return !QLIST_EMPTY(&container->giommu_list);
> @@ -1510,7 +1544,8 @@ static gboolean vfio_iova_tree_get_last(DMAMap *map, gpointer data)
>  }
>  
>  static struct vfio_device_feature *
> -vfio_device_feature_dma_logging_start_create(VFIOContainer *container)
> +vfio_device_feature_dma_logging_start_create(VFIOContainer *container,
> +                                             bool giommu)
>  {
>      struct vfio_device_feature *feature;
>      size_t feature_size;
> @@ -1529,6 +1564,16 @@ vfio_device_feature_dma_logging_start_create(VFIOContainer *container)
>      control = (struct vfio_device_feature_dma_logging_control *)feature->data;
>      control->page_size = qemu_real_host_page_size();
>  
> +    if (giommu) {
> +        ranges = g_malloc0(sizeof(*ranges));
> +        ranges->iova = 0;
> +        ranges->length = container->giommu_tracked_range;
> +        control->num_ranges = 1;
> +        control->ranges = (uint64_t)ranges;
> +
> +        return feature;
> +    }
> +
>      QEMU_LOCK_GUARD(&container->mappings_mutex);
>  
>      /*
> @@ -1578,12 +1623,12 @@ static void vfio_device_feature_dma_logging_start_destroy(
>      g_free(feature);
>  }
>  
> -static int vfio_devices_dma_logging_start(VFIOContainer *container)
> +static int vfio_devices_dma_logging_start(VFIOContainer *container, bool giommu)
>  {
>      struct vfio_device_feature *feature;
>      int ret;
>  
> -    feature = vfio_device_feature_dma_logging_start_create(container);
> +    feature = vfio_device_feature_dma_logging_start_create(container, giommu);
>      if (!feature) {
>          return -errno;
>      }
> @@ -1598,18 +1643,128 @@ static int vfio_devices_dma_logging_start(VFIOContainer *container)
>      return ret;
>  }
>  
> +typedef struct {
> +    hwaddr *ranges;
> +    unsigned int ranges_num;
> +} VFIOGIOMMUDeviceDTRanges;
> +
> +/*
> + * This value is used in the second attempt to start device dirty tracking with
> + * vIOMMU, or if the giommu fails to report its max iova.
> + * It should be in the middle, not too big and not too small, allowing devices
> + * with HW limitations to do device dirty tracking while covering a fair amount
> + * of the IOVA space.
> + *
> + * This arbitrary value was chosen becasue it is the minimum value of Intel
> + * IOMMU max IOVA and mlx5 devices support tracking a range of this size.
> + */
> +#define VFIO_IOMMU_DEFAULT_MAX_IOVA ((1ULL << 39) - 1)
> +
> +#define VFIO_IOMMU_RANGES_NUM 3
> +static VFIOGIOMMUDeviceDTRanges *
> +vfio_iommu_device_dirty_tracking_ranges_create(VFIOContainer *container)
> +{
> +    hwaddr iommu_max_iova = VFIO_IOMMU_DEFAULT_MAX_IOVA;
> +    hwaddr retry_iova;
> +    hwaddr ram_size = vfio_get_ram_size();
> +    VFIOGIOMMUDeviceDTRanges *dt_ranges;
> +    int ret;
> +
> +    dt_ranges = g_try_new0(VFIOGIOMMUDeviceDTRanges, 1);
> +    if (!dt_ranges) {
> +        errno = ENOMEM;
> +
> +        return NULL;
> +    }
> +
> +    dt_ranges->ranges_num = VFIO_IOMMU_RANGES_NUM;
> +
> +    dt_ranges->ranges = g_try_new0(hwaddr, dt_ranges->ranges_num);
> +    if (!dt_ranges->ranges) {
> +        g_free(dt_ranges);
> +        errno = ENOMEM;
> +
> +        return NULL;
> +    }
> +
> +    /*
> +     * With vIOMMU we try to track the entire IOVA space. As the IOVA space can
> +     * be rather big, devices might not be able to track it due to HW
> +     * limitations. In that case:
> +     * (1) Retry tracking a smaller part of the IOVA space.
> +     * (2) Retry tracking a range in the size of the physical memory.

This looks really sketchy, why do we think there's a "good enough"
value here?  If we get it wrong, the device potentially has access to
IOVA space that we're not tracking, right?

I'd think the only viable fallback if the vIOMMU doesn't report its max
IOVA is the full 64-bit address space, otherwise it seems like we need
to add a migration blocker.

BTW, virtio-iommu is actively working to support vfio devices, we
should include support for it as well as VT-d.  Thanks,

Alex

> +     */
> +    ret = vfio_iommu_get_max_iova(container, &iommu_max_iova);
> +    if (!ret) {
> +        /* Check 2^64 wrap around */
> +        if (!REAL_HOST_PAGE_ALIGN(iommu_max_iova)) {
> +            iommu_max_iova -= qemu_real_host_page_size();
> +        }
> +    }
> +
> +    retry_iova = MIN(iommu_max_iova / 2, VFIO_IOMMU_DEFAULT_MAX_IOVA);
> +
> +    dt_ranges->ranges[0] = REAL_HOST_PAGE_ALIGN(iommu_max_iova);
> +    dt_ranges->ranges[1] = REAL_HOST_PAGE_ALIGN(retry_iova);
> +    dt_ranges->ranges[2] = REAL_HOST_PAGE_ALIGN(MIN(ram_size, retry_iova / 2));
> +
> +    return dt_ranges;
> +}
> +
> +static void vfio_iommu_device_dirty_tracking_ranges_destroy(
> +    VFIOGIOMMUDeviceDTRanges *dt_ranges)
> +{
> +    g_free(dt_ranges->ranges);
> +    g_free(dt_ranges);
> +}
> +
> +static int vfio_devices_start_dirty_page_tracking(VFIOContainer *container)
> +{
> +    VFIOGIOMMUDeviceDTRanges *dt_ranges;
> +    int ret;
> +    int i;
> +
> +    if (!vfio_have_giommu(container)) {
> +        return vfio_devices_dma_logging_start(container, false);
> +    }
> +
> +    dt_ranges = vfio_iommu_device_dirty_tracking_ranges_create(container);
> +    if (!dt_ranges) {
> +        return -errno;
> +    }
> +
> +    for (i = 0; i < dt_ranges->ranges_num; i++) {
> +        container->giommu_tracked_range = dt_ranges->ranges[i];
> +        ret = vfio_devices_dma_logging_start(container, true);
> +        if (!ret) {
> +            break;
> +        }
> +
> +        if (i < dt_ranges->ranges_num - 1) {
> +            warn_report("Failed to start device dirty tracking with vIOMMU "
> +                        "with range of size 0x%" HWADDR_PRIx
> +                        ", err: %d. Retrying with range "
> +                        "of size 0x%" HWADDR_PRIx,
> +                        dt_ranges->ranges[i], ret, dt_ranges->ranges[i + 1]);
> +        } else {
> +            error_report("Failed to start device dirty tracking with vIOMMU "
> +                         "with range of size 0x%" HWADDR_PRIx ", err: %d",
> +                         dt_ranges->ranges[i], ret);
> +        }
> +    }
> +
> +    vfio_iommu_device_dirty_tracking_ranges_destroy(dt_ranges);
> +
> +    return ret;
> +}
> +
>  static void vfio_listener_log_global_start(MemoryListener *listener)
>  {
>      VFIOContainer *container = container_of(listener, VFIOContainer, listener);
>      int ret;
>  
>      if (vfio_devices_all_device_dirty_tracking(container)) {
> -        if (vfio_have_giommu(container)) {
> -            /* Device dirty page tracking currently doesn't support vIOMMU */
> -            return;
> -        }
> -
> -        ret = vfio_devices_dma_logging_start(container);
> +        ret = vfio_devices_start_dirty_page_tracking(container);
>      } else {
>          ret = vfio_set_dirty_page_tracking(container, true);
>      }
> @@ -1627,11 +1782,6 @@ static void vfio_listener_log_global_stop(MemoryListener *listener)
>      int ret;
>  
>      if (vfio_devices_all_device_dirty_tracking(container)) {
> -        if (vfio_have_giommu(container)) {
> -            /* Device dirty page tracking currently doesn't support vIOMMU */
> -            return;
> -        }
> -
>          ret = vfio_devices_dma_logging_stop(container);
>      } else {
>          ret = vfio_set_dirty_page_tracking(container, false);
> @@ -1670,6 +1820,17 @@ static int vfio_device_dma_logging_report(VFIODevice *vbasedev, hwaddr iova,
>      return 0;
>  }
>  
> +static bool vfio_iommu_range_is_device_tracked(VFIOContainer *container,
> +                                               hwaddr iova, hwaddr size)
> +{
> +    /* Check for 2^64 wrap around */
> +    if (!(iova + size)) {
> +        return false;
> +    }
> +
> +    return iova + size <= container->giommu_tracked_range;
> +}
> +
>  static int vfio_devices_query_dirty_bitmap(VFIOContainer *container,
>                                             VFIOBitmap *vbmap, hwaddr iova,
>                                             hwaddr size)
> @@ -1679,10 +1840,11 @@ static int vfio_devices_query_dirty_bitmap(VFIOContainer *container,
>      int ret;
>  
>      if (vfio_have_giommu(container)) {
> -        /* Device dirty page tracking currently doesn't support vIOMMU */
> -        bitmap_set(vbmap->bitmap, 0, vbmap->pages);
> +        if (!vfio_iommu_range_is_device_tracked(container, iova, size)) {
> +            bitmap_set(vbmap->bitmap, 0, vbmap->pages);
>  
> -        return 0;
> +            return 0;
> +        }
>      }
>  
>      QLIST_FOREACH(group, &container->group_list, container_next) {



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 11/20] vfio/common: Add device dirty page tracking start/stop
  2023-02-22 22:40   ` Alex Williamson
@ 2023-02-23  2:02     ` Jason Gunthorpe
  2023-02-23 19:27       ` Alex Williamson
  2023-02-23 15:36     ` Avihai Horon
  1 sibling, 1 reply; 93+ messages in thread
From: Jason Gunthorpe @ 2023-02-23  2:02 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Avihai Horon, qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins

On Wed, Feb 22, 2023 at 03:40:43PM -0700, Alex Williamson wrote:
> > +    /*
> > +     * DMA logging uAPI guarantees to support at least num_ranges that fits into
> > +     * a single host kernel page. To be on the safe side, use this as a limit
> > +     * from which to merge to a single range.
> > +     */
> > +    max_ranges = qemu_real_host_page_size() / sizeof(*ranges);
> > +    cur_ranges = iova_tree_nnodes(container->mappings);
> > +    control->num_ranges = (cur_ranges <= max_ranges) ? cur_ranges : 1;
> 
> This makes me suspicious that we're implementing to the characteristics
> of a specific device rather than strictly to the vfio migration API.
> Are we just trying to avoid the error handling to support the try and
> fall back to a single range behavior?

This was what we agreed to when making the kernel patches. Userspace
is restricted to send one page of range list to the kernel, and the
kernel will always adjust that to whatever smaller list the device needs.

We added this limit only because we don't want to have a way for
userspace to consume a lot of kernel memory.

See LOG_MAX_RANGES in vfio_main.c

If qemu is viommu mode and it has a huge number of ranges then it must
cut it down before passing things to the kernel.

Jason


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 17/20] vfio/common: Support device dirty page tracking with vIOMMU
  2023-02-22 23:34   ` Alex Williamson
@ 2023-02-23  2:08     ` Jason Gunthorpe
  2023-02-23 20:06       ` Alex Williamson
  0 siblings, 1 reply; 93+ messages in thread
From: Jason Gunthorpe @ 2023-02-23  2:08 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Avihai Horon, qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins

On Wed, Feb 22, 2023 at 04:34:39PM -0700, Alex Williamson wrote:
> > +    /*
> > +     * With vIOMMU we try to track the entire IOVA space. As the IOVA space can
> > +     * be rather big, devices might not be able to track it due to HW
> > +     * limitations. In that case:
> > +     * (1) Retry tracking a smaller part of the IOVA space.
> > +     * (2) Retry tracking a range in the size of the physical memory.
> 
> This looks really sketchy, why do we think there's a "good enough"
> value here?  If we get it wrong, the device potentially has access to
> IOVA space that we're not tracking, right?

The idea was the untracked range becomes permanently dirty, so at
worst this means the migration never converges.
 
#2 is the presumption that the guest is using an identity map.

> I'd think the only viable fallback if the vIOMMU doesn't report its max
> IOVA is the full 64-bit address space, otherwise it seems like we need
> to add a migration blocker.

This is basically saying vIOMMU doesn't work with migration, and we've
heard that this isn't OK. There are cases where vIOMMU is on but the
guest always uses identity maps. eg for virtual interrupt remapping.

We also have future problems that nested translation is incompatible
with device dirty tracking..

Jason


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 00/20] vfio: Add migration pre-copy support and device dirty tracking
  2023-02-22 20:55 ` Alex Williamson
@ 2023-02-23 10:05   ` Cédric Le Goater
  2023-02-23 15:07     ` Avihai Horon
  2023-02-23 14:56   ` Avihai Horon
  1 sibling, 1 reply; 93+ messages in thread
From: Cédric Le Goater @ 2023-02-23 10:05 UTC (permalink / raw)
  To: Alex Williamson, Avihai Horon
  Cc: qemu-devel, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Peter Xu, Jason Wang, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins

On 2/22/23 21:55, Alex Williamson wrote:
> 
> There are various errors running this through the CI on gitlab.
> 
> This one seems bogus but needs to be resolved regardless:
> 
> https://gitlab.com/alex.williamson/qemu/-/jobs/3817940731
> FAILED: libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o
> 2786s390x-linux-gnu-gcc -m64 -Ilibqemu-aarch64-softmmu.fa.p -I. -I.. -Itarget/arm -I../target/arm -Iqapi -Itrace -Iui -Iui/shader -I/usr/include/pixman-1 -I/usr/include/capstone -I/usr/include/glib-2.0 -I/usr/lib/s390x-linux-gnu/glib-2.0/include -fdiagnostics-color=auto -Wall -Winvalid-pch -Werror -std=gnu11 -O2 -g -isystem /builds/alex.williamson/qemu/linux-headers -isystem linux-headers -iquote . -iquote /builds/alex.williamson/qemu -iquote /builds/alex.williamson/qemu/include -iquote /builds/alex.williamson/qemu/tcg/s390x -pthread -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -fno-strict-aliasing -fno-common -fwrapv -Wundef -Wwrite-strings -Wmissing-prototypes -Wstrict-prototypes -Wredundant-decls -Wold-style-declaration -Wold-style-definition -Wtype-limits -Wformat-security -Wformat-y2k -Winit-self -Wignored-qualifiers -Wempty-body -Wnested-externs -Wendif-labels -Wexpansion-to-defined -Wimplicit-fallthrough=2 -Wmissing-format-attribute -Wno-missing-include-dirs -Wno-shift-negative-value -Wno-psabi -fstack-protector-strong -fPIE -isystem../linux-headers -isystemlinux-headers -DNEED_CPU_H '-DCONFIG_TARGET="aarch64-softmmu-config-target.h"' '-DCONFIG_DEVICES="aarch64-softmmu-config-devices.h"' -MD -MQ libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o -MF libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o.d -o libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o -c ../hw/vfio/common.c
> 2787../hw/vfio/common.c: In function ‘vfio_listener_log_global_start’:
> 2788../hw/vfio/common.c:1772:8: error: ‘ret’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
> 2789 1772 |     if (ret) {
> 2790      |        ^


The routine to fix is vfio_devices_start_dirty_page_tracking(). The compiler
is doing some inlining.

Thanks,
C.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 10/20] vfio/common: Record DMA mapped IOVA ranges
  2023-02-22 22:10   ` Alex Williamson
@ 2023-02-23 10:37     ` Joao Martins
  2023-02-23 21:05       ` Alex Williamson
  0 siblings, 1 reply; 93+ messages in thread
From: Joao Martins @ 2023-02-23 10:37 UTC (permalink / raw)
  To: Alex Williamson, Avihai Horon
  Cc: qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

On 22/02/2023 22:10, Alex Williamson wrote:
> On Wed, 22 Feb 2023 19:49:05 +0200
> Avihai Horon <avihaih@nvidia.com> wrote:
>> From: Joao Martins <joao.m.martins@oracle.com>
>> @@ -612,6 +665,16 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>>          .iova = iova,
>>          .size = size,
>>      };
>> +    int ret;
>> +
>> +    ret = vfio_record_mapping(container, iova, size, readonly);
>> +    if (ret) {
>> +        error_report("vfio: Failed to record mapping, iova: 0x%" HWADDR_PRIx
>> +                     ", size: 0x" RAM_ADDR_FMT ", ret: %d (%s)",
>> +                     iova, size, ret, strerror(-ret));
>> +
>> +        return ret;
>> +    }
> 
> Is there no way to replay the mappings when a migration is started?
> This seems like a horrible latency and bloat trade-off for the
> possibility that the VM might migrate and the device might support
> these features.  Our performance with vIOMMU is already terrible, I
> can't help but believe this makes it worse.  Thanks,
> 

It is a nop if the vIOMMU is being used (entries in container->giommu_list) as
that uses a max-iova based IOVA range. So this is really for iommu identity
mapping and no-VIOMMU. We could replay them if they were tracked/stored anywhere.

I suppose we could move the vfio_devices_all_device_dirty_tracking() into this
patch and then conditionally call this vfio_{record,erase}_mapping() in case we
are passing through a device that doesn't have live-migration support? Would
that address the impact you're concerned wrt to non-live-migrateable devices?

On the other hand, the PCI device hotplug hypothetical even makes this a bit
complicated as we can still attempt to hotplug a device before migration is even
attempted. Meaning that we start with live-migrateable devices, and we added the
tracking, up to hotpluging a device without such support (adding a blocker)
leaving the mappings there with no further use. So it felt simpler to just track
always and avoid any mappings recording if the vIOMMU is in active use?


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 00/20] vfio: Add migration pre-copy support and device dirty tracking
  2023-02-22 20:55 ` Alex Williamson
  2023-02-23 10:05   ` Cédric Le Goater
@ 2023-02-23 14:56   ` Avihai Horon
  2023-02-24 19:26     ` Joao Martins
  1 sibling, 1 reply; 93+ messages in thread
From: Avihai Horon @ 2023-02-23 14:56 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins


On 22/02/2023 22:55, Alex Williamson wrote:
> External email: Use caution opening links or attachments
>
>
> There are various errors running this through the CI on gitlab.
>
> This one seems bogus but needs to be resolved regardless:
>
> https://gitlab.com/alex.williamson/qemu/-/jobs/3817940731
> FAILED: libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o
> 2786s390x-linux-gnu-gcc -m64 -Ilibqemu-aarch64-softmmu.fa.p -I. -I.. -Itarget/arm -I../target/arm -Iqapi -Itrace -Iui -Iui/shader -I/usr/include/pixman-1 -I/usr/include/capstone -I/usr/include/glib-2.0 -I/usr/lib/s390x-linux-gnu/glib-2.0/include -fdiagnostics-color=auto -Wall -Winvalid-pch -Werror -std=gnu11 -O2 -g -isystem /builds/alex.williamson/qemu/linux-headers -isystem linux-headers -iquote . -iquote /builds/alex.williamson/qemu -iquote /builds/alex.williamson/qemu/include -iquote /builds/alex.williamson/qemu/tcg/s390x -pthread -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -fno-strict-aliasing -fno-common -fwrapv -Wundef -Wwrite-strings -Wmissing-prototypes -Wstrict-prototypes -Wredundant-decls -Wold-style-declaration -Wold-style-definition -Wtype-limits -Wformat-security -Wformat-y2k -Winit-self -Wignored-qualifiers -Wempty-body -Wnested-externs -Wendif-labels -Wexpansion-to-defined -Wimplicit-fallthrough=2 -Wmissing-format-attribute -Wno-missing-include-dirs -Wno-shift-negative-value -Wno-psabi -fstack-protector-strong -fPIE -isystem../linux-headers -isystemlinux-headers -DNEED_CPU_H '-DCONFIG_TARGET="aarch64-softmmu-config-target.h"' '-DCONFIG_DEVICES="aarch64-softmmu-config-devices.h"' -MD -MQ libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o -MF libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o.d -o libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o -c ../hw/vfio/common.c
> 2787../hw/vfio/common.c: In function ‘vfio_listener_log_global_start’:
> 2788../hw/vfio/common.c:1772:8: error: ‘ret’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
> 2789 1772 |     if (ret) {
> 2790      |        ^
>
> 32-bit builds have some actual errors though:
>
> https://gitlab.com/alex.williamson/qemu/-/jobs/3817940719
> FAILED: libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o
> 2601cc -m32 -Ilibqemu-aarch64-softmmu.fa.p -I. -I.. -Itarget/arm -I../target/arm -Iqapi -Itrace -Iui -Iui/shader -I/usr/include/pixman-1 -I/usr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/include/sysprof-4 -fdiagnostics-color=auto -Wall -Winvalid-pch -Werror -std=gnu11 -O2 -g -isystem /builds/alex.williamson/qemu/linux-headers -isystem linux-headers -iquote . -iquote /builds/alex.williamson/qemu -iquote /builds/alex.williamson/qemu/include -iquote /builds/alex.williamson/qemu/tcg/i386 -pthread -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -fno-strict-aliasing -fno-common -fwrapv -Wundef -Wwrite-strings -Wmissing-prototypes -Wstrict-prototypes -Wredundant-decls -Wold-style-declaration -Wold-style-definition -Wtype-limits -Wformat-security -Wformat-y2k -Winit-self -Wignored-qualifiers -Wempty-body -Wnested-externs -Wendif-labels -Wexpansion-to-defined -Wimplicit-fallthrough=2 -Wmissing-format-attribute -Wno-missing-include-dirs -Wno-shift-negative-value -Wno-psabi -fstack-protector-strong -fPIE -isystem../linux-headers -isystemlinux-headers -DNEED_CPU_H '-DCONFIG_TARGET="aarch64-softmmu-config-target.h"' '-DCONFIG_DEVICES="aarch64-softmmu-config-devices.h"' -MD -MQ libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o -MF libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o.d -o libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o -c ../hw/vfio/common.c
> 2602../hw/vfio/common.c: In function 'vfio_device_feature_dma_logging_start_create':
> 2603../hw/vfio/common.c:1572:27: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]
> 2604 1572 |         control->ranges = (uint64_t)ranges;
> 2605      |                           ^
> 2606../hw/vfio/common.c:1596:23: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]
> 2607 1596 |     control->ranges = (uint64_t)ranges;
> 2608      |                       ^
> 2609../hw/vfio/common.c: In function 'vfio_device_feature_dma_logging_start_destroy':
> 2610../hw/vfio/common.c:1620:9: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
> 2611 1620 |         (struct vfio_device_feature_dma_logging_range *)control->ranges;
> 2612      |         ^
> 2613../hw/vfio/common.c: In function 'vfio_device_dma_logging_report':
> 2614../hw/vfio/common.c:1810:22: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]
> 2615 1810 |     report->bitmap = (uint64_t)bitmap;
> 2616      |                      ^

Sure, I will fix these errors.

Thanks.


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 00/20] vfio: Add migration pre-copy support and device dirty tracking
  2023-02-23 10:05   ` Cédric Le Goater
@ 2023-02-23 15:07     ` Avihai Horon
  2023-02-27 10:24       ` Cédric Le Goater
  0 siblings, 1 reply; 93+ messages in thread
From: Avihai Horon @ 2023-02-23 15:07 UTC (permalink / raw)
  To: Cédric Le Goater, Alex Williamson
  Cc: qemu-devel, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Peter Xu, Jason Wang, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins


On 23/02/2023 12:05, Cédric Le Goater wrote:
> External email: Use caution opening links or attachments
>
>
> On 2/22/23 21:55, Alex Williamson wrote:
>>
>> There are various errors running this through the CI on gitlab.
>>
>> This one seems bogus but needs to be resolved regardless:
>>
>> https://gitlab.com/alex.williamson/qemu/-/jobs/3817940731
>> FAILED: libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o
>> 2786s390x-linux-gnu-gcc -m64 -Ilibqemu-aarch64-softmmu.fa.p -I. -I.. 
>> -Itarget/arm -I../target/arm -Iqapi -Itrace -Iui -Iui/shader 
>> -I/usr/include/pixman-1 -I/usr/include/capstone 
>> -I/usr/include/glib-2.0 -I/usr/lib/s390x-linux-gnu/glib-2.0/include 
>> -fdiagnostics-color=auto -Wall -Winvalid-pch -Werror -std=gnu11 -O2 
>> -g -isystem /builds/alex.williamson/qemu/linux-headers -isystem 
>> linux-headers -iquote . -iquote /builds/alex.williamson/qemu -iquote 
>> /builds/alex.williamson/qemu/include -iquote 
>> /builds/alex.williamson/qemu/tcg/s390x -pthread -U_FORTIFY_SOURCE 
>> -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 
>> -D_LARGEFILE_SOURCE -fno-strict-aliasing -fno-common -fwrapv -Wundef 
>> -Wwrite-strings -Wmissing-prototypes -Wstrict-prototypes 
>> -Wredundant-decls -Wold-style-declaration -Wold-style-definition 
>> -Wtype-limits -Wformat-security -Wformat-y2k -Winit-self 
>> -Wignored-qualifiers -Wempty-body -Wnested-externs -Wendif-labels 
>> -Wexpansion-to-defined -Wimplicit-fallthrough=2 
>> -Wmissing-format-attribute -Wno-missing-include-dirs 
>> -Wno-shift-negative-value -Wno-psabi -fstack-protector-strong -fPIE 
>> -isystem../linux-headers -isystemlinux-headers -DNEED_CPU_H 
>> '-DCONFIG_TARGET="aarch64-softmmu-config-target.h"' 
>> '-DCONFIG_DEVICES="aarch64-softmmu-config-devices.h"' -MD -MQ 
>> libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o -MF 
>> libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o.d -o 
>> libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o -c ../hw/vfio/common.c
>> 2787../hw/vfio/common.c: In function ‘vfio_listener_log_global_start’:
>> 2788../hw/vfio/common.c:1772:8: error: ‘ret’ may be used 
>> uninitialized in this function [-Werror=maybe-uninitialized]
>> 2789 1772 |     if (ret) {
>> 2790      |        ^
>
>
> The routine to fix is vfio_devices_start_dirty_page_tracking(). The 
> compiler
> is doing some inlining.
>
I don't think I understand how inlining could cause it.
Could you elaborate on this?

I thought that the compiler just missed the initialization of ret 
because it happens in the if else statement, and that simply doing "int 
ret = 0;" would solve it.

Thanks.



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 03/20] vfio/migration: Add VFIO migration pre-copy support
  2023-02-22 20:58   ` Alex Williamson
@ 2023-02-23 15:25     ` Avihai Horon
  2023-02-23 21:16       ` Alex Williamson
  0 siblings, 1 reply; 93+ messages in thread
From: Avihai Horon @ 2023-02-23 15:25 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins


On 22/02/2023 22:58, Alex Williamson wrote:
> External email: Use caution opening links or attachments
>
>
> On Wed, 22 Feb 2023 19:48:58 +0200
> Avihai Horon <avihaih@nvidia.com> wrote:
>
>> Pre-copy support allows the VFIO device data to be transferred while the
>> VM is running. This helps to accommodate VFIO devices that have a large
>> amount of data that needs to be transferred, and it can reduce migration
>> downtime.
>>
>> Pre-copy support is optional in VFIO migration protocol v2.
>> Implement pre-copy of VFIO migration protocol v2 and use it for devices
>> that support it. Full description of it can be found here [1].
>>
>> [1]
>> https://lore.kernel.org/kvm/20221206083438.37807-3-yishaih@nvidia.com/
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>   docs/devel/vfio-migration.rst |  35 +++++--
>>   include/hw/vfio/vfio-common.h |   3 +
>>   hw/vfio/common.c              |   6 +-
>>   hw/vfio/migration.c           | 175 ++++++++++++++++++++++++++++++++--
>>   hw/vfio/trace-events          |   4 +-
>>   5 files changed, 201 insertions(+), 22 deletions(-)
>>
>> diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
>> index c214c73e28..ba80b9150d 100644
>> --- a/docs/devel/vfio-migration.rst
>> +++ b/docs/devel/vfio-migration.rst
>> @@ -7,12 +7,14 @@ the guest is running on source host and restoring this saved state on the
>>   destination host. This document details how saving and restoring of VFIO
>>   devices is done in QEMU.
>>
>> -Migration of VFIO devices currently consists of a single stop-and-copy phase.
>> -During the stop-and-copy phase the guest is stopped and the entire VFIO device
>> -data is transferred to the destination.
>> -
>> -The pre-copy phase of migration is currently not supported for VFIO devices.
>> -Support for VFIO pre-copy will be added later on.
>> +Migration of VFIO devices consists of two phases: the optional pre-copy phase,
>> +and the stop-and-copy phase. The pre-copy phase is iterative and allows to
>> +accommodate VFIO devices that have a large amount of data that needs to be
>> +transferred. The iterative pre-copy phase of migration allows for the guest to
>> +continue whilst the VFIO device state is transferred to the destination, this
>> +helps to reduce the total downtime of the VM. VFIO devices can choose to skip
>> +the pre-copy phase of migration by not reporting the VFIO_MIGRATION_PRE_COPY
>> +flag in VFIO_DEVICE_FEATURE_MIGRATION ioctl.
> Or alternatively for the last sentence,
>
>    VFIO devices opt-in to pre-copy support by reporting the
>    VFIO_MIGRATION_PRE_COPY flag in the VFIO_DEVICE_FEATURE_MIGRATION
>    ioctl.

Sounds good, I will change it.

>
>>   Note that currently VFIO migration is supported only for a single device. This
>>   is due to VFIO migration's lack of P2P support. However, P2P support is planned
>> @@ -29,10 +31,20 @@ VFIO implements the device hooks for the iterative approach as follows:
>>   * A ``load_setup`` function that sets the VFIO device on the destination in
>>     _RESUMING state.
>>
>> +* A ``state_pending_estimate`` function that reports an estimate of the
>> +  remaining pre-copy data that the vendor driver has yet to save for the VFIO
>> +  device.
>> +
>>   * A ``state_pending_exact`` function that reads pending_bytes from the vendor
>>     driver, which indicates the amount of data that the vendor driver has yet to
>>     save for the VFIO device.
>>
>> +* An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
>> +  active only when the VFIO device is in pre-copy states.
>> +
>> +* A ``save_live_iterate`` function that reads the VFIO device's data from the
>> +  vendor driver during iterative pre-copy phase.
>> +
>>   * A ``save_state`` function to save the device config space if it is present.
>>
>>   * A ``save_live_complete_precopy`` function that sets the VFIO device in
>> @@ -95,8 +107,10 @@ Flow of state changes during Live migration
>>   ===========================================
>>
>>   Below is the flow of state change during live migration.
>> -The values in the brackets represent the VM state, the migration state, and
>> +The values in the parentheses represent the VM state, the migration state, and
>>   the VFIO device state, respectively.
>> +The text in the square brackets represents the flow if the VFIO device supports
>> +pre-copy.
>>
>>   Live migration save path
>>   ------------------------
>> @@ -108,11 +122,12 @@ Live migration save path
>>                                     |
>>                        migrate_init spawns migration_thread
>>                   Migration thread then calls each device's .save_setup()
>> -                       (RUNNING, _SETUP, _RUNNING)
>> +                  (RUNNING, _SETUP, _RUNNING [_PRE_COPY])
>>                                     |
>> -                      (RUNNING, _ACTIVE, _RUNNING)
>> -             If device is active, get pending_bytes by .state_pending_exact()
>> +                  (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])
>> +      If device is active, get pending_bytes by .state_pending_{estimate,exact}()
>>             If total pending_bytes >= threshold_size, call .save_live_iterate()
>> +                  [Data of VFIO device for pre-copy phase is copied]
>>           Iterate till total pending bytes converge and are less than threshold
>>                                     |
>>     On migration completion, vCPU stops and calls .save_live_complete_precopy for
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 87524c64a4..ee55d442b4 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -66,6 +66,9 @@ typedef struct VFIOMigration {
>>       int data_fd;
>>       void *data_buffer;
>>       size_t data_buffer_size;
>> +    uint64_t precopy_init_size;
>> +    uint64_t precopy_dirty_size;
> size_t?
>
>> +    uint64_t mig_flags;
>>   } VFIOMigration;
>>
>>   typedef struct VFIOAddressSpace {
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index bab83c0e55..6f5afe9f5a 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -409,7 +409,8 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
>>               }
>>
>>               if (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF &&
>> -                migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
>> +                (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
>> +                 migration->device_state == VFIO_DEVICE_STATE_PRE_COPY)) {
>>                   return false;
>>               }
>>           }
>> @@ -438,7 +439,8 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
>>                   return false;
>>               }
>>
>> -            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
>> +            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
>> +                migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
>>                   continue;
>>               } else {
>>                   return false;
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 94a4df73d0..307983d57d 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -67,6 +67,8 @@ static const char *mig_state_to_str(enum vfio_device_mig_state state)
>>           return "STOP_COPY";
>>       case VFIO_DEVICE_STATE_RESUMING:
>>           return "RESUMING";
>> +    case VFIO_DEVICE_STATE_PRE_COPY:
>> +        return "PRE_COPY";
>>       default:
>>           return "UNKNOWN STATE";
>>       }
>> @@ -240,6 +242,23 @@ static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
>>       return 0;
>>   }
>>
>> +static int vfio_query_precopy_size(VFIOMigration *migration,
>> +                                   uint64_t *init_size, uint64_t *dirty_size)
> size_t?  Seems like a concern throughout.

Yes, I will change it in all places.

>> +{
>> +    struct vfio_precopy_info precopy = {
>> +        .argsz = sizeof(precopy),
>> +    };
>> +
>> +    if (ioctl(migration->data_fd, VFIO_MIG_GET_PRECOPY_INFO, &precopy)) {
>> +        return -errno;
>> +    }
>> +
>> +    *init_size = precopy.initial_bytes;
>> +    *dirty_size = precopy.dirty_bytes;
>> +
>> +    return 0;
>> +}
>> +
>>   /* Returns the size of saved data on success and -errno on error */
>>   static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
>>   {
>> @@ -248,6 +267,11 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
>>       data_size = read(migration->data_fd, migration->data_buffer,
>>                        migration->data_buffer_size);
>>       if (data_size < 0) {
>> +        /* Pre-copy emptied all the device state for now */
>> +        if (errno == ENOMSG) {
>> +            return 0;
>> +        }
>> +
>>           return -errno;
>>       }
>>       if (data_size == 0) {
>> @@ -264,6 +288,31 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
>>       return qemu_file_get_error(f) ?: data_size;
>>   }
>>
>> +static void vfio_update_estimated_pending_data(VFIOMigration *migration,
>> +                                               uint64_t data_size)
>> +{
>> +    if (!data_size) {
>> +        /*
>> +         * Pre-copy emptied all the device state for now, update estimated sizes
>> +         * accordingly.
>> +         */
>> +        migration->precopy_init_size = 0;
>> +        migration->precopy_dirty_size = 0;
>> +
>> +        return;
>> +    }
>> +
>> +    if (migration->precopy_init_size) {
>> +        uint64_t init_size = MIN(migration->precopy_init_size, data_size);
>> +
>> +        migration->precopy_init_size -= init_size;
>> +        data_size -= init_size;
>> +    }
>> +
>> +    migration->precopy_dirty_size -= MIN(migration->precopy_dirty_size,
>> +                                         data_size);
>> +}
>> +
>>   /* ---------------------------------------------------------------------- */
>>
>>   static int vfio_save_setup(QEMUFile *f, void *opaque)
>> @@ -284,6 +333,35 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
>>           return -ENOMEM;
>>       }
>>
>> +    if (migration->mig_flags & VFIO_MIGRATION_PRE_COPY) {
>> +        uint64_t init_size = 0, dirty_size = 0;
>> +        int ret;
>> +
>> +        switch (migration->device_state) {
>> +        case VFIO_DEVICE_STATE_RUNNING:
>> +            ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_PRE_COPY,
>> +                                           VFIO_DEVICE_STATE_RUNNING);
>> +            if (ret) {
>> +                return ret;
>> +            }
>> +
>> +            vfio_query_precopy_size(migration, &init_size, &dirty_size);
>> +            migration->precopy_init_size = init_size;
>> +            migration->precopy_dirty_size = dirty_size;
> Seems like we could do away with {init,dirty}_size, initialize
> migration->precopy_{init,dirty}_size before the switch, pass them
> directly to vfio_query_precopy_size() and remove all but the break from
> the case below.  But then that also suggests we could redefine
> vfio_query_precopy_size() to
>
> static int vfio_update_precopy_info(VFIOMigration *migration)
>
> which sets the fields directly since this is the only way it's used.

You are right, I will change it.

>> +
>> +            break;
>> +        case VFIO_DEVICE_STATE_STOP:
>> +            /* vfio_save_complete_precopy() will go to STOP_COPY */
>> +
>> +            migration->precopy_init_size = 0;
>> +            migration->precopy_dirty_size = 0;
>> +
>> +            break;
>> +        default:
>> +            return -EINVAL;
>> +        }
>> +    }
>> +
>>       trace_vfio_save_setup(vbasedev->name, migration->data_buffer_size);
>>
>>       qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> @@ -302,23 +380,44 @@ static void vfio_save_cleanup(void *opaque)
>>       trace_vfio_save_cleanup(vbasedev->name);
>>   }
>>
>> +static void vfio_state_pending_estimate(void *opaque, uint64_t threshold_size,
>> +                                        uint64_t *must_precopy,
>> +                                        uint64_t *can_postcopy)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    if (migration->device_state != VFIO_DEVICE_STATE_PRE_COPY) {
>> +        return;
>> +    }
>> +
>> +    /*
>> +     * Initial size should be transferred during pre-copy phase so stop-copy
>> +     * phase will not be slowed down. Report threshold_size to force another
>> +     * pre-copy iteration.
>> +     */
>> +    *must_precopy += migration->precopy_init_size ?
>> +                         threshold_size :
>> +                         migration->precopy_dirty_size;
> This sure feels like we're feeding false data back to the iterator to
> spoof it to run another iteration, when the vfio migration protocol
> only recommends that initial_bytes reaches zero before proceeding to
> stop-copy, it's not a requirement.  What benefit is actually observed
> from this?  Why is this required for initial pre-copy support?  It
> seems devious.

As previously discussed in the thread that added the pre-copy uAPI [1], 
the init_bytes can be used by drivers to reduce the downtime.
For example, mlx5 transfers some metadata to the target so it will be 
able to pre-allocate resources etc.

[1] 
https://lore.kernel.org/kvm/ae4a6259-349d-0131-896c-7a6ea775cc9e@nvidia.com/

Thanks!

>> +
>> +    trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy,
>> +                                      *can_postcopy,
>> +                                      migration->precopy_init_size,
>> +                                      migration->precopy_dirty_size);
>> +}
>> +
>>   /*
>>    * Migration size of VFIO devices can be as little as a few KBs or as big as
>>    * many GBs. This value should be big enough to cover the worst case.
>>    */
>>   #define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
>>
>> -/*
>> - * Only exact function is implemented and not estimate function. The reason is
>> - * that during pre-copy phase of migration the estimate function is called
>> - * repeatedly while pending RAM size is over the threshold, thus migration
>> - * can't converge and querying the VFIO device pending data size is useless.
>> - */
>>   static void vfio_state_pending_exact(void *opaque, uint64_t threshold_size,
>>                                        uint64_t *must_precopy,
>>                                        uint64_t *can_postcopy)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>>       uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
>>
>>       /*
>> @@ -328,8 +427,57 @@ static void vfio_state_pending_exact(void *opaque, uint64_t threshold_size,
>>       vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
>>       *must_precopy += stop_copy_size;
>>
>> +    if (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
>> +        uint64_t init_size = 0, dirty_size = 0;
>> +
>> +        vfio_query_precopy_size(migration, &init_size, &dirty_size);
>> +        migration->precopy_init_size = init_size;
>> +        migration->precopy_dirty_size = dirty_size;
> This is the only other caller of vfio_query_precopy_size(), following
> the same pattern that could be simplified if the function filled the
> migration fields itself.
>
>> +
>> +        /*
>> +         * Initial size should be transferred during pre-copy phase so
>> +         * stop-copy phase will not be slowed down. Report threshold_size
>> +         * to force another pre-copy iteration.
>> +         */
>> +        *must_precopy += migration->precopy_init_size ?
>> +                             threshold_size :
>> +                             migration->precopy_dirty_size;
>> +    }
> Just as sketchy as above.  Thanks,
>
> Alex
>
>> +
>>       trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
>> -                                   stop_copy_size);
>> +                                   stop_copy_size, migration->precopy_init_size,
>> +                                   migration->precopy_dirty_size);
>> +}
>> +
>> +static bool vfio_is_active_iterate(void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY;
>> +}
>> +
>> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    ssize_t data_size;
>> +
>> +    data_size = vfio_save_block(f, migration);
>> +    if (data_size < 0) {
>> +        return data_size;
>> +    }
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +
>> +    vfio_update_estimated_pending_data(migration, data_size);
>> +
>> +    trace_vfio_save_iterate(vbasedev->name);
>> +
>> +    /*
>> +     * A VFIO device's pre-copy dirty_bytes is not guaranteed to reach zero.
>> +     * Return 1 so following handlers will not be potentially blocked.
>> +     */
>> +    return 1;
>>   }
>>
>>   static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>> @@ -338,7 +486,7 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>       ssize_t data_size;
>>       int ret;
>>
>> -    /* We reach here with device state STOP only */
>> +    /* We reach here with device state STOP or STOP_COPY only */
>>       ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
>>                                      VFIO_DEVICE_STATE_STOP);
>>       if (ret) {
>> @@ -457,7 +605,10 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>>   static const SaveVMHandlers savevm_vfio_handlers = {
>>       .save_setup = vfio_save_setup,
>>       .save_cleanup = vfio_save_cleanup,
>> +    .state_pending_estimate = vfio_state_pending_estimate,
>>       .state_pending_exact = vfio_state_pending_exact,
>> +    .is_active_iterate = vfio_is_active_iterate,
>> +    .save_live_iterate = vfio_save_iterate,
>>       .save_live_complete_precopy = vfio_save_complete_precopy,
>>       .save_state = vfio_save_state,
>>       .load_setup = vfio_load_setup,
>> @@ -470,13 +621,18 @@ static const SaveVMHandlers savevm_vfio_handlers = {
>>   static void vfio_vmstate_change(void *opaque, bool running, RunState state)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>>       enum vfio_device_mig_state new_state;
>>       int ret;
>>
>>       if (running) {
>>           new_state = VFIO_DEVICE_STATE_RUNNING;
>>       } else {
>> -        new_state = VFIO_DEVICE_STATE_STOP;
>> +        new_state =
>> +            (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY &&
>> +             (state == RUN_STATE_FINISH_MIGRATE || state == RUN_STATE_PAUSED)) ?
>> +                VFIO_DEVICE_STATE_STOP_COPY :
>> +                VFIO_DEVICE_STATE_STOP;
>>       }
>>
>>       /*
>> @@ -590,6 +746,7 @@ static int vfio_migration_init(VFIODevice *vbasedev)
>>       migration->vbasedev = vbasedev;
>>       migration->device_state = VFIO_DEVICE_STATE_RUNNING;
>>       migration->data_fd = -1;
>> +    migration->mig_flags = mig_flags;
>>
>>       oid = vmstate_if_get_id(VMSTATE_IF(DEVICE(obj)));
>>       if (oid) {
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index 669d9fe07c..51613e02e6 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -161,6 +161,8 @@ vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
>>   vfio_save_cleanup(const char *name) " (%s)"
>>   vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
>>   vfio_save_device_config_state(const char *name) " (%s)"
>> +vfio_save_iterate(const char *name) " (%s)"
>>   vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size 0x%"PRIx64
>> -vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64
>> +vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
>> +vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
>>   vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 07/20] vfio/common: Add VFIOBitmap and (de)alloc functions
  2023-02-22 21:40   ` Alex Williamson
@ 2023-02-23 15:27     ` Avihai Horon
  0 siblings, 0 replies; 93+ messages in thread
From: Avihai Horon @ 2023-02-23 15:27 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins


On 22/02/2023 23:40, Alex Williamson wrote:
> External email: Use caution opening links or attachments
>
>
> On Wed, 22 Feb 2023 19:49:02 +0200
> Avihai Horon <avihaih@nvidia.com> wrote:
>
>> There are already two places where dirty page bitmap allocation and
>> calculations are done in open code. With device dirty page tracking
>> being added in next patches, there are going to be even more places.
>>
>> To avoid code duplication, introduce VFIOBitmap struct and corresponding
>> alloc and dealloc functions and use them where applicable.
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>   hw/vfio/common.c | 89 ++++++++++++++++++++++++++++++++----------------
>>   1 file changed, 60 insertions(+), 29 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index ac93b85632..84f08bdbbb 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -320,6 +320,41 @@ const MemoryRegionOps vfio_region_ops = {
>>    * Device state interfaces
>>    */
>>
>> +typedef struct {
>> +    unsigned long *bitmap;
>> +    hwaddr size;
>> +    hwaddr pages;
>> +} VFIOBitmap;
>> +
>> +static VFIOBitmap *vfio_bitmap_alloc(hwaddr size)
>> +{
>> +    VFIOBitmap *vbmap = g_try_new0(VFIOBitmap, 1);
>> +    if (!vbmap) {
>> +        errno = ENOMEM;
>> +
>> +        return NULL;
>> +    }
>> +
>> +    vbmap->pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size();
>> +    vbmap->size = ROUND_UP(vbmap->pages, sizeof(__u64) * BITS_PER_BYTE) /
>> +                                         BITS_PER_BYTE;
>> +    vbmap->bitmap = g_try_malloc0(vbmap->size);
>> +    if (!vbmap->bitmap) {
>> +        g_free(vbmap);
>> +        errno = ENOMEM;
>> +
>> +        return NULL;
>> +    }
>> +
>> +    return vbmap;
>> +}
>> +
>> +static void vfio_bitmap_dealloc(VFIOBitmap *vbmap)
>> +{
>> +    g_free(vbmap->bitmap);
>> +    g_free(vbmap);
>> +}
> Nit, '_alloc' and '_free' seems like a more standard convention.

Sure, will change.

Thanks.



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 11/20] vfio/common: Add device dirty page tracking start/stop
  2023-02-22 22:40   ` Alex Williamson
  2023-02-23  2:02     ` Jason Gunthorpe
@ 2023-02-23 15:36     ` Avihai Horon
  1 sibling, 0 replies; 93+ messages in thread
From: Avihai Horon @ 2023-02-23 15:36 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins


On 23/02/2023 0:40, Alex Williamson wrote:
> External email: Use caution opening links or attachments
>
>
> On Wed, 22 Feb 2023 19:49:06 +0200
> Avihai Horon <avihaih@nvidia.com> wrote:
>
>> From: Joao Martins <joao.m.martins@oracle.com>
>>
>> Add device dirty page tracking start/stop functionality. This uses the
>> device DMA logging uAPI to start and stop dirty page tracking by device.
>>
>> Device dirty page tracking is used only if all devices within a
>> container support device dirty page tracking.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>   include/hw/vfio/vfio-common.h |   2 +
>>   hw/vfio/common.c              | 211 +++++++++++++++++++++++++++++++++-
>>   2 files changed, 211 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 6f36876ce0..1f21e1fa43 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -149,6 +149,8 @@ typedef struct VFIODevice {
>>       VFIOMigration *migration;
>>       Error *migration_blocker;
>>       OnOffAuto pre_copy_dirty_page_tracking;
>> +    bool dirty_pages_supported;
>> +    bool dirty_tracking;
>>   } VFIODevice;
>>
>>   struct VFIODeviceOps {
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 6041da6c7e..740153e7d7 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -473,6 +473,22 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
>>       return true;
>>   }
>>
>> +static bool vfio_devices_all_device_dirty_tracking(VFIOContainer *container)
>> +{
>> +    VFIOGroup *group;
>> +    VFIODevice *vbasedev;
>> +
>> +    QLIST_FOREACH(group, &container->group_list, container_next) {
>> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
>> +            if (!vbasedev->dirty_pages_supported) {
>> +                return false;
>> +            }
>> +        }
>> +    }
>> +
>> +    return true;
>> +}
>> +
>>   /*
>>    * Check if all VFIO devices are running and migration is active, which is
>>    * essentially equivalent to the migration being in pre-copy phase.
>> @@ -1404,13 +1420,192 @@ static int vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
>>       return ret;
>>   }
>>
>> +static int vfio_devices_dma_logging_set(VFIOContainer *container,
>> +                                        struct vfio_device_feature *feature)
>> +{
>> +    bool status = (feature->flags & VFIO_DEVICE_FEATURE_MASK) ==
>> +                  VFIO_DEVICE_FEATURE_DMA_LOGGING_START;
>> +    VFIODevice *vbasedev;
>> +    VFIOGroup *group;
>> +    int ret = 0;
>> +
>> +    QLIST_FOREACH(group, &container->group_list, container_next) {
>> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
>> +            if (vbasedev->dirty_tracking == status) {
>> +                continue;
>> +            }
>> +
>> +            ret = ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature);
>> +            if (ret) {
>> +                ret = -errno;
>> +                error_report("%s: Failed to set DMA logging %s, err %d (%s)",
>> +                             vbasedev->name, status ? "start" : "stop", ret,
>> +                             strerror(errno));
>> +                goto out;
>> +            }
>> +            vbasedev->dirty_tracking = status;
>> +        }
>> +    }
>> +
>> +out:
>> +    return ret;
>> +}
>> +
>> +static int vfio_devices_dma_logging_stop(VFIOContainer *container)
>> +{
>> +    uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature),
>> +                              sizeof(uint64_t))] = {};
>> +    struct vfio_device_feature *feature = (struct vfio_device_feature *)buf;
>> +
>> +    feature->argsz = sizeof(buf);
>> +    feature->flags = VFIO_DEVICE_FEATURE_SET;
>> +    feature->flags |= VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP;
>> +
>> +    return vfio_devices_dma_logging_set(container, feature);
>> +}
>> +
>> +static gboolean vfio_device_dma_logging_range_add(DMAMap *map, gpointer data)
>> +{
>> +    struct vfio_device_feature_dma_logging_range **out = data;
>> +    struct vfio_device_feature_dma_logging_range *range = *out;
>> +
>> +    range->iova = map->iova;
>> +    /* IOVATree is inclusive, DMA logging uAPI isn't, so add 1 to length */
>> +    range->length = map->size + 1;
>> +
>> +    *out = ++range;
>> +
>> +    return false;
>> +}
>> +
>> +static gboolean vfio_iova_tree_get_first(DMAMap *map, gpointer data)
>> +{
>> +    DMAMap *first = data;
>> +
>> +    first->iova = map->iova;
>> +    first->size = map->size;
>> +
>> +    return true;
>> +}
>> +
>> +static gboolean vfio_iova_tree_get_last(DMAMap *map, gpointer data)
>> +{
>> +    DMAMap *last = data;
>> +
>> +    last->iova = map->iova;
>> +    last->size = map->size;
>> +
>> +    return false;
>> +}
>> +
>> +static struct vfio_device_feature *
>> +vfio_device_feature_dma_logging_start_create(VFIOContainer *container)
>> +{
>> +    struct vfio_device_feature *feature;
>> +    size_t feature_size;
>> +    struct vfio_device_feature_dma_logging_control *control;
>> +    struct vfio_device_feature_dma_logging_range *ranges;
>> +    unsigned int max_ranges;
>> +    unsigned int cur_ranges;
>> +
>> +    feature_size = sizeof(struct vfio_device_feature) +
>> +                   sizeof(struct vfio_device_feature_dma_logging_control);
>> +    feature = g_malloc0(feature_size);
>> +    feature->argsz = feature_size;
>> +    feature->flags = VFIO_DEVICE_FEATURE_SET;
>> +    feature->flags |= VFIO_DEVICE_FEATURE_DMA_LOGGING_START;
>> +
>> +    control = (struct vfio_device_feature_dma_logging_control *)feature->data;
>> +    control->page_size = qemu_real_host_page_size();
>> +
>> +    QEMU_LOCK_GUARD(&container->mappings_mutex);
>> +
>> +    /*
>> +     * DMA logging uAPI guarantees to support at least num_ranges that fits into
>> +     * a single host kernel page. To be on the safe side, use this as a limit
>> +     * from which to merge to a single range.
>> +     */
>> +    max_ranges = qemu_real_host_page_size() / sizeof(*ranges);
>> +    cur_ranges = iova_tree_nnodes(container->mappings);
>> +    control->num_ranges = (cur_ranges <= max_ranges) ? cur_ranges : 1;
> This makes me suspicious that we're implementing to the characteristics
> of a specific device rather than strictly to the vfio migration API.
> Are we just trying to avoid the error handling to support the try and
> fall back to a single range behavior?  If we want to make a
> simplification, then document it as such.  The "[t]o be on the safe
> side" phrasing above could later be interpreted as avoiding an issue
> and might discourage a more complete implementation.

Yes, it was mainly to make things simple.
I will replace the "To be on the safe side..." phrasing.

Thanks.

>> +    ranges = g_try_new0(struct vfio_device_feature_dma_logging_range,
>> +                        control->num_ranges);
>> +    if (!ranges) {
>> +        g_free(feature);
>> +        errno = ENOMEM;
>> +
>> +        return NULL;
>> +    }
>> +
>> +    control->ranges = (uint64_t)ranges;
>> +    if (cur_ranges <= max_ranges) {
>> +        iova_tree_foreach(container->mappings,
>> +                          vfio_device_dma_logging_range_add, &ranges);
>> +    } else {
>> +        DMAMap first, last;
>> +
>> +        iova_tree_foreach(container->mappings, vfio_iova_tree_get_first,
>> +                          &first);
>> +        iova_tree_foreach(container->mappings, vfio_iova_tree_get_last, &last);
>> +        ranges->iova = first.iova;
>> +        /* IOVATree is inclusive, DMA logging uAPI isn't, so add 1 to length */
>> +        ranges->length = (last.iova - first.iova) + last.size + 1;
>> +    }
>> +
>> +    return feature;
>> +}
>> +
>> +static void vfio_device_feature_dma_logging_start_destroy(
>> +    struct vfio_device_feature *feature)
>> +{
>> +    struct vfio_device_feature_dma_logging_control *control =
>> +        (struct vfio_device_feature_dma_logging_control *)feature->data;
>> +    struct vfio_device_feature_dma_logging_range *ranges =
>> +        (struct vfio_device_feature_dma_logging_range *)control->ranges;
>> +
>> +    g_free(ranges);
>> +    g_free(feature);
>> +}
>> +
>> +static int vfio_devices_dma_logging_start(VFIOContainer *container)
>> +{
>> +    struct vfio_device_feature *feature;
>> +    int ret;
>> +
>> +    feature = vfio_device_feature_dma_logging_start_create(container);
>> +    if (!feature) {
>> +        return -errno;
>> +    }
>> +
>> +    ret = vfio_devices_dma_logging_set(container, feature);
>> +    if (ret) {
>> +        vfio_devices_dma_logging_stop(container);
>> +    }
>> +
>> +    vfio_device_feature_dma_logging_start_destroy(feature);
>> +
>> +    return ret;
>> +}
>> +
>>   static void vfio_listener_log_global_start(MemoryListener *listener)
>>   {
>>       VFIOContainer *container = container_of(listener, VFIOContainer, listener);
>>       int ret;
>>
>> -    ret = vfio_set_dirty_page_tracking(container, true);
>> +    if (vfio_devices_all_device_dirty_tracking(container)) {
>> +        if (vfio_have_giommu(container)) {
>> +            /* Device dirty page tracking currently doesn't support vIOMMU */
>> +            return;
>> +        }
>> +
>> +        ret = vfio_devices_dma_logging_start(container);
>> +    } else {
>> +        ret = vfio_set_dirty_page_tracking(container, true);
>> +    }
>> +
>>       if (ret) {
>> +        error_report("vfio: Could not start dirty page tracking, err: %d (%s)",
>> +                     ret, strerror(-ret));
>>           vfio_set_migration_error(ret);
>>       }
>>   }
>> @@ -1420,8 +1615,20 @@ static void vfio_listener_log_global_stop(MemoryListener *listener)
>>       VFIOContainer *container = container_of(listener, VFIOContainer, listener);
>>       int ret;
>>
>> -    ret = vfio_set_dirty_page_tracking(container, false);
>> +    if (vfio_devices_all_device_dirty_tracking(container)) {
>> +        if (vfio_have_giommu(container)) {
>> +            /* Device dirty page tracking currently doesn't support vIOMMU */
>> +            return;
>> +        }
>> +
>> +        ret = vfio_devices_dma_logging_stop(container);
>> +    } else {
>> +        ret = vfio_set_dirty_page_tracking(container, false);
>> +    }
>> +
>>       if (ret) {
>> +        error_report("vfio: Could not stop dirty page tracking, err: %d (%s)",
>> +                     ret, strerror(-ret));
>>           vfio_set_migration_error(ret);
>>       }
>>   }


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 11/20] vfio/common: Add device dirty page tracking start/stop
  2023-02-23  2:02     ` Jason Gunthorpe
@ 2023-02-23 19:27       ` Alex Williamson
  2023-02-23 19:30         ` Jason Gunthorpe
  0 siblings, 1 reply; 93+ messages in thread
From: Alex Williamson @ 2023-02-23 19:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Avihai Horon, qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins

On Wed, 22 Feb 2023 22:02:24 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Feb 22, 2023 at 03:40:43PM -0700, Alex Williamson wrote:
> > > +    /*
> > > +     * DMA logging uAPI guarantees to support at least num_ranges that fits into
> > > +     * a single host kernel page. To be on the safe side, use this as a limit
> > > +     * from which to merge to a single range.
> > > +     */
> > > +    max_ranges = qemu_real_host_page_size() / sizeof(*ranges);
> > > +    cur_ranges = iova_tree_nnodes(container->mappings);
> > > +    control->num_ranges = (cur_ranges <= max_ranges) ? cur_ranges : 1;  
> > 
> > This makes me suspicious that we're implementing to the characteristics
> > of a specific device rather than strictly to the vfio migration API.
> > Are we just trying to avoid the error handling to support the try and
> > fall back to a single range behavior?  
> 
> This was what we agreed to when making the kernel patches. Userspace
> is restricted to send one page of range list to the kernel, and the
> kernel will always adjust that to whatever smaller list the device needs.
> 
> We added this limit only because we don't want to have a way for
> userspace to consume a lot of kernel memory.
> 
> See LOG_MAX_RANGES in vfio_main.c
> 
> If qemu is viommu mode and it has a huge number of ranges then it must
> cut it down before passing things to the kernel.

Ok, that's the kernel implementation, but the uAPI states:

 * The core kernel code guarantees to support by minimum num_ranges that fit
 * into a single kernel page. User space can try higher values but should give
 * up if the above can't be achieved as of some driver limitations.

So again, I think I'm just looking for a better comment that doesn't
add FUD to the reasoning behind switching to a single range, ie. a)
it's easier to deal with given the kernel guarantee and b) the current
kernel implementation imposes a hard limit at page size anyway.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 11/20] vfio/common: Add device dirty page tracking start/stop
  2023-02-23 19:27       ` Alex Williamson
@ 2023-02-23 19:30         ` Jason Gunthorpe
  2023-02-23 20:16           ` Alex Williamson
  0 siblings, 1 reply; 93+ messages in thread
From: Jason Gunthorpe @ 2023-02-23 19:30 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Avihai Horon, qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins

On Thu, Feb 23, 2023 at 12:27:23PM -0700, Alex Williamson wrote:
> So again, I think I'm just looking for a better comment that doesn't
> add FUD to the reasoning behind switching to a single range, 

It isn't a single range, it is a single page of ranges, right?

The comment should say

"Keep the implementation simple and use at most a PAGE_SIZE of ranges
because the kernel is guaranteed to be able to parse that"

Jason


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 17/20] vfio/common: Support device dirty page tracking with vIOMMU
  2023-02-23  2:08     ` Jason Gunthorpe
@ 2023-02-23 20:06       ` Alex Williamson
  2023-02-23 20:55         ` Jason Gunthorpe
  0 siblings, 1 reply; 93+ messages in thread
From: Alex Williamson @ 2023-02-23 20:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Avihai Horon, qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins

On Wed, 22 Feb 2023 22:08:33 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Feb 22, 2023 at 04:34:39PM -0700, Alex Williamson wrote:
> > > +    /*
> > > +     * With vIOMMU we try to track the entire IOVA space. As the IOVA space can
> > > +     * be rather big, devices might not be able to track it due to HW
> > > +     * limitations. In that case:
> > > +     * (1) Retry tracking a smaller part of the IOVA space.
> > > +     * (2) Retry tracking a range in the size of the physical memory.  
> > 
> > This looks really sketchy, why do we think there's a "good enough"
> > value here?  If we get it wrong, the device potentially has access to
> > IOVA space that we're not tracking, right?  
> 
> The idea was the untracked range becomes permanently dirty, so at
> worst this means the migration never converges.

I didn't spot the mechanics where that's implemented, I'll look again.
 
> #2 is the presumption that the guest is using an identity map.

This is a dangerous assumption.

> > I'd think the only viable fallback if the vIOMMU doesn't report its max
> > IOVA is the full 64-bit address space, otherwise it seems like we need
> > to add a migration blocker.  
> 
> This is basically saying vIOMMU doesn't work with migration, and we've
> heard that this isn't OK. There are cases where vIOMMU is on but the
> guest always uses identity maps. eg for virtual interrupt remapping.

Yes, the vIOMMU can be automatically added to a VM when we exceed 255
vCPUs, but I don't see how we can therefore deduce anything about the
usage mode of the vIOMMU.  Users also make use of vfio with vIOMMU for
nested assignment, ie. userspace drivers running within the guest,
where making assumptions about the IOVA extents of the userspace driver
seems dangerous.

Let's backup though, if a device doesn't support the full address width
of the platform, it's the responsibility of the device driver to
implement a DMA mask such that the device is never asked to DMA outside
of its address space support.  Therefore how could a device ever dirty
pages outside of its own limitations?

Isn't it reasonable to require that a device support dirty tracking for
the entire extent if its DMA address width in order to support this
feature?

If we can make those assumptions, then the vfio driver should happily
accept a range exceeding the device's DMA address width capabilities,
knowing that the device cannot dirty anything beyond its addressable
range.

> We also have future problems that nested translation is incompatible
> with device dirty tracking..

:-\  Thanks,

Alex



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 11/20] vfio/common: Add device dirty page tracking start/stop
  2023-02-23 19:30         ` Jason Gunthorpe
@ 2023-02-23 20:16           ` Alex Williamson
  2023-02-23 20:54             ` Jason Gunthorpe
  0 siblings, 1 reply; 93+ messages in thread
From: Alex Williamson @ 2023-02-23 20:16 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Avihai Horon, qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins

On Thu, 23 Feb 2023 15:30:28 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Thu, Feb 23, 2023 at 12:27:23PM -0700, Alex Williamson wrote:
> > So again, I think I'm just looking for a better comment that doesn't
> > add FUD to the reasoning behind switching to a single range,   
> 
> It isn't a single range, it is a single page of ranges, right?

Exceeding a single page of ranges is the inflection point at which we
switch to a single range.
 
> The comment should say
> 
> "Keep the implementation simple and use at most a PAGE_SIZE of ranges
> because the kernel is guaranteed to be able to parse that"

Something along those lines, yeah.  And bonus points for noting that
the kernel implementation is currently hard coded at this limit, so
there's no point in trying larger arrays as implied in the uAPI.
Thanks,

Alex



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 11/20] vfio/common: Add device dirty page tracking start/stop
  2023-02-23 20:16           ` Alex Williamson
@ 2023-02-23 20:54             ` Jason Gunthorpe
  2023-02-26 16:54               ` Avihai Horon
  0 siblings, 1 reply; 93+ messages in thread
From: Jason Gunthorpe @ 2023-02-23 20:54 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Avihai Horon, qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins

On Thu, Feb 23, 2023 at 01:16:40PM -0700, Alex Williamson wrote:
> On Thu, 23 Feb 2023 15:30:28 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Thu, Feb 23, 2023 at 12:27:23PM -0700, Alex Williamson wrote:
> > > So again, I think I'm just looking for a better comment that doesn't
> > > add FUD to the reasoning behind switching to a single range,   
> > 
> > It isn't a single range, it is a single page of ranges, right?
> 
> Exceeding a single page of ranges is the inflection point at which we
> switch to a single range.

Oh, that isn't what it should do - it should cut it back to fit in a
page..

Jason


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 17/20] vfio/common: Support device dirty page tracking with vIOMMU
  2023-02-23 20:06       ` Alex Williamson
@ 2023-02-23 20:55         ` Jason Gunthorpe
  2023-02-23 21:30           ` Joao Martins
  2023-02-23 22:33           ` Alex Williamson
  0 siblings, 2 replies; 93+ messages in thread
From: Jason Gunthorpe @ 2023-02-23 20:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Avihai Horon, qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins

On Thu, Feb 23, 2023 at 01:06:33PM -0700, Alex Williamson wrote:
> > #2 is the presumption that the guest is using an identity map.
> 
> This is a dangerous assumption.
> 
> > > I'd think the only viable fallback if the vIOMMU doesn't report its max
> > > IOVA is the full 64-bit address space, otherwise it seems like we need
> > > to add a migration blocker.  
> > 
> > This is basically saying vIOMMU doesn't work with migration, and we've
> > heard that this isn't OK. There are cases where vIOMMU is on but the
> > guest always uses identity maps. eg for virtual interrupt remapping.
> 
> Yes, the vIOMMU can be automatically added to a VM when we exceed 255
> vCPUs, but I don't see how we can therefore deduce anything about the
> usage mode of the vIOMMU.  

We just loose optimizations. Any mappings that are established outside
the dirty tracking range are permanently dirty. So at worst the guest
can block migration by establishing bad mappings. It is not exactly
production quality but it is still useful for a closed environment
with known guest configurations.

> nested assignment, ie. userspace drivers running within the guest,
> where making assumptions about the IOVA extents of the userspace driver
> seems dangerous.
>
> Let's backup though, if a device doesn't support the full address width
> of the platform, it's the responsibility of the device driver to
> implement a DMA mask such that the device is never asked to DMA outside
> of its address space support.  Therefore how could a device ever dirty
> pages outside of its own limitations?

The device always supports the full address space. We can't enforce
any kind of limit on the VM

It just can't dirty track it all.

> Isn't it reasonable to require that a device support dirty tracking for
> the entire extent if its DMA address width in order to support this
> feature?

No, 2**64 is too big a number to be reasonable.

Ideally we'd work it the other way and tell the vIOMMU that the vHW
only supports a limited number of address bits for the translation, eg
through the ACPI tables. Then the dirty tracking could safely cover
the larger of all system memory or the limited IOVA address space.

Or even better figure out how to get interrupt remapping without IOMMU
support :\

Jason


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 10/20] vfio/common: Record DMA mapped IOVA ranges
  2023-02-23 10:37     ` Joao Martins
@ 2023-02-23 21:05       ` Alex Williamson
  2023-02-23 21:19         ` Joao Martins
  0 siblings, 1 reply; 93+ messages in thread
From: Alex Williamson @ 2023-02-23 21:05 UTC (permalink / raw)
  To: Joao Martins
  Cc: Avihai Horon, qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

On Thu, 23 Feb 2023 10:37:10 +0000
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 22/02/2023 22:10, Alex Williamson wrote:
> > On Wed, 22 Feb 2023 19:49:05 +0200
> > Avihai Horon <avihaih@nvidia.com> wrote:  
> >> From: Joao Martins <joao.m.martins@oracle.com>
> >> @@ -612,6 +665,16 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
> >>          .iova = iova,
> >>          .size = size,
> >>      };
> >> +    int ret;
> >> +
> >> +    ret = vfio_record_mapping(container, iova, size, readonly);
> >> +    if (ret) {
> >> +        error_report("vfio: Failed to record mapping, iova: 0x%" HWADDR_PRIx
> >> +                     ", size: 0x" RAM_ADDR_FMT ", ret: %d (%s)",
> >> +                     iova, size, ret, strerror(-ret));
> >> +
> >> +        return ret;
> >> +    }  
> > 
> > Is there no way to replay the mappings when a migration is started?
> > This seems like a horrible latency and bloat trade-off for the
> > possibility that the VM might migrate and the device might support
> > these features.  Our performance with vIOMMU is already terrible, I
> > can't help but believe this makes it worse.  Thanks,
> >   
> 
> It is a nop if the vIOMMU is being used (entries in container->giommu_list) as
> that uses a max-iova based IOVA range. So this is really for iommu identity
> mapping and no-VIOMMU.

Ok, yes, there are no mappings recorded for any containers that have a
non-empty giommu_list.

> We could replay them if they were tracked/stored anywhere.

Rather than piggybacking on vfio_memory_listener, why not simply
register a new MemoryListener when migration is started?  That will
replay all the existing ranges and allow tracking to happen separate
from mapping, and only when needed.

> I suppose we could move the vfio_devices_all_device_dirty_tracking() into this
> patch and then conditionally call this vfio_{record,erase}_mapping() in case we
> are passing through a device that doesn't have live-migration support? Would
> that address the impact you're concerned wrt to non-live-migrateable devices?
> 
> On the other hand, the PCI device hotplug hypothetical even makes this a bit
> complicated as we can still attempt to hotplug a device before migration is even
> attempted. Meaning that we start with live-migrateable devices, and we added the
> tracking, up to hotpluging a device without such support (adding a blocker)
> leaving the mappings there with no further use. So it felt simpler to just track
> always and avoid any mappings recording if the vIOMMU is in active use?

My preference would be that there's no runtime overhead for migration
support until a migration is initiated.  I currently don't see why we
can't achieve that by dynamically adding a new MemoryListener around
migration for that purpose.  Do you?  Thanks,

Alex



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 03/20] vfio/migration: Add VFIO migration pre-copy support
  2023-02-23 15:25     ` Avihai Horon
@ 2023-02-23 21:16       ` Alex Williamson
  2023-02-26 16:43         ` Avihai Horon
  0 siblings, 1 reply; 93+ messages in thread
From: Alex Williamson @ 2023-02-23 21:16 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins

On Thu, 23 Feb 2023 17:25:12 +0200
Avihai Horon <avihaih@nvidia.com> wrote:

> On 22/02/2023 22:58, Alex Williamson wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > On Wed, 22 Feb 2023 19:48:58 +0200
> > Avihai Horon <avihaih@nvidia.com> wrote:
> >  
> >> @@ -302,23 +380,44 @@ static void vfio_save_cleanup(void *opaque)
> >>       trace_vfio_save_cleanup(vbasedev->name);
> >>   }
> >>
> >> +static void vfio_state_pending_estimate(void *opaque, uint64_t threshold_size,
> >> +                                        uint64_t *must_precopy,
> >> +                                        uint64_t *can_postcopy)
> >> +{
> >> +    VFIODevice *vbasedev = opaque;
> >> +    VFIOMigration *migration = vbasedev->migration;
> >> +
> >> +    if (migration->device_state != VFIO_DEVICE_STATE_PRE_COPY) {
> >> +        return;
> >> +    }
> >> +
> >> +    /*
> >> +     * Initial size should be transferred during pre-copy phase so stop-copy
> >> +     * phase will not be slowed down. Report threshold_size to force another
> >> +     * pre-copy iteration.
> >> +     */
> >> +    *must_precopy += migration->precopy_init_size ?
> >> +                         threshold_size :
> >> +                         migration->precopy_dirty_size;  
> > This sure feels like we're feeding false data back to the iterator to
> > spoof it to run another iteration, when the vfio migration protocol
> > only recommends that initial_bytes reaches zero before proceeding to
> > stop-copy, it's not a requirement.  What benefit is actually observed
> > from this?  Why is this required for initial pre-copy support?  It
> > seems devious.  
> 
> As previously discussed in the thread that added the pre-copy uAPI [1], 
> the init_bytes can be used by drivers to reduce the downtime.
> For example, mlx5 transfers some metadata to the target so it will be 
> able to pre-allocate resources etc.
> 
> [1] 
> https://lore.kernel.org/kvm/ae4a6259-349d-0131-896c-7a6ea775cc9e@nvidia.com/

Yes, but how does that become a requirement to QEMU that it must
iterate until the initial segment is complete?  Especially when we need
to trigger that behavior via such nefarious means.  AIUI, QEMU should
be allowed to move to stop-copy at any point.  We should make efforts
that QEMU would never decide on its own to move from pre-copy to
stop-copy without completing the init_bytes (which sounds suspiciously
like the purpose of @must_precopy), but if, for instance a user forces a
transition to stop-copy, I don't see that we have any business to
impose a policy to delay that until the init_bytes is complete.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 10/20] vfio/common: Record DMA mapped IOVA ranges
  2023-02-23 21:05       ` Alex Williamson
@ 2023-02-23 21:19         ` Joao Martins
  2023-02-23 21:50           ` Alex Williamson
  0 siblings, 1 reply; 93+ messages in thread
From: Joao Martins @ 2023-02-23 21:19 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Avihai Horon, qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

On 23/02/2023 21:05, Alex Williamson wrote:
> On Thu, 23 Feb 2023 10:37:10 +0000
> Joao Martins <joao.m.martins@oracle.com> wrote:
>> On 22/02/2023 22:10, Alex Williamson wrote:
>>> On Wed, 22 Feb 2023 19:49:05 +0200
>>> Avihai Horon <avihaih@nvidia.com> wrote:  
>>>> From: Joao Martins <joao.m.martins@oracle.com>
>>>> @@ -612,6 +665,16 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>>>>          .iova = iova,
>>>>          .size = size,
>>>>      };
>>>> +    int ret;
>>>> +
>>>> +    ret = vfio_record_mapping(container, iova, size, readonly);
>>>> +    if (ret) {
>>>> +        error_report("vfio: Failed to record mapping, iova: 0x%" HWADDR_PRIx
>>>> +                     ", size: 0x" RAM_ADDR_FMT ", ret: %d (%s)",
>>>> +                     iova, size, ret, strerror(-ret));
>>>> +
>>>> +        return ret;
>>>> +    }  
>>>
>>> Is there no way to replay the mappings when a migration is started?
>>> This seems like a horrible latency and bloat trade-off for the
>>> possibility that the VM might migrate and the device might support
>>> these features.  Our performance with vIOMMU is already terrible, I
>>> can't help but believe this makes it worse.  Thanks,
>>>   
>>
>> It is a nop if the vIOMMU is being used (entries in container->giommu_list) as
>> that uses a max-iova based IOVA range. So this is really for iommu identity
>> mapping and no-VIOMMU.
> 
> Ok, yes, there are no mappings recorded for any containers that have a
> non-empty giommu_list.
> 
>> We could replay them if they were tracked/stored anywhere.
> 
> Rather than piggybacking on vfio_memory_listener, why not simply
> register a new MemoryListener when migration is started?  That will
> replay all the existing ranges and allow tracking to happen separate
> from mapping, and only when needed.
> 

The problem with that is that *starting* dirty tracking needs to have all the
range, we aren't supposed to start each range separately. So on a memory
listener callback you don't have introspection when you are dealing with the
last range, do we?

>> I suppose we could move the vfio_devices_all_device_dirty_tracking() into this
>> patch and then conditionally call this vfio_{record,erase}_mapping() in case we
>> are passing through a device that doesn't have live-migration support? Would
>> that address the impact you're concerned wrt to non-live-migrateable devices?
>>
>> On the other hand, the PCI device hotplug hypothetical even makes this a bit
>> complicated as we can still attempt to hotplug a device before migration is even
>> attempted. Meaning that we start with live-migrateable devices, and we added the
>> tracking, up to hotpluging a device without such support (adding a blocker)
>> leaving the mappings there with no further use. So it felt simpler to just track
>> always and avoid any mappings recording if the vIOMMU is in active use?
> 
> My preference would be that there's no runtime overhead for migration
> support until a migration is initiated.  I currently don't see why we
> can't achieve that by dynamically adding a new MemoryListener around
> migration for that purpose.  Do you?  Thanks,

I definitely agree with the general sentiment of being more dynamic, but perhaps
I am not seeing how.


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 17/20] vfio/common: Support device dirty page tracking with vIOMMU
  2023-02-23 20:55         ` Jason Gunthorpe
@ 2023-02-23 21:30           ` Joao Martins
  2023-02-23 22:33           ` Alex Williamson
  1 sibling, 0 replies; 93+ messages in thread
From: Joao Martins @ 2023-02-23 21:30 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Avihai Horon, qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Maor Gottlieb, Kirti Wankhede, Tarun Gupta

On 23/02/2023 20:55, Jason Gunthorpe wrote:
> On Thu, Feb 23, 2023 at 01:06:33PM -0700, Alex Williamson wrote:
>>> #2 is the presumption that the guest is using an identity map.
>> Isn't it reasonable to require that a device support dirty tracking for
>> the entire extent if its DMA address width in order to support this
>> feature?
> 
> No, 2**64 is too big a number to be reasonable.
> 
+1

> Ideally we'd work it the other way and tell the vIOMMU that the vHW
> only supports a limited number of address bits for the translation, eg
> through the ACPI tables. Then the dirty tracking could safely cover
> the larger of all system memory or the limited IOVA address space.
> 
> Or even better figure out how to get interrupt remapping without IOMMU
> support :\

FWIW That's generally my use of `iommu=pt` because all I want is interrupt
remapping, not the DMA remapping part. And this is going to be specially
relevant with these new boxes that easily surprass the >255 dedicated physical
CPUs mark with just two sockets.

The only other alternative I could see is to rely on IOMMU attribute for DMA
translation. Today you can actually toggle that 'off' in VT-d (and I can imagine
the same thing working for AMD-vIOMMU). In Intel it just omits the 39
Address-width cap. And it means it doesn't have virtual addressing. Similar to
what Avihai already does for MAX_IOVA, we would do for DMA_TRANSLATION, and let
each vIOMMU implementation support that.

But to be honest I am not sure how robust relying on that is as that doesn't
really represent a hardware implementation. Without vIOMMU you have a (KVM) PV
op in new *guest* kernels that (ab)uses some unused bits in IOAPIC for a 24-bit
DestID. But this is only on new guests and hypervisors, old *guests* running
older < 5.15 kernels won't work.

... So iommu=pt really is the most convenient right now :/

	Joao


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 10/20] vfio/common: Record DMA mapped IOVA ranges
  2023-02-23 21:19         ` Joao Martins
@ 2023-02-23 21:50           ` Alex Williamson
  2023-02-23 21:54             ` Joao Martins
  2023-02-28 12:11             ` Joao Martins
  0 siblings, 2 replies; 93+ messages in thread
From: Alex Williamson @ 2023-02-23 21:50 UTC (permalink / raw)
  To: Joao Martins
  Cc: Avihai Horon, qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

On Thu, 23 Feb 2023 21:19:12 +0000
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 23/02/2023 21:05, Alex Williamson wrote:
> > On Thu, 23 Feb 2023 10:37:10 +0000
> > Joao Martins <joao.m.martins@oracle.com> wrote:  
> >> On 22/02/2023 22:10, Alex Williamson wrote:  
> >>> On Wed, 22 Feb 2023 19:49:05 +0200
> >>> Avihai Horon <avihaih@nvidia.com> wrote:    
> >>>> From: Joao Martins <joao.m.martins@oracle.com>
> >>>> @@ -612,6 +665,16 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
> >>>>          .iova = iova,
> >>>>          .size = size,
> >>>>      };
> >>>> +    int ret;
> >>>> +
> >>>> +    ret = vfio_record_mapping(container, iova, size, readonly);
> >>>> +    if (ret) {
> >>>> +        error_report("vfio: Failed to record mapping, iova: 0x%" HWADDR_PRIx
> >>>> +                     ", size: 0x" RAM_ADDR_FMT ", ret: %d (%s)",
> >>>> +                     iova, size, ret, strerror(-ret));
> >>>> +
> >>>> +        return ret;
> >>>> +    }    
> >>>
> >>> Is there no way to replay the mappings when a migration is started?
> >>> This seems like a horrible latency and bloat trade-off for the
> >>> possibility that the VM might migrate and the device might support
> >>> these features.  Our performance with vIOMMU is already terrible, I
> >>> can't help but believe this makes it worse.  Thanks,
> >>>     
> >>
> >> It is a nop if the vIOMMU is being used (entries in container->giommu_list) as
> >> that uses a max-iova based IOVA range. So this is really for iommu identity
> >> mapping and no-VIOMMU.  
> > 
> > Ok, yes, there are no mappings recorded for any containers that have a
> > non-empty giommu_list.
> >   
> >> We could replay them if they were tracked/stored anywhere.  
> > 
> > Rather than piggybacking on vfio_memory_listener, why not simply
> > register a new MemoryListener when migration is started?  That will
> > replay all the existing ranges and allow tracking to happen separate
> > from mapping, and only when needed.
> >   
> 
> The problem with that is that *starting* dirty tracking needs to have all the
> range, we aren't supposed to start each range separately. So on a memory
> listener callback you don't have introspection when you are dealing with the
> last range, do we?

As soon as memory_listener_register() returns, all your callbacks to
build the IOVATree have been called and you can act on the result the
same as if you were relying on the vfio mapping MemoryListener.  I'm
not seeing the problem.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 10/20] vfio/common: Record DMA mapped IOVA ranges
  2023-02-23 21:50           ` Alex Williamson
@ 2023-02-23 21:54             ` Joao Martins
  2023-02-28 12:11             ` Joao Martins
  1 sibling, 0 replies; 93+ messages in thread
From: Joao Martins @ 2023-02-23 21:54 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Avihai Horon, qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

On 23/02/2023 21:50, Alex Williamson wrote:
> On Thu, 23 Feb 2023 21:19:12 +0000
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> On 23/02/2023 21:05, Alex Williamson wrote:
>>> On Thu, 23 Feb 2023 10:37:10 +0000
>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
>>>> On 22/02/2023 22:10, Alex Williamson wrote:  
>>>>> On Wed, 22 Feb 2023 19:49:05 +0200
>>>>> Avihai Horon <avihaih@nvidia.com> wrote:    
>>>>>> From: Joao Martins <joao.m.martins@oracle.com>
>>>>>> @@ -612,6 +665,16 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>>>>>>          .iova = iova,
>>>>>>          .size = size,
>>>>>>      };
>>>>>> +    int ret;
>>>>>> +
>>>>>> +    ret = vfio_record_mapping(container, iova, size, readonly);
>>>>>> +    if (ret) {
>>>>>> +        error_report("vfio: Failed to record mapping, iova: 0x%" HWADDR_PRIx
>>>>>> +                     ", size: 0x" RAM_ADDR_FMT ", ret: %d (%s)",
>>>>>> +                     iova, size, ret, strerror(-ret));
>>>>>> +
>>>>>> +        return ret;
>>>>>> +    }    
>>>>>
>>>>> Is there no way to replay the mappings when a migration is started?
>>>>> This seems like a horrible latency and bloat trade-off for the
>>>>> possibility that the VM might migrate and the device might support
>>>>> these features.  Our performance with vIOMMU is already terrible, I
>>>>> can't help but believe this makes it worse.  Thanks,
>>>>>     
>>>>
>>>> It is a nop if the vIOMMU is being used (entries in container->giommu_list) as
>>>> that uses a max-iova based IOVA range. So this is really for iommu identity
>>>> mapping and no-VIOMMU.  
>>>
>>> Ok, yes, there are no mappings recorded for any containers that have a
>>> non-empty giommu_list.
>>>   
>>>> We could replay them if they were tracked/stored anywhere.  
>>>
>>> Rather than piggybacking on vfio_memory_listener, why not simply
>>> register a new MemoryListener when migration is started?  That will
>>> replay all the existing ranges and allow tracking to happen separate
>>> from mapping, and only when needed.
>>>   
>>
>> The problem with that is that *starting* dirty tracking needs to have all the
>> range, we aren't supposed to start each range separately. So on a memory
>> listener callback you don't have introspection when you are dealing with the
>> last range, do we?
> 
> As soon as memory_listener_register() returns, all your callbacks to
> build the IOVATree have been called and you can act on the result the
> same as if you were relying on the vfio mapping MemoryListener.  I'm
> not seeing the problem.  Thanks,

I was just checking memory_global_dirty_log_start() (as when dirty tracking
starts getting enabled) and yes you're definitively right.

I thought this was asynchronous given that there are so many mrs, but I must be
confusing with something elsewhere.


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 17/20] vfio/common: Support device dirty page tracking with vIOMMU
  2023-02-23 20:55         ` Jason Gunthorpe
  2023-02-23 21:30           ` Joao Martins
@ 2023-02-23 22:33           ` Alex Williamson
  2023-02-23 23:26             ` Jason Gunthorpe
  1 sibling, 1 reply; 93+ messages in thread
From: Alex Williamson @ 2023-02-23 22:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Avihai Horon, qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins

On Thu, 23 Feb 2023 16:55:54 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Thu, Feb 23, 2023 at 01:06:33PM -0700, Alex Williamson wrote:
> > > #2 is the presumption that the guest is using an identity map.  
> > 
> > This is a dangerous assumption.
> >   
> > > > I'd think the only viable fallback if the vIOMMU doesn't report its max
> > > > IOVA is the full 64-bit address space, otherwise it seems like we need
> > > > to add a migration blocker.    
> > > 
> > > This is basically saying vIOMMU doesn't work with migration, and we've
> > > heard that this isn't OK. There are cases where vIOMMU is on but the
> > > guest always uses identity maps. eg for virtual interrupt remapping.  
> > 
> > Yes, the vIOMMU can be automatically added to a VM when we exceed 255
> > vCPUs, but I don't see how we can therefore deduce anything about the
> > usage mode of the vIOMMU.    
> 
> We just loose optimizations. Any mappings that are established outside
> the dirty tracking range are permanently dirty. So at worst the guest
> can block migration by establishing bad mappings. It is not exactly
> production quality but it is still useful for a closed environment
> with known guest configurations.

That doesn't seem to be what happens in this series, nor does it really
make sense to me that userspace would simply decide to truncate the
dirty tracking ranges array.

> > nested assignment, ie. userspace drivers running within the guest,
> > where making assumptions about the IOVA extents of the userspace driver
> > seems dangerous.
> >
> > Let's backup though, if a device doesn't support the full address width
> > of the platform, it's the responsibility of the device driver to
> > implement a DMA mask such that the device is never asked to DMA outside
> > of its address space support.  Therefore how could a device ever dirty
> > pages outside of its own limitations?  
> 
> The device always supports the full address space. We can't enforce
> any kind of limit on the VM
> 
> It just can't dirty track it all.
> 
> > Isn't it reasonable to require that a device support dirty tracking for
> > the entire extent if its DMA address width in order to support this
> > feature?  
> 
> No, 2**64 is too big a number to be reasonable.

So what are the actual restrictions were dealing with here?  I think it
would help us collaborate on a solution if we didn't have these device
specific restrictions sprinkled through the base implementation.

> Ideally we'd work it the other way and tell the vIOMMU that the vHW
> only supports a limited number of address bits for the translation, eg
> through the ACPI tables. Then the dirty tracking could safely cover
> the larger of all system memory or the limited IOVA address space.

Why can't we do that?  Hotplug is an obvious issue, but maybe it's not
vHW telling the vIOMMU a restriction, maybe it's a QEMU machine or
vIOMMU option and if it's not set to something the device can support,
migration is blocked.
 
> Or even better figure out how to get interrupt remapping without IOMMU
> support :\

-machine q35,default_bus_bypass_iommu=on,kernel-irqchip=split \
-device intel-iommu,caching-mode=on,intremap=on

Thanks,
Alex



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 17/20] vfio/common: Support device dirty page tracking with vIOMMU
  2023-02-23 22:33           ` Alex Williamson
@ 2023-02-23 23:26             ` Jason Gunthorpe
  2023-02-24 11:25               ` Joao Martins
  0 siblings, 1 reply; 93+ messages in thread
From: Jason Gunthorpe @ 2023-02-23 23:26 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Avihai Horon, qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins

On Thu, Feb 23, 2023 at 03:33:09PM -0700, Alex Williamson wrote:
> On Thu, 23 Feb 2023 16:55:54 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Thu, Feb 23, 2023 at 01:06:33PM -0700, Alex Williamson wrote:
> > > > #2 is the presumption that the guest is using an identity map.  
> > > 
> > > This is a dangerous assumption.
> > >   
> > > > > I'd think the only viable fallback if the vIOMMU doesn't report its max
> > > > > IOVA is the full 64-bit address space, otherwise it seems like we need
> > > > > to add a migration blocker.    
> > > > 
> > > > This is basically saying vIOMMU doesn't work with migration, and we've
> > > > heard that this isn't OK. There are cases where vIOMMU is on but the
> > > > guest always uses identity maps. eg for virtual interrupt remapping.  
> > > 
> > > Yes, the vIOMMU can be automatically added to a VM when we exceed 255
> > > vCPUs, but I don't see how we can therefore deduce anything about the
> > > usage mode of the vIOMMU.    
> > 
> > We just loose optimizations. Any mappings that are established outside
> > the dirty tracking range are permanently dirty. So at worst the guest
> > can block migration by establishing bad mappings. It is not exactly
> > production quality but it is still useful for a closed environment
> > with known guest configurations.
> 
> That doesn't seem to be what happens in this series, 

Seems like something is missed then

> nor does it really make sense to me that userspace would simply
> decide to truncate the dirty tracking ranges array.

Who else would do it?

> > No, 2**64 is too big a number to be reasonable.
> 
> So what are the actual restrictions were dealing with here?  I think it
> would help us collaborate on a solution if we didn't have these device
> specific restrictions sprinkled through the base implementation.

Hmm? It was always like this, the driver gets to decide if it accepts
the proprosed tracking ranges or not. Given how the implementation has
to work there is no device that could do 2**64...

At least for mlx5 it is in the multi-TB range. Enough for physical
memory on any real server.

> > Ideally we'd work it the other way and tell the vIOMMU that the vHW
> > only supports a limited number of address bits for the translation, eg
> > through the ACPI tables. Then the dirty tracking could safely cover
> > the larger of all system memory or the limited IOVA address space.
> 
> Why can't we do that?  Hotplug is an obvious issue, but maybe it's not
> vHW telling the vIOMMU a restriction, maybe it's a QEMU machine or
> vIOMMU option and if it's not set to something the device can support,
> migration is blocked.

I don't know, maybe we should if we can.

> > Or even better figure out how to get interrupt remapping without IOMMU
> > support :\
> 
> -machine q35,default_bus_bypass_iommu=on,kernel-irqchip=split \
> -device intel-iommu,caching-mode=on,intremap=on

Joao?

If this works lets just block migration if the vIOMMU is turned on..

Jason


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 17/20] vfio/common: Support device dirty page tracking with vIOMMU
  2023-02-23 23:26             ` Jason Gunthorpe
@ 2023-02-24 11:25               ` Joao Martins
  2023-02-24 12:53                 ` Joao Martins
  0 siblings, 1 reply; 93+ messages in thread
From: Joao Martins @ 2023-02-24 11:25 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Avihai Horon, qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Maor Gottlieb, Kirti Wankhede, Tarun Gupta

On 23/02/2023 23:26, Jason Gunthorpe wrote:
> On Thu, Feb 23, 2023 at 03:33:09PM -0700, Alex Williamson wrote:
>> On Thu, 23 Feb 2023 16:55:54 -0400
>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>> On Thu, Feb 23, 2023 at 01:06:33PM -0700, Alex Williamson wrote:
>>> Or even better figure out how to get interrupt remapping without IOMMU
>>> support :\
>>
>> -machine q35,default_bus_bypass_iommu=on,kernel-irqchip=split \
>> -device intel-iommu,caching-mode=on,intremap=on
> 
> Joao?
> 
> If this works lets just block migration if the vIOMMU is turned on..

At a first glance, this looked like my regular iommu incantation.

But reading the code this ::bypass_iommu (new to me) apparently tells that
vIOMMU is bypassed or not for the PCI devices all the way to avoiding
enumerating in the IVRS/DMAR ACPI tables. And I see VFIO double-checks whether
PCI device is within the IOMMU address space (or bypassed) prior to DMA maps and
such.

You can see from the other email that all of the other options in my head were
either bit inconvenient or risky. I wasn't aware of this option for what is
worth -- much simpler, should work!

And avoiding vIOMMU simplifies the whole patchset too, if it's OK to add a live
migration blocker if `bypass_iommu` is off for any PCI device.

	Joao


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 17/20] vfio/common: Support device dirty page tracking with vIOMMU
  2023-02-24 11:25               ` Joao Martins
@ 2023-02-24 12:53                 ` Joao Martins
  2023-02-24 15:47                   ` Jason Gunthorpe
  2023-02-24 15:56                   ` Alex Williamson
  0 siblings, 2 replies; 93+ messages in thread
From: Joao Martins @ 2023-02-24 12:53 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Avihai Horon, qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Maor Gottlieb, Kirti Wankhede, Tarun Gupta

On 24/02/2023 11:25, Joao Martins wrote:
> On 23/02/2023 23:26, Jason Gunthorpe wrote:
>> On Thu, Feb 23, 2023 at 03:33:09PM -0700, Alex Williamson wrote:
>>> On Thu, 23 Feb 2023 16:55:54 -0400
>>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>>> On Thu, Feb 23, 2023 at 01:06:33PM -0700, Alex Williamson wrote:
>>>> Or even better figure out how to get interrupt remapping without IOMMU
>>>> support :\
>>>
>>> -machine q35,default_bus_bypass_iommu=on,kernel-irqchip=split \
>>> -device intel-iommu,caching-mode=on,intremap=on
>>
>> Joao?
>>
>> If this works lets just block migration if the vIOMMU is turned on..
> 
> At a first glance, this looked like my regular iommu incantation.
> 
> But reading the code this ::bypass_iommu (new to me) apparently tells that
> vIOMMU is bypassed or not for the PCI devices all the way to avoiding
> enumerating in the IVRS/DMAR ACPI tables. And I see VFIO double-checks whether
> PCI device is within the IOMMU address space (or bypassed) prior to DMA maps and
> such.
> 
> You can see from the other email that all of the other options in my head were
> either bit inconvenient or risky. I wasn't aware of this option for what is
> worth -- much simpler, should work!
>

I say *should*, but on a second thought interrupt remapping may still be
required to one of these devices that are IOMMU-bypassed. Say to put affinities
to vcpus above 255? I was trying this out with more than 255 vcpus with a couple
VFs and at a first glance these VFs fail to probe (these are CX6 VFs).

It is a working setup without the parameter, but now adding a
default_bus_bypass_iommu=on fails to init VFs:

[   32.412733] mlx5_core 0000:00:02.0: Rate limit: 127 rates are supported,
range: 0Mbps to 97656Mbps
[   32.416242] mlx5_core 0000:00:02.0: mlx5_load:1204:(pid 3361): Failed to
alloc IRQs
[   33.227852] mlx5_core 0000:00:02.0: probe_one:1684:(pid 3361): mlx5_init_one
failed with error code -19
[   33.242182] mlx5_core 0000:00:03.0: firmware version: 22.31.1660
[   33.415876] mlx5_core 0000:00:03.0: Rate limit: 127 rates are supported,
range: 0Mbps to 97656Mbps
[   33.448016] mlx5_core 0000:00:03.0: mlx5_load:1204:(pid 3361): Failed to
alloc IRQs
[   34.207532] mlx5_core 0000:00:03.0: probe_one:1684:(pid 3361): mlx5_init_one
failed with error code -19

I haven't dived yet into why it fails.

> And avoiding vIOMMU simplifies the whole patchset too, if it's OK to add a live
> migration blocker if `bypass_iommu` is off for any PCI device.
> 

Still we could have for starters a live migration blocker until we revisit the
vIOMMU case ... should we deem that the default_bus_bypass_iommu=on or the
others I suggested as non-options?


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 17/20] vfio/common: Support device dirty page tracking with vIOMMU
  2023-02-24 12:53                 ` Joao Martins
@ 2023-02-24 15:47                   ` Jason Gunthorpe
  2023-02-24 15:56                   ` Alex Williamson
  1 sibling, 0 replies; 93+ messages in thread
From: Jason Gunthorpe @ 2023-02-24 15:47 UTC (permalink / raw)
  To: Joao Martins
  Cc: Alex Williamson, Avihai Horon, qemu-devel, Cédric Le Goater,
	Juan Quintela, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Peter Xu, Jason Wang, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost, David Hildenbrand,
	Philippe Mathieu-Daudé,
	Yishai Hadas, Maor Gottlieb, Kirti Wankhede, Tarun Gupta

On Fri, Feb 24, 2023 at 12:53:26PM +0000, Joao Martins wrote:
> > But reading the code this ::bypass_iommu (new to me) apparently tells that
> > vIOMMU is bypassed or not for the PCI devices all the way to avoiding
> > enumerating in the IVRS/DMAR ACPI tables. And I see VFIO double-checks whether
> > PCI device is within the IOMMU address space (or bypassed) prior to DMA maps and
> > such.
> > 
> > You can see from the other email that all of the other options in my head were
> > either bit inconvenient or risky. I wasn't aware of this option for what is
> > worth -- much simpler, should work!
> >
> 
> I say *should*, but on a second thought interrupt remapping may still be
> required to one of these devices that are IOMMU-bypassed. Say to put affinities
> to vcpus above 255? I was trying this out with more than 255 vcpus with a couple
> VFs and at a first glance these VFs fail to probe (these are CX6
> VFs).

It is pretty bizarre, but the Intel iommu driver is responsible for
installing the interrupt remapping irq driver on the devices.

So if there is no iommu driver bound then there won't be any interrupt
remapping capability for the device even if the interrupt remapping HW
is otherwise setup.

The only reason Avihai is touching this is to try and keep the
interrupt remapping emulation usable, we could certainly punt on that
for now if it looks too ugly.

Jason


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 17/20] vfio/common: Support device dirty page tracking with vIOMMU
  2023-02-24 12:53                 ` Joao Martins
  2023-02-24 15:47                   ` Jason Gunthorpe
@ 2023-02-24 15:56                   ` Alex Williamson
  2023-02-24 19:16                     ` Joao Martins
  1 sibling, 1 reply; 93+ messages in thread
From: Alex Williamson @ 2023-02-24 15:56 UTC (permalink / raw)
  To: Joao Martins
  Cc: Jason Gunthorpe, Avihai Horon, qemu-devel, Cédric Le Goater,
	Juan Quintela, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Peter Xu, Jason Wang, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost, David Hildenbrand,
	Philippe Mathieu-Daudé,
	Yishai Hadas, Maor Gottlieb, Kirti Wankhede, Tarun Gupta

On Fri, 24 Feb 2023 12:53:26 +0000
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 24/02/2023 11:25, Joao Martins wrote:
> > On 23/02/2023 23:26, Jason Gunthorpe wrote:  
> >> On Thu, Feb 23, 2023 at 03:33:09PM -0700, Alex Williamson wrote:  
> >>> On Thu, 23 Feb 2023 16:55:54 -0400
> >>> Jason Gunthorpe <jgg@nvidia.com> wrote:  
> >>>> On Thu, Feb 23, 2023 at 01:06:33PM -0700, Alex Williamson wrote:
> >>>> Or even better figure out how to get interrupt remapping without IOMMU
> >>>> support :\  
> >>>
> >>> -machine q35,default_bus_bypass_iommu=on,kernel-irqchip=split \
> >>> -device intel-iommu,caching-mode=on,intremap=on  
> >>
> >> Joao?
> >>
> >> If this works lets just block migration if the vIOMMU is turned on..  
> > 
> > At a first glance, this looked like my regular iommu incantation.
> > 
> > But reading the code this ::bypass_iommu (new to me) apparently tells that
> > vIOMMU is bypassed or not for the PCI devices all the way to avoiding
> > enumerating in the IVRS/DMAR ACPI tables. And I see VFIO double-checks whether
> > PCI device is within the IOMMU address space (or bypassed) prior to DMA maps and
> > such.
> > 
> > You can see from the other email that all of the other options in my head were
> > either bit inconvenient or risky. I wasn't aware of this option for what is
> > worth -- much simpler, should work!
> >  
> 
> I say *should*, but on a second thought interrupt remapping may still be
> required to one of these devices that are IOMMU-bypassed. Say to put affinities
> to vcpus above 255? I was trying this out with more than 255 vcpus with a couple
> VFs and at a first glance these VFs fail to probe (these are CX6 VFs).
> 
> It is a working setup without the parameter, but now adding a
> default_bus_bypass_iommu=on fails to init VFs:
> 
> [   32.412733] mlx5_core 0000:00:02.0: Rate limit: 127 rates are supported,
> range: 0Mbps to 97656Mbps
> [   32.416242] mlx5_core 0000:00:02.0: mlx5_load:1204:(pid 3361): Failed to
> alloc IRQs
> [   33.227852] mlx5_core 0000:00:02.0: probe_one:1684:(pid 3361): mlx5_init_one
> failed with error code -19
> [   33.242182] mlx5_core 0000:00:03.0: firmware version: 22.31.1660
> [   33.415876] mlx5_core 0000:00:03.0: Rate limit: 127 rates are supported,
> range: 0Mbps to 97656Mbps
> [   33.448016] mlx5_core 0000:00:03.0: mlx5_load:1204:(pid 3361): Failed to
> alloc IRQs
> [   34.207532] mlx5_core 0000:00:03.0: probe_one:1684:(pid 3361): mlx5_init_one
> failed with error code -19
> 
> I haven't dived yet into why it fails.

Hmm, I was thinking this would only affect DMA, but on second thought
I think the DRHD also describes the interrupt remapping hardware and
while interrupt remapping is an optional feature of the DRHD, DMA
remapping is always supported afaict.  I saw IR vectors in
/proc/interrupts and thought it worked, but indeed an assigned device
is having trouble getting vectors.

> 
> > And avoiding vIOMMU simplifies the whole patchset too, if it's OK to add a live
> > migration blocker if `bypass_iommu` is off for any PCI device.
> >   
> 
> Still we could have for starters a live migration blocker until we revisit the
> vIOMMU case ... should we deem that the default_bus_bypass_iommu=on or the
> others I suggested as non-options?

I'm very uncomfortable presuming a vIOMMU usage model, especially when
it leads to potentially untracked DMA if our assumptions are violated.
We could use a MemoryListener on the IOVA space to record a high level
mark, but we'd need to continue to monitor that mark while we're in
pre-copy and I don't think anyone would agree that a migratable VM can
suddenly become unmigratable due to a random IOVA allocation would be
supportable.  That leads me to think that a machine option to limit the
vIOMMU address space, and testing that against the device prior to
declaring migration support of the device is possibly our best option.

Is that feasible?  Do all the vIOMMU models have a means to limit the
IOVA space?  How does QEMU learn a limit for a given device?  We
probably need to think about whether there are devices that can even
support the guest physical memory ranges when we start relocating RAM
to arbitrary addresses (ex. hypertransport).  Can we infer anything
from the vCPU virtual address space or is that still an unreasonable
range to track for devices?  Thanks,

Alex



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 17/20] vfio/common: Support device dirty page tracking with vIOMMU
  2023-02-24 15:56                   ` Alex Williamson
@ 2023-02-24 19:16                     ` Joao Martins
  0 siblings, 0 replies; 93+ messages in thread
From: Joao Martins @ 2023-02-24 19:16 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: Avihai Horon, qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Maor Gottlieb, Kirti Wankhede, Tarun Gupta

On 24/02/2023 15:56, Alex Williamson wrote:
> On Fri, 24 Feb 2023 12:53:26 +0000
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> On 24/02/2023 11:25, Joao Martins wrote:
>>> On 23/02/2023 23:26, Jason Gunthorpe wrote:  
>>>> On Thu, Feb 23, 2023 at 03:33:09PM -0700, Alex Williamson wrote:  
>>>>> On Thu, 23 Feb 2023 16:55:54 -0400
>>>>> Jason Gunthorpe <jgg@nvidia.com> wrote:  
>>>>>> On Thu, Feb 23, 2023 at 01:06:33PM -0700, Alex Williamson wrote:
>>>>>> Or even better figure out how to get interrupt remapping without IOMMU
>>>>>> support :\  
>>>>>
>>>>> -machine q35,default_bus_bypass_iommu=on,kernel-irqchip=split \
>>>>> -device intel-iommu,caching-mode=on,intremap=on  
>>>>
>>>> Joao?
>>>>
>>>> If this works lets just block migration if the vIOMMU is turned on..  
>>>
>>> At a first glance, this looked like my regular iommu incantation.
>>>
>>> But reading the code this ::bypass_iommu (new to me) apparently tells that
>>> vIOMMU is bypassed or not for the PCI devices all the way to avoiding
>>> enumerating in the IVRS/DMAR ACPI tables. And I see VFIO double-checks whether
>>> PCI device is within the IOMMU address space (or bypassed) prior to DMA maps and
>>> such.
>>>
>>> You can see from the other email that all of the other options in my head were
>>> either bit inconvenient or risky. I wasn't aware of this option for what is
>>> worth -- much simpler, should work!
>>>  
>>
>> I say *should*, but on a second thought interrupt remapping may still be
>> required to one of these devices that are IOMMU-bypassed. Say to put affinities
>> to vcpus above 255? I was trying this out with more than 255 vcpus with a couple
>> VFs and at a first glance these VFs fail to probe (these are CX6 VFs).
>>
>> It is a working setup without the parameter, but now adding a
>> default_bus_bypass_iommu=on fails to init VFs:
>>
>> [   32.412733] mlx5_core 0000:00:02.0: Rate limit: 127 rates are supported,
>> range: 0Mbps to 97656Mbps
>> [   32.416242] mlx5_core 0000:00:02.0: mlx5_load:1204:(pid 3361): Failed to
>> alloc IRQs
>> [   33.227852] mlx5_core 0000:00:02.0: probe_one:1684:(pid 3361): mlx5_init_one
>> failed with error code -19
>> [   33.242182] mlx5_core 0000:00:03.0: firmware version: 22.31.1660
>> [   33.415876] mlx5_core 0000:00:03.0: Rate limit: 127 rates are supported,
>> range: 0Mbps to 97656Mbps
>> [   33.448016] mlx5_core 0000:00:03.0: mlx5_load:1204:(pid 3361): Failed to
>> alloc IRQs
>> [   34.207532] mlx5_core 0000:00:03.0: probe_one:1684:(pid 3361): mlx5_init_one
>> failed with error code -19
>>
>> I haven't dived yet into why it fails.
> 
> Hmm, I was thinking this would only affect DMA, but on second thought
> I think the DRHD also describes the interrupt remapping hardware and
> while interrupt remapping is an optional feature of the DRHD, DMA
> remapping is always supported afaict.  I saw IR vectors in
> /proc/interrupts and thought it worked, but indeed an assigned device
> is having trouble getting vectors.
> 

AMD/IVRS might be a little different.

I also tried disabling dma-translation from IOMMU feature as I had mentioned in
another email, and that renders the same result as default_bus_bypass_iommu.

So it's either this KVM pv-op (which is not really interrupt remapping and it's
x86 specific) or full vIOMMU. The PV op[*] has the natural disadvantage of
requiring a compatible guest kernel.

[*] See, KVM_FEATURE_MSI_EXT_DEST_ID.

>>
>>> And avoiding vIOMMU simplifies the whole patchset too, if it's OK to add a live
>>> migration blocker if `bypass_iommu` is off for any PCI device.
>>>   
>>
>> Still we could have for starters a live migration blocker until we revisit the
>> vIOMMU case ... should we deem that the default_bus_bypass_iommu=on or the
>> others I suggested as non-options?
> 
> I'm very uncomfortable presuming a vIOMMU usage model, especially when
> it leads to potentially untracked DMA if our assumptions are violated.

We can track DMA that got dirtied, but it doesn't mean that said DMA is mapped.
I don't think VFIO ties those two in? Like you can ask to track certain ranges,
but if it's in IOMMU then device gets target abort. Start dirty tracking,
doesn't imply that you allow such DMA

with vIOMMU it's just anything that falls outside the IOMMU mapped ranges (or
identity map) get always marked as dirty if it wasn't armed in the device dirty
tracker. It's a best effort basis -- as I don't think supporting vIOMMU has a
ton of options without a more significant compromise. If the vIOMMU is in
passthrough mode, then things work just as if no-vIOMMU is there. Avihai's code
reflects that.

Considering your earlier suggestion that we only start dirty tracking and record
ranges *when*  dirty tracking start operation happens ... then this gets further
simplified. We also have to take into account that we can't have guarantees that
we can change ranges under tracking to be dynamic.

For improving vIOMMU case we either track the MAX_IOVA or we compose an
artifical range based the max-iova of current vIOMMU maps.

> We could use a MemoryListener on the IOVA space to record a high level
> mark, but we'd need to continue to monitor that mark while we're in
> pre-copy and I don't think anyone would agree that a migratable VM can
> suddenly become unmigratable due to a random IOVA allocation would be
> supportable.  That leads me to think that a machine option to limit the
> vIOMMU address space, and testing that against the device prior to
> declaring migration support of the device is possibly our best option.
> 
> Is that feasible?  Do all the vIOMMU models have a means to limit the
> IOVA space? 

I can say that *at least* AMD and Intel support that. Intel supports either 39
or 48 address-width modes (only those two values as I understand). AMD
supposedly has a more granular management of VASize and PASize.

I have no idea on smmuv3 or virtio-iommu.

But isn't this is actually what Avihai does in the series, but minus the device
part? The address-width is fetched directly from the vIOMMU model, via the
IOMMU_ATTR_MAX_IOVA, and one of the options is to compose a range based on max
vIOMMU range.

> How does QEMU learn a limit for a given device? 

IOMMU_ATTR_MAX_IOVA for vIOMMU

For device this is not described in ACPI or any place that I know :/ without
getting into VF specifics

> We
> probably need to think about whether there are devices that can even
> support the guest physical memory ranges when we start relocating RAM
> to arbitrary addresses (ex. hypertransport). 

In theory we require one-bit more in device DMA engine. so instead of max 39bits
we require 40bits for a 1T guest. GPUs and modern NICs are 64-bit DMA address
capable devices, but it's a bit hard to learn this as it's device specific.

> Can we infer anything
> from the vCPU virtual address space or is that still an unreasonable
> range to track for devices?  Thanks,
> 
We sort of rely on that for iommu=pt or no-vIOMMU case where vCPU address space
matches that of IOVA space, but that not sure how much you would from vCPU
address space that vIOMMU mapping doesn't give you already


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 00/20] vfio: Add migration pre-copy support and device dirty tracking
  2023-02-23 14:56   ` Avihai Horon
@ 2023-02-24 19:26     ` Joao Martins
  2023-02-26 17:00       ` Avihai Horon
  0 siblings, 1 reply; 93+ messages in thread
From: Joao Martins @ 2023-02-24 19:26 UTC (permalink / raw)
  To: Avihai Horon, Alex Williamson
  Cc: qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta



On 23/02/2023 14:56, Avihai Horon wrote:
> On 22/02/2023 22:55, Alex Williamson wrote:
>> There are various errors running this through the CI on gitlab.
>>
>> This one seems bogus but needs to be resolved regardless:
>>
>> https://gitlab.com/alex.williamson/qemu/-/jobs/3817940731
>> FAILED: libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o
>> 2786s390x-linux-gnu-gcc -m64 -Ilibqemu-aarch64-softmmu.fa.p -I. -I..
>> -Itarget/arm -I../target/arm -Iqapi -Itrace -Iui -Iui/shader
>> -I/usr/include/pixman-1 -I/usr/include/capstone -I/usr/include/glib-2.0
>> -I/usr/lib/s390x-linux-gnu/glib-2.0/include -fdiagnostics-color=auto -Wall
>> -Winvalid-pch -Werror -std=gnu11 -O2 -g -isystem
>> /builds/alex.williamson/qemu/linux-headers -isystem linux-headers -iquote .
>> -iquote /builds/alex.williamson/qemu -iquote
>> /builds/alex.williamson/qemu/include -iquote
>> /builds/alex.williamson/qemu/tcg/s390x -pthread -U_FORTIFY_SOURCE
>> -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE
>> -fno-strict-aliasing -fno-common -fwrapv -Wundef -Wwrite-strings
>> -Wmissing-prototypes -Wstrict-prototypes -Wredundant-decls
>> -Wold-style-declaration -Wold-style-definition -Wtype-limits -Wformat-security
>> -Wformat-y2k -Winit-self -Wignored-qualifiers -Wempty-body -Wnested-externs
>> -Wendif-labels -Wexpansion-to-defined -Wimplicit-fallthrough=2
>> -Wmissing-format-attribute -Wno-missing-include-dirs -Wno-shift-negative-value
>> -Wno-psabi -fstack-protector-strong -fPIE -isystem../linux-headers
>> -isystemlinux-headers -DNEED_CPU_H
>> '-DCONFIG_TARGET="aarch64-softmmu-config-target.h"'
>> '-DCONFIG_DEVICES="aarch64-softmmu-config-devices.h"' -MD -MQ
>> libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o -MF
>> libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o.d -o
>> libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o -c ../hw/vfio/common.c
>> 2787../hw/vfio/common.c: In function ‘vfio_listener_log_global_start’:
>> 2788../hw/vfio/common.c:1772:8: error: ‘ret’ may be used uninitialized in this
>> function [-Werror=maybe-uninitialized]
>> 2789 1772 |     if (ret) {
>> 2790      |        ^
>>
>> 32-bit builds have some actual errors though:
>>
>> https://gitlab.com/alex.williamson/qemu/-/jobs/3817940719
>> FAILED: libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o
>> 2601cc -m32 -Ilibqemu-aarch64-softmmu.fa.p -I. -I.. -Itarget/arm
>> -I../target/arm -Iqapi -Itrace -Iui -Iui/shader -I/usr/include/pixman-1
>> -I/usr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/include/sysprof-4
>> -fdiagnostics-color=auto -Wall -Winvalid-pch -Werror -std=gnu11 -O2 -g
>> -isystem /builds/alex.williamson/qemu/linux-headers -isystem linux-headers
>> -iquote . -iquote /builds/alex.williamson/qemu -iquote
>> /builds/alex.williamson/qemu/include -iquote
>> /builds/alex.williamson/qemu/tcg/i386 -pthread -U_FORTIFY_SOURCE
>> -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE
>> -fno-strict-aliasing -fno-common -fwrapv -Wundef -Wwrite-strings
>> -Wmissing-prototypes -Wstrict-prototypes -Wredundant-decls
>> -Wold-style-declaration -Wold-style-definition -Wtype-limits -Wformat-security
>> -Wformat-y2k -Winit-self -Wignored-qualifiers -Wempty-body -Wnested-externs
>> -Wendif-labels -Wexpansion-to-defined -Wimplicit-fallthrough=2
>> -Wmissing-format-attribute -Wno-missing-include-dirs -Wno-shift-negative-value
>> -Wno-psabi -fstack-protector-strong -fPIE -isystem../linux-headers
>> -isystemlinux-headers -DNEED_CPU_H
>> '-DCONFIG_TARGET="aarch64-softmmu-config-target.h"'
>> '-DCONFIG_DEVICES="aarch64-softmmu-config-devices.h"' -MD -MQ
>> libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o -MF
>> libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o.d -o
>> libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o -c ../hw/vfio/common.c
>> 2602../hw/vfio/common.c: In function
>> 'vfio_device_feature_dma_logging_start_create':
>> 2603../hw/vfio/common.c:1572:27: error: cast from pointer to integer of
>> different size [-Werror=pointer-to-int-cast]
>> 2604 1572 |         control->ranges = (uint64_t)ranges;
>> 2605      |                           ^
>> 2606../hw/vfio/common.c:1596:23: error: cast from pointer to integer of
>> different size [-Werror=pointer-to-int-cast]
>> 2607 1596 |     control->ranges = (uint64_t)ranges;
>> 2608      |                       ^
>> 2609../hw/vfio/common.c: In function
>> 'vfio_device_feature_dma_logging_start_destroy':
>> 2610../hw/vfio/common.c:1620:9: error: cast to pointer from integer of
>> different size [-Werror=int-to-pointer-cast]
>> 2611 1620 |         (struct vfio_device_feature_dma_logging_range
>> *)control->ranges;
>> 2612      |         ^
>> 2613../hw/vfio/common.c: In function 'vfio_device_dma_logging_report':
>> 2614../hw/vfio/common.c:1810:22: error: cast from pointer to integer of
>> different size [-Werror=pointer-to-int-cast]
>> 2615 1810 |     report->bitmap = (uint64_t)bitmap;
>> 2616      |                      ^
> 
> Sure, I will fix these errors.

Just a thought: should the pre-copy series be moved towards the end of this
series, given that it's more of an improvement of downtime than a must-have like
dirty tracking?

	Joao


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 03/20] vfio/migration: Add VFIO migration pre-copy support
  2023-02-23 21:16       ` Alex Williamson
@ 2023-02-26 16:43         ` Avihai Horon
  2023-02-27 16:14           ` Alex Williamson
  0 siblings, 1 reply; 93+ messages in thread
From: Avihai Horon @ 2023-02-26 16:43 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins


On 23/02/2023 23:16, Alex Williamson wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, 23 Feb 2023 17:25:12 +0200
> Avihai Horon <avihaih@nvidia.com> wrote:
>
>> On 22/02/2023 22:58, Alex Williamson wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On Wed, 22 Feb 2023 19:48:58 +0200
>>> Avihai Horon <avihaih@nvidia.com> wrote:
>>>
>>>> @@ -302,23 +380,44 @@ static void vfio_save_cleanup(void *opaque)
>>>>        trace_vfio_save_cleanup(vbasedev->name);
>>>>    }
>>>>
>>>> +static void vfio_state_pending_estimate(void *opaque, uint64_t threshold_size,
>>>> +                                        uint64_t *must_precopy,
>>>> +                                        uint64_t *can_postcopy)
>>>> +{
>>>> +    VFIODevice *vbasedev = opaque;
>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>> +
>>>> +    if (migration->device_state != VFIO_DEVICE_STATE_PRE_COPY) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    /*
>>>> +     * Initial size should be transferred during pre-copy phase so stop-copy
>>>> +     * phase will not be slowed down. Report threshold_size to force another
>>>> +     * pre-copy iteration.
>>>> +     */
>>>> +    *must_precopy += migration->precopy_init_size ?
>>>> +                         threshold_size :
>>>> +                         migration->precopy_dirty_size;
>>> This sure feels like we're feeding false data back to the iterator to
>>> spoof it to run another iteration, when the vfio migration protocol
>>> only recommends that initial_bytes reaches zero before proceeding to
>>> stop-copy, it's not a requirement.  What benefit is actually observed
>>> from this?  Why is this required for initial pre-copy support?  It
>>> seems devious.
>> As previously discussed in the thread that added the pre-copy uAPI [1],
>> the init_bytes can be used by drivers to reduce the downtime.
>> For example, mlx5 transfers some metadata to the target so it will be
>> able to pre-allocate resources etc.
>>
>> [1]
>> https://lore.kernel.org/kvm/ae4a6259-349d-0131-896c-7a6ea775cc9e@nvidia.com/
> Yes, but how does that become a requirement to QEMU that it must
> iterate until the initial segment is complete?  Especially when we need
> to trigger that behavior via such nefarious means.  AIUI, QEMU should
> be allowed to move to stop-copy at any point.  We should make efforts
> that QEMU would never decide on its own to move from pre-copy to
> stop-copy without completing the init_bytes (which sounds suspiciously
> like the purpose of @must_precopy),

@must_precopy represents the pending bytes that must be transferred 
during pre-copy or stop-copy. If it's under the threshold, then 
migration will move to stop-copy and be completed.
So simply adding init_bytes to @must_precopy will not guarantee that we 
send all init_bytes before moving to stop-copy, since the transition to 
stop-copy can happen when @must_precopy != 0.

>   but if, for instance a user forces a
> transition to stop-copy, I don't see that we have any business to
> impose a policy to delay that until the init_bytes is complete.

Is there a way a user can force the migration to move to stop-copy?
Looking at migration code, it seems that the only way to move to 
stop-copy is if @must_precopy is below the threshold.
If so, then this is our effort to make QEMU send all init_bytes before 
moving to stop_copy and we can only benefit from it.

Regarding how to do it -- maybe instead of spoofing @must_precopy we can 
introduce a new parameter in upper migration layer (e.g., @init_precopy) 
and add another condition in migration layer that it must be zero to 
move to stop-copy.

Thanks.



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 11/20] vfio/common: Add device dirty page tracking start/stop
  2023-02-23 20:54             ` Jason Gunthorpe
@ 2023-02-26 16:54               ` Avihai Horon
  0 siblings, 0 replies; 93+ messages in thread
From: Avihai Horon @ 2023-02-26 16:54 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins


On 23/02/2023 22:54, Jason Gunthorpe wrote:
> On Thu, Feb 23, 2023 at 01:16:40PM -0700, Alex Williamson wrote:
>> On Thu, 23 Feb 2023 15:30:28 -0400
>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>
>>> On Thu, Feb 23, 2023 at 12:27:23PM -0700, Alex Williamson wrote:
>>>> So again, I think I'm just looking for a better comment that doesn't
>>>> add FUD to the reasoning behind switching to a single range,
>>> It isn't a single range, it is a single page of ranges, right?
>> Exceeding a single page of ranges is the inflection point at which we
>> switch to a single range.
> Oh, that isn't what it should do - it should cut it back to fit in a
> page..

Sure, I will change it accordingly (and rephrase the comment).

Thanks.



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 00/20] vfio: Add migration pre-copy support and device dirty tracking
  2023-02-24 19:26     ` Joao Martins
@ 2023-02-26 17:00       ` Avihai Horon
  2023-02-27 13:50         ` Cédric Le Goater
  0 siblings, 1 reply; 93+ messages in thread
From: Avihai Horon @ 2023-02-26 17:00 UTC (permalink / raw)
  To: Joao Martins, Alex Williamson
  Cc: qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta


On 24/02/2023 21:26, Joao Martins wrote:
> External email: Use caution opening links or attachments
>
>
> On 23/02/2023 14:56, Avihai Horon wrote:
>> On 22/02/2023 22:55, Alex Williamson wrote:
>>> There are various errors running this through the CI on gitlab.
>>>
>>> This one seems bogus but needs to be resolved regardless:
>>>
>>> https://gitlab.com/alex.williamson/qemu/-/jobs/3817940731
>>> FAILED: libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o
>>> 2786s390x-linux-gnu-gcc -m64 -Ilibqemu-aarch64-softmmu.fa.p -I. -I..
>>> -Itarget/arm -I../target/arm -Iqapi -Itrace -Iui -Iui/shader
>>> -I/usr/include/pixman-1 -I/usr/include/capstone -I/usr/include/glib-2.0
>>> -I/usr/lib/s390x-linux-gnu/glib-2.0/include -fdiagnostics-color=auto -Wall
>>> -Winvalid-pch -Werror -std=gnu11 -O2 -g -isystem
>>> /builds/alex.williamson/qemu/linux-headers -isystem linux-headers -iquote .
>>> -iquote /builds/alex.williamson/qemu -iquote
>>> /builds/alex.williamson/qemu/include -iquote
>>> /builds/alex.williamson/qemu/tcg/s390x -pthread -U_FORTIFY_SOURCE
>>> -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE
>>> -fno-strict-aliasing -fno-common -fwrapv -Wundef -Wwrite-strings
>>> -Wmissing-prototypes -Wstrict-prototypes -Wredundant-decls
>>> -Wold-style-declaration -Wold-style-definition -Wtype-limits -Wformat-security
>>> -Wformat-y2k -Winit-self -Wignored-qualifiers -Wempty-body -Wnested-externs
>>> -Wendif-labels -Wexpansion-to-defined -Wimplicit-fallthrough=2
>>> -Wmissing-format-attribute -Wno-missing-include-dirs -Wno-shift-negative-value
>>> -Wno-psabi -fstack-protector-strong -fPIE -isystem../linux-headers
>>> -isystemlinux-headers -DNEED_CPU_H
>>> '-DCONFIG_TARGET="aarch64-softmmu-config-target.h"'
>>> '-DCONFIG_DEVICES="aarch64-softmmu-config-devices.h"' -MD -MQ
>>> libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o -MF
>>> libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o.d -o
>>> libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o -c ../hw/vfio/common.c
>>> 2787../hw/vfio/common.c: In function ‘vfio_listener_log_global_start’:
>>> 2788../hw/vfio/common.c:1772:8: error: ‘ret’ may be used uninitialized in this
>>> function [-Werror=maybe-uninitialized]
>>> 2789 1772 |     if (ret) {
>>> 2790      |        ^
>>>
>>> 32-bit builds have some actual errors though:
>>>
>>> https://gitlab.com/alex.williamson/qemu/-/jobs/3817940719
>>> FAILED: libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o
>>> 2601cc -m32 -Ilibqemu-aarch64-softmmu.fa.p -I. -I.. -Itarget/arm
>>> -I../target/arm -Iqapi -Itrace -Iui -Iui/shader -I/usr/include/pixman-1
>>> -I/usr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/include/sysprof-4
>>> -fdiagnostics-color=auto -Wall -Winvalid-pch -Werror -std=gnu11 -O2 -g
>>> -isystem /builds/alex.williamson/qemu/linux-headers -isystem linux-headers
>>> -iquote . -iquote /builds/alex.williamson/qemu -iquote
>>> /builds/alex.williamson/qemu/include -iquote
>>> /builds/alex.williamson/qemu/tcg/i386 -pthread -U_FORTIFY_SOURCE
>>> -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE
>>> -fno-strict-aliasing -fno-common -fwrapv -Wundef -Wwrite-strings
>>> -Wmissing-prototypes -Wstrict-prototypes -Wredundant-decls
>>> -Wold-style-declaration -Wold-style-definition -Wtype-limits -Wformat-security
>>> -Wformat-y2k -Winit-self -Wignored-qualifiers -Wempty-body -Wnested-externs
>>> -Wendif-labels -Wexpansion-to-defined -Wimplicit-fallthrough=2
>>> -Wmissing-format-attribute -Wno-missing-include-dirs -Wno-shift-negative-value
>>> -Wno-psabi -fstack-protector-strong -fPIE -isystem../linux-headers
>>> -isystemlinux-headers -DNEED_CPU_H
>>> '-DCONFIG_TARGET="aarch64-softmmu-config-target.h"'
>>> '-DCONFIG_DEVICES="aarch64-softmmu-config-devices.h"' -MD -MQ
>>> libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o -MF
>>> libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o.d -o
>>> libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o -c ../hw/vfio/common.c
>>> 2602../hw/vfio/common.c: In function
>>> 'vfio_device_feature_dma_logging_start_create':
>>> 2603../hw/vfio/common.c:1572:27: error: cast from pointer to integer of
>>> different size [-Werror=pointer-to-int-cast]
>>> 2604 1572 |         control->ranges = (uint64_t)ranges;
>>> 2605      |                           ^
>>> 2606../hw/vfio/common.c:1596:23: error: cast from pointer to integer of
>>> different size [-Werror=pointer-to-int-cast]
>>> 2607 1596 |     control->ranges = (uint64_t)ranges;
>>> 2608      |                       ^
>>> 2609../hw/vfio/common.c: In function
>>> 'vfio_device_feature_dma_logging_start_destroy':
>>> 2610../hw/vfio/common.c:1620:9: error: cast to pointer from integer of
>>> different size [-Werror=int-to-pointer-cast]
>>> 2611 1620 |         (struct vfio_device_feature_dma_logging_range
>>> *)control->ranges;
>>> 2612      |         ^
>>> 2613../hw/vfio/common.c: In function 'vfio_device_dma_logging_report':
>>> 2614../hw/vfio/common.c:1810:22: error: cast from pointer to integer of
>>> different size [-Werror=pointer-to-int-cast]
>>> 2615 1810 |     report->bitmap = (uint64_t)bitmap;
>>> 2616      |                      ^
>> Sure, I will fix these errors.
> Just a thought: should the pre-copy series be moved towards the end of this
> series, given that it's more of an improvement of downtime than a must-have like
> dirty tracking?

Given recent discussion, maybe it would be better to split this series 
and go one step at a time:
Start with basic support for device dirty tracking (without vIOMMU 
support), then add pre-copy and then add vIOMMU support to device dirty 
tracking.

Thanks.



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 00/20] vfio: Add migration pre-copy support and device dirty tracking
  2023-02-23 15:07     ` Avihai Horon
@ 2023-02-27 10:24       ` Cédric Le Goater
  0 siblings, 0 replies; 93+ messages in thread
From: Cédric Le Goater @ 2023-02-27 10:24 UTC (permalink / raw)
  To: Avihai Horon, Alex Williamson
  Cc: qemu-devel, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Peter Xu, Jason Wang, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins

On 2/23/23 16:07, Avihai Horon wrote:
> 
> On 23/02/2023 12:05, Cédric Le Goater wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On 2/22/23 21:55, Alex Williamson wrote:
>>>
>>> There are various errors running this through the CI on gitlab.
>>>
>>> This one seems bogus but needs to be resolved regardless:
>>>
>>> https://gitlab.com/alex.williamson/qemu/-/jobs/3817940731
>>> FAILED: libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o
>>> 2786s390x-linux-gnu-gcc -m64 -Ilibqemu-aarch64-softmmu.fa.p -I. -I.. -Itarget/arm -I../target/arm -Iqapi -Itrace -Iui -Iui/shader -I/usr/include/pixman-1 -I/usr/include/capstone -I/usr/include/glib-2.0 -I/usr/lib/s390x-linux-gnu/glib-2.0/include -fdiagnostics-color=auto -Wall -Winvalid-pch -Werror -std=gnu11 -O2 -g -isystem /builds/alex.williamson/qemu/linux-headers -isystem linux-headers -iquote . -iquote /builds/alex.williamson/qemu -iquote /builds/alex.williamson/qemu/include -iquote /builds/alex.williamson/qemu/tcg/s390x -pthread -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -fno-strict-aliasing -fno-common -fwrapv -Wundef -Wwrite-strings -Wmissing-prototypes -Wstrict-prototypes -Wredundant-decls -Wold-style-declaration -Wold-style-definition -Wtype-limits -Wformat-security -Wformat-y2k -Winit-self -Wignored-qualifiers -Wempty-body -Wnested-externs -Wendif-labels -Wexpansion-to-defined -Wimplicit-fallthrough=2 
>>> -Wmissing-format-attribute -Wno-missing-include-dirs -Wno-shift-negative-value -Wno-psabi -fstack-protector-strong -fPIE -isystem../linux-headers -isystemlinux-headers -DNEED_CPU_H '-DCONFIG_TARGET="aarch64-softmmu-config-target.h"' '-DCONFIG_DEVICES="aarch64-softmmu-config-devices.h"' -MD -MQ libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o -MF libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o.d -o libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o -c ../hw/vfio/common.c
>>> 2787../hw/vfio/common.c: In function ‘vfio_listener_log_global_start’:
>>> 2788../hw/vfio/common.c:1772:8: error: ‘ret’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
>>> 2789 1772 |     if (ret) {
>>> 2790      |        ^
>>
>>
>> The routine to fix is vfio_devices_start_dirty_page_tracking(). The compiler
>> is doing some inlining.
>>
> I don't think I understand how inlining could cause it.
> Could you elaborate on this?

The compiler reports an error in routine 'vfio_listener_log_global_start'
but the fix should be in 'vfio_devices_start_dirty_page_tracking'. Surely
because the compiler optimization inlines the latter.

> 
> I thought that the compiler just missed the initialization of ret because it happens in the if else statement, and that simply doing "int ret = 0;" would solve it.

Yes. This will work.

Thanks,

C.



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 00/20] vfio: Add migration pre-copy support and device dirty tracking
  2023-02-26 17:00       ` Avihai Horon
@ 2023-02-27 13:50         ` Cédric Le Goater
  2023-03-01 19:04           ` Avihai Horon
  0 siblings, 1 reply; 93+ messages in thread
From: Cédric Le Goater @ 2023-02-27 13:50 UTC (permalink / raw)
  To: Avihai Horon, Joao Martins, Alex Williamson
  Cc: qemu-devel, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Peter Xu, Jason Wang, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

On 2/26/23 18:00, Avihai Horon wrote:
> 
> On 24/02/2023 21:26, Joao Martins wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On 23/02/2023 14:56, Avihai Horon wrote:
>>> On 22/02/2023 22:55, Alex Williamson wrote:
>>>> There are various errors running this through the CI on gitlab.
>>>>
>>>> This one seems bogus but needs to be resolved regardless:
>>>>
>>>> https://gitlab.com/alex.williamson/qemu/-/jobs/3817940731
>>>> FAILED: libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o
>>>> 2786s390x-linux-gnu-gcc -m64 -Ilibqemu-aarch64-softmmu.fa.p -I. -I..
>>>> -Itarget/arm -I../target/arm -Iqapi -Itrace -Iui -Iui/shader
>>>> -I/usr/include/pixman-1 -I/usr/include/capstone -I/usr/include/glib-2.0
>>>> -I/usr/lib/s390x-linux-gnu/glib-2.0/include -fdiagnostics-color=auto -Wall
>>>> -Winvalid-pch -Werror -std=gnu11 -O2 -g -isystem
>>>> /builds/alex.williamson/qemu/linux-headers -isystem linux-headers -iquote .
>>>> -iquote /builds/alex.williamson/qemu -iquote
>>>> /builds/alex.williamson/qemu/include -iquote
>>>> /builds/alex.williamson/qemu/tcg/s390x -pthread -U_FORTIFY_SOURCE
>>>> -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE
>>>> -fno-strict-aliasing -fno-common -fwrapv -Wundef -Wwrite-strings
>>>> -Wmissing-prototypes -Wstrict-prototypes -Wredundant-decls
>>>> -Wold-style-declaration -Wold-style-definition -Wtype-limits -Wformat-security
>>>> -Wformat-y2k -Winit-self -Wignored-qualifiers -Wempty-body -Wnested-externs
>>>> -Wendif-labels -Wexpansion-to-defined -Wimplicit-fallthrough=2
>>>> -Wmissing-format-attribute -Wno-missing-include-dirs -Wno-shift-negative-value
>>>> -Wno-psabi -fstack-protector-strong -fPIE -isystem../linux-headers
>>>> -isystemlinux-headers -DNEED_CPU_H
>>>> '-DCONFIG_TARGET="aarch64-softmmu-config-target.h"'
>>>> '-DCONFIG_DEVICES="aarch64-softmmu-config-devices.h"' -MD -MQ
>>>> libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o -MF
>>>> libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o.d -o
>>>> libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o -c ../hw/vfio/common.c
>>>> 2787../hw/vfio/common.c: In function ‘vfio_listener_log_global_start’:
>>>> 2788../hw/vfio/common.c:1772:8: error: ‘ret’ may be used uninitialized in this
>>>> function [-Werror=maybe-uninitialized]
>>>> 2789 1772 |     if (ret) {
>>>> 2790      |        ^
>>>>
>>>> 32-bit builds have some actual errors though:
>>>>
>>>> https://gitlab.com/alex.williamson/qemu/-/jobs/3817940719
>>>> FAILED: libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o
>>>> 2601cc -m32 -Ilibqemu-aarch64-softmmu.fa.p -I. -I.. -Itarget/arm
>>>> -I../target/arm -Iqapi -Itrace -Iui -Iui/shader -I/usr/include/pixman-1
>>>> -I/usr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/include/sysprof-4
>>>> -fdiagnostics-color=auto -Wall -Winvalid-pch -Werror -std=gnu11 -O2 -g
>>>> -isystem /builds/alex.williamson/qemu/linux-headers -isystem linux-headers
>>>> -iquote . -iquote /builds/alex.williamson/qemu -iquote
>>>> /builds/alex.williamson/qemu/include -iquote
>>>> /builds/alex.williamson/qemu/tcg/i386 -pthread -U_FORTIFY_SOURCE
>>>> -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE
>>>> -fno-strict-aliasing -fno-common -fwrapv -Wundef -Wwrite-strings
>>>> -Wmissing-prototypes -Wstrict-prototypes -Wredundant-decls
>>>> -Wold-style-declaration -Wold-style-definition -Wtype-limits -Wformat-security
>>>> -Wformat-y2k -Winit-self -Wignored-qualifiers -Wempty-body -Wnested-externs
>>>> -Wendif-labels -Wexpansion-to-defined -Wimplicit-fallthrough=2
>>>> -Wmissing-format-attribute -Wno-missing-include-dirs -Wno-shift-negative-value
>>>> -Wno-psabi -fstack-protector-strong -fPIE -isystem../linux-headers
>>>> -isystemlinux-headers -DNEED_CPU_H
>>>> '-DCONFIG_TARGET="aarch64-softmmu-config-target.h"'
>>>> '-DCONFIG_DEVICES="aarch64-softmmu-config-devices.h"' -MD -MQ
>>>> libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o -MF
>>>> libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o.d -o
>>>> libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o -c ../hw/vfio/common.c
>>>> 2602../hw/vfio/common.c: In function
>>>> 'vfio_device_feature_dma_logging_start_create':
>>>> 2603../hw/vfio/common.c:1572:27: error: cast from pointer to integer of
>>>> different size [-Werror=pointer-to-int-cast]
>>>> 2604 1572 |         control->ranges = (uint64_t)ranges;
>>>> 2605      |                           ^
>>>> 2606../hw/vfio/common.c:1596:23: error: cast from pointer to integer of
>>>> different size [-Werror=pointer-to-int-cast]
>>>> 2607 1596 |     control->ranges = (uint64_t)ranges;
>>>> 2608      |                       ^
>>>> 2609../hw/vfio/common.c: In function
>>>> 'vfio_device_feature_dma_logging_start_destroy':
>>>> 2610../hw/vfio/common.c:1620:9: error: cast to pointer from integer of
>>>> different size [-Werror=int-to-pointer-cast]
>>>> 2611 1620 |         (struct vfio_device_feature_dma_logging_range
>>>> *)control->ranges;
>>>> 2612      |         ^
>>>> 2613../hw/vfio/common.c: In function 'vfio_device_dma_logging_report':
>>>> 2614../hw/vfio/common.c:1810:22: error: cast from pointer to integer of
>>>> different size [-Werror=pointer-to-int-cast]
>>>> 2615 1810 |     report->bitmap = (uint64_t)bitmap;
>>>> 2616      |                      ^
>>> Sure, I will fix these errors.
>> Just a thought: should the pre-copy series be moved towards the end of this
>> series, given that it's more of an improvement of downtime than a must-have like
>> dirty tracking?
> 
> Given recent discussion, maybe it would be better to split this series and go one step at a time:
> Start with basic support for device dirty tracking (without vIOMMU support), then add pre-copy and then add vIOMMU support to device dirty tracking.

and add the fixes first in the series. They could be merged quickly.

Thanks,

C.




^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 07/20] vfio/common: Add VFIOBitmap and (de)alloc functions
  2023-02-22 17:49 ` [PATCH v2 07/20] vfio/common: Add VFIOBitmap and (de)alloc functions Avihai Horon
  2023-02-22 21:40   ` Alex Williamson
@ 2023-02-27 14:09   ` Cédric Le Goater
  2023-03-01 18:56     ` Avihai Horon
  2023-03-02 13:24     ` Joao Martins
  1 sibling, 2 replies; 93+ messages in thread
From: Cédric Le Goater @ 2023-02-27 14:09 UTC (permalink / raw)
  To: Avihai Horon, qemu-devel
  Cc: Alex Williamson, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Peter Xu, Jason Wang, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins

On 2/22/23 18:49, Avihai Horon wrote:
> There are already two places where dirty page bitmap allocation and
> calculations are done in open code. With device dirty page tracking
> being added in next patches, there are going to be even more places.
> 
> To avoid code duplication, introduce VFIOBitmap struct and corresponding
> alloc and dealloc functions and use them where applicable.
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>   hw/vfio/common.c | 89 ++++++++++++++++++++++++++++++++----------------
>   1 file changed, 60 insertions(+), 29 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index ac93b85632..84f08bdbbb 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -320,6 +320,41 @@ const MemoryRegionOps vfio_region_ops = {
>    * Device state interfaces
>    */
>   
> +typedef struct {
> +    unsigned long *bitmap;
> +    hwaddr size;
> +    hwaddr pages;
> +} VFIOBitmap;
> +
> +static VFIOBitmap *vfio_bitmap_alloc(hwaddr size)
> +{
> +    VFIOBitmap *vbmap = g_try_new0(VFIOBitmap, 1);

I think using g_malloc0() for the VFIOBitmap should be fine. If QEMU can
not allocate a couple of bytes, we are in trouble anyway.

Thanks,

C.


> +    if (!vbmap) {
> +        errno = ENOMEM;
> +
> +        return NULL;
> +    }
> +
> +    vbmap->pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size();
> +    vbmap->size = ROUND_UP(vbmap->pages, sizeof(__u64) * BITS_PER_BYTE) /
> +                                         BITS_PER_BYTE;
> +    vbmap->bitmap = g_try_malloc0(vbmap->size);
> +    if (!vbmap->bitmap) {
> +        g_free(vbmap);
> +        errno = ENOMEM;
> +
> +        return NULL;
> +    }
> +
> +    return vbmap;
> +}
> +
> +static void vfio_bitmap_dealloc(VFIOBitmap *vbmap)
> +{
> +    g_free(vbmap->bitmap);
> +    g_free(vbmap);
> +}
> +
>   bool vfio_mig_active(void)
>   {
>       VFIOGroup *group;
> @@ -470,9 +505,14 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container,
>   {
>       struct vfio_iommu_type1_dma_unmap *unmap;
>       struct vfio_bitmap *bitmap;
> -    uint64_t pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size();
> +    VFIOBitmap *vbmap;
>       int ret;
>   
> +    vbmap = vfio_bitmap_alloc(size);
> +    if (!vbmap) {
> +        return -errno;
> +    }
> +
>       unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));
>   
>       unmap->argsz = sizeof(*unmap) + sizeof(*bitmap);
> @@ -486,35 +526,28 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container,
>        * qemu_real_host_page_size to mark those dirty. Hence set bitmap_pgsize
>        * to qemu_real_host_page_size.
>        */
> -
>       bitmap->pgsize = qemu_real_host_page_size();
> -    bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
> -                   BITS_PER_BYTE;
> +    bitmap->size = vbmap->size;
> +    bitmap->data = (__u64 *)vbmap->bitmap;
>   
> -    if (bitmap->size > container->max_dirty_bitmap_size) {
> -        error_report("UNMAP: Size of bitmap too big 0x%"PRIx64,
> -                     (uint64_t)bitmap->size);
> +    if (vbmap->size > container->max_dirty_bitmap_size) {
> +        error_report("UNMAP: Size of bitmap too big 0x%"PRIx64, vbmap->size);
>           ret = -E2BIG;
>           goto unmap_exit;
>       }
>   
> -    bitmap->data = g_try_malloc0(bitmap->size);
> -    if (!bitmap->data) {
> -        ret = -ENOMEM;
> -        goto unmap_exit;
> -    }
> -
>       ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap);
>       if (!ret) {
> -        cpu_physical_memory_set_dirty_lebitmap((unsigned long *)bitmap->data,
> -                iotlb->translated_addr, pages);
> +        cpu_physical_memory_set_dirty_lebitmap(vbmap->bitmap,
> +                iotlb->translated_addr, vbmap->pages);
>       } else {
>           error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %m");
>       }
>   
> -    g_free(bitmap->data);
>   unmap_exit:
>       g_free(unmap);
> +    vfio_bitmap_dealloc(vbmap);
> +
>       return ret;
>   }
>   
> @@ -1331,7 +1364,7 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
>   {
>       struct vfio_iommu_type1_dirty_bitmap *dbitmap;
>       struct vfio_iommu_type1_dirty_bitmap_get *range;
> -    uint64_t pages;
> +    VFIOBitmap *vbmap;
>       int ret;
>   
>       if (!container->dirty_pages_supported) {
> @@ -1341,6 +1374,11 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
>           return 0;
>       }
>   
> +    vbmap = vfio_bitmap_alloc(size);
> +    if (!vbmap) {
> +        return -errno;
> +    }
> +
>       dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
>   
>       dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
> @@ -1355,15 +1393,8 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
>        * to qemu_real_host_page_size.
>        */
>       range->bitmap.pgsize = qemu_real_host_page_size();
> -
> -    pages = REAL_HOST_PAGE_ALIGN(range->size) / qemu_real_host_page_size();
> -    range->bitmap.size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
> -                                         BITS_PER_BYTE;
> -    range->bitmap.data = g_try_malloc0(range->bitmap.size);
> -    if (!range->bitmap.data) {
> -        ret = -ENOMEM;
> -        goto err_out;
> -    }
> +    range->bitmap.size = vbmap->size;
> +    range->bitmap.data = (__u64 *)vbmap->bitmap;
>   
>       ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
>       if (ret) {
> @@ -1374,14 +1405,14 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
>           goto err_out;
>       }
>   
> -    cpu_physical_memory_set_dirty_lebitmap((unsigned long *)range->bitmap.data,
> -                                            ram_addr, pages);
> +    cpu_physical_memory_set_dirty_lebitmap(vbmap->bitmap, ram_addr,
> +                                           vbmap->pages);
>   
>       trace_vfio_get_dirty_bitmap(container->fd, range->iova, range->size,
>                                   range->bitmap.size, ram_addr);
>   err_out:
> -    g_free(range->bitmap.data);
>       g_free(dbitmap);
> +    vfio_bitmap_dealloc(vbmap);
>   
>       return ret;
>   }



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 02/20] vfio/migration: Refactor vfio_save_block() to return saved data size
  2023-02-22 17:48 ` [PATCH v2 02/20] vfio/migration: Refactor vfio_save_block() to return saved data size Avihai Horon
@ 2023-02-27 14:10   ` Cédric Le Goater
  0 siblings, 0 replies; 93+ messages in thread
From: Cédric Le Goater @ 2023-02-27 14:10 UTC (permalink / raw)
  To: Avihai Horon, qemu-devel
  Cc: Alex Williamson, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Peter Xu, Jason Wang, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins

On 2/22/23 18:48, Avihai Horon wrote:
> Refactor vfio_save_block() to return the size of saved data on success
> and -errno on error.
> 
> This will be used in next patch to implement VFIO migration pre-copy
> support.
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>

LGTM

Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.

> ---
>   hw/vfio/migration.c | 17 +++++++++--------
>   1 file changed, 9 insertions(+), 8 deletions(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 4fb7d01532..94a4df73d0 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -240,8 +240,8 @@ static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
>       return 0;
>   }
>   
> -/* Returns 1 if end-of-stream is reached, 0 if more data and -errno if error */
> -static int vfio_save_block(QEMUFile *f, VFIOMigration *migration)
> +/* Returns the size of saved data on success and -errno on error */
> +static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
>   {
>       ssize_t data_size;
>   
> @@ -251,7 +251,7 @@ static int vfio_save_block(QEMUFile *f, VFIOMigration *migration)
>           return -errno;
>       }
>       if (data_size == 0) {
> -        return 1;
> +        return 0;
>       }
>   
>       qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> @@ -261,7 +261,7 @@ static int vfio_save_block(QEMUFile *f, VFIOMigration *migration)
>   
>       trace_vfio_save_block(migration->vbasedev->name, data_size);
>   
> -    return qemu_file_get_error(f);
> +    return qemu_file_get_error(f) ?: data_size;
>   }
>   
>   /* ---------------------------------------------------------------------- */
> @@ -335,6 +335,7 @@ static void vfio_state_pending_exact(void *opaque, uint64_t threshold_size,
>   static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>   {
>       VFIODevice *vbasedev = opaque;
> +    ssize_t data_size;
>       int ret;
>   
>       /* We reach here with device state STOP only */
> @@ -345,11 +346,11 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>       }
>   
>       do {
> -        ret = vfio_save_block(f, vbasedev->migration);
> -        if (ret < 0) {
> -            return ret;
> +        data_size = vfio_save_block(f, vbasedev->migration);
> +        if (data_size < 0) {
> +            return data_size;
>           }
> -    } while (!ret);
> +    } while (data_size);
>   
>       qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>       ret = qemu_file_get_error(f);



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 20/20] docs/devel: Document VFIO device dirty page tracking
  2023-02-22 17:49 ` [PATCH v2 20/20] docs/devel: Document VFIO device dirty page tracking Avihai Horon
@ 2023-02-27 14:29   ` Cédric Le Goater
  0 siblings, 0 replies; 93+ messages in thread
From: Cédric Le Goater @ 2023-02-27 14:29 UTC (permalink / raw)
  To: Avihai Horon, qemu-devel
  Cc: Alex Williamson, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Peter Xu, Jason Wang, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins

On 2/22/23 18:49, Avihai Horon wrote:
> Adjust the VFIO dirty page tracking documentation and add a section to
> describe device dirty page tracking.
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>

Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.

> ---
>   docs/devel/vfio-migration.rst | 50 ++++++++++++++++++++++-------------
>   1 file changed, 32 insertions(+), 18 deletions(-)
> 
> diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
> index ba80b9150d..a432cda081 100644
> --- a/docs/devel/vfio-migration.rst
> +++ b/docs/devel/vfio-migration.rst
> @@ -71,22 +71,37 @@ System memory dirty pages tracking
>   ----------------------------------
>   
>   A ``log_global_start`` and ``log_global_stop`` memory listener callback informs
> -the VFIO IOMMU module to start and stop dirty page tracking. A ``log_sync``
> -memory listener callback marks those system memory pages as dirty which are
> -used for DMA by the VFIO device. The dirty pages bitmap is queried per
> -container. All pages pinned by the vendor driver through external APIs have to
> -be marked as dirty during migration. When there are CPU writes, CPU dirty page
> -tracking can identify dirtied pages, but any page pinned by the vendor driver
> -can also be written by the device. There is currently no device or IOMMU
> -support for dirty page tracking in hardware.
> +the VFIO dirty tracking module to start and stop dirty page tracking. A
> +``log_sync`` memory listener callback queries the dirty page bitmap from the
> +dirty tracking module and marks system memory pages which were DMA-ed by the
> +VFIO device as dirty. The dirty page bitmap is queried per container.
> +
> +Currently there are two ways dirty page tracking can be done:
> +(1) Device dirty tracking:
> +In this method the device is responsible to log and report its DMAs. This
> +method can be used only if the device is capable of tracking its DMAs.
> +Discovering device capability, starting and stopping dirty tracking, and
> +syncing the dirty bitmaps from the device are done using the DMA logging uAPI.
> +More info about the uAPI can be found in the comments of the
> +``vfio_device_feature_dma_logging_control`` and
> +``vfio_device_feature_dma_logging_report`` structures in the header file
> +linux-headers/linux/vfio.h.
> +
> +(2) VFIO IOMMU module:
> +In this method dirty tracking is done by IOMMU. However, there is currently no
> +IOMMU support for dirty page tracking. For this reason, all pages are
> +perpetually marked dirty, unless the device driver pins pages through external
> +APIs in which case only those pinned pages are perpetually marked dirty.
> +
> +If the above two methods are not supported, all pages are perpetually marked
> +dirty by QEMU.
>   
>   By default, dirty pages are tracked during pre-copy as well as stop-and-copy
> -phase. So, a page pinned by the vendor driver will be copied to the destination
> -in both phases. Copying dirty pages in pre-copy phase helps QEMU to predict if
> -it can achieve its downtime tolerances. If QEMU during pre-copy phase keeps
> -finding dirty pages continuously, then it understands that even in stop-and-copy
> -phase, it is likely to find dirty pages and can predict the downtime
> -accordingly.
> +phase. So, a page marked as dirty will be copied to the destination in both
> +phases. Copying dirty pages in pre-copy phase helps QEMU to predict if it can
> +achieve its downtime tolerances. If QEMU during pre-copy phase keeps finding
> +dirty pages continuously, then it understands that even in stop-and-copy phase,
> +it is likely to find dirty pages and can predict the downtime accordingly.
>   
>   QEMU also provides a per device opt-out option ``pre-copy-dirty-page-tracking``
>   which disables querying the dirty bitmap during pre-copy phase. If it is set to
> @@ -97,10 +112,9 @@ System memory dirty pages tracking when vIOMMU is enabled
>   ---------------------------------------------------------
>   
>   With vIOMMU, an IO virtual address range can get unmapped while in pre-copy
> -phase of migration. In that case, the unmap ioctl returns any dirty pages in
> -that range and QEMU reports corresponding guest physical pages dirty. During
> -stop-and-copy phase, an IOMMU notifier is used to get a callback for mapped
> -pages and then dirty pages bitmap is fetched from VFIO IOMMU modules for those
> +phase of migration. In that case, dirty page bitmap for this range is queried
> +and synced with QEMU. During stop-and-copy phase, an IOMMU notifier is used to
> +get a callback for mapped pages and then dirty page bitmap is fetched for those
>   mapped ranges.
>   
>   Flow of state changes during Live migration



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 03/20] vfio/migration: Add VFIO migration pre-copy support
  2023-02-26 16:43         ` Avihai Horon
@ 2023-02-27 16:14           ` Alex Williamson
  2023-02-27 17:26             ` Jason Gunthorpe
  0 siblings, 1 reply; 93+ messages in thread
From: Alex Williamson @ 2023-02-27 16:14 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins

On Sun, 26 Feb 2023 18:43:50 +0200
Avihai Horon <avihaih@nvidia.com> wrote:

> On 23/02/2023 23:16, Alex Williamson wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > On Thu, 23 Feb 2023 17:25:12 +0200
> > Avihai Horon <avihaih@nvidia.com> wrote:
> >  
> >> On 22/02/2023 22:58, Alex Williamson wrote:  
> >>> External email: Use caution opening links or attachments
> >>>
> >>>
> >>> On Wed, 22 Feb 2023 19:48:58 +0200
> >>> Avihai Horon <avihaih@nvidia.com> wrote:
> >>>  
> >>>> @@ -302,23 +380,44 @@ static void vfio_save_cleanup(void *opaque)
> >>>>        trace_vfio_save_cleanup(vbasedev->name);
> >>>>    }
> >>>>
> >>>> +static void vfio_state_pending_estimate(void *opaque, uint64_t threshold_size,
> >>>> +                                        uint64_t *must_precopy,
> >>>> +                                        uint64_t *can_postcopy)
> >>>> +{
> >>>> +    VFIODevice *vbasedev = opaque;
> >>>> +    VFIOMigration *migration = vbasedev->migration;
> >>>> +
> >>>> +    if (migration->device_state != VFIO_DEVICE_STATE_PRE_COPY) {
> >>>> +        return;
> >>>> +    }
> >>>> +
> >>>> +    /*
> >>>> +     * Initial size should be transferred during pre-copy phase so stop-copy
> >>>> +     * phase will not be slowed down. Report threshold_size to force another
> >>>> +     * pre-copy iteration.
> >>>> +     */
> >>>> +    *must_precopy += migration->precopy_init_size ?
> >>>> +                         threshold_size :
> >>>> +                         migration->precopy_dirty_size;  
> >>> This sure feels like we're feeding false data back to the iterator to
> >>> spoof it to run another iteration, when the vfio migration protocol
> >>> only recommends that initial_bytes reaches zero before proceeding to
> >>> stop-copy, it's not a requirement.  What benefit is actually observed
> >>> from this?  Why is this required for initial pre-copy support?  It
> >>> seems devious.  
> >> As previously discussed in the thread that added the pre-copy uAPI [1],
> >> the init_bytes can be used by drivers to reduce the downtime.
> >> For example, mlx5 transfers some metadata to the target so it will be
> >> able to pre-allocate resources etc.
> >>
> >> [1]
> >> https://lore.kernel.org/kvm/ae4a6259-349d-0131-896c-7a6ea775cc9e@nvidia.com/  
> > Yes, but how does that become a requirement to QEMU that it must
> > iterate until the initial segment is complete?  Especially when we need
> > to trigger that behavior via such nefarious means.  AIUI, QEMU should
> > be allowed to move to stop-copy at any point.  We should make efforts
> > that QEMU would never decide on its own to move from pre-copy to
> > stop-copy without completing the init_bytes (which sounds suspiciously
> > like the purpose of @must_precopy),  
> 
> @must_precopy represents the pending bytes that must be transferred 
> during pre-copy or stop-copy. If it's under the threshold, then 
> migration will move to stop-copy and be completed.
> So simply adding init_bytes to @must_precopy will not guarantee that we 
> send all init_bytes before moving to stop-copy, since the transition to 
> stop-copy can happen when @must_precopy != 0.

But we have no requirement to send all init_bytes before stop-copy.
This is a hack to achieve a theoretical benefit that a driver might be
able to improve the latency on the target by completing another
iteration.  If drivers are filling in a "must_precopy" arg, it sounds
like even if migration moves to stop-copy, that data should be migrated
first and deferring stop-copy could potentially extend the migration in
other areas.

> >   but if, for instance a user forces a
> > transition to stop-copy, I don't see that we have any business to
> > impose a policy to delay that until the init_bytes is complete.  
> 
> Is there a way a user can force the migration to move to stop-copy?
> Looking at migration code, it seems that the only way to move to 
> stop-copy is if @must_precopy is below the threshold.
> If so, then this is our effort to make QEMU send all init_bytes before 
> moving to stop_copy and we can only benefit from it.

But we have no requirement to send all init_bytes before stop-copy.
This is a hack to achieve a theoretical benefit that a driver might be
able to improve the latency on the target by completing another
iteration.  If drivers are filling in a "must_precopy" arg, it sounds
like even if migration moves to stop-copy, that data should be migrated
first and deferring stop-copy could potentially extend the migration in
other areas.
 
> Regarding how to do it -- maybe instead of spoofing @must_precopy we can 
> introduce a new parameter in upper migration layer (e.g., @init_precopy) 
> and add another condition in migration layer that it must be zero to 
> move to stop-copy.

Why not just move to stop-copy but transfer all must_precopy data
first?  That would seem to align with the naming to me.  I don't think
the device actually cares if the transfer happens while the device is
running or stopped, it just wants it at the target device early enough
to start configuration, right?

I'd drop this for an initial implementation, the uAPI does not require
that QEMU complete init_bytes before transitioning to stop-copy and
this is clearly not a very clean or well justified means to try to
achieve that as a non-requirement.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 03/20] vfio/migration: Add VFIO migration pre-copy support
  2023-02-27 16:14           ` Alex Williamson
@ 2023-02-27 17:26             ` Jason Gunthorpe
  2023-02-27 17:43               ` Alex Williamson
  0 siblings, 1 reply; 93+ messages in thread
From: Jason Gunthorpe @ 2023-02-27 17:26 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Avihai Horon, qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins

On Mon, Feb 27, 2023 at 09:14:44AM -0700, Alex Williamson wrote:

> But we have no requirement to send all init_bytes before stop-copy.
> This is a hack to achieve a theoretical benefit that a driver might be
> able to improve the latency on the target by completing another
> iteration.

I think this is another half-step at this point..

The goal is to not stop the VM until the target VFIO driver has
completed loading initial_bytes.

This signals that the time consuming pre-setup is completed in the
device and we don't have to use downtime to do that work.

We've measured this in our devices and the time-shift can be
significant, like seconds levels of time removed from the downtime
period.

Stopping the VM before this pre-setup is done is simply extending the
stopped VM downtime.

Really what we want is to have the far side acknowledge that
initial_bytes has completed loading.

To remind, what mlx5 is doing here with precopy is time-shifting work,
not data. We want to put expensive work (ie time) into the period when
the VM is still running and have less downtime.

This challenges the assumption built into qmeu that all data has equal
time and it can estimate downtime time simply by scaling the estimated
data. We have a data-size independent time component to deal with as
well.

Jason


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 03/20] vfio/migration: Add VFIO migration pre-copy support
  2023-02-27 17:26             ` Jason Gunthorpe
@ 2023-02-27 17:43               ` Alex Williamson
  2023-03-01 18:49                 ` Avihai Horon
  0 siblings, 1 reply; 93+ messages in thread
From: Alex Williamson @ 2023-02-27 17:43 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Avihai Horon, qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins

On Mon, 27 Feb 2023 13:26:00 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Feb 27, 2023 at 09:14:44AM -0700, Alex Williamson wrote:
> 
> > But we have no requirement to send all init_bytes before stop-copy.
> > This is a hack to achieve a theoretical benefit that a driver might be
> > able to improve the latency on the target by completing another
> > iteration.  
> 
> I think this is another half-step at this point..
> 
> The goal is to not stop the VM until the target VFIO driver has
> completed loading initial_bytes.
> 
> This signals that the time consuming pre-setup is completed in the
> device and we don't have to use downtime to do that work.
> 
> We've measured this in our devices and the time-shift can be
> significant, like seconds levels of time removed from the downtime
> period.
> 
> Stopping the VM before this pre-setup is done is simply extending the
> stopped VM downtime.
> 
> Really what we want is to have the far side acknowledge that
> initial_bytes has completed loading.
> 
> To remind, what mlx5 is doing here with precopy is time-shifting work,
> not data. We want to put expensive work (ie time) into the period when
> the VM is still running and have less downtime.
> 
> This challenges the assumption built into qmeu that all data has equal
> time and it can estimate downtime time simply by scaling the estimated
> data. We have a data-size independent time component to deal with as
> well.

As I mentioned before, I understand the motivation, but imo the
implementation is exploiting the interface it extended in order to force
a device driven policy which is specifically not a requirement of the
vfio migration uAPI.  It sounds like there's more work required in the
QEMU migration interfaces to properly factor this information into the
algorithm.  Until then, this seems like a follow-on improvement unless
you can convince the migration maintainers that providing false
information in order to force another pre-copy iteration is a valid use
of passing the threshold value to the driver.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 10/20] vfio/common: Record DMA mapped IOVA ranges
  2023-02-23 21:50           ` Alex Williamson
  2023-02-23 21:54             ` Joao Martins
@ 2023-02-28 12:11             ` Joao Martins
  2023-02-28 20:36               ` Alex Williamson
  1 sibling, 1 reply; 93+ messages in thread
From: Joao Martins @ 2023-02-28 12:11 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Avihai Horon, qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

On 23/02/2023 21:50, Alex Williamson wrote:
> On Thu, 23 Feb 2023 21:19:12 +0000
> Joao Martins <joao.m.martins@oracle.com> wrote:
>> On 23/02/2023 21:05, Alex Williamson wrote:
>>> On Thu, 23 Feb 2023 10:37:10 +0000
>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
>>>> On 22/02/2023 22:10, Alex Williamson wrote:  
>>>>> On Wed, 22 Feb 2023 19:49:05 +0200
>>>>> Avihai Horon <avihaih@nvidia.com> wrote:    
>>>>>> From: Joao Martins <joao.m.martins@oracle.com>
>>>>>> @@ -612,6 +665,16 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>>>>>>          .iova = iova,
>>>>>>          .size = size,
>>>>>>      };
>>>>>> +    int ret;
>>>>>> +
>>>>>> +    ret = vfio_record_mapping(container, iova, size, readonly);
>>>>>> +    if (ret) {
>>>>>> +        error_report("vfio: Failed to record mapping, iova: 0x%" HWADDR_PRIx
>>>>>> +                     ", size: 0x" RAM_ADDR_FMT ", ret: %d (%s)",
>>>>>> +                     iova, size, ret, strerror(-ret));
>>>>>> +
>>>>>> +        return ret;
>>>>>> +    }    
>>>>>
>>>>> Is there no way to replay the mappings when a migration is started?
>>>>> This seems like a horrible latency and bloat trade-off for the
>>>>> possibility that the VM might migrate and the device might support
>>>>> these features.  Our performance with vIOMMU is already terrible, I
>>>>> can't help but believe this makes it worse.  Thanks,
>>>>>     
>>>>
>>>> It is a nop if the vIOMMU is being used (entries in container->giommu_list) as
>>>> that uses a max-iova based IOVA range. So this is really for iommu identity
>>>> mapping and no-VIOMMU.  
>>>
>>> Ok, yes, there are no mappings recorded for any containers that have a
>>> non-empty giommu_list.
>>>   
>>>> We could replay them if they were tracked/stored anywhere.  
>>>
>>> Rather than piggybacking on vfio_memory_listener, why not simply
>>> register a new MemoryListener when migration is started?  That will
>>> replay all the existing ranges and allow tracking to happen separate
>>> from mapping, and only when needed.
>>>   
>>
>> The problem with that is that *starting* dirty tracking needs to have all the
>> range, we aren't supposed to start each range separately. So on a memory
>> listener callback you don't have introspection when you are dealing with the
>> last range, do we?
> 
> As soon as memory_listener_register() returns, all your callbacks to
> build the IOVATree have been called and you can act on the result the
> same as if you were relying on the vfio mapping MemoryListener.  I'm
> not seeing the problem.  Thanks,
> 

While doing these changes, the nice thing of the current patch is that whatever
changes apply to vfio_listener_region_add() will be reflected in the mappings
tree that stores what we will dirty track. If we move the mappings calculation
necessary for dirty tracking only when we start, we will have to duplicate the
same checks, and open for bugs where we ask things to be dirty track-ed that
haven't been DMA mapped. These two aren't necessarily tied, but felt like I
should raise the potentially duplication of the checks (and the same thing
applies for handling virtio-mem and what not).

I understand that if we were going to store *a lot* of mappings that this would
add up in space requirements. But for no-vIOMMU (or iommu=pt) case this is only
about 12ranges or so, it is much simpler to piggyback the existing listener.
Would you still want to move this to its own dedicated memory listener?

	Joao


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 10/20] vfio/common: Record DMA mapped IOVA ranges
  2023-02-28 12:11             ` Joao Martins
@ 2023-02-28 20:36               ` Alex Williamson
  2023-03-02  0:07                 ` Joao Martins
  0 siblings, 1 reply; 93+ messages in thread
From: Alex Williamson @ 2023-02-28 20:36 UTC (permalink / raw)
  To: Joao Martins
  Cc: Avihai Horon, qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

On Tue, 28 Feb 2023 12:11:06 +0000
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 23/02/2023 21:50, Alex Williamson wrote:
> > On Thu, 23 Feb 2023 21:19:12 +0000
> > Joao Martins <joao.m.martins@oracle.com> wrote:  
> >> On 23/02/2023 21:05, Alex Williamson wrote:  
> >>> On Thu, 23 Feb 2023 10:37:10 +0000
> >>> Joao Martins <joao.m.martins@oracle.com> wrote:    
> >>>> On 22/02/2023 22:10, Alex Williamson wrote:    
> >>>>> On Wed, 22 Feb 2023 19:49:05 +0200
> >>>>> Avihai Horon <avihaih@nvidia.com> wrote:      
> >>>>>> From: Joao Martins <joao.m.martins@oracle.com>
> >>>>>> @@ -612,6 +665,16 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
> >>>>>>          .iova = iova,
> >>>>>>          .size = size,
> >>>>>>      };
> >>>>>> +    int ret;
> >>>>>> +
> >>>>>> +    ret = vfio_record_mapping(container, iova, size, readonly);
> >>>>>> +    if (ret) {
> >>>>>> +        error_report("vfio: Failed to record mapping, iova: 0x%" HWADDR_PRIx
> >>>>>> +                     ", size: 0x" RAM_ADDR_FMT ", ret: %d (%s)",
> >>>>>> +                     iova, size, ret, strerror(-ret));
> >>>>>> +
> >>>>>> +        return ret;
> >>>>>> +    }      
> >>>>>
> >>>>> Is there no way to replay the mappings when a migration is started?
> >>>>> This seems like a horrible latency and bloat trade-off for the
> >>>>> possibility that the VM might migrate and the device might support
> >>>>> these features.  Our performance with vIOMMU is already terrible, I
> >>>>> can't help but believe this makes it worse.  Thanks,
> >>>>>       
> >>>>
> >>>> It is a nop if the vIOMMU is being used (entries in container->giommu_list) as
> >>>> that uses a max-iova based IOVA range. So this is really for iommu identity
> >>>> mapping and no-VIOMMU.    
> >>>
> >>> Ok, yes, there are no mappings recorded for any containers that have a
> >>> non-empty giommu_list.
> >>>     
> >>>> We could replay them if they were tracked/stored anywhere.    
> >>>
> >>> Rather than piggybacking on vfio_memory_listener, why not simply
> >>> register a new MemoryListener when migration is started?  That will
> >>> replay all the existing ranges and allow tracking to happen separate
> >>> from mapping, and only when needed.
> >>>     
> >>
> >> The problem with that is that *starting* dirty tracking needs to have all the
> >> range, we aren't supposed to start each range separately. So on a memory
> >> listener callback you don't have introspection when you are dealing with the
> >> last range, do we?  
> > 
> > As soon as memory_listener_register() returns, all your callbacks to
> > build the IOVATree have been called and you can act on the result the
> > same as if you were relying on the vfio mapping MemoryListener.  I'm
> > not seeing the problem.  Thanks,
> >   
> 
> While doing these changes, the nice thing of the current patch is that whatever
> changes apply to vfio_listener_region_add() will be reflected in the mappings
> tree that stores what we will dirty track. If we move the mappings calculation
> necessary for dirty tracking only when we start, we will have to duplicate the
> same checks, and open for bugs where we ask things to be dirty track-ed that
> haven't been DMA mapped. These two aren't necessarily tied, but felt like I
> should raise the potentially duplication of the checks (and the same thing
> applies for handling virtio-mem and what not).
> 
> I understand that if we were going to store *a lot* of mappings that this would
> add up in space requirements. But for no-vIOMMU (or iommu=pt) case this is only
> about 12ranges or so, it is much simpler to piggyback the existing listener.
> Would you still want to move this to its own dedicated memory listener?

Code duplication and bugs are good points, but while typically we're
only seeing a few handfuls of ranges, doesn't virtio-mem in particular
allow that we could be seeing quite a lot more?

We used to be limited to a fairly small number of KVM memory slots,
which effectively bounded non-vIOMMU DMA mappings, but that value is
now 2^15, so we need to anticipate that we could see many more than a
dozen mappings.

Can we make the same argument that the overhead is negligible if a VM
makes use of 10s of GB of virtio-mem with 2MB block size?

But then on a 4KB host we're limited to 256 tracking entries, so
wasting all that time and space on a runtime IOVATree is even more
dubious.

In fact, it doesn't really matter that vfio_listener_region_add and
this potentially new listener come to the same result, as long as the
new listener is a superset of the existing listener.  So I think we can
simplify out a lot of the places we'd see duplication and bugs.  I'm
not even really sure why we wouldn't simplify things further and only
record a single range covering the low and high memory marks for a
non-vIOMMU VMs, or potentially an approximation removing gaps of 1GB or
more, for example.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 03/20] vfio/migration: Add VFIO migration pre-copy support
  2023-02-27 17:43               ` Alex Williamson
@ 2023-03-01 18:49                 ` Avihai Horon
  2023-03-01 19:55                   ` Alex Williamson
  0 siblings, 1 reply; 93+ messages in thread
From: Avihai Horon @ 2023-03-01 18:49 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins


On 27/02/2023 19:43, Alex Williamson wrote:
> External email: Use caution opening links or attachments
>
>
> On Mon, 27 Feb 2023 13:26:00 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
>
>> On Mon, Feb 27, 2023 at 09:14:44AM -0700, Alex Williamson wrote:
>>
>>> But we have no requirement to send all init_bytes before stop-copy.
>>> This is a hack to achieve a theoretical benefit that a driver might be
>>> able to improve the latency on the target by completing another
>>> iteration.
>> I think this is another half-step at this point..
>>
>> The goal is to not stop the VM until the target VFIO driver has
>> completed loading initial_bytes.
>>
>> This signals that the time consuming pre-setup is completed in the
>> device and we don't have to use downtime to do that work.
>>
>> We've measured this in our devices and the time-shift can be
>> significant, like seconds levels of time removed from the downtime
>> period.
>>
>> Stopping the VM before this pre-setup is done is simply extending the
>> stopped VM downtime.
>>
>> Really what we want is to have the far side acknowledge that
>> initial_bytes has completed loading.
>>
>> To remind, what mlx5 is doing here with precopy is time-shifting work,
>> not data. We want to put expensive work (ie time) into the period when
>> the VM is still running and have less downtime.
>>
>> This challenges the assumption built into qmeu that all data has equal
>> time and it can estimate downtime time simply by scaling the estimated
>> data. We have a data-size independent time component to deal with as
>> well.
> As I mentioned before, I understand the motivation, but imo the
> implementation is exploiting the interface it extended in order to force
> a device driven policy which is specifically not a requirement of the
> vfio migration uAPI.  It sounds like there's more work required in the
> QEMU migration interfaces to properly factor this information into the
> algorithm.  Until then, this seems like a follow-on improvement unless
> you can convince the migration maintainers that providing false
> information in order to force another pre-copy iteration is a valid use
> of passing the threshold value to the driver.

In my previous message I suggested to drop this exploit and instead 
change the QEMU migration API and introduce to it the concept of 
pre-copy initial bytes -- data that must be transferred before source VM 
stops (which is different from current @must_precopy that represents 
data that can be transferred even when VM is stopped).
We could do it by adding a new parameter "init_precopy_size" to the 
state_pending_{estimate,exact} handlers and every migration user could 
use it (RAM, block, etc).
We will also change the migration algorithm to take this new parameter 
into account when deciding to move to stop-copy.

Of course this will have to be approved by migration maintainers first, 
but if it's done in a standard way such as above, via the migration API, 
would it be OK by you to go this way?

Thanks.



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 07/20] vfio/common: Add VFIOBitmap and (de)alloc functions
  2023-02-27 14:09   ` Cédric Le Goater
@ 2023-03-01 18:56     ` Avihai Horon
  2023-03-02 13:24     ` Joao Martins
  1 sibling, 0 replies; 93+ messages in thread
From: Avihai Horon @ 2023-03-01 18:56 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Peter Xu, Jason Wang, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins


On 27/02/2023 16:09, Cédric Le Goater wrote:
> External email: Use caution opening links or attachments
>
>
> On 2/22/23 18:49, Avihai Horon wrote:
>> There are already two places where dirty page bitmap allocation and
>> calculations are done in open code. With device dirty page tracking
>> being added in next patches, there are going to be even more places.
>>
>> To avoid code duplication, introduce VFIOBitmap struct and corresponding
>> alloc and dealloc functions and use them where applicable.
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>   hw/vfio/common.c | 89 ++++++++++++++++++++++++++++++++----------------
>>   1 file changed, 60 insertions(+), 29 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index ac93b85632..84f08bdbbb 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -320,6 +320,41 @@ const MemoryRegionOps vfio_region_ops = {
>>    * Device state interfaces
>>    */
>>
>> +typedef struct {
>> +    unsigned long *bitmap;
>> +    hwaddr size;
>> +    hwaddr pages;
>> +} VFIOBitmap;
>> +
>> +static VFIOBitmap *vfio_bitmap_alloc(hwaddr size)
>> +{
>> +    VFIOBitmap *vbmap = g_try_new0(VFIOBitmap, 1);
>
> I think using g_malloc0() for the VFIOBitmap should be fine. If QEMU can
> not allocate a couple of bytes, we are in trouble anyway.
>
Sure, this will simplify the code a bit. I will change it.

Thanks.

>
>
>> +    if (!vbmap) {
>> +        errno = ENOMEM;
>> +
>> +        return NULL;
>> +    }
>> +
>> +    vbmap->pages = REAL_HOST_PAGE_ALIGN(size) / 
>> qemu_real_host_page_size();
>> +    vbmap->size = ROUND_UP(vbmap->pages, sizeof(__u64) * 
>> BITS_PER_BYTE) /
>> +                                         BITS_PER_BYTE;
>> +    vbmap->bitmap = g_try_malloc0(vbmap->size);
>> +    if (!vbmap->bitmap) {
>> +        g_free(vbmap);
>> +        errno = ENOMEM;
>> +
>> +        return NULL;
>> +    }
>> +
>> +    return vbmap;
>> +}
>> +
>> +static void vfio_bitmap_dealloc(VFIOBitmap *vbmap)
>> +{
>> +    g_free(vbmap->bitmap);
>> +    g_free(vbmap);
>> +}
>> +
>>   bool vfio_mig_active(void)
>>   {
>>       VFIOGroup *group;
>> @@ -470,9 +505,14 @@ static int vfio_dma_unmap_bitmap(VFIOContainer 
>> *container,
>>   {
>>       struct vfio_iommu_type1_dma_unmap *unmap;
>>       struct vfio_bitmap *bitmap;
>> -    uint64_t pages = REAL_HOST_PAGE_ALIGN(size) / 
>> qemu_real_host_page_size();
>> +    VFIOBitmap *vbmap;
>>       int ret;
>>
>> +    vbmap = vfio_bitmap_alloc(size);
>> +    if (!vbmap) {
>> +        return -errno;
>> +    }
>> +
>>       unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));
>>
>>       unmap->argsz = sizeof(*unmap) + sizeof(*bitmap);
>> @@ -486,35 +526,28 @@ static int vfio_dma_unmap_bitmap(VFIOContainer 
>> *container,
>>        * qemu_real_host_page_size to mark those dirty. Hence set 
>> bitmap_pgsize
>>        * to qemu_real_host_page_size.
>>        */
>> -
>>       bitmap->pgsize = qemu_real_host_page_size();
>> -    bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
>> -                   BITS_PER_BYTE;
>> +    bitmap->size = vbmap->size;
>> +    bitmap->data = (__u64 *)vbmap->bitmap;
>>
>> -    if (bitmap->size > container->max_dirty_bitmap_size) {
>> -        error_report("UNMAP: Size of bitmap too big 0x%"PRIx64,
>> -                     (uint64_t)bitmap->size);
>> +    if (vbmap->size > container->max_dirty_bitmap_size) {
>> +        error_report("UNMAP: Size of bitmap too big 0x%"PRIx64, 
>> vbmap->size);
>>           ret = -E2BIG;
>>           goto unmap_exit;
>>       }
>>
>> -    bitmap->data = g_try_malloc0(bitmap->size);
>> -    if (!bitmap->data) {
>> -        ret = -ENOMEM;
>> -        goto unmap_exit;
>> -    }
>> -
>>       ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap);
>>       if (!ret) {
>> -        cpu_physical_memory_set_dirty_lebitmap((unsigned long 
>> *)bitmap->data,
>> -                iotlb->translated_addr, pages);
>> + cpu_physical_memory_set_dirty_lebitmap(vbmap->bitmap,
>> +                iotlb->translated_addr, vbmap->pages);
>>       } else {
>>           error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %m");
>>       }
>>
>> -    g_free(bitmap->data);
>>   unmap_exit:
>>       g_free(unmap);
>> +    vfio_bitmap_dealloc(vbmap);
>> +
>>       return ret;
>>   }
>>
>> @@ -1331,7 +1364,7 @@ static int vfio_get_dirty_bitmap(VFIOContainer 
>> *container, uint64_t iova,
>>   {
>>       struct vfio_iommu_type1_dirty_bitmap *dbitmap;
>>       struct vfio_iommu_type1_dirty_bitmap_get *range;
>> -    uint64_t pages;
>> +    VFIOBitmap *vbmap;
>>       int ret;
>>
>>       if (!container->dirty_pages_supported) {
>> @@ -1341,6 +1374,11 @@ static int vfio_get_dirty_bitmap(VFIOContainer 
>> *container, uint64_t iova,
>>           return 0;
>>       }
>>
>> +    vbmap = vfio_bitmap_alloc(size);
>> +    if (!vbmap) {
>> +        return -errno;
>> +    }
>> +
>>       dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
>>
>>       dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
>> @@ -1355,15 +1393,8 @@ static int vfio_get_dirty_bitmap(VFIOContainer 
>> *container, uint64_t iova,
>>        * to qemu_real_host_page_size.
>>        */
>>       range->bitmap.pgsize = qemu_real_host_page_size();
>> -
>> -    pages = REAL_HOST_PAGE_ALIGN(range->size) / 
>> qemu_real_host_page_size();
>> -    range->bitmap.size = ROUND_UP(pages, sizeof(__u64) * 
>> BITS_PER_BYTE) /
>> -                                         BITS_PER_BYTE;
>> -    range->bitmap.data = g_try_malloc0(range->bitmap.size);
>> -    if (!range->bitmap.data) {
>> -        ret = -ENOMEM;
>> -        goto err_out;
>> -    }
>> +    range->bitmap.size = vbmap->size;
>> +    range->bitmap.data = (__u64 *)vbmap->bitmap;
>>
>>       ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
>>       if (ret) {
>> @@ -1374,14 +1405,14 @@ static int 
>> vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
>>           goto err_out;
>>       }
>>
>> -    cpu_physical_memory_set_dirty_lebitmap((unsigned long 
>> *)range->bitmap.data,
>> -                                            ram_addr, pages);
>> +    cpu_physical_memory_set_dirty_lebitmap(vbmap->bitmap, ram_addr,
>> +                                           vbmap->pages);
>>
>>       trace_vfio_get_dirty_bitmap(container->fd, range->iova, 
>> range->size,
>>                                   range->bitmap.size, ram_addr);
>>   err_out:
>> -    g_free(range->bitmap.data);
>>       g_free(dbitmap);
>> +    vfio_bitmap_dealloc(vbmap);
>>
>>       return ret;
>>   }
>


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 00/20] vfio: Add migration pre-copy support and device dirty tracking
  2023-02-27 13:50         ` Cédric Le Goater
@ 2023-03-01 19:04           ` Avihai Horon
  0 siblings, 0 replies; 93+ messages in thread
From: Avihai Horon @ 2023-03-01 19:04 UTC (permalink / raw)
  To: Cédric Le Goater, Joao Martins, Alex Williamson
  Cc: qemu-devel, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Peter Xu, Jason Wang, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta


On 27/02/2023 15:50, Cédric Le Goater wrote:
> External email: Use caution opening links or attachments
>
>
> On 2/26/23 18:00, Avihai Horon wrote:
>>
>> On 24/02/2023 21:26, Joao Martins wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On 23/02/2023 14:56, Avihai Horon wrote:
>>>> On 22/02/2023 22:55, Alex Williamson wrote:
>>>>> There are various errors running this through the CI on gitlab.
>>>>>
>>>>> This one seems bogus but needs to be resolved regardless:
>>>>>
>>>>> https://gitlab.com/alex.williamson/qemu/-/jobs/3817940731
>>>>> FAILED: libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o
>>>>> 2786s390x-linux-gnu-gcc -m64 -Ilibqemu-aarch64-softmmu.fa.p -I. -I..
>>>>> -Itarget/arm -I../target/arm -Iqapi -Itrace -Iui -Iui/shader
>>>>> -I/usr/include/pixman-1 -I/usr/include/capstone 
>>>>> -I/usr/include/glib-2.0
>>>>> -I/usr/lib/s390x-linux-gnu/glib-2.0/include 
>>>>> -fdiagnostics-color=auto -Wall
>>>>> -Winvalid-pch -Werror -std=gnu11 -O2 -g -isystem
>>>>> /builds/alex.williamson/qemu/linux-headers -isystem linux-headers 
>>>>> -iquote .
>>>>> -iquote /builds/alex.williamson/qemu -iquote
>>>>> /builds/alex.williamson/qemu/include -iquote
>>>>> /builds/alex.williamson/qemu/tcg/s390x -pthread -U_FORTIFY_SOURCE
>>>>> -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 
>>>>> -D_LARGEFILE_SOURCE
>>>>> -fno-strict-aliasing -fno-common -fwrapv -Wundef -Wwrite-strings
>>>>> -Wmissing-prototypes -Wstrict-prototypes -Wredundant-decls
>>>>> -Wold-style-declaration -Wold-style-definition -Wtype-limits 
>>>>> -Wformat-security
>>>>> -Wformat-y2k -Winit-self -Wignored-qualifiers -Wempty-body 
>>>>> -Wnested-externs
>>>>> -Wendif-labels -Wexpansion-to-defined -Wimplicit-fallthrough=2
>>>>> -Wmissing-format-attribute -Wno-missing-include-dirs 
>>>>> -Wno-shift-negative-value
>>>>> -Wno-psabi -fstack-protector-strong -fPIE -isystem../linux-headers
>>>>> -isystemlinux-headers -DNEED_CPU_H
>>>>> '-DCONFIG_TARGET="aarch64-softmmu-config-target.h"'
>>>>> '-DCONFIG_DEVICES="aarch64-softmmu-config-devices.h"' -MD -MQ
>>>>> libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o -MF
>>>>> libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o.d -o
>>>>> libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o -c 
>>>>> ../hw/vfio/common.c
>>>>> 2787../hw/vfio/common.c: In function 
>>>>> ‘vfio_listener_log_global_start’:
>>>>> 2788../hw/vfio/common.c:1772:8: error: ‘ret’ may be used 
>>>>> uninitialized in this
>>>>> function [-Werror=maybe-uninitialized]
>>>>> 2789 1772 |     if (ret) {
>>>>> 2790      |        ^
>>>>>
>>>>> 32-bit builds have some actual errors though:
>>>>>
>>>>> https://gitlab.com/alex.williamson/qemu/-/jobs/3817940719
>>>>> FAILED: libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o
>>>>> 2601cc -m32 -Ilibqemu-aarch64-softmmu.fa.p -I. -I.. -Itarget/arm
>>>>> -I../target/arm -Iqapi -Itrace -Iui -Iui/shader 
>>>>> -I/usr/include/pixman-1
>>>>> -I/usr/include/glib-2.0 -I/usr/lib/glib-2.0/include 
>>>>> -I/usr/include/sysprof-4
>>>>> -fdiagnostics-color=auto -Wall -Winvalid-pch -Werror -std=gnu11 
>>>>> -O2 -g
>>>>> -isystem /builds/alex.williamson/qemu/linux-headers -isystem 
>>>>> linux-headers
>>>>> -iquote . -iquote /builds/alex.williamson/qemu -iquote
>>>>> /builds/alex.williamson/qemu/include -iquote
>>>>> /builds/alex.williamson/qemu/tcg/i386 -pthread -U_FORTIFY_SOURCE
>>>>> -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 
>>>>> -D_LARGEFILE_SOURCE
>>>>> -fno-strict-aliasing -fno-common -fwrapv -Wundef -Wwrite-strings
>>>>> -Wmissing-prototypes -Wstrict-prototypes -Wredundant-decls
>>>>> -Wold-style-declaration -Wold-style-definition -Wtype-limits 
>>>>> -Wformat-security
>>>>> -Wformat-y2k -Winit-self -Wignored-qualifiers -Wempty-body 
>>>>> -Wnested-externs
>>>>> -Wendif-labels -Wexpansion-to-defined -Wimplicit-fallthrough=2
>>>>> -Wmissing-format-attribute -Wno-missing-include-dirs 
>>>>> -Wno-shift-negative-value
>>>>> -Wno-psabi -fstack-protector-strong -fPIE -isystem../linux-headers
>>>>> -isystemlinux-headers -DNEED_CPU_H
>>>>> '-DCONFIG_TARGET="aarch64-softmmu-config-target.h"'
>>>>> '-DCONFIG_DEVICES="aarch64-softmmu-config-devices.h"' -MD -MQ
>>>>> libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o -MF
>>>>> libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o.d -o
>>>>> libqemu-aarch64-softmmu.fa.p/hw_vfio_common.c.o -c 
>>>>> ../hw/vfio/common.c
>>>>> 2602../hw/vfio/common.c: In function
>>>>> 'vfio_device_feature_dma_logging_start_create':
>>>>> 2603../hw/vfio/common.c:1572:27: error: cast from pointer to 
>>>>> integer of
>>>>> different size [-Werror=pointer-to-int-cast]
>>>>> 2604 1572 |         control->ranges = (uint64_t)ranges;
>>>>> 2605      |                           ^
>>>>> 2606../hw/vfio/common.c:1596:23: error: cast from pointer to 
>>>>> integer of
>>>>> different size [-Werror=pointer-to-int-cast]
>>>>> 2607 1596 |     control->ranges = (uint64_t)ranges;
>>>>> 2608      |                       ^
>>>>> 2609../hw/vfio/common.c: In function
>>>>> 'vfio_device_feature_dma_logging_start_destroy':
>>>>> 2610../hw/vfio/common.c:1620:9: error: cast to pointer from 
>>>>> integer of
>>>>> different size [-Werror=int-to-pointer-cast]
>>>>> 2611 1620 |         (struct vfio_device_feature_dma_logging_range
>>>>> *)control->ranges;
>>>>> 2612      |         ^
>>>>> 2613../hw/vfio/common.c: In function 
>>>>> 'vfio_device_dma_logging_report':
>>>>> 2614../hw/vfio/common.c:1810:22: error: cast from pointer to 
>>>>> integer of
>>>>> different size [-Werror=pointer-to-int-cast]
>>>>> 2615 1810 |     report->bitmap = (uint64_t)bitmap;
>>>>> 2616      |                      ^
>>>> Sure, I will fix these errors.
>>> Just a thought: should the pre-copy series be moved towards the end 
>>> of this
>>> series, given that it's more of an improvement of downtime than a 
>>> must-have like
>>> dirty tracking?
>>
>> Given recent discussion, maybe it would be better to split this 
>> series and go one step at a time:
>> Start with basic support for device dirty tracking (without vIOMMU 
>> support), then add pre-copy and then add vIOMMU support to device 
>> dirty tracking.
>
> and add the fixes first in the series. They could be merged quickly.

Yes, of course. I will add them.

Thanks.



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 03/20] vfio/migration: Add VFIO migration pre-copy support
  2023-03-01 18:49                 ` Avihai Horon
@ 2023-03-01 19:55                   ` Alex Williamson
  2023-03-01 21:12                     ` Jason Gunthorpe
  0 siblings, 1 reply; 93+ messages in thread
From: Alex Williamson @ 2023-03-01 19:55 UTC (permalink / raw)
  To: Avihai Horon
  Cc: Jason Gunthorpe, qemu-devel, Cédric Le Goater,
	Juan Quintela, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Peter Xu, Jason Wang, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost, David Hildenbrand,
	Philippe Mathieu-Daudé,
	Yishai Hadas, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins

On Wed, 1 Mar 2023 20:49:28 +0200
Avihai Horon <avihaih@nvidia.com> wrote:

> On 27/02/2023 19:43, Alex Williamson wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > On Mon, 27 Feb 2023 13:26:00 -0400
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >  
> >> On Mon, Feb 27, 2023 at 09:14:44AM -0700, Alex Williamson wrote:
> >>  
> >>> But we have no requirement to send all init_bytes before stop-copy.
> >>> This is a hack to achieve a theoretical benefit that a driver might be
> >>> able to improve the latency on the target by completing another
> >>> iteration.  
> >> I think this is another half-step at this point..
> >>
> >> The goal is to not stop the VM until the target VFIO driver has
> >> completed loading initial_bytes.
> >>
> >> This signals that the time consuming pre-setup is completed in the
> >> device and we don't have to use downtime to do that work.
> >>
> >> We've measured this in our devices and the time-shift can be
> >> significant, like seconds levels of time removed from the downtime
> >> period.
> >>
> >> Stopping the VM before this pre-setup is done is simply extending the
> >> stopped VM downtime.
> >>
> >> Really what we want is to have the far side acknowledge that
> >> initial_bytes has completed loading.
> >>
> >> To remind, what mlx5 is doing here with precopy is time-shifting work,
> >> not data. We want to put expensive work (ie time) into the period when
> >> the VM is still running and have less downtime.
> >>
> >> This challenges the assumption built into qmeu that all data has equal
> >> time and it can estimate downtime time simply by scaling the estimated
> >> data. We have a data-size independent time component to deal with as
> >> well.  
> > As I mentioned before, I understand the motivation, but imo the
> > implementation is exploiting the interface it extended in order to force
> > a device driven policy which is specifically not a requirement of the
> > vfio migration uAPI.  It sounds like there's more work required in the
> > QEMU migration interfaces to properly factor this information into the
> > algorithm.  Until then, this seems like a follow-on improvement unless
> > you can convince the migration maintainers that providing false
> > information in order to force another pre-copy iteration is a valid use
> > of passing the threshold value to the driver.  
> 
> In my previous message I suggested to drop this exploit and instead 
> change the QEMU migration API and introduce to it the concept of 
> pre-copy initial bytes -- data that must be transferred before source VM 
> stops (which is different from current @must_precopy that represents 
> data that can be transferred even when VM is stopped).
> We could do it by adding a new parameter "init_precopy_size" to the 
> state_pending_{estimate,exact} handlers and every migration user could 
> use it (RAM, block, etc).
> We will also change the migration algorithm to take this new parameter 
> into account when deciding to move to stop-copy.
> 
> Of course this will have to be approved by migration maintainers first, 
> but if it's done in a standard way such as above, via the migration API, 
> would it be OK by you to go this way?

I still think we're conflating information and requirements by allowing
a device to impose a policy which keeps QEMU in pre-copy.  AIUI, what
we're trying to do is maximize the time separation between the
initial_bytes from the device and the end-of-stream.  But knowing the
data size of initial_bytes is not really all that useful.

If we think about the limits of network bandwidth, all data transfers
approach zero time, but the startup latency of the target device that
we're trying to maximize here is fixed.  By prioritizing initial_bytes,
we're separating in space the beginning of target device setup from the
end-of-stream, but that's only an approximation of time, which is what
QEMU really needs to know to honor downtime requirements.

So it seems like what we need here is both a preface buffer size and a
target device latency.  The QEMU pre-copy algorithm should factor both
the remaining data size and the device latency into deciding when to
transition to stop-copy, thereby allowing the device to feed actually
relevant data into the algorithm rather than dictate its behavior.
Thanks,

Alex



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 03/20] vfio/migration: Add VFIO migration pre-copy support
  2023-03-01 19:55                   ` Alex Williamson
@ 2023-03-01 21:12                     ` Jason Gunthorpe
  2023-03-01 22:39                       ` Alex Williamson
  0 siblings, 1 reply; 93+ messages in thread
From: Jason Gunthorpe @ 2023-03-01 21:12 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Avihai Horon, qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins

On Wed, Mar 01, 2023 at 12:55:59PM -0700, Alex Williamson wrote:

> So it seems like what we need here is both a preface buffer size and a
> target device latency.  The QEMU pre-copy algorithm should factor both
> the remaining data size and the device latency into deciding when to
> transition to stop-copy, thereby allowing the device to feed actually
> relevant data into the algorithm rather than dictate its behavior.

I don't know that we can realistically estimate startup latency,
especially have the sender estimate latency on the receiver..

I feel like trying to overlap the device start up with the STOP phase
is an unnecessary optimization? How do you see it benifits?

I've been thinking of this from the perspective that we should always
ensure device startup is completed, it is time that has to be paid,
why pay it during STOP?

Jason


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 03/20] vfio/migration: Add VFIO migration pre-copy support
  2023-03-01 21:12                     ` Jason Gunthorpe
@ 2023-03-01 22:39                       ` Alex Williamson
  2023-03-06 19:01                         ` Jason Gunthorpe
  0 siblings, 1 reply; 93+ messages in thread
From: Alex Williamson @ 2023-03-01 22:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Avihai Horon, qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins

On Wed, 1 Mar 2023 17:12:51 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Mar 01, 2023 at 12:55:59PM -0700, Alex Williamson wrote:
> 
> > So it seems like what we need here is both a preface buffer size and a
> > target device latency.  The QEMU pre-copy algorithm should factor both
> > the remaining data size and the device latency into deciding when to
> > transition to stop-copy, thereby allowing the device to feed actually
> > relevant data into the algorithm rather than dictate its behavior.  
> 
> I don't know that we can realistically estimate startup latency,
> especially have the sender estimate latency on the receiver..

Knowing that the target device is compatible with the source is a point
towards making an educated guess.

> I feel like trying to overlap the device start up with the STOP phase
> is an unnecessary optimization? How do you see it benifits?

If we can't guarantee that there's some time difference between sending
initial bytes immediately at the end of pre-copy vs immediately at the
beginning of stop-copy, does that mean any handling of initial bytes is
an unnecessary optimization?

I'm imagining that completing initial bytes triggers some
initialization sequence in the target host driver which runs in
parallel to the remaining data stream, so in practice, even if sent at
the beginning of stop-copy, the target device gets a head start.

> I've been thinking of this from the perspective that we should always
> ensure device startup is completed, it is time that has to be paid,
> why pay it during STOP?

Creating a policy for QEMU to send initial bytes in a given phase
doesn't ensure startup is complete.  There's no guaranteed time
difference between sending that data and the beginning of stop-copy.

QEMU is trying to achieve a downtime goal, where it estimates network
bandwidth to get a data size threshold, and then polls devices for
remaining data.  That downtime goal might exceed the startup latency of
the target device anyway, where it's then the operators choice to pay
that time in stop-copy, or stalled on the target.

But if we actually want to ensure startup of the target is complete,
then drivers should be able to return both data size and estimated time
for the target device to initialize.  That time estimate should be
updated by the driver based on if/when initial_bytes is drained.  The
decision whether to continue iterating pre-copy would then be based on
both the maximum remaining device startup time and the calculated time
based on remaining data size.

I think this provides a better guarantee than anything based simply on
transferring a given chunk of data in a specific phase of the process.
Thoughts?  Thanks,

Alex



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 10/20] vfio/common: Record DMA mapped IOVA ranges
  2023-02-28 20:36               ` Alex Williamson
@ 2023-03-02  0:07                 ` Joao Martins
  2023-03-02  0:13                   ` Joao Martins
  2023-03-02 18:42                   ` Alex Williamson
  0 siblings, 2 replies; 93+ messages in thread
From: Joao Martins @ 2023-03-02  0:07 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Avihai Horon, qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

On 28/02/2023 20:36, Alex Williamson wrote:
> On Tue, 28 Feb 2023 12:11:06 +0000
> Joao Martins <joao.m.martins@oracle.com> wrote:
>> On 23/02/2023 21:50, Alex Williamson wrote:
>>> On Thu, 23 Feb 2023 21:19:12 +0000
>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
>>>> On 23/02/2023 21:05, Alex Williamson wrote:  
>>>>> On Thu, 23 Feb 2023 10:37:10 +0000
>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:    
>>>>>> On 22/02/2023 22:10, Alex Williamson wrote:    
>>>>>>> On Wed, 22 Feb 2023 19:49:05 +0200
>>>>>>> Avihai Horon <avihaih@nvidia.com> wrote:      
>>>>>>>> From: Joao Martins <joao.m.martins@oracle.com>
>>>>>>>> @@ -612,6 +665,16 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>>>>>>>>          .iova = iova,
>>>>>>>>          .size = size,
>>>>>>>>      };
>>>>>>>> +    int ret;
>>>>>>>> +
>>>>>>>> +    ret = vfio_record_mapping(container, iova, size, readonly);
>>>>>>>> +    if (ret) {
>>>>>>>> +        error_report("vfio: Failed to record mapping, iova: 0x%" HWADDR_PRIx
>>>>>>>> +                     ", size: 0x" RAM_ADDR_FMT ", ret: %d (%s)",
>>>>>>>> +                     iova, size, ret, strerror(-ret));
>>>>>>>> +
>>>>>>>> +        return ret;
>>>>>>>> +    }      
>>>>>>>
>>>>>>> Is there no way to replay the mappings when a migration is started?
>>>>>>> This seems like a horrible latency and bloat trade-off for the
>>>>>>> possibility that the VM might migrate and the device might support
>>>>>>> these features.  Our performance with vIOMMU is already terrible, I
>>>>>>> can't help but believe this makes it worse.  Thanks,
>>>>>>>       
>>>>>>
>>>>>> It is a nop if the vIOMMU is being used (entries in container->giommu_list) as
>>>>>> that uses a max-iova based IOVA range. So this is really for iommu identity
>>>>>> mapping and no-VIOMMU.    
>>>>>
>>>>> Ok, yes, there are no mappings recorded for any containers that have a
>>>>> non-empty giommu_list.
>>>>>     
>>>>>> We could replay them if they were tracked/stored anywhere.    
>>>>>
>>>>> Rather than piggybacking on vfio_memory_listener, why not simply
>>>>> register a new MemoryListener when migration is started?  That will
>>>>> replay all the existing ranges and allow tracking to happen separate
>>>>> from mapping, and only when needed.
>>>>
>>>> The problem with that is that *starting* dirty tracking needs to have all the
>>>> range, we aren't supposed to start each range separately. So on a memory
>>>> listener callback you don't have introspection when you are dealing with the
>>>> last range, do we?  
>>>
>>> As soon as memory_listener_register() returns, all your callbacks to
>>> build the IOVATree have been called and you can act on the result the
>>> same as if you were relying on the vfio mapping MemoryListener.  I'm
>>> not seeing the problem.  Thanks,
>>>   
>>
>> While doing these changes, the nice thing of the current patch is that whatever
>> changes apply to vfio_listener_region_add() will be reflected in the mappings
>> tree that stores what we will dirty track. If we move the mappings calculation
>> necessary for dirty tracking only when we start, we will have to duplicate the
>> same checks, and open for bugs where we ask things to be dirty track-ed that
>> haven't been DMA mapped. These two aren't necessarily tied, but felt like I
>> should raise the potentially duplication of the checks (and the same thing
>> applies for handling virtio-mem and what not).
>>
>> I understand that if we were going to store *a lot* of mappings that this would
>> add up in space requirements. But for no-vIOMMU (or iommu=pt) case this is only
>> about 12ranges or so, it is much simpler to piggyback the existing listener.
>> Would you still want to move this to its own dedicated memory listener?
> 
> Code duplication and bugs are good points, but while typically we're
> only seeing a few handfuls of ranges, doesn't virtio-mem in particular
> allow that we could be seeing quite a lot more?
> 
Ugh yes, it could be.

> We used to be limited to a fairly small number of KVM memory slots,
> which effectively bounded non-vIOMMU DMA mappings, but that value is
> now 2^15, so we need to anticipate that we could see many more than a
> dozen mappings.
> 

Even with 32k memory slots today we are still reduced on a handful. hv-balloon
and virtio-mem approaches though are the ones that may stress such limit IIUC
prior to starting migration.

> Can we make the same argument that the overhead is negligible if a VM
> makes use of 10s of GB of virtio-mem with 2MB block size?
> 
> But then on a 4KB host we're limited to 256 tracking entries, so
> wasting all that time and space on a runtime IOVATree is even more
> dubious.
>
> In fact, it doesn't really matter that vfio_listener_region_add and
> this potentially new listener come to the same result, as long as the
> new listener is a superset of the existing listener. 

I am trying to put this in a way that's not too ugly to reuse the most between
vfio_listener_region_add() and the vfio_migration_mapping_add().

For you to have an idea, here's so far how it looks thus far:

https://github.com/jpemartins/qemu/commits/vfio-dirty-tracking

Particularly this one:

https://github.com/jpemartins/qemu/commit/3b11fa0e4faa0f9c0f42689a7367284a25d1b585

vfio_get_section_iova_range() is where most of these checks are that are sort of
a subset of the ones in vfio_listener_region_add().

> So I think we can
> simplify out a lot of the places we'd see duplication and bugs.  I'm
> not even really sure why we wouldn't simplify things further and only
> record a single range covering the low and high memory marks for a
> non-vIOMMU VMs, or potentially an approximation removing gaps of 1GB or
> more, for example.  Thanks,

Yes, for Qemu, to have one single artificial range with a computed min IOVA and
max IOVA is the simplest to get it implemented. It would avoid us maintaining an
IOVATree as you would only track min/max pair (maybe max_below).

My concern with a reduced single range is 1) big holes in address space leading
to asking more than you need[*] and then 2) device dirty tracking limits e.g.
hardware may have upper limits, so you may prematurely exercise those. So giving
more choice to the vfio drivers to decide how to cope with the mapped address
space description looks to have a bit more longevity.

Anyway the temptation with having a single range is that this can all go away if
the vfio_listener_region_add() tracks just min/max IOVA pair.

Below scissors mark it's how this patch is looking like in the commit above
while being a full list of mappings. It's also stored here:

https://github.com/jpemartins/qemu/commits/vfio-dirty-tracking

I'll respond here with a patch on what it looks like with the range watermark
approach.

	Joao

[0] AMD 1T boundary is what comes to mind, which on Qemu relocates memory above
4G into after 1T.

------------------>8--------------------

From: Joao Martins <joao.m.martins@oracle.com>
Date: Wed, 22 Feb 2023 19:49:05 +0200
Subject: [PATCH wip 7/12] vfio/common: Record DMA mapped IOVA ranges

According to the device DMA logging uAPI, IOVA ranges to be logged by
the device must be provided all at once upon DMA logging start.

As preparation for the following patches which will add device dirty
page tracking, keep a record of all DMA mapped IOVA ranges so later they
can be used for DMA logging start.

Note that when vIOMMU is enabled DMA mapped IOVA ranges are not tracked.
This is due to the dynamic nature of vIOMMU DMA mapping/unmapping.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/common.c              | 147 +++++++++++++++++++++++++++++++++-
 hw/vfio/trace-events          |   2 +
 include/hw/vfio/vfio-common.h |   4 +
 3 files changed, 150 insertions(+), 3 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 655e8dbb74d4..17971e6dbaeb 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -44,6 +44,7 @@
 #include "migration/blocker.h"
 #include "migration/qemu-file.h"
 #include "sysemu/tpm.h"
+#include "qemu/iova-tree.h"

 VFIOGroupList vfio_group_list =
     QLIST_HEAD_INITIALIZER(vfio_group_list);
@@ -426,6 +427,11 @@ void vfio_unblock_multiple_devices_migration(void)
     multiple_devices_migration_blocker = NULL;
 }

+static bool vfio_have_giommu(VFIOContainer *container)
+{
+    return !QLIST_EMPTY(&container->giommu_list);
+}
+
 static void vfio_set_migration_error(int err)
 {
     MigrationState *ms = migrate_get_current();
@@ -610,6 +616,7 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
         .iova = iova,
         .size = size,
     };
+    int ret;

     if (!readonly) {
         map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
@@ -626,8 +633,10 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
         return 0;
     }

+    ret = -errno;
     error_report("VFIO_MAP_DMA failed: %s", strerror(errno));
-    return -errno;
+
+    return ret;
 }

 static void vfio_host_win_add(VFIOContainer *container,
@@ -1326,11 +1335,127 @@ static int vfio_set_dirty_page_tracking(VFIOContainer
*container, bool start)
     return ret;
 }

+static bool vfio_get_section_iova_range(VFIOContainer *container,
+                                        MemoryRegionSection *section,
+                                        hwaddr *out_iova, hwaddr *out_end)
+{
+    Int128 llend, llsize;
+    hwaddr iova, end;
+
+    iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
+    llend = int128_make64(section->offset_within_address_space);
+    llend = int128_add(llend, section->size);
+    llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask()));
+
+    if (int128_ge(int128_make64(iova), llend)) {
+        return false;
+    }
+    end = int128_get64(int128_sub(llend, int128_one()));
+
+    if (memory_region_is_iommu(section->mr) ||
+        memory_region_has_ram_discard_manager(section->mr)) {
+	return false;
+    }
+
+    llsize = int128_sub(llend, int128_make64(iova));
+
+    if (memory_region_is_ram_device(section->mr)) {
+        VFIOHostDMAWindow *hostwin;
+        hwaddr pgmask;
+
+        hostwin = vfio_find_hostwin(container, iova, end);
+        if (!hostwin) {
+            return false;
+        }
+
+	pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
+        if ((iova & pgmask) || (int128_get64(llsize) & pgmask)) {
+            return false;
+        }
+    }
+
+    *out_iova = iova;
+    *out_end = int128_get64(llend);
+    return true;
+}
+
+static void vfio_migration_add_mapping(MemoryListener *listener,
+                                       MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer,
mappings_listener);
+    hwaddr end = 0;
+    DMAMap map;
+    int ret;
+
+    if (vfio_have_giommu(container)) {
+        vfio_set_migration_error(-EOPNOTSUPP);
+        return;
+    }
+
+    if (!vfio_listener_valid_section(section) ||
+        !vfio_get_section_iova_range(container, section, &map.iova, &end)) {
+        return;
+    }
+
+    map.size = end - map.iova - 1; // IOVATree is inclusive, so subtract 1 from
size
+    map.perm = section->readonly ? IOMMU_RO : IOMMU_RW;
+
+    WITH_QEMU_LOCK_GUARD(&container->mappings_mutex) {
+        ret = iova_tree_insert(container->mappings, &map);
+        if (ret) {
+            if (ret == IOVA_ERR_INVALID) {
+                ret = -EINVAL;
+            } else if (ret == IOVA_ERR_OVERLAP) {
+                ret = -EEXIST;
+            }
+        }
+    }
+
+    trace_vfio_migration_mapping_add(map.iova, map.iova + map.size, ret);
+
+    if (ret)
+        vfio_set_migration_error(ret);
+    return;
+}
+
+static void vfio_migration_remove_mapping(MemoryListener *listener,
+                                          MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer,
mappings_listener);
+    hwaddr end = 0;
+    DMAMap map;
+
+    if (vfio_have_giommu(container)) {
+        vfio_set_migration_error(-EOPNOTSUPP);
+        return;
+    }
+
+    if (!vfio_listener_valid_section(section) ||
+        !vfio_get_section_iova_range(container, section, &map.iova, &end)) {
+        return;
+    }
+
+    WITH_QEMU_LOCK_GUARD(&container->mappings_mutex) {
+        iova_tree_remove(container->mappings, map);
+    }
+
+    trace_vfio_migration_mapping_del(map.iova, map.iova + map.size);
+}
+
+
+static const MemoryListener vfio_dirty_tracking_listener = {
+    .name = "vfio-migration",
+    .region_add = vfio_migration_add_mapping,
+    .region_del = vfio_migration_remove_mapping,
+};
+
 static void vfio_listener_log_global_start(MemoryListener *listener)
 {
     VFIOContainer *container = container_of(listener, VFIOContainer, listener);
     int ret;

+    memory_listener_register(&container->mappings_listener, container->space->as);
+
     ret = vfio_set_dirty_page_tracking(container, true);
     if (ret) {
         vfio_set_migration_error(ret);
@@ -1346,6 +1471,8 @@ static void vfio_listener_log_global_stop(MemoryListener
*listener)
     if (ret) {
         vfio_set_migration_error(ret);
     }
+
+    memory_listener_unregister(&container->mappings_listener);
 }

 static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
@@ -2172,16 +2299,24 @@ static int vfio_connect_container(VFIOGroup *group,
AddressSpace *as,
     QLIST_INIT(&container->giommu_list);
     QLIST_INIT(&container->hostwin_list);
     QLIST_INIT(&container->vrdl_list);
+    container->mappings = iova_tree_new();
+    if (!container->mappings) {
+        error_setg(errp, "Cannot allocate DMA mappings tree");
+        ret = -ENOMEM;
+        goto free_container_exit;
+    }
+    qemu_mutex_init(&container->mappings_mutex);
+    container->mappings_listener = vfio_dirty_tracking_listener;

     ret = vfio_init_container(container, group->fd, errp);
     if (ret) {
-        goto free_container_exit;
+        goto destroy_mappings_exit;
     }

     ret = vfio_ram_block_discard_disable(container, true);
     if (ret) {
         error_setg_errno(errp, -ret, "Cannot set discarding of RAM broken");
-        goto free_container_exit;
+        goto destroy_mappings_exit;
     }

     switch (container->iommu_type) {
@@ -2317,6 +2452,10 @@ listener_release_exit:
 enable_discards_exit:
     vfio_ram_block_discard_disable(container, false);

+destroy_mappings_exit:
+    qemu_mutex_destroy(&container->mappings_mutex);
+    iova_tree_destroy(container->mappings);
+
 free_container_exit:
     g_free(container);

@@ -2371,6 +2510,8 @@ static void vfio_disconnect_container(VFIOGroup *group)
         }

         trace_vfio_disconnect_container(container->fd);
+        qemu_mutex_destroy(&container->mappings_mutex);
+        iova_tree_destroy(container->mappings);
         close(container->fd);
         g_free(container);

diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 669d9fe07cd9..c92eaadcc7c4 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -104,6 +104,8 @@ vfio_known_safe_misalignment(const char *name, uint64_t
iova, uint64_t offset_wi
 vfio_listener_region_add_no_dma_map(const char *name, uint64_t iova, uint64_t
size, uint64_t page_size) "Region \"%s\" 0x%"PRIx64" size=0x%"PRIx64" is not
aligned to 0x%"PRIx64" and cannot be mapped for DMA"
 vfio_listener_region_del_skip(uint64_t start, uint64_t end) "SKIPPING
region_del 0x%"PRIx64" - 0x%"PRIx64
 vfio_listener_region_del(uint64_t start, uint64_t end) "region_del 0x%"PRIx64"
- 0x%"PRIx64
+vfio_migration_mapping_add(uint64_t start, uint64_t end, int err) "mapping_add
0x%"PRIx64" - 0x%"PRIx64" err=%d"
+vfio_migration_mapping_del(uint64_t start, uint64_t end) "mapping_del
0x%"PRIx64" - 0x%"PRIx64
 vfio_disconnect_container(int fd) "close container->fd=%d"
 vfio_put_group(int fd) "close group->fd=%d"
 vfio_get_device(const char * name, unsigned int flags, unsigned int
num_regions, unsigned int num_irqs) "Device %s flags: %u, regions: %u, irqs: %u"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 87524c64a443..48951da11ab4 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -23,6 +23,7 @@

 #include "exec/memory.h"
 #include "qemu/queue.h"
+#include "qemu/iova-tree.h"
 #include "qemu/notify.h"
 #include "ui/console.h"
 #include "hw/display/ramfb.h"
@@ -81,6 +82,7 @@ typedef struct VFIOContainer {
     int fd; /* /dev/vfio/vfio, empowered by the attached groups */
     MemoryListener listener;
     MemoryListener prereg_listener;
+    MemoryListener mappings_listener;
     unsigned iommu_type;
     Error *error;
     bool initialized;
@@ -89,6 +91,8 @@ typedef struct VFIOContainer {
     uint64_t max_dirty_bitmap_size;
     unsigned long pgsizes;
     unsigned int dma_max_mappings;
+    IOVATree *mappings;
+    QemuMutex mappings_mutex;
     QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
     QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
     QLIST_HEAD(, VFIOGroup) group_list;
--
2.17.2


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 10/20] vfio/common: Record DMA mapped IOVA ranges
  2023-03-02  0:07                 ` Joao Martins
@ 2023-03-02  0:13                   ` Joao Martins
  2023-03-02 18:42                   ` Alex Williamson
  1 sibling, 0 replies; 93+ messages in thread
From: Joao Martins @ 2023-03-02  0:13 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Avihai Horon, qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

On 02/03/2023 00:07, Joao Martins wrote:
> On 28/02/2023 20:36, Alex Williamson wrote:

[...]

>> Can we make the same argument that the overhead is negligible if a VM
>> makes use of 10s of GB of virtio-mem with 2MB block size?
>>
>> But then on a 4KB host we're limited to 256 tracking entries, so
>> wasting all that time and space on a runtime IOVATree is even more
>> dubious.
>>
>> In fact, it doesn't really matter that vfio_listener_region_add and
>> this potentially new listener come to the same result, as long as the
>> new listener is a superset of the existing listener. 
> 
> I am trying to put this in a way that's not too ugly to reuse the most between
> vfio_listener_region_add() and the vfio_migration_mapping_add().
> 
> For you to have an idea, here's so far how it looks thus far:
> 
> https://github.com/jpemartins/qemu/commits/vfio-dirty-tracking
> 
> Particularly this one:
> 
> https://github.com/jpemartins/qemu/commit/3b11fa0e4faa0f9c0f42689a7367284a25d1b585
> 
> vfio_get_section_iova_range() is where most of these checks are that are sort of
> a subset of the ones in vfio_listener_region_add().
> 
>> So I think we can
>> simplify out a lot of the places we'd see duplication and bugs.  I'm
>> not even really sure why we wouldn't simplify things further and only
>> record a single range covering the low and high memory marks for a
>> non-vIOMMU VMs, or potentially an approximation removing gaps of 1GB or
>> more, for example.  Thanks,
> 
> Yes, for Qemu, to have one single artificial range with a computed min IOVA and
> max IOVA is the simplest to get it implemented. It would avoid us maintaining an
> IOVATree as you would only track min/max pair (maybe max_below).
> 
> My concern with a reduced single range is 1) big holes in address space leading
> to asking more than you need[*] and then 2) device dirty tracking limits e.g.
> hardware may have upper limits, so you may prematurely exercise those. So giving
> more choice to the vfio drivers to decide how to cope with the mapped address
> space description looks to have a bit more longevity.
> 
> Anyway the temptation with having a single range is that this can all go away if
> the vfio_listener_region_add() tracks just min/max IOVA pair.
> 
> Below scissors mark it's how this patch is looking like in the commit above
> while being a full list of mappings. It's also stored here:
> 
> https://github.com/jpemartins/qemu/commits/vfio-dirty-tracking
> 
> I'll respond here with a patch on what it looks like with the range watermark
> approach.
> 

... Which is here:

https://github.com/jpemartins/qemu/commits/vfio-dirty-tracking-range

And below scissors mark at the end this patch in the series. Smaller, most of
the churn is the new checks. I need to adjust commit messages, depending on
which way the group decides to go. So take those with a grain of salt.

> 
> [0] AMD 1T boundary is what comes to mind, which on Qemu relocates memory above
> 4G into after 1T.

---------------->8-----------------

From: Joao Martins <joao.m.martins@oracle.com>
Date: Wed, 22 Feb 2023 19:49:05 +0200
Subject: [PATCH wip 7/12] vfio/common: Record DMA mapped IOVA ranges

According to the device DMA logging uAPI, IOVA ranges to be logged by
the device must be provided all at once upon DMA logging start.

As preparation for the following patches which will add device dirty
page tracking, keep a record of all DMA mapped IOVA ranges so later they
can be used for DMA logging start.

Note that when vIOMMU is enabled DMA mapped IOVA ranges are not tracked.
This is due to the dynamic nature of vIOMMU DMA mapping/unmapping.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/common.c              | 110 ++++++++++++++++++++++++++++++++--
 hw/vfio/trace-events          |   1 +
 include/hw/vfio/vfio-common.h |   5 ++
 3 files changed, 112 insertions(+), 4 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 655e8dbb74d4..ff4a2aa0e14b 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -44,6 +44,7 @@
 #include "migration/blocker.h"
 #include "migration/qemu-file.h"
 #include "sysemu/tpm.h"
+#include "qemu/iova-tree.h"

 VFIOGroupList vfio_group_list =
     QLIST_HEAD_INITIALIZER(vfio_group_list);
@@ -426,6 +427,11 @@ void vfio_unblock_multiple_devices_migration(void)
     multiple_devices_migration_blocker = NULL;
 }

+static bool vfio_have_giommu(VFIOContainer *container)
+{
+    return !QLIST_EMPTY(&container->giommu_list);
+}
+
 static void vfio_set_migration_error(int err)
 {
     MigrationState *ms = migrate_get_current();
@@ -610,6 +616,7 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
         .iova = iova,
         .size = size,
     };
+    int ret;

     if (!readonly) {
         map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
@@ -626,8 +633,10 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
         return 0;
     }

+    ret = -errno;
     error_report("VFIO_MAP_DMA failed: %s", strerror(errno));
-    return -errno;
+
+    return ret;
 }

 static void vfio_host_win_add(VFIOContainer *container,
@@ -1326,11 +1335,93 @@ static int vfio_set_dirty_page_tracking(VFIOContainer
*container, bool start)
     return ret;
 }

+static bool vfio_get_section_iova_range(VFIOContainer *container,
+                                        MemoryRegionSection *section,
+                                        hwaddr *out_iova, hwaddr *out_end)
+{
+    Int128 llend, llsize;
+    hwaddr iova, end;
+
+    iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
+    llend = int128_make64(section->offset_within_address_space);
+    llend = int128_add(llend, section->size);
+    llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask()));
+
+    if (int128_ge(int128_make64(iova), llend)) {
+        return false;
+    }
+    end = int128_get64(int128_sub(llend, int128_one()));
+
+    if (memory_region_is_iommu(section->mr) ||
+        memory_region_has_ram_discard_manager(section->mr)) {
+        return false;
+    }
+
+    llsize = int128_sub(llend, int128_make64(iova));
+
+    if (memory_region_is_ram_device(section->mr)) {
+        VFIOHostDMAWindow *hostwin;
+        hwaddr pgmask;
+
+        hostwin = vfio_find_hostwin(container, iova, end);
+        if (!hostwin) {
+            return false;
+        }
+
+        pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
+        if ((iova & pgmask) || (int128_get64(llsize) & pgmask)) {
+            return false;
+        }
+    }
+
+    *out_iova = iova;
+    *out_end = int128_get64(llend);
+    return true;
+}
+
+static void vfio_dma_tracking_update(MemoryListener *listener,
+                                     MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer,
mappings_listener);
+    hwaddr iova, end;
+
+    if (vfio_have_giommu(container)) {
+        vfio_set_migration_error(-EOPNOTSUPP);
+        return;
+    }
+
+    if (!vfio_listener_valid_section(section) ||
+        !vfio_get_section_iova_range(container, section, &iova, &end)) {
+        return;
+    }
+
+    WITH_QEMU_LOCK_GUARD(&container->mappings_mutex) {
+        if (container->min_tracking_iova > iova) {
+            container->min_tracking_iova = iova;
+        }
+        if (container->max_tracking_iova < end) {
+            container->max_tracking_iova = end;
+        }
+    }
+
+    trace_vfio_dma_tracking_update(iova, end,
+                                   container->min_tracking_iova,
+                                   container->max_tracking_iova);
+    return;
+}
+
+static const MemoryListener vfio_dirty_tracking_listener = {
+    .name = "vfio-tracking",
+    .region_add = vfio_dma_tracking_update,
+};
+
 static void vfio_listener_log_global_start(MemoryListener *listener)
 {
     VFIOContainer *container = container_of(listener, VFIOContainer, listener);
     int ret;

+    memory_listener_register(&container->mappings_listener, container->space->as);
+
     ret = vfio_set_dirty_page_tracking(container, true);
     if (ret) {
         vfio_set_migration_error(ret);
@@ -1346,6 +1437,13 @@ static void vfio_listener_log_global_stop(MemoryListener
*listener)
     if (ret) {
         vfio_set_migration_error(ret);
     }
+
+    memory_listener_unregister(&container->mappings_listener);
+
+    WITH_QEMU_LOCK_GUARD(&container->mappings_mutex) {
+        container->min_tracking_iova = 0;
+        container->max_tracking_iova = 0;
+    }
 }

 static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
@@ -2172,16 +2270,18 @@ static int vfio_connect_container(VFIOGroup *group,
AddressSpace *as,
     QLIST_INIT(&container->giommu_list);
     QLIST_INIT(&container->hostwin_list);
     QLIST_INIT(&container->vrdl_list);
+    qemu_mutex_init(&container->mappings_mutex);
+    container->mappings_listener = vfio_dirty_tracking_listener;

     ret = vfio_init_container(container, group->fd, errp);
     if (ret) {
-        goto free_container_exit;
+        goto destroy_mappings_exit;
     }

     ret = vfio_ram_block_discard_disable(container, true);
     if (ret) {
         error_setg_errno(errp, -ret, "Cannot set discarding of RAM broken");
-        goto free_container_exit;
+        goto destroy_mappings_exit;
     }

     switch (container->iommu_type) {
@@ -2317,7 +2417,8 @@ listener_release_exit:
 enable_discards_exit:
     vfio_ram_block_discard_disable(container, false);

-free_container_exit:
+destroy_mappings_exit:
+    qemu_mutex_destroy(&container->mappings_mutex);
     g_free(container);

 close_fd_exit:
@@ -2371,6 +2472,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
         }

         trace_vfio_disconnect_container(container->fd);
+        qemu_mutex_destroy(&container->mappings_mutex);
         close(container->fd);
         g_free(container);

diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 669d9fe07cd9..8591f660595b 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -104,6 +104,7 @@ vfio_known_safe_misalignment(const char *name, uint64_t
iova, uint64_t offset_wi
 vfio_listener_region_add_no_dma_map(const char *name, uint64_t iova, uint64_t
size, uint64_t page_size) "Region \"%s\" 0x%"PRIx64" size=0x%"PRIx64" is not
aligned to 0x%"PRIx64" and cannot be mapped for DMA"
 vfio_listener_region_del_skip(uint64_t start, uint64_t end) "SKIPPING
region_del 0x%"PRIx64" - 0x%"PRIx64
 vfio_listener_region_del(uint64_t start, uint64_t end) "region_del 0x%"PRIx64"
- 0x%"PRIx64
+vfio_dma_tracking_update(uint64_t start, uint64_t end, uint64_t min, uint64_t
max) "tracking_update 0x%"PRIx64" - 0x%"PRIx64" -> [0x%"PRIx64" - 0x%"PRIx64"]"
 vfio_disconnect_container(int fd) "close container->fd=%d"
 vfio_put_group(int fd) "close group->fd=%d"
 vfio_get_device(const char * name, unsigned int flags, unsigned int
num_regions, unsigned int num_irqs) "Device %s flags: %u, regions: %u, irqs: %u"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 87524c64a443..bb54f204ab8b 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -23,6 +23,7 @@

 #include "exec/memory.h"
 #include "qemu/queue.h"
+#include "qemu/iova-tree.h"
 #include "qemu/notify.h"
 #include "ui/console.h"
 #include "hw/display/ramfb.h"
@@ -81,6 +82,7 @@ typedef struct VFIOContainer {
     int fd; /* /dev/vfio/vfio, empowered by the attached groups */
     MemoryListener listener;
     MemoryListener prereg_listener;
+    MemoryListener mappings_listener;
     unsigned iommu_type;
     Error *error;
     bool initialized;
@@ -89,6 +91,9 @@ typedef struct VFIOContainer {
     uint64_t max_dirty_bitmap_size;
     unsigned long pgsizes;
     unsigned int dma_max_mappings;
+    hwaddr min_tracking_iova;
+    hwaddr max_tracking_iova;
+    QemuMutex mappings_mutex;
     QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
     QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
     QLIST_HEAD(, VFIOGroup) group_list;
--
2.17.2



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 07/20] vfio/common: Add VFIOBitmap and (de)alloc functions
  2023-02-27 14:09   ` Cédric Le Goater
  2023-03-01 18:56     ` Avihai Horon
@ 2023-03-02 13:24     ` Joao Martins
  2023-03-02 14:52       ` Cédric Le Goater
  1 sibling, 1 reply; 93+ messages in thread
From: Joao Martins @ 2023-03-02 13:24 UTC (permalink / raw)
  To: Cédric Le Goater, Avihai Horon
  Cc: Alex Williamson, Juan Quintela, qemu-devel,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

On 27/02/2023 14:09, Cédric Le Goater wrote:
> On 2/22/23 18:49, Avihai Horon wrote:
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -320,6 +320,41 @@ const MemoryRegionOps vfio_region_ops = {
>>    * Device state interfaces
>>    */
>>   +typedef struct {
>> +    unsigned long *bitmap;
>> +    hwaddr size;
>> +    hwaddr pages;
>> +} VFIOBitmap;
>> +
>> +static VFIOBitmap *vfio_bitmap_alloc(hwaddr size)
>> +{
>> +    VFIOBitmap *vbmap = g_try_new0(VFIOBitmap, 1);
> 
> I think using g_malloc0() for the VFIOBitmap should be fine. If QEMU can
> not allocate a couple of bytes, we are in trouble anyway.
> 

OOM situations are rather unpredictable, and switching to g_malloc0 means we
will exit ungracefully in the middle of fetching dirty bitmaps. And this
function (vfio_bitmap_alloc) overall will be allocating megabytes for terabyte
guests.

It would be ok if we are initializing, but this is at runtime when we do
migration. I think we should stick with g_try_new0. exit on failure should be
reserved to failure to switch the kernel migration state whereby we are likely
to be dealing with a hardware failure and thus requires something more drastic.

	Joao


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 07/20] vfio/common: Add VFIOBitmap and (de)alloc functions
  2023-03-02 13:24     ` Joao Martins
@ 2023-03-02 14:52       ` Cédric Le Goater
  2023-03-02 16:30         ` Joao Martins
  2023-03-04  0:23         ` Joao Martins
  0 siblings, 2 replies; 93+ messages in thread
From: Cédric Le Goater @ 2023-03-02 14:52 UTC (permalink / raw)
  To: Joao Martins, Avihai Horon
  Cc: Alex Williamson, Juan Quintela, qemu-devel,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

Hello Joao,

On 3/2/23 14:24, Joao Martins wrote:
> On 27/02/2023 14:09, Cédric Le Goater wrote:
>> On 2/22/23 18:49, Avihai Horon wrote:
>>> --- a/hw/vfio/common.c
>>> +++ b/hw/vfio/common.c
>>> @@ -320,6 +320,41 @@ const MemoryRegionOps vfio_region_ops = {
>>>     * Device state interfaces
>>>     */
>>>    +typedef struct {
>>> +    unsigned long *bitmap;
>>> +    hwaddr size;
>>> +    hwaddr pages;
>>> +} VFIOBitmap;
>>> +
>>> +static VFIOBitmap *vfio_bitmap_alloc(hwaddr size)
>>> +{
>>> +    VFIOBitmap *vbmap = g_try_new0(VFIOBitmap, 1);
>>
>> I think using g_malloc0() for the VFIOBitmap should be fine. If QEMU can
>> not allocate a couple of bytes, we are in trouble anyway.
>>
> 
> OOM situations are rather unpredictable, and switching to g_malloc0 means we
> will exit ungracefully in the middle of fetching dirty bitmaps. And this
> function (vfio_bitmap_alloc) overall will be allocating megabytes for terabyte
> guests.
> 
> It would be ok if we are initializing, but this is at runtime when we do
> migration. I think we should stick with g_try_new0. exit on failure should be
> reserved to failure to switch the kernel migration state whereby we are likely
> to be dealing with a hardware failure and thus requires something more drastic.

I agree for large allocation :

     vbmap->bitmap = g_try_malloc0(vbmap->size);

but not for the smaller ones, like VFIOBitmap. You would have to
convert some other g_malloc0() calls, like the one allocating 'unmap'
in vfio_dma_unmap_bitmap(), to be consistent.

Given the size of VFIOBitmap, I think it could live on the stack in
routine vfio_dma_unmap_bitmap() and routine vfio_get_dirty_bitmap()
since the reference is not kept.

The 'vbmap' attribute of vfio_giommu_dirty_notifier does not need
to be a pointer either.

vfio_bitmap_alloc(hwaddr size) could then become
vfio_bitmap_init(VFIOBitmap *vbmap, hwaddr size).

Anyhow, this is minor. It would simplify a bit the exit path
and error handling.

Thanks,

C.





^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 07/20] vfio/common: Add VFIOBitmap and (de)alloc functions
  2023-03-02 14:52       ` Cédric Le Goater
@ 2023-03-02 16:30         ` Joao Martins
  2023-03-04  0:23         ` Joao Martins
  1 sibling, 0 replies; 93+ messages in thread
From: Joao Martins @ 2023-03-02 16:30 UTC (permalink / raw)
  To: Cédric Le Goater, Avihai Horon
  Cc: Alex Williamson, Juan Quintela, qemu-devel,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

On 02/03/2023 14:52, Cédric Le Goater wrote:
> Hello Joao,
> On 3/2/23 14:24, Joao Martins wrote:
>> On 27/02/2023 14:09, Cédric Le Goater wrote:
>>> On 2/22/23 18:49, Avihai Horon wrote:
>>>> --- a/hw/vfio/common.c
>>>> +++ b/hw/vfio/common.c
>>>> @@ -320,6 +320,41 @@ const MemoryRegionOps vfio_region_ops = {
>>>>     * Device state interfaces
>>>>     */
>>>>    +typedef struct {
>>>> +    unsigned long *bitmap;
>>>> +    hwaddr size;
>>>> +    hwaddr pages;
>>>> +} VFIOBitmap;
>>>> +
>>>> +static VFIOBitmap *vfio_bitmap_alloc(hwaddr size)
>>>> +{
>>>> +    VFIOBitmap *vbmap = g_try_new0(VFIOBitmap, 1);
>>>
>>> I think using g_malloc0() for the VFIOBitmap should be fine. If QEMU can
>>> not allocate a couple of bytes, we are in trouble anyway.
>>>
>>
>> OOM situations are rather unpredictable, and switching to g_malloc0 means we
>> will exit ungracefully in the middle of fetching dirty bitmaps. And this
>> function (vfio_bitmap_alloc) overall will be allocating megabytes for terabyte
>> guests.
>>
>> It would be ok if we are initializing, but this is at runtime when we do
>> migration. I think we should stick with g_try_new0. exit on failure should be
>> reserved to failure to switch the kernel migration state whereby we are likely
>> to be dealing with a hardware failure and thus requires something more drastic.
> 
> I agree for large allocation :
> 
>     vbmap->bitmap = g_try_malloc0(vbmap->size);
> 
> but not for the smaller ones, like VFIOBitmap. You would have to
> convert some other g_malloc0() calls, like the one allocating 'unmap'
> in vfio_dma_unmap_bitmap(), to be consistent.
> 
> Given the size of VFIOBitmap, I think it could live on the stack in
> routine vfio_dma_unmap_bitmap() and routine vfio_get_dirty_bitmap()
> since the reference is not kept.
> 

Both good points. Specially the g_malloc0 ones, though the dma unmap wouldn't be
in use for a device that supports dirty tracking. But there's one where we add
by mistake and that's the one vfio_device_feature_dma_logging_start_create(). It
shouldn't be g_malloc0 there too. The rest, except dma_unmap and type1-iommu
get_dirty_Bitmap functions, the others would argue that only happen in the
initialization.

> The 'vbmap' attribute of vfio_giommu_dirty_notifier does not need
> to be a pointer either.
> 
> vfio_bitmap_alloc(hwaddr size) could then become
> vfio_bitmap_init(VFIOBitmap *vbmap, hwaddr size).
> 
> Anyhow, this is minor. It would simplify a bit the exit path
> and error handling.
> 
By simplify presumably it's because vfio_bitmap_free() would be a single line
and thus avoiding the new helper and instead we would just live with the
vfio_bitmap_alloc(). I am at two minds with alloc vs init, considering we are
still allocating the actual bitmap. Still lingering more over staying with alloc
than init.


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 10/20] vfio/common: Record DMA mapped IOVA ranges
  2023-03-02  0:07                 ` Joao Martins
  2023-03-02  0:13                   ` Joao Martins
@ 2023-03-02 18:42                   ` Alex Williamson
  2023-03-03  0:19                     ` Joao Martins
  1 sibling, 1 reply; 93+ messages in thread
From: Alex Williamson @ 2023-03-02 18:42 UTC (permalink / raw)
  To: Joao Martins
  Cc: Avihai Horon, qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

On Thu, 2 Mar 2023 00:07:35 +0000
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 28/02/2023 20:36, Alex Williamson wrote:
> > On Tue, 28 Feb 2023 12:11:06 +0000
> > Joao Martins <joao.m.martins@oracle.com> wrote:  
> >> On 23/02/2023 21:50, Alex Williamson wrote:  
> >>> On Thu, 23 Feb 2023 21:19:12 +0000
> >>> Joao Martins <joao.m.martins@oracle.com> wrote:    
> >>>> On 23/02/2023 21:05, Alex Williamson wrote:    
> >>>>> On Thu, 23 Feb 2023 10:37:10 +0000
> >>>>> Joao Martins <joao.m.martins@oracle.com> wrote:      
> >>>>>> On 22/02/2023 22:10, Alex Williamson wrote:      
> >>>>>>> On Wed, 22 Feb 2023 19:49:05 +0200
> >>>>>>> Avihai Horon <avihaih@nvidia.com> wrote:        
> >>>>>>>> From: Joao Martins <joao.m.martins@oracle.com>
> >>>>>>>> @@ -612,6 +665,16 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
> >>>>>>>>          .iova = iova,
> >>>>>>>>          .size = size,
> >>>>>>>>      };
> >>>>>>>> +    int ret;
> >>>>>>>> +
> >>>>>>>> +    ret = vfio_record_mapping(container, iova, size, readonly);
> >>>>>>>> +    if (ret) {
> >>>>>>>> +        error_report("vfio: Failed to record mapping, iova: 0x%" HWADDR_PRIx
> >>>>>>>> +                     ", size: 0x" RAM_ADDR_FMT ", ret: %d (%s)",
> >>>>>>>> +                     iova, size, ret, strerror(-ret));
> >>>>>>>> +
> >>>>>>>> +        return ret;
> >>>>>>>> +    }        
> >>>>>>>
> >>>>>>> Is there no way to replay the mappings when a migration is started?
> >>>>>>> This seems like a horrible latency and bloat trade-off for the
> >>>>>>> possibility that the VM might migrate and the device might support
> >>>>>>> these features.  Our performance with vIOMMU is already terrible, I
> >>>>>>> can't help but believe this makes it worse.  Thanks,
> >>>>>>>         
> >>>>>>
> >>>>>> It is a nop if the vIOMMU is being used (entries in container->giommu_list) as
> >>>>>> that uses a max-iova based IOVA range. So this is really for iommu identity
> >>>>>> mapping and no-VIOMMU.      
> >>>>>
> >>>>> Ok, yes, there are no mappings recorded for any containers that have a
> >>>>> non-empty giommu_list.
> >>>>>       
> >>>>>> We could replay them if they were tracked/stored anywhere.      
> >>>>>
> >>>>> Rather than piggybacking on vfio_memory_listener, why not simply
> >>>>> register a new MemoryListener when migration is started?  That will
> >>>>> replay all the existing ranges and allow tracking to happen separate
> >>>>> from mapping, and only when needed.  
> >>>>
> >>>> The problem with that is that *starting* dirty tracking needs to have all the
> >>>> range, we aren't supposed to start each range separately. So on a memory
> >>>> listener callback you don't have introspection when you are dealing with the
> >>>> last range, do we?    
> >>>
> >>> As soon as memory_listener_register() returns, all your callbacks to
> >>> build the IOVATree have been called and you can act on the result the
> >>> same as if you were relying on the vfio mapping MemoryListener.  I'm
> >>> not seeing the problem.  Thanks,
> >>>     
> >>
> >> While doing these changes, the nice thing of the current patch is that whatever
> >> changes apply to vfio_listener_region_add() will be reflected in the mappings
> >> tree that stores what we will dirty track. If we move the mappings calculation
> >> necessary for dirty tracking only when we start, we will have to duplicate the
> >> same checks, and open for bugs where we ask things to be dirty track-ed that
> >> haven't been DMA mapped. These two aren't necessarily tied, but felt like I
> >> should raise the potentially duplication of the checks (and the same thing
> >> applies for handling virtio-mem and what not).
> >>
> >> I understand that if we were going to store *a lot* of mappings that this would
> >> add up in space requirements. But for no-vIOMMU (or iommu=pt) case this is only
> >> about 12ranges or so, it is much simpler to piggyback the existing listener.
> >> Would you still want to move this to its own dedicated memory listener?  
> > 
> > Code duplication and bugs are good points, but while typically we're
> > only seeing a few handfuls of ranges, doesn't virtio-mem in particular
> > allow that we could be seeing quite a lot more?
> >   
> Ugh yes, it could be.
> 
> > We used to be limited to a fairly small number of KVM memory slots,
> > which effectively bounded non-vIOMMU DMA mappings, but that value is
> > now 2^15, so we need to anticipate that we could see many more than a
> > dozen mappings.
> >   
> 
> Even with 32k memory slots today we are still reduced on a handful. hv-balloon
> and virtio-mem approaches though are the ones that may stress such limit IIUC
> prior to starting migration.
> 
> > Can we make the same argument that the overhead is negligible if a VM
> > makes use of 10s of GB of virtio-mem with 2MB block size?
> > 
> > But then on a 4KB host we're limited to 256 tracking entries, so
> > wasting all that time and space on a runtime IOVATree is even more
> > dubious.
> >
> > In fact, it doesn't really matter that vfio_listener_region_add and
> > this potentially new listener come to the same result, as long as the
> > new listener is a superset of the existing listener.   
> 
> I am trying to put this in a way that's not too ugly to reuse the most between
> vfio_listener_region_add() and the vfio_migration_mapping_add().
> 
> For you to have an idea, here's so far how it looks thus far:
> 
> https://github.com/jpemartins/qemu/commits/vfio-dirty-tracking
> 
> Particularly this one:
> 
> https://github.com/jpemartins/qemu/commit/3b11fa0e4faa0f9c0f42689a7367284a25d1b585
> 
> vfio_get_section_iova_range() is where most of these checks are that are sort of
> a subset of the ones in vfio_listener_region_add().
> 
> > So I think we can
> > simplify out a lot of the places we'd see duplication and bugs.  I'm
> > not even really sure why we wouldn't simplify things further and only
> > record a single range covering the low and high memory marks for a
> > non-vIOMMU VMs, or potentially an approximation removing gaps of 1GB or
> > more, for example.  Thanks,  
> 
> Yes, for Qemu, to have one single artificial range with a computed min IOVA and
> max IOVA is the simplest to get it implemented. It would avoid us maintaining an
> IOVATree as you would only track min/max pair (maybe max_below).
> 
> My concern with a reduced single range is 1) big holes in address space leading
> to asking more than you need[*] and then 2) device dirty tracking limits e.g.
> hardware may have upper limits, so you may prematurely exercise those. So giving
> more choice to the vfio drivers to decide how to cope with the mapped address
> space description looks to have a bit more longevity.

The fact that we don't know anything about the device dirty tracking
limits worries me.  If QEMU reports the VM is migratable, ie. lacking
migration blockers, then we really shouldn't have non-extraordinary
things like the VM actually having a bigger address space than the
device can support, or enabling a vIOMMU, suddenly make the VM
non-migratable.

If we only needed to worry about scenarios like the AMD hypertransport
memory relocation, then tracking ranges for 32-bit and 64-bit RAM
separately would be an easy solution, we always have 1-2 ranges for the
device to track.  That's still a big simplification from tracking every
DMA mappings.

AIUI, we really can't even rely on the device supporting a full host
page size worth of mappings, the uAPI only stipulates that the core
kernel code will support such a request.  So it seems prudent that
userspace should conserve entries wherever it can.  For the
alternative, to provide ranges that closely match actual mappings, I
think we'd need to be able to collapse IOVATree entries with the
smallest gap when we reach the limit, and continue to collapse each time
the driver rejects the number of ranges provided.  That's obviously
much more complicated and I'd prefer to avoid it if there are easier
approximations.

> Anyway the temptation with having a single range is that this can all go away if
> the vfio_listener_region_add() tracks just min/max IOVA pair.
> 
> Below scissors mark it's how this patch is looking like in the commit above
> while being a full list of mappings. It's also stored here:
> 
> https://github.com/jpemartins/qemu/commits/vfio-dirty-tracking
> 
> I'll respond here with a patch on what it looks like with the range watermark
> approach.
> 
> 	Joao
> 
> [0] AMD 1T boundary is what comes to mind, which on Qemu relocates memory above
> 4G into after 1T.
> 
> ------------------>8--------------------  
> 
> From: Joao Martins <joao.m.martins@oracle.com>
> Date: Wed, 22 Feb 2023 19:49:05 +0200
> Subject: [PATCH wip 7/12] vfio/common: Record DMA mapped IOVA ranges
> 
> According to the device DMA logging uAPI, IOVA ranges to be logged by
> the device must be provided all at once upon DMA logging start.
> 
> As preparation for the following patches which will add device dirty
> page tracking, keep a record of all DMA mapped IOVA ranges so later they
> can be used for DMA logging start.
> 
> Note that when vIOMMU is enabled DMA mapped IOVA ranges are not tracked.
> This is due to the dynamic nature of vIOMMU DMA mapping/unmapping.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>  hw/vfio/common.c              | 147 +++++++++++++++++++++++++++++++++-
>  hw/vfio/trace-events          |   2 +
>  include/hw/vfio/vfio-common.h |   4 +
>  3 files changed, 150 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 655e8dbb74d4..17971e6dbaeb 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -44,6 +44,7 @@
>  #include "migration/blocker.h"
>  #include "migration/qemu-file.h"
>  #include "sysemu/tpm.h"
> +#include "qemu/iova-tree.h"
> 
>  VFIOGroupList vfio_group_list =
>      QLIST_HEAD_INITIALIZER(vfio_group_list);
> @@ -426,6 +427,11 @@ void vfio_unblock_multiple_devices_migration(void)
>      multiple_devices_migration_blocker = NULL;
>  }
> 
> +static bool vfio_have_giommu(VFIOContainer *container)
> +{
> +    return !QLIST_EMPTY(&container->giommu_list);
> +}

I think it's the case, but can you confirm we build the giommu_list
regardless of whether the vIOMMU is actually enabled?

> +
>  static void vfio_set_migration_error(int err)
>  {
>      MigrationState *ms = migrate_get_current();
> @@ -610,6 +616,7 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>          .iova = iova,
>          .size = size,
>      };
> +    int ret;
> 
>      if (!readonly) {
>          map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
> @@ -626,8 +633,10 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>          return 0;
>      }
> 
> +    ret = -errno;
>      error_report("VFIO_MAP_DMA failed: %s", strerror(errno));
> -    return -errno;
> +
> +    return ret;
>  }
> 
>  static void vfio_host_win_add(VFIOContainer *container,
> @@ -1326,11 +1335,127 @@ static int vfio_set_dirty_page_tracking(VFIOContainer
> *container, bool start)
>      return ret;
>  }
> 
> +static bool vfio_get_section_iova_range(VFIOContainer *container,
> +                                        MemoryRegionSection *section,
> +                                        hwaddr *out_iova, hwaddr *out_end)
> +{
> +    Int128 llend, llsize;
> +    hwaddr iova, end;
> +
> +    iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
> +    llend = int128_make64(section->offset_within_address_space);
> +    llend = int128_add(llend, section->size);
> +    llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask()));
> +
> +    if (int128_ge(int128_make64(iova), llend)) {
> +        return false;
> +    }
> +    end = int128_get64(int128_sub(llend, int128_one()));
> +
> +    if (memory_region_is_iommu(section->mr) ||

Shouldn't there already be a migration blocker in place preventing this
from being possible?

> +        memory_region_has_ram_discard_manager(section->mr)) {

Are we claiming not to support virtio-mem VMs as well?  The current
comment in vfio/common.c that states we only want to map actually
populated parts seems like it doesn't apply here, we'd want dirty
tracking ranges to include these regardless.  Unless there's some
reason virtio-mem changes are blocked during pre-copy.

> +	return false;
> +    }
> +
> +    llsize = int128_sub(llend, int128_make64(iova));
> +
> +    if (memory_region_is_ram_device(section->mr)) {
> +        VFIOHostDMAWindow *hostwin;
> +        hwaddr pgmask;
> +
> +        hostwin = vfio_find_hostwin(container, iova, end);
> +        if (!hostwin) {
> +            return false;
> +        }
> +
> +	pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
> +        if ((iova & pgmask) || (int128_get64(llsize) & pgmask)) {
> +            return false;
> +        }
> +    }

ram_device is intended to be an address range on another device, so do
we really need it in DMA dirty tracking?  ex. we don't include device
BARs in the dirty bitmap, we expect modified device state to be
reported by the device, so it seems like there's no case where we'd
include this in the device dirty tracking ranges.

> +
> +    *out_iova = iova;
> +    *out_end = int128_get64(llend);
> +    return true;
> +}
> +
> +static void vfio_migration_add_mapping(MemoryListener *listener,
> +                                       MemoryRegionSection *section)
> +{
> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> mappings_listener);
> +    hwaddr end = 0;
> +    DMAMap map;
> +    int ret;
> +
> +    if (vfio_have_giommu(container)) {
> +        vfio_set_migration_error(-EOPNOTSUPP);

There should be a migration blocker that prevents this from ever being
called in this case.

> +        return;
> +    }
> +
> +    if (!vfio_listener_valid_section(section) ||
> +        !vfio_get_section_iova_range(container, section, &map.iova, &end)) {
> +        return;
> +    }
> +
> +    map.size = end - map.iova - 1; // IOVATree is inclusive, so subtract 1 from
> size
> +    map.perm = section->readonly ? IOMMU_RO : IOMMU_RW;
> +
> +    WITH_QEMU_LOCK_GUARD(&container->mappings_mutex) {
> +        ret = iova_tree_insert(container->mappings, &map);
> +        if (ret) {
> +            if (ret == IOVA_ERR_INVALID) {
> +                ret = -EINVAL;
> +            } else if (ret == IOVA_ERR_OVERLAP) {
> +                ret = -EEXIST;
> +            }
> +        }
> +    }
> +
> +    trace_vfio_migration_mapping_add(map.iova, map.iova + map.size, ret);
> +
> +    if (ret)
> +        vfio_set_migration_error(ret);
> +    return;
> +}
> +
> +static void vfio_migration_remove_mapping(MemoryListener *listener,
> +                                          MemoryRegionSection *section)
> +{
> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> mappings_listener);
> +    hwaddr end = 0;
> +    DMAMap map;
> +
> +    if (vfio_have_giommu(container)) {
> +        vfio_set_migration_error(-EOPNOTSUPP);
> +        return;
> +    }
> +
> +    if (!vfio_listener_valid_section(section) ||
> +        !vfio_get_section_iova_range(container, section, &map.iova, &end)) {
> +        return;
> +    }
> +
> +    WITH_QEMU_LOCK_GUARD(&container->mappings_mutex) {
> +        iova_tree_remove(container->mappings, map);
> +    }
> +
> +    trace_vfio_migration_mapping_del(map.iova, map.iova + map.size);
> +}

Why do we need a region_del callback?  We don't support modifying the
dirty tracking ranges we've provided to the device.

> +
> +
> +static const MemoryListener vfio_dirty_tracking_listener = {
> +    .name = "vfio-migration",
> +    .region_add = vfio_migration_add_mapping,
> +    .region_del = vfio_migration_remove_mapping,
> +};
> +
>  static void vfio_listener_log_global_start(MemoryListener *listener)
>  {
>      VFIOContainer *container = container_of(listener, VFIOContainer, listener);
>      int ret;
> 
> +    memory_listener_register(&container->mappings_listener, container->space->as);
> +
>      ret = vfio_set_dirty_page_tracking(container, true);
>      if (ret) {
>          vfio_set_migration_error(ret);
> @@ -1346,6 +1471,8 @@ static void vfio_listener_log_global_stop(MemoryListener
> *listener)
>      if (ret) {
>          vfio_set_migration_error(ret);
>      }
> +
> +    memory_listener_unregister(&container->mappings_listener);

We don't have a way to update dirty tracking ranges for a device once
dirty tracking is enabled, so what's the point of this listener running
in more than a one-shot mode?  The only purpose of a listener
continuing to run seems like it would be to generate an error for
untracked ranges and either generate a migration error or mark them
perpetually dirty.

>  }
> 
>  static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
> @@ -2172,16 +2299,24 @@ static int vfio_connect_container(VFIOGroup *group,
> AddressSpace *as,
>      QLIST_INIT(&container->giommu_list);
>      QLIST_INIT(&container->hostwin_list);
>      QLIST_INIT(&container->vrdl_list);
> +    container->mappings = iova_tree_new();
> +    if (!container->mappings) {
> +        error_setg(errp, "Cannot allocate DMA mappings tree");
> +        ret = -ENOMEM;
> +        goto free_container_exit;
> +    }
> +    qemu_mutex_init(&container->mappings_mutex);
> +    container->mappings_listener = vfio_dirty_tracking_listener;

This all seems like code that would only be necessary before starting
the listener.

> 
>      ret = vfio_init_container(container, group->fd, errp);
>      if (ret) {
> -        goto free_container_exit;
> +        goto destroy_mappings_exit;
>      }
> 
>      ret = vfio_ram_block_discard_disable(container, true);
>      if (ret) {
>          error_setg_errno(errp, -ret, "Cannot set discarding of RAM broken");
> -        goto free_container_exit;
> +        goto destroy_mappings_exit;
>      }
> 
>      switch (container->iommu_type) {
> @@ -2317,6 +2452,10 @@ listener_release_exit:
>  enable_discards_exit:
>      vfio_ram_block_discard_disable(container, false);
> 
> +destroy_mappings_exit:
> +    qemu_mutex_destroy(&container->mappings_mutex);
> +    iova_tree_destroy(container->mappings);
> +
>  free_container_exit:
>      g_free(container);
> 
> @@ -2371,6 +2510,8 @@ static void vfio_disconnect_container(VFIOGroup *group)
>          }
> 
>          trace_vfio_disconnect_container(container->fd);
> +        qemu_mutex_destroy(&container->mappings_mutex);
> +        iova_tree_destroy(container->mappings);

The IOVATree should be destroyed as soon as we're done processing the
result upon starting logging.  It serves no purpose to keep it around.

Comparing with the follow-up, setting {min,max}_tracking_iova, many of
the same comments apply.  Both of these are only preparing for the
question of what do we actually do with this data.  In the IOVATree
approach, we have more fine grained information, but we can also exceed
what the device supports and we need to be able to handle that.  If our
fallback is to simply identify the min and max based on the IOVATree,
and we expect that to work better than the more granular approach, why
not start with just min/max?  If we expect there's value to the more
granular approach, then when not proceed to collapse the IOVATree until
we find a set of ranges the device can support?  Thanks,

Alex

>          close(container->fd);
>          g_free(container);
> 
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 669d9fe07cd9..c92eaadcc7c4 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -104,6 +104,8 @@ vfio_known_safe_misalignment(const char *name, uint64_t
> iova, uint64_t offset_wi
>  vfio_listener_region_add_no_dma_map(const char *name, uint64_t iova, uint64_t
> size, uint64_t page_size) "Region \"%s\" 0x%"PRIx64" size=0x%"PRIx64" is not
> aligned to 0x%"PRIx64" and cannot be mapped for DMA"
>  vfio_listener_region_del_skip(uint64_t start, uint64_t end) "SKIPPING
> region_del 0x%"PRIx64" - 0x%"PRIx64
>  vfio_listener_region_del(uint64_t start, uint64_t end) "region_del 0x%"PRIx64"
> - 0x%"PRIx64
> +vfio_migration_mapping_add(uint64_t start, uint64_t end, int err) "mapping_add
> 0x%"PRIx64" - 0x%"PRIx64" err=%d"
> +vfio_migration_mapping_del(uint64_t start, uint64_t end) "mapping_del
> 0x%"PRIx64" - 0x%"PRIx64
>  vfio_disconnect_container(int fd) "close container->fd=%d"
>  vfio_put_group(int fd) "close group->fd=%d"
>  vfio_get_device(const char * name, unsigned int flags, unsigned int
> num_regions, unsigned int num_irqs) "Device %s flags: %u, regions: %u, irqs: %u"
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 87524c64a443..48951da11ab4 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -23,6 +23,7 @@
> 
>  #include "exec/memory.h"
>  #include "qemu/queue.h"
> +#include "qemu/iova-tree.h"
>  #include "qemu/notify.h"
>  #include "ui/console.h"
>  #include "hw/display/ramfb.h"
> @@ -81,6 +82,7 @@ typedef struct VFIOContainer {
>      int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>      MemoryListener listener;
>      MemoryListener prereg_listener;
> +    MemoryListener mappings_listener;
>      unsigned iommu_type;
>      Error *error;
>      bool initialized;
> @@ -89,6 +91,8 @@ typedef struct VFIOContainer {
>      uint64_t max_dirty_bitmap_size;
>      unsigned long pgsizes;
>      unsigned int dma_max_mappings;
> +    IOVATree *mappings;
> +    QemuMutex mappings_mutex;
>      QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
>      QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
>      QLIST_HEAD(, VFIOGroup) group_list;
> --
> 2.17.2
> 



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 10/20] vfio/common: Record DMA mapped IOVA ranges
  2023-03-02 18:42                   ` Alex Williamson
@ 2023-03-03  0:19                     ` Joao Martins
  2023-03-03 16:58                       ` Joao Martins
  0 siblings, 1 reply; 93+ messages in thread
From: Joao Martins @ 2023-03-03  0:19 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Avihai Horon, qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

On 02/03/2023 18:42, Alex Williamson wrote:
> On Thu, 2 Mar 2023 00:07:35 +0000
> Joao Martins <joao.m.martins@oracle.com> wrote:
>> On 28/02/2023 20:36, Alex Williamson wrote:
>>> On Tue, 28 Feb 2023 12:11:06 +0000
>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
>>>> On 23/02/2023 21:50, Alex Williamson wrote:  
>>>>> On Thu, 23 Feb 2023 21:19:12 +0000
>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:    
>>>>>> On 23/02/2023 21:05, Alex Williamson wrote:    
>>>>>>> On Thu, 23 Feb 2023 10:37:10 +0000
>>>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:      
>>>>>>>> On 22/02/2023 22:10, Alex Williamson wrote:      
>>>>>>>>> On Wed, 22 Feb 2023 19:49:05 +0200
>>>>>>>>> Avihai Horon <avihaih@nvidia.com> wrote:        
>>>>>>>>>> From: Joao Martins <joao.m.martins@oracle.com>
>>>>>>>>>> @@ -612,6 +665,16 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>>>>>>>>>>          .iova = iova,
>>>>>>>>>>          .size = size,
>>>>>>>>>>      };
>>>>>>>>>> +    int ret;
>>>>>>>>>> +
>>>>>>>>>> +    ret = vfio_record_mapping(container, iova, size, readonly);
>>>>>>>>>> +    if (ret) {
>>>>>>>>>> +        error_report("vfio: Failed to record mapping, iova: 0x%" HWADDR_PRIx
>>>>>>>>>> +                     ", size: 0x" RAM_ADDR_FMT ", ret: %d (%s)",
>>>>>>>>>> +                     iova, size, ret, strerror(-ret));
>>>>>>>>>> +
>>>>>>>>>> +        return ret;
>>>>>>>>>> +    }        
>>>>>>>>>
>>>>>>>>> Is there no way to replay the mappings when a migration is started?
>>>>>>>>> This seems like a horrible latency and bloat trade-off for the
>>>>>>>>> possibility that the VM might migrate and the device might support
>>>>>>>>> these features.  Our performance with vIOMMU is already terrible, I
>>>>>>>>> can't help but believe this makes it worse.  Thanks,
>>>>>>>>>         
>>>>>>>>
>>>>>>>> It is a nop if the vIOMMU is being used (entries in container->giommu_list) as
>>>>>>>> that uses a max-iova based IOVA range. So this is really for iommu identity
>>>>>>>> mapping and no-VIOMMU.      
>>>>>>>
>>>>>>> Ok, yes, there are no mappings recorded for any containers that have a
>>>>>>> non-empty giommu_list.
>>>>>>>       
>>>>>>>> We could replay them if they were tracked/stored anywhere.      
>>>>>>>
>>>>>>> Rather than piggybacking on vfio_memory_listener, why not simply
>>>>>>> register a new MemoryListener when migration is started?  That will
>>>>>>> replay all the existing ranges and allow tracking to happen separate
>>>>>>> from mapping, and only when needed.  
>>>>>>
>>>>>> The problem with that is that *starting* dirty tracking needs to have all the
>>>>>> range, we aren't supposed to start each range separately. So on a memory
>>>>>> listener callback you don't have introspection when you are dealing with the
>>>>>> last range, do we?    
>>>>>
>>>>> As soon as memory_listener_register() returns, all your callbacks to
>>>>> build the IOVATree have been called and you can act on the result the
>>>>> same as if you were relying on the vfio mapping MemoryListener.  I'm
>>>>> not seeing the problem.  Thanks,
>>>>>     
>>>>
>>>> While doing these changes, the nice thing of the current patch is that whatever
>>>> changes apply to vfio_listener_region_add() will be reflected in the mappings
>>>> tree that stores what we will dirty track. If we move the mappings calculation
>>>> necessary for dirty tracking only when we start, we will have to duplicate the
>>>> same checks, and open for bugs where we ask things to be dirty track-ed that
>>>> haven't been DMA mapped. These two aren't necessarily tied, but felt like I
>>>> should raise the potentially duplication of the checks (and the same thing
>>>> applies for handling virtio-mem and what not).
>>>>
>>>> I understand that if we were going to store *a lot* of mappings that this would
>>>> add up in space requirements. But for no-vIOMMU (or iommu=pt) case this is only
>>>> about 12ranges or so, it is much simpler to piggyback the existing listener.
>>>> Would you still want to move this to its own dedicated memory listener?  
>>>
>>> Code duplication and bugs are good points, but while typically we're
>>> only seeing a few handfuls of ranges, doesn't virtio-mem in particular
>>> allow that we could be seeing quite a lot more?
>>>   
>> Ugh yes, it could be.
>>
>>> We used to be limited to a fairly small number of KVM memory slots,
>>> which effectively bounded non-vIOMMU DMA mappings, but that value is
>>> now 2^15, so we need to anticipate that we could see many more than a
>>> dozen mappings.
>>>   
>>
>> Even with 32k memory slots today we are still reduced on a handful. hv-balloon
>> and virtio-mem approaches though are the ones that may stress such limit IIUC
>> prior to starting migration.
>>
>>> Can we make the same argument that the overhead is negligible if a VM
>>> makes use of 10s of GB of virtio-mem with 2MB block size?
>>>
>>> But then on a 4KB host we're limited to 256 tracking entries, so
>>> wasting all that time and space on a runtime IOVATree is even more
>>> dubious.
>>>
>>> In fact, it doesn't really matter that vfio_listener_region_add and
>>> this potentially new listener come to the same result, as long as the
>>> new listener is a superset of the existing listener.   
>>
>> I am trying to put this in a way that's not too ugly to reuse the most between
>> vfio_listener_region_add() and the vfio_migration_mapping_add().
>>
>> For you to have an idea, here's so far how it looks thus far:
>>
>> https://github.com/jpemartins/qemu/commits/vfio-dirty-tracking
>>
>> Particularly this one:
>>
>> https://github.com/jpemartins/qemu/commit/3b11fa0e4faa0f9c0f42689a7367284a25d1b585
>>
>> vfio_get_section_iova_range() is where most of these checks are that are sort of
>> a subset of the ones in vfio_listener_region_add().
>>
>>> So I think we can
>>> simplify out a lot of the places we'd see duplication and bugs.  I'm
>>> not even really sure why we wouldn't simplify things further and only
>>> record a single range covering the low and high memory marks for a
>>> non-vIOMMU VMs, or potentially an approximation removing gaps of 1GB or
>>> more, for example.  Thanks,  
>>
>> Yes, for Qemu, to have one single artificial range with a computed min IOVA and
>> max IOVA is the simplest to get it implemented. It would avoid us maintaining an
>> IOVATree as you would only track min/max pair (maybe max_below).
>>
>> My concern with a reduced single range is 1) big holes in address space leading
>> to asking more than you need[*] and then 2) device dirty tracking limits e.g.
>> hardware may have upper limits, so you may prematurely exercise those. So giving
>> more choice to the vfio drivers to decide how to cope with the mapped address
>> space description looks to have a bit more longevity.
> 
> The fact that we don't know anything about the device dirty tracking
> limits worries me.  If QEMU reports the VM is migratable, ie. lacking
> migration blockers, then we really shouldn't have non-extraordinary
> things like the VM actually having a bigger address space than the
> device can support, or enabling a vIOMMU, suddenly make the VM
> non-migratable.
> 
Makes sense.

> If we only needed to worry about scenarios like the AMD hypertransport
> memory relocation, 

So far that's my major concern, considering it's still a 1T gap.

> then tracking ranges for 32-bit and 64-bit RAM
> separately would be an easy solution, we always have 1-2 ranges for the
> device to track.  That's still a big simplification from tracking every
> DMA mappings.
> 
I was too stuck on the naming on PC code that I thought it would be too x86
specific. But in reality we would be just covering 32-bit and 64-bit limits, so
it should cover everybody without being specific on a target.

The kernel (and device) will ultimately grab that info and adjust into its own
limits if needs to be split, as long as it can track what's requested.

That should work, and greatly simplify things here as you say. And later on for
vIOMMU we can expand the max limit to cover the 39-bit/48-bit intel max (or any
other IOMMU equivalent defined max).

> AIUI, we really can't even rely on the device supporting a full host
> page size worth of mappings, the uAPI only stipulates that the core
> kernel code will support such a request. 

Yes. Limited to LOG_MAX_RANGES, which is 256 ranges IIRC.

> So it seems prudent that
> userspace should conserve entries wherever it can.

Indeed.

> For the
> alternative, to provide ranges that closely match actual mappings, I
> think we'd need to be able to collapse IOVATree entries with the
> smallest gap when we reach the limit, and continue to collapse each time
> the driver rejects the number of ranges provided.  That's obviously
> much more complicated and I'd prefer to avoid it if there are easier
> approximations.
>

I think I start to like on the min/max {32,64} range limit approach given the
much less runtime overhead and complexity to maintain [while supporting the same
things].

>> Anyway the temptation with having a single range is that this can all go away if
>> the vfio_listener_region_add() tracks just min/max IOVA pair.
>>
>> Below scissors mark it's how this patch is looking like in the commit above
>> while being a full list of mappings. It's also stored here:
>>
>> https://github.com/jpemartins/qemu/commits/vfio-dirty-tracking
>>
>> I'll respond here with a patch on what it looks like with the range watermark
>> approach.
>>
>> 	Joao
>>
>> [0] AMD 1T boundary is what comes to mind, which on Qemu relocates memory above
>> 4G into after 1T.
>>
>> ------------------>8--------------------  
>>
>> From: Joao Martins <joao.m.martins@oracle.com>
>> Date: Wed, 22 Feb 2023 19:49:05 +0200
>> Subject: [PATCH wip 7/12] vfio/common: Record DMA mapped IOVA ranges
>>
>> According to the device DMA logging uAPI, IOVA ranges to be logged by
>> the device must be provided all at once upon DMA logging start.
>>
>> As preparation for the following patches which will add device dirty
>> page tracking, keep a record of all DMA mapped IOVA ranges so later they
>> can be used for DMA logging start.
>>
>> Note that when vIOMMU is enabled DMA mapped IOVA ranges are not tracked.
>> This is due to the dynamic nature of vIOMMU DMA mapping/unmapping.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>  hw/vfio/common.c              | 147 +++++++++++++++++++++++++++++++++-
>>  hw/vfio/trace-events          |   2 +
>>  include/hw/vfio/vfio-common.h |   4 +
>>  3 files changed, 150 insertions(+), 3 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 655e8dbb74d4..17971e6dbaeb 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -44,6 +44,7 @@
>>  #include "migration/blocker.h"
>>  #include "migration/qemu-file.h"
>>  #include "sysemu/tpm.h"
>> +#include "qemu/iova-tree.h"
>>
>>  VFIOGroupList vfio_group_list =
>>      QLIST_HEAD_INITIALIZER(vfio_group_list);
>> @@ -426,6 +427,11 @@ void vfio_unblock_multiple_devices_migration(void)
>>      multiple_devices_migration_blocker = NULL;
>>  }
>>
>> +static bool vfio_have_giommu(VFIOContainer *container)
>> +{
>> +    return !QLIST_EMPTY(&container->giommu_list);
>> +}
> 
> I think it's the case, but can you confirm we build the giommu_list
> regardless of whether the vIOMMU is actually enabled?
> 
I think that is only non-empty when we have the first IOVA mappings e.g. on
IOMMU passthrough mode *I think* it's empty. Let me confirm.

Otherwise I'll have to find a TYPE_IOMMU_MEMORY_REGION object to determine if
the VM was configured with a vIOMMU or not. That is to create the LM blocker.

>> +
>>  static void vfio_set_migration_error(int err)
>>  {
>>      MigrationState *ms = migrate_get_current();
>> @@ -610,6 +616,7 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>>          .iova = iova,
>>          .size = size,
>>      };
>> +    int ret;
>>
>>      if (!readonly) {
>>          map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
>> @@ -626,8 +633,10 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>>          return 0;
>>      }
>>
>> +    ret = -errno;
>>      error_report("VFIO_MAP_DMA failed: %s", strerror(errno));
>> -    return -errno;
>> +
>> +    return ret;
>>  }
>>
>>  static void vfio_host_win_add(VFIOContainer *container,
>> @@ -1326,11 +1335,127 @@ static int vfio_set_dirty_page_tracking(VFIOContainer
>> *container, bool start)
>>      return ret;
>>  }
>>
>> +static bool vfio_get_section_iova_range(VFIOContainer *container,
>> +                                        MemoryRegionSection *section,
>> +                                        hwaddr *out_iova, hwaddr *out_end)
>> +{
>> +    Int128 llend, llsize;
>> +    hwaddr iova, end;
>> +
>> +    iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
>> +    llend = int128_make64(section->offset_within_address_space);
>> +    llend = int128_add(llend, section->size);
>> +    llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask()));
>> +
>> +    if (int128_ge(int128_make64(iova), llend)) {
>> +        return false;
>> +    }
>> +    end = int128_get64(int128_sub(llend, int128_one()));
>> +
>> +    if (memory_region_is_iommu(section->mr) ||
> 
> Shouldn't there already be a migration blocker in place preventing this
> from being possible?
> 
Yes. From my previous comment I am still working it out.

>> +        memory_region_has_ram_discard_manager(section->mr)) {
> 
> Are we claiming not to support virtio-mem VMs as well? 

That was not the intention. From explanation below I likely misunderstood the
unpopulated parts handling and included that check.

> The current
> comment in vfio/common.c that states we only want to map actually
> populated parts seems like it doesn't apply here, we'd want dirty
> tracking ranges to include these regardless.  Unless there's some
> reason virtio-mem changes are blocked during pre-copy.
> 
As far as I am aware, virtio-mem is deemed busy when !migration_is_idle(), hence
plug and unplug requests are blocked throughout migration (devices add/del are
also blocked during migration, so hotplug of memory/cpus/devices are also blocked).

Anyways your point still holds regardless, I'll drop the check.

>> +	return false;
>> +    }
>> +
>> +    llsize = int128_sub(llend, int128_make64(iova));
>> +
>> +    if (memory_region_is_ram_device(section->mr)) {
>> +        VFIOHostDMAWindow *hostwin;
>> +        hwaddr pgmask;
>> +
>> +        hostwin = vfio_find_hostwin(container, iova, end);
>> +        if (!hostwin) {
>> +            return false;
>> +        }
>> +
>> +	pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
>> +        if ((iova & pgmask) || (int128_get64(llsize) & pgmask)) {
>> +            return false;
>> +        }
>> +    }
> 
> ram_device is intended to be an address range on another device, so do
> we really need it in DMA dirty tracking?

I don't think so.

> ex. we don't include device
> BARs in the dirty bitmap, we expect modified device state to be
> reported by the device, so it seems like there's no case where we'd
> include this in the device dirty tracking ranges.
> 
/me nods

>> +
>> +    *out_iova = iova;
>> +    *out_end = int128_get64(llend);
>> +    return true;
>> +}
>> +
>> +static void vfio_migration_add_mapping(MemoryListener *listener,
>> +                                       MemoryRegionSection *section)
>> +{
>> +    VFIOContainer *container = container_of(listener, VFIOContainer,
>> mappings_listener);
>> +    hwaddr end = 0;
>> +    DMAMap map;
>> +    int ret;
>> +
>> +    if (vfio_have_giommu(container)) {
>> +        vfio_set_migration_error(-EOPNOTSUPP);
> 
> There should be a migration blocker that prevents this from ever being
> called in this case.
> 
Correct.

>> +        return;
>> +    }
>> +
>> +    if (!vfio_listener_valid_section(section) ||
>> +        !vfio_get_section_iova_range(container, section, &map.iova, &end)) {
>> +        return;
>> +    }
>> +
>> +    map.size = end - map.iova - 1; // IOVATree is inclusive, so subtract 1 from
>> size
>> +    map.perm = section->readonly ? IOMMU_RO : IOMMU_RW;
>> +
>> +    WITH_QEMU_LOCK_GUARD(&container->mappings_mutex) {
>> +        ret = iova_tree_insert(container->mappings, &map);
>> +        if (ret) {
>> +            if (ret == IOVA_ERR_INVALID) {
>> +                ret = -EINVAL;
>> +            } else if (ret == IOVA_ERR_OVERLAP) {
>> +                ret = -EEXIST;
>> +            }
>> +        }
>> +    }
>> +
>> +    trace_vfio_migration_mapping_add(map.iova, map.iova + map.size, ret);
>> +
>> +    if (ret)
>> +        vfio_set_migration_error(ret);
>> +    return;
>> +}
>> +
>> +static void vfio_migration_remove_mapping(MemoryListener *listener,
>> +                                          MemoryRegionSection *section)
>> +{
>> +    VFIOContainer *container = container_of(listener, VFIOContainer,
>> mappings_listener);
>> +    hwaddr end = 0;
>> +    DMAMap map;
>> +
>> +    if (vfio_have_giommu(container)) {
>> +        vfio_set_migration_error(-EOPNOTSUPP);
>> +        return;
>> +    }
>> +
>> +    if (!vfio_listener_valid_section(section) ||
>> +        !vfio_get_section_iova_range(container, section, &map.iova, &end)) {
>> +        return;
>> +    }
>> +
>> +    WITH_QEMU_LOCK_GUARD(&container->mappings_mutex) {
>> +        iova_tree_remove(container->mappings, map);
>> +    }
>> +
>> +    trace_vfio_migration_mapping_del(map.iova, map.iova + map.size);
>> +}
> 
> Why do we need a region_del callback?  We don't support modifying the
> dirty tracking ranges we've provided to the device.
> 

My intention with a region_del callback was for the simple case where migration
fails, or is cancelled and you want to try again later on, which could likely
mean a different min/max depending on setup on each retry. Given most operations
that change address space are blocked, this would work.

In the range alternative I was only clearing the min/max to zeroes.

>> +
>> +
>> +static const MemoryListener vfio_dirty_tracking_listener = {
>> +    .name = "vfio-migration",
>> +    .region_add = vfio_migration_add_mapping,
>> +    .region_del = vfio_migration_remove_mapping,
>> +};
>> +
>>  static void vfio_listener_log_global_start(MemoryListener *listener)
>>  {
>>      VFIOContainer *container = container_of(listener, VFIOContainer, listener);
>>      int ret;
>>
>> +    memory_listener_register(&container->mappings_listener, container->space->as);
>> +
>>      ret = vfio_set_dirty_page_tracking(container, true);
>>      if (ret) {
>>          vfio_set_migration_error(ret);
>> @@ -1346,6 +1471,8 @@ static void vfio_listener_log_global_stop(MemoryListener
>> *listener)
>>      if (ret) {
>>          vfio_set_migration_error(ret);
>>      }
>> +
>> +    memory_listener_unregister(&container->mappings_listener);
> 
> We don't have a way to update dirty tracking ranges for a device once
> dirty tracking is enabled, so what's the point of this listener running
> in more than a one-shot mode? 

See above.

> The only purpose of a listener
> continuing to run seems like it would be to generate an error for
> untracked ranges and either generate a migration error or mark them
> perpetually dirty.
> 
True for vIOMMU. But my intention was for "migration failure and later retry"
case. I am not sure we should be outright deleting what we requested tracking.
But we shouldn't be changing it for sure, yes.

I can delete the region_del callback, and on migration (re)start I clear/destroy
mapping tracking info if there was one there already.

>>  }
>>
>>  static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
>> @@ -2172,16 +2299,24 @@ static int vfio_connect_container(VFIOGroup *group,
>> AddressSpace *as,
>>      QLIST_INIT(&container->giommu_list);
>>      QLIST_INIT(&container->hostwin_list);
>>      QLIST_INIT(&container->vrdl_list);
>> +    container->mappings = iova_tree_new();
>> +    if (!container->mappings) {
>> +        error_setg(errp, "Cannot allocate DMA mappings tree");
>> +        ret = -ENOMEM;
>> +        goto free_container_exit;
>> +    }
>> +    qemu_mutex_init(&container->mappings_mutex);
>> +    container->mappings_listener = vfio_dirty_tracking_listener;
> 
> This all seems like code that would only be necessary before starting
> the listener.
> 
I can move it there.

>>
>>      ret = vfio_init_container(container, group->fd, errp);
>>      if (ret) {
>> -        goto free_container_exit;
>> +        goto destroy_mappings_exit;
>>      }
>>
>>      ret = vfio_ram_block_discard_disable(container, true);
>>      if (ret) {
>>          error_setg_errno(errp, -ret, "Cannot set discarding of RAM broken");
>> -        goto free_container_exit;
>> +        goto destroy_mappings_exit;
>>      }
>>
>>      switch (container->iommu_type) {
>> @@ -2317,6 +2452,10 @@ listener_release_exit:
>>  enable_discards_exit:
>>      vfio_ram_block_discard_disable(container, false);
>>
>> +destroy_mappings_exit:
>> +    qemu_mutex_destroy(&container->mappings_mutex);
>> +    iova_tree_destroy(container->mappings);
>> +
>>  free_container_exit:
>>      g_free(container);
>>
>> @@ -2371,6 +2510,8 @@ static void vfio_disconnect_container(VFIOGroup *group)
>>          }
>>
>>          trace_vfio_disconnect_container(container->fd);
>> +        qemu_mutex_destroy(&container->mappings_mutex);
>> +        iova_tree_destroy(container->mappings);
> 
> The IOVATree should be destroyed as soon as we're done processing the
> result upon starting logging.  It serves no purpose to keep it around.
> 
OK

> Comparing with the follow-up, setting {min,max}_tracking_iova, many of
> the same comments apply.

/me nods

> Both of these are only preparing for the
> question of what do we actually do with this data.  In the IOVATree
> approach, we have more fine grained information, but we can also exceed
> what the device supports and we need to be able to handle that.  If our
> fallback is to simply identify the min and max based on the IOVATree,
> and we expect that to work better than the more granular approach, why
> not start with just min/max?  If we expect there's value to the more
> granular approach, then when not proceed to collapse the IOVATree until
> we find a set of ranges the device can support?  Thanks,
> 

Your proposed solution for the HT address space gap handles my bigger concern.
I think it makes more sense to adopt a simplistic approach. Specially as we also
know it's applicable to the vIOMMU case too that a later series will handle, and
 less fragile against a limited number of ranges UAPI. On a second
consideration, while a two-range watermark isn't faithful to the real guest
address space it still lets the VFIO driver pick and split it to accomodate to
its own requirements. The range approach also doesn't affect how we ask the
logging reports, which are still honoring the real address space (hence no
different in runtime bitmap sizes).

With the tree approach, I fear that collapsing the tree is adding too much
complexity already to accomodate the reduced number of ranges we can pass, in
addition to the runtime overhead that it implies (specially for virtio-mem like
cases). I am happy to ressurect that approach again, should it be deemed required.

Unless there's objections, we should be able to send v3 of this series with all
the comments above (and those already given in this series) with the min/max
range approach, *I hope* no later than tomorrow.

> Alex
> 
>>          close(container->fd);
>>          g_free(container);
>>
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index 669d9fe07cd9..c92eaadcc7c4 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -104,6 +104,8 @@ vfio_known_safe_misalignment(const char *name, uint64_t
>> iova, uint64_t offset_wi
>>  vfio_listener_region_add_no_dma_map(const char *name, uint64_t iova, uint64_t
>> size, uint64_t page_size) "Region \"%s\" 0x%"PRIx64" size=0x%"PRIx64" is not
>> aligned to 0x%"PRIx64" and cannot be mapped for DMA"
>>  vfio_listener_region_del_skip(uint64_t start, uint64_t end) "SKIPPING
>> region_del 0x%"PRIx64" - 0x%"PRIx64
>>  vfio_listener_region_del(uint64_t start, uint64_t end) "region_del 0x%"PRIx64"
>> - 0x%"PRIx64
>> +vfio_migration_mapping_add(uint64_t start, uint64_t end, int err) "mapping_add
>> 0x%"PRIx64" - 0x%"PRIx64" err=%d"
>> +vfio_migration_mapping_del(uint64_t start, uint64_t end) "mapping_del
>> 0x%"PRIx64" - 0x%"PRIx64
>>  vfio_disconnect_container(int fd) "close container->fd=%d"
>>  vfio_put_group(int fd) "close group->fd=%d"
>>  vfio_get_device(const char * name, unsigned int flags, unsigned int
>> num_regions, unsigned int num_irqs) "Device %s flags: %u, regions: %u, irqs: %u"
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 87524c64a443..48951da11ab4 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -23,6 +23,7 @@
>>
>>  #include "exec/memory.h"
>>  #include "qemu/queue.h"
>> +#include "qemu/iova-tree.h"
>>  #include "qemu/notify.h"
>>  #include "ui/console.h"
>>  #include "hw/display/ramfb.h"
>> @@ -81,6 +82,7 @@ typedef struct VFIOContainer {
>>      int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>>      MemoryListener listener;
>>      MemoryListener prereg_listener;
>> +    MemoryListener mappings_listener;
>>      unsigned iommu_type;
>>      Error *error;
>>      bool initialized;
>> @@ -89,6 +91,8 @@ typedef struct VFIOContainer {
>>      uint64_t max_dirty_bitmap_size;
>>      unsigned long pgsizes;
>>      unsigned int dma_max_mappings;
>> +    IOVATree *mappings;
>> +    QemuMutex mappings_mutex;
>>      QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
>>      QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
>>      QLIST_HEAD(, VFIOGroup) group_list;
>> --
>> 2.17.2
>>
> 


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 10/20] vfio/common: Record DMA mapped IOVA ranges
  2023-03-03  0:19                     ` Joao Martins
@ 2023-03-03 16:58                       ` Joao Martins
  2023-03-03 17:05                         ` Alex Williamson
  0 siblings, 1 reply; 93+ messages in thread
From: Joao Martins @ 2023-03-03 16:58 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Avihai Horon, qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

On 03/03/2023 00:19, Joao Martins wrote:
> On 02/03/2023 18:42, Alex Williamson wrote:
>> On Thu, 2 Mar 2023 00:07:35 +0000
>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>> @@ -426,6 +427,11 @@ void vfio_unblock_multiple_devices_migration(void)
>>>      multiple_devices_migration_blocker = NULL;
>>>  }
>>>
>>> +static bool vfio_have_giommu(VFIOContainer *container)
>>> +{
>>> +    return !QLIST_EMPTY(&container->giommu_list);
>>> +}
>>
>> I think it's the case, but can you confirm we build the giommu_list
>> regardless of whether the vIOMMU is actually enabled?
>>
> I think that is only non-empty when we have the first IOVA mappings e.g. on
> IOMMU passthrough mode *I think* it's empty. Let me confirm.
> 
Yeap, it's empty.

> Otherwise I'll have to find a TYPE_IOMMU_MEMORY_REGION object to determine if
> the VM was configured with a vIOMMU or not. That is to create the LM blocker.
> 
I am trying this way, with something like this, but neither
x86_iommu_get_default() nor below is really working out yet. A little afraid of
having to add the live migration blocker on each machine_init_done hook, unless
t here's a more obvious way. vfio_realize should be at a much later stage, so I
am surprised how an IOMMU object doesn't exist at that time.

@@ -416,9 +421,26 @@ void vfio_unblock_multiple_devices_migration(void)
     multiple_devices_migration_blocker = NULL;
 }

-static bool vfio_have_giommu(VFIOContainer *container)
+int vfio_block_giommu_migration(Error **errp)
 {
-    return !QLIST_EMPTY(&container->giommu_list);
+    int ret;
+
+    if (!object_resolve_path_type("", TYPE_INTEL_IOMMU_DEVICE, NULL) ||
+        !object_resolve_path_type("", TYPE_AMD_IOMMU_DEVICE, NULL) ||
+        !object_resolve_path_type("", TYPE_ARM_SMMU, NULL) ||
+        !object_resolve_path_type("", TYPE_VIRTIO_IOMMU, NULL)) {
+       return 0;
+    }
+
+    error_setg(&giommu_migration_blocker,
+               "Migration is currently not supported with vIOMMU enabled");
+    ret = migrate_add_blocker(giommu_migration_blocker, errp);
+    if (ret < 0) {
+        error_free(giommu_migration_blocker);
+        giommu_migration_blocker = NULL;
+    }
+
+    return ret;
 }

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 8981ae71a6f8..127a44ccaf19 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -649,6 +649,11 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
         return ret;
     }

+    ret = vfio_block_giommu_migration(errp);
+    if (ret) {
+        return ret;
+    }
+
     trace_vfio_migration_probe(vbasedev->name);
     return 0;


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 10/20] vfio/common: Record DMA mapped IOVA ranges
  2023-03-03 16:58                       ` Joao Martins
@ 2023-03-03 17:05                         ` Alex Williamson
  2023-03-03 19:14                           ` Joao Martins
  0 siblings, 1 reply; 93+ messages in thread
From: Alex Williamson @ 2023-03-03 17:05 UTC (permalink / raw)
  To: Joao Martins
  Cc: Avihai Horon, qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

On Fri, 3 Mar 2023 16:58:55 +0000
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 03/03/2023 00:19, Joao Martins wrote:
> > On 02/03/2023 18:42, Alex Williamson wrote:  
> >> On Thu, 2 Mar 2023 00:07:35 +0000
> >> Joao Martins <joao.m.martins@oracle.com> wrote:  
> >>> @@ -426,6 +427,11 @@ void vfio_unblock_multiple_devices_migration(void)
> >>>      multiple_devices_migration_blocker = NULL;
> >>>  }
> >>>
> >>> +static bool vfio_have_giommu(VFIOContainer *container)
> >>> +{
> >>> +    return !QLIST_EMPTY(&container->giommu_list);
> >>> +}  
> >>
> >> I think it's the case, but can you confirm we build the giommu_list
> >> regardless of whether the vIOMMU is actually enabled?
> >>  
> > I think that is only non-empty when we have the first IOVA mappings e.g. on
> > IOMMU passthrough mode *I think* it's empty. Let me confirm.
> >   
> Yeap, it's empty.
> 
> > Otherwise I'll have to find a TYPE_IOMMU_MEMORY_REGION object to determine if
> > the VM was configured with a vIOMMU or not. That is to create the LM blocker.
> >   
> I am trying this way, with something like this, but neither
> x86_iommu_get_default() nor below is really working out yet. A little afraid of
> having to add the live migration blocker on each machine_init_done hook, unless
> t here's a more obvious way. vfio_realize should be at a much later stage, so I
> am surprised how an IOMMU object doesn't exist at that time.

Can we just test whether the container address space is system_memory?
Thanks,

Alex



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 10/20] vfio/common: Record DMA mapped IOVA ranges
  2023-03-03 17:05                         ` Alex Williamson
@ 2023-03-03 19:14                           ` Joao Martins
  2023-03-03 19:40                             ` Alex Williamson
  0 siblings, 1 reply; 93+ messages in thread
From: Joao Martins @ 2023-03-03 19:14 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Avihai Horon, qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

On 03/03/2023 17:05, Alex Williamson wrote:
> On Fri, 3 Mar 2023 16:58:55 +0000
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> On 03/03/2023 00:19, Joao Martins wrote:
>>> On 02/03/2023 18:42, Alex Williamson wrote:  
>>>> On Thu, 2 Mar 2023 00:07:35 +0000
>>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
>>>>> @@ -426,6 +427,11 @@ void vfio_unblock_multiple_devices_migration(void)
>>>>>      multiple_devices_migration_blocker = NULL;
>>>>>  }
>>>>>
>>>>> +static bool vfio_have_giommu(VFIOContainer *container)
>>>>> +{
>>>>> +    return !QLIST_EMPTY(&container->giommu_list);
>>>>> +}  
>>>>
>>>> I think it's the case, but can you confirm we build the giommu_list
>>>> regardless of whether the vIOMMU is actually enabled?
>>>>  
>>> I think that is only non-empty when we have the first IOVA mappings e.g. on
>>> IOMMU passthrough mode *I think* it's empty. Let me confirm.
>>>   
>> Yeap, it's empty.
>>
>>> Otherwise I'll have to find a TYPE_IOMMU_MEMORY_REGION object to determine if
>>> the VM was configured with a vIOMMU or not. That is to create the LM blocker.
>>>   
>> I am trying this way, with something like this, but neither
>> x86_iommu_get_default() nor below is really working out yet. A little afraid of
>> having to add the live migration blocker on each machine_init_done hook, unless
>> t here's a more obvious way. vfio_realize should be at a much later stage, so I
>> am surprised how an IOMMU object doesn't exist at that time.
> 
> Can we just test whether the container address space is system_memory?

IIUC, it doesn't work (see below snippet).

The problem is that you start as a regular VFIO guest, and when the guest boot
is when new mappings get established/invalidated and propagated into listeners
(vfio_listener_region_add) and they morph into having a giommu. And that's when
you can figure out in higher layers that 'you have a vIOMMU' as that's when the
address space gets changed? That is without being specific to a particular IOMMU
model. Maybe region_add is where to add, but then it then depends on the guest.

I was going to attempt at vtd_machine_done_notify_one() ?

@@ -416,9 +416,25 @@ void vfio_unblock_multiple_devices_migration(void)
     multiple_devices_migration_blocker = NULL;
 }

-static bool vfio_have_giommu(VFIOContainer *container)
+static VFIOAddressSpace *vfio_get_address_space(AddressSpace *as);
+
+int vfio_block_giommu_migration(Error **errp)
 {
-    return !QLIST_EMPTY(&container->giommu_list);
+    int ret;
+
+    if (vfio_get_address_space(&address_space_memory)) {
+        return 0;
+    }
+
+    error_setg(&giommu_migration_blocker,
+               "Migration is currently not supported with vIOMMU enabled");
+    ret = migrate_add_blocker(giommu_migration_blocker, errp);
+    if (ret < 0) {
+        error_free(giommu_migration_blocker);
+        giommu_migration_blocker = NULL;
+    }
+
+    return ret;
 }


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 10/20] vfio/common: Record DMA mapped IOVA ranges
  2023-03-03 19:14                           ` Joao Martins
@ 2023-03-03 19:40                             ` Alex Williamson
  2023-03-03 20:16                               ` Joao Martins
  0 siblings, 1 reply; 93+ messages in thread
From: Alex Williamson @ 2023-03-03 19:40 UTC (permalink / raw)
  To: Joao Martins
  Cc: Avihai Horon, qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

On Fri, 3 Mar 2023 19:14:50 +0000
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 03/03/2023 17:05, Alex Williamson wrote:
> > On Fri, 3 Mar 2023 16:58:55 +0000
> > Joao Martins <joao.m.martins@oracle.com> wrote:
> >   
> >> On 03/03/2023 00:19, Joao Martins wrote:  
> >>> On 02/03/2023 18:42, Alex Williamson wrote:    
> >>>> On Thu, 2 Mar 2023 00:07:35 +0000
> >>>> Joao Martins <joao.m.martins@oracle.com> wrote:    
> >>>>> @@ -426,6 +427,11 @@ void vfio_unblock_multiple_devices_migration(void)
> >>>>>      multiple_devices_migration_blocker = NULL;
> >>>>>  }
> >>>>>
> >>>>> +static bool vfio_have_giommu(VFIOContainer *container)
> >>>>> +{
> >>>>> +    return !QLIST_EMPTY(&container->giommu_list);
> >>>>> +}    
> >>>>
> >>>> I think it's the case, but can you confirm we build the giommu_list
> >>>> regardless of whether the vIOMMU is actually enabled?
> >>>>    
> >>> I think that is only non-empty when we have the first IOVA mappings e.g. on
> >>> IOMMU passthrough mode *I think* it's empty. Let me confirm.
> >>>     
> >> Yeap, it's empty.
> >>  
> >>> Otherwise I'll have to find a TYPE_IOMMU_MEMORY_REGION object to determine if
> >>> the VM was configured with a vIOMMU or not. That is to create the LM blocker.
> >>>     
> >> I am trying this way, with something like this, but neither
> >> x86_iommu_get_default() nor below is really working out yet. A little afraid of
> >> having to add the live migration blocker on each machine_init_done hook, unless
> >> t here's a more obvious way. vfio_realize should be at a much later stage, so I
> >> am surprised how an IOMMU object doesn't exist at that time.  
> > 
> > Can we just test whether the container address space is system_memory?  
> 
> IIUC, it doesn't work (see below snippet).
> 
> The problem is that you start as a regular VFIO guest, and when the guest boot
> is when new mappings get established/invalidated and propagated into listeners
> (vfio_listener_region_add) and they morph into having a giommu. And that's when
> you can figure out in higher layers that 'you have a vIOMMU' as that's when the
> address space gets changed? That is without being specific to a particular IOMMU
> model. Maybe region_add is where to add, but then it then depends on the guest.

This doesn't seem right to me, look for instance at
pci_device_iommu_address_space() which returns address_space_memory
when there is no vIOMMU.  If devices share an address space, they can
share a container.  When a vIOMMU is present (not even enabled), each
device gets it's own container due to the fact that it's in its own
address space (modulo devices within the same address space due to
aliasing).  Thanks,

Alex



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 10/20] vfio/common: Record DMA mapped IOVA ranges
  2023-03-03 19:40                             ` Alex Williamson
@ 2023-03-03 20:16                               ` Joao Martins
  2023-03-03 23:47                                 ` Alex Williamson
  0 siblings, 1 reply; 93+ messages in thread
From: Joao Martins @ 2023-03-03 20:16 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Avihai Horon, qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

On 03/03/2023 19:40, Alex Williamson wrote:
> On Fri, 3 Mar 2023 19:14:50 +0000
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> On 03/03/2023 17:05, Alex Williamson wrote:
>>> On Fri, 3 Mar 2023 16:58:55 +0000
>>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>>   
>>>> On 03/03/2023 00:19, Joao Martins wrote:  
>>>>> On 02/03/2023 18:42, Alex Williamson wrote:    
>>>>>> On Thu, 2 Mar 2023 00:07:35 +0000
>>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:    
>>>>>>> @@ -426,6 +427,11 @@ void vfio_unblock_multiple_devices_migration(void)
>>>>>>>      multiple_devices_migration_blocker = NULL;
>>>>>>>  }
>>>>>>>
>>>>>>> +static bool vfio_have_giommu(VFIOContainer *container)
>>>>>>> +{
>>>>>>> +    return !QLIST_EMPTY(&container->giommu_list);
>>>>>>> +}    
>>>>>>
>>>>>> I think it's the case, but can you confirm we build the giommu_list
>>>>>> regardless of whether the vIOMMU is actually enabled?
>>>>>>    
>>>>> I think that is only non-empty when we have the first IOVA mappings e.g. on
>>>>> IOMMU passthrough mode *I think* it's empty. Let me confirm.
>>>>>     
>>>> Yeap, it's empty.
>>>>  
>>>>> Otherwise I'll have to find a TYPE_IOMMU_MEMORY_REGION object to determine if
>>>>> the VM was configured with a vIOMMU or not. That is to create the LM blocker.
>>>>>     
>>>> I am trying this way, with something like this, but neither
>>>> x86_iommu_get_default() nor below is really working out yet. A little afraid of
>>>> having to add the live migration blocker on each machine_init_done hook, unless
>>>> t here's a more obvious way. vfio_realize should be at a much later stage, so I
>>>> am surprised how an IOMMU object doesn't exist at that time.  
>>>
>>> Can we just test whether the container address space is system_memory?  
>>
>> IIUC, it doesn't work (see below snippet).
>>
>> The problem is that you start as a regular VFIO guest, and when the guest boot
>> is when new mappings get established/invalidated and propagated into listeners
>> (vfio_listener_region_add) and they morph into having a giommu. And that's when
>> you can figure out in higher layers that 'you have a vIOMMU' as that's when the
>> address space gets changed? That is without being specific to a particular IOMMU
>> model. Maybe region_add is where to add, but then it then depends on the guest.
> 
> This doesn't seem right to me, look for instance at
> pci_device_iommu_address_space() which returns address_space_memory
> when there is no vIOMMU.  If devices share an address space, they can
> share a container.  When a vIOMMU is present (not even enabled), each
> device gets it's own container due to the fact that it's in its own
> address space (modulo devices within the same address space due to
> aliasing).

You're obviously right, I was reading this whole thing wrong. This works as far
as I tested with an iommu=pt guest (and without an vIOMMU).

I am gonna shape this up, and hopefully submit v3 during over night.

@@ -416,9 +416,26 @@ void vfio_unblock_multiple_devices_migration(void)
     multiple_devices_migration_blocker = NULL;
 }

-static bool vfio_have_giommu(VFIOContainer *container)
+static VFIOAddressSpace *vfio_get_address_space(AddressSpace *as);
+
+int vfio_block_giommu_migration(VFIODevice *vbasedev, Error **errp)
 {
-    return !QLIST_EMPTY(&container->giommu_list);
+    int ret;
+
+    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI &&
+       !vfio_has_iommu(vbasedev)) {
+       return 0;
+    }
+
+    error_setg(&giommu_migration_blocker,
+               "Migration is currently not supported with vIOMMU enabled");
+    ret = migrate_add_blocker(giommu_migration_blocker, errp);
+    if (ret < 0) {
+        error_free(giommu_migration_blocker);
+        giommu_migration_blocker = NULL;
+    }
+
+    return ret;
 }
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 939dcc3d4a9e..f4cf0b41a157 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2843,6 +2843,15 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
     vdev->req_enabled = false;
 }

+bool vfio_has_iommu(VFIODevice *vbasedev)
+{
+    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+    PCIDevice *pdev = &vdev->pdev;
+    AddressSpace *as = &address_space_memory;
+
+    return !(pci_device_iommu_address_space(pdev) == as);
+}
+



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 10/20] vfio/common: Record DMA mapped IOVA ranges
  2023-03-03 20:16                               ` Joao Martins
@ 2023-03-03 23:47                                 ` Alex Williamson
  2023-03-03 23:57                                   ` Joao Martins
  0 siblings, 1 reply; 93+ messages in thread
From: Alex Williamson @ 2023-03-03 23:47 UTC (permalink / raw)
  To: Joao Martins
  Cc: Avihai Horon, qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

On Fri, 3 Mar 2023 20:16:19 +0000
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 03/03/2023 19:40, Alex Williamson wrote:
> > On Fri, 3 Mar 2023 19:14:50 +0000
> > Joao Martins <joao.m.martins@oracle.com> wrote:
> >   
> >> On 03/03/2023 17:05, Alex Williamson wrote:  
> >>> On Fri, 3 Mar 2023 16:58:55 +0000
> >>> Joao Martins <joao.m.martins@oracle.com> wrote:
> >>>     
> >>>> On 03/03/2023 00:19, Joao Martins wrote:    
> >>>>> On 02/03/2023 18:42, Alex Williamson wrote:      
> >>>>>> On Thu, 2 Mar 2023 00:07:35 +0000
> >>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:      
> >>>>>>> @@ -426,6 +427,11 @@ void vfio_unblock_multiple_devices_migration(void)
> >>>>>>>      multiple_devices_migration_blocker = NULL;
> >>>>>>>  }
> >>>>>>>
> >>>>>>> +static bool vfio_have_giommu(VFIOContainer *container)
> >>>>>>> +{
> >>>>>>> +    return !QLIST_EMPTY(&container->giommu_list);
> >>>>>>> +}      
> >>>>>>
> >>>>>> I think it's the case, but can you confirm we build the giommu_list
> >>>>>> regardless of whether the vIOMMU is actually enabled?
> >>>>>>      
> >>>>> I think that is only non-empty when we have the first IOVA mappings e.g. on
> >>>>> IOMMU passthrough mode *I think* it's empty. Let me confirm.
> >>>>>       
> >>>> Yeap, it's empty.
> >>>>    
> >>>>> Otherwise I'll have to find a TYPE_IOMMU_MEMORY_REGION object to determine if
> >>>>> the VM was configured with a vIOMMU or not. That is to create the LM blocker.
> >>>>>       
> >>>> I am trying this way, with something like this, but neither
> >>>> x86_iommu_get_default() nor below is really working out yet. A little afraid of
> >>>> having to add the live migration blocker on each machine_init_done hook, unless
> >>>> t here's a more obvious way. vfio_realize should be at a much later stage, so I
> >>>> am surprised how an IOMMU object doesn't exist at that time.    
> >>>
> >>> Can we just test whether the container address space is system_memory?    
> >>
> >> IIUC, it doesn't work (see below snippet).
> >>
> >> The problem is that you start as a regular VFIO guest, and when the guest boot
> >> is when new mappings get established/invalidated and propagated into listeners
> >> (vfio_listener_region_add) and they morph into having a giommu. And that's when
> >> you can figure out in higher layers that 'you have a vIOMMU' as that's when the
> >> address space gets changed? That is without being specific to a particular IOMMU
> >> model. Maybe region_add is where to add, but then it then depends on the guest.  
> > 
> > This doesn't seem right to me, look for instance at
> > pci_device_iommu_address_space() which returns address_space_memory
> > when there is no vIOMMU.  If devices share an address space, they can
> > share a container.  When a vIOMMU is present (not even enabled), each
> > device gets it's own container due to the fact that it's in its own
> > address space (modulo devices within the same address space due to
> > aliasing).  
> 
> You're obviously right, I was reading this whole thing wrong. This works as far
> as I tested with an iommu=pt guest (and without an vIOMMU).
> 
> I am gonna shape this up, and hopefully submit v3 during over night.
> 
> @@ -416,9 +416,26 @@ void vfio_unblock_multiple_devices_migration(void)
>      multiple_devices_migration_blocker = NULL;
>  }
> 
> -static bool vfio_have_giommu(VFIOContainer *container)
> +static VFIOAddressSpace *vfio_get_address_space(AddressSpace *as);
> +
> +int vfio_block_giommu_migration(VFIODevice *vbasedev, Error **errp)
>  {
> -    return !QLIST_EMPTY(&container->giommu_list);
> +    int ret;
> +
> +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI &&
> +       !vfio_has_iommu(vbasedev)) {
> +       return 0;
> +    }
> +
> +    error_setg(&giommu_migration_blocker,
> +               "Migration is currently not supported with vIOMMU enabled");
> +    ret = migrate_add_blocker(giommu_migration_blocker, errp);
> +    if (ret < 0) {
> +        error_free(giommu_migration_blocker);
> +        giommu_migration_blocker = NULL;
> +    }
> +
> +    return ret;
>  }
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 939dcc3d4a9e..f4cf0b41a157 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2843,6 +2843,15 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
>      vdev->req_enabled = false;
>  }
> 
> +bool vfio_has_iommu(VFIODevice *vbasedev)
> +{
> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> +    PCIDevice *pdev = &vdev->pdev;
> +    AddressSpace *as = &address_space_memory;
> +
> +    return !(pci_device_iommu_address_space(pdev) == as);
> +}


Shouldn't this be something non-PCI specific like:

    return vbasedev->group->container->space != &address_space_memory;

Thanks,
Alex



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 10/20] vfio/common: Record DMA mapped IOVA ranges
  2023-03-03 23:47                                 ` Alex Williamson
@ 2023-03-03 23:57                                   ` Joao Martins
  2023-03-04  0:21                                     ` Joao Martins
  0 siblings, 1 reply; 93+ messages in thread
From: Joao Martins @ 2023-03-03 23:57 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Avihai Horon, qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

On 03/03/2023 23:47, Alex Williamson wrote:
> On Fri, 3 Mar 2023 20:16:19 +0000
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> On 03/03/2023 19:40, Alex Williamson wrote:
>>> On Fri, 3 Mar 2023 19:14:50 +0000
>>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>>   
>>>> On 03/03/2023 17:05, Alex Williamson wrote:  
>>>>> On Fri, 3 Mar 2023 16:58:55 +0000
>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>>     
>>>>>> On 03/03/2023 00:19, Joao Martins wrote:    
>>>>>>> On 02/03/2023 18:42, Alex Williamson wrote:      
>>>>>>>> On Thu, 2 Mar 2023 00:07:35 +0000
>>>>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:      
>>>>>>>>> @@ -426,6 +427,11 @@ void vfio_unblock_multiple_devices_migration(void)
>>>>>>>>>      multiple_devices_migration_blocker = NULL;
>>>>>>>>>  }
>>>>>>>>>
>>>>>>>>> +static bool vfio_have_giommu(VFIOContainer *container)
>>>>>>>>> +{
>>>>>>>>> +    return !QLIST_EMPTY(&container->giommu_list);
>>>>>>>>> +}      
>>>>>>>>
>>>>>>>> I think it's the case, but can you confirm we build the giommu_list
>>>>>>>> regardless of whether the vIOMMU is actually enabled?
>>>>>>>>      
>>>>>>> I think that is only non-empty when we have the first IOVA mappings e.g. on
>>>>>>> IOMMU passthrough mode *I think* it's empty. Let me confirm.
>>>>>>>       
>>>>>> Yeap, it's empty.
>>>>>>    
>>>>>>> Otherwise I'll have to find a TYPE_IOMMU_MEMORY_REGION object to determine if
>>>>>>> the VM was configured with a vIOMMU or not. That is to create the LM blocker.
>>>>>>>       
>>>>>> I am trying this way, with something like this, but neither
>>>>>> x86_iommu_get_default() nor below is really working out yet. A little afraid of
>>>>>> having to add the live migration blocker on each machine_init_done hook, unless
>>>>>> t here's a more obvious way. vfio_realize should be at a much later stage, so I
>>>>>> am surprised how an IOMMU object doesn't exist at that time.    
>>>>>
>>>>> Can we just test whether the container address space is system_memory?    
>>>>
>>>> IIUC, it doesn't work (see below snippet).
>>>>
>>>> The problem is that you start as a regular VFIO guest, and when the guest boot
>>>> is when new mappings get established/invalidated and propagated into listeners
>>>> (vfio_listener_region_add) and they morph into having a giommu. And that's when
>>>> you can figure out in higher layers that 'you have a vIOMMU' as that's when the
>>>> address space gets changed? That is without being specific to a particular IOMMU
>>>> model. Maybe region_add is where to add, but then it then depends on the guest.  
>>>
>>> This doesn't seem right to me, look for instance at
>>> pci_device_iommu_address_space() which returns address_space_memory
>>> when there is no vIOMMU.  If devices share an address space, they can
>>> share a container.  When a vIOMMU is present (not even enabled), each
>>> device gets it's own container due to the fact that it's in its own
>>> address space (modulo devices within the same address space due to
>>> aliasing).  
>>
>> You're obviously right, I was reading this whole thing wrong. This works as far
>> as I tested with an iommu=pt guest (and without an vIOMMU).
>>
>> I am gonna shape this up, and hopefully submit v3 during over night.
>>
>> @@ -416,9 +416,26 @@ void vfio_unblock_multiple_devices_migration(void)
>>      multiple_devices_migration_blocker = NULL;
>>  }
>>
>> -static bool vfio_have_giommu(VFIOContainer *container)
>> +static VFIOAddressSpace *vfio_get_address_space(AddressSpace *as);
>> +
>> +int vfio_block_giommu_migration(VFIODevice *vbasedev, Error **errp)
>>  {
>> -    return !QLIST_EMPTY(&container->giommu_list);
>> +    int ret;
>> +
>> +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI &&
>> +       !vfio_has_iommu(vbasedev)) {
>> +       return 0;
>> +    }
>> +
>> +    error_setg(&giommu_migration_blocker,
>> +               "Migration is currently not supported with vIOMMU enabled");
>> +    ret = migrate_add_blocker(giommu_migration_blocker, errp);
>> +    if (ret < 0) {
>> +        error_free(giommu_migration_blocker);
>> +        giommu_migration_blocker = NULL;
>> +    }
>> +
>> +    return ret;
>>  }
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index 939dcc3d4a9e..f4cf0b41a157 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -2843,6 +2843,15 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
>>      vdev->req_enabled = false;
>>  }
>>
>> +bool vfio_has_iommu(VFIODevice *vbasedev)
>> +{
>> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
>> +    PCIDevice *pdev = &vdev->pdev;
>> +    AddressSpace *as = &address_space_memory;
>> +
>> +    return !(pci_device_iommu_address_space(pdev) == as);
>> +}
> 
> 
> Shouldn't this be something non-PCI specific like:
> 
>     return vbasedev->group->container->space != &address_space_memory;
> 

Yes, much better, I've applied the following (partial diff below):

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 6cd0100bbe09..60af3c3018dc 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -421,8 +421,7 @@ int vfio_block_giommu_migration(VFIODevice *vbasedev, Error
**errp)
 {
     int ret;

-    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI &&
-       !vfio_has_iommu(vbasedev)) {
+    if (vbasedev->group->container->space->as == &address_space_memory) {
        return 0;
     }



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 10/20] vfio/common: Record DMA mapped IOVA ranges
  2023-03-03 23:57                                   ` Joao Martins
@ 2023-03-04  0:21                                     ` Joao Martins
  0 siblings, 0 replies; 93+ messages in thread
From: Joao Martins @ 2023-03-04  0:21 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Avihai Horon, qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

On 03/03/2023 23:57, Joao Martins wrote:
> On 03/03/2023 23:47, Alex Williamson wrote:
>> On Fri, 3 Mar 2023 20:16:19 +0000
>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>>> On 03/03/2023 19:40, Alex Williamson wrote:
>>>> On Fri, 3 Mar 2023 19:14:50 +0000
>>>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>   
>>>>> On 03/03/2023 17:05, Alex Williamson wrote:  
>>>>>> On Fri, 3 Mar 2023 16:58:55 +0000
>>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>>>     
>>>>>>> On 03/03/2023 00:19, Joao Martins wrote:    
>>>>>>>> On 02/03/2023 18:42, Alex Williamson wrote:      
>>>>>>>>> On Thu, 2 Mar 2023 00:07:35 +0000
>>>>>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:      
>>>>>>>>>> @@ -426,6 +427,11 @@ void vfio_unblock_multiple_devices_migration(void)
>>>>>>>>>>      multiple_devices_migration_blocker = NULL;
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>> +static bool vfio_have_giommu(VFIOContainer *container)
>>>>>>>>>> +{
>>>>>>>>>> +    return !QLIST_EMPTY(&container->giommu_list);
>>>>>>>>>> +}      
>>>>>>>>>
>>>>>>>>> I think it's the case, but can you confirm we build the giommu_list
>>>>>>>>> regardless of whether the vIOMMU is actually enabled?
>>>>>>>>>      
>>>>>>>> I think that is only non-empty when we have the first IOVA mappings e.g. on
>>>>>>>> IOMMU passthrough mode *I think* it's empty. Let me confirm.
>>>>>>>>       
>>>>>>> Yeap, it's empty.
>>>>>>>    
>>>>>>>> Otherwise I'll have to find a TYPE_IOMMU_MEMORY_REGION object to determine if
>>>>>>>> the VM was configured with a vIOMMU or not. That is to create the LM blocker.
>>>>>>>>       
>>>>>>> I am trying this way, with something like this, but neither
>>>>>>> x86_iommu_get_default() nor below is really working out yet. A little afraid of
>>>>>>> having to add the live migration blocker on each machine_init_done hook, unless
>>>>>>> t here's a more obvious way. vfio_realize should be at a much later stage, so I
>>>>>>> am surprised how an IOMMU object doesn't exist at that time.    
>>>>>>
>>>>>> Can we just test whether the container address space is system_memory?    
>>>>>
>>>>> IIUC, it doesn't work (see below snippet).
>>>>>
>>>>> The problem is that you start as a regular VFIO guest, and when the guest boot
>>>>> is when new mappings get established/invalidated and propagated into listeners
>>>>> (vfio_listener_region_add) and they morph into having a giommu. And that's when
>>>>> you can figure out in higher layers that 'you have a vIOMMU' as that's when the
>>>>> address space gets changed? That is without being specific to a particular IOMMU
>>>>> model. Maybe region_add is where to add, but then it then depends on the guest.  
>>>>
>>>> This doesn't seem right to me, look for instance at
>>>> pci_device_iommu_address_space() which returns address_space_memory
>>>> when there is no vIOMMU.  If devices share an address space, they can
>>>> share a container.  When a vIOMMU is present (not even enabled), each
>>>> device gets it's own container due to the fact that it's in its own
>>>> address space (modulo devices within the same address space due to
>>>> aliasing).  
>>>
>>> You're obviously right, I was reading this whole thing wrong. This works as far
>>> as I tested with an iommu=pt guest (and without an vIOMMU).
>>>
>>> I am gonna shape this up, and hopefully submit v3 during over night.
>>>
>>> @@ -416,9 +416,26 @@ void vfio_unblock_multiple_devices_migration(void)
>>>      multiple_devices_migration_blocker = NULL;
>>>  }
>>>
>>> -static bool vfio_have_giommu(VFIOContainer *container)
>>> +static VFIOAddressSpace *vfio_get_address_space(AddressSpace *as);
>>> +
>>> +int vfio_block_giommu_migration(VFIODevice *vbasedev, Error **errp)
>>>  {
>>> -    return !QLIST_EMPTY(&container->giommu_list);
>>> +    int ret;
>>> +
>>> +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI &&
>>> +       !vfio_has_iommu(vbasedev)) {
>>> +       return 0;
>>> +    }
>>> +
>>> +    error_setg(&giommu_migration_blocker,
>>> +               "Migration is currently not supported with vIOMMU enabled");
>>> +    ret = migrate_add_blocker(giommu_migration_blocker, errp);
>>> +    if (ret < 0) {
>>> +        error_free(giommu_migration_blocker);
>>> +        giommu_migration_blocker = NULL;
>>> +    }
>>> +
>>> +    return ret;
>>>  }
>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>> index 939dcc3d4a9e..f4cf0b41a157 100644
>>> --- a/hw/vfio/pci.c
>>> +++ b/hw/vfio/pci.c
>>> @@ -2843,6 +2843,15 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
>>>      vdev->req_enabled = false;
>>>  }
>>>
>>> +bool vfio_has_iommu(VFIODevice *vbasedev)
>>> +{
>>> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
>>> +    PCIDevice *pdev = &vdev->pdev;
>>> +    AddressSpace *as = &address_space_memory;
>>> +
>>> +    return !(pci_device_iommu_address_space(pdev) == as);
>>> +}
>>
>>
>> Shouldn't this be something non-PCI specific like:
>>
>>     return vbasedev->group->container->space != &address_space_memory;
>>
> 
> Yes, much better, I've applied the following (partial diff below):
> 
I've also structured this similar to the other blocker wrt to multiple vfio devices.

	Joao


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 07/20] vfio/common: Add VFIOBitmap and (de)alloc functions
  2023-03-02 14:52       ` Cédric Le Goater
  2023-03-02 16:30         ` Joao Martins
@ 2023-03-04  0:23         ` Joao Martins
  1 sibling, 0 replies; 93+ messages in thread
From: Joao Martins @ 2023-03-04  0:23 UTC (permalink / raw)
  To: Cédric Le Goater, Avihai Horon
  Cc: Alex Williamson, Juan Quintela, qemu-devel,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta

On 02/03/2023 14:52, Cédric Le Goater wrote:
> Hello Joao,
> 
> On 3/2/23 14:24, Joao Martins wrote:
>> On 27/02/2023 14:09, Cédric Le Goater wrote:
>>> On 2/22/23 18:49, Avihai Horon wrote:
>>>> --- a/hw/vfio/common.c
>>>> +++ b/hw/vfio/common.c
>>>> @@ -320,6 +320,41 @@ const MemoryRegionOps vfio_region_ops = {
>>>>     * Device state interfaces
>>>>     */
>>>>    +typedef struct {
>>>> +    unsigned long *bitmap;
>>>> +    hwaddr size;
>>>> +    hwaddr pages;
>>>> +} VFIOBitmap;
>>>> +
>>>> +static VFIOBitmap *vfio_bitmap_alloc(hwaddr size)
>>>> +{
>>>> +    VFIOBitmap *vbmap = g_try_new0(VFIOBitmap, 1);
>>>
>>> I think using g_malloc0() for the VFIOBitmap should be fine. If QEMU can
>>> not allocate a couple of bytes, we are in trouble anyway.
>>>
>>
>> OOM situations are rather unpredictable, and switching to g_malloc0 means we
>> will exit ungracefully in the middle of fetching dirty bitmaps. And this
>> function (vfio_bitmap_alloc) overall will be allocating megabytes for terabyte
>> guests.
>>
>> It would be ok if we are initializing, but this is at runtime when we do
>> migration. I think we should stick with g_try_new0. exit on failure should be
>> reserved to failure to switch the kernel migration state whereby we are likely
>> to be dealing with a hardware failure and thus requires something more drastic.
> 
> I agree for large allocation :
> 
>     vbmap->bitmap = g_try_malloc0(vbmap->size);
> 
> but not for the smaller ones, like VFIOBitmap. You would have to
> convert some other g_malloc0() calls, like the one allocating 'unmap'
> in vfio_dma_unmap_bitmap(), to be consistent.
> 
> Given the size of VFIOBitmap, I think it could live on the stack in
> routine vfio_dma_unmap_bitmap() and routine vfio_get_dirty_bitmap()
> since the reference is not kept.
> 
> The 'vbmap' attribute of vfio_giommu_dirty_notifier does not need
> to be a pointer either.
> 
> vfio_bitmap_alloc(hwaddr size) could then become
> vfio_bitmap_init(VFIOBitmap *vbmap, hwaddr size).
> 
> Anyhow, this is minor. It would simplify a bit the exit path
> and error handling.
>

FWIW, I've addressed this on v3, following your suggestion.

	Joao


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 03/20] vfio/migration: Add VFIO migration pre-copy support
  2023-03-01 22:39                       ` Alex Williamson
@ 2023-03-06 19:01                         ` Jason Gunthorpe
  0 siblings, 0 replies; 93+ messages in thread
From: Jason Gunthorpe @ 2023-03-06 19:01 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Avihai Horon, qemu-devel, Cédric Le Goater, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins

On Wed, Mar 01, 2023 at 03:39:17PM -0700, Alex Williamson wrote:
> On Wed, 1 Mar 2023 17:12:51 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Mar 01, 2023 at 12:55:59PM -0700, Alex Williamson wrote:
> > 
> > > So it seems like what we need here is both a preface buffer size and a
> > > target device latency.  The QEMU pre-copy algorithm should factor both
> > > the remaining data size and the device latency into deciding when to
> > > transition to stop-copy, thereby allowing the device to feed actually
> > > relevant data into the algorithm rather than dictate its behavior.  
> > 
> > I don't know that we can realistically estimate startup latency,
> > especially have the sender estimate latency on the receiver..
> 
> Knowing that the target device is compatible with the source is a point
> towards making an educated guess.
> 
> > I feel like trying to overlap the device start up with the STOP phase
> > is an unnecessary optimization? How do you see it benifits?
> 
> If we can't guarantee that there's some time difference between sending
> initial bytes immediately at the end of pre-copy vs immediately at the
> beginning of stop-copy, does that mean any handling of initial bytes is
> an unnecessary optimization?

Sure if the device doesn't implement an initial_bytes startup phase
then it is all pointless, but probably those devices should return 0
for initial_bytes? If we see initial_bytes and assume it indicates a
startup phase, why not do it?

> I'm imagining that completing initial bytes triggers some
> initialization sequence in the target host driver which runs in
> parallel to the remaining data stream, so in practice, even if sent at
> the beginning of stop-copy, the target device gets a head start.

It isn't parallel in mlx5. The load operation of the initial bytes on
the receiver will execute the load command and that command will take
some amount of time sort of proportional to how much data is in the
device. IIRC the mlx5 VFIO driver will block read until this finishes.

It is convoluted but it ultimately is allocating (potentially alot)
pages in the hypervisor kernel so the time predictability is not very
good.

Other device types we are looking at might do network connections at
this step - eg a storage might open a network connection to its back
end. This could be unpredicatably long in degenerate cases.

> > I've been thinking of this from the perspective that we should always
> > ensure device startup is completed, it is time that has to be paid,
> > why pay it during STOP?
> 
> Creating a policy for QEMU to send initial bytes in a given phase
> doesn't ensure startup is complete.  There's no guaranteed time
> difference between sending that data and the beginning of stop-copy.

As I've said, to really do a good job here we want to have the sender
wait until the receiver completes startup, and not just treat it as a
unidirectional byte-stream. That isn't this patch..

> QEMU is trying to achieve a downtime goal, where it estimates network
> bandwidth to get a data size threshold, and then polls devices for
> remaining data.  That downtime goal might exceed the startup latency of
> the target device anyway, where it's then the operators choice to pay
> that time in stop-copy, or stalled on the target.

If you are saying there should be a policy flag ('optimize for total
migration time' vs 'optimize for minimum downtime') that seems
reasonable, though I wonder who would pick the first option.
 
> But if we actually want to ensure startup of the target is complete,
> then drivers should be able to return both data size and estimated time
> for the target device to initialize.  That time estimate should be
> updated by the driver based on if/when initial_bytes is drained.  The
> decision whether to continue iterating pre-copy would then be based on
> both the maximum remaining device startup time and the calculated time
> based on remaining data size.

That seems complicated. Why not just wait for the other side to
acknowledge it has started the device? Then we aren't trying to guess.

AFAIK this sort of happens implicitly in this patch because once
initial bytes is pushed the next data that follows it will block on
the pending load and the single socket will backpressure until the
load is done. Horrible, yes, but it is where qemu is at. multi-fd is
really important :)

Jason


^ permalink raw reply	[flat|nested] 93+ messages in thread

end of thread, other threads:[~2023-03-06 19:07 UTC | newest]

Thread overview: 93+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-22 17:48 [PATCH v2 00/20] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
2023-02-22 17:48 ` [PATCH v2 01/20] migration: Pass threshold_size to .state_pending_{estimate, exact}() Avihai Horon via
2023-02-22 17:48 ` [PATCH v2 02/20] vfio/migration: Refactor vfio_save_block() to return saved data size Avihai Horon
2023-02-27 14:10   ` Cédric Le Goater
2023-02-22 17:48 ` [PATCH v2 03/20] vfio/migration: Add VFIO migration pre-copy support Avihai Horon
2023-02-22 20:58   ` Alex Williamson
2023-02-23 15:25     ` Avihai Horon
2023-02-23 21:16       ` Alex Williamson
2023-02-26 16:43         ` Avihai Horon
2023-02-27 16:14           ` Alex Williamson
2023-02-27 17:26             ` Jason Gunthorpe
2023-02-27 17:43               ` Alex Williamson
2023-03-01 18:49                 ` Avihai Horon
2023-03-01 19:55                   ` Alex Williamson
2023-03-01 21:12                     ` Jason Gunthorpe
2023-03-01 22:39                       ` Alex Williamson
2023-03-06 19:01                         ` Jason Gunthorpe
2023-02-22 17:48 ` [PATCH v2 04/20] vfio/common: Fix error reporting in vfio_get_dirty_bitmap() Avihai Horon
2023-02-22 17:49 ` [PATCH v2 05/20] vfio/common: Fix wrong %m usages Avihai Horon
2023-02-22 17:49 ` [PATCH v2 06/20] vfio/common: Abort migration if dirty log start/stop/sync fails Avihai Horon
2023-02-22 17:49 ` [PATCH v2 07/20] vfio/common: Add VFIOBitmap and (de)alloc functions Avihai Horon
2023-02-22 21:40   ` Alex Williamson
2023-02-23 15:27     ` Avihai Horon
2023-02-27 14:09   ` Cédric Le Goater
2023-03-01 18:56     ` Avihai Horon
2023-03-02 13:24     ` Joao Martins
2023-03-02 14:52       ` Cédric Le Goater
2023-03-02 16:30         ` Joao Martins
2023-03-04  0:23         ` Joao Martins
2023-02-22 17:49 ` [PATCH v2 08/20] util: Add iova_tree_nnodes() Avihai Horon
2023-02-22 17:49 ` [PATCH v2 09/20] util: Extend iova_tree_foreach() to take data argument Avihai Horon
2023-02-22 17:49 ` [PATCH v2 10/20] vfio/common: Record DMA mapped IOVA ranges Avihai Horon
2023-02-22 22:10   ` Alex Williamson
2023-02-23 10:37     ` Joao Martins
2023-02-23 21:05       ` Alex Williamson
2023-02-23 21:19         ` Joao Martins
2023-02-23 21:50           ` Alex Williamson
2023-02-23 21:54             ` Joao Martins
2023-02-28 12:11             ` Joao Martins
2023-02-28 20:36               ` Alex Williamson
2023-03-02  0:07                 ` Joao Martins
2023-03-02  0:13                   ` Joao Martins
2023-03-02 18:42                   ` Alex Williamson
2023-03-03  0:19                     ` Joao Martins
2023-03-03 16:58                       ` Joao Martins
2023-03-03 17:05                         ` Alex Williamson
2023-03-03 19:14                           ` Joao Martins
2023-03-03 19:40                             ` Alex Williamson
2023-03-03 20:16                               ` Joao Martins
2023-03-03 23:47                                 ` Alex Williamson
2023-03-03 23:57                                   ` Joao Martins
2023-03-04  0:21                                     ` Joao Martins
2023-02-22 17:49 ` [PATCH v2 11/20] vfio/common: Add device dirty page tracking start/stop Avihai Horon
2023-02-22 22:40   ` Alex Williamson
2023-02-23  2:02     ` Jason Gunthorpe
2023-02-23 19:27       ` Alex Williamson
2023-02-23 19:30         ` Jason Gunthorpe
2023-02-23 20:16           ` Alex Williamson
2023-02-23 20:54             ` Jason Gunthorpe
2023-02-26 16:54               ` Avihai Horon
2023-02-23 15:36     ` Avihai Horon
2023-02-22 17:49 ` [PATCH v2 12/20] vfio/common: Extract code from vfio_get_dirty_bitmap() to new function Avihai Horon
2023-02-22 17:49 ` [PATCH v2 13/20] vfio/common: Add device dirty page bitmap sync Avihai Horon
2023-02-22 17:49 ` [PATCH v2 14/20] vfio/common: Extract vIOMMU code from vfio_sync_dirty_bitmap() Avihai Horon
2023-02-22 17:49 ` [PATCH v2 15/20] memory/iommu: Add IOMMU_ATTR_MAX_IOVA attribute Avihai Horon
2023-02-22 17:49 ` [PATCH v2 16/20] intel-iommu: Implement get_attr() method Avihai Horon
2023-02-22 17:49 ` [PATCH v2 17/20] vfio/common: Support device dirty page tracking with vIOMMU Avihai Horon
2023-02-22 23:34   ` Alex Williamson
2023-02-23  2:08     ` Jason Gunthorpe
2023-02-23 20:06       ` Alex Williamson
2023-02-23 20:55         ` Jason Gunthorpe
2023-02-23 21:30           ` Joao Martins
2023-02-23 22:33           ` Alex Williamson
2023-02-23 23:26             ` Jason Gunthorpe
2023-02-24 11:25               ` Joao Martins
2023-02-24 12:53                 ` Joao Martins
2023-02-24 15:47                   ` Jason Gunthorpe
2023-02-24 15:56                   ` Alex Williamson
2023-02-24 19:16                     ` Joao Martins
2023-02-22 17:49 ` [PATCH v2 18/20] vfio/common: Optimize " Avihai Horon
2023-02-22 17:49 ` [PATCH v2 19/20] vfio/migration: Query device dirty page tracking support Avihai Horon
2023-02-22 17:49 ` [PATCH v2 20/20] docs/devel: Document VFIO device dirty page tracking Avihai Horon
2023-02-27 14:29   ` Cédric Le Goater
2023-02-22 18:00 ` [PATCH v2 00/20] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
2023-02-22 20:55 ` Alex Williamson
2023-02-23 10:05   ` Cédric Le Goater
2023-02-23 15:07     ` Avihai Horon
2023-02-27 10:24       ` Cédric Le Goater
2023-02-23 14:56   ` Avihai Horon
2023-02-24 19:26     ` Joao Martins
2023-02-26 17:00       ` Avihai Horon
2023-02-27 13:50         ` Cédric Le Goater
2023-03-01 19:04           ` Avihai Horon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.