All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/18] vfio: Add migration pre-copy support and device dirty tracking
@ 2023-01-26 18:49 Avihai Horon
  2023-01-26 18:49 ` [PATCH 01/18] vfio/migration: Add VFIO migration pre-copy support Avihai Horon
                   ` (17 more replies)
  0 siblings, 18 replies; 42+ messages in thread
From: Avihai Horon @ 2023-01-26 18:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

Hello,

This series is based on the previous one that added the basic VFIO
migration protocol v2 implementation [1].

The first patch in the series starts by adding pre-copy support for VFIO
migration protocol v2. Pre-copy support allows the VFIO device data to
be transferred while the VM is running. This can improve performance and
reduce migration downtime. Full description of it can be found here [2].

The series then moves on to implement device dirty page tracking.
Device dirty page tracking allows the VFIO device to record its DMAs and
report them back when needed. This is part of VFIO migration and is used
during pre-copy phase of migration to track the RAM pages that the
device has written to and mark those pages dirty, so they can later be
re-sent to target.

Device dirty page tracking uses the DMA logging uAPI to discover device
capabilities, to start and stop tracking, and to get dirty page bitmap
report. Extra details and uAPI definition can be found here [3].

Device dirty page tracking operates in VFIOContainer scope. I.e., When
dirty tracking is started, stopped or dirty page report is queried, all
devices within a VFIOContainer are iterated and for each of them device
dirty page tracking is started, stopped or dirty page report is queried,
respectively.

Device dirty page tracking is used only if all devices within a
VFIOContainer support it. Otherwise, VFIO IOMMU dirty page tracking is
used, and if that is not supported as well, memory is perpetually marked
dirty by QEMU. Note that since VFIO IOMMU dirty page tracking has no HW
support, the last two usually have the same effect of perpetually
marking all pages dirty.

Normally, when asked to start dirty tracking, all the currently DMA
mapped ranges are tracked by device dirty page tracking. However, when
vIOMMU is enabled IOVA ranges are DMA mapped/unmapped on the fly as the
vIOMMU maps/unmaps them. These IOVA ranges can potentially be mapped
anywhere in the vIOMMU IOVA space. Due to this dynamic nature of vIOMMU
mapping/unmapping, tracking only the currently DMA mapped IOVA ranges
doesn't work very well.

Thus, when vIOMMU is enabled, we try to track the entire vIOMMU IOVA
space. If that fails (IOVA space can be rather big and we might hit HW
limitation), we try to track smaller range while marking untracked
ranges dirty.

Patch breakdown:
- Patch 1 adds VFIO migration pre-copy support.
- Patches 2-8 fix bugs and do some preparatory work required prior to
  adding device dirty page tracking.
- Patches 9-11 implement device dirty page tracking.
- Patches 12-16 add vIOMMU support to device dirty page tracking.
- Patches 17-18 enable device dirty page tracking and document it.

Thanks.

[1]
https://lore.kernel.org/qemu-devel/20230116141135.12021-1-avihaih@nvidia.com/

[2]
https://lore.kernel.org/kvm/20221206083438.37807-3-yishaih@nvidia.com/

[3]
https://lore.kernel.org/netdev/20220908183448.195262-4-yishaih@nvidia.com/

Avihai Horon (12):
  vfio/migration: Add VFIO migration pre-copy support
  vfio/common: Fix error reporting in vfio_get_dirty_bitmap()
  vfio/common: Fix wrong %m usages
  vfio/common: Abort migration if dirty log start/stop/sync fails
  vfio/common: Add VFIOBitmap and (de)alloc functions
  vfio/common: Extract code from vfio_get_dirty_bitmap() to new function
  vfio/common: Extract vIOMMU code from vfio_sync_dirty_bitmap()
  memory/iommu: Add IOMMU_ATTR_MAX_IOVA attribute
  intel-iommu: Implement get_attr() method
  vfio/common: Support device dirty page tracking with vIOMMU
  vfio/common: Optimize device dirty page tracking with vIOMMU
  docs/devel: Document VFIO device dirty page tracking

Joao Martins (6):
  util: Add iova_tree_nnodes()
  util: Extend iova_tree_foreach() to take data argument
  vfio/common: Record DMA mapped IOVA ranges
  vfio/common: Add device dirty page tracking start/stop
  vfio/common: Add device dirty page bitmap sync
  vfio/migration: Query device dirty page tracking support

 docs/devel/vfio-migration.rst |  79 ++--
 include/exec/memory.h         |   3 +-
 include/hw/vfio/vfio-common.h |  10 +
 include/qemu/iova-tree.h      |  19 +-
 hw/i386/intel_iommu.c         |  18 +
 hw/vfio/common.c              | 866 ++++++++++++++++++++++++++++++----
 hw/vfio/migration.c           | 127 ++++-
 util/iova-tree.c              |  23 +-
 hw/vfio/trace-events          |   5 +-
 9 files changed, 1006 insertions(+), 144 deletions(-)

-- 
2.26.3



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 01/18] vfio/migration: Add VFIO migration pre-copy support
  2023-01-26 18:49 [PATCH 00/18] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
@ 2023-01-26 18:49 ` Avihai Horon
  2023-01-26 23:52   ` Alex Williamson
  2023-01-26 18:49 ` [PATCH 02/18] vfio/common: Fix error reporting in vfio_get_dirty_bitmap() Avihai Horon
                   ` (16 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Avihai Horon @ 2023-01-26 18:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

Pre-copy support allows the VFIO device data to be transferred while the
VM is running. This helps to accommodate VFIO devices that have a large
amount of data that needs to be transferred, and it can reduce migration
downtime.

Pre-copy support is optional in VFIO migration protocol v2.
Implement pre-copy of VFIO migration protocol v2 and use it for devices
that support it. Full description of it can be found here [1].

[1]
https://lore.kernel.org/kvm/20221206083438.37807-3-yishaih@nvidia.com/

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 docs/devel/vfio-migration.rst |  29 ++++++---
 include/hw/vfio/vfio-common.h |   3 +
 hw/vfio/common.c              |   8 ++-
 hw/vfio/migration.c           | 112 ++++++++++++++++++++++++++++++++--
 hw/vfio/trace-events          |   5 +-
 5 files changed, 140 insertions(+), 17 deletions(-)

diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
index 1d50c2fe5f..51f5e1a537 100644
--- a/docs/devel/vfio-migration.rst
+++ b/docs/devel/vfio-migration.rst
@@ -7,12 +7,14 @@ the guest is running on source host and restoring this saved state on the
 destination host. This document details how saving and restoring of VFIO
 devices is done in QEMU.
 
-Migration of VFIO devices currently consists of a single stop-and-copy phase.
-During the stop-and-copy phase the guest is stopped and the entire VFIO device
-data is transferred to the destination.
-
-The pre-copy phase of migration is currently not supported for VFIO devices.
-Support for VFIO pre-copy will be added later on.
+Migration of VFIO devices consists of two phases: the optional pre-copy phase,
+and the stop-and-copy phase. The pre-copy phase is iterative and allows to
+accommodate VFIO devices that have a large amount of data that needs to be
+transferred. The iterative pre-copy phase of migration allows for the guest to
+continue whilst the VFIO device state is transferred to the destination, this
+helps to reduce the total downtime of the VM. VFIO devices can choose to skip
+the pre-copy phase of migration by not reporting the VFIO_MIGRATION_PRE_COPY
+flag in VFIO_DEVICE_FEATURE_MIGRATION ioctl.
 
 A detailed description of the UAPI for VFIO device migration can be found in
 the comment for the ``vfio_device_mig_state`` structure in the header file
@@ -29,6 +31,12 @@ VFIO implements the device hooks for the iterative approach as follows:
   driver, which indicates the amount of data that the vendor driver has yet to
   save for the VFIO device.
 
+* An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
+  active only if the VFIO device is in pre-copy states.
+
+* A ``save_live_iterate`` function that reads the VFIO device's data from the
+  vendor driver during iterative phase.
+
 * A ``save_state`` function to save the device config space if it is present.
 
 * A ``save_live_complete_precopy`` function that sets the VFIO device in
@@ -91,8 +99,10 @@ Flow of state changes during Live migration
 ===========================================
 
 Below is the flow of state change during live migration.
-The values in the brackets represent the VM state, the migration state, and
+The values in the parentheses represent the VM state, the migration state, and
 the VFIO device state, respectively.
+The text in the square brackets represents the flow if the VFIO device supports
+pre-copy.
 
 Live migration save path
 ------------------------
@@ -104,11 +114,12 @@ Live migration save path
                                   |
                      migrate_init spawns migration_thread
                 Migration thread then calls each device's .save_setup()
-                       (RUNNING, _SETUP, _RUNNING)
+                  (RUNNING, _SETUP, _RUNNING [_PRE_COPY])
                                   |
-                      (RUNNING, _ACTIVE, _RUNNING)
+                  (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])
              If device is active, get pending_bytes by .save_live_pending()
           If total pending_bytes >= threshold_size, call .save_live_iterate()
+                  [Data of VFIO device for pre-copy phase is copied]
         Iterate till total pending bytes converge and are less than threshold
                                   |
   On migration completion, vCPU stops and calls .save_live_complete_precopy for
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 5f8e7a02fe..88c2194fb9 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -67,7 +67,10 @@ typedef struct VFIOMigration {
     int data_fd;
     void *data_buffer;
     size_t data_buffer_size;
+    uint64_t mig_flags;
     uint64_t stop_copy_size;
+    uint64_t precopy_init_size;
+    uint64_t precopy_dirty_size;
 } VFIOMigration;
 
 typedef struct VFIOAddressSpace {
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 9a0dbee6b4..93b18c5e3d 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -357,7 +357,9 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
 
             if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
                 (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
-                 migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P)) {
+                 migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P ||
+                 migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ||
+                 migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P)) {
                 return false;
             }
         }
@@ -387,7 +389,9 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
             }
 
             if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
-                migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P) {
+                migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P ||
+                migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ||
+                migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P) {
                 continue;
             } else {
                 return false;
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 760f667e04..2a0a663023 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -69,6 +69,10 @@ static const char *mig_state_to_str(enum vfio_device_mig_state state)
         return "RESUMING";
     case VFIO_DEVICE_STATE_RUNNING_P2P:
         return "RUNNING_P2P";
+    case VFIO_DEVICE_STATE_PRE_COPY:
+        return "PRE_COPY";
+    case VFIO_DEVICE_STATE_PRE_COPY_P2P:
+        return "PRE_COPY_P2P";
     default:
         return "UNKNOWN STATE";
     }
@@ -237,6 +241,11 @@ static int vfio_save_block(QEMUFile *f, VFIOMigration *migration)
     data_size = read(migration->data_fd, migration->data_buffer,
                      migration->data_buffer_size);
     if (data_size < 0) {
+        /* Pre-copy emptied all the device state for now */
+        if (errno == ENOMSG) {
+            return 1;
+        }
+
         return -errno;
     }
     if (data_size == 0) {
@@ -260,6 +269,7 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
     uint64_t stop_copy_size = VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE;
+    int ret;
 
     qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
 
@@ -273,6 +283,23 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
         return -ENOMEM;
     }
 
+    if (migration->mig_flags & VFIO_MIGRATION_PRE_COPY) {
+        switch (migration->device_state) {
+        case VFIO_DEVICE_STATE_RUNNING:
+            ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_PRE_COPY,
+                                           VFIO_DEVICE_STATE_RUNNING);
+            if (ret) {
+                return ret;
+            }
+            break;
+        case VFIO_DEVICE_STATE_STOP:
+            /* vfio_save_complete_precopy() will go to STOP_COPY */
+            break;
+        default:
+            return -EINVAL;
+        }
+    }
+
     trace_vfio_save_setup(vbasedev->name, migration->data_buffer_size);
 
     qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
@@ -287,6 +314,12 @@ static void vfio_save_cleanup(void *opaque)
 
     g_free(migration->data_buffer);
     migration->data_buffer = NULL;
+
+    if (migration->mig_flags & VFIO_MIGRATION_PRE_COPY) {
+        migration->precopy_init_size = 0;
+        migration->precopy_dirty_size = 0;
+    }
+
     vfio_migration_cleanup(vbasedev);
     trace_vfio_save_cleanup(vbasedev->name);
 }
@@ -301,9 +334,55 @@ static void vfio_save_pending(void *opaque, uint64_t threshold_size,
 
     *res_precopy_only += migration->stop_copy_size;
 
+    if (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ||
+        migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P) {
+        if (migration->precopy_init_size) {
+            /*
+             * Initial size should be transferred during pre-copy phase so
+             * stop-copy phase will not be slowed down. Report threshold_size
+             * to force another pre-copy iteration.
+             */
+            *res_precopy_only += threshold_size;
+        } else {
+            *res_precopy_only += migration->precopy_dirty_size;
+        }
+    }
+
     trace_vfio_save_pending(vbasedev->name, *res_precopy_only,
                             *res_postcopy_only, *res_compatible,
-                            migration->stop_copy_size);
+                            migration->stop_copy_size,
+                            migration->precopy_init_size,
+                            migration->precopy_dirty_size);
+}
+
+static bool vfio_is_active_iterate(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ||
+           migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P;
+}
+
+static int vfio_save_iterate(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    ret = vfio_save_block(f, migration);
+    if (ret < 0) {
+        return ret;
+    }
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    trace_vfio_save_iterate(vbasedev->name);
+
+    /*
+     * A VFIO device's pre-copy dirty_bytes is not guaranteed to reach zero.
+     * Return 1 so following handlers will not be potentially blocked.
+     */
+    return 1;
 }
 
 static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
@@ -312,7 +391,7 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
     enum vfio_device_mig_state recover_state;
     int ret;
 
-    /* We reach here with device state STOP only */
+    /* We reach here with device state STOP or STOP_COPY only */
     recover_state = VFIO_DEVICE_STATE_STOP;
     ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
                                    recover_state);
@@ -430,6 +509,8 @@ static const SaveVMHandlers savevm_vfio_handlers = {
     .save_setup = vfio_save_setup,
     .save_cleanup = vfio_save_cleanup,
     .save_live_pending = vfio_save_pending,
+    .is_active_iterate = vfio_is_active_iterate,
+    .save_live_iterate = vfio_save_iterate,
     .save_live_complete_precopy = vfio_save_complete_precopy,
     .save_state = vfio_save_state,
     .load_setup = vfio_load_setup,
@@ -442,13 +523,19 @@ static const SaveVMHandlers savevm_vfio_handlers = {
 static void vfio_vmstate_change(void *opaque, bool running, RunState state)
 {
     VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
     enum vfio_device_mig_state new_state;
     int ret;
 
     if (running) {
         new_state = VFIO_DEVICE_STATE_RUNNING;
     } else {
-        new_state = VFIO_DEVICE_STATE_STOP;
+        new_state =
+            ((migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ||
+              migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P) &&
+             (state == RUN_STATE_FINISH_MIGRATE || state == RUN_STATE_PAUSED)) ?
+                VFIO_DEVICE_STATE_STOP_COPY :
+                VFIO_DEVICE_STATE_STOP;
     }
 
     ret = vfio_migration_set_state(vbasedev, new_state,
@@ -496,6 +583,9 @@ static int vfio_migration_data_notifier(NotifierWithReturn *n, void *data)
 {
     VFIOMigration *migration = container_of(n, VFIOMigration, migration_data);
     VFIODevice *vbasedev = migration->vbasedev;
+    struct vfio_precopy_info precopy = {
+        .argsz = sizeof(precopy),
+    };
     PrecopyNotifyData *pnd = data;
 
     if (pnd->reason != PRECOPY_NOTIFY_AFTER_BITMAP_SYNC) {
@@ -515,8 +605,21 @@ static int vfio_migration_data_notifier(NotifierWithReturn *n, void *data)
         migration->stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
     }
 
+    if ((migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ||
+         migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P)) {
+        if (ioctl(migration->data_fd, VFIO_MIG_GET_PRECOPY_INFO, &precopy)) {
+            migration->precopy_init_size = 0;
+            migration->precopy_dirty_size = 0;
+        } else {
+            migration->precopy_init_size = precopy.initial_bytes;
+            migration->precopy_dirty_size = precopy.dirty_bytes;
+        }
+    }
+
     trace_vfio_migration_data_notifier(vbasedev->name,
-                                       migration->stop_copy_size);
+                                       migration->stop_copy_size,
+                                       migration->precopy_init_size,
+                                       migration->precopy_dirty_size);
 
     return 0;
 }
@@ -588,6 +691,7 @@ static int vfio_migration_init(VFIODevice *vbasedev)
     migration->vbasedev = vbasedev;
     migration->device_state = VFIO_DEVICE_STATE_RUNNING;
     migration->data_fd = -1;
+    migration->mig_flags = mig_flags;
 
     oid = vmstate_if_get_id(VMSTATE_IF(DEVICE(obj)));
     if (oid) {
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index db9cb94952..37724579e3 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -154,7 +154,7 @@ vfio_load_cleanup(const char *name) " (%s)"
 vfio_load_device_config_state(const char *name) " (%s)"
 vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
 vfio_load_state_device_data(const char *name, uint64_t data_size, int ret) " (%s) size 0x%"PRIx64" ret %d"
-vfio_migration_data_notifier(const char *name, uint64_t stopcopy_size) " (%s) stopcopy size 0x%"PRIx64
+vfio_migration_data_notifier(const char *name, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) stopcopy size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
 vfio_migration_probe(const char *name) " (%s)"
 vfio_migration_set_state(const char *name, const char *state) " (%s) state %s"
 vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
@@ -162,6 +162,7 @@ vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
 vfio_save_cleanup(const char *name) " (%s)"
 vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
 vfio_save_device_config_state(const char *name) " (%s)"
-vfio_save_pending(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible, uint64_t stopcopy_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64" stopcopy size 0x%"PRIx64
+vfio_save_iterate(const char *name) " (%s)"
+vfio_save_pending(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
 vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size 0x%"PRIx64
 vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 02/18] vfio/common: Fix error reporting in vfio_get_dirty_bitmap()
  2023-01-26 18:49 [PATCH 00/18] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
  2023-01-26 18:49 ` [PATCH 01/18] vfio/migration: Add VFIO migration pre-copy support Avihai Horon
@ 2023-01-26 18:49 ` Avihai Horon
  2023-02-15  9:21   ` Cédric Le Goater
  2023-01-26 18:49 ` [PATCH 03/18] vfio/common: Fix wrong %m usages Avihai Horon
                   ` (15 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Avihai Horon @ 2023-01-26 18:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

Return -errno instead of -1 if VFIO_IOMMU_DIRTY_PAGES ioctl fails in
vfio_get_dirty_bitmap().

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/common.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 93b18c5e3d..d892609cf1 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1288,6 +1288,7 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
 
     ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
     if (ret) {
+        ret = -errno;
         error_report("Failed to get dirty bitmap for iova: 0x%"PRIx64
                 " size: 0x%"PRIx64" err: %d", (uint64_t)range->iova,
                 (uint64_t)range->size, errno);
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 03/18] vfio/common: Fix wrong %m usages
  2023-01-26 18:49 [PATCH 00/18] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
  2023-01-26 18:49 ` [PATCH 01/18] vfio/migration: Add VFIO migration pre-copy support Avihai Horon
  2023-01-26 18:49 ` [PATCH 02/18] vfio/common: Fix error reporting in vfio_get_dirty_bitmap() Avihai Horon
@ 2023-01-26 18:49 ` Avihai Horon
  2023-02-15  9:21   ` Cédric Le Goater
  2023-01-26 18:49 ` [PATCH 04/18] vfio/common: Abort migration if dirty log start/stop/sync fails Avihai Horon
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Avihai Horon @ 2023-01-26 18:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

There are several places where the %m conversion is used if one of
vfio_dma_map(), vfio_dma_unmap() or vfio_get_dirty_bitmap() fail.

The %m usage in these places is wrong since %m relies on errno value while
the above functions don't report errors via errno.

Fix it by using strerror() with the returned value instead.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/common.c | 29 ++++++++++++++++-------------
 1 file changed, 16 insertions(+), 13 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index d892609cf1..643418f6f1 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -656,17 +656,17 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
                            read_only);
         if (ret) {
             error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
-                         "0x%"HWADDR_PRIx", %p) = %d (%m)",
+                         "0x%"HWADDR_PRIx", %p) = %d (%s)",
                          container, iova,
-                         iotlb->addr_mask + 1, vaddr, ret);
+                         iotlb->addr_mask + 1, vaddr, ret, strerror(-ret));
         }
     } else {
         ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1, iotlb);
         if (ret) {
             error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
-                         "0x%"HWADDR_PRIx") = %d (%m)",
+                         "0x%"HWADDR_PRIx") = %d (%s)",
                          container, iova,
-                         iotlb->addr_mask + 1, ret);
+                         iotlb->addr_mask + 1, ret, strerror(-ret));
         }
     }
 out:
@@ -1048,8 +1048,9 @@ static void vfio_listener_region_add(MemoryListener *listener,
                        vaddr, section->readonly);
     if (ret) {
         error_setg(&err, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
-                   "0x%"HWADDR_PRIx", %p) = %d (%m)",
-                   container, iova, int128_get64(llsize), vaddr, ret);
+                   "0x%"HWADDR_PRIx", %p) = %d (%s)",
+                   container, iova, int128_get64(llsize), vaddr, ret,
+                   strerror(-ret));
         if (memory_region_is_ram_device(section->mr)) {
             /* Allow unexpected mappings not to be fatal for RAM devices */
             error_report_err(err);
@@ -1181,16 +1182,18 @@ static void vfio_listener_region_del(MemoryListener *listener,
             ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
             if (ret) {
                 error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
-                             "0x%"HWADDR_PRIx") = %d (%m)",
-                             container, iova, int128_get64(llsize), ret);
+                             "0x%"HWADDR_PRIx") = %d (%s)",
+                             container, iova, int128_get64(llsize), ret,
+                             strerror(-ret));
             }
             iova += int128_get64(llsize);
         }
         ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
         if (ret) {
             error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
-                         "0x%"HWADDR_PRIx") = %d (%m)",
-                         container, iova, int128_get64(llsize), ret);
+                         "0x%"HWADDR_PRIx") = %d (%s)",
+                         container, iova, int128_get64(llsize), ret,
+                         strerror(-ret));
         }
     }
 
@@ -1337,9 +1340,9 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
                                     translated_addr);
         if (ret) {
             error_report("vfio_iommu_map_dirty_notify(%p, 0x%"HWADDR_PRIx", "
-                         "0x%"HWADDR_PRIx") = %d (%m)",
-                         container, iova,
-                         iotlb->addr_mask + 1, ret);
+                         "0x%"HWADDR_PRIx") = %d (%s)",
+                         container, iova, iotlb->addr_mask + 1, ret,
+                         strerror(-ret));
         }
     }
     rcu_read_unlock();
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 04/18] vfio/common: Abort migration if dirty log start/stop/sync fails
  2023-01-26 18:49 [PATCH 00/18] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
                   ` (2 preceding siblings ...)
  2023-01-26 18:49 ` [PATCH 03/18] vfio/common: Fix wrong %m usages Avihai Horon
@ 2023-01-26 18:49 ` Avihai Horon
  2023-02-15  9:41   ` Cédric Le Goater
  2023-01-26 18:49 ` [PATCH 05/18] vfio/common: Add VFIOBitmap and (de)alloc functions Avihai Horon
                   ` (13 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Avihai Horon @ 2023-01-26 18:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

If VFIO dirty pages log start/stop/sync fails during migration,
migration should be aborted as pages dirtied by VFIO devices might not
be reported properly.

This is not the case today, where in such scenario only an error is
printed.

Fix it by aborting migration in the above scenario.

Fixes: 758b96b61d5c ("vfio/migrate: Move switch of dirty tracking into vfio_memory_listener")
Fixes: b6dd6504e303 ("vfio: Add vfio_listener_log_sync to mark dirty pages")
Fixes: 9e7b0442f23a ("vfio: Add ioctl to get dirty pages bitmap during dma unmap")
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/common.c | 53 ++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 45 insertions(+), 8 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 643418f6f1..8e8ffbc046 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -41,6 +41,7 @@
 #include "qapi/error.h"
 #include "migration/migration.h"
 #include "migration/misc.h"
+#include "migration/qemu-file.h"
 #include "sysemu/tpm.h"
 
 VFIOGroupList vfio_group_list =
@@ -337,6 +338,19 @@ bool vfio_mig_active(void)
     return true;
 }
 
+static void vfio_set_migration_error(int err)
+{
+    MigrationState *ms = migrate_get_current();
+
+    if (migration_is_setup_or_active(ms->state)) {
+        WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) {
+            if (ms->to_dst_file) {
+                qemu_file_set_error(ms->to_dst_file, err);
+            }
+        }
+    }
+}
+
 static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
 {
     VFIOGroup *group;
@@ -633,6 +647,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     if (iotlb->target_as != &address_space_memory) {
         error_report("Wrong target AS \"%s\", only system memory is allowed",
                      iotlb->target_as->name ? iotlb->target_as->name : "none");
+        vfio_set_migration_error(-EINVAL);
         return;
     }
 
@@ -667,6 +682,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
                          "0x%"HWADDR_PRIx") = %d (%s)",
                          container, iova,
                          iotlb->addr_mask + 1, ret, strerror(-ret));
+            vfio_set_migration_error(ret);
         }
     }
 out:
@@ -1212,7 +1228,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
     }
 }
 
-static void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
+static int vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
 {
     int ret;
     struct vfio_iommu_type1_dirty_bitmap dirty = {
@@ -1220,7 +1236,7 @@ static void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
     };
 
     if (!container->dirty_pages_supported) {
-        return;
+        return 0;
     }
 
     if (start) {
@@ -1231,23 +1247,34 @@ static void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
 
     ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty);
     if (ret) {
+        ret = -errno;
         error_report("Failed to set dirty tracking flag 0x%x errno: %d",
                      dirty.flags, errno);
     }
+
+    return ret;
 }
 
 static void vfio_listener_log_global_start(MemoryListener *listener)
 {
     VFIOContainer *container = container_of(listener, VFIOContainer, listener);
+    int ret;
 
-    vfio_set_dirty_page_tracking(container, true);
+    ret = vfio_set_dirty_page_tracking(container, true);
+    if (ret) {
+        vfio_set_migration_error(ret);
+    }
 }
 
 static void vfio_listener_log_global_stop(MemoryListener *listener)
 {
     VFIOContainer *container = container_of(listener, VFIOContainer, listener);
+    int ret;
 
-    vfio_set_dirty_page_tracking(container, false);
+    ret = vfio_set_dirty_page_tracking(container, false);
+    if (ret) {
+        vfio_set_migration_error(ret);
+    }
 }
 
 static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
@@ -1323,19 +1350,18 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     VFIOContainer *container = giommu->container;
     hwaddr iova = iotlb->iova + giommu->iommu_offset;
     ram_addr_t translated_addr;
+    int ret = -EINVAL;
 
     trace_vfio_iommu_map_dirty_notify(iova, iova + iotlb->addr_mask);
 
     if (iotlb->target_as != &address_space_memory) {
         error_report("Wrong target AS \"%s\", only system memory is allowed",
                      iotlb->target_as->name ? iotlb->target_as->name : "none");
-        return;
+        goto out;
     }
 
     rcu_read_lock();
     if (vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL)) {
-        int ret;
-
         ret = vfio_get_dirty_bitmap(container, iova, iotlb->addr_mask + 1,
                                     translated_addr);
         if (ret) {
@@ -1346,6 +1372,11 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
         }
     }
     rcu_read_unlock();
+
+out:
+    if (ret) {
+        vfio_set_migration_error(ret);
+    }
 }
 
 static int vfio_ram_discard_get_dirty_bitmap(MemoryRegionSection *section,
@@ -1438,13 +1469,19 @@ static void vfio_listener_log_sync(MemoryListener *listener,
         MemoryRegionSection *section)
 {
     VFIOContainer *container = container_of(listener, VFIOContainer, listener);
+    int ret;
 
     if (vfio_listener_skipped_section(section)) {
         return;
     }
 
     if (vfio_devices_all_dirty_tracking(container)) {
-        vfio_sync_dirty_bitmap(container, section);
+        ret = vfio_sync_dirty_bitmap(container, section);
+        if (ret) {
+            error_report("vfio: Failed to sync dirty bitmap, err: %d (%s)", ret,
+                         strerror(-ret));
+            vfio_set_migration_error(ret);
+        }
     }
 }
 
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 05/18] vfio/common: Add VFIOBitmap and (de)alloc functions
  2023-01-26 18:49 [PATCH 00/18] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
                   ` (3 preceding siblings ...)
  2023-01-26 18:49 ` [PATCH 04/18] vfio/common: Abort migration if dirty log start/stop/sync fails Avihai Horon
@ 2023-01-26 18:49 ` Avihai Horon
  2023-01-27 21:11   ` Alex Williamson
  2023-01-26 18:49 ` [PATCH 06/18] util: Add iova_tree_nnodes() Avihai Horon
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Avihai Horon @ 2023-01-26 18:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

There are already two places where dirty page bitmap allocation and
calculations are done in open code. With device dirty page tracking
being added in next patches, there are going to be even more places.

To avoid code duplication, introduce VFIOBitmap struct and corresponding
alloc and dealloc functions and use them where applicable.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/common.c | 89 ++++++++++++++++++++++++++++++++----------------
 1 file changed, 60 insertions(+), 29 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 8e8ffbc046..e554573eb5 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -319,6 +319,41 @@ const MemoryRegionOps vfio_region_ops = {
  * Device state interfaces
  */
 
+typedef struct {
+    unsigned long *bitmap;
+    hwaddr size;
+    hwaddr pages;
+} VFIOBitmap;
+
+static VFIOBitmap *vfio_bitmap_alloc(hwaddr size)
+{
+    VFIOBitmap *vbmap = g_try_new0(VFIOBitmap, 1);
+    if (!vbmap) {
+        errno = ENOMEM;
+
+        return NULL;
+    }
+
+    vbmap->pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size();
+    vbmap->size =  ROUND_UP(vbmap->pages, sizeof(__u64) * BITS_PER_BYTE) /
+                                          BITS_PER_BYTE;
+    vbmap->bitmap = g_try_malloc0(vbmap->size);
+    if (!vbmap->bitmap) {
+        g_free(vbmap);
+        errno = ENOMEM;
+
+        return NULL;
+    }
+
+    return vbmap;
+}
+
+static void vfio_bitmap_dealloc(VFIOBitmap *vbmap)
+{
+    g_free(vbmap->bitmap);
+    g_free(vbmap);
+}
+
 bool vfio_mig_active(void)
 {
     VFIOGroup *group;
@@ -421,9 +456,14 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container,
 {
     struct vfio_iommu_type1_dma_unmap *unmap;
     struct vfio_bitmap *bitmap;
-    uint64_t pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size();
+    VFIOBitmap *vbmap;
     int ret;
 
+    vbmap = vfio_bitmap_alloc(size);
+    if (!vbmap) {
+        return -errno;
+    }
+
     unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));
 
     unmap->argsz = sizeof(*unmap) + sizeof(*bitmap);
@@ -437,35 +477,28 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container,
      * qemu_real_host_page_size to mark those dirty. Hence set bitmap_pgsize
      * to qemu_real_host_page_size.
      */
-
     bitmap->pgsize = qemu_real_host_page_size();
-    bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
-                   BITS_PER_BYTE;
+    bitmap->size = vbmap->size;
+    bitmap->data = (__u64 *)vbmap->bitmap;
 
-    if (bitmap->size > container->max_dirty_bitmap_size) {
-        error_report("UNMAP: Size of bitmap too big 0x%"PRIx64,
-                     (uint64_t)bitmap->size);
+    if (vbmap->size > container->max_dirty_bitmap_size) {
+        error_report("UNMAP: Size of bitmap too big 0x%"PRIx64, vbmap->size);
         ret = -E2BIG;
         goto unmap_exit;
     }
 
-    bitmap->data = g_try_malloc0(bitmap->size);
-    if (!bitmap->data) {
-        ret = -ENOMEM;
-        goto unmap_exit;
-    }
-
     ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap);
     if (!ret) {
-        cpu_physical_memory_set_dirty_lebitmap((unsigned long *)bitmap->data,
-                iotlb->translated_addr, pages);
+        cpu_physical_memory_set_dirty_lebitmap(vbmap->bitmap,
+                iotlb->translated_addr, vbmap->pages);
     } else {
         error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %m");
     }
 
-    g_free(bitmap->data);
 unmap_exit:
     g_free(unmap);
+    vfio_bitmap_dealloc(vbmap);
+
     return ret;
 }
 
@@ -1282,7 +1315,7 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
 {
     struct vfio_iommu_type1_dirty_bitmap *dbitmap;
     struct vfio_iommu_type1_dirty_bitmap_get *range;
-    uint64_t pages;
+    VFIOBitmap *vbmap;
     int ret;
 
     if (!container->dirty_pages_supported) {
@@ -1292,6 +1325,11 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
         return 0;
     }
 
+    vbmap = vfio_bitmap_alloc(size);
+    if (!vbmap) {
+        return -errno;
+    }
+
     dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
 
     dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
@@ -1306,15 +1344,8 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
      * to qemu_real_host_page_size.
      */
     range->bitmap.pgsize = qemu_real_host_page_size();
-
-    pages = REAL_HOST_PAGE_ALIGN(range->size) / qemu_real_host_page_size();
-    range->bitmap.size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
-                                         BITS_PER_BYTE;
-    range->bitmap.data = g_try_malloc0(range->bitmap.size);
-    if (!range->bitmap.data) {
-        ret = -ENOMEM;
-        goto err_out;
-    }
+    range->bitmap.size = vbmap->size;
+    range->bitmap.data = (__u64 *)vbmap->bitmap;
 
     ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
     if (ret) {
@@ -1325,14 +1356,14 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
         goto err_out;
     }
 
-    cpu_physical_memory_set_dirty_lebitmap((unsigned long *)range->bitmap.data,
-                                            ram_addr, pages);
+    cpu_physical_memory_set_dirty_lebitmap(vbmap->bitmap, ram_addr,
+                                           vbmap->pages);
 
     trace_vfio_get_dirty_bitmap(container->fd, range->iova, range->size,
                                 range->bitmap.size, ram_addr);
 err_out:
-    g_free(range->bitmap.data);
     g_free(dbitmap);
+    vfio_bitmap_dealloc(vbmap);
 
     return ret;
 }
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 06/18] util: Add iova_tree_nnodes()
  2023-01-26 18:49 [PATCH 00/18] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
                   ` (4 preceding siblings ...)
  2023-01-26 18:49 ` [PATCH 05/18] vfio/common: Add VFIOBitmap and (de)alloc functions Avihai Horon
@ 2023-01-26 18:49 ` Avihai Horon
  2023-02-09 22:21   ` Peter Xu
  2023-01-26 18:49 ` [PATCH 07/18] util: Extend iova_tree_foreach() to take data argument Avihai Horon
                   ` (11 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Avihai Horon @ 2023-01-26 18:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

From: Joao Martins <joao.m.martins@oracle.com>

Add iova_tree_nnodes() which returns the number of nodes in the IOVA
tree.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 include/qemu/iova-tree.h | 11 +++++++++++
 util/iova-tree.c         |  5 +++++
 2 files changed, 16 insertions(+)

diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
index 8528e5c98f..7bb80783ce 100644
--- a/include/qemu/iova-tree.h
+++ b/include/qemu/iova-tree.h
@@ -164,4 +164,15 @@ int iova_tree_alloc_map(IOVATree *tree, DMAMap *map, hwaddr iova_begin,
  */
 void iova_tree_destroy(IOVATree *tree);
 
+/**
+ * iova_tree_nnodes:
+ *
+ * @tree: the iova tree to consult
+ *
+ * Returns the number of nodes in the iova tree
+ *
+ * Return: >=0 for the number of nodes.
+ */
+gint iova_tree_nnodes(IOVATree *tree);
+
 #endif
diff --git a/util/iova-tree.c b/util/iova-tree.c
index 536789797e..6141a6229b 100644
--- a/util/iova-tree.c
+++ b/util/iova-tree.c
@@ -280,3 +280,8 @@ void iova_tree_destroy(IOVATree *tree)
     g_tree_destroy(tree->tree);
     g_free(tree);
 }
+
+gint iova_tree_nnodes(IOVATree *tree)
+{
+    return g_tree_nnodes(tree->tree);
+}
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 07/18] util: Extend iova_tree_foreach() to take data argument
  2023-01-26 18:49 [PATCH 00/18] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
                   ` (5 preceding siblings ...)
  2023-01-26 18:49 ` [PATCH 06/18] util: Add iova_tree_nnodes() Avihai Horon
@ 2023-01-26 18:49 ` Avihai Horon
  2023-02-09 22:21   ` Peter Xu
  2023-01-26 18:49 ` [PATCH 08/18] vfio/common: Record DMA mapped IOVA ranges Avihai Horon
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Avihai Horon @ 2023-01-26 18:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

From: Joao Martins <joao.m.martins@oracle.com>

Extend iova_tree_foreach() to take data argument to be passed and used
by the iterator.

While at it, fix a documentation error:
The documentation says iova_tree_foreach() returns a value even though
it is a void function.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 include/qemu/iova-tree.h |  8 +++++---
 util/iova-tree.c         | 18 ++++++++++++++----
 2 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
index 7bb80783ce..1332dce014 100644
--- a/include/qemu/iova-tree.h
+++ b/include/qemu/iova-tree.h
@@ -38,7 +38,7 @@ typedef struct DMAMap {
     hwaddr size;                /* Inclusive */
     IOMMUAccessFlags perm;
 } QEMU_PACKED DMAMap;
-typedef gboolean (*iova_tree_iterator)(DMAMap *map);
+typedef gboolean (*iova_tree_iterator)(DMAMap *map, gpointer data);
 
 /**
  * iova_tree_new:
@@ -129,12 +129,14 @@ const DMAMap *iova_tree_find_address(const IOVATree *tree, hwaddr iova);
  *
  * @tree: the iova tree to iterate on
  * @iterator: the interator for the mappings, return true to stop
+ * @data: data to be passed to the iterator
  *
  * Iterate over the iova tree.
  *
- * Return: 1 if found any overlap, 0 if not, <0 if error.
+ * Return: None.
  */
-void iova_tree_foreach(IOVATree *tree, iova_tree_iterator iterator);
+void iova_tree_foreach(IOVATree *tree, iova_tree_iterator iterator,
+                       gpointer data);
 
 /**
  * iova_tree_alloc_map:
diff --git a/util/iova-tree.c b/util/iova-tree.c
index 6141a6229b..9845427b86 100644
--- a/util/iova-tree.c
+++ b/util/iova-tree.c
@@ -42,6 +42,11 @@ typedef struct IOVATreeFindIOVAArgs {
     const DMAMap *result;
 } IOVATreeFindIOVAArgs;
 
+typedef struct IOVATreeIterator {
+    iova_tree_iterator fn;
+    gpointer data;
+} IOVATreeIterator;
+
 /**
  * Iterate args to the next hole
  *
@@ -151,17 +156,22 @@ int iova_tree_insert(IOVATree *tree, const DMAMap *map)
 static gboolean iova_tree_traverse(gpointer key, gpointer value,
                                 gpointer data)
 {
-    iova_tree_iterator iterator = data;
+    IOVATreeIterator *iterator = data;
     DMAMap *map = key;
 
     g_assert(key == value);
 
-    return iterator(map);
+    return iterator->fn(map, iterator->data);
 }
 
-void iova_tree_foreach(IOVATree *tree, iova_tree_iterator iterator)
+void iova_tree_foreach(IOVATree *tree, iova_tree_iterator iterator,
+                       gpointer data)
 {
-    g_tree_foreach(tree->tree, iova_tree_traverse, iterator);
+    IOVATreeIterator arg = {
+        .fn = iterator,
+        .data = data,
+    };
+    g_tree_foreach(tree->tree, iova_tree_traverse, &arg);
 }
 
 void iova_tree_remove(IOVATree *tree, DMAMap map)
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 08/18] vfio/common: Record DMA mapped IOVA ranges
  2023-01-26 18:49 [PATCH 00/18] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
                   ` (6 preceding siblings ...)
  2023-01-26 18:49 ` [PATCH 07/18] util: Extend iova_tree_foreach() to take data argument Avihai Horon
@ 2023-01-26 18:49 ` Avihai Horon
  2023-01-27 21:42   ` Alex Williamson
  2023-01-26 18:49 ` [PATCH 09/18] vfio/common: Add device dirty page tracking start/stop Avihai Horon
                   ` (9 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Avihai Horon @ 2023-01-26 18:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

From: Joao Martins <joao.m.martins@oracle.com>

According to the device DMA logging uAPI, IOVA ranges to be logged by
the device must be provided all at once upon DMA logging start.

As preparation for the following patches which will add device dirty
page tracking, keep a record of all DMA mapped IOVA ranges so later they
can be used for DMA logging start.

Note that when vIOMMU is enabled DMA mapped IOVA ranges are not tracked.
This is due to the dynamic nature of vIOMMU DMA mapping/unmapping.
Following patches will address the vIOMMU case specifically.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 include/hw/vfio/vfio-common.h |  3 ++
 hw/vfio/common.c              | 86 +++++++++++++++++++++++++++++++++--
 2 files changed, 86 insertions(+), 3 deletions(-)

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 88c2194fb9..d54000d7ae 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -23,6 +23,7 @@
 
 #include "exec/memory.h"
 #include "qemu/queue.h"
+#include "qemu/iova-tree.h"
 #include "qemu/notify.h"
 #include "ui/console.h"
 #include "hw/display/ramfb.h"
@@ -94,6 +95,8 @@ typedef struct VFIOContainer {
     uint64_t max_dirty_bitmap_size;
     unsigned long pgsizes;
     unsigned int dma_max_mappings;
+    IOVATree *mappings;
+    QemuMutex mappings_mutex;
     QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
     QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
     QLIST_HEAD(, VFIOGroup) group_list;
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index e554573eb5..fafc361cea 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -43,6 +43,7 @@
 #include "migration/misc.h"
 #include "migration/qemu-file.h"
 #include "sysemu/tpm.h"
+#include "qemu/iova-tree.h"
 
 VFIOGroupList vfio_group_list =
     QLIST_HEAD_INITIALIZER(vfio_group_list);
@@ -373,6 +374,11 @@ bool vfio_mig_active(void)
     return true;
 }
 
+static bool vfio_have_giommu(VFIOContainer *container)
+{
+    return !QLIST_EMPTY(&container->giommu_list);
+}
+
 static void vfio_set_migration_error(int err)
 {
     MigrationState *ms = migrate_get_current();
@@ -450,6 +456,51 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
     return true;
 }
 
+static int vfio_record_mapping(VFIOContainer *container, hwaddr iova,
+                               hwaddr size, bool readonly)
+{
+    DMAMap map = {
+        .iova = iova,
+        .size = size - 1, /* IOVATree is inclusive, so subtract 1 from size */
+        .perm = readonly ? IOMMU_RO : IOMMU_RW,
+    };
+    int ret;
+
+    if (vfio_have_giommu(container)) {
+        return 0;
+    }
+
+    WITH_QEMU_LOCK_GUARD(&container->mappings_mutex) {
+        ret = iova_tree_insert(container->mappings, &map);
+        if (ret) {
+            if (ret == IOVA_ERR_INVALID) {
+                ret = -EINVAL;
+            } else if (ret == IOVA_ERR_OVERLAP) {
+                ret = -EEXIST;
+            }
+        }
+    }
+
+    return ret;
+}
+
+static void vfio_erase_mapping(VFIOContainer *container, hwaddr iova,
+                                hwaddr size)
+{
+    DMAMap map = {
+        .iova = iova,
+        .size = size - 1, /* IOVATree is inclusive, so subtract 1 from size */
+    };
+
+    if (vfio_have_giommu(container)) {
+        return;
+    }
+
+    WITH_QEMU_LOCK_GUARD(&container->mappings_mutex) {
+        iova_tree_remove(container->mappings, map);
+    }
+}
+
 static int vfio_dma_unmap_bitmap(VFIOContainer *container,
                                  hwaddr iova, ram_addr_t size,
                                  IOMMUTLBEntry *iotlb)
@@ -550,6 +601,8 @@ static int vfio_dma_unmap(VFIOContainer *container,
                                             DIRTY_CLIENTS_NOCODE);
     }
 
+    vfio_erase_mapping(container, iova, size);
+
     return 0;
 }
 
@@ -563,6 +616,16 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
         .iova = iova,
         .size = size,
     };
+    int ret;
+
+    ret = vfio_record_mapping(container, iova, size, readonly);
+    if (ret) {
+        error_report("vfio: Failed to record mapping, iova: 0x%" HWADDR_PRIx
+                     ", size: 0x" RAM_ADDR_FMT ", ret: %d (%s)",
+                     iova, size, ret, strerror(-ret));
+
+        return ret;
+    }
 
     if (!readonly) {
         map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
@@ -579,8 +642,12 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
         return 0;
     }
 
+    ret = -errno;
     error_report("VFIO_MAP_DMA failed: %s", strerror(errno));
-    return -errno;
+
+    vfio_erase_mapping(container, iova, size);
+
+    return ret;
 }
 
 static void vfio_host_win_add(VFIOContainer *container,
@@ -2134,16 +2201,23 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     QLIST_INIT(&container->giommu_list);
     QLIST_INIT(&container->hostwin_list);
     QLIST_INIT(&container->vrdl_list);
+    container->mappings = iova_tree_new();
+    if (!container->mappings) {
+        error_setg(errp, "Cannot allocate DMA mappings tree");
+        ret = -ENOMEM;
+        goto free_container_exit;
+    }
+    qemu_mutex_init(&container->mappings_mutex);
 
     ret = vfio_init_container(container, group->fd, errp);
     if (ret) {
-        goto free_container_exit;
+        goto destroy_mappings_exit;
     }
 
     ret = vfio_ram_block_discard_disable(container, true);
     if (ret) {
         error_setg_errno(errp, -ret, "Cannot set discarding of RAM broken");
-        goto free_container_exit;
+        goto destroy_mappings_exit;
     }
 
     switch (container->iommu_type) {
@@ -2279,6 +2353,10 @@ listener_release_exit:
 enable_discards_exit:
     vfio_ram_block_discard_disable(container, false);
 
+destroy_mappings_exit:
+    qemu_mutex_destroy(&container->mappings_mutex);
+    iova_tree_destroy(container->mappings);
+
 free_container_exit:
     g_free(container);
 
@@ -2333,6 +2411,8 @@ static void vfio_disconnect_container(VFIOGroup *group)
         }
 
         trace_vfio_disconnect_container(container->fd);
+        qemu_mutex_destroy(&container->mappings_mutex);
+        iova_tree_destroy(container->mappings);
         close(container->fd);
         g_free(container);
 
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 09/18] vfio/common: Add device dirty page tracking start/stop
  2023-01-26 18:49 [PATCH 00/18] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
                   ` (7 preceding siblings ...)
  2023-01-26 18:49 ` [PATCH 08/18] vfio/common: Record DMA mapped IOVA ranges Avihai Horon
@ 2023-01-26 18:49 ` Avihai Horon
  2023-01-26 18:49 ` [PATCH 10/18] vfio/common: Extract code from vfio_get_dirty_bitmap() to new function Avihai Horon
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Avihai Horon @ 2023-01-26 18:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

From: Joao Martins <joao.m.martins@oracle.com>

Add device dirty page tracking start/stop functionality. This uses the
device DMA logging uAPI to start and stop dirty page tracking by device.

Device dirty page tracking is used only if all devices within a
container support device dirty page tracking.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 include/hw/vfio/vfio-common.h |   2 +
 hw/vfio/common.c              | 211 +++++++++++++++++++++++++++++++++-
 2 files changed, 211 insertions(+), 2 deletions(-)

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index d54000d7ae..cde6ffb9d6 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -151,6 +151,8 @@ typedef struct VFIODevice {
     VFIOMigration *migration;
     Error *migration_blocker;
     OnOffAuto pre_copy_dirty_page_tracking;
+    bool dirty_pages_supported;
+    bool dirty_tracking;
 } VFIODevice;
 
 struct VFIODeviceOps {
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index fafc361cea..005c060c67 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -422,6 +422,22 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
     return true;
 }
 
+static bool vfio_devices_all_device_dirty_tracking(VFIOContainer *container)
+{
+    VFIOGroup *group;
+    VFIODevice *vbasedev;
+
+    QLIST_FOREACH(group, &container->group_list, container_next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            if (!vbasedev->dirty_pages_supported) {
+                return false;
+            }
+        }
+    }
+
+    return true;
+}
+
 /*
  * Check if all VFIO devices are running and migration is active, which is
  * essentially equivalent to the migration being in pre-copy phase.
@@ -1355,13 +1371,192 @@ static int vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
     return ret;
 }
 
+static int vfio_devices_dma_logging_set(VFIOContainer *container,
+                                        struct vfio_device_feature *feature)
+{
+    bool status = (feature->flags & VFIO_DEVICE_FEATURE_MASK) ==
+                  VFIO_DEVICE_FEATURE_DMA_LOGGING_START;
+    VFIODevice *vbasedev;
+    VFIOGroup *group;
+    int ret = 0;
+
+    QLIST_FOREACH(group, &container->group_list, container_next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            if (vbasedev->dirty_tracking == status) {
+                continue;
+            }
+
+            ret = ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature);
+            if (ret) {
+                ret = -errno;
+                error_report("%s: Failed to set DMA logging %s, err %d (%s)",
+                             vbasedev->name, status ? "start" : "stop", ret,
+                             strerror(errno));
+                goto out;
+            }
+            vbasedev->dirty_tracking = status;
+        }
+    }
+
+out:
+    return ret;
+}
+
+static int vfio_devices_dma_logging_stop(VFIOContainer *container)
+{
+    uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature),
+                              sizeof(uint64_t))] = {};
+    struct vfio_device_feature *feature = (struct vfio_device_feature *)buf;
+
+    feature->argsz = sizeof(buf);
+    feature->flags = VFIO_DEVICE_FEATURE_SET;
+    feature->flags |= VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP;
+
+    return vfio_devices_dma_logging_set(container, feature);
+}
+
+static gboolean vfio_device_dma_logging_range_add(DMAMap *map, gpointer data)
+{
+    struct vfio_device_feature_dma_logging_range **out = data;
+    struct vfio_device_feature_dma_logging_range *range = *out;
+
+    range->iova = map->iova;
+    /* IOVATree is inclusive, DMA logging uAPI isn't, so add 1 to length */
+    range->length = map->size + 1;
+
+    *out = ++range;
+
+    return false;
+}
+
+static gboolean vfio_iova_tree_get_first(DMAMap *map, gpointer data)
+{
+    DMAMap *first = data;
+
+    first->iova = map->iova;
+    first->size = map->size;
+
+    return true;
+}
+
+static gboolean vfio_iova_tree_get_last(DMAMap *map, gpointer data)
+{
+    DMAMap *last = data;
+
+    last->iova = map->iova;
+    last->size = map->size;
+
+    return false;
+}
+
+static struct vfio_device_feature *
+vfio_device_feature_dma_logging_start_create(VFIOContainer *container)
+{
+    struct vfio_device_feature *feature;
+    size_t feature_size;
+    struct vfio_device_feature_dma_logging_control *control;
+    struct vfio_device_feature_dma_logging_range *ranges;
+    unsigned int max_ranges;
+    unsigned int cur_ranges;
+
+    feature_size = sizeof(struct vfio_device_feature) +
+                   sizeof(struct vfio_device_feature_dma_logging_control);
+    feature = g_malloc0(feature_size);
+    feature->argsz = feature_size;
+    feature->flags = VFIO_DEVICE_FEATURE_SET;
+    feature->flags |= VFIO_DEVICE_FEATURE_DMA_LOGGING_START;
+
+    control = (struct vfio_device_feature_dma_logging_control *)feature->data;
+    control->page_size = qemu_real_host_page_size();
+
+    QEMU_LOCK_GUARD(&container->mappings_mutex);
+
+    /*
+     * DMA logging uAPI guarantees to support at least num_ranges that fits into
+     * a single host kernel page. To be on the safe side, use this as a limit
+     * from which to merge to a single range.
+     */
+    max_ranges = qemu_real_host_page_size() / sizeof(*ranges);
+    cur_ranges = iova_tree_nnodes(container->mappings);
+    control->num_ranges = (cur_ranges <= max_ranges) ? cur_ranges : 1;
+    ranges = g_try_new0(struct vfio_device_feature_dma_logging_range,
+                        control->num_ranges);
+    if (!ranges) {
+        g_free(feature);
+        errno = ENOMEM;
+
+        return NULL;
+    }
+
+    control->ranges = (uint64_t)ranges;
+    if (cur_ranges <= max_ranges) {
+        iova_tree_foreach(container->mappings,
+                          vfio_device_dma_logging_range_add, &ranges);
+    } else {
+        DMAMap first, last;
+
+        iova_tree_foreach(container->mappings, vfio_iova_tree_get_first,
+                          &first);
+        iova_tree_foreach(container->mappings, vfio_iova_tree_get_last, &last);
+        ranges->iova = first.iova;
+        /* IOVATree is inclusive, DMA logging uAPI isn't, so add 1 to length */
+        ranges->length = (last.iova - first.iova) + last.size + 1;
+    }
+
+    return feature;
+}
+
+static void vfio_device_feature_dma_logging_start_destroy(
+    struct vfio_device_feature *feature)
+{
+    struct vfio_device_feature_dma_logging_control *control =
+        (struct vfio_device_feature_dma_logging_control *)feature->data;
+    struct vfio_device_feature_dma_logging_range *ranges =
+        (struct vfio_device_feature_dma_logging_range *)control->ranges;
+
+    g_free(ranges);
+    g_free(feature);
+}
+
+static int vfio_devices_dma_logging_start(VFIOContainer *container)
+{
+    struct vfio_device_feature *feature;
+    int ret;
+
+    feature = vfio_device_feature_dma_logging_start_create(container);
+    if (!feature) {
+        return -errno;
+    }
+
+    ret = vfio_devices_dma_logging_set(container, feature);
+    if (ret) {
+        vfio_devices_dma_logging_stop(container);
+    }
+
+    vfio_device_feature_dma_logging_start_destroy(feature);
+
+    return ret;
+}
+
 static void vfio_listener_log_global_start(MemoryListener *listener)
 {
     VFIOContainer *container = container_of(listener, VFIOContainer, listener);
     int ret;
 
-    ret = vfio_set_dirty_page_tracking(container, true);
+    if (vfio_devices_all_device_dirty_tracking(container)) {
+        if (vfio_have_giommu(container)) {
+            /* Device dirty page tracking currently doesn't support vIOMMU */
+            return;
+        }
+
+        ret = vfio_devices_dma_logging_start(container);
+    } else {
+        ret = vfio_set_dirty_page_tracking(container, true);
+    }
+
     if (ret) {
+        error_report("vfio: Could not start dirty page tracking, err: %d (%s)",
+                     ret, strerror(-ret));
         vfio_set_migration_error(ret);
     }
 }
@@ -1371,8 +1566,20 @@ static void vfio_listener_log_global_stop(MemoryListener *listener)
     VFIOContainer *container = container_of(listener, VFIOContainer, listener);
     int ret;
 
-    ret = vfio_set_dirty_page_tracking(container, false);
+    if (vfio_devices_all_device_dirty_tracking(container)) {
+        if (vfio_have_giommu(container)) {
+            /* Device dirty page tracking currently doesn't support vIOMMU */
+            return;
+        }
+
+        ret = vfio_devices_dma_logging_stop(container);
+    } else {
+        ret = vfio_set_dirty_page_tracking(container, false);
+    }
+
     if (ret) {
+        error_report("vfio: Could not stop dirty page tracking, err: %d (%s)",
+                     ret, strerror(-ret));
         vfio_set_migration_error(ret);
     }
 }
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 10/18] vfio/common: Extract code from vfio_get_dirty_bitmap() to new function
  2023-01-26 18:49 [PATCH 00/18] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
                   ` (8 preceding siblings ...)
  2023-01-26 18:49 ` [PATCH 09/18] vfio/common: Add device dirty page tracking start/stop Avihai Horon
@ 2023-01-26 18:49 ` Avihai Horon
  2023-01-26 18:49 ` [PATCH 11/18] vfio/common: Add device dirty page bitmap sync Avihai Horon
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Avihai Horon @ 2023-01-26 18:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

Extract the VFIO_IOMMU_DIRTY_PAGES ioctl code in vfio_get_dirty_bitmap()
to its own function.

This will help the code to be more readable after next patch will add
device dirty page bitmap sync functionality.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/common.c | 53 ++++++++++++++++++++++++++++++------------------
 1 file changed, 33 insertions(+), 20 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 005c060c67..3caa73d6f7 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1584,26 +1584,13 @@ static void vfio_listener_log_global_stop(MemoryListener *listener)
     }
 }
 
-static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
-                                 uint64_t size, ram_addr_t ram_addr)
+static int vfio_query_dirty_bitmap(VFIOContainer *container, VFIOBitmap *vbmap,
+                                   hwaddr iova, hwaddr size)
 {
     struct vfio_iommu_type1_dirty_bitmap *dbitmap;
     struct vfio_iommu_type1_dirty_bitmap_get *range;
-    VFIOBitmap *vbmap;
     int ret;
 
-    if (!container->dirty_pages_supported) {
-        cpu_physical_memory_set_dirty_range(ram_addr, size,
-                                            tcg_enabled() ? DIRTY_CLIENTS_ALL :
-                                            DIRTY_CLIENTS_NOCODE);
-        return 0;
-    }
-
-    vbmap = vfio_bitmap_alloc(size);
-    if (!vbmap) {
-        return -errno;
-    }
-
     dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
 
     dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
@@ -1627,16 +1614,42 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
         error_report("Failed to get dirty bitmap for iova: 0x%"PRIx64
                 " size: 0x%"PRIx64" err: %d", (uint64_t)range->iova,
                 (uint64_t)range->size, errno);
-        goto err_out;
+    }
+
+    g_free(dbitmap);
+
+    return ret;
+}
+
+static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
+                                 uint64_t size, ram_addr_t ram_addr)
+{
+    VFIOBitmap *vbmap;
+    int ret;
+
+    if (!container->dirty_pages_supported) {
+        cpu_physical_memory_set_dirty_range(ram_addr, size,
+                                            tcg_enabled() ? DIRTY_CLIENTS_ALL :
+                                            DIRTY_CLIENTS_NOCODE);
+        return 0;
+    }
+
+    vbmap = vfio_bitmap_alloc(size);
+    if (!vbmap) {
+        return -errno;
+    }
+
+    ret = vfio_query_dirty_bitmap(container, vbmap, iova, size);
+    if (ret) {
+        goto out;
     }
 
     cpu_physical_memory_set_dirty_lebitmap(vbmap->bitmap, ram_addr,
                                            vbmap->pages);
 
-    trace_vfio_get_dirty_bitmap(container->fd, range->iova, range->size,
-                                range->bitmap.size, ram_addr);
-err_out:
-    g_free(dbitmap);
+    trace_vfio_get_dirty_bitmap(container->fd, iova, size, vbmap->size,
+                                ram_addr);
+out:
     vfio_bitmap_dealloc(vbmap);
 
     return ret;
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 11/18] vfio/common: Add device dirty page bitmap sync
  2023-01-26 18:49 [PATCH 00/18] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
                   ` (9 preceding siblings ...)
  2023-01-26 18:49 ` [PATCH 10/18] vfio/common: Extract code from vfio_get_dirty_bitmap() to new function Avihai Horon
@ 2023-01-26 18:49 ` Avihai Horon
  2023-01-27 23:37   ` Alex Williamson
  2023-01-26 18:49 ` [PATCH 12/18] vfio/common: Extract vIOMMU code from vfio_sync_dirty_bitmap() Avihai Horon
                   ` (6 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Avihai Horon @ 2023-01-26 18:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

From: Joao Martins <joao.m.martins@oracle.com>

Add device dirty page bitmap sync functionality. This uses the device
DMA logging uAPI to sync dirty page bitmap from the device.

Device dirty page bitmap sync is used only if all devices within a
container support device dirty page tracking.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/common.c | 93 ++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 82 insertions(+), 11 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 3caa73d6f7..0003f2421d 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -355,6 +355,9 @@ static void vfio_bitmap_dealloc(VFIOBitmap *vbmap)
     g_free(vbmap);
 }
 
+static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
+                                 uint64_t size, ram_addr_t ram_addr);
+
 bool vfio_mig_active(void)
 {
     VFIOGroup *group;
@@ -582,10 +585,19 @@ static int vfio_dma_unmap(VFIOContainer *container,
         .iova = iova,
         .size = size,
     };
+    int ret;
 
-    if (iotlb && container->dirty_pages_supported &&
-        vfio_devices_all_running_and_mig_active(container)) {
-        return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
+    if (iotlb && vfio_devices_all_running_and_mig_active(container)) {
+        if (!vfio_devices_all_device_dirty_tracking(container) &&
+            container->dirty_pages_supported) {
+            return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
+        }
+
+        ret = vfio_get_dirty_bitmap(container, iova, size,
+                                    iotlb->translated_addr);
+        if (ret) {
+            return ret;
+        }
     }
 
     while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
@@ -611,12 +623,6 @@ static int vfio_dma_unmap(VFIOContainer *container,
         return -errno;
     }
 
-    if (iotlb && vfio_devices_all_running_and_mig_active(container)) {
-        cpu_physical_memory_set_dirty_range(iotlb->translated_addr, size,
-                                            tcg_enabled() ? DIRTY_CLIENTS_ALL :
-                                            DIRTY_CLIENTS_NOCODE);
-    }
-
     vfio_erase_mapping(container, iova, size);
 
     return 0;
@@ -1584,6 +1590,65 @@ static void vfio_listener_log_global_stop(MemoryListener *listener)
     }
 }
 
+static int vfio_device_dma_logging_report(VFIODevice *vbasedev, hwaddr iova,
+                                          hwaddr size, void *bitmap)
+{
+    uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
+                        sizeof(struct vfio_device_feature_dma_logging_report),
+                        sizeof(uint64_t))] = {};
+    struct vfio_device_feature *feature = (struct vfio_device_feature *)buf;
+    struct vfio_device_feature_dma_logging_report *report =
+        (struct vfio_device_feature_dma_logging_report *)feature->data;
+
+    report->iova = iova;
+    report->length = size;
+    report->page_size = qemu_real_host_page_size();
+    report->bitmap = (uint64_t)bitmap;
+
+    feature->argsz = sizeof(buf);
+    feature->flags =
+        VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_DMA_LOGGING_REPORT;
+
+    if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
+        return -errno;
+    }
+
+    return 0;
+}
+
+static int vfio_devices_query_dirty_bitmap(VFIOContainer *container,
+                                           VFIOBitmap *vbmap, hwaddr iova,
+                                           hwaddr size)
+{
+    VFIODevice *vbasedev;
+    VFIOGroup *group;
+    int ret;
+
+    if (vfio_have_giommu(container)) {
+        /* Device dirty page tracking currently doesn't support vIOMMU */
+        bitmap_set(vbmap->bitmap, 0, vbmap->pages);
+
+        return 0;
+    }
+
+    QLIST_FOREACH(group, &container->group_list, container_next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            ret = vfio_device_dma_logging_report(vbasedev, iova, size,
+                                                 vbmap->bitmap);
+            if (ret) {
+                error_report("%s: Failed to get DMA logging report, iova: "
+                             "0x%" HWADDR_PRIx ", size: 0x%" HWADDR_PRIx
+                             ", err: %d (%s)",
+                             vbasedev->name, iova, size, ret, strerror(-ret));
+
+                return ret;
+            }
+        }
+    }
+
+    return 0;
+}
+
 static int vfio_query_dirty_bitmap(VFIOContainer *container, VFIOBitmap *vbmap,
                                    hwaddr iova, hwaddr size)
 {
@@ -1627,7 +1692,8 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
     VFIOBitmap *vbmap;
     int ret;
 
-    if (!container->dirty_pages_supported) {
+    if (!container->dirty_pages_supported &&
+        !vfio_devices_all_device_dirty_tracking(container)) {
         cpu_physical_memory_set_dirty_range(ram_addr, size,
                                             tcg_enabled() ? DIRTY_CLIENTS_ALL :
                                             DIRTY_CLIENTS_NOCODE);
@@ -1639,7 +1705,12 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
         return -errno;
     }
 
-    ret = vfio_query_dirty_bitmap(container, vbmap, iova, size);
+    if (vfio_devices_all_device_dirty_tracking(container)) {
+        ret = vfio_devices_query_dirty_bitmap(container, vbmap, iova, size);
+    } else {
+        ret = vfio_query_dirty_bitmap(container, vbmap, iova, size);
+    }
+
     if (ret) {
         goto out;
     }
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 12/18] vfio/common: Extract vIOMMU code from vfio_sync_dirty_bitmap()
  2023-01-26 18:49 [PATCH 00/18] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
                   ` (10 preceding siblings ...)
  2023-01-26 18:49 ` [PATCH 11/18] vfio/common: Add device dirty page bitmap sync Avihai Horon
@ 2023-01-26 18:49 ` Avihai Horon
  2023-01-26 18:49 ` [PATCH 13/18] memory/iommu: Add IOMMU_ATTR_MAX_IOVA attribute Avihai Horon
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Avihai Horon @ 2023-01-26 18:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

Extract vIOMMU code from vfio_sync_dirty_bitmap() to a new function and
restructure the code.

This is done as preparation for the following patches which will add
vIOMMU support to device dirty page tracking. No functional changes
intended.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/common.c | 63 +++++++++++++++++++++++++++++-------------------
 1 file changed, 38 insertions(+), 25 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 0003f2421d..9792c2c935 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1811,37 +1811,50 @@ static int vfio_sync_ram_discard_listener_dirty_bitmap(VFIOContainer *container,
                                                 &vrdl);
 }
 
+static int vfio_sync_iommu_dirty_bitmap(VFIOContainer *container,
+                                        MemoryRegionSection *section)
+{
+    VFIOGuestIOMMU *giommu;
+    bool found = false;
+    Int128 llend;
+    vfio_giommu_dirty_notifier gdn;
+    int idx;
+
+    QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
+        if (MEMORY_REGION(giommu->iommu_mr) == section->mr &&
+            giommu->n.start == section->offset_within_region) {
+            found = true;
+            break;
+        }
+    }
+
+    if (!found) {
+        return 0;
+    }
+
+    gdn.giommu = giommu;
+    idx = memory_region_iommu_attrs_to_index(giommu->iommu_mr,
+                                             MEMTXATTRS_UNSPECIFIED);
+
+    llend = int128_add(int128_make64(section->offset_within_region),
+                       section->size);
+    llend = int128_sub(llend, int128_one());
+
+    iommu_notifier_init(&gdn.n, vfio_iommu_map_dirty_notify, IOMMU_NOTIFIER_MAP,
+                        section->offset_within_region, int128_get64(llend),
+                        idx);
+    memory_region_iommu_replay(giommu->iommu_mr, &gdn.n);
+
+    return 0;
+}
+
 static int vfio_sync_dirty_bitmap(VFIOContainer *container,
                                   MemoryRegionSection *section)
 {
     ram_addr_t ram_addr;
 
     if (memory_region_is_iommu(section->mr)) {
-        VFIOGuestIOMMU *giommu;
-
-        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
-            if (MEMORY_REGION(giommu->iommu_mr) == section->mr &&
-                giommu->n.start == section->offset_within_region) {
-                Int128 llend;
-                vfio_giommu_dirty_notifier gdn = { .giommu = giommu };
-                int idx = memory_region_iommu_attrs_to_index(giommu->iommu_mr,
-                                                       MEMTXATTRS_UNSPECIFIED);
-
-                llend = int128_add(int128_make64(section->offset_within_region),
-                                   section->size);
-                llend = int128_sub(llend, int128_one());
-
-                iommu_notifier_init(&gdn.n,
-                                    vfio_iommu_map_dirty_notify,
-                                    IOMMU_NOTIFIER_MAP,
-                                    section->offset_within_region,
-                                    int128_get64(llend),
-                                    idx);
-                memory_region_iommu_replay(giommu->iommu_mr, &gdn.n);
-                break;
-            }
-        }
-        return 0;
+        return vfio_sync_iommu_dirty_bitmap(container, section);
     } else if (memory_region_has_ram_discard_manager(section->mr)) {
         return vfio_sync_ram_discard_listener_dirty_bitmap(container, section);
     }
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 13/18] memory/iommu: Add IOMMU_ATTR_MAX_IOVA attribute
  2023-01-26 18:49 [PATCH 00/18] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
                   ` (11 preceding siblings ...)
  2023-01-26 18:49 ` [PATCH 12/18] vfio/common: Extract vIOMMU code from vfio_sync_dirty_bitmap() Avihai Horon
@ 2023-01-26 18:49 ` Avihai Horon
  2023-02-09 22:16   ` Peter Xu
  2023-01-26 18:49 ` [PATCH 14/18] intel-iommu: Implement get_attr() method Avihai Horon
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Avihai Horon @ 2023-01-26 18:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

Add a new IOMMU attribute IOMMU_ATTR_MAX_IOVA which indicates the
maximal IOVA that an IOMMU can use.

This attribute will be used by VFIO device dirty page tracking so it can
track the entire IOVA space when needed (i.e. when vIOMMU is enabled).

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 include/exec/memory.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index c37ffdbcd1..910067a3a5 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -290,7 +290,8 @@ typedef struct MemoryRegionClass {
 
 
 enum IOMMUMemoryRegionAttr {
-    IOMMU_ATTR_SPAPR_TCE_FD
+    IOMMU_ATTR_SPAPR_TCE_FD,
+    IOMMU_ATTR_MAX_IOVA,
 };
 
 /*
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 14/18] intel-iommu: Implement get_attr() method
  2023-01-26 18:49 [PATCH 00/18] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
                   ` (12 preceding siblings ...)
  2023-01-26 18:49 ` [PATCH 13/18] memory/iommu: Add IOMMU_ATTR_MAX_IOVA attribute Avihai Horon
@ 2023-01-26 18:49 ` Avihai Horon
  2023-02-09 22:18   ` Peter Xu
  2023-01-26 18:49 ` [PATCH 15/18] vfio/common: Support device dirty page tracking with vIOMMU Avihai Horon
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Avihai Horon @ 2023-01-26 18:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

Implement get_attr() method and use the address width property to report
the IOMMU_ATTR_MAX_IOVA attribute.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/i386/intel_iommu.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 98a5c304a7..b0068b0df4 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3841,6 +3841,23 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
     return;
 }
 
+static int vtd_iommu_get_attr(IOMMUMemoryRegion *iommu_mr,
+                              enum IOMMUMemoryRegionAttr attr, void *data)
+{
+    VTDAddressSpace *vtd_as = container_of(iommu_mr, VTDAddressSpace, iommu);
+    IntelIOMMUState *s = vtd_as->iommu_state;
+
+    if (attr == IOMMU_ATTR_MAX_IOVA) {
+        hwaddr *max_iova = data;
+
+        *max_iova = (1ULL << s->aw_bits) - 1;
+
+        return 0;
+    }
+
+    return -EINVAL;
+}
+
 /* Do the initialization. It will also be called when reset, so pay
  * attention when adding new initialization stuff.
  */
@@ -4173,6 +4190,7 @@ static void vtd_iommu_memory_region_class_init(ObjectClass *klass,
     imrc->translate = vtd_iommu_translate;
     imrc->notify_flag_changed = vtd_iommu_notify_flag_changed;
     imrc->replay = vtd_iommu_replay;
+    imrc->get_attr = vtd_iommu_get_attr;
 }
 
 static const TypeInfo vtd_iommu_memory_region_info = {
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 15/18] vfio/common: Support device dirty page tracking with vIOMMU
  2023-01-26 18:49 [PATCH 00/18] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
                   ` (13 preceding siblings ...)
  2023-01-26 18:49 ` [PATCH 14/18] intel-iommu: Implement get_attr() method Avihai Horon
@ 2023-01-26 18:49 ` Avihai Horon
  2023-01-26 18:49 ` [PATCH 16/18] vfio/common: Optimize " Avihai Horon
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Avihai Horon @ 2023-01-26 18:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

Currently, device dirty page tracking with vIOMMU is not supported - RAM
pages are perpetually marked dirty in this case.

When vIOMMU is used, IOVA ranges are DMA mapped/unmapped on the fly as
the vIOMMU maps/unmaps them. These IOVA ranges can potentially be mapped
anywhere in the vIOMMU IOVA space.

Due to this dynamic nature of vIOMMU mapping/unmapping, tracking only
the currently mapped IOVA ranges, as done in the non-vIOMMU case,
doesn't work very well.

Instead, to support device dirty tracking when vIOMMU is enabled, track
the entire vIOMMU IOVA space. If that fails (IOVA space can be rather
big and we might hit HW limitation), try tracking smaller range while
marking untracked ranges dirty.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 include/hw/vfio/vfio-common.h |   2 +
 hw/vfio/common.c              | 153 ++++++++++++++++++++++++++++++----
 2 files changed, 138 insertions(+), 17 deletions(-)

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index cde6ffb9d6..15109c311d 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -97,6 +97,8 @@ typedef struct VFIOContainer {
     unsigned int dma_max_mappings;
     IOVATree *mappings;
     QemuMutex mappings_mutex;
+    /* Represents the range [0, giommu_tracked_range) not inclusive */
+    hwaddr giommu_tracked_range;
     QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
     QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
     QLIST_HEAD(, VFIOGroup) group_list;
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 9792c2c935..c3a27cbbd5 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -44,6 +44,8 @@
 #include "migration/qemu-file.h"
 #include "sysemu/tpm.h"
 #include "qemu/iova-tree.h"
+#include "hw/boards.h"
+#include "hw/mem/memory-device.h"
 
 VFIOGroupList vfio_group_list =
     QLIST_HEAD_INITIALIZER(vfio_group_list);
@@ -377,6 +379,38 @@ bool vfio_mig_active(void)
     return true;
 }
 
+static uint64_t vfio_get_ram_size(void)
+{
+    MachineState *ms = MACHINE(qdev_get_machine());
+    uint64_t plugged_size;
+
+    plugged_size = get_plugged_memory_size();
+    if (plugged_size == (uint64_t)-1) {
+        plugged_size = 0;
+    }
+
+    return ms->ram_size + plugged_size;
+}
+
+static int vfio_giommu_get_max_iova(VFIOContainer *container, hwaddr *max_iova)
+{
+    VFIOGuestIOMMU *giommu;
+    int ret;
+
+    giommu = QLIST_FIRST(&container->giommu_list);
+    if (!giommu) {
+        return -ENOENT;
+    }
+
+    ret = memory_region_iommu_get_attr(giommu->iommu_mr, IOMMU_ATTR_MAX_IOVA,
+                                       max_iova);
+    if (ret) {
+        return ret;
+    }
+
+    return 0;
+}
+
 static bool vfio_have_giommu(VFIOContainer *container)
 {
     return !QLIST_EMPTY(&container->giommu_list);
@@ -1456,7 +1490,8 @@ static gboolean vfio_iova_tree_get_last(DMAMap *map, gpointer data)
 }
 
 static struct vfio_device_feature *
-vfio_device_feature_dma_logging_start_create(VFIOContainer *container)
+vfio_device_feature_dma_logging_start_create(VFIOContainer *container,
+                                             bool giommu)
 {
     struct vfio_device_feature *feature;
     size_t feature_size;
@@ -1475,6 +1510,16 @@ vfio_device_feature_dma_logging_start_create(VFIOContainer *container)
     control = (struct vfio_device_feature_dma_logging_control *)feature->data;
     control->page_size = qemu_real_host_page_size();
 
+    if (giommu) {
+        ranges = g_malloc0(sizeof(*ranges));
+        ranges->iova = 0;
+        ranges->length = container->giommu_tracked_range;
+        control->num_ranges = 1;
+        control->ranges = (uint64_t)ranges;
+
+        return feature;
+    }
+
     QEMU_LOCK_GUARD(&container->mappings_mutex);
 
     /*
@@ -1524,12 +1569,12 @@ static void vfio_device_feature_dma_logging_start_destroy(
     g_free(feature);
 }
 
-static int vfio_devices_dma_logging_start(VFIOContainer *container)
+static int vfio_devices_dma_logging_start(VFIOContainer *container, bool giommu)
 {
     struct vfio_device_feature *feature;
     int ret;
 
-    feature = vfio_device_feature_dma_logging_start_create(container);
+    feature = vfio_device_feature_dma_logging_start_create(container, giommu);
     if (!feature) {
         return -errno;
     }
@@ -1544,18 +1589,85 @@ static int vfio_devices_dma_logging_start(VFIOContainer *container)
     return ret;
 }
 
+/*
+ * This value is used in the second attempt to start device dirty tracking with
+ * vIOMMU, if the first attempt fails. It should be in the middle, not too big
+ * and not too small, allowing devices with HW limitations to do device dirty
+ * tracking while covering a fair amount of the IOVA space.
+ *
+ * This arbitrary value was chosen becasue it is the minimum value of Intel
+ * IOMMU max IOVA and mlx5 device supports tracking a range of this size.
+ */
+#define VFIO_GIOMMU_RETRY_IOVA ((1ULL << 39) - 1)
+
+#define VFIO_GIOMMU_RETRY_COUNT 3
+static int vfio_devices_start_dirty_page_tracking(VFIOContainer *container)
+{
+    hwaddr giommu_max_iova, iova_size, iova_retry_size, ram_size;
+    hwaddr iova_to_track[VFIO_GIOMMU_RETRY_COUNT] = {};
+    int ret;
+    int i;
+
+    if (!vfio_have_giommu(container)) {
+        return vfio_devices_dma_logging_start(container, false);
+    }
+
+    /*
+     * With vIOMMU we try to track the entire IOVA space. As the IOVA space can
+     * be rather big, devices might not be able to track it due to HW
+     * limitations. Therefore, retry tracking smaller ranges as follows:
+     * (1) Retry tracking a smaller part of the IOVA space.
+     * (2) Retry tracking a range in the size of the physical memory.
+     * (3) If all fail, give up.
+     */
+    ret = vfio_giommu_get_max_iova(container, &giommu_max_iova);
+    if (!ret && !REAL_HOST_PAGE_ALIGN(giommu_max_iova)) {
+        giommu_max_iova -= qemu_real_host_page_size();
+    }
+
+    iova_size = ret ? 0 : giommu_max_iova;
+    iova_retry_size = iova_size ? MIN(VFIO_GIOMMU_RETRY_IOVA, iova_size / 2) :
+                                  VFIO_GIOMMU_RETRY_IOVA;
+    ram_size = vfio_get_ram_size();
+
+    iova_to_track[0] = REAL_HOST_PAGE_ALIGN(iova_size);
+    iova_to_track[1] = REAL_HOST_PAGE_ALIGN(iova_retry_size);
+    iova_to_track[2] = REAL_HOST_PAGE_ALIGN(MIN(ram_size, iova_retry_size / 2));
+
+    for (i = 0; i < VFIO_GIOMMU_RETRY_COUNT; i++) {
+        if (!iova_to_track[i]) {
+            continue;
+        }
+
+        container->giommu_tracked_range = iova_to_track[i];
+        ret = vfio_devices_dma_logging_start(container, true);
+        if (!ret) {
+            break;
+        }
+
+        if (i < VFIO_GIOMMU_RETRY_COUNT - 1) {
+            warn_report("Failed to start device dirty tracking with vIOMMU "
+                        "with range of size 0x%" HWADDR_PRIx
+                        ", err: %d. Retrying with range "
+                        "of size 0x%" HWADDR_PRIx,
+                        iova_to_track[i], ret, iova_to_track[i + 1]);
+        } else {
+            error_report("Failed to start device dirty tracking with vIOMMU "
+                         "with range of size 0x%" HWADDR_PRIx ", err: %d",
+                         iova_to_track[i], ret);
+        }
+    }
+
+    return ret;
+}
+
 static void vfio_listener_log_global_start(MemoryListener *listener)
 {
     VFIOContainer *container = container_of(listener, VFIOContainer, listener);
     int ret;
 
     if (vfio_devices_all_device_dirty_tracking(container)) {
-        if (vfio_have_giommu(container)) {
-            /* Device dirty page tracking currently doesn't support vIOMMU */
-            return;
-        }
-
-        ret = vfio_devices_dma_logging_start(container);
+        ret = vfio_devices_start_dirty_page_tracking(container);
     } else {
         ret = vfio_set_dirty_page_tracking(container, true);
     }
@@ -1573,11 +1685,6 @@ static void vfio_listener_log_global_stop(MemoryListener *listener)
     int ret;
 
     if (vfio_devices_all_device_dirty_tracking(container)) {
-        if (vfio_have_giommu(container)) {
-            /* Device dirty page tracking currently doesn't support vIOMMU */
-            return;
-        }
-
         ret = vfio_devices_dma_logging_stop(container);
     } else {
         ret = vfio_set_dirty_page_tracking(container, false);
@@ -1616,6 +1723,17 @@ static int vfio_device_dma_logging_report(VFIODevice *vbasedev, hwaddr iova,
     return 0;
 }
 
+static bool vfio_iommu_range_is_device_tracked(VFIOContainer *container,
+                                               hwaddr iova, hwaddr size)
+{
+    /* Check overflow */
+    if (iova + size < iova) {
+        return false;
+    }
+
+    return iova + size <= container->giommu_tracked_range;
+}
+
 static int vfio_devices_query_dirty_bitmap(VFIOContainer *container,
                                            VFIOBitmap *vbmap, hwaddr iova,
                                            hwaddr size)
@@ -1625,10 +1743,11 @@ static int vfio_devices_query_dirty_bitmap(VFIOContainer *container,
     int ret;
 
     if (vfio_have_giommu(container)) {
-        /* Device dirty page tracking currently doesn't support vIOMMU */
-        bitmap_set(vbmap->bitmap, 0, vbmap->pages);
+        if (!vfio_iommu_range_is_device_tracked(container, iova, size)) {
+            bitmap_set(vbmap->bitmap, 0, vbmap->pages);
 
-        return 0;
+            return 0;
+        }
     }
 
     QLIST_FOREACH(group, &container->group_list, container_next) {
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 16/18] vfio/common: Optimize device dirty page tracking with vIOMMU
  2023-01-26 18:49 [PATCH 00/18] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
                   ` (14 preceding siblings ...)
  2023-01-26 18:49 ` [PATCH 15/18] vfio/common: Support device dirty page tracking with vIOMMU Avihai Horon
@ 2023-01-26 18:49 ` Avihai Horon
  2023-01-26 18:49 ` [PATCH 17/18] vfio/migration: Query device dirty page tracking support Avihai Horon
  2023-01-26 18:49 ` [PATCH 18/18] docs/devel: Document VFIO device dirty page tracking Avihai Horon
  17 siblings, 0 replies; 42+ messages in thread
From: Avihai Horon @ 2023-01-26 18:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

When vIOMMU is enabled, syncing dirty page bitmaps is done by replaying
the vIOMMU mappings and querying the dirty bitmap for each mapping.

With device dirty tracking this causes a lot of overhead, since the HW
is queried many times (even with small idle guest this can end up with
thousands of calls to HW).

Optimize this by de-coupling dirty bitmap query from vIOMMU replay.
Now a single dirty bitmap is queried per vIOMMU MR section, which is
then used for all corresponding vIOMMU mappings within that MR section.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/common.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 83 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index c3a27cbbd5..4f27cd669f 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1848,8 +1848,42 @@ out:
 typedef struct {
     IOMMUNotifier n;
     VFIOGuestIOMMU *giommu;
+    VFIOBitmap *vbmap;
 } vfio_giommu_dirty_notifier;
 
+static int vfio_iommu_set_dirty_bitmap(VFIOContainer *container,
+                                       vfio_giommu_dirty_notifier *gdn,
+                                       hwaddr iova, hwaddr size,
+                                       ram_addr_t ram_addr)
+{
+    VFIOBitmap *vbmap = gdn->vbmap;
+    VFIOBitmap *dst_vbmap;
+    hwaddr start_iova = REAL_HOST_PAGE_ALIGN(gdn->n.start);
+    hwaddr copy_offset;
+
+    dst_vbmap = vfio_bitmap_alloc(size);
+    if (!dst_vbmap) {
+        return -errno;
+    }
+
+    if (!vfio_iommu_range_is_device_tracked(container, iova, size)) {
+        bitmap_set(dst_vbmap->bitmap, 0, dst_vbmap->pages);
+
+        goto out;
+    }
+
+    copy_offset = (iova - start_iova) / qemu_real_host_page_size();
+    bitmap_copy_with_src_offset(dst_vbmap->bitmap, vbmap->bitmap, copy_offset,
+                                dst_vbmap->pages);
+
+out:
+    cpu_physical_memory_set_dirty_lebitmap(dst_vbmap->bitmap, ram_addr,
+                                           dst_vbmap->pages);
+    vfio_bitmap_dealloc(dst_vbmap);
+
+    return 0;
+}
+
 static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
 {
     vfio_giommu_dirty_notifier *gdn = container_of(n,
@@ -1870,8 +1904,15 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
 
     rcu_read_lock();
     if (vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL)) {
-        ret = vfio_get_dirty_bitmap(container, iova, iotlb->addr_mask + 1,
-                                    translated_addr);
+        if (gdn->vbmap) {
+            ret = vfio_iommu_set_dirty_bitmap(container, gdn, iova,
+                                              iotlb->addr_mask + 1,
+                                              translated_addr);
+        } else {
+            ret = vfio_get_dirty_bitmap(container, iova, iotlb->addr_mask + 1,
+                                        translated_addr);
+        }
+
         if (ret) {
             error_report("vfio_iommu_map_dirty_notify(%p, 0x%"HWADDR_PRIx", "
                          "0x%"HWADDR_PRIx") = %d (%s)",
@@ -1935,6 +1976,7 @@ static int vfio_sync_iommu_dirty_bitmap(VFIOContainer *container,
 {
     VFIOGuestIOMMU *giommu;
     bool found = false;
+    VFIOBitmap *vbmap = NULL;
     Int128 llend;
     vfio_giommu_dirty_notifier gdn;
     int idx;
@@ -1952,6 +1994,7 @@ static int vfio_sync_iommu_dirty_bitmap(VFIOContainer *container,
     }
 
     gdn.giommu = giommu;
+    gdn.vbmap = NULL;
     idx = memory_region_iommu_attrs_to_index(giommu->iommu_mr,
                                              MEMTXATTRS_UNSPECIFIED);
 
@@ -1959,11 +2002,49 @@ static int vfio_sync_iommu_dirty_bitmap(VFIOContainer *container,
                        section->size);
     llend = int128_sub(llend, int128_one());
 
+    /*
+     * Optimize device dirty tracking if the MR section is at least partially
+     * tracked. Optimization is done by querying a single dirty bitmap for the
+     * entire range instead of querying dirty bitmap for each vIOMMU mapping.
+     */
+    if (vfio_devices_all_device_dirty_tracking(container)) {
+        hwaddr start = REAL_HOST_PAGE_ALIGN(section->offset_within_region);
+        hwaddr end = int128_get64(llend);
+        hwaddr size;
+        int ret;
+
+        if (start >= container->giommu_tracked_range) {
+            goto notifier_init;
+        }
+
+        size = REAL_HOST_PAGE_ALIGN(
+            MIN(container->giommu_tracked_range - 1, end) - start);
+
+        vbmap = vfio_bitmap_alloc(size);
+        if (!vbmap) {
+            return -errno;
+        }
+
+        ret = vfio_devices_query_dirty_bitmap(container, vbmap, start, size);
+        if (ret) {
+            vfio_bitmap_dealloc(vbmap);
+
+            return ret;
+        }
+
+        gdn.vbmap = vbmap;
+    }
+
+notifier_init:
     iommu_notifier_init(&gdn.n, vfio_iommu_map_dirty_notify, IOMMU_NOTIFIER_MAP,
                         section->offset_within_region, int128_get64(llend),
                         idx);
     memory_region_iommu_replay(giommu->iommu_mr, &gdn.n);
 
+    if (vbmap) {
+        vfio_bitmap_dealloc(vbmap);
+    }
+
     return 0;
 }
 
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 17/18] vfio/migration: Query device dirty page tracking support
  2023-01-26 18:49 [PATCH 00/18] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
                   ` (15 preceding siblings ...)
  2023-01-26 18:49 ` [PATCH 16/18] vfio/common: Optimize " Avihai Horon
@ 2023-01-26 18:49 ` Avihai Horon
  2023-01-26 18:49 ` [PATCH 18/18] docs/devel: Document VFIO device dirty page tracking Avihai Horon
  17 siblings, 0 replies; 42+ messages in thread
From: Avihai Horon @ 2023-01-26 18:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

From: Joao Martins <joao.m.martins@oracle.com>

Now that everything has been set up for device dirty page tracking,
query the device for device dirty page tracking support.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/migration.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 2a0a663023..5aeda47345 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -658,6 +658,19 @@ static int vfio_migration_query_flags(VFIODevice *vbasedev, uint64_t *mig_flags)
     return 0;
 }
 
+static bool vfio_dma_logging_supported(VFIODevice *vbasedev)
+{
+    uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature),
+                              sizeof(uint64_t))] = {};
+    struct vfio_device_feature *feature = (struct vfio_device_feature *)buf;
+
+    feature->argsz = sizeof(buf);
+    feature->flags =
+        VFIO_DEVICE_FEATURE_PROBE | VFIO_DEVICE_FEATURE_DMA_LOGGING_START;
+
+    return !ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature);
+}
+
 static int vfio_migration_init(VFIODevice *vbasedev)
 {
     int ret;
@@ -693,6 +706,8 @@ static int vfio_migration_init(VFIODevice *vbasedev)
     migration->data_fd = -1;
     migration->mig_flags = mig_flags;
 
+    vbasedev->dirty_pages_supported = vfio_dma_logging_supported(vbasedev);
+
     oid = vmstate_if_get_id(VMSTATE_IF(DEVICE(obj)));
     if (oid) {
         path = g_strdup_printf("%s/vfio", oid);
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 18/18] docs/devel: Document VFIO device dirty page tracking
  2023-01-26 18:49 [PATCH 00/18] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
                   ` (16 preceding siblings ...)
  2023-01-26 18:49 ` [PATCH 17/18] vfio/migration: Query device dirty page tracking support Avihai Horon
@ 2023-01-26 18:49 ` Avihai Horon
  17 siblings, 0 replies; 42+ messages in thread
From: Avihai Horon @ 2023-01-26 18:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

Adjust the VFIO dirty page tracking documentation and add a section to
describe device dirty page tracking.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 docs/devel/vfio-migration.rst | 50 ++++++++++++++++++++++-------------
 1 file changed, 32 insertions(+), 18 deletions(-)

diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
index 51f5e1a537..c02f62e534 100644
--- a/docs/devel/vfio-migration.rst
+++ b/docs/devel/vfio-migration.rst
@@ -63,22 +63,37 @@ System memory dirty pages tracking
 ----------------------------------
 
 A ``log_global_start`` and ``log_global_stop`` memory listener callback informs
-the VFIO IOMMU module to start and stop dirty page tracking. A ``log_sync``
-memory listener callback marks those system memory pages as dirty which are
-used for DMA by the VFIO device. The dirty pages bitmap is queried per
-container. All pages pinned by the vendor driver through external APIs have to
-be marked as dirty during migration. When there are CPU writes, CPU dirty page
-tracking can identify dirtied pages, but any page pinned by the vendor driver
-can also be written by the device. There is currently no device or IOMMU
-support for dirty page tracking in hardware.
+the VFIO dirty tracking module to start and stop dirty page tracking. A
+``log_sync`` memory listener callback queries the dirty page bitmap from the
+dirty tracking module and marks system memory pages which were DMA-ed by the
+VFIO device as dirty. The dirty page bitmap is queried per container.
+
+Currently there are two ways dirty page tracking can be done:
+(1) Device dirty tracking:
+In this method the device is responsible to log and report its DMAs. This
+method can be used only if the device is capable of tracking its DMAs.
+Discovering device capability, starting and stopping dirty tracking, and
+syncing the dirty bitmaps from the device are done using the DMA logging uAPI.
+More info about the uAPI can be found in the comments of the
+``vfio_device_feature_dma_logging_control`` and
+``vfio_device_feature_dma_logging_report`` structures in the header file
+linux-headers/linux/vfio.h.
+
+(2) VFIO IOMMU module:
+In this method dirty tracking is done by IOMMU. However, there is currently no
+IOMMU support for dirty page tracking. For this reason, all pages are
+perpetually marked dirty, unless the device driver pins pages through external
+APIs in which case only those pinned pages are perpetually marked dirty.
+
+If the above two methods are not supported, all pages are perpetually marked
+dirty by QEMU.
 
 By default, dirty pages are tracked during pre-copy as well as stop-and-copy
-phase. So, a page pinned by the vendor driver will be copied to the destination
-in both phases. Copying dirty pages in pre-copy phase helps QEMU to predict if
-it can achieve its downtime tolerances. If QEMU during pre-copy phase keeps
-finding dirty pages continuously, then it understands that even in stop-and-copy
-phase, it is likely to find dirty pages and can predict the downtime
-accordingly.
+phase. So, a page marked as dirty will be copied to the destination in both
+phases. Copying dirty pages in pre-copy phase helps QEMU to predict if it can
+achieve its downtime tolerances. If QEMU during pre-copy phase keeps finding
+dirty pages continuously, then it understands that even in stop-and-copy phase,
+it is likely to find dirty pages and can predict the downtime accordingly.
 
 QEMU also provides a per device opt-out option ``pre-copy-dirty-page-tracking``
 which disables querying the dirty bitmap during pre-copy phase. If it is set to
@@ -89,10 +104,9 @@ System memory dirty pages tracking when vIOMMU is enabled
 ---------------------------------------------------------
 
 With vIOMMU, an IO virtual address range can get unmapped while in pre-copy
-phase of migration. In that case, the unmap ioctl returns any dirty pages in
-that range and QEMU reports corresponding guest physical pages dirty. During
-stop-and-copy phase, an IOMMU notifier is used to get a callback for mapped
-pages and then dirty pages bitmap is fetched from VFIO IOMMU modules for those
+phase of migration. In that case, dirty page bitmap for this range is queried
+and synced with QEMU. During stop-and-copy phase, an IOMMU notifier is used to
+get a callback for mapped pages and then dirty page bitmap is fetched for those
 mapped ranges.
 
 Flow of state changes during Live migration
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 01/18] vfio/migration: Add VFIO migration pre-copy support
  2023-01-26 18:49 ` [PATCH 01/18] vfio/migration: Add VFIO migration pre-copy support Avihai Horon
@ 2023-01-26 23:52   ` Alex Williamson
  2023-01-31 12:44     ` Avihai Horon
  0 siblings, 1 reply; 42+ messages in thread
From: Alex Williamson @ 2023-01-26 23:52 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins

On Thu, 26 Jan 2023 20:49:31 +0200
Avihai Horon <avihaih@nvidia.com> wrote:

> Pre-copy support allows the VFIO device data to be transferred while the
> VM is running. This helps to accommodate VFIO devices that have a large
> amount of data that needs to be transferred, and it can reduce migration
> downtime.
> 
> Pre-copy support is optional in VFIO migration protocol v2.
> Implement pre-copy of VFIO migration protocol v2 and use it for devices
> that support it. Full description of it can be found here [1].
> 
> [1]
> https://lore.kernel.org/kvm/20221206083438.37807-3-yishaih@nvidia.com/
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>  docs/devel/vfio-migration.rst |  29 ++++++---
>  include/hw/vfio/vfio-common.h |   3 +
>  hw/vfio/common.c              |   8 ++-
>  hw/vfio/migration.c           | 112 ++++++++++++++++++++++++++++++++--
>  hw/vfio/trace-events          |   5 +-
>  5 files changed, 140 insertions(+), 17 deletions(-)
> 
> diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
> index 1d50c2fe5f..51f5e1a537 100644
> --- a/docs/devel/vfio-migration.rst
> +++ b/docs/devel/vfio-migration.rst
> @@ -7,12 +7,14 @@ the guest is running on source host and restoring this saved state on the
>  destination host. This document details how saving and restoring of VFIO
>  devices is done in QEMU.
>  
> -Migration of VFIO devices currently consists of a single stop-and-copy phase.
> -During the stop-and-copy phase the guest is stopped and the entire VFIO device
> -data is transferred to the destination.
> -
> -The pre-copy phase of migration is currently not supported for VFIO devices.
> -Support for VFIO pre-copy will be added later on.
> +Migration of VFIO devices consists of two phases: the optional pre-copy phase,
> +and the stop-and-copy phase. The pre-copy phase is iterative and allows to
> +accommodate VFIO devices that have a large amount of data that needs to be
> +transferred. The iterative pre-copy phase of migration allows for the guest to
> +continue whilst the VFIO device state is transferred to the destination, this
> +helps to reduce the total downtime of the VM. VFIO devices can choose to skip
> +the pre-copy phase of migration by not reporting the VFIO_MIGRATION_PRE_COPY
> +flag in VFIO_DEVICE_FEATURE_MIGRATION ioctl.
>  
>  A detailed description of the UAPI for VFIO device migration can be found in
>  the comment for the ``vfio_device_mig_state`` structure in the header file
> @@ -29,6 +31,12 @@ VFIO implements the device hooks for the iterative approach as follows:
>    driver, which indicates the amount of data that the vendor driver has yet to
>    save for the VFIO device.
>  
> +* An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
> +  active only if the VFIO device is in pre-copy states.
> +
> +* A ``save_live_iterate`` function that reads the VFIO device's data from the
> +  vendor driver during iterative phase.
> +
>  * A ``save_state`` function to save the device config space if it is present.
>  
>  * A ``save_live_complete_precopy`` function that sets the VFIO device in
> @@ -91,8 +99,10 @@ Flow of state changes during Live migration
>  ===========================================
>  
>  Below is the flow of state change during live migration.
> -The values in the brackets represent the VM state, the migration state, and
> +The values in the parentheses represent the VM state, the migration state, and
>  the VFIO device state, respectively.
> +The text in the square brackets represents the flow if the VFIO device supports
> +pre-copy.
>  
>  Live migration save path
>  ------------------------
> @@ -104,11 +114,12 @@ Live migration save path
>                                    |
>                       migrate_init spawns migration_thread
>                  Migration thread then calls each device's .save_setup()
> -                       (RUNNING, _SETUP, _RUNNING)
> +                  (RUNNING, _SETUP, _RUNNING [_PRE_COPY])
>                                    |
> -                      (RUNNING, _ACTIVE, _RUNNING)
> +                  (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])
>               If device is active, get pending_bytes by .save_live_pending()
>            If total pending_bytes >= threshold_size, call .save_live_iterate()
> +                  [Data of VFIO device for pre-copy phase is copied]
>          Iterate till total pending bytes converge and are less than threshold
>                                    |
>    On migration completion, vCPU stops and calls .save_live_complete_precopy for
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 5f8e7a02fe..88c2194fb9 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -67,7 +67,10 @@ typedef struct VFIOMigration {
>      int data_fd;
>      void *data_buffer;
>      size_t data_buffer_size;
> +    uint64_t mig_flags;
>      uint64_t stop_copy_size;
> +    uint64_t precopy_init_size;
> +    uint64_t precopy_dirty_size;
>  } VFIOMigration;
>  
>  typedef struct VFIOAddressSpace {
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 9a0dbee6b4..93b18c5e3d 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -357,7 +357,9 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
>  
>              if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
>                  (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> -                 migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P)) {
> +                 migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P ||
> +                 migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ||
> +                 migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P)) {

Should this just turn into a test that we're not in STOP_COPY?

>                  return false;
>              }
>          }
> @@ -387,7 +389,9 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
>              }
>  
>              if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> -                migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P) {
> +                migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P ||
> +                migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ||
> +                migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P) {
>                  continue;
>              } else {
>                  return false;

Hmm, this only seems to highlight that between this series and the
previous, we're adding tests for states that we never actually use, ie.
these _P2P states.

IIRC, the reason we have these _P2P states is so that we can transition
a set of devices, which may have active P2P DMA between them, to STOP,
STOP_COPY, and even RUNNING states safely without lost data given that
we cannot simultaneously transition all devices.  That suggest that
missing from both these series is support for bringing all devices to
these _P2P states before we move any device to one of STOP, STOP_COPY,
or RUNNING states (in the case of RESUMING).

Also, I recall discussions that we need to enforce configuration
restrictions when not all devices support the _P2P states?  For example
adding a migration blocker when there are multiple vfio devices and at
least one of them does not support _P2P migration states.  Or perhaps
initially, requiring support for _P2P states.

I think what's implemented here, where we don't make use of the _P2P
states would require adding a migration blocker whenever there are
multiple vfio devices, regardless of the device support for _P2P.
Thanks,

Alex



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 05/18] vfio/common: Add VFIOBitmap and (de)alloc functions
  2023-01-26 18:49 ` [PATCH 05/18] vfio/common: Add VFIOBitmap and (de)alloc functions Avihai Horon
@ 2023-01-27 21:11   ` Alex Williamson
  2023-02-12 15:36     ` Avihai Horon
  0 siblings, 1 reply; 42+ messages in thread
From: Alex Williamson @ 2023-01-27 21:11 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins

On Thu, 26 Jan 2023 20:49:35 +0200
Avihai Horon <avihaih@nvidia.com> wrote:

> There are already two places where dirty page bitmap allocation and
> calculations are done in open code. With device dirty page tracking
> being added in next patches, there are going to be even more places.
> 
> To avoid code duplication, introduce VFIOBitmap struct and corresponding
> alloc and dealloc functions and use them where applicable.
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>  hw/vfio/common.c | 89 ++++++++++++++++++++++++++++++++----------------
>  1 file changed, 60 insertions(+), 29 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 8e8ffbc046..e554573eb5 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -319,6 +319,41 @@ const MemoryRegionOps vfio_region_ops = {
>   * Device state interfaces
>   */
>  
> +typedef struct {
> +    unsigned long *bitmap;
> +    hwaddr size;
> +    hwaddr pages;
> +} VFIOBitmap;
> +
> +static VFIOBitmap *vfio_bitmap_alloc(hwaddr size)
> +{
> +    VFIOBitmap *vbmap = g_try_new0(VFIOBitmap, 1);
> +    if (!vbmap) {
> +        errno = ENOMEM;
> +
> +        return NULL;
> +    }
> +
> +    vbmap->pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size();
> +    vbmap->size =  ROUND_UP(vbmap->pages, sizeof(__u64) * BITS_PER_BYTE) /
> +                                          BITS_PER_BYTE;
> +    vbmap->bitmap = g_try_malloc0(vbmap->size);
> +    if (!vbmap->bitmap) {
> +        g_free(vbmap);
> +        errno = ENOMEM;
> +
> +        return NULL;
> +    }
> +
> +    return vbmap;
> +}
> +
> +static void vfio_bitmap_dealloc(VFIOBitmap *vbmap)
> +{
> +    g_free(vbmap->bitmap);
> +    g_free(vbmap);
> +}
> +
>  bool vfio_mig_active(void)
>  {
>      VFIOGroup *group;
> @@ -421,9 +456,14 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container,
>  {
>      struct vfio_iommu_type1_dma_unmap *unmap;
>      struct vfio_bitmap *bitmap;
> -    uint64_t pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size();
> +    VFIOBitmap *vbmap;
>      int ret;
>  
> +    vbmap = vfio_bitmap_alloc(size);
> +    if (!vbmap) {
> +        return -errno;
> +    }
> +
>      unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));
>  
>      unmap->argsz = sizeof(*unmap) + sizeof(*bitmap);
> @@ -437,35 +477,28 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container,
>       * qemu_real_host_page_size to mark those dirty. Hence set bitmap_pgsize
>       * to qemu_real_host_page_size.
>       */
> -
>      bitmap->pgsize = qemu_real_host_page_size();
> -    bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
> -                   BITS_PER_BYTE;
> +    bitmap->size = vbmap->size;
> +    bitmap->data = (__u64 *)vbmap->bitmap;
>  
> -    if (bitmap->size > container->max_dirty_bitmap_size) {
> -        error_report("UNMAP: Size of bitmap too big 0x%"PRIx64,
> -                     (uint64_t)bitmap->size);
> +    if (vbmap->size > container->max_dirty_bitmap_size) {
> +        error_report("UNMAP: Size of bitmap too big 0x%"PRIx64, vbmap->size);

Why not pass the container to the alloc function so we can test this
consistently for each bitmap we allocate?  Thanks,

Alex



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 08/18] vfio/common: Record DMA mapped IOVA ranges
  2023-01-26 18:49 ` [PATCH 08/18] vfio/common: Record DMA mapped IOVA ranges Avihai Horon
@ 2023-01-27 21:42   ` Alex Williamson
  2023-02-12 15:40     ` Avihai Horon
  0 siblings, 1 reply; 42+ messages in thread
From: Alex Williamson @ 2023-01-27 21:42 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins

On Thu, 26 Jan 2023 20:49:38 +0200
Avihai Horon <avihaih@nvidia.com> wrote:

> From: Joao Martins <joao.m.martins@oracle.com>
> 
> According to the device DMA logging uAPI, IOVA ranges to be logged by
> the device must be provided all at once upon DMA logging start.
> 
> As preparation for the following patches which will add device dirty
> page tracking, keep a record of all DMA mapped IOVA ranges so later they
> can be used for DMA logging start.
> 
> Note that when vIOMMU is enabled DMA mapped IOVA ranges are not tracked.
> This is due to the dynamic nature of vIOMMU DMA mapping/unmapping.
> Following patches will address the vIOMMU case specifically.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>  include/hw/vfio/vfio-common.h |  3 ++
>  hw/vfio/common.c              | 86 +++++++++++++++++++++++++++++++++--
>  2 files changed, 86 insertions(+), 3 deletions(-)
> 
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 88c2194fb9..d54000d7ae 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -23,6 +23,7 @@
>  
>  #include "exec/memory.h"
>  #include "qemu/queue.h"
> +#include "qemu/iova-tree.h"
>  #include "qemu/notify.h"
>  #include "ui/console.h"
>  #include "hw/display/ramfb.h"
> @@ -94,6 +95,8 @@ typedef struct VFIOContainer {
>      uint64_t max_dirty_bitmap_size;
>      unsigned long pgsizes;
>      unsigned int dma_max_mappings;
> +    IOVATree *mappings;
> +    QemuMutex mappings_mutex;
>      QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
>      QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
>      QLIST_HEAD(, VFIOGroup) group_list;
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index e554573eb5..fafc361cea 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -43,6 +43,7 @@
>  #include "migration/misc.h"
>  #include "migration/qemu-file.h"
>  #include "sysemu/tpm.h"
> +#include "qemu/iova-tree.h"
>  
>  VFIOGroupList vfio_group_list =
>      QLIST_HEAD_INITIALIZER(vfio_group_list);
> @@ -373,6 +374,11 @@ bool vfio_mig_active(void)
>      return true;
>  }
>  
> +static bool vfio_have_giommu(VFIOContainer *container)
> +{
> +    return !QLIST_EMPTY(&container->giommu_list);
> +}
> +
>  static void vfio_set_migration_error(int err)
>  {
>      MigrationState *ms = migrate_get_current();
> @@ -450,6 +456,51 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
>      return true;
>  }
>  
> +static int vfio_record_mapping(VFIOContainer *container, hwaddr iova,
> +                               hwaddr size, bool readonly)
> +{
> +    DMAMap map = {
> +        .iova = iova,
> +        .size = size - 1, /* IOVATree is inclusive, so subtract 1 from size */

<facepalm>



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 11/18] vfio/common: Add device dirty page bitmap sync
  2023-01-26 18:49 ` [PATCH 11/18] vfio/common: Add device dirty page bitmap sync Avihai Horon
@ 2023-01-27 23:37   ` Alex Williamson
  2023-02-12 15:49     ` Avihai Horon
  0 siblings, 1 reply; 42+ messages in thread
From: Alex Williamson @ 2023-01-27 23:37 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins

On Thu, 26 Jan 2023 20:49:41 +0200
Avihai Horon <avihaih@nvidia.com> wrote:

> From: Joao Martins <joao.m.martins@oracle.com>
> 
> Add device dirty page bitmap sync functionality. This uses the device
> DMA logging uAPI to sync dirty page bitmap from the device.
> 
> Device dirty page bitmap sync is used only if all devices within a
> container support device dirty page tracking.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>  hw/vfio/common.c | 93 ++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 82 insertions(+), 11 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 3caa73d6f7..0003f2421d 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -355,6 +355,9 @@ static void vfio_bitmap_dealloc(VFIOBitmap *vbmap)
>      g_free(vbmap);
>  }
>  
> +static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
> +                                 uint64_t size, ram_addr_t ram_addr);
> +
>  bool vfio_mig_active(void)
>  {
>      VFIOGroup *group;
> @@ -582,10 +585,19 @@ static int vfio_dma_unmap(VFIOContainer *container,
>          .iova = iova,
>          .size = size,
>      };
> +    int ret;
>  
> -    if (iotlb && container->dirty_pages_supported &&
> -        vfio_devices_all_running_and_mig_active(container)) {
> -        return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
> +    if (iotlb && vfio_devices_all_running_and_mig_active(container)) {
> +        if (!vfio_devices_all_device_dirty_tracking(container) &&
> +            container->dirty_pages_supported) {
> +            return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
> +        }
> +
> +        ret = vfio_get_dirty_bitmap(container, iova, size,
> +                                    iotlb->translated_addr);
> +        if (ret) {
> +            return ret;
> +        }

Isn't the ordering backwards here?  Only after the range is unmapped
can we know that this container can no longer dirty pages within the
range.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 01/18] vfio/migration: Add VFIO migration pre-copy support
  2023-01-26 23:52   ` Alex Williamson
@ 2023-01-31 12:44     ` Avihai Horon
  2023-01-31 22:43       ` Alex Williamson
  0 siblings, 1 reply; 42+ messages in thread
From: Avihai Horon @ 2023-01-31 12:44 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins


On 27/01/2023 1:52, Alex Williamson wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, 26 Jan 2023 20:49:31 +0200
> Avihai Horon <avihaih@nvidia.com> wrote:
>
>> Pre-copy support allows the VFIO device data to be transferred while the
>> VM is running. This helps to accommodate VFIO devices that have a large
>> amount of data that needs to be transferred, and it can reduce migration
>> downtime.
>>
>> Pre-copy support is optional in VFIO migration protocol v2.
>> Implement pre-copy of VFIO migration protocol v2 and use it for devices
>> that support it. Full description of it can be found here [1].
>>
>> [1]
>> https://lore.kernel.org/kvm/20221206083438.37807-3-yishaih@nvidia.com/
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>   docs/devel/vfio-migration.rst |  29 ++++++---
>>   include/hw/vfio/vfio-common.h |   3 +
>>   hw/vfio/common.c              |   8 ++-
>>   hw/vfio/migration.c           | 112 ++++++++++++++++++++++++++++++++--
>>   hw/vfio/trace-events          |   5 +-
>>   5 files changed, 140 insertions(+), 17 deletions(-)
>>
>> diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
>> index 1d50c2fe5f..51f5e1a537 100644
>> --- a/docs/devel/vfio-migration.rst
>> +++ b/docs/devel/vfio-migration.rst
>> @@ -7,12 +7,14 @@ the guest is running on source host and restoring this saved state on the
>>   destination host. This document details how saving and restoring of VFIO
>>   devices is done in QEMU.
>>
>> -Migration of VFIO devices currently consists of a single stop-and-copy phase.
>> -During the stop-and-copy phase the guest is stopped and the entire VFIO device
>> -data is transferred to the destination.
>> -
>> -The pre-copy phase of migration is currently not supported for VFIO devices.
>> -Support for VFIO pre-copy will be added later on.
>> +Migration of VFIO devices consists of two phases: the optional pre-copy phase,
>> +and the stop-and-copy phase. The pre-copy phase is iterative and allows to
>> +accommodate VFIO devices that have a large amount of data that needs to be
>> +transferred. The iterative pre-copy phase of migration allows for the guest to
>> +continue whilst the VFIO device state is transferred to the destination, this
>> +helps to reduce the total downtime of the VM. VFIO devices can choose to skip
>> +the pre-copy phase of migration by not reporting the VFIO_MIGRATION_PRE_COPY
>> +flag in VFIO_DEVICE_FEATURE_MIGRATION ioctl.
>>
>>   A detailed description of the UAPI for VFIO device migration can be found in
>>   the comment for the ``vfio_device_mig_state`` structure in the header file
>> @@ -29,6 +31,12 @@ VFIO implements the device hooks for the iterative approach as follows:
>>     driver, which indicates the amount of data that the vendor driver has yet to
>>     save for the VFIO device.
>>
>> +* An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
>> +  active only if the VFIO device is in pre-copy states.
>> +
>> +* A ``save_live_iterate`` function that reads the VFIO device's data from the
>> +  vendor driver during iterative phase.
>> +
>>   * A ``save_state`` function to save the device config space if it is present.
>>
>>   * A ``save_live_complete_precopy`` function that sets the VFIO device in
>> @@ -91,8 +99,10 @@ Flow of state changes during Live migration
>>   ===========================================
>>
>>   Below is the flow of state change during live migration.
>> -The values in the brackets represent the VM state, the migration state, and
>> +The values in the parentheses represent the VM state, the migration state, and
>>   the VFIO device state, respectively.
>> +The text in the square brackets represents the flow if the VFIO device supports
>> +pre-copy.
>>
>>   Live migration save path
>>   ------------------------
>> @@ -104,11 +114,12 @@ Live migration save path
>>                                     |
>>                        migrate_init spawns migration_thread
>>                   Migration thread then calls each device's .save_setup()
>> -                       (RUNNING, _SETUP, _RUNNING)
>> +                  (RUNNING, _SETUP, _RUNNING [_PRE_COPY])
>>                                     |
>> -                      (RUNNING, _ACTIVE, _RUNNING)
>> +                  (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])
>>                If device is active, get pending_bytes by .save_live_pending()
>>             If total pending_bytes >= threshold_size, call .save_live_iterate()
>> +                  [Data of VFIO device for pre-copy phase is copied]
>>           Iterate till total pending bytes converge and are less than threshold
>>                                     |
>>     On migration completion, vCPU stops and calls .save_live_complete_precopy for
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 5f8e7a02fe..88c2194fb9 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -67,7 +67,10 @@ typedef struct VFIOMigration {
>>       int data_fd;
>>       void *data_buffer;
>>       size_t data_buffer_size;
>> +    uint64_t mig_flags;
>>       uint64_t stop_copy_size;
>> +    uint64_t precopy_init_size;
>> +    uint64_t precopy_dirty_size;
>>   } VFIOMigration;
>>
>>   typedef struct VFIOAddressSpace {
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 9a0dbee6b4..93b18c5e3d 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -357,7 +357,9 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
>>
>>               if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
>>                   (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
>> -                 migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P)) {
>> +                 migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P ||
>> +                 migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ||
>> +                 migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P)) {
> Should this just turn into a test that we're not in STOP_COPY?

Then we would need to check we are not in STOP_COPY and not in STOP.
The STOP check is for the case where PRE_COPY is not supported, since 
RAM will ask for dirty page sync when the device is in STOP state.
Without the STOP check, the device will skip the final dirty page sync.

>
>>                   return false;
>>               }
>>           }
>> @@ -387,7 +389,9 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
>>               }
>>
>>               if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
>> -                migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P) {
>> +                migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P ||
>> +                migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ||
>> +                migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P) {
>>                   continue;
>>               } else {
>>                   return false;
> Hmm, this only seems to highlight that between this series and the
> previous, we're adding tests for states that we never actually use, ie.
> these _P2P states.
>
> IIRC, the reason we have these _P2P states is so that we can transition
> a set of devices, which may have active P2P DMA between them, to STOP,
> STOP_COPY, and even RUNNING states safely without lost data given that
> we cannot simultaneously transition all devices.  That suggest that
> missing from both these series is support for bringing all devices to
> these _P2P states before we move any device to one of STOP, STOP_COPY,
> or RUNNING states (in the case of RESUMING).
>
> Also, I recall discussions that we need to enforce configuration
> restrictions when not all devices support the _P2P states?  For example
> adding a migration blocker when there are multiple vfio devices and at
> least one of them does not support _P2P migration states.  Or perhaps
> initially, requiring support for _P2P states.
>
> I think what's implemented here, where we don't make use of the _P2P
> states would require adding a migration blocker whenever there are
> multiple vfio devices, regardless of the device support for _P2P.

Yes, I think you are right.

I will add a migration blocker if there are multiple devices as part of 
v9 of the basic series.
When P2P support is added, I will block migration of multiple devices if 
one or more of them doesn't support P2P.

Thanks.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 01/18] vfio/migration: Add VFIO migration pre-copy support
  2023-01-31 12:44     ` Avihai Horon
@ 2023-01-31 22:43       ` Alex Williamson
  2023-01-31 23:29         ` Jason Gunthorpe
  0 siblings, 1 reply; 42+ messages in thread
From: Alex Williamson @ 2023-01-31 22:43 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins

On Tue, 31 Jan 2023 14:44:54 +0200
Avihai Horon <avihaih@nvidia.com> wrote:

> On 27/01/2023 1:52, Alex Williamson wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > On Thu, 26 Jan 2023 20:49:31 +0200
> > Avihai Horon <avihaih@nvidia.com> wrote:
> >  
> >> Pre-copy support allows the VFIO device data to be transferred while the
> >> VM is running. This helps to accommodate VFIO devices that have a large
> >> amount of data that needs to be transferred, and it can reduce migration
> >> downtime.
> >>
> >> Pre-copy support is optional in VFIO migration protocol v2.
> >> Implement pre-copy of VFIO migration protocol v2 and use it for devices
> >> that support it. Full description of it can be found here [1].
> >>
> >> [1]
> >> https://lore.kernel.org/kvm/20221206083438.37807-3-yishaih@nvidia.com/
> >>
> >> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> >> ---
> >>   docs/devel/vfio-migration.rst |  29 ++++++---
> >>   include/hw/vfio/vfio-common.h |   3 +
> >>   hw/vfio/common.c              |   8 ++-
> >>   hw/vfio/migration.c           | 112 ++++++++++++++++++++++++++++++++--
> >>   hw/vfio/trace-events          |   5 +-
> >>   5 files changed, 140 insertions(+), 17 deletions(-)
> >>
> >> diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
> >> index 1d50c2fe5f..51f5e1a537 100644
> >> --- a/docs/devel/vfio-migration.rst
> >> +++ b/docs/devel/vfio-migration.rst
> >> @@ -7,12 +7,14 @@ the guest is running on source host and restoring this saved state on the
> >>   destination host. This document details how saving and restoring of VFIO
> >>   devices is done in QEMU.
> >>
> >> -Migration of VFIO devices currently consists of a single stop-and-copy phase.
> >> -During the stop-and-copy phase the guest is stopped and the entire VFIO device
> >> -data is transferred to the destination.
> >> -
> >> -The pre-copy phase of migration is currently not supported for VFIO devices.
> >> -Support for VFIO pre-copy will be added later on.
> >> +Migration of VFIO devices consists of two phases: the optional pre-copy phase,
> >> +and the stop-and-copy phase. The pre-copy phase is iterative and allows to
> >> +accommodate VFIO devices that have a large amount of data that needs to be
> >> +transferred. The iterative pre-copy phase of migration allows for the guest to
> >> +continue whilst the VFIO device state is transferred to the destination, this
> >> +helps to reduce the total downtime of the VM. VFIO devices can choose to skip
> >> +the pre-copy phase of migration by not reporting the VFIO_MIGRATION_PRE_COPY
> >> +flag in VFIO_DEVICE_FEATURE_MIGRATION ioctl.
> >>
> >>   A detailed description of the UAPI for VFIO device migration can be found in
> >>   the comment for the ``vfio_device_mig_state`` structure in the header file
> >> @@ -29,6 +31,12 @@ VFIO implements the device hooks for the iterative approach as follows:
> >>     driver, which indicates the amount of data that the vendor driver has yet to
> >>     save for the VFIO device.
> >>
> >> +* An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
> >> +  active only if the VFIO device is in pre-copy states.
> >> +
> >> +* A ``save_live_iterate`` function that reads the VFIO device's data from the
> >> +  vendor driver during iterative phase.
> >> +
> >>   * A ``save_state`` function to save the device config space if it is present.
> >>
> >>   * A ``save_live_complete_precopy`` function that sets the VFIO device in
> >> @@ -91,8 +99,10 @@ Flow of state changes during Live migration
> >>   ===========================================
> >>
> >>   Below is the flow of state change during live migration.
> >> -The values in the brackets represent the VM state, the migration state, and
> >> +The values in the parentheses represent the VM state, the migration state, and
> >>   the VFIO device state, respectively.
> >> +The text in the square brackets represents the flow if the VFIO device supports
> >> +pre-copy.
> >>
> >>   Live migration save path
> >>   ------------------------
> >> @@ -104,11 +114,12 @@ Live migration save path
> >>                                     |
> >>                        migrate_init spawns migration_thread
> >>                   Migration thread then calls each device's .save_setup()
> >> -                       (RUNNING, _SETUP, _RUNNING)
> >> +                  (RUNNING, _SETUP, _RUNNING [_PRE_COPY])
> >>                                     |
> >> -                      (RUNNING, _ACTIVE, _RUNNING)
> >> +                  (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])
> >>                If device is active, get pending_bytes by .save_live_pending()
> >>             If total pending_bytes >= threshold_size, call .save_live_iterate()
> >> +                  [Data of VFIO device for pre-copy phase is copied]
> >>           Iterate till total pending bytes converge and are less than threshold
> >>                                     |
> >>     On migration completion, vCPU stops and calls .save_live_complete_precopy for
> >> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> >> index 5f8e7a02fe..88c2194fb9 100644
> >> --- a/include/hw/vfio/vfio-common.h
> >> +++ b/include/hw/vfio/vfio-common.h
> >> @@ -67,7 +67,10 @@ typedef struct VFIOMigration {
> >>       int data_fd;
> >>       void *data_buffer;
> >>       size_t data_buffer_size;
> >> +    uint64_t mig_flags;
> >>       uint64_t stop_copy_size;
> >> +    uint64_t precopy_init_size;
> >> +    uint64_t precopy_dirty_size;
> >>   } VFIOMigration;
> >>
> >>   typedef struct VFIOAddressSpace {
> >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >> index 9a0dbee6b4..93b18c5e3d 100644
> >> --- a/hw/vfio/common.c
> >> +++ b/hw/vfio/common.c
> >> @@ -357,7 +357,9 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
> >>
> >>               if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
> >>                   (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> >> -                 migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P)) {
> >> +                 migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P ||
> >> +                 migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ||
> >> +                 migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P)) {  
> > Should this just turn into a test that we're not in STOP_COPY?  
> 
> Then we would need to check we are not in STOP_COPY and not in STOP.
> The STOP check is for the case where PRE_COPY is not supported, since 
> RAM will ask for dirty page sync when the device is in STOP state.
> Without the STOP check, the device will skip the final dirty page sync.
> 
> >  
> >>                   return false;
> >>               }
> >>           }
> >> @@ -387,7 +389,9 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
> >>               }
> >>
> >>               if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> >> -                migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P) {
> >> +                migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P ||
> >> +                migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ||
> >> +                migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P) {
> >>                   continue;
> >>               } else {
> >>                   return false;  
> > Hmm, this only seems to highlight that between this series and the
> > previous, we're adding tests for states that we never actually use, ie.
> > these _P2P states.
> >
> > IIRC, the reason we have these _P2P states is so that we can transition
> > a set of devices, which may have active P2P DMA between them, to STOP,
> > STOP_COPY, and even RUNNING states safely without lost data given that
> > we cannot simultaneously transition all devices.  That suggest that
> > missing from both these series is support for bringing all devices to
> > these _P2P states before we move any device to one of STOP, STOP_COPY,
> > or RUNNING states (in the case of RESUMING).
> >
> > Also, I recall discussions that we need to enforce configuration
> > restrictions when not all devices support the _P2P states?  For example
> > adding a migration blocker when there are multiple vfio devices and at
> > least one of them does not support _P2P migration states.  Or perhaps
> > initially, requiring support for _P2P states.
> >
> > I think what's implemented here, where we don't make use of the _P2P
> > states would require adding a migration blocker whenever there are
> > multiple vfio devices, regardless of the device support for _P2P.  
> 
> Yes, I think you are right.
> 
> I will add a migration blocker if there are multiple devices as part of 
> v9 of the basic series.
> When P2P support is added, I will block migration of multiple devices if 
> one or more of them doesn't support P2P.

How does this affect our path towards supported migration?  I'm
thinking about a user experience where QEMU supports migration if
device A OR device B are attached, but not devices A and B attached to
the same VM.  We might have a device C where QEMU supports migration
with B AND C, but not A AND C, nor A AND B AND C.  This would be the
case if device B and device C both supported P2P states, but device A
did not. The user has no observability of this feature, so all of this
looks effectively random to the user.

Even in the single device case, we need to make an assumption that a
device that does not support P2P migration states (or when QEMU doesn't
make use of P2P states) cannot be a DMA target, or otherwise have its
MMIO space accessed while in a STOP state.  Can we guarantee that when
other devices have not yet transitioned to STOP?

We could disable the direct map MemoryRegions when we move to a STOP
state, which would give QEMU visibility to those accesses, but besides
pulling an abort should such an access occur, could we queue them in
software, add them to the migration stream, and replay them after the
device moves to the RUNNING state?  We'd need to account for the lack of
RESUMING_P2P states as well, trapping and queue accesses from devices
already RUNNING to those still in RESUMING (not _P2P).

This all looks complicated.  Is it better to start with requiring P2P
state support?  Thanks,

Alex



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 01/18] vfio/migration: Add VFIO migration pre-copy support
  2023-01-31 22:43       ` Alex Williamson
@ 2023-01-31 23:29         ` Jason Gunthorpe
  2023-02-01  4:15           ` Alex Williamson
  0 siblings, 1 reply; 42+ messages in thread
From: Jason Gunthorpe @ 2023-01-31 23:29 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Avihai Horon, qemu-devel, Michael S. Tsirkin, Peter Xu,
	Jason Wang, Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins

On Tue, Jan 31, 2023 at 03:43:01PM -0700, Alex Williamson wrote:

> How does this affect our path towards supported migration?  I'm
> thinking about a user experience where QEMU supports migration if
> device A OR device B are attached, but not devices A and B attached to
> the same VM.  We might have a device C where QEMU supports migration
> with B AND C, but not A AND C, nor A AND B AND C.  This would be the
> case if device B and device C both supported P2P states, but device A
> did not. The user has no observability of this feature, so all of this
> looks effectively random to the user.

I think qemu should just log if it encounters a device without P2P
support.

> Even in the single device case, we need to make an assumption that a
> device that does not support P2P migration states (or when QEMU doesn't
> make use of P2P states) cannot be a DMA target, or otherwise have its
> MMIO space accessed while in a STOP state.  Can we guarantee that when
> other devices have not yet transitioned to STOP?

You mean the software devices created by qemu?

> We could disable the direct map MemoryRegions when we move to a STOP
> state, which would give QEMU visibility to those accesses, but besides
> pulling an abort should such an access occur, could we queue them in
> software, add them to the migration stream, and replay them after the
> device moves to the RUNNING state?  We'd need to account for the lack of
> RESUMING_P2P states as well, trapping and queue accesses from devices
> already RUNNING to those still in RESUMING (not _P2P).

I think any internal SW devices should just fail all accesses to the
P2P space, all the time.

qemu simply acts like a real system that doesn't support P2P.

IMHO this is generally the way forward to do multi-device as well,
remove the MMIO from all the address maps: VFIO, SW access, all of
them. Nothing can touch MMIO except for the vCPU.

> This all looks complicated.  Is it better to start with requiring P2P
> state support?  Thanks,

People have built HW without it, so I don't see this as so good..

Jason


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 01/18] vfio/migration: Add VFIO migration pre-copy support
  2023-01-31 23:29         ` Jason Gunthorpe
@ 2023-02-01  4:15           ` Alex Williamson
  2023-02-01 17:28             ` Jason Gunthorpe
  0 siblings, 1 reply; 42+ messages in thread
From: Alex Williamson @ 2023-02-01  4:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Avihai Horon, qemu-devel, Michael S. Tsirkin, Peter Xu,
	Jason Wang, Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins

On Tue, 31 Jan 2023 19:29:48 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Jan 31, 2023 at 03:43:01PM -0700, Alex Williamson wrote:
> 
> > How does this affect our path towards supported migration?  I'm
> > thinking about a user experience where QEMU supports migration if
> > device A OR device B are attached, but not devices A and B attached to
> > the same VM.  We might have a device C where QEMU supports migration
> > with B AND C, but not A AND C, nor A AND B AND C.  This would be the
> > case if device B and device C both supported P2P states, but device A
> > did not. The user has no observability of this feature, so all of this
> > looks effectively random to the user.  
> 
> I think qemu should just log if it encounters a device without P2P
> support.

Better for debugging, but still poor from a VM management perspective.

> > Even in the single device case, we need to make an assumption that a
> > device that does not support P2P migration states (or when QEMU doesn't
> > make use of P2P states) cannot be a DMA target, or otherwise have its
> > MMIO space accessed while in a STOP state.  Can we guarantee that when
> > other devices have not yet transitioned to STOP?  
> 
> You mean the software devices created by qemu?

Other devices, software or otherwise, yes.

> > We could disable the direct map MemoryRegions when we move to a STOP
> > state, which would give QEMU visibility to those accesses, but besides
> > pulling an abort should such an access occur, could we queue them in
> > software, add them to the migration stream, and replay them after the
> > device moves to the RUNNING state?  We'd need to account for the lack of
> > RESUMING_P2P states as well, trapping and queue accesses from devices
> > already RUNNING to those still in RESUMING (not _P2P).  
> 
> I think any internal SW devices should just fail all accesses to the
> P2P space, all the time.
> 
> qemu simply acts like a real system that doesn't support P2P.
> 
> IMHO this is generally the way forward to do multi-device as well,
> remove the MMIO from all the address maps: VFIO, SW access, all of
> them. Nothing can touch MMIO except for the vCPU.

Are you suggesting this relative to migration or in general?  P2P is a
feature with tangible benefits and real use cases.  Real systems seem
to be moving towards making P2P work better, so it would seem short
sighted to move to and enforce only configurations w/o P2P in QEMU
generally.  Besides, this would require some sort of deprecation, so are
we intending to make users choose between migration and P2P?
 
> > This all looks complicated.  Is it better to start with requiring P2P
> > state support?  Thanks,  
> 
> People have built HW without it, so I don't see this as so good..

Are we obliged to start with that hardware?  I'm just trying to think
about whether a single device restriction is sufficient to prevent any
possible P2P or whether there might be an easier starting point for
more capable hardware.  There's no shortage of hardware that could
support migration given sufficient effort.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 01/18] vfio/migration: Add VFIO migration pre-copy support
  2023-02-01  4:15           ` Alex Williamson
@ 2023-02-01 17:28             ` Jason Gunthorpe
  2023-02-01 18:42               ` Alex Williamson
  0 siblings, 1 reply; 42+ messages in thread
From: Jason Gunthorpe @ 2023-02-01 17:28 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Avihai Horon, qemu-devel, Michael S. Tsirkin, Peter Xu,
	Jason Wang, Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins

On Tue, Jan 31, 2023 at 09:15:03PM -0700, Alex Williamson wrote:

> > IMHO this is generally the way forward to do multi-device as well,
> > remove the MMIO from all the address maps: VFIO, SW access, all of
> > them. Nothing can touch MMIO except for the vCPU.
> 
> Are you suggesting this relative to migration or in general?  

I would suggest a general qemu p2p on/off option.

> P2P is a feature with tangible benefits and real use cases.  Real
> systems seem to be moving towards making P2P work better, so it
> would seem short sighted to move to and enforce only configurations
> w/o P2P in QEMU generally.  

qemu needs to support it, but it should be a user option option.

Every system I've been involved with for enabling P2P into a VM has
been a total nightmare. This is not something you just turn on and it
works great :\ The whole thing was carefully engineered right down to
the BIOS to be able to work safely.

P2P in baremetal is much easier compared to P2P inside a VM.

> Besides, this would require some sort of deprecation, so are we
> intending to make users choose between migration and P2P?

Give qemu an option 'p2p on/p2p off' and default it to on for
backwards compatability.

If p2p on and migration devices don't support P2P states then
migration is disabled. The user made this choice when they bought
un-capable HW.

Log warnings to make it more discoverable. I think with the cdev
patches we can make it so libvirt can query the device FD for
capabilities to be even cleaner.

If user sets 'p2p off' then migration works with all HW.

p2p on/off is a global switch. With p2p off nothing, no HW or SW or
hybrid device, can touch the MMIO memory.

'p2p off' is a valuable option in its own right because this stuff
doesn't work reliably and is actively dangerous. Did you know you can
hard crash the bare metal from a guest on some platforms with P2P
operations? Yikes. If you don't need to use it turn it off and don't
take the risk.

Arguably for this reason 'p2p off' should trend toward the default as
it is much safer.

> Are we obliged to start with that hardware?  I'm just trying to think
> about whether a single device restriction is sufficient to prevent any
> possible P2P or whether there might be an easier starting point for
> more capable hardware.  There's no shortage of hardware that could
> support migration given sufficient effort.  Thanks,

I think multi-device will likely have some use cases, so I'd like to
see a path to have support for them. For this series I think it is
probably fine since it is already 18 patches.

Jason


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 01/18] vfio/migration: Add VFIO migration pre-copy support
  2023-02-01 17:28             ` Jason Gunthorpe
@ 2023-02-01 18:42               ` Alex Williamson
  2023-02-01 20:10                 ` Jason Gunthorpe
  0 siblings, 1 reply; 42+ messages in thread
From: Alex Williamson @ 2023-02-01 18:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Avihai Horon, qemu-devel, Michael S. Tsirkin, Peter Xu,
	Jason Wang, Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins

On Wed, 1 Feb 2023 13:28:40 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Jan 31, 2023 at 09:15:03PM -0700, Alex Williamson wrote:
> 
> > > IMHO this is generally the way forward to do multi-device as well,
> > > remove the MMIO from all the address maps: VFIO, SW access, all of
> > > them. Nothing can touch MMIO except for the vCPU.  
> > 
> > Are you suggesting this relative to migration or in general?    
> 
> I would suggest a general qemu p2p on/off option.
> 
> > P2P is a feature with tangible benefits and real use cases.  Real
> > systems seem to be moving towards making P2P work better, so it
> > would seem short sighted to move to and enforce only configurations
> > w/o P2P in QEMU generally.    
> 
> qemu needs to support it, but it should be a user option option.
> 
> Every system I've been involved with for enabling P2P into a VM has
> been a total nightmare. This is not something you just turn on and it
> works great :\ The whole thing was carefully engineered right down to
> the BIOS to be able to work safely.
> 
> P2P in baremetal is much easier compared to P2P inside a VM.
> 
> > Besides, this would require some sort of deprecation, so are we
> > intending to make users choose between migration and P2P?  
> 
> Give qemu an option 'p2p on/p2p off' and default it to on for
> backwards compatability.
> 
> If p2p on and migration devices don't support P2P states then
> migration is disabled. The user made this choice when they bought
> un-capable HW.
> 
> Log warnings to make it more discoverable. I think with the cdev
> patches we can make it so libvirt can query the device FD for
> capabilities to be even cleaner.
> 
> If user sets 'p2p off' then migration works with all HW.
> 
> p2p on/off is a global switch. With p2p off nothing, no HW or SW or
> hybrid device, can touch the MMIO memory.
> 
> 'p2p off' is a valuable option in its own right because this stuff
> doesn't work reliably and is actively dangerous. Did you know you can
> hard crash the bare metal from a guest on some platforms with P2P
> operations? Yikes. If you don't need to use it turn it off and don't
> take the risk.

If we're honest, there are a number of cases of non-exceptional faults
that an assigned device can generate that the platform might escalate
to fatal errors.

> Arguably for this reason 'p2p off' should trend toward the default as
> it is much safer.

Safety in the hands of the userspace to protect the host though?
Shouldn't the opt-in be at the kernel level whether to allow p2p
mappings?  I don't have an issue if QEMU were to mirror this by
creating a RAM-only AddressSpace for devices which would be used when
p2p is disable (it'd save us some headaches for various unaligned
devices as well), but we shouldn't pretend that actually protects the
host.  OTOH, QEMU could feel confident supporting migration of devices
w/o support of the migration P2P states with that restriction.

> > Are we obliged to start with that hardware?  I'm just trying to think
> > about whether a single device restriction is sufficient to prevent any
> > possible P2P or whether there might be an easier starting point for
> > more capable hardware.  There's no shortage of hardware that could
> > support migration given sufficient effort.  Thanks,  
> 
> I think multi-device will likely have some use cases, so I'd like to
> see a path to have support for them. For this series I think it is
> probably fine since it is already 18 patches.

It might be fine for this series because it hasn't yet proposed to make
migration non-experimental, but it's unclear where the goal post is
that we can actually make that transition.

If we restrict migration to a single vfio device, is that enough?
Theoretically it's not, but perhaps in practice... maybe?

Therefore, do we depend on QEMU implementing a new RAM-only AddressSpace
for devices?  What's the path to making it the default?  Maybe there
are other aspects of the VM from which we can infer a preference
towards migration support, ex. 'host' CPU type.

Another option, as previously mentioned, is to start with requiring P2P
migration support both at the device and QEMU, where we only restrict
the set of devices that could initially support migration without
modifying existing behavior of the VM.

In any case, it seems we're a bit further from being able to declare
this as supported than some had hoped.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 01/18] vfio/migration: Add VFIO migration pre-copy support
  2023-02-01 18:42               ` Alex Williamson
@ 2023-02-01 20:10                 ` Jason Gunthorpe
  0 siblings, 0 replies; 42+ messages in thread
From: Jason Gunthorpe @ 2023-02-01 20:10 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Avihai Horon, qemu-devel, Michael S. Tsirkin, Peter Xu,
	Jason Wang, Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins

On Wed, Feb 01, 2023 at 11:42:46AM -0700, Alex Williamson wrote:

> > 'p2p off' is a valuable option in its own right because this stuff
> > doesn't work reliably and is actively dangerous. Did you know you can
> > hard crash the bare metal from a guest on some platforms with P2P
> > operations? Yikes. If you don't need to use it turn it off and don't
> > take the risk.
> 
> If we're honest, there are a number of cases of non-exceptional faults
> that an assigned device can generate that the platform might escalate
> to fatal errors.

What I understand is that is true on some commodity hardware, but
engineered systems to run as cloud hypervisors have these problems
solved and VFIO is made safe.

Unfortunately there is no way to know if you have a safe or unsafe
system from the OS.

> > Arguably for this reason 'p2p off' should trend toward the default as
> > it is much safer.
> 
> Safety in the hands of the userspace to protect the host though?
> Shouldn't the opt-in be at the kernel level whether to allow p2p
> mappings?  

I haven't seen anyone interested in doing this kind of work. The
expectation seems to be that places seriously concerned about security
either don't include VFIO at all in their environments or have
engineered their platforms to make it safe.

Where this leaves the enterprise space, I don't know. I think they end
up with systems that functionally work but possibly have DOS problems.

So, given this landscape I think a user option in qemu is the best we
can do at the moment.

> I don't have an issue if QEMU were to mirror this by
> creating a RAM-only AddressSpace for devices which would be used when
> p2p is disable (it'd save us some headaches for various unaligned
> devices as well), but we shouldn't pretend that actually protects the
> host.  OTOH, QEMU could feel confident supporting migration of devices
> w/o support of the migration P2P states with that restriction.

It protects the host from a hostile VM, it does not fully protect the
host from a compromised qemu. That is still an improvement.

> > I think multi-device will likely have some use cases, so I'd like to
> > see a path to have support for them. For this series I think it is
> > probably fine since it is already 18 patches.
> 
> It might be fine for this series because it hasn't yet proposed to make
> migration non-experimental, but it's unclear where the goal post is
> that we can actually make that transition.

IMHO non-experimental just means the solution works with whatever
configuration limitations it comes with. It shouldn't mean every
device or every configuration combination works.

So if you want to do single device, or just hard require P2P for now,
those are both reasonable temporary choices IMHO.

But they are temporary and we should come with a remedy to allow
non-P2P migration devices to work as well.

Given we merged a non-P2P kernel driver I prefer the single device
option as it is more logically consistent with the kernel situation.

Jason


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/18] memory/iommu: Add IOMMU_ATTR_MAX_IOVA attribute
  2023-01-26 18:49 ` [PATCH 13/18] memory/iommu: Add IOMMU_ATTR_MAX_IOVA attribute Avihai Horon
@ 2023-02-09 22:16   ` Peter Xu
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Xu @ 2023-02-09 22:16 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Michael S. Tsirkin, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins

On Thu, Jan 26, 2023 at 08:49:43PM +0200, Avihai Horon wrote:
> Add a new IOMMU attribute IOMMU_ATTR_MAX_IOVA which indicates the
> maximal IOVA that an IOMMU can use.
> 
> This attribute will be used by VFIO device dirty page tracking so it can
> track the entire IOVA space when needed (i.e. when vIOMMU is enabled).
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>

Acked-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 14/18] intel-iommu: Implement get_attr() method
  2023-01-26 18:49 ` [PATCH 14/18] intel-iommu: Implement get_attr() method Avihai Horon
@ 2023-02-09 22:18   ` Peter Xu
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Xu @ 2023-02-09 22:18 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Michael S. Tsirkin, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins

On Thu, Jan 26, 2023 at 08:49:44PM +0200, Avihai Horon wrote:
> Implement get_attr() method and use the address width property to report
> the IOMMU_ATTR_MAX_IOVA attribute.
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>

Acked-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 07/18] util: Extend iova_tree_foreach() to take data argument
  2023-01-26 18:49 ` [PATCH 07/18] util: Extend iova_tree_foreach() to take data argument Avihai Horon
@ 2023-02-09 22:21   ` Peter Xu
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Xu @ 2023-02-09 22:21 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Michael S. Tsirkin, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins

On Thu, Jan 26, 2023 at 08:49:37PM +0200, Avihai Horon wrote:
> From: Joao Martins <joao.m.martins@oracle.com>
> 
> Extend iova_tree_foreach() to take data argument to be passed and used
> by the iterator.
> 
> While at it, fix a documentation error:
> The documentation says iova_tree_foreach() returns a value even though
> it is a void function.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

Acked-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 06/18] util: Add iova_tree_nnodes()
  2023-01-26 18:49 ` [PATCH 06/18] util: Add iova_tree_nnodes() Avihai Horon
@ 2023-02-09 22:21   ` Peter Xu
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Xu @ 2023-02-09 22:21 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Michael S. Tsirkin, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins

On Thu, Jan 26, 2023 at 08:49:36PM +0200, Avihai Horon wrote:
> From: Joao Martins <joao.m.martins@oracle.com>
> 
> Add iova_tree_nnodes() which returns the number of nodes in the IOVA
> tree.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

Acked-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 05/18] vfio/common: Add VFIOBitmap and (de)alloc functions
  2023-01-27 21:11   ` Alex Williamson
@ 2023-02-12 15:36     ` Avihai Horon
  2023-02-14 21:28       ` Alex Williamson
  0 siblings, 1 reply; 42+ messages in thread
From: Avihai Horon @ 2023-02-12 15:36 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins


On 27/01/2023 23:11, Alex Williamson wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, 26 Jan 2023 20:49:35 +0200
> Avihai Horon <avihaih@nvidia.com> wrote:
>
>> There are already two places where dirty page bitmap allocation and
>> calculations are done in open code. With device dirty page tracking
>> being added in next patches, there are going to be even more places.
>>
>> To avoid code duplication, introduce VFIOBitmap struct and corresponding
>> alloc and dealloc functions and use them where applicable.
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>   hw/vfio/common.c | 89 ++++++++++++++++++++++++++++++++----------------
>>   1 file changed, 60 insertions(+), 29 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 8e8ffbc046..e554573eb5 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -319,6 +319,41 @@ const MemoryRegionOps vfio_region_ops = {
>>    * Device state interfaces
>>    */
>>
>> +typedef struct {
>> +    unsigned long *bitmap;
>> +    hwaddr size;
>> +    hwaddr pages;
>> +} VFIOBitmap;
>> +
>> +static VFIOBitmap *vfio_bitmap_alloc(hwaddr size)
>> +{
>> +    VFIOBitmap *vbmap = g_try_new0(VFIOBitmap, 1);
>> +    if (!vbmap) {
>> +        errno = ENOMEM;
>> +
>> +        return NULL;
>> +    }
>> +
>> +    vbmap->pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size();
>> +    vbmap->size =  ROUND_UP(vbmap->pages, sizeof(__u64) * BITS_PER_BYTE) /
>> +                                          BITS_PER_BYTE;
>> +    vbmap->bitmap = g_try_malloc0(vbmap->size);
>> +    if (!vbmap->bitmap) {
>> +        g_free(vbmap);
>> +        errno = ENOMEM;
>> +
>> +        return NULL;
>> +    }
>> +
>> +    return vbmap;
>> +}
>> +
>> +static void vfio_bitmap_dealloc(VFIOBitmap *vbmap)
>> +{
>> +    g_free(vbmap->bitmap);
>> +    g_free(vbmap);
>> +}
>> +
>>   bool vfio_mig_active(void)
>>   {
>>       VFIOGroup *group;
>> @@ -421,9 +456,14 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container,
>>   {
>>       struct vfio_iommu_type1_dma_unmap *unmap;
>>       struct vfio_bitmap *bitmap;
>> -    uint64_t pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size();
>> +    VFIOBitmap *vbmap;
>>       int ret;
>>
>> +    vbmap = vfio_bitmap_alloc(size);
>> +    if (!vbmap) {
>> +        return -errno;
>> +    }
>> +
>>       unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));
>>
>>       unmap->argsz = sizeof(*unmap) + sizeof(*bitmap);
>> @@ -437,35 +477,28 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container,
>>        * qemu_real_host_page_size to mark those dirty. Hence set bitmap_pgsize
>>        * to qemu_real_host_page_size.
>>        */
>> -
>>       bitmap->pgsize = qemu_real_host_page_size();
>> -    bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
>> -                   BITS_PER_BYTE;
>> +    bitmap->size = vbmap->size;
>> +    bitmap->data = (__u64 *)vbmap->bitmap;
>>
>> -    if (bitmap->size > container->max_dirty_bitmap_size) {
>> -        error_report("UNMAP: Size of bitmap too big 0x%"PRIx64,
>> -                     (uint64_t)bitmap->size);
>> +    if (vbmap->size > container->max_dirty_bitmap_size) {
>> +        error_report("UNMAP: Size of bitmap too big 0x%"PRIx64, vbmap->size);
> Why not pass the container to the alloc function so we can test this
> consistently for each bitmap we allocate?

Hi, sorry for the delay.

This test is relevant only for VFIO IOMMU dirty tracking. With device 
dirty tracking it should be skipped.
Do you think we should still move it to the alloc function?

Thanks.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 08/18] vfio/common: Record DMA mapped IOVA ranges
  2023-01-27 21:42   ` Alex Williamson
@ 2023-02-12 15:40     ` Avihai Horon
  2023-02-13 15:25       ` Alex Williamson
  0 siblings, 1 reply; 42+ messages in thread
From: Avihai Horon @ 2023-02-12 15:40 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins


On 27/01/2023 23:42, Alex Williamson wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, 26 Jan 2023 20:49:38 +0200
> Avihai Horon <avihaih@nvidia.com> wrote:
>
>> From: Joao Martins <joao.m.martins@oracle.com>
>>
>> According to the device DMA logging uAPI, IOVA ranges to be logged by
>> the device must be provided all at once upon DMA logging start.
>>
>> As preparation for the following patches which will add device dirty
>> page tracking, keep a record of all DMA mapped IOVA ranges so later they
>> can be used for DMA logging start.
>>
>> Note that when vIOMMU is enabled DMA mapped IOVA ranges are not tracked.
>> This is due to the dynamic nature of vIOMMU DMA mapping/unmapping.
>> Following patches will address the vIOMMU case specifically.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>   include/hw/vfio/vfio-common.h |  3 ++
>>   hw/vfio/common.c              | 86 +++++++++++++++++++++++++++++++++--
>>   2 files changed, 86 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 88c2194fb9..d54000d7ae 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -23,6 +23,7 @@
>>
>>   #include "exec/memory.h"
>>   #include "qemu/queue.h"
>> +#include "qemu/iova-tree.h"
>>   #include "qemu/notify.h"
>>   #include "ui/console.h"
>>   #include "hw/display/ramfb.h"
>> @@ -94,6 +95,8 @@ typedef struct VFIOContainer {
>>       uint64_t max_dirty_bitmap_size;
>>       unsigned long pgsizes;
>>       unsigned int dma_max_mappings;
>> +    IOVATree *mappings;
>> +    QemuMutex mappings_mutex;
>>       QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
>>       QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
>>       QLIST_HEAD(, VFIOGroup) group_list;
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index e554573eb5..fafc361cea 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -43,6 +43,7 @@
>>   #include "migration/misc.h"
>>   #include "migration/qemu-file.h"
>>   #include "sysemu/tpm.h"
>> +#include "qemu/iova-tree.h"
>>
>>   VFIOGroupList vfio_group_list =
>>       QLIST_HEAD_INITIALIZER(vfio_group_list);
>> @@ -373,6 +374,11 @@ bool vfio_mig_active(void)
>>       return true;
>>   }
>>
>> +static bool vfio_have_giommu(VFIOContainer *container)
>> +{
>> +    return !QLIST_EMPTY(&container->giommu_list);
>> +}
>> +
>>   static void vfio_set_migration_error(int err)
>>   {
>>       MigrationState *ms = migrate_get_current();
>> @@ -450,6 +456,51 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
>>       return true;
>>   }
>>
>> +static int vfio_record_mapping(VFIOContainer *container, hwaddr iova,
>> +                               hwaddr size, bool readonly)
>> +{
>> +    DMAMap map = {
>> +        .iova = iova,
>> +        .size = size - 1, /* IOVATree is inclusive, so subtract 1 from size */
> <facepalm>

I am not sure what's the error you are referring here.
Is it because DMAMap.size is not actually the size?
Something else?

Thanks.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 11/18] vfio/common: Add device dirty page bitmap sync
  2023-01-27 23:37   ` Alex Williamson
@ 2023-02-12 15:49     ` Avihai Horon
  0 siblings, 0 replies; 42+ messages in thread
From: Avihai Horon @ 2023-02-12 15:49 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins


On 28/01/2023 1:37, Alex Williamson wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, 26 Jan 2023 20:49:41 +0200
> Avihai Horon <avihaih@nvidia.com> wrote:
>
>> From: Joao Martins <joao.m.martins@oracle.com>
>>
>> Add device dirty page bitmap sync functionality. This uses the device
>> DMA logging uAPI to sync dirty page bitmap from the device.
>>
>> Device dirty page bitmap sync is used only if all devices within a
>> container support device dirty page tracking.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>   hw/vfio/common.c | 93 ++++++++++++++++++++++++++++++++++++++++++------
>>   1 file changed, 82 insertions(+), 11 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 3caa73d6f7..0003f2421d 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -355,6 +355,9 @@ static void vfio_bitmap_dealloc(VFIOBitmap *vbmap)
>>       g_free(vbmap);
>>   }
>>
>> +static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
>> +                                 uint64_t size, ram_addr_t ram_addr);
>> +
>>   bool vfio_mig_active(void)
>>   {
>>       VFIOGroup *group;
>> @@ -582,10 +585,19 @@ static int vfio_dma_unmap(VFIOContainer *container,
>>           .iova = iova,
>>           .size = size,
>>       };
>> +    int ret;
>>
>> -    if (iotlb && container->dirty_pages_supported &&
>> -        vfio_devices_all_running_and_mig_active(container)) {
>> -        return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
>> +    if (iotlb && vfio_devices_all_running_and_mig_active(container)) {
>> +        if (!vfio_devices_all_device_dirty_tracking(container) &&
>> +            container->dirty_pages_supported) {
>> +            return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
>> +        }
>> +
>> +        ret = vfio_get_dirty_bitmap(container, iova, size,
>> +                                    iotlb->translated_addr);
>> +        if (ret) {
>> +            return ret;
>> +        }
> Isn't the ordering backwards here?  Only after the range is unmapped
> can we know that this container can no longer dirty pages within the
> range.

Oops, I thought that it's OK to query the dirty bitmap when we get the 
vIOMMU unmap notification.
I will reverse the order.

Thanks.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 08/18] vfio/common: Record DMA mapped IOVA ranges
  2023-02-12 15:40     ` Avihai Horon
@ 2023-02-13 15:25       ` Alex Williamson
  0 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2023-02-13 15:25 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins

On Sun, 12 Feb 2023 17:40:06 +0200
Avihai Horon <avihaih@nvidia.com> wrote:

> On 27/01/2023 23:42, Alex Williamson wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > On Thu, 26 Jan 2023 20:49:38 +0200
> > Avihai Horon <avihaih@nvidia.com> wrote:
> >  
> >> From: Joao Martins <joao.m.martins@oracle.com>
> >>
> >> According to the device DMA logging uAPI, IOVA ranges to be logged by
> >> the device must be provided all at once upon DMA logging start.
> >>
> >> As preparation for the following patches which will add device dirty
> >> page tracking, keep a record of all DMA mapped IOVA ranges so later they
> >> can be used for DMA logging start.
> >>
> >> Note that when vIOMMU is enabled DMA mapped IOVA ranges are not tracked.
> >> This is due to the dynamic nature of vIOMMU DMA mapping/unmapping.
> >> Following patches will address the vIOMMU case specifically.
> >>
> >> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> >> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> >> ---
> >>   include/hw/vfio/vfio-common.h |  3 ++
> >>   hw/vfio/common.c              | 86 +++++++++++++++++++++++++++++++++--
> >>   2 files changed, 86 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> >> index 88c2194fb9..d54000d7ae 100644
> >> --- a/include/hw/vfio/vfio-common.h
> >> +++ b/include/hw/vfio/vfio-common.h
> >> @@ -23,6 +23,7 @@
> >>
> >>   #include "exec/memory.h"
> >>   #include "qemu/queue.h"
> >> +#include "qemu/iova-tree.h"
> >>   #include "qemu/notify.h"
> >>   #include "ui/console.h"
> >>   #include "hw/display/ramfb.h"
> >> @@ -94,6 +95,8 @@ typedef struct VFIOContainer {
> >>       uint64_t max_dirty_bitmap_size;
> >>       unsigned long pgsizes;
> >>       unsigned int dma_max_mappings;
> >> +    IOVATree *mappings;
> >> +    QemuMutex mappings_mutex;
> >>       QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
> >>       QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
> >>       QLIST_HEAD(, VFIOGroup) group_list;
> >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >> index e554573eb5..fafc361cea 100644
> >> --- a/hw/vfio/common.c
> >> +++ b/hw/vfio/common.c
> >> @@ -43,6 +43,7 @@
> >>   #include "migration/misc.h"
> >>   #include "migration/qemu-file.h"
> >>   #include "sysemu/tpm.h"
> >> +#include "qemu/iova-tree.h"
> >>
> >>   VFIOGroupList vfio_group_list =
> >>       QLIST_HEAD_INITIALIZER(vfio_group_list);
> >> @@ -373,6 +374,11 @@ bool vfio_mig_active(void)
> >>       return true;
> >>   }
> >>
> >> +static bool vfio_have_giommu(VFIOContainer *container)
> >> +{
> >> +    return !QLIST_EMPTY(&container->giommu_list);
> >> +}
> >> +
> >>   static void vfio_set_migration_error(int err)
> >>   {
> >>       MigrationState *ms = migrate_get_current();
> >> @@ -450,6 +456,51 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
> >>       return true;
> >>   }
> >>
> >> +static int vfio_record_mapping(VFIOContainer *container, hwaddr iova,
> >> +                               hwaddr size, bool readonly)
> >> +{
> >> +    DMAMap map = {
> >> +        .iova = iova,
> >> +        .size = size - 1, /* IOVATree is inclusive, so subtract 1 from size */  
> > <facepalm>  
> 
> I am not sure what's the error you are referring here.
> Is it because DMAMap.size is not actually the size?
> Something else?

Sorry, I'm just lamenting what a terrible interface IOVATree provides
with this inclusive range nonsense.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 05/18] vfio/common: Add VFIOBitmap and (de)alloc functions
  2023-02-12 15:36     ` Avihai Horon
@ 2023-02-14 21:28       ` Alex Williamson
  0 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2023-02-14 21:28 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins

On Sun, 12 Feb 2023 17:36:49 +0200
Avihai Horon <avihaih@nvidia.com> wrote:

> On 27/01/2023 23:11, Alex Williamson wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > On Thu, 26 Jan 2023 20:49:35 +0200
> > Avihai Horon <avihaih@nvidia.com> wrote:
> >  
> >> There are already two places where dirty page bitmap allocation and
> >> calculations are done in open code. With device dirty page tracking
> >> being added in next patches, there are going to be even more places.
> >>
> >> To avoid code duplication, introduce VFIOBitmap struct and corresponding
> >> alloc and dealloc functions and use them where applicable.
> >>
> >> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> >> ---
> >>   hw/vfio/common.c | 89 ++++++++++++++++++++++++++++++++----------------
> >>   1 file changed, 60 insertions(+), 29 deletions(-)
> >>
> >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >> index 8e8ffbc046..e554573eb5 100644
> >> --- a/hw/vfio/common.c
> >> +++ b/hw/vfio/common.c
> >> @@ -319,6 +319,41 @@ const MemoryRegionOps vfio_region_ops = {
> >>    * Device state interfaces
> >>    */
> >>
> >> +typedef struct {
> >> +    unsigned long *bitmap;
> >> +    hwaddr size;
> >> +    hwaddr pages;
> >> +} VFIOBitmap;
> >> +
> >> +static VFIOBitmap *vfio_bitmap_alloc(hwaddr size)
> >> +{
> >> +    VFIOBitmap *vbmap = g_try_new0(VFIOBitmap, 1);
> >> +    if (!vbmap) {
> >> +        errno = ENOMEM;
> >> +
> >> +        return NULL;
> >> +    }
> >> +
> >> +    vbmap->pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size();
> >> +    vbmap->size =  ROUND_UP(vbmap->pages, sizeof(__u64) * BITS_PER_BYTE) /
> >> +                                          BITS_PER_BYTE;
> >> +    vbmap->bitmap = g_try_malloc0(vbmap->size);
> >> +    if (!vbmap->bitmap) {
> >> +        g_free(vbmap);
> >> +        errno = ENOMEM;
> >> +
> >> +        return NULL;
> >> +    }
> >> +
> >> +    return vbmap;
> >> +}
> >> +
> >> +static void vfio_bitmap_dealloc(VFIOBitmap *vbmap)
> >> +{
> >> +    g_free(vbmap->bitmap);
> >> +    g_free(vbmap);
> >> +}
> >> +
> >>   bool vfio_mig_active(void)
> >>   {
> >>       VFIOGroup *group;
> >> @@ -421,9 +456,14 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container,
> >>   {
> >>       struct vfio_iommu_type1_dma_unmap *unmap;
> >>       struct vfio_bitmap *bitmap;
> >> -    uint64_t pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size();
> >> +    VFIOBitmap *vbmap;
> >>       int ret;
> >>
> >> +    vbmap = vfio_bitmap_alloc(size);
> >> +    if (!vbmap) {
> >> +        return -errno;
> >> +    }
> >> +
> >>       unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));
> >>
> >>       unmap->argsz = sizeof(*unmap) + sizeof(*bitmap);
> >> @@ -437,35 +477,28 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container,
> >>        * qemu_real_host_page_size to mark those dirty. Hence set bitmap_pgsize
> >>        * to qemu_real_host_page_size.
> >>        */
> >> -
> >>       bitmap->pgsize = qemu_real_host_page_size();
> >> -    bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
> >> -                   BITS_PER_BYTE;
> >> +    bitmap->size = vbmap->size;
> >> +    bitmap->data = (__u64 *)vbmap->bitmap;
> >>
> >> -    if (bitmap->size > container->max_dirty_bitmap_size) {
> >> -        error_report("UNMAP: Size of bitmap too big 0x%"PRIx64,
> >> -                     (uint64_t)bitmap->size);
> >> +    if (vbmap->size > container->max_dirty_bitmap_size) {
> >> +        error_report("UNMAP: Size of bitmap too big 0x%"PRIx64, vbmap->size);  
> > Why not pass the container to the alloc function so we can test this
> > consistently for each bitmap we allocate?  
> 
> Hi, sorry for the delay.
> 
> This test is relevant only for VFIO IOMMU dirty tracking. With device 
> dirty tracking it should be skipped.
> Do you think we should still move it to the alloc function?

Ah, ok.  Sounds like we'll have to live with a separate test for the
container path.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 02/18] vfio/common: Fix error reporting in vfio_get_dirty_bitmap()
  2023-01-26 18:49 ` [PATCH 02/18] vfio/common: Fix error reporting in vfio_get_dirty_bitmap() Avihai Horon
@ 2023-02-15  9:21   ` Cédric Le Goater
  0 siblings, 0 replies; 42+ messages in thread
From: Cédric Le Goater @ 2023-02-15  9:21 UTC (permalink / raw)
  To: Avihai Horon, qemu-devel
  Cc: Alex Williamson, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins

On 1/26/23 19:49, Avihai Horon wrote:
> Return -errno instead of -1 if VFIO_IOMMU_DIRTY_PAGES ioctl fails in
> vfio_get_dirty_bitmap().
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>


Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.


> ---
>   hw/vfio/common.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 93b18c5e3d..d892609cf1 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -1288,6 +1288,7 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
>   
>       ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
>       if (ret) {
> +        ret = -errno;
>           error_report("Failed to get dirty bitmap for iova: 0x%"PRIx64
>                   " size: 0x%"PRIx64" err: %d", (uint64_t)range->iova,
>                   (uint64_t)range->size, errno);



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 03/18] vfio/common: Fix wrong %m usages
  2023-01-26 18:49 ` [PATCH 03/18] vfio/common: Fix wrong %m usages Avihai Horon
@ 2023-02-15  9:21   ` Cédric Le Goater
  0 siblings, 0 replies; 42+ messages in thread
From: Cédric Le Goater @ 2023-02-15  9:21 UTC (permalink / raw)
  To: Avihai Horon, qemu-devel
  Cc: Alex Williamson, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins

On 1/26/23 19:49, Avihai Horon wrote:
> There are several places where the %m conversion is used if one of
> vfio_dma_map(), vfio_dma_unmap() or vfio_get_dirty_bitmap() fail.
> 
> The %m usage in these places is wrong since %m relies on errno value while
> the above functions don't report errors via errno.
> 
> Fix it by using strerror() with the returned value instead.
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>


Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.


> ---
>   hw/vfio/common.c | 29 ++++++++++++++++-------------
>   1 file changed, 16 insertions(+), 13 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index d892609cf1..643418f6f1 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -656,17 +656,17 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>                              read_only);
>           if (ret) {
>               error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
> -                         "0x%"HWADDR_PRIx", %p) = %d (%m)",
> +                         "0x%"HWADDR_PRIx", %p) = %d (%s)",
>                            container, iova,
> -                         iotlb->addr_mask + 1, vaddr, ret);
> +                         iotlb->addr_mask + 1, vaddr, ret, strerror(-ret));
>           }
>       } else {
>           ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1, iotlb);
>           if (ret) {
>               error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
> -                         "0x%"HWADDR_PRIx") = %d (%m)",
> +                         "0x%"HWADDR_PRIx") = %d (%s)",
>                            container, iova,
> -                         iotlb->addr_mask + 1, ret);
> +                         iotlb->addr_mask + 1, ret, strerror(-ret));
>           }
>       }
>   out:
> @@ -1048,8 +1048,9 @@ static void vfio_listener_region_add(MemoryListener *listener,
>                          vaddr, section->readonly);
>       if (ret) {
>           error_setg(&err, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
> -                   "0x%"HWADDR_PRIx", %p) = %d (%m)",
> -                   container, iova, int128_get64(llsize), vaddr, ret);
> +                   "0x%"HWADDR_PRIx", %p) = %d (%s)",
> +                   container, iova, int128_get64(llsize), vaddr, ret,
> +                   strerror(-ret));
>           if (memory_region_is_ram_device(section->mr)) {
>               /* Allow unexpected mappings not to be fatal for RAM devices */
>               error_report_err(err);
> @@ -1181,16 +1182,18 @@ static void vfio_listener_region_del(MemoryListener *listener,
>               ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
>               if (ret) {
>                   error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
> -                             "0x%"HWADDR_PRIx") = %d (%m)",
> -                             container, iova, int128_get64(llsize), ret);
> +                             "0x%"HWADDR_PRIx") = %d (%s)",
> +                             container, iova, int128_get64(llsize), ret,
> +                             strerror(-ret));
>               }
>               iova += int128_get64(llsize);
>           }
>           ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
>           if (ret) {
>               error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
> -                         "0x%"HWADDR_PRIx") = %d (%m)",
> -                         container, iova, int128_get64(llsize), ret);
> +                         "0x%"HWADDR_PRIx") = %d (%s)",
> +                         container, iova, int128_get64(llsize), ret,
> +                         strerror(-ret));
>           }
>       }
>   
> @@ -1337,9 +1340,9 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>                                       translated_addr);
>           if (ret) {
>               error_report("vfio_iommu_map_dirty_notify(%p, 0x%"HWADDR_PRIx", "
> -                         "0x%"HWADDR_PRIx") = %d (%m)",
> -                         container, iova,
> -                         iotlb->addr_mask + 1, ret);
> +                         "0x%"HWADDR_PRIx") = %d (%s)",
> +                         container, iova, iotlb->addr_mask + 1, ret,
> +                         strerror(-ret));
>           }
>       }
>       rcu_read_unlock();



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 04/18] vfio/common: Abort migration if dirty log start/stop/sync fails
  2023-01-26 18:49 ` [PATCH 04/18] vfio/common: Abort migration if dirty log start/stop/sync fails Avihai Horon
@ 2023-02-15  9:41   ` Cédric Le Goater
  0 siblings, 0 replies; 42+ messages in thread
From: Cédric Le Goater @ 2023-02-15  9:41 UTC (permalink / raw)
  To: Avihai Horon, qemu-devel
  Cc: Alex Williamson, Michael S. Tsirkin, Peter Xu, Jason Wang,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, David Hildenbrand, Philippe Mathieu-Daudé,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins

On 1/26/23 19:49, Avihai Horon wrote:
> If VFIO dirty pages log start/stop/sync fails during migration,
> migration should be aborted as pages dirtied by VFIO devices might not
> be reported properly.
> 
> This is not the case today, where in such scenario only an error is
> printed.
> 
> Fix it by aborting migration in the above scenario.
> 
> Fixes: 758b96b61d5c ("vfio/migrate: Move switch of dirty tracking into vfio_memory_listener")
> Fixes: b6dd6504e303 ("vfio: Add vfio_listener_log_sync to mark dirty pages")
> Fixes: 9e7b0442f23a ("vfio: Add ioctl to get dirty pages bitmap during dma unmap")
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>




Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.

> ---
>   hw/vfio/common.c | 53 ++++++++++++++++++++++++++++++++++++++++--------
>   1 file changed, 45 insertions(+), 8 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 643418f6f1..8e8ffbc046 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -41,6 +41,7 @@
>   #include "qapi/error.h"
>   #include "migration/migration.h"
>   #include "migration/misc.h"
> +#include "migration/qemu-file.h"
>   #include "sysemu/tpm.h"
>   
>   VFIOGroupList vfio_group_list =
> @@ -337,6 +338,19 @@ bool vfio_mig_active(void)
>       return true;
>   }
>   
> +static void vfio_set_migration_error(int err)
> +{
> +    MigrationState *ms = migrate_get_current();
> +
> +    if (migration_is_setup_or_active(ms->state)) {
> +        WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) {
> +            if (ms->to_dst_file) {
> +                qemu_file_set_error(ms->to_dst_file, err);
> +            }
> +        }
> +    }
> +}
> +
>   static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
>   {
>       VFIOGroup *group;
> @@ -633,6 +647,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>       if (iotlb->target_as != &address_space_memory) {
>           error_report("Wrong target AS \"%s\", only system memory is allowed",
>                        iotlb->target_as->name ? iotlb->target_as->name : "none");
> +        vfio_set_migration_error(-EINVAL);
>           return;
>       }
>   
> @@ -667,6 +682,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>                            "0x%"HWADDR_PRIx") = %d (%s)",
>                            container, iova,
>                            iotlb->addr_mask + 1, ret, strerror(-ret));
> +            vfio_set_migration_error(ret);
>           }
>       }
>   out:
> @@ -1212,7 +1228,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
>       }
>   }
>   
> -static void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
> +static int vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
>   {
>       int ret;
>       struct vfio_iommu_type1_dirty_bitmap dirty = {
> @@ -1220,7 +1236,7 @@ static void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
>       };
>   
>       if (!container->dirty_pages_supported) {
> -        return;
> +        return 0;
>       }
>   
>       if (start) {
> @@ -1231,23 +1247,34 @@ static void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
>   
>       ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty);
>       if (ret) {
> +        ret = -errno;
>           error_report("Failed to set dirty tracking flag 0x%x errno: %d",
>                        dirty.flags, errno);
>       }
> +
> +    return ret;
>   }
>   
>   static void vfio_listener_log_global_start(MemoryListener *listener)
>   {
>       VFIOContainer *container = container_of(listener, VFIOContainer, listener);
> +    int ret;
>   
> -    vfio_set_dirty_page_tracking(container, true);
> +    ret = vfio_set_dirty_page_tracking(container, true);
> +    if (ret) {
> +        vfio_set_migration_error(ret);
> +    }
>   }
>   
>   static void vfio_listener_log_global_stop(MemoryListener *listener)
>   {
>       VFIOContainer *container = container_of(listener, VFIOContainer, listener);
> +    int ret;
>   
> -    vfio_set_dirty_page_tracking(container, false);
> +    ret = vfio_set_dirty_page_tracking(container, false);
> +    if (ret) {
> +        vfio_set_migration_error(ret);
> +    }
>   }
>   
>   static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
> @@ -1323,19 +1350,18 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>       VFIOContainer *container = giommu->container;
>       hwaddr iova = iotlb->iova + giommu->iommu_offset;
>       ram_addr_t translated_addr;
> +    int ret = -EINVAL;
>   
>       trace_vfio_iommu_map_dirty_notify(iova, iova + iotlb->addr_mask);
>   
>       if (iotlb->target_as != &address_space_memory) {
>           error_report("Wrong target AS \"%s\", only system memory is allowed",
>                        iotlb->target_as->name ? iotlb->target_as->name : "none");
> -        return;
> +        goto out;
>       }
>   
>       rcu_read_lock();
>       if (vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL)) {
> -        int ret;
> -
>           ret = vfio_get_dirty_bitmap(container, iova, iotlb->addr_mask + 1,
>                                       translated_addr);
>           if (ret) {
> @@ -1346,6 +1372,11 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>           }
>       }
>       rcu_read_unlock();
> +
> +out:
> +    if (ret) {
> +        vfio_set_migration_error(ret);
> +    }
>   }
>   
>   static int vfio_ram_discard_get_dirty_bitmap(MemoryRegionSection *section,
> @@ -1438,13 +1469,19 @@ static void vfio_listener_log_sync(MemoryListener *listener,
>           MemoryRegionSection *section)
>   {
>       VFIOContainer *container = container_of(listener, VFIOContainer, listener);
> +    int ret;
>   
>       if (vfio_listener_skipped_section(section)) {
>           return;
>       }
>   
>       if (vfio_devices_all_dirty_tracking(container)) {
> -        vfio_sync_dirty_bitmap(container, section);
> +        ret = vfio_sync_dirty_bitmap(container, section);
> +        if (ret) {
> +            error_report("vfio: Failed to sync dirty bitmap, err: %d (%s)", ret,
> +                         strerror(-ret));
> +            vfio_set_migration_error(ret);
> +        }
>       }
>   }
>   



^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2023-02-15  9:43 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-26 18:49 [PATCH 00/18] vfio: Add migration pre-copy support and device dirty tracking Avihai Horon
2023-01-26 18:49 ` [PATCH 01/18] vfio/migration: Add VFIO migration pre-copy support Avihai Horon
2023-01-26 23:52   ` Alex Williamson
2023-01-31 12:44     ` Avihai Horon
2023-01-31 22:43       ` Alex Williamson
2023-01-31 23:29         ` Jason Gunthorpe
2023-02-01  4:15           ` Alex Williamson
2023-02-01 17:28             ` Jason Gunthorpe
2023-02-01 18:42               ` Alex Williamson
2023-02-01 20:10                 ` Jason Gunthorpe
2023-01-26 18:49 ` [PATCH 02/18] vfio/common: Fix error reporting in vfio_get_dirty_bitmap() Avihai Horon
2023-02-15  9:21   ` Cédric Le Goater
2023-01-26 18:49 ` [PATCH 03/18] vfio/common: Fix wrong %m usages Avihai Horon
2023-02-15  9:21   ` Cédric Le Goater
2023-01-26 18:49 ` [PATCH 04/18] vfio/common: Abort migration if dirty log start/stop/sync fails Avihai Horon
2023-02-15  9:41   ` Cédric Le Goater
2023-01-26 18:49 ` [PATCH 05/18] vfio/common: Add VFIOBitmap and (de)alloc functions Avihai Horon
2023-01-27 21:11   ` Alex Williamson
2023-02-12 15:36     ` Avihai Horon
2023-02-14 21:28       ` Alex Williamson
2023-01-26 18:49 ` [PATCH 06/18] util: Add iova_tree_nnodes() Avihai Horon
2023-02-09 22:21   ` Peter Xu
2023-01-26 18:49 ` [PATCH 07/18] util: Extend iova_tree_foreach() to take data argument Avihai Horon
2023-02-09 22:21   ` Peter Xu
2023-01-26 18:49 ` [PATCH 08/18] vfio/common: Record DMA mapped IOVA ranges Avihai Horon
2023-01-27 21:42   ` Alex Williamson
2023-02-12 15:40     ` Avihai Horon
2023-02-13 15:25       ` Alex Williamson
2023-01-26 18:49 ` [PATCH 09/18] vfio/common: Add device dirty page tracking start/stop Avihai Horon
2023-01-26 18:49 ` [PATCH 10/18] vfio/common: Extract code from vfio_get_dirty_bitmap() to new function Avihai Horon
2023-01-26 18:49 ` [PATCH 11/18] vfio/common: Add device dirty page bitmap sync Avihai Horon
2023-01-27 23:37   ` Alex Williamson
2023-02-12 15:49     ` Avihai Horon
2023-01-26 18:49 ` [PATCH 12/18] vfio/common: Extract vIOMMU code from vfio_sync_dirty_bitmap() Avihai Horon
2023-01-26 18:49 ` [PATCH 13/18] memory/iommu: Add IOMMU_ATTR_MAX_IOVA attribute Avihai Horon
2023-02-09 22:16   ` Peter Xu
2023-01-26 18:49 ` [PATCH 14/18] intel-iommu: Implement get_attr() method Avihai Horon
2023-02-09 22:18   ` Peter Xu
2023-01-26 18:49 ` [PATCH 15/18] vfio/common: Support device dirty page tracking with vIOMMU Avihai Horon
2023-01-26 18:49 ` [PATCH 16/18] vfio/common: Optimize " Avihai Horon
2023-01-26 18:49 ` [PATCH 17/18] vfio/migration: Query device dirty page tracking support Avihai Horon
2023-01-26 18:49 ` [PATCH 18/18] docs/devel: Document VFIO device dirty page tracking Avihai Horon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.