All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/17] vfio/migration: Implement VFIO migration protocol v2
@ 2022-11-03 16:16 Avihai Horon
  2022-11-03 16:16 ` [PATCH v3 01/17] migration: Remove res_compatible parameter Avihai Horon
                   ` (16 more replies)
  0 siblings, 17 replies; 59+ messages in thread
From: Avihai Horon @ 2022-11-03 16:16 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Shay Drory, Avihai Horon, Kirti Wankhede,
	Tarun Gupta, Joao Martins

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="y", Size: 6445 bytes --]

Hello,

A long time has passed since v2 of this series was posted. During this
time we had several KVM calls discussing the problems that were needed
to be solved in order to move forward.

This version of the series includes quite some changes, and I believe
that it addresses all the major problems we have discussed. Please see
below the updated patch list and change log from v2.

Following VFIO migration protocol v2 acceptance in kernel, this series
implements VFIO migration according to the new v2 protocol and replaces
the now deprecated v1 implementation.

The main differences between v1 and v2 migration protocols are:
1. VFIO device state is represented as a finite state machine instead of
   a bitmap.

2. The migration interface with kernel is done using VFIO_DEVICE_FEATURE
   ioctl and normal read() and write() instead of the migration region
   used in v1.

3. Migration protocol v2 currently doesn't support the pre-copy phase of
   migration.

Full description of the v2 protocol and the differences from v1 can be
found here [1].

Patch list:

Patches 1-4 are patches taken from Juan's RFC [2].
As discussed in the KVM call, since we are going to add a new ioctl to
get device data size while it's RUNNING, we don't need the stop and
resume VM functionality from the RFC.

Patches 5-11 are prep patches fixing bugs, adding QEMUFile function
that will be used later and refactoring v1 protocol code to make it
easier to add v2 protocol.

Patches 12-16 implement v2 protocol and remove v1 protocol.

Patch 17 is a preview patch (which is not for merging yet) that
demonstrates how the new ioctl to get device state size will work once
added.

Thanks.

Changes from v2 [3]:
- Rebased on top of latest master branch.

- Added relevant patches from Juan's RFC [2] with minor changes:
  1. Added Reviewed-by tag to patch #3 in the RFC.
  2. Adjusted patch #6 to work without patch #4 in the RFC.

- Added a new patch "vfio/migration: Fix wrong enum usage" that fixes a
  small bug in v1 code. This patch has been sent a few weeks ago [4] but
  wasn't taken yet.

- Patch #2 (vfio/migration: Skip pre-copy if dirty page tracking is not supported):
  1. Dropped this patch and replaced it with
     "vfio/migration: Allow migration without VFIO IOMMU dirty tracking support".
     The new patch uses a different approach – instead of skipping
     pre-copy phase completely, QEMU VFIO code will mark RAM dirty
     (instead of kernel). This ensures that current migration behavior
     is not changed and SLA is taken into account.

- Patch #4 (vfio/common: Change vfio_devices_all_running_and_saving() logic to equivalent one):
  1. Improved commit message to better explain the change.

- Patch #7 (vfio/migration: Implement VFIO migration protocol v2):
  1. Enhanced vfio_migration_set_state() error reporting.
  2. In vfio_save_complete_precopy() of v2 protocol - when changing
     device state to STOP, set recover state to ERROR instead of STOP as
     suggested by Joao.
  3. Constify SaveVMHandlers of v2 protocol.
  4. Modified trace_vfio_vmstate_change and trace_vfio_migration_set_state
     to print device state string instead of enum.
  5. Replaced qemu_put_buffer_async() with qemu_put_buffer() in
     vfio_save_block(), as requested by Juan.
  6. Implemented v2 protocol version of vfio_save_pending() as requested
     by Juan. Until ioctl to get device state size is added, we just
     report some big hard coded value, as agreed in KVM call.

- Patch #9 (vfio/migration: Reset device if setting recover state fails):
  1. Enhanced error reporting.
  2. Set VFIOMigration->device_state to RUNNING after device reset.

- Patch #11 (docs/devel: Align vfio-migration docs to VFIO migration v2):
  1. Adjusted vfio migration documentation to the added vfio_save_pending()

- Added the last patch (which is not for merging yet) that demonstrates
  how the new ioctl to get device state size will work once added.

Changes from v1 [5]:
- Split the big patch that replaced v1 with v2 into several patches as
  suggested by Joao, to make review easier.
- Change warn_report to warn_report_once when container doesn't support
  dirty tracking.
- Add Reviewed-by tag.

[1]
https://lore.kernel.org/all/20220224142024.147653-10-yishaih@nvidia.com/

[2]
https://lore.kernel.org/qemu-devel/20221003031600.20084-1-quintela@redhat.com/T/

[3]
https://lore.kernel.org/all/20220530170739.19072-1-avihaih@nvidia.com/

[4]
https://lore.kernel.org/all/20221016085752.32740-1-avihaih@nvidia.com/

[5]
https://lore.kernel.org/all/20220512154320.19697-1-avihaih@nvidia.com/

Avihai Horon (13):
  vfio/migration: Fix wrong enum usage
  vfio/migration: Fix NULL pointer dereference bug
  vfio/migration: Allow migration without VFIO IOMMU dirty tracking
    support
  migration/qemu-file: Add qemu_file_get_to_fd()
  vfio/common: Change vfio_devices_all_running_and_saving() logic to
    equivalent one
  vfio/migration: Move migration v1 logic to vfio_migration_init()
  vfio/migration: Rename functions/structs related to v1 protocol
  vfio/migration: Implement VFIO migration protocol v2
  vfio/migration: Remove VFIO migration protocol v1
  vfio/migration: Reset device if setting recover state fails
  vfio: Alphabetize migration section of VFIO trace-events file
  docs/devel: Align vfio-migration docs to VFIO migration v2
  vfio/migration: Query device data size in vfio_save_pending()

Juan Quintela (4):
  migration: Remove res_compatible parameter
  migration: No save_live_pending() method uses the QEMUFile parameter
  migration: Block migration comment or code is wrong
  migration: Simplify migration_iteration_run()

 docs/devel/vfio-migration.rst  |  68 ++--
 hw/s390x/s390-stattrib.c       |   8 +-
 hw/vfio/common.c               | 103 +++--
 hw/vfio/migration.c            | 669 +++++++++------------------------
 hw/vfio/trace-events           |  28 +-
 include/hw/vfio/vfio-common.h  |   8 +-
 include/migration/register.h   |  22 +-
 linux-headers/linux/vfio.h     |  13 +
 migration/block-dirty-bitmap.c |  10 +-
 migration/block.c              |  13 +-
 migration/migration.c          |  35 +-
 migration/qemu-file.c          |  34 ++
 migration/qemu-file.h          |   1 +
 migration/ram.c                |  10 +-
 migration/savevm.c             |  17 +-
 migration/savevm.h             |   6 +-
 migration/trace-events         |   2 +-
 17 files changed, 402 insertions(+), 645 deletions(-)

-- 
2.21.3



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v3 01/17] migration: Remove res_compatible parameter
  2022-11-03 16:16 [PATCH v3 00/17] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
@ 2022-11-03 16:16 ` Avihai Horon
  2022-11-08 17:52   ` Vladimir Sementsov-Ogievskiy
  2022-11-03 16:16 ` [PATCH v3 02/17] migration: No save_live_pending() method uses the QEMUFile parameter Avihai Horon
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 59+ messages in thread
From: Avihai Horon @ 2022-11-03 16:16 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Shay Drory, Avihai Horon, Kirti Wankhede,
	Tarun Gupta, Joao Martins

From: Juan Quintela <quintela@redhat.com>

It was only used for RAM, and in that case, it means that this amount
of data was sent for memory.  Just delete the field in all callers.

Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 hw/s390x/s390-stattrib.c       |  6 ++----
 hw/vfio/migration.c            | 10 ++++------
 hw/vfio/trace-events           |  2 +-
 include/migration/register.h   | 20 ++++++++++----------
 migration/block-dirty-bitmap.c |  7 +++----
 migration/block.c              |  7 +++----
 migration/migration.c          |  9 ++++-----
 migration/ram.c                |  8 +++-----
 migration/savevm.c             | 14 +++++---------
 migration/savevm.h             |  4 +---
 migration/trace-events         |  2 +-
 11 files changed, 37 insertions(+), 52 deletions(-)

diff --git a/hw/s390x/s390-stattrib.c b/hw/s390x/s390-stattrib.c
index 9eda1c3b2a..ee60b53da4 100644
--- a/hw/s390x/s390-stattrib.c
+++ b/hw/s390x/s390-stattrib.c
@@ -183,16 +183,14 @@ static int cmma_save_setup(QEMUFile *f, void *opaque)
 }
 
 static void cmma_save_pending(QEMUFile *f, void *opaque, uint64_t max_size,
-                              uint64_t *res_precopy_only,
-                              uint64_t *res_compatible,
-                              uint64_t *res_postcopy_only)
+                              uint64_t *res_precopy, uint64_t *res_postcopy)
 {
     S390StAttribState *sas = S390_STATTRIB(opaque);
     S390StAttribClass *sac = S390_STATTRIB_GET_CLASS(sas);
     long long res = sac->get_dirtycount(sas);
 
     if (res >= 0) {
-        *res_precopy_only += res;
+        *res_precopy += res;
     }
 }
 
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 3de4252111..3423f113f0 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -458,9 +458,8 @@ static void vfio_save_cleanup(void *opaque)
 
 static void vfio_save_pending(QEMUFile *f, void *opaque,
                               uint64_t threshold_size,
-                              uint64_t *res_precopy_only,
-                              uint64_t *res_compatible,
-                              uint64_t *res_postcopy_only)
+                              uint64_t *res_precopy,
+                              uint64_t *res_postcopy)
 {
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
@@ -471,10 +470,9 @@ static void vfio_save_pending(QEMUFile *f, void *opaque,
         return;
     }
 
-    *res_precopy_only += migration->pending_bytes;
+    *res_precopy += migration->pending_bytes;
 
-    trace_vfio_save_pending(vbasedev->name, *res_precopy_only,
-                            *res_postcopy_only, *res_compatible);
+    trace_vfio_save_pending(vbasedev->name, *res_precopy, *res_postcopy);
 }
 
 static int vfio_save_iterate(QEMUFile *f, void *opaque)
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 73dffe9e00..a21cbd2a56 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -157,7 +157,7 @@ vfio_save_cleanup(const char *name) " (%s)"
 vfio_save_buffer(const char *name, uint64_t data_offset, uint64_t data_size, uint64_t pending) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64" pending 0x%"PRIx64
 vfio_update_pending(const char *name, uint64_t pending) " (%s) pending 0x%"PRIx64
 vfio_save_device_config_state(const char *name) " (%s)"
-vfio_save_pending(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64
+vfio_save_pending(const char *name, uint64_t precopy, uint64_t postcopy) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64
 vfio_save_iterate(const char *name, int data_size) " (%s) data_size %d"
 vfio_save_complete_precopy(const char *name) " (%s)"
 vfio_load_device_config_state(const char *name) " (%s)"
diff --git a/include/migration/register.h b/include/migration/register.h
index c1dcff0f90..1950fee6a8 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -48,18 +48,18 @@ typedef struct SaveVMHandlers {
     int (*save_setup)(QEMUFile *f, void *opaque);
     void (*save_live_pending)(QEMUFile *f, void *opaque,
                               uint64_t threshold_size,
-                              uint64_t *res_precopy_only,
-                              uint64_t *res_compatible,
-                              uint64_t *res_postcopy_only);
+                              uint64_t *rest_precopy,
+                              uint64_t *rest_postcopy);
     /* Note for save_live_pending:
-     * - res_precopy_only is for data which must be migrated in precopy phase
-     *     or in stopped state, in other words - before target vm start
-     * - res_compatible is for data which may be migrated in any phase
-     * - res_postcopy_only is for data which must be migrated in postcopy phase
-     *     or in stopped state, in other words - after source vm stop
+     * - res_precopy is for data which must be migrated in precopy
+     *     phase or in stopped state, in other words - before target
+     *     vm start
+     * - res_postcopy is for data which must be migrated in postcopy
+     *     phase or in stopped state, in other words - after source vm
+     *     stop
      *
-     * Sum of res_postcopy_only, res_compatible and res_postcopy_only is the
-     * whole amount of pending data.
+     * Sum of res_precopy and res_postcopy is the whole amount of
+     * pending data.
      */
 
 
diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
index 9aba7d9c22..dfea546330 100644
--- a/migration/block-dirty-bitmap.c
+++ b/migration/block-dirty-bitmap.c
@@ -763,9 +763,8 @@ static int dirty_bitmap_save_complete(QEMUFile *f, void *opaque)
 
 static void dirty_bitmap_save_pending(QEMUFile *f, void *opaque,
                                       uint64_t max_size,
-                                      uint64_t *res_precopy_only,
-                                      uint64_t *res_compatible,
-                                      uint64_t *res_postcopy_only)
+                                      uint64_t *res_precopy,
+                                      uint64_t *res_postcopy)
 {
     DBMSaveState *s = &((DBMState *)opaque)->save;
     SaveBitmapState *dbms;
@@ -785,7 +784,7 @@ static void dirty_bitmap_save_pending(QEMUFile *f, void *opaque,
 
     trace_dirty_bitmap_save_pending(pending, max_size);
 
-    *res_postcopy_only += pending;
+    *res_postcopy += pending;
 }
 
 /* First occurrence of this bitmap. It should be created if doesn't exist */
diff --git a/migration/block.c b/migration/block.c
index 3577c815a9..4ae8f837b0 100644
--- a/migration/block.c
+++ b/migration/block.c
@@ -863,9 +863,8 @@ static int block_save_complete(QEMUFile *f, void *opaque)
 }
 
 static void block_save_pending(QEMUFile *f, void *opaque, uint64_t max_size,
-                               uint64_t *res_precopy_only,
-                               uint64_t *res_compatible,
-                               uint64_t *res_postcopy_only)
+                               uint64_t *res_precopy,
+                               uint64_t *res_postcopy)
 {
     /* Estimate pending number of bytes to send */
     uint64_t pending;
@@ -886,7 +885,7 @@ static void block_save_pending(QEMUFile *f, void *opaque, uint64_t max_size,
 
     trace_migration_block_save_pending(pending);
     /* We don't do postcopy */
-    *res_precopy_only += pending;
+    *res_precopy += pending;
 }
 
 static int block_load(QEMUFile *f, void *opaque, int version_id)
diff --git a/migration/migration.c b/migration/migration.c
index 739bb683f3..a4a18228c6 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3735,15 +3735,14 @@ typedef enum {
  */
 static MigIterateState migration_iteration_run(MigrationState *s)
 {
-    uint64_t pending_size, pend_pre, pend_compat, pend_post;
+    uint64_t pending_size, pend_pre, pend_post;
     bool in_postcopy = s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE;
 
     qemu_savevm_state_pending(s->to_dst_file, s->threshold_size, &pend_pre,
-                              &pend_compat, &pend_post);
-    pending_size = pend_pre + pend_compat + pend_post;
+                              &pend_post);
+    pending_size = pend_pre + pend_post;
 
-    trace_migrate_pending(pending_size, s->threshold_size,
-                          pend_pre, pend_compat, pend_post);
+    trace_migrate_pending(pending_size, s->threshold_size, pend_pre, pend_post);
 
     if (pending_size && pending_size >= s->threshold_size) {
         /* Still a significant amount to transfer */
diff --git a/migration/ram.c b/migration/ram.c
index dc1de9ddbc..20167e1102 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3435,9 +3435,7 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
 }
 
 static void ram_save_pending(QEMUFile *f, void *opaque, uint64_t max_size,
-                             uint64_t *res_precopy_only,
-                             uint64_t *res_compatible,
-                             uint64_t *res_postcopy_only)
+                             uint64_t *res_precopy, uint64_t *res_postcopy)
 {
     RAMState **temp = opaque;
     RAMState *rs = *temp;
@@ -3457,9 +3455,9 @@ static void ram_save_pending(QEMUFile *f, void *opaque, uint64_t max_size,
 
     if (migrate_postcopy_ram()) {
         /* We can do postcopy, and all the data is postcopiable */
-        *res_compatible += remaining_size;
+        *res_postcopy += remaining_size;
     } else {
-        *res_precopy_only += remaining_size;
+        *res_precopy += remaining_size;
     }
 }
 
diff --git a/migration/savevm.c b/migration/savevm.c
index a0cdb714f7..4d02887f25 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1472,16 +1472,13 @@ flush:
  * for units that can't do postcopy.
  */
 void qemu_savevm_state_pending(QEMUFile *f, uint64_t threshold_size,
-                               uint64_t *res_precopy_only,
-                               uint64_t *res_compatible,
-                               uint64_t *res_postcopy_only)
+                               uint64_t *res_precopy,
+                               uint64_t *res_postcopy)
 {
     SaveStateEntry *se;
 
-    *res_precopy_only = 0;
-    *res_compatible = 0;
-    *res_postcopy_only = 0;
-
+    *res_precopy = 0;
+    *res_postcopy = 0;
 
     QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
         if (!se->ops || !se->ops->save_live_pending) {
@@ -1493,8 +1490,7 @@ void qemu_savevm_state_pending(QEMUFile *f, uint64_t threshold_size,
             }
         }
         se->ops->save_live_pending(f, se->opaque, threshold_size,
-                                   res_precopy_only, res_compatible,
-                                   res_postcopy_only);
+                                   res_precopy, res_postcopy);
     }
 }
 
diff --git a/migration/savevm.h b/migration/savevm.h
index 6461342cb4..9bd55c336c 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -41,9 +41,7 @@ void qemu_savevm_state_complete_postcopy(QEMUFile *f);
 int qemu_savevm_state_complete_precopy(QEMUFile *f, bool iterable_only,
                                        bool inactivate_disks);
 void qemu_savevm_state_pending(QEMUFile *f, uint64_t max_size,
-                               uint64_t *res_precopy_only,
-                               uint64_t *res_compatible,
-                               uint64_t *res_postcopy_only);
+                               uint64_t *res_precopy, uint64_t *res_postcopy);
 void qemu_savevm_send_ping(QEMUFile *f, uint32_t value);
 void qemu_savevm_send_open_return_path(QEMUFile *f);
 int qemu_savevm_send_packaged(QEMUFile *f, const uint8_t *buf, size_t len);
diff --git a/migration/trace-events b/migration/trace-events
index 57003edcbd..f2a873fd6c 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -150,7 +150,7 @@ migrate_fd_cleanup(void) ""
 migrate_fd_error(const char *error_desc) "error=%s"
 migrate_fd_cancel(void) ""
 migrate_handle_rp_req_pages(const char *rbname, size_t start, size_t len) "in %s at 0x%zx len 0x%zx"
-migrate_pending(uint64_t size, uint64_t max, uint64_t pre, uint64_t compat, uint64_t post) "pending size %" PRIu64 " max %" PRIu64 " (pre = %" PRIu64 " compat=%" PRIu64 " post=%" PRIu64 ")"
+migrate_pending(uint64_t size, uint64_t max, uint64_t pre, uint64_t post) "pending size %" PRIu64 " max %" PRIu64 " (pre = %" PRIu64 " post=%" PRIu64 ")"
 migrate_send_rp_message(int msg_type, uint16_t len) "%d: len %d"
 migrate_send_rp_recv_bitmap(char *name, int64_t size) "block '%s' size 0x%"PRIi64
 migration_completion_file_err(void) ""
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 02/17] migration: No save_live_pending() method uses the QEMUFile parameter
  2022-11-03 16:16 [PATCH v3 00/17] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
  2022-11-03 16:16 ` [PATCH v3 01/17] migration: Remove res_compatible parameter Avihai Horon
@ 2022-11-03 16:16 ` Avihai Horon
  2022-11-08 17:57   ` Vladimir Sementsov-Ogievskiy
  2022-11-03 16:16 ` [PATCH v3 03/17] migration: Block migration comment or code is wrong Avihai Horon
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 59+ messages in thread
From: Avihai Horon @ 2022-11-03 16:16 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Shay Drory, Avihai Horon, Kirti Wankhede,
	Tarun Gupta, Joao Martins

From: Juan Quintela <quintela@redhat.com>

So remove it everywhere.

Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 hw/s390x/s390-stattrib.c       | 2 +-
 hw/vfio/migration.c            | 6 ++----
 include/migration/register.h   | 6 ++----
 migration/block-dirty-bitmap.c | 3 +--
 migration/block.c              | 2 +-
 migration/migration.c          | 3 +--
 migration/ram.c                | 2 +-
 migration/savevm.c             | 5 ++---
 migration/savevm.h             | 2 +-
 9 files changed, 12 insertions(+), 19 deletions(-)

diff --git a/hw/s390x/s390-stattrib.c b/hw/s390x/s390-stattrib.c
index ee60b53da4..9b74eeadf3 100644
--- a/hw/s390x/s390-stattrib.c
+++ b/hw/s390x/s390-stattrib.c
@@ -182,7 +182,7 @@ static int cmma_save_setup(QEMUFile *f, void *opaque)
     return 0;
 }
 
-static void cmma_save_pending(QEMUFile *f, void *opaque, uint64_t max_size,
+static void cmma_save_pending(void *opaque, uint64_t max_size,
                               uint64_t *res_precopy, uint64_t *res_postcopy)
 {
     S390StAttribState *sas = S390_STATTRIB(opaque);
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 3423f113f0..760d5f3c5c 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -456,10 +456,8 @@ static void vfio_save_cleanup(void *opaque)
     trace_vfio_save_cleanup(vbasedev->name);
 }
 
-static void vfio_save_pending(QEMUFile *f, void *opaque,
-                              uint64_t threshold_size,
-                              uint64_t *res_precopy,
-                              uint64_t *res_postcopy)
+static void vfio_save_pending(void *opaque,  uint64_t threshold_size,
+                              uint64_t *res_precopy, uint64_t *res_postcopy)
 {
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
diff --git a/include/migration/register.h b/include/migration/register.h
index 1950fee6a8..5b5424ed8f 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -46,10 +46,8 @@ typedef struct SaveVMHandlers {
 
     /* This runs outside the iothread lock!  */
     int (*save_setup)(QEMUFile *f, void *opaque);
-    void (*save_live_pending)(QEMUFile *f, void *opaque,
-                              uint64_t threshold_size,
-                              uint64_t *rest_precopy,
-                              uint64_t *rest_postcopy);
+    void (*save_live_pending)(void *opaque,  uint64_t threshold_size,
+                              uint64_t *rest_precopy, uint64_t *rest_postcopy);
     /* Note for save_live_pending:
      * - res_precopy is for data which must be migrated in precopy
      *     phase or in stopped state, in other words - before target
diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
index dfea546330..9d4f56693f 100644
--- a/migration/block-dirty-bitmap.c
+++ b/migration/block-dirty-bitmap.c
@@ -761,8 +761,7 @@ static int dirty_bitmap_save_complete(QEMUFile *f, void *opaque)
     return 0;
 }
 
-static void dirty_bitmap_save_pending(QEMUFile *f, void *opaque,
-                                      uint64_t max_size,
+static void dirty_bitmap_save_pending(void *opaque, uint64_t max_size,
                                       uint64_t *res_precopy,
                                       uint64_t *res_postcopy)
 {
diff --git a/migration/block.c b/migration/block.c
index 4ae8f837b0..b3d680af75 100644
--- a/migration/block.c
+++ b/migration/block.c
@@ -862,7 +862,7 @@ static int block_save_complete(QEMUFile *f, void *opaque)
     return 0;
 }
 
-static void block_save_pending(QEMUFile *f, void *opaque, uint64_t max_size,
+static void block_save_pending(void *opaque, uint64_t max_size,
                                uint64_t *res_precopy,
                                uint64_t *res_postcopy)
 {
diff --git a/migration/migration.c b/migration/migration.c
index a4a18228c6..ffe868b86f 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3738,8 +3738,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
     uint64_t pending_size, pend_pre, pend_post;
     bool in_postcopy = s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE;
 
-    qemu_savevm_state_pending(s->to_dst_file, s->threshold_size, &pend_pre,
-                              &pend_post);
+    qemu_savevm_state_pending(s->threshold_size, &pend_pre, &pend_post);
     pending_size = pend_pre + pend_post;
 
     trace_migrate_pending(pending_size, s->threshold_size, pend_pre, pend_post);
diff --git a/migration/ram.c b/migration/ram.c
index 20167e1102..48a31b87c8 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3434,7 +3434,7 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
     return 0;
 }
 
-static void ram_save_pending(QEMUFile *f, void *opaque, uint64_t max_size,
+static void ram_save_pending(void *opaque, uint64_t max_size,
                              uint64_t *res_precopy, uint64_t *res_postcopy)
 {
     RAMState **temp = opaque;
diff --git a/migration/savevm.c b/migration/savevm.c
index 4d02887f25..9ddcbba4e3 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1471,8 +1471,7 @@ flush:
  * the result is split into the amount for units that can and
  * for units that can't do postcopy.
  */
-void qemu_savevm_state_pending(QEMUFile *f, uint64_t threshold_size,
-                               uint64_t *res_precopy,
+void qemu_savevm_state_pending(uint64_t threshold_size, uint64_t *res_precopy,
                                uint64_t *res_postcopy)
 {
     SaveStateEntry *se;
@@ -1489,7 +1488,7 @@ void qemu_savevm_state_pending(QEMUFile *f, uint64_t threshold_size,
                 continue;
             }
         }
-        se->ops->save_live_pending(f, se->opaque, threshold_size,
+        se->ops->save_live_pending(se->opaque, threshold_size,
                                    res_precopy, res_postcopy);
     }
 }
diff --git a/migration/savevm.h b/migration/savevm.h
index 9bd55c336c..98fae6f9b3 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -40,7 +40,7 @@ void qemu_savevm_state_cleanup(void);
 void qemu_savevm_state_complete_postcopy(QEMUFile *f);
 int qemu_savevm_state_complete_precopy(QEMUFile *f, bool iterable_only,
                                        bool inactivate_disks);
-void qemu_savevm_state_pending(QEMUFile *f, uint64_t max_size,
+void qemu_savevm_state_pending(uint64_t max_size,
                                uint64_t *res_precopy, uint64_t *res_postcopy);
 void qemu_savevm_send_ping(QEMUFile *f, uint32_t value);
 void qemu_savevm_send_open_return_path(QEMUFile *f);
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 03/17] migration: Block migration comment or code is wrong
  2022-11-03 16:16 [PATCH v3 00/17] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
  2022-11-03 16:16 ` [PATCH v3 01/17] migration: Remove res_compatible parameter Avihai Horon
  2022-11-03 16:16 ` [PATCH v3 02/17] migration: No save_live_pending() method uses the QEMUFile parameter Avihai Horon
@ 2022-11-03 16:16 ` Avihai Horon
  2022-11-08 18:36   ` Vladimir Sementsov-Ogievskiy
  2022-11-03 16:16 ` [PATCH v3 04/17] migration: Simplify migration_iteration_run() Avihai Horon
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 59+ messages in thread
From: Avihai Horon @ 2022-11-03 16:16 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Shay Drory, Avihai Horon, Kirti Wankhede,
	Tarun Gupta, Joao Martins

From: Juan Quintela <quintela@redhat.com>

And it appears that what is wrong is the code. During bulk stage we
need to make sure that some block is dirty, but no games with
max_size at all.

Signed-off-by: Juan Quintela <quintela@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 migration/block.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/migration/block.c b/migration/block.c
index b3d680af75..39ce4003c6 100644
--- a/migration/block.c
+++ b/migration/block.c
@@ -879,8 +879,8 @@ static void block_save_pending(void *opaque, uint64_t max_size,
     blk_mig_unlock();
 
     /* Report at least one block pending during bulk phase */
-    if (pending <= max_size && !block_mig_state.bulk_completed) {
-        pending = max_size + BLK_MIG_BLOCK_SIZE;
+    if (!pending && !block_mig_state.bulk_completed) {
+        pending = BLK_MIG_BLOCK_SIZE;
     }
 
     trace_migration_block_save_pending(pending);
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 04/17] migration: Simplify migration_iteration_run()
  2022-11-03 16:16 [PATCH v3 00/17] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
                   ` (2 preceding siblings ...)
  2022-11-03 16:16 ` [PATCH v3 03/17] migration: Block migration comment or code is wrong Avihai Horon
@ 2022-11-03 16:16 ` Avihai Horon
  2022-11-08 18:56   ` Vladimir Sementsov-Ogievskiy
  2022-11-03 16:16 ` [PATCH v3 05/17] vfio/migration: Fix wrong enum usage Avihai Horon
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 59+ messages in thread
From: Avihai Horon @ 2022-11-03 16:16 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Shay Drory, Avihai Horon, Kirti Wankhede,
	Tarun Gupta, Joao Martins

From: Juan Quintela <quintela@redhat.com>

Signed-off-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 migration/migration.c | 25 +++++++++++++------------
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index ffe868b86f..59cc3c309b 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3743,23 +3743,24 @@ static MigIterateState migration_iteration_run(MigrationState *s)
 
     trace_migrate_pending(pending_size, s->threshold_size, pend_pre, pend_post);
 
-    if (pending_size && pending_size >= s->threshold_size) {
-        /* Still a significant amount to transfer */
-        if (!in_postcopy && pend_pre <= s->threshold_size &&
-            qatomic_read(&s->start_postcopy)) {
-            if (postcopy_start(s)) {
-                error_report("%s: postcopy failed to start", __func__);
-            }
-            return MIG_ITERATE_SKIP;
-        }
-        /* Just another iteration step */
-        qemu_savevm_state_iterate(s->to_dst_file, in_postcopy);
-    } else {
+
+    if (pending_size < s->threshold_size) {
         trace_migration_thread_low_pending(pending_size);
         migration_completion(s);
         return MIG_ITERATE_BREAK;
     }
 
+    /* Still a significant amount to transfer */
+    if (!in_postcopy && pend_pre <= s->threshold_size &&
+        qatomic_read(&s->start_postcopy)) {
+        if (postcopy_start(s)) {
+            error_report("%s: postcopy failed to start", __func__);
+        }
+        return MIG_ITERATE_SKIP;
+    }
+
+    /* Just another iteration step */
+    qemu_savevm_state_iterate(s->to_dst_file, in_postcopy);
     return MIG_ITERATE_RESUME;
 }
 
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 05/17] vfio/migration: Fix wrong enum usage
  2022-11-03 16:16 [PATCH v3 00/17] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
                   ` (3 preceding siblings ...)
  2022-11-03 16:16 ` [PATCH v3 04/17] migration: Simplify migration_iteration_run() Avihai Horon
@ 2022-11-03 16:16 ` Avihai Horon
  2022-11-08 19:05   ` Vladimir Sementsov-Ogievskiy
  2022-11-03 16:16 ` [PATCH v3 06/17] vfio/migration: Fix NULL pointer dereference bug Avihai Horon
                   ` (11 subsequent siblings)
  16 siblings, 1 reply; 59+ messages in thread
From: Avihai Horon @ 2022-11-03 16:16 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Shay Drory, Avihai Horon, Kirti Wankhede,
	Tarun Gupta, Joao Martins

vfio_migration_init() initializes VFIOMigration->device_state using enum
of VFIO migration protocol v2. Current implemented protocol is v1 so v1
enum should be used. Fix it.

Fixes: 429c72800654 ("vfio/migration: Fix incorrect initialization value for parameters in VFIOMigration")
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Zhang Chen <chen.zhang@intel.com>
---
 hw/vfio/migration.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 760d5f3c5c..8ae1bd31a8 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -802,7 +802,7 @@ static int vfio_migration_init(VFIODevice *vbasedev,
     }
 
     vbasedev->migration = g_new0(VFIOMigration, 1);
-    vbasedev->migration->device_state = VFIO_DEVICE_STATE_RUNNING;
+    vbasedev->migration->device_state = VFIO_DEVICE_STATE_V1_RUNNING;
     vbasedev->migration->vm_running = runstate_is_running();
 
     ret = vfio_region_setup(obj, vbasedev, &vbasedev->migration->region,
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 06/17] vfio/migration: Fix NULL pointer dereference bug
  2022-11-03 16:16 [PATCH v3 00/17] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
                   ` (4 preceding siblings ...)
  2022-11-03 16:16 ` [PATCH v3 05/17] vfio/migration: Fix wrong enum usage Avihai Horon
@ 2022-11-03 16:16 ` Avihai Horon
  2022-11-08 19:08   ` Vladimir Sementsov-Ogievskiy
  2022-11-03 16:16 ` [PATCH v3 07/17] vfio/migration: Allow migration without VFIO IOMMU dirty tracking support Avihai Horon
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 59+ messages in thread
From: Avihai Horon @ 2022-11-03 16:16 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Shay Drory, Avihai Horon, Kirti Wankhede,
	Tarun Gupta, Joao Martins

As part of its error flow, vfio_vmstate_change() accesses
MigrationState->to_dst_file without any checks. This can cause a NULL
pointer dereference if the error flow is taken and
MigrationState->to_dst_file is not set.

For example, this can happen if VM is started or stopped not during
migration and vfio_vmstate_change() error flow is taken, as
MigrationState->to_dst_file is not set at that time.

Fix it by checking that MigrationState->to_dst_file is set before using
it.

Fixes: 02a7e71b1e5b ("vfio: Add VM state change handler to know state of VM")
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
---
 hw/vfio/migration.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 8ae1bd31a8..f5e72c7ac1 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -740,7 +740,9 @@ static void vfio_vmstate_change(void *opaque, bool running, RunState state)
          */
         error_report("%s: Failed to set device state 0x%x", vbasedev->name,
                      (migration->device_state & mask) | value);
-        qemu_file_set_error(migrate_get_current()->to_dst_file, ret);
+        if (migrate_get_current()->to_dst_file) {
+            qemu_file_set_error(migrate_get_current()->to_dst_file, ret);
+        }
     }
     vbasedev->migration->vm_running = running;
     trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 07/17] vfio/migration: Allow migration without VFIO IOMMU dirty tracking support
  2022-11-03 16:16 [PATCH v3 00/17] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
                   ` (5 preceding siblings ...)
  2022-11-03 16:16 ` [PATCH v3 06/17] vfio/migration: Fix NULL pointer dereference bug Avihai Horon
@ 2022-11-03 16:16 ` Avihai Horon
  2022-11-15 23:36   ` Alex Williamson
  2022-11-03 16:16 ` [PATCH v3 08/17] migration/qemu-file: Add qemu_file_get_to_fd() Avihai Horon
                   ` (9 subsequent siblings)
  16 siblings, 1 reply; 59+ messages in thread
From: Avihai Horon @ 2022-11-03 16:16 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Shay Drory, Avihai Horon, Kirti Wankhede,
	Tarun Gupta, Joao Martins

Currently, if IOMMU of a VFIO container doesn't support dirty page
tracking, migration is blocked. This is because a DMA-able VFIO device
can dirty RAM pages without updating QEMU about it, thus breaking the
migration.

However, this doesn't mean that migration can't be done at all.
In such case, allow migration and let QEMU VFIO code mark the entire
bitmap dirty.

This guarantees that all pages that might have gotten dirty are reported
back, and thus guarantees a valid migration even without VFIO IOMMU
dirty tracking support.

The motivation for this patch is the future introduction of iommufd [1].
iommufd will directly implement the /dev/vfio/vfio container IOCTLs by
mapping them into its internal ops, allowing the usage of these IOCTLs
over iommufd. However, VFIO IOMMU dirty tracking will not be supported
by this VFIO compatibility API.

This patch will allow migration by hosts that use the VFIO compatibility
API and prevent migration regressions caused by the lack of VFIO IOMMU
dirty tracking support.

[1] https://lore.kernel.org/kvm/0-v2-f9436d0bde78+4bb-iommufd_jgg@nvidia.com/

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/common.c    | 84 +++++++++++++++++++++++++++++++++++++--------
 hw/vfio/migration.c |  3 +-
 2 files changed, 70 insertions(+), 17 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 6b5d8c0bf6..5470dbcb04 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -392,6 +392,41 @@ static bool vfio_devices_all_running_and_saving(VFIOContainer *container)
     return true;
 }
 
+static int vfio_dma_unmap_mark_dirty(VFIOContainer *container, hwaddr iova,
+                                     hwaddr size, IOMMUTLBEntry *iotlb)
+{
+    struct vfio_iommu_type1_dma_unmap unmap = {
+        .argsz = sizeof(unmap),
+        .flags = 0,
+        .iova = iova,
+        .size = size,
+    };
+    unsigned long *bitmap;
+    uint64_t pages;
+    int ret;
+
+    pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size();
+    bitmap = g_try_malloc0(ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
+                           BITS_PER_BYTE);
+    if (!bitmap) {
+        return -ENOMEM;
+    }
+
+    ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap);
+    if (ret) {
+        error_report("VFIO_UNMAP_DMA failed : %m");
+        g_free(bitmap);
+        return ret;
+    }
+
+    bitmap_set(bitmap, 0, pages);
+    cpu_physical_memory_set_dirty_lebitmap(bitmap, iotlb->translated_addr,
+                                           pages);
+    g_free(bitmap);
+
+    return 0;
+}
+
 static int vfio_dma_unmap_bitmap(VFIOContainer *container,
                                  hwaddr iova, ram_addr_t size,
                                  IOMMUTLBEntry *iotlb)
@@ -401,6 +436,10 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container,
     uint64_t pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size();
     int ret;
 
+    if (!container->dirty_pages_supported) {
+        return vfio_dma_unmap_mark_dirty(container, iova, size, iotlb);
+    }
+
     unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));
 
     unmap->argsz = sizeof(*unmap) + sizeof(*bitmap);
@@ -460,8 +499,7 @@ static int vfio_dma_unmap(VFIOContainer *container,
         .size = size,
     };
 
-    if (iotlb && container->dirty_pages_supported &&
-        vfio_devices_all_running_and_saving(container)) {
+    if (iotlb && vfio_devices_all_running_and_saving(container)) {
         return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
     }
 
@@ -1274,14 +1312,18 @@ static void vfio_listener_log_global_start(MemoryListener *listener)
 {
     VFIOContainer *container = container_of(listener, VFIOContainer, listener);
 
-    vfio_set_dirty_page_tracking(container, true);
+    if (container->dirty_pages_supported) {
+        vfio_set_dirty_page_tracking(container, true);
+    }
 }
 
 static void vfio_listener_log_global_stop(MemoryListener *listener)
 {
     VFIOContainer *container = container_of(listener, VFIOContainer, listener);
 
-    vfio_set_dirty_page_tracking(container, false);
+    if (container->dirty_pages_supported) {
+        vfio_set_dirty_page_tracking(container, false);
+    }
 }
 
 static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
@@ -1289,9 +1331,29 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
 {
     struct vfio_iommu_type1_dirty_bitmap *dbitmap;
     struct vfio_iommu_type1_dirty_bitmap_get *range;
+    unsigned long *bitmap;
+    uint64_t bitmap_size;
     uint64_t pages;
     int ret;
 
+    pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size();
+    bitmap_size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
+                                  BITS_PER_BYTE;
+    bitmap = g_try_malloc0(bitmap_size);
+    if (!bitmap) {
+        return -ENOMEM;
+    }
+
+    if (!container->dirty_pages_supported) {
+        bitmap_set(bitmap, 0, pages);
+        cpu_physical_memory_set_dirty_lebitmap(bitmap, ram_addr, pages);
+        trace_vfio_get_dirty_bitmap(container->fd, iova, size, bitmap_size,
+                                    ram_addr);
+        g_free(bitmap);
+
+        return 0;
+    }
+
     dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
 
     dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
@@ -1306,15 +1368,8 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
      * to qemu_real_host_page_size.
      */
     range->bitmap.pgsize = qemu_real_host_page_size();
-
-    pages = REAL_HOST_PAGE_ALIGN(range->size) / qemu_real_host_page_size();
-    range->bitmap.size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
-                                         BITS_PER_BYTE;
-    range->bitmap.data = g_try_malloc0(range->bitmap.size);
-    if (!range->bitmap.data) {
-        ret = -ENOMEM;
-        goto err_out;
-    }
+    range->bitmap.size = bitmap_size;
+    range->bitmap.data = (void *)bitmap;
 
     ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
     if (ret) {
@@ -1465,8 +1520,7 @@ static void vfio_listener_log_sync(MemoryListener *listener,
 {
     VFIOContainer *container = container_of(listener, VFIOContainer, listener);
 
-    if (vfio_listener_skipped_section(section) ||
-        !container->dirty_pages_supported) {
+    if (vfio_listener_skipped_section(section)) {
         return;
     }
 
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index f5e72c7ac1..99ffb75782 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -857,11 +857,10 @@ int64_t vfio_mig_bytes_transferred(void)
 
 int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
 {
-    VFIOContainer *container = vbasedev->group->container;
     struct vfio_region_info *info = NULL;
     int ret = -ENOTSUP;
 
-    if (!vbasedev->enable_migration || !container->dirty_pages_supported) {
+    if (!vbasedev->enable_migration) {
         goto add_blocker;
     }
 
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 08/17] migration/qemu-file: Add qemu_file_get_to_fd()
  2022-11-03 16:16 [PATCH v3 00/17] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
                   ` (6 preceding siblings ...)
  2022-11-03 16:16 ` [PATCH v3 07/17] vfio/migration: Allow migration without VFIO IOMMU dirty tracking support Avihai Horon
@ 2022-11-03 16:16 ` Avihai Horon
  2022-11-08 20:26   ` Vladimir Sementsov-Ogievskiy
  2022-11-03 16:16 ` [PATCH v3 09/17] vfio/common: Change vfio_devices_all_running_and_saving() logic to equivalent one Avihai Horon
                   ` (8 subsequent siblings)
  16 siblings, 1 reply; 59+ messages in thread
From: Avihai Horon @ 2022-11-03 16:16 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Shay Drory, Avihai Horon, Kirti Wankhede,
	Tarun Gupta, Joao Martins

Add new function qemu_file_get_to_fd() that allows reading data from
QEMUFile and writing it straight into a given fd.

This will be used later in VFIO migration code.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 migration/qemu-file.c | 34 ++++++++++++++++++++++++++++++++++
 migration/qemu-file.h |  1 +
 2 files changed, 35 insertions(+)

diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 4f400c2e52..58cb0cd608 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -919,3 +919,37 @@ QIOChannel *qemu_file_get_ioc(QEMUFile *file)
 {
     return file->ioc;
 }
+
+/*
+ * Read size bytes from QEMUFile f and write them to fd.
+ */
+int qemu_file_get_to_fd(QEMUFile *f, int fd, size_t size)
+{
+    while (size) {
+        size_t pending = f->buf_size - f->buf_index;
+        ssize_t rc;
+
+        if (!pending) {
+            rc = qemu_fill_buffer(f);
+            if (rc < 0) {
+                return rc;
+            }
+            if (rc == 0) {
+                return -1;
+            }
+            continue;
+        }
+
+        rc = write(fd, f->buf + f->buf_index, MIN(pending, size));
+        if (rc < 0) {
+            return rc;
+        }
+        if (rc == 0) {
+            return -1;
+        }
+        f->buf_index += rc;
+        size -= rc;
+    }
+
+    return 0;
+}
diff --git a/migration/qemu-file.h b/migration/qemu-file.h
index fa13d04d78..9d0155a2a1 100644
--- a/migration/qemu-file.h
+++ b/migration/qemu-file.h
@@ -148,6 +148,7 @@ int qemu_file_shutdown(QEMUFile *f);
 QEMUFile *qemu_file_get_return_path(QEMUFile *f);
 void qemu_fflush(QEMUFile *f);
 void qemu_file_set_blocking(QEMUFile *f, bool block);
+int qemu_file_get_to_fd(QEMUFile *f, int fd, size_t size);
 
 void ram_control_before_iterate(QEMUFile *f, uint64_t flags);
 void ram_control_after_iterate(QEMUFile *f, uint64_t flags);
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 09/17] vfio/common: Change vfio_devices_all_running_and_saving() logic to equivalent one
  2022-11-03 16:16 [PATCH v3 00/17] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
                   ` (7 preceding siblings ...)
  2022-11-03 16:16 ` [PATCH v3 08/17] migration/qemu-file: Add qemu_file_get_to_fd() Avihai Horon
@ 2022-11-03 16:16 ` Avihai Horon
  2022-11-03 16:16 ` [PATCH v3 10/17] vfio/migration: Move migration v1 logic to vfio_migration_init() Avihai Horon
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 59+ messages in thread
From: Avihai Horon @ 2022-11-03 16:16 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Shay Drory, Avihai Horon, Kirti Wankhede,
	Tarun Gupta, Joao Martins

vfio_devices_all_running_and_saving() is used to check if migration is
in pre-copy phase. This is done by checking if migration is in setup or
active states and if all VFIO devices are in pre-copy state, i.e.
_SAVING | _RUNNING.

VFIO migration v2 protocol currently doesn't support pre-copy, so it
doesn't have an equivalent VFIO pre-copy state like v1 protocol.

As preparation for adding v2 protocol, change
vfio_devices_all_running_and_saving() logic such that it doesn't use the
VFIO pre-copy state.

The new equivalent logic checks if migration is in active state and if
all VFIO devices are in running state [1]. No functional changes
intended.

[1] Note that checking if migration is in setup or active states and if
all VFIO devices are in running state doesn't guarantee that we are in
pre-copy phase, thus we check if migration is only in active state.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/common.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 5470dbcb04..47116ba668 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -40,6 +40,7 @@
 #include "trace.h"
 #include "qapi/error.h"
 #include "migration/migration.h"
+#include "migration/misc.h"
 #include "sysemu/tpm.h"
 
 VFIOGroupList vfio_group_list =
@@ -363,13 +364,16 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
     return true;
 }
 
-static bool vfio_devices_all_running_and_saving(VFIOContainer *container)
+/*
+ * Check if all VFIO devices are running and migration is active, which is
+ * essentially equivalent to the migration being in pre-copy phase.
+ */
+static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
 {
     VFIOGroup *group;
     VFIODevice *vbasedev;
-    MigrationState *ms = migrate_get_current();
 
-    if (!migration_is_setup_or_active(ms->state)) {
+    if (!migration_is_active(migrate_get_current())) {
         return false;
     }
 
@@ -381,8 +385,7 @@ static bool vfio_devices_all_running_and_saving(VFIOContainer *container)
                 return false;
             }
 
-            if ((migration->device_state & VFIO_DEVICE_STATE_V1_SAVING) &&
-                (migration->device_state & VFIO_DEVICE_STATE_V1_RUNNING)) {
+            if (migration->device_state & VFIO_DEVICE_STATE_V1_RUNNING) {
                 continue;
             } else {
                 return false;
@@ -499,7 +502,7 @@ static int vfio_dma_unmap(VFIOContainer *container,
         .size = size,
     };
 
-    if (iotlb && vfio_devices_all_running_and_saving(container)) {
+    if (iotlb && vfio_devices_all_running_and_mig_active(container)) {
         return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
     }
 
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 10/17] vfio/migration: Move migration v1 logic to vfio_migration_init()
  2022-11-03 16:16 [PATCH v3 00/17] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
                   ` (8 preceding siblings ...)
  2022-11-03 16:16 ` [PATCH v3 09/17] vfio/common: Change vfio_devices_all_running_and_saving() logic to equivalent one Avihai Horon
@ 2022-11-03 16:16 ` Avihai Horon
  2022-11-15 23:56   ` Alex Williamson
  2022-11-03 16:16 ` [PATCH v3 11/17] vfio/migration: Rename functions/structs related to v1 protocol Avihai Horon
                   ` (6 subsequent siblings)
  16 siblings, 1 reply; 59+ messages in thread
From: Avihai Horon @ 2022-11-03 16:16 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Shay Drory, Avihai Horon, Kirti Wankhede,
	Tarun Gupta, Joao Martins

Move vfio_dev_get_region_info() logic from vfio_migration_probe() to
vfio_migration_init(). This logic is specific to v1 protocol and moving
it will make it easier to add the v2 protocol implementation later.
No functional changes intended.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/migration.c  | 30 +++++++++++++++---------------
 hw/vfio/trace-events |  2 +-
 2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 99ffb75782..0e3a950746 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -785,14 +785,14 @@ static void vfio_migration_exit(VFIODevice *vbasedev)
     vbasedev->migration = NULL;
 }
 
-static int vfio_migration_init(VFIODevice *vbasedev,
-                               struct vfio_region_info *info)
+static int vfio_migration_init(VFIODevice *vbasedev)
 {
     int ret;
     Object *obj;
     VFIOMigration *migration;
     char id[256] = "";
     g_autofree char *path = NULL, *oid = NULL;
+    struct vfio_region_info *info = NULL;
 
     if (!vbasedev->ops->vfio_get_object) {
         return -EINVAL;
@@ -803,6 +803,14 @@ static int vfio_migration_init(VFIODevice *vbasedev,
         return -EINVAL;
     }
 
+    ret = vfio_get_dev_region_info(vbasedev,
+                                   VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
+                                   VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
+                                   &info);
+    if (ret) {
+        return ret;
+    }
+
     vbasedev->migration = g_new0(VFIOMigration, 1);
     vbasedev->migration->device_state = VFIO_DEVICE_STATE_V1_RUNNING;
     vbasedev->migration->vm_running = runstate_is_running();
@@ -822,6 +830,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
         goto err;
     }
 
+    g_free(info);
+
     migration = vbasedev->migration;
     migration->vbasedev = vbasedev;
 
@@ -844,6 +854,7 @@ static int vfio_migration_init(VFIODevice *vbasedev,
     return 0;
 
 err:
+    g_free(info);
     vfio_migration_exit(vbasedev);
     return ret;
 }
@@ -857,34 +868,23 @@ int64_t vfio_mig_bytes_transferred(void)
 
 int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
 {
-    struct vfio_region_info *info = NULL;
     int ret = -ENOTSUP;
 
     if (!vbasedev->enable_migration) {
         goto add_blocker;
     }
 
-    ret = vfio_get_dev_region_info(vbasedev,
-                                   VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
-                                   VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
-                                   &info);
+    ret = vfio_migration_init(vbasedev);
     if (ret) {
         goto add_blocker;
     }
 
-    ret = vfio_migration_init(vbasedev, info);
-    if (ret) {
-        goto add_blocker;
-    }
-
-    trace_vfio_migration_probe(vbasedev->name, info->index);
-    g_free(info);
+    trace_vfio_migration_probe(vbasedev->name);
     return 0;
 
 add_blocker:
     error_setg(&vbasedev->migration_blocker,
                "VFIO device doesn't support migration");
-    g_free(info);
 
     ret = migrate_add_blocker(vbasedev->migration_blocker, errp);
     if (ret < 0) {
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index a21cbd2a56..27c059f96e 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -148,7 +148,7 @@ vfio_display_edid_update(uint32_t prefx, uint32_t prefy) "%ux%u"
 vfio_display_edid_write_error(void) ""
 
 # migration.c
-vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
+vfio_migration_probe(const char *name) " (%s)"
 vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
 vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
 vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 11/17] vfio/migration: Rename functions/structs related to v1 protocol
  2022-11-03 16:16 [PATCH v3 00/17] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
                   ` (9 preceding siblings ...)
  2022-11-03 16:16 ` [PATCH v3 10/17] vfio/migration: Move migration v1 logic to vfio_migration_init() Avihai Horon
@ 2022-11-03 16:16 ` Avihai Horon
  2022-11-03 16:16 ` [PATCH v3 12/17] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 59+ messages in thread
From: Avihai Horon @ 2022-11-03 16:16 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Shay Drory, Avihai Horon, Kirti Wankhede,
	Tarun Gupta, Joao Martins

To avoid name collisions, rename functions and structs related to VFIO
migration protocol v1. This will allow the two protocols to co-exist
when v2 protocol is added, until v1 is removed. No functional changes
intended.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/common.c              |  6 +--
 hw/vfio/migration.c           | 94 +++++++++++++++++------------------
 hw/vfio/trace-events          |  6 +--
 include/hw/vfio/vfio-common.h |  2 +-
 4 files changed, 54 insertions(+), 54 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 47116ba668..617e6cd901 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -355,8 +355,8 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
                 return false;
             }
 
-            if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF)
-                && (migration->device_state & VFIO_DEVICE_STATE_V1_RUNNING)) {
+            if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
+                (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING)) {
                 return false;
             }
         }
@@ -385,7 +385,7 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
                 return false;
             }
 
-            if (migration->device_state & VFIO_DEVICE_STATE_V1_RUNNING) {
+            if (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
                 continue;
             } else {
                 return false;
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 0e3a950746..e784374453 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -107,8 +107,8 @@ static int vfio_mig_rw(VFIODevice *vbasedev, __u8 *buf, size_t count,
  * an error is returned.
  */
 
-static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
-                                    uint32_t value)
+static int vfio_migration_v1_set_state(VFIODevice *vbasedev, uint32_t mask,
+                                       uint32_t value)
 {
     VFIOMigration *migration = vbasedev->migration;
     VFIORegion *region = &migration->region;
@@ -145,8 +145,8 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
         return ret;
     }
 
-    migration->device_state = device_state;
-    trace_vfio_migration_set_state(vbasedev->name, device_state);
+    migration->device_state_v1 = device_state;
+    trace_vfio_migration_v1_set_state(vbasedev->name, device_state);
     return 0;
 }
 
@@ -260,8 +260,8 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t *size)
     return ret;
 }
 
-static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
-                            uint64_t data_size)
+static int vfio_v1_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
+                               uint64_t data_size)
 {
     VFIORegion *region = &vbasedev->migration->region;
     uint64_t data_offset = 0, size, report_size;
@@ -288,7 +288,7 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
             data_size = 0;
         }
 
-        trace_vfio_load_state_device_data(vbasedev->name, data_offset, size);
+        trace_vfio_v1_load_state_device_data(vbasedev->name, data_offset, size);
 
         while (size) {
             void *buf;
@@ -394,7 +394,7 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
     return qemu_file_get_error(f);
 }
 
-static void vfio_migration_cleanup(VFIODevice *vbasedev)
+static void vfio_migration_v1_cleanup(VFIODevice *vbasedev)
 {
     VFIOMigration *migration = vbasedev->migration;
 
@@ -405,7 +405,7 @@ static void vfio_migration_cleanup(VFIODevice *vbasedev)
 
 /* ---------------------------------------------------------------------- */
 
-static int vfio_save_setup(QEMUFile *f, void *opaque)
+static int vfio_v1_save_setup(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
@@ -431,8 +431,8 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
         }
     }
 
-    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_MASK,
-                                   VFIO_DEVICE_STATE_V1_SAVING);
+    ret = vfio_migration_v1_set_state(vbasedev, VFIO_DEVICE_STATE_MASK,
+                                      VFIO_DEVICE_STATE_V1_SAVING);
     if (ret) {
         error_report("%s: Failed to set state SAVING", vbasedev->name);
         return ret;
@@ -448,16 +448,16 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
     return 0;
 }
 
-static void vfio_save_cleanup(void *opaque)
+static void vfio_v1_save_cleanup(void *opaque)
 {
     VFIODevice *vbasedev = opaque;
 
-    vfio_migration_cleanup(vbasedev);
+    vfio_migration_v1_cleanup(vbasedev);
     trace_vfio_save_cleanup(vbasedev->name);
 }
 
-static void vfio_save_pending(void *opaque,  uint64_t threshold_size,
-                              uint64_t *res_precopy, uint64_t *res_postcopy)
+static void vfio_v1_save_pending(void *opaque, uint64_t threshold_size,
+                                 uint64_t *res_precopy, uint64_t *res_postcopy)
 {
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
@@ -520,15 +520,15 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
     return 0;
 }
 
-static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
+static int vfio_v1_save_complete_precopy(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
     uint64_t data_size;
     int ret;
 
-    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_V1_RUNNING,
-                                   VFIO_DEVICE_STATE_V1_SAVING);
+    ret = vfio_migration_v1_set_state(vbasedev, ~VFIO_DEVICE_STATE_V1_RUNNING,
+                                      VFIO_DEVICE_STATE_V1_SAVING);
     if (ret) {
         error_report("%s: Failed to set state STOP and SAVING",
                      vbasedev->name);
@@ -565,7 +565,8 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
         return ret;
     }
 
-    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_V1_SAVING, 0);
+    ret = vfio_migration_v1_set_state(vbasedev, ~VFIO_DEVICE_STATE_V1_SAVING,
+                                      0);
     if (ret) {
         error_report("%s: Failed to set state STOPPED", vbasedev->name);
         return ret;
@@ -588,7 +589,7 @@ static void vfio_save_state(QEMUFile *f, void *opaque)
     }
 }
 
-static int vfio_load_setup(QEMUFile *f, void *opaque)
+static int vfio_v1_load_setup(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
@@ -604,8 +605,8 @@ static int vfio_load_setup(QEMUFile *f, void *opaque)
         }
     }
 
-    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_MASK,
-                                   VFIO_DEVICE_STATE_V1_RESUMING);
+    ret = vfio_migration_v1_set_state(vbasedev, ~VFIO_DEVICE_STATE_MASK,
+                                      VFIO_DEVICE_STATE_V1_RESUMING);
     if (ret) {
         error_report("%s: Failed to set state RESUMING", vbasedev->name);
         if (migration->region.mmaps) {
@@ -615,11 +616,11 @@ static int vfio_load_setup(QEMUFile *f, void *opaque)
     return ret;
 }
 
-static int vfio_load_cleanup(void *opaque)
+static int vfio_v1_load_cleanup(void *opaque)
 {
     VFIODevice *vbasedev = opaque;
 
-    vfio_migration_cleanup(vbasedev);
+    vfio_migration_v1_cleanup(vbasedev);
     trace_vfio_load_cleanup(vbasedev->name);
     return 0;
 }
@@ -657,7 +658,7 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
             uint64_t data_size = qemu_get_be64(f);
 
             if (data_size) {
-                ret = vfio_load_buffer(f, vbasedev, data_size);
+                ret = vfio_v1_load_buffer(f, vbasedev, data_size);
                 if (ret < 0) {
                     return ret;
                 }
@@ -678,21 +679,21 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
     return ret;
 }
 
-static SaveVMHandlers savevm_vfio_handlers = {
-    .save_setup = vfio_save_setup,
-    .save_cleanup = vfio_save_cleanup,
-    .save_live_pending = vfio_save_pending,
+static SaveVMHandlers savevm_vfio_v1_handlers = {
+    .save_setup = vfio_v1_save_setup,
+    .save_cleanup = vfio_v1_save_cleanup,
+    .save_live_pending = vfio_v1_save_pending,
     .save_live_iterate = vfio_save_iterate,
-    .save_live_complete_precopy = vfio_save_complete_precopy,
+    .save_live_complete_precopy = vfio_v1_save_complete_precopy,
     .save_state = vfio_save_state,
-    .load_setup = vfio_load_setup,
-    .load_cleanup = vfio_load_cleanup,
+    .load_setup = vfio_v1_load_setup,
+    .load_cleanup = vfio_v1_load_cleanup,
     .load_state = vfio_load_state,
 };
 
 /* ---------------------------------------------------------------------- */
 
-static void vfio_vmstate_change(void *opaque, bool running, RunState state)
+static void vfio_v1_vmstate_change(void *opaque, bool running, RunState state)
 {
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
@@ -732,21 +733,21 @@ static void vfio_vmstate_change(void *opaque, bool running, RunState state)
         }
     }
 
-    ret = vfio_migration_set_state(vbasedev, mask, value);
+    ret = vfio_migration_v1_set_state(vbasedev, mask, value);
     if (ret) {
         /*
          * Migration should be aborted in this case, but vm_state_notify()
          * currently does not support reporting failures.
          */
         error_report("%s: Failed to set device state 0x%x", vbasedev->name,
-                     (migration->device_state & mask) | value);
+                     (migration->device_state_v1 & mask) | value);
         if (migrate_get_current()->to_dst_file) {
             qemu_file_set_error(migrate_get_current()->to_dst_file, ret);
         }
     }
     vbasedev->migration->vm_running = running;
-    trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
-            (migration->device_state & mask) | value);
+    trace_vfio_v1_vmstate_change(vbasedev->name, running, RunState_str(state),
+            (migration->device_state_v1 & mask) | value);
 }
 
 static void vfio_migration_state_notifier(Notifier *notifier, void *data)
@@ -765,10 +766,10 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
     case MIGRATION_STATUS_CANCELLED:
     case MIGRATION_STATUS_FAILED:
         bytes_transferred = 0;
-        ret = vfio_migration_set_state(vbasedev,
-                                       ~(VFIO_DEVICE_STATE_V1_SAVING |
-                                         VFIO_DEVICE_STATE_V1_RESUMING),
-                                       VFIO_DEVICE_STATE_V1_RUNNING);
+        ret = vfio_migration_v1_set_state(vbasedev,
+                                          ~(VFIO_DEVICE_STATE_V1_SAVING |
+                                            VFIO_DEVICE_STATE_V1_RESUMING),
+                                          VFIO_DEVICE_STATE_V1_RUNNING);
         if (ret) {
             error_report("%s: Failed to set state RUNNING", vbasedev->name);
         }
@@ -812,7 +813,7 @@ static int vfio_migration_init(VFIODevice *vbasedev)
     }
 
     vbasedev->migration = g_new0(VFIOMigration, 1);
-    vbasedev->migration->device_state = VFIO_DEVICE_STATE_V1_RUNNING;
+    vbasedev->migration->device_state_v1 = VFIO_DEVICE_STATE_V1_RUNNING;
     vbasedev->migration->vm_running = runstate_is_running();
 
     ret = vfio_region_setup(obj, vbasedev, &vbasedev->migration->region,
@@ -843,12 +844,11 @@ static int vfio_migration_init(VFIODevice *vbasedev)
     }
     strpadcpy(id, sizeof(id), path, '\0');
 
-    register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1, &savevm_vfio_handlers,
-                         vbasedev);
+    register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1,
+                         &savevm_vfio_v1_handlers, vbasedev);
 
-    migration->vm_state = qdev_add_vm_change_state_handler(vbasedev->dev,
-                                                           vfio_vmstate_change,
-                                                           vbasedev);
+    migration->vm_state = qdev_add_vm_change_state_handler(
+        vbasedev->dev, vfio_v1_vmstate_change, vbasedev);
     migration->migration_state.notify = vfio_migration_state_notifier;
     add_migration_state_change_notifier(&migration->migration_state);
     return 0;
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 27c059f96e..d88d2b4053 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -149,8 +149,8 @@ vfio_display_edid_write_error(void) ""
 
 # migration.c
 vfio_migration_probe(const char *name) " (%s)"
-vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
-vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
+vfio_migration_v1_set_state(const char *name, uint32_t state) " (%s) state %d"
+vfio_v1_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
 vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
 vfio_save_setup(const char *name) " (%s)"
 vfio_save_cleanup(const char *name) " (%s)"
@@ -162,7 +162,7 @@ vfio_save_iterate(const char *name, int data_size) " (%s) data_size %d"
 vfio_save_complete_precopy(const char *name) " (%s)"
 vfio_load_device_config_state(const char *name) " (%s)"
 vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
-vfio_load_state_device_data(const char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
+vfio_v1_load_state_device_data(const char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
 vfio_load_cleanup(const char *name) " (%s)"
 vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64
 vfio_iommu_map_dirty_notify(uint64_t iova_start, uint64_t iova_end) "iommu dirty @ 0x%"PRIx64" - 0x%"PRIx64
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index e573f5a9f1..bbaf72ba00 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -62,7 +62,7 @@ typedef struct VFIOMigration {
     struct VFIODevice *vbasedev;
     VMChangeStateEntry *vm_state;
     VFIORegion region;
-    uint32_t device_state;
+    uint32_t device_state_v1;
     int vm_running;
     Notifier migration_state;
     uint64_t pending_bytes;
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 12/17] vfio/migration: Implement VFIO migration protocol v2
  2022-11-03 16:16 [PATCH v3 00/17] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
                   ` (10 preceding siblings ...)
  2022-11-03 16:16 ` [PATCH v3 11/17] vfio/migration: Rename functions/structs related to v1 protocol Avihai Horon
@ 2022-11-03 16:16 ` Avihai Horon
  2022-11-16 18:29   ` Alex Williamson
  2022-11-23 18:59   ` Dr. David Alan Gilbert
  2022-11-03 16:16 ` [PATCH v3 13/17] vfio/migration: Remove VFIO migration protocol v1 Avihai Horon
                   ` (4 subsequent siblings)
  16 siblings, 2 replies; 59+ messages in thread
From: Avihai Horon @ 2022-11-03 16:16 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Shay Drory, Avihai Horon, Kirti Wankhede,
	Tarun Gupta, Joao Martins

Add implementation of VFIO migration protocol v2. The two protocols, v1
and v2, will co-exist and in next patch v1 protocol will be removed.

There are several main differences between v1 and v2 protocols:
- VFIO device state is now represented as a finite state machine instead
  of a bitmap.

- Migration interface with kernel is now done using VFIO_DEVICE_FEATURE
  ioctl and normal read() and write() instead of the migration region.

- VFIO migration protocol v2 currently doesn't support the pre-copy
  phase of migration.

Detailed information about VFIO migration protocol v2 and difference
compared to v1 can be found here [1].

[1]
https://lore.kernel.org/all/20220224142024.147653-10-yishaih@nvidia.com/

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/common.c              |  19 +-
 hw/vfio/migration.c           | 386 ++++++++++++++++++++++++++++++----
 hw/vfio/trace-events          |   4 +
 include/hw/vfio/vfio-common.h |   5 +
 4 files changed, 375 insertions(+), 39 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 617e6cd901..0bdbd1586b 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -355,10 +355,18 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
                 return false;
             }
 
-            if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
+            if (!migration->v2 &&
+                (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
                 (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING)) {
                 return false;
             }
+
+            if (migration->v2 &&
+                (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
+                (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
+                 migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P)) {
+                return false;
+            }
         }
     }
     return true;
@@ -385,7 +393,14 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
                 return false;
             }
 
-            if (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
+            if (!migration->v2 &&
+                migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
+                continue;
+            }
+
+            if (migration->v2 &&
+                (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
+                 migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P)) {
                 continue;
             } else {
                 return false;
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index e784374453..62afc23a8c 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -44,8 +44,84 @@
 #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
 #define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
 
+#define VFIO_MIG_DATA_BUFFER_SIZE (1024 * 1024)
+
 static int64_t bytes_transferred;
 
+static const char *mig_state_to_str(enum vfio_device_mig_state state)
+{
+    switch (state) {
+    case VFIO_DEVICE_STATE_ERROR:
+        return "ERROR";
+    case VFIO_DEVICE_STATE_STOP:
+        return "STOP";
+    case VFIO_DEVICE_STATE_RUNNING:
+        return "RUNNING";
+    case VFIO_DEVICE_STATE_STOP_COPY:
+        return "STOP_COPY";
+    case VFIO_DEVICE_STATE_RESUMING:
+        return "RESUMING";
+    case VFIO_DEVICE_STATE_RUNNING_P2P:
+        return "RUNNING_P2P";
+    default:
+        return "UNKNOWN STATE";
+    }
+}
+
+static int vfio_migration_set_state(VFIODevice *vbasedev,
+                                    enum vfio_device_mig_state new_state,
+                                    enum vfio_device_mig_state recover_state)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
+                              sizeof(struct vfio_device_feature_mig_state),
+                              sizeof(uint64_t))] = {};
+    struct vfio_device_feature *feature = (void *)buf;
+    struct vfio_device_feature_mig_state *mig_state = (void *)feature->data;
+
+    feature->argsz = sizeof(buf);
+    feature->flags =
+        VFIO_DEVICE_FEATURE_SET | VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE;
+    mig_state->device_state = new_state;
+    if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
+        /* Try to set the device in some good state */
+        error_report(
+            "%s: Failed setting device state to %s, err: %s. Setting device in recover state %s",
+                     vbasedev->name, mig_state_to_str(new_state),
+                     strerror(errno), mig_state_to_str(recover_state));
+
+        mig_state->device_state = recover_state;
+        if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
+            hw_error("%s: Failed setting device in recover state, err: %s",
+                     vbasedev->name, strerror(errno));
+        }
+
+        migration->device_state = recover_state;
+
+        return -1;
+    }
+
+    if (mig_state->data_fd != -1) {
+        if (migration->data_fd != -1) {
+            /*
+             * This can happen if the device is asynchronously reset and
+             * terminates a data transfer.
+             */
+            error_report("%s: data_fd out of sync", vbasedev->name);
+            close(mig_state->data_fd);
+
+            return -1;
+        }
+
+        migration->data_fd = mig_state->data_fd;
+    }
+    migration->device_state = new_state;
+
+    trace_vfio_migration_set_state(vbasedev->name, mig_state_to_str(new_state));
+
+    return 0;
+}
+
 static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
                                   off_t off, bool iswrite)
 {
@@ -260,6 +336,20 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t *size)
     return ret;
 }
 
+static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
+                            uint64_t data_size)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    ret = qemu_file_get_to_fd(f, migration->data_fd, data_size);
+    if (!ret) {
+        trace_vfio_load_state_device_data(vbasedev->name, data_size);
+    }
+
+    return ret;
+}
+
 static int vfio_v1_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
                                uint64_t data_size)
 {
@@ -394,6 +484,14 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
     return qemu_file_get_error(f);
 }
 
+static void vfio_migration_cleanup(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+
+    close(migration->data_fd);
+    migration->data_fd = -1;
+}
+
 static void vfio_migration_v1_cleanup(VFIODevice *vbasedev)
 {
     VFIOMigration *migration = vbasedev->migration;
@@ -405,6 +503,18 @@ static void vfio_migration_v1_cleanup(VFIODevice *vbasedev)
 
 /* ---------------------------------------------------------------------- */
 
+static int vfio_save_setup(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+
+    trace_vfio_save_setup(vbasedev->name);
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    return qemu_file_get_error(f);
+}
+
 static int vfio_v1_save_setup(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
@@ -448,6 +558,14 @@ static int vfio_v1_save_setup(QEMUFile *f, void *opaque)
     return 0;
 }
 
+static void vfio_save_cleanup(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+
+    vfio_migration_cleanup(vbasedev);
+    trace_vfio_save_cleanup(vbasedev->name);
+}
+
 static void vfio_v1_save_cleanup(void *opaque)
 {
     VFIODevice *vbasedev = opaque;
@@ -456,6 +574,23 @@ static void vfio_v1_save_cleanup(void *opaque)
     trace_vfio_save_cleanup(vbasedev->name);
 }
 
+#define VFIO_MIG_PENDING_SIZE (512 * 1024 * 1024)
+static void vfio_save_pending(void *opaque, uint64_t threshold_size,
+                              uint64_t *res_precopy, uint64_t *res_postcopy)
+{
+    VFIODevice *vbasedev = opaque;
+
+    /*
+     * VFIO migration protocol v2 currently doesn't have an API to get pending
+     * device state size. Until such API is introduced, report some big
+     * arbitrary pending size so the device will be taken into account for
+     * downtime limit calculations.
+     */
+    *res_postcopy += VFIO_MIG_PENDING_SIZE;
+
+    trace_vfio_save_pending(vbasedev->name, *res_precopy, *res_postcopy);
+}
+
 static void vfio_v1_save_pending(void *opaque, uint64_t threshold_size,
                                  uint64_t *res_precopy, uint64_t *res_postcopy)
 {
@@ -520,6 +655,67 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
     return 0;
 }
 
+/* Returns 1 if end-of-stream is reached, 0 if more data and -1 if error */
+static int vfio_save_block(QEMUFile *f, VFIOMigration *migration)
+{
+    ssize_t data_size;
+
+    data_size = read(migration->data_fd, migration->data_buffer,
+                     migration->data_buffer_size);
+    if (data_size < 0) {
+        return -1;
+    }
+    if (data_size == 0) {
+        return 1;
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
+    qemu_put_be64(f, data_size);
+    qemu_put_buffer(f, migration->data_buffer, data_size);
+    bytes_transferred += data_size;
+
+    trace_vfio_save_block(migration->vbasedev->name, data_size);
+
+    return qemu_file_get_error(f);
+}
+
+static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    enum vfio_device_mig_state recover_state;
+    int ret;
+
+    /* We reach here with device state STOP only */
+    recover_state = VFIO_DEVICE_STATE_STOP;
+    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
+                                   recover_state);
+    if (ret) {
+        return ret;
+    }
+
+    do {
+        ret = vfio_save_block(f, vbasedev->migration);
+        if (ret < 0) {
+            return ret;
+        }
+    } while (!ret);
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    recover_state = VFIO_DEVICE_STATE_ERROR;
+    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP,
+                                   recover_state);
+    if (!ret) {
+        trace_vfio_save_complete_precopy(vbasedev->name);
+    }
+
+    return ret;
+}
+
 static int vfio_v1_save_complete_precopy(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
@@ -589,6 +785,14 @@ static void vfio_save_state(QEMUFile *f, void *opaque)
     }
 }
 
+static int vfio_load_setup(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+
+    return vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
+                                   vbasedev->migration->device_state);
+}
+
 static int vfio_v1_load_setup(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
@@ -616,6 +820,16 @@ static int vfio_v1_load_setup(QEMUFile *f, void *opaque)
     return ret;
 }
 
+static int vfio_load_cleanup(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+
+    vfio_migration_cleanup(vbasedev);
+    trace_vfio_load_cleanup(vbasedev->name);
+
+    return 0;
+}
+
 static int vfio_v1_load_cleanup(void *opaque)
 {
     VFIODevice *vbasedev = opaque;
@@ -658,7 +872,11 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
             uint64_t data_size = qemu_get_be64(f);
 
             if (data_size) {
-                ret = vfio_v1_load_buffer(f, vbasedev, data_size);
+                if (vbasedev->migration->v2) {
+                    ret = vfio_load_buffer(f, vbasedev, data_size);
+                } else {
+                    ret = vfio_v1_load_buffer(f, vbasedev, data_size);
+                }
                 if (ret < 0) {
                     return ret;
                 }
@@ -679,6 +897,17 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
     return ret;
 }
 
+static const SaveVMHandlers savevm_vfio_handlers = {
+    .save_setup = vfio_save_setup,
+    .save_cleanup = vfio_save_cleanup,
+    .save_live_pending = vfio_save_pending,
+    .save_live_complete_precopy = vfio_save_complete_precopy,
+    .save_state = vfio_save_state,
+    .load_setup = vfio_load_setup,
+    .load_cleanup = vfio_load_cleanup,
+    .load_state = vfio_load_state,
+};
+
 static SaveVMHandlers savevm_vfio_v1_handlers = {
     .save_setup = vfio_v1_save_setup,
     .save_cleanup = vfio_v1_save_cleanup,
@@ -693,6 +922,34 @@ static SaveVMHandlers savevm_vfio_v1_handlers = {
 
 /* ---------------------------------------------------------------------- */
 
+static void vfio_vmstate_change(void *opaque, bool running, RunState state)
+{
+    VFIODevice *vbasedev = opaque;
+    enum vfio_device_mig_state new_state;
+    int ret;
+
+    if (running) {
+        new_state = VFIO_DEVICE_STATE_RUNNING;
+    } else {
+        new_state = VFIO_DEVICE_STATE_STOP;
+    }
+
+    ret = vfio_migration_set_state(vbasedev, new_state,
+                                   VFIO_DEVICE_STATE_ERROR);
+    if (ret) {
+        /*
+         * Migration should be aborted in this case, but vm_state_notify()
+         * currently does not support reporting failures.
+         */
+        if (migrate_get_current()->to_dst_file) {
+            qemu_file_set_error(migrate_get_current()->to_dst_file, ret);
+        }
+    }
+
+    trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
+                              mig_state_to_str(new_state));
+}
+
 static void vfio_v1_vmstate_change(void *opaque, bool running, RunState state)
 {
     VFIODevice *vbasedev = opaque;
@@ -766,12 +1023,17 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
     case MIGRATION_STATUS_CANCELLED:
     case MIGRATION_STATUS_FAILED:
         bytes_transferred = 0;
-        ret = vfio_migration_v1_set_state(vbasedev,
-                                          ~(VFIO_DEVICE_STATE_V1_SAVING |
-                                            VFIO_DEVICE_STATE_V1_RESUMING),
-                                          VFIO_DEVICE_STATE_V1_RUNNING);
-        if (ret) {
-            error_report("%s: Failed to set state RUNNING", vbasedev->name);
+        if (migration->v2) {
+            vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING,
+                                     VFIO_DEVICE_STATE_ERROR);
+        } else {
+            ret = vfio_migration_v1_set_state(vbasedev,
+                                              ~(VFIO_DEVICE_STATE_V1_SAVING |
+                                                VFIO_DEVICE_STATE_V1_RESUMING),
+                                              VFIO_DEVICE_STATE_V1_RUNNING);
+            if (ret) {
+                error_report("%s: Failed to set state RUNNING", vbasedev->name);
+            }
         }
     }
 }
@@ -780,12 +1042,35 @@ static void vfio_migration_exit(VFIODevice *vbasedev)
 {
     VFIOMigration *migration = vbasedev->migration;
 
-    vfio_region_exit(&migration->region);
-    vfio_region_finalize(&migration->region);
+    if (migration->v2) {
+        g_free(migration->data_buffer);
+    } else {
+        vfio_region_exit(&migration->region);
+        vfio_region_finalize(&migration->region);
+    }
     g_free(vbasedev->migration);
     vbasedev->migration = NULL;
 }
 
+static int vfio_migration_query_flags(VFIODevice *vbasedev, uint64_t *mig_flags)
+{
+    uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
+                                  sizeof(struct vfio_device_feature_migration),
+                              sizeof(uint64_t))] = {};
+    struct vfio_device_feature *feature = (void *)buf;
+    struct vfio_device_feature_migration *mig = (void *)feature->data;
+
+    feature->argsz = sizeof(buf);
+    feature->flags = VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_MIGRATION;
+    if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
+        return -EOPNOTSUPP;
+    }
+
+    *mig_flags = mig->flags;
+
+    return 0;
+}
+
 static int vfio_migration_init(VFIODevice *vbasedev)
 {
     int ret;
@@ -794,6 +1079,7 @@ static int vfio_migration_init(VFIODevice *vbasedev)
     char id[256] = "";
     g_autofree char *path = NULL, *oid = NULL;
     struct vfio_region_info *info = NULL;
+    uint64_t mig_flags;
 
     if (!vbasedev->ops->vfio_get_object) {
         return -EINVAL;
@@ -804,34 +1090,51 @@ static int vfio_migration_init(VFIODevice *vbasedev)
         return -EINVAL;
     }
 
-    ret = vfio_get_dev_region_info(vbasedev,
-                                   VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
-                                   VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
-                                   &info);
-    if (ret) {
-        return ret;
-    }
+    ret = vfio_migration_query_flags(vbasedev, &mig_flags);
+    if (!ret) {
+        /* Migration v2 */
+        /* Basic migration functionality must be supported */
+        if (!(mig_flags & VFIO_MIGRATION_STOP_COPY)) {
+            return -EOPNOTSUPP;
+        }
+        vbasedev->migration = g_new0(VFIOMigration, 1);
+        vbasedev->migration->device_state = VFIO_DEVICE_STATE_RUNNING;
+        vbasedev->migration->data_buffer_size = VFIO_MIG_DATA_BUFFER_SIZE;
+        vbasedev->migration->data_buffer =
+            g_malloc0(vbasedev->migration->data_buffer_size);
+        vbasedev->migration->data_fd = -1;
+        vbasedev->migration->v2 = true;
+    } else {
+        /* Migration v1 */
+        ret = vfio_get_dev_region_info(vbasedev,
+                                       VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
+                                       VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
+                                       &info);
+        if (ret) {
+            return ret;
+        }
 
-    vbasedev->migration = g_new0(VFIOMigration, 1);
-    vbasedev->migration->device_state_v1 = VFIO_DEVICE_STATE_V1_RUNNING;
-    vbasedev->migration->vm_running = runstate_is_running();
+        vbasedev->migration = g_new0(VFIOMigration, 1);
+        vbasedev->migration->device_state_v1 = VFIO_DEVICE_STATE_V1_RUNNING;
+        vbasedev->migration->vm_running = runstate_is_running();
 
-    ret = vfio_region_setup(obj, vbasedev, &vbasedev->migration->region,
-                            info->index, "migration");
-    if (ret) {
-        error_report("%s: Failed to setup VFIO migration region %d: %s",
-                     vbasedev->name, info->index, strerror(-ret));
-        goto err;
-    }
+        ret = vfio_region_setup(obj, vbasedev, &vbasedev->migration->region,
+                                info->index, "migration");
+        if (ret) {
+            error_report("%s: Failed to setup VFIO migration region %d: %s",
+                         vbasedev->name, info->index, strerror(-ret));
+            goto err;
+        }
 
-    if (!vbasedev->migration->region.size) {
-        error_report("%s: Invalid zero-sized VFIO migration region %d",
-                     vbasedev->name, info->index);
-        ret = -EINVAL;
-        goto err;
-    }
+        if (!vbasedev->migration->region.size) {
+            error_report("%s: Invalid zero-sized VFIO migration region %d",
+                         vbasedev->name, info->index);
+            ret = -EINVAL;
+            goto err;
+        }
 
-    g_free(info);
+        g_free(info);
+    }
 
     migration = vbasedev->migration;
     migration->vbasedev = vbasedev;
@@ -844,11 +1147,20 @@ static int vfio_migration_init(VFIODevice *vbasedev)
     }
     strpadcpy(id, sizeof(id), path, '\0');
 
-    register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1,
-                         &savevm_vfio_v1_handlers, vbasedev);
+    if (migration->v2) {
+        register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1,
+                             &savevm_vfio_handlers, vbasedev);
+
+        migration->vm_state = qdev_add_vm_change_state_handler(
+            vbasedev->dev, vfio_vmstate_change, vbasedev);
+    } else {
+        register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1,
+                             &savevm_vfio_v1_handlers, vbasedev);
+
+        migration->vm_state = qdev_add_vm_change_state_handler(
+            vbasedev->dev, vfio_v1_vmstate_change, vbasedev);
+    }
 
-    migration->vm_state = qdev_add_vm_change_state_handler(
-        vbasedev->dev, vfio_v1_vmstate_change, vbasedev);
     migration->migration_state.notify = vfio_migration_state_notifier;
     add_migration_state_change_notifier(&migration->migration_state);
     return 0;
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index d88d2b4053..9ef84e24b2 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -149,7 +149,9 @@ vfio_display_edid_write_error(void) ""
 
 # migration.c
 vfio_migration_probe(const char *name) " (%s)"
+vfio_migration_set_state(const char *name, const char *state) " (%s) state %s"
 vfio_migration_v1_set_state(const char *name, uint32_t state) " (%s) state %d"
+vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
 vfio_v1_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
 vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
 vfio_save_setup(const char *name) " (%s)"
@@ -163,6 +165,8 @@ vfio_save_complete_precopy(const char *name) " (%s)"
 vfio_load_device_config_state(const char *name) " (%s)"
 vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
 vfio_v1_load_state_device_data(const char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
+vfio_load_state_device_data(const char *name, uint64_t data_size) " (%s) size 0x%"PRIx64
 vfio_load_cleanup(const char *name) " (%s)"
 vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64
 vfio_iommu_map_dirty_notify(uint64_t iova_start, uint64_t iova_end) "iommu dirty @ 0x%"PRIx64" - 0x%"PRIx64
+vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index bbaf72ba00..2ec3346fea 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -66,6 +66,11 @@ typedef struct VFIOMigration {
     int vm_running;
     Notifier migration_state;
     uint64_t pending_bytes;
+    enum vfio_device_mig_state device_state;
+    int data_fd;
+    void *data_buffer;
+    size_t data_buffer_size;
+    bool v2;
 } VFIOMigration;
 
 typedef struct VFIOAddressSpace {
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 13/17] vfio/migration: Remove VFIO migration protocol v1
  2022-11-03 16:16 [PATCH v3 00/17] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
                   ` (11 preceding siblings ...)
  2022-11-03 16:16 ` [PATCH v3 12/17] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
@ 2022-11-03 16:16 ` Avihai Horon
  2022-11-03 16:16 ` [PATCH v3 14/17] vfio/migration: Reset device if setting recover state fails Avihai Horon
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 59+ messages in thread
From: Avihai Horon @ 2022-11-03 16:16 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Shay Drory, Avihai Horon, Kirti Wankhede,
	Tarun Gupta, Joao Martins

Now that v2 protocol implementation has been added, remove the
deprecated v1 implementation.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/common.c              |  19 +-
 hw/vfio/migration.c           | 697 +---------------------------------
 hw/vfio/trace-events          |   6 -
 include/hw/vfio/vfio-common.h |   5 -
 4 files changed, 24 insertions(+), 703 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 0bdbd1586b..9ff57d4b27 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -355,14 +355,7 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
                 return false;
             }
 
-            if (!migration->v2 &&
-                (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
-                (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING)) {
-                return false;
-            }
-
-            if (migration->v2 &&
-                (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
+            if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
                 (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
                  migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P)) {
                 return false;
@@ -393,14 +386,8 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
                 return false;
             }
 
-            if (!migration->v2 &&
-                migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
-                continue;
-            }
-
-            if (migration->v2 &&
-                (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
-                 migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P)) {
+            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
+                migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P) {
                 continue;
             } else {
                 return false;
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 62afc23a8c..f8c3228314 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -122,220 +122,6 @@ static int vfio_migration_set_state(VFIODevice *vbasedev,
     return 0;
 }
 
-static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
-                                  off_t off, bool iswrite)
-{
-    int ret;
-
-    ret = iswrite ? pwrite(vbasedev->fd, val, count, off) :
-                    pread(vbasedev->fd, val, count, off);
-    if (ret < count) {
-        error_report("vfio_mig_%s %d byte %s: failed at offset 0x%"
-                     HWADDR_PRIx", err: %s", iswrite ? "write" : "read", count,
-                     vbasedev->name, off, strerror(errno));
-        return (ret < 0) ? ret : -EINVAL;
-    }
-    return 0;
-}
-
-static int vfio_mig_rw(VFIODevice *vbasedev, __u8 *buf, size_t count,
-                       off_t off, bool iswrite)
-{
-    int ret, done = 0;
-    __u8 *tbuf = buf;
-
-    while (count) {
-        int bytes = 0;
-
-        if (count >= 8 && !(off % 8)) {
-            bytes = 8;
-        } else if (count >= 4 && !(off % 4)) {
-            bytes = 4;
-        } else if (count >= 2 && !(off % 2)) {
-            bytes = 2;
-        } else {
-            bytes = 1;
-        }
-
-        ret = vfio_mig_access(vbasedev, tbuf, bytes, off, iswrite);
-        if (ret) {
-            return ret;
-        }
-
-        count -= bytes;
-        done += bytes;
-        off += bytes;
-        tbuf += bytes;
-    }
-    return done;
-}
-
-#define vfio_mig_read(f, v, c, o)       vfio_mig_rw(f, (__u8 *)v, c, o, false)
-#define vfio_mig_write(f, v, c, o)      vfio_mig_rw(f, (__u8 *)v, c, o, true)
-
-#define VFIO_MIG_STRUCT_OFFSET(f)       \
-                                 offsetof(struct vfio_device_migration_info, f)
-/*
- * Change the device_state register for device @vbasedev. Bits set in @mask
- * are preserved, bits set in @value are set, and bits not set in either @mask
- * or @value are cleared in device_state. If the register cannot be accessed,
- * the resulting state would be invalid, or the device enters an error state,
- * an error is returned.
- */
-
-static int vfio_migration_v1_set_state(VFIODevice *vbasedev, uint32_t mask,
-                                       uint32_t value)
-{
-    VFIOMigration *migration = vbasedev->migration;
-    VFIORegion *region = &migration->region;
-    off_t dev_state_off = region->fd_offset +
-                          VFIO_MIG_STRUCT_OFFSET(device_state);
-    uint32_t device_state;
-    int ret;
-
-    ret = vfio_mig_read(vbasedev, &device_state, sizeof(device_state),
-                        dev_state_off);
-    if (ret < 0) {
-        return ret;
-    }
-
-    device_state = (device_state & mask) | value;
-
-    if (!VFIO_DEVICE_STATE_VALID(device_state)) {
-        return -EINVAL;
-    }
-
-    ret = vfio_mig_write(vbasedev, &device_state, sizeof(device_state),
-                         dev_state_off);
-    if (ret < 0) {
-        int rret;
-
-        rret = vfio_mig_read(vbasedev, &device_state, sizeof(device_state),
-                             dev_state_off);
-
-        if ((rret < 0) || (VFIO_DEVICE_STATE_IS_ERROR(device_state))) {
-            hw_error("%s: Device in error state 0x%x", vbasedev->name,
-                     device_state);
-            return rret ? rret : -EIO;
-        }
-        return ret;
-    }
-
-    migration->device_state_v1 = device_state;
-    trace_vfio_migration_v1_set_state(vbasedev->name, device_state);
-    return 0;
-}
-
-static void *get_data_section_size(VFIORegion *region, uint64_t data_offset,
-                                   uint64_t data_size, uint64_t *size)
-{
-    void *ptr = NULL;
-    uint64_t limit = 0;
-    int i;
-
-    if (!region->mmaps) {
-        if (size) {
-            *size = MIN(data_size, region->size - data_offset);
-        }
-        return ptr;
-    }
-
-    for (i = 0; i < region->nr_mmaps; i++) {
-        VFIOMmap *map = region->mmaps + i;
-
-        if ((data_offset >= map->offset) &&
-            (data_offset < map->offset + map->size)) {
-
-            /* check if data_offset is within sparse mmap areas */
-            ptr = map->mmap + data_offset - map->offset;
-            if (size) {
-                *size = MIN(data_size, map->offset + map->size - data_offset);
-            }
-            break;
-        } else if ((data_offset < map->offset) &&
-                   (!limit || limit > map->offset)) {
-            /*
-             * data_offset is not within sparse mmap areas, find size of
-             * non-mapped area. Check through all list since region->mmaps list
-             * is not sorted.
-             */
-            limit = map->offset;
-        }
-    }
-
-    if (!ptr && size) {
-        *size = limit ? MIN(data_size, limit - data_offset) : data_size;
-    }
-    return ptr;
-}
-
-static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t *size)
-{
-    VFIOMigration *migration = vbasedev->migration;
-    VFIORegion *region = &migration->region;
-    uint64_t data_offset = 0, data_size = 0, sz;
-    int ret;
-
-    ret = vfio_mig_read(vbasedev, &data_offset, sizeof(data_offset),
-                      region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_offset));
-    if (ret < 0) {
-        return ret;
-    }
-
-    ret = vfio_mig_read(vbasedev, &data_size, sizeof(data_size),
-                        region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_size));
-    if (ret < 0) {
-        return ret;
-    }
-
-    trace_vfio_save_buffer(vbasedev->name, data_offset, data_size,
-                           migration->pending_bytes);
-
-    qemu_put_be64(f, data_size);
-    sz = data_size;
-
-    while (sz) {
-        void *buf;
-        uint64_t sec_size;
-        bool buf_allocated = false;
-
-        buf = get_data_section_size(region, data_offset, sz, &sec_size);
-
-        if (!buf) {
-            buf = g_try_malloc(sec_size);
-            if (!buf) {
-                error_report("%s: Error allocating buffer ", __func__);
-                return -ENOMEM;
-            }
-            buf_allocated = true;
-
-            ret = vfio_mig_read(vbasedev, buf, sec_size,
-                                region->fd_offset + data_offset);
-            if (ret < 0) {
-                g_free(buf);
-                return ret;
-            }
-        }
-
-        qemu_put_buffer(f, buf, sec_size);
-
-        if (buf_allocated) {
-            g_free(buf);
-        }
-        sz -= sec_size;
-        data_offset += sec_size;
-    }
-
-    ret = qemu_file_get_error(f);
-
-    if (!ret && size) {
-        *size = data_size;
-    }
-
-    bytes_transferred += data_size;
-    return ret;
-}
-
 static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
                             uint64_t data_size)
 {
@@ -350,96 +136,6 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
     return ret;
 }
 
-static int vfio_v1_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
-                               uint64_t data_size)
-{
-    VFIORegion *region = &vbasedev->migration->region;
-    uint64_t data_offset = 0, size, report_size;
-    int ret;
-
-    do {
-        ret = vfio_mig_read(vbasedev, &data_offset, sizeof(data_offset),
-                      region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_offset));
-        if (ret < 0) {
-            return ret;
-        }
-
-        if (data_offset + data_size > region->size) {
-            /*
-             * If data_size is greater than the data section of migration region
-             * then iterate the write buffer operation. This case can occur if
-             * size of migration region at destination is smaller than size of
-             * migration region at source.
-             */
-            report_size = size = region->size - data_offset;
-            data_size -= size;
-        } else {
-            report_size = size = data_size;
-            data_size = 0;
-        }
-
-        trace_vfio_v1_load_state_device_data(vbasedev->name, data_offset, size);
-
-        while (size) {
-            void *buf;
-            uint64_t sec_size;
-            bool buf_alloc = false;
-
-            buf = get_data_section_size(region, data_offset, size, &sec_size);
-
-            if (!buf) {
-                buf = g_try_malloc(sec_size);
-                if (!buf) {
-                    error_report("%s: Error allocating buffer ", __func__);
-                    return -ENOMEM;
-                }
-                buf_alloc = true;
-            }
-
-            qemu_get_buffer(f, buf, sec_size);
-
-            if (buf_alloc) {
-                ret = vfio_mig_write(vbasedev, buf, sec_size,
-                        region->fd_offset + data_offset);
-                g_free(buf);
-
-                if (ret < 0) {
-                    return ret;
-                }
-            }
-            size -= sec_size;
-            data_offset += sec_size;
-        }
-
-        ret = vfio_mig_write(vbasedev, &report_size, sizeof(report_size),
-                        region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_size));
-        if (ret < 0) {
-            return ret;
-        }
-    } while (data_size);
-
-    return 0;
-}
-
-static int vfio_update_pending(VFIODevice *vbasedev)
-{
-    VFIOMigration *migration = vbasedev->migration;
-    VFIORegion *region = &migration->region;
-    uint64_t pending_bytes = 0;
-    int ret;
-
-    ret = vfio_mig_read(vbasedev, &pending_bytes, sizeof(pending_bytes),
-                    region->fd_offset + VFIO_MIG_STRUCT_OFFSET(pending_bytes));
-    if (ret < 0) {
-        migration->pending_bytes = 0;
-        return ret;
-    }
-
-    migration->pending_bytes = pending_bytes;
-    trace_vfio_update_pending(vbasedev->name, pending_bytes);
-    return 0;
-}
-
 static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
@@ -492,15 +188,6 @@ static void vfio_migration_cleanup(VFIODevice *vbasedev)
     migration->data_fd = -1;
 }
 
-static void vfio_migration_v1_cleanup(VFIODevice *vbasedev)
-{
-    VFIOMigration *migration = vbasedev->migration;
-
-    if (migration->region.mmaps) {
-        vfio_region_unmap(&migration->region);
-    }
-}
-
 /* ---------------------------------------------------------------------- */
 
 static int vfio_save_setup(QEMUFile *f, void *opaque)
@@ -515,49 +202,6 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
     return qemu_file_get_error(f);
 }
 
-static int vfio_v1_save_setup(QEMUFile *f, void *opaque)
-{
-    VFIODevice *vbasedev = opaque;
-    VFIOMigration *migration = vbasedev->migration;
-    int ret;
-
-    trace_vfio_save_setup(vbasedev->name);
-
-    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
-
-    if (migration->region.mmaps) {
-        /*
-         * Calling vfio_region_mmap() from migration thread. Memory API called
-         * from this function require locking the iothread when called from
-         * outside the main loop thread.
-         */
-        qemu_mutex_lock_iothread();
-        ret = vfio_region_mmap(&migration->region);
-        qemu_mutex_unlock_iothread();
-        if (ret) {
-            error_report("%s: Failed to mmap VFIO migration region: %s",
-                         vbasedev->name, strerror(-ret));
-            error_report("%s: Falling back to slow path", vbasedev->name);
-        }
-    }
-
-    ret = vfio_migration_v1_set_state(vbasedev, VFIO_DEVICE_STATE_MASK,
-                                      VFIO_DEVICE_STATE_V1_SAVING);
-    if (ret) {
-        error_report("%s: Failed to set state SAVING", vbasedev->name);
-        return ret;
-    }
-
-    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
-
-    ret = qemu_file_get_error(f);
-    if (ret) {
-        return ret;
-    }
-
-    return 0;
-}
-
 static void vfio_save_cleanup(void *opaque)
 {
     VFIODevice *vbasedev = opaque;
@@ -566,14 +210,6 @@ static void vfio_save_cleanup(void *opaque)
     trace_vfio_save_cleanup(vbasedev->name);
 }
 
-static void vfio_v1_save_cleanup(void *opaque)
-{
-    VFIODevice *vbasedev = opaque;
-
-    vfio_migration_v1_cleanup(vbasedev);
-    trace_vfio_save_cleanup(vbasedev->name);
-}
-
 #define VFIO_MIG_PENDING_SIZE (512 * 1024 * 1024)
 static void vfio_save_pending(void *opaque, uint64_t threshold_size,
                               uint64_t *res_precopy, uint64_t *res_postcopy)
@@ -591,70 +227,6 @@ static void vfio_save_pending(void *opaque, uint64_t threshold_size,
     trace_vfio_save_pending(vbasedev->name, *res_precopy, *res_postcopy);
 }
 
-static void vfio_v1_save_pending(void *opaque, uint64_t threshold_size,
-                                 uint64_t *res_precopy, uint64_t *res_postcopy)
-{
-    VFIODevice *vbasedev = opaque;
-    VFIOMigration *migration = vbasedev->migration;
-    int ret;
-
-    ret = vfio_update_pending(vbasedev);
-    if (ret) {
-        return;
-    }
-
-    *res_precopy += migration->pending_bytes;
-
-    trace_vfio_save_pending(vbasedev->name, *res_precopy, *res_postcopy);
-}
-
-static int vfio_save_iterate(QEMUFile *f, void *opaque)
-{
-    VFIODevice *vbasedev = opaque;
-    VFIOMigration *migration = vbasedev->migration;
-    uint64_t data_size;
-    int ret;
-
-    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
-
-    if (migration->pending_bytes == 0) {
-        ret = vfio_update_pending(vbasedev);
-        if (ret) {
-            return ret;
-        }
-
-        if (migration->pending_bytes == 0) {
-            qemu_put_be64(f, 0);
-            qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
-            /* indicates data finished, goto complete phase */
-            return 1;
-        }
-    }
-
-    ret = vfio_save_buffer(f, vbasedev, &data_size);
-    if (ret) {
-        error_report("%s: vfio_save_buffer failed %s", vbasedev->name,
-                     strerror(errno));
-        return ret;
-    }
-
-    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
-
-    ret = qemu_file_get_error(f);
-    if (ret) {
-        return ret;
-    }
-
-    /*
-     * Reset pending_bytes as .save_live_pending is not called during savevm or
-     * snapshot case, in such case vfio_update_pending() at the start of this
-     * function updates pending_bytes.
-     */
-    migration->pending_bytes = 0;
-    trace_vfio_save_iterate(vbasedev->name, data_size);
-    return 0;
-}
-
 /* Returns 1 if end-of-stream is reached, 0 if more data and -1 if error */
 static int vfio_save_block(QEMUFile *f, VFIOMigration *migration)
 {
@@ -716,62 +288,6 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
     return ret;
 }
 
-static int vfio_v1_save_complete_precopy(QEMUFile *f, void *opaque)
-{
-    VFIODevice *vbasedev = opaque;
-    VFIOMigration *migration = vbasedev->migration;
-    uint64_t data_size;
-    int ret;
-
-    ret = vfio_migration_v1_set_state(vbasedev, ~VFIO_DEVICE_STATE_V1_RUNNING,
-                                      VFIO_DEVICE_STATE_V1_SAVING);
-    if (ret) {
-        error_report("%s: Failed to set state STOP and SAVING",
-                     vbasedev->name);
-        return ret;
-    }
-
-    ret = vfio_update_pending(vbasedev);
-    if (ret) {
-        return ret;
-    }
-
-    while (migration->pending_bytes > 0) {
-        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
-        ret = vfio_save_buffer(f, vbasedev, &data_size);
-        if (ret < 0) {
-            error_report("%s: Failed to save buffer", vbasedev->name);
-            return ret;
-        }
-
-        if (data_size == 0) {
-            break;
-        }
-
-        ret = vfio_update_pending(vbasedev);
-        if (ret) {
-            return ret;
-        }
-    }
-
-    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
-
-    ret = qemu_file_get_error(f);
-    if (ret) {
-        return ret;
-    }
-
-    ret = vfio_migration_v1_set_state(vbasedev, ~VFIO_DEVICE_STATE_V1_SAVING,
-                                      0);
-    if (ret) {
-        error_report("%s: Failed to set state STOPPED", vbasedev->name);
-        return ret;
-    }
-
-    trace_vfio_save_complete_precopy(vbasedev->name);
-    return ret;
-}
-
 static void vfio_save_state(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
@@ -793,33 +309,6 @@ static int vfio_load_setup(QEMUFile *f, void *opaque)
                                    vbasedev->migration->device_state);
 }
 
-static int vfio_v1_load_setup(QEMUFile *f, void *opaque)
-{
-    VFIODevice *vbasedev = opaque;
-    VFIOMigration *migration = vbasedev->migration;
-    int ret = 0;
-
-    if (migration->region.mmaps) {
-        ret = vfio_region_mmap(&migration->region);
-        if (ret) {
-            error_report("%s: Failed to mmap VFIO migration region %d: %s",
-                         vbasedev->name, migration->region.nr,
-                         strerror(-ret));
-            error_report("%s: Falling back to slow path", vbasedev->name);
-        }
-    }
-
-    ret = vfio_migration_v1_set_state(vbasedev, ~VFIO_DEVICE_STATE_MASK,
-                                      VFIO_DEVICE_STATE_V1_RESUMING);
-    if (ret) {
-        error_report("%s: Failed to set state RESUMING", vbasedev->name);
-        if (migration->region.mmaps) {
-            vfio_region_unmap(&migration->region);
-        }
-    }
-    return ret;
-}
-
 static int vfio_load_cleanup(void *opaque)
 {
     VFIODevice *vbasedev = opaque;
@@ -830,15 +319,6 @@ static int vfio_load_cleanup(void *opaque)
     return 0;
 }
 
-static int vfio_v1_load_cleanup(void *opaque)
-{
-    VFIODevice *vbasedev = opaque;
-
-    vfio_migration_v1_cleanup(vbasedev);
-    trace_vfio_load_cleanup(vbasedev->name);
-    return 0;
-}
-
 static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
 {
     VFIODevice *vbasedev = opaque;
@@ -872,11 +352,7 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
             uint64_t data_size = qemu_get_be64(f);
 
             if (data_size) {
-                if (vbasedev->migration->v2) {
-                    ret = vfio_load_buffer(f, vbasedev, data_size);
-                } else {
-                    ret = vfio_v1_load_buffer(f, vbasedev, data_size);
-                }
+                ret = vfio_load_buffer(f, vbasedev, data_size);
                 if (ret < 0) {
                     return ret;
                 }
@@ -908,18 +384,6 @@ static const SaveVMHandlers savevm_vfio_handlers = {
     .load_state = vfio_load_state,
 };
 
-static SaveVMHandlers savevm_vfio_v1_handlers = {
-    .save_setup = vfio_v1_save_setup,
-    .save_cleanup = vfio_v1_save_cleanup,
-    .save_live_pending = vfio_v1_save_pending,
-    .save_live_iterate = vfio_save_iterate,
-    .save_live_complete_precopy = vfio_v1_save_complete_precopy,
-    .save_state = vfio_save_state,
-    .load_setup = vfio_v1_load_setup,
-    .load_cleanup = vfio_v1_load_cleanup,
-    .load_state = vfio_load_state,
-};
-
 /* ---------------------------------------------------------------------- */
 
 static void vfio_vmstate_change(void *opaque, bool running, RunState state)
@@ -950,70 +414,12 @@ static void vfio_vmstate_change(void *opaque, bool running, RunState state)
                               mig_state_to_str(new_state));
 }
 
-static void vfio_v1_vmstate_change(void *opaque, bool running, RunState state)
-{
-    VFIODevice *vbasedev = opaque;
-    VFIOMigration *migration = vbasedev->migration;
-    uint32_t value, mask;
-    int ret;
-
-    if (vbasedev->migration->vm_running == running) {
-        return;
-    }
-
-    if (running) {
-        /*
-         * Here device state can have one of _SAVING, _RESUMING or _STOP bit.
-         * Transition from _SAVING to _RUNNING can happen if there is migration
-         * failure, in that case clear _SAVING bit.
-         * Transition from _RESUMING to _RUNNING occurs during resuming
-         * phase, in that case clear _RESUMING bit.
-         * In both the above cases, set _RUNNING bit.
-         */
-        mask = ~VFIO_DEVICE_STATE_MASK;
-        value = VFIO_DEVICE_STATE_V1_RUNNING;
-    } else {
-        /*
-         * Here device state could be either _RUNNING or _SAVING|_RUNNING. Reset
-         * _RUNNING bit
-         */
-        mask = ~VFIO_DEVICE_STATE_V1_RUNNING;
-
-        /*
-         * When VM state transition to stop for savevm command, device should
-         * start saving data.
-         */
-        if (state == RUN_STATE_SAVE_VM) {
-            value = VFIO_DEVICE_STATE_V1_SAVING;
-        } else {
-            value = 0;
-        }
-    }
-
-    ret = vfio_migration_v1_set_state(vbasedev, mask, value);
-    if (ret) {
-        /*
-         * Migration should be aborted in this case, but vm_state_notify()
-         * currently does not support reporting failures.
-         */
-        error_report("%s: Failed to set device state 0x%x", vbasedev->name,
-                     (migration->device_state_v1 & mask) | value);
-        if (migrate_get_current()->to_dst_file) {
-            qemu_file_set_error(migrate_get_current()->to_dst_file, ret);
-        }
-    }
-    vbasedev->migration->vm_running = running;
-    trace_vfio_v1_vmstate_change(vbasedev->name, running, RunState_str(state),
-            (migration->device_state_v1 & mask) | value);
-}
-
 static void vfio_migration_state_notifier(Notifier *notifier, void *data)
 {
     MigrationState *s = data;
     VFIOMigration *migration = container_of(notifier, VFIOMigration,
                                             migration_state);
     VFIODevice *vbasedev = migration->vbasedev;
-    int ret;
 
     trace_vfio_migration_state_notifier(vbasedev->name,
                                         MigrationStatus_str(s->state));
@@ -1023,31 +429,14 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
     case MIGRATION_STATUS_CANCELLED:
     case MIGRATION_STATUS_FAILED:
         bytes_transferred = 0;
-        if (migration->v2) {
-            vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING,
-                                     VFIO_DEVICE_STATE_ERROR);
-        } else {
-            ret = vfio_migration_v1_set_state(vbasedev,
-                                              ~(VFIO_DEVICE_STATE_V1_SAVING |
-                                                VFIO_DEVICE_STATE_V1_RESUMING),
-                                              VFIO_DEVICE_STATE_V1_RUNNING);
-            if (ret) {
-                error_report("%s: Failed to set state RUNNING", vbasedev->name);
-            }
-        }
+        vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING,
+                                 VFIO_DEVICE_STATE_ERROR);
     }
 }
 
 static void vfio_migration_exit(VFIODevice *vbasedev)
 {
-    VFIOMigration *migration = vbasedev->migration;
-
-    if (migration->v2) {
-        g_free(migration->data_buffer);
-    } else {
-        vfio_region_exit(&migration->region);
-        vfio_region_finalize(&migration->region);
-    }
+    g_free(vbasedev->migration->data_buffer);
     g_free(vbasedev->migration);
     vbasedev->migration = NULL;
 }
@@ -1078,7 +467,6 @@ static int vfio_migration_init(VFIODevice *vbasedev)
     VFIOMigration *migration;
     char id[256] = "";
     g_autofree char *path = NULL, *oid = NULL;
-    struct vfio_region_info *info = NULL;
     uint64_t mig_flags;
 
     if (!vbasedev->ops->vfio_get_object) {
@@ -1091,53 +479,22 @@ static int vfio_migration_init(VFIODevice *vbasedev)
     }
 
     ret = vfio_migration_query_flags(vbasedev, &mig_flags);
-    if (!ret) {
-        /* Migration v2 */
-        /* Basic migration functionality must be supported */
-        if (!(mig_flags & VFIO_MIGRATION_STOP_COPY)) {
-            return -EOPNOTSUPP;
-        }
-        vbasedev->migration = g_new0(VFIOMigration, 1);
-        vbasedev->migration->device_state = VFIO_DEVICE_STATE_RUNNING;
-        vbasedev->migration->data_buffer_size = VFIO_MIG_DATA_BUFFER_SIZE;
-        vbasedev->migration->data_buffer =
-            g_malloc0(vbasedev->migration->data_buffer_size);
-        vbasedev->migration->data_fd = -1;
-        vbasedev->migration->v2 = true;
-    } else {
-        /* Migration v1 */
-        ret = vfio_get_dev_region_info(vbasedev,
-                                       VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
-                                       VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
-                                       &info);
-        if (ret) {
-            return ret;
-        }
-
-        vbasedev->migration = g_new0(VFIOMigration, 1);
-        vbasedev->migration->device_state_v1 = VFIO_DEVICE_STATE_V1_RUNNING;
-        vbasedev->migration->vm_running = runstate_is_running();
-
-        ret = vfio_region_setup(obj, vbasedev, &vbasedev->migration->region,
-                                info->index, "migration");
-        if (ret) {
-            error_report("%s: Failed to setup VFIO migration region %d: %s",
-                         vbasedev->name, info->index, strerror(-ret));
-            goto err;
-        }
-
-        if (!vbasedev->migration->region.size) {
-            error_report("%s: Invalid zero-sized VFIO migration region %d",
-                         vbasedev->name, info->index);
-            ret = -EINVAL;
-            goto err;
-        }
+    if (ret) {
+        return ret;
+    }
 
-        g_free(info);
+    /* Basic migration functionality must be supported */
+    if (!(mig_flags & VFIO_MIGRATION_STOP_COPY)) {
+        return -EOPNOTSUPP;
     }
 
+    vbasedev->migration = g_new0(VFIOMigration, 1);
     migration = vbasedev->migration;
     migration->vbasedev = vbasedev;
+    migration->device_state = VFIO_DEVICE_STATE_RUNNING;
+    migration->data_buffer_size = VFIO_MIG_DATA_BUFFER_SIZE;
+    migration->data_buffer = g_malloc0(migration->data_buffer_size);
+    migration->data_fd = -1;
 
     oid = vmstate_if_get_id(VMSTATE_IF(DEVICE(obj)));
     if (oid) {
@@ -1147,28 +504,16 @@ static int vfio_migration_init(VFIODevice *vbasedev)
     }
     strpadcpy(id, sizeof(id), path, '\0');
 
-    if (migration->v2) {
-        register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1,
-                             &savevm_vfio_handlers, vbasedev);
-
-        migration->vm_state = qdev_add_vm_change_state_handler(
-            vbasedev->dev, vfio_vmstate_change, vbasedev);
-    } else {
-        register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1,
-                             &savevm_vfio_v1_handlers, vbasedev);
-
-        migration->vm_state = qdev_add_vm_change_state_handler(
-            vbasedev->dev, vfio_v1_vmstate_change, vbasedev);
-    }
+    register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1, &savevm_vfio_handlers,
+                         vbasedev);
 
+    migration->vm_state = qdev_add_vm_change_state_handler(vbasedev->dev,
+                                                           vfio_vmstate_change,
+                                                           vbasedev);
     migration->migration_state.notify = vfio_migration_state_notifier;
     add_migration_state_change_notifier(&migration->migration_state);
-    return 0;
 
-err:
-    g_free(info);
-    vfio_migration_exit(vbasedev);
-    return ret;
+    return 0;
 }
 
 /* ---------------------------------------------------------------------- */
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 9ef84e24b2..cbd6590dd9 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -150,21 +150,15 @@ vfio_display_edid_write_error(void) ""
 # migration.c
 vfio_migration_probe(const char *name) " (%s)"
 vfio_migration_set_state(const char *name, const char *state) " (%s) state %s"
-vfio_migration_v1_set_state(const char *name, uint32_t state) " (%s) state %d"
 vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
-vfio_v1_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
 vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
 vfio_save_setup(const char *name) " (%s)"
 vfio_save_cleanup(const char *name) " (%s)"
-vfio_save_buffer(const char *name, uint64_t data_offset, uint64_t data_size, uint64_t pending) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64" pending 0x%"PRIx64
-vfio_update_pending(const char *name, uint64_t pending) " (%s) pending 0x%"PRIx64
 vfio_save_device_config_state(const char *name) " (%s)"
 vfio_save_pending(const char *name, uint64_t precopy, uint64_t postcopy) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64
-vfio_save_iterate(const char *name, int data_size) " (%s) data_size %d"
 vfio_save_complete_precopy(const char *name) " (%s)"
 vfio_load_device_config_state(const char *name) " (%s)"
 vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
-vfio_v1_load_state_device_data(const char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
 vfio_load_state_device_data(const char *name, uint64_t data_size) " (%s) size 0x%"PRIx64
 vfio_load_cleanup(const char *name) " (%s)"
 vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 2ec3346fea..76d470178f 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -61,16 +61,11 @@ typedef struct VFIORegion {
 typedef struct VFIOMigration {
     struct VFIODevice *vbasedev;
     VMChangeStateEntry *vm_state;
-    VFIORegion region;
-    uint32_t device_state_v1;
-    int vm_running;
     Notifier migration_state;
-    uint64_t pending_bytes;
     enum vfio_device_mig_state device_state;
     int data_fd;
     void *data_buffer;
     size_t data_buffer_size;
-    bool v2;
 } VFIOMigration;
 
 typedef struct VFIOAddressSpace {
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 14/17] vfio/migration: Reset device if setting recover state fails
  2022-11-03 16:16 [PATCH v3 00/17] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
                   ` (12 preceding siblings ...)
  2022-11-03 16:16 ` [PATCH v3 13/17] vfio/migration: Remove VFIO migration protocol v1 Avihai Horon
@ 2022-11-03 16:16 ` Avihai Horon
  2022-11-16 18:36   ` Alex Williamson
  2022-11-03 16:16 ` [PATCH v3 15/17] vfio: Alphabetize migration section of VFIO trace-events file Avihai Horon
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 59+ messages in thread
From: Avihai Horon @ 2022-11-03 16:16 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Shay Drory, Avihai Horon, Kirti Wankhede,
	Tarun Gupta, Joao Martins

If vfio_migration_set_state() fails to set the device in the requested
state it tries to put it in a recover state. If setting the device in
the recover state fails as well, hw_error is triggered and the VM is
aborted.

To improve user experience and avoid VM data loss, reset the device with
VFIO_RESET_DEVICE instead of aborting the VM.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/migration.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index f8c3228314..e8068b9147 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -92,8 +92,18 @@ static int vfio_migration_set_state(VFIODevice *vbasedev,
 
         mig_state->device_state = recover_state;
         if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
-            hw_error("%s: Failed setting device in recover state, err: %s",
-                     vbasedev->name, strerror(errno));
+            error_report(
+                "%s: Failed setting device in recover state, err: %s. Resetting device",
+                         vbasedev->name, strerror(errno));
+
+            if (ioctl(vbasedev->fd, VFIO_DEVICE_RESET)) {
+                hw_error("%s: Failed resetting device, err: %s", vbasedev->name,
+                         strerror(errno));
+            }
+
+            migration->device_state = VFIO_DEVICE_STATE_RUNNING;
+
+            return -1;
         }
 
         migration->device_state = recover_state;
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 15/17] vfio: Alphabetize migration section of VFIO trace-events file
  2022-11-03 16:16 [PATCH v3 00/17] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
                   ` (13 preceding siblings ...)
  2022-11-03 16:16 ` [PATCH v3 14/17] vfio/migration: Reset device if setting recover state fails Avihai Horon
@ 2022-11-03 16:16 ` Avihai Horon
  2022-11-03 16:16 ` [PATCH v3 16/17] docs/devel: Align vfio-migration docs to VFIO migration v2 Avihai Horon
  2022-11-03 16:16 ` [PATCH v3 17/17] vfio/migration: Query device data size in vfio_save_pending() Avihai Horon
  16 siblings, 0 replies; 59+ messages in thread
From: Avihai Horon @ 2022-11-03 16:16 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Shay Drory, Avihai Horon, Kirti Wankhede,
	Tarun Gupta, Joao Martins

Sort the migration section of VFIO trace events file alphabetically
and move two misplaced traces to common.c section.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/trace-events | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index cbd6590dd9..59f48af4ee 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -119,6 +119,8 @@ vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Devic
 vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
 vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
 vfio_dma_unmap_overflow_workaround(void) ""
+vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64
+vfio_iommu_map_dirty_notify(uint64_t iova_start, uint64_t iova_end) "iommu dirty @ 0x%"PRIx64" - 0x%"PRIx64
 
 # platform.c
 vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
@@ -148,19 +150,17 @@ vfio_display_edid_update(uint32_t prefx, uint32_t prefy) "%ux%u"
 vfio_display_edid_write_error(void) ""
 
 # migration.c
+vfio_load_cleanup(const char *name) " (%s)"
+vfio_load_device_config_state(const char *name) " (%s)"
+vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
+vfio_load_state_device_data(const char *name, uint64_t data_size) " (%s) size 0x%"PRIx64
 vfio_migration_probe(const char *name) " (%s)"
 vfio_migration_set_state(const char *name, const char *state) " (%s) state %s"
-vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
 vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
-vfio_save_setup(const char *name) " (%s)"
+vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
 vfio_save_cleanup(const char *name) " (%s)"
+vfio_save_complete_precopy(const char *name) " (%s)"
 vfio_save_device_config_state(const char *name) " (%s)"
 vfio_save_pending(const char *name, uint64_t precopy, uint64_t postcopy) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64
-vfio_save_complete_precopy(const char *name) " (%s)"
-vfio_load_device_config_state(const char *name) " (%s)"
-vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
-vfio_load_state_device_data(const char *name, uint64_t data_size) " (%s) size 0x%"PRIx64
-vfio_load_cleanup(const char *name) " (%s)"
-vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64
-vfio_iommu_map_dirty_notify(uint64_t iova_start, uint64_t iova_end) "iommu dirty @ 0x%"PRIx64" - 0x%"PRIx64
-vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
+vfio_save_setup(const char *name) " (%s)"
+vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 16/17] docs/devel: Align vfio-migration docs to VFIO migration v2
  2022-11-03 16:16 [PATCH v3 00/17] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
                   ` (14 preceding siblings ...)
  2022-11-03 16:16 ` [PATCH v3 15/17] vfio: Alphabetize migration section of VFIO trace-events file Avihai Horon
@ 2022-11-03 16:16 ` Avihai Horon
  2022-11-03 16:16 ` [PATCH v3 17/17] vfio/migration: Query device data size in vfio_save_pending() Avihai Horon
  16 siblings, 0 replies; 59+ messages in thread
From: Avihai Horon @ 2022-11-03 16:16 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Shay Drory, Avihai Horon, Kirti Wankhede,
	Tarun Gupta, Joao Martins

Align the vfio-migration documentation to VFIO migration protocol v2.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 docs/devel/vfio-migration.rst | 68 ++++++++++++++++-------------------
 1 file changed, 30 insertions(+), 38 deletions(-)

diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
index 9ff6163c88..ad991b7eeb 100644
--- a/docs/devel/vfio-migration.rst
+++ b/docs/devel/vfio-migration.rst
@@ -7,46 +7,39 @@ the guest is running on source host and restoring this saved state on the
 destination host. This document details how saving and restoring of VFIO
 devices is done in QEMU.
 
-Migration of VFIO devices consists of two phases: the optional pre-copy phase,
-and the stop-and-copy phase. The pre-copy phase is iterative and allows to
-accommodate VFIO devices that have a large amount of data that needs to be
-transferred. The iterative pre-copy phase of migration allows for the guest to
-continue whilst the VFIO device state is transferred to the destination, this
-helps to reduce the total downtime of the VM. VFIO devices can choose to skip
-the pre-copy phase of migration by returning pending_bytes as zero during the
-pre-copy phase.
+Migration of VFIO devices currently consists of a single stop-and-copy phase.
+During the stop-and-copy phase the guest is stopped and the entire VFIO device
+data is transferred to the destination.
+
+The pre-copy phase of migration is currently not supported for VFIO devices,
+so VFIO device data is not transferred during pre-copy phase.
 
 A detailed description of the UAPI for VFIO device migration can be found in
-the comment for the ``vfio_device_migration_info`` structure in the header
-file linux-headers/linux/vfio.h.
+the comment for the ``vfio_device_mig_state`` structure in the header file
+linux-headers/linux/vfio.h.
 
 VFIO implements the device hooks for the iterative approach as follows:
 
-* A ``save_setup`` function that sets up the migration region and sets _SAVING
-  flag in the VFIO device state.
+* A ``save_setup`` function that sets up migration on the source.
 
-* A ``load_setup`` function that sets up the migration region on the
-  destination and sets _RESUMING flag in the VFIO device state.
+* A ``load_setup`` function that sets the VFIO device on the destination in
+  _RESUMING state.
 
 * A ``save_live_pending`` function that reads pending_bytes from the vendor
   driver, which indicates the amount of data that the vendor driver has yet to
   save for the VFIO device.
 
-* A ``save_live_iterate`` function that reads the VFIO device's data from the
-  vendor driver through the migration region during iterative phase.
-
 * A ``save_state`` function to save the device config space if it is present.
 
-* A ``save_live_complete_precopy`` function that resets _RUNNING flag from the
-  VFIO device state and iteratively copies the remaining data for the VFIO
-  device until the vendor driver indicates that no data remains (pending bytes
-  is zero).
+* A ``save_live_complete_precopy`` function that sets the VFIO device in
+  _STOP_COPY state and iteratively copies the data for the VFIO device until
+  the vendor driver indicates that no data remains.
 
 * A ``load_state`` function that loads the config section and the data
-  sections that are generated by the save functions above
+  sections that are generated by the save functions above.
 
 * ``cleanup`` functions for both save and load that perform any migration
-  related cleanup, including unmapping the migration region
+  related cleanup.
 
 
 The VFIO migration code uses a VM state change handler to change the VFIO
@@ -71,13 +64,13 @@ tracking can identify dirtied pages, but any page pinned by the vendor driver
 can also be written by the device. There is currently no device or IOMMU
 support for dirty page tracking in hardware.
 
-By default, dirty pages are tracked when the device is in pre-copy as well as
-stop-and-copy phase. So, a page pinned by the vendor driver will be copied to
-the destination in both phases. Copying dirty pages in pre-copy phase helps
-QEMU to predict if it can achieve its downtime tolerances. If QEMU during
-pre-copy phase keeps finding dirty pages continuously, then it understands
-that even in stop-and-copy phase, it is likely to find dirty pages and can
-predict the downtime accordingly.
+By default, dirty pages are tracked during pre-copy as well as stop-and-copy
+phase. So, a page pinned by the vendor driver will be copied to the destination
+in both phases. Copying dirty pages in pre-copy phase helps QEMU to predict if
+it can achieve its downtime tolerances. If QEMU during pre-copy phase keeps
+finding dirty pages continuously, then it understands that even in stop-and-copy
+phase, it is likely to find dirty pages and can predict the downtime
+accordingly.
 
 QEMU also provides a per device opt-out option ``pre-copy-dirty-page-tracking``
 which disables querying the dirty bitmap during pre-copy phase. If it is set to
@@ -111,23 +104,22 @@ Live migration save path
                                   |
                      migrate_init spawns migration_thread
                 Migration thread then calls each device's .save_setup()
-                    (RUNNING, _SETUP, _RUNNING|_SAVING)
+                       (RUNNING, _SETUP, _RUNNING)
                                   |
-                    (RUNNING, _ACTIVE, _RUNNING|_SAVING)
+                      (RUNNING, _ACTIVE, _RUNNING)
              If device is active, get pending_bytes by .save_live_pending()
           If total pending_bytes >= threshold_size, call .save_live_iterate()
-                  Data of VFIO device for pre-copy phase is copied
         Iterate till total pending bytes converge and are less than threshold
                                   |
   On migration completion, vCPU stops and calls .save_live_complete_precopy for
-   each active device. The VFIO device is then transitioned into _SAVING state
-                   (FINISH_MIGRATE, _DEVICE, _SAVING)
+  each active device. The VFIO device is then transitioned into _STOP_COPY state
+                  (FINISH_MIGRATE, _DEVICE, _STOP_COPY)
                                   |
      For the VFIO device, iterate in .save_live_complete_precopy until
                          pending data is 0
-                   (FINISH_MIGRATE, _DEVICE, _STOPPED)
+                   (FINISH_MIGRATE, _DEVICE, _STOP)
                                   |
-                 (FINISH_MIGRATE, _COMPLETED, _STOPPED)
+                 (FINISH_MIGRATE, _COMPLETED, _STOP)
              Migraton thread schedules cleanup bottom half and exits
 
 Live migration resume path
@@ -136,7 +128,7 @@ Live migration resume path
 ::
 
               Incoming migration calls .load_setup for each device
-                       (RESTORE_VM, _ACTIVE, _STOPPED)
+                       (RESTORE_VM, _ACTIVE, _STOP)
                                  |
        For each device, .load_state is called for that device section data
                        (RESTORE_VM, _ACTIVE, _RESUMING)
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v3 17/17] vfio/migration: Query device data size in vfio_save_pending()
  2022-11-03 16:16 [PATCH v3 00/17] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
                   ` (15 preceding siblings ...)
  2022-11-03 16:16 ` [PATCH v3 16/17] docs/devel: Align vfio-migration docs to VFIO migration v2 Avihai Horon
@ 2022-11-03 16:16 ` Avihai Horon
  16 siblings, 0 replies; 59+ messages in thread
From: Avihai Horon @ 2022-11-03 16:16 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Shay Drory, Avihai Horon, Kirti Wankhede,
	Tarun Gupta, Joao Martins

Use VFIO_DEVICE_FEATURE_MIG_DATA_SIZE uAPI to query the device data size
and report it in vfio_save_pending() instead of the hardcoded value that
is currently used.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/migration.c        | 27 ++++++++++++++++++++-------
 linux-headers/linux/vfio.h | 13 +++++++++++++
 2 files changed, 33 insertions(+), 7 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index e8068b9147..8ade901383 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -225,14 +225,27 @@ static void vfio_save_pending(void *opaque, uint64_t threshold_size,
                               uint64_t *res_precopy, uint64_t *res_postcopy)
 {
     VFIODevice *vbasedev = opaque;
+    uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
+                              sizeof(struct vfio_device_feature_mig_data_size),
+                              sizeof(uint64_t))] = {};
+    struct vfio_device_feature *feature = (void *)buf;
+    struct vfio_device_feature_mig_data_size *mig_data_size =
+        (void *)feature->data;
+
+    feature->argsz = sizeof(buf);
+    feature->flags =
+        VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_MIG_DATA_SIZE;
 
-    /*
-     * VFIO migration protocol v2 currently doesn't have an API to get pending
-     * device state size. Until such API is introduced, report some big
-     * arbitrary pending size so the device will be taken into account for
-     * downtime limit calculations.
-     */
-    *res_postcopy += VFIO_MIG_PENDING_SIZE;
+    if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
+        if (errno != ENOTTY) {
+            return;
+        }
+
+        /* Kernel doesn't support VFIO_DEVICE_FEATURE_MIG_DATA_SIZE */
+        *res_postcopy += VFIO_MIG_PENDING_SIZE;
+    } else {
+        *res_postcopy += mig_data_size->stop_copy_length;
+    }
 
     trace_vfio_save_pending(vbasedev->name, *res_precopy, *res_postcopy);
 }
diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index ede44b5572..5c4ddf424f 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -986,6 +986,19 @@ enum vfio_device_mig_state {
 	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
 };
 
+/*
+ * Upon VFIO_DEVICE_FEATURE_GET read back the estimated data length that will
+ * be required to complete stop copy.
+ *
+ * Note: Can be called on each device state.
+ */
+
+struct vfio_device_feature_mig_data_size {
+	__aligned_u64 stop_copy_length;
+};
+
+#define VFIO_DEVICE_FEATURE_MIG_DATA_SIZE 9
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**
-- 
2.21.3



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 01/17] migration: Remove res_compatible parameter
  2022-11-03 16:16 ` [PATCH v3 01/17] migration: Remove res_compatible parameter Avihai Horon
@ 2022-11-08 17:52   ` Vladimir Sementsov-Ogievskiy
  2022-11-10 13:36     ` Avihai Horon
  0 siblings, 1 reply; 59+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2022-11-08 17:52 UTC (permalink / raw)
  To: Avihai Horon, qemu-devel
  Cc: Alex Williamson, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake, John Snow, qemu-s390x,
	qemu-block, Kunkun Jiang, Zhang, Chen, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Shay Drory, Kirti Wankhede,
	Tarun Gupta, Joao Martins

On 11/3/22 19:16, Avihai Horon wrote:
> From: Juan Quintela <quintela@redhat.com>
> 
> It was only used for RAM, and in that case, it means that this amount
> of data was sent for memory. 

Not clear for me, what means "this amount of data was sent for memory"... That amount of data was not yet sent, actually.

> Just delete the field in all callers.
> 
> Signed-off-by: Juan Quintela <quintela@redhat.com>
> ---
>   hw/s390x/s390-stattrib.c       |  6 ++----
>   hw/vfio/migration.c            | 10 ++++------
>   hw/vfio/trace-events           |  2 +-
>   include/migration/register.h   | 20 ++++++++++----------
>   migration/block-dirty-bitmap.c |  7 +++----
>   migration/block.c              |  7 +++----
>   migration/migration.c          |  9 ++++-----
>   migration/ram.c                |  8 +++-----
>   migration/savevm.c             | 14 +++++---------
>   migration/savevm.h             |  4 +---
>   migration/trace-events         |  2 +-
>   11 files changed, 37 insertions(+), 52 deletions(-)
> 

[..]

> diff --git a/include/migration/register.h b/include/migration/register.h
> index c1dcff0f90..1950fee6a8 100644
> --- a/include/migration/register.h
> +++ b/include/migration/register.h
> @@ -48,18 +48,18 @@ typedef struct SaveVMHandlers {
>       int (*save_setup)(QEMUFile *f, void *opaque);
>       void (*save_live_pending)(QEMUFile *f, void *opaque,
>                                 uint64_t threshold_size,
> -                              uint64_t *res_precopy_only,
> -                              uint64_t *res_compatible,
> -                              uint64_t *res_postcopy_only);
> +                              uint64_t *rest_precopy,
> +                              uint64_t *rest_postcopy);
>       /* Note for save_live_pending:
> -     * - res_precopy_only is for data which must be migrated in precopy phase
> -     *     or in stopped state, in other words - before target vm start
> -     * - res_compatible is for data which may be migrated in any phase
> -     * - res_postcopy_only is for data which must be migrated in postcopy phase
> -     *     or in stopped state, in other words - after source vm stop
> +     * - res_precopy is for data which must be migrated in precopy
> +     *     phase or in stopped state, in other words - before target
> +     *     vm start
> +     * - res_postcopy is for data which must be migrated in postcopy
> +     *     phase or in stopped state, in other words - after source vm
> +     *     stop
>        *
> -     * Sum of res_postcopy_only, res_compatible and res_postcopy_only is the
> -     * whole amount of pending data.
> +     * Sum of res_precopy and res_postcopy is the whole amount of
> +     * pending data.
>        */
>   
>   

[..]

> diff --git a/migration/ram.c b/migration/ram.c
> index dc1de9ddbc..20167e1102 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -3435,9 +3435,7 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
>   }
>   
>   static void ram_save_pending(QEMUFile *f, void *opaque, uint64_t max_size,
> -                             uint64_t *res_precopy_only,
> -                             uint64_t *res_compatible,
> -                             uint64_t *res_postcopy_only)
> +                             uint64_t *res_precopy, uint64_t *res_postcopy)
>   {
>       RAMState **temp = opaque;
>       RAMState *rs = *temp;
> @@ -3457,9 +3455,9 @@ static void ram_save_pending(QEMUFile *f, void *opaque, uint64_t max_size,
>   
>       if (migrate_postcopy_ram()) {
>           /* We can do postcopy, and all the data is postcopiable */
> -        *res_compatible += remaining_size;
> +        *res_postcopy += remaining_size;

That's seems to be not quite correct.

res_postcopy is defined as "data which must be migrated in postcopy", but that's not true here, as RAM can be migrated both in precopy and postcopy.

Still we really can include "compat" into "postcopy" just because in the logic of migration_iteration_run() we don't actually distinguish "compat" and "post". The logic only depends on "total" and "pre".

So, if we want to combine "compat" into "post", we should redefine "post" in the comment in include/migration/register.h, something like this:

- res_precopy is for data which MUST be migrated in precopy
   phase or in stopped state, in other words - before target
   vm start

- res_postcopy is for all data except for declared in res_precopy.
   res_postcopy data CAN be migrated in postcopy, i.e. after target
   vm start.


>       } else {
> -        *res_precopy_only += remaining_size;
> +        *res_precopy += remaining_size;
>       }
>   }
>   


-- 
Best regards,
Vladimir



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 02/17] migration: No save_live_pending() method uses the QEMUFile parameter
  2022-11-03 16:16 ` [PATCH v3 02/17] migration: No save_live_pending() method uses the QEMUFile parameter Avihai Horon
@ 2022-11-08 17:57   ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 59+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2022-11-08 17:57 UTC (permalink / raw)
  To: Avihai Horon, qemu-devel
  Cc: Alex Williamson, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake, John Snow, qemu-s390x,
	qemu-block, Kunkun Jiang, Zhang, Chen, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Shay Drory, Kirti Wankhede,
	Tarun Gupta, Joao Martins

On 11/3/22 19:16, Avihai Horon wrote:
> From: Juan Quintela<quintela@redhat.com>
> 
> So remove it everywhere.
> 
> Signed-off-by: Juan Quintela<quintela@redhat.com>

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>


-- 
Best regards,
Vladimir



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 03/17] migration: Block migration comment or code is wrong
  2022-11-03 16:16 ` [PATCH v3 03/17] migration: Block migration comment or code is wrong Avihai Horon
@ 2022-11-08 18:36   ` Vladimir Sementsov-Ogievskiy
  2022-11-08 18:38     ` Vladimir Sementsov-Ogievskiy
  2022-11-10 13:38     ` Avihai Horon
  0 siblings, 2 replies; 59+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2022-11-08 18:36 UTC (permalink / raw)
  To: Avihai Horon, qemu-devel
  Cc: Alex Williamson, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake, John Snow, qemu-s390x,
	qemu-block, Kunkun Jiang, Zhang, Chen, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Shay Drory, Kirti Wankhede,
	Tarun Gupta, Joao Martins

On 11/3/22 19:16, Avihai Horon wrote:
> From: Juan Quintela <quintela@redhat.com>
> 
> And it appears that what is wrong is the code. During bulk stage we
> need to make sure that some block is dirty, but no games with
> max_size at all.

:) That made me interested in, why we need this one block, so I decided to search through the history.

And what I see? Haha, that was my commit 04636dc410b163c "migration/block: fix pending() return value" [1], which you actually revert with this patch.

So, at least we should note, that it's a revert of [1].

Still that this will reintroduce the bug fixed by [1].

As I understand the problem is (was) that in block_save_complete() we finalize only dirty blocks, but don't finalize the bulk phase if it's not finalized yet. So, we can fix block_save_complete() to finalize the bulk phase, instead of hacking with pending in [1].

Interesting, why we need this one block, described in the comment you refer to? Was it an incomplete workaround for the same problem, described in [1]? If so, we can fix block_save_complete() and remove this if() together with the comment from block_save_pending().

> 
> Signed-off-by: Juan Quintela <quintela@redhat.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>   migration/block.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/migration/block.c b/migration/block.c
> index b3d680af75..39ce4003c6 100644
> --- a/migration/block.c
> +++ b/migration/block.c
> @@ -879,8 +879,8 @@ static void block_save_pending(void *opaque, uint64_t max_size,
>       blk_mig_unlock();
>   
>       /* Report at least one block pending during bulk phase */
> -    if (pending <= max_size && !block_mig_state.bulk_completed) {
> -        pending = max_size + BLK_MIG_BLOCK_SIZE;
> +    if (!pending && !block_mig_state.bulk_completed) {
> +        pending = BLK_MIG_BLOCK_SIZE;
>       }
>   
>       trace_migration_block_save_pending(pending);

-- 
Best regards,
Vladimir



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 03/17] migration: Block migration comment or code is wrong
  2022-11-08 18:36   ` Vladimir Sementsov-Ogievskiy
@ 2022-11-08 18:38     ` Vladimir Sementsov-Ogievskiy
  2022-11-10 13:38     ` Avihai Horon
  1 sibling, 0 replies; 59+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2022-11-08 18:38 UTC (permalink / raw)
  To: Avihai Horon, qemu-devel
  Cc: Alex Williamson, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake, John Snow, qemu-s390x,
	qemu-block, Kunkun Jiang, Zhang, Chen, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Shay Drory, Kirti Wankhede,
	Tarun Gupta, Joao Martins

On 11/8/22 21:36, Vladimir Sementsov-Ogievskiy wrote:
> On 11/3/22 19:16, Avihai Horon wrote:
>> From: Juan Quintela <quintela@redhat.com>
>>
>> And it appears that what is wrong is the code. During bulk stage we
>> need to make sure that some block is dirty, but no games with
>> max_size at all.
> 
> :) That made me interested in, why we need this one block, so I decided to search through the history.
> 
> And what I see? Haha, that was my commit 04636dc410b163c "migration/block: fix pending() return value" [1], which you actually revert with this patch.
> 
> So, at least we should note, that it's a revert of [1].
> 
> Still that this will reintroduce the bug fixed by [1].
> 
> As I understand the problem is (was) that in block_save_complete() we finalize only dirty blocks, but don't finalize the bulk phase if it's not finalized yet. So, we can fix block_save_complete() to finalize the bulk phase, instead of hacking with pending in [1].
> 
> Interesting, why we need this one block, described in the comment you refer to? Was it an incomplete workaround for the same problem, described in [1]? If so, we can fix block_save_complete() and remove this if() together with the comment from block_save_pending().
> 

PS: Don't we want to deprecate block migration? Is it really used in production?  block-mirror is a recommended way to migrate block devices.

-- 
Best regards,
Vladimir



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 04/17] migration: Simplify migration_iteration_run()
  2022-11-03 16:16 ` [PATCH v3 04/17] migration: Simplify migration_iteration_run() Avihai Horon
@ 2022-11-08 18:56   ` Vladimir Sementsov-Ogievskiy
  2022-11-10 13:42     ` Avihai Horon
  0 siblings, 1 reply; 59+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2022-11-08 18:56 UTC (permalink / raw)
  To: Avihai Horon, qemu-devel
  Cc: Alex Williamson, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake, John Snow, qemu-s390x,
	qemu-block, Kunkun Jiang, Zhang, Chen, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Shay Drory, Kirti Wankhede,
	Tarun Gupta, Joao Martins

On 11/3/22 19:16, Avihai Horon wrote:
> From: Juan Quintela <quintela@redhat.com>
> 
> Signed-off-by: Juan Quintela <quintela@redhat.com>
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>   migration/migration.c | 25 +++++++++++++------------
>   1 file changed, 13 insertions(+), 12 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index ffe868b86f..59cc3c309b 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3743,23 +3743,24 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>   
>       trace_migrate_pending(pending_size, s->threshold_size, pend_pre, pend_post);
>   
> -    if (pending_size && pending_size >= s->threshold_size) {
> -        /* Still a significant amount to transfer */
> -        if (!in_postcopy && pend_pre <= s->threshold_size &&
> -            qatomic_read(&s->start_postcopy)) {
> -            if (postcopy_start(s)) {
> -                error_report("%s: postcopy failed to start", __func__);
> -            }
> -            return MIG_ITERATE_SKIP;
> -        }
> -        /* Just another iteration step */
> -        qemu_savevm_state_iterate(s->to_dst_file, in_postcopy);
> -    } else {
> +
> +    if (pending_size < s->threshold_size) {

Is corner case "pending_size == s->threshold_size == 0" theoretically possible here? In this case prepatch we go to completion. Afterpatch we go to next iteration..

>           trace_migration_thread_low_pending(pending_size);
>           migration_completion(s);
>           return MIG_ITERATE_BREAK;
>       }
>   
> +    /* Still a significant amount to transfer */
> +    if (!in_postcopy && pend_pre <= s->threshold_size &&
> +        qatomic_read(&s->start_postcopy)) {
> +        if (postcopy_start(s)) {
> +            error_report("%s: postcopy failed to start", __func__);
> +        }
> +        return MIG_ITERATE_SKIP;
> +    }
> +
> +    /* Just another iteration step */
> +    qemu_savevm_state_iterate(s->to_dst_file, in_postcopy);
>       return MIG_ITERATE_RESUME;
>   }
>   

-- 
Best regards,
Vladimir



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 05/17] vfio/migration: Fix wrong enum usage
  2022-11-03 16:16 ` [PATCH v3 05/17] vfio/migration: Fix wrong enum usage Avihai Horon
@ 2022-11-08 19:05   ` Vladimir Sementsov-Ogievskiy
  2022-11-10 13:47     ` Avihai Horon
  0 siblings, 1 reply; 59+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2022-11-08 19:05 UTC (permalink / raw)
  To: Avihai Horon, qemu-devel
  Cc: Alex Williamson, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake, John Snow, qemu-s390x,
	qemu-block, Kunkun Jiang, Zhang, Chen, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Shay Drory, Kirti Wankhede,
	Tarun Gupta, Joao Martins

On 11/3/22 19:16, Avihai Horon wrote:
> vfio_migration_init() initializes VFIOMigration->device_state using enum
> of VFIO migration protocol v2. Current implemented protocol is v1 so v1
> enum should be used. Fix it.
> 
> Fixes: 429c72800654 ("vfio/migration: Fix incorrect initialization value for parameters in VFIOMigration")
> Signed-off-by: Avihai Horon<avihaih@nvidia.com>
> Reviewed-by: Zhang Chen<chen.zhang@intel.com>

the commit is already in master branch

-- 
Best regards,
Vladimir



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 06/17] vfio/migration: Fix NULL pointer dereference bug
  2022-11-03 16:16 ` [PATCH v3 06/17] vfio/migration: Fix NULL pointer dereference bug Avihai Horon
@ 2022-11-08 19:08   ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 59+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2022-11-08 19:08 UTC (permalink / raw)
  To: Avihai Horon, qemu-devel
  Cc: Alex Williamson, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake, John Snow, qemu-s390x,
	qemu-block, Kunkun Jiang, Zhang, Chen, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Shay Drory, Kirti Wankhede,
	Tarun Gupta, Joao Martins

On 11/3/22 19:16, Avihai Horon wrote:
> As part of its error flow, vfio_vmstate_change() accesses
> MigrationState->to_dst_file without any checks. This can cause a NULL
> pointer dereference if the error flow is taken and
> MigrationState->to_dst_file is not set.
> 
> For example, this can happen if VM is started or stopped not during
> migration and vfio_vmstate_change() error flow is taken, as
> MigrationState->to_dst_file is not set at that time.
> 
> Fix it by checking that MigrationState->to_dst_file is set before using
> it.
> 
> Fixes: 02a7e71b1e5b ("vfio: Add VM state change handler to know state of VM")
> Signed-off-by: Avihai Horon<avihaih@nvidia.com>
> Reviewed-by: Juan Quintela<quintela@redhat.com>


Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>

-- 
Best regards,
Vladimir



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 08/17] migration/qemu-file: Add qemu_file_get_to_fd()
  2022-11-03 16:16 ` [PATCH v3 08/17] migration/qemu-file: Add qemu_file_get_to_fd() Avihai Horon
@ 2022-11-08 20:26   ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 59+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2022-11-08 20:26 UTC (permalink / raw)
  To: Avihai Horon, qemu-devel
  Cc: Alex Williamson, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake, John Snow, qemu-s390x,
	qemu-block, Kunkun Jiang, Zhang, Chen, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Shay Drory, Kirti Wankhede,
	Tarun Gupta, Joao Martins

On 11/3/22 19:16, Avihai Horon wrote:
> Add new function qemu_file_get_to_fd() that allows reading data from
> QEMUFile and writing it straight into a given fd.
> 
> This will be used later in VFIO migration code.
> 
> Signed-off-by: Avihai Horon<avihaih@nvidia.com>

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>

-- 
Best regards,
Vladimir



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 01/17] migration: Remove res_compatible parameter
  2022-11-08 17:52   ` Vladimir Sementsov-Ogievskiy
@ 2022-11-10 13:36     ` Avihai Horon
  2022-11-21  7:20       ` Avihai Horon
  2022-11-23 18:23       ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 59+ messages in thread
From: Avihai Horon @ 2022-11-10 13:36 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel
  Cc: Alex Williamson, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake, John Snow, qemu-s390x,
	qemu-block, Kunkun Jiang, Zhang, Chen, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Shay Drory, Kirti Wankhede,
	Tarun Gupta, Joao Martins


On 08/11/2022 19:52, Vladimir Sementsov-Ogievskiy wrote:
> External email: Use caution opening links or attachments
>
>
> On 11/3/22 19:16, Avihai Horon wrote:
>> From: Juan Quintela <quintela@redhat.com>
>>
>> It was only used for RAM, and in that case, it means that this amount
>> of data was sent for memory.
>
> Not clear for me, what means "this amount of data was sent for 
> memory"... That amount of data was not yet sent, actually.
>
Yes, this should be changed to something like:

"It was only used for RAM, and in that case, it means that this amount
of data still needs to be sent for memory, and can be sent in any phase
of migration. The same functionality can be achieved without res_compatible,
so just delete the field in all callers and change the definition of 
res_postcopy accordingly.".
>> Just delete the field in all callers.
>>
>> Signed-off-by: Juan Quintela <quintela@redhat.com>
>> ---
>>   hw/s390x/s390-stattrib.c       |  6 ++----
>>   hw/vfio/migration.c            | 10 ++++------
>>   hw/vfio/trace-events           |  2 +-
>>   include/migration/register.h   | 20 ++++++++++----------
>>   migration/block-dirty-bitmap.c |  7 +++----
>>   migration/block.c              |  7 +++----
>>   migration/migration.c          |  9 ++++-----
>>   migration/ram.c                |  8 +++-----
>>   migration/savevm.c             | 14 +++++---------
>>   migration/savevm.h             |  4 +---
>>   migration/trace-events         |  2 +-
>>   11 files changed, 37 insertions(+), 52 deletions(-)
>>
>
> [..]
>
>> diff --git a/include/migration/register.h b/include/migration/register.h
>> index c1dcff0f90..1950fee6a8 100644
>> --- a/include/migration/register.h
>> +++ b/include/migration/register.h
>> @@ -48,18 +48,18 @@ typedef struct SaveVMHandlers {
>>       int (*save_setup)(QEMUFile *f, void *opaque);
>>       void (*save_live_pending)(QEMUFile *f, void *opaque,
>>                                 uint64_t threshold_size,
>> -                              uint64_t *res_precopy_only,
>> -                              uint64_t *res_compatible,
>> -                              uint64_t *res_postcopy_only);
>> +                              uint64_t *rest_precopy,
>> +                              uint64_t *rest_postcopy);
>>       /* Note for save_live_pending:
>> -     * - res_precopy_only is for data which must be migrated in 
>> precopy phase
>> -     *     or in stopped state, in other words - before target vm start
>> -     * - res_compatible is for data which may be migrated in any phase
>> -     * - res_postcopy_only is for data which must be migrated in 
>> postcopy phase
>> -     *     or in stopped state, in other words - after source vm stop
>> +     * - res_precopy is for data which must be migrated in precopy
>> +     *     phase or in stopped state, in other words - before target
>> +     *     vm start
>> +     * - res_postcopy is for data which must be migrated in postcopy
>> +     *     phase or in stopped state, in other words - after source vm
>> +     *     stop
>>        *
>> -     * Sum of res_postcopy_only, res_compatible and 
>> res_postcopy_only is the
>> -     * whole amount of pending data.
>> +     * Sum of res_precopy and res_postcopy is the whole amount of
>> +     * pending data.
>>        */
>>
>>
>
> [..]
>
>> diff --git a/migration/ram.c b/migration/ram.c
>> index dc1de9ddbc..20167e1102 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -3435,9 +3435,7 @@ static int ram_save_complete(QEMUFile *f, void 
>> *opaque)
>>   }
>>
>>   static void ram_save_pending(QEMUFile *f, void *opaque, uint64_t 
>> max_size,
>> -                             uint64_t *res_precopy_only,
>> -                             uint64_t *res_compatible,
>> -                             uint64_t *res_postcopy_only)
>> +                             uint64_t *res_precopy, uint64_t 
>> *res_postcopy)
>>   {
>>       RAMState **temp = opaque;
>>       RAMState *rs = *temp;
>> @@ -3457,9 +3455,9 @@ static void ram_save_pending(QEMUFile *f, void 
>> *opaque, uint64_t max_size,
>>
>>       if (migrate_postcopy_ram()) {
>>           /* We can do postcopy, and all the data is postcopiable */
>> -        *res_compatible += remaining_size;
>> +        *res_postcopy += remaining_size;
>
> That's seems to be not quite correct.
>
> res_postcopy is defined as "data which must be migrated in postcopy", 
> but that's not true here, as RAM can be migrated both in precopy and 
> postcopy.
>
> Still we really can include "compat" into "postcopy" just because in 
> the logic of migration_iteration_run() we don't actually distinguish 
> "compat" and "post". The logic only depends on "total" and "pre".
>
> So, if we want to combine "compat" into "post", we should redefine 
> "post" in the comment in include/migration/register.h, something like 
> this:
>
> - res_precopy is for data which MUST be migrated in precopy
>   phase or in stopped state, in other words - before target
>   vm start
>
> - res_postcopy is for all data except for declared in res_precopy.
>   res_postcopy data CAN be migrated in postcopy, i.e. after target
>   vm start.
>
>
You are right, the definition of res_postcopy should be changed.

Yet, I am not sure if this patch really makes things more clear/simple.
Juan, what do you think?

Thanks!
>>       } else {
>> -        *res_precopy_only += remaining_size;
>> +        *res_precopy += remaining_size;
>>       }
>>   }
>>
>
>
> -- 
> Best regards,
> Vladimir
>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 03/17] migration: Block migration comment or code is wrong
  2022-11-08 18:36   ` Vladimir Sementsov-Ogievskiy
  2022-11-08 18:38     ` Vladimir Sementsov-Ogievskiy
@ 2022-11-10 13:38     ` Avihai Horon
  2022-11-21  7:21       ` Avihai Horon
  1 sibling, 1 reply; 59+ messages in thread
From: Avihai Horon @ 2022-11-10 13:38 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel
  Cc: Alex Williamson, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake, John Snow, qemu-s390x,
	qemu-block, Kunkun Jiang, Zhang, Chen, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Shay Drory, Kirti Wankhede,
	Tarun Gupta, Joao Martins


On 08/11/2022 20:36, Vladimir Sementsov-Ogievskiy wrote:
> External email: Use caution opening links or attachments
>
>
> On 11/3/22 19:16, Avihai Horon wrote:
>> From: Juan Quintela <quintela@redhat.com>
>>
>> And it appears that what is wrong is the code. During bulk stage we
>> need to make sure that some block is dirty, but no games with
>> max_size at all.
>
> :) That made me interested in, why we need this one block, so I 
> decided to search through the history.
>
> And what I see? Haha, that was my commit 04636dc410b163c 
> "migration/block: fix pending() return value" [1], which you actually 
> revert with this patch.
>
> So, at least we should note, that it's a revert of [1].
>
> Still that this will reintroduce the bug fixed by [1].
>
> As I understand the problem is (was) that in block_save_complete() we 
> finalize only dirty blocks, but don't finalize the bulk phase if it's 
> not finalized yet. So, we can fix block_save_complete() to finalize 
> the bulk phase, instead of hacking with pending in [1].
>
> Interesting, why we need this one block, described in the comment you 
> refer to? Was it an incomplete workaround for the same problem, 
> described in [1]? If so, we can fix block_save_complete() and remove 
> this if() together with the comment from block_save_pending().
>
I am not familiar with block migration.
I can drop this patch in next version.

Juan/Stefan, could you help here?

>>
>> Signed-off-by: Juan Quintela <quintela@redhat.com>
>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>> ---
>>   migration/block.c | 4 ++--
>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/migration/block.c b/migration/block.c
>> index b3d680af75..39ce4003c6 100644
>> --- a/migration/block.c
>> +++ b/migration/block.c
>> @@ -879,8 +879,8 @@ static void block_save_pending(void *opaque, 
>> uint64_t max_size,
>>       blk_mig_unlock();
>>
>>       /* Report at least one block pending during bulk phase */
>> -    if (pending <= max_size && !block_mig_state.bulk_completed) {
>> -        pending = max_size + BLK_MIG_BLOCK_SIZE;
>> +    if (!pending && !block_mig_state.bulk_completed) {
>> +        pending = BLK_MIG_BLOCK_SIZE;
>>       }
>>
>>       trace_migration_block_save_pending(pending);
>
> -- 
> Best regards,
> Vladimir
>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 04/17] migration: Simplify migration_iteration_run()
  2022-11-08 18:56   ` Vladimir Sementsov-Ogievskiy
@ 2022-11-10 13:42     ` Avihai Horon
  0 siblings, 0 replies; 59+ messages in thread
From: Avihai Horon @ 2022-11-10 13:42 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel
  Cc: Alex Williamson, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake, John Snow, qemu-s390x,
	qemu-block, Kunkun Jiang, Zhang, Chen, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Shay Drory, Kirti Wankhede,
	Tarun Gupta, Joao Martins


On 08/11/2022 20:56, Vladimir Sementsov-Ogievskiy wrote:
> External email: Use caution opening links or attachments
>
>
> On 11/3/22 19:16, Avihai Horon wrote:
>> From: Juan Quintela <quintela@redhat.com>
>>
>> Signed-off-by: Juan Quintela <quintela@redhat.com>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>   migration/migration.c | 25 +++++++++++++------------
>>   1 file changed, 13 insertions(+), 12 deletions(-)
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index ffe868b86f..59cc3c309b 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -3743,23 +3743,24 @@ static MigIterateState 
>> migration_iteration_run(MigrationState *s)
>>
>>       trace_migrate_pending(pending_size, s->threshold_size, 
>> pend_pre, pend_post);
>>
>> -    if (pending_size && pending_size >= s->threshold_size) {
>> -        /* Still a significant amount to transfer */
>> -        if (!in_postcopy && pend_pre <= s->threshold_size &&
>> -            qatomic_read(&s->start_postcopy)) {
>> -            if (postcopy_start(s)) {
>> -                error_report("%s: postcopy failed to start", __func__);
>> -            }
>> -            return MIG_ITERATE_SKIP;
>> -        }
>> -        /* Just another iteration step */
>> -        qemu_savevm_state_iterate(s->to_dst_file, in_postcopy);
>> -    } else {
>> +
>> +    if (pending_size < s->threshold_size) {
>
> Is corner case "pending_size == s->threshold_size == 0" theoretically 
> possible here? In this case prepatch we go to completion. Afterpatch 
> we go to next iteration..
>
I guess it's theoretically possible.
Let's address this corner case and keep the functional behavior exactly 
the same.

Thanks!

>> trace_migration_thread_low_pending(pending_size);
>>           migration_completion(s);
>>           return MIG_ITERATE_BREAK;
>>       }
>>
>> +    /* Still a significant amount to transfer */
>> +    if (!in_postcopy && pend_pre <= s->threshold_size &&
>> +        qatomic_read(&s->start_postcopy)) {
>> +        if (postcopy_start(s)) {
>> +            error_report("%s: postcopy failed to start", __func__);
>> +        }
>> +        return MIG_ITERATE_SKIP;
>> +    }
>> +
>> +    /* Just another iteration step */
>> +    qemu_savevm_state_iterate(s->to_dst_file, in_postcopy);
>>       return MIG_ITERATE_RESUME;
>>   }
>>
>
> -- 
> Best regards,
> Vladimir
>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 05/17] vfio/migration: Fix wrong enum usage
  2022-11-08 19:05   ` Vladimir Sementsov-Ogievskiy
@ 2022-11-10 13:47     ` Avihai Horon
  0 siblings, 0 replies; 59+ messages in thread
From: Avihai Horon @ 2022-11-10 13:47 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel
  Cc: Alex Williamson, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake, John Snow, qemu-s390x,
	qemu-block, Kunkun Jiang, Zhang, Chen, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Shay Drory, Kirti Wankhede,
	Tarun Gupta, Joao Martins


On 08/11/2022 21:05, Vladimir Sementsov-Ogievskiy wrote:
> External email: Use caution opening links or attachments
>
>
> On 11/3/22 19:16, Avihai Horon wrote:
>> vfio_migration_init() initializes VFIOMigration->device_state using enum
>> of VFIO migration protocol v2. Current implemented protocol is v1 so v1
>> enum should be used. Fix it.
>>
>> Fixes: 429c72800654 ("vfio/migration: Fix incorrect initialization 
>> value for parameters in VFIOMigration")
>> Signed-off-by: Avihai Horon<avihaih@nvidia.com>
>> Reviewed-by: Zhang Chen<chen.zhang@intel.com>
>
> the commit is already in master branch
>
Yes, I will drop it in next version.

Thanks!

> -- 
> Best regards,
> Vladimir
>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 07/17] vfio/migration: Allow migration without VFIO IOMMU dirty tracking support
  2022-11-03 16:16 ` [PATCH v3 07/17] vfio/migration: Allow migration without VFIO IOMMU dirty tracking support Avihai Horon
@ 2022-11-15 23:36   ` Alex Williamson
  2022-11-16 13:29     ` Avihai Horon
  0 siblings, 1 reply; 59+ messages in thread
From: Alex Williamson @ 2022-11-15 23:36 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Shay Drory, Kirti Wankhede, Tarun Gupta,
	Joao Martins

On Thu, 3 Nov 2022 18:16:10 +0200
Avihai Horon <avihaih@nvidia.com> wrote:

> Currently, if IOMMU of a VFIO container doesn't support dirty page
> tracking, migration is blocked. This is because a DMA-able VFIO device
> can dirty RAM pages without updating QEMU about it, thus breaking the
> migration.
> 
> However, this doesn't mean that migration can't be done at all.
> In such case, allow migration and let QEMU VFIO code mark the entire
> bitmap dirty.
> 
> This guarantees that all pages that might have gotten dirty are reported
> back, and thus guarantees a valid migration even without VFIO IOMMU
> dirty tracking support.
> 
> The motivation for this patch is the future introduction of iommufd [1].
> iommufd will directly implement the /dev/vfio/vfio container IOCTLs by
> mapping them into its internal ops, allowing the usage of these IOCTLs
> over iommufd. However, VFIO IOMMU dirty tracking will not be supported
> by this VFIO compatibility API.
> 
> This patch will allow migration by hosts that use the VFIO compatibility
> API and prevent migration regressions caused by the lack of VFIO IOMMU
> dirty tracking support.
> 
> [1] https://lore.kernel.org/kvm/0-v2-f9436d0bde78+4bb-iommufd_jgg@nvidia.com/
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>  hw/vfio/common.c    | 84 +++++++++++++++++++++++++++++++++++++--------
>  hw/vfio/migration.c |  3 +-
>  2 files changed, 70 insertions(+), 17 deletions(-)

This duplicates quite a bit of code, I think we can integrate this into
a common flow quite a bit more.  See below, only compile tested. Thanks,

Alex

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 6b5d8c0bf694..4117b40fd9b0 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -397,17 +397,33 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container,
                                  IOMMUTLBEntry *iotlb)
 {
     struct vfio_iommu_type1_dma_unmap *unmap;
-    struct vfio_bitmap *bitmap;
+    struct vfio_bitmap *vbitmap;
+    unsigned long *bitmap;
+    uint64_t bitmap_size;
     uint64_t pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size();
     int ret;
 
-    unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));
+    unmap = g_malloc0(sizeof(*unmap) + sizeof(*vbitmap));
 
-    unmap->argsz = sizeof(*unmap) + sizeof(*bitmap);
+    unmap->argsz = sizeof(*unmap);
     unmap->iova = iova;
     unmap->size = size;
-    unmap->flags |= VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP;
-    bitmap = (struct vfio_bitmap *)&unmap->data;
+
+    bitmap_size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
+                  BITS_PER_BYTE;
+    bitmap = g_try_malloc0(bitmap_size);
+    if (!bitmap) {
+        ret = -ENOMEM;
+        goto unmap_exit;
+    }
+
+    if (!container->dirty_pages_supported) {
+        bitmap_set(bitmap, 0, pages);
+        goto do_unmap;
+    }
+
+    unmap->argsz += sizeof(*vbitmap);
+    unmap->flags = VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP;
 
     /*
      * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of
@@ -415,33 +431,28 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container,
      * to qemu_real_host_page_size.
      */
 
-    bitmap->pgsize = qemu_real_host_page_size();
-    bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
-                   BITS_PER_BYTE;
+    vbitmap = (struct vfio_bitmap *)&unmap->data;
+    vbitmap->data = (__u64 *)bitmap;
+    vbitmap->pgsize = qemu_real_host_page_size();
+    vbitmap->size = bitmap_size;
 
-    if (bitmap->size > container->max_dirty_bitmap_size) {
-        error_report("UNMAP: Size of bitmap too big 0x%"PRIx64,
-                     (uint64_t)bitmap->size);
+    if (bitmap_size > container->max_dirty_bitmap_size) {
+        error_report("UNMAP: Size of bitmap too big 0x%"PRIx64, bitmap_size);
         ret = -E2BIG;
         goto unmap_exit;
     }
 
-    bitmap->data = g_try_malloc0(bitmap->size);
-    if (!bitmap->data) {
-        ret = -ENOMEM;
-        goto unmap_exit;
-    }
-
+do_unmap:
     ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap);
     if (!ret) {
-        cpu_physical_memory_set_dirty_lebitmap((unsigned long *)bitmap->data,
-                iotlb->translated_addr, pages);
+        cpu_physical_memory_set_dirty_lebitmap(bitmap, iotlb->translated_addr,
+                                               pages);
     } else {
         error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %m");
     }
 
-    g_free(bitmap->data);
 unmap_exit:
+    g_free(bitmap);
     g_free(unmap);
     return ret;
 }
@@ -460,8 +471,7 @@ static int vfio_dma_unmap(VFIOContainer *container,
         .size = size,
     };
 
-    if (iotlb && container->dirty_pages_supported &&
-        vfio_devices_all_running_and_saving(container)) {
+    if (iotlb && vfio_devices_all_running_and_saving(container)) {
         return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
     }
 
@@ -1257,6 +1267,10 @@ static void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
         .argsz = sizeof(dirty),
     };
 
+    if (!container->dirty_pages_supported) {
+        return;
+    }
+
     if (start) {
         dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START;
     } else {
@@ -1287,11 +1301,26 @@ static void vfio_listener_log_global_stop(MemoryListener *listener)
 static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
                                  uint64_t size, ram_addr_t ram_addr)
 {
-    struct vfio_iommu_type1_dirty_bitmap *dbitmap;
+    struct vfio_iommu_type1_dirty_bitmap *dbitmap = NULL;
     struct vfio_iommu_type1_dirty_bitmap_get *range;
+    unsigned long *bitmap;
+    uint64_t bitmap_size;
     uint64_t pages;
     int ret;
 
+    pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size();
+    bitmap_size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
+                           BITS_PER_BYTE;
+    bitmap = g_try_malloc0(bitmap_size);
+    if (!bitmap) {
+        return -ENOMEM;
+    }
+
+    if (!container->dirty_pages_supported) {
+        bitmap_set(bitmap, 0, pages);
+        goto set_dirty;
+    }
+
     dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
 
     dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
@@ -1306,15 +1335,8 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
      * to qemu_real_host_page_size.
      */
     range->bitmap.pgsize = qemu_real_host_page_size();
-
-    pages = REAL_HOST_PAGE_ALIGN(range->size) / qemu_real_host_page_size();
-    range->bitmap.size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
-                                         BITS_PER_BYTE;
-    range->bitmap.data = g_try_malloc0(range->bitmap.size);
-    if (!range->bitmap.data) {
-        ret = -ENOMEM;
-        goto err_out;
-    }
+    range->bitmap.size = bitmap_size;
+    range->bitmap.data = (__u64 *)bitmap;
 
     ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
     if (ret) {
@@ -1324,13 +1346,13 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
         goto err_out;
     }
 
-    cpu_physical_memory_set_dirty_lebitmap((unsigned long *)range->bitmap.data,
-                                            ram_addr, pages);
+set_dirty:
+    cpu_physical_memory_set_dirty_lebitmap(bitmap, ram_addr, pages);
 
-    trace_vfio_get_dirty_bitmap(container->fd, range->iova, range->size,
-                                range->bitmap.size, ram_addr);
+    trace_vfio_get_dirty_bitmap(container->fd, iova, size,
+                                bitmap_size, ram_addr);
 err_out:
-    g_free(range->bitmap.data);
+    g_free(bitmap);
     g_free(dbitmap);
 
     return ret;
@@ -1465,8 +1487,7 @@ static void vfio_listener_log_sync(MemoryListener *listener,
 {
     VFIOContainer *container = container_of(listener, VFIOContainer, listener);
 
-    if (vfio_listener_skipped_section(section) ||
-        !container->dirty_pages_supported) {
+    if (vfio_listener_skipped_section(section)) {
         return;
     }
 
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index f5e72c7ac198..99ffb7578290 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -857,11 +857,10 @@ int64_t vfio_mig_bytes_transferred(void)
 
 int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
 {
-    VFIOContainer *container = vbasedev->group->container;
     struct vfio_region_info *info = NULL;
     int ret = -ENOTSUP;
 
-    if (!vbasedev->enable_migration || !container->dirty_pages_supported) {
+    if (!vbasedev->enable_migration) {
         goto add_blocker;
     }
 



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 10/17] vfio/migration: Move migration v1 logic to vfio_migration_init()
  2022-11-03 16:16 ` [PATCH v3 10/17] vfio/migration: Move migration v1 logic to vfio_migration_init() Avihai Horon
@ 2022-11-15 23:56   ` Alex Williamson
  2022-11-16 13:39     ` Avihai Horon
  0 siblings, 1 reply; 59+ messages in thread
From: Alex Williamson @ 2022-11-15 23:56 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Shay Drory, Kirti Wankhede, Tarun Gupta,
	Joao Martins

On Thu, 3 Nov 2022 18:16:13 +0200
Avihai Horon <avihaih@nvidia.com> wrote:

> Move vfio_dev_get_region_info() logic from vfio_migration_probe() to
> vfio_migration_init(). This logic is specific to v1 protocol and moving
> it will make it easier to add the v2 protocol implementation later.
> No functional changes intended.
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>  hw/vfio/migration.c  | 30 +++++++++++++++---------------
>  hw/vfio/trace-events |  2 +-
>  2 files changed, 16 insertions(+), 16 deletions(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 99ffb75782..0e3a950746 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -785,14 +785,14 @@ static void vfio_migration_exit(VFIODevice *vbasedev)
>      vbasedev->migration = NULL;
>  }
>  
> -static int vfio_migration_init(VFIODevice *vbasedev,
> -                               struct vfio_region_info *info)
> +static int vfio_migration_init(VFIODevice *vbasedev)
>  {
>      int ret;
>      Object *obj;
>      VFIOMigration *migration;
>      char id[256] = "";
>      g_autofree char *path = NULL, *oid = NULL;
> +    struct vfio_region_info *info = NULL;

Nit, I'm not spotting any cases where we need this initialization.  The
same is not true in the code the info handling was extracted from.
Thanks,

Alex

>  
>      if (!vbasedev->ops->vfio_get_object) {
>          return -EINVAL;
> @@ -803,6 +803,14 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>          return -EINVAL;
>      }
>  
> +    ret = vfio_get_dev_region_info(vbasedev,
> +                                   VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
> +                                   VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
> +                                   &info);
> +    if (ret) {
> +        return ret;
> +    }
> +
>      vbasedev->migration = g_new0(VFIOMigration, 1);
>      vbasedev->migration->device_state = VFIO_DEVICE_STATE_V1_RUNNING;
>      vbasedev->migration->vm_running = runstate_is_running();
> @@ -822,6 +830,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>          goto err;
>      }
>  
> +    g_free(info);
> +
>      migration = vbasedev->migration;
>      migration->vbasedev = vbasedev;
>  
> @@ -844,6 +854,7 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>      return 0;
>  
>  err:
> +    g_free(info);
>      vfio_migration_exit(vbasedev);
>      return ret;
>  }
> @@ -857,34 +868,23 @@ int64_t vfio_mig_bytes_transferred(void)
>  
>  int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
>  {
> -    struct vfio_region_info *info = NULL;
>      int ret = -ENOTSUP;
>  
>      if (!vbasedev->enable_migration) {
>          goto add_blocker;
>      }
>  
> -    ret = vfio_get_dev_region_info(vbasedev,
> -                                   VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
> -                                   VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
> -                                   &info);
> +    ret = vfio_migration_init(vbasedev);
>      if (ret) {
>          goto add_blocker;
>      }
>  
> -    ret = vfio_migration_init(vbasedev, info);
> -    if (ret) {
> -        goto add_blocker;
> -    }
> -
> -    trace_vfio_migration_probe(vbasedev->name, info->index);
> -    g_free(info);
> +    trace_vfio_migration_probe(vbasedev->name);
>      return 0;
>  
>  add_blocker:
>      error_setg(&vbasedev->migration_blocker,
>                 "VFIO device doesn't support migration");
> -    g_free(info);
>  
>      ret = migrate_add_blocker(vbasedev->migration_blocker, errp);
>      if (ret < 0) {
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index a21cbd2a56..27c059f96e 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -148,7 +148,7 @@ vfio_display_edid_update(uint32_t prefx, uint32_t prefy) "%ux%u"
>  vfio_display_edid_write_error(void) ""
>  
>  # migration.c
> -vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
> +vfio_migration_probe(const char *name) " (%s)"
>  vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
>  vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
>  vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 07/17] vfio/migration: Allow migration without VFIO IOMMU dirty tracking support
  2022-11-15 23:36   ` Alex Williamson
@ 2022-11-16 13:29     ` Avihai Horon
  0 siblings, 0 replies; 59+ messages in thread
From: Avihai Horon @ 2022-11-16 13:29 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Shay Drory, Kirti Wankhede, Tarun Gupta,
	Joao Martins


On 16/11/2022 1:36, Alex Williamson wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, 3 Nov 2022 18:16:10 +0200
> Avihai Horon <avihaih@nvidia.com> wrote:
>
>> Currently, if IOMMU of a VFIO container doesn't support dirty page
>> tracking, migration is blocked. This is because a DMA-able VFIO device
>> can dirty RAM pages without updating QEMU about it, thus breaking the
>> migration.
>>
>> However, this doesn't mean that migration can't be done at all.
>> In such case, allow migration and let QEMU VFIO code mark the entire
>> bitmap dirty.
>>
>> This guarantees that all pages that might have gotten dirty are reported
>> back, and thus guarantees a valid migration even without VFIO IOMMU
>> dirty tracking support.
>>
>> The motivation for this patch is the future introduction of iommufd [1].
>> iommufd will directly implement the /dev/vfio/vfio container IOCTLs by
>> mapping them into its internal ops, allowing the usage of these IOCTLs
>> over iommufd. However, VFIO IOMMU dirty tracking will not be supported
>> by this VFIO compatibility API.
>>
>> This patch will allow migration by hosts that use the VFIO compatibility
>> API and prevent migration regressions caused by the lack of VFIO IOMMU
>> dirty tracking support.
>>
>> [1] https://lore.kernel.org/kvm/0-v2-f9436d0bde78+4bb-iommufd_jgg@nvidia.com/
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>   hw/vfio/common.c    | 84 +++++++++++++++++++++++++++++++++++++--------
>>   hw/vfio/migration.c |  3 +-
>>   2 files changed, 70 insertions(+), 17 deletions(-)
> This duplicates quite a bit of code, I think we can integrate this into
> a common flow quite a bit more.  See below, only compile tested. Thanks,

Oh, great, thanks!
I will test it and add it as part of v4.

> Alex
>
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 6b5d8c0bf694..4117b40fd9b0 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -397,17 +397,33 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container,
>                                    IOMMUTLBEntry *iotlb)
>   {
>       struct vfio_iommu_type1_dma_unmap *unmap;
> -    struct vfio_bitmap *bitmap;
> +    struct vfio_bitmap *vbitmap;
> +    unsigned long *bitmap;
> +    uint64_t bitmap_size;
>       uint64_t pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size();
>       int ret;
>
> -    unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));
> +    unmap = g_malloc0(sizeof(*unmap) + sizeof(*vbitmap));
>
> -    unmap->argsz = sizeof(*unmap) + sizeof(*bitmap);
> +    unmap->argsz = sizeof(*unmap);
>       unmap->iova = iova;
>       unmap->size = size;
> -    unmap->flags |= VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP;
> -    bitmap = (struct vfio_bitmap *)&unmap->data;
> +
> +    bitmap_size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
> +                  BITS_PER_BYTE;
> +    bitmap = g_try_malloc0(bitmap_size);
> +    if (!bitmap) {
> +        ret = -ENOMEM;
> +        goto unmap_exit;
> +    }
> +
> +    if (!container->dirty_pages_supported) {
> +        bitmap_set(bitmap, 0, pages);
> +        goto do_unmap;
> +    }
> +
> +    unmap->argsz += sizeof(*vbitmap);
> +    unmap->flags = VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP;
>
>       /*
>        * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of
> @@ -415,33 +431,28 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container,
>        * to qemu_real_host_page_size.
>        */
>
> -    bitmap->pgsize = qemu_real_host_page_size();
> -    bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
> -                   BITS_PER_BYTE;
> +    vbitmap = (struct vfio_bitmap *)&unmap->data;
> +    vbitmap->data = (__u64 *)bitmap;
> +    vbitmap->pgsize = qemu_real_host_page_size();
> +    vbitmap->size = bitmap_size;
>
> -    if (bitmap->size > container->max_dirty_bitmap_size) {
> -        error_report("UNMAP: Size of bitmap too big 0x%"PRIx64,
> -                     (uint64_t)bitmap->size);
> +    if (bitmap_size > container->max_dirty_bitmap_size) {
> +        error_report("UNMAP: Size of bitmap too big 0x%"PRIx64, bitmap_size);
>           ret = -E2BIG;
>           goto unmap_exit;
>       }
>
> -    bitmap->data = g_try_malloc0(bitmap->size);
> -    if (!bitmap->data) {
> -        ret = -ENOMEM;
> -        goto unmap_exit;
> -    }
> -
> +do_unmap:
>       ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap);
>       if (!ret) {
> -        cpu_physical_memory_set_dirty_lebitmap((unsigned long *)bitmap->data,
> -                iotlb->translated_addr, pages);
> +        cpu_physical_memory_set_dirty_lebitmap(bitmap, iotlb->translated_addr,
> +                                               pages);
>       } else {
>           error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %m");
>       }
>
> -    g_free(bitmap->data);
>   unmap_exit:
> +    g_free(bitmap);
>       g_free(unmap);
>       return ret;
>   }
> @@ -460,8 +471,7 @@ static int vfio_dma_unmap(VFIOContainer *container,
>           .size = size,
>       };
>
> -    if (iotlb && container->dirty_pages_supported &&
> -        vfio_devices_all_running_and_saving(container)) {
> +    if (iotlb && vfio_devices_all_running_and_saving(container)) {
>           return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
>       }
>
> @@ -1257,6 +1267,10 @@ static void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
>           .argsz = sizeof(dirty),
>       };
>
> +    if (!container->dirty_pages_supported) {
> +        return;
> +    }
> +
>       if (start) {
>           dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START;
>       } else {
> @@ -1287,11 +1301,26 @@ static void vfio_listener_log_global_stop(MemoryListener *listener)
>   static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
>                                    uint64_t size, ram_addr_t ram_addr)
>   {
> -    struct vfio_iommu_type1_dirty_bitmap *dbitmap;
> +    struct vfio_iommu_type1_dirty_bitmap *dbitmap = NULL;
>       struct vfio_iommu_type1_dirty_bitmap_get *range;
> +    unsigned long *bitmap;
> +    uint64_t bitmap_size;
>       uint64_t pages;
>       int ret;
>
> +    pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size();
> +    bitmap_size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
> +                           BITS_PER_BYTE;
> +    bitmap = g_try_malloc0(bitmap_size);
> +    if (!bitmap) {
> +        return -ENOMEM;
> +    }
> +
> +    if (!container->dirty_pages_supported) {
> +        bitmap_set(bitmap, 0, pages);
> +        goto set_dirty;
> +    }
> +
>       dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
>
>       dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
> @@ -1306,15 +1335,8 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
>        * to qemu_real_host_page_size.
>        */
>       range->bitmap.pgsize = qemu_real_host_page_size();
> -
> -    pages = REAL_HOST_PAGE_ALIGN(range->size) / qemu_real_host_page_size();
> -    range->bitmap.size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
> -                                         BITS_PER_BYTE;
> -    range->bitmap.data = g_try_malloc0(range->bitmap.size);
> -    if (!range->bitmap.data) {
> -        ret = -ENOMEM;
> -        goto err_out;
> -    }
> +    range->bitmap.size = bitmap_size;
> +    range->bitmap.data = (__u64 *)bitmap;
>
>       ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
>       if (ret) {
> @@ -1324,13 +1346,13 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
>           goto err_out;
>       }
>
> -    cpu_physical_memory_set_dirty_lebitmap((unsigned long *)range->bitmap.data,
> -                                            ram_addr, pages);
> +set_dirty:
> +    cpu_physical_memory_set_dirty_lebitmap(bitmap, ram_addr, pages);
>
> -    trace_vfio_get_dirty_bitmap(container->fd, range->iova, range->size,
> -                                range->bitmap.size, ram_addr);
> +    trace_vfio_get_dirty_bitmap(container->fd, iova, size,
> +                                bitmap_size, ram_addr);
>   err_out:
> -    g_free(range->bitmap.data);
> +    g_free(bitmap);
>       g_free(dbitmap);
>
>       return ret;
> @@ -1465,8 +1487,7 @@ static void vfio_listener_log_sync(MemoryListener *listener,
>   {
>       VFIOContainer *container = container_of(listener, VFIOContainer, listener);
>
> -    if (vfio_listener_skipped_section(section) ||
> -        !container->dirty_pages_supported) {
> +    if (vfio_listener_skipped_section(section)) {
>           return;
>       }
>
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index f5e72c7ac198..99ffb7578290 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -857,11 +857,10 @@ int64_t vfio_mig_bytes_transferred(void)
>
>   int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
>   {
> -    VFIOContainer *container = vbasedev->group->container;
>       struct vfio_region_info *info = NULL;
>       int ret = -ENOTSUP;
>
> -    if (!vbasedev->enable_migration || !container->dirty_pages_supported) {
> +    if (!vbasedev->enable_migration) {
>           goto add_blocker;
>       }
>
>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 10/17] vfio/migration: Move migration v1 logic to vfio_migration_init()
  2022-11-15 23:56   ` Alex Williamson
@ 2022-11-16 13:39     ` Avihai Horon
  0 siblings, 0 replies; 59+ messages in thread
From: Avihai Horon @ 2022-11-16 13:39 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Shay Drory, Kirti Wankhede, Tarun Gupta,
	Joao Martins


On 16/11/2022 1:56, Alex Williamson wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, 3 Nov 2022 18:16:13 +0200
> Avihai Horon <avihaih@nvidia.com> wrote:
>
>> Move vfio_dev_get_region_info() logic from vfio_migration_probe() to
>> vfio_migration_init(). This logic is specific to v1 protocol and moving
>> it will make it easier to add the v2 protocol implementation later.
>> No functional changes intended.
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>   hw/vfio/migration.c  | 30 +++++++++++++++---------------
>>   hw/vfio/trace-events |  2 +-
>>   2 files changed, 16 insertions(+), 16 deletions(-)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 99ffb75782..0e3a950746 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -785,14 +785,14 @@ static void vfio_migration_exit(VFIODevice *vbasedev)
>>       vbasedev->migration = NULL;
>>   }
>>
>> -static int vfio_migration_init(VFIODevice *vbasedev,
>> -                               struct vfio_region_info *info)
>> +static int vfio_migration_init(VFIODevice *vbasedev)
>>   {
>>       int ret;
>>       Object *obj;
>>       VFIOMigration *migration;
>>       char id[256] = "";
>>       g_autofree char *path = NULL, *oid = NULL;
>> +    struct vfio_region_info *info = NULL;
> Nit, I'm not spotting any cases where we need this initialization.  The
> same is not true in the code the info handling was extracted from.
> Thanks,

You are right. I will drop the initialization in v4.
Thanks!

> Alex
>
>>       if (!vbasedev->ops->vfio_get_object) {
>>           return -EINVAL;
>> @@ -803,6 +803,14 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>>           return -EINVAL;
>>       }
>>
>> +    ret = vfio_get_dev_region_info(vbasedev,
>> +                                   VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
>> +                                   VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
>> +                                   &info);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>>       vbasedev->migration = g_new0(VFIOMigration, 1);
>>       vbasedev->migration->device_state = VFIO_DEVICE_STATE_V1_RUNNING;
>>       vbasedev->migration->vm_running = runstate_is_running();
>> @@ -822,6 +830,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>>           goto err;
>>       }
>>
>> +    g_free(info);
>> +
>>       migration = vbasedev->migration;
>>       migration->vbasedev = vbasedev;
>>
>> @@ -844,6 +854,7 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>>       return 0;
>>
>>   err:
>> +    g_free(info);
>>       vfio_migration_exit(vbasedev);
>>       return ret;
>>   }
>> @@ -857,34 +868,23 @@ int64_t vfio_mig_bytes_transferred(void)
>>
>>   int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
>>   {
>> -    struct vfio_region_info *info = NULL;
>>       int ret = -ENOTSUP;
>>
>>       if (!vbasedev->enable_migration) {
>>           goto add_blocker;
>>       }
>>
>> -    ret = vfio_get_dev_region_info(vbasedev,
>> -                                   VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
>> -                                   VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
>> -                                   &info);
>> +    ret = vfio_migration_init(vbasedev);
>>       if (ret) {
>>           goto add_blocker;
>>       }
>>
>> -    ret = vfio_migration_init(vbasedev, info);
>> -    if (ret) {
>> -        goto add_blocker;
>> -    }
>> -
>> -    trace_vfio_migration_probe(vbasedev->name, info->index);
>> -    g_free(info);
>> +    trace_vfio_migration_probe(vbasedev->name);
>>       return 0;
>>
>>   add_blocker:
>>       error_setg(&vbasedev->migration_blocker,
>>                  "VFIO device doesn't support migration");
>> -    g_free(info);
>>
>>       ret = migrate_add_blocker(vbasedev->migration_blocker, errp);
>>       if (ret < 0) {
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index a21cbd2a56..27c059f96e 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -148,7 +148,7 @@ vfio_display_edid_update(uint32_t prefx, uint32_t prefy) "%ux%u"
>>   vfio_display_edid_write_error(void) ""
>>
>>   # migration.c
>> -vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
>> +vfio_migration_probe(const char *name) " (%s)"
>>   vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
>>   vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
>>   vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 12/17] vfio/migration: Implement VFIO migration protocol v2
  2022-11-03 16:16 ` [PATCH v3 12/17] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
@ 2022-11-16 18:29   ` Alex Williamson
  2022-11-17 17:07     ` Avihai Horon
  2022-11-23 18:59   ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 59+ messages in thread
From: Alex Williamson @ 2022-11-16 18:29 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Shay Drory, Kirti Wankhede, Tarun Gupta,
	Joao Martins

On Thu, 3 Nov 2022 18:16:15 +0200
Avihai Horon <avihaih@nvidia.com> wrote:

> Add implementation of VFIO migration protocol v2. The two protocols, v1
> and v2, will co-exist and in next patch v1 protocol will be removed.
> 
> There are several main differences between v1 and v2 protocols:
> - VFIO device state is now represented as a finite state machine instead
>   of a bitmap.
> 
> - Migration interface with kernel is now done using VFIO_DEVICE_FEATURE
>   ioctl and normal read() and write() instead of the migration region.
> 
> - VFIO migration protocol v2 currently doesn't support the pre-copy
>   phase of migration.
> 
> Detailed information about VFIO migration protocol v2 and difference
> compared to v1 can be found here [1].
> 
> [1]
> https://lore.kernel.org/all/20220224142024.147653-10-yishaih@nvidia.com/
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>  hw/vfio/common.c              |  19 +-
>  hw/vfio/migration.c           | 386 ++++++++++++++++++++++++++++++----
>  hw/vfio/trace-events          |   4 +
>  include/hw/vfio/vfio-common.h |   5 +
>  4 files changed, 375 insertions(+), 39 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 617e6cd901..0bdbd1586b 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -355,10 +355,18 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
>                  return false;
>              }
>  
> -            if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
> +            if (!migration->v2 &&
> +                (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
>                  (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING)) {
>                  return false;
>              }
> +
> +            if (migration->v2 &&
> +                (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
> +                (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> +                 migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P)) {
> +                return false;
> +            }
>          }
>      }
>      return true;
> @@ -385,7 +393,14 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
>                  return false;
>              }
>  
> -            if (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
> +            if (!migration->v2 &&
> +                migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
> +                continue;
> +            }
> +
> +            if (migration->v2 &&
> +                (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> +                 migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P)) {
>                  continue;
>              } else {
>                  return false;
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index e784374453..62afc23a8c 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -44,8 +44,84 @@
>  #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
>  #define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
>  
> +#define VFIO_MIG_DATA_BUFFER_SIZE (1024 * 1024)

Add comment explaining heuristic of this size.

> +
>  static int64_t bytes_transferred;
>  
> +static const char *mig_state_to_str(enum vfio_device_mig_state state)
> +{
> +    switch (state) {
> +    case VFIO_DEVICE_STATE_ERROR:
> +        return "ERROR";
> +    case VFIO_DEVICE_STATE_STOP:
> +        return "STOP";
> +    case VFIO_DEVICE_STATE_RUNNING:
> +        return "RUNNING";
> +    case VFIO_DEVICE_STATE_STOP_COPY:
> +        return "STOP_COPY";
> +    case VFIO_DEVICE_STATE_RESUMING:
> +        return "RESUMING";
> +    case VFIO_DEVICE_STATE_RUNNING_P2P:
> +        return "RUNNING_P2P";
> +    default:
> +        return "UNKNOWN STATE";
> +    }
> +}
> +
> +static int vfio_migration_set_state(VFIODevice *vbasedev,
> +                                    enum vfio_device_mig_state new_state,
> +                                    enum vfio_device_mig_state recover_state)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
> +                              sizeof(struct vfio_device_feature_mig_state),
> +                              sizeof(uint64_t))] = {};
> +    struct vfio_device_feature *feature = (void *)buf;
> +    struct vfio_device_feature_mig_state *mig_state = (void *)feature->data;

We can cast to the actual types rather than void* here.

> +
> +    feature->argsz = sizeof(buf);
> +    feature->flags =
> +        VFIO_DEVICE_FEATURE_SET | VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE;
> +    mig_state->device_state = new_state;
> +    if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
> +        /* Try to set the device in some good state */
> +        error_report(
> +            "%s: Failed setting device state to %s, err: %s. Setting device in recover state %s",
> +                     vbasedev->name, mig_state_to_str(new_state),
> +                     strerror(errno), mig_state_to_str(recover_state));
> +
> +        mig_state->device_state = recover_state;
> +        if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
> +            hw_error("%s: Failed setting device in recover state, err: %s",
> +                     vbasedev->name, strerror(errno));
> +        }
> +
> +        migration->device_state = recover_state;
> +
> +        return -1;

We could preserve -errno to return here.

> +    }
> +
> +    if (mig_state->data_fd != -1) {
> +        if (migration->data_fd != -1) {
> +            /*
> +             * This can happen if the device is asynchronously reset and
> +             * terminates a data transfer.
> +             */
> +            error_report("%s: data_fd out of sync", vbasedev->name);
> +            close(mig_state->data_fd);
> +
> +            return -1;

Should we go to recover_state here?  Is migration->device_state
invalid?  -EBADF?

> +        }
> +
> +        migration->data_fd = mig_state->data_fd;
> +    }
> +    migration->device_state = new_state;
> +
> +    trace_vfio_migration_set_state(vbasedev->name, mig_state_to_str(new_state));
> +
> +    return 0;
> +}
> +
>  static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
>                                    off_t off, bool iswrite)
>  {
> @@ -260,6 +336,20 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t *size)
>      return ret;
>  }
>  
> +static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
> +                            uint64_t data_size)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    ret = qemu_file_get_to_fd(f, migration->data_fd, data_size);
> +    if (!ret) {
> +        trace_vfio_load_state_device_data(vbasedev->name, data_size);
> +    }
> +
> +    return ret;
> +}
> +
>  static int vfio_v1_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
>                                 uint64_t data_size)
>  {
> @@ -394,6 +484,14 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>      return qemu_file_get_error(f);
>  }
>  
> +static void vfio_migration_cleanup(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    close(migration->data_fd);
> +    migration->data_fd = -1;
> +}
> +
>  static void vfio_migration_v1_cleanup(VFIODevice *vbasedev)
>  {
>      VFIOMigration *migration = vbasedev->migration;
> @@ -405,6 +503,18 @@ static void vfio_migration_v1_cleanup(VFIODevice *vbasedev)
>  
>  /* ---------------------------------------------------------------------- */
>  
> +static int vfio_save_setup(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    trace_vfio_save_setup(vbasedev->name);
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    return qemu_file_get_error(f);
> +}
> +
>  static int vfio_v1_save_setup(QEMUFile *f, void *opaque)
>  {
>      VFIODevice *vbasedev = opaque;
> @@ -448,6 +558,14 @@ static int vfio_v1_save_setup(QEMUFile *f, void *opaque)
>      return 0;
>  }
>  
> +static void vfio_save_cleanup(void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    vfio_migration_cleanup(vbasedev);
> +    trace_vfio_save_cleanup(vbasedev->name);
> +}
> +
>  static void vfio_v1_save_cleanup(void *opaque)
>  {
>      VFIODevice *vbasedev = opaque;
> @@ -456,6 +574,23 @@ static void vfio_v1_save_cleanup(void *opaque)
>      trace_vfio_save_cleanup(vbasedev->name);
>  }
>  
> +#define VFIO_MIG_PENDING_SIZE (512 * 1024 * 1024)

There's a comment below, but that gets deleted in a later patch while
we still use this as a fallback size.  Some explanation of how this
size is derived would be useful.  Is this an estimate for mlx5?  It
seems muchtoo small for a GPU.  For a fallback, should we set something
here so large that we don't risk failing any SLA, ex. 100G?

> +static void vfio_save_pending(void *opaque, uint64_t threshold_size,
> +                              uint64_t *res_precopy, uint64_t *res_postcopy)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    /*
> +     * VFIO migration protocol v2 currently doesn't have an API to get pending
> +     * device state size. Until such API is introduced, report some big
> +     * arbitrary pending size so the device will be taken into account for
> +     * downtime limit calculations.
> +     */
> +    *res_postcopy += VFIO_MIG_PENDING_SIZE;
> +
> +    trace_vfio_save_pending(vbasedev->name, *res_precopy, *res_postcopy);
> +}
> +
>  static void vfio_v1_save_pending(void *opaque, uint64_t threshold_size,
>                                   uint64_t *res_precopy, uint64_t *res_postcopy)
>  {
> @@ -520,6 +655,67 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
>      return 0;
>  }
>  
> +/* Returns 1 if end-of-stream is reached, 0 if more data and -1 if error */
> +static int vfio_save_block(QEMUFile *f, VFIOMigration *migration)
> +{
> +    ssize_t data_size;
> +
> +    data_size = read(migration->data_fd, migration->data_buffer,
> +                     migration->data_buffer_size);
> +    if (data_size < 0) {
> +        return -1;

Appears this could return -errno, granted it'll get swallowed in
qemu_savevm_state_complete_precopy_iterable(), but it seems a bit
cleaner here.

> +    }
> +    if (data_size == 0) {
> +        return 1;
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> +    qemu_put_be64(f, data_size);
> +    qemu_put_buffer(f, migration->data_buffer, data_size);
> +    bytes_transferred += data_size;
> +
> +    trace_vfio_save_block(migration->vbasedev->name, data_size);
> +
> +    return qemu_file_get_error(f);
> +}
> +
> +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    enum vfio_device_mig_state recover_state;
> +    int ret;
> +
> +    /* We reach here with device state STOP only */
> +    recover_state = VFIO_DEVICE_STATE_STOP;
> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
> +                                   recover_state);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    do {
> +        ret = vfio_save_block(f, vbasedev->migration);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +    } while (!ret);
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    recover_state = VFIO_DEVICE_STATE_ERROR;
> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP,
> +                                   recover_state);
> +    if (!ret) {
> +        trace_vfio_save_complete_precopy(vbasedev->name);
> +    }
> +
> +    return ret;
> +}
> +
>  static int vfio_v1_save_complete_precopy(QEMUFile *f, void *opaque)
>  {
>      VFIODevice *vbasedev = opaque;
> @@ -589,6 +785,14 @@ static void vfio_save_state(QEMUFile *f, void *opaque)
>      }
>  }
>  
> +static int vfio_load_setup(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    return vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
> +                                   vbasedev->migration->device_state);
> +}
> +
>  static int vfio_v1_load_setup(QEMUFile *f, void *opaque)
>  {
>      VFIODevice *vbasedev = opaque;
> @@ -616,6 +820,16 @@ static int vfio_v1_load_setup(QEMUFile *f, void *opaque)
>      return ret;
>  }
>  
> +static int vfio_load_cleanup(void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    vfio_migration_cleanup(vbasedev);
> +    trace_vfio_load_cleanup(vbasedev->name);
> +
> +    return 0;
> +}
> +
>  static int vfio_v1_load_cleanup(void *opaque)
>  {
>      VFIODevice *vbasedev = opaque;
> @@ -658,7 +872,11 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>              uint64_t data_size = qemu_get_be64(f);
>  
>              if (data_size) {
> -                ret = vfio_v1_load_buffer(f, vbasedev, data_size);
> +                if (vbasedev->migration->v2) {
> +                    ret = vfio_load_buffer(f, vbasedev, data_size);
> +                } else {
> +                    ret = vfio_v1_load_buffer(f, vbasedev, data_size);
> +                }
>                  if (ret < 0) {
>                      return ret;
>                  }
> @@ -679,6 +897,17 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>      return ret;
>  }
>  
> +static const SaveVMHandlers savevm_vfio_handlers = {
> +    .save_setup = vfio_save_setup,
> +    .save_cleanup = vfio_save_cleanup,
> +    .save_live_pending = vfio_save_pending,
> +    .save_live_complete_precopy = vfio_save_complete_precopy,
> +    .save_state = vfio_save_state,
> +    .load_setup = vfio_load_setup,
> +    .load_cleanup = vfio_load_cleanup,
> +    .load_state = vfio_load_state,
> +};
> +
>  static SaveVMHandlers savevm_vfio_v1_handlers = {
>      .save_setup = vfio_v1_save_setup,
>      .save_cleanup = vfio_v1_save_cleanup,
> @@ -693,6 +922,34 @@ static SaveVMHandlers savevm_vfio_v1_handlers = {
>  
>  /* ---------------------------------------------------------------------- */
>  
> +static void vfio_vmstate_change(void *opaque, bool running, RunState state)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    enum vfio_device_mig_state new_state;
> +    int ret;
> +
> +    if (running) {
> +        new_state = VFIO_DEVICE_STATE_RUNNING;
> +    } else {
> +        new_state = VFIO_DEVICE_STATE_STOP;
> +    }
> +
> +    ret = vfio_migration_set_state(vbasedev, new_state,
> +                                   VFIO_DEVICE_STATE_ERROR);
> +    if (ret) {
> +        /*
> +         * Migration should be aborted in this case, but vm_state_notify()
> +         * currently does not support reporting failures.
> +         */
> +        if (migrate_get_current()->to_dst_file) {
> +            qemu_file_set_error(migrate_get_current()->to_dst_file, ret);
> +        }
> +    }
> +
> +    trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
> +                              mig_state_to_str(new_state));
> +}
> +
>  static void vfio_v1_vmstate_change(void *opaque, bool running, RunState state)
>  {
>      VFIODevice *vbasedev = opaque;
> @@ -766,12 +1023,17 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
>      case MIGRATION_STATUS_CANCELLED:
>      case MIGRATION_STATUS_FAILED:
>          bytes_transferred = 0;
> -        ret = vfio_migration_v1_set_state(vbasedev,
> -                                          ~(VFIO_DEVICE_STATE_V1_SAVING |
> -                                            VFIO_DEVICE_STATE_V1_RESUMING),
> -                                          VFIO_DEVICE_STATE_V1_RUNNING);
> -        if (ret) {
> -            error_report("%s: Failed to set state RUNNING", vbasedev->name);
> +        if (migration->v2) {
> +            vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING,
> +                                     VFIO_DEVICE_STATE_ERROR);
> +        } else {
> +            ret = vfio_migration_v1_set_state(vbasedev,
> +                                              ~(VFIO_DEVICE_STATE_V1_SAVING |
> +                                                VFIO_DEVICE_STATE_V1_RESUMING),
> +                                              VFIO_DEVICE_STATE_V1_RUNNING);
> +            if (ret) {
> +                error_report("%s: Failed to set state RUNNING", vbasedev->name);
> +            }
>          }
>      }
>  }
> @@ -780,12 +1042,35 @@ static void vfio_migration_exit(VFIODevice *vbasedev)
>  {
>      VFIOMigration *migration = vbasedev->migration;
>  
> -    vfio_region_exit(&migration->region);
> -    vfio_region_finalize(&migration->region);
> +    if (migration->v2) {
> +        g_free(migration->data_buffer);
> +    } else {
> +        vfio_region_exit(&migration->region);
> +        vfio_region_finalize(&migration->region);
> +    }
>      g_free(vbasedev->migration);
>      vbasedev->migration = NULL;
>  }
>  
> +static int vfio_migration_query_flags(VFIODevice *vbasedev, uint64_t *mig_flags)
> +{
> +    uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
> +                                  sizeof(struct vfio_device_feature_migration),
> +                              sizeof(uint64_t))] = {};
> +    struct vfio_device_feature *feature = (void *)buf;
> +    struct vfio_device_feature_migration *mig = (void *)feature->data;
> +
> +    feature->argsz = sizeof(buf);
> +    feature->flags = VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_MIGRATION;
> +    if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
> +        return -EOPNOTSUPP;
> +    }
> +
> +    *mig_flags = mig->flags;
> +
> +    return 0;
> +}
> +
>  static int vfio_migration_init(VFIODevice *vbasedev)
>  {
>      int ret;
> @@ -794,6 +1079,7 @@ static int vfio_migration_init(VFIODevice *vbasedev)
>      char id[256] = "";
>      g_autofree char *path = NULL, *oid = NULL;
>      struct vfio_region_info *info = NULL;
> +    uint64_t mig_flags;
>  
>      if (!vbasedev->ops->vfio_get_object) {
>          return -EINVAL;
> @@ -804,34 +1090,51 @@ static int vfio_migration_init(VFIODevice *vbasedev)
>          return -EINVAL;
>      }
>  
> -    ret = vfio_get_dev_region_info(vbasedev,
> -                                   VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
> -                                   VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
> -                                   &info);
> -    if (ret) {
> -        return ret;
> -    }
> +    ret = vfio_migration_query_flags(vbasedev, &mig_flags);
> +    if (!ret) {
> +        /* Migration v2 */
> +        /* Basic migration functionality must be supported */
> +        if (!(mig_flags & VFIO_MIGRATION_STOP_COPY)) {
> +            return -EOPNOTSUPP;
> +        }
> +        vbasedev->migration = g_new0(VFIOMigration, 1);
> +        vbasedev->migration->device_state = VFIO_DEVICE_STATE_RUNNING;
> +        vbasedev->migration->data_buffer_size = VFIO_MIG_DATA_BUFFER_SIZE;
> +        vbasedev->migration->data_buffer =
> +            g_malloc0(vbasedev->migration->data_buffer_size);

So VFIO_MIG_DATA_BUFFER_SIZE is our chunk size, but why doesn't the
later addition of estimated device data size make any changes here?
I'd think we'd want to scale the buffer to the minimum of the reported
data size and some well documented heuristic for an upper bound.

> +        vbasedev->migration->data_fd = -1;
> +        vbasedev->migration->v2 = true;
> +    } else {
> +        /* Migration v1 */
> +        ret = vfio_get_dev_region_info(vbasedev,
> +                                       VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
> +                                       VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
> +                                       &info);
> +        if (ret) {
> +            return ret;
> +        }
>  
> -    vbasedev->migration = g_new0(VFIOMigration, 1);
> -    vbasedev->migration->device_state_v1 = VFIO_DEVICE_STATE_V1_RUNNING;
> -    vbasedev->migration->vm_running = runstate_is_running();
> +        vbasedev->migration = g_new0(VFIOMigration, 1);
> +        vbasedev->migration->device_state_v1 = VFIO_DEVICE_STATE_V1_RUNNING;
> +        vbasedev->migration->vm_running = runstate_is_running();
>  
> -    ret = vfio_region_setup(obj, vbasedev, &vbasedev->migration->region,
> -                            info->index, "migration");
> -    if (ret) {
> -        error_report("%s: Failed to setup VFIO migration region %d: %s",
> -                     vbasedev->name, info->index, strerror(-ret));
> -        goto err;
> -    }
> +        ret = vfio_region_setup(obj, vbasedev, &vbasedev->migration->region,
> +                                info->index, "migration");
> +        if (ret) {
> +            error_report("%s: Failed to setup VFIO migration region %d: %s",
> +                         vbasedev->name, info->index, strerror(-ret));
> +            goto err;
> +        }
>  
> -    if (!vbasedev->migration->region.size) {
> -        error_report("%s: Invalid zero-sized VFIO migration region %d",
> -                     vbasedev->name, info->index);
> -        ret = -EINVAL;
> -        goto err;
> -    }
> +        if (!vbasedev->migration->region.size) {
> +            error_report("%s: Invalid zero-sized VFIO migration region %d",
> +                         vbasedev->name, info->index);
> +            ret = -EINVAL;
> +            goto err;
> +        }
>  
> -    g_free(info);
> +        g_free(info);


It would probably make sense to scope info within this branch, but it
goes away in the next patch anyway, so this is fine.  Thanks,

Alex

> +    }
>  
>      migration = vbasedev->migration;
>      migration->vbasedev = vbasedev;
> @@ -844,11 +1147,20 @@ static int vfio_migration_init(VFIODevice *vbasedev)
>      }
>      strpadcpy(id, sizeof(id), path, '\0');
>  
> -    register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1,
> -                         &savevm_vfio_v1_handlers, vbasedev);
> +    if (migration->v2) {
> +        register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1,
> +                             &savevm_vfio_handlers, vbasedev);
> +
> +        migration->vm_state = qdev_add_vm_change_state_handler(
> +            vbasedev->dev, vfio_vmstate_change, vbasedev);
> +    } else {
> +        register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1,
> +                             &savevm_vfio_v1_handlers, vbasedev);
> +
> +        migration->vm_state = qdev_add_vm_change_state_handler(
> +            vbasedev->dev, vfio_v1_vmstate_change, vbasedev);
> +    }
>  
> -    migration->vm_state = qdev_add_vm_change_state_handler(
> -        vbasedev->dev, vfio_v1_vmstate_change, vbasedev);
>      migration->migration_state.notify = vfio_migration_state_notifier;
>      add_migration_state_change_notifier(&migration->migration_state);
>      return 0;
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index d88d2b4053..9ef84e24b2 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -149,7 +149,9 @@ vfio_display_edid_write_error(void) ""
>  
>  # migration.c
>  vfio_migration_probe(const char *name) " (%s)"
> +vfio_migration_set_state(const char *name, const char *state) " (%s) state %s"
>  vfio_migration_v1_set_state(const char *name, uint32_t state) " (%s) state %d"
> +vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
>  vfio_v1_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
>  vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
>  vfio_save_setup(const char *name) " (%s)"
> @@ -163,6 +165,8 @@ vfio_save_complete_precopy(const char *name) " (%s)"
>  vfio_load_device_config_state(const char *name) " (%s)"
>  vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
>  vfio_v1_load_state_device_data(const char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
> +vfio_load_state_device_data(const char *name, uint64_t data_size) " (%s) size 0x%"PRIx64
>  vfio_load_cleanup(const char *name) " (%s)"
>  vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64
>  vfio_iommu_map_dirty_notify(uint64_t iova_start, uint64_t iova_end) "iommu dirty @ 0x%"PRIx64" - 0x%"PRIx64
> +vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index bbaf72ba00..2ec3346fea 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -66,6 +66,11 @@ typedef struct VFIOMigration {
>      int vm_running;
>      Notifier migration_state;
>      uint64_t pending_bytes;
> +    enum vfio_device_mig_state device_state;
> +    int data_fd;
> +    void *data_buffer;
> +    size_t data_buffer_size;
> +    bool v2;
>  } VFIOMigration;
>  
>  typedef struct VFIOAddressSpace {



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 14/17] vfio/migration: Reset device if setting recover state fails
  2022-11-03 16:16 ` [PATCH v3 14/17] vfio/migration: Reset device if setting recover state fails Avihai Horon
@ 2022-11-16 18:36   ` Alex Williamson
  2022-11-17 17:11     ` Avihai Horon
  0 siblings, 1 reply; 59+ messages in thread
From: Alex Williamson @ 2022-11-16 18:36 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Shay Drory, Kirti Wankhede, Tarun Gupta,
	Joao Martins

On Thu, 3 Nov 2022 18:16:17 +0200
Avihai Horon <avihaih@nvidia.com> wrote:

> If vfio_migration_set_state() fails to set the device in the requested
> state it tries to put it in a recover state. If setting the device in
> the recover state fails as well, hw_error is triggered and the VM is
> aborted.
> 
> To improve user experience and avoid VM data loss, reset the device with
> VFIO_RESET_DEVICE instead of aborting the VM.
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>  hw/vfio/migration.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index f8c3228314..e8068b9147 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -92,8 +92,18 @@ static int vfio_migration_set_state(VFIODevice *vbasedev,
>  
>          mig_state->device_state = recover_state;
>          if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
> -            hw_error("%s: Failed setting device in recover state, err: %s",
> -                     vbasedev->name, strerror(errno));
> +            error_report(
> +                "%s: Failed setting device in recover state, err: %s. Resetting device",
> +                         vbasedev->name, strerror(errno));
> +
> +            if (ioctl(vbasedev->fd, VFIO_DEVICE_RESET)) {
> +                hw_error("%s: Failed resetting device, err: %s", vbasedev->name,
> +                         strerror(errno));
> +            }
> +
> +            migration->device_state = VFIO_DEVICE_STATE_RUNNING;
> +
> +            return -1;
>          }
>  
>          migration->device_state = recover_state;

This addresses one of my comments on 12/ and should probably be rolled
in there.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 12/17] vfio/migration: Implement VFIO migration protocol v2
  2022-11-16 18:29   ` Alex Williamson
@ 2022-11-17 17:07     ` Avihai Horon
  2022-11-17 17:24       ` Jason Gunthorpe
  2022-11-17 17:38       ` Alex Williamson
  0 siblings, 2 replies; 59+ messages in thread
From: Avihai Horon @ 2022-11-17 17:07 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Shay Drory, Kirti Wankhede, Tarun Gupta,
	Joao Martins


On 16/11/2022 20:29, Alex Williamson wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, 3 Nov 2022 18:16:15 +0200
> Avihai Horon <avihaih@nvidia.com> wrote:
>
>> Add implementation of VFIO migration protocol v2. The two protocols, v1
>> and v2, will co-exist and in next patch v1 protocol will be removed.
>>
>> There are several main differences between v1 and v2 protocols:
>> - VFIO device state is now represented as a finite state machine instead
>>    of a bitmap.
>>
>> - Migration interface with kernel is now done using VFIO_DEVICE_FEATURE
>>    ioctl and normal read() and write() instead of the migration region.
>>
>> - VFIO migration protocol v2 currently doesn't support the pre-copy
>>    phase of migration.
>>
>> Detailed information about VFIO migration protocol v2 and difference
>> compared to v1 can be found here [1].
>>
>> [1]
>> https://lore.kernel.org/all/20220224142024.147653-10-yishaih@nvidia.com/
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>   hw/vfio/common.c              |  19 +-
>>   hw/vfio/migration.c           | 386 ++++++++++++++++++++++++++++++----
>>   hw/vfio/trace-events          |   4 +
>>   include/hw/vfio/vfio-common.h |   5 +
>>   4 files changed, 375 insertions(+), 39 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 617e6cd901..0bdbd1586b 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -355,10 +355,18 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
>>                   return false;
>>               }
>>
>> -            if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
>> +            if (!migration->v2 &&
>> +                (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
>>                   (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING)) {
>>                   return false;
>>               }
>> +
>> +            if (migration->v2 &&
>> +                (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
>> +                (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
>> +                 migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P)) {
>> +                return false;
>> +            }
>>           }
>>       }
>>       return true;
>> @@ -385,7 +393,14 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
>>                   return false;
>>               }
>>
>> -            if (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
>> +            if (!migration->v2 &&
>> +                migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
>> +                continue;
>> +            }
>> +
>> +            if (migration->v2 &&
>> +                (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
>> +                 migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P)) {
>>                   continue;
>>               } else {
>>                   return false;
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index e784374453..62afc23a8c 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -44,8 +44,84 @@
>>   #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
>>   #define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
>>
>> +#define VFIO_MIG_DATA_BUFFER_SIZE (1024 * 1024)
> Add comment explaining heuristic of this size.

This is an arbitrary size we picked with mlx5 state size in mind.
Increasing this size to higher values (128M, 1G) didn't improve 
performance in our testing.

How about this comment:
This is an initial value that doesn't consume much memory and provides 
good performance.

Do you have other suggestion?

>
>> +
>>   static int64_t bytes_transferred;
>>
>> +static const char *mig_state_to_str(enum vfio_device_mig_state state)
>> +{
>> +    switch (state) {
>> +    case VFIO_DEVICE_STATE_ERROR:
>> +        return "ERROR";
>> +    case VFIO_DEVICE_STATE_STOP:
>> +        return "STOP";
>> +    case VFIO_DEVICE_STATE_RUNNING:
>> +        return "RUNNING";
>> +    case VFIO_DEVICE_STATE_STOP_COPY:
>> +        return "STOP_COPY";
>> +    case VFIO_DEVICE_STATE_RESUMING:
>> +        return "RESUMING";
>> +    case VFIO_DEVICE_STATE_RUNNING_P2P:
>> +        return "RUNNING_P2P";
>> +    default:
>> +        return "UNKNOWN STATE";
>> +    }
>> +}
>> +
>> +static int vfio_migration_set_state(VFIODevice *vbasedev,
>> +                                    enum vfio_device_mig_state new_state,
>> +                                    enum vfio_device_mig_state recover_state)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
>> +                              sizeof(struct vfio_device_feature_mig_state),
>> +                              sizeof(uint64_t))] = {};
>> +    struct vfio_device_feature *feature = (void *)buf;
>> +    struct vfio_device_feature_mig_state *mig_state = (void *)feature->data;
> We can cast to the actual types rather than void* here.

Sure, will do.

>
>> +
>> +    feature->argsz = sizeof(buf);
>> +    feature->flags =
>> +        VFIO_DEVICE_FEATURE_SET | VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE;
>> +    mig_state->device_state = new_state;
>> +    if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
>> +        /* Try to set the device in some good state */
>> +        error_report(
>> +            "%s: Failed setting device state to %s, err: %s. Setting device in recover state %s",
>> +                     vbasedev->name, mig_state_to_str(new_state),
>> +                     strerror(errno), mig_state_to_str(recover_state));
>> +
>> +        mig_state->device_state = recover_state;
>> +        if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
>> +            hw_error("%s: Failed setting device in recover state, err: %s",
>> +                     vbasedev->name, strerror(errno));
>> +        }
>> +
>> +        migration->device_state = recover_state;
>> +
>> +        return -1;
> We could preserve -errno to return here.

Sure, will do.

>
>> +    }
>> +
>> +    if (mig_state->data_fd != -1) {
>> +        if (migration->data_fd != -1) {
>> +            /*
>> +             * This can happen if the device is asynchronously reset and
>> +             * terminates a data transfer.
>> +             */
>> +            error_report("%s: data_fd out of sync", vbasedev->name);
>> +            close(mig_state->data_fd);
>> +
>> +            return -1;
> Should we go to recover_state here?  Is migration->device_state
> invalid?  -EBADF?

Yes, we should.
Although VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE ioctl above succeeded, 
setting the device state didn't *really* succeed, as the data_fd went 
out of sync.
So we should go to recover_state and return -EBADF.

I will change it.

>> +        }
>> +
>> +        migration->data_fd = mig_state->data_fd;
>> +    }
>> +    migration->device_state = new_state;
>> +
>> +    trace_vfio_migration_set_state(vbasedev->name, mig_state_to_str(new_state));
>> +
>> +    return 0;
>> +}
>> +
>>   static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
>>                                     off_t off, bool iswrite)
>>   {
>> @@ -260,6 +336,20 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t *size)
>>       return ret;
>>   }
>>
>> +static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
>> +                            uint64_t data_size)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    int ret;
>> +
>> +    ret = qemu_file_get_to_fd(f, migration->data_fd, data_size);
>> +    if (!ret) {
>> +        trace_vfio_load_state_device_data(vbasedev->name, data_size);
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>>   static int vfio_v1_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
>>                                  uint64_t data_size)
>>   {
>> @@ -394,6 +484,14 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>>       return qemu_file_get_error(f);
>>   }
>>
>> +static void vfio_migration_cleanup(VFIODevice *vbasedev)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    close(migration->data_fd);
>> +    migration->data_fd = -1;
>> +}
>> +
>>   static void vfio_migration_v1_cleanup(VFIODevice *vbasedev)
>>   {
>>       VFIOMigration *migration = vbasedev->migration;
>> @@ -405,6 +503,18 @@ static void vfio_migration_v1_cleanup(VFIODevice *vbasedev)
>>
>>   /* ---------------------------------------------------------------------- */
>>
>> +static int vfio_save_setup(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +
>> +    trace_vfio_save_setup(vbasedev->name);
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +
>> +    return qemu_file_get_error(f);
>> +}
>> +
>>   static int vfio_v1_save_setup(QEMUFile *f, void *opaque)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> @@ -448,6 +558,14 @@ static int vfio_v1_save_setup(QEMUFile *f, void *opaque)
>>       return 0;
>>   }
>>
>> +static void vfio_save_cleanup(void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +
>> +    vfio_migration_cleanup(vbasedev);
>> +    trace_vfio_save_cleanup(vbasedev->name);
>> +}
>> +
>>   static void vfio_v1_save_cleanup(void *opaque)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> @@ -456,6 +574,23 @@ static void vfio_v1_save_cleanup(void *opaque)
>>       trace_vfio_save_cleanup(vbasedev->name);
>>   }
>>
>> +#define VFIO_MIG_PENDING_SIZE (512 * 1024 * 1024)
> There's a comment below, but that gets deleted in a later patch while
> we still use this as a fallback size.

I will keep some variant of the comment in the later patch.

>    Some explanation of how this
> size is derived would be useful.  Is this an estimate for mlx5?  It
> seems muchtoo small for a GPU.  For a fallback, should we set something
> here so large that we don't risk failing any SLA, ex. 100G?

In the KVM call we talked about setting this to 1G. mlx5 state size, 
when heavily utilized, should be in the order of hundreds of MBs.
So eventually we decided to go for 500MB.

However, I think you are right, we should set it to 100G to cover the 
worst case.
Anyway, this fallback value is used only if kernel doesn't support 
VFIO_DEVICE_FEATURE_MIG_DATA_SIZE ioctl, which should not be the usual 
case AFAIU.

I will also add a comment explaining the size choice.

>> +static void vfio_save_pending(void *opaque, uint64_t threshold_size,
>> +                              uint64_t *res_precopy, uint64_t *res_postcopy)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +
>> +    /*
>> +     * VFIO migration protocol v2 currently doesn't have an API to get pending
>> +     * device state size. Until such API is introduced, report some big
>> +     * arbitrary pending size so the device will be taken into account for
>> +     * downtime limit calculations.
>> +     */
>> +    *res_postcopy += VFIO_MIG_PENDING_SIZE;
>> +
>> +    trace_vfio_save_pending(vbasedev->name, *res_precopy, *res_postcopy);
>> +}
>> +
>>   static void vfio_v1_save_pending(void *opaque, uint64_t threshold_size,
>>                                    uint64_t *res_precopy, uint64_t *res_postcopy)
>>   {
>> @@ -520,6 +655,67 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
>>       return 0;
>>   }
>>
>> +/* Returns 1 if end-of-stream is reached, 0 if more data and -1 if error */
>> +static int vfio_save_block(QEMUFile *f, VFIOMigration *migration)
>> +{
>> +    ssize_t data_size;
>> +
>> +    data_size = read(migration->data_fd, migration->data_buffer,
>> +                     migration->data_buffer_size);
>> +    if (data_size < 0) {
>> +        return -1;
> Appears this could return -errno, granted it'll get swallowed in
> qemu_savevm_state_complete_precopy_iterable(), but it seems a bit
> cleaner here.

Sure, I will change it.

>> +    }
>> +    if (data_size == 0) {
>> +        return 1;
>> +    }
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
>> +    qemu_put_be64(f, data_size);
>> +    qemu_put_buffer(f, migration->data_buffer, data_size);
>> +    bytes_transferred += data_size;
>> +
>> +    trace_vfio_save_block(migration->vbasedev->name, data_size);
>> +
>> +    return qemu_file_get_error(f);
>> +}
>> +
>> +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    enum vfio_device_mig_state recover_state;
>> +    int ret;
>> +
>> +    /* We reach here with device state STOP only */
>> +    recover_state = VFIO_DEVICE_STATE_STOP;
>> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
>> +                                   recover_state);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    do {
>> +        ret = vfio_save_block(f, vbasedev->migration);
>> +        if (ret < 0) {
>> +            return ret;
>> +        }
>> +    } while (!ret);
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +    ret = qemu_file_get_error(f);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    recover_state = VFIO_DEVICE_STATE_ERROR;
>> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP,
>> +                                   recover_state);
>> +    if (!ret) {
>> +        trace_vfio_save_complete_precopy(vbasedev->name);
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>>   static int vfio_v1_save_complete_precopy(QEMUFile *f, void *opaque)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> @@ -589,6 +785,14 @@ static void vfio_save_state(QEMUFile *f, void *opaque)
>>       }
>>   }
>>
>> +static int vfio_load_setup(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +
>> +    return vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
>> +                                   vbasedev->migration->device_state);
>> +}
>> +
>>   static int vfio_v1_load_setup(QEMUFile *f, void *opaque)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> @@ -616,6 +820,16 @@ static int vfio_v1_load_setup(QEMUFile *f, void *opaque)
>>       return ret;
>>   }
>>
>> +static int vfio_load_cleanup(void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +
>> +    vfio_migration_cleanup(vbasedev);
>> +    trace_vfio_load_cleanup(vbasedev->name);
>> +
>> +    return 0;
>> +}
>> +
>>   static int vfio_v1_load_cleanup(void *opaque)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> @@ -658,7 +872,11 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>>               uint64_t data_size = qemu_get_be64(f);
>>
>>               if (data_size) {
>> -                ret = vfio_v1_load_buffer(f, vbasedev, data_size);
>> +                if (vbasedev->migration->v2) {
>> +                    ret = vfio_load_buffer(f, vbasedev, data_size);
>> +                } else {
>> +                    ret = vfio_v1_load_buffer(f, vbasedev, data_size);
>> +                }
>>                   if (ret < 0) {
>>                       return ret;
>>                   }
>> @@ -679,6 +897,17 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>>       return ret;
>>   }
>>
>> +static const SaveVMHandlers savevm_vfio_handlers = {
>> +    .save_setup = vfio_save_setup,
>> +    .save_cleanup = vfio_save_cleanup,
>> +    .save_live_pending = vfio_save_pending,
>> +    .save_live_complete_precopy = vfio_save_complete_precopy,
>> +    .save_state = vfio_save_state,
>> +    .load_setup = vfio_load_setup,
>> +    .load_cleanup = vfio_load_cleanup,
>> +    .load_state = vfio_load_state,
>> +};
>> +
>>   static SaveVMHandlers savevm_vfio_v1_handlers = {
>>       .save_setup = vfio_v1_save_setup,
>>       .save_cleanup = vfio_v1_save_cleanup,
>> @@ -693,6 +922,34 @@ static SaveVMHandlers savevm_vfio_v1_handlers = {
>>
>>   /* ---------------------------------------------------------------------- */
>>
>> +static void vfio_vmstate_change(void *opaque, bool running, RunState state)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    enum vfio_device_mig_state new_state;
>> +    int ret;
>> +
>> +    if (running) {
>> +        new_state = VFIO_DEVICE_STATE_RUNNING;
>> +    } else {
>> +        new_state = VFIO_DEVICE_STATE_STOP;
>> +    }
>> +
>> +    ret = vfio_migration_set_state(vbasedev, new_state,
>> +                                   VFIO_DEVICE_STATE_ERROR);
>> +    if (ret) {
>> +        /*
>> +         * Migration should be aborted in this case, but vm_state_notify()
>> +         * currently does not support reporting failures.
>> +         */
>> +        if (migrate_get_current()->to_dst_file) {
>> +            qemu_file_set_error(migrate_get_current()->to_dst_file, ret);
>> +        }
>> +    }
>> +
>> +    trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
>> +                              mig_state_to_str(new_state));
>> +}
>> +
>>   static void vfio_v1_vmstate_change(void *opaque, bool running, RunState state)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> @@ -766,12 +1023,17 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
>>       case MIGRATION_STATUS_CANCELLED:
>>       case MIGRATION_STATUS_FAILED:
>>           bytes_transferred = 0;
>> -        ret = vfio_migration_v1_set_state(vbasedev,
>> -                                          ~(VFIO_DEVICE_STATE_V1_SAVING |
>> -                                            VFIO_DEVICE_STATE_V1_RESUMING),
>> -                                          VFIO_DEVICE_STATE_V1_RUNNING);
>> -        if (ret) {
>> -            error_report("%s: Failed to set state RUNNING", vbasedev->name);
>> +        if (migration->v2) {
>> +            vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING,
>> +                                     VFIO_DEVICE_STATE_ERROR);
>> +        } else {
>> +            ret = vfio_migration_v1_set_state(vbasedev,
>> +                                              ~(VFIO_DEVICE_STATE_V1_SAVING |
>> +                                                VFIO_DEVICE_STATE_V1_RESUMING),
>> +                                              VFIO_DEVICE_STATE_V1_RUNNING);
>> +            if (ret) {
>> +                error_report("%s: Failed to set state RUNNING", vbasedev->name);
>> +            }
>>           }
>>       }
>>   }
>> @@ -780,12 +1042,35 @@ static void vfio_migration_exit(VFIODevice *vbasedev)
>>   {
>>       VFIOMigration *migration = vbasedev->migration;
>>
>> -    vfio_region_exit(&migration->region);
>> -    vfio_region_finalize(&migration->region);
>> +    if (migration->v2) {
>> +        g_free(migration->data_buffer);
>> +    } else {
>> +        vfio_region_exit(&migration->region);
>> +        vfio_region_finalize(&migration->region);
>> +    }
>>       g_free(vbasedev->migration);
>>       vbasedev->migration = NULL;
>>   }
>>
>> +static int vfio_migration_query_flags(VFIODevice *vbasedev, uint64_t *mig_flags)
>> +{
>> +    uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
>> +                                  sizeof(struct vfio_device_feature_migration),
>> +                              sizeof(uint64_t))] = {};
>> +    struct vfio_device_feature *feature = (void *)buf;
>> +    struct vfio_device_feature_migration *mig = (void *)feature->data;
>> +
>> +    feature->argsz = sizeof(buf);
>> +    feature->flags = VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_MIGRATION;
>> +    if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
>> +        return -EOPNOTSUPP;
>> +    }
>> +
>> +    *mig_flags = mig->flags;
>> +
>> +    return 0;
>> +}
>> +
>>   static int vfio_migration_init(VFIODevice *vbasedev)
>>   {
>>       int ret;
>> @@ -794,6 +1079,7 @@ static int vfio_migration_init(VFIODevice *vbasedev)
>>       char id[256] = "";
>>       g_autofree char *path = NULL, *oid = NULL;
>>       struct vfio_region_info *info = NULL;
>> +    uint64_t mig_flags;
>>
>>       if (!vbasedev->ops->vfio_get_object) {
>>           return -EINVAL;
>> @@ -804,34 +1090,51 @@ static int vfio_migration_init(VFIODevice *vbasedev)
>>           return -EINVAL;
>>       }
>>
>> -    ret = vfio_get_dev_region_info(vbasedev,
>> -                                   VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
>> -                                   VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
>> -                                   &info);
>> -    if (ret) {
>> -        return ret;
>> -    }
>> +    ret = vfio_migration_query_flags(vbasedev, &mig_flags);
>> +    if (!ret) {
>> +        /* Migration v2 */
>> +        /* Basic migration functionality must be supported */
>> +        if (!(mig_flags & VFIO_MIGRATION_STOP_COPY)) {
>> +            return -EOPNOTSUPP;
>> +        }
>> +        vbasedev->migration = g_new0(VFIOMigration, 1);
>> +        vbasedev->migration->device_state = VFIO_DEVICE_STATE_RUNNING;
>> +        vbasedev->migration->data_buffer_size = VFIO_MIG_DATA_BUFFER_SIZE;
>> +        vbasedev->migration->data_buffer =
>> +            g_malloc0(vbasedev->migration->data_buffer_size);
> So VFIO_MIG_DATA_BUFFER_SIZE is our chunk size, but why doesn't the
> later addition of estimated device data size make any changes here?
> I'd think we'd want to scale the buffer to the minimum of the reported
> data size and some well documented heuristic for an upper bound.

As I wrote above, increasing this size to higher values (128M, 1G) 
didn't improve performance in our testing.
We can always change it later on if some other heuristics are proven to 
improve performance.

Thanks!

>> +        vbasedev->migration->data_fd = -1;
>> +        vbasedev->migration->v2 = true;
>> +    } else {
>> +        /* Migration v1 */
>> +        ret = vfio_get_dev_region_info(vbasedev,
>> +                                       VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
>> +                                       VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
>> +                                       &info);
>> +        if (ret) {
>> +            return ret;
>> +        }
>>
>> -    vbasedev->migration = g_new0(VFIOMigration, 1);
>> -    vbasedev->migration->device_state_v1 = VFIO_DEVICE_STATE_V1_RUNNING;
>> -    vbasedev->migration->vm_running = runstate_is_running();
>> +        vbasedev->migration = g_new0(VFIOMigration, 1);
>> +        vbasedev->migration->device_state_v1 = VFIO_DEVICE_STATE_V1_RUNNING;
>> +        vbasedev->migration->vm_running = runstate_is_running();
>>
>> -    ret = vfio_region_setup(obj, vbasedev, &vbasedev->migration->region,
>> -                            info->index, "migration");
>> -    if (ret) {
>> -        error_report("%s: Failed to setup VFIO migration region %d: %s",
>> -                     vbasedev->name, info->index, strerror(-ret));
>> -        goto err;
>> -    }
>> +        ret = vfio_region_setup(obj, vbasedev, &vbasedev->migration->region,
>> +                                info->index, "migration");
>> +        if (ret) {
>> +            error_report("%s: Failed to setup VFIO migration region %d: %s",
>> +                         vbasedev->name, info->index, strerror(-ret));
>> +            goto err;
>> +        }
>>
>> -    if (!vbasedev->migration->region.size) {
>> -        error_report("%s: Invalid zero-sized VFIO migration region %d",
>> -                     vbasedev->name, info->index);
>> -        ret = -EINVAL;
>> -        goto err;
>> -    }
>> +        if (!vbasedev->migration->region.size) {
>> +            error_report("%s: Invalid zero-sized VFIO migration region %d",
>> +                         vbasedev->name, info->index);
>> +            ret = -EINVAL;
>> +            goto err;
>> +        }
>>
>> -    g_free(info);
>> +        g_free(info);
>
> It would probably make sense to scope info within this branch, but it
> goes away in the next patch anyway, so this is fine.  Thanks,
>
> Alex
>
>> +    }
>>
>>       migration = vbasedev->migration;
>>       migration->vbasedev = vbasedev;
>> @@ -844,11 +1147,20 @@ static int vfio_migration_init(VFIODevice *vbasedev)
>>       }
>>       strpadcpy(id, sizeof(id), path, '\0');
>>
>> -    register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1,
>> -                         &savevm_vfio_v1_handlers, vbasedev);
>> +    if (migration->v2) {
>> +        register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1,
>> +                             &savevm_vfio_handlers, vbasedev);
>> +
>> +        migration->vm_state = qdev_add_vm_change_state_handler(
>> +            vbasedev->dev, vfio_vmstate_change, vbasedev);
>> +    } else {
>> +        register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1,
>> +                             &savevm_vfio_v1_handlers, vbasedev);
>> +
>> +        migration->vm_state = qdev_add_vm_change_state_handler(
>> +            vbasedev->dev, vfio_v1_vmstate_change, vbasedev);
>> +    }
>>
>> -    migration->vm_state = qdev_add_vm_change_state_handler(
>> -        vbasedev->dev, vfio_v1_vmstate_change, vbasedev);
>>       migration->migration_state.notify = vfio_migration_state_notifier;
>>       add_migration_state_change_notifier(&migration->migration_state);
>>       return 0;
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index d88d2b4053..9ef84e24b2 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -149,7 +149,9 @@ vfio_display_edid_write_error(void) ""
>>
>>   # migration.c
>>   vfio_migration_probe(const char *name) " (%s)"
>> +vfio_migration_set_state(const char *name, const char *state) " (%s) state %s"
>>   vfio_migration_v1_set_state(const char *name, uint32_t state) " (%s) state %d"
>> +vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
>>   vfio_v1_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
>>   vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
>>   vfio_save_setup(const char *name) " (%s)"
>> @@ -163,6 +165,8 @@ vfio_save_complete_precopy(const char *name) " (%s)"
>>   vfio_load_device_config_state(const char *name) " (%s)"
>>   vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
>>   vfio_v1_load_state_device_data(const char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
>> +vfio_load_state_device_data(const char *name, uint64_t data_size) " (%s) size 0x%"PRIx64
>>   vfio_load_cleanup(const char *name) " (%s)"
>>   vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64
>>   vfio_iommu_map_dirty_notify(uint64_t iova_start, uint64_t iova_end) "iommu dirty @ 0x%"PRIx64" - 0x%"PRIx64
>> +vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index bbaf72ba00..2ec3346fea 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -66,6 +66,11 @@ typedef struct VFIOMigration {
>>       int vm_running;
>>       Notifier migration_state;
>>       uint64_t pending_bytes;
>> +    enum vfio_device_mig_state device_state;
>> +    int data_fd;
>> +    void *data_buffer;
>> +    size_t data_buffer_size;
>> +    bool v2;
>>   } VFIOMigration;
>>
>>   typedef struct VFIOAddressSpace {


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 14/17] vfio/migration: Reset device if setting recover state fails
  2022-11-16 18:36   ` Alex Williamson
@ 2022-11-17 17:11     ` Avihai Horon
  2022-11-17 18:18       ` Alex Williamson
  0 siblings, 1 reply; 59+ messages in thread
From: Avihai Horon @ 2022-11-17 17:11 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Shay Drory, Kirti Wankhede, Tarun Gupta,
	Joao Martins


On 16/11/2022 20:36, Alex Williamson wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, 3 Nov 2022 18:16:17 +0200
> Avihai Horon <avihaih@nvidia.com> wrote:
>
>> If vfio_migration_set_state() fails to set the device in the requested
>> state it tries to put it in a recover state. If setting the device in
>> the recover state fails as well, hw_error is triggered and the VM is
>> aborted.
>>
>> To improve user experience and avoid VM data loss, reset the device with
>> VFIO_RESET_DEVICE instead of aborting the VM.
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>   hw/vfio/migration.c | 14 ++++++++++++--
>>   1 file changed, 12 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index f8c3228314..e8068b9147 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -92,8 +92,18 @@ static int vfio_migration_set_state(VFIODevice *vbasedev,
>>
>>           mig_state->device_state = recover_state;
>>           if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
>> -            hw_error("%s: Failed setting device in recover state, err: %s",
>> -                     vbasedev->name, strerror(errno));
>> +            error_report(
>> +                "%s: Failed setting device in recover state, err: %s. Resetting device",
>> +                         vbasedev->name, strerror(errno));
>> +
>> +            if (ioctl(vbasedev->fd, VFIO_DEVICE_RESET)) {
>> +                hw_error("%s: Failed resetting device, err: %s", vbasedev->name,
>> +                         strerror(errno));
>> +            }
>> +
>> +            migration->device_state = VFIO_DEVICE_STATE_RUNNING;
>> +
>> +            return -1;
>>           }
>>
>>           migration->device_state = recover_state;
> This addresses one of my comments on 12/ and should probably be rolled
> in there.

Not sure to which comment you refer to. Could you elaborate?

Thanks!

>    Thanks,
>
> Alex
>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 12/17] vfio/migration: Implement VFIO migration protocol v2
  2022-11-17 17:07     ` Avihai Horon
@ 2022-11-17 17:24       ` Jason Gunthorpe
  2022-11-20  8:46         ` Avihai Horon
  2022-11-17 17:38       ` Alex Williamson
  1 sibling, 1 reply; 59+ messages in thread
From: Jason Gunthorpe @ 2022-11-17 17:24 UTC (permalink / raw)
  To: Avihai Horon
  Cc: Alex Williamson, qemu-devel, Halil Pasic, Christian Borntraeger,
	Eric Farman, Richard Henderson, David Hildenbrand,
	Ilya Leoshkevich, Thomas Huth, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Cornelia Huck,
	Paolo Bonzini, Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Maor Gottlieb,
	Shay Drory, Kirti Wankhede, Tarun Gupta, Joao Martins

On Thu, Nov 17, 2022 at 07:07:10PM +0200, Avihai Horon wrote:
> > > +    }
> > > +
> > > +    if (mig_state->data_fd != -1) {
> > > +        if (migration->data_fd != -1) {
> > > +            /*
> > > +             * This can happen if the device is asynchronously reset and
> > > +             * terminates a data transfer.
> > > +             */
> > > +            error_report("%s: data_fd out of sync", vbasedev->name);
> > > +            close(mig_state->data_fd);
> > > +
> > > +            return -1;
> > Should we go to recover_state here?  Is migration->device_state
> > invalid?  -EBADF?
> 
> Yes, we should.
> Although VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE ioctl above succeeded, setting
> the device state didn't *really* succeed, as the data_fd went out of sync.
> So we should go to recover_state and return -EBADF.

The state did succeed and it is now "new_state". Getting an
unexpected data_fd means it did something like RUNNING->PRE_COPY_P2P
when the code was expecting PRE_COPY->PRE_COPY_P2P.

It is actually in PRE_COPY_P2P but the in-progress migration must be
stopped and the kernel would have made the migration->data_fd
permanently return some error when it went async to RUNNING.

The recovery is to resart the migration (of this device?) from the
start.

Jason


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 12/17] vfio/migration: Implement VFIO migration protocol v2
  2022-11-17 17:07     ` Avihai Horon
  2022-11-17 17:24       ` Jason Gunthorpe
@ 2022-11-17 17:38       ` Alex Williamson
  2022-11-20  9:34         ` Avihai Horon
  1 sibling, 1 reply; 59+ messages in thread
From: Alex Williamson @ 2022-11-17 17:38 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Shay Drory, Kirti Wankhede, Tarun Gupta,
	Joao Martins

On Thu, 17 Nov 2022 19:07:10 +0200
Avihai Horon <avihaih@nvidia.com> wrote:
> On 16/11/2022 20:29, Alex Williamson wrote:
> > On Thu, 3 Nov 2022 18:16:15 +0200
> > Avihai Horon <avihaih@nvidia.com> wrote:
> >> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> >> index e784374453..62afc23a8c 100644
> >> --- a/hw/vfio/migration.c
> >> +++ b/hw/vfio/migration.c
> >> @@ -44,8 +44,84 @@
> >>   #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
> >>   #define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
> >>
> >> +#define VFIO_MIG_DATA_BUFFER_SIZE (1024 * 1024)  
> > Add comment explaining heuristic of this size.  
> 
> This is an arbitrary size we picked with mlx5 state size in mind.
> Increasing this size to higher values (128M, 1G) didn't improve 
> performance in our testing.
> 
> How about this comment:
> This is an initial value that doesn't consume much memory and provides 
> good performance.
> 
> Do you have other suggestion?

I'd lean more towards your description above, ex:

/*
 * This is an arbitrary size based on migration of mlx5 devices, where
 * the worst case total device migration size is on the order of 100s
 * of MB.  Testing with larger values, ex. 128MB and 1GB, did not show
 * a performance improvement.
 */

I think that provides sufficient information for someone who might come
later to have an understanding of the basis if they want to try to
optimize further.

> >> @@ -804,34 +1090,51 @@ static int vfio_migration_init(VFIODevice *vbasedev)
> >>           return -EINVAL;
> >>       }
> >>
> >> -    ret = vfio_get_dev_region_info(vbasedev,
> >> -                                   VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
> >> -                                   VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
> >> -                                   &info);
> >> -    if (ret) {
> >> -        return ret;
> >> -    }
> >> +    ret = vfio_migration_query_flags(vbasedev, &mig_flags);
> >> +    if (!ret) {
> >> +        /* Migration v2 */
> >> +        /* Basic migration functionality must be supported */
> >> +        if (!(mig_flags & VFIO_MIGRATION_STOP_COPY)) {
> >> +            return -EOPNOTSUPP;
> >> +        }
> >> +        vbasedev->migration = g_new0(VFIOMigration, 1);
> >> +        vbasedev->migration->device_state = VFIO_DEVICE_STATE_RUNNING;
> >> +        vbasedev->migration->data_buffer_size = VFIO_MIG_DATA_BUFFER_SIZE;
> >> +        vbasedev->migration->data_buffer =
> >> +            g_malloc0(vbasedev->migration->data_buffer_size);  
> > So VFIO_MIG_DATA_BUFFER_SIZE is our chunk size, but why doesn't the
> > later addition of estimated device data size make any changes here?
> > I'd think we'd want to scale the buffer to the minimum of the reported
> > data size and some well documented heuristic for an upper bound.  
> 
> As I wrote above, increasing this size to higher values (128M, 1G) 
> didn't improve performance in our testing.
> We can always change it later on if some other heuristics are proven to 
> improve performance.

Note that I'm asking about a minimum buffer size, for example if
hisi_acc reports only 10s of KB for an estimated device size, why would
we still allocate VFIO_MIG_DATA_BUFFER_SIZE here?  Thanks,

Alex



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 14/17] vfio/migration: Reset device if setting recover state fails
  2022-11-17 17:11     ` Avihai Horon
@ 2022-11-17 18:18       ` Alex Williamson
  2022-11-20  9:39         ` Avihai Horon
  0 siblings, 1 reply; 59+ messages in thread
From: Alex Williamson @ 2022-11-17 18:18 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Shay Drory, Kirti Wankhede, Tarun Gupta,
	Joao Martins

On Thu, 17 Nov 2022 19:11:47 +0200
Avihai Horon <avihaih@nvidia.com> wrote:

> On 16/11/2022 20:36, Alex Williamson wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > On Thu, 3 Nov 2022 18:16:17 +0200
> > Avihai Horon <avihaih@nvidia.com> wrote:
> >  
> >> If vfio_migration_set_state() fails to set the device in the requested
> >> state it tries to put it in a recover state. If setting the device in
> >> the recover state fails as well, hw_error is triggered and the VM is
> >> aborted.
> >>
> >> To improve user experience and avoid VM data loss, reset the device with
> >> VFIO_RESET_DEVICE instead of aborting the VM.
> >>
> >> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> >> ---
> >>   hw/vfio/migration.c | 14 ++++++++++++--
> >>   1 file changed, 12 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> >> index f8c3228314..e8068b9147 100644
> >> --- a/hw/vfio/migration.c
> >> +++ b/hw/vfio/migration.c
> >> @@ -92,8 +92,18 @@ static int vfio_migration_set_state(VFIODevice *vbasedev,
> >>
> >>           mig_state->device_state = recover_state;
> >>           if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
> >> -            hw_error("%s: Failed setting device in recover state, err: %s",
> >> -                     vbasedev->name, strerror(errno));
> >> +            error_report(
> >> +                "%s: Failed setting device in recover state, err: %s. Resetting device",
> >> +                         vbasedev->name, strerror(errno));
> >> +
> >> +            if (ioctl(vbasedev->fd, VFIO_DEVICE_RESET)) {
> >> +                hw_error("%s: Failed resetting device, err: %s", vbasedev->name,
> >> +                         strerror(errno));
> >> +            }
> >> +
> >> +            migration->device_state = VFIO_DEVICE_STATE_RUNNING;
> >> +
> >> +            return -1;
> >>           }
> >>
> >>           migration->device_state = recover_state;  
> > This addresses one of my comments on 12/ and should probably be rolled
> > in there.  
> 
> Not sure to which comment you refer to. Could you elaborate?

Hmm, I guess I thought this was in the section immediately following
where I questioned going to recovery state.  I'm still not sure why
this is a separate patch from the initial implementation of the
function in 12/ though.  Thanks,
'
Alex



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 12/17] vfio/migration: Implement VFIO migration protocol v2
  2022-11-17 17:24       ` Jason Gunthorpe
@ 2022-11-20  8:46         ` Avihai Horon
  0 siblings, 0 replies; 59+ messages in thread
From: Avihai Horon @ 2022-11-20  8:46 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, qemu-devel, Halil Pasic, Christian Borntraeger,
	Eric Farman, Richard Henderson, David Hildenbrand,
	Ilya Leoshkevich, Thomas Huth, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Cornelia Huck,
	Paolo Bonzini, Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Maor Gottlieb,
	Shay Drory, Kirti Wankhede, Tarun Gupta, Joao Martins


On 17/11/2022 19:24, Jason Gunthorpe wrote:
> On Thu, Nov 17, 2022 at 07:07:10PM +0200, Avihai Horon wrote:
>>>> +    }
>>>> +
>>>> +    if (mig_state->data_fd != -1) {
>>>> +        if (migration->data_fd != -1) {
>>>> +            /*
>>>> +             * This can happen if the device is asynchronously reset and
>>>> +             * terminates a data transfer.
>>>> +             */
>>>> +            error_report("%s: data_fd out of sync", vbasedev->name);
>>>> +            close(mig_state->data_fd);
>>>> +
>>>> +            return -1;
>>> Should we go to recover_state here?  Is migration->device_state
>>> invalid?  -EBADF?
>> Yes, we should.
>> Although VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE ioctl above succeeded, setting
>> the device state didn't *really* succeed, as the data_fd went out of sync.
>> So we should go to recover_state and return -EBADF.
> The state did succeed and it is now "new_state". Getting an
> unexpected data_fd means it did something like RUNNING->PRE_COPY_P2P
> when the code was expecting PRE_COPY->PRE_COPY_P2P.
>
> It is actually in PRE_COPY_P2P but the in-progress migration must be
> stopped and the kernel would have made the migration->data_fd
> permanently return some error when it went async to RUNNING.
>
> The recovery is to resart the migration (of this device?) from the
> start.

Yes, and restart is what's happening here - the -EBADF that we return 
here will cause the migration to be aborted.
I didn't mean that we should go to recover_state *instead* of returning 
an error.

But rethinking about it, I think you are right - recover_state should be 
used only if we can't set the device in new_state (i.e., there is some 
error in device functionality).
In the "data_fd out of sync" case, we did set the device in new_state 
(no error in device functionality), but data_fd got mixed up, so we 
should just abort migration and restart it again.

So bottom line, I think we should just return -EBADF here to abort 
migration.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 12/17] vfio/migration: Implement VFIO migration protocol v2
  2022-11-17 17:38       ` Alex Williamson
@ 2022-11-20  9:34         ` Avihai Horon
  2022-11-24 12:41           ` Avihai Horon
  0 siblings, 1 reply; 59+ messages in thread
From: Avihai Horon @ 2022-11-20  9:34 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Shay Drory, Kirti Wankhede, Tarun Gupta,
	Joao Martins


On 17/11/2022 19:38, Alex Williamson wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, 17 Nov 2022 19:07:10 +0200
> Avihai Horon <avihaih@nvidia.com> wrote:
>> On 16/11/2022 20:29, Alex Williamson wrote:
>>> On Thu, 3 Nov 2022 18:16:15 +0200
>>> Avihai Horon <avihaih@nvidia.com> wrote:
>>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>>> index e784374453..62afc23a8c 100644
>>>> --- a/hw/vfio/migration.c
>>>> +++ b/hw/vfio/migration.c
>>>> @@ -44,8 +44,84 @@
>>>>    #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
>>>>    #define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
>>>>
>>>> +#define VFIO_MIG_DATA_BUFFER_SIZE (1024 * 1024)
>>> Add comment explaining heuristic of this size.
>> This is an arbitrary size we picked with mlx5 state size in mind.
>> Increasing this size to higher values (128M, 1G) didn't improve
>> performance in our testing.
>>
>> How about this comment:
>> This is an initial value that doesn't consume much memory and provides
>> good performance.
>>
>> Do you have other suggestion?
> I'd lean more towards your description above, ex:
>
> /*
>   * This is an arbitrary size based on migration of mlx5 devices, where
>   * the worst case total device migration size is on the order of 100s
>   * of MB.  Testing with larger values, ex. 128MB and 1GB, did not show
>   * a performance improvement.
>   */
>
> I think that provides sufficient information for someone who might come
> later to have an understanding of the basis if they want to try to
> optimize further.

OK, sounds good, I will add a comment like this.

>>>> @@ -804,34 +1090,51 @@ static int vfio_migration_init(VFIODevice *vbasedev)
>>>>            return -EINVAL;
>>>>        }
>>>>
>>>> -    ret = vfio_get_dev_region_info(vbasedev,
>>>> -                                   VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
>>>> -                                   VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
>>>> -                                   &info);
>>>> -    if (ret) {
>>>> -        return ret;
>>>> -    }
>>>> +    ret = vfio_migration_query_flags(vbasedev, &mig_flags);
>>>> +    if (!ret) {
>>>> +        /* Migration v2 */
>>>> +        /* Basic migration functionality must be supported */
>>>> +        if (!(mig_flags & VFIO_MIGRATION_STOP_COPY)) {
>>>> +            return -EOPNOTSUPP;
>>>> +        }
>>>> +        vbasedev->migration = g_new0(VFIOMigration, 1);
>>>> +        vbasedev->migration->device_state = VFIO_DEVICE_STATE_RUNNING;
>>>> +        vbasedev->migration->data_buffer_size = VFIO_MIG_DATA_BUFFER_SIZE;
>>>> +        vbasedev->migration->data_buffer =
>>>> +            g_malloc0(vbasedev->migration->data_buffer_size);
>>> So VFIO_MIG_DATA_BUFFER_SIZE is our chunk size, but why doesn't the
>>> later addition of estimated device data size make any changes here?
>>> I'd think we'd want to scale the buffer to the minimum of the reported
>>> data size and some well documented heuristic for an upper bound.
>> As I wrote above, increasing this size to higher values (128M, 1G)
>> didn't improve performance in our testing.
>> We can always change it later on if some other heuristics are proven to
>> improve performance.
> Note that I'm asking about a minimum buffer size, for example if
> hisi_acc reports only 10s of KB for an estimated device size, why would
> we still allocate VFIO_MIG_DATA_BUFFER_SIZE here?  Thanks,

This buffer is rather small and has little memory footprint.
Do you think it is worth the extra complexity of resizing the buffer?



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 14/17] vfio/migration: Reset device if setting recover state fails
  2022-11-17 18:18       ` Alex Williamson
@ 2022-11-20  9:39         ` Avihai Horon
  0 siblings, 0 replies; 59+ messages in thread
From: Avihai Horon @ 2022-11-20  9:39 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Shay Drory, Kirti Wankhede, Tarun Gupta,
	Joao Martins


On 17/11/2022 20:18, Alex Williamson wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, 17 Nov 2022 19:11:47 +0200
> Avihai Horon <avihaih@nvidia.com> wrote:
>
>> On 16/11/2022 20:36, Alex Williamson wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On Thu, 3 Nov 2022 18:16:17 +0200
>>> Avihai Horon <avihaih@nvidia.com> wrote:
>>>
>>>> If vfio_migration_set_state() fails to set the device in the requested
>>>> state it tries to put it in a recover state. If setting the device in
>>>> the recover state fails as well, hw_error is triggered and the VM is
>>>> aborted.
>>>>
>>>> To improve user experience and avoid VM data loss, reset the device with
>>>> VFIO_RESET_DEVICE instead of aborting the VM.
>>>>
>>>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>>>> ---
>>>>    hw/vfio/migration.c | 14 ++++++++++++--
>>>>    1 file changed, 12 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>>> index f8c3228314..e8068b9147 100644
>>>> --- a/hw/vfio/migration.c
>>>> +++ b/hw/vfio/migration.c
>>>> @@ -92,8 +92,18 @@ static int vfio_migration_set_state(VFIODevice *vbasedev,
>>>>
>>>>            mig_state->device_state = recover_state;
>>>>            if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
>>>> -            hw_error("%s: Failed setting device in recover state, err: %s",
>>>> -                     vbasedev->name, strerror(errno));
>>>> +            error_report(
>>>> +                "%s: Failed setting device in recover state, err: %s. Resetting device",
>>>> +                         vbasedev->name, strerror(errno));
>>>> +
>>>> +            if (ioctl(vbasedev->fd, VFIO_DEVICE_RESET)) {
>>>> +                hw_error("%s: Failed resetting device, err: %s", vbasedev->name,
>>>> +                         strerror(errno));
>>>> +            }
>>>> +
>>>> +            migration->device_state = VFIO_DEVICE_STATE_RUNNING;
>>>> +
>>>> +            return -1;
>>>>            }
>>>>
>>>>            migration->device_state = recover_state;
>>> This addresses one of my comments on 12/ and should probably be rolled
>>> in there.
>> Not sure to which comment you refer to. Could you elaborate?
> Hmm, I guess I thought this was in the section immediately following
> where I questioned going to recovery state.  I'm still not sure why
> this is a separate patch from the initial implementation of the
> function in 12/ though.

This adds new functionality comparing to v1, so I thought this should be 
in its own patch.

I can squash it to patch 12 if you want.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 01/17] migration: Remove res_compatible parameter
  2022-11-10 13:36     ` Avihai Horon
@ 2022-11-21  7:20       ` Avihai Horon
  2022-11-23 18:23       ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 59+ messages in thread
From: Avihai Horon @ 2022-11-21  7:20 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel, Juan Quintela
  Cc: Alex Williamson, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Cornelia Huck, Paolo Bonzini, Stefan Hajnoczi, Fam Zheng,
	Eric Blake, John Snow, qemu-s390x, qemu-block, Kunkun Jiang,
	Zhang, Chen, Yishai Hadas, Jason Gunthorpe, Maor Gottlieb,
	Shay Drory, Kirti Wankhede, Tarun Gupta, Joao Martins

Ping.

On 10/11/2022 15:36, Avihai Horon wrote:
>
> On 08/11/2022 19:52, Vladimir Sementsov-Ogievskiy wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On 11/3/22 19:16, Avihai Horon wrote:
>>> From: Juan Quintela <quintela@redhat.com>
>>>
>>> It was only used for RAM, and in that case, it means that this amount
>>> of data was sent for memory.
>>
>> Not clear for me, what means "this amount of data was sent for 
>> memory"... That amount of data was not yet sent, actually.
>>
> Yes, this should be changed to something like:
>
> "It was only used for RAM, and in that case, it means that this amount
> of data still needs to be sent for memory, and can be sent in any phase
> of migration. The same functionality can be achieved without 
> res_compatible,
> so just delete the field in all callers and change the definition of 
> res_postcopy accordingly.".
>>> Just delete the field in all callers.
>>>
>>> Signed-off-by: Juan Quintela <quintela@redhat.com>
>>> ---
>>>   hw/s390x/s390-stattrib.c       |  6 ++----
>>>   hw/vfio/migration.c            | 10 ++++------
>>>   hw/vfio/trace-events           |  2 +-
>>>   include/migration/register.h   | 20 ++++++++++----------
>>>   migration/block-dirty-bitmap.c |  7 +++----
>>>   migration/block.c              |  7 +++----
>>>   migration/migration.c          |  9 ++++-----
>>>   migration/ram.c                |  8 +++-----
>>>   migration/savevm.c             | 14 +++++---------
>>>   migration/savevm.h             |  4 +---
>>>   migration/trace-events         |  2 +-
>>>   11 files changed, 37 insertions(+), 52 deletions(-)
>>>
>>
>> [..]
>>
>>> diff --git a/include/migration/register.h 
>>> b/include/migration/register.h
>>> index c1dcff0f90..1950fee6a8 100644
>>> --- a/include/migration/register.h
>>> +++ b/include/migration/register.h
>>> @@ -48,18 +48,18 @@ typedef struct SaveVMHandlers {
>>>       int (*save_setup)(QEMUFile *f, void *opaque);
>>>       void (*save_live_pending)(QEMUFile *f, void *opaque,
>>>                                 uint64_t threshold_size,
>>> -                              uint64_t *res_precopy_only,
>>> -                              uint64_t *res_compatible,
>>> -                              uint64_t *res_postcopy_only);
>>> +                              uint64_t *rest_precopy,
>>> +                              uint64_t *rest_postcopy);
>>>       /* Note for save_live_pending:
>>> -     * - res_precopy_only is for data which must be migrated in 
>>> precopy phase
>>> -     *     or in stopped state, in other words - before target vm 
>>> start
>>> -     * - res_compatible is for data which may be migrated in any phase
>>> -     * - res_postcopy_only is for data which must be migrated in 
>>> postcopy phase
>>> -     *     or in stopped state, in other words - after source vm stop
>>> +     * - res_precopy is for data which must be migrated in precopy
>>> +     *     phase or in stopped state, in other words - before target
>>> +     *     vm start
>>> +     * - res_postcopy is for data which must be migrated in postcopy
>>> +     *     phase or in stopped state, in other words - after source vm
>>> +     *     stop
>>>        *
>>> -     * Sum of res_postcopy_only, res_compatible and 
>>> res_postcopy_only is the
>>> -     * whole amount of pending data.
>>> +     * Sum of res_precopy and res_postcopy is the whole amount of
>>> +     * pending data.
>>>        */
>>>
>>>
>>
>> [..]
>>
>>> diff --git a/migration/ram.c b/migration/ram.c
>>> index dc1de9ddbc..20167e1102 100644
>>> --- a/migration/ram.c
>>> +++ b/migration/ram.c
>>> @@ -3435,9 +3435,7 @@ static int ram_save_complete(QEMUFile *f, void 
>>> *opaque)
>>>   }
>>>
>>>   static void ram_save_pending(QEMUFile *f, void *opaque, uint64_t 
>>> max_size,
>>> -                             uint64_t *res_precopy_only,
>>> -                             uint64_t *res_compatible,
>>> -                             uint64_t *res_postcopy_only)
>>> +                             uint64_t *res_precopy, uint64_t 
>>> *res_postcopy)
>>>   {
>>>       RAMState **temp = opaque;
>>>       RAMState *rs = *temp;
>>> @@ -3457,9 +3455,9 @@ static void ram_save_pending(QEMUFile *f, void 
>>> *opaque, uint64_t max_size,
>>>
>>>       if (migrate_postcopy_ram()) {
>>>           /* We can do postcopy, and all the data is postcopiable */
>>> -        *res_compatible += remaining_size;
>>> +        *res_postcopy += remaining_size;
>>
>> That's seems to be not quite correct.
>>
>> res_postcopy is defined as "data which must be migrated in postcopy", 
>> but that's not true here, as RAM can be migrated both in precopy and 
>> postcopy.
>>
>> Still we really can include "compat" into "postcopy" just because in 
>> the logic of migration_iteration_run() we don't actually distinguish 
>> "compat" and "post". The logic only depends on "total" and "pre".
>>
>> So, if we want to combine "compat" into "post", we should redefine 
>> "post" in the comment in include/migration/register.h, something like 
>> this:
>>
>> - res_precopy is for data which MUST be migrated in precopy
>>   phase or in stopped state, in other words - before target
>>   vm start
>>
>> - res_postcopy is for all data except for declared in res_precopy.
>>   res_postcopy data CAN be migrated in postcopy, i.e. after target
>>   vm start.
>>
>>
> You are right, the definition of res_postcopy should be changed.
>
> Yet, I am not sure if this patch really makes things more clear/simple.
> Juan, what do you think?
>
> Thanks!
>>>       } else {
>>> -        *res_precopy_only += remaining_size;
>>> +        *res_precopy += remaining_size;
>>>       }
>>>   }
>>>
>>
>>
>> -- 
>> Best regards,
>> Vladimir
>>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 03/17] migration: Block migration comment or code is wrong
  2022-11-10 13:38     ` Avihai Horon
@ 2022-11-21  7:21       ` Avihai Horon
  0 siblings, 0 replies; 59+ messages in thread
From: Avihai Horon @ 2022-11-21  7:21 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel, Juan Quintela
  Cc: Alex Williamson, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Cornelia Huck, Paolo Bonzini, Stefan Hajnoczi, Fam Zheng,
	Eric Blake, John Snow, qemu-s390x, qemu-block, Kunkun Jiang,
	Zhang, Chen, Yishai Hadas, Jason Gunthorpe, Maor Gottlieb,
	Shay Drory, Kirti Wankhede, Tarun Gupta, Joao Martins

Ping.

On 10/11/2022 15:38, Avihai Horon wrote:
>
> On 08/11/2022 20:36, Vladimir Sementsov-Ogievskiy wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On 11/3/22 19:16, Avihai Horon wrote:
>>> From: Juan Quintela <quintela@redhat.com>
>>>
>>> And it appears that what is wrong is the code. During bulk stage we
>>> need to make sure that some block is dirty, but no games with
>>> max_size at all.
>>
>> :) That made me interested in, why we need this one block, so I 
>> decided to search through the history.
>>
>> And what I see? Haha, that was my commit 04636dc410b163c 
>> "migration/block: fix pending() return value" [1], which you actually 
>> revert with this patch.
>>
>> So, at least we should note, that it's a revert of [1].
>>
>> Still that this will reintroduce the bug fixed by [1].
>>
>> As I understand the problem is (was) that in block_save_complete() we 
>> finalize only dirty blocks, but don't finalize the bulk phase if it's 
>> not finalized yet. So, we can fix block_save_complete() to finalize 
>> the bulk phase, instead of hacking with pending in [1].
>>
>> Interesting, why we need this one block, described in the comment you 
>> refer to? Was it an incomplete workaround for the same problem, 
>> described in [1]? If so, we can fix block_save_complete() and remove 
>> this if() together with the comment from block_save_pending().
>>
> I am not familiar with block migration.
> I can drop this patch in next version.
>
> Juan/Stefan, could you help here?
>
>>>
>>> Signed-off-by: Juan Quintela <quintela@redhat.com>
>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>>> ---
>>>   migration/block.c | 4 ++--
>>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/migration/block.c b/migration/block.c
>>> index b3d680af75..39ce4003c6 100644
>>> --- a/migration/block.c
>>> +++ b/migration/block.c
>>> @@ -879,8 +879,8 @@ static void block_save_pending(void *opaque, 
>>> uint64_t max_size,
>>>       blk_mig_unlock();
>>>
>>>       /* Report at least one block pending during bulk phase */
>>> -    if (pending <= max_size && !block_mig_state.bulk_completed) {
>>> -        pending = max_size + BLK_MIG_BLOCK_SIZE;
>>> +    if (!pending && !block_mig_state.bulk_completed) {
>>> +        pending = BLK_MIG_BLOCK_SIZE;
>>>       }
>>>
>>>       trace_migration_block_save_pending(pending);
>>
>> -- 
>> Best regards,
>> Vladimir
>>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 01/17] migration: Remove res_compatible parameter
  2022-11-10 13:36     ` Avihai Horon
  2022-11-21  7:20       ` Avihai Horon
@ 2022-11-23 18:23       ` Dr. David Alan Gilbert
  2022-11-24 12:19         ` Avihai Horon
  1 sibling, 1 reply; 59+ messages in thread
From: Dr. David Alan Gilbert @ 2022-11-23 18:23 UTC (permalink / raw)
  To: Avihai Horon
  Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, Alex Williamson,
	Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Michael S. Tsirkin, Cornelia Huck,
	Paolo Bonzini, Stefan Hajnoczi, Fam Zheng, Eric Blake, John Snow,
	qemu-s390x, qemu-block, Kunkun Jiang, Zhang, Chen, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Shay Drory, Kirti Wankhede,
	Tarun Gupta, Joao Martins

* Avihai Horon (avihaih@nvidia.com) wrote:
> 
> On 08/11/2022 19:52, Vladimir Sementsov-Ogievskiy wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On 11/3/22 19:16, Avihai Horon wrote:
> > > From: Juan Quintela <quintela@redhat.com>
> > > 
> > > It was only used for RAM, and in that case, it means that this amount
> > > of data was sent for memory.
> > 
> > Not clear for me, what means "this amount of data was sent for
> > memory"... That amount of data was not yet sent, actually.
> > 
> Yes, this should be changed to something like:
> 
> "It was only used for RAM, and in that case, it means that this amount
> of data still needs to be sent for memory, and can be sent in any phase
> of migration. The same functionality can be achieved without res_compatible,
> so just delete the field in all callers and change the definition of
> res_postcopy accordingly.".

Sorry, I recently sent a similar comment in reply to Juan's original
post.
If I understand correctly though, the dirty bitmap code relies on
'postcopy' here to be data only sent during postcopy.

Dave

> > > Just delete the field in all callers.
> > > 
> > > Signed-off-by: Juan Quintela <quintela@redhat.com>
> > > ---
> > >   hw/s390x/s390-stattrib.c       |  6 ++----
> > >   hw/vfio/migration.c            | 10 ++++------
> > >   hw/vfio/trace-events           |  2 +-
> > >   include/migration/register.h   | 20 ++++++++++----------
> > >   migration/block-dirty-bitmap.c |  7 +++----
> > >   migration/block.c              |  7 +++----
> > >   migration/migration.c          |  9 ++++-----
> > >   migration/ram.c                |  8 +++-----
> > >   migration/savevm.c             | 14 +++++---------
> > >   migration/savevm.h             |  4 +---
> > >   migration/trace-events         |  2 +-
> > >   11 files changed, 37 insertions(+), 52 deletions(-)
> > > 
> > 
> > [..]
> > 
> > > diff --git a/include/migration/register.h b/include/migration/register.h
> > > index c1dcff0f90..1950fee6a8 100644
> > > --- a/include/migration/register.h
> > > +++ b/include/migration/register.h
> > > @@ -48,18 +48,18 @@ typedef struct SaveVMHandlers {
> > >       int (*save_setup)(QEMUFile *f, void *opaque);
> > >       void (*save_live_pending)(QEMUFile *f, void *opaque,
> > >                                 uint64_t threshold_size,
> > > -                              uint64_t *res_precopy_only,
> > > -                              uint64_t *res_compatible,
> > > -                              uint64_t *res_postcopy_only);
> > > +                              uint64_t *rest_precopy,
> > > +                              uint64_t *rest_postcopy);
> > >       /* Note for save_live_pending:
> > > -     * - res_precopy_only is for data which must be migrated in
> > > precopy phase
> > > -     *     or in stopped state, in other words - before target vm start
> > > -     * - res_compatible is for data which may be migrated in any phase
> > > -     * - res_postcopy_only is for data which must be migrated in
> > > postcopy phase
> > > -     *     or in stopped state, in other words - after source vm stop
> > > +     * - res_precopy is for data which must be migrated in precopy
> > > +     *     phase or in stopped state, in other words - before target
> > > +     *     vm start
> > > +     * - res_postcopy is for data which must be migrated in postcopy
> > > +     *     phase or in stopped state, in other words - after source vm
> > > +     *     stop
> > >        *
> > > -     * Sum of res_postcopy_only, res_compatible and
> > > res_postcopy_only is the
> > > -     * whole amount of pending data.
> > > +     * Sum of res_precopy and res_postcopy is the whole amount of
> > > +     * pending data.
> > >        */
> > > 
> > > 
> > 
> > [..]
> > 
> > > diff --git a/migration/ram.c b/migration/ram.c
> > > index dc1de9ddbc..20167e1102 100644
> > > --- a/migration/ram.c
> > > +++ b/migration/ram.c
> > > @@ -3435,9 +3435,7 @@ static int ram_save_complete(QEMUFile *f, void
> > > *opaque)
> > >   }
> > > 
> > >   static void ram_save_pending(QEMUFile *f, void *opaque, uint64_t
> > > max_size,
> > > -                             uint64_t *res_precopy_only,
> > > -                             uint64_t *res_compatible,
> > > -                             uint64_t *res_postcopy_only)
> > > +                             uint64_t *res_precopy, uint64_t
> > > *res_postcopy)
> > >   {
> > >       RAMState **temp = opaque;
> > >       RAMState *rs = *temp;
> > > @@ -3457,9 +3455,9 @@ static void ram_save_pending(QEMUFile *f, void
> > > *opaque, uint64_t max_size,
> > > 
> > >       if (migrate_postcopy_ram()) {
> > >           /* We can do postcopy, and all the data is postcopiable */
> > > -        *res_compatible += remaining_size;
> > > +        *res_postcopy += remaining_size;
> > 
> > That's seems to be not quite correct.
> > 
> > res_postcopy is defined as "data which must be migrated in postcopy",
> > but that's not true here, as RAM can be migrated both in precopy and
> > postcopy.
> > 
> > Still we really can include "compat" into "postcopy" just because in the
> > logic of migration_iteration_run() we don't actually distinguish
> > "compat" and "post". The logic only depends on "total" and "pre".
> > 
> > So, if we want to combine "compat" into "post", we should redefine
> > "post" in the comment in include/migration/register.h, something like
> > this:
> > 
> > - res_precopy is for data which MUST be migrated in precopy
> >   phase or in stopped state, in other words - before target
> >   vm start
> > 
> > - res_postcopy is for all data except for declared in res_precopy.
> >   res_postcopy data CAN be migrated in postcopy, i.e. after target
> >   vm start.
> > 
> > 
> You are right, the definition of res_postcopy should be changed.
> 
> Yet, I am not sure if this patch really makes things more clear/simple.
> Juan, what do you think?
> 
> Thanks!
> > >       } else {
> > > -        *res_precopy_only += remaining_size;
> > > +        *res_precopy += remaining_size;
> > >       }
> > >   }
> > > 
> > 
> > 
> > -- 
> > Best regards,
> > Vladimir
> > 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 12/17] vfio/migration: Implement VFIO migration protocol v2
  2022-11-03 16:16 ` [PATCH v3 12/17] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
  2022-11-16 18:29   ` Alex Williamson
@ 2022-11-23 18:59   ` Dr. David Alan Gilbert
  2022-11-24 12:25     ` Avihai Horon
  1 sibling, 1 reply; 59+ messages in thread
From: Dr. David Alan Gilbert @ 2022-11-23 18:59 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Halil Pasic, Christian Borntraeger,
	Eric Farman, Richard Henderson, David Hildenbrand,
	Ilya Leoshkevich, Thomas Huth, Juan Quintela, Michael S. Tsirkin,
	Cornelia Huck, Paolo Bonzini, Stefan Hajnoczi, Fam Zheng,
	Eric Blake, Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x,
	qemu-block, Kunkun Jiang, Zhang, Chen, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Shay Drory, Kirti Wankhede,
	Tarun Gupta, Joao Martins

* Avihai Horon (avihaih@nvidia.com) wrote:

<snip>

> +    ret = qemu_file_get_to_fd(f, migration->data_fd, data_size);
> +    if (!ret) {
> +        trace_vfio_load_state_device_data(vbasedev->name, data_size);
> +
> +    }

I notice you had a few cases like that; I wouldn't bother making that
conditional - just add 'ret' to the trace parameters; that way if it
fails then you can see that in the trace, and it's simpler anyway.

Dave

> +
> +    return ret;
> +}
> +
>  static int vfio_v1_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
>                                 uint64_t data_size)
>  {
> @@ -394,6 +484,14 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>      return qemu_file_get_error(f);
>  }
>  
> +static void vfio_migration_cleanup(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    close(migration->data_fd);
> +    migration->data_fd = -1;
> +}
> +
>  static void vfio_migration_v1_cleanup(VFIODevice *vbasedev)
>  {
>      VFIOMigration *migration = vbasedev->migration;
> @@ -405,6 +503,18 @@ static void vfio_migration_v1_cleanup(VFIODevice *vbasedev)
>  
>  /* ---------------------------------------------------------------------- */
>  
> +static int vfio_save_setup(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    trace_vfio_save_setup(vbasedev->name);
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    return qemu_file_get_error(f);
> +}
> +
>  static int vfio_v1_save_setup(QEMUFile *f, void *opaque)
>  {
>      VFIODevice *vbasedev = opaque;
> @@ -448,6 +558,14 @@ static int vfio_v1_save_setup(QEMUFile *f, void *opaque)
>      return 0;
>  }
>  
> +static void vfio_save_cleanup(void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    vfio_migration_cleanup(vbasedev);
> +    trace_vfio_save_cleanup(vbasedev->name);
> +}
> +
>  static void vfio_v1_save_cleanup(void *opaque)
>  {
>      VFIODevice *vbasedev = opaque;
> @@ -456,6 +574,23 @@ static void vfio_v1_save_cleanup(void *opaque)
>      trace_vfio_save_cleanup(vbasedev->name);
>  }
>  
> +#define VFIO_MIG_PENDING_SIZE (512 * 1024 * 1024)
> +static void vfio_save_pending(void *opaque, uint64_t threshold_size,
> +                              uint64_t *res_precopy, uint64_t *res_postcopy)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    /*
> +     * VFIO migration protocol v2 currently doesn't have an API to get pending
> +     * device state size. Until such API is introduced, report some big
> +     * arbitrary pending size so the device will be taken into account for
> +     * downtime limit calculations.
> +     */
> +    *res_postcopy += VFIO_MIG_PENDING_SIZE;
> +
> +    trace_vfio_save_pending(vbasedev->name, *res_precopy, *res_postcopy);
> +}
> +
>  static void vfio_v1_save_pending(void *opaque, uint64_t threshold_size,
>                                   uint64_t *res_precopy, uint64_t *res_postcopy)
>  {
> @@ -520,6 +655,67 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
>      return 0;
>  }
>  
> +/* Returns 1 if end-of-stream is reached, 0 if more data and -1 if error */
> +static int vfio_save_block(QEMUFile *f, VFIOMigration *migration)
> +{
> +    ssize_t data_size;
> +
> +    data_size = read(migration->data_fd, migration->data_buffer,
> +                     migration->data_buffer_size);
> +    if (data_size < 0) {
> +        return -1;
> +    }
> +    if (data_size == 0) {
> +        return 1;
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> +    qemu_put_be64(f, data_size);
> +    qemu_put_buffer(f, migration->data_buffer, data_size);
> +    bytes_transferred += data_size;
> +
> +    trace_vfio_save_block(migration->vbasedev->name, data_size);
> +
> +    return qemu_file_get_error(f);
> +}
> +
> +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    enum vfio_device_mig_state recover_state;
> +    int ret;
> +
> +    /* We reach here with device state STOP only */
> +    recover_state = VFIO_DEVICE_STATE_STOP;
> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
> +                                   recover_state);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    do {
> +        ret = vfio_save_block(f, vbasedev->migration);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +    } while (!ret);
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    recover_state = VFIO_DEVICE_STATE_ERROR;
> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP,
> +                                   recover_state);
> +    if (!ret) {
> +        trace_vfio_save_complete_precopy(vbasedev->name);
> +    }
> +
> +    return ret;
> +}
> +
>  static int vfio_v1_save_complete_precopy(QEMUFile *f, void *opaque)
>  {
>      VFIODevice *vbasedev = opaque;
> @@ -589,6 +785,14 @@ static void vfio_save_state(QEMUFile *f, void *opaque)
>      }
>  }
>  
> +static int vfio_load_setup(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    return vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
> +                                   vbasedev->migration->device_state);
> +}
> +
>  static int vfio_v1_load_setup(QEMUFile *f, void *opaque)
>  {
>      VFIODevice *vbasedev = opaque;
> @@ -616,6 +820,16 @@ static int vfio_v1_load_setup(QEMUFile *f, void *opaque)
>      return ret;
>  }
>  
> +static int vfio_load_cleanup(void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    vfio_migration_cleanup(vbasedev);
> +    trace_vfio_load_cleanup(vbasedev->name);
> +
> +    return 0;
> +}
> +
>  static int vfio_v1_load_cleanup(void *opaque)
>  {
>      VFIODevice *vbasedev = opaque;
> @@ -658,7 +872,11 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>              uint64_t data_size = qemu_get_be64(f);
>  
>              if (data_size) {
> -                ret = vfio_v1_load_buffer(f, vbasedev, data_size);
> +                if (vbasedev->migration->v2) {
> +                    ret = vfio_load_buffer(f, vbasedev, data_size);
> +                } else {
> +                    ret = vfio_v1_load_buffer(f, vbasedev, data_size);
> +                }
>                  if (ret < 0) {
>                      return ret;
>                  }
> @@ -679,6 +897,17 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>      return ret;
>  }
>  
> +static const SaveVMHandlers savevm_vfio_handlers = {
> +    .save_setup = vfio_save_setup,
> +    .save_cleanup = vfio_save_cleanup,
> +    .save_live_pending = vfio_save_pending,
> +    .save_live_complete_precopy = vfio_save_complete_precopy,
> +    .save_state = vfio_save_state,
> +    .load_setup = vfio_load_setup,
> +    .load_cleanup = vfio_load_cleanup,
> +    .load_state = vfio_load_state,
> +};
> +
>  static SaveVMHandlers savevm_vfio_v1_handlers = {
>      .save_setup = vfio_v1_save_setup,
>      .save_cleanup = vfio_v1_save_cleanup,
> @@ -693,6 +922,34 @@ static SaveVMHandlers savevm_vfio_v1_handlers = {
>  
>  /* ---------------------------------------------------------------------- */
>  
> +static void vfio_vmstate_change(void *opaque, bool running, RunState state)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    enum vfio_device_mig_state new_state;
> +    int ret;
> +
> +    if (running) {
> +        new_state = VFIO_DEVICE_STATE_RUNNING;
> +    } else {
> +        new_state = VFIO_DEVICE_STATE_STOP;
> +    }
> +
> +    ret = vfio_migration_set_state(vbasedev, new_state,
> +                                   VFIO_DEVICE_STATE_ERROR);
> +    if (ret) {
> +        /*
> +         * Migration should be aborted in this case, but vm_state_notify()
> +         * currently does not support reporting failures.
> +         */
> +        if (migrate_get_current()->to_dst_file) {
> +            qemu_file_set_error(migrate_get_current()->to_dst_file, ret);
> +        }
> +    }
> +
> +    trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
> +                              mig_state_to_str(new_state));
> +}
> +
>  static void vfio_v1_vmstate_change(void *opaque, bool running, RunState state)
>  {
>      VFIODevice *vbasedev = opaque;
> @@ -766,12 +1023,17 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
>      case MIGRATION_STATUS_CANCELLED:
>      case MIGRATION_STATUS_FAILED:
>          bytes_transferred = 0;
> -        ret = vfio_migration_v1_set_state(vbasedev,
> -                                          ~(VFIO_DEVICE_STATE_V1_SAVING |
> -                                            VFIO_DEVICE_STATE_V1_RESUMING),
> -                                          VFIO_DEVICE_STATE_V1_RUNNING);
> -        if (ret) {
> -            error_report("%s: Failed to set state RUNNING", vbasedev->name);
> +        if (migration->v2) {
> +            vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING,
> +                                     VFIO_DEVICE_STATE_ERROR);
> +        } else {
> +            ret = vfio_migration_v1_set_state(vbasedev,
> +                                              ~(VFIO_DEVICE_STATE_V1_SAVING |
> +                                                VFIO_DEVICE_STATE_V1_RESUMING),
> +                                              VFIO_DEVICE_STATE_V1_RUNNING);
> +            if (ret) {
> +                error_report("%s: Failed to set state RUNNING", vbasedev->name);
> +            }
>          }
>      }
>  }
> @@ -780,12 +1042,35 @@ static void vfio_migration_exit(VFIODevice *vbasedev)
>  {
>      VFIOMigration *migration = vbasedev->migration;
>  
> -    vfio_region_exit(&migration->region);
> -    vfio_region_finalize(&migration->region);
> +    if (migration->v2) {
> +        g_free(migration->data_buffer);
> +    } else {
> +        vfio_region_exit(&migration->region);
> +        vfio_region_finalize(&migration->region);
> +    }
>      g_free(vbasedev->migration);
>      vbasedev->migration = NULL;
>  }
>  
> +static int vfio_migration_query_flags(VFIODevice *vbasedev, uint64_t *mig_flags)
> +{
> +    uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
> +                                  sizeof(struct vfio_device_feature_migration),
> +                              sizeof(uint64_t))] = {};
> +    struct vfio_device_feature *feature = (void *)buf;
> +    struct vfio_device_feature_migration *mig = (void *)feature->data;
> +
> +    feature->argsz = sizeof(buf);
> +    feature->flags = VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_MIGRATION;
> +    if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
> +        return -EOPNOTSUPP;
> +    }
> +
> +    *mig_flags = mig->flags;
> +
> +    return 0;
> +}
> +
>  static int vfio_migration_init(VFIODevice *vbasedev)
>  {
>      int ret;
> @@ -794,6 +1079,7 @@ static int vfio_migration_init(VFIODevice *vbasedev)
>      char id[256] = "";
>      g_autofree char *path = NULL, *oid = NULL;
>      struct vfio_region_info *info = NULL;
> +    uint64_t mig_flags;
>  
>      if (!vbasedev->ops->vfio_get_object) {
>          return -EINVAL;
> @@ -804,34 +1090,51 @@ static int vfio_migration_init(VFIODevice *vbasedev)
>          return -EINVAL;
>      }
>  
> -    ret = vfio_get_dev_region_info(vbasedev,
> -                                   VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
> -                                   VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
> -                                   &info);
> -    if (ret) {
> -        return ret;
> -    }
> +    ret = vfio_migration_query_flags(vbasedev, &mig_flags);
> +    if (!ret) {
> +        /* Migration v2 */
> +        /* Basic migration functionality must be supported */
> +        if (!(mig_flags & VFIO_MIGRATION_STOP_COPY)) {
> +            return -EOPNOTSUPP;
> +        }
> +        vbasedev->migration = g_new0(VFIOMigration, 1);
> +        vbasedev->migration->device_state = VFIO_DEVICE_STATE_RUNNING;
> +        vbasedev->migration->data_buffer_size = VFIO_MIG_DATA_BUFFER_SIZE;
> +        vbasedev->migration->data_buffer =
> +            g_malloc0(vbasedev->migration->data_buffer_size);
> +        vbasedev->migration->data_fd = -1;
> +        vbasedev->migration->v2 = true;
> +    } else {
> +        /* Migration v1 */
> +        ret = vfio_get_dev_region_info(vbasedev,
> +                                       VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
> +                                       VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
> +                                       &info);
> +        if (ret) {
> +            return ret;
> +        }
>  
> -    vbasedev->migration = g_new0(VFIOMigration, 1);
> -    vbasedev->migration->device_state_v1 = VFIO_DEVICE_STATE_V1_RUNNING;
> -    vbasedev->migration->vm_running = runstate_is_running();
> +        vbasedev->migration = g_new0(VFIOMigration, 1);
> +        vbasedev->migration->device_state_v1 = VFIO_DEVICE_STATE_V1_RUNNING;
> +        vbasedev->migration->vm_running = runstate_is_running();
>  
> -    ret = vfio_region_setup(obj, vbasedev, &vbasedev->migration->region,
> -                            info->index, "migration");
> -    if (ret) {
> -        error_report("%s: Failed to setup VFIO migration region %d: %s",
> -                     vbasedev->name, info->index, strerror(-ret));
> -        goto err;
> -    }
> +        ret = vfio_region_setup(obj, vbasedev, &vbasedev->migration->region,
> +                                info->index, "migration");
> +        if (ret) {
> +            error_report("%s: Failed to setup VFIO migration region %d: %s",
> +                         vbasedev->name, info->index, strerror(-ret));
> +            goto err;
> +        }
>  
> -    if (!vbasedev->migration->region.size) {
> -        error_report("%s: Invalid zero-sized VFIO migration region %d",
> -                     vbasedev->name, info->index);
> -        ret = -EINVAL;
> -        goto err;
> -    }
> +        if (!vbasedev->migration->region.size) {
> +            error_report("%s: Invalid zero-sized VFIO migration region %d",
> +                         vbasedev->name, info->index);
> +            ret = -EINVAL;
> +            goto err;
> +        }
>  
> -    g_free(info);
> +        g_free(info);
> +    }
>  
>      migration = vbasedev->migration;
>      migration->vbasedev = vbasedev;
> @@ -844,11 +1147,20 @@ static int vfio_migration_init(VFIODevice *vbasedev)
>      }
>      strpadcpy(id, sizeof(id), path, '\0');
>  
> -    register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1,
> -                         &savevm_vfio_v1_handlers, vbasedev);
> +    if (migration->v2) {
> +        register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1,
> +                             &savevm_vfio_handlers, vbasedev);
> +
> +        migration->vm_state = qdev_add_vm_change_state_handler(
> +            vbasedev->dev, vfio_vmstate_change, vbasedev);
> +    } else {
> +        register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1,
> +                             &savevm_vfio_v1_handlers, vbasedev);
> +
> +        migration->vm_state = qdev_add_vm_change_state_handler(
> +            vbasedev->dev, vfio_v1_vmstate_change, vbasedev);
> +    }
>  
> -    migration->vm_state = qdev_add_vm_change_state_handler(
> -        vbasedev->dev, vfio_v1_vmstate_change, vbasedev);
>      migration->migration_state.notify = vfio_migration_state_notifier;
>      add_migration_state_change_notifier(&migration->migration_state);
>      return 0;
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index d88d2b4053..9ef84e24b2 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -149,7 +149,9 @@ vfio_display_edid_write_error(void) ""
>  
>  # migration.c
>  vfio_migration_probe(const char *name) " (%s)"
> +vfio_migration_set_state(const char *name, const char *state) " (%s) state %s"
>  vfio_migration_v1_set_state(const char *name, uint32_t state) " (%s) state %d"
> +vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
>  vfio_v1_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
>  vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
>  vfio_save_setup(const char *name) " (%s)"
> @@ -163,6 +165,8 @@ vfio_save_complete_precopy(const char *name) " (%s)"
>  vfio_load_device_config_state(const char *name) " (%s)"
>  vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
>  vfio_v1_load_state_device_data(const char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
> +vfio_load_state_device_data(const char *name, uint64_t data_size) " (%s) size 0x%"PRIx64
>  vfio_load_cleanup(const char *name) " (%s)"
>  vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64
>  vfio_iommu_map_dirty_notify(uint64_t iova_start, uint64_t iova_end) "iommu dirty @ 0x%"PRIx64" - 0x%"PRIx64
> +vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index bbaf72ba00..2ec3346fea 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -66,6 +66,11 @@ typedef struct VFIOMigration {
>      int vm_running;
>      Notifier migration_state;
>      uint64_t pending_bytes;
> +    enum vfio_device_mig_state device_state;
> +    int data_fd;
> +    void *data_buffer;
> +    size_t data_buffer_size;
> +    bool v2;
>  } VFIOMigration;
>  
>  typedef struct VFIOAddressSpace {
> -- 
> 2.21.3
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 01/17] migration: Remove res_compatible parameter
  2022-11-23 18:23       ` Dr. David Alan Gilbert
@ 2022-11-24 12:19         ` Avihai Horon
  0 siblings, 0 replies; 59+ messages in thread
From: Avihai Horon @ 2022-11-24 12:19 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, Alex Williamson,
	Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Michael S. Tsirkin, Cornelia Huck,
	Paolo Bonzini, Stefan Hajnoczi, Fam Zheng, Eric Blake, John Snow,
	qemu-s390x, qemu-block, Kunkun Jiang, Zhang, Chen, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Shay Drory, Kirti Wankhede,
	Tarun Gupta, Joao Martins


On 23/11/2022 20:23, Dr. David Alan Gilbert wrote:
> External email: Use caution opening links or attachments
>
>
> * Avihai Horon (avihaih@nvidia.com) wrote:
>> On 08/11/2022 19:52, Vladimir Sementsov-Ogievskiy wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On 11/3/22 19:16, Avihai Horon wrote:
>>>> From: Juan Quintela <quintela@redhat.com>
>>>>
>>>> It was only used for RAM, and in that case, it means that this amount
>>>> of data was sent for memory.
>>> Not clear for me, what means "this amount of data was sent for
>>> memory"... That amount of data was not yet sent, actually.
>>>
>> Yes, this should be changed to something like:
>>
>> "It was only used for RAM, and in that case, it means that this amount
>> of data still needs to be sent for memory, and can be sent in any phase
>> of migration. The same functionality can be achieved without res_compatible,
>> so just delete the field in all callers and change the definition of
>> res_postcopy accordingly.".
> Sorry, I recently sent a similar comment in reply to Juan's original
> post.
> If I understand correctly though, the dirty bitmap code relies on
> 'postcopy' here to be data only sent during postcopy.

Looks like this patch requires some further discussion.
Since it's not mandatory for this series and I don't want it to block 
this series, I can drop it and some variant of it can be added later on.

Thanks all for the effort!

> Dave
>
>>>> Just delete the field in all callers.
>>>>
>>>> Signed-off-by: Juan Quintela <quintela@redhat.com>
>>>> ---
>>>>    hw/s390x/s390-stattrib.c       |  6 ++----
>>>>    hw/vfio/migration.c            | 10 ++++------
>>>>    hw/vfio/trace-events           |  2 +-
>>>>    include/migration/register.h   | 20 ++++++++++----------
>>>>    migration/block-dirty-bitmap.c |  7 +++----
>>>>    migration/block.c              |  7 +++----
>>>>    migration/migration.c          |  9 ++++-----
>>>>    migration/ram.c                |  8 +++-----
>>>>    migration/savevm.c             | 14 +++++---------
>>>>    migration/savevm.h             |  4 +---
>>>>    migration/trace-events         |  2 +-
>>>>    11 files changed, 37 insertions(+), 52 deletions(-)
>>>>
>>> [..]
>>>
>>>> diff --git a/include/migration/register.h b/include/migration/register.h
>>>> index c1dcff0f90..1950fee6a8 100644
>>>> --- a/include/migration/register.h
>>>> +++ b/include/migration/register.h
>>>> @@ -48,18 +48,18 @@ typedef struct SaveVMHandlers {
>>>>        int (*save_setup)(QEMUFile *f, void *opaque);
>>>>        void (*save_live_pending)(QEMUFile *f, void *opaque,
>>>>                                  uint64_t threshold_size,
>>>> -                              uint64_t *res_precopy_only,
>>>> -                              uint64_t *res_compatible,
>>>> -                              uint64_t *res_postcopy_only);
>>>> +                              uint64_t *rest_precopy,
>>>> +                              uint64_t *rest_postcopy);
>>>>        /* Note for save_live_pending:
>>>> -     * - res_precopy_only is for data which must be migrated in
>>>> precopy phase
>>>> -     *     or in stopped state, in other words - before target vm start
>>>> -     * - res_compatible is for data which may be migrated in any phase
>>>> -     * - res_postcopy_only is for data which must be migrated in
>>>> postcopy phase
>>>> -     *     or in stopped state, in other words - after source vm stop
>>>> +     * - res_precopy is for data which must be migrated in precopy
>>>> +     *     phase or in stopped state, in other words - before target
>>>> +     *     vm start
>>>> +     * - res_postcopy is for data which must be migrated in postcopy
>>>> +     *     phase or in stopped state, in other words - after source vm
>>>> +     *     stop
>>>>         *
>>>> -     * Sum of res_postcopy_only, res_compatible and
>>>> res_postcopy_only is the
>>>> -     * whole amount of pending data.
>>>> +     * Sum of res_precopy and res_postcopy is the whole amount of
>>>> +     * pending data.
>>>>         */
>>>>
>>>>
>>> [..]
>>>
>>>> diff --git a/migration/ram.c b/migration/ram.c
>>>> index dc1de9ddbc..20167e1102 100644
>>>> --- a/migration/ram.c
>>>> +++ b/migration/ram.c
>>>> @@ -3435,9 +3435,7 @@ static int ram_save_complete(QEMUFile *f, void
>>>> *opaque)
>>>>    }
>>>>
>>>>    static void ram_save_pending(QEMUFile *f, void *opaque, uint64_t
>>>> max_size,
>>>> -                             uint64_t *res_precopy_only,
>>>> -                             uint64_t *res_compatible,
>>>> -                             uint64_t *res_postcopy_only)
>>>> +                             uint64_t *res_precopy, uint64_t
>>>> *res_postcopy)
>>>>    {
>>>>        RAMState **temp = opaque;
>>>>        RAMState *rs = *temp;
>>>> @@ -3457,9 +3455,9 @@ static void ram_save_pending(QEMUFile *f, void
>>>> *opaque, uint64_t max_size,
>>>>
>>>>        if (migrate_postcopy_ram()) {
>>>>            /* We can do postcopy, and all the data is postcopiable */
>>>> -        *res_compatible += remaining_size;
>>>> +        *res_postcopy += remaining_size;
>>> That's seems to be not quite correct.
>>>
>>> res_postcopy is defined as "data which must be migrated in postcopy",
>>> but that's not true here, as RAM can be migrated both in precopy and
>>> postcopy.
>>>
>>> Still we really can include "compat" into "postcopy" just because in the
>>> logic of migration_iteration_run() we don't actually distinguish
>>> "compat" and "post". The logic only depends on "total" and "pre".
>>>
>>> So, if we want to combine "compat" into "post", we should redefine
>>> "post" in the comment in include/migration/register.h, something like
>>> this:
>>>
>>> - res_precopy is for data which MUST be migrated in precopy
>>>    phase or in stopped state, in other words - before target
>>>    vm start
>>>
>>> - res_postcopy is for all data except for declared in res_precopy.
>>>    res_postcopy data CAN be migrated in postcopy, i.e. after target
>>>    vm start.
>>>
>>>
>> You are right, the definition of res_postcopy should be changed.
>>
>> Yet, I am not sure if this patch really makes things more clear/simple.
>> Juan, what do you think?
>>
>> Thanks!
>>>>        } else {
>>>> -        *res_precopy_only += remaining_size;
>>>> +        *res_precopy += remaining_size;
>>>>        }
>>>>    }
>>>>
>>>
>>> --
>>> Best regards,
>>> Vladimir
>>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 12/17] vfio/migration: Implement VFIO migration protocol v2
  2022-11-23 18:59   ` Dr. David Alan Gilbert
@ 2022-11-24 12:25     ` Avihai Horon
  2022-11-24 13:28       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 59+ messages in thread
From: Avihai Horon @ 2022-11-24 12:25 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, Alex Williamson, Halil Pasic, Christian Borntraeger,
	Eric Farman, Richard Henderson, David Hildenbrand,
	Ilya Leoshkevich, Thomas Huth, Juan Quintela, Michael S. Tsirkin,
	Cornelia Huck, Paolo Bonzini, Stefan Hajnoczi, Fam Zheng,
	Eric Blake, Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x,
	qemu-block, Kunkun Jiang, Zhang, Chen, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Shay Drory, Kirti Wankhede,
	Tarun Gupta, Joao Martins


On 23/11/2022 20:59, Dr. David Alan Gilbert wrote:
> External email: Use caution opening links or attachments
>
>
> * Avihai Horon (avihaih@nvidia.com) wrote:
>
> <snip>
>
>> +    ret = qemu_file_get_to_fd(f, migration->data_fd, data_size);
>> +    if (!ret) {
>> +        trace_vfio_load_state_device_data(vbasedev->name, data_size);
>> +
>> +    }
> I notice you had a few cases like that; I wouldn't bother making that
> conditional - just add 'ret' to the trace parameters; that way if it
> fails then you can see that in the trace, and it's simpler anyway.

If we add ret to traces such as this, shouldn’t we add ret to the other 
traces as well, to keep consistent trace format?
In that case, is it worth the trouble?

Alternatively, we can print the traces regardless of success or failure 
of the function to better reflect the flow of execution.
WDYT?

Thanks.

> Dave
>
>> +
>> +    return ret;
>> +}
>> +
>>   static int vfio_v1_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
>>                                  uint64_t data_size)
>>   {
>> @@ -394,6 +484,14 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>>       return qemu_file_get_error(f);
>>   }
>>
>> +static void vfio_migration_cleanup(VFIODevice *vbasedev)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    close(migration->data_fd);
>> +    migration->data_fd = -1;
>> +}
>> +
>>   static void vfio_migration_v1_cleanup(VFIODevice *vbasedev)
>>   {
>>       VFIOMigration *migration = vbasedev->migration;
>> @@ -405,6 +503,18 @@ static void vfio_migration_v1_cleanup(VFIODevice *vbasedev)
>>
>>   /* ---------------------------------------------------------------------- */
>>
>> +static int vfio_save_setup(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +
>> +    trace_vfio_save_setup(vbasedev->name);
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +
>> +    return qemu_file_get_error(f);
>> +}
>> +
>>   static int vfio_v1_save_setup(QEMUFile *f, void *opaque)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> @@ -448,6 +558,14 @@ static int vfio_v1_save_setup(QEMUFile *f, void *opaque)
>>       return 0;
>>   }
>>
>> +static void vfio_save_cleanup(void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +
>> +    vfio_migration_cleanup(vbasedev);
>> +    trace_vfio_save_cleanup(vbasedev->name);
>> +}
>> +
>>   static void vfio_v1_save_cleanup(void *opaque)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> @@ -456,6 +574,23 @@ static void vfio_v1_save_cleanup(void *opaque)
>>       trace_vfio_save_cleanup(vbasedev->name);
>>   }
>>
>> +#define VFIO_MIG_PENDING_SIZE (512 * 1024 * 1024)
>> +static void vfio_save_pending(void *opaque, uint64_t threshold_size,
>> +                              uint64_t *res_precopy, uint64_t *res_postcopy)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +
>> +    /*
>> +     * VFIO migration protocol v2 currently doesn't have an API to get pending
>> +     * device state size. Until such API is introduced, report some big
>> +     * arbitrary pending size so the device will be taken into account for
>> +     * downtime limit calculations.
>> +     */
>> +    *res_postcopy += VFIO_MIG_PENDING_SIZE;
>> +
>> +    trace_vfio_save_pending(vbasedev->name, *res_precopy, *res_postcopy);
>> +}
>> +
>>   static void vfio_v1_save_pending(void *opaque, uint64_t threshold_size,
>>                                    uint64_t *res_precopy, uint64_t *res_postcopy)
>>   {
>> @@ -520,6 +655,67 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
>>       return 0;
>>   }
>>
>> +/* Returns 1 if end-of-stream is reached, 0 if more data and -1 if error */
>> +static int vfio_save_block(QEMUFile *f, VFIOMigration *migration)
>> +{
>> +    ssize_t data_size;
>> +
>> +    data_size = read(migration->data_fd, migration->data_buffer,
>> +                     migration->data_buffer_size);
>> +    if (data_size < 0) {
>> +        return -1;
>> +    }
>> +    if (data_size == 0) {
>> +        return 1;
>> +    }
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
>> +    qemu_put_be64(f, data_size);
>> +    qemu_put_buffer(f, migration->data_buffer, data_size);
>> +    bytes_transferred += data_size;
>> +
>> +    trace_vfio_save_block(migration->vbasedev->name, data_size);
>> +
>> +    return qemu_file_get_error(f);
>> +}
>> +
>> +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    enum vfio_device_mig_state recover_state;
>> +    int ret;
>> +
>> +    /* We reach here with device state STOP only */
>> +    recover_state = VFIO_DEVICE_STATE_STOP;
>> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
>> +                                   recover_state);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    do {
>> +        ret = vfio_save_block(f, vbasedev->migration);
>> +        if (ret < 0) {
>> +            return ret;
>> +        }
>> +    } while (!ret);
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +    ret = qemu_file_get_error(f);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    recover_state = VFIO_DEVICE_STATE_ERROR;
>> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP,
>> +                                   recover_state);
>> +    if (!ret) {
>> +        trace_vfio_save_complete_precopy(vbasedev->name);
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>>   static int vfio_v1_save_complete_precopy(QEMUFile *f, void *opaque)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> @@ -589,6 +785,14 @@ static void vfio_save_state(QEMUFile *f, void *opaque)
>>       }
>>   }
>>
>> +static int vfio_load_setup(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +
>> +    return vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
>> +                                   vbasedev->migration->device_state);
>> +}
>> +
>>   static int vfio_v1_load_setup(QEMUFile *f, void *opaque)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> @@ -616,6 +820,16 @@ static int vfio_v1_load_setup(QEMUFile *f, void *opaque)
>>       return ret;
>>   }
>>
>> +static int vfio_load_cleanup(void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +
>> +    vfio_migration_cleanup(vbasedev);
>> +    trace_vfio_load_cleanup(vbasedev->name);
>> +
>> +    return 0;
>> +}
>> +
>>   static int vfio_v1_load_cleanup(void *opaque)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> @@ -658,7 +872,11 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>>               uint64_t data_size = qemu_get_be64(f);
>>
>>               if (data_size) {
>> -                ret = vfio_v1_load_buffer(f, vbasedev, data_size);
>> +                if (vbasedev->migration->v2) {
>> +                    ret = vfio_load_buffer(f, vbasedev, data_size);
>> +                } else {
>> +                    ret = vfio_v1_load_buffer(f, vbasedev, data_size);
>> +                }
>>                   if (ret < 0) {
>>                       return ret;
>>                   }
>> @@ -679,6 +897,17 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>>       return ret;
>>   }
>>
>> +static const SaveVMHandlers savevm_vfio_handlers = {
>> +    .save_setup = vfio_save_setup,
>> +    .save_cleanup = vfio_save_cleanup,
>> +    .save_live_pending = vfio_save_pending,
>> +    .save_live_complete_precopy = vfio_save_complete_precopy,
>> +    .save_state = vfio_save_state,
>> +    .load_setup = vfio_load_setup,
>> +    .load_cleanup = vfio_load_cleanup,
>> +    .load_state = vfio_load_state,
>> +};
>> +
>>   static SaveVMHandlers savevm_vfio_v1_handlers = {
>>       .save_setup = vfio_v1_save_setup,
>>       .save_cleanup = vfio_v1_save_cleanup,
>> @@ -693,6 +922,34 @@ static SaveVMHandlers savevm_vfio_v1_handlers = {
>>
>>   /* ---------------------------------------------------------------------- */
>>
>> +static void vfio_vmstate_change(void *opaque, bool running, RunState state)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    enum vfio_device_mig_state new_state;
>> +    int ret;
>> +
>> +    if (running) {
>> +        new_state = VFIO_DEVICE_STATE_RUNNING;
>> +    } else {
>> +        new_state = VFIO_DEVICE_STATE_STOP;
>> +    }
>> +
>> +    ret = vfio_migration_set_state(vbasedev, new_state,
>> +                                   VFIO_DEVICE_STATE_ERROR);
>> +    if (ret) {
>> +        /*
>> +         * Migration should be aborted in this case, but vm_state_notify()
>> +         * currently does not support reporting failures.
>> +         */
>> +        if (migrate_get_current()->to_dst_file) {
>> +            qemu_file_set_error(migrate_get_current()->to_dst_file, ret);
>> +        }
>> +    }
>> +
>> +    trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
>> +                              mig_state_to_str(new_state));
>> +}
>> +
>>   static void vfio_v1_vmstate_change(void *opaque, bool running, RunState state)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> @@ -766,12 +1023,17 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
>>       case MIGRATION_STATUS_CANCELLED:
>>       case MIGRATION_STATUS_FAILED:
>>           bytes_transferred = 0;
>> -        ret = vfio_migration_v1_set_state(vbasedev,
>> -                                          ~(VFIO_DEVICE_STATE_V1_SAVING |
>> -                                            VFIO_DEVICE_STATE_V1_RESUMING),
>> -                                          VFIO_DEVICE_STATE_V1_RUNNING);
>> -        if (ret) {
>> -            error_report("%s: Failed to set state RUNNING", vbasedev->name);
>> +        if (migration->v2) {
>> +            vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING,
>> +                                     VFIO_DEVICE_STATE_ERROR);
>> +        } else {
>> +            ret = vfio_migration_v1_set_state(vbasedev,
>> +                                              ~(VFIO_DEVICE_STATE_V1_SAVING |
>> +                                                VFIO_DEVICE_STATE_V1_RESUMING),
>> +                                              VFIO_DEVICE_STATE_V1_RUNNING);
>> +            if (ret) {
>> +                error_report("%s: Failed to set state RUNNING", vbasedev->name);
>> +            }
>>           }
>>       }
>>   }
>> @@ -780,12 +1042,35 @@ static void vfio_migration_exit(VFIODevice *vbasedev)
>>   {
>>       VFIOMigration *migration = vbasedev->migration;
>>
>> -    vfio_region_exit(&migration->region);
>> -    vfio_region_finalize(&migration->region);
>> +    if (migration->v2) {
>> +        g_free(migration->data_buffer);
>> +    } else {
>> +        vfio_region_exit(&migration->region);
>> +        vfio_region_finalize(&migration->region);
>> +    }
>>       g_free(vbasedev->migration);
>>       vbasedev->migration = NULL;
>>   }
>>
>> +static int vfio_migration_query_flags(VFIODevice *vbasedev, uint64_t *mig_flags)
>> +{
>> +    uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
>> +                                  sizeof(struct vfio_device_feature_migration),
>> +                              sizeof(uint64_t))] = {};
>> +    struct vfio_device_feature *feature = (void *)buf;
>> +    struct vfio_device_feature_migration *mig = (void *)feature->data;
>> +
>> +    feature->argsz = sizeof(buf);
>> +    feature->flags = VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_MIGRATION;
>> +    if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
>> +        return -EOPNOTSUPP;
>> +    }
>> +
>> +    *mig_flags = mig->flags;
>> +
>> +    return 0;
>> +}
>> +
>>   static int vfio_migration_init(VFIODevice *vbasedev)
>>   {
>>       int ret;
>> @@ -794,6 +1079,7 @@ static int vfio_migration_init(VFIODevice *vbasedev)
>>       char id[256] = "";
>>       g_autofree char *path = NULL, *oid = NULL;
>>       struct vfio_region_info *info = NULL;
>> +    uint64_t mig_flags;
>>
>>       if (!vbasedev->ops->vfio_get_object) {
>>           return -EINVAL;
>> @@ -804,34 +1090,51 @@ static int vfio_migration_init(VFIODevice *vbasedev)
>>           return -EINVAL;
>>       }
>>
>> -    ret = vfio_get_dev_region_info(vbasedev,
>> -                                   VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
>> -                                   VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
>> -                                   &info);
>> -    if (ret) {
>> -        return ret;
>> -    }
>> +    ret = vfio_migration_query_flags(vbasedev, &mig_flags);
>> +    if (!ret) {
>> +        /* Migration v2 */
>> +        /* Basic migration functionality must be supported */
>> +        if (!(mig_flags & VFIO_MIGRATION_STOP_COPY)) {
>> +            return -EOPNOTSUPP;
>> +        }
>> +        vbasedev->migration = g_new0(VFIOMigration, 1);
>> +        vbasedev->migration->device_state = VFIO_DEVICE_STATE_RUNNING;
>> +        vbasedev->migration->data_buffer_size = VFIO_MIG_DATA_BUFFER_SIZE;
>> +        vbasedev->migration->data_buffer =
>> +            g_malloc0(vbasedev->migration->data_buffer_size);
>> +        vbasedev->migration->data_fd = -1;
>> +        vbasedev->migration->v2 = true;
>> +    } else {
>> +        /* Migration v1 */
>> +        ret = vfio_get_dev_region_info(vbasedev,
>> +                                       VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
>> +                                       VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
>> +                                       &info);
>> +        if (ret) {
>> +            return ret;
>> +        }
>>
>> -    vbasedev->migration = g_new0(VFIOMigration, 1);
>> -    vbasedev->migration->device_state_v1 = VFIO_DEVICE_STATE_V1_RUNNING;
>> -    vbasedev->migration->vm_running = runstate_is_running();
>> +        vbasedev->migration = g_new0(VFIOMigration, 1);
>> +        vbasedev->migration->device_state_v1 = VFIO_DEVICE_STATE_V1_RUNNING;
>> +        vbasedev->migration->vm_running = runstate_is_running();
>>
>> -    ret = vfio_region_setup(obj, vbasedev, &vbasedev->migration->region,
>> -                            info->index, "migration");
>> -    if (ret) {
>> -        error_report("%s: Failed to setup VFIO migration region %d: %s",
>> -                     vbasedev->name, info->index, strerror(-ret));
>> -        goto err;
>> -    }
>> +        ret = vfio_region_setup(obj, vbasedev, &vbasedev->migration->region,
>> +                                info->index, "migration");
>> +        if (ret) {
>> +            error_report("%s: Failed to setup VFIO migration region %d: %s",
>> +                         vbasedev->name, info->index, strerror(-ret));
>> +            goto err;
>> +        }
>>
>> -    if (!vbasedev->migration->region.size) {
>> -        error_report("%s: Invalid zero-sized VFIO migration region %d",
>> -                     vbasedev->name, info->index);
>> -        ret = -EINVAL;
>> -        goto err;
>> -    }
>> +        if (!vbasedev->migration->region.size) {
>> +            error_report("%s: Invalid zero-sized VFIO migration region %d",
>> +                         vbasedev->name, info->index);
>> +            ret = -EINVAL;
>> +            goto err;
>> +        }
>>
>> -    g_free(info);
>> +        g_free(info);
>> +    }
>>
>>       migration = vbasedev->migration;
>>       migration->vbasedev = vbasedev;
>> @@ -844,11 +1147,20 @@ static int vfio_migration_init(VFIODevice *vbasedev)
>>       }
>>       strpadcpy(id, sizeof(id), path, '\0');
>>
>> -    register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1,
>> -                         &savevm_vfio_v1_handlers, vbasedev);
>> +    if (migration->v2) {
>> +        register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1,
>> +                             &savevm_vfio_handlers, vbasedev);
>> +
>> +        migration->vm_state = qdev_add_vm_change_state_handler(
>> +            vbasedev->dev, vfio_vmstate_change, vbasedev);
>> +    } else {
>> +        register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1,
>> +                             &savevm_vfio_v1_handlers, vbasedev);
>> +
>> +        migration->vm_state = qdev_add_vm_change_state_handler(
>> +            vbasedev->dev, vfio_v1_vmstate_change, vbasedev);
>> +    }
>>
>> -    migration->vm_state = qdev_add_vm_change_state_handler(
>> -        vbasedev->dev, vfio_v1_vmstate_change, vbasedev);
>>       migration->migration_state.notify = vfio_migration_state_notifier;
>>       add_migration_state_change_notifier(&migration->migration_state);
>>       return 0;
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index d88d2b4053..9ef84e24b2 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -149,7 +149,9 @@ vfio_display_edid_write_error(void) ""
>>
>>   # migration.c
>>   vfio_migration_probe(const char *name) " (%s)"
>> +vfio_migration_set_state(const char *name, const char *state) " (%s) state %s"
>>   vfio_migration_v1_set_state(const char *name, uint32_t state) " (%s) state %d"
>> +vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
>>   vfio_v1_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
>>   vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
>>   vfio_save_setup(const char *name) " (%s)"
>> @@ -163,6 +165,8 @@ vfio_save_complete_precopy(const char *name) " (%s)"
>>   vfio_load_device_config_state(const char *name) " (%s)"
>>   vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
>>   vfio_v1_load_state_device_data(const char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
>> +vfio_load_state_device_data(const char *name, uint64_t data_size) " (%s) size 0x%"PRIx64
>>   vfio_load_cleanup(const char *name) " (%s)"
>>   vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64
>>   vfio_iommu_map_dirty_notify(uint64_t iova_start, uint64_t iova_end) "iommu dirty @ 0x%"PRIx64" - 0x%"PRIx64
>> +vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index bbaf72ba00..2ec3346fea 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -66,6 +66,11 @@ typedef struct VFIOMigration {
>>       int vm_running;
>>       Notifier migration_state;
>>       uint64_t pending_bytes;
>> +    enum vfio_device_mig_state device_state;
>> +    int data_fd;
>> +    void *data_buffer;
>> +    size_t data_buffer_size;
>> +    bool v2;
>>   } VFIOMigration;
>>
>>   typedef struct VFIOAddressSpace {
>> --
>> 2.21.3
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 12/17] vfio/migration: Implement VFIO migration protocol v2
  2022-11-20  9:34         ` Avihai Horon
@ 2022-11-24 12:41           ` Avihai Horon
  2022-11-28 18:50             ` Alex Williamson
  0 siblings, 1 reply; 59+ messages in thread
From: Avihai Horon @ 2022-11-24 12:41 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Shay Drory, Kirti Wankhede, Tarun Gupta,
	Joao Martins


On 20/11/2022 11:34, Avihai Horon wrote:
>
> On 17/11/2022 19:38, Alex Williamson wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On Thu, 17 Nov 2022 19:07:10 +0200
>> Avihai Horon <avihaih@nvidia.com> wrote:
>>> On 16/11/2022 20:29, Alex Williamson wrote:
>>>> On Thu, 3 Nov 2022 18:16:15 +0200
>>>> Avihai Horon <avihaih@nvidia.com> wrote:
>>>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>>>> index e784374453..62afc23a8c 100644
>>>>> --- a/hw/vfio/migration.c
>>>>> +++ b/hw/vfio/migration.c
>>>>> @@ -44,8 +44,84 @@
>>>>>    #define VFIO_MIG_FLAG_DEV_SETUP_STATE (0xffffffffef100003ULL)
>>>>>    #define VFIO_MIG_FLAG_DEV_DATA_STATE (0xffffffffef100004ULL)
>>>>>
>>>>> +#define VFIO_MIG_DATA_BUFFER_SIZE (1024 * 1024)
>>>> Add comment explaining heuristic of this size.
>>> This is an arbitrary size we picked with mlx5 state size in mind.
>>> Increasing this size to higher values (128M, 1G) didn't improve
>>> performance in our testing.
>>>
>>> How about this comment:
>>> This is an initial value that doesn't consume much memory and provides
>>> good performance.
>>>
>>> Do you have other suggestion?
>> I'd lean more towards your description above, ex:
>>
>> /*
>>   * This is an arbitrary size based on migration of mlx5 devices, where
>>   * the worst case total device migration size is on the order of 100s
>>   * of MB.  Testing with larger values, ex. 128MB and 1GB, did not show
>>   * a performance improvement.
>>   */
>>
>> I think that provides sufficient information for someone who might come
>> later to have an understanding of the basis if they want to try to
>> optimize further.
>
> OK, sounds good, I will add a comment like this.
>
>>>>> @@ -804,34 +1090,51 @@ static int vfio_migration_init(VFIODevice 
>>>>> *vbasedev)
>>>>>            return -EINVAL;
>>>>>        }
>>>>>
>>>>> -    ret = vfio_get_dev_region_info(vbasedev,
>>>>> - VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
>>>>> - VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
>>>>> -                                   &info);
>>>>> -    if (ret) {
>>>>> -        return ret;
>>>>> -    }
>>>>> +    ret = vfio_migration_query_flags(vbasedev, &mig_flags);
>>>>> +    if (!ret) {
>>>>> +        /* Migration v2 */
>>>>> +        /* Basic migration functionality must be supported */
>>>>> +        if (!(mig_flags & VFIO_MIGRATION_STOP_COPY)) {
>>>>> +            return -EOPNOTSUPP;
>>>>> +        }
>>>>> +        vbasedev->migration = g_new0(VFIOMigration, 1);
>>>>> +        vbasedev->migration->device_state = 
>>>>> VFIO_DEVICE_STATE_RUNNING;
>>>>> +        vbasedev->migration->data_buffer_size = 
>>>>> VFIO_MIG_DATA_BUFFER_SIZE;
>>>>> +        vbasedev->migration->data_buffer =
>>>>> + g_malloc0(vbasedev->migration->data_buffer_size);
>>>> So VFIO_MIG_DATA_BUFFER_SIZE is our chunk size, but why doesn't the
>>>> later addition of estimated device data size make any changes here?
>>>> I'd think we'd want to scale the buffer to the minimum of the reported
>>>> data size and some well documented heuristic for an upper bound.
>>> As I wrote above, increasing this size to higher values (128M, 1G)
>>> didn't improve performance in our testing.
>>> We can always change it later on if some other heuristics are proven to
>>> improve performance.
>> Note that I'm asking about a minimum buffer size, for example if
>> hisi_acc reports only 10s of KB for an estimated device size, why would
>> we still allocate VFIO_MIG_DATA_BUFFER_SIZE here?  Thanks,
>
> This buffer is rather small and has little memory footprint.
> Do you think it is worth the extra complexity of resizing the buffer?
>
Alex, WDYT?
Note that the reported estimated size is dynamic and might change from 
query to the other, potentially leaving us with smaller buffer size.

Also, as part of v4 I moved this allocation to vfio_save_setup(), so it 
will be allocated only during migration (when it's actually used) and 
only by src side.

Thanks.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 12/17] vfio/migration: Implement VFIO migration protocol v2
  2022-11-24 12:25     ` Avihai Horon
@ 2022-11-24 13:28       ` Dr. David Alan Gilbert
  2022-11-24 14:07         ` Avihai Horon
  0 siblings, 1 reply; 59+ messages in thread
From: Dr. David Alan Gilbert @ 2022-11-24 13:28 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Halil Pasic, Christian Borntraeger,
	Eric Farman, Richard Henderson, David Hildenbrand,
	Ilya Leoshkevich, Thomas Huth, Juan Quintela, Michael S. Tsirkin,
	Cornelia Huck, Paolo Bonzini, Stefan Hajnoczi, Fam Zheng,
	Eric Blake, Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x,
	qemu-block, Kunkun Jiang, Zhang, Chen, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Shay Drory, Kirti Wankhede,
	Tarun Gupta, Joao Martins

* Avihai Horon (avihaih@nvidia.com) wrote:
> 
> On 23/11/2022 20:59, Dr. David Alan Gilbert wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > * Avihai Horon (avihaih@nvidia.com) wrote:
> > 
> > <snip>
> > 
> > > +    ret = qemu_file_get_to_fd(f, migration->data_fd, data_size);
> > > +    if (!ret) {
> > > +        trace_vfio_load_state_device_data(vbasedev->name, data_size);
> > > +
> > > +    }
> > I notice you had a few cases like that; I wouldn't bother making that
> > conditional - just add 'ret' to the trace parameters; that way if it
> > fails then you can see that in the trace, and it's simpler anyway.
> 
> If we add ret to traces such as this, shouldn’t we add ret to the other
> traces as well, to keep consistent trace format?
> In that case, is it worth the trouble?

Traces are for humans; it's nice to keep the same format, but not
required.
> 
> Alternatively, we can print the traces regardless of success or failure of
> the function to better reflect the flow of execution.
> WDYT?

I'd just add ret to the ones you're changing.
The important thing about traceing is that they've got what you need to
debug things!

Dave

> 
> Thanks.
> 
> > Dave
> > 
> > > +
> > > +    return ret;
> > > +}
> > > +
> > >   static int vfio_v1_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
> > >                                  uint64_t data_size)
> > >   {
> > > @@ -394,6 +484,14 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
> > >       return qemu_file_get_error(f);
> > >   }
> > > 
> > > +static void vfio_migration_cleanup(VFIODevice *vbasedev)
> > > +{
> > > +    VFIOMigration *migration = vbasedev->migration;
> > > +
> > > +    close(migration->data_fd);
> > > +    migration->data_fd = -1;
> > > +}
> > > +
> > >   static void vfio_migration_v1_cleanup(VFIODevice *vbasedev)
> > >   {
> > >       VFIOMigration *migration = vbasedev->migration;
> > > @@ -405,6 +503,18 @@ static void vfio_migration_v1_cleanup(VFIODevice *vbasedev)
> > > 
> > >   /* ---------------------------------------------------------------------- */
> > > 
> > > +static int vfio_save_setup(QEMUFile *f, void *opaque)
> > > +{
> > > +    VFIODevice *vbasedev = opaque;
> > > +
> > > +    trace_vfio_save_setup(vbasedev->name);
> > > +
> > > +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
> > > +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> > > +
> > > +    return qemu_file_get_error(f);
> > > +}
> > > +
> > >   static int vfio_v1_save_setup(QEMUFile *f, void *opaque)
> > >   {
> > >       VFIODevice *vbasedev = opaque;
> > > @@ -448,6 +558,14 @@ static int vfio_v1_save_setup(QEMUFile *f, void *opaque)
> > >       return 0;
> > >   }
> > > 
> > > +static void vfio_save_cleanup(void *opaque)
> > > +{
> > > +    VFIODevice *vbasedev = opaque;
> > > +
> > > +    vfio_migration_cleanup(vbasedev);
> > > +    trace_vfio_save_cleanup(vbasedev->name);
> > > +}
> > > +
> > >   static void vfio_v1_save_cleanup(void *opaque)
> > >   {
> > >       VFIODevice *vbasedev = opaque;
> > > @@ -456,6 +574,23 @@ static void vfio_v1_save_cleanup(void *opaque)
> > >       trace_vfio_save_cleanup(vbasedev->name);
> > >   }
> > > 
> > > +#define VFIO_MIG_PENDING_SIZE (512 * 1024 * 1024)
> > > +static void vfio_save_pending(void *opaque, uint64_t threshold_size,
> > > +                              uint64_t *res_precopy, uint64_t *res_postcopy)
> > > +{
> > > +    VFIODevice *vbasedev = opaque;
> > > +
> > > +    /*
> > > +     * VFIO migration protocol v2 currently doesn't have an API to get pending
> > > +     * device state size. Until such API is introduced, report some big
> > > +     * arbitrary pending size so the device will be taken into account for
> > > +     * downtime limit calculations.
> > > +     */
> > > +    *res_postcopy += VFIO_MIG_PENDING_SIZE;
> > > +
> > > +    trace_vfio_save_pending(vbasedev->name, *res_precopy, *res_postcopy);
> > > +}
> > > +
> > >   static void vfio_v1_save_pending(void *opaque, uint64_t threshold_size,
> > >                                    uint64_t *res_precopy, uint64_t *res_postcopy)
> > >   {
> > > @@ -520,6 +655,67 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
> > >       return 0;
> > >   }
> > > 
> > > +/* Returns 1 if end-of-stream is reached, 0 if more data and -1 if error */
> > > +static int vfio_save_block(QEMUFile *f, VFIOMigration *migration)
> > > +{
> > > +    ssize_t data_size;
> > > +
> > > +    data_size = read(migration->data_fd, migration->data_buffer,
> > > +                     migration->data_buffer_size);
> > > +    if (data_size < 0) {
> > > +        return -1;
> > > +    }
> > > +    if (data_size == 0) {
> > > +        return 1;
> > > +    }
> > > +
> > > +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> > > +    qemu_put_be64(f, data_size);
> > > +    qemu_put_buffer(f, migration->data_buffer, data_size);
> > > +    bytes_transferred += data_size;
> > > +
> > > +    trace_vfio_save_block(migration->vbasedev->name, data_size);
> > > +
> > > +    return qemu_file_get_error(f);
> > > +}
> > > +
> > > +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> > > +{
> > > +    VFIODevice *vbasedev = opaque;
> > > +    enum vfio_device_mig_state recover_state;
> > > +    int ret;
> > > +
> > > +    /* We reach here with device state STOP only */
> > > +    recover_state = VFIO_DEVICE_STATE_STOP;
> > > +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
> > > +                                   recover_state);
> > > +    if (ret) {
> > > +        return ret;
> > > +    }
> > > +
> > > +    do {
> > > +        ret = vfio_save_block(f, vbasedev->migration);
> > > +        if (ret < 0) {
> > > +            return ret;
> > > +        }
> > > +    } while (!ret);
> > > +
> > > +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> > > +    ret = qemu_file_get_error(f);
> > > +    if (ret) {
> > > +        return ret;
> > > +    }
> > > +
> > > +    recover_state = VFIO_DEVICE_STATE_ERROR;
> > > +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP,
> > > +                                   recover_state);
> > > +    if (!ret) {
> > > +        trace_vfio_save_complete_precopy(vbasedev->name);
> > > +    }
> > > +
> > > +    return ret;
> > > +}
> > > +
> > >   static int vfio_v1_save_complete_precopy(QEMUFile *f, void *opaque)
> > >   {
> > >       VFIODevice *vbasedev = opaque;
> > > @@ -589,6 +785,14 @@ static void vfio_save_state(QEMUFile *f, void *opaque)
> > >       }
> > >   }
> > > 
> > > +static int vfio_load_setup(QEMUFile *f, void *opaque)
> > > +{
> > > +    VFIODevice *vbasedev = opaque;
> > > +
> > > +    return vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
> > > +                                   vbasedev->migration->device_state);
> > > +}
> > > +
> > >   static int vfio_v1_load_setup(QEMUFile *f, void *opaque)
> > >   {
> > >       VFIODevice *vbasedev = opaque;
> > > @@ -616,6 +820,16 @@ static int vfio_v1_load_setup(QEMUFile *f, void *opaque)
> > >       return ret;
> > >   }
> > > 
> > > +static int vfio_load_cleanup(void *opaque)
> > > +{
> > > +    VFIODevice *vbasedev = opaque;
> > > +
> > > +    vfio_migration_cleanup(vbasedev);
> > > +    trace_vfio_load_cleanup(vbasedev->name);
> > > +
> > > +    return 0;
> > > +}
> > > +
> > >   static int vfio_v1_load_cleanup(void *opaque)
> > >   {
> > >       VFIODevice *vbasedev = opaque;
> > > @@ -658,7 +872,11 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> > >               uint64_t data_size = qemu_get_be64(f);
> > > 
> > >               if (data_size) {
> > > -                ret = vfio_v1_load_buffer(f, vbasedev, data_size);
> > > +                if (vbasedev->migration->v2) {
> > > +                    ret = vfio_load_buffer(f, vbasedev, data_size);
> > > +                } else {
> > > +                    ret = vfio_v1_load_buffer(f, vbasedev, data_size);
> > > +                }
> > >                   if (ret < 0) {
> > >                       return ret;
> > >                   }
> > > @@ -679,6 +897,17 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> > >       return ret;
> > >   }
> > > 
> > > +static const SaveVMHandlers savevm_vfio_handlers = {
> > > +    .save_setup = vfio_save_setup,
> > > +    .save_cleanup = vfio_save_cleanup,
> > > +    .save_live_pending = vfio_save_pending,
> > > +    .save_live_complete_precopy = vfio_save_complete_precopy,
> > > +    .save_state = vfio_save_state,
> > > +    .load_setup = vfio_load_setup,
> > > +    .load_cleanup = vfio_load_cleanup,
> > > +    .load_state = vfio_load_state,
> > > +};
> > > +
> > >   static SaveVMHandlers savevm_vfio_v1_handlers = {
> > >       .save_setup = vfio_v1_save_setup,
> > >       .save_cleanup = vfio_v1_save_cleanup,
> > > @@ -693,6 +922,34 @@ static SaveVMHandlers savevm_vfio_v1_handlers = {
> > > 
> > >   /* ---------------------------------------------------------------------- */
> > > 
> > > +static void vfio_vmstate_change(void *opaque, bool running, RunState state)
> > > +{
> > > +    VFIODevice *vbasedev = opaque;
> > > +    enum vfio_device_mig_state new_state;
> > > +    int ret;
> > > +
> > > +    if (running) {
> > > +        new_state = VFIO_DEVICE_STATE_RUNNING;
> > > +    } else {
> > > +        new_state = VFIO_DEVICE_STATE_STOP;
> > > +    }
> > > +
> > > +    ret = vfio_migration_set_state(vbasedev, new_state,
> > > +                                   VFIO_DEVICE_STATE_ERROR);
> > > +    if (ret) {
> > > +        /*
> > > +         * Migration should be aborted in this case, but vm_state_notify()
> > > +         * currently does not support reporting failures.
> > > +         */
> > > +        if (migrate_get_current()->to_dst_file) {
> > > +            qemu_file_set_error(migrate_get_current()->to_dst_file, ret);
> > > +        }
> > > +    }
> > > +
> > > +    trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
> > > +                              mig_state_to_str(new_state));
> > > +}
> > > +
> > >   static void vfio_v1_vmstate_change(void *opaque, bool running, RunState state)
> > >   {
> > >       VFIODevice *vbasedev = opaque;
> > > @@ -766,12 +1023,17 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
> > >       case MIGRATION_STATUS_CANCELLED:
> > >       case MIGRATION_STATUS_FAILED:
> > >           bytes_transferred = 0;
> > > -        ret = vfio_migration_v1_set_state(vbasedev,
> > > -                                          ~(VFIO_DEVICE_STATE_V1_SAVING |
> > > -                                            VFIO_DEVICE_STATE_V1_RESUMING),
> > > -                                          VFIO_DEVICE_STATE_V1_RUNNING);
> > > -        if (ret) {
> > > -            error_report("%s: Failed to set state RUNNING", vbasedev->name);
> > > +        if (migration->v2) {
> > > +            vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING,
> > > +                                     VFIO_DEVICE_STATE_ERROR);
> > > +        } else {
> > > +            ret = vfio_migration_v1_set_state(vbasedev,
> > > +                                              ~(VFIO_DEVICE_STATE_V1_SAVING |
> > > +                                                VFIO_DEVICE_STATE_V1_RESUMING),
> > > +                                              VFIO_DEVICE_STATE_V1_RUNNING);
> > > +            if (ret) {
> > > +                error_report("%s: Failed to set state RUNNING", vbasedev->name);
> > > +            }
> > >           }
> > >       }
> > >   }
> > > @@ -780,12 +1042,35 @@ static void vfio_migration_exit(VFIODevice *vbasedev)
> > >   {
> > >       VFIOMigration *migration = vbasedev->migration;
> > > 
> > > -    vfio_region_exit(&migration->region);
> > > -    vfio_region_finalize(&migration->region);
> > > +    if (migration->v2) {
> > > +        g_free(migration->data_buffer);
> > > +    } else {
> > > +        vfio_region_exit(&migration->region);
> > > +        vfio_region_finalize(&migration->region);
> > > +    }
> > >       g_free(vbasedev->migration);
> > >       vbasedev->migration = NULL;
> > >   }
> > > 
> > > +static int vfio_migration_query_flags(VFIODevice *vbasedev, uint64_t *mig_flags)
> > > +{
> > > +    uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
> > > +                                  sizeof(struct vfio_device_feature_migration),
> > > +                              sizeof(uint64_t))] = {};
> > > +    struct vfio_device_feature *feature = (void *)buf;
> > > +    struct vfio_device_feature_migration *mig = (void *)feature->data;
> > > +
> > > +    feature->argsz = sizeof(buf);
> > > +    feature->flags = VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_MIGRATION;
> > > +    if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
> > > +        return -EOPNOTSUPP;
> > > +    }
> > > +
> > > +    *mig_flags = mig->flags;
> > > +
> > > +    return 0;
> > > +}
> > > +
> > >   static int vfio_migration_init(VFIODevice *vbasedev)
> > >   {
> > >       int ret;
> > > @@ -794,6 +1079,7 @@ static int vfio_migration_init(VFIODevice *vbasedev)
> > >       char id[256] = "";
> > >       g_autofree char *path = NULL, *oid = NULL;
> > >       struct vfio_region_info *info = NULL;
> > > +    uint64_t mig_flags;
> > > 
> > >       if (!vbasedev->ops->vfio_get_object) {
> > >           return -EINVAL;
> > > @@ -804,34 +1090,51 @@ static int vfio_migration_init(VFIODevice *vbasedev)
> > >           return -EINVAL;
> > >       }
> > > 
> > > -    ret = vfio_get_dev_region_info(vbasedev,
> > > -                                   VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
> > > -                                   VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
> > > -                                   &info);
> > > -    if (ret) {
> > > -        return ret;
> > > -    }
> > > +    ret = vfio_migration_query_flags(vbasedev, &mig_flags);
> > > +    if (!ret) {
> > > +        /* Migration v2 */
> > > +        /* Basic migration functionality must be supported */
> > > +        if (!(mig_flags & VFIO_MIGRATION_STOP_COPY)) {
> > > +            return -EOPNOTSUPP;
> > > +        }
> > > +        vbasedev->migration = g_new0(VFIOMigration, 1);
> > > +        vbasedev->migration->device_state = VFIO_DEVICE_STATE_RUNNING;
> > > +        vbasedev->migration->data_buffer_size = VFIO_MIG_DATA_BUFFER_SIZE;
> > > +        vbasedev->migration->data_buffer =
> > > +            g_malloc0(vbasedev->migration->data_buffer_size);
> > > +        vbasedev->migration->data_fd = -1;
> > > +        vbasedev->migration->v2 = true;
> > > +    } else {
> > > +        /* Migration v1 */
> > > +        ret = vfio_get_dev_region_info(vbasedev,
> > > +                                       VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
> > > +                                       VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
> > > +                                       &info);
> > > +        if (ret) {
> > > +            return ret;
> > > +        }
> > > 
> > > -    vbasedev->migration = g_new0(VFIOMigration, 1);
> > > -    vbasedev->migration->device_state_v1 = VFIO_DEVICE_STATE_V1_RUNNING;
> > > -    vbasedev->migration->vm_running = runstate_is_running();
> > > +        vbasedev->migration = g_new0(VFIOMigration, 1);
> > > +        vbasedev->migration->device_state_v1 = VFIO_DEVICE_STATE_V1_RUNNING;
> > > +        vbasedev->migration->vm_running = runstate_is_running();
> > > 
> > > -    ret = vfio_region_setup(obj, vbasedev, &vbasedev->migration->region,
> > > -                            info->index, "migration");
> > > -    if (ret) {
> > > -        error_report("%s: Failed to setup VFIO migration region %d: %s",
> > > -                     vbasedev->name, info->index, strerror(-ret));
> > > -        goto err;
> > > -    }
> > > +        ret = vfio_region_setup(obj, vbasedev, &vbasedev->migration->region,
> > > +                                info->index, "migration");
> > > +        if (ret) {
> > > +            error_report("%s: Failed to setup VFIO migration region %d: %s",
> > > +                         vbasedev->name, info->index, strerror(-ret));
> > > +            goto err;
> > > +        }
> > > 
> > > -    if (!vbasedev->migration->region.size) {
> > > -        error_report("%s: Invalid zero-sized VFIO migration region %d",
> > > -                     vbasedev->name, info->index);
> > > -        ret = -EINVAL;
> > > -        goto err;
> > > -    }
> > > +        if (!vbasedev->migration->region.size) {
> > > +            error_report("%s: Invalid zero-sized VFIO migration region %d",
> > > +                         vbasedev->name, info->index);
> > > +            ret = -EINVAL;
> > > +            goto err;
> > > +        }
> > > 
> > > -    g_free(info);
> > > +        g_free(info);
> > > +    }
> > > 
> > >       migration = vbasedev->migration;
> > >       migration->vbasedev = vbasedev;
> > > @@ -844,11 +1147,20 @@ static int vfio_migration_init(VFIODevice *vbasedev)
> > >       }
> > >       strpadcpy(id, sizeof(id), path, '\0');
> > > 
> > > -    register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1,
> > > -                         &savevm_vfio_v1_handlers, vbasedev);
> > > +    if (migration->v2) {
> > > +        register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1,
> > > +                             &savevm_vfio_handlers, vbasedev);
> > > +
> > > +        migration->vm_state = qdev_add_vm_change_state_handler(
> > > +            vbasedev->dev, vfio_vmstate_change, vbasedev);
> > > +    } else {
> > > +        register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1,
> > > +                             &savevm_vfio_v1_handlers, vbasedev);
> > > +
> > > +        migration->vm_state = qdev_add_vm_change_state_handler(
> > > +            vbasedev->dev, vfio_v1_vmstate_change, vbasedev);
> > > +    }
> > > 
> > > -    migration->vm_state = qdev_add_vm_change_state_handler(
> > > -        vbasedev->dev, vfio_v1_vmstate_change, vbasedev);
> > >       migration->migration_state.notify = vfio_migration_state_notifier;
> > >       add_migration_state_change_notifier(&migration->migration_state);
> > >       return 0;
> > > diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> > > index d88d2b4053..9ef84e24b2 100644
> > > --- a/hw/vfio/trace-events
> > > +++ b/hw/vfio/trace-events
> > > @@ -149,7 +149,9 @@ vfio_display_edid_write_error(void) ""
> > > 
> > >   # migration.c
> > >   vfio_migration_probe(const char *name) " (%s)"
> > > +vfio_migration_set_state(const char *name, const char *state) " (%s) state %s"
> > >   vfio_migration_v1_set_state(const char *name, uint32_t state) " (%s) state %d"
> > > +vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
> > >   vfio_v1_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
> > >   vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
> > >   vfio_save_setup(const char *name) " (%s)"
> > > @@ -163,6 +165,8 @@ vfio_save_complete_precopy(const char *name) " (%s)"
> > >   vfio_load_device_config_state(const char *name) " (%s)"
> > >   vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
> > >   vfio_v1_load_state_device_data(const char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
> > > +vfio_load_state_device_data(const char *name, uint64_t data_size) " (%s) size 0x%"PRIx64
> > >   vfio_load_cleanup(const char *name) " (%s)"
> > >   vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64
> > >   vfio_iommu_map_dirty_notify(uint64_t iova_start, uint64_t iova_end) "iommu dirty @ 0x%"PRIx64" - 0x%"PRIx64
> > > +vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
> > > diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> > > index bbaf72ba00..2ec3346fea 100644
> > > --- a/include/hw/vfio/vfio-common.h
> > > +++ b/include/hw/vfio/vfio-common.h
> > > @@ -66,6 +66,11 @@ typedef struct VFIOMigration {
> > >       int vm_running;
> > >       Notifier migration_state;
> > >       uint64_t pending_bytes;
> > > +    enum vfio_device_mig_state device_state;
> > > +    int data_fd;
> > > +    void *data_buffer;
> > > +    size_t data_buffer_size;
> > > +    bool v2;
> > >   } VFIOMigration;
> > > 
> > >   typedef struct VFIOAddressSpace {
> > > --
> > > 2.21.3
> > > 
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 12/17] vfio/migration: Implement VFIO migration protocol v2
  2022-11-24 13:28       ` Dr. David Alan Gilbert
@ 2022-11-24 14:07         ` Avihai Horon
  0 siblings, 0 replies; 59+ messages in thread
From: Avihai Horon @ 2022-11-24 14:07 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, Alex Williamson, Halil Pasic, Christian Borntraeger,
	Eric Farman, Richard Henderson, David Hildenbrand,
	Ilya Leoshkevich, Thomas Huth, Juan Quintela, Michael S. Tsirkin,
	Cornelia Huck, Paolo Bonzini, Stefan Hajnoczi, Fam Zheng,
	Eric Blake, Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x,
	qemu-block, Kunkun Jiang, Zhang, Chen, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Shay Drory, Kirti Wankhede,
	Tarun Gupta, Joao Martins


On 24/11/2022 15:28, Dr. David Alan Gilbert wrote:
> External email: Use caution opening links or attachments
>
>
> * Avihai Horon (avihaih@nvidia.com) wrote:
>> On 23/11/2022 20:59, Dr. David Alan Gilbert wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> * Avihai Horon (avihaih@nvidia.com) wrote:
>>>
>>> <snip>
>>>
>>>> +    ret = qemu_file_get_to_fd(f, migration->data_fd, data_size);
>>>> +    if (!ret) {
>>>> +        trace_vfio_load_state_device_data(vbasedev->name, data_size);
>>>> +
>>>> +    }
>>> I notice you had a few cases like that; I wouldn't bother making that
>>> conditional - just add 'ret' to the trace parameters; that way if it
>>> fails then you can see that in the trace, and it's simpler anyway.
>> If we add ret to traces such as this, shouldn’t we add ret to the other
>> traces as well, to keep consistent trace format?
>> In that case, is it worth the trouble?
> Traces are for humans; it's nice to keep the same format, but not
> required.
>> Alternatively, we can print the traces regardless of success or failure of
>> the function to better reflect the flow of execution.
>> WDYT?
> I'd just add ret to the ones you're changing.
> The important thing about traceing is that they've got what you need to
> debug things!

OK, fair enough. I will add ret to these traces.

Thanks.

> Dave
>
>> Thanks.
>>
>>> Dave
>>>
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>>    static int vfio_v1_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
>>>>                                   uint64_t data_size)
>>>>    {
>>>> @@ -394,6 +484,14 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>>>>        return qemu_file_get_error(f);
>>>>    }
>>>>
>>>> +static void vfio_migration_cleanup(VFIODevice *vbasedev)
>>>> +{
>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>> +
>>>> +    close(migration->data_fd);
>>>> +    migration->data_fd = -1;
>>>> +}
>>>> +
>>>>    static void vfio_migration_v1_cleanup(VFIODevice *vbasedev)
>>>>    {
>>>>        VFIOMigration *migration = vbasedev->migration;
>>>> @@ -405,6 +503,18 @@ static void vfio_migration_v1_cleanup(VFIODevice *vbasedev)
>>>>
>>>>    /* ---------------------------------------------------------------------- */
>>>>
>>>> +static int vfio_save_setup(QEMUFile *f, void *opaque)
>>>> +{
>>>> +    VFIODevice *vbasedev = opaque;
>>>> +
>>>> +    trace_vfio_save_setup(vbasedev->name);
>>>> +
>>>> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
>>>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>>>> +
>>>> +    return qemu_file_get_error(f);
>>>> +}
>>>> +
>>>>    static int vfio_v1_save_setup(QEMUFile *f, void *opaque)
>>>>    {
>>>>        VFIODevice *vbasedev = opaque;
>>>> @@ -448,6 +558,14 @@ static int vfio_v1_save_setup(QEMUFile *f, void *opaque)
>>>>        return 0;
>>>>    }
>>>>
>>>> +static void vfio_save_cleanup(void *opaque)
>>>> +{
>>>> +    VFIODevice *vbasedev = opaque;
>>>> +
>>>> +    vfio_migration_cleanup(vbasedev);
>>>> +    trace_vfio_save_cleanup(vbasedev->name);
>>>> +}
>>>> +
>>>>    static void vfio_v1_save_cleanup(void *opaque)
>>>>    {
>>>>        VFIODevice *vbasedev = opaque;
>>>> @@ -456,6 +574,23 @@ static void vfio_v1_save_cleanup(void *opaque)
>>>>        trace_vfio_save_cleanup(vbasedev->name);
>>>>    }
>>>>
>>>> +#define VFIO_MIG_PENDING_SIZE (512 * 1024 * 1024)
>>>> +static void vfio_save_pending(void *opaque, uint64_t threshold_size,
>>>> +                              uint64_t *res_precopy, uint64_t *res_postcopy)
>>>> +{
>>>> +    VFIODevice *vbasedev = opaque;
>>>> +
>>>> +    /*
>>>> +     * VFIO migration protocol v2 currently doesn't have an API to get pending
>>>> +     * device state size. Until such API is introduced, report some big
>>>> +     * arbitrary pending size so the device will be taken into account for
>>>> +     * downtime limit calculations.
>>>> +     */
>>>> +    *res_postcopy += VFIO_MIG_PENDING_SIZE;
>>>> +
>>>> +    trace_vfio_save_pending(vbasedev->name, *res_precopy, *res_postcopy);
>>>> +}
>>>> +
>>>>    static void vfio_v1_save_pending(void *opaque, uint64_t threshold_size,
>>>>                                     uint64_t *res_precopy, uint64_t *res_postcopy)
>>>>    {
>>>> @@ -520,6 +655,67 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
>>>>        return 0;
>>>>    }
>>>>
>>>> +/* Returns 1 if end-of-stream is reached, 0 if more data and -1 if error */
>>>> +static int vfio_save_block(QEMUFile *f, VFIOMigration *migration)
>>>> +{
>>>> +    ssize_t data_size;
>>>> +
>>>> +    data_size = read(migration->data_fd, migration->data_buffer,
>>>> +                     migration->data_buffer_size);
>>>> +    if (data_size < 0) {
>>>> +        return -1;
>>>> +    }
>>>> +    if (data_size == 0) {
>>>> +        return 1;
>>>> +    }
>>>> +
>>>> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
>>>> +    qemu_put_be64(f, data_size);
>>>> +    qemu_put_buffer(f, migration->data_buffer, data_size);
>>>> +    bytes_transferred += data_size;
>>>> +
>>>> +    trace_vfio_save_block(migration->vbasedev->name, data_size);
>>>> +
>>>> +    return qemu_file_get_error(f);
>>>> +}
>>>> +
>>>> +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>>> +{
>>>> +    VFIODevice *vbasedev = opaque;
>>>> +    enum vfio_device_mig_state recover_state;
>>>> +    int ret;
>>>> +
>>>> +    /* We reach here with device state STOP only */
>>>> +    recover_state = VFIO_DEVICE_STATE_STOP;
>>>> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
>>>> +                                   recover_state);
>>>> +    if (ret) {
>>>> +        return ret;
>>>> +    }
>>>> +
>>>> +    do {
>>>> +        ret = vfio_save_block(f, vbasedev->migration);
>>>> +        if (ret < 0) {
>>>> +            return ret;
>>>> +        }
>>>> +    } while (!ret);
>>>> +
>>>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>>>> +    ret = qemu_file_get_error(f);
>>>> +    if (ret) {
>>>> +        return ret;
>>>> +    }
>>>> +
>>>> +    recover_state = VFIO_DEVICE_STATE_ERROR;
>>>> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP,
>>>> +                                   recover_state);
>>>> +    if (!ret) {
>>>> +        trace_vfio_save_complete_precopy(vbasedev->name);
>>>> +    }
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>>    static int vfio_v1_save_complete_precopy(QEMUFile *f, void *opaque)
>>>>    {
>>>>        VFIODevice *vbasedev = opaque;
>>>> @@ -589,6 +785,14 @@ static void vfio_save_state(QEMUFile *f, void *opaque)
>>>>        }
>>>>    }
>>>>
>>>> +static int vfio_load_setup(QEMUFile *f, void *opaque)
>>>> +{
>>>> +    VFIODevice *vbasedev = opaque;
>>>> +
>>>> +    return vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
>>>> +                                   vbasedev->migration->device_state);
>>>> +}
>>>> +
>>>>    static int vfio_v1_load_setup(QEMUFile *f, void *opaque)
>>>>    {
>>>>        VFIODevice *vbasedev = opaque;
>>>> @@ -616,6 +820,16 @@ static int vfio_v1_load_setup(QEMUFile *f, void *opaque)
>>>>        return ret;
>>>>    }
>>>>
>>>> +static int vfio_load_cleanup(void *opaque)
>>>> +{
>>>> +    VFIODevice *vbasedev = opaque;
>>>> +
>>>> +    vfio_migration_cleanup(vbasedev);
>>>> +    trace_vfio_load_cleanup(vbasedev->name);
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>>    static int vfio_v1_load_cleanup(void *opaque)
>>>>    {
>>>>        VFIODevice *vbasedev = opaque;
>>>> @@ -658,7 +872,11 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>>>>                uint64_t data_size = qemu_get_be64(f);
>>>>
>>>>                if (data_size) {
>>>> -                ret = vfio_v1_load_buffer(f, vbasedev, data_size);
>>>> +                if (vbasedev->migration->v2) {
>>>> +                    ret = vfio_load_buffer(f, vbasedev, data_size);
>>>> +                } else {
>>>> +                    ret = vfio_v1_load_buffer(f, vbasedev, data_size);
>>>> +                }
>>>>                    if (ret < 0) {
>>>>                        return ret;
>>>>                    }
>>>> @@ -679,6 +897,17 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>>>>        return ret;
>>>>    }
>>>>
>>>> +static const SaveVMHandlers savevm_vfio_handlers = {
>>>> +    .save_setup = vfio_save_setup,
>>>> +    .save_cleanup = vfio_save_cleanup,
>>>> +    .save_live_pending = vfio_save_pending,
>>>> +    .save_live_complete_precopy = vfio_save_complete_precopy,
>>>> +    .save_state = vfio_save_state,
>>>> +    .load_setup = vfio_load_setup,
>>>> +    .load_cleanup = vfio_load_cleanup,
>>>> +    .load_state = vfio_load_state,
>>>> +};
>>>> +
>>>>    static SaveVMHandlers savevm_vfio_v1_handlers = {
>>>>        .save_setup = vfio_v1_save_setup,
>>>>        .save_cleanup = vfio_v1_save_cleanup,
>>>> @@ -693,6 +922,34 @@ static SaveVMHandlers savevm_vfio_v1_handlers = {
>>>>
>>>>    /* ---------------------------------------------------------------------- */
>>>>
>>>> +static void vfio_vmstate_change(void *opaque, bool running, RunState state)
>>>> +{
>>>> +    VFIODevice *vbasedev = opaque;
>>>> +    enum vfio_device_mig_state new_state;
>>>> +    int ret;
>>>> +
>>>> +    if (running) {
>>>> +        new_state = VFIO_DEVICE_STATE_RUNNING;
>>>> +    } else {
>>>> +        new_state = VFIO_DEVICE_STATE_STOP;
>>>> +    }
>>>> +
>>>> +    ret = vfio_migration_set_state(vbasedev, new_state,
>>>> +                                   VFIO_DEVICE_STATE_ERROR);
>>>> +    if (ret) {
>>>> +        /*
>>>> +         * Migration should be aborted in this case, but vm_state_notify()
>>>> +         * currently does not support reporting failures.
>>>> +         */
>>>> +        if (migrate_get_current()->to_dst_file) {
>>>> +            qemu_file_set_error(migrate_get_current()->to_dst_file, ret);
>>>> +        }
>>>> +    }
>>>> +
>>>> +    trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
>>>> +                              mig_state_to_str(new_state));
>>>> +}
>>>> +
>>>>    static void vfio_v1_vmstate_change(void *opaque, bool running, RunState state)
>>>>    {
>>>>        VFIODevice *vbasedev = opaque;
>>>> @@ -766,12 +1023,17 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
>>>>        case MIGRATION_STATUS_CANCELLED:
>>>>        case MIGRATION_STATUS_FAILED:
>>>>            bytes_transferred = 0;
>>>> -        ret = vfio_migration_v1_set_state(vbasedev,
>>>> -                                          ~(VFIO_DEVICE_STATE_V1_SAVING |
>>>> -                                            VFIO_DEVICE_STATE_V1_RESUMING),
>>>> -                                          VFIO_DEVICE_STATE_V1_RUNNING);
>>>> -        if (ret) {
>>>> -            error_report("%s: Failed to set state RUNNING", vbasedev->name);
>>>> +        if (migration->v2) {
>>>> +            vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING,
>>>> +                                     VFIO_DEVICE_STATE_ERROR);
>>>> +        } else {
>>>> +            ret = vfio_migration_v1_set_state(vbasedev,
>>>> +                                              ~(VFIO_DEVICE_STATE_V1_SAVING |
>>>> +                                                VFIO_DEVICE_STATE_V1_RESUMING),
>>>> +                                              VFIO_DEVICE_STATE_V1_RUNNING);
>>>> +            if (ret) {
>>>> +                error_report("%s: Failed to set state RUNNING", vbasedev->name);
>>>> +            }
>>>>            }
>>>>        }
>>>>    }
>>>> @@ -780,12 +1042,35 @@ static void vfio_migration_exit(VFIODevice *vbasedev)
>>>>    {
>>>>        VFIOMigration *migration = vbasedev->migration;
>>>>
>>>> -    vfio_region_exit(&migration->region);
>>>> -    vfio_region_finalize(&migration->region);
>>>> +    if (migration->v2) {
>>>> +        g_free(migration->data_buffer);
>>>> +    } else {
>>>> +        vfio_region_exit(&migration->region);
>>>> +        vfio_region_finalize(&migration->region);
>>>> +    }
>>>>        g_free(vbasedev->migration);
>>>>        vbasedev->migration = NULL;
>>>>    }
>>>>
>>>> +static int vfio_migration_query_flags(VFIODevice *vbasedev, uint64_t *mig_flags)
>>>> +{
>>>> +    uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
>>>> +                                  sizeof(struct vfio_device_feature_migration),
>>>> +                              sizeof(uint64_t))] = {};
>>>> +    struct vfio_device_feature *feature = (void *)buf;
>>>> +    struct vfio_device_feature_migration *mig = (void *)feature->data;
>>>> +
>>>> +    feature->argsz = sizeof(buf);
>>>> +    feature->flags = VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_MIGRATION;
>>>> +    if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
>>>> +        return -EOPNOTSUPP;
>>>> +    }
>>>> +
>>>> +    *mig_flags = mig->flags;
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>>    static int vfio_migration_init(VFIODevice *vbasedev)
>>>>    {
>>>>        int ret;
>>>> @@ -794,6 +1079,7 @@ static int vfio_migration_init(VFIODevice *vbasedev)
>>>>        char id[256] = "";
>>>>        g_autofree char *path = NULL, *oid = NULL;
>>>>        struct vfio_region_info *info = NULL;
>>>> +    uint64_t mig_flags;
>>>>
>>>>        if (!vbasedev->ops->vfio_get_object) {
>>>>            return -EINVAL;
>>>> @@ -804,34 +1090,51 @@ static int vfio_migration_init(VFIODevice *vbasedev)
>>>>            return -EINVAL;
>>>>        }
>>>>
>>>> -    ret = vfio_get_dev_region_info(vbasedev,
>>>> -                                   VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
>>>> -                                   VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
>>>> -                                   &info);
>>>> -    if (ret) {
>>>> -        return ret;
>>>> -    }
>>>> +    ret = vfio_migration_query_flags(vbasedev, &mig_flags);
>>>> +    if (!ret) {
>>>> +        /* Migration v2 */
>>>> +        /* Basic migration functionality must be supported */
>>>> +        if (!(mig_flags & VFIO_MIGRATION_STOP_COPY)) {
>>>> +            return -EOPNOTSUPP;
>>>> +        }
>>>> +        vbasedev->migration = g_new0(VFIOMigration, 1);
>>>> +        vbasedev->migration->device_state = VFIO_DEVICE_STATE_RUNNING;
>>>> +        vbasedev->migration->data_buffer_size = VFIO_MIG_DATA_BUFFER_SIZE;
>>>> +        vbasedev->migration->data_buffer =
>>>> +            g_malloc0(vbasedev->migration->data_buffer_size);
>>>> +        vbasedev->migration->data_fd = -1;
>>>> +        vbasedev->migration->v2 = true;
>>>> +    } else {
>>>> +        /* Migration v1 */
>>>> +        ret = vfio_get_dev_region_info(vbasedev,
>>>> +                                       VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
>>>> +                                       VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
>>>> +                                       &info);
>>>> +        if (ret) {
>>>> +            return ret;
>>>> +        }
>>>>
>>>> -    vbasedev->migration = g_new0(VFIOMigration, 1);
>>>> -    vbasedev->migration->device_state_v1 = VFIO_DEVICE_STATE_V1_RUNNING;
>>>> -    vbasedev->migration->vm_running = runstate_is_running();
>>>> +        vbasedev->migration = g_new0(VFIOMigration, 1);
>>>> +        vbasedev->migration->device_state_v1 = VFIO_DEVICE_STATE_V1_RUNNING;
>>>> +        vbasedev->migration->vm_running = runstate_is_running();
>>>>
>>>> -    ret = vfio_region_setup(obj, vbasedev, &vbasedev->migration->region,
>>>> -                            info->index, "migration");
>>>> -    if (ret) {
>>>> -        error_report("%s: Failed to setup VFIO migration region %d: %s",
>>>> -                     vbasedev->name, info->index, strerror(-ret));
>>>> -        goto err;
>>>> -    }
>>>> +        ret = vfio_region_setup(obj, vbasedev, &vbasedev->migration->region,
>>>> +                                info->index, "migration");
>>>> +        if (ret) {
>>>> +            error_report("%s: Failed to setup VFIO migration region %d: %s",
>>>> +                         vbasedev->name, info->index, strerror(-ret));
>>>> +            goto err;
>>>> +        }
>>>>
>>>> -    if (!vbasedev->migration->region.size) {
>>>> -        error_report("%s: Invalid zero-sized VFIO migration region %d",
>>>> -                     vbasedev->name, info->index);
>>>> -        ret = -EINVAL;
>>>> -        goto err;
>>>> -    }
>>>> +        if (!vbasedev->migration->region.size) {
>>>> +            error_report("%s: Invalid zero-sized VFIO migration region %d",
>>>> +                         vbasedev->name, info->index);
>>>> +            ret = -EINVAL;
>>>> +            goto err;
>>>> +        }
>>>>
>>>> -    g_free(info);
>>>> +        g_free(info);
>>>> +    }
>>>>
>>>>        migration = vbasedev->migration;
>>>>        migration->vbasedev = vbasedev;
>>>> @@ -844,11 +1147,20 @@ static int vfio_migration_init(VFIODevice *vbasedev)
>>>>        }
>>>>        strpadcpy(id, sizeof(id), path, '\0');
>>>>
>>>> -    register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1,
>>>> -                         &savevm_vfio_v1_handlers, vbasedev);
>>>> +    if (migration->v2) {
>>>> +        register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1,
>>>> +                             &savevm_vfio_handlers, vbasedev);
>>>> +
>>>> +        migration->vm_state = qdev_add_vm_change_state_handler(
>>>> +            vbasedev->dev, vfio_vmstate_change, vbasedev);
>>>> +    } else {
>>>> +        register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1,
>>>> +                             &savevm_vfio_v1_handlers, vbasedev);
>>>> +
>>>> +        migration->vm_state = qdev_add_vm_change_state_handler(
>>>> +            vbasedev->dev, vfio_v1_vmstate_change, vbasedev);
>>>> +    }
>>>>
>>>> -    migration->vm_state = qdev_add_vm_change_state_handler(
>>>> -        vbasedev->dev, vfio_v1_vmstate_change, vbasedev);
>>>>        migration->migration_state.notify = vfio_migration_state_notifier;
>>>>        add_migration_state_change_notifier(&migration->migration_state);
>>>>        return 0;
>>>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>>>> index d88d2b4053..9ef84e24b2 100644
>>>> --- a/hw/vfio/trace-events
>>>> +++ b/hw/vfio/trace-events
>>>> @@ -149,7 +149,9 @@ vfio_display_edid_write_error(void) ""
>>>>
>>>>    # migration.c
>>>>    vfio_migration_probe(const char *name) " (%s)"
>>>> +vfio_migration_set_state(const char *name, const char *state) " (%s) state %s"
>>>>    vfio_migration_v1_set_state(const char *name, uint32_t state) " (%s) state %d"
>>>> +vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
>>>>    vfio_v1_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
>>>>    vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
>>>>    vfio_save_setup(const char *name) " (%s)"
>>>> @@ -163,6 +165,8 @@ vfio_save_complete_precopy(const char *name) " (%s)"
>>>>    vfio_load_device_config_state(const char *name) " (%s)"
>>>>    vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
>>>>    vfio_v1_load_state_device_data(const char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
>>>> +vfio_load_state_device_data(const char *name, uint64_t data_size) " (%s) size 0x%"PRIx64
>>>>    vfio_load_cleanup(const char *name) " (%s)"
>>>>    vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64
>>>>    vfio_iommu_map_dirty_notify(uint64_t iova_start, uint64_t iova_end) "iommu dirty @ 0x%"PRIx64" - 0x%"PRIx64
>>>> +vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>>>> index bbaf72ba00..2ec3346fea 100644
>>>> --- a/include/hw/vfio/vfio-common.h
>>>> +++ b/include/hw/vfio/vfio-common.h
>>>> @@ -66,6 +66,11 @@ typedef struct VFIOMigration {
>>>>        int vm_running;
>>>>        Notifier migration_state;
>>>>        uint64_t pending_bytes;
>>>> +    enum vfio_device_mig_state device_state;
>>>> +    int data_fd;
>>>> +    void *data_buffer;
>>>> +    size_t data_buffer_size;
>>>> +    bool v2;
>>>>    } VFIOMigration;
>>>>
>>>>    typedef struct VFIOAddressSpace {
>>>> --
>>>> 2.21.3
>>>>
>>> --
>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 12/17] vfio/migration: Implement VFIO migration protocol v2
  2022-11-24 12:41           ` Avihai Horon
@ 2022-11-28 18:50             ` Alex Williamson
  2022-11-28 19:40               ` Jason Gunthorpe
  0 siblings, 1 reply; 59+ messages in thread
From: Alex Williamson @ 2022-11-28 18:50 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Shay Drory, Kirti Wankhede, Tarun Gupta,
	Joao Martins

On Thu, 24 Nov 2022 14:41:00 +0200
Avihai Horon <avihaih@nvidia.com> wrote:

> On 20/11/2022 11:34, Avihai Horon wrote:
> >
> > On 17/11/2022 19:38, Alex Williamson wrote:  
> >> External email: Use caution opening links or attachments
> >>
> >>
> >> On Thu, 17 Nov 2022 19:07:10 +0200
> >> Avihai Horon <avihaih@nvidia.com> wrote:  
> >>> On 16/11/2022 20:29, Alex Williamson wrote:  
> >>>> On Thu, 3 Nov 2022 18:16:15 +0200
> >>>> Avihai Horon <avihaih@nvidia.com> wrote:  
> >>>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> >>>>> index e784374453..62afc23a8c 100644
> >>>>> --- a/hw/vfio/migration.c
> >>>>> +++ b/hw/vfio/migration.c
> >>>>> @@ -44,8 +44,84 @@
> >>>>>    #define VFIO_MIG_FLAG_DEV_SETUP_STATE (0xffffffffef100003ULL)
> >>>>>    #define VFIO_MIG_FLAG_DEV_DATA_STATE (0xffffffffef100004ULL)
> >>>>>
> >>>>> +#define VFIO_MIG_DATA_BUFFER_SIZE (1024 * 1024)  
> >>>> Add comment explaining heuristic of this size.  
> >>> This is an arbitrary size we picked with mlx5 state size in mind.
> >>> Increasing this size to higher values (128M, 1G) didn't improve
> >>> performance in our testing.
> >>>
> >>> How about this comment:
> >>> This is an initial value that doesn't consume much memory and provides
> >>> good performance.
> >>>
> >>> Do you have other suggestion?  
> >> I'd lean more towards your description above, ex:
> >>
> >> /*
> >>   * This is an arbitrary size based on migration of mlx5 devices, where
> >>   * the worst case total device migration size is on the order of 100s
> >>   * of MB.  Testing with larger values, ex. 128MB and 1GB, did not show
> >>   * a performance improvement.
> >>   */
> >>
> >> I think that provides sufficient information for someone who might come
> >> later to have an understanding of the basis if they want to try to
> >> optimize further.  
> >
> > OK, sounds good, I will add a comment like this.
> >  
> >>>>> @@ -804,34 +1090,51 @@ static int vfio_migration_init(VFIODevice 
> >>>>> *vbasedev)
> >>>>>            return -EINVAL;
> >>>>>        }
> >>>>>
> >>>>> -    ret = vfio_get_dev_region_info(vbasedev,
> >>>>> - VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
> >>>>> - VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
> >>>>> -                                   &info);
> >>>>> -    if (ret) {
> >>>>> -        return ret;
> >>>>> -    }
> >>>>> +    ret = vfio_migration_query_flags(vbasedev, &mig_flags);
> >>>>> +    if (!ret) {
> >>>>> +        /* Migration v2 */
> >>>>> +        /* Basic migration functionality must be supported */
> >>>>> +        if (!(mig_flags & VFIO_MIGRATION_STOP_COPY)) {
> >>>>> +            return -EOPNOTSUPP;
> >>>>> +        }
> >>>>> +        vbasedev->migration = g_new0(VFIOMigration, 1);
> >>>>> +        vbasedev->migration->device_state = 
> >>>>> VFIO_DEVICE_STATE_RUNNING;
> >>>>> +        vbasedev->migration->data_buffer_size = 
> >>>>> VFIO_MIG_DATA_BUFFER_SIZE;
> >>>>> +        vbasedev->migration->data_buffer =
> >>>>> + g_malloc0(vbasedev->migration->data_buffer_size);  
> >>>> So VFIO_MIG_DATA_BUFFER_SIZE is our chunk size, but why doesn't the
> >>>> later addition of estimated device data size make any changes here?
> >>>> I'd think we'd want to scale the buffer to the minimum of the reported
> >>>> data size and some well documented heuristic for an upper bound.  
> >>> As I wrote above, increasing this size to higher values (128M, 1G)
> >>> didn't improve performance in our testing.
> >>> We can always change it later on if some other heuristics are proven to
> >>> improve performance.  
> >> Note that I'm asking about a minimum buffer size, for example if
> >> hisi_acc reports only 10s of KB for an estimated device size, why would
> >> we still allocate VFIO_MIG_DATA_BUFFER_SIZE here?  Thanks,  
> >
> > This buffer is rather small and has little memory footprint.
> > Do you think it is worth the extra complexity of resizing the buffer?
> >  
> Alex, WDYT?
> Note that the reported estimated size is dynamic and might change from 
> query to the other, potentially leaving us with smaller buffer size.
> 
> Also, as part of v4 I moved this allocation to vfio_save_setup(), so it 
> will be allocated only during migration (when it's actually used) and 
> only by src side.

There's a claim here about added complexity that I'm not really seeing.
It looks like we simply make an ioctl call here and scale our buffer
based on the minimum of the returned device estimate or our upper
bound.

The previous comments that exceptionally large buffers don't
significantly affect migration performance seems like that also suggests
that even if the device estimate later changes, we'll likely be ok with
the initial device estimate anyway.  Periodically re-checking the
device estimate and re-allocating up to a high water mark could
potentially be future work.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 12/17] vfio/migration: Implement VFIO migration protocol v2
  2022-11-28 18:50             ` Alex Williamson
@ 2022-11-28 19:40               ` Jason Gunthorpe
  2022-11-28 20:36                 ` Alex Williamson
  0 siblings, 1 reply; 59+ messages in thread
From: Jason Gunthorpe @ 2022-11-28 19:40 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Avihai Horon, qemu-devel, Halil Pasic, Christian Borntraeger,
	Eric Farman, Richard Henderson, David Hildenbrand,
	Ilya Leoshkevich, Thomas Huth, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Cornelia Huck,
	Paolo Bonzini, Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Maor Gottlieb,
	Shay Drory, Kirti Wankhede, Tarun Gupta, Joao Martins

On Mon, Nov 28, 2022 at 11:50:03AM -0700, Alex Williamson wrote:

> There's a claim here about added complexity that I'm not really seeing.
> It looks like we simply make an ioctl call here and scale our buffer
> based on the minimum of the returned device estimate or our upper
> bound.

I'm not keen on this, for something like mlx5 that has a small precopy
size and large post-copy size it risks running with an under allocated
buffer, which is harmful to performance.

It is a mmap space, if we don't touch the pages they don't get
allocated from the OS, I think this is micro-optimizing.

Jason


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 12/17] vfio/migration: Implement VFIO migration protocol v2
  2022-11-28 19:40               ` Jason Gunthorpe
@ 2022-11-28 20:36                 ` Alex Williamson
  2022-11-28 20:56                   ` Jason Gunthorpe
  0 siblings, 1 reply; 59+ messages in thread
From: Alex Williamson @ 2022-11-28 20:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Avihai Horon, qemu-devel, Halil Pasic, Christian Borntraeger,
	Eric Farman, Richard Henderson, David Hildenbrand,
	Ilya Leoshkevich, Thomas Huth, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Cornelia Huck,
	Paolo Bonzini, Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Maor Gottlieb,
	Shay Drory, Kirti Wankhede, Tarun Gupta, Joao Martins

On Mon, 28 Nov 2022 15:40:23 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Nov 28, 2022 at 11:50:03AM -0700, Alex Williamson wrote:
> 
> > There's a claim here about added complexity that I'm not really seeing.
> > It looks like we simply make an ioctl call here and scale our buffer
> > based on the minimum of the returned device estimate or our upper
> > bound.  
> 
> I'm not keen on this, for something like mlx5 that has a small precopy
> size and large post-copy size it risks running with an under allocated
> buffer, which is harmful to performance.

I'm trying to weed out whether there are device assumptions in the
implementation, seems like maybe we found one.  MIG_DATA_SIZE specifies
that it's an estimated data size for stop-copy, so shouldn't that
provide the buffer size you're looking for?  Thanks,

Alex



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 12/17] vfio/migration: Implement VFIO migration protocol v2
  2022-11-28 20:36                 ` Alex Williamson
@ 2022-11-28 20:56                   ` Jason Gunthorpe
  2022-11-28 21:10                     ` Alex Williamson
  0 siblings, 1 reply; 59+ messages in thread
From: Jason Gunthorpe @ 2022-11-28 20:56 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Avihai Horon, qemu-devel, Halil Pasic, Christian Borntraeger,
	Eric Farman, Richard Henderson, David Hildenbrand,
	Ilya Leoshkevich, Thomas Huth, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Cornelia Huck,
	Paolo Bonzini, Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Maor Gottlieb,
	Shay Drory, Kirti Wankhede, Tarun Gupta, Joao Martins

On Mon, Nov 28, 2022 at 01:36:30PM -0700, Alex Williamson wrote:
> On Mon, 28 Nov 2022 15:40:23 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Mon, Nov 28, 2022 at 11:50:03AM -0700, Alex Williamson wrote:
> > 
> > > There's a claim here about added complexity that I'm not really seeing.
> > > It looks like we simply make an ioctl call here and scale our buffer
> > > based on the minimum of the returned device estimate or our upper
> > > bound.  
> > 
> > I'm not keen on this, for something like mlx5 that has a small precopy
> > size and large post-copy size it risks running with an under allocated
> > buffer, which is harmful to performance.
> 
> I'm trying to weed out whether there are device assumptions in the
> implementation, seems like maybe we found one.  

I don't think there are assumptions. Any correct kernel driver should
be able to do this transfer out of the FD byte-at-a-time.

This buffer size is just a random selection for now until we get
multi-fd and can sit down, benchmark and optimize this properly.

The ideal realization of this has no buffer at all.

> MIG_DATA_SIZE specifies that it's an estimated data size for
> stop-copy, so shouldn't that provide the buffer size you're looking
> for? 

Yes, it should, and it should be OK for mlx5

Jason


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 12/17] vfio/migration: Implement VFIO migration protocol v2
  2022-11-28 20:56                   ` Jason Gunthorpe
@ 2022-11-28 21:10                     ` Alex Williamson
  2022-11-29 10:40                       ` Avihai Horon
  0 siblings, 1 reply; 59+ messages in thread
From: Alex Williamson @ 2022-11-28 21:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Avihai Horon, qemu-devel, Halil Pasic, Christian Borntraeger,
	Eric Farman, Richard Henderson, David Hildenbrand,
	Ilya Leoshkevich, Thomas Huth, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Cornelia Huck,
	Paolo Bonzini, Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Maor Gottlieb,
	Shay Drory, Kirti Wankhede, Tarun Gupta, Joao Martins

On Mon, 28 Nov 2022 16:56:39 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Nov 28, 2022 at 01:36:30PM -0700, Alex Williamson wrote:
> > On Mon, 28 Nov 2022 15:40:23 -0400
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Mon, Nov 28, 2022 at 11:50:03AM -0700, Alex Williamson wrote:
> > >   
> > > > There's a claim here about added complexity that I'm not really seeing.
> > > > It looks like we simply make an ioctl call here and scale our buffer
> > > > based on the minimum of the returned device estimate or our upper
> > > > bound.    
> > > 
> > > I'm not keen on this, for something like mlx5 that has a small precopy
> > > size and large post-copy size it risks running with an under allocated
> > > buffer, which is harmful to performance.  
> > 
> > I'm trying to weed out whether there are device assumptions in the
> > implementation, seems like maybe we found one.    
> 
> I don't think there are assumptions. Any correct kernel driver should
> be able to do this transfer out of the FD byte-at-a-time.
> 
> This buffer size is just a random selection for now until we get
> multi-fd and can sit down, benchmark and optimize this properly.

We can certainly still do that, but I'm still failing to see how
buffer_size = min(MIG_DATA_SIZE, 1MB) is such an imposition on the
complexity or over-eager optimization.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 12/17] vfio/migration: Implement VFIO migration protocol v2
  2022-11-28 21:10                     ` Alex Williamson
@ 2022-11-29 10:40                       ` Avihai Horon
  0 siblings, 0 replies; 59+ messages in thread
From: Avihai Horon @ 2022-11-29 10:40 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: qemu-devel, Halil Pasic, Christian Borntraeger, Eric Farman,
	Richard Henderson, David Hildenbrand, Ilya Leoshkevich,
	Thomas Huth, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Stefan Hajnoczi, Fam Zheng, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, qemu-s390x, qemu-block,
	Kunkun Jiang, Zhang, Chen, Yishai Hadas, Maor Gottlieb,
	Shay Drory, Kirti Wankhede, Tarun Gupta, Joao Martins


On 28/11/2022 23:10, Alex Williamson wrote:
> External email: Use caution opening links or attachments
>
>
> On Mon, 28 Nov 2022 16:56:39 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
>
>> On Mon, Nov 28, 2022 at 01:36:30PM -0700, Alex Williamson wrote:
>>> On Mon, 28 Nov 2022 15:40:23 -0400
>>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>>
>>>> On Mon, Nov 28, 2022 at 11:50:03AM -0700, Alex Williamson wrote:
>>>>
>>>>> There's a claim here about added complexity that I'm not really seeing.
>>>>> It looks like we simply make an ioctl call here and scale our buffer
>>>>> based on the minimum of the returned device estimate or our upper
>>>>> bound.
>>>> I'm not keen on this, for something like mlx5 that has a small precopy
>>>> size and large post-copy size it risks running with an under allocated
>>>> buffer, which is harmful to performance.
>>> I'm trying to weed out whether there are device assumptions in the
>>> implementation, seems like maybe we found one.
>> I don't think there are assumptions. Any correct kernel driver should
>> be able to do this transfer out of the FD byte-at-a-time.
>>
>> This buffer size is just a random selection for now until we get
>> multi-fd and can sit down, benchmark and optimize this properly.
> We can certainly still do that, but I'm still failing to see how
> buffer_size = min(MIG_DATA_SIZE, 1MB) is such an imposition on the
> complexity or over-eager optimization.

OK, I will adjust the min buffer size based on MIG_DATA_SIZE ioctl.

Thanks.



^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2022-11-29 10:42 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-03 16:16 [PATCH v3 00/17] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
2022-11-03 16:16 ` [PATCH v3 01/17] migration: Remove res_compatible parameter Avihai Horon
2022-11-08 17:52   ` Vladimir Sementsov-Ogievskiy
2022-11-10 13:36     ` Avihai Horon
2022-11-21  7:20       ` Avihai Horon
2022-11-23 18:23       ` Dr. David Alan Gilbert
2022-11-24 12:19         ` Avihai Horon
2022-11-03 16:16 ` [PATCH v3 02/17] migration: No save_live_pending() method uses the QEMUFile parameter Avihai Horon
2022-11-08 17:57   ` Vladimir Sementsov-Ogievskiy
2022-11-03 16:16 ` [PATCH v3 03/17] migration: Block migration comment or code is wrong Avihai Horon
2022-11-08 18:36   ` Vladimir Sementsov-Ogievskiy
2022-11-08 18:38     ` Vladimir Sementsov-Ogievskiy
2022-11-10 13:38     ` Avihai Horon
2022-11-21  7:21       ` Avihai Horon
2022-11-03 16:16 ` [PATCH v3 04/17] migration: Simplify migration_iteration_run() Avihai Horon
2022-11-08 18:56   ` Vladimir Sementsov-Ogievskiy
2022-11-10 13:42     ` Avihai Horon
2022-11-03 16:16 ` [PATCH v3 05/17] vfio/migration: Fix wrong enum usage Avihai Horon
2022-11-08 19:05   ` Vladimir Sementsov-Ogievskiy
2022-11-10 13:47     ` Avihai Horon
2022-11-03 16:16 ` [PATCH v3 06/17] vfio/migration: Fix NULL pointer dereference bug Avihai Horon
2022-11-08 19:08   ` Vladimir Sementsov-Ogievskiy
2022-11-03 16:16 ` [PATCH v3 07/17] vfio/migration: Allow migration without VFIO IOMMU dirty tracking support Avihai Horon
2022-11-15 23:36   ` Alex Williamson
2022-11-16 13:29     ` Avihai Horon
2022-11-03 16:16 ` [PATCH v3 08/17] migration/qemu-file: Add qemu_file_get_to_fd() Avihai Horon
2022-11-08 20:26   ` Vladimir Sementsov-Ogievskiy
2022-11-03 16:16 ` [PATCH v3 09/17] vfio/common: Change vfio_devices_all_running_and_saving() logic to equivalent one Avihai Horon
2022-11-03 16:16 ` [PATCH v3 10/17] vfio/migration: Move migration v1 logic to vfio_migration_init() Avihai Horon
2022-11-15 23:56   ` Alex Williamson
2022-11-16 13:39     ` Avihai Horon
2022-11-03 16:16 ` [PATCH v3 11/17] vfio/migration: Rename functions/structs related to v1 protocol Avihai Horon
2022-11-03 16:16 ` [PATCH v3 12/17] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
2022-11-16 18:29   ` Alex Williamson
2022-11-17 17:07     ` Avihai Horon
2022-11-17 17:24       ` Jason Gunthorpe
2022-11-20  8:46         ` Avihai Horon
2022-11-17 17:38       ` Alex Williamson
2022-11-20  9:34         ` Avihai Horon
2022-11-24 12:41           ` Avihai Horon
2022-11-28 18:50             ` Alex Williamson
2022-11-28 19:40               ` Jason Gunthorpe
2022-11-28 20:36                 ` Alex Williamson
2022-11-28 20:56                   ` Jason Gunthorpe
2022-11-28 21:10                     ` Alex Williamson
2022-11-29 10:40                       ` Avihai Horon
2022-11-23 18:59   ` Dr. David Alan Gilbert
2022-11-24 12:25     ` Avihai Horon
2022-11-24 13:28       ` Dr. David Alan Gilbert
2022-11-24 14:07         ` Avihai Horon
2022-11-03 16:16 ` [PATCH v3 13/17] vfio/migration: Remove VFIO migration protocol v1 Avihai Horon
2022-11-03 16:16 ` [PATCH v3 14/17] vfio/migration: Reset device if setting recover state fails Avihai Horon
2022-11-16 18:36   ` Alex Williamson
2022-11-17 17:11     ` Avihai Horon
2022-11-17 18:18       ` Alex Williamson
2022-11-20  9:39         ` Avihai Horon
2022-11-03 16:16 ` [PATCH v3 15/17] vfio: Alphabetize migration section of VFIO trace-events file Avihai Horon
2022-11-03 16:16 ` [PATCH v3 16/17] docs/devel: Align vfio-migration docs to VFIO migration v2 Avihai Horon
2022-11-03 16:16 ` [PATCH v3 17/17] vfio/migration: Query device data size in vfio_save_pending() Avihai Horon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.