QEMU-Devel Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH QEMU v25 00/17] Add migration support for VFIO devices
@ 2020-09-22 23:24 Kirti Wankhede
  2020-09-22 23:24 ` [PATCH v26 01/17] vfio: Add function to unmap VFIO region Kirti Wankhede
                   ` (17 more replies)
  0 siblings, 18 replies; 73+ messages in thread
From: Kirti Wankhede @ 2020-09-22 23:24 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

Hi,

This Patch set adds migration support for VFIO devices in QEMU.

This Patch set include patches as below:
Patch 1-2:
- Few code refactor

Patch 3:
- Added save and restore functions for PCI configuration space. Used
  pci_device_save() and pci_device_load() so that config space cache is saved
  and restored.

Patch 4-9:
- Generic migration functionality for VFIO device.
  * This patch set adds functionality for PCI devices, but can be
    extended to other VFIO devices.
  * Added all the basic functions required for pre-copy, stop-and-copy and
    resume phases of migration.
  * Added state change notifier and from that notifier function, VFIO
    device's state changed is conveyed to VFIO device driver.
  * During save setup phase and resume/load setup phase, migration region
    is queried and is used to read/write VFIO device data.
  * .save_live_pending and .save_live_iterate are implemented to use QEMU's
    functionality of iteration during pre-copy phase.
  * In .save_live_complete_precopy, that is in stop-and-copy phase,
    iteration to read data from VFIO device driver is implemented till pending
    bytes returned by driver are not zero.

Patch 10
- Set DIRTY_MEMORY_MIGRATION flag in dirty log mask for migration with vIOMMU
  enabled.

Patch 11:
- Get migration capability from kernel module.

Patch 12-14:
- Add function to start and stop dirty pages tracking.
- Add vfio_listerner_log_sync to mark dirty pages. Dirty pages bitmap is queried
  per container. All pages pinned by vendor driver through vfio_pin_pages
  external API has to be marked as dirty during  migration.
  When there are CPU writes, CPU dirty page tracking can identify dirtied
  pages, but any page pinned by vendor driver can also be written by
  device. As of now there is no device which has hardware support for
  dirty page tracking. So all pages which are pinned by vendor driver
  should be considered as dirty.
  In Qemu, marking pages dirty is only done when device is in stop-and-copy
  phase because if pages are marked dirty during pre-copy phase and content is
  transfered from source to distination, there is no way to know newly dirtied
  pages from the point they were copied earlier until device stops. To avoid
  repeated copy of same content, pinned pages are marked dirty only during
  stop-and-copy phase.

Patch 15:
- With vIOMMU, IO virtual address range can get unmapped while in pre-copy
  phase of migration. In that case, unmap ioctl should return pages pinned
  in that range and QEMU should report corresponding guest physical pages
  dirty.
Note: Comments from v25 for this patch are not addressed in this series.
  https://www.mail-archive.com/qemu-devel@nongnu.org/msg714646.html

Patch 16:
- Make VFIO PCI device migration capable. If migration region is not provided by
  driver, migration is blocked.

Patch 17:
- Added VFIO device stats to MigrationInfo
Note: Comments from v25 for this patch are not addressed yet.
https://www.mail-archive.com/qemu-devel@nongnu.org/msg715620.html


Yet TODO:
Since there is no device which has hardware support for system memmory
dirty bitmap tracking, right now there is no other API from vendor driver
to VFIO IOMMU module to report dirty pages. In future, when such hardware
support will be implemented, an API will be required in kernel such that
vendor driver could report dirty pages to VFIO module during migration phases.

Below is the flow of state change for live migration where states in brackets
represent VM state, migration state and VFIO device state as:
    (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)

Live migration save path:
        QEMU normal running state
        (RUNNING, _NONE, _RUNNING)
                        |
    migrate_init spawns migration_thread.
    (RUNNING, _SETUP, _RUNNING|_SAVING)
    Migration thread then calls each device's .save_setup()
                        |
    (RUNNING, _ACTIVE, _RUNNING|_SAVING)
    If device is active, get pending bytes by .save_live_pending()
    if pending bytes >= threshold_size,  call save_live_iterate()
    Data of VFIO device for pre-copy phase is copied.
    Iterate till pending bytes converge and are less than threshold
                        |
    On migration completion, vCPUs stops and calls .save_live_complete_precopy
    for each active device. VFIO device is then transitioned in
     _SAVING state.
    (FINISH_MIGRATE, _DEVICE, _SAVING)
    For VFIO device, iterate in  .save_live_complete_precopy  until
    pending data is 0.
    (FINISH_MIGRATE, _DEVICE, _STOPPED)
                        |
    (FINISH_MIGRATE, _COMPLETED, STOPPED)
    Migraton thread schedule cleanup bottom half and exit

Live migration resume path:
    Incomming migration calls .load_setup for each device
    (RESTORE_VM, _ACTIVE, STOPPED)
                        |
    For each device, .load_state is called for that device section data
                        |
    At the end, called .load_cleanup for each device and vCPUs are started.
                        |
        (RUNNING, _NONE, _RUNNING)

Note that:
- Migration post copy is not supported.


v25 -> 26
- Removed emulated_config_bits cache and vdev->pdev.wmask from config space save
  load functions.
- Used VMStateDescription for config space save and load functionality.
- Major fixes from previous version review.
  https://www.mail-archive.com/qemu-devel@nongnu.org/msg714625.html

v23 -> 25
- Updated config space save and load to save config cache, emulated bits cache
  and wmask cache.
- Created idr string as suggested by Dr Dave that includes bus path.
- Updated save and load function to read/write data to mixed regions, mapped or
  trapped.
- When vIOMMU is enabled, created mapped iova range list which also keeps
  translated address. This list is used to mark dirty pages. This reduces
  downtime significantly with vIOMMU enabled than migration patches from
   previous version. 
- Removed get_address_limit() function from v23 patch as this not required now.

v22 -> v23
-- Fixed issue reported by Yan
https://lore.kernel.org/kvm/97977ede-3c5b-c5a5-7858-7eecd7dd531c@nvidia.com/
- Sending this version to test v23 kernel version patches:
https://lore.kernel.org/kvm/1589998088-3250-1-git-send-email-kwankhede@nvidia.com/

v18 -> v22
- Few fixes from v18 review. But not yet fixed all concerns. I'll address those
  concerns in subsequent iterations.
- Sending this version to test v22 kernel version patches:
https://lore.kernel.org/kvm/1589781397-28368-1-git-send-email-kwankhede@nvidia.com/

v16 -> v18
- Nit fixes
- Get migration capability flags from container
- Added VFIO stats to MigrationInfo
- Fixed bug reported by Yan
    https://lists.gnu.org/archive/html/qemu-devel/2020-04/msg00004.html

v9 -> v16
- KABI almost finalised on kernel patches.
- Added support for migration with vIOMMU enabled.

v8 -> v9:
- Split patch set in 2 sets, Kernel and QEMU sets.
- Dirty pages bitmap is queried from IOMMU container rather than from
  vendor driver for per device. Added 2 ioctls to achieve this.

v7 -> v8:
- Updated comments for KABI
- Added BAR address validation check during PCI device's config space load as
  suggested by Dr. David Alan Gilbert.
- Changed vfio_migration_set_state() to set or clear device state flags.
- Some nit fixes.

v6 -> v7:
- Fix build failures.

v5 -> v6:
- Fix build failure.

v4 -> v5:
- Added decriptive comment about the sequence of access of members of structure
  vfio_device_migration_info to be followed based on Alex's suggestion
- Updated get dirty pages sequence.
- As per Cornelia Huck's suggestion, added callbacks to VFIODeviceOps to
  get_object, save_config and load_config.
- Fixed multiple nit picks.
- Tested live migration with multiple vfio device assigned to a VM.

v3 -> v4:
- Added one more bit for _RESUMING flag to be set explicitly.
- data_offset field is read-only for user space application.
- data_size is read for every iteration before reading data from migration, that
  is removed assumption that data will be till end of migration region.
- If vendor driver supports mappable sparsed region, map those region during
  setup state of save/load, similarly unmap those from cleanup routines.
- Handles race condition that causes data corruption in migration region during
  save device state by adding mutex and serialiaing save_buffer and
  get_dirty_pages routines.
- Skip called get_dirty_pages routine for mapped MMIO region of device.
- Added trace events.
- Splitted into multiple functional patches.

v2 -> v3:
- Removed enum of VFIO device states. Defined VFIO device state with 2 bits.
- Re-structured vfio_device_migration_info to keep it minimal and defined action
  on read and write access on its members.

v1 -> v2:
- Defined MIGRATION region type and sub-type which should be used with region
  type capability.
- Re-structured vfio_device_migration_info. This structure will be placed at 0th
  offset of migration region.
- Replaced ioctl with read/write for trapped part of migration region.
- Added both type of access support, trapped or mmapped, for data section of the
  region.
- Moved PCI device functions to pci file.
- Added iteration to get dirty page bitmap until bitmap for all requested pages
  are copied.

Thanks,
Kirti



Kirti Wankhede (17):
  vfio: Add function to unmap VFIO region
  vfio: Add vfio_get_object callback to VFIODeviceOps
  vfio: Add save and load functions for VFIO PCI devices
  vfio: Add migration region initialization and finalize function
  vfio: Add VM state change handler to know state of VM
  vfio: Add migration state change notifier
  vfio: Register SaveVMHandlers for VFIO device
  vfio: Add save state functions to SaveVMHandlers
  vfio: Add load state functions to SaveVMHandlers
  memory: Set DIRTY_MEMORY_MIGRATION when IOMMU is enabled
  vfio: Get migration capability flags for container
  vfio: Add function to start and stop dirty pages tracking
  vfio: create mapped iova list when vIOMMU is enabled
  vfio: Add vfio_listener_log_sync to mark dirty pages
  vfio: Add ioctl to get dirty pages bitmap during dma unmap.
  vfio: Make vfio-pci device migration capable
  qapi: Add VFIO devices migration stats in Migration stats

 hw/vfio/common.c              | 426 ++++++++++++++++++--
 hw/vfio/meson.build           |   1 +
 hw/vfio/migration.c           | 892 ++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.c                 | 170 +++++++-
 hw/vfio/pci.h                 |   2 +-
 hw/vfio/trace-events          |  20 +
 include/hw/vfio/vfio-common.h |  30 ++
 include/qemu/vfio-helpers.h   |   3 +
 migration/migration.c         |  14 +
 monitor/hmp-cmds.c            |   6 +
 qapi/migration.json           |  17 +
 softmmu/memory.c              |   2 +-
 12 files changed, 1539 insertions(+), 44 deletions(-)
 create mode 100644 hw/vfio/migration.c

-- 
2.7.0



^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH v26 01/17] vfio: Add function to unmap VFIO region
  2020-09-22 23:24 [PATCH QEMU v25 00/17] Add migration support for VFIO devices Kirti Wankhede
@ 2020-09-22 23:24 ` Kirti Wankhede
  2020-09-22 23:24 ` [PATCH v26 02/17] vfio: Add vfio_get_object callback to VFIODeviceOps Kirti Wankhede
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2020-09-22 23:24 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

This function will be used for migration region.
Migration region is mmaped when migration starts and will be unmapped when
migration is complete.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
---
 hw/vfio/common.c              | 32 ++++++++++++++++++++++++++++----
 hw/vfio/trace-events          |  1 +
 include/hw/vfio/vfio-common.h |  1 +
 3 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 13471ae29436..c6e98b8d61be 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -924,6 +924,18 @@ int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
     return 0;
 }
 
+static void vfio_subregion_unmap(VFIORegion *region, int index)
+{
+    trace_vfio_region_unmap(memory_region_name(&region->mmaps[index].mem),
+                            region->mmaps[index].offset,
+                            region->mmaps[index].offset +
+                            region->mmaps[index].size - 1);
+    memory_region_del_subregion(region->mem, &region->mmaps[index].mem);
+    munmap(region->mmaps[index].mmap, region->mmaps[index].size);
+    object_unparent(OBJECT(&region->mmaps[index].mem));
+    region->mmaps[index].mmap = NULL;
+}
+
 int vfio_region_mmap(VFIORegion *region)
 {
     int i, prot = 0;
@@ -954,10 +966,7 @@ int vfio_region_mmap(VFIORegion *region)
             region->mmaps[i].mmap = NULL;
 
             for (i--; i >= 0; i--) {
-                memory_region_del_subregion(region->mem, &region->mmaps[i].mem);
-                munmap(region->mmaps[i].mmap, region->mmaps[i].size);
-                object_unparent(OBJECT(&region->mmaps[i].mem));
-                region->mmaps[i].mmap = NULL;
+                vfio_subregion_unmap(region, i);
             }
 
             return ret;
@@ -982,6 +991,21 @@ int vfio_region_mmap(VFIORegion *region)
     return 0;
 }
 
+void vfio_region_unmap(VFIORegion *region)
+{
+    int i;
+
+    if (!region->mem) {
+        return;
+    }
+
+    for (i = 0; i < region->nr_mmaps; i++) {
+        if (region->mmaps[i].mmap) {
+            vfio_subregion_unmap(region, i);
+        }
+    }
+}
+
 void vfio_region_exit(VFIORegion *region)
 {
     int i;
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 93a0bc2522f8..a0c7b49a2ebc 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -113,6 +113,7 @@ vfio_region_mmap(const char *name, unsigned long offset, unsigned long end) "Reg
 vfio_region_exit(const char *name, int index) "Device %s, region %d"
 vfio_region_finalize(const char *name, int index) "Device %s, region %d"
 vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d"
+vfio_region_unmap(const char *name, unsigned long offset, unsigned long end) "Region %s unmap [0x%lx - 0x%lx]"
 vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Device %s region %d: %d sparse mmap entries"
 vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
 vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index c78f3ff5593c..dc95f527b583 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -171,6 +171,7 @@ int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
                       int index, const char *name);
 int vfio_region_mmap(VFIORegion *region);
 void vfio_region_mmaps_set_enabled(VFIORegion *region, bool enabled);
+void vfio_region_unmap(VFIORegion *region);
 void vfio_region_exit(VFIORegion *region);
 void vfio_region_finalize(VFIORegion *region);
 void vfio_reset_handler(void *opaque);
-- 
2.7.0



^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH v26 02/17] vfio: Add vfio_get_object callback to VFIODeviceOps
  2020-09-22 23:24 [PATCH QEMU v25 00/17] Add migration support for VFIO devices Kirti Wankhede
  2020-09-22 23:24 ` [PATCH v26 01/17] vfio: Add function to unmap VFIO region Kirti Wankhede
@ 2020-09-22 23:24 ` Kirti Wankhede
  2020-09-22 23:24 ` [PATCH v26 03/17] vfio: Add save and load functions for VFIO PCI devices Kirti Wankhede
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2020-09-22 23:24 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

Hook vfio_get_object callback for PCI devices.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Suggested-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
---
 hw/vfio/pci.c                 | 8 ++++++++
 include/hw/vfio/vfio-common.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 0d83eb0e47bb..bffd5bfe3b78 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2394,10 +2394,18 @@ static void vfio_pci_compute_needs_reset(VFIODevice *vbasedev)
     }
 }
 
+static Object *vfio_pci_get_object(VFIODevice *vbasedev)
+{
+    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+
+    return OBJECT(vdev);
+}
+
 static VFIODeviceOps vfio_pci_ops = {
     .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
     .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
     .vfio_eoi = vfio_intx_eoi,
+    .vfio_get_object = vfio_pci_get_object,
 };
 
 int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index dc95f527b583..fe99c36a693a 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -119,6 +119,7 @@ struct VFIODeviceOps {
     void (*vfio_compute_needs_reset)(VFIODevice *vdev);
     int (*vfio_hot_reset_multi)(VFIODevice *vdev);
     void (*vfio_eoi)(VFIODevice *vdev);
+    Object *(*vfio_get_object)(VFIODevice *vdev);
 };
 
 typedef struct VFIOGroup {
-- 
2.7.0



^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH v26 03/17] vfio: Add save and load functions for VFIO PCI devices
  2020-09-22 23:24 [PATCH QEMU v25 00/17] Add migration support for VFIO devices Kirti Wankhede
  2020-09-22 23:24 ` [PATCH v26 01/17] vfio: Add function to unmap VFIO region Kirti Wankhede
  2020-09-22 23:24 ` [PATCH v26 02/17] vfio: Add vfio_get_object callback to VFIODeviceOps Kirti Wankhede
@ 2020-09-22 23:24 ` Kirti Wankhede
  2020-09-23  6:38   ` Zenghui Yu
  2020-09-24 22:49   ` Alex Williamson
  2020-09-22 23:24 ` [PATCH v26 04/17] vfio: Add migration region initialization and finalize function Kirti Wankhede
                   ` (14 subsequent siblings)
  17 siblings, 2 replies; 73+ messages in thread
From: Kirti Wankhede @ 2020-09-22 23:24 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

These functions save and restore PCI device specific data - config
space of PCI device.
Used VMStateDescription to save and restore interrupt state.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/pci.c                 | 134 ++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.h                 |   1 +
 include/hw/vfio/vfio-common.h |   2 +
 3 files changed, 137 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index bffd5bfe3b78..9968cc553391 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -41,6 +41,7 @@
 #include "trace.h"
 #include "qapi/error.h"
 #include "migration/blocker.h"
+#include "migration/qemu-file.h"
 
 #define TYPE_VFIO_PCI_NOHOTPLUG "vfio-pci-nohotplug"
 
@@ -2401,11 +2402,142 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
     return OBJECT(vdev);
 }
 
+static int vfio_get_pci_irq_state(QEMUFile *f, void *pv, size_t size,
+                             const VMStateField *field)
+{
+    VFIOPCIDevice *vdev = container_of(pv, VFIOPCIDevice, vbasedev);
+    PCIDevice *pdev = &vdev->pdev;
+    uint32_t interrupt_type;
+
+    interrupt_type = qemu_get_be32(f);
+
+    if (interrupt_type == VFIO_INT_MSI) {
+        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
+        bool msi_64bit;
+
+        /* restore msi configuration */
+        msi_flags = pci_default_read_config(pdev,
+                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
+        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
+
+        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
+                              msi_flags & ~PCI_MSI_FLAGS_ENABLE, 2);
+
+        msi_addr_lo = pci_default_read_config(pdev,
+                                        pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
+        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
+                              msi_addr_lo, 4);
+
+        if (msi_64bit) {
+            msi_addr_hi = pci_default_read_config(pdev,
+                                        pdev->msi_cap + PCI_MSI_ADDRESS_HI, 4);
+            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
+                                  msi_addr_hi, 4);
+        }
+
+        msi_data = pci_default_read_config(pdev,
+                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
+                2);
+
+        vfio_pci_write_config(pdev,
+                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
+                msi_data, 2);
+
+        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
+                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
+    } else if (interrupt_type == VFIO_INT_MSIX) {
+        uint16_t offset;
+
+        msix_load(pdev, f);
+        offset = pci_default_read_config(pdev,
+                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
+        /* load enable bit and maskall bit */
+        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
+                              offset, 2);
+    }
+    return 0;
+}
+
+static int vfio_put_pci_irq_state(QEMUFile *f, void *pv, size_t size,
+                             const VMStateField *field, QJSON *vmdesc)
+{
+    VFIOPCIDevice *vdev = container_of(pv, VFIOPCIDevice, vbasedev);
+    PCIDevice *pdev = &vdev->pdev;
+
+    qemu_put_be32(f, vdev->interrupt);
+    if (vdev->interrupt == VFIO_INT_MSIX) {
+        msix_save(pdev, f);
+    }
+
+    return 0;
+}
+
+static const VMStateInfo vmstate_info_vfio_pci_irq_state = {
+    .name = "VFIO PCI irq state",
+    .get  = vfio_get_pci_irq_state,
+    .put  = vfio_put_pci_irq_state,
+};
+
+const VMStateDescription vmstate_vfio_pci_config = {
+    .name = "VFIOPCIDevice",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (VMStateField[]) {
+        VMSTATE_INT32_POSITIVE_LE(version_id, VFIOPCIDevice),
+        VMSTATE_BUFFER_UNSAFE_INFO(interrupt, VFIOPCIDevice, 1,
+                                   vmstate_info_vfio_pci_irq_state,
+                                   sizeof(int32_t)),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
+{
+    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+    PCIDevice *pdev = &vdev->pdev;
+
+
+    pci_device_save(pdev, f);
+    vmstate_save_state(f, &vmstate_vfio_pci_config, vbasedev, NULL);
+}
+
+static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
+{
+    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+    PCIDevice *pdev = &vdev->pdev;
+    uint16_t pci_cmd;
+    int ret, i;
+
+    ret = pci_device_load(pdev, f);
+    if (ret) {
+        return ret;
+    }
+
+    /* retore pci bar configuration */
+    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
+    vfio_pci_write_config(pdev, PCI_COMMAND,
+                        pci_cmd & ~(PCI_COMMAND_IO | PCI_COMMAND_MEMORY), 2);
+    for (i = 0; i < PCI_ROM_SLOT; i++) {
+        uint32_t bar = pci_default_read_config(pdev,
+                                               PCI_BASE_ADDRESS_0 + i * 4, 4);
+
+        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
+    }
+
+    ret = vmstate_load_state(f, &vmstate_vfio_pci_config, vbasedev,
+                             vdev->version_id);
+
+    vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
+    return ret;
+}
+
 static VFIODeviceOps vfio_pci_ops = {
     .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
     .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
     .vfio_eoi = vfio_intx_eoi,
     .vfio_get_object = vfio_pci_get_object,
+    .vfio_save_config = vfio_pci_save_config,
+    .vfio_load_config = vfio_pci_load_config,
 };
 
 int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
@@ -2755,6 +2887,8 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     vdev->vbasedev.ops = &vfio_pci_ops;
     vdev->vbasedev.type = VFIO_DEVICE_TYPE_PCI;
     vdev->vbasedev.dev = DEVICE(vdev);
+    vdev->vbasedev.device_state = 0;
+    vdev->version_id = 1;
 
     tmp = g_strdup_printf("%s/iommu_group", vdev->vbasedev.sysfsdev);
     len = readlink(tmp, group_path, sizeof(group_path));
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index bce71a9ac93f..9f46af7e153f 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -156,6 +156,7 @@ struct VFIOPCIDevice {
     uint32_t display_yres;
     int32_t bootindex;
     uint32_t igd_gms;
+    int32_t version_id;     /* Version id needed for VMState */
     OffAutoPCIBAR msix_relo;
     uint8_t pm_cap;
     uint8_t nv_gpudirect_clique;
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index fe99c36a693a..ba6169cd926e 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -120,6 +120,8 @@ struct VFIODeviceOps {
     int (*vfio_hot_reset_multi)(VFIODevice *vdev);
     void (*vfio_eoi)(VFIODevice *vdev);
     Object *(*vfio_get_object)(VFIODevice *vdev);
+    void (*vfio_save_config)(VFIODevice *vdev, QEMUFile *f);
+    int (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f);
 };
 
 typedef struct VFIOGroup {
-- 
2.7.0



^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH v26 04/17] vfio: Add migration region initialization and finalize function
  2020-09-22 23:24 [PATCH QEMU v25 00/17] Add migration support for VFIO devices Kirti Wankhede
                   ` (2 preceding siblings ...)
  2020-09-22 23:24 ` [PATCH v26 03/17] vfio: Add save and load functions for VFIO PCI devices Kirti Wankhede
@ 2020-09-22 23:24 ` Kirti Wankhede
  2020-09-24 14:08   ` Cornelia Huck
  2020-09-25 20:20   ` Alex Williamson
  2020-09-22 23:24 ` [PATCH v26 05/17] vfio: Add VM state change handler to know state of VM Kirti Wankhede
                   ` (13 subsequent siblings)
  17 siblings, 2 replies; 73+ messages in thread
From: Kirti Wankhede @ 2020-09-22 23:24 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

Whether the VFIO device supports migration or not is decided based of
migration region query. If migration region query is successful and migration
region initialization is successful then migration is supported else
migration is blocked.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Acked-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 hw/vfio/meson.build           |   1 +
 hw/vfio/migration.c           | 142 ++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events          |   5 ++
 include/hw/vfio/vfio-common.h |   9 +++
 4 files changed, 157 insertions(+)
 create mode 100644 hw/vfio/migration.c

diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index 37efa74018bc..da9af297a0c5 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -2,6 +2,7 @@ vfio_ss = ss.source_set()
 vfio_ss.add(files(
   'common.c',
   'spapr.c',
+  'migration.c',
 ))
 vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
   'display.c',
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
new file mode 100644
index 000000000000..2f760f1f9c47
--- /dev/null
+++ b/hw/vfio/migration.c
@@ -0,0 +1,142 @@
+/*
+ * Migration support for VFIO devices
+ *
+ * Copyright NVIDIA, Inc. 2020
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include <linux/vfio.h>
+
+#include "hw/vfio/vfio-common.h"
+#include "cpu.h"
+#include "migration/migration.h"
+#include "migration/qemu-file.h"
+#include "migration/register.h"
+#include "migration/blocker.h"
+#include "migration/misc.h"
+#include "qapi/error.h"
+#include "exec/ramlist.h"
+#include "exec/ram_addr.h"
+#include "pci.h"
+#include "trace.h"
+
+static void vfio_migration_region_exit(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+
+    if (!migration) {
+        return;
+    }
+
+    if (migration->region.size) {
+        vfio_region_exit(&migration->region);
+        vfio_region_finalize(&migration->region);
+    }
+}
+
+static int vfio_migration_region_init(VFIODevice *vbasedev, int index)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    Object *obj = NULL;
+    int ret = -EINVAL;
+
+    obj = vbasedev->ops->vfio_get_object(vbasedev);
+    if (!obj) {
+        return ret;
+    }
+
+    ret = vfio_region_setup(obj, vbasedev, &migration->region, index,
+                            "migration");
+    if (ret) {
+        error_report("%s: Failed to setup VFIO migration region %d: %s",
+                     vbasedev->name, index, strerror(-ret));
+        goto err;
+    }
+
+    if (!migration->region.size) {
+        ret = -EINVAL;
+        error_report("%s: Invalid region size of VFIO migration region %d: %s",
+                     vbasedev->name, index, strerror(-ret));
+        goto err;
+    }
+
+    return 0;
+
+err:
+    vfio_migration_region_exit(vbasedev);
+    return ret;
+}
+
+static int vfio_migration_init(VFIODevice *vbasedev,
+                               struct vfio_region_info *info)
+{
+    int ret = -EINVAL;
+
+    if (!vbasedev->ops->vfio_get_object) {
+        return ret;
+    }
+
+    vbasedev->migration = g_new0(VFIOMigration, 1);
+
+    ret = vfio_migration_region_init(vbasedev, info->index);
+    if (ret) {
+        error_report("%s: Failed to initialise migration region",
+                     vbasedev->name);
+        g_free(vbasedev->migration);
+        vbasedev->migration = NULL;
+    }
+
+    return ret;
+}
+
+/* ---------------------------------------------------------------------- */
+
+int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
+{
+    struct vfio_region_info *info = NULL;
+    Error *local_err = NULL;
+    int ret;
+
+    ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION,
+                                   VFIO_REGION_SUBTYPE_MIGRATION, &info);
+    if (ret) {
+        goto add_blocker;
+    }
+
+    ret = vfio_migration_init(vbasedev, info);
+    if (ret) {
+        goto add_blocker;
+    }
+
+    g_free(info);
+    trace_vfio_migration_probe(vbasedev->name, info->index);
+    return 0;
+
+add_blocker:
+    error_setg(&vbasedev->migration_blocker,
+               "VFIO device doesn't support migration");
+    g_free(info);
+
+    ret = migrate_add_blocker(vbasedev->migration_blocker, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        error_free(vbasedev->migration_blocker);
+        vbasedev->migration_blocker = NULL;
+    }
+    return ret;
+}
+
+void vfio_migration_finalize(VFIODevice *vbasedev)
+{
+    if (vbasedev->migration_blocker) {
+        migrate_del_blocker(vbasedev->migration_blocker);
+        error_free(vbasedev->migration_blocker);
+        vbasedev->migration_blocker = NULL;
+    }
+
+    vfio_migration_region_exit(vbasedev);
+    g_free(vbasedev->migration);
+}
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index a0c7b49a2ebc..8fe913175d85 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -145,3 +145,8 @@ vfio_display_edid_link_up(void) ""
 vfio_display_edid_link_down(void) ""
 vfio_display_edid_update(uint32_t prefx, uint32_t prefy) "%ux%u"
 vfio_display_edid_write_error(void) ""
+
+
+# migration.c
+vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
+
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index ba6169cd926e..8275c4c68f45 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -57,6 +57,10 @@ typedef struct VFIORegion {
     uint8_t nr; /* cache the region number for debug */
 } VFIORegion;
 
+typedef struct VFIOMigration {
+    VFIORegion region;
+} VFIOMigration;
+
 typedef struct VFIOAddressSpace {
     AddressSpace *as;
     QLIST_HEAD(, VFIOContainer) containers;
@@ -113,6 +117,8 @@ typedef struct VFIODevice {
     unsigned int num_irqs;
     unsigned int num_regions;
     unsigned int flags;
+    VFIOMigration *migration;
+    Error *migration_blocker;
 } VFIODevice;
 
 struct VFIODeviceOps {
@@ -204,4 +210,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
 int vfio_spapr_remove_window(VFIOContainer *container,
                              hwaddr offset_within_address_space);
 
+int vfio_migration_probe(VFIODevice *vbasedev, Error **errp);
+void vfio_migration_finalize(VFIODevice *vbasedev);
+
 #endif /* HW_VFIO_VFIO_COMMON_H */
-- 
2.7.0



^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH v26 05/17] vfio: Add VM state change handler to know state of VM
  2020-09-22 23:24 [PATCH QEMU v25 00/17] Add migration support for VFIO devices Kirti Wankhede
                   ` (3 preceding siblings ...)
  2020-09-22 23:24 ` [PATCH v26 04/17] vfio: Add migration region initialization and finalize function Kirti Wankhede
@ 2020-09-22 23:24 ` Kirti Wankhede
  2020-09-24 15:02   ` Cornelia Huck
  2020-09-25 20:20   ` Alex Williamson
  2020-09-22 23:24 ` [PATCH v26 06/17] vfio: Add migration state change notifier Kirti Wankhede
                   ` (12 subsequent siblings)
  17 siblings, 2 replies; 73+ messages in thread
From: Kirti Wankhede @ 2020-09-22 23:24 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

VM state change handler gets called on change in VM's state. This is used to set
VFIO device state to _RUNNING.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 hw/vfio/migration.c           | 136 ++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events          |   3 +-
 include/hw/vfio/vfio-common.h |   4 ++
 3 files changed, 142 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 2f760f1f9c47..a30d628ba963 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -10,6 +10,7 @@
 #include "qemu/osdep.h"
 #include <linux/vfio.h>
 
+#include "sysemu/runstate.h"
 #include "hw/vfio/vfio-common.h"
 #include "cpu.h"
 #include "migration/migration.h"
@@ -22,6 +23,58 @@
 #include "exec/ram_addr.h"
 #include "pci.h"
 #include "trace.h"
+#include "hw/hw.h"
+
+static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
+                                  off_t off, bool iswrite)
+{
+    int ret;
+
+    ret = iswrite ? pwrite(vbasedev->fd, val, count, off) :
+                    pread(vbasedev->fd, val, count, off);
+    if (ret < count) {
+        error_report("vfio_mig_%s%d %s: failed at offset 0x%lx, err: %s",
+                     iswrite ? "write" : "read", count * 8,
+                     vbasedev->name, off, strerror(errno));
+        return (ret < 0) ? ret : -EINVAL;
+    }
+    return 0;
+}
+
+static int vfio_mig_rw(VFIODevice *vbasedev, __u8 *buf, size_t count,
+                       off_t off, bool iswrite)
+{
+    int ret, done = 0;
+    __u8 *tbuf = buf;
+
+    while (count) {
+        int bytes = 0;
+
+        if (count >= 8 && !(off % 8)) {
+            bytes = 8;
+        } else if (count >= 4 && !(off % 4)) {
+            bytes = 4;
+        } else if (count >= 2 && !(off % 2)) {
+            bytes = 2;
+        } else {
+            bytes = 1;
+        }
+
+        ret = vfio_mig_access(vbasedev, tbuf, bytes, off, iswrite);
+        if (ret) {
+            return ret;
+        }
+
+        count -= bytes;
+        done += bytes;
+        off += bytes;
+        tbuf += bytes;
+    }
+    return done;
+}
+
+#define vfio_mig_read(f, v, c, o)       vfio_mig_rw(f, (__u8 *)v, c, o, false)
+#define vfio_mig_write(f, v, c, o)      vfio_mig_rw(f, (__u8 *)v, c, o, true)
 
 static void vfio_migration_region_exit(VFIODevice *vbasedev)
 {
@@ -70,6 +123,82 @@ err:
     return ret;
 }
 
+static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
+                                    uint32_t value)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIORegion *region = &migration->region;
+    off_t dev_state_off = region->fd_offset +
+                      offsetof(struct vfio_device_migration_info, device_state);
+    uint32_t device_state;
+    int ret;
+
+    ret = vfio_mig_read(vbasedev, &device_state, sizeof(device_state),
+                        dev_state_off);
+    if (ret < 0) {
+        return ret;
+    }
+
+    device_state = (device_state & mask) | value;
+
+    if (!VFIO_DEVICE_STATE_VALID(device_state)) {
+        return -EINVAL;
+    }
+
+    ret = vfio_mig_write(vbasedev, &device_state, sizeof(device_state),
+                         dev_state_off);
+    if (ret < 0) {
+        ret = vfio_mig_read(vbasedev, &device_state, sizeof(device_state),
+                          dev_state_off);
+        if (ret < 0) {
+            return ret;
+        }
+
+        if (VFIO_DEVICE_STATE_IS_ERROR(device_state)) {
+            hw_error("%s: Device is in error state 0x%x",
+                     vbasedev->name, device_state);
+            return -EFAULT;
+        }
+    }
+
+    vbasedev->device_state = device_state;
+    trace_vfio_migration_set_state(vbasedev->name, device_state);
+    return 0;
+}
+
+static void vfio_vmstate_change(void *opaque, int running, RunState state)
+{
+    VFIODevice *vbasedev = opaque;
+
+    if ((vbasedev->vm_running != running)) {
+        int ret;
+        uint32_t value = 0, mask = 0;
+
+        if (running) {
+            value = VFIO_DEVICE_STATE_RUNNING;
+            if (vbasedev->device_state & VFIO_DEVICE_STATE_RESUMING) {
+                mask = ~VFIO_DEVICE_STATE_RESUMING;
+            }
+        } else {
+            mask = ~VFIO_DEVICE_STATE_RUNNING;
+        }
+
+        ret = vfio_migration_set_state(vbasedev, mask, value);
+        if (ret) {
+            /*
+             * vm_state_notify() doesn't support reporting failure. If such
+             * error reporting support added in furure, migration should be
+             * aborted.
+             */
+            error_report("%s: Failed to set device state 0x%x",
+                         vbasedev->name, value & mask);
+        }
+        vbasedev->vm_running = running;
+        trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
+                                  value & mask);
+    }
+}
+
 static int vfio_migration_init(VFIODevice *vbasedev,
                                struct vfio_region_info *info)
 {
@@ -87,8 +216,11 @@ static int vfio_migration_init(VFIODevice *vbasedev,
                      vbasedev->name);
         g_free(vbasedev->migration);
         vbasedev->migration = NULL;
+        return ret;
     }
 
+    vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
+                                                          vbasedev);
     return ret;
 }
 
@@ -131,6 +263,10 @@ add_blocker:
 
 void vfio_migration_finalize(VFIODevice *vbasedev)
 {
+    if (vbasedev->vm_state) {
+        qemu_del_vm_change_state_handler(vbasedev->vm_state);
+    }
+
     if (vbasedev->migration_blocker) {
         migrate_del_blocker(vbasedev->migration_blocker);
         error_free(vbasedev->migration_blocker);
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 8fe913175d85..6524734bf7b4 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -149,4 +149,5 @@ vfio_display_edid_write_error(void) ""
 
 # migration.c
 vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
-
+vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
+vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 8275c4c68f45..25e3b1a3b90a 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -29,6 +29,7 @@
 #ifdef CONFIG_LINUX
 #include <linux/vfio.h>
 #endif
+#include "sysemu/sysemu.h"
 
 #define VFIO_MSG_PREFIX "vfio %s: "
 
@@ -119,6 +120,9 @@ typedef struct VFIODevice {
     unsigned int flags;
     VFIOMigration *migration;
     Error *migration_blocker;
+    VMChangeStateEntry *vm_state;
+    uint32_t device_state;
+    int vm_running;
 } VFIODevice;
 
 struct VFIODeviceOps {
-- 
2.7.0



^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH v26 06/17] vfio: Add migration state change notifier
  2020-09-22 23:24 [PATCH QEMU v25 00/17] Add migration support for VFIO devices Kirti Wankhede
                   ` (4 preceding siblings ...)
  2020-09-22 23:24 ` [PATCH v26 05/17] vfio: Add VM state change handler to know state of VM Kirti Wankhede
@ 2020-09-22 23:24 ` Kirti Wankhede
  2020-09-25 20:20   ` Alex Williamson
  2020-09-22 23:24 ` [PATCH v26 07/17] vfio: Register SaveVMHandlers for VFIO device Kirti Wankhede
                   ` (11 subsequent siblings)
  17 siblings, 1 reply; 73+ messages in thread
From: Kirti Wankhede @ 2020-09-22 23:24 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

Added migration state change notifier to get notification on migration state
change. These states are translated to VFIO device state and conveyed to vendor
driver.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 hw/vfio/migration.c           | 29 +++++++++++++++++++++++++++++
 hw/vfio/trace-events          |  5 +++--
 include/hw/vfio/vfio-common.h |  1 +
 3 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index a30d628ba963..f650fe9fc3c8 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -199,6 +199,28 @@ static void vfio_vmstate_change(void *opaque, int running, RunState state)
     }
 }
 
+static void vfio_migration_state_notifier(Notifier *notifier, void *data)
+{
+    MigrationState *s = data;
+    VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);
+    int ret;
+
+    trace_vfio_migration_state_notifier(vbasedev->name,
+                                        MigrationStatus_str(s->state));
+
+    switch (s->state) {
+    case MIGRATION_STATUS_CANCELLING:
+    case MIGRATION_STATUS_CANCELLED:
+    case MIGRATION_STATUS_FAILED:
+        ret = vfio_migration_set_state(vbasedev,
+                      ~(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING),
+                      VFIO_DEVICE_STATE_RUNNING);
+        if (ret) {
+            error_report("%s: Failed to set state RUNNING", vbasedev->name);
+        }
+    }
+}
+
 static int vfio_migration_init(VFIODevice *vbasedev,
                                struct vfio_region_info *info)
 {
@@ -221,6 +243,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
 
     vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
                                                           vbasedev);
+    vbasedev->migration_state.notify = vfio_migration_state_notifier;
+    add_migration_state_change_notifier(&vbasedev->migration_state);
     return ret;
 }
 
@@ -263,6 +287,11 @@ add_blocker:
 
 void vfio_migration_finalize(VFIODevice *vbasedev)
 {
+
+    if (vbasedev->migration_state.notify) {
+        remove_migration_state_change_notifier(&vbasedev->migration_state);
+    }
+
     if (vbasedev->vm_state) {
         qemu_del_vm_change_state_handler(vbasedev->vm_state);
     }
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 6524734bf7b4..bcb3fa7314d7 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -149,5 +149,6 @@ vfio_display_edid_write_error(void) ""
 
 # migration.c
 vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
-vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
-vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
+vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
+vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
+vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 25e3b1a3b90a..49c7c7a0e29a 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -123,6 +123,7 @@ typedef struct VFIODevice {
     VMChangeStateEntry *vm_state;
     uint32_t device_state;
     int vm_running;
+    Notifier migration_state;
 } VFIODevice;
 
 struct VFIODeviceOps {
-- 
2.7.0



^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH v26 07/17] vfio: Register SaveVMHandlers for VFIO device
  2020-09-22 23:24 [PATCH QEMU v25 00/17] Add migration support for VFIO devices Kirti Wankhede
                   ` (5 preceding siblings ...)
  2020-09-22 23:24 ` [PATCH v26 06/17] vfio: Add migration state change notifier Kirti Wankhede
@ 2020-09-22 23:24 ` Kirti Wankhede
  2020-09-24 15:15   ` Philippe Mathieu-Daudé
                     ` (2 more replies)
  2020-09-22 23:24 ` [PATCH v26 08/17] vfio: Add save state functions to SaveVMHandlers Kirti Wankhede
                   ` (10 subsequent siblings)
  17 siblings, 3 replies; 73+ messages in thread
From: Kirti Wankhede @ 2020-09-22 23:24 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

Define flags to be used as delimeter in migration file stream.
Added .save_setup and .save_cleanup functions. Mapped & unmapped migration
region from these functions at source during saving or pre-copy phase.
Set VFIO device state depending on VM's state. During live migration, VM is
running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
device. During save-restore, VM is paused, _SAVING state is set for VFIO device.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/migration.c  | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events |  2 ++
 2 files changed, 93 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index f650fe9fc3c8..8e8adaa25779 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -8,12 +8,15 @@
  */
 
 #include "qemu/osdep.h"
+#include "qemu/main-loop.h"
+#include "qemu/cutils.h"
 #include <linux/vfio.h>
 
 #include "sysemu/runstate.h"
 #include "hw/vfio/vfio-common.h"
 #include "cpu.h"
 #include "migration/migration.h"
+#include "migration/vmstate.h"
 #include "migration/qemu-file.h"
 #include "migration/register.h"
 #include "migration/blocker.h"
@@ -25,6 +28,17 @@
 #include "trace.h"
 #include "hw/hw.h"
 
+/*
+ * Flags used as delimiter:
+ * 0xffffffff => MSB 32-bit all 1s
+ * 0xef10     => emulated (virtual) function IO
+ * 0x0000     => 16-bits reserved for flags
+ */
+#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
+#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
+#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
+#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
+
 static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
                                   off_t off, bool iswrite)
 {
@@ -166,6 +180,65 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
     return 0;
 }
 
+/* ---------------------------------------------------------------------- */
+
+static int vfio_save_setup(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    trace_vfio_save_setup(vbasedev->name);
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
+
+    if (migration->region.mmaps) {
+        qemu_mutex_lock_iothread();
+        ret = vfio_region_mmap(&migration->region);
+        qemu_mutex_unlock_iothread();
+        if (ret) {
+            error_report("%s: Failed to mmap VFIO migration region %d: %s",
+                         vbasedev->name, migration->region.nr,
+                         strerror(-ret));
+            error_report("%s: Falling back to slow path", vbasedev->name);
+        }
+    }
+
+    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_MASK,
+                                   VFIO_DEVICE_STATE_SAVING);
+    if (ret) {
+        error_report("%s: Failed to set state SAVING", vbasedev->name);
+        return ret;
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    return 0;
+}
+
+static void vfio_save_cleanup(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    if (migration->region.mmaps) {
+        vfio_region_unmap(&migration->region);
+    }
+    trace_vfio_save_cleanup(vbasedev->name);
+}
+
+static SaveVMHandlers savevm_vfio_handlers = {
+    .save_setup = vfio_save_setup,
+    .save_cleanup = vfio_save_cleanup,
+};
+
+/* ---------------------------------------------------------------------- */
+
 static void vfio_vmstate_change(void *opaque, int running, RunState state)
 {
     VFIODevice *vbasedev = opaque;
@@ -225,6 +298,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
                                struct vfio_region_info *info)
 {
     int ret = -EINVAL;
+    char id[256] = "";
+    Object *obj;
 
     if (!vbasedev->ops->vfio_get_object) {
         return ret;
@@ -241,6 +316,22 @@ static int vfio_migration_init(VFIODevice *vbasedev,
         return ret;
     }
 
+    obj = vbasedev->ops->vfio_get_object(vbasedev);
+
+    if (obj) {
+        DeviceState *dev = DEVICE(obj);
+        char *oid = vmstate_if_get_id(VMSTATE_IF(dev));
+
+        if (oid) {
+            pstrcpy(id, sizeof(id), oid);
+            pstrcat(id, sizeof(id), "/");
+            g_free(oid);
+        }
+    }
+    pstrcat(id, sizeof(id), "vfio");
+
+    register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1, &savevm_vfio_handlers,
+                         vbasedev);
     vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
                                                           vbasedev);
     vbasedev->migration_state.notify = vfio_migration_state_notifier;
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index bcb3fa7314d7..982d8dccb219 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -152,3 +152,5 @@ vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
 vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
 vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
 vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
+vfio_save_setup(const char *name) " (%s)"
+vfio_save_cleanup(const char *name) " (%s)"
-- 
2.7.0



^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH v26 08/17] vfio: Add save state functions to SaveVMHandlers
  2020-09-22 23:24 [PATCH QEMU v25 00/17] Add migration support for VFIO devices Kirti Wankhede
                   ` (6 preceding siblings ...)
  2020-09-22 23:24 ` [PATCH v26 07/17] vfio: Register SaveVMHandlers for VFIO device Kirti Wankhede
@ 2020-09-22 23:24 ` Kirti Wankhede
  2020-09-23 11:42   ` Wang, Zhi A
  2020-09-25 21:02   ` Alex Williamson
  2020-09-22 23:24 ` [PATCH v26 09/17] vfio: Add load " Kirti Wankhede
                   ` (9 subsequent siblings)
  17 siblings, 2 replies; 73+ messages in thread
From: Kirti Wankhede @ 2020-09-22 23:24 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
functions. These functions handles pre-copy and stop-and-copy phase.

In _SAVING|_RUNNING device state or pre-copy phase:
- read pending_bytes. If pending_bytes > 0, go through below steps.
- read data_offset - indicates kernel driver to write data to staging
  buffer.
- read data_size - amount of data in bytes written by vendor driver in
  migration region.
- read data_size bytes of data from data_offset in the migration region.
- Write data packet to file stream as below:
{VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
VFIO_MIG_FLAG_END_OF_STATE }

In _SAVING device state or stop-and-copy phase
a. read config space of device and save to migration file stream. This
   doesn't need to be from vendor driver. Any other special config state
   from driver can be saved as data in following iteration.
b. read pending_bytes. If pending_bytes > 0, go through below steps.
c. read data_offset - indicates kernel driver to write data to staging
   buffer.
d. read data_size - amount of data in bytes written by vendor driver in
   migration region.
e. read data_size bytes of data from data_offset in the migration region.
f. Write data packet as below:
   {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
g. iterate through steps b to f while (pending_bytes > 0)
h. Write {VFIO_MIG_FLAG_END_OF_STATE}

When data region is mapped, its user's responsibility to read data from
data_offset of data_size before moving to next steps.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/migration.c           | 273 ++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events          |   6 +
 include/hw/vfio/vfio-common.h |   1 +
 3 files changed, 280 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 8e8adaa25779..4611bb972228 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -180,6 +180,154 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
     return 0;
 }
 
+static void *get_data_section_size(VFIORegion *region, uint64_t data_offset,
+                                   uint64_t data_size, uint64_t *size)
+{
+    void *ptr = NULL;
+    uint64_t limit = 0;
+    int i;
+
+    if (!region->mmaps) {
+        if (size) {
+            *size = data_size;
+        }
+        return ptr;
+    }
+
+    for (i = 0; i < region->nr_mmaps; i++) {
+        VFIOMmap *map = region->mmaps + i;
+
+        if ((data_offset >= map->offset) &&
+            (data_offset < map->offset + map->size)) {
+
+            /* check if data_offset is within sparse mmap areas */
+            ptr = map->mmap + data_offset - map->offset;
+            if (size) {
+                *size = MIN(data_size, map->offset + map->size - data_offset);
+            }
+            break;
+        } else if ((data_offset < map->offset) &&
+                   (!limit || limit > map->offset)) {
+            /*
+             * data_offset is not within sparse mmap areas, find size of
+             * non-mapped area. Check through all list since region->mmaps list
+             * is not sorted.
+             */
+            limit = map->offset;
+        }
+    }
+
+    if (!ptr && size) {
+        *size = limit ? limit - data_offset : data_size;
+    }
+    return ptr;
+}
+
+static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t *size)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIORegion *region = &migration->region;
+    uint64_t data_offset = 0, data_size = 0, sz;
+    int ret;
+
+    ret = vfio_mig_read(vbasedev, &data_offset, sizeof(data_offset),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                             data_offset));
+    if (ret < 0) {
+        return ret;
+    }
+
+    ret = vfio_mig_read(vbasedev, &data_size, sizeof(data_size),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                             data_size));
+    if (ret < 0) {
+        return ret;
+    }
+
+    trace_vfio_save_buffer(vbasedev->name, data_offset, data_size,
+                           migration->pending_bytes);
+
+    qemu_put_be64(f, data_size);
+    sz = data_size;
+
+    while (sz) {
+        void *buf = NULL;
+        uint64_t sec_size;
+        bool buf_allocated = false;
+
+        buf = get_data_section_size(region, data_offset, sz, &sec_size);
+
+        if (!buf) {
+            buf = g_try_malloc(sec_size);
+            if (!buf) {
+                error_report("%s: Error allocating buffer ", __func__);
+                return -ENOMEM;
+            }
+            buf_allocated = true;
+
+            ret = vfio_mig_read(vbasedev, buf, sec_size,
+                                region->fd_offset + data_offset);
+            if (ret < 0) {
+                g_free(buf);
+                return ret;
+            }
+        }
+
+        qemu_put_buffer(f, buf, sec_size);
+
+        if (buf_allocated) {
+            g_free(buf);
+        }
+        sz -= sec_size;
+        data_offset += sec_size;
+    }
+
+    ret = qemu_file_get_error(f);
+
+    if (!ret && size) {
+        *size = data_size;
+    }
+
+    return ret;
+}
+
+static int vfio_update_pending(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIORegion *region = &migration->region;
+    uint64_t pending_bytes = 0;
+    int ret;
+
+    ret = vfio_mig_read(vbasedev, &pending_bytes, sizeof(pending_bytes),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                             pending_bytes));
+    if (ret < 0) {
+        migration->pending_bytes = 0;
+        return ret;
+    }
+
+    migration->pending_bytes = pending_bytes;
+    trace_vfio_update_pending(vbasedev->name, pending_bytes);
+    return 0;
+}
+
+static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
+
+    if (vbasedev->ops && vbasedev->ops->vfio_save_config) {
+        vbasedev->ops->vfio_save_config(vbasedev, f);
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    trace_vfio_save_device_config_state(vbasedev->name);
+
+    return qemu_file_get_error(f);
+}
+
 /* ---------------------------------------------------------------------- */
 
 static int vfio_save_setup(QEMUFile *f, void *opaque)
@@ -232,9 +380,134 @@ static void vfio_save_cleanup(void *opaque)
     trace_vfio_save_cleanup(vbasedev->name);
 }
 
+static void vfio_save_pending(QEMUFile *f, void *opaque,
+                              uint64_t threshold_size,
+                              uint64_t *res_precopy_only,
+                              uint64_t *res_compatible,
+                              uint64_t *res_postcopy_only)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    ret = vfio_update_pending(vbasedev);
+    if (ret) {
+        return;
+    }
+
+    *res_precopy_only += migration->pending_bytes;
+
+    trace_vfio_save_pending(vbasedev->name, *res_precopy_only,
+                            *res_postcopy_only, *res_compatible);
+}
+
+static int vfio_save_iterate(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    uint64_t data_size;
+    int ret;
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
+
+    if (migration->pending_bytes == 0) {
+        ret = vfio_update_pending(vbasedev);
+        if (ret) {
+            return ret;
+        }
+
+        if (migration->pending_bytes == 0) {
+            /* indicates data finished, goto complete phase */
+            return 1;
+        }
+    }
+
+    ret = vfio_save_buffer(f, vbasedev, &data_size);
+
+    if (ret) {
+        error_report("%s: vfio_save_buffer failed %s", vbasedev->name,
+                     strerror(errno));
+        return ret;
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    trace_vfio_save_iterate(vbasedev->name, data_size);
+
+    return 0;
+}
+
+static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    uint64_t data_size;
+    int ret;
+
+    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_RUNNING,
+                                   VFIO_DEVICE_STATE_SAVING);
+    if (ret) {
+        error_report("%s: Failed to set state STOP and SAVING",
+                     vbasedev->name);
+        return ret;
+    }
+
+    ret = vfio_save_device_config_state(f, opaque);
+    if (ret) {
+        return ret;
+    }
+
+    ret = vfio_update_pending(vbasedev);
+    if (ret) {
+        return ret;
+    }
+
+    while (migration->pending_bytes > 0) {
+        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
+        ret = vfio_save_buffer(f, vbasedev, &data_size);
+        if (ret < 0) {
+            error_report("%s: Failed to save buffer", vbasedev->name);
+            return ret;
+        }
+
+        if (data_size == 0) {
+            break;
+        }
+
+        ret = vfio_update_pending(vbasedev);
+        if (ret) {
+            return ret;
+        }
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_SAVING, 0);
+    if (ret) {
+        error_report("%s: Failed to set state STOPPED", vbasedev->name);
+        return ret;
+    }
+
+    trace_vfio_save_complete_precopy(vbasedev->name);
+    return ret;
+}
+
 static SaveVMHandlers savevm_vfio_handlers = {
     .save_setup = vfio_save_setup,
     .save_cleanup = vfio_save_cleanup,
+    .save_live_pending = vfio_save_pending,
+    .save_live_iterate = vfio_save_iterate,
+    .save_live_complete_precopy = vfio_save_complete_precopy,
 };
 
 /* ---------------------------------------------------------------------- */
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 982d8dccb219..118b5547c921 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -154,3 +154,9 @@ vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t
 vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
 vfio_save_setup(const char *name) " (%s)"
 vfio_save_cleanup(const char *name) " (%s)"
+vfio_save_buffer(const char *name, uint64_t data_offset, uint64_t data_size, uint64_t pending) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64" pending 0x%"PRIx64
+vfio_update_pending(const char *name, uint64_t pending) " (%s) pending 0x%"PRIx64
+vfio_save_device_config_state(const char *name) " (%s)"
+vfio_save_pending(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64
+vfio_save_iterate(const char *name, int data_size) " (%s) data_size %d"
+vfio_save_complete_precopy(const char *name) " (%s)"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 49c7c7a0e29a..471e444a364c 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -60,6 +60,7 @@ typedef struct VFIORegion {
 
 typedef struct VFIOMigration {
     VFIORegion region;
+    uint64_t pending_bytes;
 } VFIOMigration;
 
 typedef struct VFIOAddressSpace {
-- 
2.7.0



^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH v26 09/17] vfio: Add load state functions to SaveVMHandlers
  2020-09-22 23:24 [PATCH QEMU v25 00/17] Add migration support for VFIO devices Kirti Wankhede
                   ` (7 preceding siblings ...)
  2020-09-22 23:24 ` [PATCH v26 08/17] vfio: Add save state functions to SaveVMHandlers Kirti Wankhede
@ 2020-09-22 23:24 ` Kirti Wankhede
  2020-10-01 10:07   ` Cornelia Huck
  2020-09-22 23:24 ` [PATCH v26 10/17] memory: Set DIRTY_MEMORY_MIGRATION when IOMMU is enabled Kirti Wankhede
                   ` (8 subsequent siblings)
  17 siblings, 1 reply; 73+ messages in thread
From: Kirti Wankhede @ 2020-09-22 23:24 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

Sequence  during _RESUMING device state:
While data for this device is available, repeat below steps:
a. read data_offset from where user application should write data.
b. write data of data_size to migration region from data_offset.
c. write data_size which indicates vendor driver that data is written in
   staging buffer.

For user, data is opaque. User should write data in the same order as
received.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 hw/vfio/migration.c  | 170 +++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events |   3 +
 2 files changed, 173 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 4611bb972228..ffd70282dd0e 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -328,6 +328,33 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
     return qemu_file_get_error(f);
 }
 
+static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    uint64_t data;
+
+    if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
+        int ret;
+
+        ret = vbasedev->ops->vfio_load_config(vbasedev, f);
+        if (ret) {
+            error_report("%s: Failed to load device config space",
+                         vbasedev->name);
+            return ret;
+        }
+    }
+
+    data = qemu_get_be64(f);
+    if (data != VFIO_MIG_FLAG_END_OF_STATE) {
+        error_report("%s: Failed loading device config space, "
+                     "end flag incorrect 0x%"PRIx64, vbasedev->name, data);
+        return -EINVAL;
+    }
+
+    trace_vfio_load_device_config_state(vbasedev->name);
+    return qemu_file_get_error(f);
+}
+
 /* ---------------------------------------------------------------------- */
 
 static int vfio_save_setup(QEMUFile *f, void *opaque)
@@ -502,12 +529,155 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
     return ret;
 }
 
+static int vfio_load_setup(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret = 0;
+
+    if (migration->region.mmaps) {
+        ret = vfio_region_mmap(&migration->region);
+        if (ret) {
+            error_report("%s: Failed to mmap VFIO migration region %d: %s",
+                         vbasedev->name, migration->region.nr,
+                         strerror(-ret));
+            error_report("%s: Falling back to slow path", vbasedev->name);
+        }
+    }
+
+    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_MASK,
+                                   VFIO_DEVICE_STATE_RESUMING);
+    if (ret) {
+        error_report("%s: Failed to set state RESUMING", vbasedev->name);
+    }
+    return ret;
+}
+
+static int vfio_load_cleanup(void *opaque)
+{
+    vfio_save_cleanup(opaque);
+    return 0;
+}
+
+static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret = 0;
+    uint64_t data, data_size;
+
+    data = qemu_get_be64(f);
+    while (data != VFIO_MIG_FLAG_END_OF_STATE) {
+
+        trace_vfio_load_state(vbasedev->name, data);
+
+        switch (data) {
+        case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
+        {
+            ret = vfio_load_device_config_state(f, opaque);
+            if (ret) {
+                return ret;
+            }
+            break;
+        }
+        case VFIO_MIG_FLAG_DEV_SETUP_STATE:
+        {
+            data = qemu_get_be64(f);
+            if (data == VFIO_MIG_FLAG_END_OF_STATE) {
+                return ret;
+            } else {
+                error_report("%s: SETUP STATE: EOS not found 0x%"PRIx64,
+                             vbasedev->name, data);
+                return -EINVAL;
+            }
+            break;
+        }
+        case VFIO_MIG_FLAG_DEV_DATA_STATE:
+        {
+            VFIORegion *region = &migration->region;
+            uint64_t data_offset = 0, size;
+
+            data_size = size = qemu_get_be64(f);
+            if (data_size == 0) {
+                break;
+            }
+
+            ret = vfio_mig_read(vbasedev, &data_offset, sizeof(data_offset),
+                                region->fd_offset +
+                                offsetof(struct vfio_device_migration_info,
+                                         data_offset));
+            if (ret < 0) {
+                return ret;
+            }
+
+            trace_vfio_load_state_device_data(vbasedev->name, data_offset,
+                                              data_size);
+
+            while (size) {
+                void *buf = NULL;
+                uint64_t sec_size;
+                bool buf_alloc = false;
+
+                buf = get_data_section_size(region, data_offset, size,
+                                            &sec_size);
+
+                if (!buf) {
+                    buf = g_try_malloc(sec_size);
+                    if (!buf) {
+                        error_report("%s: Error allocating buffer ", __func__);
+                        return -ENOMEM;
+                    }
+                    buf_alloc = true;
+                }
+
+                qemu_get_buffer(f, buf, sec_size);
+
+                if (buf_alloc) {
+                    ret = vfio_mig_write(vbasedev, buf, sec_size,
+                                         region->fd_offset + data_offset);
+                    g_free(buf);
+
+                    if (ret < 0) {
+                        return ret;
+                    }
+                }
+                size -= sec_size;
+                data_offset += sec_size;
+            }
+
+            ret = vfio_mig_write(vbasedev, &data_size, sizeof(data_size),
+                                 region->fd_offset +
+                       offsetof(struct vfio_device_migration_info, data_size));
+            if (ret < 0) {
+                return ret;
+            }
+            break;
+        }
+
+        default:
+            error_report("%s: Unknown tag 0x%"PRIx64, vbasedev->name, data);
+            return -EINVAL;
+        }
+
+        data = qemu_get_be64(f);
+        ret = qemu_file_get_error(f);
+        if (ret) {
+            return ret;
+        }
+    }
+
+    return ret;
+}
+
 static SaveVMHandlers savevm_vfio_handlers = {
     .save_setup = vfio_save_setup,
     .save_cleanup = vfio_save_cleanup,
     .save_live_pending = vfio_save_pending,
     .save_live_iterate = vfio_save_iterate,
     .save_live_complete_precopy = vfio_save_complete_precopy,
+    .load_setup = vfio_load_setup,
+    .load_cleanup = vfio_load_cleanup,
+    .load_state = vfio_load_state,
 };
 
 /* ---------------------------------------------------------------------- */
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 118b5547c921..94ba4696f0c6 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -160,3 +160,6 @@ vfio_save_device_config_state(const char *name) " (%s)"
 vfio_save_pending(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64
 vfio_save_iterate(const char *name, int data_size) " (%s) data_size %d"
 vfio_save_complete_precopy(const char *name) " (%s)"
+vfio_load_device_config_state(const char *name) " (%s)"
+vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
+vfio_load_state_device_data(const char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
-- 
2.7.0



^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH v26 10/17] memory: Set DIRTY_MEMORY_MIGRATION when IOMMU is enabled
  2020-09-22 23:24 [PATCH QEMU v25 00/17] Add migration support for VFIO devices Kirti Wankhede
                   ` (8 preceding siblings ...)
  2020-09-22 23:24 ` [PATCH v26 09/17] vfio: Add load " Kirti Wankhede
@ 2020-09-22 23:24 ` Kirti Wankhede
  2020-09-22 23:24 ` [PATCH v26 11/17] vfio: Get migration capability flags for container Kirti Wankhede
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2020-09-22 23:24 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

mr->ram_block is NULL when mr->is_iommu is true, then fr.dirty_log_mask
wasn't set correctly due to which memory listener's log_sync doesn't
get called.
This patch returns log_mask with DIRTY_MEMORY_MIGRATION set when
IOMMU is enabled.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
---
 softmmu/memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/softmmu/memory.c b/softmmu/memory.c
index d030eb6f7cea..0c1460394ace 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -1777,7 +1777,7 @@ bool memory_region_is_ram_device(MemoryRegion *mr)
 uint8_t memory_region_get_dirty_log_mask(MemoryRegion *mr)
 {
     uint8_t mask = mr->dirty_log_mask;
-    if (global_dirty_log && mr->ram_block) {
+    if (global_dirty_log && (mr->ram_block || memory_region_is_iommu(mr))) {
         mask |= (1 << DIRTY_MEMORY_MIGRATION);
     }
     return mask;
-- 
2.7.0



^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH v26 11/17] vfio: Get migration capability flags for container
  2020-09-22 23:24 [PATCH QEMU v25 00/17] Add migration support for VFIO devices Kirti Wankhede
                   ` (9 preceding siblings ...)
  2020-09-22 23:24 ` [PATCH v26 10/17] memory: Set DIRTY_MEMORY_MIGRATION when IOMMU is enabled Kirti Wankhede
@ 2020-09-22 23:24 ` Kirti Wankhede
  2020-09-22 23:24 ` [PATCH v26 12/17] vfio: Add function to start and stop dirty pages tracking Kirti Wankhede
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2020-09-22 23:24 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, Eric Auger, changpeng.liu, eskultet, Shameer Kolothum,
	Ken.Xue, jonathan.davies, pbonzini

Added helper functions to get IOMMU info capability chain.
Added function to get migration capability information from that
capability chain for IOMMU container.

Similar change was proposed earlier:
https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg03759.html

Disable migration for devices if IOMMU module doesn't support migration
capability.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Cc: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Cc: Eric Auger <eric.auger@redhat.com>
---
 hw/vfio/common.c              | 90 +++++++++++++++++++++++++++++++++++++++----
 hw/vfio/migration.c           |  7 +++-
 include/hw/vfio/vfio-common.h |  3 ++
 3 files changed, 91 insertions(+), 9 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index c6e98b8d61be..d4959c036dd1 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1228,6 +1228,75 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
     return 0;
 }
 
+static int vfio_get_iommu_info(VFIOContainer *container,
+                               struct vfio_iommu_type1_info **info)
+{
+
+    size_t argsz = sizeof(struct vfio_iommu_type1_info);
+
+    *info = g_new0(struct vfio_iommu_type1_info, 1);
+again:
+    (*info)->argsz = argsz;
+
+    if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) {
+        g_free(*info);
+        *info = NULL;
+        return -errno;
+    }
+
+    if (((*info)->argsz > argsz)) {
+        argsz = (*info)->argsz;
+        *info = g_realloc(*info, argsz);
+        goto again;
+    }
+
+    return 0;
+}
+
+static struct vfio_info_cap_header *
+vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
+{
+    struct vfio_info_cap_header *hdr;
+    void *ptr = info;
+
+    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
+        return NULL;
+    }
+
+    for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
+        if (hdr->id == id) {
+            return hdr;
+        }
+    }
+
+    return NULL;
+}
+
+static void vfio_get_iommu_info_migration(VFIOContainer *container,
+                                         struct vfio_iommu_type1_info *info)
+{
+    struct vfio_info_cap_header *hdr;
+    struct vfio_iommu_type1_info_cap_migration *cap_mig;
+
+    hdr = vfio_get_iommu_info_cap(info, VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION);
+    if (!hdr) {
+        return;
+    }
+
+    cap_mig = container_of(hdr, struct vfio_iommu_type1_info_cap_migration,
+                            header);
+
+    /*
+     * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of
+     * TARGET_PAGE_SIZE to mark those dirty.
+     */
+    if (cap_mig->pgsize_bitmap & TARGET_PAGE_SIZE) {
+        container->dirty_pages_supported = true;
+        container->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size;
+        container->dirty_pgsizes = cap_mig->pgsize_bitmap;
+    }
+}
+
 static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
                                   Error **errp)
 {
@@ -1297,6 +1366,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     container->space = space;
     container->fd = fd;
     container->error = NULL;
+    container->dirty_pages_supported = false;
     QLIST_INIT(&container->giommu_list);
     QLIST_INIT(&container->hostwin_list);
 
@@ -1309,7 +1379,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     case VFIO_TYPE1v2_IOMMU:
     case VFIO_TYPE1_IOMMU:
     {
-        struct vfio_iommu_type1_info info;
+        struct vfio_iommu_type1_info *info;
 
         /*
          * FIXME: This assumes that a Type1 IOMMU can map any 64-bit
@@ -1318,15 +1388,19 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
          * existing Type1 IOMMUs generally support any IOVA we're
          * going to actually try in practice.
          */
-        info.argsz = sizeof(info);
-        ret = ioctl(fd, VFIO_IOMMU_GET_INFO, &info);
-        /* Ignore errors */
-        if (ret || !(info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
+        ret = vfio_get_iommu_info(container, &info);
+
+        if (ret || !(info->flags & VFIO_IOMMU_INFO_PGSIZES)) {
             /* Assume 4k IOVA page size */
-            info.iova_pgsizes = 4096;
+            info->iova_pgsizes = 4096;
         }
-        vfio_host_win_add(container, 0, (hwaddr)-1, info.iova_pgsizes);
-        container->pgsizes = info.iova_pgsizes;
+        vfio_host_win_add(container, 0, (hwaddr)-1, info->iova_pgsizes);
+        container->pgsizes = info->iova_pgsizes;
+
+        if (!ret) {
+            vfio_get_iommu_info_migration(container, info);
+        }
+        g_free(info);
         break;
     }
     case VFIO_SPAPR_TCE_v2_IOMMU:
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index ffd70282dd0e..4306f6316417 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -786,9 +786,14 @@ static int vfio_migration_init(VFIODevice *vbasedev,
 
 int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
 {
+    VFIOContainer *container = vbasedev->group->container;
     struct vfio_region_info *info = NULL;
     Error *local_err = NULL;
-    int ret;
+    int ret = -ENOTSUP;
+
+    if (!container->dirty_pages_supported) {
+        goto add_blocker;
+    }
 
     ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION,
                                    VFIO_REGION_SUBTYPE_MIGRATION, &info);
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 471e444a364c..0a1651eda2d0 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -79,6 +79,9 @@ typedef struct VFIOContainer {
     unsigned iommu_type;
     Error *error;
     bool initialized;
+    bool dirty_pages_supported;
+    uint64_t dirty_pgsizes;
+    uint64_t max_dirty_bitmap_size;
     unsigned long pgsizes;
     QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
     QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
-- 
2.7.0



^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH v26 12/17] vfio: Add function to start and stop dirty pages tracking
  2020-09-22 23:24 [PATCH QEMU v25 00/17] Add migration support for VFIO devices Kirti Wankhede
                   ` (10 preceding siblings ...)
  2020-09-22 23:24 ` [PATCH v26 11/17] vfio: Get migration capability flags for container Kirti Wankhede
@ 2020-09-22 23:24 ` Kirti Wankhede
  2020-09-25 21:55   ` Alex Williamson
  2020-09-22 23:24 ` [PATCH v26 13/17] vfio: create mapped iova list when vIOMMU is enabled Kirti Wankhede
                   ` (5 subsequent siblings)
  17 siblings, 1 reply; 73+ messages in thread
From: Kirti Wankhede @ 2020-09-22 23:24 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

Call VFIO_IOMMU_DIRTY_PAGES ioctl to start and stop dirty pages tracking
for VFIO devices.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 hw/vfio/migration.c | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 4306f6316417..822b68b4e015 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -11,6 +11,7 @@
 #include "qemu/main-loop.h"
 #include "qemu/cutils.h"
 #include <linux/vfio.h>
+#include <sys/ioctl.h>
 
 #include "sysemu/runstate.h"
 #include "hw/vfio/vfio-common.h"
@@ -355,6 +356,32 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
     return qemu_file_get_error(f);
 }
 
+static int vfio_set_dirty_page_tracking(VFIODevice *vbasedev, bool start)
+{
+    int ret;
+    VFIOContainer *container = vbasedev->group->container;
+    struct vfio_iommu_type1_dirty_bitmap dirty = {
+        .argsz = sizeof(dirty),
+    };
+
+    if (start) {
+        if (vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) {
+            dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START;
+        } else {
+            return -EINVAL;
+        }
+    } else {
+            dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP;
+    }
+
+    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty);
+    if (ret) {
+        error_report("Failed to set dirty tracking flag 0x%x errno: %d",
+                     dirty.flags, errno);
+    }
+    return ret;
+}
+
 /* ---------------------------------------------------------------------- */
 
 static int vfio_save_setup(QEMUFile *f, void *opaque)
@@ -386,6 +413,11 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
         return ret;
     }
 
+    ret = vfio_set_dirty_page_tracking(vbasedev, true);
+    if (ret) {
+        return ret;
+    }
+
     qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
 
     ret = qemu_file_get_error(f);
@@ -401,6 +433,8 @@ static void vfio_save_cleanup(void *opaque)
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
 
+    vfio_set_dirty_page_tracking(vbasedev, false);
+
     if (migration->region.mmaps) {
         vfio_region_unmap(&migration->region);
     }
@@ -734,6 +768,8 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
         if (ret) {
             error_report("%s: Failed to set state RUNNING", vbasedev->name);
         }
+
+        vfio_set_dirty_page_tracking(vbasedev, false);
     }
 }
 
-- 
2.7.0



^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH v26 13/17] vfio: create mapped iova list when vIOMMU is enabled
  2020-09-22 23:24 [PATCH QEMU v25 00/17] Add migration support for VFIO devices Kirti Wankhede
                   ` (11 preceding siblings ...)
  2020-09-22 23:24 ` [PATCH v26 12/17] vfio: Add function to start and stop dirty pages tracking Kirti Wankhede
@ 2020-09-22 23:24 ` Kirti Wankhede
  2020-09-25 22:23   ` Alex Williamson
  2020-09-22 23:24 ` [PATCH v26 14/17] vfio: Add vfio_listener_log_sync to mark dirty pages Kirti Wankhede
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 73+ messages in thread
From: Kirti Wankhede @ 2020-09-22 23:24 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

Create mapped iova list when vIOMMU is enabled. For each mapped iova
save translated address. Add node to list on MAP and remove node from
list on UNMAP.
This list is used to track dirty pages during migration.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
---
 hw/vfio/common.c              | 58 ++++++++++++++++++++++++++++++++++++++-----
 include/hw/vfio/vfio-common.h |  8 ++++++
 2 files changed, 60 insertions(+), 6 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index d4959c036dd1..dc56cded2d95 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -407,8 +407,8 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
 }
 
 /* Called with rcu_read_lock held.  */
-static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
-                           bool *read_only)
+static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
+                               ram_addr_t *ram_addr, bool *read_only)
 {
     MemoryRegion *mr;
     hwaddr xlat;
@@ -439,8 +439,17 @@ static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
         return false;
     }
 
-    *vaddr = memory_region_get_ram_ptr(mr) + xlat;
-    *read_only = !writable || mr->readonly;
+    if (vaddr) {
+        *vaddr = memory_region_get_ram_ptr(mr) + xlat;
+    }
+
+    if (ram_addr) {
+        *ram_addr = memory_region_get_ram_addr(mr) + xlat;
+    }
+
+    if (read_only) {
+        *read_only = !writable || mr->readonly;
+    }
 
     return true;
 }
@@ -450,7 +459,6 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
     VFIOContainer *container = giommu->container;
     hwaddr iova = iotlb->iova + giommu->iommu_offset;
-    bool read_only;
     void *vaddr;
     int ret;
 
@@ -466,7 +474,10 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     rcu_read_lock();
 
     if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
-        if (!vfio_get_vaddr(iotlb, &vaddr, &read_only)) {
+        ram_addr_t ram_addr;
+        bool read_only;
+
+        if (!vfio_get_xlat_addr(iotlb, &vaddr, &ram_addr, &read_only)) {
             goto out;
         }
         /*
@@ -484,8 +495,28 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
                          "0x%"HWADDR_PRIx", %p) = %d (%m)",
                          container, iova,
                          iotlb->addr_mask + 1, vaddr, ret);
+        } else {
+            VFIOIovaRange *iova_range;
+
+            iova_range = g_malloc0(sizeof(*iova_range));
+            iova_range->iova = iova;
+            iova_range->size = iotlb->addr_mask + 1;
+            iova_range->ram_addr = ram_addr;
+
+            QLIST_INSERT_HEAD(&giommu->iova_list, iova_range, next);
         }
     } else {
+        VFIOIovaRange *iova_range, *tmp;
+
+        QLIST_FOREACH_SAFE(iova_range, &giommu->iova_list, next, tmp) {
+            if (iova_range->iova >= iova &&
+                iova_range->iova + iova_range->size <= iova +
+                                                       iotlb->addr_mask + 1) {
+                QLIST_REMOVE(iova_range, next);
+                g_free(iova_range);
+            }
+        }
+
         ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1);
         if (ret) {
             error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
@@ -642,6 +673,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
             g_free(giommu);
             goto fail;
         }
+        QLIST_INIT(&giommu->iova_list);
         QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
         memory_region_iommu_replay(giommu->iommu, &giommu->n);
 
@@ -740,6 +772,13 @@ static void vfio_listener_region_del(MemoryListener *listener,
         QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
             if (MEMORY_REGION(giommu->iommu) == section->mr &&
                 giommu->n.start == section->offset_within_region) {
+                VFIOIovaRange *iova_range, *tmp;
+
+                QLIST_FOREACH_SAFE(iova_range, &giommu->iova_list, next, tmp) {
+                    QLIST_REMOVE(iova_range, next);
+                    g_free(iova_range);
+                }
+
                 memory_region_unregister_iommu_notifier(section->mr,
                                                         &giommu->n);
                 QLIST_REMOVE(giommu, giommu_next);
@@ -1541,6 +1580,13 @@ static void vfio_disconnect_container(VFIOGroup *group)
         QLIST_REMOVE(container, next);
 
         QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
+            VFIOIovaRange *iova_range, *itmp;
+
+            QLIST_FOREACH_SAFE(iova_range, &giommu->iova_list, next, itmp) {
+                QLIST_REMOVE(iova_range, next);
+                g_free(iova_range);
+            }
+
             memory_region_unregister_iommu_notifier(
                     MEMORY_REGION(giommu->iommu), &giommu->n);
             QLIST_REMOVE(giommu, giommu_next);
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 0a1651eda2d0..aa7524fe2cc5 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -89,11 +89,19 @@ typedef struct VFIOContainer {
     QLIST_ENTRY(VFIOContainer) next;
 } VFIOContainer;
 
+typedef struct VFIOIovaRange {
+    hwaddr iova;
+    size_t size;
+    ram_addr_t ram_addr;
+    QLIST_ENTRY(VFIOIovaRange) next;
+} VFIOIovaRange;
+
 typedef struct VFIOGuestIOMMU {
     VFIOContainer *container;
     IOMMUMemoryRegion *iommu;
     hwaddr iommu_offset;
     IOMMUNotifier n;
+    QLIST_HEAD(, VFIOIovaRange) iova_list;
     QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
 } VFIOGuestIOMMU;
 
-- 
2.7.0



^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH v26 14/17] vfio: Add vfio_listener_log_sync to mark dirty pages
  2020-09-22 23:24 [PATCH QEMU v25 00/17] Add migration support for VFIO devices Kirti Wankhede
                   ` (12 preceding siblings ...)
  2020-09-22 23:24 ` [PATCH v26 13/17] vfio: create mapped iova list when vIOMMU is enabled Kirti Wankhede
@ 2020-09-22 23:24 ` Kirti Wankhede
  2020-09-22 23:24 ` [PATCH v26 15/17] vfio: Add ioctl to get dirty pages bitmap during dma unmap Kirti Wankhede
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2020-09-22 23:24 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

vfio_listener_log_sync gets list of dirty pages from container using
VFIO_IOMMU_GET_DIRTY_BITMAP ioctl and mark those pages dirty when all
devices are stopped and saving state.
Return early for the RAM block section of mapped MMIO region.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/common.c     | 136 +++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events |   1 +
 2 files changed, 137 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index dc56cded2d95..c36f275951c4 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -29,6 +29,7 @@
 #include "hw/vfio/vfio.h"
 #include "exec/address-spaces.h"
 #include "exec/memory.h"
+#include "exec/ram_addr.h"
 #include "hw/hw.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
@@ -37,6 +38,7 @@
 #include "sysemu/reset.h"
 #include "trace.h"
 #include "qapi/error.h"
+#include "migration/migration.h"
 
 VFIOGroupList vfio_group_list =
     QLIST_HEAD_INITIALIZER(vfio_group_list);
@@ -287,6 +289,33 @@ const MemoryRegionOps vfio_region_ops = {
 };
 
 /*
+ * Device state interfaces
+ */
+
+static bool vfio_devices_all_stopped_and_saving(VFIOContainer *container)
+{
+    VFIOGroup *group;
+    VFIODevice *vbasedev;
+    MigrationState *ms = migrate_get_current();
+
+    if (!migration_is_setup_or_active(ms->state)) {
+        return false;
+    }
+
+    QLIST_FOREACH(group, &container->group_list, container_next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            if ((vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) &&
+                !(vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING)) {
+                continue;
+            } else {
+                return false;
+            }
+        }
+    }
+    return true;
+}
+
+/*
  * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
  */
 static int vfio_dma_unmap(VFIOContainer *container,
@@ -851,9 +880,116 @@ static void vfio_listener_region_del(MemoryListener *listener,
     }
 }
 
+static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
+                                 uint64_t size, ram_addr_t ram_addr)
+{
+    struct vfio_iommu_type1_dirty_bitmap *dbitmap;
+    struct vfio_iommu_type1_dirty_bitmap_get *range;
+    uint64_t pages;
+    int ret;
+
+    dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
+
+    dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
+    dbitmap->flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
+    range = (struct vfio_iommu_type1_dirty_bitmap_get *)&dbitmap->data;
+    range->iova = iova;
+    range->size = size;
+
+    /*
+     * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of
+     * TARGET_PAGE_SIZE to mark those dirty. Hence set bitmap's pgsize to
+     * TARGET_PAGE_SIZE.
+     */
+    range->bitmap.pgsize = TARGET_PAGE_SIZE;
+
+    pages = TARGET_PAGE_ALIGN(range->size) >> TARGET_PAGE_BITS;
+    range->bitmap.size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
+                                         BITS_PER_BYTE;
+    range->bitmap.data = g_try_malloc0(range->bitmap.size);
+    if (!range->bitmap.data) {
+        ret = -ENOMEM;
+        goto err_out;
+    }
+
+    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
+    if (ret) {
+        error_report("Failed to get dirty bitmap for iova: 0x%llx "
+                "size: 0x%llx err: %d",
+                range->iova, range->size, errno);
+        goto err_out;
+    }
+
+    cpu_physical_memory_set_dirty_lebitmap((uint64_t *)range->bitmap.data,
+                                            ram_addr, pages);
+
+    trace_vfio_get_dirty_bitmap(container->fd, range->iova, range->size,
+                                range->bitmap.size, ram_addr);
+err_out:
+    g_free(range->bitmap.data);
+    g_free(dbitmap);
+
+    return ret;
+}
+
+static int vfio_sync_dirty_bitmap(VFIOContainer *container,
+                                  MemoryRegionSection *section)
+{
+    VFIOGuestIOMMU *giommu = NULL;
+    ram_addr_t ram_addr;
+    uint64_t iova, size;
+    int ret = 0;
+
+    if (memory_region_is_iommu(section->mr)) {
+
+        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
+            if (MEMORY_REGION(giommu->iommu) == section->mr &&
+                giommu->n.start == section->offset_within_region) {
+                VFIOIovaRange *iova_range;
+
+                QLIST_FOREACH(iova_range, &giommu->iova_list, next) {
+                    ret = vfio_get_dirty_bitmap(container, iova_range->iova,
+                                        iova_range->size, iova_range->ram_addr);
+                    if (ret) {
+                        break;
+                    }
+                }
+                break;
+            }
+        }
+
+    } else {
+        iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
+        size = int128_get64(section->size);
+
+        ram_addr = memory_region_get_ram_addr(section->mr) +
+                   section->offset_within_region + iova -
+                   TARGET_PAGE_ALIGN(section->offset_within_address_space);
+
+        ret = vfio_get_dirty_bitmap(container, iova, size, ram_addr);
+    }
+
+    return ret;
+}
+
+static void vfio_listerner_log_sync(MemoryListener *listener,
+        MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
+
+    if (vfio_listener_skipped_section(section)) {
+        return;
+    }
+
+    if (vfio_devices_all_stopped_and_saving(container)) {
+        vfio_sync_dirty_bitmap(container, section);
+    }
+}
+
 static const MemoryListener vfio_memory_listener = {
     .region_add = vfio_listener_region_add,
     .region_del = vfio_listener_region_del,
+    .log_sync = vfio_listerner_log_sync,
 };
 
 static void vfio_listener_release(VFIOContainer *container)
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 94ba4696f0c6..6140bb726343 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -163,3 +163,4 @@ vfio_save_complete_precopy(const char *name) " (%s)"
 vfio_load_device_config_state(const char *name) " (%s)"
 vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
 vfio_load_state_device_data(const char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
+vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64
-- 
2.7.0



^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH v26 15/17] vfio: Add ioctl to get dirty pages bitmap during dma unmap.
  2020-09-22 23:24 [PATCH QEMU v25 00/17] Add migration support for VFIO devices Kirti Wankhede
                   ` (13 preceding siblings ...)
  2020-09-22 23:24 ` [PATCH v26 14/17] vfio: Add vfio_listener_log_sync to mark dirty pages Kirti Wankhede
@ 2020-09-22 23:24 ` Kirti Wankhede
  2020-09-22 23:24 ` [PATCH v26 16/17] vfio: Make vfio-pci device migration capable Kirti Wankhede
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2020-09-22 23:24 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

With vIOMMU, IO virtual address range can get unmapped while in pre-copy
phase of migration. In that case, unmap ioctl should return pages pinned
in that range and QEMU should find its correcponding guest physical
addresses and report those dirty.

Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---

Note: Comments from v25 for this patch are not addressed in this series.
  https://www.mail-archive.com/qemu-devel@nongnu.org/msg714646.html
Need to investigate more on the points raised in previous version.

 hw/vfio/common.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 86 insertions(+), 4 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index c36f275951c4..7eeaa368187a 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -315,11 +315,88 @@ static bool vfio_devices_all_stopped_and_saving(VFIOContainer *container)
     return true;
 }
 
+static bool vfio_devices_all_running_and_saving(VFIOContainer *container)
+{
+    VFIOGroup *group;
+    VFIODevice *vbasedev;
+    MigrationState *ms = migrate_get_current();
+
+    if (!migration_is_setup_or_active(ms->state)) {
+        return false;
+    }
+
+    QLIST_FOREACH(group, &container->group_list, container_next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            if ((vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) &&
+                (vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING)) {
+                continue;
+            } else {
+                return false;
+            }
+        }
+    }
+    return true;
+}
+
+static int vfio_dma_unmap_bitmap(VFIOContainer *container,
+                                 hwaddr iova, ram_addr_t size,
+                                 IOMMUTLBEntry *iotlb)
+{
+    struct vfio_iommu_type1_dma_unmap *unmap;
+    struct vfio_bitmap *bitmap;
+    uint64_t pages = TARGET_PAGE_ALIGN(size) >> TARGET_PAGE_BITS;
+    int ret;
+
+    unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));
+
+    unmap->argsz = sizeof(*unmap) + sizeof(*bitmap);
+    unmap->iova = iova;
+    unmap->size = size;
+    unmap->flags |= VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP;
+    bitmap = (struct vfio_bitmap *)&unmap->data;
+
+    /*
+     * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of
+     * TARGET_PAGE_SIZE to mark those dirty. Hence set bitmap_pgsize to
+     * TARGET_PAGE_SIZE.
+     */
+
+    bitmap->pgsize = TARGET_PAGE_SIZE;
+    bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
+                   BITS_PER_BYTE;
+
+    if (bitmap->size > container->max_dirty_bitmap_size) {
+        error_report("UNMAP: Size of bitmap too big 0x%llx", bitmap->size);
+        ret = -E2BIG;
+        goto unmap_exit;
+    }
+
+    bitmap->data = g_try_malloc0(bitmap->size);
+    if (!bitmap->data) {
+        ret = -ENOMEM;
+        goto unmap_exit;
+    }
+
+    ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap);
+    if (!ret) {
+        cpu_physical_memory_set_dirty_lebitmap((uint64_t *)bitmap->data,
+                iotlb->translated_addr, pages);
+    } else {
+        error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %m");
+    }
+
+    g_free(bitmap->data);
+unmap_exit:
+    g_free(unmap);
+    return ret;
+}
+
 /*
  * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
  */
 static int vfio_dma_unmap(VFIOContainer *container,
-                          hwaddr iova, ram_addr_t size)
+                          hwaddr iova, ram_addr_t size,
+                          IOMMUTLBEntry *iotlb)
 {
     struct vfio_iommu_type1_dma_unmap unmap = {
         .argsz = sizeof(unmap),
@@ -328,6 +405,11 @@ static int vfio_dma_unmap(VFIOContainer *container,
         .size = size,
     };
 
+    if (iotlb && container->dirty_pages_supported &&
+        vfio_devices_all_running_and_saving(container)) {
+        return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
+    }
+
     while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
         /*
          * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
@@ -375,7 +457,7 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
      * the VGA ROM space.
      */
     if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 ||
-        (errno == EBUSY && vfio_dma_unmap(container, iova, size) == 0 &&
+        (errno == EBUSY && vfio_dma_unmap(container, iova, size, NULL) == 0 &&
          ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) {
         return 0;
     }
@@ -546,7 +628,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
             }
         }
 
-        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1);
+        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1, iotlb);
         if (ret) {
             error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
                          "0x%"HWADDR_PRIx") = %d (%m)",
@@ -857,7 +939,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
     }
 
     if (try_unmap) {
-        ret = vfio_dma_unmap(container, iova, int128_get64(llsize));
+        ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
         if (ret) {
             error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
                          "0x%"HWADDR_PRIx") = %d (%m)",
-- 
2.7.0



^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH v26 16/17] vfio: Make vfio-pci device migration capable
  2020-09-22 23:24 [PATCH QEMU v25 00/17] Add migration support for VFIO devices Kirti Wankhede
                   ` (14 preceding siblings ...)
  2020-09-22 23:24 ` [PATCH v26 15/17] vfio: Add ioctl to get dirty pages bitmap during dma unmap Kirti Wankhede
@ 2020-09-22 23:24 ` Kirti Wankhede
  2020-09-25 12:17   ` Cornelia Huck
  2020-09-22 23:24 ` [PATCH v26 17/17] qapi: Add VFIO devices migration stats in Migration stats Kirti Wankhede
  2020-09-23  7:06 ` [PATCH QEMU v25 00/17] Add migration support for VFIO devices Zenghui Yu
  17 siblings, 1 reply; 73+ messages in thread
From: Kirti Wankhede @ 2020-09-22 23:24 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

If the device is not a failover primary device, call
vfio_migration_probe() and vfio_migration_finalize() to enable
migration support for those devices that support it respectively to
tear it down again.
Removed vfio_pci_vmstate structure.
Removed migration blocker from VFIO PCI device specific structure and use
migration blocker from generic structure of  VFIO device.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 hw/vfio/pci.c | 28 ++++++++--------------------
 hw/vfio/pci.h |  1 -
 2 files changed, 8 insertions(+), 21 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 9968cc553391..2418a448ebca 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2872,17 +2872,6 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         return;
     }
 
-    if (!pdev->failover_pair_id) {
-        error_setg(&vdev->migration_blocker,
-                "VFIO device doesn't support migration");
-        ret = migrate_add_blocker(vdev->migration_blocker, errp);
-        if (ret) {
-            error_free(vdev->migration_blocker);
-            vdev->migration_blocker = NULL;
-            return;
-        }
-    }
-
     vdev->vbasedev.name = g_path_get_basename(vdev->vbasedev.sysfsdev);
     vdev->vbasedev.ops = &vfio_pci_ops;
     vdev->vbasedev.type = VFIO_DEVICE_TYPE_PCI;
@@ -3152,6 +3141,13 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         }
     }
 
+    if (!pdev->failover_pair_id) {
+        ret = vfio_migration_probe(&vdev->vbasedev, errp);
+        if (ret) {
+            error_report("%s: Migration disabled", vdev->vbasedev.name);
+        }
+    }
+
     vfio_register_err_notifier(vdev);
     vfio_register_req_notifier(vdev);
     vfio_setup_resetfn_quirk(vdev);
@@ -3166,11 +3162,6 @@ out_teardown:
     vfio_bars_exit(vdev);
 error:
     error_prepend(errp, VFIO_MSG_PREFIX, vdev->vbasedev.name);
-    if (vdev->migration_blocker) {
-        migrate_del_blocker(vdev->migration_blocker);
-        error_free(vdev->migration_blocker);
-        vdev->migration_blocker = NULL;
-    }
 }
 
 static void vfio_instance_finalize(Object *obj)
@@ -3182,10 +3173,6 @@ static void vfio_instance_finalize(Object *obj)
     vfio_bars_finalize(vdev);
     g_free(vdev->emulated_config_bits);
     g_free(vdev->rom);
-    if (vdev->migration_blocker) {
-        migrate_del_blocker(vdev->migration_blocker);
-        error_free(vdev->migration_blocker);
-    }
     /*
      * XXX Leaking igd_opregion is not an oversight, we can't remove the
      * fw_cfg entry therefore leaking this allocation seems like the safest
@@ -3213,6 +3200,7 @@ static void vfio_exitfn(PCIDevice *pdev)
     }
     vfio_teardown_msi(vdev);
     vfio_bars_exit(vdev);
+    vfio_migration_finalize(&vdev->vbasedev);
 }
 
 static void vfio_pci_reset(DeviceState *dev)
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 9f46af7e153f..0e3782b8e38a 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -173,7 +173,6 @@ struct VFIOPCIDevice {
     bool no_vfio_ioeventfd;
     bool enable_ramfb;
     VFIODisplay *dpy;
-    Error *migration_blocker;
     Notifier irqchip_change_notifier;
 };
 
-- 
2.7.0



^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH v26 17/17] qapi: Add VFIO devices migration stats in Migration stats
  2020-09-22 23:24 [PATCH QEMU v25 00/17] Add migration support for VFIO devices Kirti Wankhede
                   ` (15 preceding siblings ...)
  2020-09-22 23:24 ` [PATCH v26 16/17] vfio: Make vfio-pci device migration capable Kirti Wankhede
@ 2020-09-22 23:24 ` Kirti Wankhede
  2020-09-24 15:14   ` Eric Blake
                     ` (2 more replies)
  2020-09-23  7:06 ` [PATCH QEMU v25 00/17] Add migration support for VFIO devices Zenghui Yu
  17 siblings, 3 replies; 73+ messages in thread
From: Kirti Wankhede @ 2020-09-22 23:24 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

Added amount of bytes transferred to the target VM by all VFIO devices

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
---

Note: Comments from v25 for this patch are not addressed yet.
https://www.mail-archive.com/qemu-devel@nongnu.org/msg715620.html

Alex, need more pointer on documentation part raised Markus Armbruster.


 hw/vfio/common.c            | 20 ++++++++++++++++++++
 hw/vfio/migration.c         | 10 ++++++++++
 include/qemu/vfio-helpers.h |  3 +++
 migration/migration.c       | 14 ++++++++++++++
 monitor/hmp-cmds.c          |  6 ++++++
 qapi/migration.json         | 17 +++++++++++++++++
 6 files changed, 70 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 7eeaa368187a..286cdaac8674 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -39,6 +39,7 @@
 #include "trace.h"
 #include "qapi/error.h"
 #include "migration/migration.h"
+#include "qemu/vfio-helpers.h"
 
 VFIOGroupList vfio_group_list =
     QLIST_HEAD_INITIALIZER(vfio_group_list);
@@ -292,6 +293,25 @@ const MemoryRegionOps vfio_region_ops = {
  * Device state interfaces
  */
 
+bool vfio_mig_active(void)
+{
+    VFIOGroup *group;
+    VFIODevice *vbasedev;
+
+    if (QLIST_EMPTY(&vfio_group_list)) {
+        return false;
+    }
+
+    QLIST_FOREACH(group, &vfio_group_list, next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            if (vbasedev->migration_blocker) {
+                return false;
+            }
+        }
+    }
+    return true;
+}
+
 static bool vfio_devices_all_stopped_and_saving(VFIOContainer *container)
 {
     VFIOGroup *group;
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 822b68b4e015..c4226fa8b183 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -28,6 +28,7 @@
 #include "pci.h"
 #include "trace.h"
 #include "hw/hw.h"
+#include "qemu/vfio-helpers.h"
 
 /*
  * Flags used as delimiter:
@@ -40,6 +41,8 @@
 #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
 #define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
 
+static int64_t bytes_transferred;
+
 static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
                                   off_t off, bool iswrite)
 {
@@ -289,6 +292,7 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t *size)
         *size = data_size;
     }
 
+    bytes_transferred += data_size;
     return ret;
 }
 
@@ -770,6 +774,7 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
         }
 
         vfio_set_dirty_page_tracking(vbasedev, false);
+        bytes_transferred = 0;
     }
 }
 
@@ -820,6 +825,11 @@ static int vfio_migration_init(VFIODevice *vbasedev,
 
 /* ---------------------------------------------------------------------- */
 
+int64_t vfio_mig_bytes_transferred(void)
+{
+    return bytes_transferred;
+}
+
 int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
 {
     VFIOContainer *container = vbasedev->group->container;
diff --git a/include/qemu/vfio-helpers.h b/include/qemu/vfio-helpers.h
index 1f057c2b9e40..26a7df0767b1 100644
--- a/include/qemu/vfio-helpers.h
+++ b/include/qemu/vfio-helpers.h
@@ -29,4 +29,7 @@ void qemu_vfio_pci_unmap_bar(QEMUVFIOState *s, int index, void *bar,
 int qemu_vfio_pci_init_irq(QEMUVFIOState *s, EventNotifier *e,
                            int irq_type, Error **errp);
 
+bool vfio_mig_active(void);
+int64_t vfio_mig_bytes_transferred(void);
+
 #endif
diff --git a/migration/migration.c b/migration/migration.c
index 58a5452471f9..b204bb1f6cd9 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -56,6 +56,7 @@
 #include "net/announce.h"
 #include "qemu/queue.h"
 #include "multifd.h"
+#include "qemu/vfio-helpers.h"
 
 #define MAX_THROTTLE  (32 << 20)      /* Migration transfer speed throttling */
 
@@ -996,6 +997,17 @@ static void populate_disk_info(MigrationInfo *info)
     }
 }
 
+static void populate_vfio_info(MigrationInfo *info)
+{
+#ifdef CONFIG_LINUX
+    if (vfio_mig_active()) {
+        info->has_vfio = true;
+        info->vfio = g_malloc0(sizeof(*info->vfio));
+        info->vfio->transferred = vfio_mig_bytes_transferred();
+    }
+#endif
+}
+
 static void fill_source_migration_info(MigrationInfo *info)
 {
     MigrationState *s = migrate_get_current();
@@ -1020,6 +1032,7 @@ static void fill_source_migration_info(MigrationInfo *info)
         populate_time_info(info, s);
         populate_ram_info(info, s);
         populate_disk_info(info);
+        populate_vfio_info(info);
         break;
     case MIGRATION_STATUS_COLO:
         info->has_status = true;
@@ -1028,6 +1041,7 @@ static void fill_source_migration_info(MigrationInfo *info)
     case MIGRATION_STATUS_COMPLETED:
         populate_time_info(info, s);
         populate_ram_info(info, s);
+        populate_vfio_info(info);
         break;
     case MIGRATION_STATUS_FAILED:
         info->has_status = true;
diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
index 7711726fd222..40d60d6a6651 100644
--- a/monitor/hmp-cmds.c
+++ b/monitor/hmp-cmds.c
@@ -355,6 +355,12 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict)
         }
         monitor_printf(mon, "]\n");
     }
+
+    if (info->has_vfio) {
+        monitor_printf(mon, "vfio device transferred: %" PRIu64 " kbytes\n",
+                       info->vfio->transferred >> 10);
+    }
+
     qapi_free_MigrationInfo(info);
 }
 
diff --git a/qapi/migration.json b/qapi/migration.json
index 675f70bb6734..3535977123d3 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -147,6 +147,18 @@
             'active', 'postcopy-active', 'postcopy-paused',
             'postcopy-recover', 'completed', 'failed', 'colo',
             'pre-switchover', 'device', 'wait-unplug' ] }
+##
+# @VfioStats:
+#
+# Detailed VFIO devices migration statistics
+#
+# @transferred: amount of bytes transferred to the target VM by VFIO devices
+#
+# Since: 5.1
+#
+##
+{ 'struct': 'VfioStats',
+  'data': {'transferred': 'int' } }
 
 ##
 # @MigrationInfo:
@@ -208,11 +220,16 @@
 #
 # @socket-address: Only used for tcp, to know what the real port is (Since 4.0)
 #
+# @vfio: @VfioStats containing detailed VFIO devices migration statistics,
+#        only returned if VFIO device is present, migration is supported by all
+#         VFIO devices and status is 'active' or 'completed' (since 5.1)
+#
 # Since: 0.14.0
 ##
 { 'struct': 'MigrationInfo',
   'data': {'*status': 'MigrationStatus', '*ram': 'MigrationStats',
            '*disk': 'MigrationStats',
+           '*vfio': 'VfioStats',
            '*xbzrle-cache': 'XBZRLECacheStats',
            '*total-time': 'int',
            '*expected-downtime': 'int',
-- 
2.7.0



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 03/17] vfio: Add save and load functions for VFIO PCI devices
  2020-09-22 23:24 ` [PATCH v26 03/17] vfio: Add save and load functions for VFIO PCI devices Kirti Wankhede
@ 2020-09-23  6:38   ` Zenghui Yu
  2020-09-24 22:49   ` Alex Williamson
  1 sibling, 0 replies; 73+ messages in thread
From: Zenghui Yu @ 2020-09-23  6:38 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	eauger, yi.l.liu, eskultet, ziye.yang, armbru, mlevitsk, pasic,
	felipe, wanghaibin.wang, Ken.Xue, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, quintela, zhi.a.wang, jonathan.davies,
	pbonzini

Hi Kirti,

A few trivial comments from the first read through.

On 2020/9/23 7:24, Kirti Wankhede wrote:
> These functions save and restore PCI device specific data - config
> space of PCI device.
> Used VMStateDescription to save and restore interrupt state.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>   hw/vfio/pci.c                 | 134 ++++++++++++++++++++++++++++++++++++++++++
>   hw/vfio/pci.h                 |   1 +
>   include/hw/vfio/vfio-common.h |   2 +
>   3 files changed, 137 insertions(+)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index bffd5bfe3b78..9968cc553391 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -41,6 +41,7 @@
>   #include "trace.h"
>   #include "qapi/error.h"
>   #include "migration/blocker.h"
> +#include "migration/qemu-file.h"
>   
>   #define TYPE_VFIO_PCI_NOHOTPLUG "vfio-pci-nohotplug"
>   
> @@ -2401,11 +2402,142 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
>       return OBJECT(vdev);
>   }
>   
> +static int vfio_get_pci_irq_state(QEMUFile *f, void *pv, size_t size,
> +                             const VMStateField *field)
> +{
> +    VFIOPCIDevice *vdev = container_of(pv, VFIOPCIDevice, vbasedev);
> +    PCIDevice *pdev = &vdev->pdev;
> +    uint32_t interrupt_type;
> +
> +    interrupt_type = qemu_get_be32(f);
> +
> +    if (interrupt_type == VFIO_INT_MSI) {
> +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> +        bool msi_64bit;
> +
> +        /* restore msi configuration */
> +        msi_flags = pci_default_read_config(pdev,
> +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> +
> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> +                              msi_flags & ~PCI_MSI_FLAGS_ENABLE, 2);
> +
> +        msi_addr_lo = pci_default_read_config(pdev,
> +                                        pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
> +                              msi_addr_lo, 4);
> +
> +        if (msi_64bit) {
> +            msi_addr_hi = pci_default_read_config(pdev,
> +                                        pdev->msi_cap + PCI_MSI_ADDRESS_HI, 4);
> +            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> +                                  msi_addr_hi, 4);
> +        }
> +
> +        msi_data = pci_default_read_config(pdev,
> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> +                2);
> +
> +        vfio_pci_write_config(pdev,
> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> +                msi_data, 2);
> +
> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> +                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
> +    } else if (interrupt_type == VFIO_INT_MSIX) {
> +        uint16_t offset;

Maybe rename it to 'control' to match the PCI term?

> +
> +        msix_load(pdev, f);
> +        offset = pci_default_read_config(pdev,
> +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
> +        /* load enable bit and maskall bit */
> +        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
> +                              offset, 2);
> +    }
> +    return 0;
> +}
> +
> +static int vfio_put_pci_irq_state(QEMUFile *f, void *pv, size_t size,
> +                             const VMStateField *field, QJSON *vmdesc)
> +{
> +    VFIOPCIDevice *vdev = container_of(pv, VFIOPCIDevice, vbasedev);
> +    PCIDevice *pdev = &vdev->pdev;
> +
> +    qemu_put_be32(f, vdev->interrupt);
> +    if (vdev->interrupt == VFIO_INT_MSIX) {
> +        msix_save(pdev, f);
> +    }
> +
> +    return 0;
> +}
> +
> +static const VMStateInfo vmstate_info_vfio_pci_irq_state = {
> +    .name = "VFIO PCI irq state",
> +    .get  = vfio_get_pci_irq_state,
> +    .put  = vfio_put_pci_irq_state,
> +};
> +
> +const VMStateDescription vmstate_vfio_pci_config = {
> +    .name = "VFIOPCIDevice",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_INT32_POSITIVE_LE(version_id, VFIOPCIDevice),
> +        VMSTATE_BUFFER_UNSAFE_INFO(interrupt, VFIOPCIDevice, 1,
> +                                   vmstate_info_vfio_pci_irq_state,
> +                                   sizeof(int32_t)),
> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> +
> +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
> +{
> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> +    PCIDevice *pdev = &vdev->pdev;
> +
> +

Two blank lines.

> +    pci_device_save(pdev, f);
> +    vmstate_save_state(f, &vmstate_vfio_pci_config, vbasedev, NULL);
> +}
> +
> +static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
> +{
> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> +    PCIDevice *pdev = &vdev->pdev;
> +    uint16_t pci_cmd;
> +    int ret, i;
> +
> +    ret = pci_device_load(pdev, f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    /* retore pci bar configuration */
> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> +    vfio_pci_write_config(pdev, PCI_COMMAND,
> +                        pci_cmd & ~(PCI_COMMAND_IO | PCI_COMMAND_MEMORY), 2);
> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> +        uint32_t bar = pci_default_read_config(pdev,
> +                                               PCI_BASE_ADDRESS_0 + i * 4, 4);
> +
> +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
> +    }
> +
> +    ret = vmstate_load_state(f, &vmstate_vfio_pci_config, vbasedev,
> +                             vdev->version_id);
> +
> +    vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
> +    return ret;
> +}
> +
>   static VFIODeviceOps vfio_pci_ops = {
>       .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
>       .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
>       .vfio_eoi = vfio_intx_eoi,
>       .vfio_get_object = vfio_pci_get_object,
> +    .vfio_save_config = vfio_pci_save_config,
> +    .vfio_load_config = vfio_pci_load_config,
>   };
>   
>   int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
> @@ -2755,6 +2887,8 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>       vdev->vbasedev.ops = &vfio_pci_ops;
>       vdev->vbasedev.type = VFIO_DEVICE_TYPE_PCI;
>       vdev->vbasedev.dev = DEVICE(vdev);
> +    vdev->vbasedev.device_state = 0;

This shouldn't belong to this patch.


Thanks,
Zenghui


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH QEMU v25 00/17] Add migration support for VFIO devices
  2020-09-22 23:24 [PATCH QEMU v25 00/17] Add migration support for VFIO devices Kirti Wankhede
                   ` (16 preceding siblings ...)
  2020-09-22 23:24 ` [PATCH v26 17/17] qapi: Add VFIO devices migration stats in Migration stats Kirti Wankhede
@ 2020-09-23  7:06 ` Zenghui Yu
  17 siblings, 0 replies; 73+ messages in thread
From: Zenghui Yu @ 2020-09-23  7:06 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	eauger, yi.l.liu, eskultet, ziye.yang, armbru, mlevitsk, pasic,
	felipe, wanghaibin.wang, Ken.Xue, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, quintela, zhi.a.wang, jonathan.davies,
	pbonzini

On 2020/9/23 7:24, Kirti Wankhede wrote:

> Patch 4-9:
> - Generic migration functionality for VFIO device.
>   * This patch set adds functionality for PCI devices, but can be
>     extended to other VFIO devices.
>   * Added all the basic functions required for pre-copy, stop-and-copy and
>     resume phases of migration.
>   * Added state change notifier and from that notifier function, VFIO
>     device's state changed is conveyed to VFIO device driver.
>   * During save setup phase and resume/load setup phase, migration region
>     is queried and is used to read/write VFIO device data.
>   * .save_live_pending and .save_live_iterate are implemented to use QEMU's
>     functionality of iteration during pre-copy phase.
>   * In .save_live_complete_precopy, that is in stop-and-copy phase,
>     iteration to read data from VFIO device driver is implemented till pending
>     bytes returned by driver are not zero.

s/are not zero/are zero/ ?

[...]

> Live migration resume path:
>     Incomming migration calls .load_setup for each device
>     (RESTORE_VM, _ACTIVE, STOPPED)

The _RESUMING device state is missed here?

>                         |
>     For each device, .load_state is called for that device section data
>                         |
>     At the end, called .load_cleanup for each device and vCPUs are started.
>                         |
>         (RUNNING, _NONE, _RUNNING)


Thanks,
Zenghui


^ permalink raw reply	[flat|nested] 73+ messages in thread

* RE: [PATCH v26 08/17] vfio: Add save state functions to SaveVMHandlers
  2020-09-22 23:24 ` [PATCH v26 08/17] vfio: Add save state functions to SaveVMHandlers Kirti Wankhede
@ 2020-09-23 11:42   ` Wang, Zhi A
  2020-10-21 14:30     ` Kirti Wankhede
  2020-09-25 21:02   ` Alex Williamson
  1 sibling, 1 reply; 73+ messages in thread
From: Wang, Zhi A @ 2020-09-23 11:42 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	eauger, Liu, Yi L, quintela, Yang, Ziye, armbru, mlevitsk, pasic,
	felipe, Ken.Xue, Tian, Kevin, Zhao,  Yan Y, dgilbert, Liu,
	Changpeng, eskultet, jonathan.davies, pbonzini

I met a problem when trying this patch. Mostly the problem happens if a device doesn't set any pending bytes in the iteration stage, which shows this device doesn't have a stage of iteration. The QEMU in the destination machine will complain out-of-memory. After some investigation, it seems the vendor-specific bit stream is not complete and the QEMU in the destination machine will wrongly take a signature as the size of the section and failed to allocate the memory. Not sure if others meet the same problem.

I solved this problem by the following fix and the qemu version I am using is v5.0.0.0.

commit 13a80adc2cdddd48d76acf6a5dd715bcbf42b577
Author: Zhi Wang <zhi.wang.linux@gmail.com>
Date:   Tue Sep 15 15:58:45 2020 +0300

    fix
    
    Signed-off-by: Zhi Wang <zhi.wang.linux@gmail.com>

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 09eec9c..e741319 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -453,10 +458,12 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
             return ret;
         }
 
-        if (migration->pending_bytes == 0) {
-            /* indicates data finished, goto complete phase */
-            return 1;
-        }
+	if (migration->pending_bytes == 0) {
+		/* indicates data finished, goto complete phase */
+		qemu_put_be64(f, 0);
+		qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+		return 1;
+	}
     }
 
     data_size = vfio_save_buffer(f, vbasedev); 

-----Original Message-----
From: Kirti Wankhede <kwankhede@nvidia.com> 
Sent: Wednesday, September 23, 2020 2:24 AM
To: alex.williamson@redhat.com; cjia@nvidia.com
Cc: Tian, Kevin <kevin.tian@intel.com>; Yang, Ziye <ziye.yang@intel.com>; Liu, Changpeng <changpeng.liu@intel.com>; Liu, Yi L <yi.l.liu@intel.com>; mlevitsk@redhat.com; eskultet@redhat.com; cohuck@redhat.com; dgilbert@redhat.com; jonathan.davies@nutanix.com; eauger@redhat.com; aik@ozlabs.ru; pasic@linux.ibm.com; felipe@nutanix.com; Zhengxiao.zx@Alibaba-inc.com; shuangtai.tst@alibaba-inc.com; Ken.Xue@amd.com; Wang, Zhi A <zhi.a.wang@intel.com>; Zhao, Yan Y <yan.y.zhao@intel.com>; pbonzini@redhat.com; quintela@redhat.com; eblake@redhat.com; armbru@redhat.com; peterx@redhat.com; qemu-devel@nongnu.org; Kirti Wankhede <kwankhede@nvidia.com>
Subject: [PATCH v26 08/17] vfio: Add save state functions to SaveVMHandlers

Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy functions. These functions handles pre-copy and stop-and-copy phase.

In _SAVING|_RUNNING device state or pre-copy phase:
- read pending_bytes. If pending_bytes > 0, go through below steps.
- read data_offset - indicates kernel driver to write data to staging
  buffer.
- read data_size - amount of data in bytes written by vendor driver in
  migration region.
- read data_size bytes of data from data_offset in the migration region.
- Write data packet to file stream as below:
{VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data, VFIO_MIG_FLAG_END_OF_STATE }

In _SAVING device state or stop-and-copy phase a. read config space of device and save to migration file stream. This
   doesn't need to be from vendor driver. Any other special config state
   from driver can be saved as data in following iteration.
b. read pending_bytes. If pending_bytes > 0, go through below steps.
c. read data_offset - indicates kernel driver to write data to staging
   buffer.
d. read data_size - amount of data in bytes written by vendor driver in
   migration region.
e. read data_size bytes of data from data_offset in the migration region.
f. Write data packet as below:
   {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data} g. iterate through steps b to f while (pending_bytes > 0) h. Write {VFIO_MIG_FLAG_END_OF_STATE}

When data region is mapped, its user's responsibility to read data from data_offset of data_size before moving to next steps.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/migration.c           | 273 ++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events          |   6 +
 include/hw/vfio/vfio-common.h |   1 +
 3 files changed, 280 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 8e8adaa25779..4611bb972228 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -180,6 +180,154 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
     return 0;
 }
 
+static void *get_data_section_size(VFIORegion *region, uint64_t data_offset,
+                                   uint64_t data_size, uint64_t *size) 
+{
+    void *ptr = NULL;
+    uint64_t limit = 0;
+    int i;
+
+    if (!region->mmaps) {
+        if (size) {
+            *size = data_size;
+        }
+        return ptr;
+    }
+
+    for (i = 0; i < region->nr_mmaps; i++) {
+        VFIOMmap *map = region->mmaps + i;
+
+        if ((data_offset >= map->offset) &&
+            (data_offset < map->offset + map->size)) {
+
+            /* check if data_offset is within sparse mmap areas */
+            ptr = map->mmap + data_offset - map->offset;
+            if (size) {
+                *size = MIN(data_size, map->offset + map->size - data_offset);
+            }
+            break;
+        } else if ((data_offset < map->offset) &&
+                   (!limit || limit > map->offset)) {
+            /*
+             * data_offset is not within sparse mmap areas, find size of
+             * non-mapped area. Check through all list since region->mmaps list
+             * is not sorted.
+             */
+            limit = map->offset;
+        }
+    }
+
+    if (!ptr && size) {
+        *size = limit ? limit - data_offset : data_size;
+    }
+    return ptr;
+}
+
+static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t 
+*size) {
+    VFIOMigration *migration = vbasedev->migration;
+    VFIORegion *region = &migration->region;
+    uint64_t data_offset = 0, data_size = 0, sz;
+    int ret;
+
+    ret = vfio_mig_read(vbasedev, &data_offset, sizeof(data_offset),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                             data_offset));
+    if (ret < 0) {
+        return ret;
+    }
+
+    ret = vfio_mig_read(vbasedev, &data_size, sizeof(data_size),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                             data_size));
+    if (ret < 0) {
+        return ret;
+    }
+
+    trace_vfio_save_buffer(vbasedev->name, data_offset, data_size,
+                           migration->pending_bytes);
+
+    qemu_put_be64(f, data_size);
+    sz = data_size;
+
+    while (sz) {
+        void *buf = NULL;
+        uint64_t sec_size;
+        bool buf_allocated = false;
+
+        buf = get_data_section_size(region, data_offset, sz, 
+ &sec_size);
+
+        if (!buf) {
+            buf = g_try_malloc(sec_size);
+            if (!buf) {
+                error_report("%s: Error allocating buffer ", __func__);
+                return -ENOMEM;
+            }
+            buf_allocated = true;
+
+            ret = vfio_mig_read(vbasedev, buf, sec_size,
+                                region->fd_offset + data_offset);
+            if (ret < 0) {
+                g_free(buf);
+                return ret;
+            }
+        }
+
+        qemu_put_buffer(f, buf, sec_size);
+
+        if (buf_allocated) {
+            g_free(buf);
+        }
+        sz -= sec_size;
+        data_offset += sec_size;
+    }
+
+    ret = qemu_file_get_error(f);
+
+    if (!ret && size) {
+        *size = data_size;
+    }
+
+    return ret;
+}
+
+static int vfio_update_pending(VFIODevice *vbasedev) {
+    VFIOMigration *migration = vbasedev->migration;
+    VFIORegion *region = &migration->region;
+    uint64_t pending_bytes = 0;
+    int ret;
+
+    ret = vfio_mig_read(vbasedev, &pending_bytes, sizeof(pending_bytes),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                             pending_bytes));
+    if (ret < 0) {
+        migration->pending_bytes = 0;
+        return ret;
+    }
+
+    migration->pending_bytes = pending_bytes;
+    trace_vfio_update_pending(vbasedev->name, pending_bytes);
+    return 0;
+}
+
+static int vfio_save_device_config_state(QEMUFile *f, void *opaque) {
+    VFIODevice *vbasedev = opaque;
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
+
+    if (vbasedev->ops && vbasedev->ops->vfio_save_config) {
+        vbasedev->ops->vfio_save_config(vbasedev, f);
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    trace_vfio_save_device_config_state(vbasedev->name);
+
+    return qemu_file_get_error(f);
+}
+
 /* ---------------------------------------------------------------------- */
 
 static int vfio_save_setup(QEMUFile *f, void *opaque) @@ -232,9 +380,134 @@ static void vfio_save_cleanup(void *opaque)
     trace_vfio_save_cleanup(vbasedev->name);
 }
 
+static void vfio_save_pending(QEMUFile *f, void *opaque,
+                              uint64_t threshold_size,
+                              uint64_t *res_precopy_only,
+                              uint64_t *res_compatible,
+                              uint64_t *res_postcopy_only) {
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    ret = vfio_update_pending(vbasedev);
+    if (ret) {
+        return;
+    }
+
+    *res_precopy_only += migration->pending_bytes;
+
+    trace_vfio_save_pending(vbasedev->name, *res_precopy_only,
+                            *res_postcopy_only, *res_compatible); }
+
+static int vfio_save_iterate(QEMUFile *f, void *opaque) {
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    uint64_t data_size;
+    int ret;
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
+
+    if (migration->pending_bytes == 0) {
+        ret = vfio_update_pending(vbasedev);
+        if (ret) {
+            return ret;
+        }
+
+        if (migration->pending_bytes == 0) {
+            /* indicates data finished, goto complete phase */
+            return 1;
+        }
+    }
+
+    ret = vfio_save_buffer(f, vbasedev, &data_size);
+
+    if (ret) {
+        error_report("%s: vfio_save_buffer failed %s", vbasedev->name,
+                     strerror(errno));
+        return ret;
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    trace_vfio_save_iterate(vbasedev->name, data_size);
+
+    return 0;
+}
+
+static int vfio_save_complete_precopy(QEMUFile *f, void *opaque) {
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    uint64_t data_size;
+    int ret;
+
+    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_RUNNING,
+                                   VFIO_DEVICE_STATE_SAVING);
+    if (ret) {
+        error_report("%s: Failed to set state STOP and SAVING",
+                     vbasedev->name);
+        return ret;
+    }
+
+    ret = vfio_save_device_config_state(f, opaque);
+    if (ret) {
+        return ret;
+    }
+
+    ret = vfio_update_pending(vbasedev);
+    if (ret) {
+        return ret;
+    }
+
+    while (migration->pending_bytes > 0) {
+        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
+        ret = vfio_save_buffer(f, vbasedev, &data_size);
+        if (ret < 0) {
+            error_report("%s: Failed to save buffer", vbasedev->name);
+            return ret;
+        }
+
+        if (data_size == 0) {
+            break;
+        }
+
+        ret = vfio_update_pending(vbasedev);
+        if (ret) {
+            return ret;
+        }
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_SAVING, 0);
+    if (ret) {
+        error_report("%s: Failed to set state STOPPED", vbasedev->name);
+        return ret;
+    }
+
+    trace_vfio_save_complete_precopy(vbasedev->name);
+    return ret;
+}
+
 static SaveVMHandlers savevm_vfio_handlers = {
     .save_setup = vfio_save_setup,
     .save_cleanup = vfio_save_cleanup,
+    .save_live_pending = vfio_save_pending,
+    .save_live_iterate = vfio_save_iterate,
+    .save_live_complete_precopy = vfio_save_complete_precopy,
 };
 
 /* ---------------------------------------------------------------------- */ diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events index 982d8dccb219..118b5547c921 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -154,3 +154,9 @@ vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t  vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
 vfio_save_setup(const char *name) " (%s)"
 vfio_save_cleanup(const char *name) " (%s)"
+vfio_save_buffer(const char *name, uint64_t data_offset, uint64_t 
+data_size, uint64_t pending) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64" 
+pending 0x%"PRIx64 vfio_update_pending(const char *name, uint64_t pending) " (%s) pending 0x%"PRIx64 vfio_save_device_config_state(const char *name) " (%s)"
+vfio_save_pending(const char *name, uint64_t precopy, uint64_t 
+postcopy, uint64_t compatible) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64 vfio_save_iterate(const char *name, int data_size) " (%s) data_size %d"
+vfio_save_complete_precopy(const char *name) " (%s)"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index 49c7c7a0e29a..471e444a364c 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -60,6 +60,7 @@ typedef struct VFIORegion {
 
 typedef struct VFIOMigration {
     VFIORegion region;
+    uint64_t pending_bytes;
 } VFIOMigration;
 
 typedef struct VFIOAddressSpace {
--
2.7.0



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 04/17] vfio: Add migration region initialization and finalize function
  2020-09-22 23:24 ` [PATCH v26 04/17] vfio: Add migration region initialization and finalize function Kirti Wankhede
@ 2020-09-24 14:08   ` Cornelia Huck
  2020-10-17 20:14     ` Kirti Wankhede
  2020-09-25 20:20   ` Alex Williamson
  1 sibling, 1 reply; 73+ messages in thread
From: Cornelia Huck @ 2020-09-24 14:08 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk, pasic,
	felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini

On Wed, 23 Sep 2020 04:54:06 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Whether the VFIO device supports migration or not is decided based of
> migration region query. If migration region query is successful and migration
> region initialization is successful then migration is supported else
> migration is blocked.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> Acked-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  hw/vfio/meson.build           |   1 +
>  hw/vfio/migration.c           | 142 ++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/trace-events          |   5 ++
>  include/hw/vfio/vfio-common.h |   9 +++
>  4 files changed, 157 insertions(+)
>  create mode 100644 hw/vfio/migration.c

(...)

> +static int vfio_migration_region_init(VFIODevice *vbasedev, int index)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    Object *obj = NULL;
> +    int ret = -EINVAL;
> +
> +    obj = vbasedev->ops->vfio_get_object(vbasedev);
> +    if (!obj) {
> +        return ret;
> +    }
> +
> +    ret = vfio_region_setup(obj, vbasedev, &migration->region, index,
> +                            "migration");
> +    if (ret) {
> +        error_report("%s: Failed to setup VFIO migration region %d: %s",
> +                     vbasedev->name, index, strerror(-ret));
> +        goto err;
> +    }
> +
> +    if (!migration->region.size) {
> +        ret = -EINVAL;
> +        error_report("%s: Invalid region size of VFIO migration region %d: %s",
> +                     vbasedev->name, index, strerror(-ret));

Using strerror on a hardcoded error value is probably not terribly
helpful. I think printing either region.size (if you plan to extend
this check later) or something like "Invalid zero-sized VFIO migration
region" would make more sense.

> +        goto err;
> +    }
> +
> +    return 0;
> +
> +err:
> +    vfio_migration_region_exit(vbasedev);
> +    return ret;
> +}

(...)

Apart from that, looks good to me.



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 05/17] vfio: Add VM state change handler to know state of VM
  2020-09-22 23:24 ` [PATCH v26 05/17] vfio: Add VM state change handler to know state of VM Kirti Wankhede
@ 2020-09-24 15:02   ` Cornelia Huck
  2020-09-29 11:03     ` Dr. David Alan Gilbert
  2020-09-25 20:20   ` Alex Williamson
  1 sibling, 1 reply; 73+ messages in thread
From: Cornelia Huck @ 2020-09-24 15:02 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk, pasic,
	felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini

On Wed, 23 Sep 2020 04:54:07 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> VM state change handler gets called on change in VM's state. This is used to set
> VFIO device state to _RUNNING.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  hw/vfio/migration.c           | 136 ++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/trace-events          |   3 +-
>  include/hw/vfio/vfio-common.h |   4 ++
>  3 files changed, 142 insertions(+), 1 deletion(-)
> 

(...)

> +static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
> +                                    uint32_t value)

I think I've mentioned that before, but this function could really
benefit from a comment what mask and value mean. 

> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region;
> +    off_t dev_state_off = region->fd_offset +
> +                      offsetof(struct vfio_device_migration_info, device_state);
> +    uint32_t device_state;
> +    int ret;
> +
> +    ret = vfio_mig_read(vbasedev, &device_state, sizeof(device_state),
> +                        dev_state_off);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    device_state = (device_state & mask) | value;
> +
> +    if (!VFIO_DEVICE_STATE_VALID(device_state)) {
> +        return -EINVAL;
> +    }
> +
> +    ret = vfio_mig_write(vbasedev, &device_state, sizeof(device_state),
> +                         dev_state_off);
> +    if (ret < 0) {
> +        ret = vfio_mig_read(vbasedev, &device_state, sizeof(device_state),
> +                          dev_state_off);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +
> +        if (VFIO_DEVICE_STATE_IS_ERROR(device_state)) {
> +            hw_error("%s: Device is in error state 0x%x",
> +                     vbasedev->name, device_state);
> +            return -EFAULT;

Is -EFAULT a good return value here? Maybe -EIO?

> +        }
> +    }
> +
> +    vbasedev->device_state = device_state;
> +    trace_vfio_migration_set_state(vbasedev->name, device_state);
> +    return 0;
> +}
> +
> +static void vfio_vmstate_change(void *opaque, int running, RunState state)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    if ((vbasedev->vm_running != running)) {
> +        int ret;
> +        uint32_t value = 0, mask = 0;
> +
> +        if (running) {
> +            value = VFIO_DEVICE_STATE_RUNNING;
> +            if (vbasedev->device_state & VFIO_DEVICE_STATE_RESUMING) {
> +                mask = ~VFIO_DEVICE_STATE_RESUMING;

I've been staring at this for some time and I think that the desired
result is
- set _RUNNING
- if _RESUMING was set, clear it, but leave the other bits intact
- if _RESUMING was not set, clear everything previously set
This would really benefit from a comment (or am I the only one
struggling here?)

> +            }
> +        } else {
> +            mask = ~VFIO_DEVICE_STATE_RUNNING;
> +        }
> +
> +        ret = vfio_migration_set_state(vbasedev, mask, value);
> +        if (ret) {
> +            /*
> +             * vm_state_notify() doesn't support reporting failure. If such
> +             * error reporting support added in furure, migration should be
> +             * aborted.


"We should abort the migration in this case, but vm_state_notify()
currently does not support reporting failures."

?

Can/should we mark the failing device in some way?

> +             */
> +            error_report("%s: Failed to set device state 0x%x",
> +                         vbasedev->name, value & mask);
> +        }
> +        vbasedev->vm_running = running;
> +        trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
> +                                  value & mask);
> +    }
> +}
> +
>  static int vfio_migration_init(VFIODevice *vbasedev,
>                                 struct vfio_region_info *info)
>  {

(...)



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 17/17] qapi: Add VFIO devices migration stats in Migration stats
  2020-09-22 23:24 ` [PATCH v26 17/17] qapi: Add VFIO devices migration stats in Migration stats Kirti Wankhede
@ 2020-09-24 15:14   ` Eric Blake
  2020-09-25 22:55   ` Alex Williamson
  2020-09-29 10:40   ` Dr. David Alan Gilbert
  2 siblings, 0 replies; 73+ messages in thread
From: Eric Blake @ 2020-09-24 15:14 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk, pasic,
	felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini

On 9/22/20 6:24 PM, Kirti Wankhede wrote:
> Added amount of bytes transferred to the target VM by all VFIO devices
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> ---
> 
> Note: Comments from v25 for this patch are not addressed yet.
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg715620.html
> 
> Alex, need more pointer on documentation part raised Markus Armbruster.
> 
> 

> +++ b/qapi/migration.json
> @@ -147,6 +147,18 @@
>               'active', 'postcopy-active', 'postcopy-paused',
>               'postcopy-recover', 'completed', 'failed', 'colo',
>               'pre-switchover', 'device', 'wait-unplug' ] }
> +##
> +# @VfioStats:
> +#
> +# Detailed VFIO devices migration statistics
> +#
> +# @transferred: amount of bytes transferred to the target VM by VFIO devices
> +#
> +# Since: 5.1

This should be Since: 5.2

> +#
> +##
> +{ 'struct': 'VfioStats',
> +  'data': {'transferred': 'int' } }
>   
>   ##
>   # @MigrationInfo:
> @@ -208,11 +220,16 @@
>   #
>   # @socket-address: Only used for tcp, to know what the real port is (Since 4.0)
>   #
> +# @vfio: @VfioStats containing detailed VFIO devices migration statistics,
> +#        only returned if VFIO device is present, migration is supported by all
> +#         VFIO devices and status is 'active' or 'completed' (since 5.1)

and here

> +#
>   # Since: 0.14.0
>   ##
>   { 'struct': 'MigrationInfo',
>     'data': {'*status': 'MigrationStatus', '*ram': 'MigrationStats',
>              '*disk': 'MigrationStats',
> +           '*vfio': 'VfioStats',
>              '*xbzrle-cache': 'XBZRLECacheStats',
>              '*total-time': 'int',
>              '*expected-downtime': 'int',
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 07/17] vfio: Register SaveVMHandlers for VFIO device
  2020-09-22 23:24 ` [PATCH v26 07/17] vfio: Register SaveVMHandlers for VFIO device Kirti Wankhede
@ 2020-09-24 15:15   ` Philippe Mathieu-Daudé
  2020-09-29 10:19     ` Dr. David Alan Gilbert
  2020-09-25 11:53   ` Cornelia Huck
  2020-09-25 20:20   ` Alex Williamson
  2 siblings, 1 reply; 73+ messages in thread
From: Philippe Mathieu-Daudé @ 2020-09-24 15:15 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	eauger, yi.l.liu, eskultet, ziye.yang, armbru, mlevitsk, pasic,
	felipe, Ken.Xue, kevin.tian, yan.y.zhao, dgilbert, changpeng.liu,
	quintela, zhi.a.wang, jonathan.davies, pbonzini

On 9/23/20 1:24 AM, Kirti Wankhede wrote:
> Define flags to be used as delimeter in migration file stream.

Typo "delimiter".

> Added .save_setup and .save_cleanup functions. Mapped & unmapped migration
> region from these functions at source during saving or pre-copy phase.
> Set VFIO device state depending on VM's state. During live migration, VM is
> running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
> device. During save-restore, VM is paused, _SAVING state is set for VFIO device.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/migration.c  | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/trace-events |  2 ++
>  2 files changed, 93 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index f650fe9fc3c8..8e8adaa25779 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -8,12 +8,15 @@
>   */
>  
>  #include "qemu/osdep.h"
> +#include "qemu/main-loop.h"
> +#include "qemu/cutils.h"
>  #include <linux/vfio.h>
>  
>  #include "sysemu/runstate.h"
>  #include "hw/vfio/vfio-common.h"
>  #include "cpu.h"
>  #include "migration/migration.h"
> +#include "migration/vmstate.h"
>  #include "migration/qemu-file.h"
>  #include "migration/register.h"
>  #include "migration/blocker.h"
> @@ -25,6 +28,17 @@
>  #include "trace.h"
>  #include "hw/hw.h"
>  
> +/*
> + * Flags used as delimiter:
> + * 0xffffffff => MSB 32-bit all 1s
> + * 0xef10     => emulated (virtual) function IO
> + * 0x0000     => 16-bits reserved for flags
> + */
> +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
> +#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
> +
>  static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
>                                    off_t off, bool iswrite)
>  {
> @@ -166,6 +180,65 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
>      return 0;
>  }
>  
> +/* ---------------------------------------------------------------------- */
> +
> +static int vfio_save_setup(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    trace_vfio_save_setup(vbasedev->name);
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
> +
> +    if (migration->region.mmaps) {
> +        qemu_mutex_lock_iothread();
> +        ret = vfio_region_mmap(&migration->region);
> +        qemu_mutex_unlock_iothread();
> +        if (ret) {
> +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
> +                         vbasedev->name, migration->region.nr,
> +                         strerror(-ret));
> +            error_report("%s: Falling back to slow path", vbasedev->name);
> +        }
> +    }
> +
> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_MASK,
> +                                   VFIO_DEVICE_STATE_SAVING);
> +    if (ret) {
> +        error_report("%s: Failed to set state SAVING", vbasedev->name);
> +        return ret;
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    return 0;
> +}
> +
> +static void vfio_save_cleanup(void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    if (migration->region.mmaps) {
> +        vfio_region_unmap(&migration->region);
> +    }
> +    trace_vfio_save_cleanup(vbasedev->name);
> +}
> +
> +static SaveVMHandlers savevm_vfio_handlers = {
> +    .save_setup = vfio_save_setup,
> +    .save_cleanup = vfio_save_cleanup,
> +};
> +
> +/* ---------------------------------------------------------------------- */
> +
>  static void vfio_vmstate_change(void *opaque, int running, RunState state)
>  {
>      VFIODevice *vbasedev = opaque;
> @@ -225,6 +298,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>                                 struct vfio_region_info *info)
>  {
>      int ret = -EINVAL;
> +    char id[256] = "";
> +    Object *obj;
>  
>      if (!vbasedev->ops->vfio_get_object) {
>          return ret;
> @@ -241,6 +316,22 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>          return ret;
>      }
>  
> +    obj = vbasedev->ops->vfio_get_object(vbasedev);
> +
> +    if (obj) {
> +        DeviceState *dev = DEVICE(obj);
> +        char *oid = vmstate_if_get_id(VMSTATE_IF(dev));
> +
> +        if (oid) {
> +            pstrcpy(id, sizeof(id), oid);
> +            pstrcat(id, sizeof(id), "/");
> +            g_free(oid);
> +        }
> +    }
> +    pstrcat(id, sizeof(id), "vfio");

Alternatively (easier to review, matter of taste):

 g_autofree char *path = NULL;

 if (oid) {
   path = g_strdup_printf("%s/vfio",
                          vmstate_if_get_id(VMSTATE_IF(obj)););
 } else {
   path = g_strdup("vfio");
 }
 strpadcpy(id, sizeof(id), path, '\0');

> +
> +    register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1, &savevm_vfio_handlers,
> +                         vbasedev);
>      vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
>                                                            vbasedev);
>      vbasedev->migration_state.notify = vfio_migration_state_notifier;
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index bcb3fa7314d7..982d8dccb219 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -152,3 +152,5 @@ vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
>  vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
>  vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
>  vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
> +vfio_save_setup(const char *name) " (%s)"
> +vfio_save_cleanup(const char *name) " (%s)"
> 



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 03/17] vfio: Add save and load functions for VFIO PCI devices
  2020-09-22 23:24 ` [PATCH v26 03/17] vfio: Add save and load functions for VFIO PCI devices Kirti Wankhede
  2020-09-23  6:38   ` Zenghui Yu
@ 2020-09-24 22:49   ` Alex Williamson
  2020-10-21  9:30     ` Zenghui Yu
  1 sibling, 1 reply; 73+ messages in thread
From: Alex Williamson @ 2020-09-24 22:49 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini

On Wed, 23 Sep 2020 04:54:05 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> These functions save and restore PCI device specific data - config
> space of PCI device.
> Used VMStateDescription to save and restore interrupt state.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/pci.c                 | 134 ++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/pci.h                 |   1 +
>  include/hw/vfio/vfio-common.h |   2 +
>  3 files changed, 137 insertions(+)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index bffd5bfe3b78..9968cc553391 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -41,6 +41,7 @@
>  #include "trace.h"
>  #include "qapi/error.h"
>  #include "migration/blocker.h"
> +#include "migration/qemu-file.h"
>  
>  #define TYPE_VFIO_PCI_NOHOTPLUG "vfio-pci-nohotplug"
>  
> @@ -2401,11 +2402,142 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
>      return OBJECT(vdev);
>  }
>  
> +static int vfio_get_pci_irq_state(QEMUFile *f, void *pv, size_t size,
> +                             const VMStateField *field)
> +{
> +    VFIOPCIDevice *vdev = container_of(pv, VFIOPCIDevice, vbasedev);
> +    PCIDevice *pdev = &vdev->pdev;
> +    uint32_t interrupt_type;
> +
> +    interrupt_type = qemu_get_be32(f);
> +
> +    if (interrupt_type == VFIO_INT_MSI) {
> +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> +        bool msi_64bit;
> +
> +        /* restore msi configuration */
> +        msi_flags = pci_default_read_config(pdev,
> +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> +
> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> +                              msi_flags & ~PCI_MSI_FLAGS_ENABLE, 2);
> +
> +        msi_addr_lo = pci_default_read_config(pdev,
> +                                        pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
> +                              msi_addr_lo, 4);
> +
> +        if (msi_64bit) {
> +            msi_addr_hi = pci_default_read_config(pdev,
> +                                        pdev->msi_cap + PCI_MSI_ADDRESS_HI, 4);
> +            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> +                                  msi_addr_hi, 4);
> +        }
> +
> +        msi_data = pci_default_read_config(pdev,
> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> +                2);
> +
> +        vfio_pci_write_config(pdev,
> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> +                msi_data, 2);
> +
> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> +                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);

Aside from the flags register which includes the enable bit, is there
any purpose to reading the other registers from emulated config space
and writing them back through vfio?

> +    } else if (interrupt_type == VFIO_INT_MSIX) {
> +        uint16_t offset;
> +
> +        msix_load(pdev, f);
> +        offset = pci_default_read_config(pdev,
> +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
> +        /* load enable bit and maskall bit */
> +        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
> +                              offset, 2);
> +    }
> +    return 0;


It seems this could be simplified down to:

if (msi_enabled(pdev)) {
    vfio_msi_enable(vdev);
} else if (msix_enabled(pdev)) {
    msix_load(pdev, f);
    vfio_msix_enable(vdev);
}

But that sort of begs the question whether both MSI and MSI-X should be
handled via subsections, where MSI-X could make use of VMSTATE_MSIX and
a post_load callback for each would test to see if the capability is
enabled and call the appropriate vfio_msi{x}_enable() function.  That
would also make it a lot more clear how additional capabilities with
QEMU emulation state would be handled in the future.

> +}
> +
> +static int vfio_put_pci_irq_state(QEMUFile *f, void *pv, size_t size,
> +                             const VMStateField *field, QJSON *vmdesc)
> +{
> +    VFIOPCIDevice *vdev = container_of(pv, VFIOPCIDevice, vbasedev);
> +    PCIDevice *pdev = &vdev->pdev;
> +
> +    qemu_put_be32(f, vdev->interrupt);

As above, it seems that vdev->interrupt can be inferred by looking at
config space.

> +    if (vdev->interrupt == VFIO_INT_MSIX) {
> +        msix_save(pdev, f);
> +    }
> +
> +    return 0;
> +}
> +
> +static const VMStateInfo vmstate_info_vfio_pci_irq_state = {
> +    .name = "VFIO PCI irq state",
> +    .get  = vfio_get_pci_irq_state,
> +    .put  = vfio_put_pci_irq_state,
> +};
> +
> +const VMStateDescription vmstate_vfio_pci_config = {
> +    .name = "VFIOPCIDevice",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_INT32_POSITIVE_LE(version_id, VFIOPCIDevice),
> +        VMSTATE_BUFFER_UNSAFE_INFO(interrupt, VFIOPCIDevice, 1,
> +                                   vmstate_info_vfio_pci_irq_state,
> +                                   sizeof(int32_t)),

Seems like we're copying vmstate_pci_device here rather than using
VMSTATE_PCI_DEVICE, shouldn't we be using the latter instead?

> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> +
> +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
> +{
> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> +    PCIDevice *pdev = &vdev->pdev;
> +
> +
> +    pci_device_save(pdev, f);
> +    vmstate_save_state(f, &vmstate_vfio_pci_config, vbasedev, NULL);
> +}
> +
> +static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
> +{
> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> +    PCIDevice *pdev = &vdev->pdev;
> +    uint16_t pci_cmd;
> +    int ret, i;
> +
> +    ret = pci_device_load(pdev, f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    /* retore pci bar configuration */
> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> +    vfio_pci_write_config(pdev, PCI_COMMAND,
> +                        pci_cmd & ~(PCI_COMMAND_IO | PCI_COMMAND_MEMORY), 2);
> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> +        uint32_t bar = pci_default_read_config(pdev,
> +                                               PCI_BASE_ADDRESS_0 + i * 4, 4);
> +
> +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
> +    }

Is the intention here to trigger the sub-page support?  If so we should
have a comment because otherwise there's no reason to write it back,
right?  Another option might be to simply call the sub-page update
directly.

> +
> +    ret = vmstate_load_state(f, &vmstate_vfio_pci_config, vbasedev,
> +                             vdev->version_id);
> +
> +    vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
> +    return ret;
> +}
> +
>  static VFIODeviceOps vfio_pci_ops = {
>      .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
>      .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
>      .vfio_eoi = vfio_intx_eoi,
>      .vfio_get_object = vfio_pci_get_object,
> +    .vfio_save_config = vfio_pci_save_config,
> +    .vfio_load_config = vfio_pci_load_config,
>  };
>  
>  int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
> @@ -2755,6 +2887,8 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>      vdev->vbasedev.ops = &vfio_pci_ops;
>      vdev->vbasedev.type = VFIO_DEVICE_TYPE_PCI;
>      vdev->vbasedev.dev = DEVICE(vdev);
> +    vdev->vbasedev.device_state = 0;

Why is this here?

> +    vdev->version_id = 1;

I'm not sure how this is meant to work or if it's even necessary if we
use VMSTATE_PCI_DEVICE and infer the interrupt configuration from
config space.

>  
>      tmp = g_strdup_printf("%s/iommu_group", vdev->vbasedev.sysfsdev);
>      len = readlink(tmp, group_path, sizeof(group_path));
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index bce71a9ac93f..9f46af7e153f 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -156,6 +156,7 @@ struct VFIOPCIDevice {
>      uint32_t display_yres;
>      int32_t bootindex;
>      uint32_t igd_gms;
> +    int32_t version_id;     /* Version id needed for VMState */
>      OffAutoPCIBAR msix_relo;
>      uint8_t pm_cap;
>      uint8_t nv_gpudirect_clique;
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index fe99c36a693a..ba6169cd926e 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -120,6 +120,8 @@ struct VFIODeviceOps {
>      int (*vfio_hot_reset_multi)(VFIODevice *vdev);
>      void (*vfio_eoi)(VFIODevice *vdev);
>      Object *(*vfio_get_object)(VFIODevice *vdev);
> +    void (*vfio_save_config)(VFIODevice *vdev, QEMUFile *f);
> +    int (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f);
>  };
>  
>  typedef struct VFIOGroup {



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 07/17] vfio: Register SaveVMHandlers for VFIO device
  2020-09-22 23:24 ` [PATCH v26 07/17] vfio: Register SaveVMHandlers for VFIO device Kirti Wankhede
  2020-09-24 15:15   ` Philippe Mathieu-Daudé
@ 2020-09-25 11:53   ` Cornelia Huck
  2020-10-18 20:55     ` Kirti Wankhede
  2020-09-25 20:20   ` Alex Williamson
  2 siblings, 1 reply; 73+ messages in thread
From: Cornelia Huck @ 2020-09-25 11:53 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk, pasic,
	felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini

On Wed, 23 Sep 2020 04:54:09 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Define flags to be used as delimeter in migration file stream.
> Added .save_setup and .save_cleanup functions. Mapped & unmapped migration
> region from these functions at source during saving or pre-copy phase.
> Set VFIO device state depending on VM's state. During live migration, VM is
> running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
> device. During save-restore, VM is paused, _SAVING state is set for VFIO device.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/migration.c  | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/trace-events |  2 ++
>  2 files changed, 93 insertions(+)
> 

(...)

> +/*
> + * Flags used as delimiter:
> + * 0xffffffff => MSB 32-bit all 1s
> + * 0xef10     => emulated (virtual) function IO

Where is this value coming from?

> + * 0x0000     => 16-bits reserved for flags
> + */
> +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
> +#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)

I think we need some more documentation what these values mean and how
they are used. From reading ahead a bit, it seems there is always
supposed to be a pair of DEV_*_STATE and END_OF_STATE framing some kind
of data?

(...)



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 16/17] vfio: Make vfio-pci device migration capable
  2020-09-22 23:24 ` [PATCH v26 16/17] vfio: Make vfio-pci device migration capable Kirti Wankhede
@ 2020-09-25 12:17   ` Cornelia Huck
  0 siblings, 0 replies; 73+ messages in thread
From: Cornelia Huck @ 2020-09-25 12:17 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk, pasic,
	felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini

On Wed, 23 Sep 2020 04:54:18 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> If the device is not a failover primary device, call
> vfio_migration_probe() and vfio_migration_finalize() to enable
> migration support for those devices that support it respectively to
> tear it down again.
> Removed vfio_pci_vmstate structure.
> Removed migration blocker from VFIO PCI device specific structure and use
> migration blocker from generic structure of  VFIO device.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  hw/vfio/pci.c | 28 ++++++++--------------------
>  hw/vfio/pci.h |  1 -
>  2 files changed, 8 insertions(+), 21 deletions(-)

Reviewed-by: Cornelia Huck <cohuck@redhat.com>



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 04/17] vfio: Add migration region initialization and finalize function
  2020-09-22 23:24 ` [PATCH v26 04/17] vfio: Add migration region initialization and finalize function Kirti Wankhede
  2020-09-24 14:08   ` Cornelia Huck
@ 2020-09-25 20:20   ` Alex Williamson
  2020-09-28  9:39     ` Cornelia Huck
  2020-10-17 20:17     ` Kirti Wankhede
  1 sibling, 2 replies; 73+ messages in thread
From: Alex Williamson @ 2020-09-25 20:20 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini

On Wed, 23 Sep 2020 04:54:06 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Whether the VFIO device supports migration or not is decided based of
> migration region query. If migration region query is successful and migration
> region initialization is successful then migration is supported else
> migration is blocked.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> Acked-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  hw/vfio/meson.build           |   1 +
>  hw/vfio/migration.c           | 142 ++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/trace-events          |   5 ++
>  include/hw/vfio/vfio-common.h |   9 +++
>  4 files changed, 157 insertions(+)
>  create mode 100644 hw/vfio/migration.c
> 
> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
> index 37efa74018bc..da9af297a0c5 100644
> --- a/hw/vfio/meson.build
> +++ b/hw/vfio/meson.build
> @@ -2,6 +2,7 @@ vfio_ss = ss.source_set()
>  vfio_ss.add(files(
>    'common.c',
>    'spapr.c',
> +  'migration.c',
>  ))
>  vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
>    'display.c',
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> new file mode 100644
> index 000000000000..2f760f1f9c47
> --- /dev/null
> +++ b/hw/vfio/migration.c
> @@ -0,0 +1,142 @@
> +/*
> + * Migration support for VFIO devices
> + *
> + * Copyright NVIDIA, Inc. 2020
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2. See
> + * the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include <linux/vfio.h>
> +
> +#include "hw/vfio/vfio-common.h"
> +#include "cpu.h"
> +#include "migration/migration.h"
> +#include "migration/qemu-file.h"
> +#include "migration/register.h"
> +#include "migration/blocker.h"
> +#include "migration/misc.h"
> +#include "qapi/error.h"
> +#include "exec/ramlist.h"
> +#include "exec/ram_addr.h"
> +#include "pci.h"
> +#include "trace.h"
> +
> +static void vfio_migration_region_exit(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    if (!migration) {
> +        return;
> +    }
> +
> +    if (migration->region.size) {
> +        vfio_region_exit(&migration->region);
> +        vfio_region_finalize(&migration->region);
> +    }
> +}
> +
> +static int vfio_migration_region_init(VFIODevice *vbasedev, int index)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    Object *obj = NULL;

Unnecessary initialization.

> +    int ret = -EINVAL;

return -EINVAL below, this doesn't need to be initialized, use it for
storing actual return values.

> +
> +    obj = vbasedev->ops->vfio_get_object(vbasedev);
> +    if (!obj) {
> +        return ret;
> +    }

vfio_migration_init() tests whether the vbasedev->ops supports
vfio_get_object, then calls this, then calls vfio_get_object itself
(added in a later patch, with a strange inconsistency in failure modes).
Wouldn't it make more sense for vfio_migration_init() to pass the
Object since that function also needs it (eventually) and actually does
the existence test?

> +
> +    ret = vfio_region_setup(obj, vbasedev, &migration->region, index,
> +                            "migration");
> +    if (ret) {
> +        error_report("%s: Failed to setup VFIO migration region %d: %s",
> +                     vbasedev->name, index, strerror(-ret));
> +        goto err;
> +    }
> +
> +    if (!migration->region.size) {
> +        ret = -EINVAL;
> +        error_report("%s: Invalid region size of VFIO migration region %d: %s",
> +                     vbasedev->name, index, strerror(-ret));
> +        goto err;
> +    }

If the caller were to pass obj, this is nothing more than a wrapper for
calling vfio_region_setup(), which suggests to me we might not even
need this as a separate function outside of vfio_migration_init().

> +
> +    return 0;
> +
> +err:
> +    vfio_migration_region_exit(vbasedev);
> +    return ret;
> +}
> +
> +static int vfio_migration_init(VFIODevice *vbasedev,
> +                               struct vfio_region_info *info)
> +{
> +    int ret = -EINVAL;
> +
> +    if (!vbasedev->ops->vfio_get_object) {
> +        return ret;
> +    }
> +
> +    vbasedev->migration = g_new0(VFIOMigration, 1);
> +
> +    ret = vfio_migration_region_init(vbasedev, info->index);
> +    if (ret) {
> +        error_report("%s: Failed to initialise migration region",
> +                     vbasedev->name);
> +        g_free(vbasedev->migration);
> +        vbasedev->migration = NULL;
> +    }
> +
> +    return ret;
> +}
> +
> +/* ---------------------------------------------------------------------- */
> +
> +int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
> +{
> +    struct vfio_region_info *info = NULL;

Not sure this initialization is strictly necessary either, but it also
seems to be a common convention for this function, so either way.

Connie, does vfio_ccw_get_region() leak this?  It appears to call
vfio_get_dev_region_info() and vfio_get_region_info() several times with
the same pointer without freeing it between uses.

Thanks,
Alex

> +    Error *local_err = NULL;
> +    int ret;
> +
> +    ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION,
> +                                   VFIO_REGION_SUBTYPE_MIGRATION, &info);
> +    if (ret) {
> +        goto add_blocker;
> +    }
> +
> +    ret = vfio_migration_init(vbasedev, info);
> +    if (ret) {
> +        goto add_blocker;
> +    }
> +
> +    g_free(info);
> +    trace_vfio_migration_probe(vbasedev->name, info->index);
> +    return 0;
> +
> +add_blocker:
> +    error_setg(&vbasedev->migration_blocker,
> +               "VFIO device doesn't support migration");
> +    g_free(info);
> +
> +    ret = migrate_add_blocker(vbasedev->migration_blocker, &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        error_free(vbasedev->migration_blocker);
> +        vbasedev->migration_blocker = NULL;
> +    }
> +    return ret;
> +}
> +
> +void vfio_migration_finalize(VFIODevice *vbasedev)
> +{
> +    if (vbasedev->migration_blocker) {
> +        migrate_del_blocker(vbasedev->migration_blocker);
> +        error_free(vbasedev->migration_blocker);
> +        vbasedev->migration_blocker = NULL;
> +    }
> +
> +    vfio_migration_region_exit(vbasedev);
> +    g_free(vbasedev->migration);
> +}
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index a0c7b49a2ebc..8fe913175d85 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -145,3 +145,8 @@ vfio_display_edid_link_up(void) ""
>  vfio_display_edid_link_down(void) ""
>  vfio_display_edid_update(uint32_t prefx, uint32_t prefy) "%ux%u"
>  vfio_display_edid_write_error(void) ""
> +
> +
> +# migration.c
> +vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
> +
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index ba6169cd926e..8275c4c68f45 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -57,6 +57,10 @@ typedef struct VFIORegion {
>      uint8_t nr; /* cache the region number for debug */
>  } VFIORegion;
>  
> +typedef struct VFIOMigration {
> +    VFIORegion region;
> +} VFIOMigration;
> +
>  typedef struct VFIOAddressSpace {
>      AddressSpace *as;
>      QLIST_HEAD(, VFIOContainer) containers;
> @@ -113,6 +117,8 @@ typedef struct VFIODevice {
>      unsigned int num_irqs;
>      unsigned int num_regions;
>      unsigned int flags;
> +    VFIOMigration *migration;
> +    Error *migration_blocker;
>  } VFIODevice;
>  
>  struct VFIODeviceOps {
> @@ -204,4 +210,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
>  int vfio_spapr_remove_window(VFIOContainer *container,
>                               hwaddr offset_within_address_space);
>  
> +int vfio_migration_probe(VFIODevice *vbasedev, Error **errp);
> +void vfio_migration_finalize(VFIODevice *vbasedev);
> +
>  #endif /* HW_VFIO_VFIO_COMMON_H */



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 05/17] vfio: Add VM state change handler to know state of VM
  2020-09-22 23:24 ` [PATCH v26 05/17] vfio: Add VM state change handler to know state of VM Kirti Wankhede
  2020-09-24 15:02   ` Cornelia Huck
@ 2020-09-25 20:20   ` Alex Williamson
  2020-10-17 20:30     ` Kirti Wankhede
  1 sibling, 1 reply; 73+ messages in thread
From: Alex Williamson @ 2020-09-25 20:20 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini

On Wed, 23 Sep 2020 04:54:07 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> VM state change handler gets called on change in VM's state. This is used to set
> VFIO device state to _RUNNING.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  hw/vfio/migration.c           | 136 ++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/trace-events          |   3 +-
>  include/hw/vfio/vfio-common.h |   4 ++
>  3 files changed, 142 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 2f760f1f9c47..a30d628ba963 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -10,6 +10,7 @@
>  #include "qemu/osdep.h"
>  #include <linux/vfio.h>
>  
> +#include "sysemu/runstate.h"
>  #include "hw/vfio/vfio-common.h"
>  #include "cpu.h"
>  #include "migration/migration.h"
> @@ -22,6 +23,58 @@
>  #include "exec/ram_addr.h"
>  #include "pci.h"
>  #include "trace.h"
> +#include "hw/hw.h"
> +
> +static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
> +                                  off_t off, bool iswrite)
> +{
> +    int ret;
> +
> +    ret = iswrite ? pwrite(vbasedev->fd, val, count, off) :
> +                    pread(vbasedev->fd, val, count, off);
> +    if (ret < count) {
> +        error_report("vfio_mig_%s%d %s: failed at offset 0x%lx, err: %s",
> +                     iswrite ? "write" : "read", count * 8,
> +                     vbasedev->name, off, strerror(errno));

This would suggest from the log that there's, for example, a
vfio_mig_read8 function, which doesn't exist.

> +        return (ret < 0) ? ret : -EINVAL;
> +    }
> +    return 0;
> +}
> +
> +static int vfio_mig_rw(VFIODevice *vbasedev, __u8 *buf, size_t count,
> +                       off_t off, bool iswrite)
> +{
> +    int ret, done = 0;
> +    __u8 *tbuf = buf;
> +
> +    while (count) {
> +        int bytes = 0;
> +
> +        if (count >= 8 && !(off % 8)) {
> +            bytes = 8;
> +        } else if (count >= 4 && !(off % 4)) {
> +            bytes = 4;
> +        } else if (count >= 2 && !(off % 2)) {
> +            bytes = 2;
> +        } else {
> +            bytes = 1;
> +        }
> +
> +        ret = vfio_mig_access(vbasedev, tbuf, bytes, off, iswrite);
> +        if (ret) {
> +            return ret;
> +        }
> +
> +        count -= bytes;
> +        done += bytes;
> +        off += bytes;
> +        tbuf += bytes;
> +    }
> +    return done;
> +}
> +
> +#define vfio_mig_read(f, v, c, o)       vfio_mig_rw(f, (__u8 *)v, c, o, false)
> +#define vfio_mig_write(f, v, c, o)      vfio_mig_rw(f, (__u8 *)v, c, o, true)
>  
>  static void vfio_migration_region_exit(VFIODevice *vbasedev)
>  {
> @@ -70,6 +123,82 @@ err:
>      return ret;
>  }
>  
> +static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
> +                                    uint32_t value)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region;
> +    off_t dev_state_off = region->fd_offset +
> +                      offsetof(struct vfio_device_migration_info, device_state);
> +    uint32_t device_state;
> +    int ret;
> +
> +    ret = vfio_mig_read(vbasedev, &device_state, sizeof(device_state),
> +                        dev_state_off);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    device_state = (device_state & mask) | value;

Agree with Connie that mask and value args are not immediately obvious
how they're used.  I don't have a naming convention that would be more
clear and the names do make some sense once they're understood, but a
comment to indicate mask bits are preserved, value bits are set,
remaining bits are cleared would probably help the reader.

> +
> +    if (!VFIO_DEVICE_STATE_VALID(device_state)) {
> +        return -EINVAL;
> +    }
> +
> +    ret = vfio_mig_write(vbasedev, &device_state, sizeof(device_state),
> +                         dev_state_off);
> +    if (ret < 0) {
> +        ret = vfio_mig_read(vbasedev, &device_state, sizeof(device_state),
> +                          dev_state_off);
> +        if (ret < 0) {
> +            return ret;

Seems like we're in pretty bad shape here, should this be combined with
below to trigger a hw_error?

> +        }
> +
> +        if (VFIO_DEVICE_STATE_IS_ERROR(device_state)) {
> +            hw_error("%s: Device is in error state 0x%x",
> +                     vbasedev->name, device_state);
> +            return -EFAULT;
> +        }
> +    }
> +
> +    vbasedev->device_state = device_state;
> +    trace_vfio_migration_set_state(vbasedev->name, device_state);
> +    return 0;

So we return success even if we failed to write the desired state as
long as we were able to read back any non-error state?
vbasedev->device_state remains correct, but it seems confusing form a
caller perspective that a set-state can succeed but it's then necessary
to check the state.

> +}
> +
> +static void vfio_vmstate_change(void *opaque, int running, RunState state)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    if ((vbasedev->vm_running != running)) {
> +        int ret;
> +        uint32_t value = 0, mask = 0;
> +
> +        if (running) {
> +            value = VFIO_DEVICE_STATE_RUNNING;
> +            if (vbasedev->device_state & VFIO_DEVICE_STATE_RESUMING) {
> +                mask = ~VFIO_DEVICE_STATE_RESUMING;
> +            }
> +        } else {
> +            mask = ~VFIO_DEVICE_STATE_RUNNING;
> +        }
> +
> +        ret = vfio_migration_set_state(vbasedev, mask, value);
> +        if (ret) {
> +            /*
> +             * vm_state_notify() doesn't support reporting failure. If such
> +             * error reporting support added in furure, migration should be
> +             * aborted.
> +             */
> +            error_report("%s: Failed to set device state 0x%x",
> +                         vbasedev->name, value & mask);
> +        }

Here for instance we assume that success means the device is now in the
desired state, but we'd actually need to evaluate
vbasedev->device_state to determine that.

> +        vbasedev->vm_running = running;
> +        trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
> +                                  value & mask);
> +    }
> +}
> +
>  static int vfio_migration_init(VFIODevice *vbasedev,
>                                 struct vfio_region_info *info)
>  {
> @@ -87,8 +216,11 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>                       vbasedev->name);
>          g_free(vbasedev->migration);
>          vbasedev->migration = NULL;
> +        return ret;
>      }
>  
> +    vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
> +                                                          vbasedev);
>      return ret;
>  }
>  
> @@ -131,6 +263,10 @@ add_blocker:
>  
>  void vfio_migration_finalize(VFIODevice *vbasedev)
>  {
> +    if (vbasedev->vm_state) {
> +        qemu_del_vm_change_state_handler(vbasedev->vm_state);
> +    }
> +
>      if (vbasedev->migration_blocker) {
>          migrate_del_blocker(vbasedev->migration_blocker);
>          error_free(vbasedev->migration_blocker);
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 8fe913175d85..6524734bf7b4 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -149,4 +149,5 @@ vfio_display_edid_write_error(void) ""
>  
>  # migration.c
>  vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
> -
> +vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
> +vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 8275c4c68f45..25e3b1a3b90a 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -29,6 +29,7 @@
>  #ifdef CONFIG_LINUX
>  #include <linux/vfio.h>
>  #endif
> +#include "sysemu/sysemu.h"
>  
>  #define VFIO_MSG_PREFIX "vfio %s: "
>  
> @@ -119,6 +120,9 @@ typedef struct VFIODevice {
>      unsigned int flags;
>      VFIOMigration *migration;
>      Error *migration_blocker;
> +    VMChangeStateEntry *vm_state;
> +    uint32_t device_state;
> +    int vm_running;

Could these be placed in VFIOMigration?  Thanks,

Alex

>  } VFIODevice;
>  
>  struct VFIODeviceOps {



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 06/17] vfio: Add migration state change notifier
  2020-09-22 23:24 ` [PATCH v26 06/17] vfio: Add migration state change notifier Kirti Wankhede
@ 2020-09-25 20:20   ` Alex Williamson
  2020-10-17 20:35     ` Kirti Wankhede
  0 siblings, 1 reply; 73+ messages in thread
From: Alex Williamson @ 2020-09-25 20:20 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini

On Wed, 23 Sep 2020 04:54:08 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Added migration state change notifier to get notification on migration state
> change. These states are translated to VFIO device state and conveyed to vendor
> driver.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  hw/vfio/migration.c           | 29 +++++++++++++++++++++++++++++
>  hw/vfio/trace-events          |  5 +++--
>  include/hw/vfio/vfio-common.h |  1 +
>  3 files changed, 33 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index a30d628ba963..f650fe9fc3c8 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -199,6 +199,28 @@ static void vfio_vmstate_change(void *opaque, int running, RunState state)
>      }
>  }
>  
> +static void vfio_migration_state_notifier(Notifier *notifier, void *data)
> +{
> +    MigrationState *s = data;
> +    VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);
> +    int ret;
> +
> +    trace_vfio_migration_state_notifier(vbasedev->name,
> +                                        MigrationStatus_str(s->state));
> +
> +    switch (s->state) {
> +    case MIGRATION_STATUS_CANCELLING:
> +    case MIGRATION_STATUS_CANCELLED:
> +    case MIGRATION_STATUS_FAILED:
> +        ret = vfio_migration_set_state(vbasedev,
> +                      ~(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING),
> +                      VFIO_DEVICE_STATE_RUNNING);
> +        if (ret) {
> +            error_report("%s: Failed to set state RUNNING", vbasedev->name);
> +        }

Here again the caller assumes success means the device has entered the
desired state, but as implemented it only means the device is in some
non-error state.

> +    }
> +}
> +
>  static int vfio_migration_init(VFIODevice *vbasedev,
>                                 struct vfio_region_info *info)
>  {
> @@ -221,6 +243,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>  
>      vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
>                                                            vbasedev);
> +    vbasedev->migration_state.notify = vfio_migration_state_notifier;
> +    add_migration_state_change_notifier(&vbasedev->migration_state);
>      return ret;
>  }
>  
> @@ -263,6 +287,11 @@ add_blocker:
>  
>  void vfio_migration_finalize(VFIODevice *vbasedev)
>  {
> +
> +    if (vbasedev->migration_state.notify) {
> +        remove_migration_state_change_notifier(&vbasedev->migration_state);
> +    }
> +
>      if (vbasedev->vm_state) {
>          qemu_del_vm_change_state_handler(vbasedev->vm_state);
>      }
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 6524734bf7b4..bcb3fa7314d7 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -149,5 +149,6 @@ vfio_display_edid_write_error(void) ""
>  
>  # migration.c
>  vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
> -vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
> -vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
> +vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
> +vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
> +vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 25e3b1a3b90a..49c7c7a0e29a 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -123,6 +123,7 @@ typedef struct VFIODevice {
>      VMChangeStateEntry *vm_state;
>      uint32_t device_state;
>      int vm_running;
> +    Notifier migration_state;

Can this live in VFIOMigration?  Thanks,

Alex

>  } VFIODevice;
>  
>  struct VFIODeviceOps {



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 07/17] vfio: Register SaveVMHandlers for VFIO device
  2020-09-22 23:24 ` [PATCH v26 07/17] vfio: Register SaveVMHandlers for VFIO device Kirti Wankhede
  2020-09-24 15:15   ` Philippe Mathieu-Daudé
  2020-09-25 11:53   ` Cornelia Huck
@ 2020-09-25 20:20   ` Alex Williamson
  2020-10-18 17:40     ` Kirti Wankhede
  2 siblings, 1 reply; 73+ messages in thread
From: Alex Williamson @ 2020-09-25 20:20 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini

On Wed, 23 Sep 2020 04:54:09 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Define flags to be used as delimeter in migration file stream.
> Added .save_setup and .save_cleanup functions. Mapped & unmapped migration
> region from these functions at source during saving or pre-copy phase.
> Set VFIO device state depending on VM's state. During live migration, VM is
> running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
> device. During save-restore, VM is paused, _SAVING state is set for VFIO device.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/migration.c  | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/trace-events |  2 ++
>  2 files changed, 93 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index f650fe9fc3c8..8e8adaa25779 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -8,12 +8,15 @@
>   */
>  
>  #include "qemu/osdep.h"
> +#include "qemu/main-loop.h"
> +#include "qemu/cutils.h"
>  #include <linux/vfio.h>
>  
>  #include "sysemu/runstate.h"
>  #include "hw/vfio/vfio-common.h"
>  #include "cpu.h"
>  #include "migration/migration.h"
> +#include "migration/vmstate.h"
>  #include "migration/qemu-file.h"
>  #include "migration/register.h"
>  #include "migration/blocker.h"
> @@ -25,6 +28,17 @@
>  #include "trace.h"
>  #include "hw/hw.h"
>  
> +/*
> + * Flags used as delimiter:
> + * 0xffffffff => MSB 32-bit all 1s
> + * 0xef10     => emulated (virtual) function IO
> + * 0x0000     => 16-bits reserved for flags
> + */
> +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
> +#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
> +
>  static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
>                                    off_t off, bool iswrite)
>  {
> @@ -166,6 +180,65 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
>      return 0;
>  }
>  
> +/* ---------------------------------------------------------------------- */
> +
> +static int vfio_save_setup(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    trace_vfio_save_setup(vbasedev->name);
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
> +
> +    if (migration->region.mmaps) {
> +        qemu_mutex_lock_iothread();
> +        ret = vfio_region_mmap(&migration->region);
> +        qemu_mutex_unlock_iothread();

Please add a comment identifying why the iothread mutex lock is
necessary here.

> +        if (ret) {
> +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
> +                         vbasedev->name, migration->region.nr,

We don't support multiple migration regions, is it useful to include
the region index here?

> +                         strerror(-ret));
> +            error_report("%s: Falling back to slow path", vbasedev->name);
> +        }
> +    }
> +
> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_MASK,
> +                                   VFIO_DEVICE_STATE_SAVING);
> +    if (ret) {
> +        error_report("%s: Failed to set state SAVING", vbasedev->name);
> +        return ret;
> +    }

Again, doesn't match the function semantics that success only means the
device is in a non-error state, maybe the one that was asked for.

> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);

What's the overall purpose of writing these markers into the migration
stream?  vfio_load_state() doesn't do anything with this other than
validate that the end-of-state immediately follows.  Is this a
placeholder for something in the future?

> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    return 0;
> +}
> +
> +static void vfio_save_cleanup(void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    if (migration->region.mmaps) {
> +        vfio_region_unmap(&migration->region);
> +    }
> +    trace_vfio_save_cleanup(vbasedev->name);
> +}
> +
> +static SaveVMHandlers savevm_vfio_handlers = {
> +    .save_setup = vfio_save_setup,
> +    .save_cleanup = vfio_save_cleanup,
> +};
> +
> +/* ---------------------------------------------------------------------- */
> +
>  static void vfio_vmstate_change(void *opaque, int running, RunState state)
>  {
>      VFIODevice *vbasedev = opaque;
> @@ -225,6 +298,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>                                 struct vfio_region_info *info)
>  {
>      int ret = -EINVAL;
> +    char id[256] = "";
> +    Object *obj;
>  
>      if (!vbasedev->ops->vfio_get_object) {
>          return ret;
> @@ -241,6 +316,22 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>          return ret;
>      }
>  
> +    obj = vbasedev->ops->vfio_get_object(vbasedev);
> +
> +    if (obj) {
> +        DeviceState *dev = DEVICE(obj);
> +        char *oid = vmstate_if_get_id(VMSTATE_IF(dev));
> +
> +        if (oid) {
> +            pstrcpy(id, sizeof(id), oid);
> +            pstrcat(id, sizeof(id), "/");
> +            g_free(oid);
> +        }
> +    }

Here's where vfio_migration_init() starts using vfio_get_object() as I
referenced back on patch 04.  We might as well get the object before
calling vfio_migration_region_init() and then pass the object.  The
conditional branch to handle obj is strange here too, it's fatal if
vfio_migration_region_init() doesn't find an object, why do we handle
it as optional here?  Also, what is this doing?  Comments would be
nice...  Thanks,

Alex

> +    pstrcat(id, sizeof(id), "vfio");
> +
> +    register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1, &savevm_vfio_handlers,
> +                         vbasedev);
>      vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
>                                                            vbasedev);
>      vbasedev->migration_state.notify = vfio_migration_state_notifier;
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index bcb3fa7314d7..982d8dccb219 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -152,3 +152,5 @@ vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
>  vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
>  vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
>  vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
> +vfio_save_setup(const char *name) " (%s)"
> +vfio_save_cleanup(const char *name) " (%s)"



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 08/17] vfio: Add save state functions to SaveVMHandlers
  2020-09-22 23:24 ` [PATCH v26 08/17] vfio: Add save state functions to SaveVMHandlers Kirti Wankhede
  2020-09-23 11:42   ` Wang, Zhi A
@ 2020-09-25 21:02   ` Alex Williamson
  2020-10-18 18:00     ` Kirti Wankhede
  1 sibling, 1 reply; 73+ messages in thread
From: Alex Williamson @ 2020-09-25 21:02 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini

On Wed, 23 Sep 2020 04:54:10 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
> functions. These functions handles pre-copy and stop-and-copy phase.
> 
> In _SAVING|_RUNNING device state or pre-copy phase:
> - read pending_bytes. If pending_bytes > 0, go through below steps.
> - read data_offset - indicates kernel driver to write data to staging
>   buffer.
> - read data_size - amount of data in bytes written by vendor driver in
>   migration region.
> - read data_size bytes of data from data_offset in the migration region.
> - Write data packet to file stream as below:
> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
> VFIO_MIG_FLAG_END_OF_STATE }
> 
> In _SAVING device state or stop-and-copy phase
> a. read config space of device and save to migration file stream. This
>    doesn't need to be from vendor driver. Any other special config state
>    from driver can be saved as data in following iteration.
> b. read pending_bytes. If pending_bytes > 0, go through below steps.
> c. read data_offset - indicates kernel driver to write data to staging
>    buffer.
> d. read data_size - amount of data in bytes written by vendor driver in
>    migration region.
> e. read data_size bytes of data from data_offset in the migration region.
> f. Write data packet as below:
>    {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
> g. iterate through steps b to f while (pending_bytes > 0)
> h. Write {VFIO_MIG_FLAG_END_OF_STATE}
> 
> When data region is mapped, its user's responsibility to read data from
> data_offset of data_size before moving to next steps.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/migration.c           | 273 ++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/trace-events          |   6 +
>  include/hw/vfio/vfio-common.h |   1 +
>  3 files changed, 280 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 8e8adaa25779..4611bb972228 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -180,6 +180,154 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
>      return 0;
>  }
>  
> +static void *get_data_section_size(VFIORegion *region, uint64_t data_offset,
> +                                   uint64_t data_size, uint64_t *size)
> +{
> +    void *ptr = NULL;
> +    uint64_t limit = 0;
> +    int i;
> +
> +    if (!region->mmaps) {
> +        if (size) {
> +            *size = data_size;
> +        }
> +        return ptr;
> +    }
> +
> +    for (i = 0; i < region->nr_mmaps; i++) {
> +        VFIOMmap *map = region->mmaps + i;
> +
> +        if ((data_offset >= map->offset) &&
> +            (data_offset < map->offset + map->size)) {
> +
> +            /* check if data_offset is within sparse mmap areas */
> +            ptr = map->mmap + data_offset - map->offset;
> +            if (size) {
> +                *size = MIN(data_size, map->offset + map->size - data_offset);
> +            }
> +            break;
> +        } else if ((data_offset < map->offset) &&
> +                   (!limit || limit > map->offset)) {
> +            /*
> +             * data_offset is not within sparse mmap areas, find size of
> +             * non-mapped area. Check through all list since region->mmaps list
> +             * is not sorted.
> +             */
> +            limit = map->offset;
> +        }
> +    }
> +
> +    if (!ptr && size) {
> +        *size = limit ? limit - data_offset : data_size;

'limit - data_offset' doesn't take data_size into account, this should
be MIN(data_size, limit - data_offset).

> +    }
> +    return ptr;
> +}
> +
> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t *size)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region;
> +    uint64_t data_offset = 0, data_size = 0, sz;
> +    int ret;
> +
> +    ret = vfio_mig_read(vbasedev, &data_offset, sizeof(data_offset),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             data_offset));
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    ret = vfio_mig_read(vbasedev, &data_size, sizeof(data_size),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             data_size));
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    trace_vfio_save_buffer(vbasedev->name, data_offset, data_size,
> +                           migration->pending_bytes);
> +
> +    qemu_put_be64(f, data_size);
> +    sz = data_size;
> +
> +    while (sz) {
> +        void *buf = NULL;

Unnecessary initialization.

> +        uint64_t sec_size;
> +        bool buf_allocated = false;
> +
> +        buf = get_data_section_size(region, data_offset, sz, &sec_size);
> +
> +        if (!buf) {
> +            buf = g_try_malloc(sec_size);
> +            if (!buf) {
> +                error_report("%s: Error allocating buffer ", __func__);
> +                return -ENOMEM;
> +            }
> +            buf_allocated = true;
> +
> +            ret = vfio_mig_read(vbasedev, buf, sec_size,
> +                                region->fd_offset + data_offset);
> +            if (ret < 0) {
> +                g_free(buf);
> +                return ret;
> +            }
> +        }
> +
> +        qemu_put_buffer(f, buf, sec_size);
> +
> +        if (buf_allocated) {
> +            g_free(buf);
> +        }
> +        sz -= sec_size;
> +        data_offset += sec_size;
> +    }
> +
> +    ret = qemu_file_get_error(f);
> +
> +    if (!ret && size) {
> +        *size = data_size;
> +    }
> +
> +    return ret;
> +}
> +
> +static int vfio_update_pending(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region;
> +    uint64_t pending_bytes = 0;
> +    int ret;
> +
> +    ret = vfio_mig_read(vbasedev, &pending_bytes, sizeof(pending_bytes),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             pending_bytes));
> +    if (ret < 0) {
> +        migration->pending_bytes = 0;
> +        return ret;
> +    }
> +
> +    migration->pending_bytes = pending_bytes;
> +    trace_vfio_update_pending(vbasedev->name, pending_bytes);
> +    return 0;
> +}
> +
> +static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
> +
> +    if (vbasedev->ops && vbasedev->ops->vfio_save_config) {
> +        vbasedev->ops->vfio_save_config(vbasedev, f);
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    trace_vfio_save_device_config_state(vbasedev->name);
> +
> +    return qemu_file_get_error(f);
> +}
> +
>  /* ---------------------------------------------------------------------- */
>  
>  static int vfio_save_setup(QEMUFile *f, void *opaque)
> @@ -232,9 +380,134 @@ static void vfio_save_cleanup(void *opaque)
>      trace_vfio_save_cleanup(vbasedev->name);
>  }
>  
> +static void vfio_save_pending(QEMUFile *f, void *opaque,
> +                              uint64_t threshold_size,
> +                              uint64_t *res_precopy_only,
> +                              uint64_t *res_compatible,
> +                              uint64_t *res_postcopy_only)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    ret = vfio_update_pending(vbasedev);
> +    if (ret) {
> +        return;
> +    }
> +
> +    *res_precopy_only += migration->pending_bytes;
> +
> +    trace_vfio_save_pending(vbasedev->name, *res_precopy_only,
> +                            *res_postcopy_only, *res_compatible);
> +}
> +
> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    uint64_t data_size;
> +    int ret;
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> +
> +    if (migration->pending_bytes == 0) {
> +        ret = vfio_update_pending(vbasedev);
> +        if (ret) {
> +            return ret;
> +        }
> +
> +        if (migration->pending_bytes == 0) {
> +            /* indicates data finished, goto complete phase */
> +            return 1;
> +        }
> +    }
> +
> +    ret = vfio_save_buffer(f, vbasedev, &data_size);
> +
> +    if (ret) {
> +        error_report("%s: vfio_save_buffer failed %s", vbasedev->name,
> +                     strerror(errno));
> +        return ret;
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    trace_vfio_save_iterate(vbasedev->name, data_size);
> +
> +    return 0;
> +}
> +
> +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    uint64_t data_size;
> +    int ret;
> +
> +    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_RUNNING,
> +                                   VFIO_DEVICE_STATE_SAVING);
> +    if (ret) {
> +        error_report("%s: Failed to set state STOP and SAVING",
> +                     vbasedev->name);
> +        return ret;
> +    }

This also assumes success implies desired state.

> +
> +    ret = vfio_save_device_config_state(f, opaque);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    ret = vfio_update_pending(vbasedev);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    while (migration->pending_bytes > 0) {
> +        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> +        ret = vfio_save_buffer(f, vbasedev, &data_size);
> +        if (ret < 0) {
> +            error_report("%s: Failed to save buffer", vbasedev->name);
> +            return ret;
> +        }
> +
> +        if (data_size == 0) {
> +            break;
> +        }
> +
> +        ret = vfio_update_pending(vbasedev);
> +        if (ret) {
> +            return ret;
> +        }
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_SAVING, 0);
> +    if (ret) {
> +        error_report("%s: Failed to set state STOPPED", vbasedev->name);
> +        return ret;
> +    }

And another.  Thanks,

Alex

> +
> +    trace_vfio_save_complete_precopy(vbasedev->name);
> +    return ret;
> +}
> +
>  static SaveVMHandlers savevm_vfio_handlers = {
>      .save_setup = vfio_save_setup,
>      .save_cleanup = vfio_save_cleanup,
> +    .save_live_pending = vfio_save_pending,
> +    .save_live_iterate = vfio_save_iterate,
> +    .save_live_complete_precopy = vfio_save_complete_precopy,
>  };
>  
>  /* ---------------------------------------------------------------------- */
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 982d8dccb219..118b5547c921 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -154,3 +154,9 @@ vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t
>  vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
>  vfio_save_setup(const char *name) " (%s)"
>  vfio_save_cleanup(const char *name) " (%s)"
> +vfio_save_buffer(const char *name, uint64_t data_offset, uint64_t data_size, uint64_t pending) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64" pending 0x%"PRIx64
> +vfio_update_pending(const char *name, uint64_t pending) " (%s) pending 0x%"PRIx64
> +vfio_save_device_config_state(const char *name) " (%s)"
> +vfio_save_pending(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64
> +vfio_save_iterate(const char *name, int data_size) " (%s) data_size %d"
> +vfio_save_complete_precopy(const char *name) " (%s)"
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 49c7c7a0e29a..471e444a364c 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -60,6 +60,7 @@ typedef struct VFIORegion {
>  
>  typedef struct VFIOMigration {
>      VFIORegion region;
> +    uint64_t pending_bytes;
>  } VFIOMigration;
>  
>  typedef struct VFIOAddressSpace {



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 12/17] vfio: Add function to start and stop dirty pages tracking
  2020-09-22 23:24 ` [PATCH v26 12/17] vfio: Add function to start and stop dirty pages tracking Kirti Wankhede
@ 2020-09-25 21:55   ` Alex Williamson
  2020-10-18 20:52     ` Kirti Wankhede
  0 siblings, 1 reply; 73+ messages in thread
From: Alex Williamson @ 2020-09-25 21:55 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini

On Wed, 23 Sep 2020 04:54:14 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Call VFIO_IOMMU_DIRTY_PAGES ioctl to start and stop dirty pages tracking
> for VFIO devices.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  hw/vfio/migration.c | 36 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 36 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 4306f6316417..822b68b4e015 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -11,6 +11,7 @@
>  #include "qemu/main-loop.h"
>  #include "qemu/cutils.h"
>  #include <linux/vfio.h>
> +#include <sys/ioctl.h>
>  
>  #include "sysemu/runstate.h"
>  #include "hw/vfio/vfio-common.h"
> @@ -355,6 +356,32 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>      return qemu_file_get_error(f);
>  }
>  
> +static int vfio_set_dirty_page_tracking(VFIODevice *vbasedev, bool start)
> +{
> +    int ret;
> +    VFIOContainer *container = vbasedev->group->container;
> +    struct vfio_iommu_type1_dirty_bitmap dirty = {
> +        .argsz = sizeof(dirty),
> +    };
> +
> +    if (start) {
> +        if (vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) {
> +            dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START;
> +        } else {
> +            return -EINVAL;
> +        }
> +    } else {
> +            dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP;
> +    }
> +
> +    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty);
> +    if (ret) {
> +        error_report("Failed to set dirty tracking flag 0x%x errno: %d",
> +                     dirty.flags, errno);
> +    }

Maybe doesn't matter in the long run, but do you want to use -errno for
the return rather than -1 from the ioctl on error?  Thanks,

Alex

> +    return ret;
> +}
> +
>  /* ---------------------------------------------------------------------- */
>  
>  static int vfio_save_setup(QEMUFile *f, void *opaque)
> @@ -386,6 +413,11 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
>          return ret;
>      }
>  
> +    ret = vfio_set_dirty_page_tracking(vbasedev, true);
> +    if (ret) {
> +        return ret;
> +    }
> +
>      qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>  
>      ret = qemu_file_get_error(f);
> @@ -401,6 +433,8 @@ static void vfio_save_cleanup(void *opaque)
>      VFIODevice *vbasedev = opaque;
>      VFIOMigration *migration = vbasedev->migration;
>  
> +    vfio_set_dirty_page_tracking(vbasedev, false);
> +
>      if (migration->region.mmaps) {
>          vfio_region_unmap(&migration->region);
>      }
> @@ -734,6 +768,8 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
>          if (ret) {
>              error_report("%s: Failed to set state RUNNING", vbasedev->name);
>          }
> +
> +        vfio_set_dirty_page_tracking(vbasedev, false);
>      }
>  }
>  



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 13/17] vfio: create mapped iova list when vIOMMU is enabled
  2020-09-22 23:24 ` [PATCH v26 13/17] vfio: create mapped iova list when vIOMMU is enabled Kirti Wankhede
@ 2020-09-25 22:23   ` Alex Williamson
  2020-10-19  6:01     ` Kirti Wankhede
  0 siblings, 1 reply; 73+ messages in thread
From: Alex Williamson @ 2020-09-25 22:23 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini

On Wed, 23 Sep 2020 04:54:15 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Create mapped iova list when vIOMMU is enabled. For each mapped iova
> save translated address. Add node to list on MAP and remove node from
> list on UNMAP.
> This list is used to track dirty pages during migration.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> ---
>  hw/vfio/common.c              | 58 ++++++++++++++++++++++++++++++++++++++-----
>  include/hw/vfio/vfio-common.h |  8 ++++++
>  2 files changed, 60 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index d4959c036dd1..dc56cded2d95 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -407,8 +407,8 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>  }
>  
>  /* Called with rcu_read_lock held.  */
> -static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
> -                           bool *read_only)
> +static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
> +                               ram_addr_t *ram_addr, bool *read_only)
>  {
>      MemoryRegion *mr;
>      hwaddr xlat;
> @@ -439,8 +439,17 @@ static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
>          return false;
>      }
>  
> -    *vaddr = memory_region_get_ram_ptr(mr) + xlat;
> -    *read_only = !writable || mr->readonly;
> +    if (vaddr) {
> +        *vaddr = memory_region_get_ram_ptr(mr) + xlat;
> +    }
> +
> +    if (ram_addr) {
> +        *ram_addr = memory_region_get_ram_addr(mr) + xlat;
> +    }
> +
> +    if (read_only) {
> +        *read_only = !writable || mr->readonly;
> +    }
>  
>      return true;
>  }
> @@ -450,7 +459,6 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>      VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>      VFIOContainer *container = giommu->container;
>      hwaddr iova = iotlb->iova + giommu->iommu_offset;
> -    bool read_only;
>      void *vaddr;
>      int ret;
>  
> @@ -466,7 +474,10 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>      rcu_read_lock();
>  
>      if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
> -        if (!vfio_get_vaddr(iotlb, &vaddr, &read_only)) {
> +        ram_addr_t ram_addr;
> +        bool read_only;
> +
> +        if (!vfio_get_xlat_addr(iotlb, &vaddr, &ram_addr, &read_only)) {
>              goto out;
>          }
>          /*
> @@ -484,8 +495,28 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>                           "0x%"HWADDR_PRIx", %p) = %d (%m)",
>                           container, iova,
>                           iotlb->addr_mask + 1, vaddr, ret);
> +        } else {
> +            VFIOIovaRange *iova_range;
> +
> +            iova_range = g_malloc0(sizeof(*iova_range));
> +            iova_range->iova = iova;
> +            iova_range->size = iotlb->addr_mask + 1;
> +            iova_range->ram_addr = ram_addr;
> +
> +            QLIST_INSERT_HEAD(&giommu->iova_list, iova_range, next);
>          }
>      } else {
> +        VFIOIovaRange *iova_range, *tmp;
> +
> +        QLIST_FOREACH_SAFE(iova_range, &giommu->iova_list, next, tmp) {
> +            if (iova_range->iova >= iova &&
> +                iova_range->iova + iova_range->size <= iova +
> +                                                       iotlb->addr_mask + 1) {
> +                QLIST_REMOVE(iova_range, next);
> +                g_free(iova_range);
> +            }
> +        }
> +


This is some pretty serious overhead... can't we trigger a replay when
migration is enabled to build this information then?  We're looking at
potentially thousands of entries, so a list is probably also not a good
choice.  I don't think it's acceptable to incur this even when not
migrating (ie. the vast majority of the time).  Thanks,

Alex

>          ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1);
>          if (ret) {
>              error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
> @@ -642,6 +673,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
>              g_free(giommu);
>              goto fail;
>          }
> +        QLIST_INIT(&giommu->iova_list);
>          QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
>          memory_region_iommu_replay(giommu->iommu, &giommu->n);
>  
> @@ -740,6 +772,13 @@ static void vfio_listener_region_del(MemoryListener *listener,
>          QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
>              if (MEMORY_REGION(giommu->iommu) == section->mr &&
>                  giommu->n.start == section->offset_within_region) {
> +                VFIOIovaRange *iova_range, *tmp;
> +
> +                QLIST_FOREACH_SAFE(iova_range, &giommu->iova_list, next, tmp) {
> +                    QLIST_REMOVE(iova_range, next);
> +                    g_free(iova_range);
> +                }
> +
>                  memory_region_unregister_iommu_notifier(section->mr,
>                                                          &giommu->n);
>                  QLIST_REMOVE(giommu, giommu_next);
> @@ -1541,6 +1580,13 @@ static void vfio_disconnect_container(VFIOGroup *group)
>          QLIST_REMOVE(container, next);
>  
>          QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
> +            VFIOIovaRange *iova_range, *itmp;
> +
> +            QLIST_FOREACH_SAFE(iova_range, &giommu->iova_list, next, itmp) {
> +                QLIST_REMOVE(iova_range, next);
> +                g_free(iova_range);
> +            }
> +
>              memory_region_unregister_iommu_notifier(
>                      MEMORY_REGION(giommu->iommu), &giommu->n);
>              QLIST_REMOVE(giommu, giommu_next);
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 0a1651eda2d0..aa7524fe2cc5 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -89,11 +89,19 @@ typedef struct VFIOContainer {
>      QLIST_ENTRY(VFIOContainer) next;
>  } VFIOContainer;
>  
> +typedef struct VFIOIovaRange {
> +    hwaddr iova;
> +    size_t size;
> +    ram_addr_t ram_addr;
> +    QLIST_ENTRY(VFIOIovaRange) next;
> +} VFIOIovaRange;
> +
>  typedef struct VFIOGuestIOMMU {
>      VFIOContainer *container;
>      IOMMUMemoryRegion *iommu;
>      hwaddr iommu_offset;
>      IOMMUNotifier n;
> +    QLIST_HEAD(, VFIOIovaRange) iova_list;
>      QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
>  } VFIOGuestIOMMU;
>  



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 17/17] qapi: Add VFIO devices migration stats in Migration stats
  2020-09-22 23:24 ` [PATCH v26 17/17] qapi: Add VFIO devices migration stats in Migration stats Kirti Wankhede
  2020-09-24 15:14   ` Eric Blake
@ 2020-09-25 22:55   ` Alex Williamson
  2020-09-29 10:40   ` Dr. David Alan Gilbert
  2 siblings, 0 replies; 73+ messages in thread
From: Alex Williamson @ 2020-09-25 22:55 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini

On Wed, 23 Sep 2020 04:54:19 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Added amount of bytes transferred to the target VM by all VFIO devices
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> ---
> 
> Note: Comments from v25 for this patch are not addressed yet.
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg715620.html
> 
> Alex, need more pointer on documentation part raised Markus Armbruster.

I'm not sure what you're asking of me here.  Are we being asked to
justify that a vfio device has state that needs to be migrated and the
ability to dirty pages?
 
> 
>  hw/vfio/common.c            | 20 ++++++++++++++++++++
>  hw/vfio/migration.c         | 10 ++++++++++
>  include/qemu/vfio-helpers.h |  3 +++
>  migration/migration.c       | 14 ++++++++++++++
>  monitor/hmp-cmds.c          |  6 ++++++
>  qapi/migration.json         | 17 +++++++++++++++++
>  6 files changed, 70 insertions(+)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 7eeaa368187a..286cdaac8674 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -39,6 +39,7 @@
>  #include "trace.h"
>  #include "qapi/error.h"
>  #include "migration/migration.h"
> +#include "qemu/vfio-helpers.h"
>  
>  VFIOGroupList vfio_group_list =
>      QLIST_HEAD_INITIALIZER(vfio_group_list);
> @@ -292,6 +293,25 @@ const MemoryRegionOps vfio_region_ops = {
>   * Device state interfaces
>   */
>  
> +bool vfio_mig_active(void)
> +{
> +    VFIOGroup *group;
> +    VFIODevice *vbasedev;
> +
> +    if (QLIST_EMPTY(&vfio_group_list)) {
> +        return false;
> +    }
> +
> +    QLIST_FOREACH(group, &vfio_group_list, next) {
> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> +            if (vbasedev->migration_blocker) {
> +                return false;
> +            }
> +        }
> +    }
> +    return true;
> +}
> +
>  static bool vfio_devices_all_stopped_and_saving(VFIOContainer *container)
>  {
>      VFIOGroup *group;
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 822b68b4e015..c4226fa8b183 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -28,6 +28,7 @@
>  #include "pci.h"
>  #include "trace.h"
>  #include "hw/hw.h"
> +#include "qemu/vfio-helpers.h"
>  
>  /*
>   * Flags used as delimiter:
> @@ -40,6 +41,8 @@
>  #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
>  #define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
>  
> +static int64_t bytes_transferred;
> +
>  static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
>                                    off_t off, bool iswrite)
>  {
> @@ -289,6 +292,7 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t *size)
>          *size = data_size;
>      }
>  
> +    bytes_transferred += data_size;
>      return ret;
>  }
>  
> @@ -770,6 +774,7 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
>          }
>  
>          vfio_set_dirty_page_tracking(vbasedev, false);
> +        bytes_transferred = 0;
>      }
>  }
>  
> @@ -820,6 +825,11 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>  
>  /* ---------------------------------------------------------------------- */
>  
> +int64_t vfio_mig_bytes_transferred(void)
> +{
> +    return bytes_transferred;
> +}
> +
>  int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
>  {
>      VFIOContainer *container = vbasedev->group->container;
> diff --git a/include/qemu/vfio-helpers.h b/include/qemu/vfio-helpers.h
> index 1f057c2b9e40..26a7df0767b1 100644
> --- a/include/qemu/vfio-helpers.h
> +++ b/include/qemu/vfio-helpers.h
> @@ -29,4 +29,7 @@ void qemu_vfio_pci_unmap_bar(QEMUVFIOState *s, int index, void *bar,
>  int qemu_vfio_pci_init_irq(QEMUVFIOState *s, EventNotifier *e,
>                             int irq_type, Error **errp);
>  
> +bool vfio_mig_active(void);
> +int64_t vfio_mig_bytes_transferred(void);
> +
>  #endif
> diff --git a/migration/migration.c b/migration/migration.c
> index 58a5452471f9..b204bb1f6cd9 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -56,6 +56,7 @@
>  #include "net/announce.h"
>  #include "qemu/queue.h"
>  #include "multifd.h"
> +#include "qemu/vfio-helpers.h"
>  
>  #define MAX_THROTTLE  (32 << 20)      /* Migration transfer speed throttling */
>  
> @@ -996,6 +997,17 @@ static void populate_disk_info(MigrationInfo *info)
>      }
>  }
>  
> +static void populate_vfio_info(MigrationInfo *info)
> +{
> +#ifdef CONFIG_LINUX
> +    if (vfio_mig_active()) {
> +        info->has_vfio = true;
> +        info->vfio = g_malloc0(sizeof(*info->vfio));
> +        info->vfio->transferred = vfio_mig_bytes_transferred();
> +    }
> +#endif
> +}
> +
>  static void fill_source_migration_info(MigrationInfo *info)
>  {
>      MigrationState *s = migrate_get_current();
> @@ -1020,6 +1032,7 @@ static void fill_source_migration_info(MigrationInfo *info)
>          populate_time_info(info, s);
>          populate_ram_info(info, s);
>          populate_disk_info(info);
> +        populate_vfio_info(info);
>          break;
>      case MIGRATION_STATUS_COLO:
>          info->has_status = true;
> @@ -1028,6 +1041,7 @@ static void fill_source_migration_info(MigrationInfo *info)
>      case MIGRATION_STATUS_COMPLETED:
>          populate_time_info(info, s);
>          populate_ram_info(info, s);
> +        populate_vfio_info(info);
>          break;
>      case MIGRATION_STATUS_FAILED:
>          info->has_status = true;
> diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
> index 7711726fd222..40d60d6a6651 100644
> --- a/monitor/hmp-cmds.c
> +++ b/monitor/hmp-cmds.c
> @@ -355,6 +355,12 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict)
>          }
>          monitor_printf(mon, "]\n");
>      }
> +
> +    if (info->has_vfio) {
> +        monitor_printf(mon, "vfio device transferred: %" PRIu64 " kbytes\n",
> +                       info->vfio->transferred >> 10);
> +    }
> +
>      qapi_free_MigrationInfo(info);
>  }
>  
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 675f70bb6734..3535977123d3 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -147,6 +147,18 @@
>              'active', 'postcopy-active', 'postcopy-paused',
>              'postcopy-recover', 'completed', 'failed', 'colo',
>              'pre-switchover', 'device', 'wait-unplug' ] }
> +##
> +# @VfioStats:
> +#
> +# Detailed VFIO devices migration statistics
> +#
> +# @transferred: amount of bytes transferred to the target VM by VFIO devices
> +#
> +# Since: 5.1
> +#
> +##
> +{ 'struct': 'VfioStats',
> +  'data': {'transferred': 'int' } }
>  
>  ##
>  # @MigrationInfo:
> @@ -208,11 +220,16 @@
>  #
>  # @socket-address: Only used for tcp, to know what the real port is (Since 4.0)
>  #
> +# @vfio: @VfioStats containing detailed VFIO devices migration statistics,
> +#        only returned if VFIO device is present, migration is supported by all
> +#         VFIO devices and status is 'active' or 'completed' (since 5.1)

This could be clarified a little:

...only returned if all VFIO devices attached to a VM support migration
and migration status is 'active' or 'completed' (since 5.2)

Is that accurate?  Thanks,

Alex

> +#
>  # Since: 0.14.0
>  ##
>  { 'struct': 'MigrationInfo',
>    'data': {'*status': 'MigrationStatus', '*ram': 'MigrationStats',
>             '*disk': 'MigrationStats',
> +           '*vfio': 'VfioStats',
>             '*xbzrle-cache': 'XBZRLECacheStats',
>             '*total-time': 'int',
>             '*expected-downtime': 'int',



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 04/17] vfio: Add migration region initialization and finalize function
  2020-09-25 20:20   ` Alex Williamson
@ 2020-09-28  9:39     ` Cornelia Huck
  2020-10-17 20:17     ` Kirti Wankhede
  1 sibling, 0 replies; 73+ messages in thread
From: Cornelia Huck @ 2020-09-28  9:39 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

On Fri, 25 Sep 2020 14:20:06 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Wed, 23 Sep 2020 04:54:06 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:

> > +int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
> > +{
> > +    struct vfio_region_info *info = NULL;  
> 
> Not sure this initialization is strictly necessary either, but it also
> seems to be a common convention for this function, so either way.
> 
> Connie, does vfio_ccw_get_region() leak this?  It appears to call
> vfio_get_dev_region_info() and vfio_get_region_info() several times with
> the same pointer without freeing it between uses.

Ugh, indeed, info is reallocated without freeing it inbetween. I'll
cook up a patch, thanks for spotting.



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 07/17] vfio: Register SaveVMHandlers for VFIO device
  2020-09-24 15:15   ` Philippe Mathieu-Daudé
@ 2020-09-29 10:19     ` Dr. David Alan Gilbert
  2020-10-17 20:36       ` Kirti Wankhede
  0 siblings, 1 reply; 73+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-29 10:19 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, Kirti Wankhede, eauger, yi.l.liu, eskultet, ziye.yang,
	armbru, mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian,
	yan.y.zhao, alex.williamson, changpeng.liu, quintela, Ken.Xue,
	jonathan.davies, pbonzini

* Philippe Mathieu-Daudé (philmd@redhat.com) wrote:
> On 9/23/20 1:24 AM, Kirti Wankhede wrote:
> > Define flags to be used as delimeter in migration file stream.
> 
> Typo "delimiter".
> 
> > Added .save_setup and .save_cleanup functions. Mapped & unmapped migration
> > region from these functions at source during saving or pre-copy phase.
> > Set VFIO device state depending on VM's state. During live migration, VM is
> > running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
> > device. During save-restore, VM is paused, _SAVING state is set for VFIO device.
> > 
> > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > Reviewed-by: Neo Jia <cjia@nvidia.com>
> > ---
> >  hw/vfio/migration.c  | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  hw/vfio/trace-events |  2 ++
> >  2 files changed, 93 insertions(+)
> > 
> > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > index f650fe9fc3c8..8e8adaa25779 100644
> > --- a/hw/vfio/migration.c
> > +++ b/hw/vfio/migration.c
> > @@ -8,12 +8,15 @@
> >   */
> >  
> >  #include "qemu/osdep.h"
> > +#include "qemu/main-loop.h"
> > +#include "qemu/cutils.h"
> >  #include <linux/vfio.h>
> >  
> >  #include "sysemu/runstate.h"
> >  #include "hw/vfio/vfio-common.h"
> >  #include "cpu.h"
> >  #include "migration/migration.h"
> > +#include "migration/vmstate.h"
> >  #include "migration/qemu-file.h"
> >  #include "migration/register.h"
> >  #include "migration/blocker.h"
> > @@ -25,6 +28,17 @@
> >  #include "trace.h"
> >  #include "hw/hw.h"
> >  
> > +/*
> > + * Flags used as delimiter:
> > + * 0xffffffff => MSB 32-bit all 1s
> > + * 0xef10     => emulated (virtual) function IO
> > + * 0x0000     => 16-bits reserved for flags
> > + */
> > +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
> > +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
> > +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
> > +#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
> > +
> >  static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
> >                                    off_t off, bool iswrite)
> >  {
> > @@ -166,6 +180,65 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
> >      return 0;
> >  }
> >  
> > +/* ---------------------------------------------------------------------- */
> > +
> > +static int vfio_save_setup(QEMUFile *f, void *opaque)
> > +{
> > +    VFIODevice *vbasedev = opaque;
> > +    VFIOMigration *migration = vbasedev->migration;
> > +    int ret;
> > +
> > +    trace_vfio_save_setup(vbasedev->name);
> > +
> > +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
> > +
> > +    if (migration->region.mmaps) {
> > +        qemu_mutex_lock_iothread();
> > +        ret = vfio_region_mmap(&migration->region);
> > +        qemu_mutex_unlock_iothread();
> > +        if (ret) {
> > +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
> > +                         vbasedev->name, migration->region.nr,
> > +                         strerror(-ret));
> > +            error_report("%s: Falling back to slow path", vbasedev->name);
> > +        }
> > +    }
> > +
> > +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_MASK,
> > +                                   VFIO_DEVICE_STATE_SAVING);
> > +    if (ret) {
> > +        error_report("%s: Failed to set state SAVING", vbasedev->name);
> > +        return ret;
> > +    }
> > +
> > +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> > +
> > +    ret = qemu_file_get_error(f);
> > +    if (ret) {
> > +        return ret;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static void vfio_save_cleanup(void *opaque)
> > +{
> > +    VFIODevice *vbasedev = opaque;
> > +    VFIOMigration *migration = vbasedev->migration;
> > +
> > +    if (migration->region.mmaps) {
> > +        vfio_region_unmap(&migration->region);
> > +    }
> > +    trace_vfio_save_cleanup(vbasedev->name);
> > +}
> > +
> > +static SaveVMHandlers savevm_vfio_handlers = {
> > +    .save_setup = vfio_save_setup,
> > +    .save_cleanup = vfio_save_cleanup,
> > +};
> > +
> > +/* ---------------------------------------------------------------------- */
> > +
> >  static void vfio_vmstate_change(void *opaque, int running, RunState state)
> >  {
> >      VFIODevice *vbasedev = opaque;
> > @@ -225,6 +298,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
> >                                 struct vfio_region_info *info)
> >  {
> >      int ret = -EINVAL;
> > +    char id[256] = "";
> > +    Object *obj;
> >  
> >      if (!vbasedev->ops->vfio_get_object) {
> >          return ret;
> > @@ -241,6 +316,22 @@ static int vfio_migration_init(VFIODevice *vbasedev,
> >          return ret;
> >      }
> >  
> > +    obj = vbasedev->ops->vfio_get_object(vbasedev);
> > +
> > +    if (obj) {
> > +        DeviceState *dev = DEVICE(obj);
> > +        char *oid = vmstate_if_get_id(VMSTATE_IF(dev));
> > +
> > +        if (oid) {
> > +            pstrcpy(id, sizeof(id), oid);
> > +            pstrcat(id, sizeof(id), "/");
> > +            g_free(oid);
> > +        }
> > +    }
> > +    pstrcat(id, sizeof(id), "vfio");
> 
> Alternatively (easier to review, matter of taste):
> 
>  g_autofree char *path = NULL;
> 
>  if (oid) {
>    path = g_strdup_printf("%s/vfio",
>                           vmstate_if_get_id(VMSTATE_IF(obj)););
>  } else {
>    path = g_strdup("vfio");
>  }
>  strpadcpy(id, sizeof(id), path, '\0');

Maybe, although it's a straight copy of the magic in unregister_savevm.

Dave

> > +
> > +    register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1, &savevm_vfio_handlers,
> > +                         vbasedev);
> >      vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
> >                                                            vbasedev);
> >      vbasedev->migration_state.notify = vfio_migration_state_notifier;
> > diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> > index bcb3fa7314d7..982d8dccb219 100644
> > --- a/hw/vfio/trace-events
> > +++ b/hw/vfio/trace-events
> > @@ -152,3 +152,5 @@ vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
> >  vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
> >  vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
> >  vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
> > +vfio_save_setup(const char *name) " (%s)"
> > +vfio_save_cleanup(const char *name) " (%s)"
> > 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 17/17] qapi: Add VFIO devices migration stats in Migration stats
  2020-09-22 23:24 ` [PATCH v26 17/17] qapi: Add VFIO devices migration stats in Migration stats Kirti Wankhede
  2020-09-24 15:14   ` Eric Blake
  2020-09-25 22:55   ` Alex Williamson
@ 2020-09-29 10:40   ` Dr. David Alan Gilbert
  2 siblings, 0 replies; 73+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-29 10:40 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> Added amount of bytes transferred to the target VM by all VFIO devices
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> ---
> 
> Note: Comments from v25 for this patch are not addressed yet.
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg715620.html
> 
> Alex, need more pointer on documentation part raised Markus Armbruster.

I think I'm OK with this from the migration side, except for the minor
5.2's below, so

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Markus is right that we do have lots more information that falls out of
the migration stats, but I'm not sure what more you have to collect.

Dave

> 
>  hw/vfio/common.c            | 20 ++++++++++++++++++++
>  hw/vfio/migration.c         | 10 ++++++++++
>  include/qemu/vfio-helpers.h |  3 +++
>  migration/migration.c       | 14 ++++++++++++++
>  monitor/hmp-cmds.c          |  6 ++++++
>  qapi/migration.json         | 17 +++++++++++++++++
>  6 files changed, 70 insertions(+)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 7eeaa368187a..286cdaac8674 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -39,6 +39,7 @@
>  #include "trace.h"
>  #include "qapi/error.h"
>  #include "migration/migration.h"
> +#include "qemu/vfio-helpers.h"
>  
>  VFIOGroupList vfio_group_list =
>      QLIST_HEAD_INITIALIZER(vfio_group_list);
> @@ -292,6 +293,25 @@ const MemoryRegionOps vfio_region_ops = {
>   * Device state interfaces
>   */
>  
> +bool vfio_mig_active(void)
> +{
> +    VFIOGroup *group;
> +    VFIODevice *vbasedev;
> +
> +    if (QLIST_EMPTY(&vfio_group_list)) {
> +        return false;
> +    }
> +
> +    QLIST_FOREACH(group, &vfio_group_list, next) {
> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> +            if (vbasedev->migration_blocker) {
> +                return false;
> +            }
> +        }
> +    }
> +    return true;
> +}
> +
>  static bool vfio_devices_all_stopped_and_saving(VFIOContainer *container)
>  {
>      VFIOGroup *group;
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 822b68b4e015..c4226fa8b183 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -28,6 +28,7 @@
>  #include "pci.h"
>  #include "trace.h"
>  #include "hw/hw.h"
> +#include "qemu/vfio-helpers.h"
>  
>  /*
>   * Flags used as delimiter:
> @@ -40,6 +41,8 @@
>  #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
>  #define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
>  
> +static int64_t bytes_transferred;
> +
>  static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
>                                    off_t off, bool iswrite)
>  {
> @@ -289,6 +292,7 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t *size)
>          *size = data_size;
>      }
>  
> +    bytes_transferred += data_size;
>      return ret;
>  }
>  
> @@ -770,6 +774,7 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
>          }
>  
>          vfio_set_dirty_page_tracking(vbasedev, false);
> +        bytes_transferred = 0;
>      }
>  }
>  
> @@ -820,6 +825,11 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>  
>  /* ---------------------------------------------------------------------- */
>  
> +int64_t vfio_mig_bytes_transferred(void)
> +{
> +    return bytes_transferred;
> +}
> +
>  int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
>  {
>      VFIOContainer *container = vbasedev->group->container;
> diff --git a/include/qemu/vfio-helpers.h b/include/qemu/vfio-helpers.h
> index 1f057c2b9e40..26a7df0767b1 100644
> --- a/include/qemu/vfio-helpers.h
> +++ b/include/qemu/vfio-helpers.h
> @@ -29,4 +29,7 @@ void qemu_vfio_pci_unmap_bar(QEMUVFIOState *s, int index, void *bar,
>  int qemu_vfio_pci_init_irq(QEMUVFIOState *s, EventNotifier *e,
>                             int irq_type, Error **errp);
>  
> +bool vfio_mig_active(void);
> +int64_t vfio_mig_bytes_transferred(void);
> +
>  #endif
> diff --git a/migration/migration.c b/migration/migration.c
> index 58a5452471f9..b204bb1f6cd9 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -56,6 +56,7 @@
>  #include "net/announce.h"
>  #include "qemu/queue.h"
>  #include "multifd.h"
> +#include "qemu/vfio-helpers.h"
>  
>  #define MAX_THROTTLE  (32 << 20)      /* Migration transfer speed throttling */
>  
> @@ -996,6 +997,17 @@ static void populate_disk_info(MigrationInfo *info)
>      }
>  }
>  
> +static void populate_vfio_info(MigrationInfo *info)
> +{
> +#ifdef CONFIG_LINUX
> +    if (vfio_mig_active()) {
> +        info->has_vfio = true;
> +        info->vfio = g_malloc0(sizeof(*info->vfio));
> +        info->vfio->transferred = vfio_mig_bytes_transferred();
> +    }
> +#endif
> +}
> +
>  static void fill_source_migration_info(MigrationInfo *info)
>  {
>      MigrationState *s = migrate_get_current();
> @@ -1020,6 +1032,7 @@ static void fill_source_migration_info(MigrationInfo *info)
>          populate_time_info(info, s);
>          populate_ram_info(info, s);
>          populate_disk_info(info);
> +        populate_vfio_info(info);
>          break;
>      case MIGRATION_STATUS_COLO:
>          info->has_status = true;
> @@ -1028,6 +1041,7 @@ static void fill_source_migration_info(MigrationInfo *info)
>      case MIGRATION_STATUS_COMPLETED:
>          populate_time_info(info, s);
>          populate_ram_info(info, s);
> +        populate_vfio_info(info);
>          break;
>      case MIGRATION_STATUS_FAILED:
>          info->has_status = true;
> diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
> index 7711726fd222..40d60d6a6651 100644
> --- a/monitor/hmp-cmds.c
> +++ b/monitor/hmp-cmds.c
> @@ -355,6 +355,12 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict)
>          }
>          monitor_printf(mon, "]\n");
>      }
> +
> +    if (info->has_vfio) {
> +        monitor_printf(mon, "vfio device transferred: %" PRIu64 " kbytes\n",
> +                       info->vfio->transferred >> 10);
> +    }
> +
>      qapi_free_MigrationInfo(info);
>  }
>  
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 675f70bb6734..3535977123d3 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -147,6 +147,18 @@
>              'active', 'postcopy-active', 'postcopy-paused',
>              'postcopy-recover', 'completed', 'failed', 'colo',
>              'pre-switchover', 'device', 'wait-unplug' ] }
> +##
> +# @VfioStats:
> +#
> +# Detailed VFIO devices migration statistics
> +#
> +# @transferred: amount of bytes transferred to the target VM by VFIO devices
> +#
> +# Since: 5.1
> +#
> +##
> +{ 'struct': 'VfioStats',
> +  'data': {'transferred': 'int' } }
>  
>  ##
>  # @MigrationInfo:
> @@ -208,11 +220,16 @@
>  #
>  # @socket-address: Only used for tcp, to know what the real port is (Since 4.0)
>  #
> +# @vfio: @VfioStats containing detailed VFIO devices migration statistics,
> +#        only returned if VFIO device is present, migration is supported by all
> +#         VFIO devices and status is 'active' or 'completed' (since 5.1)
> +#
>  # Since: 0.14.0
>  ##
>  { 'struct': 'MigrationInfo',
>    'data': {'*status': 'MigrationStatus', '*ram': 'MigrationStats',
>             '*disk': 'MigrationStats',
> +           '*vfio': 'VfioStats',
>             '*xbzrle-cache': 'XBZRLECacheStats',
>             '*total-time': 'int',
>             '*expected-downtime': 'int',
> -- 
> 2.7.0
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 05/17] vfio: Add VM state change handler to know state of VM
  2020-09-24 15:02   ` Cornelia Huck
@ 2020-09-29 11:03     ` Dr. David Alan Gilbert
  2020-10-17 20:24       ` Kirti Wankhede
  0 siblings, 1 reply; 73+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-29 11:03 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini

* Cornelia Huck (cohuck@redhat.com) wrote:
> On Wed, 23 Sep 2020 04:54:07 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > VM state change handler gets called on change in VM's state. This is used to set
> > VFIO device state to _RUNNING.
> > 
> > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > Reviewed-by: Neo Jia <cjia@nvidia.com>
> > Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  hw/vfio/migration.c           | 136 ++++++++++++++++++++++++++++++++++++++++++
> >  hw/vfio/trace-events          |   3 +-
> >  include/hw/vfio/vfio-common.h |   4 ++
> >  3 files changed, 142 insertions(+), 1 deletion(-)
> > 
> 
> (...)
> 
> > +static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
> > +                                    uint32_t value)
> 
> I think I've mentioned that before, but this function could really
> benefit from a comment what mask and value mean. 
> 
> > +{
> > +    VFIOMigration *migration = vbasedev->migration;
> > +    VFIORegion *region = &migration->region;
> > +    off_t dev_state_off = region->fd_offset +
> > +                      offsetof(struct vfio_device_migration_info, device_state);
> > +    uint32_t device_state;
> > +    int ret;
> > +
> > +    ret = vfio_mig_read(vbasedev, &device_state, sizeof(device_state),
> > +                        dev_state_off);
> > +    if (ret < 0) {
> > +        return ret;
> > +    }
> > +
> > +    device_state = (device_state & mask) | value;
> > +
> > +    if (!VFIO_DEVICE_STATE_VALID(device_state)) {
> > +        return -EINVAL;
> > +    }
> > +
> > +    ret = vfio_mig_write(vbasedev, &device_state, sizeof(device_state),
> > +                         dev_state_off);
> > +    if (ret < 0) {
> > +        ret = vfio_mig_read(vbasedev, &device_state, sizeof(device_state),
> > +                          dev_state_off);
> > +        if (ret < 0) {
> > +            return ret;
> > +        }
> > +
> > +        if (VFIO_DEVICE_STATE_IS_ERROR(device_state)) {
> > +            hw_error("%s: Device is in error state 0x%x",
> > +                     vbasedev->name, device_state);
> > +            return -EFAULT;
> 
> Is -EFAULT a good return value here? Maybe -EIO?
> 
> > +        }
> > +    }
> > +
> > +    vbasedev->device_state = device_state;
> > +    trace_vfio_migration_set_state(vbasedev->name, device_state);
> > +    return 0;
> > +}
> > +
> > +static void vfio_vmstate_change(void *opaque, int running, RunState state)
> > +{
> > +    VFIODevice *vbasedev = opaque;
> > +
> > +    if ((vbasedev->vm_running != running)) {
> > +        int ret;
> > +        uint32_t value = 0, mask = 0;
> > +
> > +        if (running) {
> > +            value = VFIO_DEVICE_STATE_RUNNING;
> > +            if (vbasedev->device_state & VFIO_DEVICE_STATE_RESUMING) {
> > +                mask = ~VFIO_DEVICE_STATE_RESUMING;
> 
> I've been staring at this for some time and I think that the desired
> result is
> - set _RUNNING
> - if _RESUMING was set, clear it, but leave the other bits intact
> - if _RESUMING was not set, clear everything previously set
> This would really benefit from a comment (or am I the only one
> struggling here?)
> 
> > +            }
> > +        } else {
> > +            mask = ~VFIO_DEVICE_STATE_RUNNING;
> > +        }
> > +
> > +        ret = vfio_migration_set_state(vbasedev, mask, value);
> > +        if (ret) {
> > +            /*
> > +             * vm_state_notify() doesn't support reporting failure. If such
> > +             * error reporting support added in furure, migration should be
> > +             * aborted.
> 
> 
> "We should abort the migration in this case, but vm_state_notify()
> currently does not support reporting failures."
> 
> ?
> 
> Can/should we mark the failing device in some way?

I think you can call qemu_file_set_error on the migration stream to
force an error.

Dave

> > +             */
> > +            error_report("%s: Failed to set device state 0x%x",
> > +                         vbasedev->name, value & mask);
> > +        }
> > +        vbasedev->vm_running = running;
> > +        trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
> > +                                  value & mask);
> > +    }
> > +}
> > +
> >  static int vfio_migration_init(VFIODevice *vbasedev,
> >                                 struct vfio_region_info *info)
> >  {
> 
> (...)
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 09/17] vfio: Add load state functions to SaveVMHandlers
  2020-09-22 23:24 ` [PATCH v26 09/17] vfio: Add load " Kirti Wankhede
@ 2020-10-01 10:07   ` Cornelia Huck
  2020-10-18 20:47     ` Kirti Wankhede
  0 siblings, 1 reply; 73+ messages in thread
From: Cornelia Huck @ 2020-10-01 10:07 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk, pasic,
	felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini

On Wed, 23 Sep 2020 04:54:11 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Sequence  during _RESUMING device state:
> While data for this device is available, repeat below steps:
> a. read data_offset from where user application should write data.
> b. write data of data_size to migration region from data_offset.
> c. write data_size which indicates vendor driver that data is written in
>    staging buffer.
> 
> For user, data is opaque. User should write data in the same order as
> received.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  hw/vfio/migration.c  | 170 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/trace-events |   3 +
>  2 files changed, 173 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 4611bb972228..ffd70282dd0e 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -328,6 +328,33 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
>      return qemu_file_get_error(f);
>  }
>  
> +static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    uint64_t data;
> +
> +    if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
> +        int ret;
> +
> +        ret = vbasedev->ops->vfio_load_config(vbasedev, f);
> +        if (ret) {
> +            error_report("%s: Failed to load device config space",
> +                         vbasedev->name);
> +            return ret;
> +        }
> +    }
> +
> +    data = qemu_get_be64(f);
> +    if (data != VFIO_MIG_FLAG_END_OF_STATE) {
> +        error_report("%s: Failed loading device config space, "
> +                     "end flag incorrect 0x%"PRIx64, vbasedev->name, data);

I'm confused here: If we don't have a vfio_load_config callback, or if
that callback did not read everything, we also might end up with a
value that's not END_OF_STATE... in that case, the problem is not with
the stream, but rather with the consumer?

> +        return -EINVAL;
> +    }
> +
> +    trace_vfio_load_device_config_state(vbasedev->name);
> +    return qemu_file_get_error(f);
> +}
> +
>  /* ---------------------------------------------------------------------- */
>  
>  static int vfio_save_setup(QEMUFile *f, void *opaque)
> @@ -502,12 +529,155 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>      return ret;
>  }
>  
> +static int vfio_load_setup(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret = 0;
> +
> +    if (migration->region.mmaps) {
> +        ret = vfio_region_mmap(&migration->region);
> +        if (ret) {
> +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
> +                         vbasedev->name, migration->region.nr,
> +                         strerror(-ret));
> +            error_report("%s: Falling back to slow path", vbasedev->name);
> +        }
> +    }
> +
> +    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_MASK,
> +                                   VFIO_DEVICE_STATE_RESUMING);
> +    if (ret) {
> +        error_report("%s: Failed to set state RESUMING", vbasedev->name);
> +    }
> +    return ret;

If I follow the code correctly, the cleanup callback will not be
invoked if you return != 0 here... should you clean up possible
mappings on error here?

> +}
> +
> +static int vfio_load_cleanup(void *opaque)
> +{
> +    vfio_save_cleanup(opaque);
> +    return 0;
> +}
> +
> +static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret = 0;
> +    uint64_t data, data_size;
> +
> +    data = qemu_get_be64(f);
> +    while (data != VFIO_MIG_FLAG_END_OF_STATE) {
> +
> +        trace_vfio_load_state(vbasedev->name, data);
> +
> +        switch (data) {
> +        case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
> +        {
> +            ret = vfio_load_device_config_state(f, opaque);
> +            if (ret) {
> +                return ret;
> +            }
> +            break;
> +        }
> +        case VFIO_MIG_FLAG_DEV_SETUP_STATE:
> +        {
> +            data = qemu_get_be64(f);
> +            if (data == VFIO_MIG_FLAG_END_OF_STATE) {
> +                return ret;
> +            } else {
> +                error_report("%s: SETUP STATE: EOS not found 0x%"PRIx64,
> +                             vbasedev->name, data);
> +                return -EINVAL;
> +            }
> +            break;
> +        }
> +        case VFIO_MIG_FLAG_DEV_DATA_STATE:
> +        {
> +            VFIORegion *region = &migration->region;
> +            uint64_t data_offset = 0, size;

I think this function would benefit from splitting this off into a
function handling DEV_DATA_STATE. It is quite hard to follow through
all the checks and find out when we continue, and when we break off.

Some documentation about the markers would also be really helpful.
The logic seems to be:
- DEV_CONFIG_STATE has config data and must be ended by END_OF_STATE
- DEV_SETUP_STATE has only END_OF_STATE, no data
- DEV_DATA_STATE has... data; if there's any END_OF_STATE, it's buried
  far down in the called functions


> +
> +            data_size = size = qemu_get_be64(f);
> +            if (data_size == 0) {
> +                break;
> +            }
> +
> +            ret = vfio_mig_read(vbasedev, &data_offset, sizeof(data_offset),
> +                                region->fd_offset +
> +                                offsetof(struct vfio_device_migration_info,
> +                                         data_offset));
> +            if (ret < 0) {
> +                return ret;
> +            }
> +
> +            trace_vfio_load_state_device_data(vbasedev->name, data_offset,
> +                                              data_size);
> +
> +            while (size) {
> +                void *buf = NULL;
> +                uint64_t sec_size;
> +                bool buf_alloc = false;
> +
> +                buf = get_data_section_size(region, data_offset, size,
> +                                            &sec_size);
> +
> +                if (!buf) {
> +                    buf = g_try_malloc(sec_size);
> +                    if (!buf) {
> +                        error_report("%s: Error allocating buffer ", __func__);
> +                        return -ENOMEM;
> +                    }
> +                    buf_alloc = true;
> +                }
> +
> +                qemu_get_buffer(f, buf, sec_size);
> +
> +                if (buf_alloc) {
> +                    ret = vfio_mig_write(vbasedev, buf, sec_size,
> +                                         region->fd_offset + data_offset);
> +                    g_free(buf);
> +
> +                    if (ret < 0) {
> +                        return ret;
> +                    }
> +                }
> +                size -= sec_size;
> +                data_offset += sec_size;
> +            }
> +
> +            ret = vfio_mig_write(vbasedev, &data_size, sizeof(data_size),
> +                                 region->fd_offset +
> +                       offsetof(struct vfio_device_migration_info, data_size));
> +            if (ret < 0) {
> +                return ret;
> +            }
> +            break;
> +        }
> +
> +        default:
> +            error_report("%s: Unknown tag 0x%"PRIx64, vbasedev->name, data);
> +            return -EINVAL;
> +        }
> +
> +        data = qemu_get_be64(f);
> +        ret = qemu_file_get_error(f);
> +        if (ret) {
> +            return ret;
> +        }
> +    }
> +
> +    return ret;
> +}
> +
>  static SaveVMHandlers savevm_vfio_handlers = {
>      .save_setup = vfio_save_setup,
>      .save_cleanup = vfio_save_cleanup,
>      .save_live_pending = vfio_save_pending,
>      .save_live_iterate = vfio_save_iterate,
>      .save_live_complete_precopy = vfio_save_complete_precopy,
> +    .load_setup = vfio_load_setup,
> +    .load_cleanup = vfio_load_cleanup,

Unrelated to this patch: It's a bit odd that load_cleanup() (unlike
save_cleanup()) has a return code (that seems unused).

> +    .load_state = vfio_load_state,
>  };
>  
>  /* ---------------------------------------------------------------------- */



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 04/17] vfio: Add migration region initialization and finalize function
  2020-09-24 14:08   ` Cornelia Huck
@ 2020-10-17 20:14     ` Kirti Wankhede
  0 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2020-10-17 20:14 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk, pasic,
	felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini



On 9/24/2020 7:38 PM, Cornelia Huck wrote:
> On Wed, 23 Sep 2020 04:54:06 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> Whether the VFIO device supports migration or not is decided based of
>> migration region query. If migration region query is successful and migration
>> region initialization is successful then migration is supported else
>> migration is blocked.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> Acked-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>> ---
>>   hw/vfio/meson.build           |   1 +
>>   hw/vfio/migration.c           | 142 ++++++++++++++++++++++++++++++++++++++++++
>>   hw/vfio/trace-events          |   5 ++
>>   include/hw/vfio/vfio-common.h |   9 +++
>>   4 files changed, 157 insertions(+)
>>   create mode 100644 hw/vfio/migration.c
> 
> (...)
> 
>> +static int vfio_migration_region_init(VFIODevice *vbasedev, int index)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    Object *obj = NULL;
>> +    int ret = -EINVAL;
>> +
>> +    obj = vbasedev->ops->vfio_get_object(vbasedev);
>> +    if (!obj) {
>> +        return ret;
>> +    }
>> +
>> +    ret = vfio_region_setup(obj, vbasedev, &migration->region, index,
>> +                            "migration");
>> +    if (ret) {
>> +        error_report("%s: Failed to setup VFIO migration region %d: %s",
>> +                     vbasedev->name, index, strerror(-ret));
>> +        goto err;
>> +    }
>> +
>> +    if (!migration->region.size) {
>> +        ret = -EINVAL;
>> +        error_report("%s: Invalid region size of VFIO migration region %d: %s",
>> +                     vbasedev->name, index, strerror(-ret));
> 
> Using strerror on a hardcoded error value is probably not terribly
> helpful. I think printing either region.size (if you plan to extend
> this check later) or something like "Invalid zero-sized VFIO migration
> region" would make more sense.
> 

Updating the error string as you suggested.


>> +        goto err;
>> +    }
>> +
>> +    return 0;
>> +
>> +err:
>> +    vfio_migration_region_exit(vbasedev);
>> +    return ret;
>> +}
> 
> (...)
> 
> Apart from that, looks good to me.
> 

Thanks.

Kirti


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 04/17] vfio: Add migration region initialization and finalize function
  2020-09-25 20:20   ` Alex Williamson
  2020-09-28  9:39     ` Cornelia Huck
@ 2020-10-17 20:17     ` Kirti Wankhede
  1 sibling, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2020-10-17 20:17 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini



On 9/26/2020 1:50 AM, Alex Williamson wrote:
> On Wed, 23 Sep 2020 04:54:06 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> Whether the VFIO device supports migration or not is decided based of
>> migration region query. If migration region query is successful and migration
>> region initialization is successful then migration is supported else
>> migration is blocked.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> Acked-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>> ---
>>   hw/vfio/meson.build           |   1 +
>>   hw/vfio/migration.c           | 142 ++++++++++++++++++++++++++++++++++++++++++
>>   hw/vfio/trace-events          |   5 ++
>>   include/hw/vfio/vfio-common.h |   9 +++
>>   4 files changed, 157 insertions(+)
>>   create mode 100644 hw/vfio/migration.c
>>
>> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
>> index 37efa74018bc..da9af297a0c5 100644
>> --- a/hw/vfio/meson.build
>> +++ b/hw/vfio/meson.build
>> @@ -2,6 +2,7 @@ vfio_ss = ss.source_set()
>>   vfio_ss.add(files(
>>     'common.c',
>>     'spapr.c',
>> +  'migration.c',
>>   ))
>>   vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
>>     'display.c',
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> new file mode 100644
>> index 000000000000..2f760f1f9c47
>> --- /dev/null
>> +++ b/hw/vfio/migration.c
>> @@ -0,0 +1,142 @@
>> +/*
>> + * Migration support for VFIO devices
>> + *
>> + * Copyright NVIDIA, Inc. 2020
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2. See
>> + * the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include <linux/vfio.h>
>> +
>> +#include "hw/vfio/vfio-common.h"
>> +#include "cpu.h"
>> +#include "migration/migration.h"
>> +#include "migration/qemu-file.h"
>> +#include "migration/register.h"
>> +#include "migration/blocker.h"
>> +#include "migration/misc.h"
>> +#include "qapi/error.h"
>> +#include "exec/ramlist.h"
>> +#include "exec/ram_addr.h"
>> +#include "pci.h"
>> +#include "trace.h"
>> +
>> +static void vfio_migration_region_exit(VFIODevice *vbasedev)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    if (!migration) {
>> +        return;
>> +    }
>> +
>> +    if (migration->region.size) {
>> +        vfio_region_exit(&migration->region);
>> +        vfio_region_finalize(&migration->region);
>> +    }
>> +}
>> +
>> +static int vfio_migration_region_init(VFIODevice *vbasedev, int index)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    Object *obj = NULL;
> 
> Unnecessary initialization.
> 
>> +    int ret = -EINVAL;
> 
> return -EINVAL below, this doesn't need to be initialized, use it for
> storing actual return values.
> 
>> +
>> +    obj = vbasedev->ops->vfio_get_object(vbasedev);
>> +    if (!obj) {
>> +        return ret;
>> +    }
> 
> vfio_migration_init() tests whether the vbasedev->ops supports
> vfio_get_object, then calls this, then calls vfio_get_object itself
> (added in a later patch, with a strange inconsistency in failure modes).
> Wouldn't it make more sense for vfio_migration_init() to pass the
> Object since that function also needs it (eventually) and actually does
> the existence test?
> 
>> +
>> +    ret = vfio_region_setup(obj, vbasedev, &migration->region, index,
>> +                            "migration");
>> +    if (ret) {
>> +        error_report("%s: Failed to setup VFIO migration region %d: %s",
>> +                     vbasedev->name, index, strerror(-ret));
>> +        goto err;
>> +    }
>> +
>> +    if (!migration->region.size) {
>> +        ret = -EINVAL;
>> +        error_report("%s: Invalid region size of VFIO migration region %d: %s",
>> +                     vbasedev->name, index, strerror(-ret));
>> +        goto err;
>> +    }
> 
> If the caller were to pass obj, this is nothing more than a wrapper for
> calling vfio_region_setup(), which suggests to me we might not even
> need this as a separate function outside of vfio_migration_init().
> 

Removed vfio_migration_region_init(), moving vfio_region_setup() to
vfio_migration_init()

Thanks,
Kirti

>> +
>> +    return 0;
>> +
>> +err:
>> +    vfio_migration_region_exit(vbasedev);
>> +    return ret;
>> +}
>> +
>> +static int vfio_migration_init(VFIODevice *vbasedev,
>> +                               struct vfio_region_info *info)
>> +{
>> +    int ret = -EINVAL;
>> +
>> +    if (!vbasedev->ops->vfio_get_object) {
>> +        return ret;
>> +    }
>> +
>> +    vbasedev->migration = g_new0(VFIOMigration, 1);
>> +
>> +    ret = vfio_migration_region_init(vbasedev, info->index);
>> +    if (ret) {
>> +        error_report("%s: Failed to initialise migration region",
>> +                     vbasedev->name);
>> +        g_free(vbasedev->migration);
>> +        vbasedev->migration = NULL;
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +/* ---------------------------------------------------------------------- */
>> +
>> +int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
>> +{
>> +    struct vfio_region_info *info = NULL;
> 
> Not sure this initialization is strictly necessary either, but it also
> seems to be a common convention for this function, so either way.
> 
> Connie, does vfio_ccw_get_region() leak this?  It appears to call
> vfio_get_dev_region_info() and vfio_get_region_info() several times with
> the same pointer without freeing it between uses.
> 
> Thanks,
> Alex
> 
>> +    Error *local_err = NULL;
>> +    int ret;
>> +
>> +    ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION,
>> +                                   VFIO_REGION_SUBTYPE_MIGRATION, &info);
>> +    if (ret) {
>> +        goto add_blocker;
>> +    }
>> +
>> +    ret = vfio_migration_init(vbasedev, info);
>> +    if (ret) {
>> +        goto add_blocker;
>> +    }
>> +
>> +    g_free(info);
>> +    trace_vfio_migration_probe(vbasedev->name, info->index);
>> +    return 0;
>> +
>> +add_blocker:
>> +    error_setg(&vbasedev->migration_blocker,
>> +               "VFIO device doesn't support migration");
>> +    g_free(info);
>> +
>> +    ret = migrate_add_blocker(vbasedev->migration_blocker, &local_err);
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +        error_free(vbasedev->migration_blocker);
>> +        vbasedev->migration_blocker = NULL;
>> +    }
>> +    return ret;
>> +}
>> +
>> +void vfio_migration_finalize(VFIODevice *vbasedev)
>> +{
>> +    if (vbasedev->migration_blocker) {
>> +        migrate_del_blocker(vbasedev->migration_blocker);
>> +        error_free(vbasedev->migration_blocker);
>> +        vbasedev->migration_blocker = NULL;
>> +    }
>> +
>> +    vfio_migration_region_exit(vbasedev);
>> +    g_free(vbasedev->migration);
>> +}
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index a0c7b49a2ebc..8fe913175d85 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -145,3 +145,8 @@ vfio_display_edid_link_up(void) ""
>>   vfio_display_edid_link_down(void) ""
>>   vfio_display_edid_update(uint32_t prefx, uint32_t prefy) "%ux%u"
>>   vfio_display_edid_write_error(void) ""
>> +
>> +
>> +# migration.c
>> +vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
>> +
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index ba6169cd926e..8275c4c68f45 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -57,6 +57,10 @@ typedef struct VFIORegion {
>>       uint8_t nr; /* cache the region number for debug */
>>   } VFIORegion;
>>   
>> +typedef struct VFIOMigration {
>> +    VFIORegion region;
>> +} VFIOMigration;
>> +
>>   typedef struct VFIOAddressSpace {
>>       AddressSpace *as;
>>       QLIST_HEAD(, VFIOContainer) containers;
>> @@ -113,6 +117,8 @@ typedef struct VFIODevice {
>>       unsigned int num_irqs;
>>       unsigned int num_regions;
>>       unsigned int flags;
>> +    VFIOMigration *migration;
>> +    Error *migration_blocker;
>>   } VFIODevice;
>>   
>>   struct VFIODeviceOps {
>> @@ -204,4 +210,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
>>   int vfio_spapr_remove_window(VFIOContainer *container,
>>                                hwaddr offset_within_address_space);
>>   
>> +int vfio_migration_probe(VFIODevice *vbasedev, Error **errp);
>> +void vfio_migration_finalize(VFIODevice *vbasedev);
>> +
>>   #endif /* HW_VFIO_VFIO_COMMON_H */
> 


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 05/17] vfio: Add VM state change handler to know state of VM
  2020-09-29 11:03     ` Dr. David Alan Gilbert
@ 2020-10-17 20:24       ` Kirti Wankhede
  2020-10-20 10:51         ` Cornelia Huck
  0 siblings, 1 reply; 73+ messages in thread
From: Kirti Wankhede @ 2020-10-17 20:24 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Cornelia Huck
  Cc: cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk, pasic,
	felipe, zhi.a.wang, kevin.tian, yan.y.zhao, alex.williamson,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini



On 9/29/2020 4:33 PM, Dr. David Alan Gilbert wrote:
> * Cornelia Huck (cohuck@redhat.com) wrote:
>> On Wed, 23 Sep 2020 04:54:07 +0530
>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>
>>> VM state change handler gets called on change in VM's state. This is used to set
>>> VFIO device state to _RUNNING.
>>>
>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>>> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>>> ---
>>>   hw/vfio/migration.c           | 136 ++++++++++++++++++++++++++++++++++++++++++
>>>   hw/vfio/trace-events          |   3 +-
>>>   include/hw/vfio/vfio-common.h |   4 ++
>>>   3 files changed, 142 insertions(+), 1 deletion(-)
>>>
>>
>> (...)
>>
>>> +static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
>>> +                                    uint32_t value)
>>
>> I think I've mentioned that before, but this function could really
>> benefit from a comment what mask and value mean.
>>

Adding a comment as:

/*
  *  Write device_state field to inform the vendor driver about the 
device state
  *  to be transitioned to.
  *  vbasedev: VFIO device
  *  mask : bits set in the mask are preserved in device_state
  *  value: bits set in the value are set in device_state
  *  Remaining bits in device_state are cleared.
  */


>>> +{
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +    VFIORegion *region = &migration->region;
>>> +    off_t dev_state_off = region->fd_offset +
>>> +                      offsetof(struct vfio_device_migration_info, device_state);
>>> +    uint32_t device_state;
>>> +    int ret;
>>> +
>>> +    ret = vfio_mig_read(vbasedev, &device_state, sizeof(device_state),
>>> +                        dev_state_off);
>>> +    if (ret < 0) {
>>> +        return ret;
>>> +    }
>>> +
>>> +    device_state = (device_state & mask) | value;
>>> +
>>> +    if (!VFIO_DEVICE_STATE_VALID(device_state)) {
>>> +        return -EINVAL;
>>> +    }
>>> +
>>> +    ret = vfio_mig_write(vbasedev, &device_state, sizeof(device_state),
>>> +                         dev_state_off);
>>> +    if (ret < 0) {
>>> +        ret = vfio_mig_read(vbasedev, &device_state, sizeof(device_state),
>>> +                          dev_state_off);
>>> +        if (ret < 0) {
>>> +            return ret;
>>> +        }
>>> +
>>> +        if (VFIO_DEVICE_STATE_IS_ERROR(device_state)) {
>>> +            hw_error("%s: Device is in error state 0x%x",
>>> +                     vbasedev->name, device_state);
>>> +            return -EFAULT;
>>
>> Is -EFAULT a good return value here? Maybe -EIO?
>>

Ok. Changing to -EIO.

>>> +        }
>>> +    }
>>> +
>>> +    vbasedev->device_state = device_state;
>>> +    trace_vfio_migration_set_state(vbasedev->name, device_state);
>>> +    return 0;
>>> +}
>>> +
>>> +static void vfio_vmstate_change(void *opaque, int running, RunState state)
>>> +{
>>> +    VFIODevice *vbasedev = opaque;
>>> +
>>> +    if ((vbasedev->vm_running != running)) {
>>> +        int ret;
>>> +        uint32_t value = 0, mask = 0;
>>> +
>>> +        if (running) {
>>> +            value = VFIO_DEVICE_STATE_RUNNING;
>>> +            if (vbasedev->device_state & VFIO_DEVICE_STATE_RESUMING) {
>>> +                mask = ~VFIO_DEVICE_STATE_RESUMING;
>>
>> I've been staring at this for some time and I think that the desired
>> result is
>> - set _RUNNING
>> - if _RESUMING was set, clear it, but leave the other bits intact

Upto here, you're correct.

>> - if _RESUMING was not set, clear everything previously set
>> This would really benefit from a comment (or am I the only one
>> struggling here?)
>>

Here mask should be ~0. Correcting it.


>>> +            }
>>> +        } else {
>>> +            mask = ~VFIO_DEVICE_STATE_RUNNING;
>>> +        }
>>> +
>>> +        ret = vfio_migration_set_state(vbasedev, mask, value);
>>> +        if (ret) {
>>> +            /*
>>> +             * vm_state_notify() doesn't support reporting failure. If such
>>> +             * error reporting support added in furure, migration should be
>>> +             * aborted.
>>
>>
>> "We should abort the migration in this case, but vm_state_notify()
>> currently does not support reporting failures."
>>
>> ?
>>

Ok. Updating comment as suggested here.

>> Can/should we mark the failing device in some way?
> 
> I think you can call qemu_file_set_error on the migration stream to
> force an error.
> 

It should be as below, right?
qemu_file_set_error(migrate_get_current()->to_dst_file, ret);


Thanks,
Kirti

> Dave
> 
>>> +             */
>>> +            error_report("%s: Failed to set device state 0x%x",
>>> +                         vbasedev->name, value & mask);
>>> +        }
>>> +        vbasedev->vm_running = running;
>>> +        trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
>>> +                                  value & mask);
>>> +    }
>>> +}
>>> +
>>>   static int vfio_migration_init(VFIODevice *vbasedev,
>>>                                  struct vfio_region_info *info)
>>>   {
>>
>> (...)


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 05/17] vfio: Add VM state change handler to know state of VM
  2020-09-25 20:20   ` Alex Williamson
@ 2020-10-17 20:30     ` Kirti Wankhede
  2020-10-17 23:44       ` Alex Williamson
  0 siblings, 1 reply; 73+ messages in thread
From: Kirti Wankhede @ 2020-10-17 20:30 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini



On 9/26/2020 1:50 AM, Alex Williamson wrote:
> On Wed, 23 Sep 2020 04:54:07 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> VM state change handler gets called on change in VM's state. This is used to set
>> VFIO device state to _RUNNING.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>> ---
>>   hw/vfio/migration.c           | 136 ++++++++++++++++++++++++++++++++++++++++++
>>   hw/vfio/trace-events          |   3 +-
>>   include/hw/vfio/vfio-common.h |   4 ++
>>   3 files changed, 142 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 2f760f1f9c47..a30d628ba963 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -10,6 +10,7 @@
>>   #include "qemu/osdep.h"
>>   #include <linux/vfio.h>
>>   
>> +#include "sysemu/runstate.h"
>>   #include "hw/vfio/vfio-common.h"
>>   #include "cpu.h"
>>   #include "migration/migration.h"
>> @@ -22,6 +23,58 @@
>>   #include "exec/ram_addr.h"
>>   #include "pci.h"
>>   #include "trace.h"
>> +#include "hw/hw.h"
>> +
>> +static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
>> +                                  off_t off, bool iswrite)
>> +{
>> +    int ret;
>> +
>> +    ret = iswrite ? pwrite(vbasedev->fd, val, count, off) :
>> +                    pread(vbasedev->fd, val, count, off);
>> +    if (ret < count) {
>> +        error_report("vfio_mig_%s%d %s: failed at offset 0x%lx, err: %s",
>> +                     iswrite ? "write" : "read", count * 8,
>> +                     vbasedev->name, off, strerror(errno));
> 
> This would suggest from the log that there's, for example, a
> vfio_mig_read8 function, which doesn't exist.
> 

Changing to:
error_report("vfio_mig_%s %d byte %s: failed at offset 0x%lx, err: %s",
              iswrite ? "write" : "read", count,
              vbasedev->name, off, strerror(errno));
Hope this address your concern.

>> +        return (ret < 0) ? ret : -EINVAL;
>> +    }
>> +    return 0;
>> +}
>> +
>> +static int vfio_mig_rw(VFIODevice *vbasedev, __u8 *buf, size_t count,
>> +                       off_t off, bool iswrite)
>> +{
>> +    int ret, done = 0;
>> +    __u8 *tbuf = buf;
>> +
>> +    while (count) {
>> +        int bytes = 0;
>> +
>> +        if (count >= 8 && !(off % 8)) {
>> +            bytes = 8;
>> +        } else if (count >= 4 && !(off % 4)) {
>> +            bytes = 4;
>> +        } else if (count >= 2 && !(off % 2)) {
>> +            bytes = 2;
>> +        } else {
>> +            bytes = 1;
>> +        }
>> +
>> +        ret = vfio_mig_access(vbasedev, tbuf, bytes, off, iswrite);
>> +        if (ret) {
>> +            return ret;
>> +        }
>> +
>> +        count -= bytes;
>> +        done += bytes;
>> +        off += bytes;
>> +        tbuf += bytes;
>> +    }
>> +    return done;
>> +}
>> +
>> +#define vfio_mig_read(f, v, c, o)       vfio_mig_rw(f, (__u8 *)v, c, o, false)
>> +#define vfio_mig_write(f, v, c, o)      vfio_mig_rw(f, (__u8 *)v, c, o, true)
>>   
>>   static void vfio_migration_region_exit(VFIODevice *vbasedev)
>>   {
>> @@ -70,6 +123,82 @@ err:
>>       return ret;
>>   }
>>   
>> +static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
>> +                                    uint32_t value)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIORegion *region = &migration->region;
>> +    off_t dev_state_off = region->fd_offset +
>> +                      offsetof(struct vfio_device_migration_info, device_state);
>> +    uint32_t device_state;
>> +    int ret;
>> +
>> +    ret = vfio_mig_read(vbasedev, &device_state, sizeof(device_state),
>> +                        dev_state_off);
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>> +
>> +    device_state = (device_state & mask) | value;
> 
> Agree with Connie that mask and value args are not immediately obvious
> how they're used.  I don't have a naming convention that would be more
> clear and the names do make some sense once they're understood, but a
> comment to indicate mask bits are preserved, value bits are set,
> remaining bits are cleared would probably help the reader.
> 

Added comment.

>> +
>> +    if (!VFIO_DEVICE_STATE_VALID(device_state)) {
>> +        return -EINVAL;
>> +    }
>> +
>> +    ret = vfio_mig_write(vbasedev, &device_state, sizeof(device_state),
>> +                         dev_state_off);
>> +    if (ret < 0) {
>> +        ret = vfio_mig_read(vbasedev, &device_state, sizeof(device_state),
>> +                          dev_state_off);
>> +        if (ret < 0) {
>> +            return ret;
> 
> Seems like we're in pretty bad shape here, should this be combined with
> below to trigger a hw_error?
> 

Ok.

>> +        }
>> +
>> +        if (VFIO_DEVICE_STATE_IS_ERROR(device_state)) {
>> +            hw_error("%s: Device is in error state 0x%x",
>> +                     vbasedev->name, device_state);
>> +            return -EFAULT;
>> +        }
>> +    }
>> +
>> +    vbasedev->device_state = device_state;
>> +    trace_vfio_migration_set_state(vbasedev->name, device_state);
>> +    return 0;
> 
> So we return success even if we failed to write the desired state as
> long as we were able to read back any non-error state?
> vbasedev->device_state remains correct, but it seems confusing form a
> caller perspective that a set-state can succeed but it's then necessary
> to check the state.
> 

Correcting here. If vfio_mig_write() had retured error, return error 
from vfio_migration_set_state()

>> +}
>> +
>> +static void vfio_vmstate_change(void *opaque, int running, RunState state)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +
>> +    if ((vbasedev->vm_running != running)) {
>> +        int ret;
>> +        uint32_t value = 0, mask = 0;
>> +
>> +        if (running) {
>> +            value = VFIO_DEVICE_STATE_RUNNING;
>> +            if (vbasedev->device_state & VFIO_DEVICE_STATE_RESUMING) {
>> +                mask = ~VFIO_DEVICE_STATE_RESUMING;
>> +            }
>> +        } else {
>> +            mask = ~VFIO_DEVICE_STATE_RUNNING;
>> +        }
>> +
>> +        ret = vfio_migration_set_state(vbasedev, mask, value);
>> +        if (ret) {
>> +            /*
>> +             * vm_state_notify() doesn't support reporting failure. If such
>> +             * error reporting support added in furure, migration should be
>> +             * aborted.
>> +             */
>> +            error_report("%s: Failed to set device state 0x%x",
>> +                         vbasedev->name, value & mask);
>> +        }
> 
> Here for instance we assume that success means the device is now in the
> desired state, but we'd actually need to evaluate
> vbasedev->device_state to determine that.
> 

Updating.

>> +        vbasedev->vm_running = running;
>> +        trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
>> +                                  value & mask);
>> +    }
>> +}
>> +
>>   static int vfio_migration_init(VFIODevice *vbasedev,
>>                                  struct vfio_region_info *info)
>>   {
>> @@ -87,8 +216,11 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>>                        vbasedev->name);
>>           g_free(vbasedev->migration);
>>           vbasedev->migration = NULL;
>> +        return ret;
>>       }
>>   
>> +    vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
>> +                                                          vbasedev);
>>       return ret;
>>   }
>>   
>> @@ -131,6 +263,10 @@ add_blocker:
>>   
>>   void vfio_migration_finalize(VFIODevice *vbasedev)
>>   {
>> +    if (vbasedev->vm_state) {
>> +        qemu_del_vm_change_state_handler(vbasedev->vm_state);
>> +    }
>> +
>>       if (vbasedev->migration_blocker) {
>>           migrate_del_blocker(vbasedev->migration_blocker);
>>           error_free(vbasedev->migration_blocker);
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index 8fe913175d85..6524734bf7b4 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -149,4 +149,5 @@ vfio_display_edid_write_error(void) ""
>>   
>>   # migration.c
>>   vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
>> -
>> +vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
>> +vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 8275c4c68f45..25e3b1a3b90a 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -29,6 +29,7 @@
>>   #ifdef CONFIG_LINUX
>>   #include <linux/vfio.h>
>>   #endif
>> +#include "sysemu/sysemu.h"
>>   
>>   #define VFIO_MSG_PREFIX "vfio %s: "
>>   
>> @@ -119,6 +120,9 @@ typedef struct VFIODevice {
>>       unsigned int flags;
>>       VFIOMigration *migration;
>>       Error *migration_blocker;
>> +    VMChangeStateEntry *vm_state;
>> +    uint32_t device_state;
>> +    int vm_running;
> 
> Could these be placed in VFIOMigration?  Thanks,
>

I think device_state should be part of VFIODevice since its about device 
rather than only related to migration, others can be moved to VFIOMigration.

Thanks,
Kirti


> Alex
> 
>>   } VFIODevice;
>>   
>>   struct VFIODeviceOps {
> 


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 06/17] vfio: Add migration state change notifier
  2020-09-25 20:20   ` Alex Williamson
@ 2020-10-17 20:35     ` Kirti Wankhede
  2020-10-19 17:57       ` Alex Williamson
  0 siblings, 1 reply; 73+ messages in thread
From: Kirti Wankhede @ 2020-10-17 20:35 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini



On 9/26/2020 1:50 AM, Alex Williamson wrote:
> On Wed, 23 Sep 2020 04:54:08 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> Added migration state change notifier to get notification on migration state
>> change. These states are translated to VFIO device state and conveyed to vendor
>> driver.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>> ---
>>   hw/vfio/migration.c           | 29 +++++++++++++++++++++++++++++
>>   hw/vfio/trace-events          |  5 +++--
>>   include/hw/vfio/vfio-common.h |  1 +
>>   3 files changed, 33 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index a30d628ba963..f650fe9fc3c8 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -199,6 +199,28 @@ static void vfio_vmstate_change(void *opaque, int running, RunState state)
>>       }
>>   }
>>   
>> +static void vfio_migration_state_notifier(Notifier *notifier, void *data)
>> +{
>> +    MigrationState *s = data;
>> +    VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);
>> +    int ret;
>> +
>> +    trace_vfio_migration_state_notifier(vbasedev->name,
>> +                                        MigrationStatus_str(s->state));
>> +
>> +    switch (s->state) {
>> +    case MIGRATION_STATUS_CANCELLING:
>> +    case MIGRATION_STATUS_CANCELLED:
>> +    case MIGRATION_STATUS_FAILED:
>> +        ret = vfio_migration_set_state(vbasedev,
>> +                      ~(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING),
>> +                      VFIO_DEVICE_STATE_RUNNING);
>> +        if (ret) {
>> +            error_report("%s: Failed to set state RUNNING", vbasedev->name);
>> +        }
> 
> Here again the caller assumes success means the device has entered the
> desired state, but as implemented it only means the device is in some
> non-error state.
> 
>> +    }
>> +}
>> +
>>   static int vfio_migration_init(VFIODevice *vbasedev,
>>                                  struct vfio_region_info *info)
>>   {
>> @@ -221,6 +243,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>>   
>>       vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
>>                                                             vbasedev);
>> +    vbasedev->migration_state.notify = vfio_migration_state_notifier;
>> +    add_migration_state_change_notifier(&vbasedev->migration_state);
>>       return ret;
>>   }
>>   
>> @@ -263,6 +287,11 @@ add_blocker:
>>   
>>   void vfio_migration_finalize(VFIODevice *vbasedev)
>>   {
>> +
>> +    if (vbasedev->migration_state.notify) {
>> +        remove_migration_state_change_notifier(&vbasedev->migration_state);
>> +    }
>> +
>>       if (vbasedev->vm_state) {
>>           qemu_del_vm_change_state_handler(vbasedev->vm_state);
>>       }
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index 6524734bf7b4..bcb3fa7314d7 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -149,5 +149,6 @@ vfio_display_edid_write_error(void) ""
>>   
>>   # migration.c
>>   vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
>> -vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
>> -vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
>> +vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
>> +vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
>> +vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 25e3b1a3b90a..49c7c7a0e29a 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -123,6 +123,7 @@ typedef struct VFIODevice {
>>       VMChangeStateEntry *vm_state;
>>       uint32_t device_state;
>>       int vm_running;
>> +    Notifier migration_state;
> 
> Can this live in VFIOMigration?  Thanks,
> 

No, callback vfio_migration_state_notifier() has notifier argument and 
to reach its corresponding device structure as below, its should be in 
VFIODevice.

VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);

Thanks,
Kirti

> Alex
> 
>>   } VFIODevice;
>>   
>>   struct VFIODeviceOps {
> 


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 07/17] vfio: Register SaveVMHandlers for VFIO device
  2020-09-29 10:19     ` Dr. David Alan Gilbert
@ 2020-10-17 20:36       ` Kirti Wankhede
  0 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2020-10-17 20:36 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Philippe Mathieu-Daudé
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, eskultet, ziye.yang, armbru, mlevitsk,
	pasic, felipe, Ken.Xue, kevin.tian, yan.y.zhao, alex.williamson,
	changpeng.liu, quintela, zhi.a.wang, jonathan.davies, pbonzini



On 9/29/2020 3:49 PM, Dr. David Alan Gilbert wrote:
> * Philippe Mathieu-Daudé (philmd@redhat.com) wrote:
>> On 9/23/20 1:24 AM, Kirti Wankhede wrote:
>>> Define flags to be used as delimeter in migration file stream.
>>
>> Typo "delimiter".
>>
>>> Added .save_setup and .save_cleanup functions. Mapped & unmapped migration
>>> region from these functions at source during saving or pre-copy phase.
>>> Set VFIO device state depending on VM's state. During live migration, VM is
>>> running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
>>> device. During save-restore, VM is paused, _SAVING state is set for VFIO device.
>>>
>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>>> ---
>>>   hw/vfio/migration.c  | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>   hw/vfio/trace-events |  2 ++
>>>   2 files changed, 93 insertions(+)
>>>
>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>> index f650fe9fc3c8..8e8adaa25779 100644
>>> --- a/hw/vfio/migration.c
>>> +++ b/hw/vfio/migration.c
>>> @@ -8,12 +8,15 @@
>>>    */
>>>   
>>>   #include "qemu/osdep.h"
>>> +#include "qemu/main-loop.h"
>>> +#include "qemu/cutils.h"
>>>   #include <linux/vfio.h>
>>>   
>>>   #include "sysemu/runstate.h"
>>>   #include "hw/vfio/vfio-common.h"
>>>   #include "cpu.h"
>>>   #include "migration/migration.h"
>>> +#include "migration/vmstate.h"
>>>   #include "migration/qemu-file.h"
>>>   #include "migration/register.h"
>>>   #include "migration/blocker.h"
>>> @@ -25,6 +28,17 @@
>>>   #include "trace.h"
>>>   #include "hw/hw.h"
>>>   
>>> +/*
>>> + * Flags used as delimiter:
>>> + * 0xffffffff => MSB 32-bit all 1s
>>> + * 0xef10     => emulated (virtual) function IO
>>> + * 0x0000     => 16-bits reserved for flags
>>> + */
>>> +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
>>> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
>>> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
>>> +#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
>>> +
>>>   static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
>>>                                     off_t off, bool iswrite)
>>>   {
>>> @@ -166,6 +180,65 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
>>>       return 0;
>>>   }
>>>   
>>> +/* ---------------------------------------------------------------------- */
>>> +
>>> +static int vfio_save_setup(QEMUFile *f, void *opaque)
>>> +{
>>> +    VFIODevice *vbasedev = opaque;
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +    int ret;
>>> +
>>> +    trace_vfio_save_setup(vbasedev->name);
>>> +
>>> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
>>> +
>>> +    if (migration->region.mmaps) {
>>> +        qemu_mutex_lock_iothread();
>>> +        ret = vfio_region_mmap(&migration->region);
>>> +        qemu_mutex_unlock_iothread();
>>> +        if (ret) {
>>> +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
>>> +                         vbasedev->name, migration->region.nr,
>>> +                         strerror(-ret));
>>> +            error_report("%s: Falling back to slow path", vbasedev->name);
>>> +        }
>>> +    }
>>> +
>>> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_MASK,
>>> +                                   VFIO_DEVICE_STATE_SAVING);
>>> +    if (ret) {
>>> +        error_report("%s: Failed to set state SAVING", vbasedev->name);
>>> +        return ret;
>>> +    }
>>> +
>>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>>> +
>>> +    ret = qemu_file_get_error(f);
>>> +    if (ret) {
>>> +        return ret;
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static void vfio_save_cleanup(void *opaque)
>>> +{
>>> +    VFIODevice *vbasedev = opaque;
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +
>>> +    if (migration->region.mmaps) {
>>> +        vfio_region_unmap(&migration->region);
>>> +    }
>>> +    trace_vfio_save_cleanup(vbasedev->name);
>>> +}
>>> +
>>> +static SaveVMHandlers savevm_vfio_handlers = {
>>> +    .save_setup = vfio_save_setup,
>>> +    .save_cleanup = vfio_save_cleanup,
>>> +};
>>> +
>>> +/* ---------------------------------------------------------------------- */
>>> +
>>>   static void vfio_vmstate_change(void *opaque, int running, RunState state)
>>>   {
>>>       VFIODevice *vbasedev = opaque;
>>> @@ -225,6 +298,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>>>                                  struct vfio_region_info *info)
>>>   {
>>>       int ret = -EINVAL;
>>> +    char id[256] = "";
>>> +    Object *obj;
>>>   
>>>       if (!vbasedev->ops->vfio_get_object) {
>>>           return ret;
>>> @@ -241,6 +316,22 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>>>           return ret;
>>>       }
>>>   
>>> +    obj = vbasedev->ops->vfio_get_object(vbasedev);
>>> +
>>> +    if (obj) {
>>> +        DeviceState *dev = DEVICE(obj);
>>> +        char *oid = vmstate_if_get_id(VMSTATE_IF(dev));
>>> +
>>> +        if (oid) {
>>> +            pstrcpy(id, sizeof(id), oid);
>>> +            pstrcat(id, sizeof(id), "/");
>>> +            g_free(oid);
>>> +        }
>>> +    }
>>> +    pstrcat(id, sizeof(id), "vfio");
>>
>> Alternatively (easier to review, matter of taste):
>>
>>   g_autofree char *path = NULL;
>>
>>   if (oid) {
>>     path = g_strdup_printf("%s/vfio",
>>                            vmstate_if_get_id(VMSTATE_IF(obj)););
>>   } else {
>>     path = g_strdup("vfio");
>>   }
>>   strpadcpy(id, sizeof(id), path, '\0');
> 
> Maybe, although it's a straight copy of the magic in unregister_savevm.
> 

Ok. Changing it as Philippe suggested.

Thanks,
Kirti

> Dave
> 
>>> +
>>> +    register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1, &savevm_vfio_handlers,
>>> +                         vbasedev);
>>>       vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
>>>                                                             vbasedev);
>>>       vbasedev->migration_state.notify = vfio_migration_state_notifier;
>>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>>> index bcb3fa7314d7..982d8dccb219 100644
>>> --- a/hw/vfio/trace-events
>>> +++ b/hw/vfio/trace-events
>>> @@ -152,3 +152,5 @@ vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
>>>   vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
>>>   vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
>>>   vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
>>> +vfio_save_setup(const char *name) " (%s)"
>>> +vfio_save_cleanup(const char *name) " (%s)"
>>>
>>


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 05/17] vfio: Add VM state change handler to know state of VM
  2020-10-17 20:30     ` Kirti Wankhede
@ 2020-10-17 23:44       ` Alex Williamson
  2020-10-18 17:43         ` Kirti Wankhede
  0 siblings, 1 reply; 73+ messages in thread
From: Alex Williamson @ 2020-10-17 23:44 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini

On Sun, 18 Oct 2020 02:00:44 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 9/26/2020 1:50 AM, Alex Williamson wrote:
> > On Wed, 23 Sep 2020 04:54:07 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> VM state change handler gets called on change in VM's state. This is used to set
> >> VFIO device state to _RUNNING.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> >> ---
> >>   hw/vfio/migration.c           | 136 ++++++++++++++++++++++++++++++++++++++++++
> >>   hw/vfio/trace-events          |   3 +-
> >>   include/hw/vfio/vfio-common.h |   4 ++
> >>   3 files changed, 142 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> >> index 2f760f1f9c47..a30d628ba963 100644
> >> --- a/hw/vfio/migration.c
> >> +++ b/hw/vfio/migration.c
> >> @@ -10,6 +10,7 @@
> >>   #include "qemu/osdep.h"
> >>   #include <linux/vfio.h>
> >>   
> >> +#include "sysemu/runstate.h"
> >>   #include "hw/vfio/vfio-common.h"
> >>   #include "cpu.h"
> >>   #include "migration/migration.h"
> >> @@ -22,6 +23,58 @@
> >>   #include "exec/ram_addr.h"
> >>   #include "pci.h"
> >>   #include "trace.h"
> >> +#include "hw/hw.h"
> >> +
> >> +static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
> >> +                                  off_t off, bool iswrite)
> >> +{
> >> +    int ret;
> >> +
> >> +    ret = iswrite ? pwrite(vbasedev->fd, val, count, off) :
> >> +                    pread(vbasedev->fd, val, count, off);
> >> +    if (ret < count) {
> >> +        error_report("vfio_mig_%s%d %s: failed at offset 0x%lx, err: %s",
> >> +                     iswrite ? "write" : "read", count * 8,
> >> +                     vbasedev->name, off, strerror(errno));  
> > 
> > This would suggest from the log that there's, for example, a
> > vfio_mig_read8 function, which doesn't exist.
> >   
> 
> Changing to:
> error_report("vfio_mig_%s %d byte %s: failed at offset 0x%lx, err: %s",
>               iswrite ? "write" : "read", count,
>               vbasedev->name, off, strerror(errno));
> Hope this address your concern.
> 
> >> +        return (ret < 0) ? ret : -EINVAL;
> >> +    }
> >> +    return 0;
> >> +}
> >> +
> >> +static int vfio_mig_rw(VFIODevice *vbasedev, __u8 *buf, size_t count,
> >> +                       off_t off, bool iswrite)
> >> +{
> >> +    int ret, done = 0;
> >> +    __u8 *tbuf = buf;
> >> +
> >> +    while (count) {
> >> +        int bytes = 0;
> >> +
> >> +        if (count >= 8 && !(off % 8)) {
> >> +            bytes = 8;
> >> +        } else if (count >= 4 && !(off % 4)) {
> >> +            bytes = 4;
> >> +        } else if (count >= 2 && !(off % 2)) {
> >> +            bytes = 2;
> >> +        } else {
> >> +            bytes = 1;
> >> +        }
> >> +
> >> +        ret = vfio_mig_access(vbasedev, tbuf, bytes, off, iswrite);
> >> +        if (ret) {
> >> +            return ret;
> >> +        }
> >> +
> >> +        count -= bytes;
> >> +        done += bytes;
> >> +        off += bytes;
> >> +        tbuf += bytes;
> >> +    }
> >> +    return done;
> >> +}
> >> +
> >> +#define vfio_mig_read(f, v, c, o)       vfio_mig_rw(f, (__u8 *)v, c, o, false)
> >> +#define vfio_mig_write(f, v, c, o)      vfio_mig_rw(f, (__u8 *)v, c, o, true)
> >>   
> >>   static void vfio_migration_region_exit(VFIODevice *vbasedev)
> >>   {
> >> @@ -70,6 +123,82 @@ err:
> >>       return ret;
> >>   }
> >>   
> >> +static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
> >> +                                    uint32_t value)
> >> +{
> >> +    VFIOMigration *migration = vbasedev->migration;
> >> +    VFIORegion *region = &migration->region;
> >> +    off_t dev_state_off = region->fd_offset +
> >> +                      offsetof(struct vfio_device_migration_info, device_state);
> >> +    uint32_t device_state;
> >> +    int ret;
> >> +
> >> +    ret = vfio_mig_read(vbasedev, &device_state, sizeof(device_state),
> >> +                        dev_state_off);
> >> +    if (ret < 0) {
> >> +        return ret;
> >> +    }
> >> +
> >> +    device_state = (device_state & mask) | value;  
> > 
> > Agree with Connie that mask and value args are not immediately obvious
> > how they're used.  I don't have a naming convention that would be more
> > clear and the names do make some sense once they're understood, but a
> > comment to indicate mask bits are preserved, value bits are set,
> > remaining bits are cleared would probably help the reader.
> >   
> 
> Added comment.
> 
> >> +
> >> +    if (!VFIO_DEVICE_STATE_VALID(device_state)) {
> >> +        return -EINVAL;
> >> +    }
> >> +
> >> +    ret = vfio_mig_write(vbasedev, &device_state, sizeof(device_state),
> >> +                         dev_state_off);
> >> +    if (ret < 0) {
> >> +        ret = vfio_mig_read(vbasedev, &device_state, sizeof(device_state),
> >> +                          dev_state_off);
> >> +        if (ret < 0) {
> >> +            return ret;  
> > 
> > Seems like we're in pretty bad shape here, should this be combined with
> > below to trigger a hw_error?
> >   
> 
> Ok.
> 
> >> +        }
> >> +
> >> +        if (VFIO_DEVICE_STATE_IS_ERROR(device_state)) {
> >> +            hw_error("%s: Device is in error state 0x%x",
> >> +                     vbasedev->name, device_state);
> >> +            return -EFAULT;
> >> +        }
> >> +    }
> >> +
> >> +    vbasedev->device_state = device_state;
> >> +    trace_vfio_migration_set_state(vbasedev->name, device_state);
> >> +    return 0;  
> > 
> > So we return success even if we failed to write the desired state as
> > long as we were able to read back any non-error state?
> > vbasedev->device_state remains correct, but it seems confusing form a
> > caller perspective that a set-state can succeed but it's then necessary
> > to check the state.
> >   
> 
> Correcting here. If vfio_mig_write() had retured error, return error 
> from vfio_migration_set_state()
> 
> >> +}
> >> +
> >> +static void vfio_vmstate_change(void *opaque, int running, RunState state)
> >> +{
> >> +    VFIODevice *vbasedev = opaque;
> >> +
> >> +    if ((vbasedev->vm_running != running)) {
> >> +        int ret;
> >> +        uint32_t value = 0, mask = 0;
> >> +
> >> +        if (running) {
> >> +            value = VFIO_DEVICE_STATE_RUNNING;
> >> +            if (vbasedev->device_state & VFIO_DEVICE_STATE_RESUMING) {
> >> +                mask = ~VFIO_DEVICE_STATE_RESUMING;
> >> +            }
> >> +        } else {
> >> +            mask = ~VFIO_DEVICE_STATE_RUNNING;
> >> +        }
> >> +
> >> +        ret = vfio_migration_set_state(vbasedev, mask, value);
> >> +        if (ret) {
> >> +            /*
> >> +             * vm_state_notify() doesn't support reporting failure. If such
> >> +             * error reporting support added in furure, migration should be
> >> +             * aborted.
> >> +             */
> >> +            error_report("%s: Failed to set device state 0x%x",
> >> +                         vbasedev->name, value & mask);
> >> +        }  
> > 
> > Here for instance we assume that success means the device is now in the
> > desired state, but we'd actually need to evaluate
> > vbasedev->device_state to determine that.
> >   
> 
> Updating.
> 
> >> +        vbasedev->vm_running = running;
> >> +        trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
> >> +                                  value & mask);
> >> +    }
> >> +}
> >> +
> >>   static int vfio_migration_init(VFIODevice *vbasedev,
> >>                                  struct vfio_region_info *info)
> >>   {
> >> @@ -87,8 +216,11 @@ static int vfio_migration_init(VFIODevice *vbasedev,
> >>                        vbasedev->name);
> >>           g_free(vbasedev->migration);
> >>           vbasedev->migration = NULL;
> >> +        return ret;
> >>       }
> >>   
> >> +    vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
> >> +                                                          vbasedev);
> >>       return ret;
> >>   }
> >>   
> >> @@ -131,6 +263,10 @@ add_blocker:
> >>   
> >>   void vfio_migration_finalize(VFIODevice *vbasedev)
> >>   {
> >> +    if (vbasedev->vm_state) {
> >> +        qemu_del_vm_change_state_handler(vbasedev->vm_state);
> >> +    }
> >> +
> >>       if (vbasedev->migration_blocker) {
> >>           migrate_del_blocker(vbasedev->migration_blocker);
> >>           error_free(vbasedev->migration_blocker);
> >> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> >> index 8fe913175d85..6524734bf7b4 100644
> >> --- a/hw/vfio/trace-events
> >> +++ b/hw/vfio/trace-events
> >> @@ -149,4 +149,5 @@ vfio_display_edid_write_error(void) ""
> >>   
> >>   # migration.c
> >>   vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
> >> -
> >> +vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
> >> +vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
> >> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> >> index 8275c4c68f45..25e3b1a3b90a 100644
> >> --- a/include/hw/vfio/vfio-common.h
> >> +++ b/include/hw/vfio/vfio-common.h
> >> @@ -29,6 +29,7 @@
> >>   #ifdef CONFIG_LINUX
> >>   #include <linux/vfio.h>
> >>   #endif
> >> +#include "sysemu/sysemu.h"
> >>   
> >>   #define VFIO_MSG_PREFIX "vfio %s: "
> >>   
> >> @@ -119,6 +120,9 @@ typedef struct VFIODevice {
> >>       unsigned int flags;
> >>       VFIOMigration *migration;
> >>       Error *migration_blocker;
> >> +    VMChangeStateEntry *vm_state;
> >> +    uint32_t device_state;
> >> +    int vm_running;  
> > 
> > Could these be placed in VFIOMigration?  Thanks,
> >  
> 
> I think device_state should be part of VFIODevice since its about device 
> rather than only related to migration, others can be moved to VFIOMigration.

But these are only valid when migration is supported and thus when
VFIOMigration exists.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 07/17] vfio: Register SaveVMHandlers for VFIO device
  2020-09-25 20:20   ` Alex Williamson
@ 2020-10-18 17:40     ` Kirti Wankhede
  0 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2020-10-18 17:40 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini



On 9/26/2020 1:50 AM, Alex Williamson wrote:
> On Wed, 23 Sep 2020 04:54:09 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> Define flags to be used as delimeter in migration file stream.
>> Added .save_setup and .save_cleanup functions. Mapped & unmapped migration
>> region from these functions at source during saving or pre-copy phase.
>> Set VFIO device state depending on VM's state. During live migration, VM is
>> running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
>> device. During save-restore, VM is paused, _SAVING state is set for VFIO device.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>   hw/vfio/migration.c  | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   hw/vfio/trace-events |  2 ++
>>   2 files changed, 93 insertions(+)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index f650fe9fc3c8..8e8adaa25779 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -8,12 +8,15 @@
>>    */
>>   
>>   #include "qemu/osdep.h"
>> +#include "qemu/main-loop.h"
>> +#include "qemu/cutils.h"
>>   #include <linux/vfio.h>
>>   
>>   #include "sysemu/runstate.h"
>>   #include "hw/vfio/vfio-common.h"
>>   #include "cpu.h"
>>   #include "migration/migration.h"
>> +#include "migration/vmstate.h"
>>   #include "migration/qemu-file.h"
>>   #include "migration/register.h"
>>   #include "migration/blocker.h"
>> @@ -25,6 +28,17 @@
>>   #include "trace.h"
>>   #include "hw/hw.h"
>>   
>> +/*
>> + * Flags used as delimiter:
>> + * 0xffffffff => MSB 32-bit all 1s
>> + * 0xef10     => emulated (virtual) function IO
>> + * 0x0000     => 16-bits reserved for flags
>> + */
>> +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
>> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
>> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
>> +#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
>> +
>>   static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
>>                                     off_t off, bool iswrite)
>>   {
>> @@ -166,6 +180,65 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
>>       return 0;
>>   }
>>   
>> +/* ---------------------------------------------------------------------- */
>> +
>> +static int vfio_save_setup(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    int ret;
>> +
>> +    trace_vfio_save_setup(vbasedev->name);
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
>> +
>> +    if (migration->region.mmaps) {
>> +        qemu_mutex_lock_iothread();
>> +        ret = vfio_region_mmap(&migration->region);
>> +        qemu_mutex_unlock_iothread();
> 
> Please add a comment identifying why the iothread mutex lock is
> necessary here.
> 
>> +        if (ret) {
>> +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
>> +                         vbasedev->name, migration->region.nr,
> 
> We don't support multiple migration regions, is it useful to include
> the region index here?
> 

Ok. Removing region.nr


>> +                         strerror(-ret));
>> +            error_report("%s: Falling back to slow path", vbasedev->name);
>> +        }
>> +    }
>> +
>> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_MASK,
>> +                                   VFIO_DEVICE_STATE_SAVING);
>> +    if (ret) {
>> +        error_report("%s: Failed to set state SAVING", vbasedev->name);
>> +        return ret;
>> +    }
> 
> Again, doesn't match the function semantics that success only means the
> device is in a non-error state, maybe the one that was asked for.
> 

Fixed in patch 05.

>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> 
> What's the overall purpose of writing these markers into the migration
> stream?  vfio_load_state() doesn't do anything with this other than
> validate that the end-of-state immediately follows.  Is this a
> placeholder for something in the future?
> 

Its not placeholder, it is used in vfio_load_state() to determine upto 
what point to loop to fetch data for each state, otherwise how would we 
know when to stop reading data from stream for that VFIO device.

>> +
>> +    ret = qemu_file_get_error(f);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static void vfio_save_cleanup(void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    if (migration->region.mmaps) {
>> +        vfio_region_unmap(&migration->region);
>> +    }
>> +    trace_vfio_save_cleanup(vbasedev->name);
>> +}
>> +
>> +static SaveVMHandlers savevm_vfio_handlers = {
>> +    .save_setup = vfio_save_setup,
>> +    .save_cleanup = vfio_save_cleanup,
>> +};
>> +
>> +/* ---------------------------------------------------------------------- */
>> +
>>   static void vfio_vmstate_change(void *opaque, int running, RunState state)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> @@ -225,6 +298,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>>                                  struct vfio_region_info *info)
>>   {
>>       int ret = -EINVAL;
>> +    char id[256] = "";
>> +    Object *obj;
>>   
>>       if (!vbasedev->ops->vfio_get_object) {
>>           return ret;
>> @@ -241,6 +316,22 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>>           return ret;
>>       }
>>   
>> +    obj = vbasedev->ops->vfio_get_object(vbasedev);
>> +
>> +    if (obj) {
>> +        DeviceState *dev = DEVICE(obj);
>> +        char *oid = vmstate_if_get_id(VMSTATE_IF(dev));
>> +
>> +        if (oid) {
>> +            pstrcpy(id, sizeof(id), oid);
>> +            pstrcat(id, sizeof(id), "/");
>> +            g_free(oid);
>> +        }
>> +    }
> 
> Here's where vfio_migration_init() starts using vfio_get_object() as I
> referenced back on patch 04.  We might as well get the object before
> calling vfio_migration_region_init() and then pass the object.  The
> conditional branch to handle obj is strange here too, it's fatal if
> vfio_migration_region_init() doesn't find an object, why do we handle
> it as optional here?  Also, what is this doing?  Comments would be
> nice...  Thanks,
> 

Changing it as I mentioned in other patch reply.

Thanks.

> Alex
> 
>> +    pstrcat(id, sizeof(id), "vfio");
>> +
>> +    register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1, &savevm_vfio_handlers,
>> +                         vbasedev);
>>       vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
>>                                                             vbasedev);
>>       vbasedev->migration_state.notify = vfio_migration_state_notifier;
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index bcb3fa7314d7..982d8dccb219 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -152,3 +152,5 @@ vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
>>   vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
>>   vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
>>   vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
>> +vfio_save_setup(const char *name) " (%s)"
>> +vfio_save_cleanup(const char *name) " (%s)"
> 


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 05/17] vfio: Add VM state change handler to know state of VM
  2020-10-17 23:44       ` Alex Williamson
@ 2020-10-18 17:43         ` Kirti Wankhede
  2020-10-19 17:51           ` Alex Williamson
  0 siblings, 1 reply; 73+ messages in thread
From: Kirti Wankhede @ 2020-10-18 17:43 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini

<snip>

>>>> +vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
>>>> +vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>>>> index 8275c4c68f45..25e3b1a3b90a 100644
>>>> --- a/include/hw/vfio/vfio-common.h
>>>> +++ b/include/hw/vfio/vfio-common.h
>>>> @@ -29,6 +29,7 @@
>>>>    #ifdef CONFIG_LINUX
>>>>    #include <linux/vfio.h>
>>>>    #endif
>>>> +#include "sysemu/sysemu.h"
>>>>    
>>>>    #define VFIO_MSG_PREFIX "vfio %s: "
>>>>    
>>>> @@ -119,6 +120,9 @@ typedef struct VFIODevice {
>>>>        unsigned int flags;
>>>>        VFIOMigration *migration;
>>>>        Error *migration_blocker;
>>>> +    VMChangeStateEntry *vm_state;
>>>> +    uint32_t device_state;
>>>> +    int vm_running;
>>>
>>> Could these be placed in VFIOMigration?  Thanks,
>>>   
>>
>> I think device_state should be part of VFIODevice since its about device
>> rather than only related to migration, others can be moved to VFIOMigration.
> 
> But these are only valid when migration is supported and thus when
> VFIOMigration exists.  Thanks,
> 

Even though it is used when migration is supported, its device's attribute.

Thanks,
Kirti



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 08/17] vfio: Add save state functions to SaveVMHandlers
  2020-09-25 21:02   ` Alex Williamson
@ 2020-10-18 18:00     ` Kirti Wankhede
  0 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2020-10-18 18:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini



On 9/26/2020 2:32 AM, Alex Williamson wrote:
> On Wed, 23 Sep 2020 04:54:10 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
>> functions. These functions handles pre-copy and stop-and-copy phase.
>>
>> In _SAVING|_RUNNING device state or pre-copy phase:
>> - read pending_bytes. If pending_bytes > 0, go through below steps.
>> - read data_offset - indicates kernel driver to write data to staging
>>    buffer.
>> - read data_size - amount of data in bytes written by vendor driver in
>>    migration region.
>> - read data_size bytes of data from data_offset in the migration region.
>> - Write data packet to file stream as below:
>> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
>> VFIO_MIG_FLAG_END_OF_STATE }
>>
>> In _SAVING device state or stop-and-copy phase
>> a. read config space of device and save to migration file stream. This
>>     doesn't need to be from vendor driver. Any other special config state
>>     from driver can be saved as data in following iteration.
>> b. read pending_bytes. If pending_bytes > 0, go through below steps.
>> c. read data_offset - indicates kernel driver to write data to staging
>>     buffer.
>> d. read data_size - amount of data in bytes written by vendor driver in
>>     migration region.
>> e. read data_size bytes of data from data_offset in the migration region.
>> f. Write data packet as below:
>>     {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
>> g. iterate through steps b to f while (pending_bytes > 0)
>> h. Write {VFIO_MIG_FLAG_END_OF_STATE}
>>
>> When data region is mapped, its user's responsibility to read data from
>> data_offset of data_size before moving to next steps.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>   hw/vfio/migration.c           | 273 ++++++++++++++++++++++++++++++++++++++++++
>>   hw/vfio/trace-events          |   6 +
>>   include/hw/vfio/vfio-common.h |   1 +
>>   3 files changed, 280 insertions(+)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 8e8adaa25779..4611bb972228 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -180,6 +180,154 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
>>       return 0;
>>   }
>>   
>> +static void *get_data_section_size(VFIORegion *region, uint64_t data_offset,
>> +                                   uint64_t data_size, uint64_t *size)
>> +{
>> +    void *ptr = NULL;
>> +    uint64_t limit = 0;
>> +    int i;
>> +
>> +    if (!region->mmaps) {
>> +        if (size) {
>> +            *size = data_size;
>> +        }
>> +        return ptr;
>> +    }
>> +
>> +    for (i = 0; i < region->nr_mmaps; i++) {
>> +        VFIOMmap *map = region->mmaps + i;
>> +
>> +        if ((data_offset >= map->offset) &&
>> +            (data_offset < map->offset + map->size)) {
>> +
>> +            /* check if data_offset is within sparse mmap areas */
>> +            ptr = map->mmap + data_offset - map->offset;
>> +            if (size) {
>> +                *size = MIN(data_size, map->offset + map->size - data_offset);
>> +            }
>> +            break;
>> +        } else if ((data_offset < map->offset) &&
>> +                   (!limit || limit > map->offset)) {
>> +            /*
>> +             * data_offset is not within sparse mmap areas, find size of
>> +             * non-mapped area. Check through all list since region->mmaps list
>> +             * is not sorted.
>> +             */
>> +            limit = map->offset;
>> +        }
>> +    }
>> +
>> +    if (!ptr && size) {
>> +        *size = limit ? limit - data_offset : data_size;
> 
> 'limit - data_offset' doesn't take data_size into account, this should
> be MIN(data_size, limit - data_offset).
> 

Done.

>> +    }
>> +    return ptr;
>> +}
>> +
>> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t *size)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIORegion *region = &migration->region;
>> +    uint64_t data_offset = 0, data_size = 0, sz;
>> +    int ret;
>> +
>> +    ret = vfio_mig_read(vbasedev, &data_offset, sizeof(data_offset),
>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                             data_offset));
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>> +
>> +    ret = vfio_mig_read(vbasedev, &data_size, sizeof(data_size),
>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                             data_size));
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>> +
>> +    trace_vfio_save_buffer(vbasedev->name, data_offset, data_size,
>> +                           migration->pending_bytes);
>> +
>> +    qemu_put_be64(f, data_size);
>> +    sz = data_size;
>> +
>> +    while (sz) {
>> +        void *buf = NULL;
> 
> Unnecessary initialization.
> 
>> +        uint64_t sec_size;
>> +        bool buf_allocated = false;
>> +
>> +        buf = get_data_section_size(region, data_offset, sz, &sec_size);
>> +
>> +        if (!buf) {
>> +            buf = g_try_malloc(sec_size);
>> +            if (!buf) {
>> +                error_report("%s: Error allocating buffer ", __func__);
>> +                return -ENOMEM;
>> +            }
>> +            buf_allocated = true;
>> +
>> +            ret = vfio_mig_read(vbasedev, buf, sec_size,
>> +                                region->fd_offset + data_offset);
>> +            if (ret < 0) {
>> +                g_free(buf);
>> +                return ret;
>> +            }
>> +        }
>> +
>> +        qemu_put_buffer(f, buf, sec_size);
>> +
>> +        if (buf_allocated) {
>> +            g_free(buf);
>> +        }
>> +        sz -= sec_size;
>> +        data_offset += sec_size;
>> +    }
>> +
>> +    ret = qemu_file_get_error(f);
>> +
>> +    if (!ret && size) {
>> +        *size = data_size;
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +static int vfio_update_pending(VFIODevice *vbasedev)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIORegion *region = &migration->region;
>> +    uint64_t pending_bytes = 0;
>> +    int ret;
>> +
>> +    ret = vfio_mig_read(vbasedev, &pending_bytes, sizeof(pending_bytes),
>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                             pending_bytes));
>> +    if (ret < 0) {
>> +        migration->pending_bytes = 0;
>> +        return ret;
>> +    }
>> +
>> +    migration->pending_bytes = pending_bytes;
>> +    trace_vfio_update_pending(vbasedev->name, pending_bytes);
>> +    return 0;
>> +}
>> +
>> +static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
>> +
>> +    if (vbasedev->ops && vbasedev->ops->vfio_save_config) {
>> +        vbasedev->ops->vfio_save_config(vbasedev, f);
>> +    }
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +
>> +    trace_vfio_save_device_config_state(vbasedev->name);
>> +
>> +    return qemu_file_get_error(f);
>> +}
>> +
>>   /* ---------------------------------------------------------------------- */
>>   
>>   static int vfio_save_setup(QEMUFile *f, void *opaque)
>> @@ -232,9 +380,134 @@ static void vfio_save_cleanup(void *opaque)
>>       trace_vfio_save_cleanup(vbasedev->name);
>>   }
>>   
>> +static void vfio_save_pending(QEMUFile *f, void *opaque,
>> +                              uint64_t threshold_size,
>> +                              uint64_t *res_precopy_only,
>> +                              uint64_t *res_compatible,
>> +                              uint64_t *res_postcopy_only)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    int ret;
>> +
>> +    ret = vfio_update_pending(vbasedev);
>> +    if (ret) {
>> +        return;
>> +    }
>> +
>> +    *res_precopy_only += migration->pending_bytes;
>> +
>> +    trace_vfio_save_pending(vbasedev->name, *res_precopy_only,
>> +                            *res_postcopy_only, *res_compatible);
>> +}
>> +
>> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    uint64_t data_size;
>> +    int ret;
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
>> +
>> +    if (migration->pending_bytes == 0) {
>> +        ret = vfio_update_pending(vbasedev);
>> +        if (ret) {
>> +            return ret;
>> +        }
>> +
>> +        if (migration->pending_bytes == 0) {
>> +            /* indicates data finished, goto complete phase */
>> +            return 1;
>> +        }
>> +    }
>> +
>> +    ret = vfio_save_buffer(f, vbasedev, &data_size);
>> +
>> +    if (ret) {
>> +        error_report("%s: vfio_save_buffer failed %s", vbasedev->name,
>> +                     strerror(errno));
>> +        return ret;
>> +    }
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +
>> +    ret = qemu_file_get_error(f);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    trace_vfio_save_iterate(vbasedev->name, data_size);
>> +
>> +    return 0;
>> +}
>> +
>> +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    uint64_t data_size;
>> +    int ret;
>> +
>> +    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_RUNNING,
>> +                                   VFIO_DEVICE_STATE_SAVING);
>> +    if (ret) {
>> +        error_report("%s: Failed to set state STOP and SAVING",
>> +                     vbasedev->name);
>> +        return ret;
>> +    }
> 
> This also assumes success implies desired state.
> 
>> +
>> +    ret = vfio_save_device_config_state(f, opaque);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    ret = vfio_update_pending(vbasedev);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    while (migration->pending_bytes > 0) {
>> +        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
>> +        ret = vfio_save_buffer(f, vbasedev, &data_size);
>> +        if (ret < 0) {
>> +            error_report("%s: Failed to save buffer", vbasedev->name);
>> +            return ret;
>> +        }
>> +
>> +        if (data_size == 0) {
>> +            break;
>> +        }
>> +
>> +        ret = vfio_update_pending(vbasedev);
>> +        if (ret) {
>> +            return ret;
>> +        }
>> +    }
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +
>> +    ret = qemu_file_get_error(f);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_SAVING, 0);
>> +    if (ret) {
>> +        error_report("%s: Failed to set state STOPPED", vbasedev->name);
>> +        return ret;
>> +    }
> 
> And another.  Thanks,
>


Fixing in patch 5.

Thanks,
Kirti


> Alex
> 
>> +
>> +    trace_vfio_save_complete_precopy(vbasedev->name);
>> +    return ret;
>> +}
>> +
>>   static SaveVMHandlers savevm_vfio_handlers = {
>>       .save_setup = vfio_save_setup,
>>       .save_cleanup = vfio_save_cleanup,
>> +    .save_live_pending = vfio_save_pending,
>> +    .save_live_iterate = vfio_save_iterate,
>> +    .save_live_complete_precopy = vfio_save_complete_precopy,
>>   };
>>   
>>   /* ---------------------------------------------------------------------- */
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index 982d8dccb219..118b5547c921 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -154,3 +154,9 @@ vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t
>>   vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
>>   vfio_save_setup(const char *name) " (%s)"
>>   vfio_save_cleanup(const char *name) " (%s)"
>> +vfio_save_buffer(const char *name, uint64_t data_offset, uint64_t data_size, uint64_t pending) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64" pending 0x%"PRIx64
>> +vfio_update_pending(const char *name, uint64_t pending) " (%s) pending 0x%"PRIx64
>> +vfio_save_device_config_state(const char *name) " (%s)"
>> +vfio_save_pending(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64
>> +vfio_save_iterate(const char *name, int data_size) " (%s) data_size %d"
>> +vfio_save_complete_precopy(const char *name) " (%s)"
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 49c7c7a0e29a..471e444a364c 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -60,6 +60,7 @@ typedef struct VFIORegion {
>>   
>>   typedef struct VFIOMigration {
>>       VFIORegion region;
>> +    uint64_t pending_bytes;
>>   } VFIOMigration;
>>   
>>   typedef struct VFIOAddressSpace {
> 


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 09/17] vfio: Add load state functions to SaveVMHandlers
  2020-10-01 10:07   ` Cornelia Huck
@ 2020-10-18 20:47     ` Kirti Wankhede
  2020-10-20 16:25       ` Cornelia Huck
  0 siblings, 1 reply; 73+ messages in thread
From: Kirti Wankhede @ 2020-10-18 20:47 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk, pasic,
	felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini



On 10/1/2020 3:37 PM, Cornelia Huck wrote:
> On Wed, 23 Sep 2020 04:54:11 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> Sequence  during _RESUMING device state:
>> While data for this device is available, repeat below steps:
>> a. read data_offset from where user application should write data.
>> b. write data of data_size to migration region from data_offset.
>> c. write data_size which indicates vendor driver that data is written in
>>     staging buffer.
>>
>> For user, data is opaque. User should write data in the same order as
>> received.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>> ---
>>   hw/vfio/migration.c  | 170 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>   hw/vfio/trace-events |   3 +
>>   2 files changed, 173 insertions(+)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 4611bb972228..ffd70282dd0e 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -328,6 +328,33 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
>>       return qemu_file_get_error(f);
>>   }
>>   
>> +static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    uint64_t data;
>> +
>> +    if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
>> +        int ret;
>> +
>> +        ret = vbasedev->ops->vfio_load_config(vbasedev, f);
>> +        if (ret) {
>> +            error_report("%s: Failed to load device config space",
>> +                         vbasedev->name);
>> +            return ret;
>> +        }
>> +    }
>> +
>> +    data = qemu_get_be64(f);
>> +    if (data != VFIO_MIG_FLAG_END_OF_STATE) {
>> +        error_report("%s: Failed loading device config space, "
>> +                     "end flag incorrect 0x%"PRIx64, vbasedev->name, data);
> 
> I'm confused here: If we don't have a vfio_load_config callback, or if
> that callback did not read everything, we also might end up with a
> value that's not END_OF_STATE... in that case, the problem is not with
> the stream, but rather with the consumer?

Right, hence "end flag incorrect" is reported.

> 
>> +        return -EINVAL;
>> +    }
>> +
>> +    trace_vfio_load_device_config_state(vbasedev->name);
>> +    return qemu_file_get_error(f);
>> +}
>> +
>>   /* ---------------------------------------------------------------------- */
>>   
>>   static int vfio_save_setup(QEMUFile *f, void *opaque)
>> @@ -502,12 +529,155 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>       return ret;
>>   }
>>   
>> +static int vfio_load_setup(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    int ret = 0;
>> +
>> +    if (migration->region.mmaps) {
>> +        ret = vfio_region_mmap(&migration->region);
>> +        if (ret) {
>> +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
>> +                         vbasedev->name, migration->region.nr,
>> +                         strerror(-ret));
>> +            error_report("%s: Falling back to slow path", vbasedev->name);
>> +        }
>> +    }
>> +
>> +    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_MASK,
>> +                                   VFIO_DEVICE_STATE_RESUMING);
>> +    if (ret) {
>> +        error_report("%s: Failed to set state RESUMING", vbasedev->name);
>> +    }
>> +    return ret;
> 
> If I follow the code correctly, the cleanup callback will not be
> invoked if you return != 0 here... should you clean up possible
> mappings on error here?
> 

Makes sense, adding region ummap on error.

>> +}
>> +
>> +static int vfio_load_cleanup(void *opaque)
>> +{
>> +    vfio_save_cleanup(opaque);
>> +    return 0;
>> +}
>> +
>> +static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    int ret = 0;
>> +    uint64_t data, data_size;
>> +
>> +    data = qemu_get_be64(f);
>> +    while (data != VFIO_MIG_FLAG_END_OF_STATE) {
>> +
>> +        trace_vfio_load_state(vbasedev->name, data);
>> +
>> +        switch (data) {
>> +        case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
>> +        {
>> +            ret = vfio_load_device_config_state(f, opaque);
>> +            if (ret) {
>> +                return ret;
>> +            }
>> +            break;
>> +        }
>> +        case VFIO_MIG_FLAG_DEV_SETUP_STATE:
>> +        {
>> +            data = qemu_get_be64(f);
>> +            if (data == VFIO_MIG_FLAG_END_OF_STATE) {
>> +                return ret;
>> +            } else {
>> +                error_report("%s: SETUP STATE: EOS not found 0x%"PRIx64,
>> +                             vbasedev->name, data);
>> +                return -EINVAL;
>> +            }
>> +            break;
>> +        }
>> +        case VFIO_MIG_FLAG_DEV_DATA_STATE:
>> +        {
>> +            VFIORegion *region = &migration->region;
>> +            uint64_t data_offset = 0, size;
> 
> I think this function would benefit from splitting this off into a
> function handling DEV_DATA_STATE. It is quite hard to follow through
> all the checks and find out when we continue, and when we break off.
> 

Each switch case has a break, we break off on success cases, where as we 
return error if we encounter any case where (ret < 0)


> Some documentation about the markers would also be really helpful.

Sure adding it in patch 07, where these are defined.

> The logic seems to be:
> - DEV_CONFIG_STATE has config data and must be ended by END_OF_STATE
Right

> - DEV_SETUP_STATE has only END_OF_STATE, no data
Right now there is no data, but this is provision to add data if 
required in future.

> - DEV_DATA_STATE has... data; if there's any END_OF_STATE, it's buried
>    far down in the called functions
>

This is not correct, END_OF_STATE is after data. Moved data buffer 
loading logic to function vfio_load_buffer(), so DEV_DATA_STATE looks 
simplified as below. Hope this helps.

         case VFIO_MIG_FLAG_DEV_DATA_STATE:
         {
             uint64_t data_size;

             data_size = qemu_get_be64(f);
             if (data_size == 0) {
                 break;
             }

             ret = vfio_load_buffer(f, vbasedev, data_size);
             if (ret < 0) {
                 return ret;
             }
             break;
         }

Also handling the case if data_size is greater than the data section of 
migration region at destination in vfio_load_buffer()in my next version.

Thanks,
Kirti

> 
>> +
>> +            data_size = size = qemu_get_be64(f);
>> +            if (data_size == 0) {
>> +                break;
>> +            }
>> +
>> +            ret = vfio_mig_read(vbasedev, &data_offset, sizeof(data_offset),
>> +                                region->fd_offset +
>> +                                offsetof(struct vfio_device_migration_info,
>> +                                         data_offset));
>> +            if (ret < 0) {
>> +                return ret;
>> +            }
>> +
>> +            trace_vfio_load_state_device_data(vbasedev->name, data_offset,
>> +                                              data_size);
>> +
>> +            while (size) {
>> +                void *buf = NULL;
>> +                uint64_t sec_size;
>> +                bool buf_alloc = false;
>> +
>> +                buf = get_data_section_size(region, data_offset, size,
>> +                                            &sec_size);
>> +
>> +                if (!buf) {
>> +                    buf = g_try_malloc(sec_size);
>> +                    if (!buf) {
>> +                        error_report("%s: Error allocating buffer ", __func__);
>> +                        return -ENOMEM;
>> +                    }
>> +                    buf_alloc = true;
>> +                }
>> +
>> +                qemu_get_buffer(f, buf, sec_size);
>> +
>> +                if (buf_alloc) {
>> +                    ret = vfio_mig_write(vbasedev, buf, sec_size,
>> +                                         region->fd_offset + data_offset);
>> +                    g_free(buf);
>> +
>> +                    if (ret < 0) {
>> +                        return ret;
>> +                    }
>> +                }
>> +                size -= sec_size;
>> +                data_offset += sec_size;
>> +            }
>> +
>> +            ret = vfio_mig_write(vbasedev, &data_size, sizeof(data_size),
>> +                                 region->fd_offset +
>> +                       offsetof(struct vfio_device_migration_info, data_size));
>> +            if (ret < 0) {
>> +                return ret;
>> +            }
>> +            break;
>> +        }
>> +
>> +        default:
>> +            error_report("%s: Unknown tag 0x%"PRIx64, vbasedev->name, data);
>> +            return -EINVAL;
>> +        }
>> +
>> +        data = qemu_get_be64(f);
>> +        ret = qemu_file_get_error(f);
>> +        if (ret) {
>> +            return ret;
>> +        }
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>>   static SaveVMHandlers savevm_vfio_handlers = {
>>       .save_setup = vfio_save_setup,
>>       .save_cleanup = vfio_save_cleanup,
>>       .save_live_pending = vfio_save_pending,
>>       .save_live_iterate = vfio_save_iterate,
>>       .save_live_complete_precopy = vfio_save_complete_precopy,
>> +    .load_setup = vfio_load_setup,
>> +    .load_cleanup = vfio_load_cleanup,
> 
> Unrelated to this patch: It's a bit odd that load_cleanup() (unlike
> save_cleanup()) has a return code (that seems unused).
> 
>> +    .load_state = vfio_load_state,
>>   };
>>   
>>   /* ---------------------------------------------------------------------- */
> 


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 12/17] vfio: Add function to start and stop dirty pages tracking
  2020-09-25 21:55   ` Alex Williamson
@ 2020-10-18 20:52     ` Kirti Wankhede
  0 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2020-10-18 20:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini



On 9/26/2020 3:25 AM, Alex Williamson wrote:
> On Wed, 23 Sep 2020 04:54:14 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> Call VFIO_IOMMU_DIRTY_PAGES ioctl to start and stop dirty pages tracking
>> for VFIO devices.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>> ---
>>   hw/vfio/migration.c | 36 ++++++++++++++++++++++++++++++++++++
>>   1 file changed, 36 insertions(+)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 4306f6316417..822b68b4e015 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -11,6 +11,7 @@
>>   #include "qemu/main-loop.h"
>>   #include "qemu/cutils.h"
>>   #include <linux/vfio.h>
>> +#include <sys/ioctl.h>
>>   
>>   #include "sysemu/runstate.h"
>>   #include "hw/vfio/vfio-common.h"
>> @@ -355,6 +356,32 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>>       return qemu_file_get_error(f);
>>   }
>>   
>> +static int vfio_set_dirty_page_tracking(VFIODevice *vbasedev, bool start)
>> +{
>> +    int ret;
>> +    VFIOContainer *container = vbasedev->group->container;
>> +    struct vfio_iommu_type1_dirty_bitmap dirty = {
>> +        .argsz = sizeof(dirty),
>> +    };
>> +
>> +    if (start) {
>> +        if (vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) {
>> +            dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START;
>> +        } else {
>> +            return -EINVAL;
>> +        }
>> +    } else {
>> +            dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP;
>> +    }
>> +
>> +    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty);
>> +    if (ret) {
>> +        error_report("Failed to set dirty tracking flag 0x%x errno: %d",
>> +                     dirty.flags, errno);
>> +    }
> 
> Maybe doesn't matter in the long run, but do you want to use -errno for
> the return rather than -1 from the ioctl on error?  Thanks,
> 

Makes sense. Changing it.

Thanks,
Kirti

> Alex
> 
>> +    return ret;
>> +}
>> +
>>   /* ---------------------------------------------------------------------- */
>>   
>>   static int vfio_save_setup(QEMUFile *f, void *opaque)
>> @@ -386,6 +413,11 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
>>           return ret;
>>       }
>>   
>> +    ret = vfio_set_dirty_page_tracking(vbasedev, true);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>>       qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>>   
>>       ret = qemu_file_get_error(f);
>> @@ -401,6 +433,8 @@ static void vfio_save_cleanup(void *opaque)
>>       VFIODevice *vbasedev = opaque;
>>       VFIOMigration *migration = vbasedev->migration;
>>   
>> +    vfio_set_dirty_page_tracking(vbasedev, false);
>> +
>>       if (migration->region.mmaps) {
>>           vfio_region_unmap(&migration->region);
>>       }
>> @@ -734,6 +768,8 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
>>           if (ret) {
>>               error_report("%s: Failed to set state RUNNING", vbasedev->name);
>>           }
>> +
>> +        vfio_set_dirty_page_tracking(vbasedev, false);
>>       }
>>   }
>>   
> 


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 07/17] vfio: Register SaveVMHandlers for VFIO device
  2020-09-25 11:53   ` Cornelia Huck
@ 2020-10-18 20:55     ` Kirti Wankhede
  2020-10-20 15:51       ` Cornelia Huck
  0 siblings, 1 reply; 73+ messages in thread
From: Kirti Wankhede @ 2020-10-18 20:55 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk, pasic,
	felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini



On 9/25/2020 5:23 PM, Cornelia Huck wrote:
> On Wed, 23 Sep 2020 04:54:09 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> Define flags to be used as delimeter in migration file stream.
>> Added .save_setup and .save_cleanup functions. Mapped & unmapped migration
>> region from these functions at source during saving or pre-copy phase.
>> Set VFIO device state depending on VM's state. During live migration, VM is
>> running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
>> device. During save-restore, VM is paused, _SAVING state is set for VFIO device.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>   hw/vfio/migration.c  | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   hw/vfio/trace-events |  2 ++
>>   2 files changed, 93 insertions(+)
>>
> 
> (...)
> 
>> +/*
>> + * Flags used as delimiter:
>> + * 0xffffffff => MSB 32-bit all 1s
>> + * 0xef10     => emulated (virtual) function IO
> 
> Where is this value coming from?
> 

Delimiter flags should be unique and this is a magic number that 
represents (e)mulated (f)unction (10) representing IO.

>> + * 0x0000     => 16-bits reserved for flags
>> + */
>> +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
>> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
>> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
>> +#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
> 
> I think we need some more documentation what these values mean and how
> they are used. From reading ahead a bit, it seems there is always
> supposed to be a pair of DEV_*_STATE and END_OF_STATE framing some kind
> of data?
> 

Adding comment as below, hope it helps.

/*
  * Flags used as delimiter for VFIO devices should be unique in 
migration stream
  * These flags are composed as:
  * 0xffffffff => MSB 32-bit all 1s
  * 0xef10     => Magic ID, represents emulated (virtual) function IO
  * 0x0000     => 16-bits reserved for flags
  *
  * Flags _DEV_CONFIG_STATE, _DEV_SETUP_STATE and _DEV_DATA_STATE marks 
start of
  * respective states in migration stream.
  * FLAG _END_OF_STATE indicates end of current state, state could be any
  * of above states.
  */

Thanks,
Kirti



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 13/17] vfio: create mapped iova list when vIOMMU is enabled
  2020-09-25 22:23   ` Alex Williamson
@ 2020-10-19  6:01     ` Kirti Wankhede
  2020-10-19 17:24       ` Alex Williamson
  0 siblings, 1 reply; 73+ messages in thread
From: Kirti Wankhede @ 2020-10-19  6:01 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini



On 9/26/2020 3:53 AM, Alex Williamson wrote:
> On Wed, 23 Sep 2020 04:54:15 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> Create mapped iova list when vIOMMU is enabled. For each mapped iova
>> save translated address. Add node to list on MAP and remove node from
>> list on UNMAP.
>> This list is used to track dirty pages during migration.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> ---
>>   hw/vfio/common.c              | 58 ++++++++++++++++++++++++++++++++++++++-----
>>   include/hw/vfio/vfio-common.h |  8 ++++++
>>   2 files changed, 60 insertions(+), 6 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index d4959c036dd1..dc56cded2d95 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -407,8 +407,8 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>>   }
>>   
>>   /* Called with rcu_read_lock held.  */
>> -static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
>> -                           bool *read_only)
>> +static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
>> +                               ram_addr_t *ram_addr, bool *read_only)
>>   {
>>       MemoryRegion *mr;
>>       hwaddr xlat;
>> @@ -439,8 +439,17 @@ static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
>>           return false;
>>       }
>>   
>> -    *vaddr = memory_region_get_ram_ptr(mr) + xlat;
>> -    *read_only = !writable || mr->readonly;
>> +    if (vaddr) {
>> +        *vaddr = memory_region_get_ram_ptr(mr) + xlat;
>> +    }
>> +
>> +    if (ram_addr) {
>> +        *ram_addr = memory_region_get_ram_addr(mr) + xlat;
>> +    }
>> +
>> +    if (read_only) {
>> +        *read_only = !writable || mr->readonly;
>> +    }
>>   
>>       return true;
>>   }
>> @@ -450,7 +459,6 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>>       VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>>       VFIOContainer *container = giommu->container;
>>       hwaddr iova = iotlb->iova + giommu->iommu_offset;
>> -    bool read_only;
>>       void *vaddr;
>>       int ret;
>>   
>> @@ -466,7 +474,10 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>>       rcu_read_lock();
>>   
>>       if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
>> -        if (!vfio_get_vaddr(iotlb, &vaddr, &read_only)) {
>> +        ram_addr_t ram_addr;
>> +        bool read_only;
>> +
>> +        if (!vfio_get_xlat_addr(iotlb, &vaddr, &ram_addr, &read_only)) {
>>               goto out;
>>           }
>>           /*
>> @@ -484,8 +495,28 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>>                            "0x%"HWADDR_PRIx", %p) = %d (%m)",
>>                            container, iova,
>>                            iotlb->addr_mask + 1, vaddr, ret);
>> +        } else {
>> +            VFIOIovaRange *iova_range;
>> +
>> +            iova_range = g_malloc0(sizeof(*iova_range));
>> +            iova_range->iova = iova;
>> +            iova_range->size = iotlb->addr_mask + 1;
>> +            iova_range->ram_addr = ram_addr;
>> +
>> +            QLIST_INSERT_HEAD(&giommu->iova_list, iova_range, next);
>>           }
>>       } else {
>> +        VFIOIovaRange *iova_range, *tmp;
>> +
>> +        QLIST_FOREACH_SAFE(iova_range, &giommu->iova_list, next, tmp) {
>> +            if (iova_range->iova >= iova &&
>> +                iova_range->iova + iova_range->size <= iova +
>> +                                                       iotlb->addr_mask + 1) {
>> +                QLIST_REMOVE(iova_range, next);
>> +                g_free(iova_range);
>> +            }
>> +        }
>> +
> 
> 
> This is some pretty serious overhead... can't we trigger a replay when
> migration is enabled to build this information then? 

Are you suggesting to call memory_region_iommu_replay() before 
vfio_sync_dirty_bitmap(), which would call vfio_iommu_map_notify() where 
iova list of mapping is maintained? Then in the notifer check if 
migration_is_running() and container->dirty_pages_supported == true, 
then only create iova mapping tree? In this case how would we know that 
this is triggered by
vfio_sync_dirty_bitmap()
  -> memory_region_iommu_replay()
and we don't have to call vfio_dma_map()?

> We're looking at
> potentially thousands of entries, so a list is probably also not a good
> choice. 

Changing it to tree.

Thanks,
Kirti

  I don't think it's acceptable to incur this even when not
> migrating (ie. the vast majority of the time).  Thanks,
> 
> Alex
> 
>>           ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1);
>>           if (ret) {
>>               error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>> @@ -642,6 +673,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>               g_free(giommu);
>>               goto fail;
>>           }
>> +        QLIST_INIT(&giommu->iova_list);
>>           QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
>>           memory_region_iommu_replay(giommu->iommu, &giommu->n);
>>   
>> @@ -740,6 +772,13 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>           QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
>>               if (MEMORY_REGION(giommu->iommu) == section->mr &&
>>                   giommu->n.start == section->offset_within_region) {
>> +                VFIOIovaRange *iova_range, *tmp;
>> +
>> +                QLIST_FOREACH_SAFE(iova_range, &giommu->iova_list, next, tmp) {
>> +                    QLIST_REMOVE(iova_range, next);
>> +                    g_free(iova_range);
>> +                }
>> +
>>                   memory_region_unregister_iommu_notifier(section->mr,
>>                                                           &giommu->n);
>>                   QLIST_REMOVE(giommu, giommu_next);
>> @@ -1541,6 +1580,13 @@ static void vfio_disconnect_container(VFIOGroup *group)
>>           QLIST_REMOVE(container, next);
>>   
>>           QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
>> +            VFIOIovaRange *iova_range, *itmp;
>> +
>> +            QLIST_FOREACH_SAFE(iova_range, &giommu->iova_list, next, itmp) {
>> +                QLIST_REMOVE(iova_range, next);
>> +                g_free(iova_range);
>> +            }
>> +
>>               memory_region_unregister_iommu_notifier(
>>                       MEMORY_REGION(giommu->iommu), &giommu->n);
>>               QLIST_REMOVE(giommu, giommu_next);
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 0a1651eda2d0..aa7524fe2cc5 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -89,11 +89,19 @@ typedef struct VFIOContainer {
>>       QLIST_ENTRY(VFIOContainer) next;
>>   } VFIOContainer;
>>   
>> +typedef struct VFIOIovaRange {
>> +    hwaddr iova;
>> +    size_t size;
>> +    ram_addr_t ram_addr;
>> +    QLIST_ENTRY(VFIOIovaRange) next;
>> +} VFIOIovaRange;
>> +
>>   typedef struct VFIOGuestIOMMU {
>>       VFIOContainer *container;
>>       IOMMUMemoryRegion *iommu;
>>       hwaddr iommu_offset;
>>       IOMMUNotifier n;
>> +    QLIST_HEAD(, VFIOIovaRange) iova_list;
>>       QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
>>   } VFIOGuestIOMMU;
>>   
> 


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 13/17] vfio: create mapped iova list when vIOMMU is enabled
  2020-10-19  6:01     ` Kirti Wankhede
@ 2020-10-19 17:24       ` Alex Williamson
  2020-10-19 19:15         ` Kirti Wankhede
  0 siblings, 1 reply; 73+ messages in thread
From: Alex Williamson @ 2020-10-19 17:24 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini

On Mon, 19 Oct 2020 11:31:03 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 9/26/2020 3:53 AM, Alex Williamson wrote:
> > On Wed, 23 Sep 2020 04:54:15 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> Create mapped iova list when vIOMMU is enabled. For each mapped iova
> >> save translated address. Add node to list on MAP and remove node from
> >> list on UNMAP.
> >> This list is used to track dirty pages during migration.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> ---
> >>   hw/vfio/common.c              | 58 ++++++++++++++++++++++++++++++++++++++-----
> >>   include/hw/vfio/vfio-common.h |  8 ++++++
> >>   2 files changed, 60 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >> index d4959c036dd1..dc56cded2d95 100644
> >> --- a/hw/vfio/common.c
> >> +++ b/hw/vfio/common.c
> >> @@ -407,8 +407,8 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
> >>   }
> >>   
> >>   /* Called with rcu_read_lock held.  */
> >> -static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
> >> -                           bool *read_only)
> >> +static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
> >> +                               ram_addr_t *ram_addr, bool *read_only)
> >>   {
> >>       MemoryRegion *mr;
> >>       hwaddr xlat;
> >> @@ -439,8 +439,17 @@ static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
> >>           return false;
> >>       }
> >>   
> >> -    *vaddr = memory_region_get_ram_ptr(mr) + xlat;
> >> -    *read_only = !writable || mr->readonly;
> >> +    if (vaddr) {
> >> +        *vaddr = memory_region_get_ram_ptr(mr) + xlat;
> >> +    }
> >> +
> >> +    if (ram_addr) {
> >> +        *ram_addr = memory_region_get_ram_addr(mr) + xlat;
> >> +    }
> >> +
> >> +    if (read_only) {
> >> +        *read_only = !writable || mr->readonly;
> >> +    }
> >>   
> >>       return true;
> >>   }
> >> @@ -450,7 +459,6 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> >>       VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
> >>       VFIOContainer *container = giommu->container;
> >>       hwaddr iova = iotlb->iova + giommu->iommu_offset;
> >> -    bool read_only;
> >>       void *vaddr;
> >>       int ret;
> >>   
> >> @@ -466,7 +474,10 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> >>       rcu_read_lock();
> >>   
> >>       if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
> >> -        if (!vfio_get_vaddr(iotlb, &vaddr, &read_only)) {
> >> +        ram_addr_t ram_addr;
> >> +        bool read_only;
> >> +
> >> +        if (!vfio_get_xlat_addr(iotlb, &vaddr, &ram_addr, &read_only)) {
> >>               goto out;
> >>           }
> >>           /*
> >> @@ -484,8 +495,28 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> >>                            "0x%"HWADDR_PRIx", %p) = %d (%m)",
> >>                            container, iova,
> >>                            iotlb->addr_mask + 1, vaddr, ret);
> >> +        } else {
> >> +            VFIOIovaRange *iova_range;
> >> +
> >> +            iova_range = g_malloc0(sizeof(*iova_range));
> >> +            iova_range->iova = iova;
> >> +            iova_range->size = iotlb->addr_mask + 1;
> >> +            iova_range->ram_addr = ram_addr;
> >> +
> >> +            QLIST_INSERT_HEAD(&giommu->iova_list, iova_range, next);
> >>           }
> >>       } else {
> >> +        VFIOIovaRange *iova_range, *tmp;
> >> +
> >> +        QLIST_FOREACH_SAFE(iova_range, &giommu->iova_list, next, tmp) {
> >> +            if (iova_range->iova >= iova &&
> >> +                iova_range->iova + iova_range->size <= iova +
> >> +                                                       iotlb->addr_mask + 1) {
> >> +                QLIST_REMOVE(iova_range, next);
> >> +                g_free(iova_range);
> >> +            }
> >> +        }
> >> +  
> > 
> > 
> > This is some pretty serious overhead... can't we trigger a replay when
> > migration is enabled to build this information then?   
> 
> Are you suggesting to call memory_region_iommu_replay() before 
> vfio_sync_dirty_bitmap(), which would call vfio_iommu_map_notify() where 
> iova list of mapping is maintained? Then in the notifer check if 
> migration_is_running() and container->dirty_pages_supported == true, 
> then only create iova mapping tree? In this case how would we know that 
> this is triggered by
> vfio_sync_dirty_bitmap()
>   -> memory_region_iommu_replay()  
> and we don't have to call vfio_dma_map()?

memory_region_iommu_replay() calls a notifier of our choice, so we
could create a notifier specifically for creating this tree when dirty
logging is enabled.  Thanks,

Alex

> > We're looking at
> > potentially thousands of entries, so a list is probably also not a good
> > choice.   
> 
> Changing it to tree.
> 
> Thanks,
> Kirti
> 
>   I don't think it's acceptable to incur this even when not
> > migrating (ie. the vast majority of the time).  Thanks,
> > 
> > Alex
> >   
> >>           ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1);
> >>           if (ret) {
> >>               error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
> >> @@ -642,6 +673,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
> >>               g_free(giommu);
> >>               goto fail;
> >>           }
> >> +        QLIST_INIT(&giommu->iova_list);
> >>           QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
> >>           memory_region_iommu_replay(giommu->iommu, &giommu->n);
> >>   
> >> @@ -740,6 +772,13 @@ static void vfio_listener_region_del(MemoryListener *listener,
> >>           QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
> >>               if (MEMORY_REGION(giommu->iommu) == section->mr &&
> >>                   giommu->n.start == section->offset_within_region) {
> >> +                VFIOIovaRange *iova_range, *tmp;
> >> +
> >> +                QLIST_FOREACH_SAFE(iova_range, &giommu->iova_list, next, tmp) {
> >> +                    QLIST_REMOVE(iova_range, next);
> >> +                    g_free(iova_range);
> >> +                }
> >> +
> >>                   memory_region_unregister_iommu_notifier(section->mr,
> >>                                                           &giommu->n);
> >>                   QLIST_REMOVE(giommu, giommu_next);
> >> @@ -1541,6 +1580,13 @@ static void vfio_disconnect_container(VFIOGroup *group)
> >>           QLIST_REMOVE(container, next);
> >>   
> >>           QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
> >> +            VFIOIovaRange *iova_range, *itmp;
> >> +
> >> +            QLIST_FOREACH_SAFE(iova_range, &giommu->iova_list, next, itmp) {
> >> +                QLIST_REMOVE(iova_range, next);
> >> +                g_free(iova_range);
> >> +            }
> >> +
> >>               memory_region_unregister_iommu_notifier(
> >>                       MEMORY_REGION(giommu->iommu), &giommu->n);
> >>               QLIST_REMOVE(giommu, giommu_next);
> >> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> >> index 0a1651eda2d0..aa7524fe2cc5 100644
> >> --- a/include/hw/vfio/vfio-common.h
> >> +++ b/include/hw/vfio/vfio-common.h
> >> @@ -89,11 +89,19 @@ typedef struct VFIOContainer {
> >>       QLIST_ENTRY(VFIOContainer) next;
> >>   } VFIOContainer;
> >>   
> >> +typedef struct VFIOIovaRange {
> >> +    hwaddr iova;
> >> +    size_t size;
> >> +    ram_addr_t ram_addr;
> >> +    QLIST_ENTRY(VFIOIovaRange) next;
> >> +} VFIOIovaRange;
> >> +
> >>   typedef struct VFIOGuestIOMMU {
> >>       VFIOContainer *container;
> >>       IOMMUMemoryRegion *iommu;
> >>       hwaddr iommu_offset;
> >>       IOMMUNotifier n;
> >> +    QLIST_HEAD(, VFIOIovaRange) iova_list;
> >>       QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
> >>   } VFIOGuestIOMMU;
> >>     
> >   
> 



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 05/17] vfio: Add VM state change handler to know state of VM
  2020-10-18 17:43         ` Kirti Wankhede
@ 2020-10-19 17:51           ` Alex Williamson
  2020-10-20 10:23             ` Cornelia Huck
  0 siblings, 1 reply; 73+ messages in thread
From: Alex Williamson @ 2020-10-19 17:51 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini

On Sun, 18 Oct 2020 23:13:39 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> <snip>
> 
> >>>> +vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
> >>>> +vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
> >>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> >>>> index 8275c4c68f45..25e3b1a3b90a 100644
> >>>> --- a/include/hw/vfio/vfio-common.h
> >>>> +++ b/include/hw/vfio/vfio-common.h
> >>>> @@ -29,6 +29,7 @@
> >>>>    #ifdef CONFIG_LINUX
> >>>>    #include <linux/vfio.h>
> >>>>    #endif
> >>>> +#include "sysemu/sysemu.h"
> >>>>    
> >>>>    #define VFIO_MSG_PREFIX "vfio %s: "
> >>>>    
> >>>> @@ -119,6 +120,9 @@ typedef struct VFIODevice {
> >>>>        unsigned int flags;
> >>>>        VFIOMigration *migration;
> >>>>        Error *migration_blocker;
> >>>> +    VMChangeStateEntry *vm_state;
> >>>> +    uint32_t device_state;
> >>>> +    int vm_running;  
> >>>
> >>> Could these be placed in VFIOMigration?  Thanks,
> >>>     
> >>
> >> I think device_state should be part of VFIODevice since its about device
> >> rather than only related to migration, others can be moved to VFIOMigration.  
> > 
> > But these are only valid when migration is supported and thus when
> > VFIOMigration exists.  Thanks,
> >   
> 
> Even though it is used when migration is supported, its device's attribute.

device_state is a local copy of the migration region register, so it
serves no purpose when a migration region is not present.  In fact the
initial value would indicate the device is stopped, which is incorrect.
vm_running is never initialized and cannot be set other than through a
migration region update of device_state, so at least two of these
values show incorrect state when migration is not supported by the
device.  vm_state is unused when migration isn't present, so if nothing
else the pointer here is wasteful.  It's not clear to me what
justification is being presented here as a "device's attribute",
supporting migration as indicated by a non-NULL migration pointer is
also a device attribute and these are attributes further defining the
state of that support.

BTW, device_state is used in patch 03/ but only defined here in 05/, so
the series would fail to compile on bisect.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 06/17] vfio: Add migration state change notifier
  2020-10-17 20:35     ` Kirti Wankhede
@ 2020-10-19 17:57       ` Alex Williamson
  2020-10-20 10:55         ` Cornelia Huck
  0 siblings, 1 reply; 73+ messages in thread
From: Alex Williamson @ 2020-10-19 17:57 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini

On Sun, 18 Oct 2020 02:05:03 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 9/26/2020 1:50 AM, Alex Williamson wrote:
> > On Wed, 23 Sep 2020 04:54:08 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> Added migration state change notifier to get notification on migration state
> >> change. These states are translated to VFIO device state and conveyed to vendor
> >> driver.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> >> ---
> >>   hw/vfio/migration.c           | 29 +++++++++++++++++++++++++++++
> >>   hw/vfio/trace-events          |  5 +++--
> >>   include/hw/vfio/vfio-common.h |  1 +
> >>   3 files changed, 33 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> >> index a30d628ba963..f650fe9fc3c8 100644
> >> --- a/hw/vfio/migration.c
> >> +++ b/hw/vfio/migration.c
> >> @@ -199,6 +199,28 @@ static void vfio_vmstate_change(void *opaque, int running, RunState state)
> >>       }
> >>   }
> >>   
> >> +static void vfio_migration_state_notifier(Notifier *notifier, void *data)
> >> +{
> >> +    MigrationState *s = data;
> >> +    VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);
> >> +    int ret;
> >> +
> >> +    trace_vfio_migration_state_notifier(vbasedev->name,
> >> +                                        MigrationStatus_str(s->state));
> >> +
> >> +    switch (s->state) {
> >> +    case MIGRATION_STATUS_CANCELLING:
> >> +    case MIGRATION_STATUS_CANCELLED:
> >> +    case MIGRATION_STATUS_FAILED:
> >> +        ret = vfio_migration_set_state(vbasedev,
> >> +                      ~(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING),
> >> +                      VFIO_DEVICE_STATE_RUNNING);
> >> +        if (ret) {
> >> +            error_report("%s: Failed to set state RUNNING", vbasedev->name);
> >> +        }  
> > 
> > Here again the caller assumes success means the device has entered the
> > desired state, but as implemented it only means the device is in some
> > non-error state.
> >   
> >> +    }
> >> +}
> >> +
> >>   static int vfio_migration_init(VFIODevice *vbasedev,
> >>                                  struct vfio_region_info *info)
> >>   {
> >> @@ -221,6 +243,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
> >>   
> >>       vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
> >>                                                             vbasedev);
> >> +    vbasedev->migration_state.notify = vfio_migration_state_notifier;
> >> +    add_migration_state_change_notifier(&vbasedev->migration_state);
> >>       return ret;
> >>   }
> >>   
> >> @@ -263,6 +287,11 @@ add_blocker:
> >>   
> >>   void vfio_migration_finalize(VFIODevice *vbasedev)
> >>   {
> >> +
> >> +    if (vbasedev->migration_state.notify) {
> >> +        remove_migration_state_change_notifier(&vbasedev->migration_state);
> >> +    }
> >> +
> >>       if (vbasedev->vm_state) {
> >>           qemu_del_vm_change_state_handler(vbasedev->vm_state);
> >>       }
> >> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> >> index 6524734bf7b4..bcb3fa7314d7 100644
> >> --- a/hw/vfio/trace-events
> >> +++ b/hw/vfio/trace-events
> >> @@ -149,5 +149,6 @@ vfio_display_edid_write_error(void) ""
> >>   
> >>   # migration.c
> >>   vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
> >> -vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
> >> -vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
> >> +vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
> >> +vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
> >> +vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
> >> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> >> index 25e3b1a3b90a..49c7c7a0e29a 100644
> >> --- a/include/hw/vfio/vfio-common.h
> >> +++ b/include/hw/vfio/vfio-common.h
> >> @@ -123,6 +123,7 @@ typedef struct VFIODevice {
> >>       VMChangeStateEntry *vm_state;
> >>       uint32_t device_state;
> >>       int vm_running;
> >> +    Notifier migration_state;  
> > 
> > Can this live in VFIOMigration?  Thanks,
> >   
> 
> No, callback vfio_migration_state_notifier() has notifier argument and 
> to reach its corresponding device structure as below, its should be in 
> VFIODevice.
> 
> VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);

An alternative would be to place migration_state within VFIOMigration,
along with a pointer back to vbasedev (like we do in VFIORegion) then
the notifier could use container_of to get the VFIOMigration structure,
from which we could get to the VFIODevice via the vbasedev pointer.
This would better compartmentalize the migration related data into a
single structure.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 13/17] vfio: create mapped iova list when vIOMMU is enabled
  2020-10-19 17:24       ` Alex Williamson
@ 2020-10-19 19:15         ` Kirti Wankhede
  2020-10-19 20:07           ` Alex Williamson
  0 siblings, 1 reply; 73+ messages in thread
From: Kirti Wankhede @ 2020-10-19 19:15 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini



On 10/19/2020 10:54 PM, Alex Williamson wrote:
> On Mon, 19 Oct 2020 11:31:03 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 9/26/2020 3:53 AM, Alex Williamson wrote:
>>> On Wed, 23 Sep 2020 04:54:15 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>    
>>>> Create mapped iova list when vIOMMU is enabled. For each mapped iova
>>>> save translated address. Add node to list on MAP and remove node from
>>>> list on UNMAP.
>>>> This list is used to track dirty pages during migration.
>>>>
>>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>>> ---
>>>>    hw/vfio/common.c              | 58 ++++++++++++++++++++++++++++++++++++++-----
>>>>    include/hw/vfio/vfio-common.h |  8 ++++++
>>>>    2 files changed, 60 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>> index d4959c036dd1..dc56cded2d95 100644
>>>> --- a/hw/vfio/common.c
>>>> +++ b/hw/vfio/common.c
>>>> @@ -407,8 +407,8 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>>>>    }
>>>>    
>>>>    /* Called with rcu_read_lock held.  */
>>>> -static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
>>>> -                           bool *read_only)
>>>> +static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
>>>> +                               ram_addr_t *ram_addr, bool *read_only)
>>>>    {
>>>>        MemoryRegion *mr;
>>>>        hwaddr xlat;
>>>> @@ -439,8 +439,17 @@ static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
>>>>            return false;
>>>>        }
>>>>    
>>>> -    *vaddr = memory_region_get_ram_ptr(mr) + xlat;
>>>> -    *read_only = !writable || mr->readonly;
>>>> +    if (vaddr) {
>>>> +        *vaddr = memory_region_get_ram_ptr(mr) + xlat;
>>>> +    }
>>>> +
>>>> +    if (ram_addr) {
>>>> +        *ram_addr = memory_region_get_ram_addr(mr) + xlat;
>>>> +    }
>>>> +
>>>> +    if (read_only) {
>>>> +        *read_only = !writable || mr->readonly;
>>>> +    }
>>>>    
>>>>        return true;
>>>>    }
>>>> @@ -450,7 +459,6 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>>>>        VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>>>>        VFIOContainer *container = giommu->container;
>>>>        hwaddr iova = iotlb->iova + giommu->iommu_offset;
>>>> -    bool read_only;
>>>>        void *vaddr;
>>>>        int ret;
>>>>    
>>>> @@ -466,7 +474,10 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>>>>        rcu_read_lock();
>>>>    
>>>>        if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
>>>> -        if (!vfio_get_vaddr(iotlb, &vaddr, &read_only)) {
>>>> +        ram_addr_t ram_addr;
>>>> +        bool read_only;
>>>> +
>>>> +        if (!vfio_get_xlat_addr(iotlb, &vaddr, &ram_addr, &read_only)) {
>>>>                goto out;
>>>>            }
>>>>            /*
>>>> @@ -484,8 +495,28 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>>>>                             "0x%"HWADDR_PRIx", %p) = %d (%m)",
>>>>                             container, iova,
>>>>                             iotlb->addr_mask + 1, vaddr, ret);
>>>> +        } else {
>>>> +            VFIOIovaRange *iova_range;
>>>> +
>>>> +            iova_range = g_malloc0(sizeof(*iova_range));
>>>> +            iova_range->iova = iova;
>>>> +            iova_range->size = iotlb->addr_mask + 1;
>>>> +            iova_range->ram_addr = ram_addr;
>>>> +
>>>> +            QLIST_INSERT_HEAD(&giommu->iova_list, iova_range, next);
>>>>            }
>>>>        } else {
>>>> +        VFIOIovaRange *iova_range, *tmp;
>>>> +
>>>> +        QLIST_FOREACH_SAFE(iova_range, &giommu->iova_list, next, tmp) {
>>>> +            if (iova_range->iova >= iova &&
>>>> +                iova_range->iova + iova_range->size <= iova +
>>>> +                                                       iotlb->addr_mask + 1) {
>>>> +                QLIST_REMOVE(iova_range, next);
>>>> +                g_free(iova_range);
>>>> +            }
>>>> +        }
>>>> +
>>>
>>>
>>> This is some pretty serious overhead... can't we trigger a replay when
>>> migration is enabled to build this information then?
>>
>> Are you suggesting to call memory_region_iommu_replay() before
>> vfio_sync_dirty_bitmap(), which would call vfio_iommu_map_notify() where
>> iova list of mapping is maintained? Then in the notifer check if
>> migration_is_running() and container->dirty_pages_supported == true,
>> then only create iova mapping tree? In this case how would we know that
>> this is triggered by
>> vfio_sync_dirty_bitmap()
>>    -> memory_region_iommu_replay()
>> and we don't have to call vfio_dma_map()?
> 
> memory_region_iommu_replay() calls a notifier of our choice, so we
> could create a notifier specifically for creating this tree when dirty
> logging is enabled.  Thanks,
> 

This would also mean changes in intel_iommu.c such that it would walk 
through the iova_tree and call notifier for each entry in iova_tree. 
What about other platforms? We will have to handle such cases for AMD, 
ARM, PPC etc...?
I don't see replay callback for AMD, that would result in minimum IOMMU 
supported page size granularity walk - which is similar to that I tried 
to implement 2-3 versions back.
Does that mean doing such change would improve performance for Intel 
IOMMU but worsen for AMD/PPC?

I'm changing list to tree as first level of improvement in this patch.

Can we do the change you suggested above later as next level of improvement?

Thanks,
Kirti


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 13/17] vfio: create mapped iova list when vIOMMU is enabled
  2020-10-19 19:15         ` Kirti Wankhede
@ 2020-10-19 20:07           ` Alex Williamson
  0 siblings, 0 replies; 73+ messages in thread
From: Alex Williamson @ 2020-10-19 20:07 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini

On Tue, 20 Oct 2020 00:45:28 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 10/19/2020 10:54 PM, Alex Williamson wrote:
> > On Mon, 19 Oct 2020 11:31:03 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 9/26/2020 3:53 AM, Alex Williamson wrote:  
> >>> On Wed, 23 Sep 2020 04:54:15 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>      
> >>>> Create mapped iova list when vIOMMU is enabled. For each mapped iova
> >>>> save translated address. Add node to list on MAP and remove node from
> >>>> list on UNMAP.
> >>>> This list is used to track dirty pages during migration.
> >>>>
> >>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >>>> ---
> >>>>    hw/vfio/common.c              | 58 ++++++++++++++++++++++++++++++++++++++-----
> >>>>    include/hw/vfio/vfio-common.h |  8 ++++++
> >>>>    2 files changed, 60 insertions(+), 6 deletions(-)
> >>>>
> >>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >>>> index d4959c036dd1..dc56cded2d95 100644
> >>>> --- a/hw/vfio/common.c
> >>>> +++ b/hw/vfio/common.c
> >>>> @@ -407,8 +407,8 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
> >>>>    }
> >>>>    
> >>>>    /* Called with rcu_read_lock held.  */
> >>>> -static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
> >>>> -                           bool *read_only)
> >>>> +static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
> >>>> +                               ram_addr_t *ram_addr, bool *read_only)
> >>>>    {
> >>>>        MemoryRegion *mr;
> >>>>        hwaddr xlat;
> >>>> @@ -439,8 +439,17 @@ static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
> >>>>            return false;
> >>>>        }
> >>>>    
> >>>> -    *vaddr = memory_region_get_ram_ptr(mr) + xlat;
> >>>> -    *read_only = !writable || mr->readonly;
> >>>> +    if (vaddr) {
> >>>> +        *vaddr = memory_region_get_ram_ptr(mr) + xlat;
> >>>> +    }
> >>>> +
> >>>> +    if (ram_addr) {
> >>>> +        *ram_addr = memory_region_get_ram_addr(mr) + xlat;
> >>>> +    }
> >>>> +
> >>>> +    if (read_only) {
> >>>> +        *read_only = !writable || mr->readonly;
> >>>> +    }
> >>>>    
> >>>>        return true;
> >>>>    }
> >>>> @@ -450,7 +459,6 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> >>>>        VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
> >>>>        VFIOContainer *container = giommu->container;
> >>>>        hwaddr iova = iotlb->iova + giommu->iommu_offset;
> >>>> -    bool read_only;
> >>>>        void *vaddr;
> >>>>        int ret;
> >>>>    
> >>>> @@ -466,7 +474,10 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> >>>>        rcu_read_lock();
> >>>>    
> >>>>        if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
> >>>> -        if (!vfio_get_vaddr(iotlb, &vaddr, &read_only)) {
> >>>> +        ram_addr_t ram_addr;
> >>>> +        bool read_only;
> >>>> +
> >>>> +        if (!vfio_get_xlat_addr(iotlb, &vaddr, &ram_addr, &read_only)) {
> >>>>                goto out;
> >>>>            }
> >>>>            /*
> >>>> @@ -484,8 +495,28 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> >>>>                             "0x%"HWADDR_PRIx", %p) = %d (%m)",
> >>>>                             container, iova,
> >>>>                             iotlb->addr_mask + 1, vaddr, ret);
> >>>> +        } else {
> >>>> +            VFIOIovaRange *iova_range;
> >>>> +
> >>>> +            iova_range = g_malloc0(sizeof(*iova_range));
> >>>> +            iova_range->iova = iova;
> >>>> +            iova_range->size = iotlb->addr_mask + 1;
> >>>> +            iova_range->ram_addr = ram_addr;
> >>>> +
> >>>> +            QLIST_INSERT_HEAD(&giommu->iova_list, iova_range, next);
> >>>>            }
> >>>>        } else {
> >>>> +        VFIOIovaRange *iova_range, *tmp;
> >>>> +
> >>>> +        QLIST_FOREACH_SAFE(iova_range, &giommu->iova_list, next, tmp) {
> >>>> +            if (iova_range->iova >= iova &&
> >>>> +                iova_range->iova + iova_range->size <= iova +
> >>>> +                                                       iotlb->addr_mask + 1) {
> >>>> +                QLIST_REMOVE(iova_range, next);
> >>>> +                g_free(iova_range);
> >>>> +            }
> >>>> +        }
> >>>> +  
> >>>
> >>>
> >>> This is some pretty serious overhead... can't we trigger a replay when
> >>> migration is enabled to build this information then?  
> >>
> >> Are you suggesting to call memory_region_iommu_replay() before
> >> vfio_sync_dirty_bitmap(), which would call vfio_iommu_map_notify() where
> >> iova list of mapping is maintained? Then in the notifer check if
> >> migration_is_running() and container->dirty_pages_supported == true,
> >> then only create iova mapping tree? In this case how would we know that
> >> this is triggered by
> >> vfio_sync_dirty_bitmap()  
> >>    -> memory_region_iommu_replay()  
> >> and we don't have to call vfio_dma_map()?  
> > 
> > memory_region_iommu_replay() calls a notifier of our choice, so we
> > could create a notifier specifically for creating this tree when dirty
> > logging is enabled.  Thanks,
> >   
> 
> This would also mean changes in intel_iommu.c such that it would walk 
> through the iova_tree and call notifier for each entry in iova_tree.

I think we already have that in vtd_iommu_replay(), an
IOMMUMemoryRegionClass.replay callback is rather a requirement of any
vIOMMU intending to support vfio AIUI.
 
> What about other platforms? We will have to handle such cases for
> AMD, ARM, PPC etc...?

There's already a requirement for a working replay callback to work in
any reasonable way with vfio, this is just an additional use case of a
callback we already need and use.

> I don't see replay callback for AMD, that would result in minimum
> IOMMU supported page size granularity walk - which is similar to that
> I tried to implement 2-3 versions back.

Patch 1/3:
https://lists.gnu.org/archive/html/qemu-devel/2020-10/msg00545.html
Patch 5/10:
https://lists.gnu.org/archive/html/qemu-devel/2020-10/msg02196.html

> Does that mean doing such change would improve performance for Intel 
> IOMMU but worsen for AMD/PPC?

We're not adding a new requirement, we already call replay, PPC doesn't
use type1.  What exactly regresses if we introduce another replay user?

> I'm changing list to tree as first level of improvement in this patch.
> 
> Can we do the change you suggested above later as next level of
> improvement?

AIUI above, we're allocating an object and adding it to a list (soon to
be tree) for every vIOMMU mapping, on the off chance that migration
might be used, regardless of devices even supporting migration.  I can
only see that as a runtime performance and size regression.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 05/17] vfio: Add VM state change handler to know state of VM
  2020-10-19 17:51           ` Alex Williamson
@ 2020-10-20 10:23             ` Cornelia Huck
  0 siblings, 0 replies; 73+ messages in thread
From: Cornelia Huck @ 2020-10-20 10:23 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

On Mon, 19 Oct 2020 11:51:36 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Sun, 18 Oct 2020 23:13:39 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > <snip>
> >   
> > >>>> +vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
> > >>>> +vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
> > >>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> > >>>> index 8275c4c68f45..25e3b1a3b90a 100644
> > >>>> --- a/include/hw/vfio/vfio-common.h
> > >>>> +++ b/include/hw/vfio/vfio-common.h
> > >>>> @@ -29,6 +29,7 @@
> > >>>>    #ifdef CONFIG_LINUX
> > >>>>    #include <linux/vfio.h>
> > >>>>    #endif
> > >>>> +#include "sysemu/sysemu.h"
> > >>>>    
> > >>>>    #define VFIO_MSG_PREFIX "vfio %s: "
> > >>>>    
> > >>>> @@ -119,6 +120,9 @@ typedef struct VFIODevice {
> > >>>>        unsigned int flags;
> > >>>>        VFIOMigration *migration;
> > >>>>        Error *migration_blocker;
> > >>>> +    VMChangeStateEntry *vm_state;
> > >>>> +    uint32_t device_state;
> > >>>> +    int vm_running;    
> > >>>
> > >>> Could these be placed in VFIOMigration?  Thanks,
> > >>>       
> > >>
> > >> I think device_state should be part of VFIODevice since its about device
> > >> rather than only related to migration, others can be moved to VFIOMigration.    
> > > 
> > > But these are only valid when migration is supported and thus when
> > > VFIOMigration exists.  Thanks,
> > >     
> > 
> > Even though it is used when migration is supported, its device's attribute.  
> 
> device_state is a local copy of the migration region register, so it
> serves no purpose when a migration region is not present.  In fact the
> initial value would indicate the device is stopped, which is incorrect.
> vm_running is never initialized and cannot be set other than through a
> migration region update of device_state, so at least two of these
> values show incorrect state when migration is not supported by the
> device.  vm_state is unused when migration isn't present, so if nothing
> else the pointer here is wasteful.  It's not clear to me what
> justification is being presented here as a "device's attribute",
> supporting migration as indicated by a non-NULL migration pointer is
> also a device attribute and these are attributes further defining the
> state of that support.

Agreed. Also, it is not obvious from the naming that 'device_state' is
related to migration, and it is easy to assume that this field is
useful even in the non-migration case. Moving it would solve that
problem.



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 05/17] vfio: Add VM state change handler to know state of VM
  2020-10-17 20:24       ` Kirti Wankhede
@ 2020-10-20 10:51         ` Cornelia Huck
  2020-10-21  5:33           ` Kirti Wankhede
  0 siblings, 1 reply; 73+ messages in thread
From: Cornelia Huck @ 2020-10-20 10:51 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk, pasic,
	felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	Dr. David Alan Gilbert, alex.williamson, changpeng.liu, eskultet,
	Ken.Xue, jonathan.davies, pbonzini

On Sun, 18 Oct 2020 01:54:56 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 9/29/2020 4:33 PM, Dr. David Alan Gilbert wrote:
> > * Cornelia Huck (cohuck@redhat.com) wrote:  
> >> On Wed, 23 Sep 2020 04:54:07 +0530
> >> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>  
> >>> VM state change handler gets called on change in VM's state. This is used to set
> >>> VFIO device state to _RUNNING.
> >>>
> >>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >>> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >>> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> >>> ---
> >>>   hw/vfio/migration.c           | 136 ++++++++++++++++++++++++++++++++++++++++++
> >>>   hw/vfio/trace-events          |   3 +-
> >>>   include/hw/vfio/vfio-common.h |   4 ++
> >>>   3 files changed, 142 insertions(+), 1 deletion(-)
> >>>  
> >>
> >> (...)
> >>  
> >>> +static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
> >>> +                                    uint32_t value)  
> >>
> >> I think I've mentioned that before, but this function could really
> >> benefit from a comment what mask and value mean.
> >>  
> 
> Adding a comment as:
> 
> /*
>   *  Write device_state field to inform the vendor driver about the 
> device state
>   *  to be transitioned to.
>   *  vbasedev: VFIO device
>   *  mask : bits set in the mask are preserved in device_state
>   *  value: bits set in the value are set in device_state
>   *  Remaining bits in device_state are cleared.
>   */

Maybe:

"Change the device_state register for device @vbasedev. Bits set in
@mask are preserved, bits set in @value are set, and bits not set in
either @mask or @value are cleared in device_state. If the register
cannot be accessed, the resulting state would be invalid, or the device
enters an error state, an error is returned." ?

> 
> 
> >>> +{
> >>> +    VFIOMigration *migration = vbasedev->migration;
> >>> +    VFIORegion *region = &migration->region;
> >>> +    off_t dev_state_off = region->fd_offset +
> >>> +                      offsetof(struct vfio_device_migration_info, device_state);
> >>> +    uint32_t device_state;
> >>> +    int ret;
> >>> +
> >>> +    ret = vfio_mig_read(vbasedev, &device_state, sizeof(device_state),
> >>> +                        dev_state_off);
> >>> +    if (ret < 0) {
> >>> +        return ret;
> >>> +    }
> >>> +
> >>> +    device_state = (device_state & mask) | value;
> >>> +
> >>> +    if (!VFIO_DEVICE_STATE_VALID(device_state)) {
> >>> +        return -EINVAL;
> >>> +    }
> >>> +
> >>> +    ret = vfio_mig_write(vbasedev, &device_state, sizeof(device_state),
> >>> +                         dev_state_off);
> >>> +    if (ret < 0) {
> >>> +        ret = vfio_mig_read(vbasedev, &device_state, sizeof(device_state),
> >>> +                          dev_state_off);
> >>> +        if (ret < 0) {
> >>> +            return ret;
> >>> +        }
> >>> +
> >>> +        if (VFIO_DEVICE_STATE_IS_ERROR(device_state)) {
> >>> +            hw_error("%s: Device is in error state 0x%x",
> >>> +                     vbasedev->name, device_state);
> >>> +            return -EFAULT;  
> >>
> >> Is -EFAULT a good return value here? Maybe -EIO?
> >>  
> 
> Ok. Changing to -EIO.
> 
> >>> +        }
> >>> +    }
> >>> +
> >>> +    vbasedev->device_state = device_state;
> >>> +    trace_vfio_migration_set_state(vbasedev->name, device_state);
> >>> +    return 0;
> >>> +}
> >>> +
> >>> +static void vfio_vmstate_change(void *opaque, int running, RunState state)
> >>> +{
> >>> +    VFIODevice *vbasedev = opaque;
> >>> +
> >>> +    if ((vbasedev->vm_running != running)) {
> >>> +        int ret;
> >>> +        uint32_t value = 0, mask = 0;
> >>> +
> >>> +        if (running) {
> >>> +            value = VFIO_DEVICE_STATE_RUNNING;
> >>> +            if (vbasedev->device_state & VFIO_DEVICE_STATE_RESUMING) {
> >>> +                mask = ~VFIO_DEVICE_STATE_RESUMING;  
> >>
> >> I've been staring at this for some time and I think that the desired
> >> result is
> >> - set _RUNNING
> >> - if _RESUMING was set, clear it, but leave the other bits intact  
> 
> Upto here, you're correct.
> 
> >> - if _RESUMING was not set, clear everything previously set
> >> This would really benefit from a comment (or am I the only one
> >> struggling here?)
> >>  
> 
> Here mask should be ~0. Correcting it.

Hm, now I'm confused. With value == _RUNNING, ~_RUNNING and ~0 as mask
should be equivalent, shouldn't they?

> 
> 
> >>> +            }
> >>> +        } else {
> >>> +            mask = ~VFIO_DEVICE_STATE_RUNNING;
> >>> +        }
> >>> +
> >>> +        ret = vfio_migration_set_state(vbasedev, mask, value);
> >>> +        if (ret) {
> >>> +            /*
> >>> +             * vm_state_notify() doesn't support reporting failure. If such
> >>> +             * error reporting support added in furure, migration should be
> >>> +             * aborted.  
> >>
> >>
> >> "We should abort the migration in this case, but vm_state_notify()
> >> currently does not support reporting failures."
> >>
> >> ?
> >>  
> 
> Ok. Updating comment as suggested here.
> 
> >> Can/should we mark the failing device in some way?  
> > 
> > I think you can call qemu_file_set_error on the migration stream to
> > force an error.
> >   
> 
> It should be as below, right?
> qemu_file_set_error(migrate_get_current()->to_dst_file, ret);

Does this indicate in any way which device was causing problems? (I'm
not sure how visible the error_report would be?)

> 
> 
> Thanks,
> Kirti
> 
> > Dave
> >   
> >>> +             */
> >>> +            error_report("%s: Failed to set device state 0x%x",
> >>> +                         vbasedev->name, value & mask);
> >>> +        }
> >>> +        vbasedev->vm_running = running;
> >>> +        trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
> >>> +                                  value & mask);
> >>> +    }
> >>> +}
> >>> +
> >>>   static int vfio_migration_init(VFIODevice *vbasedev,
> >>>                                  struct vfio_region_info *info)
> >>>   {  
> >>
> >> (...)  
> 



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 06/17] vfio: Add migration state change notifier
  2020-10-19 17:57       ` Alex Williamson
@ 2020-10-20 10:55         ` Cornelia Huck
  0 siblings, 0 replies; 73+ messages in thread
From: Cornelia Huck @ 2020-10-20 10:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

On Mon, 19 Oct 2020 11:57:47 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Sun, 18 Oct 2020 02:05:03 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > On 9/26/2020 1:50 AM, Alex Williamson wrote:  
> > > On Wed, 23 Sep 2020 04:54:08 +0530
> > > Kirti Wankhede <kwankhede@nvidia.com> wrote:

> > >> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> > >> index 25e3b1a3b90a..49c7c7a0e29a 100644
> > >> --- a/include/hw/vfio/vfio-common.h
> > >> +++ b/include/hw/vfio/vfio-common.h
> > >> @@ -123,6 +123,7 @@ typedef struct VFIODevice {
> > >>       VMChangeStateEntry *vm_state;
> > >>       uint32_t device_state;
> > >>       int vm_running;
> > >> +    Notifier migration_state;    
> > > 
> > > Can this live in VFIOMigration?  Thanks,
> > >     
> > 
> > No, callback vfio_migration_state_notifier() has notifier argument and 
> > to reach its corresponding device structure as below, its should be in 
> > VFIODevice.
> > 
> > VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);  
> 
> An alternative would be to place migration_state within VFIOMigration,
> along with a pointer back to vbasedev (like we do in VFIORegion) then
> the notifier could use container_of to get the VFIOMigration structure,
> from which we could get to the VFIODevice via the vbasedev pointer.
> This would better compartmentalize the migration related data into a
> single structure.  Thanks,
> 
> Alex

+1, I think having everything in VFIOMigration would be nicer.



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 07/17] vfio: Register SaveVMHandlers for VFIO device
  2020-10-18 20:55     ` Kirti Wankhede
@ 2020-10-20 15:51       ` Cornelia Huck
  0 siblings, 0 replies; 73+ messages in thread
From: Cornelia Huck @ 2020-10-20 15:51 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk, pasic,
	felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini

On Mon, 19 Oct 2020 02:25:28 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 9/25/2020 5:23 PM, Cornelia Huck wrote:
> > On Wed, 23 Sep 2020 04:54:09 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> Define flags to be used as delimeter in migration file stream.
> >> Added .save_setup and .save_cleanup functions. Mapped & unmapped migration
> >> region from these functions at source during saving or pre-copy phase.
> >> Set VFIO device state depending on VM's state. During live migration, VM is
> >> running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
> >> device. During save-restore, VM is paused, _SAVING state is set for VFIO device.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>   hw/vfio/migration.c  | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>   hw/vfio/trace-events |  2 ++
> >>   2 files changed, 93 insertions(+)
> >>  
> > 
> > (...)
> >   
> >> +/*
> >> + * Flags used as delimiter:
> >> + * 0xffffffff => MSB 32-bit all 1s
> >> + * 0xef10     => emulated (virtual) function IO  
> > 
> > Where is this value coming from?
> >   
> 
> Delimiter flags should be unique and this is a magic number that 
> represents (e)mulated (f)unction (10) representing IO.
> 
> >> + * 0x0000     => 16-bits reserved for flags
> >> + */
> >> +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
> >> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
> >> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
> >> +#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)  
> > 
> > I think we need some more documentation what these values mean and how
> > they are used. From reading ahead a bit, it seems there is always
> > supposed to be a pair of DEV_*_STATE and END_OF_STATE framing some kind
> > of data?
> >   
> 
> Adding comment as below, hope it helps.
> 
> /*
>   * Flags used as delimiter for VFIO devices should be unique in 
> migration stream

Maybe

"Flags to be used as unique delimiters for VFIO devices in the
migration stream" ?

>   * These flags are composed as:
>   * 0xffffffff => MSB 32-bit all 1s
>   * 0xef10     => Magic ID, represents emulated (virtual) function IO
>   * 0x0000     => 16-bits reserved for flags
>   *
>   * Flags _DEV_CONFIG_STATE, _DEV_SETUP_STATE and _DEV_DATA_STATE marks 
> start of
>   * respective states in migration stream.
>   * FLAG _END_OF_STATE indicates end of current state, state could be any
>   * of above states.
>   */

"The beginning of state information is marked by _DEV_CONFIG_STATE,
_DEV_SETUP_STATE, or _DEV_DATA_STATE, respectively. The end of a
certain state information is marked by _END_OF_STATE." ?

> 
> Thanks,
> Kirti
> 



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 09/17] vfio: Add load state functions to SaveVMHandlers
  2020-10-18 20:47     ` Kirti Wankhede
@ 2020-10-20 16:25       ` Cornelia Huck
  0 siblings, 0 replies; 73+ messages in thread
From: Cornelia Huck @ 2020-10-20 16:25 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk, pasic,
	felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini

On Mon, 19 Oct 2020 02:17:43 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 10/1/2020 3:37 PM, Cornelia Huck wrote:
> > On Wed, 23 Sep 2020 04:54:11 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> Sequence  during _RESUMING device state:
> >> While data for this device is available, repeat below steps:
> >> a. read data_offset from where user application should write data.
> >> b. write data of data_size to migration region from data_offset.
> >> c. write data_size which indicates vendor driver that data is written in
> >>     staging buffer.
> >>
> >> For user, data is opaque. User should write data in the same order as
> >> received.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> >> ---
> >>   hw/vfio/migration.c  | 170 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >>   hw/vfio/trace-events |   3 +
> >>   2 files changed, 173 insertions(+)
> >>
> >> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> >> index 4611bb972228..ffd70282dd0e 100644
> >> --- a/hw/vfio/migration.c
> >> +++ b/hw/vfio/migration.c
> >> @@ -328,6 +328,33 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
> >>       return qemu_file_get_error(f);
> >>   }
> >>   
> >> +static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
> >> +{
> >> +    VFIODevice *vbasedev = opaque;
> >> +    uint64_t data;
> >> +
> >> +    if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
> >> +        int ret;
> >> +
> >> +        ret = vbasedev->ops->vfio_load_config(vbasedev, f);
> >> +        if (ret) {
> >> +            error_report("%s: Failed to load device config space",
> >> +                         vbasedev->name);
> >> +            return ret;
> >> +        }
> >> +    }
> >> +
> >> +    data = qemu_get_be64(f);
> >> +    if (data != VFIO_MIG_FLAG_END_OF_STATE) {
> >> +        error_report("%s: Failed loading device config space, "
> >> +                     "end flag incorrect 0x%"PRIx64, vbasedev->name, data);  
> > 
> > I'm confused here: If we don't have a vfio_load_config callback, or if
> > that callback did not read everything, we also might end up with a
> > value that's not END_OF_STATE... in that case, the problem is not with
> > the stream, but rather with the consumer?  
> 
> Right, hence "end flag incorrect" is reported.

Yes, but that's what I find confusing... a missing or incorrect
vfio_load_config callback does not have anything to do with incorrect
end flags as present in the stream, but with the consumer not reading
things correctly. If I got this error, I would go looking whether
there's anything wrong with the stream and the code that produced it,
and that's the wrong direction.

(...)

> >> +static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> >> +{
> >> +    VFIODevice *vbasedev = opaque;
> >> +    VFIOMigration *migration = vbasedev->migration;
> >> +    int ret = 0;
> >> +    uint64_t data, data_size;
> >> +
> >> +    data = qemu_get_be64(f);
> >> +    while (data != VFIO_MIG_FLAG_END_OF_STATE) {
> >> +
> >> +        trace_vfio_load_state(vbasedev->name, data);
> >> +
> >> +        switch (data) {
> >> +        case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
> >> +        {
> >> +            ret = vfio_load_device_config_state(f, opaque);
> >> +            if (ret) {
> >> +                return ret;
> >> +            }
> >> +            break;
> >> +        }
> >> +        case VFIO_MIG_FLAG_DEV_SETUP_STATE:
> >> +        {
> >> +            data = qemu_get_be64(f);
> >> +            if (data == VFIO_MIG_FLAG_END_OF_STATE) {
> >> +                return ret;
> >> +            } else {
> >> +                error_report("%s: SETUP STATE: EOS not found 0x%"PRIx64,
> >> +                             vbasedev->name, data);
> >> +                return -EINVAL;
> >> +            }
> >> +            break;
> >> +        }
> >> +        case VFIO_MIG_FLAG_DEV_DATA_STATE:
> >> +        {
> >> +            VFIORegion *region = &migration->region;
> >> +            uint64_t data_offset = 0, size;  
> > 
> > I think this function would benefit from splitting this off into a
> > function handling DEV_DATA_STATE. It is quite hard to follow through
> > all the checks and find out when we continue, and when we break off.
> >   
> 
> Each switch case has a break, we break off on success cases, where as we 
> return error if we encounter any case where (ret < 0)

Of course, but I don't find it easy to follow when the errors are
happening.

> 
> 
> > Some documentation about the markers would also be really helpful.  
> 
> Sure adding it in patch 07, where these are defined.
> 
> > The logic seems to be:
> > - DEV_CONFIG_STATE has config data and must be ended by END_OF_STATE  
> Right
> 
> > - DEV_SETUP_STATE has only END_OF_STATE, no data  
> Right now there is no data, but this is provision to add data if 
> required in future.
> 
> > - DEV_DATA_STATE has... data; if there's any END_OF_STATE, it's buried
> >    far down in the called functions
> >  
> 
> This is not correct, END_OF_STATE is after data. Moved data buffer 
> loading logic to function vfio_load_buffer(), so DEV_DATA_STATE looks 
> simplified as below. Hope this helps.
> 
>          case VFIO_MIG_FLAG_DEV_DATA_STATE:
>          {
>              uint64_t data_size;
> 
>              data_size = qemu_get_be64(f);
>              if (data_size == 0) {
>                  break;
>              }
> 
>              ret = vfio_load_buffer(f, vbasedev, data_size);
>              if (ret < 0) {
>                  return ret;
>              }
>              break;
>          }

Hm.

What I find not that easy to follow is the structure here:

while (!end_marker) {
    switch (data) {
    case config_state:
        if (load_config)
            return error;
        break;
    case setup_state:
        read_next_value();
        if (end_marker)
            return 0;
        else
            return error;
        break;
    case data_state:
        size = read_next_value();
        if (!size)
            break;
        if (vfio_load_buffer())
            return error;
        break;
    default:
        return error;
    }
    read_next_value();
    if (qemu_file_get_error())
        return error;
}

So, what I don't understand is:
- Why do we call qemu_file_get_error() only after we went through the
  whole switch? This means it is never called for the
  setup_state/end_marker pair.
- If we look for an end marker for config_state and data_state, it's
  buried in the called functions. How can we be sure they actually do
  look for it? That needs at least a comment.
- If we find a valid setup_state section, we return success
  immediately. If we find valid config_state or data_state sections, we
  keep looking for more sections. Why? This also needs at least a
  comment.

> 
> Also handling the case if data_size is greater than the data section of 
> migration region at destination in vfio_load_buffer()in my next version.
> 
> Thanks,
> Kirti
> 
> >   
> >> +
> >> +            data_size = size = qemu_get_be64(f);
> >> +            if (data_size == 0) {
> >> +                break;
> >> +            }
> >> +
> >> +            ret = vfio_mig_read(vbasedev, &data_offset, sizeof(data_offset),
> >> +                                region->fd_offset +
> >> +                                offsetof(struct vfio_device_migration_info,
> >> +                                         data_offset));
> >> +            if (ret < 0) {
> >> +                return ret;
> >> +            }
> >> +
> >> +            trace_vfio_load_state_device_data(vbasedev->name, data_offset,
> >> +                                              data_size);
> >> +
> >> +            while (size) {
> >> +                void *buf = NULL;
> >> +                uint64_t sec_size;
> >> +                bool buf_alloc = false;
> >> +
> >> +                buf = get_data_section_size(region, data_offset, size,
> >> +                                            &sec_size);
> >> +
> >> +                if (!buf) {
> >> +                    buf = g_try_malloc(sec_size);
> >> +                    if (!buf) {
> >> +                        error_report("%s: Error allocating buffer ", __func__);
> >> +                        return -ENOMEM;
> >> +                    }
> >> +                    buf_alloc = true;
> >> +                }
> >> +
> >> +                qemu_get_buffer(f, buf, sec_size);
> >> +
> >> +                if (buf_alloc) {
> >> +                    ret = vfio_mig_write(vbasedev, buf, sec_size,
> >> +                                         region->fd_offset + data_offset);
> >> +                    g_free(buf);
> >> +
> >> +                    if (ret < 0) {
> >> +                        return ret;
> >> +                    }
> >> +                }
> >> +                size -= sec_size;
> >> +                data_offset += sec_size;
> >> +            }
> >> +
> >> +            ret = vfio_mig_write(vbasedev, &data_size, sizeof(data_size),
> >> +                                 region->fd_offset +
> >> +                       offsetof(struct vfio_device_migration_info, data_size));
> >> +            if (ret < 0) {
> >> +                return ret;
> >> +            }
> >> +            break;
> >> +        }
> >> +
> >> +        default:
> >> +            error_report("%s: Unknown tag 0x%"PRIx64, vbasedev->name, data);
> >> +            return -EINVAL;
> >> +        }
> >> +
> >> +        data = qemu_get_be64(f);
> >> +        ret = qemu_file_get_error(f);
> >> +        if (ret) {
> >> +            return ret;
> >> +        }
> >> +    }
> >> +
> >> +    return ret;
> >> +}
> >> +



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 05/17] vfio: Add VM state change handler to know state of VM
  2020-10-20 10:51         ` Cornelia Huck
@ 2020-10-21  5:33           ` Kirti Wankhede
  2020-10-22  7:51             ` Cornelia Huck
  0 siblings, 1 reply; 73+ messages in thread
From: Kirti Wankhede @ 2020-10-21  5:33 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk, pasic,
	felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	Dr. David Alan Gilbert, alex.williamson, changpeng.liu, eskultet,
	Ken.Xue, jonathan.davies, pbonzini



On 10/20/2020 4:21 PM, Cornelia Huck wrote:
> On Sun, 18 Oct 2020 01:54:56 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 9/29/2020 4:33 PM, Dr. David Alan Gilbert wrote:
>>> * Cornelia Huck (cohuck@redhat.com) wrote:
>>>> On Wed, 23 Sep 2020 04:54:07 +0530
>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>   
>>>>> VM state change handler gets called on change in VM's state. This is used to set
>>>>> VFIO device state to _RUNNING.
>>>>>
>>>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>>>>> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>>>>> ---
>>>>>    hw/vfio/migration.c           | 136 ++++++++++++++++++++++++++++++++++++++++++
>>>>>    hw/vfio/trace-events          |   3 +-
>>>>>    include/hw/vfio/vfio-common.h |   4 ++
>>>>>    3 files changed, 142 insertions(+), 1 deletion(-)
>>>>>   
>>>>
>>>> (...)
>>>>   
>>>>> +static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
>>>>> +                                    uint32_t value)
>>>>
>>>> I think I've mentioned that before, but this function could really
>>>> benefit from a comment what mask and value mean.
>>>>   
>>
>> Adding a comment as:
>>
>> /*
>>    *  Write device_state field to inform the vendor driver about the
>> device state
>>    *  to be transitioned to.
>>    *  vbasedev: VFIO device
>>    *  mask : bits set in the mask are preserved in device_state
>>    *  value: bits set in the value are set in device_state
>>    *  Remaining bits in device_state are cleared.
>>    */
> 
> Maybe:
> 
> "Change the device_state register for device @vbasedev. Bits set in
> @mask are preserved, bits set in @value are set, and bits not set in
> either @mask or @value are cleared in device_state. If the register
> cannot be accessed, the resulting state would be invalid, or the device
> enters an error state, an error is returned." ?
> 

Ok.
>>
>>
>>>>> +{
>>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>>> +    VFIORegion *region = &migration->region;
>>>>> +    off_t dev_state_off = region->fd_offset +
>>>>> +                      offsetof(struct vfio_device_migration_info, device_state);
>>>>> +    uint32_t device_state;
>>>>> +    int ret;
>>>>> +
>>>>> +    ret = vfio_mig_read(vbasedev, &device_state, sizeof(device_state),
>>>>> +                        dev_state_off);
>>>>> +    if (ret < 0) {
>>>>> +        return ret;
>>>>> +    }
>>>>> +
>>>>> +    device_state = (device_state & mask) | value;
>>>>> +
>>>>> +    if (!VFIO_DEVICE_STATE_VALID(device_state)) {
>>>>> +        return -EINVAL;
>>>>> +    }
>>>>> +
>>>>> +    ret = vfio_mig_write(vbasedev, &device_state, sizeof(device_state),
>>>>> +                         dev_state_off);
>>>>> +    if (ret < 0) {
>>>>> +        ret = vfio_mig_read(vbasedev, &device_state, sizeof(device_state),
>>>>> +                          dev_state_off);
>>>>> +        if (ret < 0) {
>>>>> +            return ret;
>>>>> +        }
>>>>> +
>>>>> +        if (VFIO_DEVICE_STATE_IS_ERROR(device_state)) {
>>>>> +            hw_error("%s: Device is in error state 0x%x",
>>>>> +                     vbasedev->name, device_state);
>>>>> +            return -EFAULT;
>>>>
>>>> Is -EFAULT a good return value here? Maybe -EIO?
>>>>   
>>
>> Ok. Changing to -EIO.
>>
>>>>> +        }
>>>>> +    }
>>>>> +
>>>>> +    vbasedev->device_state = device_state;
>>>>> +    trace_vfio_migration_set_state(vbasedev->name, device_state);
>>>>> +    return 0;
>>>>> +}
>>>>> +
>>>>> +static void vfio_vmstate_change(void *opaque, int running, RunState state)
>>>>> +{
>>>>> +    VFIODevice *vbasedev = opaque;
>>>>> +
>>>>> +    if ((vbasedev->vm_running != running)) {
>>>>> +        int ret;
>>>>> +        uint32_t value = 0, mask = 0;
>>>>> +
>>>>> +        if (running) {
>>>>> +            value = VFIO_DEVICE_STATE_RUNNING;
>>>>> +            if (vbasedev->device_state & VFIO_DEVICE_STATE_RESUMING) {
>>>>> +                mask = ~VFIO_DEVICE_STATE_RESUMING;
>>>>
>>>> I've been staring at this for some time and I think that the desired
>>>> result is
>>>> - set _RUNNING
>>>> - if _RESUMING was set, clear it, but leave the other bits intact
>>
>> Upto here, you're correct.
>>
>>>> - if _RESUMING was not set, clear everything previously set
>>>> This would really benefit from a comment (or am I the only one
>>>> struggling here?)
>>>>   
>>
>> Here mask should be ~0. Correcting it.
> 
> Hm, now I'm confused. With value == _RUNNING, ~_RUNNING and ~0 as mask
> should be equivalent, shouldn't they?
> 

I too got confused after reading your comment.
Lets walk through the device states and transitions can happen here:

if running
  - device state could be either _SAVING or _RESUMING or _STOP. Both 
_SAVING and _RESUMING can't be set at a time, that is the error state. 
_STOP means 0.
  - Transition from _SAVING to _RUNNING can happen if there is migration 
failure, in that case we have to clear _SAVING
- Transition from _RESUMING to _RUNNING can happen on resuming and we 
have to clear _RESUMING.
- In both the above cases, we have to set _RUNNING and clear rest 2 bits.
Then:
mask = ~VFIO_DEVICE_STATE_MASK;
value = VFIO_DEVICE_STATE_RUNNING;

if !running
- device state could be either _RUNNING or _SAVING|_RUNNING. Here we 
have to reset running bit.
Then:
mask = ~VFIO_DEVICE_STATE_RUNNING;
value = 0;

I'll add comment in the code above.


>>
>>
>>>>> +            }
>>>>> +        } else {
>>>>> +            mask = ~VFIO_DEVICE_STATE_RUNNING;
>>>>> +        }
>>>>> +
>>>>> +        ret = vfio_migration_set_state(vbasedev, mask, value);
>>>>> +        if (ret) {
>>>>> +            /*
>>>>> +             * vm_state_notify() doesn't support reporting failure. If such
>>>>> +             * error reporting support added in furure, migration should be
>>>>> +             * aborted.
>>>>
>>>>
>>>> "We should abort the migration in this case, but vm_state_notify()
>>>> currently does not support reporting failures."
>>>>
>>>> ?
>>>>   
>>
>> Ok. Updating comment as suggested here.
>>
>>>> Can/should we mark the failing device in some way?
>>>
>>> I think you can call qemu_file_set_error on the migration stream to
>>> force an error.
>>>    
>>
>> It should be as below, right?
>> qemu_file_set_error(migrate_get_current()->to_dst_file, ret);
> 
> Does this indicate in any way which device was causing problems? (I'm
> not sure how visible the error_report would be?)
> 

I think it doesn't indicate which device caused problem but it sets 
error on migration stream

Thanks,
Kirti

>>
>>
>> Thanks,
>> Kirti
>>
>>> Dave
>>>    
>>>>> +             */
>>>>> +            error_report("%s: Failed to set device state 0x%x",
>>>>> +                         vbasedev->name, value & mask);
>>>>> +        }
>>>>> +        vbasedev->vm_running = running;
>>>>> +        trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
>>>>> +                                  value & mask);
>>>>> +    }
>>>>> +}
>>>>> +
>>>>>    static int vfio_migration_init(VFIODevice *vbasedev,
>>>>>                                   struct vfio_region_info *info)
>>>>>    {
>>>>
>>>> (...)
>>
> 


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 03/17] vfio: Add save and load functions for VFIO PCI devices
  2020-09-24 22:49   ` Alex Williamson
@ 2020-10-21  9:30     ` Zenghui Yu
  2020-10-21 19:03       ` Alex Williamson
  0 siblings, 1 reply; 73+ messages in thread
From: Zenghui Yu @ 2020-10-21  9:30 UTC (permalink / raw)
  To: Alex Williamson, Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, eskultet, ziye.yang, armbru, mlevitsk,
	pasic, felipe, wanghaibin.wang, Ken.Xue, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, quintela, lushenming, zhi.a.wang,
	jonathan.davies, pbonzini

On 2020/9/25 6:49, Alex Williamson wrote:
>> +    } else if (interrupt_type == VFIO_INT_MSIX) {
>> +        uint16_t offset;
>> +
>> +        msix_load(pdev, f);
>> +        offset = pci_default_read_config(pdev,
>> +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
>> +        /* load enable bit and maskall bit */
>> +        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
>> +                              offset, 2);

It isn't clear that what purpose this load operation serves.  The config
space has already been restored and we'll see that MSI-X _was_ and _is_
enabled (or disabled).  vfio_msix_enable() will therefore not be invoked
and no vectors would actually be enabled...  Not sure if I had missed
something.

>> +    }
>> +    return 0;
> 
> It seems this could be simplified down to:
> 
> if (msi_enabled(pdev)) {
>      vfio_msi_enable(vdev);
> } else if (msix_enabled(pdev)) {
>      msix_load(pdev, f);
>      vfio_msix_enable(vdev);
> }

And it seems that this has fixed something :-)


Thanks,
Zenghui


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 08/17] vfio: Add save state functions to SaveVMHandlers
  2020-09-23 11:42   ` Wang, Zhi A
@ 2020-10-21 14:30     ` Kirti Wankhede
  0 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2020-10-21 14:30 UTC (permalink / raw)
  To: Wang, Zhi A, alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	eauger, Liu, Yi L, quintela, Yang, Ziye, armbru, mlevitsk, pasic,
	felipe, Ken.Xue, Tian, Kevin, Zhao,  Yan Y, dgilbert, Liu,
	Changpeng, eskultet, jonathan.davies, pbonzini



On 9/23/2020 5:12 PM, Wang, Zhi A wrote:
> I met a problem when trying this patch. Mostly the problem happens if a device doesn't set any pending bytes in the iteration stage, which shows this device doesn't have a stage of iteration. 

I tried to reproduce this issue at my end and I didn't hit this case 
even if I do multiple iteration in pre-copy phase in live migration. In 
pre-copy phase, vendor driver receives read(pending_bytes) call back 
from .save_live_pending. If vendor driver don't have any data to send 
then return 0 as pending bytes.

But I could reproduce this issue with savevm/VM snapshot from 
qemu-monitor. Agree with your fix, pulling in my next iteration.

The QEMU in the destination machine will complain out-of-memory. After 
some investigation, it seems the vendor-specific bit stream is not 
complete and the QEMU in the destination machine will wrongly take a 
signature as the size of the section and failed to allocate the memory. 
Not sure if others meet the same problem.
> 
> I solved this problem by the following fix and the qemu version I am using is v5.0.0.0.
> 

Only this didn't fix the problem for me, have to reset pending_bytes 
from vfio_save_iterate() as .save_live_pending is not called during 
savevm/snapshot.

Thanks for testing.
Kirti

> commit 13a80adc2cdddd48d76acf6a5dd715bcbf42b577
> Author: Zhi Wang <zhi.wang.linux@gmail.com>
> Date:   Tue Sep 15 15:58:45 2020 +0300
> 
>      fix
>      
>      Signed-off-by: Zhi Wang <zhi.wang.linux@gmail.com>
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 09eec9c..e741319 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -453,10 +458,12 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
>               return ret;
>           }
>   
> -        if (migration->pending_bytes == 0) {
> -            /* indicates data finished, goto complete phase */
> -            return 1;
> -        }
> +	if (migration->pending_bytes == 0) {
> +		/* indicates data finished, goto complete phase */
> +		qemu_put_be64(f, 0);
> +		qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +		return 1;
> +	}
>       }
>   
>       data_size = vfio_save_buffer(f, vbasedev);
> 
> -----Original Message-----
> From: Kirti Wankhede <kwankhede@nvidia.com>
> Sent: Wednesday, September 23, 2020 2:24 AM
> To: alex.williamson@redhat.com; cjia@nvidia.com
> Cc: Tian, Kevin <kevin.tian@intel.com>; Yang, Ziye <ziye.yang@intel.com>; Liu, Changpeng <changpeng.liu@intel.com>; Liu, Yi L <yi.l.liu@intel.com>; mlevitsk@redhat.com; eskultet@redhat.com; cohuck@redhat.com; dgilbert@redhat.com; jonathan.davies@nutanix.com; eauger@redhat.com; aik@ozlabs.ru; pasic@linux.ibm.com; felipe@nutanix.com; Zhengxiao.zx@Alibaba-inc.com; shuangtai.tst@alibaba-inc.com; Ken.Xue@amd.com; Wang, Zhi A <zhi.a.wang@intel.com>; Zhao, Yan Y <yan.y.zhao@intel.com>; pbonzini@redhat.com; quintela@redhat.com; eblake@redhat.com; armbru@redhat.com; peterx@redhat.com; qemu-devel@nongnu.org; Kirti Wankhede <kwankhede@nvidia.com>
> Subject: [PATCH v26 08/17] vfio: Add save state functions to SaveVMHandlers
> 
> Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy functions. These functions handles pre-copy and stop-and-copy phase.
> 
> In _SAVING|_RUNNING device state or pre-copy phase:
> - read pending_bytes. If pending_bytes > 0, go through below steps.
> - read data_offset - indicates kernel driver to write data to staging
>    buffer.
> - read data_size - amount of data in bytes written by vendor driver in
>    migration region.
> - read data_size bytes of data from data_offset in the migration region.
> - Write data packet to file stream as below:
> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data, VFIO_MIG_FLAG_END_OF_STATE }
> 
> In _SAVING device state or stop-and-copy phase a. read config space of device and save to migration file stream. This
>     doesn't need to be from vendor driver. Any other special config state
>     from driver can be saved as data in following iteration.
> b. read pending_bytes. If pending_bytes > 0, go through below steps.
> c. read data_offset - indicates kernel driver to write data to staging
>     buffer.
> d. read data_size - amount of data in bytes written by vendor driver in
>     migration region.
> e. read data_size bytes of data from data_offset in the migration region.
> f. Write data packet as below:
>     {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data} g. iterate through steps b to f while (pending_bytes > 0) h. Write {VFIO_MIG_FLAG_END_OF_STATE}
> 
> When data region is mapped, its user's responsibility to read data from data_offset of data_size before moving to next steps.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>   hw/vfio/migration.c           | 273 ++++++++++++++++++++++++++++++++++++++++++
>   hw/vfio/trace-events          |   6 +
>   include/hw/vfio/vfio-common.h |   1 +
>   3 files changed, 280 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 8e8adaa25779..4611bb972228 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -180,6 +180,154 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
>       return 0;
>   }
>   
> +static void *get_data_section_size(VFIORegion *region, uint64_t data_offset,
> +                                   uint64_t data_size, uint64_t *size)
> +{
> +    void *ptr = NULL;
> +    uint64_t limit = 0;
> +    int i;
> +
> +    if (!region->mmaps) {
> +        if (size) {
> +            *size = data_size;
> +        }
> +        return ptr;
> +    }
> +
> +    for (i = 0; i < region->nr_mmaps; i++) {
> +        VFIOMmap *map = region->mmaps + i;
> +
> +        if ((data_offset >= map->offset) &&
> +            (data_offset < map->offset + map->size)) {
> +
> +            /* check if data_offset is within sparse mmap areas */
> +            ptr = map->mmap + data_offset - map->offset;
> +            if (size) {
> +                *size = MIN(data_size, map->offset + map->size - data_offset);
> +            }
> +            break;
> +        } else if ((data_offset < map->offset) &&
> +                   (!limit || limit > map->offset)) {
> +            /*
> +             * data_offset is not within sparse mmap areas, find size of
> +             * non-mapped area. Check through all list since region->mmaps list
> +             * is not sorted.
> +             */
> +            limit = map->offset;
> +        }
> +    }
> +
> +    if (!ptr && size) {
> +        *size = limit ? limit - data_offset : data_size;
> +    }
> +    return ptr;
> +}
> +
> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t
> +*size) {
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region;
> +    uint64_t data_offset = 0, data_size = 0, sz;
> +    int ret;
> +
> +    ret = vfio_mig_read(vbasedev, &data_offset, sizeof(data_offset),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             data_offset));
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    ret = vfio_mig_read(vbasedev, &data_size, sizeof(data_size),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             data_size));
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    trace_vfio_save_buffer(vbasedev->name, data_offset, data_size,
> +                           migration->pending_bytes);
> +
> +    qemu_put_be64(f, data_size);
> +    sz = data_size;
> +
> +    while (sz) {
> +        void *buf = NULL;
> +        uint64_t sec_size;
> +        bool buf_allocated = false;
> +
> +        buf = get_data_section_size(region, data_offset, sz,
> + &sec_size);
> +
> +        if (!buf) {
> +            buf = g_try_malloc(sec_size);
> +            if (!buf) {
> +                error_report("%s: Error allocating buffer ", __func__);
> +                return -ENOMEM;
> +            }
> +            buf_allocated = true;
> +
> +            ret = vfio_mig_read(vbasedev, buf, sec_size,
> +                                region->fd_offset + data_offset);
> +            if (ret < 0) {
> +                g_free(buf);
> +                return ret;
> +            }
> +        }
> +
> +        qemu_put_buffer(f, buf, sec_size);
> +
> +        if (buf_allocated) {
> +            g_free(buf);
> +        }
> +        sz -= sec_size;
> +        data_offset += sec_size;
> +    }
> +
> +    ret = qemu_file_get_error(f);
> +
> +    if (!ret && size) {
> +        *size = data_size;
> +    }
> +
> +    return ret;
> +}
> +
> +static int vfio_update_pending(VFIODevice *vbasedev) {
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region;
> +    uint64_t pending_bytes = 0;
> +    int ret;
> +
> +    ret = vfio_mig_read(vbasedev, &pending_bytes, sizeof(pending_bytes),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             pending_bytes));
> +    if (ret < 0) {
> +        migration->pending_bytes = 0;
> +        return ret;
> +    }
> +
> +    migration->pending_bytes = pending_bytes;
> +    trace_vfio_update_pending(vbasedev->name, pending_bytes);
> +    return 0;
> +}
> +
> +static int vfio_save_device_config_state(QEMUFile *f, void *opaque) {
> +    VFIODevice *vbasedev = opaque;
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
> +
> +    if (vbasedev->ops && vbasedev->ops->vfio_save_config) {
> +        vbasedev->ops->vfio_save_config(vbasedev, f);
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    trace_vfio_save_device_config_state(vbasedev->name);
> +
> +    return qemu_file_get_error(f);
> +}
> +
>   /* ---------------------------------------------------------------------- */
>   
>   static int vfio_save_setup(QEMUFile *f, void *opaque) @@ -232,9 +380,134 @@ static void vfio_save_cleanup(void *opaque)
>       trace_vfio_save_cleanup(vbasedev->name);
>   }
>   
> +static void vfio_save_pending(QEMUFile *f, void *opaque,
> +                              uint64_t threshold_size,
> +                              uint64_t *res_precopy_only,
> +                              uint64_t *res_compatible,
> +                              uint64_t *res_postcopy_only) {
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    ret = vfio_update_pending(vbasedev);
> +    if (ret) {
> +        return;
> +    }
> +
> +    *res_precopy_only += migration->pending_bytes;
> +
> +    trace_vfio_save_pending(vbasedev->name, *res_precopy_only,
> +                            *res_postcopy_only, *res_compatible); }
> +
> +static int vfio_save_iterate(QEMUFile *f, void *opaque) {
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    uint64_t data_size;
> +    int ret;
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> +
> +    if (migration->pending_bytes == 0) {
> +        ret = vfio_update_pending(vbasedev);
> +        if (ret) {
> +            return ret;
> +        }
> +
> +        if (migration->pending_bytes == 0) {
> +            /* indicates data finished, goto complete phase */
> +            return 1;
> +        }
> +    }
> +
> +    ret = vfio_save_buffer(f, vbasedev, &data_size);
> +
> +    if (ret) {
> +        error_report("%s: vfio_save_buffer failed %s", vbasedev->name,
> +                     strerror(errno));
> +        return ret;
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    trace_vfio_save_iterate(vbasedev->name, data_size);
> +
> +    return 0;
> +}
> +
> +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque) {
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    uint64_t data_size;
> +    int ret;
> +
> +    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_RUNNING,
> +                                   VFIO_DEVICE_STATE_SAVING);
> +    if (ret) {
> +        error_report("%s: Failed to set state STOP and SAVING",
> +                     vbasedev->name);
> +        return ret;
> +    }
> +
> +    ret = vfio_save_device_config_state(f, opaque);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    ret = vfio_update_pending(vbasedev);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    while (migration->pending_bytes > 0) {
> +        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> +        ret = vfio_save_buffer(f, vbasedev, &data_size);
> +        if (ret < 0) {
> +            error_report("%s: Failed to save buffer", vbasedev->name);
> +            return ret;
> +        }
> +
> +        if (data_size == 0) {
> +            break;
> +        }
> +
> +        ret = vfio_update_pending(vbasedev);
> +        if (ret) {
> +            return ret;
> +        }
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_SAVING, 0);
> +    if (ret) {
> +        error_report("%s: Failed to set state STOPPED", vbasedev->name);
> +        return ret;
> +    }
> +
> +    trace_vfio_save_complete_precopy(vbasedev->name);
> +    return ret;
> +}
> +
>   static SaveVMHandlers savevm_vfio_handlers = {
>       .save_setup = vfio_save_setup,
>       .save_cleanup = vfio_save_cleanup,
> +    .save_live_pending = vfio_save_pending,
> +    .save_live_iterate = vfio_save_iterate,
> +    .save_live_complete_precopy = vfio_save_complete_precopy,
>   };
>   
>   /* ---------------------------------------------------------------------- */ diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events index 982d8dccb219..118b5547c921 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -154,3 +154,9 @@ vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t  vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
>   vfio_save_setup(const char *name) " (%s)"
>   vfio_save_cleanup(const char *name) " (%s)"
> +vfio_save_buffer(const char *name, uint64_t data_offset, uint64_t
> +data_size, uint64_t pending) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64"
> +pending 0x%"PRIx64 vfio_update_pending(const char *name, uint64_t pending) " (%s) pending 0x%"PRIx64 vfio_save_device_config_state(const char *name) " (%s)"
> +vfio_save_pending(const char *name, uint64_t precopy, uint64_t
> +postcopy, uint64_t compatible) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64 vfio_save_iterate(const char *name, int data_size) " (%s) data_size %d"
> +vfio_save_complete_precopy(const char *name) " (%s)"
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index 49c7c7a0e29a..471e444a364c 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -60,6 +60,7 @@ typedef struct VFIORegion {
>   
>   typedef struct VFIOMigration {
>       VFIORegion region;
> +    uint64_t pending_bytes;
>   } VFIOMigration;
>   
>   typedef struct VFIOAddressSpace {
> --
> 2.7.0
> 


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 03/17] vfio: Add save and load functions for VFIO PCI devices
  2020-10-21  9:30     ` Zenghui Yu
@ 2020-10-21 19:03       ` Alex Williamson
  0 siblings, 0 replies; 73+ messages in thread
From: Alex Williamson @ 2020-10-21 19:03 UTC (permalink / raw)
  To: Zenghui Yu
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, Kirti Wankhede, eauger, yi.l.liu, eskultet, ziye.yang,
	armbru, mlevitsk, pasic, felipe, wanghaibin.wang, zhi.a.wang,
	kevin.tian, yan.y.zhao, dgilbert, changpeng.liu, quintela,
	lushenming, Ken.Xue, jonathan.davies, pbonzini

On Wed, 21 Oct 2020 17:30:04 +0800
Zenghui Yu <yuzenghui@huawei.com> wrote:

> On 2020/9/25 6:49, Alex Williamson wrote:
> >> +    } else if (interrupt_type == VFIO_INT_MSIX) {
> >> +        uint16_t offset;
> >> +
> >> +        msix_load(pdev, f);
> >> +        offset = pci_default_read_config(pdev,
> >> +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
> >> +        /* load enable bit and maskall bit */
> >> +        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
> >> +                              offset, 2);  
> 
> It isn't clear that what purpose this load operation serves.  The config
> space has already been restored and we'll see that MSI-X _was_ and _is_
> enabled (or disabled).  vfio_msix_enable() will therefore not be invoked
> and no vectors would actually be enabled...  Not sure if I had missed
> something.

Yeah, afaict your interpretation is correct.  I think the intention was
to mimic userspace preforming a write to set the enable bit, but
re-writing it doesn't change the vconfig value, so the effect is not
the same.  I think this probably never worked.

> >> +    }
> >> +    return 0;  
> > 
> > It seems this could be simplified down to:
> > 
> > if (msi_enabled(pdev)) {
> >      vfio_msi_enable(vdev);
> > } else if (msix_enabled(pdev)) {
> >      msix_load(pdev, f);
> >      vfio_msix_enable(vdev);
> > }  
> 
> And it seems that this has fixed something :-)

Yep, no dependency on the value changing, simply set the state to that
indicated in vconfig.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 05/17] vfio: Add VM state change handler to know state of VM
  2020-10-21  5:33           ` Kirti Wankhede
@ 2020-10-22  7:51             ` Cornelia Huck
  2020-10-22 15:42               ` Kirti Wankhede
  0 siblings, 1 reply; 73+ messages in thread
From: Cornelia Huck @ 2020-10-22  7:51 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk, pasic,
	felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	Dr. David Alan Gilbert, alex.williamson, changpeng.liu, eskultet,
	Ken.Xue, jonathan.davies, pbonzini

On Wed, 21 Oct 2020 11:03:23 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 10/20/2020 4:21 PM, Cornelia Huck wrote:
> > On Sun, 18 Oct 2020 01:54:56 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 9/29/2020 4:33 PM, Dr. David Alan Gilbert wrote:  
> >>> * Cornelia Huck (cohuck@redhat.com) wrote:  
> >>>> On Wed, 23 Sep 2020 04:54:07 +0530
> >>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:

> >>>>> +static void vfio_vmstate_change(void *opaque, int running, RunState state)
> >>>>> +{
> >>>>> +    VFIODevice *vbasedev = opaque;
> >>>>> +
> >>>>> +    if ((vbasedev->vm_running != running)) {
> >>>>> +        int ret;
> >>>>> +        uint32_t value = 0, mask = 0;
> >>>>> +
> >>>>> +        if (running) {
> >>>>> +            value = VFIO_DEVICE_STATE_RUNNING;
> >>>>> +            if (vbasedev->device_state & VFIO_DEVICE_STATE_RESUMING) {
> >>>>> +                mask = ~VFIO_DEVICE_STATE_RESUMING;  
> >>>>
> >>>> I've been staring at this for some time and I think that the desired
> >>>> result is
> >>>> - set _RUNNING
> >>>> - if _RESUMING was set, clear it, but leave the other bits intact  
> >>
> >> Upto here, you're correct.
> >>  
> >>>> - if _RESUMING was not set, clear everything previously set
> >>>> This would really benefit from a comment (or am I the only one
> >>>> struggling here?)
> >>>>     
> >>
> >> Here mask should be ~0. Correcting it.  
> > 
> > Hm, now I'm confused. With value == _RUNNING, ~_RUNNING and ~0 as mask
> > should be equivalent, shouldn't they?
> >   
> 
> I too got confused after reading your comment.
> Lets walk through the device states and transitions can happen here:
> 
> if running
>   - device state could be either _SAVING or _RESUMING or _STOP. Both 
> _SAVING and _RESUMING can't be set at a time, that is the error state. 
> _STOP means 0.
>   - Transition from _SAVING to _RUNNING can happen if there is migration 
> failure, in that case we have to clear _SAVING
> - Transition from _RESUMING to _RUNNING can happen on resuming and we 
> have to clear _RESUMING.
> - In both the above cases, we have to set _RUNNING and clear rest 2 bits.
> Then:
> mask = ~VFIO_DEVICE_STATE_MASK;
> value = VFIO_DEVICE_STATE_RUNNING;

ok

> 
> if !running
> - device state could be either _RUNNING or _SAVING|_RUNNING. Here we 
> have to reset running bit.
> Then:
> mask = ~VFIO_DEVICE_STATE_RUNNING;
> value = 0;

ok

> 
> I'll add comment in the code above.

That will help.

I'm a bit worried though that all that reasoning which flags are set or
cleared when is quite complex, and it's easy to make mistakes.

Can we model this as a FSM, where an event (running state changes)
transitions the device state from one state to another? I (personally)
find FSMs easier to comprehend, but I'm not sure whether that change
would be too invasive. If others can parse the state changes with that
mask/value interface, I won't object to it.

> 
> 
> >>
> >>  
> >>>>> +            }
> >>>>> +        } else {
> >>>>> +            mask = ~VFIO_DEVICE_STATE_RUNNING;
> >>>>> +        }



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 05/17] vfio: Add VM state change handler to know state of VM
  2020-10-22  7:51             ` Cornelia Huck
@ 2020-10-22 15:42               ` Kirti Wankhede
  2020-10-22 15:49                 ` Cornelia Huck
  0 siblings, 1 reply; 73+ messages in thread
From: Kirti Wankhede @ 2020-10-22 15:42 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk, pasic,
	felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	Dr. David Alan Gilbert, alex.williamson, changpeng.liu, eskultet,
	Ken.Xue, jonathan.davies, pbonzini



On 10/22/2020 1:21 PM, Cornelia Huck wrote:
> On Wed, 21 Oct 2020 11:03:23 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 10/20/2020 4:21 PM, Cornelia Huck wrote:
>>> On Sun, 18 Oct 2020 01:54:56 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>    
>>>> On 9/29/2020 4:33 PM, Dr. David Alan Gilbert wrote:
>>>>> * Cornelia Huck (cohuck@redhat.com) wrote:
>>>>>> On Wed, 23 Sep 2020 04:54:07 +0530
>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>>>>>>> +static void vfio_vmstate_change(void *opaque, int running, RunState state)
>>>>>>> +{
>>>>>>> +    VFIODevice *vbasedev = opaque;
>>>>>>> +
>>>>>>> +    if ((vbasedev->vm_running != running)) {
>>>>>>> +        int ret;
>>>>>>> +        uint32_t value = 0, mask = 0;
>>>>>>> +
>>>>>>> +        if (running) {
>>>>>>> +            value = VFIO_DEVICE_STATE_RUNNING;
>>>>>>> +            if (vbasedev->device_state & VFIO_DEVICE_STATE_RESUMING) {
>>>>>>> +                mask = ~VFIO_DEVICE_STATE_RESUMING;
>>>>>>
>>>>>> I've been staring at this for some time and I think that the desired
>>>>>> result is
>>>>>> - set _RUNNING
>>>>>> - if _RESUMING was set, clear it, but leave the other bits intact
>>>>
>>>> Upto here, you're correct.
>>>>   
>>>>>> - if _RESUMING was not set, clear everything previously set
>>>>>> This would really benefit from a comment (or am I the only one
>>>>>> struggling here?)
>>>>>>      
>>>>
>>>> Here mask should be ~0. Correcting it.
>>>
>>> Hm, now I'm confused. With value == _RUNNING, ~_RUNNING and ~0 as mask
>>> should be equivalent, shouldn't they?
>>>    
>>
>> I too got confused after reading your comment.
>> Lets walk through the device states and transitions can happen here:
>>
>> if running
>>    - device state could be either _SAVING or _RESUMING or _STOP. Both
>> _SAVING and _RESUMING can't be set at a time, that is the error state.
>> _STOP means 0.
>>    - Transition from _SAVING to _RUNNING can happen if there is migration
>> failure, in that case we have to clear _SAVING
>> - Transition from _RESUMING to _RUNNING can happen on resuming and we
>> have to clear _RESUMING.
>> - In both the above cases, we have to set _RUNNING and clear rest 2 bits.
>> Then:
>> mask = ~VFIO_DEVICE_STATE_MASK;
>> value = VFIO_DEVICE_STATE_RUNNING;
> 
> ok
> 
>>
>> if !running
>> - device state could be either _RUNNING or _SAVING|_RUNNING. Here we
>> have to reset running bit.
>> Then:
>> mask = ~VFIO_DEVICE_STATE_RUNNING;
>> value = 0;
> 
> ok
> 
>>
>> I'll add comment in the code above.
> 
> That will help.
> 
> I'm a bit worried though that all that reasoning which flags are set or
> cleared when is quite complex, and it's easy to make mistakes.
> 
> Can we model this as a FSM, where an event (running state changes)
> transitions the device state from one state to another? I (personally)
> find FSMs easier to comprehend, but I'm not sure whether that change
> would be too invasive. If others can parse the state changes with that
> mask/value interface, I won't object to it.
> 

I agree FSM will be easy and for long term may be easy to maintain. But 
at this moment it will be intrusive change. For now we can go ahead with 
this code and later we can change to FSM model, if all agrees on it.

Thanks,
Kirti


>>
>>
>>>>
>>>>   
>>>>>>> +            }
>>>>>>> +        } else {
>>>>>>> +            mask = ~VFIO_DEVICE_STATE_RUNNING;
>>>>>>> +        }
> 


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v26 05/17] vfio: Add VM state change handler to know state of VM
  2020-10-22 15:42               ` Kirti Wankhede
@ 2020-10-22 15:49                 ` Cornelia Huck
  0 siblings, 0 replies; 73+ messages in thread
From: Cornelia Huck @ 2020-10-22 15:49 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk, pasic,
	felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	Dr. David Alan Gilbert, alex.williamson, changpeng.liu, eskultet,
	Ken.Xue, jonathan.davies, pbonzini

On Thu, 22 Oct 2020 21:12:58 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 10/22/2020 1:21 PM, Cornelia Huck wrote:

> > I'm a bit worried though that all that reasoning which flags are set or
> > cleared when is quite complex, and it's easy to make mistakes.
> > 
> > Can we model this as a FSM, where an event (running state changes)
> > transitions the device state from one state to another? I (personally)
> > find FSMs easier to comprehend, but I'm not sure whether that change
> > would be too invasive. If others can parse the state changes with that
> > mask/value interface, I won't object to it.
> >   
> 
> I agree FSM will be easy and for long term may be easy to maintain. But 
> at this moment it will be intrusive change. For now we can go ahead with 
> this code and later we can change to FSM model, if all agrees on it.

Yes, we can certainly revisit this later.



^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH QEMU v25 00/17] Add migration support for VFIO devices
@ 2020-06-20 20:21 Kirti Wankhede
  0 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2020-06-20 20:21 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

Hi,

Sorry for little delay for this iteration. I had adresses most of the concerns
raised in v16, v18 and v23 versions. More details about changes below.

This Patch set adds migration support for VFIO devices in QEMU.

This Patch set include patches as below:
Patch 1-2:
- Few code refactor

Patch 3:
- Added save and restore functions for PCI configuration space. Used
  pci_device_save() and pci_device_load() so that config space cache is saved
  and restored. Similarly, to keep emulated_config_bits cache and
  vdev->pdev.wmask in sync after migration, these are also saved and restored.

Patch 4-9:
- Generic migration functionality for VFIO device.
  * This patch set adds functionality only for PCI devices, but can be
    extended to other VFIO devices.
  * Added all the basic functions required for pre-copy, stop-and-copy and
    resume phases of migration.
  * Added state change notifier and from that notifier function, VFIO
    device's state changed is conveyed to VFIO device driver.
  * During save setup phase and resume/load setup phase, migration region
    is queried and is used to read/write VFIO device data.
  * .save_live_pending and .save_live_iterate are implemented to use QEMU's
    functionality of iteration during pre-copy phase.
  * In .save_live_complete_precopy, that is in stop-and-copy phase,
    iteration to read data from VFIO device driver is implemented till pending
    bytes returned by driver are not zero.

Patch 10
- Set DIRTY_MEMORY_MIGRATION flag in dirty log mask for migration with vIOMMU
  enabled.

Patch 11:
- Get migration capability from kernel module.

Patch 12-14:
- Add function to start and stop dirty pages tracking.
- Add vfio_listerner_log_sync to mark dirty pages. Dirty pages bitmap is queried
  per container. All pages pinned by vendor driver through vfio_pin_pages
  external API has to be marked as dirty during  migration.
  When there are CPU writes, CPU dirty page tracking can identify dirtied
  pages, but any page pinned by vendor driver can also be written by
  device. As of now there is no device which has hardware support for
  dirty page tracking. So all pages which are pinned by vendor driver
  should be considered as dirty.
  In Qemu, marking pages dirty is only done when device is in stop-and-copy
  phase because if pages are marked dirty during pre-copy phase and content is
  transfered from source to distination, there is no way to know newly dirtied
  pages from the point they were copied earlier until device stops. To avoid
  repeated copy of same content, pinned pages are marked dirty only during
  stop-and-copy phase.

Patch 15:
- With vIOMMU, IO virtual address range can get unmapped while in pre-copy
  phase of migration. In that case, unmap ioctl should return pages pinned
  in that range and QEMU should report corresponding guest physical pages
  dirty.

Patch 16:
- Make VFIO PCI device migration capable. If migration region is not provided by
  driver, migration is blocked.

Patch 17:
- Added VFIO device stats to MigrationInfo

Yet TODO:
Since there is no device which has hardware support for system memmory
dirty bitmap tracking, right now there is no other API from vendor driver
to VFIO IOMMU module to report dirty pages. In future, when such hardware
support will be implemented, an API will be required in kernel such that
vendor driver could report dirty pages to VFIO module during migration phases.

Below is the flow of state change for live migration where states in brackets
represent VM state, migration state and VFIO device state as:
    (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)

Live migration save path:
        QEMU normal running state
        (RUNNING, _NONE, _RUNNING)
                        |
    migrate_init spawns migration_thread.
    (RUNNING, _SETUP, _RUNNING|_SAVING)
    Migration thread then calls each device's .save_setup()
                        |
    (RUNNING, _ACTIVE, _RUNNING|_SAVING)
    If device is active, get pending bytes by .save_live_pending()
    if pending bytes >= threshold_size,  call save_live_iterate()
    Data of VFIO device for pre-copy phase is copied.
    Iterate till pending bytes converge and are less than threshold
                        |
    On migration completion, vCPUs stops and calls .save_live_complete_precopy
    for each active device. VFIO device is then transitioned in
     _SAVING state.
    (FINISH_MIGRATE, _DEVICE, _SAVING)
    For VFIO device, iterate in  .save_live_complete_precopy  until
    pending data is 0.
    (FINISH_MIGRATE, _DEVICE, _STOPPED)
                        |
    (FINISH_MIGRATE, _COMPLETED, STOPPED)
    Migraton thread schedule cleanup bottom half and exit

Live migration resume path:
    Incomming migration calls .load_setup for each device
    (RESTORE_VM, _ACTIVE, STOPPED)
                        |
    For each device, .load_state is called for that device section data
                        |
    At the end, called .load_cleanup for each device and vCPUs are started.
                        |
        (RUNNING, _NONE, _RUNNING)

Note that:
- Migration post copy is not supported.

v23 -> 25
- Updated config space save and load to save config cache, emulated bits cache
  and wmask cache.
- Created idr string as suggested by Dr Dave that includes bus path.
- Updated save and load function to read/write data to mixed regions, mapped or
  trapped.
- When vIOMMU is enabled, created mapped iova range list which also keeps
  translated address. This list is used to mark dirty pages. This reduces
  downtime significantly with vIOMMU enabled than migration patches from
   previous version. 
- Removed get_address_limit() function from v23 patch as this not required now.

v22 -> v23
-- Fixed issue reported by Yan
https://lore.kernel.org/kvm/97977ede-3c5b-c5a5-7858-7eecd7dd531c@nvidia.com/
- Sending this version to test v23 kernel version patches:
https://lore.kernel.org/kvm/1589998088-3250-1-git-send-email-kwankhede@nvidia.com/

v18 -> v22
- Few fixes from v18 review. But not yet fixed all concerns. I'll address those
  concerns in subsequent iterations.
- Sending this version to test v22 kernel version patches:
https://lore.kernel.org/kvm/1589781397-28368-1-git-send-email-kwankhede@nvidia.com/

v16 -> v18
- Nit fixes
- Get migration capability flags from container
- Added VFIO stats to MigrationInfo
- Fixed bug reported by Yan
    https://lists.gnu.org/archive/html/qemu-devel/2020-04/msg00004.html

v9 -> v16
- KABI almost finalised on kernel patches.
- Added support for migration with vIOMMU enabled.

v8 -> v9:
- Split patch set in 2 sets, Kernel and QEMU sets.
- Dirty pages bitmap is queried from IOMMU container rather than from
  vendor driver for per device. Added 2 ioctls to achieve this.

v7 -> v8:
- Updated comments for KABI
- Added BAR address validation check during PCI device's config space load as
  suggested by Dr. David Alan Gilbert.
- Changed vfio_migration_set_state() to set or clear device state flags.
- Some nit fixes.

v6 -> v7:
- Fix build failures.

v5 -> v6:
- Fix build failure.

v4 -> v5:
- Added decriptive comment about the sequence of access of members of structure
  vfio_device_migration_info to be followed based on Alex's suggestion
- Updated get dirty pages sequence.
- As per Cornelia Huck's suggestion, added callbacks to VFIODeviceOps to
  get_object, save_config and load_config.
- Fixed multiple nit picks.
- Tested live migration with multiple vfio device assigned to a VM.

v3 -> v4:
- Added one more bit for _RESUMING flag to be set explicitly.
- data_offset field is read-only for user space application.
- data_size is read for every iteration before reading data from migration, that
  is removed assumption that data will be till end of migration region.
- If vendor driver supports mappable sparsed region, map those region during
  setup state of save/load, similarly unmap those from cleanup routines.
- Handles race condition that causes data corruption in migration region during
  save device state by adding mutex and serialiaing save_buffer and
  get_dirty_pages routines.
- Skip called get_dirty_pages routine for mapped MMIO region of device.
- Added trace events.
- Splitted into multiple functional patches.

v2 -> v3:
- Removed enum of VFIO device states. Defined VFIO device state with 2 bits.
- Re-structured vfio_device_migration_info to keep it minimal and defined action
  on read and write access on its members.

v1 -> v2:
- Defined MIGRATION region type and sub-type which should be used with region
  type capability.
- Re-structured vfio_device_migration_info. This structure will be placed at 0th
  offset of migration region.
- Replaced ioctl with read/write for trapped part of migration region.
- Added both type of access support, trapped or mmapped, for data section of the
  region.
- Moved PCI device functions to pci file.
- Added iteration to get dirty page bitmap until bitmap for all requested pages
  are copied.

Thanks,
Kirti




Kirti Wankhede (17):
  vfio: Add function to unmap VFIO region
  vfio: Add vfio_get_object callback to VFIODeviceOps
  vfio: Add save and load functions for VFIO PCI devices
  vfio: Add migration region initialization and finalize function
  vfio: Add VM state change handler to know state of VM
  vfio: Add migration state change notifier
  vfio: Register SaveVMHandlers for VFIO device
  vfio: Add save state functions to SaveVMHandlers
  vfio: Add load state functions to SaveVMHandlers
  memory: Set DIRTY_MEMORY_MIGRATION when IOMMU is enabled
  vfio: Get migration capability flags for container
  vfio: Add function to start and stop dirty pages tracking
  vfio: create mapped iova list when vIOMMU is enabled
  vfio: Add vfio_listener_log_sync to mark dirty pages
  vfio: Add ioctl to get dirty pages bitmap during dma unmap.
  vfio: Make vfio-pci device migration capable
  qapi: Add VFIO devices migration stats in Migration stats

 hw/vfio/Makefile.objs         |   2 +-
 hw/vfio/common.c              | 416 ++++++++++++++++++--
 hw/vfio/migration.c           | 855 ++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.c                 | 135 +++++--
 hw/vfio/pci.h                 |   1 -
 hw/vfio/trace-events          |  19 +
 include/hw/vfio/vfio-common.h |  30 ++
 include/qemu/vfio-helpers.h   |   3 +
 memory.c                      |   2 +-
 migration/migration.c         |  14 +
 monitor/hmp-cmds.c            |   6 +
 qapi/migration.json           |  17 +
 12 files changed, 1454 insertions(+), 46 deletions(-)
 create mode 100644 hw/vfio/migration.c

-- 
2.7.0



^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, back to index

Thread overview: 73+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-22 23:24 [PATCH QEMU v25 00/17] Add migration support for VFIO devices Kirti Wankhede
2020-09-22 23:24 ` [PATCH v26 01/17] vfio: Add function to unmap VFIO region Kirti Wankhede
2020-09-22 23:24 ` [PATCH v26 02/17] vfio: Add vfio_get_object callback to VFIODeviceOps Kirti Wankhede
2020-09-22 23:24 ` [PATCH v26 03/17] vfio: Add save and load functions for VFIO PCI devices Kirti Wankhede
2020-09-23  6:38   ` Zenghui Yu
2020-09-24 22:49   ` Alex Williamson
2020-10-21  9:30     ` Zenghui Yu
2020-10-21 19:03       ` Alex Williamson
2020-09-22 23:24 ` [PATCH v26 04/17] vfio: Add migration region initialization and finalize function Kirti Wankhede
2020-09-24 14:08   ` Cornelia Huck
2020-10-17 20:14     ` Kirti Wankhede
2020-09-25 20:20   ` Alex Williamson
2020-09-28  9:39     ` Cornelia Huck
2020-10-17 20:17     ` Kirti Wankhede
2020-09-22 23:24 ` [PATCH v26 05/17] vfio: Add VM state change handler to know state of VM Kirti Wankhede
2020-09-24 15:02   ` Cornelia Huck
2020-09-29 11:03     ` Dr. David Alan Gilbert
2020-10-17 20:24       ` Kirti Wankhede
2020-10-20 10:51         ` Cornelia Huck
2020-10-21  5:33           ` Kirti Wankhede
2020-10-22  7:51             ` Cornelia Huck
2020-10-22 15:42               ` Kirti Wankhede
2020-10-22 15:49                 ` Cornelia Huck
2020-09-25 20:20   ` Alex Williamson
2020-10-17 20:30     ` Kirti Wankhede
2020-10-17 23:44       ` Alex Williamson
2020-10-18 17:43         ` Kirti Wankhede
2020-10-19 17:51           ` Alex Williamson
2020-10-20 10:23             ` Cornelia Huck
2020-09-22 23:24 ` [PATCH v26 06/17] vfio: Add migration state change notifier Kirti Wankhede
2020-09-25 20:20   ` Alex Williamson
2020-10-17 20:35     ` Kirti Wankhede
2020-10-19 17:57       ` Alex Williamson
2020-10-20 10:55         ` Cornelia Huck
2020-09-22 23:24 ` [PATCH v26 07/17] vfio: Register SaveVMHandlers for VFIO device Kirti Wankhede
2020-09-24 15:15   ` Philippe Mathieu-Daudé
2020-09-29 10:19     ` Dr. David Alan Gilbert
2020-10-17 20:36       ` Kirti Wankhede
2020-09-25 11:53   ` Cornelia Huck
2020-10-18 20:55     ` Kirti Wankhede
2020-10-20 15:51       ` Cornelia Huck
2020-09-25 20:20   ` Alex Williamson
2020-10-18 17:40     ` Kirti Wankhede
2020-09-22 23:24 ` [PATCH v26 08/17] vfio: Add save state functions to SaveVMHandlers Kirti Wankhede
2020-09-23 11:42   ` Wang, Zhi A
2020-10-21 14:30     ` Kirti Wankhede
2020-09-25 21:02   ` Alex Williamson
2020-10-18 18:00     ` Kirti Wankhede
2020-09-22 23:24 ` [PATCH v26 09/17] vfio: Add load " Kirti Wankhede
2020-10-01 10:07   ` Cornelia Huck
2020-10-18 20:47     ` Kirti Wankhede
2020-10-20 16:25       ` Cornelia Huck
2020-09-22 23:24 ` [PATCH v26 10/17] memory: Set DIRTY_MEMORY_MIGRATION when IOMMU is enabled Kirti Wankhede
2020-09-22 23:24 ` [PATCH v26 11/17] vfio: Get migration capability flags for container Kirti Wankhede
2020-09-22 23:24 ` [PATCH v26 12/17] vfio: Add function to start and stop dirty pages tracking Kirti Wankhede
2020-09-25 21:55   ` Alex Williamson
2020-10-18 20:52     ` Kirti Wankhede
2020-09-22 23:24 ` [PATCH v26 13/17] vfio: create mapped iova list when vIOMMU is enabled Kirti Wankhede
2020-09-25 22:23   ` Alex Williamson
2020-10-19  6:01     ` Kirti Wankhede
2020-10-19 17:24       ` Alex Williamson
2020-10-19 19:15         ` Kirti Wankhede
2020-10-19 20:07           ` Alex Williamson
2020-09-22 23:24 ` [PATCH v26 14/17] vfio: Add vfio_listener_log_sync to mark dirty pages Kirti Wankhede
2020-09-22 23:24 ` [PATCH v26 15/17] vfio: Add ioctl to get dirty pages bitmap during dma unmap Kirti Wankhede
2020-09-22 23:24 ` [PATCH v26 16/17] vfio: Make vfio-pci device migration capable Kirti Wankhede
2020-09-25 12:17   ` Cornelia Huck
2020-09-22 23:24 ` [PATCH v26 17/17] qapi: Add VFIO devices migration stats in Migration stats Kirti Wankhede
2020-09-24 15:14   ` Eric Blake
2020-09-25 22:55   ` Alex Williamson
2020-09-29 10:40   ` Dr. David Alan Gilbert
2020-09-23  7:06 ` [PATCH QEMU v25 00/17] Add migration support for VFIO devices Zenghui Yu
  -- strict thread matches above, loose matches on Subject: below --
2020-06-20 20:21 Kirti Wankhede

QEMU-Devel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/qemu-devel/0 qemu-devel/git/0.git
	git clone --mirror https://lore.kernel.org/qemu-devel/1 qemu-devel/git/1.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 qemu-devel qemu-devel/ https://lore.kernel.org/qemu-devel \
		qemu-devel@nongnu.org
	public-inbox-index qemu-devel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.nongnu.qemu-devel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git