qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH QEMU v25 00/17] Add migration support for VFIO devices
@ 2020-06-20 20:21 Kirti Wankhede
  2020-06-20 20:21 ` [PATCH QEMU v25 01/17] vfio: Add function to unmap VFIO region Kirti Wankhede
                   ` (16 more replies)
  0 siblings, 17 replies; 66+ messages in thread
From: Kirti Wankhede @ 2020-06-20 20:21 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

Hi,

Sorry for little delay for this iteration. I had adresses most of the concerns
raised in v16, v18 and v23 versions. More details about changes below.

This Patch set adds migration support for VFIO devices in QEMU.

This Patch set include patches as below:
Patch 1-2:
- Few code refactor

Patch 3:
- Added save and restore functions for PCI configuration space. Used
  pci_device_save() and pci_device_load() so that config space cache is saved
  and restored. Similarly, to keep emulated_config_bits cache and
  vdev->pdev.wmask in sync after migration, these are also saved and restored.

Patch 4-9:
- Generic migration functionality for VFIO device.
  * This patch set adds functionality only for PCI devices, but can be
    extended to other VFIO devices.
  * Added all the basic functions required for pre-copy, stop-and-copy and
    resume phases of migration.
  * Added state change notifier and from that notifier function, VFIO
    device's state changed is conveyed to VFIO device driver.
  * During save setup phase and resume/load setup phase, migration region
    is queried and is used to read/write VFIO device data.
  * .save_live_pending and .save_live_iterate are implemented to use QEMU's
    functionality of iteration during pre-copy phase.
  * In .save_live_complete_precopy, that is in stop-and-copy phase,
    iteration to read data from VFIO device driver is implemented till pending
    bytes returned by driver are not zero.

Patch 10
- Set DIRTY_MEMORY_MIGRATION flag in dirty log mask for migration with vIOMMU
  enabled.

Patch 11:
- Get migration capability from kernel module.

Patch 12-14:
- Add function to start and stop dirty pages tracking.
- Add vfio_listerner_log_sync to mark dirty pages. Dirty pages bitmap is queried
  per container. All pages pinned by vendor driver through vfio_pin_pages
  external API has to be marked as dirty during  migration.
  When there are CPU writes, CPU dirty page tracking can identify dirtied
  pages, but any page pinned by vendor driver can also be written by
  device. As of now there is no device which has hardware support for
  dirty page tracking. So all pages which are pinned by vendor driver
  should be considered as dirty.
  In Qemu, marking pages dirty is only done when device is in stop-and-copy
  phase because if pages are marked dirty during pre-copy phase and content is
  transfered from source to distination, there is no way to know newly dirtied
  pages from the point they were copied earlier until device stops. To avoid
  repeated copy of same content, pinned pages are marked dirty only during
  stop-and-copy phase.

Patch 15:
- With vIOMMU, IO virtual address range can get unmapped while in pre-copy
  phase of migration. In that case, unmap ioctl should return pages pinned
  in that range and QEMU should report corresponding guest physical pages
  dirty.

Patch 16:
- Make VFIO PCI device migration capable. If migration region is not provided by
  driver, migration is blocked.

Patch 17:
- Added VFIO device stats to MigrationInfo

Yet TODO:
Since there is no device which has hardware support for system memmory
dirty bitmap tracking, right now there is no other API from vendor driver
to VFIO IOMMU module to report dirty pages. In future, when such hardware
support will be implemented, an API will be required in kernel such that
vendor driver could report dirty pages to VFIO module during migration phases.

Below is the flow of state change for live migration where states in brackets
represent VM state, migration state and VFIO device state as:
    (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)

Live migration save path:
        QEMU normal running state
        (RUNNING, _NONE, _RUNNING)
                        |
    migrate_init spawns migration_thread.
    (RUNNING, _SETUP, _RUNNING|_SAVING)
    Migration thread then calls each device's .save_setup()
                        |
    (RUNNING, _ACTIVE, _RUNNING|_SAVING)
    If device is active, get pending bytes by .save_live_pending()
    if pending bytes >= threshold_size,  call save_live_iterate()
    Data of VFIO device for pre-copy phase is copied.
    Iterate till pending bytes converge and are less than threshold
                        |
    On migration completion, vCPUs stops and calls .save_live_complete_precopy
    for each active device. VFIO device is then transitioned in
     _SAVING state.
    (FINISH_MIGRATE, _DEVICE, _SAVING)
    For VFIO device, iterate in  .save_live_complete_precopy  until
    pending data is 0.
    (FINISH_MIGRATE, _DEVICE, _STOPPED)
                        |
    (FINISH_MIGRATE, _COMPLETED, STOPPED)
    Migraton thread schedule cleanup bottom half and exit

Live migration resume path:
    Incomming migration calls .load_setup for each device
    (RESTORE_VM, _ACTIVE, STOPPED)
                        |
    For each device, .load_state is called for that device section data
                        |
    At the end, called .load_cleanup for each device and vCPUs are started.
                        |
        (RUNNING, _NONE, _RUNNING)

Note that:
- Migration post copy is not supported.

v23 -> 25
- Updated config space save and load to save config cache, emulated bits cache
  and wmask cache.
- Created idr string as suggested by Dr Dave that includes bus path.
- Updated save and load function to read/write data to mixed regions, mapped or
  trapped.
- When vIOMMU is enabled, created mapped iova range list which also keeps
  translated address. This list is used to mark dirty pages. This reduces
  downtime significantly with vIOMMU enabled than migration patches from
   previous version. 
- Removed get_address_limit() function from v23 patch as this not required now.

v22 -> v23
-- Fixed issue reported by Yan
https://lore.kernel.org/kvm/97977ede-3c5b-c5a5-7858-7eecd7dd531c@nvidia.com/
- Sending this version to test v23 kernel version patches:
https://lore.kernel.org/kvm/1589998088-3250-1-git-send-email-kwankhede@nvidia.com/

v18 -> v22
- Few fixes from v18 review. But not yet fixed all concerns. I'll address those
  concerns in subsequent iterations.
- Sending this version to test v22 kernel version patches:
https://lore.kernel.org/kvm/1589781397-28368-1-git-send-email-kwankhede@nvidia.com/

v16 -> v18
- Nit fixes
- Get migration capability flags from container
- Added VFIO stats to MigrationInfo
- Fixed bug reported by Yan
    https://lists.gnu.org/archive/html/qemu-devel/2020-04/msg00004.html

v9 -> v16
- KABI almost finalised on kernel patches.
- Added support for migration with vIOMMU enabled.

v8 -> v9:
- Split patch set in 2 sets, Kernel and QEMU sets.
- Dirty pages bitmap is queried from IOMMU container rather than from
  vendor driver for per device. Added 2 ioctls to achieve this.

v7 -> v8:
- Updated comments for KABI
- Added BAR address validation check during PCI device's config space load as
  suggested by Dr. David Alan Gilbert.
- Changed vfio_migration_set_state() to set or clear device state flags.
- Some nit fixes.

v6 -> v7:
- Fix build failures.

v5 -> v6:
- Fix build failure.

v4 -> v5:
- Added decriptive comment about the sequence of access of members of structure
  vfio_device_migration_info to be followed based on Alex's suggestion
- Updated get dirty pages sequence.
- As per Cornelia Huck's suggestion, added callbacks to VFIODeviceOps to
  get_object, save_config and load_config.
- Fixed multiple nit picks.
- Tested live migration with multiple vfio device assigned to a VM.

v3 -> v4:
- Added one more bit for _RESUMING flag to be set explicitly.
- data_offset field is read-only for user space application.
- data_size is read for every iteration before reading data from migration, that
  is removed assumption that data will be till end of migration region.
- If vendor driver supports mappable sparsed region, map those region during
  setup state of save/load, similarly unmap those from cleanup routines.
- Handles race condition that causes data corruption in migration region during
  save device state by adding mutex and serialiaing save_buffer and
  get_dirty_pages routines.
- Skip called get_dirty_pages routine for mapped MMIO region of device.
- Added trace events.
- Splitted into multiple functional patches.

v2 -> v3:
- Removed enum of VFIO device states. Defined VFIO device state with 2 bits.
- Re-structured vfio_device_migration_info to keep it minimal and defined action
  on read and write access on its members.

v1 -> v2:
- Defined MIGRATION region type and sub-type which should be used with region
  type capability.
- Re-structured vfio_device_migration_info. This structure will be placed at 0th
  offset of migration region.
- Replaced ioctl with read/write for trapped part of migration region.
- Added both type of access support, trapped or mmapped, for data section of the
  region.
- Moved PCI device functions to pci file.
- Added iteration to get dirty page bitmap until bitmap for all requested pages
  are copied.

Thanks,
Kirti




Kirti Wankhede (17):
  vfio: Add function to unmap VFIO region
  vfio: Add vfio_get_object callback to VFIODeviceOps
  vfio: Add save and load functions for VFIO PCI devices
  vfio: Add migration region initialization and finalize function
  vfio: Add VM state change handler to know state of VM
  vfio: Add migration state change notifier
  vfio: Register SaveVMHandlers for VFIO device
  vfio: Add save state functions to SaveVMHandlers
  vfio: Add load state functions to SaveVMHandlers
  memory: Set DIRTY_MEMORY_MIGRATION when IOMMU is enabled
  vfio: Get migration capability flags for container
  vfio: Add function to start and stop dirty pages tracking
  vfio: create mapped iova list when vIOMMU is enabled
  vfio: Add vfio_listener_log_sync to mark dirty pages
  vfio: Add ioctl to get dirty pages bitmap during dma unmap.
  vfio: Make vfio-pci device migration capable
  qapi: Add VFIO devices migration stats in Migration stats

 hw/vfio/Makefile.objs         |   2 +-
 hw/vfio/common.c              | 416 ++++++++++++++++++--
 hw/vfio/migration.c           | 855 ++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.c                 | 135 +++++--
 hw/vfio/pci.h                 |   1 -
 hw/vfio/trace-events          |  19 +
 include/hw/vfio/vfio-common.h |  30 ++
 include/qemu/vfio-helpers.h   |   3 +
 memory.c                      |   2 +-
 migration/migration.c         |  14 +
 monitor/hmp-cmds.c            |   6 +
 qapi/migration.json           |  17 +
 12 files changed, 1454 insertions(+), 46 deletions(-)
 create mode 100644 hw/vfio/migration.c

-- 
2.7.0



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH QEMU v25 01/17] vfio: Add function to unmap VFIO region
  2020-06-20 20:21 [PATCH QEMU v25 00/17] Add migration support for VFIO devices Kirti Wankhede
@ 2020-06-20 20:21 ` Kirti Wankhede
  2020-06-20 20:21 ` [PATCH QEMU v25 02/17] vfio: Add vfio_get_object callback to VFIODeviceOps Kirti Wankhede
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 66+ messages in thread
From: Kirti Wankhede @ 2020-06-20 20:21 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

This function will be used for migration region.
Migration region is mmaped when migration starts and will be unmapped when
migration is complete.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
---
 hw/vfio/common.c              | 32 ++++++++++++++++++++++++++++----
 hw/vfio/trace-events          |  1 +
 include/hw/vfio/vfio-common.h |  1 +
 3 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 0b3593b3c0c4..90e9a854d82c 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -925,6 +925,18 @@ int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
     return 0;
 }
 
+static void vfio_subregion_unmap(VFIORegion *region, int index)
+{
+    trace_vfio_region_unmap(memory_region_name(&region->mmaps[index].mem),
+                            region->mmaps[index].offset,
+                            region->mmaps[index].offset +
+                            region->mmaps[index].size - 1);
+    memory_region_del_subregion(region->mem, &region->mmaps[index].mem);
+    munmap(region->mmaps[index].mmap, region->mmaps[index].size);
+    object_unparent(OBJECT(&region->mmaps[index].mem));
+    region->mmaps[index].mmap = NULL;
+}
+
 int vfio_region_mmap(VFIORegion *region)
 {
     int i, prot = 0;
@@ -955,10 +967,7 @@ int vfio_region_mmap(VFIORegion *region)
             region->mmaps[i].mmap = NULL;
 
             for (i--; i >= 0; i--) {
-                memory_region_del_subregion(region->mem, &region->mmaps[i].mem);
-                munmap(region->mmaps[i].mmap, region->mmaps[i].size);
-                object_unparent(OBJECT(&region->mmaps[i].mem));
-                region->mmaps[i].mmap = NULL;
+                vfio_subregion_unmap(region, i);
             }
 
             return ret;
@@ -983,6 +992,21 @@ int vfio_region_mmap(VFIORegion *region)
     return 0;
 }
 
+void vfio_region_unmap(VFIORegion *region)
+{
+    int i;
+
+    if (!region->mem) {
+        return;
+    }
+
+    for (i = 0; i < region->nr_mmaps; i++) {
+        if (region->mmaps[i].mmap) {
+            vfio_subregion_unmap(region, i);
+        }
+    }
+}
+
 void vfio_region_exit(VFIORegion *region)
 {
     int i;
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index b1ef55a33ffd..8cdc27946cb8 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -111,6 +111,7 @@ vfio_region_mmap(const char *name, unsigned long offset, unsigned long end) "Reg
 vfio_region_exit(const char *name, int index) "Device %s, region %d"
 vfio_region_finalize(const char *name, int index) "Device %s, region %d"
 vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d"
+vfio_region_unmap(const char *name, unsigned long offset, unsigned long end) "Region %s unmap [0x%lx - 0x%lx]"
 vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Device %s region %d: %d sparse mmap entries"
 vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
 vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index fd564209ac71..8d7a0fbb1046 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -171,6 +171,7 @@ int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
                       int index, const char *name);
 int vfio_region_mmap(VFIORegion *region);
 void vfio_region_mmaps_set_enabled(VFIORegion *region, bool enabled);
+void vfio_region_unmap(VFIORegion *region);
 void vfio_region_exit(VFIORegion *region);
 void vfio_region_finalize(VFIORegion *region);
 void vfio_reset_handler(void *opaque);
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH QEMU v25 02/17] vfio: Add vfio_get_object callback to VFIODeviceOps
  2020-06-20 20:21 [PATCH QEMU v25 00/17] Add migration support for VFIO devices Kirti Wankhede
  2020-06-20 20:21 ` [PATCH QEMU v25 01/17] vfio: Add function to unmap VFIO region Kirti Wankhede
@ 2020-06-20 20:21 ` Kirti Wankhede
  2020-06-20 20:21 ` [PATCH QEMU v25 03/17] vfio: Add save and load functions for VFIO PCI devices Kirti Wankhede
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 66+ messages in thread
From: Kirti Wankhede @ 2020-06-20 20:21 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

Hook vfio_get_object callback for PCI devices.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Suggested-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
---
 hw/vfio/pci.c                 | 8 ++++++++
 include/hw/vfio/vfio-common.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 6838bcc4b307..27f8872db2b1 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2400,10 +2400,18 @@ static void vfio_pci_compute_needs_reset(VFIODevice *vbasedev)
     }
 }
 
+static Object *vfio_pci_get_object(VFIODevice *vbasedev)
+{
+    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+
+    return OBJECT(vdev);
+}
+
 static VFIODeviceOps vfio_pci_ops = {
     .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
     .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
     .vfio_eoi = vfio_intx_eoi,
+    .vfio_get_object = vfio_pci_get_object,
 };
 
 int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 8d7a0fbb1046..74261feaeac9 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -119,6 +119,7 @@ struct VFIODeviceOps {
     void (*vfio_compute_needs_reset)(VFIODevice *vdev);
     int (*vfio_hot_reset_multi)(VFIODevice *vdev);
     void (*vfio_eoi)(VFIODevice *vdev);
+    Object *(*vfio_get_object)(VFIODevice *vdev);
 };
 
 typedef struct VFIOGroup {
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH QEMU v25 03/17] vfio: Add save and load functions for VFIO PCI devices
  2020-06-20 20:21 [PATCH QEMU v25 00/17] Add migration support for VFIO devices Kirti Wankhede
  2020-06-20 20:21 ` [PATCH QEMU v25 01/17] vfio: Add function to unmap VFIO region Kirti Wankhede
  2020-06-20 20:21 ` [PATCH QEMU v25 02/17] vfio: Add vfio_get_object callback to VFIODeviceOps Kirti Wankhede
@ 2020-06-20 20:21 ` Kirti Wankhede
  2020-06-22 20:28   ` Alex Williamson
  2020-06-20 20:21 ` [PATCH QEMU v25 04/17] vfio: Add migration region initialization and finalize function Kirti Wankhede
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 66+ messages in thread
From: Kirti Wankhede @ 2020-06-20 20:21 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

These functions save and restore PCI device specific data - config
space of PCI device.
Tested save and restore with MSI and MSIX type.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/pci.c                 | 95 +++++++++++++++++++++++++++++++++++++++++++
 include/hw/vfio/vfio-common.h |  2 +
 2 files changed, 97 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 27f8872db2b1..5ba340aee1d4 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -41,6 +41,7 @@
 #include "trace.h"
 #include "qapi/error.h"
 #include "migration/blocker.h"
+#include "migration/qemu-file.h"
 
 #define TYPE_VFIO_PCI "vfio-pci"
 #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
@@ -2407,11 +2408,105 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
     return OBJECT(vdev);
 }
 
+static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
+{
+    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+    PCIDevice *pdev = &vdev->pdev;
+
+    qemu_put_buffer(f, vdev->emulated_config_bits, vdev->config_size);
+    qemu_put_buffer(f, vdev->pdev.wmask, vdev->config_size);
+    pci_device_save(pdev, f);
+
+    qemu_put_be32(f, vdev->interrupt);
+    if (vdev->interrupt == VFIO_INT_MSIX) {
+        msix_save(pdev, f);
+    }
+}
+
+static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
+{
+    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+    PCIDevice *pdev = &vdev->pdev;
+    uint32_t interrupt_type;
+    uint16_t pci_cmd;
+    int i, ret;
+
+    qemu_get_buffer(f, vdev->emulated_config_bits, vdev->config_size);
+    qemu_get_buffer(f, vdev->pdev.wmask, vdev->config_size);
+
+    ret = pci_device_load(pdev, f);
+    if (ret) {
+        return ret;
+    }
+
+    /* retore pci bar configuration */
+    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
+    vfio_pci_write_config(pdev, PCI_COMMAND,
+                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
+    for (i = 0; i < PCI_ROM_SLOT; i++) {
+        uint32_t bar = pci_default_read_config(pdev,
+                                               PCI_BASE_ADDRESS_0 + i * 4, 4);
+
+        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
+    }
+
+    interrupt_type = qemu_get_be32(f);
+
+    if (interrupt_type == VFIO_INT_MSI) {
+        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
+        bool msi_64bit;
+
+        /* restore msi configuration */
+        msi_flags = pci_default_read_config(pdev,
+                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
+        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
+
+        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
+                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
+
+        msi_addr_lo = pci_default_read_config(pdev,
+                                        pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
+        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
+                              msi_addr_lo, 4);
+
+        if (msi_64bit) {
+            msi_addr_hi = pci_default_read_config(pdev,
+                                        pdev->msi_cap + PCI_MSI_ADDRESS_HI, 4);
+            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
+                                  msi_addr_hi, 4);
+        }
+
+        msi_data = pci_default_read_config(pdev,
+                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
+                2);
+
+        vfio_pci_write_config(pdev,
+                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
+                msi_data, 2);
+
+        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
+                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
+    } else if (interrupt_type == VFIO_INT_MSIX) {
+        uint16_t offset;
+
+        offset = pci_default_read_config(pdev,
+                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
+        /* load enable bit and maskall bit */
+        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
+                              offset, 2);
+        msix_load(pdev, f);
+    }
+    vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
+    return 0;
+}
+
 static VFIODeviceOps vfio_pci_ops = {
     .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
     .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
     .vfio_eoi = vfio_intx_eoi,
     .vfio_get_object = vfio_pci_get_object,
+    .vfio_save_config = vfio_pci_save_config,
+    .vfio_load_config = vfio_pci_load_config,
 };
 
 int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 74261feaeac9..d69a7f3ae31e 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -120,6 +120,8 @@ struct VFIODeviceOps {
     int (*vfio_hot_reset_multi)(VFIODevice *vdev);
     void (*vfio_eoi)(VFIODevice *vdev);
     Object *(*vfio_get_object)(VFIODevice *vdev);
+    void (*vfio_save_config)(VFIODevice *vdev, QEMUFile *f);
+    int (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f);
 };
 
 typedef struct VFIOGroup {
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH QEMU v25 04/17] vfio: Add migration region initialization and finalize function
  2020-06-20 20:21 [PATCH QEMU v25 00/17] Add migration support for VFIO devices Kirti Wankhede
                   ` (2 preceding siblings ...)
  2020-06-20 20:21 ` [PATCH QEMU v25 03/17] vfio: Add save and load functions for VFIO PCI devices Kirti Wankhede
@ 2020-06-20 20:21 ` Kirti Wankhede
  2020-06-23  7:54   ` Cornelia Huck
  2020-06-20 20:21 ` [PATCH QEMU v25 05/17] vfio: Add VM state change handler to know state of VM Kirti Wankhede
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 66+ messages in thread
From: Kirti Wankhede @ 2020-06-20 20:21 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

Whether the VFIO device supports migration or not is decided based of
migration region query. If migration region query is successful and migration
region initialization is successful then migration is supported else
migration is blocked.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Acked-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 hw/vfio/Makefile.objs         |   2 +-
 hw/vfio/migration.c           | 142 ++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events          |   3 +
 include/hw/vfio/vfio-common.h |   9 +++
 4 files changed, 155 insertions(+), 1 deletion(-)
 create mode 100644 hw/vfio/migration.c

diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
index 9bb1c09e8477..8b296c889ed9 100644
--- a/hw/vfio/Makefile.objs
+++ b/hw/vfio/Makefile.objs
@@ -1,4 +1,4 @@
-obj-y += common.o spapr.o
+obj-y += common.o spapr.o migration.o
 obj-$(CONFIG_VFIO_PCI) += pci.o pci-quirks.o display.o
 obj-$(CONFIG_VFIO_CCW) += ccw.o
 obj-$(CONFIG_VFIO_PLATFORM) += platform.o
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
new file mode 100644
index 000000000000..48ac385d80a7
--- /dev/null
+++ b/hw/vfio/migration.c
@@ -0,0 +1,142 @@
+/*
+ * Migration support for VFIO devices
+ *
+ * Copyright NVIDIA, Inc. 2020
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include <linux/vfio.h>
+
+#include "hw/vfio/vfio-common.h"
+#include "cpu.h"
+#include "migration/migration.h"
+#include "migration/qemu-file.h"
+#include "migration/register.h"
+#include "migration/blocker.h"
+#include "migration/misc.h"
+#include "qapi/error.h"
+#include "exec/ramlist.h"
+#include "exec/ram_addr.h"
+#include "pci.h"
+#include "trace.h"
+
+static void vfio_migration_region_exit(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+
+    if (!migration) {
+        return;
+    }
+
+    if (migration->region.size) {
+        vfio_region_exit(&migration->region);
+        vfio_region_finalize(&migration->region);
+    }
+}
+
+static int vfio_migration_region_init(VFIODevice *vbasedev, int index)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    Object *obj = NULL;
+    int ret = -EINVAL;
+
+    if (!vbasedev->ops->vfio_get_object) {
+        return ret;
+    }
+
+    obj = vbasedev->ops->vfio_get_object(vbasedev);
+    if (!obj) {
+        return ret;
+    }
+
+    ret = vfio_region_setup(obj, vbasedev, &migration->region, index,
+                            "migration");
+    if (ret) {
+        error_report("%s: Failed to setup VFIO migration region %d: %s",
+                     vbasedev->name, index, strerror(-ret));
+        goto err;
+    }
+
+    if (!migration->region.size) {
+        ret = -EINVAL;
+        error_report("%s: Invalid region size of VFIO migration region %d: %s",
+                     vbasedev->name, index, strerror(-ret));
+        goto err;
+    }
+
+    return 0;
+
+err:
+    vfio_migration_region_exit(vbasedev);
+    return ret;
+}
+
+static int vfio_migration_init(VFIODevice *vbasedev,
+                               struct vfio_region_info *info)
+{
+    int ret;
+
+    vbasedev->migration = g_new0(VFIOMigration, 1);
+
+    ret = vfio_migration_region_init(vbasedev, info->index);
+    if (ret) {
+        error_report("%s: Failed to initialise migration region",
+                     vbasedev->name);
+        g_free(vbasedev->migration);
+        vbasedev->migration = NULL;
+    }
+
+    return ret;
+}
+
+/* ---------------------------------------------------------------------- */
+
+int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
+{
+    struct vfio_region_info *info;
+    Error *local_err = NULL;
+    int ret;
+
+    ret = vfio_get_dev_region_info(vbasedev, VFIO_REGION_TYPE_MIGRATION,
+                                   VFIO_REGION_SUBTYPE_MIGRATION, &info);
+    if (ret) {
+        goto add_blocker;
+    }
+
+    ret = vfio_migration_init(vbasedev, info);
+    if (ret) {
+        goto add_blocker;
+    }
+
+    trace_vfio_migration_probe(vbasedev->name, info->index);
+    g_free(info);
+    return 0;
+
+add_blocker:
+    error_setg(&vbasedev->migration_blocker,
+               "VFIO device doesn't support migration");
+    g_free(info);
+
+    ret = migrate_add_blocker(vbasedev->migration_blocker, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        error_free(vbasedev->migration_blocker);
+        vbasedev->migration_blocker = NULL;
+    }
+    return ret;
+}
+
+void vfio_migration_finalize(VFIODevice *vbasedev)
+{
+    if (vbasedev->migration_blocker) {
+        migrate_del_blocker(vbasedev->migration_blocker);
+        error_free(vbasedev->migration_blocker);
+        vbasedev->migration_blocker = NULL;
+    }
+
+    vfio_migration_region_exit(vbasedev);
+    g_free(vbasedev->migration);
+}
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 8cdc27946cb8..fd034ac53684 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -143,3 +143,6 @@ vfio_display_edid_link_up(void) ""
 vfio_display_edid_link_down(void) ""
 vfio_display_edid_update(uint32_t prefx, uint32_t prefy) "%ux%u"
 vfio_display_edid_write_error(void) ""
+
+# migration.c
+vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index d69a7f3ae31e..d4b268641173 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -57,6 +57,10 @@ typedef struct VFIORegion {
     uint8_t nr; /* cache the region number for debug */
 } VFIORegion;
 
+typedef struct VFIOMigration {
+    VFIORegion region;
+} VFIOMigration;
+
 typedef struct VFIOAddressSpace {
     AddressSpace *as;
     QLIST_HEAD(, VFIOContainer) containers;
@@ -113,6 +117,8 @@ typedef struct VFIODevice {
     unsigned int num_irqs;
     unsigned int num_regions;
     unsigned int flags;
+    VFIOMigration *migration;
+    Error *migration_blocker;
 } VFIODevice;
 
 struct VFIODeviceOps {
@@ -204,4 +210,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
 int vfio_spapr_remove_window(VFIOContainer *container,
                              hwaddr offset_within_address_space);
 
+int vfio_migration_probe(VFIODevice *vbasedev, Error **errp);
+void vfio_migration_finalize(VFIODevice *vbasedev);
+
 #endif /* HW_VFIO_VFIO_COMMON_H */
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH QEMU v25 05/17] vfio: Add VM state change handler to know state of VM
  2020-06-20 20:21 [PATCH QEMU v25 00/17] Add migration support for VFIO devices Kirti Wankhede
                   ` (3 preceding siblings ...)
  2020-06-20 20:21 ` [PATCH QEMU v25 04/17] vfio: Add migration region initialization and finalize function Kirti Wankhede
@ 2020-06-20 20:21 ` Kirti Wankhede
  2020-06-22 22:50   ` Alex Williamson
  2020-06-23  8:07   ` Cornelia Huck
  2020-06-20 20:21 ` [PATCH QEMU v25 06/17] vfio: Add migration state change notifier Kirti Wankhede
                   ` (11 subsequent siblings)
  16 siblings, 2 replies; 66+ messages in thread
From: Kirti Wankhede @ 2020-06-20 20:21 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

VM state change handler gets called on change in VM's state. This is used to set
VFIO device state to _RUNNING.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 hw/vfio/migration.c           | 87 +++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events          |  2 +
 include/hw/vfio/vfio-common.h |  4 ++
 3 files changed, 93 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 48ac385d80a7..fcecc0bb0874 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -10,6 +10,7 @@
 #include "qemu/osdep.h"
 #include <linux/vfio.h>
 
+#include "sysemu/runstate.h"
 #include "hw/vfio/vfio-common.h"
 #include "cpu.h"
 #include "migration/migration.h"
@@ -74,6 +75,85 @@ err:
     return ret;
 }
 
+static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
+                                    uint32_t value)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIORegion *region = &migration->region;
+    uint32_t device_state;
+    int ret;
+
+    ret = pread(vbasedev->fd, &device_state, sizeof(device_state),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                              device_state));
+    if (ret < 0) {
+        error_report("%s: Failed to read device state %d %s",
+                     vbasedev->name, ret, strerror(errno));
+        return ret;
+    }
+
+    device_state = (device_state & mask) | value;
+
+    if (!VFIO_DEVICE_STATE_VALID(device_state)) {
+        return -EINVAL;
+    }
+
+    ret = pwrite(vbasedev->fd, &device_state, sizeof(device_state),
+                 region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                              device_state));
+    if (ret < 0) {
+        error_report("%s: Failed to set device state %d %s",
+                     vbasedev->name, ret, strerror(errno));
+
+        ret = pread(vbasedev->fd, &device_state, sizeof(device_state),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                device_state));
+        if (ret < 0) {
+            error_report("%s: On failure, failed to read device state %d %s",
+                    vbasedev->name, ret, strerror(errno));
+            return ret;
+        }
+
+        if (VFIO_DEVICE_STATE_IS_ERROR(device_state)) {
+            error_report("%s: Device is in error state 0x%x",
+                         vbasedev->name, device_state);
+            return -EFAULT;
+        }
+    }
+
+    vbasedev->device_state = device_state;
+    trace_vfio_migration_set_state(vbasedev->name, device_state);
+    return 0;
+}
+
+static void vfio_vmstate_change(void *opaque, int running, RunState state)
+{
+    VFIODevice *vbasedev = opaque;
+
+    if ((vbasedev->vm_running != running)) {
+        int ret;
+        uint32_t value = 0, mask = 0;
+
+        if (running) {
+            value = VFIO_DEVICE_STATE_RUNNING;
+            if (vbasedev->device_state & VFIO_DEVICE_STATE_RESUMING) {
+                mask = ~VFIO_DEVICE_STATE_RESUMING;
+            }
+        } else {
+            mask = ~VFIO_DEVICE_STATE_RUNNING;
+        }
+
+        ret = vfio_migration_set_state(vbasedev, mask, value);
+        if (ret) {
+            error_report("%s: Failed to set device state 0x%x",
+                         vbasedev->name, value & mask);
+        }
+        vbasedev->vm_running = running;
+        trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
+                                  value & mask);
+    }
+}
+
 static int vfio_migration_init(VFIODevice *vbasedev,
                                struct vfio_region_info *info)
 {
@@ -87,8 +167,11 @@ static int vfio_migration_init(VFIODevice *vbasedev,
                      vbasedev->name);
         g_free(vbasedev->migration);
         vbasedev->migration = NULL;
+        return ret;
     }
 
+    vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
+                                                          vbasedev);
     return ret;
 }
 
@@ -131,6 +214,10 @@ add_blocker:
 
 void vfio_migration_finalize(VFIODevice *vbasedev)
 {
+    if (vbasedev->vm_state) {
+        qemu_del_vm_change_state_handler(vbasedev->vm_state);
+    }
+
     if (vbasedev->migration_blocker) {
         migrate_del_blocker(vbasedev->migration_blocker);
         error_free(vbasedev->migration_blocker);
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index fd034ac53684..14b0a86c0035 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -146,3 +146,5 @@ vfio_display_edid_write_error(void) ""
 
 # migration.c
 vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
+vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
+vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index d4b268641173..3d18eb146b33 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -29,6 +29,7 @@
 #ifdef CONFIG_LINUX
 #include <linux/vfio.h>
 #endif
+#include "sysemu/sysemu.h"
 
 #define VFIO_MSG_PREFIX "vfio %s: "
 
@@ -119,6 +120,9 @@ typedef struct VFIODevice {
     unsigned int flags;
     VFIOMigration *migration;
     Error *migration_blocker;
+    VMChangeStateEntry *vm_state;
+    uint32_t device_state;
+    int vm_running;
 } VFIODevice;
 
 struct VFIODeviceOps {
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH QEMU v25 06/17] vfio: Add migration state change notifier
  2020-06-20 20:21 [PATCH QEMU v25 00/17] Add migration support for VFIO devices Kirti Wankhede
                   ` (4 preceding siblings ...)
  2020-06-20 20:21 ` [PATCH QEMU v25 05/17] vfio: Add VM state change handler to know state of VM Kirti Wankhede
@ 2020-06-20 20:21 ` Kirti Wankhede
  2020-06-23  8:10   ` Cornelia Huck
  2020-06-20 20:21 ` [PATCH QEMU v25 07/17] vfio: Register SaveVMHandlers for VFIO device Kirti Wankhede
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 66+ messages in thread
From: Kirti Wankhede @ 2020-06-20 20:21 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

Added migration state change notifier to get notification on migration state
change. These states are translated to VFIO device state and conveyed to vendor
driver.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 hw/vfio/migration.c           | 29 +++++++++++++++++++++++++++++
 hw/vfio/trace-events          |  5 +++--
 include/hw/vfio/vfio-common.h |  1 +
 3 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index fcecc0bb0874..e30bd8768701 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -154,6 +154,28 @@ static void vfio_vmstate_change(void *opaque, int running, RunState state)
     }
 }
 
+static void vfio_migration_state_notifier(Notifier *notifier, void *data)
+{
+    MigrationState *s = data;
+    VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);
+    int ret;
+
+    trace_vfio_migration_state_notifier(vbasedev->name,
+                                        MigrationStatus_str(s->state));
+
+    switch (s->state) {
+    case MIGRATION_STATUS_CANCELLING:
+    case MIGRATION_STATUS_CANCELLED:
+    case MIGRATION_STATUS_FAILED:
+        ret = vfio_migration_set_state(vbasedev,
+                      ~(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING),
+                      VFIO_DEVICE_STATE_RUNNING);
+        if (ret) {
+            error_report("%s: Failed to set state RUNNING", vbasedev->name);
+        }
+    }
+}
+
 static int vfio_migration_init(VFIODevice *vbasedev,
                                struct vfio_region_info *info)
 {
@@ -172,6 +194,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
 
     vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
                                                           vbasedev);
+    vbasedev->migration_state.notify = vfio_migration_state_notifier;
+    add_migration_state_change_notifier(&vbasedev->migration_state);
     return ret;
 }
 
@@ -214,6 +238,11 @@ add_blocker:
 
 void vfio_migration_finalize(VFIODevice *vbasedev)
 {
+
+    if (vbasedev->migration_state.notify) {
+        remove_migration_state_change_notifier(&vbasedev->migration_state);
+    }
+
     if (vbasedev->vm_state) {
         qemu_del_vm_change_state_handler(vbasedev->vm_state);
     }
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 14b0a86c0035..bd3d47b005cb 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -146,5 +146,6 @@ vfio_display_edid_write_error(void) ""
 
 # migration.c
 vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
-vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
-vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
+vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
+vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
+vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 3d18eb146b33..28f55f66d019 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -123,6 +123,7 @@ typedef struct VFIODevice {
     VMChangeStateEntry *vm_state;
     uint32_t device_state;
     int vm_running;
+    Notifier migration_state;
 } VFIODevice;
 
 struct VFIODeviceOps {
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH QEMU v25 07/17] vfio: Register SaveVMHandlers for VFIO device
  2020-06-20 20:21 [PATCH QEMU v25 00/17] Add migration support for VFIO devices Kirti Wankhede
                   ` (5 preceding siblings ...)
  2020-06-20 20:21 ` [PATCH QEMU v25 06/17] vfio: Add migration state change notifier Kirti Wankhede
@ 2020-06-20 20:21 ` Kirti Wankhede
  2020-06-22 22:50   ` Alex Williamson
  2020-06-26 14:31   ` Dr. David Alan Gilbert
  2020-06-20 20:21 ` [PATCH QEMU v25 08/17] vfio: Add save state functions to SaveVMHandlers Kirti Wankhede
                   ` (9 subsequent siblings)
  16 siblings, 2 replies; 66+ messages in thread
From: Kirti Wankhede @ 2020-06-20 20:21 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

Define flags to be used as delimeter in migration file stream.
Added .save_setup and .save_cleanup functions. Mapped & unmapped migration
region from these functions at source during saving or pre-copy phase.
Set VFIO device state depending on VM's state. During live migration, VM is
running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
device. During save-restore, VM is paused, _SAVING state is set for VFIO device.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/migration.c  | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events |  2 ++
 2 files changed, 94 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index e30bd8768701..133bb5b1b3b2 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -8,12 +8,15 @@
  */
 
 #include "qemu/osdep.h"
+#include "qemu/main-loop.h"
+#include "qemu/cutils.h"
 #include <linux/vfio.h>
 
 #include "sysemu/runstate.h"
 #include "hw/vfio/vfio-common.h"
 #include "cpu.h"
 #include "migration/migration.h"
+#include "migration/vmstate.h"
 #include "migration/qemu-file.h"
 #include "migration/register.h"
 #include "migration/blocker.h"
@@ -24,6 +27,17 @@
 #include "pci.h"
 #include "trace.h"
 
+/*
+ * Flags used as delimiter:
+ * 0xffffffff => MSB 32-bit all 1s
+ * 0xef10     => emulated (virtual) function IO
+ * 0x0000     => 16-bits reserved for flags
+ */
+#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
+#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
+#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
+#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
+
 static void vfio_migration_region_exit(VFIODevice *vbasedev)
 {
     VFIOMigration *migration = vbasedev->migration;
@@ -126,6 +140,65 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
     return 0;
 }
 
+/* ---------------------------------------------------------------------- */
+
+static int vfio_save_setup(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    trace_vfio_save_setup(vbasedev->name);
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
+
+    if (migration->region.mmaps) {
+        qemu_mutex_lock_iothread();
+        ret = vfio_region_mmap(&migration->region);
+        qemu_mutex_unlock_iothread();
+        if (ret) {
+            error_report("%s: Failed to mmap VFIO migration region %d: %s",
+                         vbasedev->name, migration->region.nr,
+                         strerror(-ret));
+            return ret;
+        }
+    }
+
+    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_MASK,
+                                   VFIO_DEVICE_STATE_SAVING);
+    if (ret) {
+        error_report("%s: Failed to set state SAVING", vbasedev->name);
+        return ret;
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    return 0;
+}
+
+static void vfio_save_cleanup(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    if (migration->region.mmaps) {
+        vfio_region_unmap(&migration->region);
+    }
+    trace_vfio_save_cleanup(vbasedev->name);
+}
+
+static SaveVMHandlers savevm_vfio_handlers = {
+    .save_setup = vfio_save_setup,
+    .save_cleanup = vfio_save_cleanup,
+};
+
+/* ---------------------------------------------------------------------- */
+
 static void vfio_vmstate_change(void *opaque, int running, RunState state)
 {
     VFIODevice *vbasedev = opaque;
@@ -180,6 +253,7 @@ static int vfio_migration_init(VFIODevice *vbasedev,
                                struct vfio_region_info *info)
 {
     int ret;
+    char id[256] = "";
 
     vbasedev->migration = g_new0(VFIOMigration, 1);
 
@@ -192,6 +266,24 @@ static int vfio_migration_init(VFIODevice *vbasedev,
         return ret;
     }
 
+    if (vbasedev->ops->vfio_get_object) {
+        Object *obj = vbasedev->ops->vfio_get_object(vbasedev);
+
+        if (obj) {
+            DeviceState *dev = DEVICE(obj);
+            char *oid = vmstate_if_get_id(VMSTATE_IF(dev));
+
+            if (oid) {
+                pstrcpy(id, sizeof(id), oid);
+                pstrcat(id, sizeof(id), "/");
+                g_free(oid);
+            }
+        }
+    }
+    pstrcat(id, sizeof(id), "vfio");
+
+    register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1, &savevm_vfio_handlers,
+                         vbasedev);
     vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
                                                           vbasedev);
     vbasedev->migration_state.notify = vfio_migration_state_notifier;
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index bd3d47b005cb..86c18def016e 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -149,3 +149,5 @@ vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
 vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
 vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
 vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
+vfio_save_setup(const char *name) " (%s)"
+vfio_save_cleanup(const char *name) " (%s)"
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH QEMU v25 08/17] vfio: Add save state functions to SaveVMHandlers
  2020-06-20 20:21 [PATCH QEMU v25 00/17] Add migration support for VFIO devices Kirti Wankhede
                   ` (6 preceding siblings ...)
  2020-06-20 20:21 ` [PATCH QEMU v25 07/17] vfio: Register SaveVMHandlers for VFIO device Kirti Wankhede
@ 2020-06-20 20:21 ` Kirti Wankhede
  2020-06-22 22:50   ` Alex Williamson
  2020-06-20 20:21 ` [PATCH QEMU v25 09/17] vfio: Add load " Kirti Wankhede
                   ` (8 subsequent siblings)
  16 siblings, 1 reply; 66+ messages in thread
From: Kirti Wankhede @ 2020-06-20 20:21 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
functions. These functions handles pre-copy and stop-and-copy phase.

In _SAVING|_RUNNING device state or pre-copy phase:
- read pending_bytes. If pending_bytes > 0, go through below steps.
- read data_offset - indicates kernel driver to write data to staging
  buffer.
- read data_size - amount of data in bytes written by vendor driver in
  migration region.
- read data_size bytes of data from data_offset in the migration region.
- Write data packet to file stream as below:
{VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
VFIO_MIG_FLAG_END_OF_STATE }

In _SAVING device state or stop-and-copy phase
a. read config space of device and save to migration file stream. This
   doesn't need to be from vendor driver. Any other special config state
   from driver can be saved as data in following iteration.
b. read pending_bytes. If pending_bytes > 0, go through below steps.
c. read data_offset - indicates kernel driver to write data to staging
   buffer.
d. read data_size - amount of data in bytes written by vendor driver in
   migration region.
e. read data_size bytes of data from data_offset in the migration region.
f. Write data packet as below:
   {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
g. iterate through steps b to f while (pending_bytes > 0)
h. Write {VFIO_MIG_FLAG_END_OF_STATE}

When data region is mapped, its user's responsibility to read data from
data_offset of data_size before moving to next steps.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/migration.c           | 283 ++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events          |   6 +
 include/hw/vfio/vfio-common.h |   1 +
 3 files changed, 290 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 133bb5b1b3b2..ef1150c1ff02 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -140,6 +140,168 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
     return 0;
 }
 
+static void *get_data_section_size(VFIORegion *region, uint64_t data_offset,
+                                   uint64_t data_size, uint64_t *size)
+{
+    void *ptr = NULL;
+    int i;
+
+    if (!region->mmaps) {
+        *size = data_size;
+        return ptr;
+    }
+
+    /* check if data_offset in within sparse mmap areas */
+    for (i = 0; i < region->nr_mmaps; i++) {
+        VFIOMmap *map = region->mmaps + i;
+
+        if ((data_offset >= map->offset) &&
+            (data_offset < map->offset + map->size)) {
+            ptr = map->mmap + data_offset - map->offset;
+
+            if (data_offset + data_size <= map->offset + map->size) {
+                *size = data_size;
+            } else {
+                *size = map->offset + map->size - data_offset;
+            }
+            break;
+        }
+    }
+
+    if (!ptr) {
+        uint64_t limit = 0;
+
+        /*
+         * data_offset is not within sparse mmap areas, find size of non-mapped
+         * area. Check through all list since region->mmaps list is not sorted.
+         */
+        for (i = 0; i < region->nr_mmaps; i++) {
+            VFIOMmap *map = region->mmaps + i;
+
+            if ((data_offset < map->offset) &&
+                (!limit || limit > map->offset)) {
+                limit = map->offset;
+            }
+        }
+
+        *size = limit ? limit - data_offset : data_size;
+    }
+    return ptr;
+}
+
+static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIORegion *region = &migration->region;
+    uint64_t data_offset = 0, data_size = 0, size;
+    int ret;
+
+    ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                             data_offset));
+    if (ret != sizeof(data_offset)) {
+        error_report("%s: Failed to get migration buffer data offset %d",
+                     vbasedev->name, ret);
+        return -EINVAL;
+    }
+
+    ret = pread(vbasedev->fd, &data_size, sizeof(data_size),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                             data_size));
+    if (ret != sizeof(data_size)) {
+        error_report("%s: Failed to get migration buffer data size %d",
+                     vbasedev->name, ret);
+        return -EINVAL;
+    }
+
+    trace_vfio_save_buffer(vbasedev->name, data_offset, data_size,
+                           migration->pending_bytes);
+
+    qemu_put_be64(f, data_size);
+    size = data_size;
+
+    while (size) {
+        void *buf = NULL;
+        bool buffer_mmaped;
+        uint64_t sec_size;
+
+        buf = get_data_section_size(region, data_offset, size, &sec_size);
+
+        buffer_mmaped = (buf != NULL);
+
+        if (!buffer_mmaped) {
+            buf = g_try_malloc(sec_size);
+            if (!buf) {
+                error_report("%s: Error allocating buffer ", __func__);
+                return -ENOMEM;
+            }
+
+            ret = pread(vbasedev->fd, buf, sec_size,
+                        region->fd_offset + data_offset);
+            if (ret != sec_size) {
+                error_report("%s: Failed to get migration data %d",
+                             vbasedev->name, ret);
+                g_free(buf);
+                return -EINVAL;
+            }
+        }
+
+        qemu_put_buffer(f, buf, sec_size);
+
+        if (!buffer_mmaped) {
+            g_free(buf);
+        }
+        size -= sec_size;
+        data_offset += sec_size;
+    }
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    return data_size;
+}
+
+static int vfio_update_pending(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIORegion *region = &migration->region;
+    uint64_t pending_bytes = 0;
+    int ret;
+
+    ret = pread(vbasedev->fd, &pending_bytes, sizeof(pending_bytes),
+                region->fd_offset + offsetof(struct vfio_device_migration_info,
+                                             pending_bytes));
+    if ((ret < 0) || (ret != sizeof(pending_bytes))) {
+        error_report("%s: Failed to get pending bytes %d",
+                     vbasedev->name, ret);
+        migration->pending_bytes = 0;
+        return (ret < 0) ? ret : -EINVAL;
+    }
+
+    migration->pending_bytes = pending_bytes;
+    trace_vfio_update_pending(vbasedev->name, pending_bytes);
+    return 0;
+}
+
+static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
+
+    if (vbasedev->ops && vbasedev->ops->vfio_save_config) {
+        vbasedev->ops->vfio_save_config(vbasedev, f);
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    trace_vfio_save_device_config_state(vbasedev->name);
+
+    return qemu_file_get_error(f);
+}
+
 /* ---------------------------------------------------------------------- */
 
 static int vfio_save_setup(QEMUFile *f, void *opaque)
@@ -192,9 +354,130 @@ static void vfio_save_cleanup(void *opaque)
     trace_vfio_save_cleanup(vbasedev->name);
 }
 
+static void vfio_save_pending(QEMUFile *f, void *opaque,
+                              uint64_t threshold_size,
+                              uint64_t *res_precopy_only,
+                              uint64_t *res_compatible,
+                              uint64_t *res_postcopy_only)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    ret = vfio_update_pending(vbasedev);
+    if (ret) {
+        return;
+    }
+
+    *res_precopy_only += migration->pending_bytes;
+
+    trace_vfio_save_pending(vbasedev->name, *res_precopy_only,
+                            *res_postcopy_only, *res_compatible);
+}
+
+static int vfio_save_iterate(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret, data_size;
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
+
+    if (migration->pending_bytes == 0) {
+        ret = vfio_update_pending(vbasedev);
+        if (ret) {
+            return ret;
+        }
+
+        if (migration->pending_bytes == 0) {
+            /* indicates data finished, goto complete phase */
+            return 1;
+        }
+    }
+
+    data_size = vfio_save_buffer(f, vbasedev);
+
+    if (data_size < 0) {
+        error_report("%s: vfio_save_buffer failed %s", vbasedev->name,
+                     strerror(errno));
+        return data_size;
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    trace_vfio_save_iterate(vbasedev->name, data_size);
+
+    return 0;
+}
+
+static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_RUNNING,
+                                   VFIO_DEVICE_STATE_SAVING);
+    if (ret) {
+        error_report("%s: Failed to set state STOP and SAVING",
+                     vbasedev->name);
+        return ret;
+    }
+
+    ret = vfio_save_device_config_state(f, opaque);
+    if (ret) {
+        return ret;
+    }
+
+    ret = vfio_update_pending(vbasedev);
+    if (ret) {
+        return ret;
+    }
+
+    while (migration->pending_bytes > 0) {
+        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
+        ret = vfio_save_buffer(f, vbasedev);
+        if (ret < 0) {
+            error_report("%s: Failed to save buffer", vbasedev->name);
+            return ret;
+        } else if (ret == 0) {
+            break;
+        }
+
+        ret = vfio_update_pending(vbasedev);
+        if (ret) {
+            return ret;
+        }
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_SAVING, 0);
+    if (ret) {
+        error_report("%s: Failed to set state STOPPED", vbasedev->name);
+        return ret;
+    }
+
+    trace_vfio_save_complete_precopy(vbasedev->name);
+    return ret;
+}
+
 static SaveVMHandlers savevm_vfio_handlers = {
     .save_setup = vfio_save_setup,
     .save_cleanup = vfio_save_cleanup,
+    .save_live_pending = vfio_save_pending,
+    .save_live_iterate = vfio_save_iterate,
+    .save_live_complete_precopy = vfio_save_complete_precopy,
 };
 
 /* ---------------------------------------------------------------------- */
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 86c18def016e..9a1c5e17d97f 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -151,3 +151,9 @@ vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t
 vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
 vfio_save_setup(const char *name) " (%s)"
 vfio_save_cleanup(const char *name) " (%s)"
+vfio_save_buffer(const char *name, uint64_t data_offset, uint64_t data_size, uint64_t pending) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64" pending 0x%"PRIx64
+vfio_update_pending(const char *name, uint64_t pending) " (%s) pending 0x%"PRIx64
+vfio_save_device_config_state(const char *name) " (%s)"
+vfio_save_pending(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64
+vfio_save_iterate(const char *name, int data_size) " (%s) data_size %d"
+vfio_save_complete_precopy(const char *name) " (%s)"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 28f55f66d019..c78033e4149d 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -60,6 +60,7 @@ typedef struct VFIORegion {
 
 typedef struct VFIOMigration {
     VFIORegion region;
+    uint64_t pending_bytes;
 } VFIOMigration;
 
 typedef struct VFIOAddressSpace {
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH QEMU v25 09/17] vfio: Add load state functions to SaveVMHandlers
  2020-06-20 20:21 [PATCH QEMU v25 00/17] Add migration support for VFIO devices Kirti Wankhede
                   ` (7 preceding siblings ...)
  2020-06-20 20:21 ` [PATCH QEMU v25 08/17] vfio: Add save state functions to SaveVMHandlers Kirti Wankhede
@ 2020-06-20 20:21 ` Kirti Wankhede
  2020-06-24 18:54   ` Alex Williamson
  2020-06-20 20:21 ` [PATCH QEMU v25 10/17] memory: Set DIRTY_MEMORY_MIGRATION when IOMMU is enabled Kirti Wankhede
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 66+ messages in thread
From: Kirti Wankhede @ 2020-06-20 20:21 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

Sequence  during _RESUMING device state:
While data for this device is available, repeat below steps:
a. read data_offset from where user application should write data.
b. write data of data_size to migration region from data_offset.
c. write data_size which indicates vendor driver that data is written in
   staging buffer.

For user, data is opaque. User should write data in the same order as
received.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 hw/vfio/migration.c  | 177 +++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events |   3 +
 2 files changed, 180 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index ef1150c1ff02..faacea5327cb 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -302,6 +302,33 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
     return qemu_file_get_error(f);
 }
 
+static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    uint64_t data;
+
+    if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
+        int ret;
+
+        ret = vbasedev->ops->vfio_load_config(vbasedev, f);
+        if (ret) {
+            error_report("%s: Failed to load device config space",
+                         vbasedev->name);
+            return ret;
+        }
+    }
+
+    data = qemu_get_be64(f);
+    if (data != VFIO_MIG_FLAG_END_OF_STATE) {
+        error_report("%s: Failed loading device config space, "
+                     "end flag incorrect 0x%"PRIx64, vbasedev->name, data);
+        return -EINVAL;
+    }
+
+    trace_vfio_load_device_config_state(vbasedev->name);
+    return qemu_file_get_error(f);
+}
+
 /* ---------------------------------------------------------------------- */
 
 static int vfio_save_setup(QEMUFile *f, void *opaque)
@@ -472,12 +499,162 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
     return ret;
 }
 
+static int vfio_load_setup(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret = 0;
+
+    if (migration->region.mmaps) {
+        ret = vfio_region_mmap(&migration->region);
+        if (ret) {
+            error_report("%s: Failed to mmap VFIO migration region %d: %s",
+                         vbasedev->name, migration->region.nr,
+                         strerror(-ret));
+            return ret;
+        }
+    }
+
+    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_MASK,
+                                   VFIO_DEVICE_STATE_RESUMING);
+    if (ret) {
+        error_report("%s: Failed to set state RESUMING", vbasedev->name);
+    }
+    return ret;
+}
+
+static int vfio_load_cleanup(void *opaque)
+{
+    vfio_save_cleanup(opaque);
+    return 0;
+}
+
+static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret = 0;
+    uint64_t data, data_size;
+
+    data = qemu_get_be64(f);
+    while (data != VFIO_MIG_FLAG_END_OF_STATE) {
+
+        trace_vfio_load_state(vbasedev->name, data);
+
+        switch (data) {
+        case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
+        {
+            ret = vfio_load_device_config_state(f, opaque);
+            if (ret) {
+                return ret;
+            }
+            break;
+        }
+        case VFIO_MIG_FLAG_DEV_SETUP_STATE:
+        {
+            data = qemu_get_be64(f);
+            if (data == VFIO_MIG_FLAG_END_OF_STATE) {
+                return ret;
+            } else {
+                error_report("%s: SETUP STATE: EOS not found 0x%"PRIx64,
+                             vbasedev->name, data);
+                return -EINVAL;
+            }
+            break;
+        }
+        case VFIO_MIG_FLAG_DEV_DATA_STATE:
+        {
+            VFIORegion *region = &migration->region;
+            uint64_t data_offset = 0, size;
+
+            data_size = size = qemu_get_be64(f);
+            if (data_size == 0) {
+                break;
+            }
+
+            ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
+                        region->fd_offset +
+                        offsetof(struct vfio_device_migration_info,
+                        data_offset));
+            if (ret != sizeof(data_offset)) {
+                error_report("%s:Failed to get migration buffer data offset %d",
+                             vbasedev->name, ret);
+                return -EINVAL;
+            }
+
+            trace_vfio_load_state_device_data(vbasedev->name, data_offset,
+                                              data_size);
+
+            while (size) {
+                void *buf = NULL;
+                uint64_t sec_size;
+                bool buffer_mmaped;
+
+                buf = get_data_section_size(region, data_offset, size,
+                                            &sec_size);
+
+                buffer_mmaped = (buf != NULL);
+
+                if (!buffer_mmaped) {
+                    buf = g_try_malloc(sec_size);
+                    if (!buf) {
+                        error_report("%s: Error allocating buffer ", __func__);
+                        return -ENOMEM;
+                    }
+                }
+
+                qemu_get_buffer(f, buf, sec_size);
+
+                if (!buffer_mmaped) {
+                    ret = pwrite(vbasedev->fd, buf, sec_size,
+                                 region->fd_offset + data_offset);
+                    g_free(buf);
+
+                    if (ret != sec_size) {
+                        error_report("%s: Failed to set migration buffer %d",
+                                vbasedev->name, ret);
+                        return -EINVAL;
+                    }
+                }
+                size -= sec_size;
+                data_offset += sec_size;
+            }
+
+            ret = pwrite(vbasedev->fd, &data_size, sizeof(data_size),
+                         region->fd_offset +
+                       offsetof(struct vfio_device_migration_info, data_size));
+            if (ret != sizeof(data_size)) {
+                error_report("%s: Failed to set migration buffer data size %d",
+                             vbasedev->name, ret);
+                return -EINVAL;
+            }
+            break;
+        }
+
+        default:
+            error_report("%s: Unknown tag 0x%"PRIx64, vbasedev->name, data);
+            return -EINVAL;
+        }
+
+        data = qemu_get_be64(f);
+        ret = qemu_file_get_error(f);
+        if (ret) {
+            return ret;
+        }
+    }
+
+    return ret;
+}
+
 static SaveVMHandlers savevm_vfio_handlers = {
     .save_setup = vfio_save_setup,
     .save_cleanup = vfio_save_cleanup,
     .save_live_pending = vfio_save_pending,
     .save_live_iterate = vfio_save_iterate,
     .save_live_complete_precopy = vfio_save_complete_precopy,
+    .load_setup = vfio_load_setup,
+    .load_cleanup = vfio_load_cleanup,
+    .load_state = vfio_load_state,
 };
 
 /* ---------------------------------------------------------------------- */
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 9a1c5e17d97f..4a4bd3ba9a2a 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -157,3 +157,6 @@ vfio_save_device_config_state(const char *name) " (%s)"
 vfio_save_pending(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64
 vfio_save_iterate(const char *name, int data_size) " (%s) data_size %d"
 vfio_save_complete_precopy(const char *name) " (%s)"
+vfio_load_device_config_state(const char *name) " (%s)"
+vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
+vfio_load_state_device_data(const char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH QEMU v25 10/17] memory: Set DIRTY_MEMORY_MIGRATION when IOMMU is enabled
  2020-06-20 20:21 [PATCH QEMU v25 00/17] Add migration support for VFIO devices Kirti Wankhede
                   ` (8 preceding siblings ...)
  2020-06-20 20:21 ` [PATCH QEMU v25 09/17] vfio: Add load " Kirti Wankhede
@ 2020-06-20 20:21 ` Kirti Wankhede
  2020-06-20 20:21 ` [PATCH QEMU v25 11/17] vfio: Get migration capability flags for container Kirti Wankhede
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 66+ messages in thread
From: Kirti Wankhede @ 2020-06-20 20:21 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

mr->ram_block is NULL when mr->is_iommu is true, then fr.dirty_log_mask
wasn't set correctly due to which memory listener's log_sync doesn't
get called.
This patch returns log_mask with DIRTY_MEMORY_MIGRATION set when
IOMMU is enabled.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
---
 memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/memory.c b/memory.c
index 2f15a4b250c8..95a47d2d9533 100644
--- a/memory.c
+++ b/memory.c
@@ -1788,7 +1788,7 @@ bool memory_region_is_ram_device(MemoryRegion *mr)
 uint8_t memory_region_get_dirty_log_mask(MemoryRegion *mr)
 {
     uint8_t mask = mr->dirty_log_mask;
-    if (global_dirty_log && mr->ram_block) {
+    if (global_dirty_log && (mr->ram_block || memory_region_is_iommu(mr))) {
         mask |= (1 << DIRTY_MEMORY_MIGRATION);
     }
     return mask;
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH QEMU v25 11/17] vfio: Get migration capability flags for container
  2020-06-20 20:21 [PATCH QEMU v25 00/17] Add migration support for VFIO devices Kirti Wankhede
                   ` (9 preceding siblings ...)
  2020-06-20 20:21 ` [PATCH QEMU v25 10/17] memory: Set DIRTY_MEMORY_MIGRATION when IOMMU is enabled Kirti Wankhede
@ 2020-06-20 20:21 ` Kirti Wankhede
  2020-06-24  8:43   ` Cornelia Huck
  2020-06-24 18:55   ` Alex Williamson
  2020-06-20 20:21 ` [PATCH QEMU v25 12/17] vfio: Add function to start and stop dirty pages tracking Kirti Wankhede
                   ` (5 subsequent siblings)
  16 siblings, 2 replies; 66+ messages in thread
From: Kirti Wankhede @ 2020-06-20 20:21 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, Eric Auger, changpeng.liu, eskultet, Shameer Kolothum,
	Ken.Xue, jonathan.davies, pbonzini

Added helper functions to get IOMMU info capability chain.
Added function to get migration capability information from that
capability chain for IOMMU container.

Similar change was proposed earlier:
https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg03759.html

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Cc: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Cc: Eric Auger <eric.auger@redhat.com>
---
 hw/vfio/common.c              | 91 +++++++++++++++++++++++++++++++++++++++----
 include/hw/vfio/vfio-common.h |  3 ++
 2 files changed, 86 insertions(+), 8 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 90e9a854d82c..e0d3d4585a65 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1229,6 +1229,75 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
     return 0;
 }
 
+static int vfio_get_iommu_info(VFIOContainer *container,
+                               struct vfio_iommu_type1_info **info)
+{
+
+    size_t argsz = sizeof(struct vfio_iommu_type1_info);
+
+    *info = g_new0(struct vfio_iommu_type1_info, 1);
+again:
+    (*info)->argsz = argsz;
+
+    if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) {
+        g_free(*info);
+        *info = NULL;
+        return -errno;
+    }
+
+    if (((*info)->argsz > argsz)) {
+        argsz = (*info)->argsz;
+        *info = g_realloc(*info, argsz);
+        goto again;
+    }
+
+    return 0;
+}
+
+static struct vfio_info_cap_header *
+vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
+{
+    struct vfio_info_cap_header *hdr;
+    void *ptr = info;
+
+    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
+        return NULL;
+    }
+
+    for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
+        if (hdr->id == id) {
+            return hdr;
+        }
+    }
+
+    return NULL;
+}
+
+static void vfio_get_iommu_info_migration(VFIOContainer *container,
+                                         struct vfio_iommu_type1_info *info)
+{
+    struct vfio_info_cap_header *hdr;
+    struct vfio_iommu_type1_info_cap_migration *cap_mig;
+
+    hdr = vfio_get_iommu_info_cap(info, VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION);
+    if (!hdr) {
+        return;
+    }
+
+    cap_mig = container_of(hdr, struct vfio_iommu_type1_info_cap_migration,
+                            header);
+
+    container->dirty_pages_supported = true;
+    container->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size;
+    container->dirty_pgsizes = cap_mig->pgsize_bitmap;
+
+    /*
+     * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of
+     * TARGET_PAGE_SIZE to mark those dirty.
+     */
+    assert(container->dirty_pgsizes & TARGET_PAGE_SIZE);
+}
+
 static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
                                   Error **errp)
 {
@@ -1293,6 +1362,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     container->space = space;
     container->fd = fd;
     container->error = NULL;
+    container->dirty_pages_supported = false;
     QLIST_INIT(&container->giommu_list);
     QLIST_INIT(&container->hostwin_list);
 
@@ -1305,7 +1375,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     case VFIO_TYPE1v2_IOMMU:
     case VFIO_TYPE1_IOMMU:
     {
-        struct vfio_iommu_type1_info info;
+        struct vfio_iommu_type1_info *info;
 
         /*
          * FIXME: This assumes that a Type1 IOMMU can map any 64-bit
@@ -1314,15 +1384,20 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
          * existing Type1 IOMMUs generally support any IOVA we're
          * going to actually try in practice.
          */
-        info.argsz = sizeof(info);
-        ret = ioctl(fd, VFIO_IOMMU_GET_INFO, &info);
-        /* Ignore errors */
-        if (ret || !(info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
+        ret = vfio_get_iommu_info(container, &info);
+        if (ret) {
+                goto free_container_exit;
+        }
+
+        if (!(info->flags & VFIO_IOMMU_INFO_PGSIZES)) {
             /* Assume 4k IOVA page size */
-            info.iova_pgsizes = 4096;
+            info->iova_pgsizes = 4096;
         }
-        vfio_host_win_add(container, 0, (hwaddr)-1, info.iova_pgsizes);
-        container->pgsizes = info.iova_pgsizes;
+        vfio_host_win_add(container, 0, (hwaddr)-1, info->iova_pgsizes);
+        container->pgsizes = info->iova_pgsizes;
+
+        vfio_get_iommu_info_migration(container, info);
+        g_free(info);
         break;
     }
     case VFIO_SPAPR_TCE_v2_IOMMU:
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index c78033e4149d..5a57a78ec517 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -79,6 +79,9 @@ typedef struct VFIOContainer {
     unsigned iommu_type;
     Error *error;
     bool initialized;
+    bool dirty_pages_supported;
+    uint64_t dirty_pgsizes;
+    uint64_t max_dirty_bitmap_size;
     unsigned long pgsizes;
     QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
     QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH QEMU v25 12/17] vfio: Add function to start and stop dirty pages tracking
  2020-06-20 20:21 [PATCH QEMU v25 00/17] Add migration support for VFIO devices Kirti Wankhede
                   ` (10 preceding siblings ...)
  2020-06-20 20:21 ` [PATCH QEMU v25 11/17] vfio: Get migration capability flags for container Kirti Wankhede
@ 2020-06-20 20:21 ` Kirti Wankhede
  2020-06-23 10:32   ` Cornelia Huck
  2020-06-24 18:55   ` Alex Williamson
  2020-06-20 20:21 ` [PATCH QEMU v25 13/17] vfio: create mapped iova list when vIOMMU is enabled Kirti Wankhede
                   ` (4 subsequent siblings)
  16 siblings, 2 replies; 66+ messages in thread
From: Kirti Wankhede @ 2020-06-20 20:21 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

Call VFIO_IOMMU_DIRTY_PAGES ioctl to start and stop dirty pages tracking
for VFIO devices.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 hw/vfio/migration.c | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index faacea5327cb..e0fbb3a01855 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -11,6 +11,7 @@
 #include "qemu/main-loop.h"
 #include "qemu/cutils.h"
 #include <linux/vfio.h>
+#include <sys/ioctl.h>
 
 #include "sysemu/runstate.h"
 #include "hw/vfio/vfio-common.h"
@@ -329,6 +330,32 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
     return qemu_file_get_error(f);
 }
 
+static int vfio_start_dirty_page_tracking(VFIODevice *vbasedev, bool start)
+{
+    int ret;
+    VFIOContainer *container = vbasedev->group->container;
+    struct vfio_iommu_type1_dirty_bitmap dirty = {
+        .argsz = sizeof(dirty),
+    };
+
+    if (start) {
+        if (vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) {
+            dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START;
+        } else {
+            return -EINVAL;
+        }
+    } else {
+            dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP;
+    }
+
+    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty);
+    if (ret) {
+        error_report("Failed to set dirty tracking flag 0x%x errno: %d",
+                     dirty.flags, errno);
+    }
+    return ret;
+}
+
 /* ---------------------------------------------------------------------- */
 
 static int vfio_save_setup(QEMUFile *f, void *opaque)
@@ -360,6 +387,11 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
         return ret;
     }
 
+    ret = vfio_start_dirty_page_tracking(vbasedev, true);
+    if (ret) {
+        return ret;
+    }
+
     qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
 
     ret = qemu_file_get_error(f);
@@ -375,6 +407,8 @@ static void vfio_save_cleanup(void *opaque)
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
 
+    vfio_start_dirty_page_tracking(vbasedev, false);
+
     if (migration->region.mmaps) {
         vfio_region_unmap(&migration->region);
     }
@@ -706,6 +740,8 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
         if (ret) {
             error_report("%s: Failed to set state RUNNING", vbasedev->name);
         }
+
+        vfio_start_dirty_page_tracking(vbasedev, false);
     }
 }
 
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH QEMU v25 13/17] vfio: create mapped iova list when vIOMMU is enabled
  2020-06-20 20:21 [PATCH QEMU v25 00/17] Add migration support for VFIO devices Kirti Wankhede
                   ` (11 preceding siblings ...)
  2020-06-20 20:21 ` [PATCH QEMU v25 12/17] vfio: Add function to start and stop dirty pages tracking Kirti Wankhede
@ 2020-06-20 20:21 ` Kirti Wankhede
  2020-06-24 18:55   ` Alex Williamson
  2020-06-20 20:21 ` [PATCH QEMU v25 14/17] vfio: Add vfio_listener_log_sync to mark dirty pages Kirti Wankhede
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 66+ messages in thread
From: Kirti Wankhede @ 2020-06-20 20:21 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

Create mapped iova list when vIOMMU is enabled. For each mapped iova
save translated address. Add node to list on MAP and remove node from
list on UNMAP.
This list is used to track dirty pages during migration.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
---
 hw/vfio/common.c              | 58 ++++++++++++++++++++++++++++++++++++++-----
 include/hw/vfio/vfio-common.h |  8 ++++++
 2 files changed, 60 insertions(+), 6 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index e0d3d4585a65..6921a78e9ba5 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -408,8 +408,8 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
 }
 
 /* Called with rcu_read_lock held.  */
-static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
-                           bool *read_only)
+static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
+                               ram_addr_t *ram_addr, bool *read_only)
 {
     MemoryRegion *mr;
     hwaddr xlat;
@@ -440,8 +440,17 @@ static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
         return false;
     }
 
-    *vaddr = memory_region_get_ram_ptr(mr) + xlat;
-    *read_only = !writable || mr->readonly;
+    if (vaddr) {
+        *vaddr = memory_region_get_ram_ptr(mr) + xlat;
+    }
+
+    if (ram_addr) {
+        *ram_addr = memory_region_get_ram_addr(mr) + xlat;
+    }
+
+    if (read_only) {
+        *read_only = !writable || mr->readonly;
+    }
 
     return true;
 }
@@ -451,7 +460,6 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
     VFIOContainer *container = giommu->container;
     hwaddr iova = iotlb->iova + giommu->iommu_offset;
-    bool read_only;
     void *vaddr;
     int ret;
 
@@ -467,7 +475,10 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     rcu_read_lock();
 
     if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
-        if (!vfio_get_vaddr(iotlb, &vaddr, &read_only)) {
+        ram_addr_t ram_addr;
+        bool read_only;
+
+        if (!vfio_get_xlat_addr(iotlb, &vaddr, &ram_addr, &read_only)) {
             goto out;
         }
         /*
@@ -485,8 +496,28 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
                          "0x%"HWADDR_PRIx", %p) = %d (%m)",
                          container, iova,
                          iotlb->addr_mask + 1, vaddr, ret);
+        } else {
+            VFIOIovaRange *iova_range;
+
+            iova_range = g_malloc0(sizeof(*iova_range));
+            iova_range->iova = iova;
+            iova_range->size = iotlb->addr_mask + 1;
+            iova_range->ram_addr = ram_addr;
+
+            QLIST_INSERT_HEAD(&giommu->iova_list, iova_range, next);
         }
     } else {
+        VFIOIovaRange *iova_range, *tmp;
+
+        QLIST_FOREACH_SAFE(iova_range, &giommu->iova_list, next, tmp) {
+            if (iova_range->iova >= iova &&
+                iova_range->iova + iova_range->size <= iova +
+                                                       iotlb->addr_mask + 1) {
+                QLIST_REMOVE(iova_range, next);
+                g_free(iova_range);
+            }
+        }
+
         ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1);
         if (ret) {
             error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
@@ -643,6 +674,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
             g_free(giommu);
             goto fail;
         }
+        QLIST_INIT(&giommu->iova_list);
         QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
         memory_region_iommu_replay(giommu->iommu, &giommu->n);
 
@@ -741,6 +773,13 @@ static void vfio_listener_region_del(MemoryListener *listener,
         QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
             if (MEMORY_REGION(giommu->iommu) == section->mr &&
                 giommu->n.start == section->offset_within_region) {
+                VFIOIovaRange *iova_range, *tmp;
+
+                QLIST_FOREACH_SAFE(iova_range, &giommu->iova_list, next, tmp) {
+                    QLIST_REMOVE(iova_range, next);
+                    g_free(iova_range);
+                }
+
                 memory_region_unregister_iommu_notifier(section->mr,
                                                         &giommu->n);
                 QLIST_REMOVE(giommu, giommu_next);
@@ -1538,6 +1577,13 @@ static void vfio_disconnect_container(VFIOGroup *group)
         QLIST_REMOVE(container, next);
 
         QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
+            VFIOIovaRange *iova_range, *itmp;
+
+            QLIST_FOREACH_SAFE(iova_range, &giommu->iova_list, next, itmp) {
+                QLIST_REMOVE(iova_range, next);
+                g_free(iova_range);
+            }
+
             memory_region_unregister_iommu_notifier(
                     MEMORY_REGION(giommu->iommu), &giommu->n);
             QLIST_REMOVE(giommu, giommu_next);
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 5a57a78ec517..56b75e4a8bc4 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -89,11 +89,19 @@ typedef struct VFIOContainer {
     QLIST_ENTRY(VFIOContainer) next;
 } VFIOContainer;
 
+typedef struct VFIOIovaRange {
+    hwaddr iova;
+    size_t size;
+    ram_addr_t ram_addr;
+    QLIST_ENTRY(VFIOIovaRange) next;
+} VFIOIovaRange;
+
 typedef struct VFIOGuestIOMMU {
     VFIOContainer *container;
     IOMMUMemoryRegion *iommu;
     hwaddr iommu_offset;
     IOMMUNotifier n;
+    QLIST_HEAD(, VFIOIovaRange) iova_list;
     QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
 } VFIOGuestIOMMU;
 
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH QEMU v25 14/17] vfio: Add vfio_listener_log_sync to mark dirty pages
  2020-06-20 20:21 [PATCH QEMU v25 00/17] Add migration support for VFIO devices Kirti Wankhede
                   ` (12 preceding siblings ...)
  2020-06-20 20:21 ` [PATCH QEMU v25 13/17] vfio: create mapped iova list when vIOMMU is enabled Kirti Wankhede
@ 2020-06-20 20:21 ` Kirti Wankhede
  2020-06-24 18:55   ` Alex Williamson
  2020-06-20 20:21 ` [PATCH QEMU v25 15/17] vfio: Add ioctl to get dirty pages bitmap during dma unmap Kirti Wankhede
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 66+ messages in thread
From: Kirti Wankhede @ 2020-06-20 20:21 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

vfio_listener_log_sync gets list of dirty pages from container using
VFIO_IOMMU_GET_DIRTY_BITMAP ioctl and mark those pages dirty when all
devices are stopped and saving state.
Return early for the RAM block section of mapped MMIO region.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/common.c     | 130 +++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events |   1 +
 2 files changed, 131 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 6921a78e9ba5..0518cf228ed5 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -29,6 +29,7 @@
 #include "hw/vfio/vfio.h"
 #include "exec/address-spaces.h"
 #include "exec/memory.h"
+#include "exec/ram_addr.h"
 #include "hw/hw.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
@@ -38,6 +39,7 @@
 #include "sysemu/reset.h"
 #include "trace.h"
 #include "qapi/error.h"
+#include "migration/migration.h"
 
 VFIOGroupList vfio_group_list =
     QLIST_HEAD_INITIALIZER(vfio_group_list);
@@ -288,6 +290,28 @@ const MemoryRegionOps vfio_region_ops = {
 };
 
 /*
+ * Device state interfaces
+ */
+
+static bool vfio_devices_are_stopped_and_saving(void)
+{
+    VFIOGroup *group;
+    VFIODevice *vbasedev;
+
+    QLIST_FOREACH(group, &vfio_group_list, next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            if ((vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) &&
+                !(vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING)) {
+                continue;
+            } else {
+                return false;
+            }
+        }
+    }
+    return true;
+}
+
+/*
  * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
  */
 static int vfio_dma_unmap(VFIOContainer *container,
@@ -852,9 +876,115 @@ static void vfio_listener_region_del(MemoryListener *listener,
     }
 }
 
+static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
+                                 uint64_t size, ram_addr_t ram_addr)
+{
+    struct vfio_iommu_type1_dirty_bitmap *dbitmap;
+    struct vfio_iommu_type1_dirty_bitmap_get *range;
+    uint64_t pages;
+    int ret;
+
+    dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
+
+    dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
+    dbitmap->flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
+    range = (struct vfio_iommu_type1_dirty_bitmap_get *)&dbitmap->data;
+    range->iova = iova;
+    range->size = size;
+
+    /*
+     * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of
+     * TARGET_PAGE_SIZE to mark those dirty. Hence set bitmap's pgsize to
+     * TARGET_PAGE_SIZE.
+     */
+    range->bitmap.pgsize = TARGET_PAGE_SIZE;
+
+    pages = TARGET_PAGE_ALIGN(range->size) >> TARGET_PAGE_BITS;
+    range->bitmap.size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
+                                         BITS_PER_BYTE;
+    range->bitmap.data = g_try_malloc0(range->bitmap.size);
+    if (!range->bitmap.data) {
+        ret = -ENOMEM;
+        goto err_out;
+    }
+
+    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
+    if (ret) {
+        error_report("Failed to get dirty bitmap for iova: 0x%llx "
+                "size: 0x%llx err: %d",
+                range->iova, range->size, errno);
+        goto err_out;
+    }
+
+    cpu_physical_memory_set_dirty_lebitmap((uint64_t *)range->bitmap.data,
+                                            ram_addr, pages);
+
+    trace_vfio_get_dirty_bitmap(container->fd, range->iova, range->size,
+                                range->bitmap.size, ram_addr);
+err_out:
+    g_free(range->bitmap.data);
+    g_free(dbitmap);
+
+    return ret;
+}
+
+static int vfio_sync_dirty_bitmap(MemoryListener *listener,
+                                 MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
+    VFIOGuestIOMMU *giommu = NULL;
+    ram_addr_t ram_addr;
+    uint64_t iova, size;
+    int ret = 0;
+
+    if (memory_region_is_iommu(section->mr)) {
+
+        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
+            if (MEMORY_REGION(giommu->iommu) == section->mr &&
+                giommu->n.start == section->offset_within_region) {
+                VFIOIovaRange *iova_range;
+
+                QLIST_FOREACH(iova_range, &giommu->iova_list, next) {
+                    ret = vfio_get_dirty_bitmap(container, iova_range->iova,
+                                        iova_range->size, iova_range->ram_addr);
+                    if (ret) {
+                        break;
+                    }
+                }
+                break;
+            }
+        }
+
+    } else {
+        iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
+        size = int128_get64(section->size);
+
+        ram_addr = memory_region_get_ram_addr(section->mr) +
+                   section->offset_within_region + iova -
+                   TARGET_PAGE_ALIGN(section->offset_within_address_space);
+
+        ret = vfio_get_dirty_bitmap(container, iova, size, ram_addr);
+    }
+
+    return ret;
+}
+
+static void vfio_listerner_log_sync(MemoryListener *listener,
+        MemoryRegionSection *section)
+{
+    if (vfio_listener_skipped_section(section)) {
+        return;
+    }
+
+    if (vfio_devices_are_stopped_and_saving()) {
+        vfio_sync_dirty_bitmap(listener, section);
+    }
+}
+
 static const MemoryListener vfio_memory_listener = {
     .region_add = vfio_listener_region_add,
     .region_del = vfio_listener_region_del,
+    .log_sync = vfio_listerner_log_sync,
 };
 
 static void vfio_listener_release(VFIOContainer *container)
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 4a4bd3ba9a2a..c61ae4f3ead8 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -160,3 +160,4 @@ vfio_save_complete_precopy(const char *name) " (%s)"
 vfio_load_device_config_state(const char *name) " (%s)"
 vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
 vfio_load_state_device_data(const char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
+vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH QEMU v25 15/17] vfio: Add ioctl to get dirty pages bitmap during dma unmap.
  2020-06-20 20:21 [PATCH QEMU v25 00/17] Add migration support for VFIO devices Kirti Wankhede
                   ` (13 preceding siblings ...)
  2020-06-20 20:21 ` [PATCH QEMU v25 14/17] vfio: Add vfio_listener_log_sync to mark dirty pages Kirti Wankhede
@ 2020-06-20 20:21 ` Kirti Wankhede
  2020-06-23  8:25   ` Cornelia Huck
  2020-06-24 18:56   ` Alex Williamson
  2020-06-20 20:21 ` [PATCH QEMU v25 16/17] vfio: Make vfio-pci device migration capable Kirti Wankhede
  2020-06-20 20:21 ` [PATCH QEMU v25 17/17] qapi: Add VFIO devices migration stats in Migration stats Kirti Wankhede
  16 siblings, 2 replies; 66+ messages in thread
From: Kirti Wankhede @ 2020-06-20 20:21 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

With vIOMMU, IO virtual address range can get unmapped while in pre-copy
phase of migration. In that case, unmap ioctl should return pages pinned
in that range and QEMU should find its correcponding guest physical
addresses and report those dirty.

Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 hw/vfio/common.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 81 insertions(+), 4 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 0518cf228ed5..a06b8f2f66e2 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -311,11 +311,83 @@ static bool vfio_devices_are_stopped_and_saving(void)
     return true;
 }
 
+static bool vfio_devices_are_running_and_saving(void)
+{
+    VFIOGroup *group;
+    VFIODevice *vbasedev;
+
+    QLIST_FOREACH(group, &vfio_group_list, next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            if ((vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) &&
+                (vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING)) {
+                continue;
+            } else {
+                return false;
+            }
+        }
+    }
+    return true;
+}
+
+static int vfio_dma_unmap_bitmap(VFIOContainer *container,
+                                 hwaddr iova, ram_addr_t size,
+                                 IOMMUTLBEntry *iotlb)
+{
+    struct vfio_iommu_type1_dma_unmap *unmap;
+    struct vfio_bitmap *bitmap;
+    uint64_t pages = TARGET_PAGE_ALIGN(size) >> TARGET_PAGE_BITS;
+    int ret;
+
+    unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));
+
+    unmap->argsz = sizeof(*unmap) + sizeof(*bitmap);
+    unmap->iova = iova;
+    unmap->size = size;
+    unmap->flags |= VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP;
+    bitmap = (struct vfio_bitmap *)&unmap->data;
+
+    /*
+     * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of
+     * TARGET_PAGE_SIZE to mark those dirty. Hence set bitmap_pgsize to
+     * TARGET_PAGE_SIZE.
+     */
+
+    bitmap->pgsize = TARGET_PAGE_SIZE;
+    bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
+                   BITS_PER_BYTE;
+
+    if (bitmap->size > container->max_dirty_bitmap_size) {
+        error_report("UNMAP: Size of bitmap too big 0x%llx", bitmap->size);
+        ret = -E2BIG;
+        goto unmap_exit;
+    }
+
+    bitmap->data = g_try_malloc0(bitmap->size);
+    if (!bitmap->data) {
+        ret = -ENOMEM;
+        goto unmap_exit;
+    }
+
+    ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap);
+    if (!ret) {
+        cpu_physical_memory_set_dirty_lebitmap((uint64_t *)bitmap->data,
+                iotlb->translated_addr, pages);
+    } else {
+        error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %m");
+    }
+
+    g_free(bitmap->data);
+unmap_exit:
+    g_free(unmap);
+    return ret;
+}
+
 /*
  * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
  */
 static int vfio_dma_unmap(VFIOContainer *container,
-                          hwaddr iova, ram_addr_t size)
+                          hwaddr iova, ram_addr_t size,
+                          IOMMUTLBEntry *iotlb)
 {
     struct vfio_iommu_type1_dma_unmap unmap = {
         .argsz = sizeof(unmap),
@@ -324,6 +396,11 @@ static int vfio_dma_unmap(VFIOContainer *container,
         .size = size,
     };
 
+    if (iotlb && container->dirty_pages_supported &&
+        vfio_devices_are_running_and_saving()) {
+        return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
+    }
+
     while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
         /*
          * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
@@ -371,7 +448,7 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
      * the VGA ROM space.
      */
     if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 ||
-        (errno == EBUSY && vfio_dma_unmap(container, iova, size) == 0 &&
+        (errno == EBUSY && vfio_dma_unmap(container, iova, size, NULL) == 0 &&
          ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) {
         return 0;
     }
@@ -542,7 +619,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
             }
         }
 
-        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1);
+        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1, iotlb);
         if (ret) {
             error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
                          "0x%"HWADDR_PRIx") = %d (%m)",
@@ -853,7 +930,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
     }
 
     if (try_unmap) {
-        ret = vfio_dma_unmap(container, iova, int128_get64(llsize));
+        ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
         if (ret) {
             error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
                          "0x%"HWADDR_PRIx") = %d (%m)",
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH QEMU v25 16/17] vfio: Make vfio-pci device migration capable
  2020-06-20 20:21 [PATCH QEMU v25 00/17] Add migration support for VFIO devices Kirti Wankhede
                   ` (14 preceding siblings ...)
  2020-06-20 20:21 ` [PATCH QEMU v25 15/17] vfio: Add ioctl to get dirty pages bitmap during dma unmap Kirti Wankhede
@ 2020-06-20 20:21 ` Kirti Wankhede
  2020-06-22 16:51   ` Cornelia Huck
  2020-06-20 20:21 ` [PATCH QEMU v25 17/17] qapi: Add VFIO devices migration stats in Migration stats Kirti Wankhede
  16 siblings, 1 reply; 66+ messages in thread
From: Kirti Wankhede @ 2020-06-20 20:21 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

If device is not failover primary device call vfio_migration_probe()
and vfio_migration_finalize() functions for vfio-pci device to enable
migration for vfio PCI device which support migration.
Removed vfio_pci_vmstate structure.
Removed migration blocker from VFIO PCI device specific structure and use
migration blocker from generic structure of  VFIO device.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 hw/vfio/pci.c | 32 +++++++++++---------------------
 hw/vfio/pci.h |  1 -
 2 files changed, 11 insertions(+), 22 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 5ba340aee1d4..9dc2868993fb 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2841,22 +2841,11 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         return;
     }
 
-    if (!pdev->failover_pair_id) {
-        error_setg(&vdev->migration_blocker,
-                "VFIO device doesn't support migration");
-        ret = migrate_add_blocker(vdev->migration_blocker, &err);
-        if (ret) {
-            error_propagate(errp, err);
-            error_free(vdev->migration_blocker);
-            vdev->migration_blocker = NULL;
-            return;
-        }
-    }
-
     vdev->vbasedev.name = g_path_get_basename(vdev->vbasedev.sysfsdev);
     vdev->vbasedev.ops = &vfio_pci_ops;
     vdev->vbasedev.type = VFIO_DEVICE_TYPE_PCI;
     vdev->vbasedev.dev = DEVICE(vdev);
+    vdev->vbasedev.device_state = 0;
 
     tmp = g_strdup_printf("%s/iommu_group", vdev->vbasedev.sysfsdev);
     len = readlink(tmp, group_path, sizeof(group_path));
@@ -3120,6 +3109,14 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         }
     }
 
+    if (!pdev->failover_pair_id) {
+        ret = vfio_migration_probe(&vdev->vbasedev, errp);
+        if (ret) {
+            error_report("%s: Failed to setup for migration",
+                         vdev->vbasedev.name);
+        }
+    }
+
     vfio_register_err_notifier(vdev);
     vfio_register_req_notifier(vdev);
     vfio_setup_resetfn_quirk(vdev);
@@ -3134,11 +3131,6 @@ out_teardown:
     vfio_bars_exit(vdev);
 error:
     error_prepend(errp, VFIO_MSG_PREFIX, vdev->vbasedev.name);
-    if (vdev->migration_blocker) {
-        migrate_del_blocker(vdev->migration_blocker);
-        error_free(vdev->migration_blocker);
-        vdev->migration_blocker = NULL;
-    }
 }
 
 static void vfio_instance_finalize(Object *obj)
@@ -3150,10 +3142,7 @@ static void vfio_instance_finalize(Object *obj)
     vfio_bars_finalize(vdev);
     g_free(vdev->emulated_config_bits);
     g_free(vdev->rom);
-    if (vdev->migration_blocker) {
-        migrate_del_blocker(vdev->migration_blocker);
-        error_free(vdev->migration_blocker);
-    }
+
     /*
      * XXX Leaking igd_opregion is not an oversight, we can't remove the
      * fw_cfg entry therefore leaking this allocation seems like the safest
@@ -3181,6 +3170,7 @@ static void vfio_exitfn(PCIDevice *pdev)
     }
     vfio_teardown_msi(vdev);
     vfio_bars_exit(vdev);
+    vfio_migration_finalize(&vdev->vbasedev);
 }
 
 static void vfio_pci_reset(DeviceState *dev)
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 0da7a20a7ec2..b148c937ef72 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -168,7 +168,6 @@ typedef struct VFIOPCIDevice {
     bool no_vfio_ioeventfd;
     bool enable_ramfb;
     VFIODisplay *dpy;
-    Error *migration_blocker;
     Notifier irqchip_change_notifier;
 } VFIOPCIDevice;
 
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH QEMU v25 17/17] qapi: Add VFIO devices migration stats in Migration stats
  2020-06-20 20:21 [PATCH QEMU v25 00/17] Add migration support for VFIO devices Kirti Wankhede
                   ` (15 preceding siblings ...)
  2020-06-20 20:21 ` [PATCH QEMU v25 16/17] vfio: Make vfio-pci device migration capable Kirti Wankhede
@ 2020-06-20 20:21 ` Kirti Wankhede
  2020-06-23  7:21   ` Markus Armbruster
  16 siblings, 1 reply; 66+ messages in thread
From: Kirti Wankhede @ 2020-06-20 20:21 UTC (permalink / raw)
  To: alex.williamson, cjia
  Cc: cohuck, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

Added amount of bytes transferred to the target VM by all VFIO devices

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
---
 hw/vfio/common.c            | 20 ++++++++++++++++++++
 hw/vfio/migration.c         | 11 ++++++++++-
 include/qemu/vfio-helpers.h |  3 +++
 migration/migration.c       | 14 ++++++++++++++
 monitor/hmp-cmds.c          |  6 ++++++
 qapi/migration.json         | 17 +++++++++++++++++
 6 files changed, 70 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index a06b8f2f66e2..ff0a4072107f 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -40,6 +40,7 @@
 #include "trace.h"
 #include "qapi/error.h"
 #include "migration/migration.h"
+#include "qemu/vfio-helpers.h"
 
 VFIOGroupList vfio_group_list =
     QLIST_HEAD_INITIALIZER(vfio_group_list);
@@ -293,6 +294,25 @@ const MemoryRegionOps vfio_region_ops = {
  * Device state interfaces
  */
 
+bool vfio_mig_active(void)
+{
+    VFIOGroup *group;
+    VFIODevice *vbasedev;
+
+    if (QLIST_EMPTY(&vfio_group_list)) {
+        return false;
+    }
+
+    QLIST_FOREACH(group, &vfio_group_list, next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            if (vbasedev->migration_blocker) {
+                return false;
+            }
+        }
+    }
+    return true;
+}
+
 static bool vfio_devices_are_stopped_and_saving(void)
 {
     VFIOGroup *group;
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index e0fbb3a01855..09eec9cafdd5 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -27,7 +27,7 @@
 #include "exec/ram_addr.h"
 #include "pci.h"
 #include "trace.h"
-
+#include "qemu/vfio-helpers.h"
 /*
  * Flags used as delimiter:
  * 0xffffffff => MSB 32-bit all 1s
@@ -39,6 +39,8 @@
 #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
 #define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
 
+static int64_t bytes_transferred;
+
 static void vfio_migration_region_exit(VFIODevice *vbasedev)
 {
     VFIOMigration *migration = vbasedev->migration;
@@ -261,6 +263,7 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
         return ret;
     }
 
+    bytes_transferred += data_size;
     return data_size;
 }
 
@@ -742,6 +745,7 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
         }
 
         vfio_start_dirty_page_tracking(vbasedev, false);
+        bytes_transferred = 0;
     }
 }
 
@@ -789,6 +793,11 @@ static int vfio_migration_init(VFIODevice *vbasedev,
 
 /* ---------------------------------------------------------------------- */
 
+int64_t vfio_mig_bytes_transferred(void)
+{
+    return bytes_transferred;
+}
+
 int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
 {
     struct vfio_region_info *info;
diff --git a/include/qemu/vfio-helpers.h b/include/qemu/vfio-helpers.h
index 1f057c2b9e40..26a7df0767b1 100644
--- a/include/qemu/vfio-helpers.h
+++ b/include/qemu/vfio-helpers.h
@@ -29,4 +29,7 @@ void qemu_vfio_pci_unmap_bar(QEMUVFIOState *s, int index, void *bar,
 int qemu_vfio_pci_init_irq(QEMUVFIOState *s, EventNotifier *e,
                            int irq_type, Error **errp);
 
+bool vfio_mig_active(void);
+int64_t vfio_mig_bytes_transferred(void);
+
 #endif
diff --git a/migration/migration.c b/migration/migration.c
index 481a590f7222..8b010a002c3d 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -54,6 +54,7 @@
 #include "net/announce.h"
 #include "qemu/queue.h"
 #include "multifd.h"
+#include "qemu/vfio-helpers.h"
 
 #define MAX_THROTTLE  (32 << 20)      /* Migration transfer speed throttling */
 
@@ -970,6 +971,17 @@ static void populate_disk_info(MigrationInfo *info)
     }
 }
 
+static void populate_vfio_info(MigrationInfo *info)
+{
+#ifdef CONFIG_LINUX
+    if (vfio_mig_active()) {
+        info->has_vfio = true;
+        info->vfio = g_malloc0(sizeof(*info->vfio));
+        info->vfio->transferred = vfio_mig_bytes_transferred();
+    }
+#endif
+}
+
 static void fill_source_migration_info(MigrationInfo *info)
 {
     MigrationState *s = migrate_get_current();
@@ -995,6 +1007,7 @@ static void fill_source_migration_info(MigrationInfo *info)
         populate_time_info(info, s);
         populate_ram_info(info, s);
         populate_disk_info(info);
+        populate_vfio_info(info);
         break;
     case MIGRATION_STATUS_COLO:
         info->has_status = true;
@@ -1003,6 +1016,7 @@ static void fill_source_migration_info(MigrationInfo *info)
     case MIGRATION_STATUS_COMPLETED:
         populate_time_info(info, s);
         populate_ram_info(info, s);
+        populate_vfio_info(info);
         break;
     case MIGRATION_STATUS_FAILED:
         info->has_status = true;
diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
index 2b0b58a336e6..29bb4d2d949c 100644
--- a/monitor/hmp-cmds.c
+++ b/monitor/hmp-cmds.c
@@ -355,6 +355,12 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict)
         }
         monitor_printf(mon, "]\n");
     }
+
+    if (info->has_vfio) {
+        monitor_printf(mon, "vfio device transferred: %" PRIu64 " kbytes\n",
+                       info->vfio->transferred >> 10);
+    }
+
     qapi_free_MigrationInfo(info);
 }
 
diff --git a/qapi/migration.json b/qapi/migration.json
index d5000558c6c9..952864b05455 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -146,6 +146,18 @@
             'active', 'postcopy-active', 'postcopy-paused',
             'postcopy-recover', 'completed', 'failed', 'colo',
             'pre-switchover', 'device', 'wait-unplug' ] }
+##
+# @VfioStats:
+#
+# Detailed VFIO devices migration statistics
+#
+# @transferred: amount of bytes transferred to the target VM by VFIO devices
+#
+# Since: 5.1
+#
+##
+{ 'struct': 'VfioStats',
+  'data': {'transferred': 'int' } }
 
 ##
 # @MigrationInfo:
@@ -207,11 +219,16 @@
 #
 # @socket-address: Only used for tcp, to know what the real port is (Since 4.0)
 #
+# @vfio: @VfioStats containing detailed VFIO devices migration statistics,
+#        only returned if VFIO device is present, migration is supported by all
+#         VFIO devices and status is 'active' or 'completed' (since 5.1)
+#
 # Since: 0.14.0
 ##
 { 'struct': 'MigrationInfo',
   'data': {'*status': 'MigrationStatus', '*ram': 'MigrationStats',
            '*disk': 'MigrationStats',
+           '*vfio': 'VfioStats',
            '*xbzrle-cache': 'XBZRLECacheStats',
            '*total-time': 'int',
            '*expected-downtime': 'int',
-- 
2.7.0



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 16/17] vfio: Make vfio-pci device migration capable
  2020-06-20 20:21 ` [PATCH QEMU v25 16/17] vfio: Make vfio-pci device migration capable Kirti Wankhede
@ 2020-06-22 16:51   ` Cornelia Huck
  0 siblings, 0 replies; 66+ messages in thread
From: Cornelia Huck @ 2020-06-22 16:51 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk, pasic,
	felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini

On Sun, 21 Jun 2020 01:51:25 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> If device is not failover primary device call vfio_migration_probe()
> and vfio_migration_finalize() functions for vfio-pci device to enable
> migration for vfio PCI device which support migration.

"If the device is not a failover primary device, call
vfio_migration_probe() and vfio_migration_finalize() to enable
migration support for those devices that support it respectively to
tear it down again."

?

> Removed vfio_pci_vmstate structure.
> Removed migration blocker from VFIO PCI device specific structure and use
> migration blocker from generic structure of  VFIO device.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  hw/vfio/pci.c | 32 +++++++++++---------------------
>  hw/vfio/pci.h |  1 -
>  2 files changed, 11 insertions(+), 22 deletions(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 5ba340aee1d4..9dc2868993fb 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2841,22 +2841,11 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>          return;
>      }
>  
> -    if (!pdev->failover_pair_id) {
> -        error_setg(&vdev->migration_blocker,
> -                "VFIO device doesn't support migration");
> -        ret = migrate_add_blocker(vdev->migration_blocker, &err);
> -        if (ret) {
> -            error_propagate(errp, err);
> -            error_free(vdev->migration_blocker);
> -            vdev->migration_blocker = NULL;
> -            return;
> -        }
> -    }
> -
>      vdev->vbasedev.name = g_path_get_basename(vdev->vbasedev.sysfsdev);
>      vdev->vbasedev.ops = &vfio_pci_ops;
>      vdev->vbasedev.type = VFIO_DEVICE_TYPE_PCI;
>      vdev->vbasedev.dev = DEVICE(vdev);
> +    vdev->vbasedev.device_state = 0;
>  
>      tmp = g_strdup_printf("%s/iommu_group", vdev->vbasedev.sysfsdev);
>      len = readlink(tmp, group_path, sizeof(group_path));
> @@ -3120,6 +3109,14 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>          }
>      }
>  
> +    if (!pdev->failover_pair_id) {
> +        ret = vfio_migration_probe(&vdev->vbasedev, errp);
> +        if (ret) {
> +            error_report("%s: Failed to setup for migration",

"%s: migration not enabled" ?

(Although I wonder how often we need to moan here, given that the
called function already prints error reports.)

> +                         vdev->vbasedev.name);
> +        }
> +    }
> +
>      vfio_register_err_notifier(vdev);
>      vfio_register_req_notifier(vdev);
>      vfio_setup_resetfn_quirk(vdev);

LGTM.



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 03/17] vfio: Add save and load functions for VFIO PCI devices
  2020-06-20 20:21 ` [PATCH QEMU v25 03/17] vfio: Add save and load functions for VFIO PCI devices Kirti Wankhede
@ 2020-06-22 20:28   ` Alex Williamson
  2020-06-24 14:29     ` Kirti Wankhede
  0 siblings, 1 reply; 66+ messages in thread
From: Alex Williamson @ 2020-06-22 20:28 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini

On Sun, 21 Jun 2020 01:51:12 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> These functions save and restore PCI device specific data - config
> space of PCI device.
> Tested save and restore with MSI and MSIX type.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/pci.c                 | 95 +++++++++++++++++++++++++++++++++++++++++++
>  include/hw/vfio/vfio-common.h |  2 +
>  2 files changed, 97 insertions(+)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 27f8872db2b1..5ba340aee1d4 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -41,6 +41,7 @@
>  #include "trace.h"
>  #include "qapi/error.h"
>  #include "migration/blocker.h"
> +#include "migration/qemu-file.h"
>  
>  #define TYPE_VFIO_PCI "vfio-pci"
>  #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
> @@ -2407,11 +2408,105 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
>      return OBJECT(vdev);
>  }
>  
> +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
> +{
> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> +    PCIDevice *pdev = &vdev->pdev;
> +
> +    qemu_put_buffer(f, vdev->emulated_config_bits, vdev->config_size);
> +    qemu_put_buffer(f, vdev->pdev.wmask, vdev->config_size);
> +    pci_device_save(pdev, f);
> +
> +    qemu_put_be32(f, vdev->interrupt);
> +    if (vdev->interrupt == VFIO_INT_MSIX) {
> +        msix_save(pdev, f);

msix_save() checks msix_present() so shouldn't we include this
unconditionally?  Can't there also be state in the vector table
regardless of whether we're currently running in MSI-X mode?

> +    }
> +}
> +
> +static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
> +{
> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> +    PCIDevice *pdev = &vdev->pdev;
> +    uint32_t interrupt_type;
> +    uint16_t pci_cmd;
> +    int i, ret;
> +
> +    qemu_get_buffer(f, vdev->emulated_config_bits, vdev->config_size);
> +    qemu_get_buffer(f, vdev->pdev.wmask, vdev->config_size);

This doesn't seem safe, why is it ok to indiscriminately copy these
arrays that are configured via support or masking of various device
features from the source to the target?

I think this still fails basic feature support negotiation.  For
instance, Intel IGD assignment modifies emulated_config_bits and wmask
to allow the VM BIOS to allocate fake stolen memory for the GPU and
store this value in config space.  This support can be controlled via a
QEMU build-time option, therefore the feature support on the target can
be different from the source.  If this sort of feature set doesn't
match between source and target, I think we'd want to abort the
migration, but we don't have any provisions for that here (a physical
IGD device is obviously just an example as it doesn't support migration
currently).

> +
> +    ret = pci_device_load(pdev, f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    /* retore pci bar configuration */
> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> +    vfio_pci_write_config(pdev, PCI_COMMAND,
> +                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);

s/!/~/?  Extra parenthesis too

> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> +        uint32_t bar = pci_default_read_config(pdev,
> +                                               PCI_BASE_ADDRESS_0 + i * 4, 4);
> +
> +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
> +    }
> +
> +    interrupt_type = qemu_get_be32(f);
> +
> +    if (interrupt_type == VFIO_INT_MSI) {
> +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> +        bool msi_64bit;
> +
> +        /* restore msi configuration */
> +        msi_flags = pci_default_read_config(pdev,
> +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> +
> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> +                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
> +

What if I migrate from a device with MSI support to a device without
MSI support, or to a device with MSI support at a different offset, who
is responsible for triggering a migration fault?


> +        msi_addr_lo = pci_default_read_config(pdev,
> +                                        pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
> +                              msi_addr_lo, 4);
> +
> +        if (msi_64bit) {
> +            msi_addr_hi = pci_default_read_config(pdev,
> +                                        pdev->msi_cap + PCI_MSI_ADDRESS_HI, 4);
> +            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> +                                  msi_addr_hi, 4);
> +        }
> +
> +        msi_data = pci_default_read_config(pdev,
> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> +                2);
> +
> +        vfio_pci_write_config(pdev,
> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> +                msi_data, 2);
> +
> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> +                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
> +    } else if (interrupt_type == VFIO_INT_MSIX) {
> +        uint16_t offset;
> +
> +        offset = pci_default_read_config(pdev,
> +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
> +        /* load enable bit and maskall bit */
> +        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
> +                              offset, 2);
> +        msix_load(pdev, f);

Isn't this ordering backwards, or at least less efficient?  The config
write will cause us to enable MSI-X; presumably we'd have nothing in
the vector table though.  Then msix_load() will write the vector
and pba tables and trigger a use notifier for each vector.  It seems
like that would trigger a bunch of SET_IRQS ioctls as if the guest
wrote individual unmasked vectors to the vector table, whereas if we
setup the vector table and then enable MSI-X, we do it with one ioctl.

Also same question as above, I'm not sure who is responsible for making
sure both devices support MSI-X and that the capability exists at the
same place on each.  Repeat for essentially every capability.  Are we
leaning on the migration regions to fail these migrations before we get
here?  If so, should we be?

Also, besides BARs, the command register, and MSI & MSI-X, there must
be other places where the guest can write config data through to the
device.  pci_device_{save,load}() only sets QEMU's config space.

A couple more theoretical (probably not too distant) examples related
to that; there's a resizable BAR capability that at some point we'll
probably need to allow the guest to interact with (ie. manipulation of
capability changes the reported region size for a BAR).  How would we
support that with this save/load scheme?  We'll likely also have SR-IOV
PFs assigned where we'll perhaps have support for emulating the SR-IOV
capability to call out to a privileged userspace helper to enable VFs,
how does this get extended to support that type of emulation?

I'm afraid that making carbon copies of emulated_config_bits, wmask,
and invoking pci_device_save/load() doesn't address my concerns that
saving and restoring config space between source and target really
seems like a much more important task than outlined here.  Thanks,

Alex

> +    }
> +    vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
> +    return 0;
> +}
> +
>  static VFIODeviceOps vfio_pci_ops = {
>      .vfio_compute_needs_reset = vfio_pci_compute_needs_reset,
>      .vfio_hot_reset_multi = vfio_pci_hot_reset_multi,
>      .vfio_eoi = vfio_intx_eoi,
>      .vfio_get_object = vfio_pci_get_object,
> +    .vfio_save_config = vfio_pci_save_config,
> +    .vfio_load_config = vfio_pci_load_config,
>  };
>  
>  int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 74261feaeac9..d69a7f3ae31e 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -120,6 +120,8 @@ struct VFIODeviceOps {
>      int (*vfio_hot_reset_multi)(VFIODevice *vdev);
>      void (*vfio_eoi)(VFIODevice *vdev);
>      Object *(*vfio_get_object)(VFIODevice *vdev);
> +    void (*vfio_save_config)(VFIODevice *vdev, QEMUFile *f);
> +    int (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f);
>  };
>  
>  typedef struct VFIOGroup {



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 05/17] vfio: Add VM state change handler to know state of VM
  2020-06-20 20:21 ` [PATCH QEMU v25 05/17] vfio: Add VM state change handler to know state of VM Kirti Wankhede
@ 2020-06-22 22:50   ` Alex Williamson
  2020-06-23 18:55     ` Kirti Wankhede
  2020-06-23  8:07   ` Cornelia Huck
  1 sibling, 1 reply; 66+ messages in thread
From: Alex Williamson @ 2020-06-22 22:50 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini

On Sun, 21 Jun 2020 01:51:14 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> VM state change handler gets called on change in VM's state. This is used to set
> VFIO device state to _RUNNING.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  hw/vfio/migration.c           | 87 +++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/trace-events          |  2 +
>  include/hw/vfio/vfio-common.h |  4 ++
>  3 files changed, 93 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 48ac385d80a7..fcecc0bb0874 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -10,6 +10,7 @@
>  #include "qemu/osdep.h"
>  #include <linux/vfio.h>
>  
> +#include "sysemu/runstate.h"
>  #include "hw/vfio/vfio-common.h"
>  #include "cpu.h"
>  #include "migration/migration.h"
> @@ -74,6 +75,85 @@ err:
>      return ret;
>  }
>  
> +static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
> +                                    uint32_t value)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region;
> +    uint32_t device_state;
> +    int ret;
> +
> +    ret = pread(vbasedev->fd, &device_state, sizeof(device_state),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                              device_state));
> +    if (ret < 0) {
> +        error_report("%s: Failed to read device state %d %s",
> +                     vbasedev->name, ret, strerror(errno));
> +        return ret;
> +    }
> +
> +    device_state = (device_state & mask) | value;
> +
> +    if (!VFIO_DEVICE_STATE_VALID(device_state)) {
> +        return -EINVAL;
> +    }
> +
> +    ret = pwrite(vbasedev->fd, &device_state, sizeof(device_state),
> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                              device_state));
> +    if (ret < 0) {
> +        error_report("%s: Failed to set device state %d %s",
> +                     vbasedev->name, ret, strerror(errno));
> +
> +        ret = pread(vbasedev->fd, &device_state, sizeof(device_state),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                device_state));
> +        if (ret < 0) {
> +            error_report("%s: On failure, failed to read device state %d %s",
> +                    vbasedev->name, ret, strerror(errno));
> +            return ret;
> +        }
> +
> +        if (VFIO_DEVICE_STATE_IS_ERROR(device_state)) {
> +            error_report("%s: Device is in error state 0x%x",
> +                         vbasedev->name, device_state);
> +            return -EFAULT;
> +        }
> +    }
> +
> +    vbasedev->device_state = device_state;
> +    trace_vfio_migration_set_state(vbasedev->name, device_state);
> +    return 0;
> +}
> +
> +static void vfio_vmstate_change(void *opaque, int running, RunState state)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    if ((vbasedev->vm_running != running)) {
> +        int ret;
> +        uint32_t value = 0, mask = 0;
> +
> +        if (running) {
> +            value = VFIO_DEVICE_STATE_RUNNING;
> +            if (vbasedev->device_state & VFIO_DEVICE_STATE_RESUMING) {
> +                mask = ~VFIO_DEVICE_STATE_RESUMING;
> +            }
> +        } else {
> +            mask = ~VFIO_DEVICE_STATE_RUNNING;
> +        }
> +
> +        ret = vfio_migration_set_state(vbasedev, mask, value);
> +        if (ret) {
> +            error_report("%s: Failed to set device state 0x%x",
> +                         vbasedev->name, value & mask);


Is there nothing more we should do here?  It seems like in either the
case of an outbound migration where we can't stop the device or an
inbound migration where we can't start the device, we'd want this to
trigger an abort of the migration.  Should there at least be a TODO
comment if the reason is that QEMU migration doesn't yet support failure
here?  Thanks,

Alex

> +        }
> +        vbasedev->vm_running = running;
> +        trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
> +                                  value & mask);
> +    }
> +}
> +
>  static int vfio_migration_init(VFIODevice *vbasedev,
>                                 struct vfio_region_info *info)
>  {
> @@ -87,8 +167,11 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>                       vbasedev->name);
>          g_free(vbasedev->migration);
>          vbasedev->migration = NULL;
> +        return ret;
>      }
>  
> +    vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
> +                                                          vbasedev);
>      return ret;
>  }
>  
> @@ -131,6 +214,10 @@ add_blocker:
>  
>  void vfio_migration_finalize(VFIODevice *vbasedev)
>  {
> +    if (vbasedev->vm_state) {
> +        qemu_del_vm_change_state_handler(vbasedev->vm_state);
> +    }
> +
>      if (vbasedev->migration_blocker) {
>          migrate_del_blocker(vbasedev->migration_blocker);
>          error_free(vbasedev->migration_blocker);
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index fd034ac53684..14b0a86c0035 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -146,3 +146,5 @@ vfio_display_edid_write_error(void) ""
>  
>  # migration.c
>  vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
> +vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
> +vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index d4b268641173..3d18eb146b33 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -29,6 +29,7 @@
>  #ifdef CONFIG_LINUX
>  #include <linux/vfio.h>
>  #endif
> +#include "sysemu/sysemu.h"
>  
>  #define VFIO_MSG_PREFIX "vfio %s: "
>  
> @@ -119,6 +120,9 @@ typedef struct VFIODevice {
>      unsigned int flags;
>      VFIOMigration *migration;
>      Error *migration_blocker;
> +    VMChangeStateEntry *vm_state;
> +    uint32_t device_state;
> +    int vm_running;
>  } VFIODevice;
>  
>  struct VFIODeviceOps {



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 07/17] vfio: Register SaveVMHandlers for VFIO device
  2020-06-20 20:21 ` [PATCH QEMU v25 07/17] vfio: Register SaveVMHandlers for VFIO device Kirti Wankhede
@ 2020-06-22 22:50   ` Alex Williamson
  2020-06-23 19:21     ` Kirti Wankhede
  2020-06-26 14:31   ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 66+ messages in thread
From: Alex Williamson @ 2020-06-22 22:50 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini

On Sun, 21 Jun 2020 01:51:16 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Define flags to be used as delimeter in migration file stream.
> Added .save_setup and .save_cleanup functions. Mapped & unmapped migration
> region from these functions at source during saving or pre-copy phase.
> Set VFIO device state depending on VM's state. During live migration, VM is
> running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
> device. During save-restore, VM is paused, _SAVING state is set for VFIO device.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/migration.c  | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/trace-events |  2 ++
>  2 files changed, 94 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index e30bd8768701..133bb5b1b3b2 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -8,12 +8,15 @@
>   */
>  
>  #include "qemu/osdep.h"
> +#include "qemu/main-loop.h"
> +#include "qemu/cutils.h"
>  #include <linux/vfio.h>
>  
>  #include "sysemu/runstate.h"
>  #include "hw/vfio/vfio-common.h"
>  #include "cpu.h"
>  #include "migration/migration.h"
> +#include "migration/vmstate.h"
>  #include "migration/qemu-file.h"
>  #include "migration/register.h"
>  #include "migration/blocker.h"
> @@ -24,6 +27,17 @@
>  #include "pci.h"
>  #include "trace.h"
>  
> +/*
> + * Flags used as delimiter:
> + * 0xffffffff => MSB 32-bit all 1s
> + * 0xef10     => emulated (virtual) function IO
> + * 0x0000     => 16-bits reserved for flags
> + */
> +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
> +#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
> +
>  static void vfio_migration_region_exit(VFIODevice *vbasedev)
>  {
>      VFIOMigration *migration = vbasedev->migration;
> @@ -126,6 +140,65 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
>      return 0;
>  }
>  
> +/* ---------------------------------------------------------------------- */
> +
> +static int vfio_save_setup(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    trace_vfio_save_setup(vbasedev->name);
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
> +
> +    if (migration->region.mmaps) {
> +        qemu_mutex_lock_iothread();
> +        ret = vfio_region_mmap(&migration->region);
> +        qemu_mutex_unlock_iothread();
> +        if (ret) {
> +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
> +                         vbasedev->name, migration->region.nr,
> +                         strerror(-ret));
> +            return ret;

OTOH to my previous comments, this shouldn't be fatal, right?  mmaps
are optional anyway so it should be sufficient to push an error report
to explain why this might be slower than normal, but we can still
proceed.

> +        }
> +    }
> +
> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_MASK,
> +                                   VFIO_DEVICE_STATE_SAVING);
> +    if (ret) {
> +        error_report("%s: Failed to set state SAVING", vbasedev->name);
> +        return ret;
> +    }

We seem to be lacking support in the callers for detecting if the
device is in an error state.  I'm not sure what our options are
though, maybe only a hw_error().

> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    return 0;
> +}
> +
> +static void vfio_save_cleanup(void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    if (migration->region.mmaps) {
> +        vfio_region_unmap(&migration->region);
> +    }
> +    trace_vfio_save_cleanup(vbasedev->name);
> +}
> +
> +static SaveVMHandlers savevm_vfio_handlers = {
> +    .save_setup = vfio_save_setup,
> +    .save_cleanup = vfio_save_cleanup,
> +};
> +
> +/* ---------------------------------------------------------------------- */
> +
>  static void vfio_vmstate_change(void *opaque, int running, RunState state)
>  {
>      VFIODevice *vbasedev = opaque;
> @@ -180,6 +253,7 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>                                 struct vfio_region_info *info)
>  {
>      int ret;
> +    char id[256] = "";
>  
>      vbasedev->migration = g_new0(VFIOMigration, 1);
>  
> @@ -192,6 +266,24 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>          return ret;
>      }
>  
> +    if (vbasedev->ops->vfio_get_object) {

Nit, vfio_migration_region_init() would have failed already if this were
not available.  Perhaps do the test once at the start of this function
instead?  Thanks,

Alex

> +        Object *obj = vbasedev->ops->vfio_get_object(vbasedev);
> +
> +        if (obj) {
> +            DeviceState *dev = DEVICE(obj);
> +            char *oid = vmstate_if_get_id(VMSTATE_IF(dev));
> +
> +            if (oid) {
> +                pstrcpy(id, sizeof(id), oid);
> +                pstrcat(id, sizeof(id), "/");
> +                g_free(oid);
> +            }
> +        }
> +    }
> +    pstrcat(id, sizeof(id), "vfio");
> +
> +    register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1, &savevm_vfio_handlers,
> +                         vbasedev);
>      vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
>                                                            vbasedev);
>      vbasedev->migration_state.notify = vfio_migration_state_notifier;
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index bd3d47b005cb..86c18def016e 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -149,3 +149,5 @@ vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
>  vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
>  vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
>  vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
> +vfio_save_setup(const char *name) " (%s)"
> +vfio_save_cleanup(const char *name) " (%s)"



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 08/17] vfio: Add save state functions to SaveVMHandlers
  2020-06-20 20:21 ` [PATCH QEMU v25 08/17] vfio: Add save state functions to SaveVMHandlers Kirti Wankhede
@ 2020-06-22 22:50   ` Alex Williamson
  2020-06-23 20:34     ` Kirti Wankhede
  0 siblings, 1 reply; 66+ messages in thread
From: Alex Williamson @ 2020-06-22 22:50 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini

On Sun, 21 Jun 2020 01:51:17 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
> functions. These functions handles pre-copy and stop-and-copy phase.
> 
> In _SAVING|_RUNNING device state or pre-copy phase:
> - read pending_bytes. If pending_bytes > 0, go through below steps.
> - read data_offset - indicates kernel driver to write data to staging
>   buffer.
> - read data_size - amount of data in bytes written by vendor driver in
>   migration region.
> - read data_size bytes of data from data_offset in the migration region.
> - Write data packet to file stream as below:
> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
> VFIO_MIG_FLAG_END_OF_STATE }
> 
> In _SAVING device state or stop-and-copy phase
> a. read config space of device and save to migration file stream. This
>    doesn't need to be from vendor driver. Any other special config state
>    from driver can be saved as data in following iteration.
> b. read pending_bytes. If pending_bytes > 0, go through below steps.
> c. read data_offset - indicates kernel driver to write data to staging
>    buffer.
> d. read data_size - amount of data in bytes written by vendor driver in
>    migration region.
> e. read data_size bytes of data from data_offset in the migration region.
> f. Write data packet as below:
>    {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
> g. iterate through steps b to f while (pending_bytes > 0)
> h. Write {VFIO_MIG_FLAG_END_OF_STATE}
> 
> When data region is mapped, its user's responsibility to read data from
> data_offset of data_size before moving to next steps.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/migration.c           | 283 ++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/trace-events          |   6 +
>  include/hw/vfio/vfio-common.h |   1 +
>  3 files changed, 290 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 133bb5b1b3b2..ef1150c1ff02 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -140,6 +140,168 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
>      return 0;
>  }
>  
> +static void *get_data_section_size(VFIORegion *region, uint64_t data_offset,
> +                                   uint64_t data_size, uint64_t *size)
> +{
> +    void *ptr = NULL;
> +    int i;
> +
> +    if (!region->mmaps) {
> +        *size = data_size;
> +        return ptr;
> +    }
> +
> +    /* check if data_offset in within sparse mmap areas */
> +    for (i = 0; i < region->nr_mmaps; i++) {
> +        VFIOMmap *map = region->mmaps + i;
> +
> +        if ((data_offset >= map->offset) &&
> +            (data_offset < map->offset + map->size)) {
> +            ptr = map->mmap + data_offset - map->offset;
> +
> +            if (data_offset + data_size <= map->offset + map->size) {
> +                *size = data_size;
> +            } else {
> +                *size = map->offset + map->size - data_offset;
> +            }

Ultimately we take whichever result is smaller, so we could just use:

*size = MIN(data_size, map->offset + map->size - data_offset);

> +            break;
> +        }
> +    }
> +
> +    if (!ptr) {
> +        uint64_t limit = 0;
> +
> +        /*
> +         * data_offset is not within sparse mmap areas, find size of non-mapped
> +         * area. Check through all list since region->mmaps list is not sorted.
> +         */
> +        for (i = 0; i < region->nr_mmaps; i++) {
> +            VFIOMmap *map = region->mmaps + i;
> +
> +            if ((data_offset < map->offset) &&
> +                (!limit || limit > map->offset)) {
> +                limit = map->offset;
> +            }

We could have done this in an else branch of the previous loop to avoid
walking the entries twice.

> +        }
> +
> +        *size = limit ? limit - data_offset : data_size;
> +    }
> +    return ptr;
> +}
> +
> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region;
> +    uint64_t data_offset = 0, data_size = 0, size;
> +    int ret;
> +
> +    ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             data_offset));
> +    if (ret != sizeof(data_offset)) {
> +        error_report("%s: Failed to get migration buffer data offset %d",
> +                     vbasedev->name, ret);
> +        return -EINVAL;
> +    }
> +
> +    ret = pread(vbasedev->fd, &data_size, sizeof(data_size),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             data_size));
> +    if (ret != sizeof(data_size)) {
> +        error_report("%s: Failed to get migration buffer data size %d",
> +                     vbasedev->name, ret);
> +        return -EINVAL;
> +    }
> +
> +    trace_vfio_save_buffer(vbasedev->name, data_offset, data_size,
> +                           migration->pending_bytes);
> +
> +    qemu_put_be64(f, data_size);
> +    size = data_size;
> +
> +    while (size) {
> +        void *buf = NULL;
> +        bool buffer_mmaped;
> +        uint64_t sec_size;
> +
> +        buf = get_data_section_size(region, data_offset, size, &sec_size);
> +
> +        buffer_mmaped = (buf != NULL);
> +
> +        if (!buffer_mmaped) {
> +            buf = g_try_malloc(sec_size);
> +            if (!buf) {
> +                error_report("%s: Error allocating buffer ", __func__);
> +                return -ENOMEM;
> +            }
> +
> +            ret = pread(vbasedev->fd, buf, sec_size,
> +                        region->fd_offset + data_offset);

Is the trade-off to allocate this buffer worth it?  I'd be tempted to
iterate with a basic data type here to avoid what could potentially be
a large memory allocation above.  It feels a little more robust, if not
perhaps as fast, but I this will mostly be a fallback or only cover
small ranges in normal operation.  Of course the data stream needs to
be compatible either way we retrieve it.

> +            if (ret != sec_size) {
> +                error_report("%s: Failed to get migration data %d",
> +                             vbasedev->name, ret);
> +                g_free(buf);
> +                return -EINVAL;
> +            }
> +        }
> +
> +        qemu_put_buffer(f, buf, sec_size);
> +
> +        if (!buffer_mmaped) {
> +            g_free(buf);
> +        }
> +        size -= sec_size;
> +        data_offset += sec_size;
> +    }
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    return data_size;

This function returns int, data_size is uint64_t.  Thanks,

Alex

> +}
> +
> +static int vfio_update_pending(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region;
> +    uint64_t pending_bytes = 0;
> +    int ret;
> +
> +    ret = pread(vbasedev->fd, &pending_bytes, sizeof(pending_bytes),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                             pending_bytes));
> +    if ((ret < 0) || (ret != sizeof(pending_bytes))) {
> +        error_report("%s: Failed to get pending bytes %d",
> +                     vbasedev->name, ret);
> +        migration->pending_bytes = 0;
> +        return (ret < 0) ? ret : -EINVAL;
> +    }
> +
> +    migration->pending_bytes = pending_bytes;
> +    trace_vfio_update_pending(vbasedev->name, pending_bytes);
> +    return 0;
> +}
> +
> +static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
> +
> +    if (vbasedev->ops && vbasedev->ops->vfio_save_config) {
> +        vbasedev->ops->vfio_save_config(vbasedev, f);
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    trace_vfio_save_device_config_state(vbasedev->name);
> +
> +    return qemu_file_get_error(f);
> +}
> +
>  /* ---------------------------------------------------------------------- */
>  
>  static int vfio_save_setup(QEMUFile *f, void *opaque)
> @@ -192,9 +354,130 @@ static void vfio_save_cleanup(void *opaque)
>      trace_vfio_save_cleanup(vbasedev->name);
>  }
>  
> +static void vfio_save_pending(QEMUFile *f, void *opaque,
> +                              uint64_t threshold_size,
> +                              uint64_t *res_precopy_only,
> +                              uint64_t *res_compatible,
> +                              uint64_t *res_postcopy_only)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    ret = vfio_update_pending(vbasedev);
> +    if (ret) {
> +        return;
> +    }
> +
> +    *res_precopy_only += migration->pending_bytes;
> +
> +    trace_vfio_save_pending(vbasedev->name, *res_precopy_only,
> +                            *res_postcopy_only, *res_compatible);
> +}
> +
> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret, data_size;
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> +
> +    if (migration->pending_bytes == 0) {
> +        ret = vfio_update_pending(vbasedev);
> +        if (ret) {
> +            return ret;
> +        }
> +
> +        if (migration->pending_bytes == 0) {
> +            /* indicates data finished, goto complete phase */
> +            return 1;
> +        }
> +    }
> +
> +    data_size = vfio_save_buffer(f, vbasedev);
> +
> +    if (data_size < 0) {
> +        error_report("%s: vfio_save_buffer failed %s", vbasedev->name,
> +                     strerror(errno));
> +        return data_size;
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    trace_vfio_save_iterate(vbasedev->name, data_size);
> +
> +    return 0;
> +}
> +
> +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_RUNNING,
> +                                   VFIO_DEVICE_STATE_SAVING);
> +    if (ret) {
> +        error_report("%s: Failed to set state STOP and SAVING",
> +                     vbasedev->name);
> +        return ret;
> +    }
> +
> +    ret = vfio_save_device_config_state(f, opaque);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    ret = vfio_update_pending(vbasedev);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    while (migration->pending_bytes > 0) {
> +        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> +        ret = vfio_save_buffer(f, vbasedev);
> +        if (ret < 0) {
> +            error_report("%s: Failed to save buffer", vbasedev->name);
> +            return ret;
> +        } else if (ret == 0) {
> +            break;
> +        }
> +
> +        ret = vfio_update_pending(vbasedev);
> +        if (ret) {
> +            return ret;
> +        }
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_SAVING, 0);
> +    if (ret) {
> +        error_report("%s: Failed to set state STOPPED", vbasedev->name);
> +        return ret;
> +    }
> +
> +    trace_vfio_save_complete_precopy(vbasedev->name);
> +    return ret;
> +}
> +
>  static SaveVMHandlers savevm_vfio_handlers = {
>      .save_setup = vfio_save_setup,
>      .save_cleanup = vfio_save_cleanup,
> +    .save_live_pending = vfio_save_pending,
> +    .save_live_iterate = vfio_save_iterate,
> +    .save_live_complete_precopy = vfio_save_complete_precopy,
>  };
>  
>  /* ---------------------------------------------------------------------- */
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 86c18def016e..9a1c5e17d97f 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -151,3 +151,9 @@ vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t
>  vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
>  vfio_save_setup(const char *name) " (%s)"
>  vfio_save_cleanup(const char *name) " (%s)"
> +vfio_save_buffer(const char *name, uint64_t data_offset, uint64_t data_size, uint64_t pending) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64" pending 0x%"PRIx64
> +vfio_update_pending(const char *name, uint64_t pending) " (%s) pending 0x%"PRIx64
> +vfio_save_device_config_state(const char *name) " (%s)"
> +vfio_save_pending(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64
> +vfio_save_iterate(const char *name, int data_size) " (%s) data_size %d"
> +vfio_save_complete_precopy(const char *name) " (%s)"
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 28f55f66d019..c78033e4149d 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -60,6 +60,7 @@ typedef struct VFIORegion {
>  
>  typedef struct VFIOMigration {
>      VFIORegion region;
> +    uint64_t pending_bytes;
>  } VFIOMigration;
>  
>  typedef struct VFIOAddressSpace {



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 17/17] qapi: Add VFIO devices migration stats in Migration stats
  2020-06-20 20:21 ` [PATCH QEMU v25 17/17] qapi: Add VFIO devices migration stats in Migration stats Kirti Wankhede
@ 2020-06-23  7:21   ` Markus Armbruster
  2020-06-23 21:16     ` Kirti Wankhede
  0 siblings, 1 reply; 66+ messages in thread
From: Markus Armbruster @ 2020-06-23  7:21 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, eskultet, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	alex.williamson, changpeng.liu, quintela, Ken.Xue,
	jonathan.davies, pbonzini

QAPI review only.

The only changes since I reviewed v23 is the rename of VfioStats member
@bytes to @transferred, and the move of MigrationInfo member @vfio next
to @ram and @disk.  Good.  I'm copying my other questions in the hope of
getting answers :)

Kirti Wankhede <kwankhede@nvidia.com> writes:

> Added amount of bytes transferred to the target VM by all VFIO devices
>
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
[...]
> diff --git a/qapi/migration.json b/qapi/migration.json
> index d5000558c6c9..952864b05455 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -146,6 +146,18 @@
>              'active', 'postcopy-active', 'postcopy-paused',
>              'postcopy-recover', 'completed', 'failed', 'colo',
>              'pre-switchover', 'device', 'wait-unplug' ] }
> +##
> +# @VfioStats:
> +#
> +# Detailed VFIO devices migration statistics
> +#
> +# @transferred: amount of bytes transferred to the target VM by VFIO devices
> +#
> +# Since: 5.1
> +#
> +##
> +{ 'struct': 'VfioStats',
> +  'data': {'transferred': 'int' } }

Pardon my ignorance...  What exactly do VFIO devices transfer to the
target VM?  How is that related to MigrationInfo member @ram?

MigrationStats has much more information, and some of it is pretty
useful to track how migration is doing, in particular whether it
converges, and how fast.  Absent in VfioStats due to "not implemented",
or due to "can't be done"?

Byte counts should use QAPI type 'size'.  Many existing ones don't.
Since MigrationStats uses 'int', I'll let the migration maintainers
decide whether they want 'int' or 'size' here.

>  ##
>  # @MigrationInfo:
> @@ -207,11 +219,16 @@
>  #
>  # @socket-address: Only used for tcp, to know what the real port is (Since 4.0)
>  #
> +# @vfio: @VfioStats containing detailed VFIO devices migration statistics,
> +#        only returned if VFIO device is present, migration is supported by all
> +#         VFIO devices and status is 'active' or 'completed' (since 5.1)
> +#
>  # Since: 0.14.0
>  ##
>  { 'struct': 'MigrationInfo',
>    'data': {'*status': 'MigrationStatus', '*ram': 'MigrationStats',
>             '*disk': 'MigrationStats',
> +           '*vfio': 'VfioStats',
>             '*xbzrle-cache': 'XBZRLECacheStats',
>             '*total-time': 'int',
>             '*expected-downtime': 'int',



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 04/17] vfio: Add migration region initialization and finalize function
  2020-06-20 20:21 ` [PATCH QEMU v25 04/17] vfio: Add migration region initialization and finalize function Kirti Wankhede
@ 2020-06-23  7:54   ` Cornelia Huck
  0 siblings, 0 replies; 66+ messages in thread
From: Cornelia Huck @ 2020-06-23  7:54 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk, pasic,
	felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini

On Sun, 21 Jun 2020 01:51:13 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Whether the VFIO device supports migration or not is decided based of
> migration region query. If migration region query is successful and migration
> region initialization is successful then migration is supported else
> migration is blocked.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> Acked-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  hw/vfio/Makefile.objs         |   2 +-
>  hw/vfio/migration.c           | 142 ++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/trace-events          |   3 +
>  include/hw/vfio/vfio-common.h |   9 +++
>  4 files changed, 155 insertions(+), 1 deletion(-)
>  create mode 100644 hw/vfio/migration.c

(...)

> +static int vfio_migration_region_init(VFIODevice *vbasedev, int index)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    Object *obj = NULL;
> +    int ret = -EINVAL;
> +
> +    if (!vbasedev->ops->vfio_get_object) {
> +        return ret;
> +    }
> +
> +    obj = vbasedev->ops->vfio_get_object(vbasedev);
> +    if (!obj) {
> +        return ret;
> +    }
> +
> +    ret = vfio_region_setup(obj, vbasedev, &migration->region, index,
> +                            "migration");
> +    if (ret) {
> +        error_report("%s: Failed to setup VFIO migration region %d: %s",
> +                     vbasedev->name, index, strerror(-ret));
> +        goto err;
> +    }
> +
> +    if (!migration->region.size) {
> +        ret = -EINVAL;
> +        error_report("%s: Invalid region size of VFIO migration region %d: %s",
> +                     vbasedev->name, index, strerror(-ret));

Instead of only checking for size != 0, should we also check that the
region has a certain expected size or minimum size?

> +        goto err;
> +    }
> +
> +    return 0;
> +
> +err:
> +    vfio_migration_region_exit(vbasedev);
> +    return ret;
> +}
> +

(...)

Else looks good to me.



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 05/17] vfio: Add VM state change handler to know state of VM
  2020-06-20 20:21 ` [PATCH QEMU v25 05/17] vfio: Add VM state change handler to know state of VM Kirti Wankhede
  2020-06-22 22:50   ` Alex Williamson
@ 2020-06-23  8:07   ` Cornelia Huck
  1 sibling, 0 replies; 66+ messages in thread
From: Cornelia Huck @ 2020-06-23  8:07 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk, pasic,
	felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini

On Sun, 21 Jun 2020 01:51:14 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> VM state change handler gets called on change in VM's state. This is used to set
> VFIO device state to _RUNNING.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  hw/vfio/migration.c           | 87 +++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/trace-events          |  2 +
>  include/hw/vfio/vfio-common.h |  4 ++
>  3 files changed, 93 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 48ac385d80a7..fcecc0bb0874 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -10,6 +10,7 @@
>  #include "qemu/osdep.h"
>  #include <linux/vfio.h>
>  
> +#include "sysemu/runstate.h"
>  #include "hw/vfio/vfio-common.h"
>  #include "cpu.h"
>  #include "migration/migration.h"
> @@ -74,6 +75,85 @@ err:
>      return ret;
>  }
>  
> +static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
> +                                    uint32_t value)

I commented on this interface already in
https://lore.kernel.org/qemu-devel/20200505120459.62bd0b16.cohuck@redhat.com/,
but I did not see any reply... I guess my comments still apply (here
and below).

> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +    VFIORegion *region = &migration->region;
> +    uint32_t device_state;
> +    int ret;
> +
> +    ret = pread(vbasedev->fd, &device_state, sizeof(device_state),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                              device_state));
> +    if (ret < 0) {
> +        error_report("%s: Failed to read device state %d %s",
> +                     vbasedev->name, ret, strerror(errno));
> +        return ret;
> +    }
> +
> +    device_state = (device_state & mask) | value;
> +
> +    if (!VFIO_DEVICE_STATE_VALID(device_state)) {
> +        return -EINVAL;
> +    }
> +
> +    ret = pwrite(vbasedev->fd, &device_state, sizeof(device_state),
> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                                              device_state));
> +    if (ret < 0) {
> +        error_report("%s: Failed to set device state %d %s",
> +                     vbasedev->name, ret, strerror(errno));
> +
> +        ret = pread(vbasedev->fd, &device_state, sizeof(device_state),
> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> +                device_state));
> +        if (ret < 0) {
> +            error_report("%s: On failure, failed to read device state %d %s",
> +                    vbasedev->name, ret, strerror(errno));
> +            return ret;
> +        }
> +
> +        if (VFIO_DEVICE_STATE_IS_ERROR(device_state)) {
> +            error_report("%s: Device is in error state 0x%x",
> +                         vbasedev->name, device_state);
> +            return -EFAULT;

E.g., why -EFAULT here?

> +        }
> +    }
> +
> +    vbasedev->device_state = device_state;
> +    trace_vfio_migration_set_state(vbasedev->name, device_state);
> +    return 0;
> +}
> +
> +static void vfio_vmstate_change(void *opaque, int running, RunState state)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    if ((vbasedev->vm_running != running)) {
> +        int ret;
> +        uint32_t value = 0, mask = 0;
> +
> +        if (running) {
> +            value = VFIO_DEVICE_STATE_RUNNING;
> +            if (vbasedev->device_state & VFIO_DEVICE_STATE_RESUMING) {
> +                mask = ~VFIO_DEVICE_STATE_RESUMING;
> +            }
> +        } else {
> +            mask = ~VFIO_DEVICE_STATE_RUNNING;
> +        }
> +
> +        ret = vfio_migration_set_state(vbasedev, mask, value);
> +        if (ret) {
> +            error_report("%s: Failed to set device state 0x%x",
> +                         vbasedev->name, value & mask);
> +        }
> +        vbasedev->vm_running = running;
> +        trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
> +                                  value & mask);
> +    }
> +}
> +
>  static int vfio_migration_init(VFIODevice *vbasedev,
>                                 struct vfio_region_info *info)
>  {



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 06/17] vfio: Add migration state change notifier
  2020-06-20 20:21 ` [PATCH QEMU v25 06/17] vfio: Add migration state change notifier Kirti Wankhede
@ 2020-06-23  8:10   ` Cornelia Huck
  0 siblings, 0 replies; 66+ messages in thread
From: Cornelia Huck @ 2020-06-23  8:10 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk, pasic,
	felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini

On Sun, 21 Jun 2020 01:51:15 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Added migration state change notifier to get notification on migration state
> change. These states are translated to VFIO device state and conveyed to vendor
> driver.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  hw/vfio/migration.c           | 29 +++++++++++++++++++++++++++++
>  hw/vfio/trace-events          |  5 +++--
>  include/hw/vfio/vfio-common.h |  1 +
>  3 files changed, 33 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index fcecc0bb0874..e30bd8768701 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -154,6 +154,28 @@ static void vfio_vmstate_change(void *opaque, int running, RunState state)
>      }
>  }
>  
> +static void vfio_migration_state_notifier(Notifier *notifier, void *data)
> +{
> +    MigrationState *s = data;
> +    VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);
> +    int ret;
> +
> +    trace_vfio_migration_state_notifier(vbasedev->name,
> +                                        MigrationStatus_str(s->state));
> +
> +    switch (s->state) {
> +    case MIGRATION_STATUS_CANCELLING:
> +    case MIGRATION_STATUS_CANCELLED:
> +    case MIGRATION_STATUS_FAILED:
> +        ret = vfio_migration_set_state(vbasedev,
> +                      ~(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING),
> +                      VFIO_DEVICE_STATE_RUNNING);
> +        if (ret) {
> +            error_report("%s: Failed to set state RUNNING", vbasedev->name);

Also see https://lore.kernel.org/qemu-devel/20200505124639.56531df8.cohuck@redhat.com/.

> +        }
> +    }
> +}
> +
>  static int vfio_migration_init(VFIODevice *vbasedev,
>                                 struct vfio_region_info *info)
>  {



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 15/17] vfio: Add ioctl to get dirty pages bitmap during dma unmap.
  2020-06-20 20:21 ` [PATCH QEMU v25 15/17] vfio: Add ioctl to get dirty pages bitmap during dma unmap Kirti Wankhede
@ 2020-06-23  8:25   ` Cornelia Huck
  2020-06-24 18:56   ` Alex Williamson
  1 sibling, 0 replies; 66+ messages in thread
From: Cornelia Huck @ 2020-06-23  8:25 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk, pasic,
	felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini

On Sun, 21 Jun 2020 01:51:24 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> With vIOMMU, IO virtual address range can get unmapped while in pre-copy
> phase of migration. In that case, unmap ioctl should return pages pinned
> in that range and QEMU should find its correcponding guest physical
> addresses and report those dirty.
> 
> Suggested-by: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/common.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 81 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 0518cf228ed5..a06b8f2f66e2 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -311,11 +311,83 @@ static bool vfio_devices_are_stopped_and_saving(void)
>      return true;
>  }
>  
> +static bool vfio_devices_are_running_and_saving(void)

I previously asked:
(https://lore.kernel.org/qemu-devel/20200506123125.449dbf42.cohuck@redhat.com/)

"Maybe s/are/all/ to make it sure that the scope is *all* vfio devices
here?

Is there any global state for this which we could use to check this in
a simpler way?"

Any comment?

> +{
> +    VFIOGroup *group;
> +    VFIODevice *vbasedev;
> +
> +    QLIST_FOREACH(group, &vfio_group_list, next) {
> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> +            if ((vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) &&
> +                (vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING)) {
> +                continue;
> +            } else {
> +                return false;
> +            }
> +        }
> +    }
> +    return true;
> +}



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 12/17] vfio: Add function to start and stop dirty pages tracking
  2020-06-20 20:21 ` [PATCH QEMU v25 12/17] vfio: Add function to start and stop dirty pages tracking Kirti Wankhede
@ 2020-06-23 10:32   ` Cornelia Huck
  2020-06-23 11:01     ` Dr. David Alan Gilbert
  2020-06-24 18:55   ` Alex Williamson
  1 sibling, 1 reply; 66+ messages in thread
From: Cornelia Huck @ 2020-06-23 10:32 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk, pasic,
	felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini

On Sun, 21 Jun 2020 01:51:21 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Call VFIO_IOMMU_DIRTY_PAGES ioctl to start and stop dirty pages tracking
> for VFIO devices.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  hw/vfio/migration.c | 36 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 36 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index faacea5327cb..e0fbb3a01855 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -11,6 +11,7 @@
>  #include "qemu/main-loop.h"
>  #include "qemu/cutils.h"
>  #include <linux/vfio.h>
> +#include <sys/ioctl.h>
>  
>  #include "sysemu/runstate.h"
>  #include "hw/vfio/vfio-common.h"
> @@ -329,6 +330,32 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>      return qemu_file_get_error(f);
>  }
>  
> +static int vfio_start_dirty_page_tracking(VFIODevice *vbasedev, bool start)

I find 'start' functions that may also stop something a bit confusing.
Maybe vfio_toggle_dirty_page_tracking()?

> +{
> +    int ret;
> +    VFIOContainer *container = vbasedev->group->container;
> +    struct vfio_iommu_type1_dirty_bitmap dirty = {
> +        .argsz = sizeof(dirty),
> +    };
> +
> +    if (start) {
> +        if (vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) {
> +            dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START;
> +        } else {
> +            return -EINVAL;
> +        }
> +    } else {
> +            dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP;
> +    }
> +
> +    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty);
> +    if (ret) {
> +        error_report("Failed to set dirty tracking flag 0x%x errno: %d",
> +                     dirty.flags, errno);
> +    }
> +    return ret;
> +}
> +
>  /* ---------------------------------------------------------------------- */
>  
>  static int vfio_save_setup(QEMUFile *f, void *opaque)
> @@ -360,6 +387,11 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
>          return ret;
>      }
>  
> +    ret = vfio_start_dirty_page_tracking(vbasedev, true);
> +    if (ret) {
> +        return ret;
> +    }
> +
>      qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>  
>      ret = qemu_file_get_error(f);
> @@ -375,6 +407,8 @@ static void vfio_save_cleanup(void *opaque)
>      VFIODevice *vbasedev = opaque;
>      VFIOMigration *migration = vbasedev->migration;
>  
> +    vfio_start_dirty_page_tracking(vbasedev, false);

I suppose we can't do anything useful if stopping dirty page tracking
fails?

> +
>      if (migration->region.mmaps) {
>          vfio_region_unmap(&migration->region);
>      }
> @@ -706,6 +740,8 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
>          if (ret) {
>              error_report("%s: Failed to set state RUNNING", vbasedev->name);
>          }
> +
> +        vfio_start_dirty_page_tracking(vbasedev, false);
>      }
>  }
>  



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 12/17] vfio: Add function to start and stop dirty pages tracking
  2020-06-23 10:32   ` Cornelia Huck
@ 2020-06-23 11:01     ` Dr. David Alan Gilbert
  2020-06-23 11:06       ` Cornelia Huck
  0 siblings, 1 reply; 66+ messages in thread
From: Dr. David Alan Gilbert @ 2020-06-23 11:01 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini

* Cornelia Huck (cohuck@redhat.com) wrote:
> On Sun, 21 Jun 2020 01:51:21 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > Call VFIO_IOMMU_DIRTY_PAGES ioctl to start and stop dirty pages tracking
> > for VFIO devices.
> > 
> > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  hw/vfio/migration.c | 36 ++++++++++++++++++++++++++++++++++++
> >  1 file changed, 36 insertions(+)
> > 
> > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > index faacea5327cb..e0fbb3a01855 100644
> > --- a/hw/vfio/migration.c
> > +++ b/hw/vfio/migration.c
> > @@ -11,6 +11,7 @@
> >  #include "qemu/main-loop.h"
> >  #include "qemu/cutils.h"
> >  #include <linux/vfio.h>
> > +#include <sys/ioctl.h>
> >  
> >  #include "sysemu/runstate.h"
> >  #include "hw/vfio/vfio-common.h"
> > @@ -329,6 +330,32 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
> >      return qemu_file_get_error(f);
> >  }
> >  
> > +static int vfio_start_dirty_page_tracking(VFIODevice *vbasedev, bool start)
> 
> I find 'start' functions that may also stop something a bit confusing.
> Maybe vfio_toggle_dirty_page_tracking()?

I don't think toggle is any better; I always think of toggle as flipping
the state to the other state.
vfio_set_dirty_page_tracking maybe?

Dave


> > +{
> > +    int ret;
> > +    VFIOContainer *container = vbasedev->group->container;
> > +    struct vfio_iommu_type1_dirty_bitmap dirty = {
> > +        .argsz = sizeof(dirty),
> > +    };
> > +
> > +    if (start) {
> > +        if (vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) {
> > +            dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START;
> > +        } else {
> > +            return -EINVAL;
> > +        }
> > +    } else {
> > +            dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP;
> > +    }
> > +
> > +    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty);
> > +    if (ret) {
> > +        error_report("Failed to set dirty tracking flag 0x%x errno: %d",
> > +                     dirty.flags, errno);
> > +    }
> > +    return ret;
> > +}
> > +
> >  /* ---------------------------------------------------------------------- */
> >  
> >  static int vfio_save_setup(QEMUFile *f, void *opaque)
> > @@ -360,6 +387,11 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
> >          return ret;
> >      }
> >  
> > +    ret = vfio_start_dirty_page_tracking(vbasedev, true);
> > +    if (ret) {
> > +        return ret;
> > +    }
> > +
> >      qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> >  
> >      ret = qemu_file_get_error(f);
> > @@ -375,6 +407,8 @@ static void vfio_save_cleanup(void *opaque)
> >      VFIODevice *vbasedev = opaque;
> >      VFIOMigration *migration = vbasedev->migration;
> >  
> > +    vfio_start_dirty_page_tracking(vbasedev, false);
> 
> I suppose we can't do anything useful if stopping dirty page tracking
> fails?
> 
> > +
> >      if (migration->region.mmaps) {
> >          vfio_region_unmap(&migration->region);
> >      }
> > @@ -706,6 +740,8 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
> >          if (ret) {
> >              error_report("%s: Failed to set state RUNNING", vbasedev->name);
> >          }
> > +
> > +        vfio_start_dirty_page_tracking(vbasedev, false);
> >      }
> >  }
> >  
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 12/17] vfio: Add function to start and stop dirty pages tracking
  2020-06-23 11:01     ` Dr. David Alan Gilbert
@ 2020-06-23 11:06       ` Cornelia Huck
  0 siblings, 0 replies; 66+ messages in thread
From: Cornelia Huck @ 2020-06-23 11:06 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini

On Tue, 23 Jun 2020 12:01:25 +0100
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

> * Cornelia Huck (cohuck@redhat.com) wrote:
> > On Sun, 21 Jun 2020 01:51:21 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> > > Call VFIO_IOMMU_DIRTY_PAGES ioctl to start and stop dirty pages tracking
> > > for VFIO devices.
> > > 
> > > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > > Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > ---
> > >  hw/vfio/migration.c | 36 ++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 36 insertions(+)
> > > 
> > > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > > index faacea5327cb..e0fbb3a01855 100644
> > > --- a/hw/vfio/migration.c
> > > +++ b/hw/vfio/migration.c
> > > @@ -11,6 +11,7 @@
> > >  #include "qemu/main-loop.h"
> > >  #include "qemu/cutils.h"
> > >  #include <linux/vfio.h>
> > > +#include <sys/ioctl.h>
> > >  
> > >  #include "sysemu/runstate.h"
> > >  #include "hw/vfio/vfio-common.h"
> > > @@ -329,6 +330,32 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
> > >      return qemu_file_get_error(f);
> > >  }
> > >  
> > > +static int vfio_start_dirty_page_tracking(VFIODevice *vbasedev, bool start)  
> > 
> > I find 'start' functions that may also stop something a bit confusing.
> > Maybe vfio_toggle_dirty_page_tracking()?  
> 
> I don't think toggle is any better; I always think of toggle as flipping
> the state to the other state.
> vfio_set_dirty_page_tracking maybe?

Sounds good to me.



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 05/17] vfio: Add VM state change handler to know state of VM
  2020-06-22 22:50   ` Alex Williamson
@ 2020-06-23 18:55     ` Kirti Wankhede
  2020-06-26 14:51       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 66+ messages in thread
From: Kirti Wankhede @ 2020-06-23 18:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini



On 6/23/2020 4:20 AM, Alex Williamson wrote:
> On Sun, 21 Jun 2020 01:51:14 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> VM state change handler gets called on change in VM's state. This is used to set
>> VFIO device state to _RUNNING.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>> ---
>>   hw/vfio/migration.c           | 87 +++++++++++++++++++++++++++++++++++++++++++
>>   hw/vfio/trace-events          |  2 +
>>   include/hw/vfio/vfio-common.h |  4 ++
>>   3 files changed, 93 insertions(+)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 48ac385d80a7..fcecc0bb0874 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -10,6 +10,7 @@
>>   #include "qemu/osdep.h"
>>   #include <linux/vfio.h>
>>   
>> +#include "sysemu/runstate.h"
>>   #include "hw/vfio/vfio-common.h"
>>   #include "cpu.h"
>>   #include "migration/migration.h"
>> @@ -74,6 +75,85 @@ err:
>>       return ret;
>>   }
>>   
>> +static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
>> +                                    uint32_t value)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIORegion *region = &migration->region;
>> +    uint32_t device_state;
>> +    int ret;
>> +
>> +    ret = pread(vbasedev->fd, &device_state, sizeof(device_state),
>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                              device_state));
>> +    if (ret < 0) {
>> +        error_report("%s: Failed to read device state %d %s",
>> +                     vbasedev->name, ret, strerror(errno));
>> +        return ret;
>> +    }
>> +
>> +    device_state = (device_state & mask) | value;
>> +
>> +    if (!VFIO_DEVICE_STATE_VALID(device_state)) {
>> +        return -EINVAL;
>> +    }
>> +
>> +    ret = pwrite(vbasedev->fd, &device_state, sizeof(device_state),
>> +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                              device_state));
>> +    if (ret < 0) {
>> +        error_report("%s: Failed to set device state %d %s",
>> +                     vbasedev->name, ret, strerror(errno));
>> +
>> +        ret = pread(vbasedev->fd, &device_state, sizeof(device_state),
>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                device_state));
>> +        if (ret < 0) {
>> +            error_report("%s: On failure, failed to read device state %d %s",
>> +                    vbasedev->name, ret, strerror(errno));
>> +            return ret;
>> +        }
>> +
>> +        if (VFIO_DEVICE_STATE_IS_ERROR(device_state)) {
>> +            error_report("%s: Device is in error state 0x%x",
>> +                         vbasedev->name, device_state);
>> +            return -EFAULT;
>> +        }
>> +    }
>> +
>> +    vbasedev->device_state = device_state;
>> +    trace_vfio_migration_set_state(vbasedev->name, device_state);
>> +    return 0;
>> +}
>> +
>> +static void vfio_vmstate_change(void *opaque, int running, RunState state)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +
>> +    if ((vbasedev->vm_running != running)) {
>> +        int ret;
>> +        uint32_t value = 0, mask = 0;
>> +
>> +        if (running) {
>> +            value = VFIO_DEVICE_STATE_RUNNING;
>> +            if (vbasedev->device_state & VFIO_DEVICE_STATE_RESUMING) {
>> +                mask = ~VFIO_DEVICE_STATE_RESUMING;
>> +            }
>> +        } else {
>> +            mask = ~VFIO_DEVICE_STATE_RUNNING;
>> +        }
>> +
>> +        ret = vfio_migration_set_state(vbasedev, mask, value);
>> +        if (ret) {
>> +            error_report("%s: Failed to set device state 0x%x",
>> +                         vbasedev->name, value & mask);
> 
> 
> Is there nothing more we should do here?  It seems like in either the
> case of an outbound migration where we can't stop the device or an
> inbound migration where we can't start the device, we'd want this to
> trigger an abort of the migration.  Should there at least be a TODO
> comment if the reason is that QEMU migration doesn't yet support failure
> here?  Thanks,
> 

Checked other modules in QEMU, at some places error message is reported 
as above while at some places abort() is called (for example 
kvmclock_vm_state_change() in hw/i386/kvm/clock.c). Abort will abort 
QEMU process, that is VM crash. Should we abort here on error case? 
Anyways VM will not recover properly on migration if there is such error.

Thanks,
Kirti



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 07/17] vfio: Register SaveVMHandlers for VFIO device
  2020-06-22 22:50   ` Alex Williamson
@ 2020-06-23 19:21     ` Kirti Wankhede
  2020-06-23 19:50       ` Alex Williamson
  0 siblings, 1 reply; 66+ messages in thread
From: Kirti Wankhede @ 2020-06-23 19:21 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini



On 6/23/2020 4:20 AM, Alex Williamson wrote:
> On Sun, 21 Jun 2020 01:51:16 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> Define flags to be used as delimeter in migration file stream.
>> Added .save_setup and .save_cleanup functions. Mapped & unmapped migration
>> region from these functions at source during saving or pre-copy phase.
>> Set VFIO device state depending on VM's state. During live migration, VM is
>> running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
>> device. During save-restore, VM is paused, _SAVING state is set for VFIO device.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>   hw/vfio/migration.c  | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   hw/vfio/trace-events |  2 ++
>>   2 files changed, 94 insertions(+)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index e30bd8768701..133bb5b1b3b2 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -8,12 +8,15 @@
>>    */
>>   
>>   #include "qemu/osdep.h"
>> +#include "qemu/main-loop.h"
>> +#include "qemu/cutils.h"
>>   #include <linux/vfio.h>
>>   
>>   #include "sysemu/runstate.h"
>>   #include "hw/vfio/vfio-common.h"
>>   #include "cpu.h"
>>   #include "migration/migration.h"
>> +#include "migration/vmstate.h"
>>   #include "migration/qemu-file.h"
>>   #include "migration/register.h"
>>   #include "migration/blocker.h"
>> @@ -24,6 +27,17 @@
>>   #include "pci.h"
>>   #include "trace.h"
>>   
>> +/*
>> + * Flags used as delimiter:
>> + * 0xffffffff => MSB 32-bit all 1s
>> + * 0xef10     => emulated (virtual) function IO
>> + * 0x0000     => 16-bits reserved for flags
>> + */
>> +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
>> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
>> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
>> +#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
>> +
>>   static void vfio_migration_region_exit(VFIODevice *vbasedev)
>>   {
>>       VFIOMigration *migration = vbasedev->migration;
>> @@ -126,6 +140,65 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
>>       return 0;
>>   }
>>   
>> +/* ---------------------------------------------------------------------- */
>> +
>> +static int vfio_save_setup(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    int ret;
>> +
>> +    trace_vfio_save_setup(vbasedev->name);
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
>> +
>> +    if (migration->region.mmaps) {
>> +        qemu_mutex_lock_iothread();
>> +        ret = vfio_region_mmap(&migration->region);
>> +        qemu_mutex_unlock_iothread();
>> +        if (ret) {
>> +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
>> +                         vbasedev->name, migration->region.nr,
>> +                         strerror(-ret));
>> +            return ret;
> 
> OTOH to my previous comments, this shouldn't be fatal, right?  mmaps
> are optional anyway so it should be sufficient to push an error report
> to explain why this might be slower than normal, but we can still
> proceed.
> 

Right, defining region to be sparse mmap is optional.
migration->region.mmaps is set if vendor driver defines sparse mmapable 
regions and VFIO_REGION_INFO_FLAG_MMAP flag is set. If this flag is set 
then error on mmap() should be fatal.

If there is not mmapable region, then migration will proceed.

>> +        }
>> +    }
>> +
>> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_MASK,
>> +                                   VFIO_DEVICE_STATE_SAVING);
>> +    if (ret) {
>> +        error_report("%s: Failed to set state SAVING", vbasedev->name);
>> +        return ret;
>> +    }
> 
> We seem to be lacking support in the callers for detecting if the
> device is in an error state.  I'm not sure what our options are
> though, maybe only a hw_error().
> 

Returning error here fails migration process. And if device is in error 
state, any application running inside VM using this device would fail.
I think, there is no need to take any special action here by detecting 
device error state.

>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +
>> +    ret = qemu_file_get_error(f);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static void vfio_save_cleanup(void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    if (migration->region.mmaps) {
>> +        vfio_region_unmap(&migration->region);
>> +    }
>> +    trace_vfio_save_cleanup(vbasedev->name);
>> +}
>> +
>> +static SaveVMHandlers savevm_vfio_handlers = {
>> +    .save_setup = vfio_save_setup,
>> +    .save_cleanup = vfio_save_cleanup,
>> +};
>> +
>> +/* ---------------------------------------------------------------------- */
>> +
>>   static void vfio_vmstate_change(void *opaque, int running, RunState state)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> @@ -180,6 +253,7 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>>                                  struct vfio_region_info *info)
>>   {
>>       int ret;
>> +    char id[256] = "";
>>   
>>       vbasedev->migration = g_new0(VFIOMigration, 1);
>>   
>> @@ -192,6 +266,24 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>>           return ret;
>>       }
>>   
>> +    if (vbasedev->ops->vfio_get_object) {
> 
> Nit, vfio_migration_region_init() would have failed already if this were
> not available.  Perhaps do the test once at the start of this function
> instead?  Thanks,
> 

Ok, will do that.

Thanks,
Kirti


> Alex
> 
>> +        Object *obj = vbasedev->ops->vfio_get_object(vbasedev);
>> +
>> +        if (obj) {
>> +            DeviceState *dev = DEVICE(obj);
>> +            char *oid = vmstate_if_get_id(VMSTATE_IF(dev));
>> +
>> +            if (oid) {
>> +                pstrcpy(id, sizeof(id), oid);
>> +                pstrcat(id, sizeof(id), "/");
>> +                g_free(oid);
>> +            }
>> +        }
>> +    }
>> +    pstrcat(id, sizeof(id), "vfio");
>> +
>> +    register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1, &savevm_vfio_handlers,
>> +                         vbasedev);
>>       vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
>>                                                             vbasedev);
>>       vbasedev->migration_state.notify = vfio_migration_state_notifier;
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index bd3d47b005cb..86c18def016e 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -149,3 +149,5 @@ vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
>>   vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
>>   vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
>>   vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
>> +vfio_save_setup(const char *name) " (%s)"
>> +vfio_save_cleanup(const char *name) " (%s)"
> 


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 07/17] vfio: Register SaveVMHandlers for VFIO device
  2020-06-23 19:21     ` Kirti Wankhede
@ 2020-06-23 19:50       ` Alex Williamson
  2020-06-26 14:22         ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 66+ messages in thread
From: Alex Williamson @ 2020-06-23 19:50 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini

On Wed, 24 Jun 2020 00:51:06 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 6/23/2020 4:20 AM, Alex Williamson wrote:
> > On Sun, 21 Jun 2020 01:51:16 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> Define flags to be used as delimeter in migration file stream.
> >> Added .save_setup and .save_cleanup functions. Mapped & unmapped migration
> >> region from these functions at source during saving or pre-copy phase.
> >> Set VFIO device state depending on VM's state. During live migration, VM is
> >> running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
> >> device. During save-restore, VM is paused, _SAVING state is set for VFIO device.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>   hw/vfio/migration.c  | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>   hw/vfio/trace-events |  2 ++
> >>   2 files changed, 94 insertions(+)
> >>
> >> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> >> index e30bd8768701..133bb5b1b3b2 100644
> >> --- a/hw/vfio/migration.c
> >> +++ b/hw/vfio/migration.c
> >> @@ -8,12 +8,15 @@
> >>    */
> >>   
> >>   #include "qemu/osdep.h"
> >> +#include "qemu/main-loop.h"
> >> +#include "qemu/cutils.h"
> >>   #include <linux/vfio.h>
> >>   
> >>   #include "sysemu/runstate.h"
> >>   #include "hw/vfio/vfio-common.h"
> >>   #include "cpu.h"
> >>   #include "migration/migration.h"
> >> +#include "migration/vmstate.h"
> >>   #include "migration/qemu-file.h"
> >>   #include "migration/register.h"
> >>   #include "migration/blocker.h"
> >> @@ -24,6 +27,17 @@
> >>   #include "pci.h"
> >>   #include "trace.h"
> >>   
> >> +/*
> >> + * Flags used as delimiter:
> >> + * 0xffffffff => MSB 32-bit all 1s
> >> + * 0xef10     => emulated (virtual) function IO
> >> + * 0x0000     => 16-bits reserved for flags
> >> + */
> >> +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
> >> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
> >> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
> >> +#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
> >> +
> >>   static void vfio_migration_region_exit(VFIODevice *vbasedev)
> >>   {
> >>       VFIOMigration *migration = vbasedev->migration;
> >> @@ -126,6 +140,65 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
> >>       return 0;
> >>   }
> >>   
> >> +/* ---------------------------------------------------------------------- */
> >> +
> >> +static int vfio_save_setup(QEMUFile *f, void *opaque)
> >> +{
> >> +    VFIODevice *vbasedev = opaque;
> >> +    VFIOMigration *migration = vbasedev->migration;
> >> +    int ret;
> >> +
> >> +    trace_vfio_save_setup(vbasedev->name);
> >> +
> >> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
> >> +
> >> +    if (migration->region.mmaps) {
> >> +        qemu_mutex_lock_iothread();
> >> +        ret = vfio_region_mmap(&migration->region);
> >> +        qemu_mutex_unlock_iothread();
> >> +        if (ret) {
> >> +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
> >> +                         vbasedev->name, migration->region.nr,
> >> +                         strerror(-ret));
> >> +            return ret;  
> > 
> > OTOH to my previous comments, this shouldn't be fatal, right?  mmaps
> > are optional anyway so it should be sufficient to push an error report
> > to explain why this might be slower than normal, but we can still
> > proceed.
> >   
> 
> Right, defining region to be sparse mmap is optional.
> migration->region.mmaps is set if vendor driver defines sparse mmapable 
> regions and VFIO_REGION_INFO_FLAG_MMAP flag is set. If this flag is set 
> then error on mmap() should be fatal.
> 
> If there is not mmapable region, then migration will proceed.

It's both optional for the vendor to define sparse mmap support (or any
mmap support) and optional for the user to make use of it.  The user
can recover from an mmap failure by using read/write accesses.  The
vendor MUST support this.  It doesn't make sense to worry about
aborting the VM in replying to comments for 05/17, where it's not clear
how we proceed, yet intentionally cause a fatal error here when there
is a very clear path to proceed.

> >> +        }
> >> +    }
> >> +
> >> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_MASK,
> >> +                                   VFIO_DEVICE_STATE_SAVING);
> >> +    if (ret) {
> >> +        error_report("%s: Failed to set state SAVING", vbasedev->name);
> >> +        return ret;
> >> +    }  
> > 
> > We seem to be lacking support in the callers for detecting if the
> > device is in an error state.  I'm not sure what our options are
> > though, maybe only a hw_error().
> >   
> 
> Returning error here fails migration process. And if device is in error 
> state, any application running inside VM using this device would fail.
> I think, there is no need to take any special action here by detecting 
> device error state.

If QEMU knows a device has failed, it seems like it would make sense to
stop the VM, otherwise we risk an essentially endless assortment of
ways that the user might notice the guest isn't behaving normally, some
maybe even causing the user to lose data.  Thanks,

Alex
 
> >> +
> >> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> >> +
> >> +    ret = qemu_file_get_error(f);
> >> +    if (ret) {
> >> +        return ret;
> >> +    }
> >> +
> >> +    return 0;
> >> +}
> >> +
> >> +static void vfio_save_cleanup(void *opaque)
> >> +{
> >> +    VFIODevice *vbasedev = opaque;
> >> +    VFIOMigration *migration = vbasedev->migration;
> >> +
> >> +    if (migration->region.mmaps) {
> >> +        vfio_region_unmap(&migration->region);
> >> +    }
> >> +    trace_vfio_save_cleanup(vbasedev->name);
> >> +}
> >> +
> >> +static SaveVMHandlers savevm_vfio_handlers = {
> >> +    .save_setup = vfio_save_setup,
> >> +    .save_cleanup = vfio_save_cleanup,
> >> +};
> >> +
> >> +/* ---------------------------------------------------------------------- */
> >> +
> >>   static void vfio_vmstate_change(void *opaque, int running, RunState state)
> >>   {
> >>       VFIODevice *vbasedev = opaque;
> >> @@ -180,6 +253,7 @@ static int vfio_migration_init(VFIODevice *vbasedev,
> >>                                  struct vfio_region_info *info)
> >>   {
> >>       int ret;
> >> +    char id[256] = "";
> >>   
> >>       vbasedev->migration = g_new0(VFIOMigration, 1);
> >>   
> >> @@ -192,6 +266,24 @@ static int vfio_migration_init(VFIODevice *vbasedev,
> >>           return ret;
> >>       }
> >>   
> >> +    if (vbasedev->ops->vfio_get_object) {  
> > 
> > Nit, vfio_migration_region_init() would have failed already if this were
> > not available.  Perhaps do the test once at the start of this function
> > instead?  Thanks,
> >   
> 
> Ok, will do that.
> 
> Thanks,
> Kirti
> 
> 
> > Alex
> >   
> >> +        Object *obj = vbasedev->ops->vfio_get_object(vbasedev);
> >> +
> >> +        if (obj) {
> >> +            DeviceState *dev = DEVICE(obj);
> >> +            char *oid = vmstate_if_get_id(VMSTATE_IF(dev));
> >> +
> >> +            if (oid) {
> >> +                pstrcpy(id, sizeof(id), oid);
> >> +                pstrcat(id, sizeof(id), "/");
> >> +                g_free(oid);
> >> +            }
> >> +        }
> >> +    }
> >> +    pstrcat(id, sizeof(id), "vfio");
> >> +
> >> +    register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1, &savevm_vfio_handlers,
> >> +                         vbasedev);
> >>       vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
> >>                                                             vbasedev);
> >>       vbasedev->migration_state.notify = vfio_migration_state_notifier;
> >> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> >> index bd3d47b005cb..86c18def016e 100644
> >> --- a/hw/vfio/trace-events
> >> +++ b/hw/vfio/trace-events
> >> @@ -149,3 +149,5 @@ vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
> >>   vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
> >>   vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
> >>   vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
> >> +vfio_save_setup(const char *name) " (%s)"
> >> +vfio_save_cleanup(const char *name) " (%s)"  
> >   
> 



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 08/17] vfio: Add save state functions to SaveVMHandlers
  2020-06-22 22:50   ` Alex Williamson
@ 2020-06-23 20:34     ` Kirti Wankhede
  2020-06-23 20:40       ` Alex Williamson
  0 siblings, 1 reply; 66+ messages in thread
From: Kirti Wankhede @ 2020-06-23 20:34 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini



On 6/23/2020 4:20 AM, Alex Williamson wrote:
> On Sun, 21 Jun 2020 01:51:17 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
>> functions. These functions handles pre-copy and stop-and-copy phase.
>>
>> In _SAVING|_RUNNING device state or pre-copy phase:
>> - read pending_bytes. If pending_bytes > 0, go through below steps.
>> - read data_offset - indicates kernel driver to write data to staging
>>    buffer.
>> - read data_size - amount of data in bytes written by vendor driver in
>>    migration region.
>> - read data_size bytes of data from data_offset in the migration region.
>> - Write data packet to file stream as below:
>> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
>> VFIO_MIG_FLAG_END_OF_STATE }
>>
>> In _SAVING device state or stop-and-copy phase
>> a. read config space of device and save to migration file stream. This
>>     doesn't need to be from vendor driver. Any other special config state
>>     from driver can be saved as data in following iteration.
>> b. read pending_bytes. If pending_bytes > 0, go through below steps.
>> c. read data_offset - indicates kernel driver to write data to staging
>>     buffer.
>> d. read data_size - amount of data in bytes written by vendor driver in
>>     migration region.
>> e. read data_size bytes of data from data_offset in the migration region.
>> f. Write data packet as below:
>>     {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
>> g. iterate through steps b to f while (pending_bytes > 0)
>> h. Write {VFIO_MIG_FLAG_END_OF_STATE}
>>
>> When data region is mapped, its user's responsibility to read data from
>> data_offset of data_size before moving to next steps.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>   hw/vfio/migration.c           | 283 ++++++++++++++++++++++++++++++++++++++++++
>>   hw/vfio/trace-events          |   6 +
>>   include/hw/vfio/vfio-common.h |   1 +
>>   3 files changed, 290 insertions(+)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 133bb5b1b3b2..ef1150c1ff02 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -140,6 +140,168 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
>>       return 0;
>>   }
>>   
>> +static void *get_data_section_size(VFIORegion *region, uint64_t data_offset,
>> +                                   uint64_t data_size, uint64_t *size)
>> +{
>> +    void *ptr = NULL;
>> +    int i;
>> +
>> +    if (!region->mmaps) {
>> +        *size = data_size;
>> +        return ptr;
>> +    }
>> +
>> +    /* check if data_offset in within sparse mmap areas */
>> +    for (i = 0; i < region->nr_mmaps; i++) {
>> +        VFIOMmap *map = region->mmaps + i;
>> +
>> +        if ((data_offset >= map->offset) &&
>> +            (data_offset < map->offset + map->size)) {
>> +            ptr = map->mmap + data_offset - map->offset;
>> +
>> +            if (data_offset + data_size <= map->offset + map->size) {
>> +                *size = data_size;
>> +            } else {
>> +                *size = map->offset + map->size - data_offset;
>> +            }
> 
> Ultimately we take whichever result is smaller, so we could just use:
> 
> *size = MIN(data_size, map->offset + map->size - data_offset);
> 
>> +            break;
>> +        }
>> +    }
>> +
>> +    if (!ptr) {
>> +        uint64_t limit = 0;
>> +
>> +        /*
>> +         * data_offset is not within sparse mmap areas, find size of non-mapped
>> +         * area. Check through all list since region->mmaps list is not sorted.
>> +         */
>> +        for (i = 0; i < region->nr_mmaps; i++) {
>> +            VFIOMmap *map = region->mmaps + i;
>> +
>> +            if ((data_offset < map->offset) &&
>> +                (!limit || limit > map->offset)) {
>> +                limit = map->offset;
>> +            }
> 
> We could have done this in an else branch of the previous loop to avoid
> walking the entries twice.
> 

Ok. updating with above 2 changes.

>> +        }
>> +
>> +        *size = limit ? limit - data_offset : data_size;
>> +    }
>> +    return ptr;
>> +}
>> +
>> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIORegion *region = &migration->region;
>> +    uint64_t data_offset = 0, data_size = 0, size;
>> +    int ret;
>> +
>> +    ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                             data_offset));
>> +    if (ret != sizeof(data_offset)) {
>> +        error_report("%s: Failed to get migration buffer data offset %d",
>> +                     vbasedev->name, ret);
>> +        return -EINVAL;
>> +    }
>> +
>> +    ret = pread(vbasedev->fd, &data_size, sizeof(data_size),
>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                             data_size));
>> +    if (ret != sizeof(data_size)) {
>> +        error_report("%s: Failed to get migration buffer data size %d",
>> +                     vbasedev->name, ret);
>> +        return -EINVAL;
>> +    }
>> +
>> +    trace_vfio_save_buffer(vbasedev->name, data_offset, data_size,
>> +                           migration->pending_bytes);
>> +
>> +    qemu_put_be64(f, data_size);
>> +    size = data_size;
>> +
>> +    while (size) {
>> +        void *buf = NULL;
>> +        bool buffer_mmaped;
>> +        uint64_t sec_size;
>> +
>> +        buf = get_data_section_size(region, data_offset, size, &sec_size);
>> +
>> +        buffer_mmaped = (buf != NULL);
>> +
>> +        if (!buffer_mmaped) {
>> +            buf = g_try_malloc(sec_size);
>> +            if (!buf) {
>> +                error_report("%s: Error allocating buffer ", __func__);
>> +                return -ENOMEM;
>> +            }
>> +
>> +            ret = pread(vbasedev->fd, buf, sec_size,
>> +                        region->fd_offset + data_offset);
> 
> Is the trade-off to allocate this buffer worth it?  I'd be tempted to
> iterate with a basic data type here to avoid what could potentially be
> a large memory allocation above.  It feels a little more robust, if not
> perhaps as fast, but I this will mostly be a fallback or only cover
> small ranges in normal operation.  Of course the data stream needs to
> be compatible either way we retrieve it.
> 

What should be basic data type here, u8, u16, u32, u64? We don't know at 
what granularity vendor driver is writing, then I thnk we have to go 
with smallest u8, right?


>> +            if (ret != sec_size) {
>> +                error_report("%s: Failed to get migration data %d",
>> +                             vbasedev->name, ret);
>> +                g_free(buf);
>> +                return -EINVAL;
>> +            }
>> +        }
>> +
>> +        qemu_put_buffer(f, buf, sec_size);
>> +
>> +        if (!buffer_mmaped) {
>> +            g_free(buf);
>> +        }
>> +        size -= sec_size;
>> +        data_offset += sec_size;
>> +    }
>> +
>> +    ret = qemu_file_get_error(f);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    return data_size;
> 
> This function returns int, data_size is uint64_t.  Thanks,
> 

Yes, returns for this function:
< 0 => error
==0 => no more data to save
data_size => amount of data saved in this function.

Thanks,
Kirti



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 08/17] vfio: Add save state functions to SaveVMHandlers
  2020-06-23 20:34     ` Kirti Wankhede
@ 2020-06-23 20:40       ` Alex Williamson
  0 siblings, 0 replies; 66+ messages in thread
From: Alex Williamson @ 2020-06-23 20:40 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini

On Wed, 24 Jun 2020 02:04:24 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 6/23/2020 4:20 AM, Alex Williamson wrote:
> > On Sun, 21 Jun 2020 01:51:17 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
> >> functions. These functions handles pre-copy and stop-and-copy phase.
> >>
> >> In _SAVING|_RUNNING device state or pre-copy phase:
> >> - read pending_bytes. If pending_bytes > 0, go through below steps.
> >> - read data_offset - indicates kernel driver to write data to staging
> >>    buffer.
> >> - read data_size - amount of data in bytes written by vendor driver in
> >>    migration region.
> >> - read data_size bytes of data from data_offset in the migration region.
> >> - Write data packet to file stream as below:
> >> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
> >> VFIO_MIG_FLAG_END_OF_STATE }
> >>
> >> In _SAVING device state or stop-and-copy phase
> >> a. read config space of device and save to migration file stream. This
> >>     doesn't need to be from vendor driver. Any other special config state
> >>     from driver can be saved as data in following iteration.
> >> b. read pending_bytes. If pending_bytes > 0, go through below steps.
> >> c. read data_offset - indicates kernel driver to write data to staging
> >>     buffer.
> >> d. read data_size - amount of data in bytes written by vendor driver in
> >>     migration region.
> >> e. read data_size bytes of data from data_offset in the migration region.
> >> f. Write data packet as below:
> >>     {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
> >> g. iterate through steps b to f while (pending_bytes > 0)
> >> h. Write {VFIO_MIG_FLAG_END_OF_STATE}
> >>
> >> When data region is mapped, its user's responsibility to read data from
> >> data_offset of data_size before moving to next steps.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>   hw/vfio/migration.c           | 283 ++++++++++++++++++++++++++++++++++++++++++
> >>   hw/vfio/trace-events          |   6 +
> >>   include/hw/vfio/vfio-common.h |   1 +
> >>   3 files changed, 290 insertions(+)
> >>
> >> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> >> index 133bb5b1b3b2..ef1150c1ff02 100644
> >> --- a/hw/vfio/migration.c
> >> +++ b/hw/vfio/migration.c
> >> @@ -140,6 +140,168 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
> >>       return 0;
> >>   }
> >>   
> >> +static void *get_data_section_size(VFIORegion *region, uint64_t data_offset,
> >> +                                   uint64_t data_size, uint64_t *size)
> >> +{
> >> +    void *ptr = NULL;
> >> +    int i;
> >> +
> >> +    if (!region->mmaps) {
> >> +        *size = data_size;
> >> +        return ptr;
> >> +    }
> >> +
> >> +    /* check if data_offset in within sparse mmap areas */
> >> +    for (i = 0; i < region->nr_mmaps; i++) {
> >> +        VFIOMmap *map = region->mmaps + i;
> >> +
> >> +        if ((data_offset >= map->offset) &&
> >> +            (data_offset < map->offset + map->size)) {
> >> +            ptr = map->mmap + data_offset - map->offset;
> >> +
> >> +            if (data_offset + data_size <= map->offset + map->size) {
> >> +                *size = data_size;
> >> +            } else {
> >> +                *size = map->offset + map->size - data_offset;
> >> +            }  
> > 
> > Ultimately we take whichever result is smaller, so we could just use:
> > 
> > *size = MIN(data_size, map->offset + map->size - data_offset);
> >   
> >> +            break;
> >> +        }
> >> +    }
> >> +
> >> +    if (!ptr) {
> >> +        uint64_t limit = 0;
> >> +
> >> +        /*
> >> +         * data_offset is not within sparse mmap areas, find size of non-mapped
> >> +         * area. Check through all list since region->mmaps list is not sorted.
> >> +         */
> >> +        for (i = 0; i < region->nr_mmaps; i++) {
> >> +            VFIOMmap *map = region->mmaps + i;
> >> +
> >> +            if ((data_offset < map->offset) &&
> >> +                (!limit || limit > map->offset)) {
> >> +                limit = map->offset;
> >> +            }  
> > 
> > We could have done this in an else branch of the previous loop to avoid
> > walking the entries twice.
> >   
> 
> Ok. updating with above 2 changes.
> 
> >> +        }
> >> +
> >> +        *size = limit ? limit - data_offset : data_size;
> >> +    }
> >> +    return ptr;
> >> +}
> >> +
> >> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
> >> +{
> >> +    VFIOMigration *migration = vbasedev->migration;
> >> +    VFIORegion *region = &migration->region;
> >> +    uint64_t data_offset = 0, data_size = 0, size;
> >> +    int ret;
> >> +
> >> +    ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
> >> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> >> +                                             data_offset));
> >> +    if (ret != sizeof(data_offset)) {
> >> +        error_report("%s: Failed to get migration buffer data offset %d",
> >> +                     vbasedev->name, ret);
> >> +        return -EINVAL;
> >> +    }
> >> +
> >> +    ret = pread(vbasedev->fd, &data_size, sizeof(data_size),
> >> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> >> +                                             data_size));
> >> +    if (ret != sizeof(data_size)) {
> >> +        error_report("%s: Failed to get migration buffer data size %d",
> >> +                     vbasedev->name, ret);
> >> +        return -EINVAL;
> >> +    }
> >> +
> >> +    trace_vfio_save_buffer(vbasedev->name, data_offset, data_size,
> >> +                           migration->pending_bytes);
> >> +
> >> +    qemu_put_be64(f, data_size);
> >> +    size = data_size;
> >> +
> >> +    while (size) {
> >> +        void *buf = NULL;
> >> +        bool buffer_mmaped;
> >> +        uint64_t sec_size;
> >> +
> >> +        buf = get_data_section_size(region, data_offset, size, &sec_size);
> >> +
> >> +        buffer_mmaped = (buf != NULL);
> >> +
> >> +        if (!buffer_mmaped) {
> >> +            buf = g_try_malloc(sec_size);
> >> +            if (!buf) {
> >> +                error_report("%s: Error allocating buffer ", __func__);
> >> +                return -ENOMEM;
> >> +            }
> >> +
> >> +            ret = pread(vbasedev->fd, buf, sec_size,
> >> +                        region->fd_offset + data_offset);  
> > 
> > Is the trade-off to allocate this buffer worth it?  I'd be tempted to
> > iterate with a basic data type here to avoid what could potentially be
> > a large memory allocation above.  It feels a little more robust, if not
> > perhaps as fast, but I this will mostly be a fallback or only cover
> > small ranges in normal operation.  Of course the data stream needs to
> > be compatible either way we retrieve it.
> >   
> 
> What should be basic data type here, u8, u16, u32, u64? We don't know at 
> what granularity vendor driver is writing, then I thnk we have to go 
> with smallest u8, right?

That'd be a little on the ridiculous side.  We could make a helper like
in vfio_pci_rdwr that reads at the largest aligned size up to u64.

> >> +            if (ret != sec_size) {
> >> +                error_report("%s: Failed to get migration data %d",
> >> +                             vbasedev->name, ret);
> >> +                g_free(buf);
> >> +                return -EINVAL;
> >> +            }
> >> +        }
> >> +
> >> +        qemu_put_buffer(f, buf, sec_size);
> >> +
> >> +        if (!buffer_mmaped) {
> >> +            g_free(buf);
> >> +        }
> >> +        size -= sec_size;
> >> +        data_offset += sec_size;
> >> +    }
> >> +
> >> +    ret = qemu_file_get_error(f);
> >> +    if (ret) {
> >> +        return ret;
> >> +    }
> >> +
> >> +    return data_size;  
> > 
> > This function returns int, data_size is uint64_t.  Thanks,
> >   
> 
> Yes, returns for this function:
> < 0 => error
> ==0 => no more data to save
> data_size => amount of data saved in this function.

So when data_size exceeds MAX_UINT, the return value goes negative...

Thanks,
Alex



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 17/17] qapi: Add VFIO devices migration stats in Migration stats
  2020-06-23  7:21   ` Markus Armbruster
@ 2020-06-23 21:16     ` Kirti Wankhede
  2020-06-25  5:51       ` Markus Armbruster
  0 siblings, 1 reply; 66+ messages in thread
From: Kirti Wankhede @ 2020-06-23 21:16 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, eskultet, ziye.yang, mlevitsk, pasic,
	felipe, Ken.Xue, kevin.tian, yan.y.zhao, dgilbert,
	alex.williamson, changpeng.liu, quintela, zhi.a.wang,
	jonathan.davies, pbonzini



On 6/23/2020 12:51 PM, Markus Armbruster wrote:
> QAPI review only.
> 
> The only changes since I reviewed v23 is the rename of VfioStats member
> @bytes to @transferred, and the move of MigrationInfo member @vfio next
> to @ram and @disk.  Good.  I'm copying my other questions in the hope of
> getting answers :)
> 
> Kirti Wankhede <kwankhede@nvidia.com> writes:
> 
>> Added amount of bytes transferred to the target VM by all VFIO devices
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> [...]
>> diff --git a/qapi/migration.json b/qapi/migration.json
>> index d5000558c6c9..952864b05455 100644
>> --- a/qapi/migration.json
>> +++ b/qapi/migration.json
>> @@ -146,6 +146,18 @@
>>               'active', 'postcopy-active', 'postcopy-paused',
>>               'postcopy-recover', 'completed', 'failed', 'colo',
>>               'pre-switchover', 'device', 'wait-unplug' ] }
>> +##
>> +# @VfioStats:
>> +#
>> +# Detailed VFIO devices migration statistics
>> +#
>> +# @transferred: amount of bytes transferred to the target VM by VFIO devices
>> +#
>> +# Since: 5.1
>> +#
>> +##
>> +{ 'struct': 'VfioStats',
>> +  'data': {'transferred': 'int' } }
> 
> Pardon my ignorance...  What exactly do VFIO devices transfer to the
> target VM? How is that related to MigrationInfo member @ram? 
> 

Sorry I missed to reply your question on earlier version.

VFIO device transfer vfio device's state, data from VFIO device and 
guest memory pages pinned for dma operation.
For example in case of GPU, vfio device state is GPUs current state to 
be saved that will be restored during resume and device data is data 
from onboard framebuffer. Pinned memory is marked dirty and transferred 
to target VM as part of global dirty page tracking for RAM.
VFIO device can add significant amount of data in migration stream 
(depending on FB size in GB), transferred byte count is important 
parameter to be monitored.

> MigrationStats has much more information, and some of it is pretty
> useful to track how migration is doing, in particular whether it
> converges, and how fast.  Absent in VfioStats due to "not implemented",
> or due to "can't be done"?
>

Vfio device migration interface is same as RAM's migration interface 
(using SaveVMHandlers). Converge part is already take care by 
.save_live_pending hook where *res_precopy_only is set to vfio devices 
pending_bytes, migration->pending_bytes

How fast - I'm not sure how this can be calculated.

Thanks,
Kirti

> Byte counts should use QAPI type 'size'.  Many existing ones don't.
> Since MigrationStats uses 'int', I'll let the migration maintainers
> decide whether they want 'int' or 'size' here.
> 
>>   ##
>>   # @MigrationInfo:
>> @@ -207,11 +219,16 @@
>>   #
>>   # @socket-address: Only used for tcp, to know what the real port is (Since 4.0)
>>   #
>> +# @vfio: @VfioStats containing detailed VFIO devices migration statistics,
>> +#        only returned if VFIO device is present, migration is supported by all
>> +#         VFIO devices and status is 'active' or 'completed' (since 5.1)
>> +#
>>   # Since: 0.14.0
>>   ##
>>   { 'struct': 'MigrationInfo',
>>     'data': {'*status': 'MigrationStatus', '*ram': 'MigrationStats',
>>              '*disk': 'MigrationStats',
>> +           '*vfio': 'VfioStats',
>>              '*xbzrle-cache': 'XBZRLECacheStats',
>>              '*total-time': 'int',
>>              '*expected-downtime': 'int',
> 


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 11/17] vfio: Get migration capability flags for container
  2020-06-20 20:21 ` [PATCH QEMU v25 11/17] vfio: Get migration capability flags for container Kirti Wankhede
@ 2020-06-24  8:43   ` Cornelia Huck
  2020-06-24 18:55   ` Alex Williamson
  1 sibling, 0 replies; 66+ messages in thread
From: Cornelia Huck @ 2020-06-24  8:43 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel, peterx,
	eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk, pasic,
	felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert, Eric Auger,
	alex.williamson, changpeng.liu, eskultet, Shameer Kolothum,
	Ken.Xue, jonathan.davies, pbonzini

On Sun, 21 Jun 2020 01:51:20 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Added helper functions to get IOMMU info capability chain.
> Added function to get migration capability information from that
> capability chain for IOMMU container.
> 
> Similar change was proposed earlier:
> https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg03759.html
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Cc: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> ---
>  hw/vfio/common.c              | 91 +++++++++++++++++++++++++++++++++++++++----
>  include/hw/vfio/vfio-common.h |  3 ++
>  2 files changed, 86 insertions(+), 8 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 90e9a854d82c..e0d3d4585a65 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -1229,6 +1229,75 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
>      return 0;
>  }
>  
> +static int vfio_get_iommu_info(VFIOContainer *container,
> +                               struct vfio_iommu_type1_info **info)
> +{
> +
> +    size_t argsz = sizeof(struct vfio_iommu_type1_info);
> +
> +    *info = g_new0(struct vfio_iommu_type1_info, 1);
> +again:
> +    (*info)->argsz = argsz;
> +
> +    if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) {
> +        g_free(*info);
> +        *info = NULL;
> +        return -errno;
> +    }
> +
> +    if (((*info)->argsz > argsz)) {
> +        argsz = (*info)->argsz;
> +        *info = g_realloc(*info, argsz);

Do we need to guard against getting a bogus argsz value causing a huge
allocation that might fail and crash the program?

> +        goto again;
> +    }
> +
> +    return 0;
> +}

(...)

> @@ -1314,15 +1384,20 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>           * existing Type1 IOMMUs generally support any IOVA we're
>           * going to actually try in practice.
>           */
> -        info.argsz = sizeof(info);
> -        ret = ioctl(fd, VFIO_IOMMU_GET_INFO, &info);
> -        /* Ignore errors */
> -        if (ret || !(info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
> +        ret = vfio_get_iommu_info(container, &info);

Previously, we ignored errors from the IOMMU_GET_INFO ioctl, now we
error out. Was that change intended?

> +        if (ret) {
> +                goto free_container_exit;
> +        }
> +
> +        if (!(info->flags & VFIO_IOMMU_INFO_PGSIZES)) {
>              /* Assume 4k IOVA page size */
> -            info.iova_pgsizes = 4096;
> +            info->iova_pgsizes = 4096;
>          }
> -        vfio_host_win_add(container, 0, (hwaddr)-1, info.iova_pgsizes);
> -        container->pgsizes = info.iova_pgsizes;
> +        vfio_host_win_add(container, 0, (hwaddr)-1, info->iova_pgsizes);
> +        container->pgsizes = info->iova_pgsizes;
> +
> +        vfio_get_iommu_info_migration(container, info);
> +        g_free(info);
>          break;
>      }
>      case VFIO_SPAPR_TCE_v2_IOMMU:
(...)



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 03/17] vfio: Add save and load functions for VFIO PCI devices
  2020-06-22 20:28   ` Alex Williamson
@ 2020-06-24 14:29     ` Kirti Wankhede
  2020-06-24 19:49       ` Alex Williamson
  0 siblings, 1 reply; 66+ messages in thread
From: Kirti Wankhede @ 2020-06-24 14:29 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini



On 6/23/2020 1:58 AM, Alex Williamson wrote:
> On Sun, 21 Jun 2020 01:51:12 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> These functions save and restore PCI device specific data - config
>> space of PCI device.
>> Tested save and restore with MSI and MSIX type.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>   hw/vfio/pci.c                 | 95 +++++++++++++++++++++++++++++++++++++++++++
>>   include/hw/vfio/vfio-common.h |  2 +
>>   2 files changed, 97 insertions(+)
>>
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index 27f8872db2b1..5ba340aee1d4 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -41,6 +41,7 @@
>>   #include "trace.h"
>>   #include "qapi/error.h"
>>   #include "migration/blocker.h"
>> +#include "migration/qemu-file.h"
>>   
>>   #define TYPE_VFIO_PCI "vfio-pci"
>>   #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
>> @@ -2407,11 +2408,105 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
>>       return OBJECT(vdev);
>>   }
>>   
>> +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
>> +{
>> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
>> +    PCIDevice *pdev = &vdev->pdev;
>> +
>> +    qemu_put_buffer(f, vdev->emulated_config_bits, vdev->config_size);
>> +    qemu_put_buffer(f, vdev->pdev.wmask, vdev->config_size);
>> +    pci_device_save(pdev, f);
>> +
>> +    qemu_put_be32(f, vdev->interrupt);
>> +    if (vdev->interrupt == VFIO_INT_MSIX) {
>> +        msix_save(pdev, f);
> 
> msix_save() checks msix_present() so shouldn't we include this
> unconditionally?  Can't there also be state in the vector table
> regardless of whether we're currently running in MSI-X mode?
> 
>> +    }
>> +}
>> +
>> +static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
>> +{
>> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
>> +    PCIDevice *pdev = &vdev->pdev;
>> +    uint32_t interrupt_type;
>> +    uint16_t pci_cmd;
>> +    int i, ret;
>> +
>> +    qemu_get_buffer(f, vdev->emulated_config_bits, vdev->config_size);
>> +    qemu_get_buffer(f, vdev->pdev.wmask, vdev->config_size);
> 
> This doesn't seem safe, why is it ok to indiscriminately copy these
> arrays that are configured via support or masking of various device
> features from the source to the target?
> 

Ideally, software state at host should be restrored at destination - 
this is the attempt to do that.


> I think this still fails basic feature support negotiation.  For
> instance, Intel IGD assignment modifies emulated_config_bits and wmask
> to allow the VM BIOS to allocate fake stolen memory for the GPU and
> store this value in config space.  This support can be controlled via a
> QEMU build-time option, therefore the feature support on the target can
> be different from the source.  If this sort of feature set doesn't
> match between source and target, I think we'd want to abort the
> migration, but we don't have any provisions for that here (a physical
> IGD device is obviously just an example as it doesn't support migration
> currently).
> 

Then is it ok not to include vdev->pdev.wmask? If yes, I'll remove it.
But we need vdev->emulated_config_bits to be restored.

>> +
>> +    ret = pci_device_load(pdev, f);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    /* retore pci bar configuration */
>> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
>> +    vfio_pci_write_config(pdev, PCI_COMMAND,
>> +                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);
> 
> s/!/~/?  Extra parenthesis too
> 
>> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
>> +        uint32_t bar = pci_default_read_config(pdev,
>> +                                               PCI_BASE_ADDRESS_0 + i * 4, 4);
>> +
>> +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
>> +    }
>> +
>> +    interrupt_type = qemu_get_be32(f);
>> +
>> +    if (interrupt_type == VFIO_INT_MSI) {
>> +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
>> +        bool msi_64bit;
>> +
>> +        /* restore msi configuration */
>> +        msi_flags = pci_default_read_config(pdev,
>> +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
>> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
>> +
>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
>> +                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
>> +
> 
> What if I migrate from a device with MSI support to a device without
> MSI support, or to a device with MSI support at a different offset, who
> is responsible for triggering a migration fault?
> 

Migration compatibility check should take care of that. If there is such 
a big difference in hardware then other things would also fail.

> 
>> +        msi_addr_lo = pci_default_read_config(pdev,
>> +                                        pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
>> +                              msi_addr_lo, 4);
>> +
>> +        if (msi_64bit) {
>> +            msi_addr_hi = pci_default_read_config(pdev,
>> +                                        pdev->msi_cap + PCI_MSI_ADDRESS_HI, 4);
>> +            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
>> +                                  msi_addr_hi, 4);
>> +        }
>> +
>> +        msi_data = pci_default_read_config(pdev,
>> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
>> +                2);
>> +
>> +        vfio_pci_write_config(pdev,
>> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
>> +                msi_data, 2);
>> +
>> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
>> +                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
>> +    } else if (interrupt_type == VFIO_INT_MSIX) {
>> +        uint16_t offset;
>> +
>> +        offset = pci_default_read_config(pdev,
>> +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
>> +        /* load enable bit and maskall bit */
>> +        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
>> +                              offset, 2);
>> +        msix_load(pdev, f);
> 
> Isn't this ordering backwards, or at least less efficient?  The config
> write will cause us to enable MSI-X; presumably we'd have nothing in
> the vector table though.  Then msix_load() will write the vector
> and pba tables and trigger a use notifier for each vector.  It seems
> like that would trigger a bunch of SET_IRQS ioctls as if the guest
> wrote individual unmasked vectors to the vector table, whereas if we
> setup the vector table and then enable MSI-X, we do it with one ioctl.
> 

Makes sense. Changing the order here.

> Also same question as above, I'm not sure who is responsible for making
> sure both devices support MSI-X and that the capability exists at the
> same place on each.  Repeat for essentially every capability.  Are we
> leaning on the migration regions to fail these migrations before we get
> here?  If so, should we be?
> 
As I mentioned about it should be vendor drivers responsibility to have 
compatibility check in that case.

> Also, besides BARs, the command register, and MSI & MSI-X, there must
> be other places where the guest can write config data through to the
> device.  pci_device_{save,load}() only sets QEMU's config space.
> 

 From QEMU we can restore QEMU's software state. For mediated device, 
emulated state at vendor driver should be maintained by vendor driver, 
right?

> A couple more theoretical (probably not too distant) examples related
> to that; there's a resizable BAR capability that at some point we'll
> probably need to allow the guest to interact with (ie. manipulation of
> capability changes the reported region size for a BAR).  How would we
> support that with this save/load scheme?

Config space is saved at the start of stop-and-copy phase, that means 
vCPUs are stopped. So QEMU's config space saved at this phase should 
include the change. Will there be any other software state that would be 
required to save/load?

>  We'll likely also have SR-IOV
> PFs assigned where we'll perhaps have support for emulating the SR-IOV
> capability to call out to a privileged userspace helper to enable VFs,
> how does this get extended to support that type of emulation?
> 
> I'm afraid that making carbon copies of emulated_config_bits, wmask,
> and invoking pci_device_save/load() doesn't address my concerns that
> saving and restoring config space between source and target really
> seems like a much more important task than outlined here.  Thanks,
> 

Are you suggesting to load config space using vfio_pci_write_config() 
from PCI_CONFIG_HEADER_SIZE to 
PCI_CONFIG_SPACE_SIZE/PCIE_CONFIG_SPACE_SIZE? I was kind of avoiding it.

Thanks,
Kirti



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 09/17] vfio: Add load state functions to SaveVMHandlers
  2020-06-20 20:21 ` [PATCH QEMU v25 09/17] vfio: Add load " Kirti Wankhede
@ 2020-06-24 18:54   ` Alex Williamson
  2020-06-25 14:16     ` Kirti Wankhede
  2020-06-26 14:54     ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 66+ messages in thread
From: Alex Williamson @ 2020-06-24 18:54 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini

On Sun, 21 Jun 2020 01:51:18 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Sequence  during _RESUMING device state:
> While data for this device is available, repeat below steps:
> a. read data_offset from where user application should write data.
> b. write data of data_size to migration region from data_offset.
> c. write data_size which indicates vendor driver that data is written in
>    staging buffer.
> 
> For user, data is opaque. User should write data in the same order as
> received.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  hw/vfio/migration.c  | 177 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/trace-events |   3 +
>  2 files changed, 180 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index ef1150c1ff02..faacea5327cb 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -302,6 +302,33 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
>      return qemu_file_get_error(f);
>  }
>  
> +static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    uint64_t data;
> +
> +    if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
> +        int ret;
> +
> +        ret = vbasedev->ops->vfio_load_config(vbasedev, f);
> +        if (ret) {
> +            error_report("%s: Failed to load device config space",
> +                         vbasedev->name);
> +            return ret;
> +        }
> +    }
> +
> +    data = qemu_get_be64(f);
> +    if (data != VFIO_MIG_FLAG_END_OF_STATE) {
> +        error_report("%s: Failed loading device config space, "
> +                     "end flag incorrect 0x%"PRIx64, vbasedev->name, data);
> +        return -EINVAL;
> +    }
> +
> +    trace_vfio_load_device_config_state(vbasedev->name);
> +    return qemu_file_get_error(f);
> +}
> +
>  /* ---------------------------------------------------------------------- */
>  
>  static int vfio_save_setup(QEMUFile *f, void *opaque)
> @@ -472,12 +499,162 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>      return ret;
>  }
>  
> +static int vfio_load_setup(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret = 0;
> +
> +    if (migration->region.mmaps) {
> +        ret = vfio_region_mmap(&migration->region);
> +        if (ret) {
> +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
> +                         vbasedev->name, migration->region.nr,
> +                         strerror(-ret));
> +            return ret;


Not fatal.


> +        }
> +    }
> +
> +    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_MASK,
> +                                   VFIO_DEVICE_STATE_RESUMING);
> +    if (ret) {
> +        error_report("%s: Failed to set state RESUMING", vbasedev->name);
> +    }
> +    return ret;
> +}
> +
> +static int vfio_load_cleanup(void *opaque)
> +{
> +    vfio_save_cleanup(opaque);
> +    return 0;
> +}
> +
> +static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret = 0;
> +    uint64_t data, data_size;
> +
> +    data = qemu_get_be64(f);
> +    while (data != VFIO_MIG_FLAG_END_OF_STATE) {
> +
> +        trace_vfio_load_state(vbasedev->name, data);
> +
> +        switch (data) {
> +        case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
> +        {
> +            ret = vfio_load_device_config_state(f, opaque);
> +            if (ret) {
> +                return ret;
> +            }
> +            break;
> +        }
> +        case VFIO_MIG_FLAG_DEV_SETUP_STATE:
> +        {
> +            data = qemu_get_be64(f);
> +            if (data == VFIO_MIG_FLAG_END_OF_STATE) {
> +                return ret;
> +            } else {
> +                error_report("%s: SETUP STATE: EOS not found 0x%"PRIx64,
> +                             vbasedev->name, data);
> +                return -EINVAL;

This is essentially just a compatibility failure, right?  For instance
some future version of QEMU might include additional data between these
markers that we don't understand and therefore we fail the migration.
Thanks,

Alex

> +            }
> +            break;
> +        }
> +        case VFIO_MIG_FLAG_DEV_DATA_STATE:
> +        {
> +            VFIORegion *region = &migration->region;
> +            uint64_t data_offset = 0, size;
> +
> +            data_size = size = qemu_get_be64(f);
> +            if (data_size == 0) {
> +                break;
> +            }
> +
> +            ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
> +                        region->fd_offset +
> +                        offsetof(struct vfio_device_migration_info,
> +                        data_offset));
> +            if (ret != sizeof(data_offset)) {
> +                error_report("%s:Failed to get migration buffer data offset %d",
> +                             vbasedev->name, ret);
> +                return -EINVAL;
> +            }
> +
> +            trace_vfio_load_state_device_data(vbasedev->name, data_offset,
> +                                              data_size);
> +
> +            while (size) {
> +                void *buf = NULL;
> +                uint64_t sec_size;
> +                bool buffer_mmaped;
> +
> +                buf = get_data_section_size(region, data_offset, size,
> +                                            &sec_size);
> +
> +                buffer_mmaped = (buf != NULL);
> +
> +                if (!buffer_mmaped) {
> +                    buf = g_try_malloc(sec_size);
> +                    if (!buf) {
> +                        error_report("%s: Error allocating buffer ", __func__);
> +                        return -ENOMEM;
> +                    }
> +                }
> +
> +                qemu_get_buffer(f, buf, sec_size);
> +
> +                if (!buffer_mmaped) {
> +                    ret = pwrite(vbasedev->fd, buf, sec_size,
> +                                 region->fd_offset + data_offset);
> +                    g_free(buf);
> +
> +                    if (ret != sec_size) {
> +                        error_report("%s: Failed to set migration buffer %d",
> +                                vbasedev->name, ret);
> +                        return -EINVAL;
> +                    }
> +                }
> +                size -= sec_size;
> +                data_offset += sec_size;
> +            }
> +
> +            ret = pwrite(vbasedev->fd, &data_size, sizeof(data_size),
> +                         region->fd_offset +
> +                       offsetof(struct vfio_device_migration_info, data_size));
> +            if (ret != sizeof(data_size)) {
> +                error_report("%s: Failed to set migration buffer data size %d",
> +                             vbasedev->name, ret);
> +                return -EINVAL;
> +            }
> +            break;
> +        }
> +
> +        default:
> +            error_report("%s: Unknown tag 0x%"PRIx64, vbasedev->name, data);
> +            return -EINVAL;
> +        }
> +
> +        data = qemu_get_be64(f);
> +        ret = qemu_file_get_error(f);
> +        if (ret) {
> +            return ret;
> +        }
> +    }
> +
> +    return ret;
> +}
> +
>  static SaveVMHandlers savevm_vfio_handlers = {
>      .save_setup = vfio_save_setup,
>      .save_cleanup = vfio_save_cleanup,
>      .save_live_pending = vfio_save_pending,
>      .save_live_iterate = vfio_save_iterate,
>      .save_live_complete_precopy = vfio_save_complete_precopy,
> +    .load_setup = vfio_load_setup,
> +    .load_cleanup = vfio_load_cleanup,
> +    .load_state = vfio_load_state,
>  };
>  
>  /* ---------------------------------------------------------------------- */
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 9a1c5e17d97f..4a4bd3ba9a2a 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -157,3 +157,6 @@ vfio_save_device_config_state(const char *name) " (%s)"
>  vfio_save_pending(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64
>  vfio_save_iterate(const char *name, int data_size) " (%s) data_size %d"
>  vfio_save_complete_precopy(const char *name) " (%s)"
> +vfio_load_device_config_state(const char *name) " (%s)"
> +vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
> +vfio_load_state_device_data(const char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 11/17] vfio: Get migration capability flags for container
  2020-06-20 20:21 ` [PATCH QEMU v25 11/17] vfio: Get migration capability flags for container Kirti Wankhede
  2020-06-24  8:43   ` Cornelia Huck
@ 2020-06-24 18:55   ` Alex Williamson
  2020-06-25 14:09     ` Kirti Wankhede
  1 sibling, 1 reply; 66+ messages in thread
From: Alex Williamson @ 2020-06-24 18:55 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	Eric Auger, changpeng.liu, eskultet, Shameer Kolothum, Ken.Xue,
	jonathan.davies, pbonzini

On Sun, 21 Jun 2020 01:51:20 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Added helper functions to get IOMMU info capability chain.
> Added function to get migration capability information from that
> capability chain for IOMMU container.
> 
> Similar change was proposed earlier:
> https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg03759.html
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Cc: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> ---
>  hw/vfio/common.c              | 91 +++++++++++++++++++++++++++++++++++++++----
>  include/hw/vfio/vfio-common.h |  3 ++
>  2 files changed, 86 insertions(+), 8 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 90e9a854d82c..e0d3d4585a65 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -1229,6 +1229,75 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
>      return 0;
>  }
>  
> +static int vfio_get_iommu_info(VFIOContainer *container,
> +                               struct vfio_iommu_type1_info **info)
> +{
> +
> +    size_t argsz = sizeof(struct vfio_iommu_type1_info);
> +
> +    *info = g_new0(struct vfio_iommu_type1_info, 1);
> +again:
> +    (*info)->argsz = argsz;
> +
> +    if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) {
> +        g_free(*info);
> +        *info = NULL;
> +        return -errno;
> +    }
> +
> +    if (((*info)->argsz > argsz)) {
> +        argsz = (*info)->argsz;
> +        *info = g_realloc(*info, argsz);
> +        goto again;
> +    }
> +
> +    return 0;
> +}
> +
> +static struct vfio_info_cap_header *
> +vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
> +{
> +    struct vfio_info_cap_header *hdr;
> +    void *ptr = info;
> +
> +    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
> +        return NULL;
> +    }
> +
> +    for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
> +        if (hdr->id == id) {
> +            return hdr;
> +        }
> +    }
> +
> +    return NULL;
> +}
> +
> +static void vfio_get_iommu_info_migration(VFIOContainer *container,
> +                                         struct vfio_iommu_type1_info *info)
> +{
> +    struct vfio_info_cap_header *hdr;
> +    struct vfio_iommu_type1_info_cap_migration *cap_mig;
> +
> +    hdr = vfio_get_iommu_info_cap(info, VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION);
> +    if (!hdr) {
> +        return;
> +    }
> +
> +    cap_mig = container_of(hdr, struct vfio_iommu_type1_info_cap_migration,
> +                            header);
> +
> +    container->dirty_pages_supported = true;
> +    container->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size;
> +    container->dirty_pgsizes = cap_mig->pgsize_bitmap;
> +
> +    /*
> +     * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of
> +     * TARGET_PAGE_SIZE to mark those dirty.
> +     */
> +    assert(container->dirty_pgsizes & TARGET_PAGE_SIZE);

Why assert versus simply not support dirty page tracking and therefore
migration of contained devices?

> +}
> +
>  static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>                                    Error **errp)
>  {
> @@ -1293,6 +1362,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>      container->space = space;
>      container->fd = fd;
>      container->error = NULL;
> +    container->dirty_pages_supported = false;
>      QLIST_INIT(&container->giommu_list);
>      QLIST_INIT(&container->hostwin_list);
>  
> @@ -1305,7 +1375,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>      case VFIO_TYPE1v2_IOMMU:
>      case VFIO_TYPE1_IOMMU:
>      {
> -        struct vfio_iommu_type1_info info;
> +        struct vfio_iommu_type1_info *info;
>  
>          /*
>           * FIXME: This assumes that a Type1 IOMMU can map any 64-bit
> @@ -1314,15 +1384,20 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>           * existing Type1 IOMMUs generally support any IOVA we're
>           * going to actually try in practice.
>           */
> -        info.argsz = sizeof(info);
> -        ret = ioctl(fd, VFIO_IOMMU_GET_INFO, &info);
> -        /* Ignore errors */
> -        if (ret || !(info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
> +        ret = vfio_get_iommu_info(container, &info);
> +        if (ret) {
> +                goto free_container_exit;

This was previously not fatal, why is it now?  Thanks,

Alex

> +        }
> +
> +        if (!(info->flags & VFIO_IOMMU_INFO_PGSIZES)) {
>              /* Assume 4k IOVA page size */
> -            info.iova_pgsizes = 4096;
> +            info->iova_pgsizes = 4096;
>          }
> -        vfio_host_win_add(container, 0, (hwaddr)-1, info.iova_pgsizes);
> -        container->pgsizes = info.iova_pgsizes;
> +        vfio_host_win_add(container, 0, (hwaddr)-1, info->iova_pgsizes);
> +        container->pgsizes = info->iova_pgsizes;
> +
> +        vfio_get_iommu_info_migration(container, info);
> +        g_free(info);
>          break;
>      }
>      case VFIO_SPAPR_TCE_v2_IOMMU:
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index c78033e4149d..5a57a78ec517 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -79,6 +79,9 @@ typedef struct VFIOContainer {
>      unsigned iommu_type;
>      Error *error;
>      bool initialized;
> +    bool dirty_pages_supported;
> +    uint64_t dirty_pgsizes;
> +    uint64_t max_dirty_bitmap_size;
>      unsigned long pgsizes;
>      QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
>      QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 12/17] vfio: Add function to start and stop dirty pages tracking
  2020-06-20 20:21 ` [PATCH QEMU v25 12/17] vfio: Add function to start and stop dirty pages tracking Kirti Wankhede
  2020-06-23 10:32   ` Cornelia Huck
@ 2020-06-24 18:55   ` Alex Williamson
  1 sibling, 0 replies; 66+ messages in thread
From: Alex Williamson @ 2020-06-24 18:55 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini

On Sun, 21 Jun 2020 01:51:21 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Call VFIO_IOMMU_DIRTY_PAGES ioctl to start and stop dirty pages tracking
> for VFIO devices.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  hw/vfio/migration.c | 36 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 36 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index faacea5327cb..e0fbb3a01855 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -11,6 +11,7 @@
>  #include "qemu/main-loop.h"
>  #include "qemu/cutils.h"
>  #include <linux/vfio.h>
> +#include <sys/ioctl.h>
>  
>  #include "sysemu/runstate.h"
>  #include "hw/vfio/vfio-common.h"
> @@ -329,6 +330,32 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>      return qemu_file_get_error(f);
>  }
>  
> +static int vfio_start_dirty_page_tracking(VFIODevice *vbasedev, bool start)
> +{
> +    int ret;
> +    VFIOContainer *container = vbasedev->group->container;
> +    struct vfio_iommu_type1_dirty_bitmap dirty = {
> +        .argsz = sizeof(dirty),
> +    };
> +
> +    if (start) {
> +        if (vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) {
> +            dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START;
> +        } else {
> +            return -EINVAL;
> +        }
> +    } else {
> +            dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP;
> +    }
> +
> +    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty);
> +    if (ret) {
> +        error_report("Failed to set dirty tracking flag 0x%x errno: %d",
> +                     dirty.flags, errno);
> +    }
> +    return ret;
> +}

What happens when we have a device that supports a migration region
paired with an IOMMU that doesn't report dirty page tracking, shouldn't
we have tested container->dirty_pages_supported somewhere in the
process of determining if the device is migratable?  Thanks,

Alex


> +
>  /* ---------------------------------------------------------------------- */
>  
>  static int vfio_save_setup(QEMUFile *f, void *opaque)
> @@ -360,6 +387,11 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
>          return ret;
>      }
>  
> +    ret = vfio_start_dirty_page_tracking(vbasedev, true);
> +    if (ret) {
> +        return ret;
> +    }
> +
>      qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>  
>      ret = qemu_file_get_error(f);
> @@ -375,6 +407,8 @@ static void vfio_save_cleanup(void *opaque)
>      VFIODevice *vbasedev = opaque;
>      VFIOMigration *migration = vbasedev->migration;
>  
> +    vfio_start_dirty_page_tracking(vbasedev, false);
> +
>      if (migration->region.mmaps) {
>          vfio_region_unmap(&migration->region);
>      }
> @@ -706,6 +740,8 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
>          if (ret) {
>              error_report("%s: Failed to set state RUNNING", vbasedev->name);
>          }
> +
> +        vfio_start_dirty_page_tracking(vbasedev, false);
>      }
>  }
>  



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 13/17] vfio: create mapped iova list when vIOMMU is enabled
  2020-06-20 20:21 ` [PATCH QEMU v25 13/17] vfio: create mapped iova list when vIOMMU is enabled Kirti Wankhede
@ 2020-06-24 18:55   ` Alex Williamson
  2020-06-25 14:34     ` Kirti Wankhede
  0 siblings, 1 reply; 66+ messages in thread
From: Alex Williamson @ 2020-06-24 18:55 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini

On Sun, 21 Jun 2020 01:51:22 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Create mapped iova list when vIOMMU is enabled. For each mapped iova
> save translated address. Add node to list on MAP and remove node from
> list on UNMAP.
> This list is used to track dirty pages during migration.

This seems like a lot of overhead to support that the VM might migrate.
Is there no way we can build this when we start migration, for example
replaying the mappings at that time?  Thanks,

Alex

 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> ---
>  hw/vfio/common.c              | 58 ++++++++++++++++++++++++++++++++++++++-----
>  include/hw/vfio/vfio-common.h |  8 ++++++
>  2 files changed, 60 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index e0d3d4585a65..6921a78e9ba5 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -408,8 +408,8 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>  }
>  
>  /* Called with rcu_read_lock held.  */
> -static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
> -                           bool *read_only)
> +static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
> +                               ram_addr_t *ram_addr, bool *read_only)
>  {
>      MemoryRegion *mr;
>      hwaddr xlat;
> @@ -440,8 +440,17 @@ static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
>          return false;
>      }
>  
> -    *vaddr = memory_region_get_ram_ptr(mr) + xlat;
> -    *read_only = !writable || mr->readonly;
> +    if (vaddr) {
> +        *vaddr = memory_region_get_ram_ptr(mr) + xlat;
> +    }
> +
> +    if (ram_addr) {
> +        *ram_addr = memory_region_get_ram_addr(mr) + xlat;
> +    }
> +
> +    if (read_only) {
> +        *read_only = !writable || mr->readonly;
> +    }
>  
>      return true;
>  }
> @@ -451,7 +460,6 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>      VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>      VFIOContainer *container = giommu->container;
>      hwaddr iova = iotlb->iova + giommu->iommu_offset;
> -    bool read_only;
>      void *vaddr;
>      int ret;
>  
> @@ -467,7 +475,10 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>      rcu_read_lock();
>  
>      if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
> -        if (!vfio_get_vaddr(iotlb, &vaddr, &read_only)) {
> +        ram_addr_t ram_addr;
> +        bool read_only;
> +
> +        if (!vfio_get_xlat_addr(iotlb, &vaddr, &ram_addr, &read_only)) {
>              goto out;
>          }
>          /*
> @@ -485,8 +496,28 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>                           "0x%"HWADDR_PRIx", %p) = %d (%m)",
>                           container, iova,
>                           iotlb->addr_mask + 1, vaddr, ret);
> +        } else {
> +            VFIOIovaRange *iova_range;
> +
> +            iova_range = g_malloc0(sizeof(*iova_range));
> +            iova_range->iova = iova;
> +            iova_range->size = iotlb->addr_mask + 1;
> +            iova_range->ram_addr = ram_addr;
> +
> +            QLIST_INSERT_HEAD(&giommu->iova_list, iova_range, next);
>          }
>      } else {
> +        VFIOIovaRange *iova_range, *tmp;
> +
> +        QLIST_FOREACH_SAFE(iova_range, &giommu->iova_list, next, tmp) {
> +            if (iova_range->iova >= iova &&
> +                iova_range->iova + iova_range->size <= iova +
> +                                                       iotlb->addr_mask + 1) {
> +                QLIST_REMOVE(iova_range, next);
> +                g_free(iova_range);
> +            }
> +        }
> +
>          ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1);
>          if (ret) {
>              error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
> @@ -643,6 +674,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
>              g_free(giommu);
>              goto fail;
>          }
> +        QLIST_INIT(&giommu->iova_list);
>          QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
>          memory_region_iommu_replay(giommu->iommu, &giommu->n);
>  
> @@ -741,6 +773,13 @@ static void vfio_listener_region_del(MemoryListener *listener,
>          QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
>              if (MEMORY_REGION(giommu->iommu) == section->mr &&
>                  giommu->n.start == section->offset_within_region) {
> +                VFIOIovaRange *iova_range, *tmp;
> +
> +                QLIST_FOREACH_SAFE(iova_range, &giommu->iova_list, next, tmp) {
> +                    QLIST_REMOVE(iova_range, next);
> +                    g_free(iova_range);
> +                }
> +
>                  memory_region_unregister_iommu_notifier(section->mr,
>                                                          &giommu->n);
>                  QLIST_REMOVE(giommu, giommu_next);
> @@ -1538,6 +1577,13 @@ static void vfio_disconnect_container(VFIOGroup *group)
>          QLIST_REMOVE(container, next);
>  
>          QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
> +            VFIOIovaRange *iova_range, *itmp;
> +
> +            QLIST_FOREACH_SAFE(iova_range, &giommu->iova_list, next, itmp) {
> +                QLIST_REMOVE(iova_range, next);
> +                g_free(iova_range);
> +            }
> +
>              memory_region_unregister_iommu_notifier(
>                      MEMORY_REGION(giommu->iommu), &giommu->n);
>              QLIST_REMOVE(giommu, giommu_next);
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 5a57a78ec517..56b75e4a8bc4 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -89,11 +89,19 @@ typedef struct VFIOContainer {
>      QLIST_ENTRY(VFIOContainer) next;
>  } VFIOContainer;
>  
> +typedef struct VFIOIovaRange {
> +    hwaddr iova;
> +    size_t size;
> +    ram_addr_t ram_addr;
> +    QLIST_ENTRY(VFIOIovaRange) next;
> +} VFIOIovaRange;
> +
>  typedef struct VFIOGuestIOMMU {
>      VFIOContainer *container;
>      IOMMUMemoryRegion *iommu;
>      hwaddr iommu_offset;
>      IOMMUNotifier n;
> +    QLIST_HEAD(, VFIOIovaRange) iova_list;
>      QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
>  } VFIOGuestIOMMU;
>  



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 14/17] vfio: Add vfio_listener_log_sync to mark dirty pages
  2020-06-20 20:21 ` [PATCH QEMU v25 14/17] vfio: Add vfio_listener_log_sync to mark dirty pages Kirti Wankhede
@ 2020-06-24 18:55   ` Alex Williamson
  2020-06-25 14:43     ` Kirti Wankhede
  0 siblings, 1 reply; 66+ messages in thread
From: Alex Williamson @ 2020-06-24 18:55 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini

On Sun, 21 Jun 2020 01:51:23 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> vfio_listener_log_sync gets list of dirty pages from container using
> VFIO_IOMMU_GET_DIRTY_BITMAP ioctl and mark those pages dirty when all
> devices are stopped and saving state.
> Return early for the RAM block section of mapped MMIO region.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/common.c     | 130 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/trace-events |   1 +
>  2 files changed, 131 insertions(+)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 6921a78e9ba5..0518cf228ed5 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -29,6 +29,7 @@
>  #include "hw/vfio/vfio.h"
>  #include "exec/address-spaces.h"
>  #include "exec/memory.h"
> +#include "exec/ram_addr.h"
>  #include "hw/hw.h"
>  #include "qemu/error-report.h"
>  #include "qemu/main-loop.h"
> @@ -38,6 +39,7 @@
>  #include "sysemu/reset.h"
>  #include "trace.h"
>  #include "qapi/error.h"
> +#include "migration/migration.h"
>  
>  VFIOGroupList vfio_group_list =
>      QLIST_HEAD_INITIALIZER(vfio_group_list);
> @@ -288,6 +290,28 @@ const MemoryRegionOps vfio_region_ops = {
>  };
>  
>  /*
> + * Device state interfaces
> + */
> +
> +static bool vfio_devices_are_stopped_and_saving(void)
> +{
> +    VFIOGroup *group;
> +    VFIODevice *vbasedev;
> +
> +    QLIST_FOREACH(group, &vfio_group_list, next) {

Should this be passed the container in order to iterate
container->group_list?

> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> +            if ((vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) &&
> +                !(vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING)) {
> +                continue;
> +            } else {
> +                return false;
> +            }
> +        }
> +    }
> +    return true;
> +}
> +
> +/*
>   * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
>   */
>  static int vfio_dma_unmap(VFIOContainer *container,
> @@ -852,9 +876,115 @@ static void vfio_listener_region_del(MemoryListener *listener,
>      }
>  }
>  
> +static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
> +                                 uint64_t size, ram_addr_t ram_addr)
> +{
> +    struct vfio_iommu_type1_dirty_bitmap *dbitmap;
> +    struct vfio_iommu_type1_dirty_bitmap_get *range;
> +    uint64_t pages;
> +    int ret;
> +
> +    dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
> +
> +    dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
> +    dbitmap->flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
> +    range = (struct vfio_iommu_type1_dirty_bitmap_get *)&dbitmap->data;
> +    range->iova = iova;
> +    range->size = size;
> +
> +    /*
> +     * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of
> +     * TARGET_PAGE_SIZE to mark those dirty. Hence set bitmap's pgsize to
> +     * TARGET_PAGE_SIZE.
> +     */
> +    range->bitmap.pgsize = TARGET_PAGE_SIZE;
> +
> +    pages = TARGET_PAGE_ALIGN(range->size) >> TARGET_PAGE_BITS;
> +    range->bitmap.size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
> +                                         BITS_PER_BYTE;
> +    range->bitmap.data = g_try_malloc0(range->bitmap.size);
> +    if (!range->bitmap.data) {
> +        ret = -ENOMEM;
> +        goto err_out;
> +    }
> +
> +    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
> +    if (ret) {
> +        error_report("Failed to get dirty bitmap for iova: 0x%llx "
> +                "size: 0x%llx err: %d",
> +                range->iova, range->size, errno);
> +        goto err_out;
> +    }
> +
> +    cpu_physical_memory_set_dirty_lebitmap((uint64_t *)range->bitmap.data,
> +                                            ram_addr, pages);
> +
> +    trace_vfio_get_dirty_bitmap(container->fd, range->iova, range->size,
> +                                range->bitmap.size, ram_addr);
> +err_out:
> +    g_free(range->bitmap.data);
> +    g_free(dbitmap);
> +
> +    return ret;
> +}
> +
> +static int vfio_sync_dirty_bitmap(MemoryListener *listener,
> +                                 MemoryRegionSection *section)
> +{
> +    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
> +    VFIOGuestIOMMU *giommu = NULL;
> +    ram_addr_t ram_addr;
> +    uint64_t iova, size;
> +    int ret = 0;
> +
> +    if (memory_region_is_iommu(section->mr)) {
> +
> +        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
> +            if (MEMORY_REGION(giommu->iommu) == section->mr &&
> +                giommu->n.start == section->offset_within_region) {
> +                VFIOIovaRange *iova_range;
> +
> +                QLIST_FOREACH(iova_range, &giommu->iova_list, next) {
> +                    ret = vfio_get_dirty_bitmap(container, iova_range->iova,
> +                                        iova_range->size, iova_range->ram_addr);
> +                    if (ret) {
> +                        break;
> +                    }
> +                }
> +                break;
> +            }
> +        }
> +
> +    } else {
> +        iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
> +        size = int128_get64(section->size);
> +
> +        ram_addr = memory_region_get_ram_addr(section->mr) +
> +                   section->offset_within_region + iova -
> +                   TARGET_PAGE_ALIGN(section->offset_within_address_space);
> +
> +        ret = vfio_get_dirty_bitmap(container, iova, size, ram_addr);
> +    }
> +
> +    return ret;
> +}
> +
> +static void vfio_listerner_log_sync(MemoryListener *listener,
> +        MemoryRegionSection *section)
> +{
> +    if (vfio_listener_skipped_section(section)) {
> +        return;
> +    }
> +
> +    if (vfio_devices_are_stopped_and_saving()) {
> +        vfio_sync_dirty_bitmap(listener, section);
> +    }


How do we decide that this is the best policy for all devices?  For
example if a device does not support page pinning or some future means
of marking dirtied pages, this is clearly the right choice, but when
these are supported, aren't we deferring all dirty logging info until
the final stage?  Thanks,

Alex

> +}
> +
>  static const MemoryListener vfio_memory_listener = {
>      .region_add = vfio_listener_region_add,
>      .region_del = vfio_listener_region_del,
> +    .log_sync = vfio_listerner_log_sync,
>  };
>  
>  static void vfio_listener_release(VFIOContainer *container)
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 4a4bd3ba9a2a..c61ae4f3ead8 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -160,3 +160,4 @@ vfio_save_complete_precopy(const char *name) " (%s)"
>  vfio_load_device_config_state(const char *name) " (%s)"
>  vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
>  vfio_load_state_device_data(const char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
> +vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 15/17] vfio: Add ioctl to get dirty pages bitmap during dma unmap.
  2020-06-20 20:21 ` [PATCH QEMU v25 15/17] vfio: Add ioctl to get dirty pages bitmap during dma unmap Kirti Wankhede
  2020-06-23  8:25   ` Cornelia Huck
@ 2020-06-24 18:56   ` Alex Williamson
  2020-06-25 15:01     ` Kirti Wankhede
  1 sibling, 1 reply; 66+ messages in thread
From: Alex Williamson @ 2020-06-24 18:56 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini

On Sun, 21 Jun 2020 01:51:24 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> With vIOMMU, IO virtual address range can get unmapped while in pre-copy
> phase of migration. In that case, unmap ioctl should return pages pinned
> in that range and QEMU should find its correcponding guest physical
> addresses and report those dirty.
> 
> Suggested-by: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/common.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 81 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 0518cf228ed5..a06b8f2f66e2 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -311,11 +311,83 @@ static bool vfio_devices_are_stopped_and_saving(void)
>      return true;
>  }
>  
> +static bool vfio_devices_are_running_and_saving(void)
> +{
> +    VFIOGroup *group;
> +    VFIODevice *vbasedev;
> +
> +    QLIST_FOREACH(group, &vfio_group_list, next) {

Same as previous, I'm curious if we should instead be looking at
container granularity.  It especially seems to make sense here where
we're unmapping from a container, so iterating every device in every
group seems excessive.

> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> +            if ((vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) &&
> +                (vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING)) {
> +                continue;
> +            } else {
> +                return false;
> +            }

I'm also not sure about the polarity of this function, should it be if
any device is _SAVING we should report the dirty bitmap?  For example,
what if we have a set of paried failover NICs where we intend to unplug
one just prior to stopping the devices, aren't we going to lose dirtied
pages with this logic that they all must be running and saving?  Thanks,

Alex

> +        }
> +    }
> +    return true;
> +}
> +
> +static int vfio_dma_unmap_bitmap(VFIOContainer *container,
> +                                 hwaddr iova, ram_addr_t size,
> +                                 IOMMUTLBEntry *iotlb)
> +{
> +    struct vfio_iommu_type1_dma_unmap *unmap;
> +    struct vfio_bitmap *bitmap;
> +    uint64_t pages = TARGET_PAGE_ALIGN(size) >> TARGET_PAGE_BITS;
> +    int ret;
> +
> +    unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));
> +
> +    unmap->argsz = sizeof(*unmap) + sizeof(*bitmap);
> +    unmap->iova = iova;
> +    unmap->size = size;
> +    unmap->flags |= VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP;
> +    bitmap = (struct vfio_bitmap *)&unmap->data;
> +
> +    /*
> +     * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of
> +     * TARGET_PAGE_SIZE to mark those dirty. Hence set bitmap_pgsize to
> +     * TARGET_PAGE_SIZE.
> +     */
> +
> +    bitmap->pgsize = TARGET_PAGE_SIZE;
> +    bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
> +                   BITS_PER_BYTE;
> +
> +    if (bitmap->size > container->max_dirty_bitmap_size) {
> +        error_report("UNMAP: Size of bitmap too big 0x%llx", bitmap->size);
> +        ret = -E2BIG;
> +        goto unmap_exit;
> +    }
> +
> +    bitmap->data = g_try_malloc0(bitmap->size);
> +    if (!bitmap->data) {
> +        ret = -ENOMEM;
> +        goto unmap_exit;
> +    }
> +
> +    ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap);
> +    if (!ret) {
> +        cpu_physical_memory_set_dirty_lebitmap((uint64_t *)bitmap->data,
> +                iotlb->translated_addr, pages);
> +    } else {
> +        error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %m");
> +    }
> +
> +    g_free(bitmap->data);
> +unmap_exit:
> +    g_free(unmap);
> +    return ret;
> +}
> +
>  /*
>   * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
>   */
>  static int vfio_dma_unmap(VFIOContainer *container,
> -                          hwaddr iova, ram_addr_t size)
> +                          hwaddr iova, ram_addr_t size,
> +                          IOMMUTLBEntry *iotlb)
>  {
>      struct vfio_iommu_type1_dma_unmap unmap = {
>          .argsz = sizeof(unmap),
> @@ -324,6 +396,11 @@ static int vfio_dma_unmap(VFIOContainer *container,
>          .size = size,
>      };
>  
> +    if (iotlb && container->dirty_pages_supported &&
> +        vfio_devices_are_running_and_saving()) {
> +        return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
> +    }
> +
>      while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
>          /*
>           * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
> @@ -371,7 +448,7 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>       * the VGA ROM space.
>       */
>      if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 ||
> -        (errno == EBUSY && vfio_dma_unmap(container, iova, size) == 0 &&
> +        (errno == EBUSY && vfio_dma_unmap(container, iova, size, NULL) == 0 &&
>           ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) {
>          return 0;
>      }
> @@ -542,7 +619,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>              }
>          }
>  
> -        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1);
> +        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1, iotlb);
>          if (ret) {
>              error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>                           "0x%"HWADDR_PRIx") = %d (%m)",
> @@ -853,7 +930,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
>      }
>  
>      if (try_unmap) {
> -        ret = vfio_dma_unmap(container, iova, int128_get64(llsize));
> +        ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
>          if (ret) {
>              error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>                           "0x%"HWADDR_PRIx") = %d (%m)",



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 03/17] vfio: Add save and load functions for VFIO PCI devices
  2020-06-24 14:29     ` Kirti Wankhede
@ 2020-06-24 19:49       ` Alex Williamson
  2020-06-26 12:16         ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 66+ messages in thread
From: Alex Williamson @ 2020-06-24 19:49 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini

On Wed, 24 Jun 2020 19:59:39 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 6/23/2020 1:58 AM, Alex Williamson wrote:
> > On Sun, 21 Jun 2020 01:51:12 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> These functions save and restore PCI device specific data - config
> >> space of PCI device.
> >> Tested save and restore with MSI and MSIX type.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>   hw/vfio/pci.c                 | 95 +++++++++++++++++++++++++++++++++++++++++++
> >>   include/hw/vfio/vfio-common.h |  2 +
> >>   2 files changed, 97 insertions(+)
> >>
> >> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> >> index 27f8872db2b1..5ba340aee1d4 100644
> >> --- a/hw/vfio/pci.c
> >> +++ b/hw/vfio/pci.c
> >> @@ -41,6 +41,7 @@
> >>   #include "trace.h"
> >>   #include "qapi/error.h"
> >>   #include "migration/blocker.h"
> >> +#include "migration/qemu-file.h"
> >>   
> >>   #define TYPE_VFIO_PCI "vfio-pci"
> >>   #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
> >> @@ -2407,11 +2408,105 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
> >>       return OBJECT(vdev);
> >>   }
> >>   
> >> +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
> >> +{
> >> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> >> +    PCIDevice *pdev = &vdev->pdev;
> >> +
> >> +    qemu_put_buffer(f, vdev->emulated_config_bits, vdev->config_size);
> >> +    qemu_put_buffer(f, vdev->pdev.wmask, vdev->config_size);
> >> +    pci_device_save(pdev, f);
> >> +
> >> +    qemu_put_be32(f, vdev->interrupt);
> >> +    if (vdev->interrupt == VFIO_INT_MSIX) {
> >> +        msix_save(pdev, f);  
> > 
> > msix_save() checks msix_present() so shouldn't we include this
> > unconditionally?  Can't there also be state in the vector table
> > regardless of whether we're currently running in MSI-X mode?
> >   
> >> +    }
> >> +}
> >> +
> >> +static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
> >> +{
> >> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> >> +    PCIDevice *pdev = &vdev->pdev;
> >> +    uint32_t interrupt_type;
> >> +    uint16_t pci_cmd;
> >> +    int i, ret;
> >> +
> >> +    qemu_get_buffer(f, vdev->emulated_config_bits, vdev->config_size);
> >> +    qemu_get_buffer(f, vdev->pdev.wmask, vdev->config_size);  
> > 
> > This doesn't seem safe, why is it ok to indiscriminately copy these
> > arrays that are configured via support or masking of various device
> > features from the source to the target?
> >   
> 
> Ideally, software state at host should be restrored at destination - 
> this is the attempt to do that.

Or is it the case that both source and target should initialize these
and come up with the same result and they should be used for
validation, not just overwriting the target with the source?

> > I think this still fails basic feature support negotiation.  For
> > instance, Intel IGD assignment modifies emulated_config_bits and wmask
> > to allow the VM BIOS to allocate fake stolen memory for the GPU and
> > store this value in config space.  This support can be controlled via a
> > QEMU build-time option, therefore the feature support on the target can
> > be different from the source.  If this sort of feature set doesn't
> > match between source and target, I think we'd want to abort the
> > migration, but we don't have any provisions for that here (a physical
> > IGD device is obviously just an example as it doesn't support migration
> > currently).
> >   
> 
> Then is it ok not to include vdev->pdev.wmask? If yes, I'll remove it.
> But we need vdev->emulated_config_bits to be restored.

It's not clear why we need emulated_config_bits copied or how we'd
handle the example I set forth above.  The existence of emulation
provided by QEMU is also emulation state.


> >> +
> >> +    ret = pci_device_load(pdev, f);
> >> +    if (ret) {
> >> +        return ret;
> >> +    }
> >> +
> >> +    /* retore pci bar configuration */
> >> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> >> +    vfio_pci_write_config(pdev, PCI_COMMAND,
> >> +                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);  
> > 
> > s/!/~/?  Extra parenthesis too
> >   
> >> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> >> +        uint32_t bar = pci_default_read_config(pdev,
> >> +                                               PCI_BASE_ADDRESS_0 + i * 4, 4);
> >> +
> >> +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
> >> +    }
> >> +
> >> +    interrupt_type = qemu_get_be32(f);
> >> +
> >> +    if (interrupt_type == VFIO_INT_MSI) {
> >> +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> >> +        bool msi_64bit;
> >> +
> >> +        /* restore msi configuration */
> >> +        msi_flags = pci_default_read_config(pdev,
> >> +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
> >> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> >> +
> >> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> >> +                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
> >> +  
> > 
> > What if I migrate from a device with MSI support to a device without
> > MSI support, or to a device with MSI support at a different offset, who
> > is responsible for triggering a migration fault?
> >   
> 
> Migration compatibility check should take care of that. If there is such 
> a big difference in hardware then other things would also fail.


The division between what is our responsibility in QEMU and what we
hope the vendor driver handles is not very clear imo.  How do we avoid
finger pointing when things break?


> >> +        msi_addr_lo = pci_default_read_config(pdev,
> >> +                                        pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> >> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
> >> +                              msi_addr_lo, 4);
> >> +
> >> +        if (msi_64bit) {
> >> +            msi_addr_hi = pci_default_read_config(pdev,
> >> +                                        pdev->msi_cap + PCI_MSI_ADDRESS_HI, 4);
> >> +            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> >> +                                  msi_addr_hi, 4);
> >> +        }
> >> +
> >> +        msi_data = pci_default_read_config(pdev,
> >> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> >> +                2);
> >> +
> >> +        vfio_pci_write_config(pdev,
> >> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> >> +                msi_data, 2);
> >> +
> >> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> >> +                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
> >> +    } else if (interrupt_type == VFIO_INT_MSIX) {
> >> +        uint16_t offset;
> >> +
> >> +        offset = pci_default_read_config(pdev,
> >> +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
> >> +        /* load enable bit and maskall bit */
> >> +        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
> >> +                              offset, 2);
> >> +        msix_load(pdev, f);  
> > 
> > Isn't this ordering backwards, or at least less efficient?  The config
> > write will cause us to enable MSI-X; presumably we'd have nothing in
> > the vector table though.  Then msix_load() will write the vector
> > and pba tables and trigger a use notifier for each vector.  It seems
> > like that would trigger a bunch of SET_IRQS ioctls as if the guest
> > wrote individual unmasked vectors to the vector table, whereas if we
> > setup the vector table and then enable MSI-X, we do it with one ioctl.
> >   
> 
> Makes sense. Changing the order here.
> 
> > Also same question as above, I'm not sure who is responsible for making
> > sure both devices support MSI-X and that the capability exists at the
> > same place on each.  Repeat for essentially every capability.  Are we
> > leaning on the migration regions to fail these migrations before we get
> > here?  If so, should we be?
> >   
> As I mentioned about it should be vendor drivers responsibility to have 
> compatibility check in that case.


And we'd rather blindly assume the vendor driver included that
requirement than to check for ourselves?


> > Also, besides BARs, the command register, and MSI & MSI-X, there must
> > be other places where the guest can write config data through to the
> > device.  pci_device_{save,load}() only sets QEMU's config space.
> >   
> 
>  From QEMU we can restore QEMU's software state. For mediated device, 
> emulated state at vendor driver should be maintained by vendor driver, 
> right?

In this proposal we've determined that emulated_config_bits, wmask,
emulated config space, and MSI/X state are part of QEMU's state that
need to be transmitted to the target.  It therefore shouldn't be
difficult to imagine that adding support for another capability might
involve QEMU emulation as well.  How does the migration stream we've
constructed here allow such emulation state to be included?  For example
we might have a feature like IGD where we can discern the
incompatibility via differences in the emulated_config_bits and wmask,
but that's not guaranteed.

> > A couple more theoretical (probably not too distant) examples related
> > to that; there's a resizable BAR capability that at some point we'll
> > probably need to allow the guest to interact with (ie. manipulation of
> > capability changes the reported region size for a BAR).  How would we
> > support that with this save/load scheme?  
> 
> Config space is saved at the start of stop-and-copy phase, that means 
> vCPUs are stopped. So QEMU's config space saved at this phase should 
> include the change. Will there be any other software state that would be 
> required to save/load?


There might be, it seems inevitable that there would eventually be
something that needs emulation state beyond this initial draft.  Is
this resizable BAR example another that we simply hand wave as the
responsibility of the vendor driver?
 

> >  We'll likely also have SR-IOV
> > PFs assigned where we'll perhaps have support for emulating the SR-IOV
> > capability to call out to a privileged userspace helper to enable VFs,
> > how does this get extended to support that type of emulation?
> > 
> > I'm afraid that making carbon copies of emulated_config_bits, wmask,
> > and invoking pci_device_save/load() doesn't address my concerns that
> > saving and restoring config space between source and target really
> > seems like a much more important task than outlined here.  Thanks,
> >   
> 
> Are you suggesting to load config space using vfio_pci_write_config() 
> from PCI_CONFIG_HEADER_SIZE to 
> PCI_CONFIG_SPACE_SIZE/PCIE_CONFIG_SPACE_SIZE? I was kind of avoiding it.

I don't think we can do that, even the save/restore functions in the
kernel only blindly overwrite the standard header and then use
capability specific functions elsewhere.  But I think what is missing
here is the ability to hook in support for manipulating specific
capabilities on save and restore, which might include QEMU emulation
state data outside of what's provided here.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 17/17] qapi: Add VFIO devices migration stats in Migration stats
  2020-06-23 21:16     ` Kirti Wankhede
@ 2020-06-25  5:51       ` Markus Armbruster
  0 siblings, 0 replies; 66+ messages in thread
From: Markus Armbruster @ 2020-06-25  5:51 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, mlevitsk, pasic,
	felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini

Kirti Wankhede <kwankhede@nvidia.com> writes:

> On 6/23/2020 12:51 PM, Markus Armbruster wrote:
>> QAPI review only.
>>
>> The only changes since I reviewed v23 is the rename of VfioStats member
>> @bytes to @transferred, and the move of MigrationInfo member @vfio next
>> to @ram and @disk.  Good.  I'm copying my other questions in the hope of
>> getting answers :)
>>
>> Kirti Wankhede <kwankhede@nvidia.com> writes:
>>
>>> Added amount of bytes transferred to the target VM by all VFIO devices
>>>
>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> [...]
>>> diff --git a/qapi/migration.json b/qapi/migration.json
>>> index d5000558c6c9..952864b05455 100644
>>> --- a/qapi/migration.json
>>> +++ b/qapi/migration.json
>>> @@ -146,6 +146,18 @@
>>>               'active', 'postcopy-active', 'postcopy-paused',
>>>               'postcopy-recover', 'completed', 'failed', 'colo',
>>>               'pre-switchover', 'device', 'wait-unplug' ] }
>>> +##
>>> +# @VfioStats:
>>> +#
>>> +# Detailed VFIO devices migration statistics
>>> +#
>>> +# @transferred: amount of bytes transferred to the target VM by VFIO devices
>>> +#
>>> +# Since: 5.1
>>> +#
>>> +##
>>> +{ 'struct': 'VfioStats',
>>> +  'data': {'transferred': 'int' } }
>>
>> Pardon my ignorance...  What exactly do VFIO devices transfer to the
>> target VM? How is that related to MigrationInfo member @ram? 
>>
>
> Sorry I missed to reply your question on earlier version.

Happens :)

> VFIO device transfer vfio device's state, data from VFIO device and
> guest memory pages pinned for dma operation.
> For example in case of GPU, vfio device state is GPUs current state to
> be saved that will be restored during resume and device data is data
> from onboard framebuffer. Pinned memory is marked dirty and
> transferred to target VM as part of global dirty page tracking for
> RAM.
> VFIO device can add significant amount of data in migration stream
> (depending on FB size in GB), transferred byte count is important
> parameter to be monitored.

Can we work this into documentation somehow?

Have you considered adding something on VFIO migration to docs/?  Then a
link with a short description could suffice here.

>> MigrationStats has much more information, and some of it is pretty
>> useful to track how migration is doing, in particular whether it
>> converges, and how fast.  Absent in VfioStats due to "not implemented",
>> or due to "can't be done"?
>>
>
> Vfio device migration interface is same as RAM's migration interface
> (using SaveVMHandlers). Converge part is already take care by
> .save_live_pending hook where *res_precopy_only is set to vfio devices
> pending_bytes, migration->pending_bytes
>
> How fast - I'm not sure how this can be calculated.

My concern is providing management applications the means they need to
monitor migration.  Have you solicited input from management application
developers on what's needed?

"Same as RAM's migration" makes me suspect the same stats are needed.
This may well be a subset of the stats provided for RAM.

Missing stats we need can be added on top, as long as it's done in a
timely manner.  But we better know how to compute them, or how to do
without.

> Thanks,
> Kirti
>
>> Byte counts should use QAPI type 'size'.  Many existing ones don't.
>> Since MigrationStats uses 'int', I'll let the migration maintainers
>> decide whether they want 'int' or 'size' here.
>>
>>>   ##
>>>   # @MigrationInfo:
>>> @@ -207,11 +219,16 @@
>>>   #
>>>   # @socket-address: Only used for tcp, to know what the real port is (Since 4.0)
>>>   #
>>> +# @vfio: @VfioStats containing detailed VFIO devices migration statistics,
>>> +#        only returned if VFIO device is present, migration is supported by all
>>> +#         VFIO devices and status is 'active' or 'completed' (since 5.1)
>>> +#
>>>   # Since: 0.14.0
>>>   ##
>>>   { 'struct': 'MigrationInfo',
>>>     'data': {'*status': 'MigrationStatus', '*ram': 'MigrationStats',
>>>              '*disk': 'MigrationStats',
>>> +           '*vfio': 'VfioStats',
>>>              '*xbzrle-cache': 'XBZRLECacheStats',
>>>              '*total-time': 'int',
>>>              '*expected-downtime': 'int',
>>



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 11/17] vfio: Get migration capability flags for container
  2020-06-24 18:55   ` Alex Williamson
@ 2020-06-25 14:09     ` Kirti Wankhede
  2020-06-25 14:56       ` Alex Williamson
  0 siblings, 1 reply; 66+ messages in thread
From: Kirti Wankhede @ 2020-06-25 14:09 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	Eric Auger, changpeng.liu, eskultet, Shameer Kolothum, Ken.Xue,
	jonathan.davies, pbonzini



On 6/25/2020 12:25 AM, Alex Williamson wrote:
> On Sun, 21 Jun 2020 01:51:20 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> Added helper functions to get IOMMU info capability chain.
>> Added function to get migration capability information from that
>> capability chain for IOMMU container.
>>
>> Similar change was proposed earlier:
>> https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg03759.html
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Cc: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
>> Cc: Eric Auger <eric.auger@redhat.com>
>> ---
>>   hw/vfio/common.c              | 91 +++++++++++++++++++++++++++++++++++++++----
>>   include/hw/vfio/vfio-common.h |  3 ++
>>   2 files changed, 86 insertions(+), 8 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 90e9a854d82c..e0d3d4585a65 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -1229,6 +1229,75 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
>>       return 0;
>>   }
>>   
>> +static int vfio_get_iommu_info(VFIOContainer *container,
>> +                               struct vfio_iommu_type1_info **info)
>> +{
>> +
>> +    size_t argsz = sizeof(struct vfio_iommu_type1_info);
>> +
>> +    *info = g_new0(struct vfio_iommu_type1_info, 1);
>> +again:
>> +    (*info)->argsz = argsz;
>> +
>> +    if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) {
>> +        g_free(*info);
>> +        *info = NULL;
>> +        return -errno;
>> +    }
>> +
>> +    if (((*info)->argsz > argsz)) {
>> +        argsz = (*info)->argsz;
>> +        *info = g_realloc(*info, argsz);
>> +        goto again;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static struct vfio_info_cap_header *
>> +vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
>> +{
>> +    struct vfio_info_cap_header *hdr;
>> +    void *ptr = info;
>> +
>> +    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
>> +        return NULL;
>> +    }
>> +
>> +    for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
>> +        if (hdr->id == id) {
>> +            return hdr;
>> +        }
>> +    }
>> +
>> +    return NULL;
>> +}
>> +
>> +static void vfio_get_iommu_info_migration(VFIOContainer *container,
>> +                                         struct vfio_iommu_type1_info *info)
>> +{
>> +    struct vfio_info_cap_header *hdr;
>> +    struct vfio_iommu_type1_info_cap_migration *cap_mig;
>> +
>> +    hdr = vfio_get_iommu_info_cap(info, VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION);
>> +    if (!hdr) {
>> +        return;
>> +    }
>> +
>> +    cap_mig = container_of(hdr, struct vfio_iommu_type1_info_cap_migration,
>> +                            header);
>> +
>> +    container->dirty_pages_supported = true;
>> +    container->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size;
>> +    container->dirty_pgsizes = cap_mig->pgsize_bitmap;
>> +
>> +    /*
>> +     * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of
>> +     * TARGET_PAGE_SIZE to mark those dirty.
>> +     */
>> +    assert(container->dirty_pgsizes & TARGET_PAGE_SIZE);
> 
> Why assert versus simply not support dirty page tracking and therefore
> migration of contained devices?
> 

Ok, that can be done.

>> +}
>> +
>>   static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>                                     Error **errp)
>>   {
>> @@ -1293,6 +1362,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>       container->space = space;
>>       container->fd = fd;
>>       container->error = NULL;
>> +    container->dirty_pages_supported = false;
>>       QLIST_INIT(&container->giommu_list);
>>       QLIST_INIT(&container->hostwin_list);
>>   
>> @@ -1305,7 +1375,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>       case VFIO_TYPE1v2_IOMMU:
>>       case VFIO_TYPE1_IOMMU:
>>       {
>> -        struct vfio_iommu_type1_info info;
>> +        struct vfio_iommu_type1_info *info;
>>   
>>           /*
>>            * FIXME: This assumes that a Type1 IOMMU can map any 64-bit
>> @@ -1314,15 +1384,20 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>            * existing Type1 IOMMUs generally support any IOVA we're
>>            * going to actually try in practice.
>>            */
>> -        info.argsz = sizeof(info);
>> -        ret = ioctl(fd, VFIO_IOMMU_GET_INFO, &info);
>> -        /* Ignore errors */
>> -        if (ret || !(info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
>> +        ret = vfio_get_iommu_info(container, &info);
>> +        if (ret) {
>> +                goto free_container_exit;
> 
> This was previously not fatal, why is it now?  Thanks,
> 

Cornelia asked same question.
Then what should be the action if ioctl fails? Disable migration?

Thanks,
Kirti


> Alex
> 
>> +        }
>> +
>> +        if (!(info->flags & VFIO_IOMMU_INFO_PGSIZES)) {
>>               /* Assume 4k IOVA page size */
>> -            info.iova_pgsizes = 4096;
>> +            info->iova_pgsizes = 4096;
>>           }
>> -        vfio_host_win_add(container, 0, (hwaddr)-1, info.iova_pgsizes);
>> -        container->pgsizes = info.iova_pgsizes;
>> +        vfio_host_win_add(container, 0, (hwaddr)-1, info->iova_pgsizes);
>> +        container->pgsizes = info->iova_pgsizes;
>> +
>> +        vfio_get_iommu_info_migration(container, info);
>> +        g_free(info);
>>           break;
>>       }
>>       case VFIO_SPAPR_TCE_v2_IOMMU:
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index c78033e4149d..5a57a78ec517 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -79,6 +79,9 @@ typedef struct VFIOContainer {
>>       unsigned iommu_type;
>>       Error *error;
>>       bool initialized;
>> +    bool dirty_pages_supported;
>> +    uint64_t dirty_pgsizes;
>> +    uint64_t max_dirty_bitmap_size;
>>       unsigned long pgsizes;
>>       QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
>>       QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
> 


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 09/17] vfio: Add load state functions to SaveVMHandlers
  2020-06-24 18:54   ` Alex Williamson
@ 2020-06-25 14:16     ` Kirti Wankhede
  2020-06-25 14:57       ` Alex Williamson
  2020-06-26 14:54     ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 66+ messages in thread
From: Kirti Wankhede @ 2020-06-25 14:16 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini



On 6/25/2020 12:24 AM, Alex Williamson wrote:
> On Sun, 21 Jun 2020 01:51:18 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> Sequence  during _RESUMING device state:
>> While data for this device is available, repeat below steps:
>> a. read data_offset from where user application should write data.
>> b. write data of data_size to migration region from data_offset.
>> c. write data_size which indicates vendor driver that data is written in
>>     staging buffer.
>>
>> For user, data is opaque. User should write data in the same order as
>> received.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>> ---
>>   hw/vfio/migration.c  | 177 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>   hw/vfio/trace-events |   3 +
>>   2 files changed, 180 insertions(+)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index ef1150c1ff02..faacea5327cb 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -302,6 +302,33 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
>>       return qemu_file_get_error(f);
>>   }
>>   
>> +static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    uint64_t data;
>> +
>> +    if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
>> +        int ret;
>> +
>> +        ret = vbasedev->ops->vfio_load_config(vbasedev, f);
>> +        if (ret) {
>> +            error_report("%s: Failed to load device config space",
>> +                         vbasedev->name);
>> +            return ret;
>> +        }
>> +    }
>> +
>> +    data = qemu_get_be64(f);
>> +    if (data != VFIO_MIG_FLAG_END_OF_STATE) {
>> +        error_report("%s: Failed loading device config space, "
>> +                     "end flag incorrect 0x%"PRIx64, vbasedev->name, data);
>> +        return -EINVAL;
>> +    }
>> +
>> +    trace_vfio_load_device_config_state(vbasedev->name);
>> +    return qemu_file_get_error(f);
>> +}
>> +
>>   /* ---------------------------------------------------------------------- */
>>   
>>   static int vfio_save_setup(QEMUFile *f, void *opaque)
>> @@ -472,12 +499,162 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>       return ret;
>>   }
>>   
>> +static int vfio_load_setup(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    int ret = 0;
>> +
>> +    if (migration->region.mmaps) {
>> +        ret = vfio_region_mmap(&migration->region);
>> +        if (ret) {
>> +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
>> +                         vbasedev->name, migration->region.nr,
>> +                         strerror(-ret));
>> +            return ret;
> 
> 
> Not fatal.
>

As discussed on 07/17 patch of this series, it should fall back to 
read/write, right?

> 
>> +        }
>> +    }
>> +
>> +    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_MASK,
>> +                                   VFIO_DEVICE_STATE_RESUMING);
>> +    if (ret) {
>> +        error_report("%s: Failed to set state RESUMING", vbasedev->name);
>> +    }
>> +    return ret;
>> +}
>> +
>> +static int vfio_load_cleanup(void *opaque)
>> +{
>> +    vfio_save_cleanup(opaque);
>> +    return 0;
>> +}
>> +
>> +static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    int ret = 0;
>> +    uint64_t data, data_size;
>> +
>> +    data = qemu_get_be64(f);
>> +    while (data != VFIO_MIG_FLAG_END_OF_STATE) {
>> +
>> +        trace_vfio_load_state(vbasedev->name, data);
>> +
>> +        switch (data) {
>> +        case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
>> +        {
>> +            ret = vfio_load_device_config_state(f, opaque);
>> +            if (ret) {
>> +                return ret;
>> +            }
>> +            break;
>> +        }
>> +        case VFIO_MIG_FLAG_DEV_SETUP_STATE:
>> +        {
>> +            data = qemu_get_be64(f);
>> +            if (data == VFIO_MIG_FLAG_END_OF_STATE) {
>> +                return ret;
>> +            } else {
>> +                error_report("%s: SETUP STATE: EOS not found 0x%"PRIx64,
>> +                             vbasedev->name, data);
>> +                return -EINVAL;
> 
> This is essentially just a compatibility failure, right?  For instance
> some future version of QEMU might include additional data between these
> markers that we don't understand and therefore we fail the migration.
> 

Yes.

Thanks,
Kirti

Thanks,
> 
> Alex
> 
>> +            }
>> +            break;
>> +        }
>> +        case VFIO_MIG_FLAG_DEV_DATA_STATE:
>> +        {
>> +            VFIORegion *region = &migration->region;
>> +            uint64_t data_offset = 0, size;
>> +
>> +            data_size = size = qemu_get_be64(f);
>> +            if (data_size == 0) {
>> +                break;
>> +            }
>> +
>> +            ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
>> +                        region->fd_offset +
>> +                        offsetof(struct vfio_device_migration_info,
>> +                        data_offset));
>> +            if (ret != sizeof(data_offset)) {
>> +                error_report("%s:Failed to get migration buffer data offset %d",
>> +                             vbasedev->name, ret);
>> +                return -EINVAL;
>> +            }
>> +
>> +            trace_vfio_load_state_device_data(vbasedev->name, data_offset,
>> +                                              data_size);
>> +
>> +            while (size) {
>> +                void *buf = NULL;
>> +                uint64_t sec_size;
>> +                bool buffer_mmaped;
>> +
>> +                buf = get_data_section_size(region, data_offset, size,
>> +                                            &sec_size);
>> +
>> +                buffer_mmaped = (buf != NULL);
>> +
>> +                if (!buffer_mmaped) {
>> +                    buf = g_try_malloc(sec_size);
>> +                    if (!buf) {
>> +                        error_report("%s: Error allocating buffer ", __func__);
>> +                        return -ENOMEM;
>> +                    }
>> +                }
>> +
>> +                qemu_get_buffer(f, buf, sec_size);
>> +
>> +                if (!buffer_mmaped) {
>> +                    ret = pwrite(vbasedev->fd, buf, sec_size,
>> +                                 region->fd_offset + data_offset);
>> +                    g_free(buf);
>> +
>> +                    if (ret != sec_size) {
>> +                        error_report("%s: Failed to set migration buffer %d",
>> +                                vbasedev->name, ret);
>> +                        return -EINVAL;
>> +                    }
>> +                }
>> +                size -= sec_size;
>> +                data_offset += sec_size;
>> +            }
>> +
>> +            ret = pwrite(vbasedev->fd, &data_size, sizeof(data_size),
>> +                         region->fd_offset +
>> +                       offsetof(struct vfio_device_migration_info, data_size));
>> +            if (ret != sizeof(data_size)) {
>> +                error_report("%s: Failed to set migration buffer data size %d",
>> +                             vbasedev->name, ret);
>> +                return -EINVAL;
>> +            }
>> +            break;
>> +        }
>> +
>> +        default:
>> +            error_report("%s: Unknown tag 0x%"PRIx64, vbasedev->name, data);
>> +            return -EINVAL;
>> +        }
>> +
>> +        data = qemu_get_be64(f);
>> +        ret = qemu_file_get_error(f);
>> +        if (ret) {
>> +            return ret;
>> +        }
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>>   static SaveVMHandlers savevm_vfio_handlers = {
>>       .save_setup = vfio_save_setup,
>>       .save_cleanup = vfio_save_cleanup,
>>       .save_live_pending = vfio_save_pending,
>>       .save_live_iterate = vfio_save_iterate,
>>       .save_live_complete_precopy = vfio_save_complete_precopy,
>> +    .load_setup = vfio_load_setup,
>> +    .load_cleanup = vfio_load_cleanup,
>> +    .load_state = vfio_load_state,
>>   };
>>   
>>   /* ---------------------------------------------------------------------- */
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index 9a1c5e17d97f..4a4bd3ba9a2a 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -157,3 +157,6 @@ vfio_save_device_config_state(const char *name) " (%s)"
>>   vfio_save_pending(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64
>>   vfio_save_iterate(const char *name, int data_size) " (%s) data_size %d"
>>   vfio_save_complete_precopy(const char *name) " (%s)"
>> +vfio_load_device_config_state(const char *name) " (%s)"
>> +vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
>> +vfio_load_state_device_data(const char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
> 


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 13/17] vfio: create mapped iova list when vIOMMU is enabled
  2020-06-24 18:55   ` Alex Williamson
@ 2020-06-25 14:34     ` Kirti Wankhede
  2020-06-25 17:40       ` Alex Williamson
  0 siblings, 1 reply; 66+ messages in thread
From: Kirti Wankhede @ 2020-06-25 14:34 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini



On 6/25/2020 12:25 AM, Alex Williamson wrote:
> On Sun, 21 Jun 2020 01:51:22 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> Create mapped iova list when vIOMMU is enabled. For each mapped iova
>> save translated address. Add node to list on MAP and remove node from
>> list on UNMAP.
>> This list is used to track dirty pages during migration.
> 
> This seems like a lot of overhead to support that the VM might migrate.
> Is there no way we can build this when we start migration, for example
> replaying the mappings at that time?  Thanks,
> 

In my previous version I tried to go through whole range and find valid 
iotlb, as below:

+        if (memory_region_is_iommu(section->mr)) {
+            iotlb = address_space_get_iotlb_entry(container->space->as, 
iova,
+                                                 true, 
MEMTXATTRS_UNSPECIFIED);

When mapping doesn't exist, qemu throws error as below:

qemu-system-x86_64: vtd_iova_to_slpte: detected slpte permission error 
(iova=0x0, level=0x3, slpte=0x0, write=1)
qemu-system-x86_64: vtd_iommu_translate: detected translation failure 
(dev=00:03:00, iova=0x0)
qemu-system-x86_64: New fault is not recorded due to compression of faults

Secondly, it iterates through whole range with IOMMU page size 
granularity which is 4K, so it takes long time resulting in large 
downtime. With this optimization, downtime with vIOMMU reduced 
significantly.

Other option I will try if I can check that if migration is supported 
then only create this list.

Thanks,
Kirti


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 14/17] vfio: Add vfio_listener_log_sync to mark dirty pages
  2020-06-24 18:55   ` Alex Williamson
@ 2020-06-25 14:43     ` Kirti Wankhede
  2020-06-25 17:57       ` Alex Williamson
  0 siblings, 1 reply; 66+ messages in thread
From: Kirti Wankhede @ 2020-06-25 14:43 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini



On 6/25/2020 12:25 AM, Alex Williamson wrote:
> On Sun, 21 Jun 2020 01:51:23 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> vfio_listener_log_sync gets list of dirty pages from container using
>> VFIO_IOMMU_GET_DIRTY_BITMAP ioctl and mark those pages dirty when all
>> devices are stopped and saving state.
>> Return early for the RAM block section of mapped MMIO region.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>   hw/vfio/common.c     | 130 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>   hw/vfio/trace-events |   1 +
>>   2 files changed, 131 insertions(+)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 6921a78e9ba5..0518cf228ed5 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -29,6 +29,7 @@
>>   #include "hw/vfio/vfio.h"
>>   #include "exec/address-spaces.h"
>>   #include "exec/memory.h"
>> +#include "exec/ram_addr.h"
>>   #include "hw/hw.h"
>>   #include "qemu/error-report.h"
>>   #include "qemu/main-loop.h"
>> @@ -38,6 +39,7 @@
>>   #include "sysemu/reset.h"
>>   #include "trace.h"
>>   #include "qapi/error.h"
>> +#include "migration/migration.h"
>>   
>>   VFIOGroupList vfio_group_list =
>>       QLIST_HEAD_INITIALIZER(vfio_group_list);
>> @@ -288,6 +290,28 @@ const MemoryRegionOps vfio_region_ops = {
>>   };
>>   
>>   /*
>> + * Device state interfaces
>> + */
>> +
>> +static bool vfio_devices_are_stopped_and_saving(void)
>> +{
>> +    VFIOGroup *group;
>> +    VFIODevice *vbasedev;
>> +
>> +    QLIST_FOREACH(group, &vfio_group_list, next) {
> 
> Should this be passed the container in order to iterate
> container->group_list?
> 
>> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
>> +            if ((vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) &&
>> +                !(vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING)) {
>> +                continue;
>> +            } else {
>> +                return false;
>> +            }
>> +        }
>> +    }
>> +    return true;
>> +}
>> +
>> +/*
>>    * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
>>    */
>>   static int vfio_dma_unmap(VFIOContainer *container,
>> @@ -852,9 +876,115 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>       }
>>   }
>>   
>> +static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
>> +                                 uint64_t size, ram_addr_t ram_addr)
>> +{
>> +    struct vfio_iommu_type1_dirty_bitmap *dbitmap;
>> +    struct vfio_iommu_type1_dirty_bitmap_get *range;
>> +    uint64_t pages;
>> +    int ret;
>> +
>> +    dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
>> +
>> +    dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
>> +    dbitmap->flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
>> +    range = (struct vfio_iommu_type1_dirty_bitmap_get *)&dbitmap->data;
>> +    range->iova = iova;
>> +    range->size = size;
>> +
>> +    /*
>> +     * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of
>> +     * TARGET_PAGE_SIZE to mark those dirty. Hence set bitmap's pgsize to
>> +     * TARGET_PAGE_SIZE.
>> +     */
>> +    range->bitmap.pgsize = TARGET_PAGE_SIZE;
>> +
>> +    pages = TARGET_PAGE_ALIGN(range->size) >> TARGET_PAGE_BITS;
>> +    range->bitmap.size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
>> +                                         BITS_PER_BYTE;
>> +    range->bitmap.data = g_try_malloc0(range->bitmap.size);
>> +    if (!range->bitmap.data) {
>> +        ret = -ENOMEM;
>> +        goto err_out;
>> +    }
>> +
>> +    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
>> +    if (ret) {
>> +        error_report("Failed to get dirty bitmap for iova: 0x%llx "
>> +                "size: 0x%llx err: %d",
>> +                range->iova, range->size, errno);
>> +        goto err_out;
>> +    }
>> +
>> +    cpu_physical_memory_set_dirty_lebitmap((uint64_t *)range->bitmap.data,
>> +                                            ram_addr, pages);
>> +
>> +    trace_vfio_get_dirty_bitmap(container->fd, range->iova, range->size,
>> +                                range->bitmap.size, ram_addr);
>> +err_out:
>> +    g_free(range->bitmap.data);
>> +    g_free(dbitmap);
>> +
>> +    return ret;
>> +}
>> +
>> +static int vfio_sync_dirty_bitmap(MemoryListener *listener,
>> +                                 MemoryRegionSection *section)
>> +{
>> +    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
>> +    VFIOGuestIOMMU *giommu = NULL;
>> +    ram_addr_t ram_addr;
>> +    uint64_t iova, size;
>> +    int ret = 0;
>> +
>> +    if (memory_region_is_iommu(section->mr)) {
>> +
>> +        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
>> +            if (MEMORY_REGION(giommu->iommu) == section->mr &&
>> +                giommu->n.start == section->offset_within_region) {
>> +                VFIOIovaRange *iova_range;
>> +
>> +                QLIST_FOREACH(iova_range, &giommu->iova_list, next) {
>> +                    ret = vfio_get_dirty_bitmap(container, iova_range->iova,
>> +                                        iova_range->size, iova_range->ram_addr);
>> +                    if (ret) {
>> +                        break;
>> +                    }
>> +                }
>> +                break;
>> +            }
>> +        }
>> +
>> +    } else {
>> +        iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
>> +        size = int128_get64(section->size);
>> +
>> +        ram_addr = memory_region_get_ram_addr(section->mr) +
>> +                   section->offset_within_region + iova -
>> +                   TARGET_PAGE_ALIGN(section->offset_within_address_space);
>> +
>> +        ret = vfio_get_dirty_bitmap(container, iova, size, ram_addr);
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +static void vfio_listerner_log_sync(MemoryListener *listener,
>> +        MemoryRegionSection *section)
>> +{
>> +    if (vfio_listener_skipped_section(section)) {
>> +        return;
>> +    }
>> +
>> +    if (vfio_devices_are_stopped_and_saving()) {
>> +        vfio_sync_dirty_bitmap(listener, section);
>> +    }
> 
> 
> How do we decide that this is the best policy for all devices?  For
> example if a device does not support page pinning or some future means
> of marking dirtied pages, this is clearly the right choice, but when
> these are supported, aren't we deferring all dirty logging info until
> the final stage?  Thanks,
> 

Yes, for now we are deferring all dirty logging to stop-and-copy phase. 
In future, whenever hardware support for dirty page tracking get added, 
we will have flag added in migration capability in VFIO_IOMMU_GET_INFO 
capability list. Based on that we can decide to get dirty pages in 
earlier phase of migration.

Thanks,
Kirti


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 11/17] vfio: Get migration capability flags for container
  2020-06-25 14:09     ` Kirti Wankhede
@ 2020-06-25 14:56       ` Alex Williamson
  0 siblings, 0 replies; 66+ messages in thread
From: Alex Williamson @ 2020-06-25 14:56 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	Eric Auger, changpeng.liu, eskultet, Shameer Kolothum, Ken.Xue,
	jonathan.davies, pbonzini

On Thu, 25 Jun 2020 19:39:08 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 6/25/2020 12:25 AM, Alex Williamson wrote:
> > On Sun, 21 Jun 2020 01:51:20 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> Added helper functions to get IOMMU info capability chain.
> >> Added function to get migration capability information from that
> >> capability chain for IOMMU container.
> >>
> >> Similar change was proposed earlier:
> >> https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg03759.html
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Cc: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> >> Cc: Eric Auger <eric.auger@redhat.com>
> >> ---
> >>   hw/vfio/common.c              | 91 +++++++++++++++++++++++++++++++++++++++----
> >>   include/hw/vfio/vfio-common.h |  3 ++
> >>   2 files changed, 86 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >> index 90e9a854d82c..e0d3d4585a65 100644
> >> --- a/hw/vfio/common.c
> >> +++ b/hw/vfio/common.c
> >> @@ -1229,6 +1229,75 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
> >>       return 0;
> >>   }
> >>   
> >> +static int vfio_get_iommu_info(VFIOContainer *container,
> >> +                               struct vfio_iommu_type1_info **info)
> >> +{
> >> +
> >> +    size_t argsz = sizeof(struct vfio_iommu_type1_info);
> >> +
> >> +    *info = g_new0(struct vfio_iommu_type1_info, 1);
> >> +again:
> >> +    (*info)->argsz = argsz;
> >> +
> >> +    if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) {
> >> +        g_free(*info);
> >> +        *info = NULL;
> >> +        return -errno;
> >> +    }
> >> +
> >> +    if (((*info)->argsz > argsz)) {
> >> +        argsz = (*info)->argsz;
> >> +        *info = g_realloc(*info, argsz);
> >> +        goto again;
> >> +    }
> >> +
> >> +    return 0;
> >> +}
> >> +
> >> +static struct vfio_info_cap_header *
> >> +vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
> >> +{
> >> +    struct vfio_info_cap_header *hdr;
> >> +    void *ptr = info;
> >> +
> >> +    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
> >> +        return NULL;
> >> +    }
> >> +
> >> +    for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
> >> +        if (hdr->id == id) {
> >> +            return hdr;
> >> +        }
> >> +    }
> >> +
> >> +    return NULL;
> >> +}
> >> +
> >> +static void vfio_get_iommu_info_migration(VFIOContainer *container,
> >> +                                         struct vfio_iommu_type1_info *info)
> >> +{
> >> +    struct vfio_info_cap_header *hdr;
> >> +    struct vfio_iommu_type1_info_cap_migration *cap_mig;
> >> +
> >> +    hdr = vfio_get_iommu_info_cap(info, VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION);
> >> +    if (!hdr) {
> >> +        return;
> >> +    }
> >> +
> >> +    cap_mig = container_of(hdr, struct vfio_iommu_type1_info_cap_migration,
> >> +                            header);
> >> +
> >> +    container->dirty_pages_supported = true;
> >> +    container->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size;
> >> +    container->dirty_pgsizes = cap_mig->pgsize_bitmap;
> >> +
> >> +    /*
> >> +     * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of
> >> +     * TARGET_PAGE_SIZE to mark those dirty.
> >> +     */
> >> +    assert(container->dirty_pgsizes & TARGET_PAGE_SIZE);  
> > 
> > Why assert versus simply not support dirty page tracking and therefore
> > migration of contained devices?
> >   
> 
> Ok, that can be done.
> 
> >> +}
> >> +
> >>   static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
> >>                                     Error **errp)
> >>   {
> >> @@ -1293,6 +1362,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
> >>       container->space = space;
> >>       container->fd = fd;
> >>       container->error = NULL;
> >> +    container->dirty_pages_supported = false;
> >>       QLIST_INIT(&container->giommu_list);
> >>       QLIST_INIT(&container->hostwin_list);
> >>   
> >> @@ -1305,7 +1375,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
> >>       case VFIO_TYPE1v2_IOMMU:
> >>       case VFIO_TYPE1_IOMMU:
> >>       {
> >> -        struct vfio_iommu_type1_info info;
> >> +        struct vfio_iommu_type1_info *info;
> >>   
> >>           /*
> >>            * FIXME: This assumes that a Type1 IOMMU can map any 64-bit
> >> @@ -1314,15 +1384,20 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
> >>            * existing Type1 IOMMUs generally support any IOVA we're
> >>            * going to actually try in practice.
> >>            */
> >> -        info.argsz = sizeof(info);
> >> -        ret = ioctl(fd, VFIO_IOMMU_GET_INFO, &info);
> >> -        /* Ignore errors */
> >> -        if (ret || !(info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
> >> +        ret = vfio_get_iommu_info(container, &info);
> >> +        if (ret) {
> >> +                goto free_container_exit;  
> > 
> > This was previously not fatal, why is it now?  Thanks,
> >   
> 
> Cornelia asked same question.
> Then what should be the action if ioctl fails? Disable migration?

Yes, new features shouldn't impose constraints not previously present.
Thanks,

Alex
   
> >> +        }
> >> +
> >> +        if (!(info->flags & VFIO_IOMMU_INFO_PGSIZES)) {
> >>               /* Assume 4k IOVA page size */
> >> -            info.iova_pgsizes = 4096;
> >> +            info->iova_pgsizes = 4096;
> >>           }
> >> -        vfio_host_win_add(container, 0, (hwaddr)-1, info.iova_pgsizes);
> >> -        container->pgsizes = info.iova_pgsizes;
> >> +        vfio_host_win_add(container, 0, (hwaddr)-1, info->iova_pgsizes);
> >> +        container->pgsizes = info->iova_pgsizes;
> >> +
> >> +        vfio_get_iommu_info_migration(container, info);
> >> +        g_free(info);
> >>           break;
> >>       }
> >>       case VFIO_SPAPR_TCE_v2_IOMMU:
> >> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> >> index c78033e4149d..5a57a78ec517 100644
> >> --- a/include/hw/vfio/vfio-common.h
> >> +++ b/include/hw/vfio/vfio-common.h
> >> @@ -79,6 +79,9 @@ typedef struct VFIOContainer {
> >>       unsigned iommu_type;
> >>       Error *error;
> >>       bool initialized;
> >> +    bool dirty_pages_supported;
> >> +    uint64_t dirty_pgsizes;
> >> +    uint64_t max_dirty_bitmap_size;
> >>       unsigned long pgsizes;
> >>       QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
> >>       QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;  
> >   
> 



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 09/17] vfio: Add load state functions to SaveVMHandlers
  2020-06-25 14:16     ` Kirti Wankhede
@ 2020-06-25 14:57       ` Alex Williamson
  0 siblings, 0 replies; 66+ messages in thread
From: Alex Williamson @ 2020-06-25 14:57 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini

On Thu, 25 Jun 2020 19:46:22 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 6/25/2020 12:24 AM, Alex Williamson wrote:
> > On Sun, 21 Jun 2020 01:51:18 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> Sequence  during _RESUMING device state:
> >> While data for this device is available, repeat below steps:
> >> a. read data_offset from where user application should write data.
> >> b. write data of data_size to migration region from data_offset.
> >> c. write data_size which indicates vendor driver that data is written in
> >>     staging buffer.
> >>
> >> For user, data is opaque. User should write data in the same order as
> >> received.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> >> ---
> >>   hw/vfio/migration.c  | 177 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >>   hw/vfio/trace-events |   3 +
> >>   2 files changed, 180 insertions(+)
> >>
> >> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> >> index ef1150c1ff02..faacea5327cb 100644
> >> --- a/hw/vfio/migration.c
> >> +++ b/hw/vfio/migration.c
> >> @@ -302,6 +302,33 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
> >>       return qemu_file_get_error(f);
> >>   }
> >>   
> >> +static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
> >> +{
> >> +    VFIODevice *vbasedev = opaque;
> >> +    uint64_t data;
> >> +
> >> +    if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
> >> +        int ret;
> >> +
> >> +        ret = vbasedev->ops->vfio_load_config(vbasedev, f);
> >> +        if (ret) {
> >> +            error_report("%s: Failed to load device config space",
> >> +                         vbasedev->name);
> >> +            return ret;
> >> +        }
> >> +    }
> >> +
> >> +    data = qemu_get_be64(f);
> >> +    if (data != VFIO_MIG_FLAG_END_OF_STATE) {
> >> +        error_report("%s: Failed loading device config space, "
> >> +                     "end flag incorrect 0x%"PRIx64, vbasedev->name, data);
> >> +        return -EINVAL;
> >> +    }
> >> +
> >> +    trace_vfio_load_device_config_state(vbasedev->name);
> >> +    return qemu_file_get_error(f);
> >> +}
> >> +
> >>   /* ---------------------------------------------------------------------- */
> >>   
> >>   static int vfio_save_setup(QEMUFile *f, void *opaque)
> >> @@ -472,12 +499,162 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> >>       return ret;
> >>   }
> >>   
> >> +static int vfio_load_setup(QEMUFile *f, void *opaque)
> >> +{
> >> +    VFIODevice *vbasedev = opaque;
> >> +    VFIOMigration *migration = vbasedev->migration;
> >> +    int ret = 0;
> >> +
> >> +    if (migration->region.mmaps) {
> >> +        ret = vfio_region_mmap(&migration->region);
> >> +        if (ret) {
> >> +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
> >> +                         vbasedev->name, migration->region.nr,
> >> +                         strerror(-ret));
> >> +            return ret;  
> > 
> > 
> > Not fatal.
> >  
> 
> As discussed on 07/17 patch of this series, it should fall back to 
> read/write, right?

Yes, it's worth an error_report() but not a migration abort imo.
Thanks,

Alex

> >> +        }
> >> +    }
> >> +
> >> +    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_MASK,
> >> +                                   VFIO_DEVICE_STATE_RESUMING);
> >> +    if (ret) {
> >> +        error_report("%s: Failed to set state RESUMING", vbasedev->name);
> >> +    }
> >> +    return ret;
> >> +}
> >> +
> >> +static int vfio_load_cleanup(void *opaque)
> >> +{
> >> +    vfio_save_cleanup(opaque);
> >> +    return 0;
> >> +}
> >> +
> >> +static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> >> +{
> >> +    VFIODevice *vbasedev = opaque;
> >> +    VFIOMigration *migration = vbasedev->migration;
> >> +    int ret = 0;
> >> +    uint64_t data, data_size;
> >> +
> >> +    data = qemu_get_be64(f);
> >> +    while (data != VFIO_MIG_FLAG_END_OF_STATE) {
> >> +
> >> +        trace_vfio_load_state(vbasedev->name, data);
> >> +
> >> +        switch (data) {
> >> +        case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
> >> +        {
> >> +            ret = vfio_load_device_config_state(f, opaque);
> >> +            if (ret) {
> >> +                return ret;
> >> +            }
> >> +            break;
> >> +        }
> >> +        case VFIO_MIG_FLAG_DEV_SETUP_STATE:
> >> +        {
> >> +            data = qemu_get_be64(f);
> >> +            if (data == VFIO_MIG_FLAG_END_OF_STATE) {
> >> +                return ret;
> >> +            } else {
> >> +                error_report("%s: SETUP STATE: EOS not found 0x%"PRIx64,
> >> +                             vbasedev->name, data);
> >> +                return -EINVAL;  
> > 
> > This is essentially just a compatibility failure, right?  For instance
> > some future version of QEMU might include additional data between these
> > markers that we don't understand and therefore we fail the migration.
> >   
> 
> Yes.
> 
> Thanks,
> Kirti
> 
> Thanks,
> > 
> > Alex
> >   
> >> +            }
> >> +            break;
> >> +        }
> >> +        case VFIO_MIG_FLAG_DEV_DATA_STATE:
> >> +        {
> >> +            VFIORegion *region = &migration->region;
> >> +            uint64_t data_offset = 0, size;
> >> +
> >> +            data_size = size = qemu_get_be64(f);
> >> +            if (data_size == 0) {
> >> +                break;
> >> +            }
> >> +
> >> +            ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
> >> +                        region->fd_offset +
> >> +                        offsetof(struct vfio_device_migration_info,
> >> +                        data_offset));
> >> +            if (ret != sizeof(data_offset)) {
> >> +                error_report("%s:Failed to get migration buffer data offset %d",
> >> +                             vbasedev->name, ret);
> >> +                return -EINVAL;
> >> +            }
> >> +
> >> +            trace_vfio_load_state_device_data(vbasedev->name, data_offset,
> >> +                                              data_size);
> >> +
> >> +            while (size) {
> >> +                void *buf = NULL;
> >> +                uint64_t sec_size;
> >> +                bool buffer_mmaped;
> >> +
> >> +                buf = get_data_section_size(region, data_offset, size,
> >> +                                            &sec_size);
> >> +
> >> +                buffer_mmaped = (buf != NULL);
> >> +
> >> +                if (!buffer_mmaped) {
> >> +                    buf = g_try_malloc(sec_size);
> >> +                    if (!buf) {
> >> +                        error_report("%s: Error allocating buffer ", __func__);
> >> +                        return -ENOMEM;
> >> +                    }
> >> +                }
> >> +
> >> +                qemu_get_buffer(f, buf, sec_size);
> >> +
> >> +                if (!buffer_mmaped) {
> >> +                    ret = pwrite(vbasedev->fd, buf, sec_size,
> >> +                                 region->fd_offset + data_offset);
> >> +                    g_free(buf);
> >> +
> >> +                    if (ret != sec_size) {
> >> +                        error_report("%s: Failed to set migration buffer %d",
> >> +                                vbasedev->name, ret);
> >> +                        return -EINVAL;
> >> +                    }
> >> +                }
> >> +                size -= sec_size;
> >> +                data_offset += sec_size;
> >> +            }
> >> +
> >> +            ret = pwrite(vbasedev->fd, &data_size, sizeof(data_size),
> >> +                         region->fd_offset +
> >> +                       offsetof(struct vfio_device_migration_info, data_size));
> >> +            if (ret != sizeof(data_size)) {
> >> +                error_report("%s: Failed to set migration buffer data size %d",
> >> +                             vbasedev->name, ret);
> >> +                return -EINVAL;
> >> +            }
> >> +            break;
> >> +        }
> >> +
> >> +        default:
> >> +            error_report("%s: Unknown tag 0x%"PRIx64, vbasedev->name, data);
> >> +            return -EINVAL;
> >> +        }
> >> +
> >> +        data = qemu_get_be64(f);
> >> +        ret = qemu_file_get_error(f);
> >> +        if (ret) {
> >> +            return ret;
> >> +        }
> >> +    }
> >> +
> >> +    return ret;
> >> +}
> >> +
> >>   static SaveVMHandlers savevm_vfio_handlers = {
> >>       .save_setup = vfio_save_setup,
> >>       .save_cleanup = vfio_save_cleanup,
> >>       .save_live_pending = vfio_save_pending,
> >>       .save_live_iterate = vfio_save_iterate,
> >>       .save_live_complete_precopy = vfio_save_complete_precopy,
> >> +    .load_setup = vfio_load_setup,
> >> +    .load_cleanup = vfio_load_cleanup,
> >> +    .load_state = vfio_load_state,
> >>   };
> >>   
> >>   /* ---------------------------------------------------------------------- */
> >> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> >> index 9a1c5e17d97f..4a4bd3ba9a2a 100644
> >> --- a/hw/vfio/trace-events
> >> +++ b/hw/vfio/trace-events
> >> @@ -157,3 +157,6 @@ vfio_save_device_config_state(const char *name) " (%s)"
> >>   vfio_save_pending(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64
> >>   vfio_save_iterate(const char *name, int data_size) " (%s) data_size %d"
> >>   vfio_save_complete_precopy(const char *name) " (%s)"
> >> +vfio_load_device_config_state(const char *name) " (%s)"
> >> +vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
> >> +vfio_load_state_device_data(const char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64  
> >   
> 



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 15/17] vfio: Add ioctl to get dirty pages bitmap during dma unmap.
  2020-06-24 18:56   ` Alex Williamson
@ 2020-06-25 15:01     ` Kirti Wankhede
  2020-06-25 19:18       ` Alex Williamson
  0 siblings, 1 reply; 66+ messages in thread
From: Kirti Wankhede @ 2020-06-25 15:01 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini



On 6/25/2020 12:26 AM, Alex Williamson wrote:
> On Sun, 21 Jun 2020 01:51:24 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> With vIOMMU, IO virtual address range can get unmapped while in pre-copy
>> phase of migration. In that case, unmap ioctl should return pages pinned
>> in that range and QEMU should find its correcponding guest physical
>> addresses and report those dirty.
>>
>> Suggested-by: Alex Williamson <alex.williamson@redhat.com>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>   hw/vfio/common.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
>>   1 file changed, 81 insertions(+), 4 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 0518cf228ed5..a06b8f2f66e2 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -311,11 +311,83 @@ static bool vfio_devices_are_stopped_and_saving(void)
>>       return true;
>>   }
>>   
>> +static bool vfio_devices_are_running_and_saving(void)
>> +{
>> +    VFIOGroup *group;
>> +    VFIODevice *vbasedev;
>> +
>> +    QLIST_FOREACH(group, &vfio_group_list, next) {
> 
> Same as previous, I'm curious if we should instead be looking at
> container granularity.  It especially seems to make sense here where
> we're unmapping from a container, so iterating every device in every
> group seems excessive.
> 

changing it with container argument.

>> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
>> +            if ((vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) &&
>> +                (vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING)) {
>> +                continue;
>> +            } else {
>> +                return false;
>> +            }
> 
> I'm also not sure about the polarity of this function, should it be if
> any device is _SAVING we should report the dirty bitmap?  For example,
> what if we have a set of paried failover NICs where we intend to unplug
> one just prior to stopping the devices, aren't we going to lose dirtied
> pages with this logic that they all must be running and saving?  Thanks,
> 

If migration is initiated, is device unplug allowed? Ideally it 
shouldn't. If it is, then how QEMU handles data stream of device which 
doesn't exist at destination?

_SAVING flag is set during pre-copy and stop-and-copy phase. Here we 
only want to track pages which are unmapped during pre-copy phase, i.e. 
when vCPU are running. In case of VM suspend /saveVM, there is no 
pre-copy phase, but ideally we shouldn't see unmaps when vCPUs are 
stopped, right? But still for safer side, since we know exact phase, I 
would prefer to check for _SAVING and _RUNNING flags.

Thanks,
Kirti


> Alex
> 
>> +        }
>> +    }
>> +    return true;
>> +}
>> +
>> +static int vfio_dma_unmap_bitmap(VFIOContainer *container,
>> +                                 hwaddr iova, ram_addr_t size,
>> +                                 IOMMUTLBEntry *iotlb)
>> +{
>> +    struct vfio_iommu_type1_dma_unmap *unmap;
>> +    struct vfio_bitmap *bitmap;
>> +    uint64_t pages = TARGET_PAGE_ALIGN(size) >> TARGET_PAGE_BITS;
>> +    int ret;
>> +
>> +    unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));
>> +
>> +    unmap->argsz = sizeof(*unmap) + sizeof(*bitmap);
>> +    unmap->iova = iova;
>> +    unmap->size = size;
>> +    unmap->flags |= VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP;
>> +    bitmap = (struct vfio_bitmap *)&unmap->data;
>> +
>> +    /*
>> +     * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of
>> +     * TARGET_PAGE_SIZE to mark those dirty. Hence set bitmap_pgsize to
>> +     * TARGET_PAGE_SIZE.
>> +     */
>> +
>> +    bitmap->pgsize = TARGET_PAGE_SIZE;
>> +    bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
>> +                   BITS_PER_BYTE;
>> +
>> +    if (bitmap->size > container->max_dirty_bitmap_size) {
>> +        error_report("UNMAP: Size of bitmap too big 0x%llx", bitmap->size);
>> +        ret = -E2BIG;
>> +        goto unmap_exit;
>> +    }
>> +
>> +    bitmap->data = g_try_malloc0(bitmap->size);
>> +    if (!bitmap->data) {
>> +        ret = -ENOMEM;
>> +        goto unmap_exit;
>> +    }
>> +
>> +    ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap);
>> +    if (!ret) {
>> +        cpu_physical_memory_set_dirty_lebitmap((uint64_t *)bitmap->data,
>> +                iotlb->translated_addr, pages);
>> +    } else {
>> +        error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %m");
>> +    }
>> +
>> +    g_free(bitmap->data);
>> +unmap_exit:
>> +    g_free(unmap);
>> +    return ret;
>> +}
>> +
>>   /*
>>    * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
>>    */
>>   static int vfio_dma_unmap(VFIOContainer *container,
>> -                          hwaddr iova, ram_addr_t size)
>> +                          hwaddr iova, ram_addr_t size,
>> +                          IOMMUTLBEntry *iotlb)
>>   {
>>       struct vfio_iommu_type1_dma_unmap unmap = {
>>           .argsz = sizeof(unmap),
>> @@ -324,6 +396,11 @@ static int vfio_dma_unmap(VFIOContainer *container,
>>           .size = size,
>>       };
>>   
>> +    if (iotlb && container->dirty_pages_supported &&
>> +        vfio_devices_are_running_and_saving()) {
>> +        return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
>> +    }
>> +
>>       while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
>>           /*
>>            * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
>> @@ -371,7 +448,7 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>>        * the VGA ROM space.
>>        */
>>       if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 ||
>> -        (errno == EBUSY && vfio_dma_unmap(container, iova, size) == 0 &&
>> +        (errno == EBUSY && vfio_dma_unmap(container, iova, size, NULL) == 0 &&
>>            ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) {
>>           return 0;
>>       }
>> @@ -542,7 +619,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>>               }
>>           }
>>   
>> -        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1);
>> +        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1, iotlb);
>>           if (ret) {
>>               error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>>                            "0x%"HWADDR_PRIx") = %d (%m)",
>> @@ -853,7 +930,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>       }
>>   
>>       if (try_unmap) {
>> -        ret = vfio_dma_unmap(container, iova, int128_get64(llsize));
>> +        ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
>>           if (ret) {
>>               error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>>                            "0x%"HWADDR_PRIx") = %d (%m)",
> 


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 13/17] vfio: create mapped iova list when vIOMMU is enabled
  2020-06-25 14:34     ` Kirti Wankhede
@ 2020-06-25 17:40       ` Alex Williamson
  2020-06-26 14:43         ` Peter Xu
  0 siblings, 1 reply; 66+ messages in thread
From: Alex Williamson @ 2020-06-25 17:40 UTC (permalink / raw)
  To: Kirti Wankhede, peterx
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk, pasic,
	felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini

On Thu, 25 Jun 2020 20:04:08 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 6/25/2020 12:25 AM, Alex Williamson wrote:
> > On Sun, 21 Jun 2020 01:51:22 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> Create mapped iova list when vIOMMU is enabled. For each mapped iova
> >> save translated address. Add node to list on MAP and remove node from
> >> list on UNMAP.
> >> This list is used to track dirty pages during migration.  
> > 
> > This seems like a lot of overhead to support that the VM might migrate.
> > Is there no way we can build this when we start migration, for example
> > replaying the mappings at that time?  Thanks,
> >   
> 
> In my previous version I tried to go through whole range and find valid 
> iotlb, as below:
> 
> +        if (memory_region_is_iommu(section->mr)) {
> +            iotlb = address_space_get_iotlb_entry(container->space->as, 
> iova,
> +                                                 true, 
> MEMTXATTRS_UNSPECIFIED);
> 
> When mapping doesn't exist, qemu throws error as below:
> 
> qemu-system-x86_64: vtd_iova_to_slpte: detected slpte permission error 
> (iova=0x0, level=0x3, slpte=0x0, write=1)
> qemu-system-x86_64: vtd_iommu_translate: detected translation failure 
> (dev=00:03:00, iova=0x0)
> qemu-system-x86_64: New fault is not recorded due to compression of faults

My assumption would have been that we use the replay mechanism, which
is known to work because we need to use it when we hot-add a device.
We'd make use of iommu_notifier_init() to create a new handler for this
purpose, then we'd walk our container->giommu_list and call
memory_region_iommu_replay() for each.

Peter, does this sound like the right approach to you?

> Secondly, it iterates through whole range with IOMMU page size 
> granularity which is 4K, so it takes long time resulting in large 
> downtime. With this optimization, downtime with vIOMMU reduced 
> significantly.

Right, but we amortize that overhead and the resulting bloat across the
99.9999% of the time that we're not migrating.  I wonder if we could
startup another thread to handle this when we enable dirty logging.  We
don't really need the result until we start processing the dirty
bitmap, right?  Also, if we're dealing with this many separate pages,
shouldn't we be using a tree rather than a list to give us O(logN)
rather than O(N)?
 
> Other option I will try if I can check that if migration is supported 
> then only create this list.

Wouldn't we still have problems if we start with a guest IOMMU domain
with a device that doesn't support migration, hot-add a device that
does support migration, then hot-remove the original device?  Seems
like our list would only be complete since the migration device was
added.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 14/17] vfio: Add vfio_listener_log_sync to mark dirty pages
  2020-06-25 14:43     ` Kirti Wankhede
@ 2020-06-25 17:57       ` Alex Williamson
  0 siblings, 0 replies; 66+ messages in thread
From: Alex Williamson @ 2020-06-25 17:57 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini

On Thu, 25 Jun 2020 20:13:39 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 6/25/2020 12:25 AM, Alex Williamson wrote:
> > On Sun, 21 Jun 2020 01:51:23 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> vfio_listener_log_sync gets list of dirty pages from container using
> >> VFIO_IOMMU_GET_DIRTY_BITMAP ioctl and mark those pages dirty when all
> >> devices are stopped and saving state.
> >> Return early for the RAM block section of mapped MMIO region.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>   hw/vfio/common.c     | 130 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >>   hw/vfio/trace-events |   1 +
> >>   2 files changed, 131 insertions(+)
> >>
> >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >> index 6921a78e9ba5..0518cf228ed5 100644
> >> --- a/hw/vfio/common.c
> >> +++ b/hw/vfio/common.c
> >> @@ -29,6 +29,7 @@
> >>   #include "hw/vfio/vfio.h"
> >>   #include "exec/address-spaces.h"
> >>   #include "exec/memory.h"
> >> +#include "exec/ram_addr.h"
> >>   #include "hw/hw.h"
> >>   #include "qemu/error-report.h"
> >>   #include "qemu/main-loop.h"
> >> @@ -38,6 +39,7 @@
> >>   #include "sysemu/reset.h"
> >>   #include "trace.h"
> >>   #include "qapi/error.h"
> >> +#include "migration/migration.h"
> >>   
> >>   VFIOGroupList vfio_group_list =
> >>       QLIST_HEAD_INITIALIZER(vfio_group_list);
> >> @@ -288,6 +290,28 @@ const MemoryRegionOps vfio_region_ops = {
> >>   };
> >>   
> >>   /*
> >> + * Device state interfaces
> >> + */
> >> +
> >> +static bool vfio_devices_are_stopped_and_saving(void)
> >> +{
> >> +    VFIOGroup *group;
> >> +    VFIODevice *vbasedev;
> >> +
> >> +    QLIST_FOREACH(group, &vfio_group_list, next) {  
> > 
> > Should this be passed the container in order to iterate
> > container->group_list?
> >   
> >> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> >> +            if ((vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) &&
> >> +                !(vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING)) {
> >> +                continue;
> >> +            } else {
> >> +                return false;
> >> +            }
> >> +        }
> >> +    }
> >> +    return true;
> >> +}
> >> +
> >> +/*
> >>    * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
> >>    */
> >>   static int vfio_dma_unmap(VFIOContainer *container,
> >> @@ -852,9 +876,115 @@ static void vfio_listener_region_del(MemoryListener *listener,
> >>       }
> >>   }
> >>   
> >> +static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
> >> +                                 uint64_t size, ram_addr_t ram_addr)
> >> +{
> >> +    struct vfio_iommu_type1_dirty_bitmap *dbitmap;
> >> +    struct vfio_iommu_type1_dirty_bitmap_get *range;
> >> +    uint64_t pages;
> >> +    int ret;
> >> +
> >> +    dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
> >> +
> >> +    dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
> >> +    dbitmap->flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
> >> +    range = (struct vfio_iommu_type1_dirty_bitmap_get *)&dbitmap->data;
> >> +    range->iova = iova;
> >> +    range->size = size;
> >> +
> >> +    /*
> >> +     * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of
> >> +     * TARGET_PAGE_SIZE to mark those dirty. Hence set bitmap's pgsize to
> >> +     * TARGET_PAGE_SIZE.
> >> +     */
> >> +    range->bitmap.pgsize = TARGET_PAGE_SIZE;
> >> +
> >> +    pages = TARGET_PAGE_ALIGN(range->size) >> TARGET_PAGE_BITS;
> >> +    range->bitmap.size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
> >> +                                         BITS_PER_BYTE;
> >> +    range->bitmap.data = g_try_malloc0(range->bitmap.size);
> >> +    if (!range->bitmap.data) {
> >> +        ret = -ENOMEM;
> >> +        goto err_out;
> >> +    }
> >> +
> >> +    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
> >> +    if (ret) {
> >> +        error_report("Failed to get dirty bitmap for iova: 0x%llx "
> >> +                "size: 0x%llx err: %d",
> >> +                range->iova, range->size, errno);
> >> +        goto err_out;
> >> +    }
> >> +
> >> +    cpu_physical_memory_set_dirty_lebitmap((uint64_t *)range->bitmap.data,
> >> +                                            ram_addr, pages);
> >> +
> >> +    trace_vfio_get_dirty_bitmap(container->fd, range->iova, range->size,
> >> +                                range->bitmap.size, ram_addr);
> >> +err_out:
> >> +    g_free(range->bitmap.data);
> >> +    g_free(dbitmap);
> >> +
> >> +    return ret;
> >> +}
> >> +
> >> +static int vfio_sync_dirty_bitmap(MemoryListener *listener,
> >> +                                 MemoryRegionSection *section)
> >> +{
> >> +    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
> >> +    VFIOGuestIOMMU *giommu = NULL;
> >> +    ram_addr_t ram_addr;
> >> +    uint64_t iova, size;
> >> +    int ret = 0;
> >> +
> >> +    if (memory_region_is_iommu(section->mr)) {
> >> +
> >> +        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
> >> +            if (MEMORY_REGION(giommu->iommu) == section->mr &&
> >> +                giommu->n.start == section->offset_within_region) {
> >> +                VFIOIovaRange *iova_range;
> >> +
> >> +                QLIST_FOREACH(iova_range, &giommu->iova_list, next) {
> >> +                    ret = vfio_get_dirty_bitmap(container, iova_range->iova,
> >> +                                        iova_range->size, iova_range->ram_addr);
> >> +                    if (ret) {
> >> +                        break;
> >> +                    }
> >> +                }
> >> +                break;
> >> +            }
> >> +        }
> >> +
> >> +    } else {
> >> +        iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
> >> +        size = int128_get64(section->size);
> >> +
> >> +        ram_addr = memory_region_get_ram_addr(section->mr) +
> >> +                   section->offset_within_region + iova -
> >> +                   TARGET_PAGE_ALIGN(section->offset_within_address_space);
> >> +
> >> +        ret = vfio_get_dirty_bitmap(container, iova, size, ram_addr);
> >> +    }
> >> +
> >> +    return ret;
> >> +}
> >> +
> >> +static void vfio_listerner_log_sync(MemoryListener *listener,
> >> +        MemoryRegionSection *section)
> >> +{
> >> +    if (vfio_listener_skipped_section(section)) {
> >> +        return;
> >> +    }
> >> +
> >> +    if (vfio_devices_are_stopped_and_saving()) {
> >> +        vfio_sync_dirty_bitmap(listener, section);
> >> +    }  
> > 
> > 
> > How do we decide that this is the best policy for all devices?  For
> > example if a device does not support page pinning or some future means
> > of marking dirtied pages, this is clearly the right choice, but when
> > these are supported, aren't we deferring all dirty logging info until
> > the final stage?  Thanks,
> >   
> 
> Yes, for now we are deferring all dirty logging to stop-and-copy phase. 
> In future, whenever hardware support for dirty page tracking get added, 
> we will have flag added in migration capability in VFIO_IOMMU_GET_INFO 
> capability list. Based on that we can decide to get dirty pages in 
> earlier phase of migration.

So the flag that you're expecting to add would indicate that the IOMMU
is reporting actual page dirtying, not just assuming pinned pages are
dirty, as we have now.  It's too bad we can't collect the
previously-but-not-currently-pinned pages during the iterative phase.
Thanks,

Alex



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 15/17] vfio: Add ioctl to get dirty pages bitmap during dma unmap.
  2020-06-25 15:01     ` Kirti Wankhede
@ 2020-06-25 19:18       ` Alex Williamson
  2020-06-26 14:15         ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 66+ messages in thread
From: Alex Williamson @ 2020-06-25 19:18 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao, dgilbert,
	changpeng.liu, eskultet, Ken.Xue, jonathan.davies, pbonzini

On Thu, 25 Jun 2020 20:31:12 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 6/25/2020 12:26 AM, Alex Williamson wrote:
> > On Sun, 21 Jun 2020 01:51:24 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> With vIOMMU, IO virtual address range can get unmapped while in pre-copy
> >> phase of migration. In that case, unmap ioctl should return pages pinned
> >> in that range and QEMU should find its correcponding guest physical
> >> addresses and report those dirty.
> >>
> >> Suggested-by: Alex Williamson <alex.williamson@redhat.com>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>   hw/vfio/common.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
> >>   1 file changed, 81 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >> index 0518cf228ed5..a06b8f2f66e2 100644
> >> --- a/hw/vfio/common.c
> >> +++ b/hw/vfio/common.c
> >> @@ -311,11 +311,83 @@ static bool vfio_devices_are_stopped_and_saving(void)
> >>       return true;
> >>   }
> >>   
> >> +static bool vfio_devices_are_running_and_saving(void)
> >> +{
> >> +    VFIOGroup *group;
> >> +    VFIODevice *vbasedev;
> >> +
> >> +    QLIST_FOREACH(group, &vfio_group_list, next) {  
> > 
> > Same as previous, I'm curious if we should instead be looking at
> > container granularity.  It especially seems to make sense here where
> > we're unmapping from a container, so iterating every device in every
> > group seems excessive.
> >   
> 
> changing it with container argument.
> 
> >> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> >> +            if ((vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) &&
> >> +                (vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING)) {
> >> +                continue;
> >> +            } else {
> >> +                return false;
> >> +            }  
> > 
> > I'm also not sure about the polarity of this function, should it be if
> > any device is _SAVING we should report the dirty bitmap?  For example,
> > what if we have a set of paried failover NICs where we intend to unplug
> > one just prior to stopping the devices, aren't we going to lose dirtied
> > pages with this logic that they all must be running and saving?  Thanks,
> >   
> 
> If migration is initiated, is device unplug allowed? Ideally it 
> shouldn't. If it is, then how QEMU handles data stream of device which 
> doesn't exist at destination?

include/hw/qdev-core.h
struct DeviceState {
    ...
    bool allow_unplug_during_migration;

AIUI, the failover_pair_id device is likely to be a vfio-pci NIC,
otherwise they'd simply migrate the primary NIC, so there's a very good
chance that a user would configure a VM with a migratable mdev device
and an failover NIC so that they have high speed networking on either
end of the migration.
 
> _SAVING flag is set during pre-copy and stop-and-copy phase. Here we 
> only want to track pages which are unmapped during pre-copy phase, i.e. 
> when vCPU are running. In case of VM suspend /saveVM, there is no 
> pre-copy phase, but ideally we shouldn't see unmaps when vCPUs are 
> stopped, right? But still for safer side, since we know exact phase, I 
> would prefer to check for _SAVING and _RUNNING flags.

We can't have unmaps while vCPUs are stopped, but I think the failover
code allows that we can be in the pre-copy phase where not all devices
support migration.  As coded here, it appears that dirty tracking of any
unmap while in that phase is lost.  Thanks,

Alex


> >> +        }
> >> +    }
> >> +    return true;
> >> +}
> >> +
> >> +static int vfio_dma_unmap_bitmap(VFIOContainer *container,
> >> +                                 hwaddr iova, ram_addr_t size,
> >> +                                 IOMMUTLBEntry *iotlb)
> >> +{
> >> +    struct vfio_iommu_type1_dma_unmap *unmap;
> >> +    struct vfio_bitmap *bitmap;
> >> +    uint64_t pages = TARGET_PAGE_ALIGN(size) >> TARGET_PAGE_BITS;
> >> +    int ret;
> >> +
> >> +    unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));
> >> +
> >> +    unmap->argsz = sizeof(*unmap) + sizeof(*bitmap);
> >> +    unmap->iova = iova;
> >> +    unmap->size = size;
> >> +    unmap->flags |= VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP;
> >> +    bitmap = (struct vfio_bitmap *)&unmap->data;
> >> +
> >> +    /*
> >> +     * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of
> >> +     * TARGET_PAGE_SIZE to mark those dirty. Hence set bitmap_pgsize to
> >> +     * TARGET_PAGE_SIZE.
> >> +     */
> >> +
> >> +    bitmap->pgsize = TARGET_PAGE_SIZE;
> >> +    bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
> >> +                   BITS_PER_BYTE;
> >> +
> >> +    if (bitmap->size > container->max_dirty_bitmap_size) {
> >> +        error_report("UNMAP: Size of bitmap too big 0x%llx", bitmap->size);
> >> +        ret = -E2BIG;
> >> +        goto unmap_exit;
> >> +    }
> >> +
> >> +    bitmap->data = g_try_malloc0(bitmap->size);
> >> +    if (!bitmap->data) {
> >> +        ret = -ENOMEM;
> >> +        goto unmap_exit;
> >> +    }
> >> +
> >> +    ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap);
> >> +    if (!ret) {
> >> +        cpu_physical_memory_set_dirty_lebitmap((uint64_t *)bitmap->data,
> >> +                iotlb->translated_addr, pages);
> >> +    } else {
> >> +        error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %m");
> >> +    }
> >> +
> >> +    g_free(bitmap->data);
> >> +unmap_exit:
> >> +    g_free(unmap);
> >> +    return ret;
> >> +}
> >> +
> >>   /*
> >>    * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
> >>    */
> >>   static int vfio_dma_unmap(VFIOContainer *container,
> >> -                          hwaddr iova, ram_addr_t size)
> >> +                          hwaddr iova, ram_addr_t size,
> >> +                          IOMMUTLBEntry *iotlb)
> >>   {
> >>       struct vfio_iommu_type1_dma_unmap unmap = {
> >>           .argsz = sizeof(unmap),
> >> @@ -324,6 +396,11 @@ static int vfio_dma_unmap(VFIOContainer *container,
> >>           .size = size,
> >>       };
> >>   
> >> +    if (iotlb && container->dirty_pages_supported &&
> >> +        vfio_devices_are_running_and_saving()) {
> >> +        return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
> >> +    }
> >> +
> >>       while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
> >>           /*
> >>            * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
> >> @@ -371,7 +448,7 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
> >>        * the VGA ROM space.
> >>        */
> >>       if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 ||
> >> -        (errno == EBUSY && vfio_dma_unmap(container, iova, size) == 0 &&
> >> +        (errno == EBUSY && vfio_dma_unmap(container, iova, size, NULL) == 0 &&
> >>            ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) {
> >>           return 0;
> >>       }
> >> @@ -542,7 +619,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> >>               }
> >>           }
> >>   
> >> -        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1);
> >> +        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1, iotlb);
> >>           if (ret) {
> >>               error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
> >>                            "0x%"HWADDR_PRIx") = %d (%m)",
> >> @@ -853,7 +930,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
> >>       }
> >>   
> >>       if (try_unmap) {
> >> -        ret = vfio_dma_unmap(container, iova, int128_get64(llsize));
> >> +        ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
> >>           if (ret) {
> >>               error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
> >>                            "0x%"HWADDR_PRIx") = %d (%m)",  
> >   
> 



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 03/17] vfio: Add save and load functions for VFIO PCI devices
  2020-06-24 19:49       ` Alex Williamson
@ 2020-06-26 12:16         ` Dr. David Alan Gilbert
  2020-06-26 22:44           ` Alex Williamson
  0 siblings, 1 reply; 66+ messages in thread
From: Dr. David Alan Gilbert @ 2020-06-26 12:16 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang,
	armbru, mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian,
	yan.y.zhao, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

* Alex Williamson (alex.williamson@redhat.com) wrote:
> On Wed, 24 Jun 2020 19:59:39 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > On 6/23/2020 1:58 AM, Alex Williamson wrote:
> > > On Sun, 21 Jun 2020 01:51:12 +0530
> > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >   
> > >> These functions save and restore PCI device specific data - config
> > >> space of PCI device.
> > >> Tested save and restore with MSI and MSIX type.
> > >>
> > >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> > >> ---
> > >>   hw/vfio/pci.c                 | 95 +++++++++++++++++++++++++++++++++++++++++++
> > >>   include/hw/vfio/vfio-common.h |  2 +
> > >>   2 files changed, 97 insertions(+)
> > >>
> > >> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > >> index 27f8872db2b1..5ba340aee1d4 100644
> > >> --- a/hw/vfio/pci.c
> > >> +++ b/hw/vfio/pci.c
> > >> @@ -41,6 +41,7 @@
> > >>   #include "trace.h"
> > >>   #include "qapi/error.h"
> > >>   #include "migration/blocker.h"
> > >> +#include "migration/qemu-file.h"
> > >>   
> > >>   #define TYPE_VFIO_PCI "vfio-pci"
> > >>   #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
> > >> @@ -2407,11 +2408,105 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
> > >>       return OBJECT(vdev);
> > >>   }
> > >>   
> > >> +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
> > >> +{
> > >> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> > >> +    PCIDevice *pdev = &vdev->pdev;
> > >> +
> > >> +    qemu_put_buffer(f, vdev->emulated_config_bits, vdev->config_size);
> > >> +    qemu_put_buffer(f, vdev->pdev.wmask, vdev->config_size);
> > >> +    pci_device_save(pdev, f);
> > >> +
> > >> +    qemu_put_be32(f, vdev->interrupt);
> > >> +    if (vdev->interrupt == VFIO_INT_MSIX) {
> > >> +        msix_save(pdev, f);  
> > > 
> > > msix_save() checks msix_present() so shouldn't we include this
> > > unconditionally?  Can't there also be state in the vector table
> > > regardless of whether we're currently running in MSI-X mode?
> > >   
> > >> +    }
> > >> +}
> > >> +
> > >> +static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
> > >> +{
> > >> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> > >> +    PCIDevice *pdev = &vdev->pdev;
> > >> +    uint32_t interrupt_type;
> > >> +    uint16_t pci_cmd;
> > >> +    int i, ret;
> > >> +
> > >> +    qemu_get_buffer(f, vdev->emulated_config_bits, vdev->config_size);
> > >> +    qemu_get_buffer(f, vdev->pdev.wmask, vdev->config_size);  
> > > 
> > > This doesn't seem safe, why is it ok to indiscriminately copy these
> > > arrays that are configured via support or masking of various device
> > > features from the source to the target?
> > >   
> > 
> > Ideally, software state at host should be restrored at destination - 
> > this is the attempt to do that.
> 
> Or is it the case that both source and target should initialize these
> and come up with the same result and they should be used for
> validation, not just overwriting the target with the source?

Is the request to have something similar to get_pci_config_device's
check where it compares the configs and c/w/w1c masks (see
hw/pci/pci.c:520 ish) - we get errors like:
   Bad config data: i=0x.... read: ... device: ... cmask...

this is pretty good at spotting things where the source and destination
device are configured differently, but to allow other dynamic
configuration values to be passed through OK.

Dave

> > > I think this still fails basic feature support negotiation.  For
> > > instance, Intel IGD assignment modifies emulated_config_bits and wmask
> > > to allow the VM BIOS to allocate fake stolen memory for the GPU and
> > > store this value in config space.  This support can be controlled via a
> > > QEMU build-time option, therefore the feature support on the target can
> > > be different from the source.  If this sort of feature set doesn't
> > > match between source and target, I think we'd want to abort the
> > > migration, but we don't have any provisions for that here (a physical
> > > IGD device is obviously just an example as it doesn't support migration
> > > currently).
> > >   
> > 
> > Then is it ok not to include vdev->pdev.wmask? If yes, I'll remove it.
> > But we need vdev->emulated_config_bits to be restored.
> 
> It's not clear why we need emulated_config_bits copied or how we'd
> handle the example I set forth above.  The existence of emulation
> provided by QEMU is also emulation state.
> 
> 
> > >> +
> > >> +    ret = pci_device_load(pdev, f);
> > >> +    if (ret) {
> > >> +        return ret;
> > >> +    }
> > >> +
> > >> +    /* retore pci bar configuration */
> > >> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> > >> +    vfio_pci_write_config(pdev, PCI_COMMAND,
> > >> +                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);  
> > > 
> > > s/!/~/?  Extra parenthesis too
> > >   
> > >> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> > >> +        uint32_t bar = pci_default_read_config(pdev,
> > >> +                                               PCI_BASE_ADDRESS_0 + i * 4, 4);
> > >> +
> > >> +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
> > >> +    }
> > >> +
> > >> +    interrupt_type = qemu_get_be32(f);
> > >> +
> > >> +    if (interrupt_type == VFIO_INT_MSI) {
> > >> +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> > >> +        bool msi_64bit;
> > >> +
> > >> +        /* restore msi configuration */
> > >> +        msi_flags = pci_default_read_config(pdev,
> > >> +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
> > >> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> > >> +
> > >> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> > >> +                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
> > >> +  
> > > 
> > > What if I migrate from a device with MSI support to a device without
> > > MSI support, or to a device with MSI support at a different offset, who
> > > is responsible for triggering a migration fault?
> > >   
> > 
> > Migration compatibility check should take care of that. If there is such 
> > a big difference in hardware then other things would also fail.
> 
> 
> The division between what is our responsibility in QEMU and what we
> hope the vendor driver handles is not very clear imo.  How do we avoid
> finger pointing when things break?
> 
> 
> > >> +        msi_addr_lo = pci_default_read_config(pdev,
> > >> +                                        pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> > >> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
> > >> +                              msi_addr_lo, 4);
> > >> +
> > >> +        if (msi_64bit) {
> > >> +            msi_addr_hi = pci_default_read_config(pdev,
> > >> +                                        pdev->msi_cap + PCI_MSI_ADDRESS_HI, 4);
> > >> +            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> > >> +                                  msi_addr_hi, 4);
> > >> +        }
> > >> +
> > >> +        msi_data = pci_default_read_config(pdev,
> > >> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> > >> +                2);
> > >> +
> > >> +        vfio_pci_write_config(pdev,
> > >> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> > >> +                msi_data, 2);
> > >> +
> > >> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> > >> +                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
> > >> +    } else if (interrupt_type == VFIO_INT_MSIX) {
> > >> +        uint16_t offset;
> > >> +
> > >> +        offset = pci_default_read_config(pdev,
> > >> +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
> > >> +        /* load enable bit and maskall bit */
> > >> +        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
> > >> +                              offset, 2);
> > >> +        msix_load(pdev, f);  
> > > 
> > > Isn't this ordering backwards, or at least less efficient?  The config
> > > write will cause us to enable MSI-X; presumably we'd have nothing in
> > > the vector table though.  Then msix_load() will write the vector
> > > and pba tables and trigger a use notifier for each vector.  It seems
> > > like that would trigger a bunch of SET_IRQS ioctls as if the guest
> > > wrote individual unmasked vectors to the vector table, whereas if we
> > > setup the vector table and then enable MSI-X, we do it with one ioctl.
> > >   
> > 
> > Makes sense. Changing the order here.
> > 
> > > Also same question as above, I'm not sure who is responsible for making
> > > sure both devices support MSI-X and that the capability exists at the
> > > same place on each.  Repeat for essentially every capability.  Are we
> > > leaning on the migration regions to fail these migrations before we get
> > > here?  If so, should we be?
> > >   
> > As I mentioned about it should be vendor drivers responsibility to have 
> > compatibility check in that case.
> 
> 
> And we'd rather blindly assume the vendor driver included that
> requirement than to check for ourselves?
> 
> 
> > > Also, besides BARs, the command register, and MSI & MSI-X, there must
> > > be other places where the guest can write config data through to the
> > > device.  pci_device_{save,load}() only sets QEMU's config space.
> > >   
> > 
> >  From QEMU we can restore QEMU's software state. For mediated device, 
> > emulated state at vendor driver should be maintained by vendor driver, 
> > right?
> 
> In this proposal we've determined that emulated_config_bits, wmask,
> emulated config space, and MSI/X state are part of QEMU's state that
> need to be transmitted to the target.  It therefore shouldn't be
> difficult to imagine that adding support for another capability might
> involve QEMU emulation as well.  How does the migration stream we've
> constructed here allow such emulation state to be included?  For example
> we might have a feature like IGD where we can discern the
> incompatibility via differences in the emulated_config_bits and wmask,
> but that's not guaranteed.
> 
> > > A couple more theoretical (probably not too distant) examples related
> > > to that; there's a resizable BAR capability that at some point we'll
> > > probably need to allow the guest to interact with (ie. manipulation of
> > > capability changes the reported region size for a BAR).  How would we
> > > support that with this save/load scheme?  
> > 
> > Config space is saved at the start of stop-and-copy phase, that means 
> > vCPUs are stopped. So QEMU's config space saved at this phase should 
> > include the change. Will there be any other software state that would be 
> > required to save/load?
> 
> 
> There might be, it seems inevitable that there would eventually be
> something that needs emulation state beyond this initial draft.  Is
> this resizable BAR example another that we simply hand wave as the
> responsibility of the vendor driver?
>  
> 
> > >  We'll likely also have SR-IOV
> > > PFs assigned where we'll perhaps have support for emulating the SR-IOV
> > > capability to call out to a privileged userspace helper to enable VFs,
> > > how does this get extended to support that type of emulation?
> > > 
> > > I'm afraid that making carbon copies of emulated_config_bits, wmask,
> > > and invoking pci_device_save/load() doesn't address my concerns that
> > > saving and restoring config space between source and target really
> > > seems like a much more important task than outlined here.  Thanks,
> > >   
> > 
> > Are you suggesting to load config space using vfio_pci_write_config() 
> > from PCI_CONFIG_HEADER_SIZE to 
> > PCI_CONFIG_SPACE_SIZE/PCIE_CONFIG_SPACE_SIZE? I was kind of avoiding it.
> 
> I don't think we can do that, even the save/restore functions in the
> kernel only blindly overwrite the standard header and then use
> capability specific functions elsewhere.  But I think what is missing
> here is the ability to hook in support for manipulating specific
> capabilities on save and restore, which might include QEMU emulation
> state data outside of what's provided here.  Thanks,
> 
> Alex
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 15/17] vfio: Add ioctl to get dirty pages bitmap during dma unmap.
  2020-06-25 19:18       ` Alex Williamson
@ 2020-06-26 14:15         ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 66+ messages in thread
From: Dr. David Alan Gilbert @ 2020-06-26 14:15 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang,
	armbru, mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian,
	yan.y.zhao, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

* Alex Williamson (alex.williamson@redhat.com) wrote:
> On Thu, 25 Jun 2020 20:31:12 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > On 6/25/2020 12:26 AM, Alex Williamson wrote:
> > > On Sun, 21 Jun 2020 01:51:24 +0530
> > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >   
> > >> With vIOMMU, IO virtual address range can get unmapped while in pre-copy
> > >> phase of migration. In that case, unmap ioctl should return pages pinned
> > >> in that range and QEMU should find its correcponding guest physical
> > >> addresses and report those dirty.
> > >>
> > >> Suggested-by: Alex Williamson <alex.williamson@redhat.com>
> > >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> > >> ---
> > >>   hw/vfio/common.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
> > >>   1 file changed, 81 insertions(+), 4 deletions(-)
> > >>
> > >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > >> index 0518cf228ed5..a06b8f2f66e2 100644
> > >> --- a/hw/vfio/common.c
> > >> +++ b/hw/vfio/common.c
> > >> @@ -311,11 +311,83 @@ static bool vfio_devices_are_stopped_and_saving(void)
> > >>       return true;
> > >>   }
> > >>   
> > >> +static bool vfio_devices_are_running_and_saving(void)
> > >> +{
> > >> +    VFIOGroup *group;
> > >> +    VFIODevice *vbasedev;
> > >> +
> > >> +    QLIST_FOREACH(group, &vfio_group_list, next) {  
> > > 
> > > Same as previous, I'm curious if we should instead be looking at
> > > container granularity.  It especially seems to make sense here where
> > > we're unmapping from a container, so iterating every device in every
> > > group seems excessive.
> > >   
> > 
> > changing it with container argument.
> > 
> > >> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> > >> +            if ((vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) &&
> > >> +                (vbasedev->device_state & VFIO_DEVICE_STATE_RUNNING)) {
> > >> +                continue;
> > >> +            } else {
> > >> +                return false;
> > >> +            }  
> > > 
> > > I'm also not sure about the polarity of this function, should it be if
> > > any device is _SAVING we should report the dirty bitmap?  For example,
> > > what if we have a set of paried failover NICs where we intend to unplug
> > > one just prior to stopping the devices, aren't we going to lose dirtied
> > > pages with this logic that they all must be running and saving?  Thanks,
> > >   
> > 
> > If migration is initiated, is device unplug allowed? Ideally it 
> > shouldn't. If it is, then how QEMU handles data stream of device which 
> > doesn't exist at destination?
> 
> include/hw/qdev-core.h
> struct DeviceState {
>     ...
>     bool allow_unplug_during_migration;
> 
> AIUI, the failover_pair_id device is likely to be a vfio-pci NIC,
> otherwise they'd simply migrate the primary NIC, so there's a very good
> chance that a user would configure a VM with a migratable mdev device
> and an failover NIC so that they have high speed networking on either
> end of the migration.

My understanding for that failover code is that happens right at the
beginning of migration while we're still in MIGRATION_STATUS_SETUP;
whether there's anything that enforces that is a different matter.
But, in that case, I don't think you'd be interested in that dirtying.

Dave

> > _SAVING flag is set during pre-copy and stop-and-copy phase. Here we 
> > only want to track pages which are unmapped during pre-copy phase, i.e. 
> > when vCPU are running. In case of VM suspend /saveVM, there is no 
> > pre-copy phase, but ideally we shouldn't see unmaps when vCPUs are 
> > stopped, right? But still for safer side, since we know exact phase, I 
> > would prefer to check for _SAVING and _RUNNING flags.
> 
> We can't have unmaps while vCPUs are stopped, but I think the failover
> code allows that we can be in the pre-copy phase where not all devices
> support migration.  As coded here, it appears that dirty tracking of any
> unmap while in that phase is lost.  Thanks,
> 
> Alex
> 
> 
> > >> +        }
> > >> +    }
> > >> +    return true;
> > >> +}
> > >> +
> > >> +static int vfio_dma_unmap_bitmap(VFIOContainer *container,
> > >> +                                 hwaddr iova, ram_addr_t size,
> > >> +                                 IOMMUTLBEntry *iotlb)
> > >> +{
> > >> +    struct vfio_iommu_type1_dma_unmap *unmap;
> > >> +    struct vfio_bitmap *bitmap;
> > >> +    uint64_t pages = TARGET_PAGE_ALIGN(size) >> TARGET_PAGE_BITS;
> > >> +    int ret;
> > >> +
> > >> +    unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));
> > >> +
> > >> +    unmap->argsz = sizeof(*unmap) + sizeof(*bitmap);
> > >> +    unmap->iova = iova;
> > >> +    unmap->size = size;
> > >> +    unmap->flags |= VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP;
> > >> +    bitmap = (struct vfio_bitmap *)&unmap->data;
> > >> +
> > >> +    /*
> > >> +     * cpu_physical_memory_set_dirty_lebitmap() expects pages in bitmap of
> > >> +     * TARGET_PAGE_SIZE to mark those dirty. Hence set bitmap_pgsize to
> > >> +     * TARGET_PAGE_SIZE.
> > >> +     */
> > >> +
> > >> +    bitmap->pgsize = TARGET_PAGE_SIZE;
> > >> +    bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
> > >> +                   BITS_PER_BYTE;
> > >> +
> > >> +    if (bitmap->size > container->max_dirty_bitmap_size) {
> > >> +        error_report("UNMAP: Size of bitmap too big 0x%llx", bitmap->size);
> > >> +        ret = -E2BIG;
> > >> +        goto unmap_exit;
> > >> +    }
> > >> +
> > >> +    bitmap->data = g_try_malloc0(bitmap->size);
> > >> +    if (!bitmap->data) {
> > >> +        ret = -ENOMEM;
> > >> +        goto unmap_exit;
> > >> +    }
> > >> +
> > >> +    ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap);
> > >> +    if (!ret) {
> > >> +        cpu_physical_memory_set_dirty_lebitmap((uint64_t *)bitmap->data,
> > >> +                iotlb->translated_addr, pages);
> > >> +    } else {
> > >> +        error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %m");
> > >> +    }
> > >> +
> > >> +    g_free(bitmap->data);
> > >> +unmap_exit:
> > >> +    g_free(unmap);
> > >> +    return ret;
> > >> +}
> > >> +
> > >>   /*
> > >>    * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
> > >>    */
> > >>   static int vfio_dma_unmap(VFIOContainer *container,
> > >> -                          hwaddr iova, ram_addr_t size)
> > >> +                          hwaddr iova, ram_addr_t size,
> > >> +                          IOMMUTLBEntry *iotlb)
> > >>   {
> > >>       struct vfio_iommu_type1_dma_unmap unmap = {
> > >>           .argsz = sizeof(unmap),
> > >> @@ -324,6 +396,11 @@ static int vfio_dma_unmap(VFIOContainer *container,
> > >>           .size = size,
> > >>       };
> > >>   
> > >> +    if (iotlb && container->dirty_pages_supported &&
> > >> +        vfio_devices_are_running_and_saving()) {
> > >> +        return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
> > >> +    }
> > >> +
> > >>       while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
> > >>           /*
> > >>            * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
> > >> @@ -371,7 +448,7 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
> > >>        * the VGA ROM space.
> > >>        */
> > >>       if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 ||
> > >> -        (errno == EBUSY && vfio_dma_unmap(container, iova, size) == 0 &&
> > >> +        (errno == EBUSY && vfio_dma_unmap(container, iova, size, NULL) == 0 &&
> > >>            ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) {
> > >>           return 0;
> > >>       }
> > >> @@ -542,7 +619,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> > >>               }
> > >>           }
> > >>   
> > >> -        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1);
> > >> +        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1, iotlb);
> > >>           if (ret) {
> > >>               error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
> > >>                            "0x%"HWADDR_PRIx") = %d (%m)",
> > >> @@ -853,7 +930,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
> > >>       }
> > >>   
> > >>       if (try_unmap) {
> > >> -        ret = vfio_dma_unmap(container, iova, int128_get64(llsize));
> > >> +        ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
> > >>           if (ret) {
> > >>               error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
> > >>                            "0x%"HWADDR_PRIx") = %d (%m)",  
> > >   
> > 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 07/17] vfio: Register SaveVMHandlers for VFIO device
  2020-06-23 19:50       ` Alex Williamson
@ 2020-06-26 14:22         ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 66+ messages in thread
From: Dr. David Alan Gilbert @ 2020-06-26 14:22 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang,
	armbru, mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian,
	yan.y.zhao, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

* Alex Williamson (alex.williamson@redhat.com) wrote:
> On Wed, 24 Jun 2020 00:51:06 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > On 6/23/2020 4:20 AM, Alex Williamson wrote:
> > > On Sun, 21 Jun 2020 01:51:16 +0530
> > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >   
> > >> Define flags to be used as delimeter in migration file stream.
> > >> Added .save_setup and .save_cleanup functions. Mapped & unmapped migration
> > >> region from these functions at source during saving or pre-copy phase.
> > >> Set VFIO device state depending on VM's state. During live migration, VM is
> > >> running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
> > >> device. During save-restore, VM is paused, _SAVING state is set for VFIO device.
> > >>
> > >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> > >> ---
> > >>   hw/vfio/migration.c  | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >>   hw/vfio/trace-events |  2 ++
> > >>   2 files changed, 94 insertions(+)
> > >>
> > >> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > >> index e30bd8768701..133bb5b1b3b2 100644
> > >> --- a/hw/vfio/migration.c
> > >> +++ b/hw/vfio/migration.c
> > >> @@ -8,12 +8,15 @@
> > >>    */
> > >>   
> > >>   #include "qemu/osdep.h"
> > >> +#include "qemu/main-loop.h"
> > >> +#include "qemu/cutils.h"
> > >>   #include <linux/vfio.h>
> > >>   
> > >>   #include "sysemu/runstate.h"
> > >>   #include "hw/vfio/vfio-common.h"
> > >>   #include "cpu.h"
> > >>   #include "migration/migration.h"
> > >> +#include "migration/vmstate.h"
> > >>   #include "migration/qemu-file.h"
> > >>   #include "migration/register.h"
> > >>   #include "migration/blocker.h"
> > >> @@ -24,6 +27,17 @@
> > >>   #include "pci.h"
> > >>   #include "trace.h"
> > >>   
> > >> +/*
> > >> + * Flags used as delimiter:
> > >> + * 0xffffffff => MSB 32-bit all 1s
> > >> + * 0xef10     => emulated (virtual) function IO
> > >> + * 0x0000     => 16-bits reserved for flags
> > >> + */
> > >> +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
> > >> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
> > >> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
> > >> +#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
> > >> +
> > >>   static void vfio_migration_region_exit(VFIODevice *vbasedev)
> > >>   {
> > >>       VFIOMigration *migration = vbasedev->migration;
> > >> @@ -126,6 +140,65 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
> > >>       return 0;
> > >>   }
> > >>   
> > >> +/* ---------------------------------------------------------------------- */
> > >> +
> > >> +static int vfio_save_setup(QEMUFile *f, void *opaque)
> > >> +{
> > >> +    VFIODevice *vbasedev = opaque;
> > >> +    VFIOMigration *migration = vbasedev->migration;
> > >> +    int ret;
> > >> +
> > >> +    trace_vfio_save_setup(vbasedev->name);
> > >> +
> > >> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
> > >> +
> > >> +    if (migration->region.mmaps) {
> > >> +        qemu_mutex_lock_iothread();
> > >> +        ret = vfio_region_mmap(&migration->region);
> > >> +        qemu_mutex_unlock_iothread();
> > >> +        if (ret) {
> > >> +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
> > >> +                         vbasedev->name, migration->region.nr,
> > >> +                         strerror(-ret));
> > >> +            return ret;  
> > > 
> > > OTOH to my previous comments, this shouldn't be fatal, right?  mmaps
> > > are optional anyway so it should be sufficient to push an error report
> > > to explain why this might be slower than normal, but we can still
> > > proceed.
> > >   
> > 
> > Right, defining region to be sparse mmap is optional.
> > migration->region.mmaps is set if vendor driver defines sparse mmapable 
> > regions and VFIO_REGION_INFO_FLAG_MMAP flag is set. If this flag is set 
> > then error on mmap() should be fatal.
> > 
> > If there is not mmapable region, then migration will proceed.
> 
> It's both optional for the vendor to define sparse mmap support (or any
> mmap support) and optional for the user to make use of it.  The user
> can recover from an mmap failure by using read/write accesses.  The
> vendor MUST support this.  It doesn't make sense to worry about
> aborting the VM in replying to comments for 05/17, where it's not clear
> how we proceed, yet intentionally cause a fatal error here when there
> is a very clear path to proceed.
> 
> > >> +        }
> > >> +    }
> > >> +
> > >> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_MASK,
> > >> +                                   VFIO_DEVICE_STATE_SAVING);
> > >> +    if (ret) {
> > >> +        error_report("%s: Failed to set state SAVING", vbasedev->name);
> > >> +        return ret;
> > >> +    }  
> > > 
> > > We seem to be lacking support in the callers for detecting if the
> > > device is in an error state.  I'm not sure what our options are
> > > though, maybe only a hw_error().
> > >   
> > 
> > Returning error here fails migration process. And if device is in error 
> > state, any application running inside VM using this device would fail.
> > I think, there is no need to take any special action here by detecting 
> > device error state.
> 
> If QEMU knows a device has failed, it seems like it would make sense to
> stop the VM, otherwise we risk an essentially endless assortment of
> ways that the user might notice the guest isn't behaving normally, some
> maybe even causing the user to lose data.  Thanks,

With GPUs especially though we can get into messy states where only the
GPU is toast;  for example you might get this if your GUI
starts/exits/crashes during a migration - that's a bit too common to
kill a VM that might have useful data on it.

Dave

> Alex
>  
> > >> +
> > >> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> > >> +
> > >> +    ret = qemu_file_get_error(f);
> > >> +    if (ret) {
> > >> +        return ret;
> > >> +    }
> > >> +
> > >> +    return 0;
> > >> +}
> > >> +
> > >> +static void vfio_save_cleanup(void *opaque)
> > >> +{
> > >> +    VFIODevice *vbasedev = opaque;
> > >> +    VFIOMigration *migration = vbasedev->migration;
> > >> +
> > >> +    if (migration->region.mmaps) {
> > >> +        vfio_region_unmap(&migration->region);
> > >> +    }
> > >> +    trace_vfio_save_cleanup(vbasedev->name);
> > >> +}
> > >> +
> > >> +static SaveVMHandlers savevm_vfio_handlers = {
> > >> +    .save_setup = vfio_save_setup,
> > >> +    .save_cleanup = vfio_save_cleanup,
> > >> +};
> > >> +
> > >> +/* ---------------------------------------------------------------------- */
> > >> +
> > >>   static void vfio_vmstate_change(void *opaque, int running, RunState state)
> > >>   {
> > >>       VFIODevice *vbasedev = opaque;
> > >> @@ -180,6 +253,7 @@ static int vfio_migration_init(VFIODevice *vbasedev,
> > >>                                  struct vfio_region_info *info)
> > >>   {
> > >>       int ret;
> > >> +    char id[256] = "";
> > >>   
> > >>       vbasedev->migration = g_new0(VFIOMigration, 1);
> > >>   
> > >> @@ -192,6 +266,24 @@ static int vfio_migration_init(VFIODevice *vbasedev,
> > >>           return ret;
> > >>       }
> > >>   
> > >> +    if (vbasedev->ops->vfio_get_object) {  
> > > 
> > > Nit, vfio_migration_region_init() would have failed already if this were
> > > not available.  Perhaps do the test once at the start of this function
> > > instead?  Thanks,
> > >   
> > 
> > Ok, will do that.
> > 
> > Thanks,
> > Kirti
> > 
> > 
> > > Alex
> > >   
> > >> +        Object *obj = vbasedev->ops->vfio_get_object(vbasedev);
> > >> +
> > >> +        if (obj) {
> > >> +            DeviceState *dev = DEVICE(obj);
> > >> +            char *oid = vmstate_if_get_id(VMSTATE_IF(dev));
> > >> +
> > >> +            if (oid) {
> > >> +                pstrcpy(id, sizeof(id), oid);
> > >> +                pstrcat(id, sizeof(id), "/");
> > >> +                g_free(oid);
> > >> +            }
> > >> +        }
> > >> +    }
> > >> +    pstrcat(id, sizeof(id), "vfio");
> > >> +
> > >> +    register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1, &savevm_vfio_handlers,
> > >> +                         vbasedev);
> > >>       vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
> > >>                                                             vbasedev);
> > >>       vbasedev->migration_state.notify = vfio_migration_state_notifier;
> > >> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> > >> index bd3d47b005cb..86c18def016e 100644
> > >> --- a/hw/vfio/trace-events
> > >> +++ b/hw/vfio/trace-events
> > >> @@ -149,3 +149,5 @@ vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
> > >>   vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
> > >>   vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
> > >>   vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
> > >> +vfio_save_setup(const char *name) " (%s)"
> > >> +vfio_save_cleanup(const char *name) " (%s)"  
> > >   
> > 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 07/17] vfio: Register SaveVMHandlers for VFIO device
  2020-06-20 20:21 ` [PATCH QEMU v25 07/17] vfio: Register SaveVMHandlers for VFIO device Kirti Wankhede
  2020-06-22 22:50   ` Alex Williamson
@ 2020-06-26 14:31   ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 66+ messages in thread
From: Dr. David Alan Gilbert @ 2020-06-26 14:31 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	alex.williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> Define flags to be used as delimeter in migration file stream.
> Added .save_setup and .save_cleanup functions. Mapped & unmapped migration
> region from these functions at source during saving or pre-copy phase.
> Set VFIO device state depending on VM's state. During live migration, VM is
> running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
> device. During save-restore, VM is paused, _SAVING state is set for VFIO device.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/migration.c  | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/trace-events |  2 ++
>  2 files changed, 94 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index e30bd8768701..133bb5b1b3b2 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -8,12 +8,15 @@
>   */
>  
>  #include "qemu/osdep.h"
> +#include "qemu/main-loop.h"
> +#include "qemu/cutils.h"
>  #include <linux/vfio.h>
>  
>  #include "sysemu/runstate.h"
>  #include "hw/vfio/vfio-common.h"
>  #include "cpu.h"
>  #include "migration/migration.h"
> +#include "migration/vmstate.h"
>  #include "migration/qemu-file.h"
>  #include "migration/register.h"
>  #include "migration/blocker.h"
> @@ -24,6 +27,17 @@
>  #include "pci.h"
>  #include "trace.h"
>  
> +/*
> + * Flags used as delimiter:
> + * 0xffffffff => MSB 32-bit all 1s
> + * 0xef10     => emulated (virtual) function IO
> + * 0x0000     => 16-bits reserved for flags
> + */
> +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
> +#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
> +
>  static void vfio_migration_region_exit(VFIODevice *vbasedev)
>  {
>      VFIOMigration *migration = vbasedev->migration;
> @@ -126,6 +140,65 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
>      return 0;
>  }
>  
> +/* ---------------------------------------------------------------------- */
> +
> +static int vfio_save_setup(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    trace_vfio_save_setup(vbasedev->name);
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
> +
> +    if (migration->region.mmaps) {
> +        qemu_mutex_lock_iothread();
> +        ret = vfio_region_mmap(&migration->region);
> +        qemu_mutex_unlock_iothread();
> +        if (ret) {
> +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
> +                         vbasedev->name, migration->region.nr,
> +                         strerror(-ret));
> +            return ret;
> +        }
> +    }
> +
> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_MASK,
> +                                   VFIO_DEVICE_STATE_SAVING);
> +    if (ret) {
> +        error_report("%s: Failed to set state SAVING", vbasedev->name);
> +        return ret;
> +    }
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    return 0;
> +}
> +
> +static void vfio_save_cleanup(void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    if (migration->region.mmaps) {
> +        vfio_region_unmap(&migration->region);
> +    }
> +    trace_vfio_save_cleanup(vbasedev->name);
> +}
> +
> +static SaveVMHandlers savevm_vfio_handlers = {
> +    .save_setup = vfio_save_setup,
> +    .save_cleanup = vfio_save_cleanup,
> +};
> +
> +/* ---------------------------------------------------------------------- */
> +
>  static void vfio_vmstate_change(void *opaque, int running, RunState state)
>  {
>      VFIODevice *vbasedev = opaque;
> @@ -180,6 +253,7 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>                                 struct vfio_region_info *info)
>  {
>      int ret;
> +    char id[256] = "";
>  
>      vbasedev->migration = g_new0(VFIOMigration, 1);
>  
> @@ -192,6 +266,24 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>          return ret;
>      }
>  
> +    if (vbasedev->ops->vfio_get_object) {
> +        Object *obj = vbasedev->ops->vfio_get_object(vbasedev);
> +
> +        if (obj) {
> +            DeviceState *dev = DEVICE(obj);
> +            char *oid = vmstate_if_get_id(VMSTATE_IF(dev));
> +
> +            if (oid) {
> +                pstrcpy(id, sizeof(id), oid);
> +                pstrcat(id, sizeof(id), "/");
> +                g_free(oid);
> +            }
> +        }
> +    }
> +    pstrcat(id, sizeof(id), "vfio");
> +
> +    register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1, &savevm_vfio_handlers,
> +                         vbasedev);

Right, so this version has finally changed to using this 'id' string
with a copy of the code from savevm.c; so that should mean it works with
multiple devices now; thanks for making the change.

Dave


>      vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
>                                                            vbasedev);
>      vbasedev->migration_state.notify = vfio_migration_state_notifier;
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index bd3d47b005cb..86c18def016e 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -149,3 +149,5 @@ vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
>  vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
>  vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
>  vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
> +vfio_save_setup(const char *name) " (%s)"
> +vfio_save_cleanup(const char *name) " (%s)"
> -- 
> 2.7.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 13/17] vfio: create mapped iova list when vIOMMU is enabled
  2020-06-25 17:40       ` Alex Williamson
@ 2020-06-26 14:43         ` Peter Xu
  0 siblings, 0 replies; 66+ messages in thread
From: Peter Xu @ 2020-06-26 14:43 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang, armbru,
	mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	dgilbert, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

On Thu, Jun 25, 2020 at 11:40:39AM -0600, Alex Williamson wrote:
> On Thu, 25 Jun 2020 20:04:08 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > On 6/25/2020 12:25 AM, Alex Williamson wrote:
> > > On Sun, 21 Jun 2020 01:51:22 +0530
> > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >   
> > >> Create mapped iova list when vIOMMU is enabled. For each mapped iova
> > >> save translated address. Add node to list on MAP and remove node from
> > >> list on UNMAP.
> > >> This list is used to track dirty pages during migration.  
> > > 
> > > This seems like a lot of overhead to support that the VM might migrate.
> > > Is there no way we can build this when we start migration, for example
> > > replaying the mappings at that time?  Thanks,
> > >   
> > 
> > In my previous version I tried to go through whole range and find valid 
> > iotlb, as below:
> > 
> > +        if (memory_region_is_iommu(section->mr)) {
> > +            iotlb = address_space_get_iotlb_entry(container->space->as, 
> > iova,
> > +                                                 true, 
> > MEMTXATTRS_UNSPECIFIED);
> > 
> > When mapping doesn't exist, qemu throws error as below:
> > 
> > qemu-system-x86_64: vtd_iova_to_slpte: detected slpte permission error 
> > (iova=0x0, level=0x3, slpte=0x0, write=1)
> > qemu-system-x86_64: vtd_iommu_translate: detected translation failure 
> > (dev=00:03:00, iova=0x0)
> > qemu-system-x86_64: New fault is not recorded due to compression of faults
> 
> My assumption would have been that we use the replay mechanism, which
> is known to work because we need to use it when we hot-add a device.
> We'd make use of iommu_notifier_init() to create a new handler for this
> purpose, then we'd walk our container->giommu_list and call
> memory_region_iommu_replay() for each.
> 
> Peter, does this sound like the right approach to you?

(Sorry I may not have the complete picture of this series, please bear with
 me...)

This seems to be a workable approach to me.  However then we might have a
similar mapping entry cached the 3rd time...  VFIO kernel has a copy initially,
then QEMU vIOMMU has another one (please grep iova_tree in intel_iommu.c).

My wild guess is that the mapping should still be in control in most cases, so
even if we cache it multiple times (for better layering) it would still be
fine.  However since we're in QEMU right now, I'm also thinking whether we can
share the information with the vIOMMU somehow, because even if the page table
entry is wiped off at that time we may still have a chance to use the DMAMap
object that cached in vIOMMU when iommu notify() happens.  Though that may
require some vIOMMU change too (e.g., vtd_page_walk_one may need to postpone
the iova_tree_remove to be after the hook_fn is called, also we may need to
pass the DMAMap object or at least the previous translated addr to the hook
somehow before removal), so maybe that can also be done on top.

> 
> > Secondly, it iterates through whole range with IOMMU page size 
> > granularity which is 4K, so it takes long time resulting in large 
> > downtime. With this optimization, downtime with vIOMMU reduced 
> > significantly.
> 
> Right, but we amortize that overhead and the resulting bloat across the
> 99.9999% of the time that we're not migrating.  I wonder if we could
> startup another thread to handle this when we enable dirty logging.  We
> don't really need the result until we start processing the dirty
> bitmap, right?  Also, if we're dealing with this many separate pages,
> shouldn't we be using a tree rather than a list to give us O(logN)
> rather than O(N)?

Yep I agree.  At least the vIOMMU cache is using gtree.

Btw, IIUC we won't walk the whole range using 4K granularity always, not for
VT-d emulation.  Because vtd_page_walk_level() is smart enough to skip higher
levels of invalid entries so it can jump with 2M/1G/... chunks if the whole
chunk is invalid.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 05/17] vfio: Add VM state change handler to know state of VM
  2020-06-23 18:55     ` Kirti Wankhede
@ 2020-06-26 14:51       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 66+ messages in thread
From: Dr. David Alan Gilbert @ 2020-06-26 14:51 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, eauger, yi.l.liu, quintela, ziye.yang, armbru, mlevitsk,
	pasic, felipe, zhi.a.wang, kevin.tian, yan.y.zhao,
	Alex Williamson, changpeng.liu, eskultet, Ken.Xue,
	jonathan.davies, pbonzini

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> 
> 
> On 6/23/2020 4:20 AM, Alex Williamson wrote:
> > On Sun, 21 Jun 2020 01:51:14 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > 
> > > VM state change handler gets called on change in VM's state. This is used to set
> > > VFIO device state to _RUNNING.
> > > 
> > > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > > Reviewed-by: Neo Jia <cjia@nvidia.com>
> > > Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > ---
> > >   hw/vfio/migration.c           | 87 +++++++++++++++++++++++++++++++++++++++++++
> > >   hw/vfio/trace-events          |  2 +
> > >   include/hw/vfio/vfio-common.h |  4 ++
> > >   3 files changed, 93 insertions(+)
> > > 
> > > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > > index 48ac385d80a7..fcecc0bb0874 100644
> > > --- a/hw/vfio/migration.c
> > > +++ b/hw/vfio/migration.c
> > > @@ -10,6 +10,7 @@
> > >   #include "qemu/osdep.h"
> > >   #include <linux/vfio.h>
> > > +#include "sysemu/runstate.h"
> > >   #include "hw/vfio/vfio-common.h"
> > >   #include "cpu.h"
> > >   #include "migration/migration.h"
> > > @@ -74,6 +75,85 @@ err:
> > >       return ret;
> > >   }
> > > +static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
> > > +                                    uint32_t value)
> > > +{
> > > +    VFIOMigration *migration = vbasedev->migration;
> > > +    VFIORegion *region = &migration->region;
> > > +    uint32_t device_state;
> > > +    int ret;
> > > +
> > > +    ret = pread(vbasedev->fd, &device_state, sizeof(device_state),
> > > +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> > > +                                              device_state));
> > > +    if (ret < 0) {
> > > +        error_report("%s: Failed to read device state %d %s",
> > > +                     vbasedev->name, ret, strerror(errno));
> > > +        return ret;
> > > +    }
> > > +
> > > +    device_state = (device_state & mask) | value;
> > > +
> > > +    if (!VFIO_DEVICE_STATE_VALID(device_state)) {
> > > +        return -EINVAL;
> > > +    }
> > > +
> > > +    ret = pwrite(vbasedev->fd, &device_state, sizeof(device_state),
> > > +                 region->fd_offset + offsetof(struct vfio_device_migration_info,
> > > +                                              device_state));
> > > +    if (ret < 0) {
> > > +        error_report("%s: Failed to set device state %d %s",
> > > +                     vbasedev->name, ret, strerror(errno));
> > > +
> > > +        ret = pread(vbasedev->fd, &device_state, sizeof(device_state),
> > > +                region->fd_offset + offsetof(struct vfio_device_migration_info,
> > > +                device_state));
> > > +        if (ret < 0) {
> > > +            error_report("%s: On failure, failed to read device state %d %s",
> > > +                    vbasedev->name, ret, strerror(errno));
> > > +            return ret;
> > > +        }
> > > +
> > > +        if (VFIO_DEVICE_STATE_IS_ERROR(device_state)) {
> > > +            error_report("%s: Device is in error state 0x%x",
> > > +                         vbasedev->name, device_state);
> > > +            return -EFAULT;
> > > +        }
> > > +    }
> > > +
> > > +    vbasedev->device_state = device_state;
> > > +    trace_vfio_migration_set_state(vbasedev->name, device_state);
> > > +    return 0;
> > > +}
> > > +
> > > +static void vfio_vmstate_change(void *opaque, int running, RunState state)
> > > +{
> > > +    VFIODevice *vbasedev = opaque;
> > > +
> > > +    if ((vbasedev->vm_running != running)) {
> > > +        int ret;
> > > +        uint32_t value = 0, mask = 0;
> > > +
> > > +        if (running) {
> > > +            value = VFIO_DEVICE_STATE_RUNNING;
> > > +            if (vbasedev->device_state & VFIO_DEVICE_STATE_RESUMING) {
> > > +                mask = ~VFIO_DEVICE_STATE_RESUMING;
> > > +            }
> > > +        } else {
> > > +            mask = ~VFIO_DEVICE_STATE_RUNNING;
> > > +        }
> > > +
> > > +        ret = vfio_migration_set_state(vbasedev, mask, value);
> > > +        if (ret) {
> > > +            error_report("%s: Failed to set device state 0x%x",
> > > +                         vbasedev->name, value & mask);
> > 
> > 
> > Is there nothing more we should do here?  It seems like in either the
> > case of an outbound migration where we can't stop the device or an
> > inbound migration where we can't start the device, we'd want this to
> > trigger an abort of the migration.  Should there at least be a TODO
> > comment if the reason is that QEMU migration doesn't yet support failure
> > here?  Thanks,
> > 
> 
> Checked other modules in QEMU, at some places error message is reported as
> above while at some places abort() is called (for example
> kvmclock_vm_state_change() in hw/i386/kvm/clock.c). Abort will abort QEMU
> process, that is VM crash. Should we abort here on error case? Anyways VM
> will not recover properly on migration if there is such error.

I prefer to avoid aborts on migration if possible; unless the VM is
realyl dead.
Failing the migration is preferable.

Dave

> Thanks,
> Kirti
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 09/17] vfio: Add load state functions to SaveVMHandlers
  2020-06-24 18:54   ` Alex Williamson
  2020-06-25 14:16     ` Kirti Wankhede
@ 2020-06-26 14:54     ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 66+ messages in thread
From: Dr. David Alan Gilbert @ 2020-06-26 14:54 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang,
	armbru, mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian,
	yan.y.zhao, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

* Alex Williamson (alex.williamson@redhat.com) wrote:
> On Sun, 21 Jun 2020 01:51:18 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > Sequence  during _RESUMING device state:
> > While data for this device is available, repeat below steps:
> > a. read data_offset from where user application should write data.
> > b. write data of data_size to migration region from data_offset.
> > c. write data_size which indicates vendor driver that data is written in
> >    staging buffer.
> > 
> > For user, data is opaque. User should write data in the same order as
> > received.
> > 
> > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > Reviewed-by: Neo Jia <cjia@nvidia.com>
> > Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  hw/vfio/migration.c  | 177 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >  hw/vfio/trace-events |   3 +
> >  2 files changed, 180 insertions(+)
> > 
> > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > index ef1150c1ff02..faacea5327cb 100644
> > --- a/hw/vfio/migration.c
> > +++ b/hw/vfio/migration.c
> > @@ -302,6 +302,33 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
> >      return qemu_file_get_error(f);
> >  }
> >  
> > +static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
> > +{
> > +    VFIODevice *vbasedev = opaque;
> > +    uint64_t data;
> > +
> > +    if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
> > +        int ret;
> > +
> > +        ret = vbasedev->ops->vfio_load_config(vbasedev, f);
> > +        if (ret) {
> > +            error_report("%s: Failed to load device config space",
> > +                         vbasedev->name);
> > +            return ret;
> > +        }
> > +    }
> > +
> > +    data = qemu_get_be64(f);
> > +    if (data != VFIO_MIG_FLAG_END_OF_STATE) {
> > +        error_report("%s: Failed loading device config space, "
> > +                     "end flag incorrect 0x%"PRIx64, vbasedev->name, data);
> > +        return -EINVAL;
> > +    }
> > +
> > +    trace_vfio_load_device_config_state(vbasedev->name);
> > +    return qemu_file_get_error(f);
> > +}
> > +
> >  /* ---------------------------------------------------------------------- */
> >  
> >  static int vfio_save_setup(QEMUFile *f, void *opaque)
> > @@ -472,12 +499,162 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> >      return ret;
> >  }
> >  
> > +static int vfio_load_setup(QEMUFile *f, void *opaque)
> > +{
> > +    VFIODevice *vbasedev = opaque;
> > +    VFIOMigration *migration = vbasedev->migration;
> > +    int ret = 0;
> > +
> > +    if (migration->region.mmaps) {
> > +        ret = vfio_region_mmap(&migration->region);
> > +        if (ret) {
> > +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
> > +                         vbasedev->name, migration->region.nr,
> > +                         strerror(-ret));
> > +            return ret;
> 
> 
> Not fatal.
> 
> 
> > +        }
> > +    }
> > +
> > +    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_MASK,
> > +                                   VFIO_DEVICE_STATE_RESUMING);
> > +    if (ret) {
> > +        error_report("%s: Failed to set state RESUMING", vbasedev->name);
> > +    }
> > +    return ret;
> > +}
> > +
> > +static int vfio_load_cleanup(void *opaque)
> > +{
> > +    vfio_save_cleanup(opaque);
> > +    return 0;
> > +}
> > +
> > +static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> > +{
> > +    VFIODevice *vbasedev = opaque;
> > +    VFIOMigration *migration = vbasedev->migration;
> > +    int ret = 0;
> > +    uint64_t data, data_size;
> > +
> > +    data = qemu_get_be64(f);
> > +    while (data != VFIO_MIG_FLAG_END_OF_STATE) {
> > +
> > +        trace_vfio_load_state(vbasedev->name, data);
> > +
> > +        switch (data) {
> > +        case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
> > +        {
> > +            ret = vfio_load_device_config_state(f, opaque);
> > +            if (ret) {
> > +                return ret;
> > +            }
> > +            break;
> > +        }
> > +        case VFIO_MIG_FLAG_DEV_SETUP_STATE:
> > +        {
> > +            data = qemu_get_be64(f);
> > +            if (data == VFIO_MIG_FLAG_END_OF_STATE) {
> > +                return ret;
> > +            } else {
> > +                error_report("%s: SETUP STATE: EOS not found 0x%"PRIx64,
> > +                             vbasedev->name, data);
> > +                return -EINVAL;
> 
> This is essentially just a compatibility failure, right?  For instance
> some future version of QEMU might include additional data between these
> markers that we don't understand and therefore we fail the migration.

Or any other screwup in data layout;  we've found having a canary at the
end of state is quite useful for when we screwup for one reason or
another.

Dave

> Thanks,
> 
> Alex
> 
> > +            }
> > +            break;
> > +        }
> > +        case VFIO_MIG_FLAG_DEV_DATA_STATE:
> > +        {
> > +            VFIORegion *region = &migration->region;
> > +            uint64_t data_offset = 0, size;
> > +
> > +            data_size = size = qemu_get_be64(f);
> > +            if (data_size == 0) {
> > +                break;
> > +            }
> > +
> > +            ret = pread(vbasedev->fd, &data_offset, sizeof(data_offset),
> > +                        region->fd_offset +
> > +                        offsetof(struct vfio_device_migration_info,
> > +                        data_offset));
> > +            if (ret != sizeof(data_offset)) {
> > +                error_report("%s:Failed to get migration buffer data offset %d",
> > +                             vbasedev->name, ret);
> > +                return -EINVAL;
> > +            }
> > +
> > +            trace_vfio_load_state_device_data(vbasedev->name, data_offset,
> > +                                              data_size);
> > +
> > +            while (size) {
> > +                void *buf = NULL;
> > +                uint64_t sec_size;
> > +                bool buffer_mmaped;
> > +
> > +                buf = get_data_section_size(region, data_offset, size,
> > +                                            &sec_size);
> > +
> > +                buffer_mmaped = (buf != NULL);
> > +
> > +                if (!buffer_mmaped) {
> > +                    buf = g_try_malloc(sec_size);
> > +                    if (!buf) {
> > +                        error_report("%s: Error allocating buffer ", __func__);
> > +                        return -ENOMEM;
> > +                    }
> > +                }
> > +
> > +                qemu_get_buffer(f, buf, sec_size);
> > +
> > +                if (!buffer_mmaped) {
> > +                    ret = pwrite(vbasedev->fd, buf, sec_size,
> > +                                 region->fd_offset + data_offset);
> > +                    g_free(buf);
> > +
> > +                    if (ret != sec_size) {
> > +                        error_report("%s: Failed to set migration buffer %d",
> > +                                vbasedev->name, ret);
> > +                        return -EINVAL;
> > +                    }
> > +                }
> > +                size -= sec_size;
> > +                data_offset += sec_size;
> > +            }
> > +
> > +            ret = pwrite(vbasedev->fd, &data_size, sizeof(data_size),
> > +                         region->fd_offset +
> > +                       offsetof(struct vfio_device_migration_info, data_size));
> > +            if (ret != sizeof(data_size)) {
> > +                error_report("%s: Failed to set migration buffer data size %d",
> > +                             vbasedev->name, ret);
> > +                return -EINVAL;
> > +            }
> > +            break;
> > +        }
> > +
> > +        default:
> > +            error_report("%s: Unknown tag 0x%"PRIx64, vbasedev->name, data);
> > +            return -EINVAL;
> > +        }
> > +
> > +        data = qemu_get_be64(f);
> > +        ret = qemu_file_get_error(f);
> > +        if (ret) {
> > +            return ret;
> > +        }
> > +    }
> > +
> > +    return ret;
> > +}
> > +
> >  static SaveVMHandlers savevm_vfio_handlers = {
> >      .save_setup = vfio_save_setup,
> >      .save_cleanup = vfio_save_cleanup,
> >      .save_live_pending = vfio_save_pending,
> >      .save_live_iterate = vfio_save_iterate,
> >      .save_live_complete_precopy = vfio_save_complete_precopy,
> > +    .load_setup = vfio_load_setup,
> > +    .load_cleanup = vfio_load_cleanup,
> > +    .load_state = vfio_load_state,
> >  };
> >  
> >  /* ---------------------------------------------------------------------- */
> > diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> > index 9a1c5e17d97f..4a4bd3ba9a2a 100644
> > --- a/hw/vfio/trace-events
> > +++ b/hw/vfio/trace-events
> > @@ -157,3 +157,6 @@ vfio_save_device_config_state(const char *name) " (%s)"
> >  vfio_save_pending(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64
> >  vfio_save_iterate(const char *name, int data_size) " (%s) data_size %d"
> >  vfio_save_complete_precopy(const char *name) " (%s)"
> > +vfio_load_device_config_state(const char *name) " (%s)"
> > +vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
> > +vfio_load_state_device_data(const char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 03/17] vfio: Add save and load functions for VFIO PCI devices
  2020-06-26 12:16         ` Dr. David Alan Gilbert
@ 2020-06-26 22:44           ` Alex Williamson
  2020-06-29  9:59             ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 66+ messages in thread
From: Alex Williamson @ 2020-06-26 22:44 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang,
	armbru, mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian,
	yan.y.zhao, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

On Fri, 26 Jun 2020 13:16:13 +0100
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

> * Alex Williamson (alex.williamson@redhat.com) wrote:
> > On Wed, 24 Jun 2020 19:59:39 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> > > On 6/23/2020 1:58 AM, Alex Williamson wrote:  
> > > > On Sun, 21 Jun 2020 01:51:12 +0530
> > > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > >     
> > > >> These functions save and restore PCI device specific data - config
> > > >> space of PCI device.
> > > >> Tested save and restore with MSI and MSIX type.
> > > >>
> > > >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > > >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> > > >> ---
> > > >>   hw/vfio/pci.c                 | 95 +++++++++++++++++++++++++++++++++++++++++++
> > > >>   include/hw/vfio/vfio-common.h |  2 +
> > > >>   2 files changed, 97 insertions(+)
> > > >>
> > > >> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > > >> index 27f8872db2b1..5ba340aee1d4 100644
> > > >> --- a/hw/vfio/pci.c
> > > >> +++ b/hw/vfio/pci.c
> > > >> @@ -41,6 +41,7 @@
> > > >>   #include "trace.h"
> > > >>   #include "qapi/error.h"
> > > >>   #include "migration/blocker.h"
> > > >> +#include "migration/qemu-file.h"
> > > >>   
> > > >>   #define TYPE_VFIO_PCI "vfio-pci"
> > > >>   #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
> > > >> @@ -2407,11 +2408,105 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
> > > >>       return OBJECT(vdev);
> > > >>   }
> > > >>   
> > > >> +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
> > > >> +{
> > > >> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> > > >> +    PCIDevice *pdev = &vdev->pdev;
> > > >> +
> > > >> +    qemu_put_buffer(f, vdev->emulated_config_bits, vdev->config_size);
> > > >> +    qemu_put_buffer(f, vdev->pdev.wmask, vdev->config_size);
> > > >> +    pci_device_save(pdev, f);
> > > >> +
> > > >> +    qemu_put_be32(f, vdev->interrupt);
> > > >> +    if (vdev->interrupt == VFIO_INT_MSIX) {
> > > >> +        msix_save(pdev, f);    
> > > > 
> > > > msix_save() checks msix_present() so shouldn't we include this
> > > > unconditionally?  Can't there also be state in the vector table
> > > > regardless of whether we're currently running in MSI-X mode?
> > > >     
> > > >> +    }
> > > >> +}
> > > >> +
> > > >> +static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
> > > >> +{
> > > >> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> > > >> +    PCIDevice *pdev = &vdev->pdev;
> > > >> +    uint32_t interrupt_type;
> > > >> +    uint16_t pci_cmd;
> > > >> +    int i, ret;
> > > >> +
> > > >> +    qemu_get_buffer(f, vdev->emulated_config_bits, vdev->config_size);
> > > >> +    qemu_get_buffer(f, vdev->pdev.wmask, vdev->config_size);    
> > > > 
> > > > This doesn't seem safe, why is it ok to indiscriminately copy these
> > > > arrays that are configured via support or masking of various device
> > > > features from the source to the target?
> > > >     
> > > 
> > > Ideally, software state at host should be restrored at destination - 
> > > this is the attempt to do that.  
> > 
> > Or is it the case that both source and target should initialize these
> > and come up with the same result and they should be used for
> > validation, not just overwriting the target with the source?  
> 
> Is the request to have something similar to get_pci_config_device's
> check where it compares the configs and c/w/w1c masks (see
> hw/pci/pci.c:520 ish) - we get errors like:
>    Bad config data: i=0x.... read: ... device: ... cmask...
> 
> this is pretty good at spotting things where the source and destination
> device are configured differently, but to allow other dynamic
> configuration values to be passed through OK.

Yeah, except instead of validating we're just overwriting the
destination currently.  Maybe we should make use of that directly.

I'm also not sure what the current best practice is for including
device/feature specific information into the migration stream.  For
example, if a new feature that's potentially only present on some
devices includes emulation state that needs to be migrated we'd need a
way to include that in the migration stream such that a target can fail
if it doesn't understand that data or fail if it requires that data and
it's not present.  What we have here seems very rigid, I don't see how
we iterate on it with any chance of maintaining compatibility.  Any
specific pointers to relevant examples?  Thanks,

Alex

> > > > I think this still fails basic feature support negotiation.  For
> > > > instance, Intel IGD assignment modifies emulated_config_bits and wmask
> > > > to allow the VM BIOS to allocate fake stolen memory for the GPU and
> > > > store this value in config space.  This support can be controlled via a
> > > > QEMU build-time option, therefore the feature support on the target can
> > > > be different from the source.  If this sort of feature set doesn't
> > > > match between source and target, I think we'd want to abort the
> > > > migration, but we don't have any provisions for that here (a physical
> > > > IGD device is obviously just an example as it doesn't support migration
> > > > currently).
> > > >     
> > > 
> > > Then is it ok not to include vdev->pdev.wmask? If yes, I'll remove it.
> > > But we need vdev->emulated_config_bits to be restored.  
> > 
> > It's not clear why we need emulated_config_bits copied or how we'd
> > handle the example I set forth above.  The existence of emulation
> > provided by QEMU is also emulation state.
> > 
> >   
> > > >> +
> > > >> +    ret = pci_device_load(pdev, f);
> > > >> +    if (ret) {
> > > >> +        return ret;
> > > >> +    }
> > > >> +
> > > >> +    /* retore pci bar configuration */
> > > >> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> > > >> +    vfio_pci_write_config(pdev, PCI_COMMAND,
> > > >> +                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);    
> > > > 
> > > > s/!/~/?  Extra parenthesis too
> > > >     
> > > >> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> > > >> +        uint32_t bar = pci_default_read_config(pdev,
> > > >> +                                               PCI_BASE_ADDRESS_0 + i * 4, 4);
> > > >> +
> > > >> +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
> > > >> +    }
> > > >> +
> > > >> +    interrupt_type = qemu_get_be32(f);
> > > >> +
> > > >> +    if (interrupt_type == VFIO_INT_MSI) {
> > > >> +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> > > >> +        bool msi_64bit;
> > > >> +
> > > >> +        /* restore msi configuration */
> > > >> +        msi_flags = pci_default_read_config(pdev,
> > > >> +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
> > > >> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> > > >> +
> > > >> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> > > >> +                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
> > > >> +    
> > > > 
> > > > What if I migrate from a device with MSI support to a device without
> > > > MSI support, or to a device with MSI support at a different offset, who
> > > > is responsible for triggering a migration fault?
> > > >     
> > > 
> > > Migration compatibility check should take care of that. If there is such 
> > > a big difference in hardware then other things would also fail.  
> > 
> > 
> > The division between what is our responsibility in QEMU and what we
> > hope the vendor driver handles is not very clear imo.  How do we avoid
> > finger pointing when things break?
> > 
> >   
> > > >> +        msi_addr_lo = pci_default_read_config(pdev,
> > > >> +                                        pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> > > >> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
> > > >> +                              msi_addr_lo, 4);
> > > >> +
> > > >> +        if (msi_64bit) {
> > > >> +            msi_addr_hi = pci_default_read_config(pdev,
> > > >> +                                        pdev->msi_cap + PCI_MSI_ADDRESS_HI, 4);
> > > >> +            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> > > >> +                                  msi_addr_hi, 4);
> > > >> +        }
> > > >> +
> > > >> +        msi_data = pci_default_read_config(pdev,
> > > >> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> > > >> +                2);
> > > >> +
> > > >> +        vfio_pci_write_config(pdev,
> > > >> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> > > >> +                msi_data, 2);
> > > >> +
> > > >> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> > > >> +                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
> > > >> +    } else if (interrupt_type == VFIO_INT_MSIX) {
> > > >> +        uint16_t offset;
> > > >> +
> > > >> +        offset = pci_default_read_config(pdev,
> > > >> +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
> > > >> +        /* load enable bit and maskall bit */
> > > >> +        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
> > > >> +                              offset, 2);
> > > >> +        msix_load(pdev, f);    
> > > > 
> > > > Isn't this ordering backwards, or at least less efficient?  The config
> > > > write will cause us to enable MSI-X; presumably we'd have nothing in
> > > > the vector table though.  Then msix_load() will write the vector
> > > > and pba tables and trigger a use notifier for each vector.  It seems
> > > > like that would trigger a bunch of SET_IRQS ioctls as if the guest
> > > > wrote individual unmasked vectors to the vector table, whereas if we
> > > > setup the vector table and then enable MSI-X, we do it with one ioctl.
> > > >     
> > > 
> > > Makes sense. Changing the order here.
> > >   
> > > > Also same question as above, I'm not sure who is responsible for making
> > > > sure both devices support MSI-X and that the capability exists at the
> > > > same place on each.  Repeat for essentially every capability.  Are we
> > > > leaning on the migration regions to fail these migrations before we get
> > > > here?  If so, should we be?
> > > >     
> > > As I mentioned about it should be vendor drivers responsibility to have 
> > > compatibility check in that case.  
> > 
> > 
> > And we'd rather blindly assume the vendor driver included that
> > requirement than to check for ourselves?
> > 
> >   
> > > > Also, besides BARs, the command register, and MSI & MSI-X, there must
> > > > be other places where the guest can write config data through to the
> > > > device.  pci_device_{save,load}() only sets QEMU's config space.
> > > >     
> > > 
> > >  From QEMU we can restore QEMU's software state. For mediated device, 
> > > emulated state at vendor driver should be maintained by vendor driver, 
> > > right?  
> > 
> > In this proposal we've determined that emulated_config_bits, wmask,
> > emulated config space, and MSI/X state are part of QEMU's state that
> > need to be transmitted to the target.  It therefore shouldn't be
> > difficult to imagine that adding support for another capability might
> > involve QEMU emulation as well.  How does the migration stream we've
> > constructed here allow such emulation state to be included?  For example
> > we might have a feature like IGD where we can discern the
> > incompatibility via differences in the emulated_config_bits and wmask,
> > but that's not guaranteed.
> >   
> > > > A couple more theoretical (probably not too distant) examples related
> > > > to that; there's a resizable BAR capability that at some point we'll
> > > > probably need to allow the guest to interact with (ie. manipulation of
> > > > capability changes the reported region size for a BAR).  How would we
> > > > support that with this save/load scheme?    
> > > 
> > > Config space is saved at the start of stop-and-copy phase, that means 
> > > vCPUs are stopped. So QEMU's config space saved at this phase should 
> > > include the change. Will there be any other software state that would be 
> > > required to save/load?  
> > 
> > 
> > There might be, it seems inevitable that there would eventually be
> > something that needs emulation state beyond this initial draft.  Is
> > this resizable BAR example another that we simply hand wave as the
> > responsibility of the vendor driver?
> >  
> >   
> > > >  We'll likely also have SR-IOV
> > > > PFs assigned where we'll perhaps have support for emulating the SR-IOV
> > > > capability to call out to a privileged userspace helper to enable VFs,
> > > > how does this get extended to support that type of emulation?
> > > > 
> > > > I'm afraid that making carbon copies of emulated_config_bits, wmask,
> > > > and invoking pci_device_save/load() doesn't address my concerns that
> > > > saving and restoring config space between source and target really
> > > > seems like a much more important task than outlined here.  Thanks,
> > > >     
> > > 
> > > Are you suggesting to load config space using vfio_pci_write_config() 
> > > from PCI_CONFIG_HEADER_SIZE to 
> > > PCI_CONFIG_SPACE_SIZE/PCIE_CONFIG_SPACE_SIZE? I was kind of avoiding it.  
> > 
> > I don't think we can do that, even the save/restore functions in the
> > kernel only blindly overwrite the standard header and then use
> > capability specific functions elsewhere.  But I think what is missing
> > here is the ability to hook in support for manipulating specific
> > capabilities on save and restore, which might include QEMU emulation
> > state data outside of what's provided here.  Thanks,
> > 
> > Alex  
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH QEMU v25 03/17] vfio: Add save and load functions for VFIO PCI devices
  2020-06-26 22:44           ` Alex Williamson
@ 2020-06-29  9:59             ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 66+ messages in thread
From: Dr. David Alan Gilbert @ 2020-06-29  9:59 UTC (permalink / raw)
  To: Alex Williamson
  Cc: cohuck, cjia, aik, Zhengxiao.zx, shuangtai.tst, qemu-devel,
	peterx, Kirti Wankhede, eauger, yi.l.liu, quintela, ziye.yang,
	armbru, mlevitsk, pasic, felipe, zhi.a.wang, kevin.tian,
	yan.y.zhao, changpeng.liu, eskultet, Ken.Xue, jonathan.davies,
	pbonzini

* Alex Williamson (alex.williamson@redhat.com) wrote:
> On Fri, 26 Jun 2020 13:16:13 +0100
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> 
> > * Alex Williamson (alex.williamson@redhat.com) wrote:
> > > On Wed, 24 Jun 2020 19:59:39 +0530
> > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >   
> > > > On 6/23/2020 1:58 AM, Alex Williamson wrote:  
> > > > > On Sun, 21 Jun 2020 01:51:12 +0530
> > > > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > > >     
> > > > >> These functions save and restore PCI device specific data - config
> > > > >> space of PCI device.
> > > > >> Tested save and restore with MSI and MSIX type.
> > > > >>
> > > > >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > > > >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> > > > >> ---
> > > > >>   hw/vfio/pci.c                 | 95 +++++++++++++++++++++++++++++++++++++++++++
> > > > >>   include/hw/vfio/vfio-common.h |  2 +
> > > > >>   2 files changed, 97 insertions(+)
> > > > >>
> > > > >> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > > > >> index 27f8872db2b1..5ba340aee1d4 100644
> > > > >> --- a/hw/vfio/pci.c
> > > > >> +++ b/hw/vfio/pci.c
> > > > >> @@ -41,6 +41,7 @@
> > > > >>   #include "trace.h"
> > > > >>   #include "qapi/error.h"
> > > > >>   #include "migration/blocker.h"
> > > > >> +#include "migration/qemu-file.h"
> > > > >>   
> > > > >>   #define TYPE_VFIO_PCI "vfio-pci"
> > > > >>   #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
> > > > >> @@ -2407,11 +2408,105 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
> > > > >>       return OBJECT(vdev);
> > > > >>   }
> > > > >>   
> > > > >> +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
> > > > >> +{
> > > > >> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> > > > >> +    PCIDevice *pdev = &vdev->pdev;
> > > > >> +
> > > > >> +    qemu_put_buffer(f, vdev->emulated_config_bits, vdev->config_size);
> > > > >> +    qemu_put_buffer(f, vdev->pdev.wmask, vdev->config_size);
> > > > >> +    pci_device_save(pdev, f);
> > > > >> +
> > > > >> +    qemu_put_be32(f, vdev->interrupt);
> > > > >> +    if (vdev->interrupt == VFIO_INT_MSIX) {
> > > > >> +        msix_save(pdev, f);    
> > > > > 
> > > > > msix_save() checks msix_present() so shouldn't we include this
> > > > > unconditionally?  Can't there also be state in the vector table
> > > > > regardless of whether we're currently running in MSI-X mode?
> > > > >     
> > > > >> +    }
> > > > >> +}
> > > > >> +
> > > > >> +static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
> > > > >> +{
> > > > >> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> > > > >> +    PCIDevice *pdev = &vdev->pdev;
> > > > >> +    uint32_t interrupt_type;
> > > > >> +    uint16_t pci_cmd;
> > > > >> +    int i, ret;
> > > > >> +
> > > > >> +    qemu_get_buffer(f, vdev->emulated_config_bits, vdev->config_size);
> > > > >> +    qemu_get_buffer(f, vdev->pdev.wmask, vdev->config_size);    
> > > > > 
> > > > > This doesn't seem safe, why is it ok to indiscriminately copy these
> > > > > arrays that are configured via support or masking of various device
> > > > > features from the source to the target?
> > > > >     
> > > > 
> > > > Ideally, software state at host should be restrored at destination - 
> > > > this is the attempt to do that.  
> > > 
> > > Or is it the case that both source and target should initialize these
> > > and come up with the same result and they should be used for
> > > validation, not just overwriting the target with the source?  
> > 
> > Is the request to have something similar to get_pci_config_device's
> > check where it compares the configs and c/w/w1c masks (see
> > hw/pci/pci.c:520 ish) - we get errors like:
> >    Bad config data: i=0x.... read: ... device: ... cmask...
> > 
> > this is pretty good at spotting things where the source and destination
> > device are configured differently, but to allow other dynamic
> > configuration values to be passed through OK.
> 
> Yeah, except instead of validating we're just overwriting the
> destination currently.  Maybe we should make use of that directly.
> 
> I'm also not sure what the current best practice is for including
> device/feature specific information into the migration stream.  For
> example, if a new feature that's potentially only present on some
> devices includes emulation state that needs to be migrated we'd need a
> way to include that in the migration stream such that a target can fail
> if it doesn't understand that data or fail if it requires that data and
> it's not present.  What we have here seems very rigid, I don't see how
> we iterate on it with any chance of maintaining compatibility.  Any
> specific pointers to relevant examples?  Thanks,

That's what the 'subsection' mechanism in vmsd's allows.
It's a named part of a devices migration stream; if the destination
finds it receiving it without knowing what it is, it fails and errors
giving the name of the subsection it was surprised by.
(You can also set a flag at the start of a migration, and clear it when
you recieve the subsection, and that allows you to do the check that
you've received the subsection).

Note, one thing I'd initially missed, in this v25, it actually uses
pci_device_load and pci_device_save; so it should already be doing that
'bad config data' check above - so if the c/w/w1c masks are set
correctly, maybe this actually solves your problem?

Dave

> Alex
> 
> > > > > I think this still fails basic feature support negotiation.  For
> > > > > instance, Intel IGD assignment modifies emulated_config_bits and wmask
> > > > > to allow the VM BIOS to allocate fake stolen memory for the GPU and
> > > > > store this value in config space.  This support can be controlled via a
> > > > > QEMU build-time option, therefore the feature support on the target can
> > > > > be different from the source.  If this sort of feature set doesn't
> > > > > match between source and target, I think we'd want to abort the
> > > > > migration, but we don't have any provisions for that here (a physical
> > > > > IGD device is obviously just an example as it doesn't support migration
> > > > > currently).
> > > > >     
> > > > 
> > > > Then is it ok not to include vdev->pdev.wmask? If yes, I'll remove it.
> > > > But we need vdev->emulated_config_bits to be restored.  
> > > 
> > > It's not clear why we need emulated_config_bits copied or how we'd
> > > handle the example I set forth above.  The existence of emulation
> > > provided by QEMU is also emulation state.
> > > 
> > >   
> > > > >> +
> > > > >> +    ret = pci_device_load(pdev, f);
> > > > >> +    if (ret) {
> > > > >> +        return ret;
> > > > >> +    }
> > > > >> +
> > > > >> +    /* retore pci bar configuration */
> > > > >> +    pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> > > > >> +    vfio_pci_write_config(pdev, PCI_COMMAND,
> > > > >> +                        pci_cmd & (!(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)), 2);    
> > > > > 
> > > > > s/!/~/?  Extra parenthesis too
> > > > >     
> > > > >> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> > > > >> +        uint32_t bar = pci_default_read_config(pdev,
> > > > >> +                                               PCI_BASE_ADDRESS_0 + i * 4, 4);
> > > > >> +
> > > > >> +        vfio_pci_write_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, bar, 4);
> > > > >> +    }
> > > > >> +
> > > > >> +    interrupt_type = qemu_get_be32(f);
> > > > >> +
> > > > >> +    if (interrupt_type == VFIO_INT_MSI) {
> > > > >> +        uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> > > > >> +        bool msi_64bit;
> > > > >> +
> > > > >> +        /* restore msi configuration */
> > > > >> +        msi_flags = pci_default_read_config(pdev,
> > > > >> +                                            pdev->msi_cap + PCI_MSI_FLAGS, 2);
> > > > >> +        msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> > > > >> +
> > > > >> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> > > > >> +                              msi_flags & (!PCI_MSI_FLAGS_ENABLE), 2);
> > > > >> +    
> > > > > 
> > > > > What if I migrate from a device with MSI support to a device without
> > > > > MSI support, or to a device with MSI support at a different offset, who
> > > > > is responsible for triggering a migration fault?
> > > > >     
> > > > 
> > > > Migration compatibility check should take care of that. If there is such 
> > > > a big difference in hardware then other things would also fail.  
> > > 
> > > 
> > > The division between what is our responsibility in QEMU and what we
> > > hope the vendor driver handles is not very clear imo.  How do we avoid
> > > finger pointing when things break?
> > > 
> > >   
> > > > >> +        msi_addr_lo = pci_default_read_config(pdev,
> > > > >> +                                        pdev->msi_cap + PCI_MSI_ADDRESS_LO, 4);
> > > > >> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_LO,
> > > > >> +                              msi_addr_lo, 4);
> > > > >> +
> > > > >> +        if (msi_64bit) {
> > > > >> +            msi_addr_hi = pci_default_read_config(pdev,
> > > > >> +                                        pdev->msi_cap + PCI_MSI_ADDRESS_HI, 4);
> > > > >> +            vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI,
> > > > >> +                                  msi_addr_hi, 4);
> > > > >> +        }
> > > > >> +
> > > > >> +        msi_data = pci_default_read_config(pdev,
> > > > >> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> > > > >> +                2);
> > > > >> +
> > > > >> +        vfio_pci_write_config(pdev,
> > > > >> +                pdev->msi_cap + (msi_64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32),
> > > > >> +                msi_data, 2);
> > > > >> +
> > > > >> +        vfio_pci_write_config(pdev, pdev->msi_cap + PCI_MSI_FLAGS,
> > > > >> +                              msi_flags | PCI_MSI_FLAGS_ENABLE, 2);
> > > > >> +    } else if (interrupt_type == VFIO_INT_MSIX) {
> > > > >> +        uint16_t offset;
> > > > >> +
> > > > >> +        offset = pci_default_read_config(pdev,
> > > > >> +                                       pdev->msix_cap + PCI_MSIX_FLAGS + 1, 2);
> > > > >> +        /* load enable bit and maskall bit */
> > > > >> +        vfio_pci_write_config(pdev, pdev->msix_cap + PCI_MSIX_FLAGS + 1,
> > > > >> +                              offset, 2);
> > > > >> +        msix_load(pdev, f);    
> > > > > 
> > > > > Isn't this ordering backwards, or at least less efficient?  The config
> > > > > write will cause us to enable MSI-X; presumably we'd have nothing in
> > > > > the vector table though.  Then msix_load() will write the vector
> > > > > and pba tables and trigger a use notifier for each vector.  It seems
> > > > > like that would trigger a bunch of SET_IRQS ioctls as if the guest
> > > > > wrote individual unmasked vectors to the vector table, whereas if we
> > > > > setup the vector table and then enable MSI-X, we do it with one ioctl.
> > > > >     
> > > > 
> > > > Makes sense. Changing the order here.
> > > >   
> > > > > Also same question as above, I'm not sure who is responsible for making
> > > > > sure both devices support MSI-X and that the capability exists at the
> > > > > same place on each.  Repeat for essentially every capability.  Are we
> > > > > leaning on the migration regions to fail these migrations before we get
> > > > > here?  If so, should we be?
> > > > >     
> > > > As I mentioned about it should be vendor drivers responsibility to have 
> > > > compatibility check in that case.  
> > > 
> > > 
> > > And we'd rather blindly assume the vendor driver included that
> > > requirement than to check for ourselves?
> > > 
> > >   
> > > > > Also, besides BARs, the command register, and MSI & MSI-X, there must
> > > > > be other places where the guest can write config data through to the
> > > > > device.  pci_device_{save,load}() only sets QEMU's config space.
> > > > >     
> > > > 
> > > >  From QEMU we can restore QEMU's software state. For mediated device, 
> > > > emulated state at vendor driver should be maintained by vendor driver, 
> > > > right?  
> > > 
> > > In this proposal we've determined that emulated_config_bits, wmask,
> > > emulated config space, and MSI/X state are part of QEMU's state that
> > > need to be transmitted to the target.  It therefore shouldn't be
> > > difficult to imagine that adding support for another capability might
> > > involve QEMU emulation as well.  How does the migration stream we've
> > > constructed here allow such emulation state to be included?  For example
> > > we might have a feature like IGD where we can discern the
> > > incompatibility via differences in the emulated_config_bits and wmask,
> > > but that's not guaranteed.
> > >   
> > > > > A couple more theoretical (probably not too distant) examples related
> > > > > to that; there's a resizable BAR capability that at some point we'll
> > > > > probably need to allow the guest to interact with (ie. manipulation of
> > > > > capability changes the reported region size for a BAR).  How would we
> > > > > support that with this save/load scheme?    
> > > > 
> > > > Config space is saved at the start of stop-and-copy phase, that means 
> > > > vCPUs are stopped. So QEMU's config space saved at this phase should 
> > > > include the change. Will there be any other software state that would be 
> > > > required to save/load?  
> > > 
> > > 
> > > There might be, it seems inevitable that there would eventually be
> > > something that needs emulation state beyond this initial draft.  Is
> > > this resizable BAR example another that we simply hand wave as the
> > > responsibility of the vendor driver?
> > >  
> > >   
> > > > >  We'll likely also have SR-IOV
> > > > > PFs assigned where we'll perhaps have support for emulating the SR-IOV
> > > > > capability to call out to a privileged userspace helper to enable VFs,
> > > > > how does this get extended to support that type of emulation?
> > > > > 
> > > > > I'm afraid that making carbon copies of emulated_config_bits, wmask,
> > > > > and invoking pci_device_save/load() doesn't address my concerns that
> > > > > saving and restoring config space between source and target really
> > > > > seems like a much more important task than outlined here.  Thanks,
> > > > >     
> > > > 
> > > > Are you suggesting to load config space using vfio_pci_write_config() 
> > > > from PCI_CONFIG_HEADER_SIZE to 
> > > > PCI_CONFIG_SPACE_SIZE/PCIE_CONFIG_SPACE_SIZE? I was kind of avoiding it.  
> > > 
> > > I don't think we can do that, even the save/restore functions in the
> > > kernel only blindly overwrite the standard header and then use
> > > capability specific functions elsewhere.  But I think what is missing
> > > here is the ability to hook in support for manipulating specific
> > > capabilities on save and restore, which might include QEMU emulation
> > > state data outside of what's provided here.  Thanks,
> > > 
> > > Alex  
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2020-06-29 10:01 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-20 20:21 [PATCH QEMU v25 00/17] Add migration support for VFIO devices Kirti Wankhede
2020-06-20 20:21 ` [PATCH QEMU v25 01/17] vfio: Add function to unmap VFIO region Kirti Wankhede
2020-06-20 20:21 ` [PATCH QEMU v25 02/17] vfio: Add vfio_get_object callback to VFIODeviceOps Kirti Wankhede
2020-06-20 20:21 ` [PATCH QEMU v25 03/17] vfio: Add save and load functions for VFIO PCI devices Kirti Wankhede
2020-06-22 20:28   ` Alex Williamson
2020-06-24 14:29     ` Kirti Wankhede
2020-06-24 19:49       ` Alex Williamson
2020-06-26 12:16         ` Dr. David Alan Gilbert
2020-06-26 22:44           ` Alex Williamson
2020-06-29  9:59             ` Dr. David Alan Gilbert
2020-06-20 20:21 ` [PATCH QEMU v25 04/17] vfio: Add migration region initialization and finalize function Kirti Wankhede
2020-06-23  7:54   ` Cornelia Huck
2020-06-20 20:21 ` [PATCH QEMU v25 05/17] vfio: Add VM state change handler to know state of VM Kirti Wankhede
2020-06-22 22:50   ` Alex Williamson
2020-06-23 18:55     ` Kirti Wankhede
2020-06-26 14:51       ` Dr. David Alan Gilbert
2020-06-23  8:07   ` Cornelia Huck
2020-06-20 20:21 ` [PATCH QEMU v25 06/17] vfio: Add migration state change notifier Kirti Wankhede
2020-06-23  8:10   ` Cornelia Huck
2020-06-20 20:21 ` [PATCH QEMU v25 07/17] vfio: Register SaveVMHandlers for VFIO device Kirti Wankhede
2020-06-22 22:50   ` Alex Williamson
2020-06-23 19:21     ` Kirti Wankhede
2020-06-23 19:50       ` Alex Williamson
2020-06-26 14:22         ` Dr. David Alan Gilbert
2020-06-26 14:31   ` Dr. David Alan Gilbert
2020-06-20 20:21 ` [PATCH QEMU v25 08/17] vfio: Add save state functions to SaveVMHandlers Kirti Wankhede
2020-06-22 22:50   ` Alex Williamson
2020-06-23 20:34     ` Kirti Wankhede
2020-06-23 20:40       ` Alex Williamson
2020-06-20 20:21 ` [PATCH QEMU v25 09/17] vfio: Add load " Kirti Wankhede
2020-06-24 18:54   ` Alex Williamson
2020-06-25 14:16     ` Kirti Wankhede
2020-06-25 14:57       ` Alex Williamson
2020-06-26 14:54     ` Dr. David Alan Gilbert
2020-06-20 20:21 ` [PATCH QEMU v25 10/17] memory: Set DIRTY_MEMORY_MIGRATION when IOMMU is enabled Kirti Wankhede
2020-06-20 20:21 ` [PATCH QEMU v25 11/17] vfio: Get migration capability flags for container Kirti Wankhede
2020-06-24  8:43   ` Cornelia Huck
2020-06-24 18:55   ` Alex Williamson
2020-06-25 14:09     ` Kirti Wankhede
2020-06-25 14:56       ` Alex Williamson
2020-06-20 20:21 ` [PATCH QEMU v25 12/17] vfio: Add function to start and stop dirty pages tracking Kirti Wankhede
2020-06-23 10:32   ` Cornelia Huck
2020-06-23 11:01     ` Dr. David Alan Gilbert
2020-06-23 11:06       ` Cornelia Huck
2020-06-24 18:55   ` Alex Williamson
2020-06-20 20:21 ` [PATCH QEMU v25 13/17] vfio: create mapped iova list when vIOMMU is enabled Kirti Wankhede
2020-06-24 18:55   ` Alex Williamson
2020-06-25 14:34     ` Kirti Wankhede
2020-06-25 17:40       ` Alex Williamson
2020-06-26 14:43         ` Peter Xu
2020-06-20 20:21 ` [PATCH QEMU v25 14/17] vfio: Add vfio_listener_log_sync to mark dirty pages Kirti Wankhede
2020-06-24 18:55   ` Alex Williamson
2020-06-25 14:43     ` Kirti Wankhede
2020-06-25 17:57       ` Alex Williamson
2020-06-20 20:21 ` [PATCH QEMU v25 15/17] vfio: Add ioctl to get dirty pages bitmap during dma unmap Kirti Wankhede
2020-06-23  8:25   ` Cornelia Huck
2020-06-24 18:56   ` Alex Williamson
2020-06-25 15:01     ` Kirti Wankhede
2020-06-25 19:18       ` Alex Williamson
2020-06-26 14:15         ` Dr. David Alan Gilbert
2020-06-20 20:21 ` [PATCH QEMU v25 16/17] vfio: Make vfio-pci device migration capable Kirti Wankhede
2020-06-22 16:51   ` Cornelia Huck
2020-06-20 20:21 ` [PATCH QEMU v25 17/17] qapi: Add VFIO devices migration stats in Migration stats Kirti Wankhede
2020-06-23  7:21   ` Markus Armbruster
2020-06-23 21:16     ` Kirti Wankhede
2020-06-25  5:51       ` Markus Armbruster

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).