All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] Allow to pass pre-created VFIO container/group to QEMU
@ 2022-10-17 10:54 Andrey Ryabinin
  2022-10-17 10:54 ` [PATCH 1/4] vfio: add vfio-container user createable object Andrey Ryabinin
                   ` (5 more replies)
  0 siblings, 6 replies; 11+ messages in thread
From: Andrey Ryabinin @ 2022-10-17 10:54 UTC (permalink / raw)
  To: qemu-devel
  Cc: Steve Sistare, yc-core, Andrey Ryabinin, Tony Krowiak,
	Halil Pasic, Jason Herne, Cornelia Huck, Thomas Huth,
	Alex Williamson, Eric Farman, Matthew Rosato, Paolo Bonzini,
	Daniel P. Berrangé,
	Eduardo Habkost, Eric Blake, Markus Armbruster, Cleber Rosa,
	Philippe Mathieu-Daudé,
	Wainer dos Santos Moschetta, Beraldo Leal

These patches add possibility to pass VFIO device to QEMU using file
descriptors of VFIO container/group, instead of creating those by QEMU.
This allows to take away permissions to open /dev/vfio/* from QEMU and
delegate that to managment layer like libvirt.

The VFIO API doen't allow to pass just fd of device, since we also need to have
VFIO container and group. So these patches allow to pass created VFIO container/group
to QEMU via command line/QMP, e.g. like this:
            -object vfio-container,id=ct,fd=5 \
            -object vfio-group,id=grp,fd=6,container=ct \
            -device vfio-pci,host=05:00.0,group=grp

A bit more detailed example can be found in the test:
tests/avocado/vfio.py

 *Possible future steps*

Also these patches could be a step for making local migration (within one host)
of the QEMU with VFIO devices.
I've built some prototype on top of these patches to try such idea.
In short the scheme of such migration is following:
 - migrate source VM to file.
 - retrieve fd numbers of VFIO container/group/device via new property and qom-get command
 - get the actual file descriptor via SCM_RIGHTS using new qmp command 'returnfd' which
   sends fd from QEMU by the number: { 'command': 'returnfd', 'data': {'fd': 'int'}}
 - shutdown source VM
 - launch destination VM, plug VFIO devices using obtained file descriptors.
 - PCI device reset duriing plugging the device avoided with the help of new parameter
    on vfio-pci device.
This is alternative to 'cpr-exec' migration scheme proposed here:
   https://lore.kernel.org/qemu-devel/1658851843-236870-1-git-send-email-steven.sistare@oracle.com/
Unlike cpr-exec it doesn't require new kernel flags VFIO_DMA_UNMAP_FLAG_VADDR/VFIO_DMA_MAP_FLAG_VADDR
And doesn't require new migration mode, just some additional steps from management layer.


Andrey Ryabinin (4):
  vfio: add vfio-container user createable object
  vfio: add vfio-group user createable object
  vfio: Add 'group' property to 'vfio-pci' device
  tests/avocado/vfio: add test for vfio devices

 hw/vfio/ap.c                  |   2 +-
 hw/vfio/ccw.c                 |   2 +-
 hw/vfio/common.c              | 471 +++++++++++++++++++++++-----------
 hw/vfio/pci.c                 |  10 +-
 hw/vfio/platform.c            |   2 +-
 hw/vfio/trace-events          |   4 +-
 include/hw/vfio/vfio-common.h |  10 +-
 qapi/qom.json                 |  29 +++
 tests/avocado/vfio.py         | 156 +++++++++++
 9 files changed, 525 insertions(+), 161 deletions(-)
 create mode 100644 tests/avocado/vfio.py

-- 
2.37.3



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/4] vfio: add vfio-container user createable object
  2022-10-17 10:54 [PATCH 0/4] Allow to pass pre-created VFIO container/group to QEMU Andrey Ryabinin
@ 2022-10-17 10:54 ` Andrey Ryabinin
  2022-10-17 10:54 ` [PATCH 2/4] vfio: add vfio-group " Andrey Ryabinin
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Andrey Ryabinin @ 2022-10-17 10:54 UTC (permalink / raw)
  To: qemu-devel
  Cc: Steve Sistare, yc-core, Andrey Ryabinin, Cornelia Huck,
	Thomas Huth, Alex Williamson, Tony Krowiak, Halil Pasic,
	Jason Herne, Eric Farman, Matthew Rosato, Paolo Bonzini,
	Daniel P. Berrangé,
	Eduardo Habkost, Eric Blake, Markus Armbruster, Cleber Rosa,
	Philippe Mathieu-Daudé,
	Wainer dos Santos Moschetta, Beraldo Leal

Add vfio-container type and allow user to create vfio-container
object via '-object' command line argument or 'object-add' qmp command.
Add 'fd' parameter to this type to allow user provide file descriptor of
a vfio-container.

E.g.
 -object vfio-container,id=ct,fd=5

Signed-off-by: Andrey Ryabinin <arbn@yandex-team.com>
---
 hw/vfio/common.c              | 183 ++++++++++++++++++++++------------
 hw/vfio/trace-events          |   2 +-
 include/hw/vfio/vfio-common.h |   4 +
 qapi/qom.json                 |  14 +++
 4 files changed, 137 insertions(+), 66 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 6b5d8c0bf69..392057d3025 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -31,9 +31,11 @@
 #include "exec/memory.h"
 #include "exec/ram_addr.h"
 #include "hw/hw.h"
+#include "monitor/monitor.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
 #include "qemu/range.h"
+#include "qom/object_interfaces.h"
 #include "sysemu/kvm.h"
 #include "sysemu/reset.h"
 #include "sysemu/runstate.h"
@@ -2013,7 +2015,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
                                   Error **errp)
 {
     VFIOContainer *container;
-    int ret, fd;
+    int ret;
     VFIOAddressSpace *space;
 
     space = vfio_get_address_space(as);
@@ -2069,31 +2071,15 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
         }
     }
 
-    fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
-    if (fd < 0) {
-        error_setg_errno(errp, errno, "failed to open /dev/vfio/vfio");
-        ret = -errno;
-        goto put_space_exit;
-    }
+    container = VFIO_CONTAINER(object_new(TYPE_VFIO_CONTAINER));
+    container->space = space;
 
-    ret = ioctl(fd, VFIO_GET_API_VERSION);
-    if (ret != VFIO_API_VERSION) {
-        error_setg(errp, "supported vfio version: %d, "
-                   "reported version: %d", VFIO_API_VERSION, ret);
-        ret = -EINVAL;
-        goto close_fd_exit;
+    user_creatable_complete(USER_CREATABLE(container), errp);
+    if (*errp) {
+        ret = -1;
+        goto free_container_exit;
     }
 
-    container = g_malloc0(sizeof(*container));
-    container->space = space;
-    container->fd = fd;
-    container->error = NULL;
-    container->dirty_pages_supported = false;
-    container->dma_max_mappings = 0;
-    QLIST_INIT(&container->giommu_list);
-    QLIST_INIT(&container->hostwin_list);
-    QLIST_INIT(&container->vrdl_list);
-
     ret = vfio_init_container(container, group->fd, errp);
     if (ret) {
         goto free_container_exit;
@@ -2150,7 +2136,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
          * in this file.
          */
         if (!v2) {
-            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
+            ret = ioctl(container->fd, VFIO_IOMMU_ENABLE);
             if (ret) {
                 error_setg_errno(errp, errno, "failed to enable container");
                 ret = -errno;
@@ -2171,7 +2157,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
         }
 
         info.argsz = sizeof(info);
-        ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
+        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
         if (ret) {
             error_setg_errno(errp, errno,
                              "VFIO_IOMMU_SPAPR_TCE_GET_INFO failed");
@@ -2209,7 +2195,6 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
 
     vfio_kvm_device_add_group(group);
 
-    QLIST_INIT(&container->group_list);
     QLIST_INSERT_HEAD(&space->containers, container, next);
 
     group->container = container;
@@ -2223,29 +2208,18 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
         ret = -1;
         error_propagate_prepend(errp, container->error,
             "memory listener initialization failed: ");
-        goto listener_release_exit;
+        goto free_container_exit;
     }
 
     container->initialized = true;
 
     return 0;
-listener_release_exit:
-    QLIST_REMOVE(group, container_next);
-    QLIST_REMOVE(container, next);
-    vfio_kvm_device_del_group(group);
-    vfio_listener_release(container);
 
 enable_discards_exit:
     vfio_ram_block_discard_disable(container, false);
 
 free_container_exit:
-    g_free(container);
-
-close_fd_exit:
-    close(fd);
-
-put_space_exit:
-    vfio_put_address_space(space);
+    object_unref(OBJECT(container));
 
     return ret;
 }
@@ -2271,32 +2245,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
                      group->groupid);
     }
 
-    if (QLIST_EMPTY(&container->group_list)) {
-        VFIOAddressSpace *space = container->space;
-        VFIOGuestIOMMU *giommu, *tmp;
-        VFIOHostDMAWindow *hostwin, *next;
-
-        QLIST_REMOVE(container, next);
-
-        QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
-            memory_region_unregister_iommu_notifier(
-                    MEMORY_REGION(giommu->iommu_mr), &giommu->n);
-            QLIST_REMOVE(giommu, giommu_next);
-            g_free(giommu);
-        }
-
-        QLIST_FOREACH_SAFE(hostwin, &container->hostwin_list, hostwin_next,
-                           next) {
-            QLIST_REMOVE(hostwin, hostwin_next);
-            g_free(hostwin);
-        }
-
-        trace_vfio_disconnect_container(container->fd);
-        close(container->fd);
-        g_free(container);
-
-        vfio_put_address_space(space);
-    }
+    object_unref(OBJECT(container));
 }
 
 VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
@@ -2628,3 +2577,107 @@ int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
     }
     return vfio_eeh_container_op(container, op);
 }
+
+static void vfio_container_set_fd(Object *obj, const char *value,
+                                  Error **errp)
+{
+    VFIOContainer *ct = VFIO_CONTAINER(obj);
+
+    ct->fd = monitor_fd_param(monitor_cur(), value, errp);
+}
+
+static void vfio_container_complete(UserCreatable *uc, Error **errp)
+{
+    VFIOContainer *container = VFIO_CONTAINER(uc);
+    int ret;
+
+    if (container->fd < 0) {
+        int fd;
+
+        fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
+        if (fd < 0) {
+            error_setg_errno(errp, errno, "failed to open /dev/vfio/vfio");
+            return;
+        }
+        container->fd = fd;
+    }
+
+    ret = ioctl(container->fd, VFIO_GET_API_VERSION);
+    if (ret != VFIO_API_VERSION) {
+        error_setg(errp, "supported vfio version: %d, "
+                   "reported version: %d", VFIO_API_VERSION, ret);
+        return;
+    }
+}
+
+static void vfio_container_class_init(ObjectClass *class, void *data)
+{
+    UserCreatableClass *ucc = USER_CREATABLE_CLASS(class);
+    ucc->complete = vfio_container_complete;
+
+    object_class_property_add_str(class, "fd", NULL, vfio_container_set_fd);
+}
+
+static void vfio_container_instance_init(Object *obj)
+{
+    VFIOContainer *ct = VFIO_CONTAINER(obj);
+
+    ct->dirty_pages_supported = false;
+    ct->dma_max_mappings = 0;
+    ct->fd = -1;
+    QLIST_INIT(&ct->giommu_list);
+    QLIST_INIT(&ct->hostwin_list);
+    QLIST_INIT(&ct->group_list);
+    QLIST_INIT(&ct->vrdl_list);
+}
+
+static void
+vfio_container_instance_finalize(Object *obj)
+{
+    VFIOContainer *container = VFIO_CONTAINER(obj);
+    VFIOAddressSpace *space = container->space;
+    VFIOGuestIOMMU *giommu, *tmp;
+    VFIOHostDMAWindow *hostwin, *next;
+
+    QLIST_REMOVE(container, next);
+
+    QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
+        memory_region_unregister_iommu_notifier(
+            MEMORY_REGION(giommu->iommu_mr), &giommu->n);
+        QLIST_REMOVE(giommu, giommu_next);
+        g_free(giommu);
+    }
+    QLIST_FOREACH_SAFE(hostwin, &container->hostwin_list, hostwin_next,
+                       next) {
+        QLIST_REMOVE(hostwin, hostwin_next);
+        g_free(hostwin);
+    }
+
+
+    trace_vfio_container_instance_finalize(container->fd);
+    if (container->fd > 0) {
+        close(container->fd);
+    }
+    if (space) {
+        vfio_put_address_space(space);
+    }
+}
+
+static const TypeInfo vfio_container_info = {
+    .name = TYPE_VFIO_CONTAINER,
+    .parent = TYPE_OBJECT,
+    .class_init = vfio_container_class_init,
+    .instance_size = sizeof(VFIOContainer),
+    .instance_init = vfio_container_instance_init,
+    .instance_finalize = vfio_container_instance_finalize,
+    .interfaces = (InterfaceInfo[]) {
+        {TYPE_USER_CREATABLE},
+        {}
+    },
+};
+
+static void register_vfio_types(void)
+{
+    type_register_static(&vfio_container_info);
+}
+type_init(register_vfio_types)
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 73dffe9e00d..8b79cf33a33 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -104,7 +104,7 @@ vfio_known_safe_misalignment(const char *name, uint64_t iova, uint64_t offset_wi
 vfio_listener_region_add_no_dma_map(const char *name, uint64_t iova, uint64_t size, uint64_t page_size) "Region \"%s\" 0x%"PRIx64" size=0x%"PRIx64" is not aligned to 0x%"PRIx64" and cannot be mapped for DMA"
 vfio_listener_region_del_skip(uint64_t start, uint64_t end) "SKIPPING region_del 0x%"PRIx64" - 0x%"PRIx64
 vfio_listener_region_del(uint64_t start, uint64_t end) "region_del 0x%"PRIx64" - 0x%"PRIx64
-vfio_disconnect_container(int fd) "close container->fd=%d"
+vfio_container_instance_finalize(int fd) "close container->fd=%d"
 vfio_put_group(int fd) "close group->fd=%d"
 vfio_get_device(const char * name, unsigned int flags, unsigned int num_regions, unsigned int num_irqs) "Device %s flags: %u, regions: %u, irqs: %u"
 vfio_put_base_device(int fd) "close vdev->fd=%d"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index e573f5a9f19..0ab99060e44 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -77,6 +77,7 @@ typedef struct VFIOAddressSpace {
 struct VFIOGroup;
 
 typedef struct VFIOContainer {
+    Object parent;
     VFIOAddressSpace *space;
     int fd; /* /dev/vfio/vfio, empowered by the attached groups */
     MemoryListener listener;
@@ -190,6 +191,9 @@ typedef struct VFIODisplay {
     } dmabuf;
 } VFIODisplay;
 
+#define TYPE_VFIO_CONTAINER "vfio-container"
+OBJECT_DECLARE_SIMPLE_TYPE(VFIOContainer, VFIO_CONTAINER)
+
 void vfio_put_base_device(VFIODevice *vbasedev);
 void vfio_disable_irqindex(VFIODevice *vbasedev, int index);
 void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index);
diff --git a/qapi/qom.json b/qapi/qom.json
index 80dd419b392..d1a88e10b52 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -734,6 +734,18 @@
 { 'struct': 'RemoteObjectProperties',
   'data': { 'fd': 'str', 'devid': 'str' } }
 
+##
+# @VFIOContainerProperties:
+#
+# Properties for vfio-container objects.
+#
+# @fd: file descriptor of vfio container
+#
+# Since: 7.2
+##
+{ 'struct': 'VFIOContainerProperties',
+  'data': { 'fd': 'str' } }
+
 ##
 # @VfioUserServerProperties:
 #
@@ -888,6 +900,7 @@
     'tls-creds-psk',
     'tls-creds-x509',
     'tls-cipher-suites',
+    'vfio-container',
     { 'name': 'x-remote-object', 'features': [ 'unstable' ] },
     { 'name': 'x-vfio-user-server', 'features': [ 'unstable' ] }
   ] }
@@ -953,6 +966,7 @@
       'tls-creds-psk':              'TlsCredsPskProperties',
       'tls-creds-x509':             'TlsCredsX509Properties',
       'tls-cipher-suites':          'TlsCredsProperties',
+      'vfio-container':             'VFIOContainerProperties',
       'x-remote-object':            'RemoteObjectProperties',
       'x-vfio-user-server':         'VfioUserServerProperties'
   } }
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/4] vfio: add vfio-group user createable object
  2022-10-17 10:54 [PATCH 0/4] Allow to pass pre-created VFIO container/group to QEMU Andrey Ryabinin
  2022-10-17 10:54 ` [PATCH 1/4] vfio: add vfio-container user createable object Andrey Ryabinin
@ 2022-10-17 10:54 ` Andrey Ryabinin
  2022-10-17 12:37   ` Markus Armbruster
  2022-10-17 10:54 ` [PATCH 3/4] vfio: Add 'group' property to 'vfio-pci' device Andrey Ryabinin
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 11+ messages in thread
From: Andrey Ryabinin @ 2022-10-17 10:54 UTC (permalink / raw)
  To: qemu-devel
  Cc: Steve Sistare, yc-core, Andrey Ryabinin, Cornelia Huck,
	Thomas Huth, Alex Williamson, Tony Krowiak, Halil Pasic,
	Jason Herne, Eric Farman, Matthew Rosato, Eric Blake,
	Markus Armbruster, Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Cleber Rosa, Philippe Mathieu-Daudé,
	Wainer dos Santos Moschetta, Beraldo Leal

Add vfio-group type and allow user to create such object via
'-object' command line argument or 'object-add' qmp command.
Parameters are:
 - @fd - file descriptor
 - @container - id of vfio-container object which will be used for
        this VFIO group
 - @groupid - number representing IOMMU group (no needed if @fd
                                           and @container were provided)
E.g.:
     -object vfio-container,id=ct,fd=5 \
     -object vfio-group,id=group,fd=6,container=ct

Signed-off-by: Andrey Ryabinin <arbn@yandex-team.com>
---
 hw/vfio/common.c              | 267 +++++++++++++++++++++++-----------
 hw/vfio/trace-events          |   2 +-
 include/hw/vfio/vfio-common.h |   4 +
 qapi/qom.json                 |  15 ++
 4 files changed, 205 insertions(+), 83 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 392057d3025..95722ecf96a 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1911,31 +1911,40 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
                                Error **errp)
 {
     int iommu_type, ret;
+    struct vfio_group_status status = { .argsz = sizeof(status) };
 
     iommu_type = vfio_get_iommu_type(container, errp);
     if (iommu_type < 0) {
         return iommu_type;
     }
 
-    ret = ioctl(group_fd, VFIO_GROUP_SET_CONTAINER, &container->fd);
+
+    ret = ioctl(group_fd, VFIO_GROUP_GET_STATUS, &status);
     if (ret) {
-        error_setg_errno(errp, errno, "Failed to set group container");
+        error_setg_errno(errp, errno, "Failed to get group status");
         return -errno;
     }
-
-    while (ioctl(container->fd, VFIO_SET_IOMMU, iommu_type)) {
-        if (iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
-            /*
-             * On sPAPR, despite the IOMMU subdriver always advertises v1 and
-             * v2, the running platform may not support v2 and there is no
-             * way to guess it until an IOMMU group gets added to the container.
-             * So in case it fails with v2, try v1 as a fallback.
-             */
-            iommu_type = VFIO_SPAPR_TCE_IOMMU;
-            continue;
+    if (!(status.flags & VFIO_GROUP_FLAGS_CONTAINER_SET)) {
+        ret = ioctl(group_fd, VFIO_GROUP_SET_CONTAINER, &container->fd);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to set group container");
+            return -errno;
+        }
+
+        while (ioctl(container->fd, VFIO_SET_IOMMU, iommu_type)) {
+            if (iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
+                /*
+                 * On sPAPR, despite the IOMMU subdriver always advertises v1 and
+                 * v2, the running platform may not support v2 and there is no
+                 * way to guess it until an IOMMU group gets added to the container.
+                 * So in case it fails with v2, try v1 as a fallback.
+                 */
+                iommu_type = VFIO_SPAPR_TCE_IOMMU;
+                continue;
+            }
+            error_setg_errno(errp, errno, "Failed to set iommu for container");
+            return -errno;
         }
-        error_setg_errno(errp, errno, "Failed to set iommu for container");
-        return -errno;
     }
 
     container->iommu_type = iommu_type;
@@ -2050,34 +2059,44 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
      * with some IOMMU types. vfio_ram_block_discard_disable() handles the
      * details once we know which type of IOMMU we are using.
      */
-
-    QLIST_FOREACH(container, &space->containers, next) {
-        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
-            ret = vfio_ram_block_discard_disable(container, true);
-            if (ret) {
-                error_setg_errno(errp, -ret,
-                                 "Cannot set discarding of RAM broken");
-                if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER,
-                          &container->fd)) {
-                    error_report("vfio: error disconnecting group %d from"
-                                 " container", group->groupid);
+    if (!group->container) {
+        QLIST_FOREACH(container, &space->containers, next) {
+            if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
+                ret = vfio_ram_block_discard_disable(container, true);
+                if (ret) {
+                    error_setg_errno(errp, -ret,
+                                     "Cannot set discarding of RAM broken");
+                    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER,
+                              &container->fd)) {
+                        error_report("vfio: error disconnecting group %d from"
+                                     " container", group->groupid);
+                    }
+                    return ret;
                 }
-                return ret;
+                group->container = container;
+                QLIST_INSERT_HEAD(&container->group_list, group, container_next);
+                vfio_kvm_device_add_group(group);
+                return 0;
             }
-            group->container = container;
-            QLIST_INSERT_HEAD(&container->group_list, group, container_next);
-            vfio_kvm_device_add_group(group);
-            return 0;
         }
-    }
+        container = VFIO_CONTAINER(object_new(TYPE_VFIO_CONTAINER));
+        container->space = space;
 
-    container = VFIO_CONTAINER(object_new(TYPE_VFIO_CONTAINER));
-    container->space = space;
-
-    user_creatable_complete(USER_CREATABLE(container), errp);
-    if (*errp) {
-        ret = -1;
-        goto free_container_exit;
+        user_creatable_complete(USER_CREATABLE(container), errp);
+        if (*errp) {
+            ret = -1;
+            goto free_container_exit;
+        }
+        group->container = container;
+    } else if (group->container->initialized) {
+        object_ref(OBJECT(group->container));
+        QLIST_INSERT_HEAD(&group->container->group_list, group, container_next);
+        vfio_kvm_device_add_group(group);
+        return 0;
+    } else {
+        container = group->container;
+        container->space = space;
+        object_ref(OBJECT(container));
     }
 
     ret = vfio_init_container(container, group->fd, errp);
@@ -2228,6 +2247,10 @@ static void vfio_disconnect_container(VFIOGroup *group)
 {
     VFIOContainer *container = group->container;
 
+    if (!group->container) {
+            return;
+    }
+
     QLIST_REMOVE(group, container_next);
     group->container = NULL;
 
@@ -2251,7 +2274,6 @@ static void vfio_disconnect_container(VFIOGroup *group)
 VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
 {
     VFIOGroup *group;
-    char path[32];
     struct vfio_group_status status = { .argsz = sizeof(status) };
 
     QLIST_FOREACH(group, &vfio_group_list, next) {
@@ -2267,31 +2289,14 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
         }
     }
 
-    group = g_malloc0(sizeof(*group));
-
-    snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
-    group->fd = qemu_open_old(path, O_RDWR);
-    if (group->fd < 0) {
-        error_setg_errno(errp, errno, "failed to open %s", path);
-        goto free_group_exit;
-    }
-
-    if (ioctl(group->fd, VFIO_GROUP_GET_STATUS, &status)) {
-        error_setg_errno(errp, errno, "failed to get group %d status", groupid);
-        goto close_fd_exit;
-    }
-
-    if (!(status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
-        error_setg(errp, "group %d is not viable", groupid);
-        error_append_hint(errp,
-                          "Please ensure all devices within the iommu_group "
-                          "are bound to their vfio bus driver.\n");
-        goto close_fd_exit;
+    group = VFIO_GROUP(object_new(TYPE_VFIO_GROUP));
+    object_property_set_int(OBJECT(group), "groupid", groupid, errp);
+    user_creatable_complete(USER_CREATABLE(group), errp);
+    if (*errp) {
+        object_unref(OBJECT(group));
+        return NULL;
     }
 
-    group->groupid = groupid;
-    QLIST_INIT(&group->device_list);
-
     if (vfio_connect_container(group, as, errp)) {
         error_prepend(errp, "failed to setup container for group %d: ",
                       groupid);
@@ -2302,15 +2307,10 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
         qemu_register_reset(vfio_reset_handler, NULL);
     }
 
-    QLIST_INSERT_HEAD(&vfio_group_list, group, next);
-
     return group;
 
 close_fd_exit:
-    close(group->fd);
-
-free_group_exit:
-    g_free(group);
+    object_unref(OBJECT(group));
 
     return NULL;
 }
@@ -2321,19 +2321,7 @@ void vfio_put_group(VFIOGroup *group)
         return;
     }
 
-    if (!group->ram_block_discard_allowed) {
-        vfio_ram_block_discard_disable(group->container, false);
-    }
-    vfio_kvm_device_del_group(group);
-    vfio_disconnect_container(group);
-    QLIST_REMOVE(group, next);
-    trace_vfio_put_group(group->fd);
-    close(group->fd);
-    g_free(group);
-
-    if (QLIST_EMPTY(&vfio_group_list)) {
-        qemu_unregister_reset(vfio_reset_handler, NULL);
-    }
+    object_unref(OBJECT(group));
 }
 
 int vfio_get_device(VFIOGroup *group, const char *name,
@@ -2676,8 +2664,123 @@ static const TypeInfo vfio_container_info = {
     },
 };
 
+static void vfio_group_set_fd(Object *obj, const char *value,
+                              Error **errp)
+{
+    VFIOGroup *group = VFIO_GROUP(obj);
+
+    group->fd = monitor_fd_param(monitor_cur(), value, errp);
+}
+
+static void vfio_group_set_groupid(Object *obj, Visitor *v,
+                                   const char *name, void *opaque,
+                                   Error **errp)
+{
+    VFIOGroup *group = VFIO_GROUP(obj);
+    Error *error = NULL;
+    uint32_t groupid;
+
+    visit_type_uint32(v, name, &groupid, &error);
+    if (error) {
+        error_propagate(errp, error);
+        return;
+    }
+
+    group->groupid = groupid;
+}
+
+static void vfio_group_complete(UserCreatable *uc, Error **errp)
+{
+    VFIOGroup *group = VFIO_GROUP(uc);
+    struct vfio_group_status status = { .argsz = sizeof(status) };
+
+    if (group->fd < 0 && group->groupid >= 0) {
+        char path[32];
+
+        snprintf(path, sizeof(path), "/dev/vfio/%d", group->groupid);
+
+        group->fd = qemu_open_old(path, O_RDWR);
+        if (group->fd < 0) {
+            error_setg_errno(errp, errno, "failed to open %s", path);
+            return;
+        }
+    }
+
+    if (ioctl(group->fd, VFIO_GROUP_GET_STATUS, &status)) {
+        error_setg_errno(errp, errno, "failed to get group %d status", group->groupid);
+        return;
+    }
+
+    if (!(status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
+        error_setg(errp, "group %d is not viable", group->groupid);
+        error_append_hint(errp,
+                          "Please ensure all devices within the iommu_group "
+                          "are bound to their vfio bus driver.\n");
+    }
+}
+
+static void vfio_group_class_init(ObjectClass *class, void *data)
+{
+    UserCreatableClass *ucc = USER_CREATABLE_CLASS(class);
+    ucc->complete = vfio_group_complete;
+
+    object_class_property_add_link(class, "container", TYPE_VFIO_CONTAINER,
+                                   offsetof(VFIOGroup, container),
+                                   object_property_allow_set_link, 0);
+    object_class_property_add_str(class, "fd", NULL, vfio_group_set_fd);
+    object_class_property_add(class, "groupid", "int", NULL,
+                              vfio_group_set_groupid,
+                              NULL, NULL);
+}
+
+static void vfio_group_instance_init(Object *obj)
+{
+    VFIOGroup *group = VFIO_GROUP(obj);
+
+    QLIST_INIT(&group->device_list);
+    group->fd = -1;
+    group->groupid = -1;
+    QLIST_INSERT_HEAD(&vfio_group_list, group, next);
+}
+
+static void
+vfio_group_instance_finalize(Object *obj)
+{
+    VFIOGroup *group = VFIO_GROUP(obj);
+
+    if (!group->ram_block_discard_allowed) {
+        vfio_ram_block_discard_disable(group->container, false);
+    }
+
+    vfio_kvm_device_del_group(group);
+    vfio_disconnect_container(group);
+    QLIST_REMOVE(group, next);
+    trace_vfio_group_instance_finalize(group->fd);
+    if (group->fd >= 0) {
+        close(group->fd);
+    }
+
+    if (QLIST_EMPTY(&vfio_group_list)) {
+        qemu_unregister_reset(vfio_reset_handler, NULL);
+    }
+}
+
+static const TypeInfo vfio_group_info = {
+    .name = TYPE_VFIO_GROUP,
+    .parent = TYPE_OBJECT,
+    .class_init = vfio_group_class_init,
+    .instance_size = sizeof(VFIOGroup),
+    .instance_init = vfio_group_instance_init,
+    .instance_finalize = vfio_group_instance_finalize,
+    .interfaces = (InterfaceInfo[]) {
+        {TYPE_USER_CREATABLE},
+        {}
+    },
+};
+
 static void register_vfio_types(void)
 {
     type_register_static(&vfio_container_info);
+    type_register_static(&vfio_group_info);
 }
 type_init(register_vfio_types)
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 8b79cf33a33..6ae0ed09acd 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -105,7 +105,7 @@ vfio_listener_region_add_no_dma_map(const char *name, uint64_t iova, uint64_t si
 vfio_listener_region_del_skip(uint64_t start, uint64_t end) "SKIPPING region_del 0x%"PRIx64" - 0x%"PRIx64
 vfio_listener_region_del(uint64_t start, uint64_t end) "region_del 0x%"PRIx64" - 0x%"PRIx64
 vfio_container_instance_finalize(int fd) "close container->fd=%d"
-vfio_put_group(int fd) "close group->fd=%d"
+vfio_group_instance_finalize(int fd) "close group->fd=%d"
 vfio_get_device(const char * name, unsigned int flags, unsigned int num_regions, unsigned int num_irqs) "Device %s flags: %u, regions: %u, irqs: %u"
 vfio_put_base_device(int fd) "close vdev->fd=%d"
 vfio_region_setup(const char *dev, int index, const char *name, unsigned long flags, unsigned long offset, unsigned long size) "Device %s, region %d \"%s\", flags: 0x%lx, offset: 0x%lx, size: 0x%lx"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 0ab99060e44..f2d67093f44 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -156,6 +156,7 @@ struct VFIODeviceOps {
 };
 
 typedef struct VFIOGroup {
+    Object parent;
     int fd;
     int groupid;
     VFIOContainer *container;
@@ -194,6 +195,9 @@ typedef struct VFIODisplay {
 #define TYPE_VFIO_CONTAINER "vfio-container"
 OBJECT_DECLARE_SIMPLE_TYPE(VFIOContainer, VFIO_CONTAINER)
 
+#define TYPE_VFIO_GROUP "vfio-group"
+OBJECT_DECLARE_SIMPLE_TYPE(VFIOGroup, VFIO_GROUP)
+
 void vfio_put_base_device(VFIODevice *vbasedev);
 void vfio_disable_irqindex(VFIODevice *vbasedev, int index);
 void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index);
diff --git a/qapi/qom.json b/qapi/qom.json
index d1a88e10b52..f46dd6b8034 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -746,6 +746,19 @@
 { 'struct': 'VFIOContainerProperties',
   'data': { 'fd': 'str' } }
 
+##
+# @VFIOGroupProperties:
+#
+# Properties for vfio-group objects.
+#
+# @fd: file descriptor of vfio group
+# @container: container
+#
+# Since: 7.2
+##
+{ 'struct': 'VFIOGroupProperties',
+  'data': { 'fd': 'str', 'container': 'str'} }
+
 ##
 # @VfioUserServerProperties:
 #
@@ -901,6 +914,7 @@
     'tls-creds-x509',
     'tls-cipher-suites',
     'vfio-container',
+    'vfio-group',
     { 'name': 'x-remote-object', 'features': [ 'unstable' ] },
     { 'name': 'x-vfio-user-server', 'features': [ 'unstable' ] }
   ] }
@@ -967,6 +981,7 @@
       'tls-creds-x509':             'TlsCredsX509Properties',
       'tls-cipher-suites':          'TlsCredsProperties',
       'vfio-container':             'VFIOContainerProperties',
+      'vfio-group':                 'VFIOGroupProperties',
       'x-remote-object':            'RemoteObjectProperties',
       'x-vfio-user-server':         'VfioUserServerProperties'
   } }
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/4] vfio: Add 'group' property to 'vfio-pci' device
  2022-10-17 10:54 [PATCH 0/4] Allow to pass pre-created VFIO container/group to QEMU Andrey Ryabinin
  2022-10-17 10:54 ` [PATCH 1/4] vfio: add vfio-container user createable object Andrey Ryabinin
  2022-10-17 10:54 ` [PATCH 2/4] vfio: add vfio-group " Andrey Ryabinin
@ 2022-10-17 10:54 ` Andrey Ryabinin
  2022-10-17 10:54 ` [PATCH 4/4] tests/avocado/vfio: add test for vfio devices Andrey Ryabinin
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Andrey Ryabinin @ 2022-10-17 10:54 UTC (permalink / raw)
  To: qemu-devel
  Cc: Steve Sistare, yc-core, Andrey Ryabinin, Cornelia Huck,
	Thomas Huth, Tony Krowiak, Halil Pasic, Jason Herne,
	Alex Williamson, Eric Farman, Matthew Rosato, Paolo Bonzini,
	Daniel P. Berrangé,
	Eduardo Habkost, Eric Blake, Markus Armbruster, Cleber Rosa,
	Philippe Mathieu-Daudé,
	Wainer dos Santos Moschetta, Beraldo Leal

Add 'group' properties to 'vfio-pci' device. This allows
to add vfio-pci device using precreated vfio container and group, e.g.

	-object vfio-container,id=ct,fd=5 \
        -object vfio-group,id=grp,fd=6,container=ct \
	-device vfio-pci,host=05:00.0,group=grp

Signed-off-by: Andrey Ryabinin <arbn@yandex-team.com>
---
 hw/vfio/ap.c                  |  2 +-
 hw/vfio/ccw.c                 |  2 +-
 hw/vfio/common.c              | 45 ++++++++++++++++++++---------------
 hw/vfio/pci.c                 | 10 +++++++-
 hw/vfio/platform.c            |  2 +-
 include/hw/vfio/vfio-common.h |  2 +-
 6 files changed, 39 insertions(+), 24 deletions(-)

diff --git a/hw/vfio/ap.c b/hw/vfio/ap.c
index e0dd561e85a..2a5c322883b 100644
--- a/hw/vfio/ap.c
+++ b/hw/vfio/ap.c
@@ -81,7 +81,7 @@ static VFIOGroup *vfio_ap_get_group(VFIOAPDevice *vapdev, Error **errp)
 
     g_free(group_path);
 
-    return vfio_get_group(groupid, &address_space_memory, errp);
+    return vfio_get_group(&vapdev->vdev, groupid, &address_space_memory, errp);
 }
 
 static void vfio_ap_realize(DeviceState *dev, Error **errp)
diff --git a/hw/vfio/ccw.c b/hw/vfio/ccw.c
index 0354737666a..500f0f62163 100644
--- a/hw/vfio/ccw.c
+++ b/hw/vfio/ccw.c
@@ -650,7 +650,7 @@ static VFIOGroup *vfio_ccw_get_group(S390CCWDevice *cdev, Error **errp)
         return NULL;
     }
 
-    return vfio_get_group(groupid, &address_space_memory, errp);
+    return vfio_get_group(NULL, groupid, &address_space_memory, errp);
 }
 
 static void vfio_ccw_realize(DeviceState *dev, Error **errp)
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 95722ecf96a..64ace47822d 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -2271,30 +2271,35 @@ static void vfio_disconnect_container(VFIOGroup *group)
     object_unref(OBJECT(container));
 }
 
-VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
+VFIOGroup *vfio_get_group(VFIODevice *vdev, int groupid, AddressSpace *as, Error **errp)
 {
     VFIOGroup *group;
     struct vfio_group_status status = { .argsz = sizeof(status) };
 
-    QLIST_FOREACH(group, &vfio_group_list, next) {
-        if (group->groupid == groupid) {
-            /* Found it.  Now is it already in the right context? */
-            if (group->container->space->as == as) {
-                return group;
-            } else {
-                error_setg(errp, "group %d used in multiple address spaces",
-                           group->groupid);
-                return NULL;
+    if (!vdev->group) {
+        QLIST_FOREACH(group, &vfio_group_list, next) {
+            if (group->groupid == groupid) {
+                /* Found it.  Now is it already in the right context? */
+                if (group->container->space->as == as) {
+                    return group;
+                } else {
+                    error_setg(errp, "group %d used in multiple address spaces",
+                               group->groupid);
+                    return NULL;
+                }
             }
         }
-    }
-
-    group = VFIO_GROUP(object_new(TYPE_VFIO_GROUP));
-    object_property_set_int(OBJECT(group), "groupid", groupid, errp);
-    user_creatable_complete(USER_CREATABLE(group), errp);
-    if (*errp) {
-        object_unref(OBJECT(group));
-        return NULL;
+        group = VFIO_GROUP(object_new(TYPE_VFIO_GROUP));
+        object_property_set_int(OBJECT(group), "groupid", groupid, errp);
+        user_creatable_complete(USER_CREATABLE(group), errp);
+        if (*errp) {
+            object_unref(OBJECT(group));
+            return NULL;
+        }
+    } else {
+        group = vdev->group;
+        group->groupid = groupid;
+        object_ref(OBJECT(group));
     }
 
     if (vfio_connect_container(group, as, errp)) {
@@ -2388,7 +2393,9 @@ void vfio_put_base_device(VFIODevice *vbasedev)
     if (!vbasedev->group) {
         return;
     }
-    QLIST_REMOVE(vbasedev, next);
+    if (!QLIST_EMPTY(&vbasedev->group->device_list)) {
+            QLIST_REMOVE(vbasedev, next);
+    }
     vbasedev->group = NULL;
     trace_vfio_put_base_device(vbasedev->fd);
     close(vbasedev->fd);
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 939dcc3d4a9..d671eb74881 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -29,6 +29,7 @@
 #include "hw/qdev-properties.h"
 #include "hw/qdev-properties-system.h"
 #include "migration/vmstate.h"
+#include "monitor/monitor.h"
 #include "qapi/qmp/qdict.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
@@ -2902,7 +2903,8 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
 
     trace_vfio_realize(vbasedev->name, groupid);
 
-    group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev), errp);
+    group = vfio_get_group(&vdev->vbasedev, groupid,
+                           pci_device_iommu_address_space(pdev), errp);
     if (!group) {
         goto error;
     }
@@ -3190,6 +3192,7 @@ static void vfio_instance_finalize(Object *obj)
 static void vfio_exitfn(PCIDevice *pdev)
 {
     VFIOPCIDevice *vdev = VFIO_PCI(pdev);
+    VFIOGroup *group = vdev->vbasedev.group;
 
     vfio_unregister_req_notifier(vdev);
     vfio_unregister_err_notifier(vdev);
@@ -3204,6 +3207,8 @@ static void vfio_exitfn(PCIDevice *pdev)
     vfio_teardown_msi(vdev);
     vfio_bars_exit(vdev);
     vfio_migration_finalize(&vdev->vbasedev);
+
+    vfio_put_group(group);
 }
 
 static void vfio_pci_reset(DeviceState *dev)
@@ -3330,6 +3335,9 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
     pdc->exit = vfio_exitfn;
     pdc->config_read = vfio_pci_read_config;
     pdc->config_write = vfio_pci_write_config;
+    object_class_property_add_link(klass, "group", TYPE_VFIO_GROUP,
+                                   offsetof(VFIOPCIDevice, vbasedev.group),
+                                   object_property_allow_set_link, 0);
 }
 
 static const TypeInfo vfio_pci_dev_info = {
diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c
index 5af73f92876..3a72e085026 100644
--- a/hw/vfio/platform.c
+++ b/hw/vfio/platform.c
@@ -577,7 +577,7 @@ static int vfio_base_device_init(VFIODevice *vbasedev, Error **errp)
 
     trace_vfio_platform_base_device_init(vbasedev->name, groupid);
 
-    group = vfio_get_group(groupid, &address_space_memory, errp);
+    group = vfio_get_group(vbasedev, groupid, &address_space_memory, errp);
     if (!group) {
         return -ENOENT;
     }
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index f2d67093f44..79b43eb76b3 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -216,7 +216,7 @@ void vfio_region_unmap(VFIORegion *region);
 void vfio_region_exit(VFIORegion *region);
 void vfio_region_finalize(VFIORegion *region);
 void vfio_reset_handler(void *opaque);
-VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
+VFIOGroup *vfio_get_group(VFIODevice *vdev, int groupid, AddressSpace *as, Error **errp);
 void vfio_put_group(VFIOGroup *group);
 int vfio_get_device(VFIOGroup *group, const char *name,
                     VFIODevice *vbasedev, Error **errp);
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 4/4] tests/avocado/vfio: add test for vfio devices
  2022-10-17 10:54 [PATCH 0/4] Allow to pass pre-created VFIO container/group to QEMU Andrey Ryabinin
                   ` (2 preceding siblings ...)
  2022-10-17 10:54 ` [PATCH 3/4] vfio: Add 'group' property to 'vfio-pci' device Andrey Ryabinin
@ 2022-10-17 10:54 ` Andrey Ryabinin
  2022-10-17 11:05 ` [PATCH 0/4] Allow to pass pre-created VFIO container/group to QEMU Daniel P. Berrangé
  2022-10-17 15:21 ` Alex Williamson
  5 siblings, 0 replies; 11+ messages in thread
From: Andrey Ryabinin @ 2022-10-17 10:54 UTC (permalink / raw)
  To: qemu-devel
  Cc: Steve Sistare, yc-core, Andrey Ryabinin, Tony Krowiak,
	Halil Pasic, Jason Herne, Alex Williamson, Cornelia Huck,
	Thomas Huth, Eric Farman, Matthew Rosato, Eric Blake,
	Markus Armbruster, Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Cleber Rosa, Philippe Mathieu-Daudé,
	Wainer dos Santos Moschetta, Beraldo Leal

Add avocado test for vfio subsystem. The test uses pci-tesdev as a source
PCI device for vfio. So the test supposed to be launched inside QEMU
guest launched with '-device pci-testdev' devices.

The test creates VFIO container&group, launches QEMU and passes
container/group/device to it.

Signed-off-by: Andrey Ryabinin <arbn@yandex-team.com>
---
 tests/avocado/vfio.py | 156 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 156 insertions(+)
 create mode 100644 tests/avocado/vfio.py

diff --git a/tests/avocado/vfio.py b/tests/avocado/vfio.py
new file mode 100644
index 00000000000..fa0d2b65dc7
--- /dev/null
+++ b/tests/avocado/vfio.py
@@ -0,0 +1,156 @@
+# tests for QEMU's vfio subsystem
+#
+# Copyright (c) 2022 Yandex N.V.
+#
+# This work is licensed under the terms of the GNU GPL, version 2 or later.
+# See the COPYING file in the top-level directory.
+
+from avocado.utils import wait
+from avocado import skipUnless
+from avocado_qemu import QemuSystemTest
+from avocado_qemu import run_cmd
+import os
+import sys
+import subprocess
+from fcntl import ioctl
+from ctypes import *
+import struct
+
+
+VFIO_CMD_PREFIX = ord(';') << (4*2)
+VFIO_GET_API_VERSION = VFIO_CMD_PREFIX | 100
+VFIO_CHECK_EXTENSION = VFIO_CMD_PREFIX | 101
+VFIO_SET_IOMMU = VFIO_CMD_PREFIX | 102
+VFIO_GROUP_GET_STATUS = VFIO_CMD_PREFIX | 103
+VFIO_GROUP_SET_CONTAINER = VFIO_CMD_PREFIX | 104
+VFIO_GROUP_GET_DEVICE_FD = VFIO_CMD_PREFIX | 106
+
+VFIO_TYPE1_IOMMU = 1
+VFIO_SPAPR_TCE_IOMMU = 2
+VFIO_TYPE1v2_IOMMU = 3
+VFIO_SPAPR_TCE_v2_IOMMU = 7
+
+VFIO_API_VERSION = 0
+VFIO_TYPE1_IOMMU = 1
+PCI_VENDOR_ID=0x1b36
+PCI_DEV_ID=0x0005
+
+class vfio_group_status(Structure):
+    _fields_ = [("argsz", c_uint32),
+                ("flags", c_uint32)]
+
+class vfio_container(object):
+    def open(self):
+        self.container_fd = os.open("/dev/vfio/vfio", os.O_RDWR)
+        if ioctl(self.container_fd, VFIO_GET_API_VERSION) != VFIO_API_VERSION:
+            raise Exception("VFIO_GET_API_VERSION: unexpected vfio api version")
+        iommu_types = [ VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
+                          VFIO_SPAPR_TCE_v2_IOMMU, VFIO_SPAPR_TCE_IOMMU ];
+        for iommu_type in iommu_types:
+            if ioctl(self.container_fd, VFIO_CHECK_EXTENSION, iommu_type):
+                self.iommu_type = iommu_type
+        if not self.iommu_type:
+            raise Exception("No available IOMMU models");
+
+    def set_iommu(self):
+            ioctl(self.container_fd, VFIO_SET_IOMMU, self.iommu_type);
+
+class vfio_group(object):
+    def __init__(self, container, group_num):
+        self.ct = container
+        self.group_num = group_num
+    def open(self):
+        self.group_fd = os.open("/dev/vfio/%d" % self.group_num, os.O_RDWR)
+        status = vfio_group_status(0, 0)
+        ioctl(self.group_fd, VFIO_GROUP_GET_STATUS, pointer(status))
+        ioctl(self.group_fd, VFIO_GROUP_SET_CONTAINER,
+              c_int(self.ct.container_fd));
+
+class vfio_device(object):
+    def __init__(self, dev, group):
+        self.dev = dev
+        self.group = group
+
+def pci_testdev_exists():
+    for dev in next(os.walk('/sys/bus/pci/devices'))[1]:
+        with open("/sys/bus/pci/devices/%s/vendor" % dev) as f:
+            if f.read() != '0x%x\n' % PCI_VENDOR_ID:
+                continue
+        with open("/sys/bus/pci/devices/%s/device" % dev) as f:
+            if f.read() != '0x%.4x\n' % PCI_DEV_ID:
+                continue
+        return True
+    return False
+
+class VFIOIOMMUTest(QemuSystemTest):
+    devices = []
+    groups = []
+    timeout = 900
+    ct = vfio_container()
+
+    def add_group(self, dev):
+        tmp = os.readlink("/sys/bus/pci/devices/%s/iommu_group" % dev)
+        group_num = int(tmp.split('/')[-1])
+
+        for g in self.groups:
+            if g.group_num == group_num:
+                return g
+        group = vfio_group(self.ct, group_num)
+        self.groups.append(group)
+        return group
+
+    def setUp(self):
+        super().setUp()
+        run_cmd(('modprobe', 'vfio-pci'))
+        try:
+            f = open("/sys/bus/pci/drivers/vfio-pci/new_id", "a")
+            f.write("%x %x" % (PCI_VENDOR_ID, PCI_DEV_ID))
+        except PermissionError:
+            pass
+        except FileExistsError:
+            pass
+        for dev in next(os.walk('/sys/bus/pci/devices'))[1]:
+            with open("/sys/bus/pci/devices/%s/vendor" % dev) as f:
+                if f.read() != '0x%x\n' % PCI_VENDOR_ID:
+                    continue
+            with open("/sys/bus/pci/devices/%s/device" % dev) as f:
+                if f.read() != '0x%.4x\n' % PCI_DEV_ID:
+                    continue
+
+            self.devices.append(vfio_device(dev, self.add_group(dev)))
+
+    def open_dev_fds(self):
+        self.ct.open()
+        for group in self.groups:
+            group.open()
+        self.ct.set_iommu()
+
+    def close_fds(self):
+        for group in self.groups:
+            os.close(group.group_fd)
+        os.close(self.ct.container_fd)
+
+    def hotplug_devices(self, vm):
+        vm._qmp.send_fd_scm(self.devices[0].group.ct.container_fd)
+        vm.command("getfd", fdname="ct")
+        vm.command("object-add", qom_type="vfio-container", id="ct", fd="ct")
+
+        for group in self.groups:
+            vm._qmp.send_fd_scm(group.group_fd)
+            vm.command("getfd", fdname="group_fd")
+            vm.command("object-add", qom_type="vfio-group",
+                       id="group%d" % group.group_num,
+                       fd="group_fd", container="ct")
+
+        for i in range(len(self.devices)):
+            vm.command("device_add", driver="vfio-pci",
+                       host=self.devices[i].dev, id="dev%d" % i,
+                       group="group%d" % self.devices[i].group.group_num)
+
+    @skipUnless(pci_testdev_exists(), "no pci-testdev found")
+    def test_vfio(self):
+        self.open_dev_fds()
+        self.vm.add_args('-nodefaults')
+        self.vm.launch()
+        self.hotplug_devices(self.vm)
+        self.close_fds()
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/4] Allow to pass pre-created VFIO container/group to QEMU
  2022-10-17 10:54 [PATCH 0/4] Allow to pass pre-created VFIO container/group to QEMU Andrey Ryabinin
                   ` (3 preceding siblings ...)
  2022-10-17 10:54 ` [PATCH 4/4] tests/avocado/vfio: add test for vfio devices Andrey Ryabinin
@ 2022-10-17 11:05 ` Daniel P. Berrangé
  2022-10-26 10:44   ` Andrey Ryabinin
  2022-10-17 15:21 ` Alex Williamson
  5 siblings, 1 reply; 11+ messages in thread
From: Daniel P. Berrangé @ 2022-10-17 11:05 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: qemu-devel, Steve Sistare, yc-core, Tony Krowiak, Halil Pasic,
	Jason Herne, Cornelia Huck, Thomas Huth, Alex Williamson,
	Eric Farman, Matthew Rosato, Paolo Bonzini, Eduardo Habkost,
	Eric Blake, Markus Armbruster, Cleber Rosa,
	Philippe Mathieu-Daudé,
	Wainer dos Santos Moschetta, Beraldo Leal

On Mon, Oct 17, 2022 at 01:54:03PM +0300, Andrey Ryabinin wrote:
> These patches add possibility to pass VFIO device to QEMU using file
> descriptors of VFIO container/group, instead of creating those by QEMU.
> This allows to take away permissions to open /dev/vfio/* from QEMU and
> delegate that to managment layer like libvirt.
> 
> The VFIO API doen't allow to pass just fd of device, since we also need to have
> VFIO container and group. So these patches allow to pass created VFIO container/group
> to QEMU via command line/QMP, e.g. like this:
>             -object vfio-container,id=ct,fd=5 \
>             -object vfio-group,id=grp,fd=6,container=ct \
>             -device vfio-pci,host=05:00.0,group=grp
> 
> A bit more detailed example can be found in the test:
> tests/avocado/vfio.py
> 
>  *Possible future steps*
> 
> Also these patches could be a step for making local migration (within one host)
> of the QEMU with VFIO devices.
> I've built some prototype on top of these patches to try such idea.
> In short the scheme of such migration is following:
>  - migrate source VM to file.
>  - retrieve fd numbers of VFIO container/group/device via new property and qom-get command
>  - get the actual file descriptor via SCM_RIGHTS using new qmp command 'returnfd' which
>    sends fd from QEMU by the number: { 'command': 'returnfd', 'data': {'fd': 'int'}}
>  - shutdown source VM
>  - launch destination VM, plug VFIO devices using obtained file descriptors.
>  - PCI device reset duriing plugging the device avoided with the help of new parameter
>     on vfio-pci device.

Is there a restriction by VFIO on how many processes can have the FD
open concurrently ? I guess it must be, as with SCM_RIGHTS, both src
QEMU and libvirt will have the FD open concurrently for at least a
short period, as you can't atomically close the FD at the exact same
time as SCM_RIGHTS sends it.

With migration it is *highly* desirable to never stop the source VM's
QEMU until the new QEMU has completed migration and got its vCPUs
running, in order to have best chance of successful rollback upon
failure

So assuming both QEMU's can have the FD open, provided they don't
both concurrently operate on it, could src QEMU just pass the FDs
to the target QEMU as part of the migration stream. eg use a UNIX
socket between the 2 QEMUs, and SCM_RIGHTS to pass the FDs across,
avoiding libvirt needing to be in the middle of the FD passing
dance. Since target QEMU gets the FDs as part of the migration
stream, it would inherantly know that it shold skip device reset
in that flow, without requiring any new param.


> This is alternative to 'cpr-exec' migration scheme proposed here:
>    https://lore.kernel.org/qemu-devel/1658851843-236870-1-git-send-email-steven.sistare@oracle.com/
> Unlike cpr-exec it doesn't require new kernel flags VFIO_DMA_UNMAP_FLAG_VADDR/VFIO_DMA_MAP_FLAG_VADDR
> And doesn't require new migration mode, just some additional steps from management layer.

Avoiding creating a whole new set of mgmt commands in QMP does
make this appealing as an option instead of cpr-exec. If we can
layer FD passing into the migration stream too, that'd be even
more compelling IMHO.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/4] vfio: add vfio-group user createable object
  2022-10-17 10:54 ` [PATCH 2/4] vfio: add vfio-group " Andrey Ryabinin
@ 2022-10-17 12:37   ` Markus Armbruster
  0 siblings, 0 replies; 11+ messages in thread
From: Markus Armbruster @ 2022-10-17 12:37 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: qemu-devel, Steve Sistare, yc-core, Cornelia Huck, Thomas Huth,
	Alex Williamson, Tony Krowiak, Halil Pasic, Jason Herne,
	Eric Farman, Matthew Rosato, Eric Blake, Markus Armbruster,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Cleber Rosa, Philippe Mathieu-Daudé,
	Wainer dos Santos Moschetta, Beraldo Leal

Andrey Ryabinin <arbn@yandex-team.com> writes:

> Add vfio-group type and allow user to create such object via
> '-object' command line argument or 'object-add' qmp command.
> Parameters are:
>  - @fd - file descriptor
>  - @container - id of vfio-container object which will be used for
>         this VFIO group
>  - @groupid - number representing IOMMU group (no needed if @fd
>                                            and @container were provided)
> E.g.:
>      -object vfio-container,id=ct,fd=5 \
>      -object vfio-group,id=group,fd=6,container=ct
>
> Signed-off-by: Andrey Ryabinin <arbn@yandex-team.com>
> ---
>  hw/vfio/common.c              | 267 +++++++++++++++++++++++-----------
>  hw/vfio/trace-events          |   2 +-
>  include/hw/vfio/vfio-common.h |   4 +
>  qapi/qom.json                 |  15 ++
>  4 files changed, 205 insertions(+), 83 deletions(-)
>
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 392057d3025..95722ecf96a 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -1911,31 +1911,40 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
>                                 Error **errp)
>  {
>      int iommu_type, ret;
> +    struct vfio_group_status status = { .argsz = sizeof(status) };
>  
>      iommu_type = vfio_get_iommu_type(container, errp);
>      if (iommu_type < 0) {
>          return iommu_type;
>      }
>  
> -    ret = ioctl(group_fd, VFIO_GROUP_SET_CONTAINER, &container->fd);
> +
> +    ret = ioctl(group_fd, VFIO_GROUP_GET_STATUS, &status);
>      if (ret) {
> -        error_setg_errno(errp, errno, "Failed to set group container");
> +        error_setg_errno(errp, errno, "Failed to get group status");
>          return -errno;
>      }
> -
> -    while (ioctl(container->fd, VFIO_SET_IOMMU, iommu_type)) {
> -        if (iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> -            /*
> -             * On sPAPR, despite the IOMMU subdriver always advertises v1 and
> -             * v2, the running platform may not support v2 and there is no
> -             * way to guess it until an IOMMU group gets added to the container.
> -             * So in case it fails with v2, try v1 as a fallback.
> -             */
> -            iommu_type = VFIO_SPAPR_TCE_IOMMU;
> -            continue;
> +    if (!(status.flags & VFIO_GROUP_FLAGS_CONTAINER_SET)) {
> +        ret = ioctl(group_fd, VFIO_GROUP_SET_CONTAINER, &container->fd);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to set group container");
> +            return -errno;
> +        }
> +
> +        while (ioctl(container->fd, VFIO_SET_IOMMU, iommu_type)) {
> +            if (iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> +                /*
> +                 * On sPAPR, despite the IOMMU subdriver always advertises v1 and
> +                 * v2, the running platform may not support v2 and there is no
> +                 * way to guess it until an IOMMU group gets added to the container.
> +                 * So in case it fails with v2, try v1 as a fallback.
> +                 */
> +                iommu_type = VFIO_SPAPR_TCE_IOMMU;
> +                continue;
> +            }
> +            error_setg_errno(errp, errno, "Failed to set iommu for container");
> +            return -errno;
>          }
> -        error_setg_errno(errp, errno, "Failed to set iommu for container");
> -        return -errno;
>      }
>  
>      container->iommu_type = iommu_type;
> @@ -2050,34 +2059,44 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>       * with some IOMMU types. vfio_ram_block_discard_disable() handles the
>       * details once we know which type of IOMMU we are using.
>       */
> -
> -    QLIST_FOREACH(container, &space->containers, next) {
> -        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
> -            ret = vfio_ram_block_discard_disable(container, true);
> -            if (ret) {
> -                error_setg_errno(errp, -ret,
> -                                 "Cannot set discarding of RAM broken");
> -                if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER,
> -                          &container->fd)) {
> -                    error_report("vfio: error disconnecting group %d from"
> -                                 " container", group->groupid);
> +    if (!group->container) {
> +        QLIST_FOREACH(container, &space->containers, next) {
> +            if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
> +                ret = vfio_ram_block_discard_disable(container, true);
> +                if (ret) {
> +                    error_setg_errno(errp, -ret,
> +                                     "Cannot set discarding of RAM broken");
> +                    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER,
> +                              &container->fd)) {
> +                        error_report("vfio: error disconnecting group %d from"
> +                                     " container", group->groupid);
> +                    }
> +                    return ret;
>                  }
> -                return ret;
> +                group->container = container;
> +                QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> +                vfio_kvm_device_add_group(group);
> +                return 0;
>              }
> -            group->container = container;
> -            QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> -            vfio_kvm_device_add_group(group);
> -            return 0;
>          }
> -    }
> +        container = VFIO_CONTAINER(object_new(TYPE_VFIO_CONTAINER));
> +        container->space = space;
>  
> -    container = VFIO_CONTAINER(object_new(TYPE_VFIO_CONTAINER));
> -    container->space = space;
> -
> -    user_creatable_complete(USER_CREATABLE(container), errp);
> -    if (*errp) {
> -        ret = -1;
> -        goto free_container_exit;
> +        user_creatable_complete(USER_CREATABLE(container), errp);
> +        if (*errp) {
> +            ret = -1;
> +            goto free_container_exit;
> +        }
> +        group->container = container;
> +    } else if (group->container->initialized) {
> +        object_ref(OBJECT(group->container));
> +        QLIST_INSERT_HEAD(&group->container->group_list, group, container_next);
> +        vfio_kvm_device_add_group(group);
> +        return 0;
> +    } else {
> +        container = group->container;
> +        container->space = space;
> +        object_ref(OBJECT(container));
>      }
>  
>      ret = vfio_init_container(container, group->fd, errp);
> @@ -2228,6 +2247,10 @@ static void vfio_disconnect_container(VFIOGroup *group)
>  {
>      VFIOContainer *container = group->container;
>  
> +    if (!group->container) {
> +            return;
> +    }
> +
>      QLIST_REMOVE(group, container_next);
>      group->container = NULL;
>  
> @@ -2251,7 +2274,6 @@ static void vfio_disconnect_container(VFIOGroup *group)
>  VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>  {
>      VFIOGroup *group;
> -    char path[32];
>      struct vfio_group_status status = { .argsz = sizeof(status) };
>  
>      QLIST_FOREACH(group, &vfio_group_list, next) {
> @@ -2267,31 +2289,14 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>          }
>      }
>  
> -    group = g_malloc0(sizeof(*group));
> -
> -    snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
> -    group->fd = qemu_open_old(path, O_RDWR);
> -    if (group->fd < 0) {
> -        error_setg_errno(errp, errno, "failed to open %s", path);
> -        goto free_group_exit;
> -    }
> -
> -    if (ioctl(group->fd, VFIO_GROUP_GET_STATUS, &status)) {
> -        error_setg_errno(errp, errno, "failed to get group %d status", groupid);
> -        goto close_fd_exit;
> -    }
> -
> -    if (!(status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
> -        error_setg(errp, "group %d is not viable", groupid);
> -        error_append_hint(errp,
> -                          "Please ensure all devices within the iommu_group "
> -                          "are bound to their vfio bus driver.\n");
> -        goto close_fd_exit;
> +    group = VFIO_GROUP(object_new(TYPE_VFIO_GROUP));
> +    object_property_set_int(OBJECT(group), "groupid", groupid, errp);
> +    user_creatable_complete(USER_CREATABLE(group), errp);
> +    if (*errp) {
> +        object_unref(OBJECT(group));
> +        return NULL;
>      }
>  
> -    group->groupid = groupid;
> -    QLIST_INIT(&group->device_list);
> -
>      if (vfio_connect_container(group, as, errp)) {
>          error_prepend(errp, "failed to setup container for group %d: ",
>                        groupid);
> @@ -2302,15 +2307,10 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>          qemu_register_reset(vfio_reset_handler, NULL);
>      }
>  
> -    QLIST_INSERT_HEAD(&vfio_group_list, group, next);
> -
>      return group;
>  
>  close_fd_exit:
> -    close(group->fd);
> -
> -free_group_exit:
> -    g_free(group);
> +    object_unref(OBJECT(group));
>  
>      return NULL;
>  }
> @@ -2321,19 +2321,7 @@ void vfio_put_group(VFIOGroup *group)
>          return;
>      }
>  
> -    if (!group->ram_block_discard_allowed) {
> -        vfio_ram_block_discard_disable(group->container, false);
> -    }
> -    vfio_kvm_device_del_group(group);
> -    vfio_disconnect_container(group);
> -    QLIST_REMOVE(group, next);
> -    trace_vfio_put_group(group->fd);
> -    close(group->fd);
> -    g_free(group);
> -
> -    if (QLIST_EMPTY(&vfio_group_list)) {
> -        qemu_unregister_reset(vfio_reset_handler, NULL);
> -    }
> +    object_unref(OBJECT(group));
>  }
>  
>  int vfio_get_device(VFIOGroup *group, const char *name,

Like in the previous patch, you do more than the commit message
says.  Could you split it off into a separate commit?

> @@ -2676,8 +2664,123 @@ static const TypeInfo vfio_container_info = {
>      },
>  };
>  
> +static void vfio_group_set_fd(Object *obj, const char *value,
> +                              Error **errp)
> +{
> +    VFIOGroup *group = VFIO_GROUP(obj);
> +
> +    group->fd = monitor_fd_param(monitor_cur(), value, errp);
> +}
> +
> +static void vfio_group_set_groupid(Object *obj, Visitor *v,
> +                                   const char *name, void *opaque,
> +                                   Error **errp)
> +{
> +    VFIOGroup *group = VFIO_GROUP(obj);
> +    Error *error = NULL;
> +    uint32_t groupid;
> +
> +    visit_type_uint32(v, name, &groupid, &error);
> +    if (error) {
> +        error_propagate(errp, error);
> +        return;
> +    }
> +
> +    group->groupid = groupid;
> +}
> +
> +static void vfio_group_complete(UserCreatable *uc, Error **errp)
> +{
> +    VFIOGroup *group = VFIO_GROUP(uc);
> +    struct vfio_group_status status = { .argsz = sizeof(status) };
> +
> +    if (group->fd < 0 && group->groupid >= 0) {
> +        char path[32];
> +
> +        snprintf(path, sizeof(path), "/dev/vfio/%d", group->groupid);
> +
> +        group->fd = qemu_open_old(path, O_RDWR);
> +        if (group->fd < 0) {
> +            error_setg_errno(errp, errno, "failed to open %s", path);
> +            return;
> +        }
> +    }
> +
> +    if (ioctl(group->fd, VFIO_GROUP_GET_STATUS, &status)) {
> +        error_setg_errno(errp, errno, "failed to get group %d status", group->groupid);
> +        return;
> +    }
> +
> +    if (!(status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
> +        error_setg(errp, "group %d is not viable", group->groupid);
> +        error_append_hint(errp,
> +                          "Please ensure all devices within the iommu_group "
> +                          "are bound to their vfio bus driver.\n");
> +    }
> +}
> +
> +static void vfio_group_class_init(ObjectClass *class, void *data)
> +{
> +    UserCreatableClass *ucc = USER_CREATABLE_CLASS(class);
> +    ucc->complete = vfio_group_complete;
> +
> +    object_class_property_add_link(class, "container", TYPE_VFIO_CONTAINER,
> +                                   offsetof(VFIOGroup, container),
> +                                   object_property_allow_set_link, 0);
> +    object_class_property_add_str(class, "fd", NULL, vfio_group_set_fd);
> +    object_class_property_add(class, "groupid", "int", NULL,
> +                              vfio_group_set_groupid,
> +                              NULL, NULL);
> +}
> +
> +static void vfio_group_instance_init(Object *obj)
> +{
> +    VFIOGroup *group = VFIO_GROUP(obj);
> +
> +    QLIST_INIT(&group->device_list);
> +    group->fd = -1;
> +    group->groupid = -1;
> +    QLIST_INSERT_HEAD(&vfio_group_list, group, next);
> +}
> +
> +static void
> +vfio_group_instance_finalize(Object *obj)
> +{
> +    VFIOGroup *group = VFIO_GROUP(obj);
> +
> +    if (!group->ram_block_discard_allowed) {
> +        vfio_ram_block_discard_disable(group->container, false);
> +    }
> +
> +    vfio_kvm_device_del_group(group);
> +    vfio_disconnect_container(group);
> +    QLIST_REMOVE(group, next);
> +    trace_vfio_group_instance_finalize(group->fd);
> +    if (group->fd >= 0) {
> +        close(group->fd);
> +    }
> +
> +    if (QLIST_EMPTY(&vfio_group_list)) {
> +        qemu_unregister_reset(vfio_reset_handler, NULL);
> +    }
> +}
> +
> +static const TypeInfo vfio_group_info = {
> +    .name = TYPE_VFIO_GROUP,
> +    .parent = TYPE_OBJECT,
> +    .class_init = vfio_group_class_init,
> +    .instance_size = sizeof(VFIOGroup),
> +    .instance_init = vfio_group_instance_init,
> +    .instance_finalize = vfio_group_instance_finalize,
> +    .interfaces = (InterfaceInfo[]) {
> +        {TYPE_USER_CREATABLE},
> +        {}
> +    },
> +};
> +
>  static void register_vfio_types(void)
>  {
>      type_register_static(&vfio_container_info);
> +    type_register_static(&vfio_group_info);
>  }
>  type_init(register_vfio_types)
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 8b79cf33a33..6ae0ed09acd 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -105,7 +105,7 @@ vfio_listener_region_add_no_dma_map(const char *name, uint64_t iova, uint64_t si
>  vfio_listener_region_del_skip(uint64_t start, uint64_t end) "SKIPPING region_del 0x%"PRIx64" - 0x%"PRIx64
>  vfio_listener_region_del(uint64_t start, uint64_t end) "region_del 0x%"PRIx64" - 0x%"PRIx64
>  vfio_container_instance_finalize(int fd) "close container->fd=%d"
> -vfio_put_group(int fd) "close group->fd=%d"
> +vfio_group_instance_finalize(int fd) "close group->fd=%d"
>  vfio_get_device(const char * name, unsigned int flags, unsigned int num_regions, unsigned int num_irqs) "Device %s flags: %u, regions: %u, irqs: %u"
>  vfio_put_base_device(int fd) "close vdev->fd=%d"
>  vfio_region_setup(const char *dev, int index, const char *name, unsigned long flags, unsigned long offset, unsigned long size) "Device %s, region %d \"%s\", flags: 0x%lx, offset: 0x%lx, size: 0x%lx"
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 0ab99060e44..f2d67093f44 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -156,6 +156,7 @@ struct VFIODeviceOps {
>  };
>  
>  typedef struct VFIOGroup {
> +    Object parent;
>      int fd;
>      int groupid;
>      VFIOContainer *container;
> @@ -194,6 +195,9 @@ typedef struct VFIODisplay {
>  #define TYPE_VFIO_CONTAINER "vfio-container"
>  OBJECT_DECLARE_SIMPLE_TYPE(VFIOContainer, VFIO_CONTAINER)
>  
> +#define TYPE_VFIO_GROUP "vfio-group"
> +OBJECT_DECLARE_SIMPLE_TYPE(VFIOGroup, VFIO_GROUP)
> +
>  void vfio_put_base_device(VFIODevice *vbasedev);
>  void vfio_disable_irqindex(VFIODevice *vbasedev, int index);
>  void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index);

This morphs struct VFIOGroup into a QOM object.

> diff --git a/qapi/qom.json b/qapi/qom.json
> index d1a88e10b52..f46dd6b8034 100644
> --- a/qapi/qom.json
> +++ b/qapi/qom.json
> @@ -746,6 +746,19 @@
>  { 'struct': 'VFIOContainerProperties',
>    'data': { 'fd': 'str' } }
>  
> +##
> +# @VFIOGroupProperties:
> +#
> +# Properties for vfio-group objects.
> +#
> +# @fd: file descriptor of vfio group
> +# @container: container

Try again.

If you think this review comment is too terse to be useful, you're
right!  So is your doc comment ;)

> +#
> +# Since: 7.2
> +##
> +{ 'struct': 'VFIOGroupProperties',
> +  'data': { 'fd': 'str', 'container': 'str'} }
> +
>  ##
>  # @VfioUserServerProperties:
>  #
> @@ -901,6 +914,7 @@
>      'tls-creds-x509',
>      'tls-cipher-suites',
>      'vfio-container',
> +    'vfio-group',
>      { 'name': 'x-remote-object', 'features': [ 'unstable' ] },
>      { 'name': 'x-vfio-user-server', 'features': [ 'unstable' ] }
>    ] }
> @@ -967,6 +981,7 @@
>        'tls-creds-x509':             'TlsCredsX509Properties',
>        'tls-cipher-suites':          'TlsCredsProperties',
>        'vfio-container':             'VFIOContainerProperties',
> +      'vfio-group':                 'VFIOGroupProperties',
>        'x-remote-object':            'RemoteObjectProperties',
>        'x-vfio-user-server':         'VfioUserServerProperties'
>    } }



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/4] Allow to pass pre-created VFIO container/group to QEMU
  2022-10-17 10:54 [PATCH 0/4] Allow to pass pre-created VFIO container/group to QEMU Andrey Ryabinin
                   ` (4 preceding siblings ...)
  2022-10-17 11:05 ` [PATCH 0/4] Allow to pass pre-created VFIO container/group to QEMU Daniel P. Berrangé
@ 2022-10-17 15:21 ` Alex Williamson
  2022-10-26 12:07   ` Andrey Ryabinin
  5 siblings, 1 reply; 11+ messages in thread
From: Alex Williamson @ 2022-10-17 15:21 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: qemu-devel, Steve Sistare, yc-core, Tony Krowiak, Halil Pasic,
	Jason Herne, Cornelia Huck, Thomas Huth, Eric Farman,
	Matthew Rosato, Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Eric Blake, Markus Armbruster, Cleber Rosa,
	Philippe Mathieu-Daudé,
	Wainer dos Santos Moschetta, Beraldo Leal

On Mon, 17 Oct 2022 13:54:03 +0300
Andrey Ryabinin <arbn@yandex-team.com> wrote:

> These patches add possibility to pass VFIO device to QEMU using file
> descriptors of VFIO container/group, instead of creating those by QEMU.
> This allows to take away permissions to open /dev/vfio/* from QEMU and
> delegate that to managment layer like libvirt.
> 
> The VFIO API doen't allow to pass just fd of device, since we also need to have
> VFIO container and group. So these patches allow to pass created VFIO container/group
> to QEMU via command line/QMP, e.g. like this:
>             -object vfio-container,id=ct,fd=5 \
>             -object vfio-group,id=grp,fd=6,container=ct \
>             -device vfio-pci,host=05:00.0,group=grp

This suggests that management tools need to become intimately familiar
with container and group association restrictions for implicit
dependencies, such as device AddressSpace.  We had considered this
before and intentionally chosen to allow QEMU to manage that
relationship.  Things like PCI bus type and presence of a vIOMMU factor
into these relationships.

In the above example, what happens in a mixed environment, for example
if we then add '-device vfio-pci,host=06:00.0' to the command line?
Isn't QEMU still going to try to re-use the container if it exists in
the same address space?  Potentially this device could also be a member
of the same group.  How would the management tool know when to expect
the provided fds be released?

We also have an outstanding RFC for iommufd that already proposes an fd
passing interface, where iommufd removes many of the issues of the vfio
container by supporting multiple address spaces within a single fd
context, avoiding the duplicate locked page accounting issues between
containers, and proposing a direct device fd interface for vfio.  Why at
this point in time would we choose to expand the QEMU vfio interface in
this way?  Thanks,

Alex

> A bit more detailed example can be found in the test:
> tests/avocado/vfio.py
> 
>  *Possible future steps*
> 
> Also these patches could be a step for making local migration (within one host)
> of the QEMU with VFIO devices.
> I've built some prototype on top of these patches to try such idea.
> In short the scheme of such migration is following:
>  - migrate source VM to file.
>  - retrieve fd numbers of VFIO container/group/device via new property and qom-get command
>  - get the actual file descriptor via SCM_RIGHTS using new qmp command 'returnfd' which
>    sends fd from QEMU by the number: { 'command': 'returnfd', 'data': {'fd': 'int'}}
>  - shutdown source VM
>  - launch destination VM, plug VFIO devices using obtained file descriptors.
>  - PCI device reset duriing plugging the device avoided with the help of new parameter
>     on vfio-pci device.
> This is alternative to 'cpr-exec' migration scheme proposed here:
>    https://lore.kernel.org/qemu-devel/1658851843-236870-1-git-send-email-steven.sistare@oracle.com/
> Unlike cpr-exec it doesn't require new kernel flags VFIO_DMA_UNMAP_FLAG_VADDR/VFIO_DMA_MAP_FLAG_VADDR
> And doesn't require new migration mode, just some additional steps from management layer.
> 
> 
> Andrey Ryabinin (4):
>   vfio: add vfio-container user createable object
>   vfio: add vfio-group user createable object
>   vfio: Add 'group' property to 'vfio-pci' device
>   tests/avocado/vfio: add test for vfio devices
> 
>  hw/vfio/ap.c                  |   2 +-
>  hw/vfio/ccw.c                 |   2 +-
>  hw/vfio/common.c              | 471 +++++++++++++++++++++++-----------
>  hw/vfio/pci.c                 |  10 +-
>  hw/vfio/platform.c            |   2 +-
>  hw/vfio/trace-events          |   4 +-
>  include/hw/vfio/vfio-common.h |  10 +-
>  qapi/qom.json                 |  29 +++
>  tests/avocado/vfio.py         | 156 +++++++++++
>  9 files changed, 525 insertions(+), 161 deletions(-)
>  create mode 100644 tests/avocado/vfio.py
> 



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/4] Allow to pass pre-created VFIO container/group to QEMU
  2022-10-17 11:05 ` [PATCH 0/4] Allow to pass pre-created VFIO container/group to QEMU Daniel P. Berrangé
@ 2022-10-26 10:44   ` Andrey Ryabinin
  0 siblings, 0 replies; 11+ messages in thread
From: Andrey Ryabinin @ 2022-10-26 10:44 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: qemu-devel, Steve Sistare, yc-core, Tony Krowiak, Halil Pasic,
	Jason Herne, Cornelia Huck, Thomas Huth, Alex Williamson,
	Eric Farman, Matthew Rosato, Paolo Bonzini, Eduardo Habkost,
	Eric Blake, Markus Armbruster, Cleber Rosa,
	Philippe Mathieu-Daudé,
	Wainer dos Santos Moschetta, Beraldo Leal

On 10/17/22 14:05, Daniel P. Berrangé wrote:
> On Mon, Oct 17, 2022 at 01:54:03PM +0300, Andrey Ryabinin wrote:
>> These patches add possibility to pass VFIO device to QEMU using file
>> descriptors of VFIO container/group, instead of creating those by QEMU.
>> This allows to take away permissions to open /dev/vfio/* from QEMU and
>> delegate that to managment layer like libvirt.
>>
>> The VFIO API doen't allow to pass just fd of device, since we also need to have
>> VFIO container and group. So these patches allow to pass created VFIO container/group
>> to QEMU via command line/QMP, e.g. like this:
>>             -object vfio-container,id=ct,fd=5 \
>>             -object vfio-group,id=grp,fd=6,container=ct \
>>             -device vfio-pci,host=05:00.0,group=grp
>>
>> A bit more detailed example can be found in the test:
>> tests/avocado/vfio.py
>>
>>  *Possible future steps*
>>
>> Also these patches could be a step for making local migration (within one host)
>> of the QEMU with VFIO devices.
>> I've built some prototype on top of these patches to try such idea.
>> In short the scheme of such migration is following:
>>  - migrate source VM to file.
>>  - retrieve fd numbers of VFIO container/group/device via new property and qom-get command
>>  - get the actual file descriptor via SCM_RIGHTS using new qmp command 'returnfd' which
>>    sends fd from QEMU by the number: { 'command': 'returnfd', 'data': {'fd': 'int'}}
>>  - shutdown source VM
>>  - launch destination VM, plug VFIO devices using obtained file descriptors.
>>  - PCI device reset duriing plugging the device avoided with the help of new parameter
>>     on vfio-pci device.
> 
> Is there a restriction by VFIO on how many processes can have the FD
> open concurrently ? I guess it must be, as with SCM_RIGHTS, both src
> QEMU and libvirt will have the FD open concurrently for at least a
> short period, as you can't atomically close the FD at the exact same
> time as SCM_RIGHTS sends it.
> 

There is no such restriction. Several opened descriptors is what allows us to survive
PCI device reset. The kernel reset device on the first open.
Obviously we shouldn't use these descriptors concurrently and can't in many cases
(ioctl()s will fail), but there is no problem with just saving/passing FD between processes.


> With migration it is *highly* desirable to never stop the source VM's
> QEMU until the new QEMU has completed migration and got its vCPUs
> running, in order to have best chance of successful rollback upon
> failure
> 
> So assuming both QEMU's can have the FD open, provided they don't
> both concurrently operate on it, could src QEMU just pass the FDs
> to the target QEMU as part of the migration stream. eg use a UNIX
> socket between the 2 QEMUs, and SCM_RIGHTS to pass the FDs across,
> avoiding libvirt needing to be in the middle of the FD passing
> dance. Since target QEMU gets the FDs as part of the migration
> stream, it would inherantly know that it shold skip device reset
> in that flow, without requiring any new param.
> 

Yeah, I had similar idea, but this would require a lot of rework of VFIO initialization
phase in QEMU. The main problem here is all initialization happens on device addition, which will fail
if device already used by another QEMU. I guess we would  need to move lot's of initialization to the
->post_load() hook.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/4] Allow to pass pre-created VFIO container/group to QEMU
  2022-10-17 15:21 ` Alex Williamson
@ 2022-10-26 12:07   ` Andrey Ryabinin
  2022-10-26 17:22     ` Alex Williamson
  0 siblings, 1 reply; 11+ messages in thread
From: Andrey Ryabinin @ 2022-10-26 12:07 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, Steve Sistare, yc-core, Tony Krowiak, Halil Pasic,
	Jason Herne, Cornelia Huck, Thomas Huth, Eric Farman,
	Matthew Rosato, Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Eric Blake, Markus Armbruster, Cleber Rosa,
	Philippe Mathieu-Daudé,
	Wainer dos Santos Moschetta, Beraldo Leal



On 10/17/22 18:21, Alex Williamson wrote:
> On Mon, 17 Oct 2022 13:54:03 +0300
> Andrey Ryabinin <arbn@yandex-team.com> wrote:
> 
>> These patches add possibility to pass VFIO device to QEMU using file
>> descriptors of VFIO container/group, instead of creating those by QEMU.
>> This allows to take away permissions to open /dev/vfio/* from QEMU and
>> delegate that to managment layer like libvirt.
>>
>> The VFIO API doen't allow to pass just fd of device, since we also need to have
>> VFIO container and group. So these patches allow to pass created VFIO container/group
>> to QEMU via command line/QMP, e.g. like this:
>>             -object vfio-container,id=ct,fd=5 \
>>             -object vfio-group,id=grp,fd=6,container=ct \
>>             -device vfio-pci,host=05:00.0,group=grp
> 
> This suggests that management tools need to become intimately familiar
> with container and group association restrictions for implicit
> dependencies, such as device AddressSpace.  We had considered this
> before and intentionally chosen to allow QEMU to manage that
> relationship.  Things like PCI bus type and presence of a vIOMMU factor
> into these relationships.
> 

This is already the case. These patches doesn't change much.
QEMU doesn't allow to adding device from one group to several address spaces.
So the management tool needs to know whether devices are in the same group or not
and whether QEMU will create separate address spaces for these devices or not.

E.g.
qemu-system-x86_64 -nodefaults -M q35,accel=kvm,kernel-irqchip=split \
        -device intel-iommu,intremap=on,caching-mode=on \
        -device vfio-pci,host=00:1f.3 \
        -device vfio-pci,host=00:1f.4 
qemu-system-x86_64: -device vfio-pci,host=00:1f.4: vfio 0000:00:1f.4: group 14 used in multiple address spaces

> In the above example, what happens in a mixed environment, for example
> if we then add '-device vfio-pci,host=06:00.0' to the command line?
> Isn't QEMU still going to try to re-use the container if it exists in
> the same address space? Potentially this device could also be a member
> of the same group.  How would the management tool know when to expect
> the provided fds be released?
> 

Valid point, container indeed will be reused and second device will occupy it.
But we could make new container instead. Using several containers in one address
space won't be a problem, right?
Of course several devices from same group won't be allowed to be added in mixed way.


> We also have an outstanding RFC for iommufd that already proposes an fd
> passing interface, where iommufd removes many of the issues of the vfio
> container by supporting multiple address spaces within a single fd
> context, avoiding the duplicate locked page accounting issues between
> containers, and proposing a direct device fd interface for vfio.  Why at
> this point in time would we choose to expand the QEMU vfio interface in
> this way?  Thanks,
> 

It sounds nice, but iommufd is new API which doesn't exist in any kernel yet.
These patches is something that can be used on existing, already deployed kernels.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/4] Allow to pass pre-created VFIO container/group to QEMU
  2022-10-26 12:07   ` Andrey Ryabinin
@ 2022-10-26 17:22     ` Alex Williamson
  0 siblings, 0 replies; 11+ messages in thread
From: Alex Williamson @ 2022-10-26 17:22 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: qemu-devel, Steve Sistare, yc-core, Tony Krowiak, Halil Pasic,
	Jason Herne, Cornelia Huck, Thomas Huth, Eric Farman,
	Matthew Rosato, Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Eric Blake, Markus Armbruster, Cleber Rosa,
	Philippe Mathieu-Daudé,
	Wainer dos Santos Moschetta, Beraldo Leal

On Wed, 26 Oct 2022 15:07:32 +0300
Andrey Ryabinin <arbn@yandex-team.com> wrote:

> On 10/17/22 18:21, Alex Williamson wrote:
> > On Mon, 17 Oct 2022 13:54:03 +0300
> > Andrey Ryabinin <arbn@yandex-team.com> wrote:
> >   
> >> These patches add possibility to pass VFIO device to QEMU using file
> >> descriptors of VFIO container/group, instead of creating those by QEMU.
> >> This allows to take away permissions to open /dev/vfio/* from QEMU and
> >> delegate that to managment layer like libvirt.
> >>
> >> The VFIO API doen't allow to pass just fd of device, since we also need to have
> >> VFIO container and group. So these patches allow to pass created VFIO container/group
> >> to QEMU via command line/QMP, e.g. like this:
> >>             -object vfio-container,id=ct,fd=5 \
> >>             -object vfio-group,id=grp,fd=6,container=ct \
> >>             -device vfio-pci,host=05:00.0,group=grp  
> > 
> > This suggests that management tools need to become intimately familiar
> > with container and group association restrictions for implicit
> > dependencies, such as device AddressSpace.  We had considered this
> > before and intentionally chosen to allow QEMU to manage that
> > relationship.  Things like PCI bus type and presence of a vIOMMU factor
> > into these relationships.
> >   
> 
> This is already the case. These patches doesn't change much.
> QEMU doesn't allow to adding device from one group to several address spaces.
> So the management tool needs to know whether devices are in the same group or not
> and whether QEMU will create separate address spaces for these devices or not.
> 
> E.g.
> qemu-system-x86_64 -nodefaults -M q35,accel=kvm,kernel-irqchip=split \
>         -device intel-iommu,intremap=on,caching-mode=on \
>         -device vfio-pci,host=00:1f.3 \
>         -device vfio-pci,host=00:1f.4 
> qemu-system-x86_64: -device vfio-pci,host=00:1f.4: vfio 0000:00:1f.4: group 14 used in multiple address spaces

Obviously QEMU fails this configuration.  It must.  How does that
suggest that a management tool, like libvirt, is already aware of this
requirement.  In fact, libvirt will happily validate xml creating such
a configuration.  The point was that tools like libvirt would need to
provide these group and container file descriptors and they currently
impose no restrictions or working knowledge on the relationship between
devices, groups, containers, and address spaces.

> > In the above example, what happens in a mixed environment, for example
> > if we then add '-device vfio-pci,host=06:00.0' to the command line?
> > Isn't QEMU still going to try to re-use the container if it exists in
> > the same address space? Potentially this device could also be a member
> > of the same group.  How would the management tool know when to expect
> > the provided fds be released?
> >   
> 
> Valid point, container indeed will be reused and second device will occupy it.
> But we could make new container instead. Using several containers in one address
> space won't be a problem, right?
> Of course several devices from same group won't be allowed to be added in mixed way.

Potentially, yes, that is a problem.  Each container represents a
separate IOMMU context, separate DMA map and unmap operations, and
separate locked page accounting.  So if libvirt chooses the more
trivial solution to impose a new container for every group, that
translates to space, time, and process accounting overhead.

> > We also have an outstanding RFC for iommufd that already proposes an fd
> > passing interface, where iommufd removes many of the issues of the vfio
> > container by supporting multiple address spaces within a single fd
> > context, avoiding the duplicate locked page accounting issues between
> > containers, and proposing a direct device fd interface for vfio.  Why at
> > this point in time would we choose to expand the QEMU vfio interface in
> > this way?  Thanks,
> >   
> 
> It sounds nice, but iommufd is new API which doesn't exist in any kernel yet.
> These patches is something that can be used on existing, already deployed kernels.

OTOH, we expect iommufd in the near term, non-RFC patches are posted.
The vfio kernel modules have undergone significant churn in recent
kernels to align with the development goals of iommufd.  QEMU support to
accept file descriptors for "legacy" implementations of vfio is only
the beginning, where the next step would require the management tools
to be sufficiently enlightened to implement file descriptor passing.
All of that suggests development and maintenance effort for something
we're actively trying to replace.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-10-26 17:24 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-17 10:54 [PATCH 0/4] Allow to pass pre-created VFIO container/group to QEMU Andrey Ryabinin
2022-10-17 10:54 ` [PATCH 1/4] vfio: add vfio-container user createable object Andrey Ryabinin
2022-10-17 10:54 ` [PATCH 2/4] vfio: add vfio-group " Andrey Ryabinin
2022-10-17 12:37   ` Markus Armbruster
2022-10-17 10:54 ` [PATCH 3/4] vfio: Add 'group' property to 'vfio-pci' device Andrey Ryabinin
2022-10-17 10:54 ` [PATCH 4/4] tests/avocado/vfio: add test for vfio devices Andrey Ryabinin
2022-10-17 11:05 ` [PATCH 0/4] Allow to pass pre-created VFIO container/group to QEMU Daniel P. Berrangé
2022-10-26 10:44   ` Andrey Ryabinin
2022-10-17 15:21 ` Alex Williamson
2022-10-26 12:07   ` Andrey Ryabinin
2022-10-26 17:22     ` Alex Williamson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.